Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gitea stops responding on HTTP #15826

Closed
2 tasks done
dpedu opened this issue May 10, 2021 · 6 comments · Fixed by #15831
Closed
2 tasks done

Gitea stops responding on HTTP #15826

dpedu opened this issue May 10, 2021 · 6 comments · Fixed by #15831
Labels

Comments

@dpedu
Copy link

dpedu commented May 10, 2021

  • Gitea version (or commit ref): Docker image - "latest" tag
  • Git version: 2a9b8d1 - from docker image's metadata
  • Operating system: Ubuntu Trusty, running Gitea in Docker 18 with the public gitea/gitea image.
  • Database (use [x]):
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io:
    • No
  • Log gist:
    I've had this happen a few times but not yet with debug logging enabled. It's on now. Will update. I didn't see anything out-of-the-ordinary in the default log that I had.

I did however get a trace of all the goroutines: https://gist.github.com/dpedu/86b729acd51d2132950328a4040e0092

Description

I run a web facing (not directly, it's behind Nginx and Varnish) copy of Gitea that I alone am the user of. After 2 - 3 days of uptime, the Gitea instance seems to stop responding to HTTP requests. The reverse proxy in front of it will show a 504 timeout error.

I can connect to the Gitea instance directly with my browser and while it accepts my connection it never responds to http requests; the problem appears the same with the reverse proxies between Gitea and I removed.

The ssh interface still works fine - I can clone, push, pull, etc.

Looking at the log - Gitea is logging as if it is still serving normal traffic. It logs lines like Started GET or Completed GET as normal but through inspecting the traffic in my reverse proxy, it's actually not replying.

Looking at the goroutine trace above, it looks like there are very many gorountines trying to acquire some lock in the template layer. More than 300! My instance was not using much CPU at all when viewed in htop in the bad state, it seemed like something was locked up rather than overloaded.

In ps, there were several zombie git processes that were children of gitea. Not sure if this is a cause of a result of the other problem:

root     25150  7788 25150  7788  0 May08 ?        00:00:17       docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/0e642201efc1eeb1a8daa108d88efa2b469d915524496bf4278c7660c62dfed9 -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc
root     25168 25150 25168 25168  0 May08 ?        00:00:00         /bin/s6-svscan /etc/s6
root     25219 25168 25168 25168  0 May08 ?        00:00:00           s6-supervise gitea
1000     25221 25219 25221 25221  4 May08 ?        02:13:53             /app/gitea/gitea web
1000     13231 25221 25221 25221  0 19:58 ?        00:00:00               [git] <defunct>               <---- this line repeated 16x
root     25220 25168 25168 25168  0 May08 ?        00:00:00           s6-supervise openssh
root     25222 25220 25222 25222  0 May08 ?        00:00:00             sshd: /usr/sbin/sshd -D -e [listener] 0 of 10-100 startups

Here's lsof -p output for Gitea as well. I snipped out the "normal" stuff like my sqlite database. What was left was about 400 of the TYPE=sock lines, with about 10x fewer TYPE=FIFO lines spinkled in.

COMMAND   PID     USER   FD   TYPE DEVICE  SIZE/OFF      NODE NAME
gitea   25221     1000   84u  sock    0,8       0t0 226911955 can't identify protocol
gitea   25221     1000   85u  sock    0,8       0t0 226911157 can't identify protocol
gitea   25221     1000   86r  FIFO   0,10       0t0 226911175 pipe
gitea   25221     1000   87u  sock    0,8       0t0 226911183 can't identify protocol
gitea   25221     1000   88u  sock    0,8       0t0 226911203 can't identify protocol
gitea   25221     1000   89u  sock    0,8       0t0 226912114 can't identify protocol
gitea   25221     1000   90u  sock    0,8       0t0 226912271 can't identify protocol
gitea   25221     1000   91r  FIFO   0,10       0t0 226935531 pipe
gitea   25221     1000   92u  sock    0,8       0t0 226940493 can't identify protocol
gitea   25221     1000   93u  sock    0,8       0t0 226940498 can't identify protocol

When I restart my gitea container everything is back to normal afterwards.

I first remember encounter this problem in mid April. I pull the latest docker image each time I've had to manually restart it, so it has appeared in more than one version.

There are other http-based services running on the same machine that don't show similar issues.

I have lots and lots of code checked into public repos so the instance does attract quite a bit of bot traffic scraping the html. Sometimes I have to user-agent ban ones that are too aggressive and push the CPU usage too much.

Screenshots

N/A

@zeripath
Copy link
Contributor

OK The two longest blocked goroutines are:

goroutine 11721443 [semacquire, 32 minutes]:
sync.runtime_SemacquireMutex(0xc0023cb05c, 0x1500, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0023cb058)
	/usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/unrolled/render.(*Render).HTML(0xc0023caf00, 0x7fa43d312928, 0xc0109ac4c0, 0xc8, 0x2ab5409, 0xc, 0x26764e0, 0xc00635ac90, 0x0, 0x0, ...)
	/go/src/code.gitea.io/gitea/vendor/github.com/unrolled/render/render.go:392 +0x453
code.gitea.io/gitea/modules/context.(*Context).HTML(0xc00f30e900, 0xc8, 0x2ab5409, 0xc)
	/go/src/code.gitea.io/gitea/modules/context/context.go:189 +0x270
code.gitea.io/gitea/routers/repo.FileHistory(0xc00f30e900)
	/go/src/code.gitea.io/gitea/routers/repo/commit.go:257 +0x6e8
goroutine 11722759 [semacquire, 32 minutes]:
sync.runtime_SemacquireMutex(0xc0023cb05c, 0xc0086a0400, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0023cb058)
	/usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/unrolled/render.(*Render).HTML(0xc0023caf00, 0x7fa43d312928, 0xc00d2ce9c0, 0xc8, 0x2a66faf, 0x9, 0x26764e0, 0xc007f12060, 0x0, 0x0, ...)
	/go/src/code.gitea.io/gitea/vendor/github.com/unrolled/render/render.go:392 +0x453
code.gitea.io/gitea/modules/context.(*Context).HTML(0xc00b1a7500, 0xc8, 0x2a66faf, 0x9)
	/go/src/code.gitea.io/gitea/modules/context/context.go:189 +0x270
code.gitea.io/gitea/routers/repo.renderCode(0xc00b1a7500)
	/go/src/code.gitea.io/gitea/routers/repo/view.go:739 +0xa05
code.gitea.io/gitea/routers/repo.Home(0xc00b1a7500)
	/go/src/code.gitea.io/gitea/routers/repo/view.go:632 +0x305
code.gitea.io/gitea/modules/web.Wrap.func1(0x3fcce60, 0xc00d2ce9c0, 0xc00cf2e600)
	/go/src/code.gitea.io/gitea/modules/web/route.go:64 +0x1dc

but they're both blocked on writing to HTML and acquiring a lock in:

https://github.com/unrolled/render/blob/b7951a4bb643b89b5e45e179cee661139809c294/render.go#L392

This code worries me though as it looks like anything that holds up Rendering just blocks it for everyone and we essentially can only render one html template at a time?!

The thing holding the lock therefore is:

goroutine 11721218 [IO wait]:
internal/poll.runtime_pollWait(0x7fa43bd763b8, 0x77, 0xffffffffffffffff)
	/usr/local/go/src/runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc0063da318, 0x77, 0x2600, 0x7699e, 0xffffffffffffffff)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitWrite(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:96
internal/poll.(*FD).Write(0xc0063da300, 0xc00cd1ae7e, 0xade6, 0x7f182, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:282 +0x325
net.(*netFD).Write(0xc0063da300, 0xc00cd1ae7e, 0xade6, 0x7f182, 0x1000, 0x0, 0x0)
	/usr/local/go/src/net/fd_posix.go:73 +0x4f
net.(*conn).Write(0xc010438568, 0xc00cd1ae7e, 0xade6, 0x7f182, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:195 +0x8e
net/http.checkConnErrorWriter.Write(0xc002394000, 0xc00cd1ae7e, 0xade6, 0x7f182, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/http/server.go:3512 +0x56
bufio.(*Writer).Write(0xc003c22200, 0xc00cd1ae7e, 0xade6, 0x7f182, 0xc00cc63530, 0x2aa5bb1, 0xc)
	/usr/local/go/src/bufio/bufio.go:639 +0x13d
net/http.(*chunkWriter).Write(0xc008386040, 0xc00cd1a000, 0xbc64, 0x80000, 0x7b8269, 0xc002394088, 0x0)
	/usr/local/go/src/net/http/server.go:383 +0xaa
bufio.(*Writer).Write(0xc003c22480, 0xc00cd1a000, 0xbc64, 0x80000, 0xe, 0x56eef20, 0xc005faa1e8)
	/usr/local/go/src/bufio/bufio.go:639 +0x13d
net/http.(*response).write(0xc008386000, 0xbc64, 0xc00cd1a000, 0xbc64, 0x80000, 0x0, 0x0, 0x2594ae0, 0xc00cc6e0a8, 0x2594ae0)
	/usr/local/go/src/net/http/server.go:1615 +0x38a
net/http.(*response).Write(0xc008386000, 0xc00cd1a000, 0xbc64, 0x80000, 0x1b69679, 0xc008386000, 0xc8)
	/usr/local/go/src/net/http/server.go:1573 +0x56
code.gitea.io/gitea/modules/context.(*Response).Write(0xc003c22540, 0xc00cd1a000, 0xbc64, 0x80000, 0xc, 0xc00e5909d0, 0x2aa5bb1)
	/go/src/code.gitea.io/gitea/modules/context/response.go:46 +0xbb
bytes.(*Buffer).WriteTo(0xc00de5bda0, 0x7fa43d312928, 0xc003c22540, 0x3fcce60, 0xc003c22540, 0x2676401)
	/usr/local/go/src/bytes/buffer.go:239 +0xb7
github.com/unrolled/render.HTML.Render(0xc00e4a3fc8, 0x18, 0xc8, 0x2ab5409, 0xc, 0xc0069ed500, 0x3fc4c38, 0xc00699d2c0, 0x7fa43d312928, 0xc003c22540, ...)
	/go/src/code.gitea.io/gitea/vendor/github.com/unrolled/render/engine.go:102 +0x174
github.com/unrolled/render.(*Render).Render(0xc0023caf00, 0x7fa43d312928, 0xc003c22540, 0x3fad6c0, 0xc00d90ff80, 0x26764e0, 0xc0034f5d70, 0x277d0e0, 0xc0025224e0)
	/go/src/code.gitea.io/gitea/vendor/github.com/unrolled/render/render.go:369 +0x6c
github.com/unrolled/render.(*Render).HTML(0xc0023caf00, 0x7fa43d312928, 0xc003c22540, 0xc8, 0x2ab5409, 0xc, 0x26764e0, 0xc0034f5d70, 0x0, 0x0, ...)
	/go/src/code.gitea.io/gitea/vendor/github.com/unrolled/render/render.go:424 +0x325
code.gitea.io/gitea/modules/context.(*Context).HTML(0xc00f30e000, 0xc8, 0x2ab5409, 0xc)
	/go/src/code.gitea.io/gitea/modules/context/context.go:189 +0x270
code.gitea.io/gitea/routers/repo.Commits(0xc00f30e000)

Which is weird because that looks like it's locked in writing to the response.

This is very odd - and frankly very concerning.

  • The lock in unrolled appears totally and completely unjustified and highly inefficient. Why is it there?
  • Why is the write in the response blocking and not timing out?

@zeripath
Copy link
Contributor

zeripath commented May 10, 2021

Ok we can add a write deadline (& read deadline) to the connection in here:

func (wl *wrappedListener) Accept() (net.Conn, error) {
var c net.Conn
// Set keepalive on TCPListeners connections.
if tcl, ok := wl.Listener.(*net.TCPListener); ok {
tc, err := tcl.AcceptTCP()
if err != nil {
return nil, err
}
_ = tc.SetKeepAlive(true) // see http.tcpKeepAliveListener
_ = tc.SetKeepAlivePeriod(3 * time.Minute) // see http.tcpKeepAliveListener
c = tc
} else {
var err error
c, err = wl.Listener.Accept()
if err != nil {
return nil, err
}
}
closed := int32(0)
c = wrappedConn{
Conn: c,
server: wl.server,
closed: &closed,
}
wl.server.wg.Add(1)
return c, nil
}

Likely at line 222 would be best. Agh no it doesn't work like that! It's a deadline not a timeout that means we need to change the write function to update the timeout. Damn.

The question is what the timeout should be but I suspect certainly nothing more than 5s but likely even 500ms is too much.

zeripath added a commit to zeripath/gitea that referenced this issue May 11, 2021
In go-gitea#15826 it has become apparent that there are a few occasions when a response can
hang during writing, and because there is no timeout go will happily just block
interminably. This PR adds a fixed 5 second timeout to all writes to a connection.

Fix go-gitea#15826

Signed-off-by: Andrew Thornton <art27@cantab.net>
@zeripath
Copy link
Contributor

I've opened up a second issue to deal with renderlock issue because I think your real problem here is the blocked responsewriter.

@zeripath
Copy link
Contributor

OK we've merged the unrolled/render change in #15845 preventing the renderlock problem.

@dpedu
Copy link
Author

dpedu commented May 12, 2021

That's great! I've pulled an updated image built from ffbd0fe & am running it.

What should I do with this issue? If you'd like, I will close it in a few days when it's clear it is not hitting the same problem anymore. My instance had locked up again in the 2 days since this was opened so it shouldn't be very long.

@zeripath
Copy link
Contributor

I still think #15831 is needed - as I think you're likely to find that although the rendering pipeline no longer gets blocked there are zombie network connections.

zeripath added a commit that referenced this issue May 14, 2021
In #15826 it has become apparent that there are a few occasions when a response can
hang during writing, and because there is no timeout go will happily just block
interminably. This PR adds a fixed 5 second timeout to all writes to a connection.

Fix #15826

Signed-off-by: Andrew Thornton <art27@cantab.net>
zeripath added a commit to zeripath/gitea that referenced this issue May 14, 2021
Backport go-gitea#15831

In go-gitea#15826 it has become apparent that there are a few occasions when a response can
hang during writing, and because there is no timeout go will happily just block
interminably. This PR adds a fixed 5 second timeout to all writes to a connection.

Fix go-gitea#15826

Signed-off-by: Andrew Thornton <art27@cantab.net>
zeripath added a commit that referenced this issue May 14, 2021
Backport #15831

In #15826 it has become apparent that there are a few occasions when a response can
hang during writing, and because there is no timeout go will happily just block
interminably. This PR adds a fixed 5 second timeout to all writes to a connection.

Fix #15826

Signed-off-by: Andrew Thornton <art27@cantab.net>
AbdulrhmnGhanem pushed a commit to kitspace/gitea that referenced this issue Aug 10, 2021
In go-gitea#15826 it has become apparent that there are a few occasions when a response can
hang during writing, and because there is no timeout go will happily just block
interminably. This PR adds a fixed 5 second timeout to all writes to a connection.

Fix go-gitea#15826

Signed-off-by: Andrew Thornton <art27@cantab.net>
@go-gitea go-gitea locked and limited conversation to collaborators Oct 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants