Help debugging a hard server hang, QB 13 #4584

tomz · 5 months ago

I've tar'd the log files so if there's a secure place to send them, let me know.

The start of the issue is mostly marked by the following log entry type:

2024-07-17 07:02:53,698 [pool-2-thread-94144] WARN com.pmease.quickbuild.DefaultBuildEngine - Build request for 'root/project/configuration' is ignored as there is an identical request in queue.

(100's more of entries where just the configuration is different).

Once this started, after 10 minutes the web UI became completely blocked. After trying to see if it returns for ~5 minutes I got a lot of other repeated errors looking like these:

2024-07-17 07:13:20,033 [pool-2-thread-94354] WARN com.mchange.v2.resourcepool.BasicResourcePool - com.mchange.v2.resourcepool.BasicResourcePool@1761d8e1 -- an
attempt to checkout a resource was interrupted, and the pool is still live: some other thread must have either interrupted the Thread attempting checkout!

2024-07-17 07:13:20,034 [pool-2-thread-94354] WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper - SQL Error: 0, SQLState: null
2024-07-17 07:13:20,034 [pool-2-thread-94354] ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper - An SQLException was provoked by the following failure: java.lang.InterruptedException

We need to understand what happened.@robinshen do you have any idea?

--
Tom Z

replies 3
views 467
stars 0

robinshen ADMIN · 5 months ago

Do you have external program calling restful api frequently? Or do you have configuration scheduled to run too frequently? You may send logs to [robin AT pmease DOT com]

tomz · 5 months ago

The number of items in the build queue I've seen up to roughly 150 but the build systems and the number of things being triggered haven't changed much at all in the past few months that we have seen this. We've added a few more perforce streams to our system and each get their own set of build configurations, but this likely adds about 10 more frequent (2-minute interval check for p4 changes) to the system's awareness and nothing has really grown in the queue. In fact, due to being close to a release, we're seeing activity go down as we are gating changes through triage.

I had sent you an email with the server dump. Please let us know if you find anything actionable.

1 Reply

tomz · 5 months ago

During the time of the latest failure, there were likely no RESTful API calls. It did happen at almost exactly 7AM local time, however. I can try to look at schedules to figure something out, as the previous failure also occurred very close to 7AM.