Can not cancel request when agent server was hang. #4571

hung · 1 year ago

Hello Mr.Robin Shen,

Now we are facing an issue in prod with agent-master connection
We can not reproduce the issue but here are the symptons:

Build request was hanged in CHECKING_BUILD_CONDITION status
Can not cancel build request, other request was pending in queue
Build server was unauthorized but build still not cancelled
Can not open or remote server at that time.
The server's log stop at "Active build agent {agentAddress} timed out..."
After restart server, the log continue generate:
Job still exists on job node...
Unable to find job (job class:...
Error processing build request...

I think the cause of problem is that the proxy cannot be created in getNodeService function when the agent server was lag
So we want to add timeout in that function
Please let me know you opinion.
Thank you.

replies 24
views 1357
stars 0

robinshen ADMIN · 1 year ago

Do you mean agent machine is down in this case, but build job never finishes? Have you tried the "Check Condition Timeout" in general setting of the configuration?

hung · 1 year ago

Do you mean agent machine is down in this case, but build job never finishes?
-> Yes, the job only finishs when I reboot the server
Have you tried the "Check Condition Timeout" in general setting of the configuration?
-> Check Condition Timeout was set to 5 minutes.

robinshen ADMIN · 1 year ago

Tested hard shutdown a build agent, and the job eventually terminated when QB server found error when test the job connectivity:

Caused by: com.pmease.quickbuild.QuickbuildException: Error testing job.
 	at com.pmease.quickbuild.grid.GridTaskFuture.testJobs(GridTaskFuture.java:111)
 	at com.pmease.quickbuild.grid.GridTaskFuture.get(GridTaskFuture.java:150)
 	... 6 more
 Caused by: com.caucho.hessian.client.HessianRuntimeException: com.caucho.hessian.client.HessianRuntimeException: Error connecting 'http://192.168.71.9:8811/service/node'
 	at com.caucho.hessian.client.HessianProxy.sendRequest(HessianProxy.java:285)
 	at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:171)
 	at com.sun.proxy.$Proxy93.testGridJob(Unknown Source)
 	at com.pmease.quickbuild.grid.GridNode$1.testGridJob(GridNode.java:254)
 	at com.pmease.quickbuild.grid.GridTaskFuture.testJobs(GridTaskFuture.java:89)
 	... 7 more
 Caused by: com.caucho.hessian.client.HessianRuntimeException: Error connecting 'http://192.168.71.9:8811/service/node'
 	at com.caucho.hessian.client.HessianURLConnection.getOutputStream(HessianURLConnection.java:101)
 	at com.caucho.hessian.client.HessianProxy.sendRequest(HessianProxy.java:283)
 	... 11 more
 Caused by: java.net.ConnectException: Operation timed out (Connection timed out)
 	at java.net.PlainSocketImpl.socketConnect(Native Method)
 	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
 	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
 	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
 	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
 	at java.net.Socket.connect(Socket.java:607)
 	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
 	at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
 	at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
 	at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
 	at sun.net.www.http.HttpClient.New(HttpClient.java:339)
 	at sun.net.www.http.HttpClient.New(HttpClient.java:357)
 	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1228)
 	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
 	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
 	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990)
 	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1342)
 	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1317)
 	at com.caucho.hessian.client.HessianURLConnection.getOutputStream(HessianURLConnection.java:99)

What I am done:

Run QB server on a machine
Run QB agent on another machine
Create a test configuration with build condition set to run a groovy script sleeping for 5 minutes:

groovy:
sleep(300000);

Edit pre-queue script setting of the configuration in advanced settings to run below script in order to make manual job triggering evaluating build condition:

request.respectBuildCondition

Configure the master step to run on build agent
Trigger the test configuration manually and then power off the build agent machine
Build agent becomes inactive in a short while, and the job keeps running. After 5 minutes or so, the job is terminated due to connectivity test failure

Can you please do the same to see if it works at your side?

hung · 1 year ago

Yes, it worked in our servers.
But this is not symptons of our servers when this issue occurred.
I simulated a case similar when server hang:

Run agent in a physical server
Run fork bomb in agent server
Trigger build continously
Then, after a period of time, the build was hanging and the request will stop at CHECKING_BUILD_CONDITION status
even the "Check Condition Timeout" was set to 5

1 Reply

robinshen ADMIN · 1 year ago

hung

Run fork bomb in agent server

What do you mean about this? Is this happening even when agent server is not down?

hung · 1 year ago

I ran fork bomb command to simulate the overload server
Actually, when server was hanging, I counldn't access it

robinshen ADMIN · 1 year ago

Are things working fine when agent machine is not down, even if you are running fork bomb on agent?

hung · 1 year ago

In case of simulation:
Agent machine was lag and disconnect with server, but still can excecute others command after fork bomb was stopped.
the fork bomb cmd was stopped after a period of time, so I had to run it continously

In reality:
Agent machine was overloaded and I can not do anything with it

when server was overloaded, we can not cancel build request at "CHECKING_BUILD_CONDITION" status in queue
even the agent was disconnected.

#10

hung · 1 year ago

Hello Mr.Robin Shen,
Is there any update?
If you need more information, please let me know.

#11

robinshen ADMIN · 1 year ago

I am testing an agent with high load in check build condition, but can not reproduce the problem so far. Can you share your fork bomp cmd or something simiar and also send me (robin AT pmease DOT com)database backup of the test instance?

#12

hung · 1 year ago

Because of information security reasons, I cannot send you an email
I will detail the steps to reproduce the issue below:
Server: Quickbuild 14.0.11
Database: H2
Configuration:
General Setting > Build Condition > "If specified script evaluates to true"
Script:

groovy:
	sleep(60000);
	return true;

Concurrent: Yes
Check Condition Timeout: 1

Advanced Setting > Pre-Queue Script:

groovy:
request.setRespectBuildCondition(true);

Agent:
start 3 agents have the same resource
1 terminal run fork bomb command continuously: :(){ :|:& };: (1)
the log: bash: fork: retry: Resource temporarily unavailable will generate continuously
after a few minutes, the log will stop generate, run command (1) again
while command (1) is running, create build continuously

#13

robinshen ADMIN · 1 year ago

Thanks for the detailed info. I tested and below is my observation:

Groovy sleep method is not cancellable, please use "Thread.sleep" instead
I configured master step as a command build step running "sleep 3600". When fork bomb runs for some time, the build can not be cancelled although console prints "killing process xxx". This is reasonable as system calls are very slow or even not possible in such case. However agent process still responds to job queries without any problems, and QB server thinks that the build is still running and waiting it to be cancelled.
I then shutdown the agent machine directly, and build exits with socket timeout error within 5 minutes (as socket read timeout is set to 300s)

So in such panic case, just remove agent out of grid or reboot it.

#14

hung · 1 year ago

Hello Mr.Robin Shen,

I changed check build condition script

sleep(60000) => Thread.sleep(60000)

but still can not cancel request in queue when server hangs.
2. It is ok if server hangs in BUILDING status because the queue will not stuck and other builds will continue to process in other agents.
3. When the server hangs, restarting is neccessary.
But we want Quickbuild master could detect the request which make the queue stuck and remove it.
The request is stuck in the check_build_condition state beyond the timeout duration anyway.
Please give us some suggestions.
Thank you so much.

#15

robinshen ADMIN · 1 year ago

It is possible that cancellation will not work when system resource is very very low. Have you tried stop the panic agent machine in this case? At my side, the build request will exist within 5 minutes due to socket read timeout, and QB server can continue to process other build requests on same configuration.

#16

hung · 1 year ago

Is it possible to cancel the request in QB server side?
Stop/Restart QB agent machine is the final solution.
But when the server hangs, we cannot remote via ssh.
So I think it would be best if there were a solution to delete the request from the QB server machine first and then deal with the frozen server afterward.

#17

robinshen ADMIN · 1 year ago

The problem is that cancellation command can be issued successfully to agent, and server thinks that the build is still running. QB server can not simply exit in this case as it may leave the build running alone to cause other serious problems such as running subsequent builds while previous build is still running to cause possible damanages to workspaces, etc.

#18

absalom1 · 1 year ago

The build request stuck issue happens while the request is in the "CHECKING_BUILD_CONDITION" status.
So, there is no build object on the QB server side and the workspace is not created on the agent side because none of the build steps have been processed yet.

Hence, I think, the cancellation of the request, while in "CHECKING_BUILD_CONDITION" status, does not affect the workspace.

Restoring the agent shortly is the best solution but it could take some time if the agent has a critical HW fault.

How about supporting the cancellation of a request in the "CHECKING_BUILD_CONDITION" status or releasing the configuration lock for this case?

#19

robinshen ADMIN · 1 year ago

QB will write files on agent workspace even in checking build condition phase, for instance, git repository has to be cloned in this stage to check if there are SCM changes even if no other steps are running.

Nevertheless, I filed a ticket to forcibly cancel a build request without ensuring completion of remote task:

However please note that this may cause concurrent modifications to same workspace if agent is not actually dead.

1 :thumbsup:

1 :heart:

1 :laughing:

2 :handshake:

#20

absalom1 · 8 months ago

Hello. I have something to ask.
Does the build assigning process check the status of the build agent just before the assigning?

If many build agents are connected to a master, checking their status could be delayed while processing many builds and web page requests.
Then, a build could be assigned to a build agent that is not accessible but has not been checked for a few minutes.

How about checking the real status of the assigned build agent before starting the process with CHECKING_BUILD_CONDTION?

#21

robinshen ADMIN · 8 months ago

It will not check agent status when dispatch jobs, as many agents may connected and check status of each of them can be very time consuming. Instead, agents report status asynchronously. And if server can not hear from agent for some time (the agent timeout in system setting), it will be removed from active agent list, and will not be considered for job dispatch next time.

1 Reply

#22

absalom1 · 8 months ago

Yes, I agree that checking many agents can be very time-consuming.
My system has 600 for the "Agent Timeout," and maybe I need to try it with 0 :)

1 Reply

#23

drdt · 8 months ago

Mine is set to 60, because anything less than that we get false positives due to network hiccups.

1 Reply

#24

ngocanhnu · 8 months ago

Hello Mr.Robin Shen,

We hope that you can improve removeTimedoutAgents to cancel the Request that the alerted agent (node of master step) is holding. Because sometimes the CHECKING_BUILD_CONDITION step will hang longer than the time the system detects that the agent has been timedout.
Note that, it would be the best if the requests thread can be stop immediately, so the configuration can be unlock asap.

#25

absalom1 · 8 months ago

Thanks for sharing yours.

Can not cancel request when agent server was hang. #4571

[#QB-4099] An option to cancel build request without ensuring remote task execution completion - PMEase