Your browser was unable to load all of the resources. They may have been blocked by your firewall, proxy or browser configuration.
Press Ctrl+F5 or Ctrl+Shift+R to have your browser try again.

step aborted on AIX due to "Unable to find job" #4660

drdt ·

I am completely failing to diagnose an error I am getting on a single platform, in fact a single machine; it happens on one of my AIX agents, but on no other system of any platform.

The scenario is that I have a job running with the master step on Windows, but then is a child step that runs on multiple unix platforms (via repeat parameters). For one machine, I am seeing this behavior where the step runs to completion, but then sits idle for about six minutes, and finally fails with this error:

ERROR - Step 'master>...>For each unix platform?paramBuildPlatform=aixopenssl30>For each build target?paramBuildTarget=all_unix' is failed: Unable to find job '87a86137---7c8f7c7772ce' on node 'aix72rel3-1:8813'.

The "For each build target" step runs on Windows.
It calls a child step "On selected build box" which runs on host aix72rel3-1, and completes successfully.
It is at the end of "For each build target" that the build fails.

I also see simultaneous agent log errors on both systems involved:

aix.log: ERROR grid.GridJob - Error notifying task node of job finishing (job class: 87a86137---7c8f7c7772ce, job id: com.pmease.quickbuild.stepsupport.StepExecutionJob, task node: ws22DR1-02:8823)
aix.log: ERROR grid.NodeServiceImpl - Unable to find job '87a86137---7c8f7c7772ce' on node 'aix72rel3-1:8813' (Job is ever started: true).

win.log:ERROR grid.GridTaskFuture - Unable to find job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 87a86137---7c8f7c7772ce, build id: 49669, job node: aix72rel3-1:8813)

I suspect the root cause is a problem with the aix72rel3-1 host; however, I am at a loss how to proceed in debugging. Can you give me some pointers as to what might be the issue?

I am using QuickBuild 14.0.23.

  • replies 16
  • views 57
  • stars 0
robinshen ADMIN ·

Are there any more messages being logged after this line in aix.log?

Error notifying task node of job finishing

drdt ·

Yes, there is a stack dump (see below) if I run with TRACE debugging enabled, and then the calling step logs "INFO - Executing post-execute action...". I can send you the whole log via email, just remind me how.

Also worthy of note: the log I captured was via copy/paste from the step log window while the job was running. However, once the job completed, the final log saved didn't include any of the logs from the step that ran on the affected machine. I'll send both versions so you can see what I mean.

ERROR - Step 'master>...>For each unix platform?paramBuildPlatform=aixopenssl30>For each build target?paramBuildTarget=all_unix' is failed: Unable to find job '87a86137---7c8f7c7772ce' on node 'aix72rel3-1:8813'.
12:22:34,058 [master>For each build target?paramBuildTarget=all_unix@ws22DR1-02:8823] TRACE - Stack trace for the failure
    java.lang.RuntimeException: Error executing step process job.
        at com.pmease.quickbuild.stepsupport.StepProcessTask.reduce(StepProcessTask.java:126)
        at com.pmease.quickbuild.stepsupport.StepProcessTask.reduce(StepProcessTask.java:19)
        at com.pmease.quickbuild.grid.GridTaskFuture.get(GridTaskFuture.java:168)
        at com.pmease.quickbuild.grid.GridTaskFuture.get(GridTaskFuture.java:172)
        at com.pmease.quickbuild.stepsupport.ParallelStep.triggerChildren(ParallelStep.java:83)
        at com.pmease.quickbuild.stepsupport.CompositeStep.run(CompositeStep.java:133)
        at com.pmease.quickbuild.stepsupport.Step.doExecute(Step.java:680)
        at com.pmease.quickbuild.stepsupport.Step.execute(Step.java:577)
        at com.pmease.quickbuild.stepsupport.StepExecutionJob.executeStepAwareJob(StepExecutionJob.java:31)
        at com.pmease.quickbuild.stepsupport.StepAwareJob.executeBuildAwareJob(StepAwareJob.java:56)
        at com.pmease.quickbuild.BuildAwareJob.execute(BuildAwareJob.java:77)
        at com.pmease.quickbuild.grid.GridJob.run(GridJob.java:131)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)
    Caused by: java.lang.RuntimeException: Error executing step execution job.
        at com.pmease.quickbuild.stepsupport.StepExecutionTask.reduce(StepExecutionTask.java:29)
        at com.pmease.quickbuild.stepsupport.StepExecutionTask.reduce(StepExecutionTask.java:19)
        at com.pmease.quickbuild.grid.GridTaskFuture.get(GridTaskFuture.java:168)
        at com.pmease.quickbuild.grid.GridTaskFuture.get(GridTaskFuture.java:172)
        at com.pmease.quickbuild.stepsupport.StepProcessJob.executeStepAwareJob(StepProcessJob.java:46)
        ... 8 more
    Caused by: com.pmease.quickbuild.QuickbuildException: Unable to find job '87a86137-9a7b-4768-8cd4-7c8f7c7772ce' on node 'aix72rel3-1:8813'.
        at com.pmease.quickbuild.grid.GridTaskFuture.testJobs(GridTaskFuture.java:91)
        at com.pmease.quickbuild.grid.GridTaskFuture.get(GridTaskFuture.java:150)
        ... 10 more

Because I have set the parent step to ignore failures of the child steps, the build goes on to complete successfully.

drdt ·

I misspoke; it is happening on all of the AIX machines that I have tested so far. So it may be a generic AIX error.

robinshen ADMIN ·

The stack trace is for the error "Unable to find job", and it is caused by a key error in in your original post:

Error notifying task node of job finishing (job class: 87a86137---7c8f7c7772ce, job id: com.pmease.quickbuild.stepsupport.StepExecutionJob, task node: ws22DR1-02:8823)

I checked the code, and the log should contain stack trace for this error. Which QB version are you using? Are you able to reproduce this with a simple configuration running a test step on AIX?

drdt ·

I am using v14.0.23. Upgrading to 15.x is not viable because we still have agents running Java 8 that cannot be upgraded.

By setting the log level to TRACE, I was able to get more details error message, including a stack dump, from both agents at the time of the failure:

aix: 2025-08-25 11:14:21,257 [pool-2-thread-3] ERROR com.pmease.quickbuild.grid.GridJob - Error notifying task node of job finishing
aix:   com.caucho.hessian.client.HessianConnectionException: 500: java.net.SocketTimeoutException: Read timed out
aix:   Caused by: java.net.SocketTimeoutException: Read timed out
aix: 2025-08-25 11:14:21,281 [pool-2-thread-3] TRACE com.pmease.quickbuild.grid.GridJob - Finished execution of job

win: 2025-08-25 11:14:21,297 [qtp1166226770-30] TRACE com.pmease.quickbuild.grid.NodeServiceImpl - Handling job finish event
win: 2025-08-25 11:14:21,297 [qtp1166226770-30] DEBUG com.pmease.quickbuild.grid.GridTaskFuture - Notifying task of job finished

aix: 2025-08-25 11:14:21,313 [qtp239362799-68] WARN org.eclipse.jetty.server.HttpChannel - handleException /file_transfer java.io.FileNotFoundException:
aix:    /opt/qbagent-8813/temp/job_0c56b18d-c11f-43f3-9c34-1e46c6d46bfd.log (A file or directory in the path name does not exist.)

win: 2025-08-25 11:14:22,777 [qtp1166226770-30] ERROR com.pmease.quickbuild.grid.GridTaskFuture - Error finishing job.
win:    Caused by: java.io.IOException: Server returned HTTP response code: 
win:       500 for URL: http://172.17.160.240:8813/file_transfer?file=%2Fopt%2Fqbagent-8813%2Ftemp%2Fjob_0c56b18d-c11f-43f3-9c34-1e46c6d46bfd.log
drdt ·

Full stack dump on AIX agent:

2025-08-25 11:14:21,257 [pool-2-thread-3] ERROR com.pmease.quickbuild.grid.GridJob - Error notifying task node of job finishing (job class: 0c56b18d-c11f-43f3-9c34-1e46c6d46bfd, job id: com.p
mease.quickbuild.stepsupport.StepExecutionJob, task node: ws22DR1-02:8823)
 com.caucho.hessian.client.HessianConnectionException: 500: java.net.SocketTimeoutException: Read timed out
        at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:153)
        at com.caucho.hessian.client.HessianProxy.sendRequest(HessianProxy.java:300)
        at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:171)
        at jdk.proxy3/jdk.proxy3.$Proxy47.gridJobFinished(Unknown Source)
        at com.pmease.quickbuild.grid.GridNode$1.gridJobFinished(GridNode.java:249)
        at com.pmease.quickbuild.grid.GridJob.run(GridJob.java:152)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1595)
 Caused by: java.net.SocketTimeoutException: Read timed out
        at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
        at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
        at java.base/sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:2071)
        at java.base/sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:2066)
        at java.base/java.security.AccessController.doPrivileged(AccessController.java:751)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:2065)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1635)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1615)
        at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:133)
        ... 10 more
 Caused by: java.net.SocketTimeoutException: Read timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:278)
        at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:304)
        at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:346)
        at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:796)
        at java.base/java.net.Socket$SocketInputStream.read(Socket.java:1099)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:291)
        at java.base/java.io.BufferedInputStream.read1(BufferedInputStream.java:347)
        at java.base/java.io.BufferedInputStream.implRead(BufferedInputStream.java:420)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:399)
        at java.base/sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:827)
        at java.base/sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:759)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1706)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1615)
        at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:531)
        at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:120)
        ... 10 more
2025-08-25 11:14:21,281 [pool-2-thread-3] TRACE com.pmease.quickbuild.grid.GridJob - Finished execution of job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job Id: 0c56b18d
-c11f-43f3-9c34-1e46c6d46bfd)...
2025-08-25 11:14:21,313 [qtp239362799-68] WARN org.eclipse.jetty.server.HttpChannel - handleException /file_transfer java.io.FileNotFoundException: /opt/qbagent-8813/temp/job_0c56b18d-c11f-43
f3-9c34-1e46c6d46bfd.log (A file or directory in the path name does not exist.)
drdt ·

Full stack dump on Windows agent:

2025-08-25 11:14:22,777 [qtp1166226770-30] ERROR com.pmease.quickbuild.grid.GridTaskFuture - Error finishing job.
 java.lang.RuntimeException: java.io.IOException: Server returned HTTP response code: 500 for URL: http://172.17.160.240:8813/file_transfer?file=%2Fopt%2Fqbagent-8813%2Ftemp%2Fjob_0c56b18d-c11f-43f3-9c34-1e46c6d46bfd.log
        at com.pmease.quickbuild.bootstrap.BootstrapUtils.wrapAsUnchecked(BootstrapUtils.java:56)
        at com.pmease.quickbuild.util.ExceptionUtils.wrapAsUnchecked(ExceptionUtils.java:82)
        at com.pmease.quickbuild.util.ServletUtils.download(ServletUtils.java:86)
        at com.pmease.quickbuild.util.ServletUtils.download(ServletUtils.java:42)
        at com.pmease.quickbuild.util.ServletUtils.download(ServletUtils.java:35)
        at com.pmease.quickbuild.grid.CopyFileJob.execute(CopyFileJob.java:63)
        at com.pmease.quickbuild.grid.GridNode.executeJob(GridNode.java:724)
        at com.pmease.quickbuild.grid.GridImpl.transferFile(GridImpl.java:169)
        at com.pmease.quickbuild.BuildAwareTask.jobFinished(BuildAwareTask.java:30)
        at com.pmease.quickbuild.grid.GridTaskFuture.jobFinished(GridTaskFuture.java:199)
        at com.pmease.quickbuild.grid.NodeServiceImpl.gridJobFinished(NodeServiceImpl.java:208)
        at jdk.internal.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at com.caucho.hessian.server.HessianSkeleton.invoke(HessianSkeleton.java:302)
        at com.caucho.hessian.server.HessianSkeleton.invoke(HessianSkeleton.java:198)
        at com.caucho.hessian.server.HessianServlet.invoke(HessianServlet.java:399)
        at com.caucho.hessian.server.HessianServlet.service(HessianServlet.java:379)
        at com.pmease.quickbuild.grid.GridServlet.service(GridServlet.java:36)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
        at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1631)
        at com.pmease.quickbuild.Quickbuild$DisableTraceFilter.doFilter(Quickbuild.java:1620)
        at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
        at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
        at org.eclipse.jetty.server.Server.handle(Server.java:516)
        at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
        at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
        at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
        at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
        at java.base/java.lang.Thread.run(Thread.java:840)
 Caused by: java.io.IOException: Server returned HTTP response code: 500 for URL: http://172.17.160.240:8813/file_transfer?file=%2Fopt%2Fqbagent-8813%2Ftemp%2Fjob_0c56b18d-c11f-43f3-9c34-1e46c6d46bfd.log
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:2000)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589)
        at com.pmease.quickbuild.util.ServletUtils.download(ServletUtils.java:75)
        ... 50 more
robinshen ADMIN ·

The real issue is that aix agent is unable to connect to machine ws22DR1-02:8823 due to read timeout. Please login to the aix agent, and check if below command works:

telnet ws22DR1-02 8823

drdt ·

2025-08-26 13:08:13 build@aix72rel3-1:~ $ telnet ws22DR1-02 8823
Trying...
Connected to ws22DR1-02.iscinternal.com.
Escape character is '^]'.
testing connection
HTTP/1.1 505 HTTP/0.9 not supported
Content-Type: text/html;charset=iso-8859-1
Content-Length: 65
Connection: close

[h1]Bad Message 505[/h1][pre]reason: HTTP/0.9 not supported[/pre]Connection closed.
2025-08-26 13:08:20 build@aix72rel3-1:~ $

I ran another test running the parent step on unix instead of windows, and got the same errors.

robinshen ADMIN ·

Seems that the port is accessible. Are you able to reproduce this issue with a simple test configuration running child step on your AIX box?

drdt ·

I have not. I made very simple test case and the problem didn't appear. I am still trying.

I have been making incremental reductions to the affected configuration but haven't narrowed it down. My most recent change resulted in a test run where no error was reported by the build, the other symptoms remained:

  • six minute delay returning from the aix step
  • aix step log missing from full build log
  • same errors in agent quickbuild.log

I thought the build was trying to upload the step log from aix agent to the server, but was uploading it to the agent running the parent step instead. However, I see the same behavior using the server as the master step.

And of course, why it doesn't happen in every circumstance is beyond me. I thought it was related to my use of a post-execute action, but I can't get conclusive results. I am still trying.

robinshen ADMIN ·

For your test, please make sure the parent step runs on machine ws22DR1-02:8823, and a child step inside it runs on your aix machine. Also a socket read timeout often indicates that the target machine (ws22DR1-02:8823) is overloaded if network connection is fine.

drdt ·

It is not a problem with the ws22DR1-02 box. I have successfully replicated the failure with the parent step running on:

  • mac build agent
  • windows build agent
  • windows user agent
  • red hat linux build agent
  • the quickbuild server (ubuntu).

I am further narrowing it down by removing additional steps to create a minimum viable test case.

drdt ·

I set up another test case, this time running the master on the aix box, and the child step on ubuntu. The first thing I observed was that it took five minutes for the job to complete the "CHECKING_BUILD_CONDITION" pre-step. Then I got similar issues; this time the ubuntu step suffered the timeout:

11:44:29,236 ERROR - Step 'master>Build>Sync Perforce and Build>UNIX>Build unix platforms>For each unix platform?paramBuildPlatform=lnxsuse15x64>For each build target?paramBuildTarget=all_unix' is failed.
java.lang.RuntimeException: Error executing step process job.
Caused by: java.lang.RuntimeException: Error executing step execution job.
Caused by: com.caucho.hessian.client.HessianConnectionException: 500: java.net.SocketTimeoutException: Read timed out
Caused by: java.net.SocketTimeoutException: Read timed out

With these associated errors in the agent logs:

ubuntu: ERROR com.pmease.quickbuild.grid.GridJob - Error notifying task node of job finishing (job class: 4b3f5185-c6f2-43c1-a16e-f688e8d6ac23, job id: com.pmease.quickbuild.stepsupport.StepExecutionJob, task node: aix72rel3-1:8813)
ubuntu: com.pmease.quickbuild.RemotingException: Unable to find task future of job '4b3f5185-c6f2-43c1-a16e-f688e8d6ac23' at node 'aix72rel3-1:8813'

aix: TRACE com.pmease.quickbuild.grid.NodeServiceImpl - Handling job finish event (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 4b3f5185-c6f2-43c1-a16e-f688e8d6ac23, job node: reubdock-26:8813)...
aix: ERROR com.pmease.quickbuild.RemotingSerializerFactory - Error invoking hessian method.
aix: com.pmease.quickbuild.QuickbuildException: Unable to find task future of job '4b3f5185-c6f2-43c1-a16e-f688e8d6ac23' at node 'aix72rel3-1:8813'

I am certain this is some configuration or firewall setting on my aix boxes. But I need direction to find it.

robinshen ADMIN ·

Do you have some security scan running in the middle between two machines? We have customers experiencing very odd issues (connection timeout, file download timeout for certain file types etc, but network testing is fine), and it turns out to be security scanning software dropping packages. You may also change port QB agents are running, or protect the traffic with SSL to see if it makes any differences.

drdt ·

Thank you, this is a good suggestion. What are safe ports I can use for the agent outside of the 88xx range?

Also I see from an earlier posting that there is another port used internally, can you tell me what that is, just to see if perhaps it is being blocked on the agent?