I am completely failing to diagnose an error I am getting on a single platform, in fact a single machine; it happens on one of my AIX agents, but on no other system of any platform.
The scenario is that I have a job running with the master step on Windows, but then is a child step that runs on multiple unix platforms (via repeat parameters). For one machine, I am seeing this behavior where the step runs to completion, but then sits idle for about six minutes, and finally fails with this error:
ERROR - Step 'master>...>For each unix platform?paramBuildPlatform=aixopenssl30>For each build target?paramBuildTarget=all_unix' is failed: Unable to find job '87a86137---7c8f7c7772ce' on node 'aix72rel3-1:8813'.
The "For each build target" step runs on Windows.
It calls a child step "On selected build box" which runs on host aix72rel3-1, and completes successfully.
It is at the end of "For each build target" that the build fails.
I also see simultaneous agent log errors on both systems involved:
aix.log: ERROR grid.GridJob - Error notifying task node of job finishing (job class: 87a86137---7c8f7c7772ce, job id: com.pmease.quickbuild.stepsupport.StepExecutionJob, task node: ws22DR1-02:8823)
aix.log: ERROR grid.NodeServiceImpl - Unable to find job '87a86137---7c8f7c7772ce' on node 'aix72rel3-1:8813' (Job is ever started: true).
win.log:ERROR grid.GridTaskFuture - Unable to find job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 87a86137---7c8f7c7772ce, build id: 49669, job node: aix72rel3-1:8813)
I suspect the root cause is a problem with the aix72rel3-1 host; however, I am at a loss how to proceed in debugging. Can you give me some pointers as to what might be the issue?
I am using QuickBuild 14.0.23.