Your browser was unable to load all of the resources. They may have been blocked by your firewall, proxy or browser configuration.
Press Ctrl+F5 or Ctrl+Shift+R to have your browser try again.

"Disconnect Tolerance" variable is useless #4277

thang.dv2 ·

Hello all,
"Specify disconnect tolerance in seconds when the step is marked as failed upon network disconnection.Use 0 to mark step as failed as soon as network disconnection is detected."

  • When we set 0, the build will been cancel in case the agent was disconnect after ~2 mins.
  • But when we set this variable without 0 (example: 10), the build will run forever if the agent was disconnect with server.
    The build faced this problem because "setLastDisconnectDate" method was not never called.
    Please review this. Thank you so much!
  • replies 22
  • views 3403
  • stars 0
robinshen ADMIN ·

Which QB version are you using?

thang.dv2 ·

The latest 8 QB version.

robinshen ADMIN ·
thang.dv2 ·

Thank you so much for your quickly support.

thang.dv2 ·
Disconnect Tolerance.PNG

Dear Mr.@robinshen,
As I checked, I would like to share you some result and you can consider some problem:

  • 1: Set Disconnect Tolerance variable > 2:15 (~140s), If this value < 140s, It seem not working.
  • 2: Time to wait for re-connect = Disconnect Tolerance value. But only pass (waiting successfully) during step had not done yet.
  • 3: If agent never re-connect, build will finish and be fail after waiting (Disconnect Tolerance + 2:20 (~140s)).
  • 4: Only apply for Master step, not for any child step. Implement this variable for any child step is not working, alway got fail when reconnect or after ~140s.

And we are being faced a big problem, build will be failed with error log Unable to find job on node when job at agent has been done during master-agent disconnected and reconnect after.
I think we should improve this case, if job has been done at agent, it should wait for in period defined at Disconnect Tolerance value before finished job.
Thank you so much!

robinshen ADMIN ·

The disconnect tolerance does not guarantee build can tolerate all network disconnections as that will be very difficult. Instead it solves the most common issue: when a step runs for a long time, and if during that wait time, network disconnects and reconnect later, QB should ignore network heartbeat failures and should not mark the step as failed.

If network is disconnected while child step communicates with parent step (this will happen when step is finishing, child step is starting, etc), the build will still fail.

thang.dv2 ·

Excuse me, We'd like to you share why it fails after ~140s even it is set on child step?

robinshen ADMIN ·

Do you mean if this setting is specified as 140s on a child step, the child step will fail after 140s even if network is not disconnected?

thang.dv2 ·

Sorry, I mean if Disconnect Tolerance is setting any value excluding 0 for child step, the child step will fail after 140s even if network is re-connected.

robinshen ADMIN ·

I tried below and it works:

  1. Create a test configuration with two steps. Master step runs on server, and a child step simply running sleep command for 300 seconds on an agent.
  2. Define network disconnect tolerance as 120s on the child step.
  3. Now run the build, plug out cable of agent when build runs for 60s
  4. Wait for another 60s, and plugin cable in.
  5. Build still succeeds when child steps reaches 300s

Can you please try this set up at your side to see if it works?

thang.dv2 ·

Oh, I checked again and Realize that the child step will fail after 140s not for all cases:
I set Disconnect Tolerance = 300
1 - with Thread.sleep(milliseconds)
groovy:
Thread.sleep(300000)
It's only success if disconnect-time < ~25s
2 - with sleep(milliseconds)
groovy:
sleep(300000)
It's only success if disconnect-time < ~150s
3 - with sync step (Step type: checkout)
It's only success if disconnect-time < ~14s
=> And Disconnect Tolerance variable seems not working with child step. Some child step will alway fail even if network is re-connected. Please check again for us. Thank you.

robinshen ADMIN ·

What is your server and agent OS? And how do you disconnect network for testing? I am unplugging the cable. Also please test with a brand new QB installation, and send me [robin AT pmease DOT com] the database backup.

thang.dv2 ·

They all are Ubuntu. I unplugged the cable too. (Or disable Ethernet port). I am using 8.0.42. We are unable to share data, please understand for us about that.
I will test again with Disconnect Tolerance in child step. Please use sync (Step type: checkout) and run a bash shell, run a scrip step to testing again instead of sleep step. Please share for us why the time to wait for accepting reconnect depended on type of step?
Thank you.

robinshen ADMIN ·

I do not want you to share your data. Just that you set up a test QB instance with minimum settings to demonstrate the issue, and send me the backup.

Step differs as "sleep" in groovy script is not interruptible while "Thread.sleep" is interruptible. To isolate the issue, please use Thread.sleep for all your testing.

thang.dv2 ·

We are unable to share any data, file, document... Please understand for us.

  1. I implement sync step for sync ~200Gb data while ~ 20min
    The step:
<?xml version="1.0" encoding="UTF-8"?>
 
<com.pmease.quickbuild.repositorysupport.CheckoutStep revision="0.15.0.1">
  <name>Sync</name>
  <inheritable>true</inheritable>
  <enabled>true</enabled>
  <executeCondition class="com.pmease.quickbuild.setting.step.executecondition.AllPreviousSiblingStepsSuccessful"/>
  <nodeMatcher class="com.pmease.quickbuild.setting.step.nodematcher.ParentNodeMatcher"/>
  <nodePreference class="com.pmease.quickbuild.setting.step.nodepreference.PreferLeastLoadedNode"/>
  <timeout>0</timeout>
  <disconnectTolerance>300</disconnectTolerance>
  <showLinksInLog>false</showLinksInLog>
  <preExecuteAction class="com.pmease.quickbuild.setting.step.executeaction.NoAction"/>
  <postExecuteAction class="com.pmease.quickbuild.setting.step.executeaction.NoAction"/>
  <repetitions/>
  <repositoryName>MAIN</repositoryName>
</com.pmease.quickbuild.repositorysupport.CheckoutStep>
  1. Define network disconnect tolerance as 300s on the this child step.
  2. Run the build, Disable network (plugout cable) of agent when build runs for 60s
  • 4.1. Never re-connect => Build had been fail after ~3m15s (after 2m15s from disconnecting) caused by: com.caucho.hessian.client.HessianRuntimeException: Error connecting 'http://serverURL/service/server'
  • 4.2. Wait for < ~10s, and Enable network (plugin cable) => Build had been successful.
  • 4.3. Wait for > ~10s, and Enable network (plugin cable) => Build had been alway fail. caused by: com.caucho.hessian.client.HessianRuntimeException: Error connecting 'http://serverURL/service/server'
robinshen ADMIN ·

If it is a checkout step, the copy operation itself will fail when cable is plugged out. Network disconnect tolerance only makes sure that QB can tolerate server/agent heartbeat failures. It can not recover the step from its own network operation failures. To verify, you may change the step to a simple sleep step (calling groovy Thread.sleep or calling external sleep command via Command/Batch execution step).

thang.dv2 ·

I got it, ok I will check more. Thank you.

thang.dv2 ·

Sorry. It does not work with sleep step:

<?xml version="1.0" encoding="UTF-8"?>
<com.pmease.quickbuild.plugin.basis.ScriptStep revision="0.15.1">

  <name>Slepp</name>
  <inheritable>true</inheritable>
  <enabled>true</enabled>
  <executeCondition class="com.pmease.quickbuild.setting.step.executecondition.AllPreviousSiblingStepsSuccessful"/>
  <nodeMatcher class="com.pmease.quickbuild.setting.step.nodematcher.ParentNodeMatcher"/>
  <nodePreference class="com.pmease.quickbuild.setting.step.nodepreference.PreferLeastLoadedNode"/>
  <timeout>0</timeout>
  <disconnectTolerance>240</disconnectTolerance>
  <showLinksInLog>false</showLinksInLog>
  <preExecuteAction class="com.pmease.quickbuild.setting.step.executeaction.NoAction"/>
  <postExecuteAction class="com.pmease.quickbuild.setting.step.executeaction.NoAction"/>
  <repetitions/>
  <script>groovy:
Thread.sleep(300000)</script>
</com.pmease.quickbuild.plugin.basis.ScriptStep>

Sleep 300s, disconnect tolerance as 240s.
If network never re-connect => Build had been fail after ~3m15s (after 2m15s from disconnecting) caused by: com.caucho.hessian.client.HessianRuntimeException: Error connecting 'http://serverURL/service/server'
If I wait for < ~10s from disconnect, and Enable network (plugin cable) => Build had been successful.
If I wait for > ~10s from disconnect, and Enable network (plugin cable) => Build had been alway fail. Caused by: com.pmease.quickbuild.QuickbuildException: Unable to find job '7f147565-99fa-4f9e-96d4-fdc08f14b20f' on node 'AgentUbuntu_18.04:8814'.

thang.dv2 ·

Please not that QB is able to recover the step from its own network operation failures in case we set Disconnect Tolerance for Master step instead of child step.

robinshen ADMIN ·

Please start with a new QB instance and follow steps below to see if it works:

  1. Create a test configuration with two steps. Master step runs on server, and a child step simply running sleep command for 300 seconds on an agent.
  2. Define network disconnect tolerance as 240s on the child step.
  3. Now run the build, plug out cable of agent when build runs for 60s
  4. Wait for another 60s, and plugin cable in.
  5. Build still succeeds when child steps reaches 300s
thang.dv2 ·

Ah, I really understand why we have different result testing.

  • Your scenario whom Master step run on server, and me no, my Master step run on agent which runs other steps.
  • I tested with your scenario and it worked well when the network reconnect.
    But I don't know why it had not worked yet when Master Step run on agent? Please share for us know.

Currently, our system is servicing about 10k user, 2k agents connected, so we must decrease offload for server, the master step should run on agent. Please consider about that.
Thank you so much!

robinshen ADMIN ·

If master step runs on agent, and if agent can be disconnected from server, you will need to set disconnect tolerance on master step also. This setting should be specified on any step whose running node can be disconnected from other nodes.