builds not stopped after network failure #3659

MarcinN · 9 years ago

Dear Support,

We have two sites where our build servers are located and from time to time we experience lost connections between build nodes and master server. Reasons are different (e.g. it happened once that build nodes couldn't resolve master through DNS), however bad thing is that in such event build keeps "running" for more than day, even through the timeout is defined to be 360 minutes.

As development guy, I'm not so fond to seek elusive network issues, but I don't want builds to hang around too long. As a workaround, I'd like simply to make "watchdog" script, which will stop such builds automatically. Unfortunately I fail to find appropriate method. As build deletion is made thru system.BuildManager.delete, I was trying to guess the right method and parameters for that.

Could You please tell me, how can I stop the build automatically?

Thank You and best regards,
Marcin N.

replies 11
views 6340
stars 0

robinshen ADMIN · 9 years ago

Are you able to stop the build manually via the stop build button in such event?

MarcinN · 9 years ago

Hi Robin,

Yes, it seems to work - after few minutes build is stopped.

Thank You for answer and best regards,
Marcin N.

P.S. I'd few more cases this night and what I've seen were java heap space exceptions in logs. Restart of agent service resolves problem immediately. Do you have any suggestions what to check on the next occurrence, so root cause could be found?

robinshen ADMIN · 9 years ago

Which version are you using? Also please show me full stack of the error in log if that is still available.

MarcinN · 9 years ago

QB version on master node: PMEase QuickBuild, 6.1.29, 2016-10-05. I somehow couldn't figure out how to check QB versions on build nodes - a hint would be nice.

As for call stacks, they all seem to look the same:

2016-11-20 19:12:30,560 [DefaultQuartzScheduler_Worker-32] ERROR org.quartz.core.JobRunShell - Job default.41b87061-d1cb-45ff-9000-756d0541c6c5 threw an unhandled Exception:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:612)
at java.lang.StringBuilder.append(StringBuilder.java:219)
at sun.reflect.MethodAccessorGenerator.generateName(MethodAccessorGenerator.java:772)
at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:288)
at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:113)
at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:331)
at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1133)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at com.pmease.quickbuild.log.BuildLogger.log(BuildLogger.java:246)
at com.pmease.quickbuild.log.BuildLogger.info(BuildLogger.java:150)
at com.pmease.quickbuild.plugin.basis.CommandBuildStep$1.consume(CommandBuildStep.java:246)
at com.pmease.quickbuild.execution.LineConsumer.processBuffer(LineConsumer.java:86)
at com.pmease.quickbuild.execution.LineConsumer.write(LineConsumer.java:73)
at java.io.OutputStream.write(OutputStream.java:116)
at com.pmease.quickbuild.execution.ProcessStreamPumper$1.run(ProcessStreamPumper.java:91)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

but still I think it is already a symptom (e.g. lack of memory to send new status report to master node) and actual leak was caused be something else. I'll keep looking, the problem happened to at least 3 build nodes, so maybe we will be able to catch it red handed <img src="{SMILIES_PATH}/icon_wink.gif" alt=";)" title="Wink" />

---

But coming back to earlier request - may I ask You for the routine which will then stop the build automatically?

Thank You and best regards,
Marcin N.

robinshen ADMIN · 9 years ago

Please increase build agent memory by editing <build agent install dir>/conf/wrapper.conf via property "wrapper.java.maxmemory". The default one might be too small for your site. Build agent program files will always be synced with server, so its version is the same as QB server.

As to stop builds automatically upon network issue, the timeout is using the same mechanism to stop builds as clicking the stop button. Would you mind send me your database backup and let me know the build in question so that I can have a closer check? You may send to [robin AT pmease DOT com]

MarcinN · 9 years ago

As recommended, I've increased the amount of memory (from 512 to 768), but still, the issue is there:

jvm 1 | 2016-11-22 01:06:05,401 DEBUG - Sending measurement data report [Node: [build08:8811], beginTime: [Tue Nov 22 01:06:05 CET 2016], endTime: [Tue Nov 22 01:06:05 CET 2016]
jvm 1 | source=[build08:8811], metricName=[jvm.memoryUsage], timestamp=[Tue Nov 22 01:06:05 CET 2016], value=[0.25337515967828206]
jvm 1 | source=[build08:8811], metricName=[cpu.totalUsage], timestamp=[Tue Nov 22 01:06:05 CET 2016], value=[0.10281950658634739]
jvm 1 | source=[build08:8811], metricName=[disk.usage.C:], timestamp=[Tue Nov 22 01:06:05 CET 2016], value=[0.9733087270529143]
jvm 1 | source=[build08:8811], metricName=[disk.frees.C:], timestamp=[Tue Nov 22 01:06:05 CET 2016], value=[1.1460931584E10]
jvm 1 | source=[build08:8811], metricName=[memory.usage], timestamp=[Tue Nov 22 01:06:05 CET 2016], value=[0.26679440239041924]
jvm 1 | source=[build08:8811], metricName=[steps.finishedOnNode], timestamp=[Tue Nov 22 01:06:05 CET 2016], value=[10.0]
jvm 1 | source=[build08:8811], metricName=[steps.successfulOnNode], timestamp=[Tue Nov 22 01:06:05 CET 2016], value=[10.0]] to server ...
...
jvm 1 | 2016-11-22 01:11:05,390 DEBUG - Sending measurement data report [Node: [build08:8811], beginTime: [Tue Nov 22 01:11:05 CET 2016], endTime: [Tue Nov 22 01:11:05 CET 2016]
jvm 1 | source=[build08:8811], metricName=[jvm.memoryUsage], timestamp=[Tue Nov 22 01:11:05 CET 2016], value=[0.25352614278439095]
jvm 1 | source=[build08:8811], metricName=[cpu.totalUsage], timestamp=[Tue Nov 22 01:11:05 CET 2016], value=[0.0906573916171494]
jvm 1 | source=[build08:8811], metricName=[disk.usage.C:], timestamp=[Tue Nov 22 01:11:05 CET 2016], value=[0.8923979820479262]
jvm 1 | source=[build08:8811], metricName=[disk.frees.C:], timestamp=[Tue Nov 22 01:11:05 CET 2016], value=[4.6203092992E10]
jvm 1 | source=[build08:8811], metricName=[memory.usage], timestamp=[Tue Nov 22 01:11:05 CET 2016], value=[0.32918764466833694]
jvm 1 | source=[build08:8811], metricName=[steps.finishedOnNode], timestamp=[Tue Nov 22 01:11:05 CET 2016], value=[20.0]
jvm 1 | source=[build08:8811], metricName=[steps.successfulOnNode], timestamp=[Tue Nov 22 01:11:05 CET 2016], value=[20.0]] to server ...
jvm 1 | 2016-11-22 01:11:05,593 DEBUG - Notifying task of job finished (job id cbdb776b-0f41-4e44-94ac-435f446e85c5, build id: N/A, job node: NRHT011PIM:8820)...
jvm 1 | 2016-11-22 01:11:05,593 DEBUG - Finished notifying task of job finished (job id cbdb776b-0f41-4e44-94ac-435f446e85c5, build id: N/A, job node: NRHT011PIM:8820)...
jvm 1 | 2016-11-22 01:11:05,609 DEBUG - Finished measurement polling task in 313 millis
jvm 1 | 2016-11-22 01:11:06,871 DEBUG - Removing obsolete live log entries of build: 129659
jvm 1 | 2016-11-22 01:16:05,307 DEBUG - Begin measurement polling task ...
jvm 1 | 2016-11-22 01:16:05,338 DEBUG - Begin collecting measurement data on node [build08:8811]
jvm 1 | 2016-11-22 01:16:05,338 DEBUG - Supplier [jvm] costs 0 millis
jvm 1 | 2016-11-22 01:16:05,338 DEBUG - Supplier [cpu] costs 0 millis
jvm 1 | 2016-11-22 01:16:05,338 DEBUG - Trying to find mount point of path: [C:\buildagent\storage] in disks: [[Lorg.hyperic.sigar.FileSystem;@5922c6]
jvm 1 | 2016-11-22 01:16:05,400 DEBUG - Supplier [fs] costs 60 millis
jvm 1 | 2016-11-22 01:16:05,416 DEBUG - Supplier [mem] costs 15 millis
jvm 1 | 2016-11-22 01:16:05,416 DEBUG - Supplier [swap] costs 0 millis
jvm 1 | 2016-11-22 01:16:05,416 DEBUG - Supplier [builds] costs 0 millis
jvm 1 | 2016-11-22 01:16:05,416 DEBUG - Finished collecting measurement data on node [build08:8811], 7 measurements collected
jvm 1 | 2016-11-22 01:16:05,525 DEBUG - Sending measurement data report [Node: [build08:8811], beginTime: [Tue Nov 22 01:16:05 CET 2016], endTime: [Tue Nov 22 01:16:05 CET 2016]
jvm 1 | source=[build08:8811], metricName=[jvm.memoryUsage], timestamp=[Tue Nov 22 01:16:05 CET 2016], value=[0.7930833698447742]
jvm 1 | source=[build08:8811], metricName=[cpu.totalUsage], timestamp=[Tue Nov 22 01:16:05 CET 2016], value=[0.5315953547953338]
jvm 1 | source=[build08:8811], metricName=[disk.usage.C:], timestamp=[Tue Nov 22 01:16:05 CET 2016], value=[0.8974218379635359]
jvm 1 | source=[build08:8811], metricName=[disk.frees.C:], timestamp=[Tue Nov 22 01:16:05 CET 2016], value=[4.404590592E10]
jvm 1 | source=[build08:8811], metricName=[memory.usage], timestamp=[Tue Nov 22 01:16:05 CET 2016], value=[0.504693286435439]
jvm 1 | source=[build08:8811], metricName=[steps.finishedOnNode], timestamp=[Tue Nov 22 01:16:05 CET 2016], value=[0.0]
jvm 1 | source=[build08:8811], metricName=[steps.successfulOnNode], timestamp=[Tue Nov 22 01:16:05 CET 2016], value=[0.0]] to server ...
jvm 1 | 2016-11-22 01:16:05,525 DEBUG - Sending measurement data report [Node: [build08:8811], beginTime: [Tue Nov 22 01:16:05 CET 2016], endTime: [Tue Nov 22 01:16:05 CET 2016]
jvm 1 | source=[build08:8811], metricName=[jvm.memoryUsage], timestamp=[Tue Nov 22 01:16:05 CET 2016], value=[0.7930833698447742]
jvm 1 | source=[build08:8811], metricName=[cpu.totalUsage], timestamp=[Tue Nov 22 01:16:05 CET 2016], value=[0.5315953547953338]
jvm 1 | source=[build08:8811], metricName=[disk.usage.C:], timestamp=[Tue Nov 22 01:16:05 CET 2016], value=[0.8974218379635359]
jvm 1 | source=[build08:8811], metricName=[disk.frees.C:], timestamp=[Tue Nov 22 01:16:05 CET 2016], value=[4.404590592E10]
jvm 1 | source=[build08:8811], metricName=[memory.usage], timestamp=[Tue Nov 22 01:16:05 CET 2016], value=[0.504693286435439]
jvm 1 | source=[build08:8811], metricName=[steps.finishedOnNode], timestamp=[Tue Nov 22 01:16:05 CET 2016], value=[0.0]
jvm 1 | source=[build08:8811], metricName=[steps.successfulOnNode], timestamp=[Tue Nov 22 01:16:05 CET 2016], value=[0.0]] to server ...
jvm 1 | 2016-11-22 01:16:05,837 DEBUG - Notifying task of job finished (job id c06896d8-ac47-46bf-a050-3f4bfadce22a, build id: N/A, job node: NRHT011PIM:8820)...
jvm 1 | 2016-11-22 01:16:05,837 DEBUG - Finished notifying task of job finished (job id c06896d8-ac47-46bf-a050-3f4bfadce22a, build id: N/A, job node: NRHT011PIM:8820)...
jvm 1 | 2016-11-22 01:16:05,915 DEBUG - Finished measurement polling task in 585 millis
jvm 1 | 2016-11-22 01:19:15,753 ERROR - Error connecting server.
jvm 1 | java.lang.OutOfMemoryError: Java heap space
jvm 1 | at java.util.concurrent.locks.ReentrantLock.<init>(ReentrantLock.java:234)
jvm 1 | at java.util.concurrent.ConcurrentHashMap$Segment.<init>(ConcurrentHashMap.java:289)
jvm 1 | at java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:613)
jvm 1 | at java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:644)
jvm 1 | at com.caucho.hessian.io.SerializerFactory.getSerializer(SerializerFactory.java:237)
jvm 1 | at com.pmease.quickbuild.RemotingSerializerFactory.getSerializer(RemotingSerializerFactory.java:63)
jvm 1 | at com.caucho.hessian.io.HessianOutput.writeObject(HessianOutput.java:322)
jvm 1 | at com.caucho.hessian.io.HessianOutput.call(HessianOutput.java:132)
jvm 1 | at com.caucho.hessian.client.HessianProxy.sendRequest(HessianProxy.java:297)
jvm 1 | at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:171)
jvm 1 | at $Proxy21.connect(Unknown Source)
jvm 1 | at com.pmease.quickbuild.grid.AgentConnectivityTask.run(AgentConnectivityTask.java:51)
jvm 1 | at java.lang.Thread.run(Thread.java:619)
jvm 1 | Exception in thread "pool-1-thread-4"
jvm 1 | java.lang.OutOfMemoryError: Java heap space
jvm 1 | at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:39)
jvm 1 |
jvm 1 | at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
jvm 1 | at sun.nio.cs.StreamEncoder.<init>(StreamEncoder.java:176)
jvm 1 | at sun.nio.cs.StreamEncoder.<init>(StreamEncoder.java:156)
jvm 1 | at sun.nio.cs.StreamEncoder.forOutputStreamWriter(StreamEncoder.java:40)
jvm 1 | at java.io.OutputStreamWriter.<init>(OutputStreamWriter.java:83)
jvm 1 | at java.io.PrintStream.<init>(PrintStream.java:125)
jvm 1 | at sun.net.www.http.HttpClient.openServer(HttpClient.java:396)
jvm 1 | at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
jvm 1 | at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)
jvm 1 | at sun.net.www.http.HttpClient.New(HttpClient.java:306)
jvm 1 | at sun.net.www.http.HttpClient.New(HttpClient.java:323)
jvm 1 | at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:790)
jvm 1 |
jvm 1 | at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:731)
jvm 1 | at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:656)
jvm 1 | at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:834)
jvm 1 | at com.caucho.hessian.client.HessianURLConnection.getOutputStream(HessianURLConnection.java:101)
jvm 1 | at com.caucho.hessian.client.HessianProxy.sendRequest(HessianProxy.java:283)
jvm 1 | at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:171)
jvm 1 | at $Proxy53.isGridJobActive(Unknown Source)
jvm 1 | at com.pmease.quickbuild.grid.GridJob$1.run(GridJob.java:87)
jvm 1 | at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
jvm 1 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
jvm 1 | at java.lang.Thread.run(Thread.java:619)

Please note a sudden jump in memory usage from 0.25 (~192MB) at 01:11:05 up to 0.79 (~606MB) at 01:16:05. Java heap space exception was hit at 01:19:15. During build process the maximum observed was 0.48 (~250MB) - that was before I've extended maximum heap size.

Is there any easy way to figure out what was happening?

Once again thank You for You help!

Best regards,
Marcin

MarcinN · 9 years ago

I've managed to get VM red-handed and it seems we have 2.2 milion com.pmease.quickbuild.log.BuildLogEntryEx instances (>86MB), 2.3 milion Strings/Long/Date (>44MB/>34MB/>42MB respectively) and 2.3 milion char[]-s (>492MB) - makes up just above 700MB.

There were issues with MSBuild, so we've put it on verbose level - it's possible it created such a large amount of log entries. However I'm worried that build agent can "deadlock" itself and cannot recover on its own...

Is there any configuration which would limit somehow amount of buffered log entries?

Thank You for answer and best regards,
Marcin N.

P.S. This time build was closed properly, so I guess there is no sense to send You backup database.

MarcinN · 9 years ago

Hi Robin,

I was finally able to solve the mystery.

Those strange memory leaks in build agents were caused by excessive logging during automatic test execution. You will rather not push 10k log entries each second, when master node is more than 900km away.

I think it doesn't make sense to continue this.

Robin - once again thank You a lot for Your help and support.

Best regards,
Marcin

#10

robinshen ADMIN · 9 years ago

Thanks for the findings. QB agent does not send build logs to QB server frequently. It does so only when there is someone accessing the live log. Currently QB agent stores log entries in memory and trims them to 10000 lines every 5 min. If your step generates too much log entries, it may consume a lot of memory. We will reduce the log trimming interval in future versions. To work around this issue for now, you may set the build log level to "WARN" in general setting of the configuration.

#11

MarcinN · 9 years ago

Hi Robin,

Once again thank You very much.

We removed redundant diagnostic logs from our automatic test routines and now it works fine. First night since week without any nodes lost!

If I may suggest improvement, I think it would have been helped a lot in this case, if the step logs were not lost in this case. Unfortunately in all of the cases, there were no logs for a failed step (just an empty file).

Thanks for consideration and best regards,
Marcin

#12

robinshen ADMIN · 9 years ago

It is difficult to transfer build logs when out of memory issue occurs, as it can prevent any logic going further, especially the log transferring may also take some memory. In such case, one can only check the agent log to find out the possible cause.