Job run failures, agent crashes

So over the last couple of weeks, I have been hitting an issue several times. I have jobs, with 1 or 2 tests within, running on docker containers (chrome, firefox) on AWS instances.

This has run pretty much flawlessly until recently where the jobs are failing and it causes the instance to crash and go offline/disconnected state.

I have done a bunch of troubleshooting on my end, for example running through every single test and making sure there are no “auto-heal” occurrences that seem to cause timeouts here and there.

I am also getting jobs crashing that have had zero issues over the last year, which are now all of a sudden hitting this issue. I am posting a couple of screenshots of that most recent failure from this morning. This particular job is one that has worked flawlessly this entire year.

I would love to know how to fix this for consistent test runs.

The normal duration is under 3 minutes.

If I can provide more info please let me know.

I can tack on to this thread. @marcel.bauer and I work together and I confirm this recent string of agent crashing. Here are some additional details about our setup:

  • We’re using the containerized TP agent and the chrome & firefox selenium drivers. The agent version we’re on is 3.3.1.
  • We’ve noticed that agents will frequently “hang” when reporting test run progress in to the TestProject manager. During this hang, the currently running job will disappear from the monitoring page and the agent will not resume the test run until the containers are spun down and back up again. Upon bringing the containers back up, the test will resume and fail.
@eldar have any leads on what could be causing these disconnects when the remote TestProject agent tries to report status in to the TestProject manager? Or can you point us in the direction of someone who might be able to help us out with this?

