Official high-scale Selenium Grid support?

TL;DR – The TestProject Agent absolutely has the capabilities to run a job at high concurrency via a Selenium Grid, with theoretically whatever browsers that Grid supports, but this isn’t an officially supported use case yet.

I have been evaluating TestProject for use at my company, and one of the goals we have is to be able to run a full UI test suite at high concurrency from within our network for a reasonable price. TestProject will absolutely do this for us, but only with some hacky workarounds.

Normally, a TestProject Agent isn’t supposed to run more than 10 worker threads, and this makes it very difficult to run a large test suite (job) at high concurrency. We would have to be split up the job into multiple pieces across multiple agents, which is a huge pain and limitation.

However, when you boot up the testproject-agent, there’s a --max-workers option with no limit (and a corresponding environment variable TP_MAX_WORKERS). When an execution is sent from the TestProject cloud to an agent, the agent will run it up to its configured max concurrency, regardless of what the cloud reports its current concurrency level is. There are also the CHROME, EDGE, and FIREFOX environment variables that let you set an external Selenium server for each browser type, and they can all point to a Selenium Grid instance.

(It should be noted that each “worker” that an agent runs is a separate Java thread, so even with an external Selenium Grid it’s still possible to run the agent host out of resources with too many workers. The max workers number when using remote Selenium though should be able to go way higher than just 10, or the recommended (cpu threads)/2.)

There are some quirks with this workaround though. The TestProject cloud enforces that each agent has a max workers set to 10 or less, and when the agent is posting its capabilities to the TestProject cloud, a number higher than 10 causes the entire request to fail. When I tested registering a brand-new agent with 20 max workers, the TestProject cloud wasn’t able to store much of anything about the agent, including which browsers it had installed. To solve this, you can first run the agent with a max workers setting <=10, shut it down, and then bring it back up at the higher max workers count. This way the first run generates and posts valid capabilities, but the second run can go higher than 10 max workers with otherwise identical settings.

I also noticed that when the TestProject Agent reports its current status, if it’s using more workers than the cloud thinks it can use max, then the cloud will reject that update as well. So if you did that first run with 1 max worker, and the second one with 20, the TestProject cloud will only ever show 0 or 1 current workers.

And on top of that, it seems like the TestProject cloud execution queue will wait until that current workers number goes back down to send another execution to the agent. I haven’t fully verified the behavior here, but this can act unpredictably when the TestProject cloud sometimes rejects worker count updates because they are too high. Either way it’s not really a problem for us, the unlimited concurrency is much more important than the ability to run multiple jobs at the same time on an agent.

Another limitation that the TestProject Agent has with a remote Selenium Grid is that it currently only supports the default browser versions, and only for Chrome, Firefox, and Microsoft Edge. However, the integration with Sauce Labs and BrowserStack also works via a Selenium connection, and there are more browsers are supported, with more versions, and even on different operating systems. You can even trick the TestProject agent to run one of these browser versions on your own Selenium Grid by using a Sauce Labs or BrowserStack browser type, but in the capabilities document set "cloud:URL": "http://<selenium grid>/wd/hub". You need that browser type running at the right version and on the right operating system in Selenium Grid for it to work, but it can work.

In our particular case, if we decide to go all-in with TestProject, we will be using an autoscaling Selenium Grid on an autoscaling Kubernetes cluster that supports EC2 spot instances. Selenium doesn’t natively support autoscaling on Kubernetes, but there are a couple ways to do it safely. One uses the HorizontalPodAutoscaler to scale a Selenium deployment (in our case based on Prometheus metrics via the prometheus-exporter for Kubernetes), and the Selenium pods would have a “preStop” lifecycle step to tell Selenium to start draining jobs and wait for up to 2m (or whatever timeout we set). The other would be a small, custom Kubernetes controller that creates, manages, and deletes Selenium pods directly, so that we can be smarter than the HorizontalPodAutoscaler and terminate only the idle pods.

Hello @dakota.sullivan, and thank you for sharing this information.
Please note that TestProject agent is designed to work with a max of 10 workers per agent.
Exceeding this number in any form or method is considered abusing the system and will most likely cause undesirable behaviour and or serious issues with your executions, performance, reports credibility, etc… (you also mentioned encountering some issues already).
Furthermore, if you encounter issues caused by this we most likely won’t be able to support that issue.

The best method would be to use more agents and use up to 10 workers per agent.
We will update you here If there will be any update on the max amount of workers each agent can use.

If I have a remote Selenium Grid with say 30 slots, and a server running TestProject that connects to it, why would I still be limited to 10 workers per agent? If I use all 30 slots simultaneously, it’s the same amount of load on your APIs, regardless of if I use one or three agents. The only difference is how many API calls a single agent makes, and how many worker threads a single agent manages. In this use case though, the browsers are running on completely different servers than the agent is, so there’s a lot less overhead on the agent server.