Wikimedia's Beta Cluster (aka
deployment-prep) needs to be replaced
with something competely different.
The Beta Cluster Wikitech page describes the projects' ambitions like this:
The Beta Cluster aims to provide a staging area that closely resembles the Wikimedia production environment. It runs MediaWiki and extensions from their master branch, allowing developers and power users to test new code before it goes live on Wikimedia websites.
This was written in early 2013, nearly a decade ago. Back then, the Wikimedia technical community and the WMF were much smaller. The Beta Cluster was one of the first projects on Wikimedia Labs (which is these days known as Wikimedia Cloud Services).
The Beta Cluster has from the very beginning attempted to re-use the same Puppet code used in production, with the intention being that Beta could be used by community members to test changes. This hasn't always been easy as a large part of the code was not designed to run outside production; there even was an attempt in 2015 to build a "stabler Beta Cluster" with an explicit goal of having Puppet do all of the provisioning.
To summarize: The original intention of the Beta Cluster was to allow testing changes to both MediaWiki and the underlying infrastructure.
As far as I can tell, the Beta Cluster was never maintained by the same people taking care of the equivalent production infrastructure. The people maintaining production infrastructure (originally called TechOps, these days known as the SRE team) have different needs than what the MediaWiki developer and testers do.
The nature of the Beta Cluster made it very inflexible for the infrastructure people: for example it was hard to test multiple changes for the same component at the same time, and you needed to be very careful to not break the cluster entirely because that would be disruptive to the MediaWiki developers.
Over time, the SRE team developed other systems for testing the infrastructure. Today the main way used to test infrastructure changes in a production-like environment is Pontoon. Pontoon's primary aim is to simplify starting disposable 'stacks' that are largely independent of each other and are much closer to the actual production environment than what standard Cloud VPS are. Cloud VPS itself has also moved from its original use case of being a development environment for services that were either currently living or planned to live in production.
A staging cluster that's trying to emulate production as closely as possible should be maintained by the same people maintaining production. Otherwise it's going to be impossible to keep up with all the changes and code that for whatever reason can't be easily used outside the environment it was originally written for.
Rather unsuprisingly this kind of environment hasn't been very stable. Even worse, since the people responding to outages are usually not familiar with the system, most fixes end up being hacks that are decreasing the long-term reliability of the entire platform.
Beta Cluster outages get noticed very quickly, which suggests that people rely on the Beta Cluster working at least somewhat. However, not everyone needs it for the same reason. Common reasons seem to include:
Some features are hard to configure. Others need specialized dependencies. Either way, considering the Beta Cluster is (at least currently) only for code already merged to the master branch, we should instead focus on making it easier to run those features locally or make the relevant interfaces safer so we can be more confident in for example something storing files on the local disk also working properly with a Swift backend.
The current model for Beta Cluster maintenance has been unsustainable for years, and it shows. The current model doesn't work unless it's maintained by the SRE team directly, which is not optimal for the SRE team. Therefore I think it's reasonable to make the conclusion that we need to replace the current Beta Cluster with a different solution (well, solutions) that are more sustainable to maintain and solve the same problems more efficiently.
What might that solution look like? Honestly, I'm not completely sure. What I do know is that we need to drop the requirement to be as close as possible to production, and instead need to focus on what we need to work on MediaWiki as efficiently as possible.
There are a couple of promising projects I'd like to showcase:
Thanks to Tyler Cipriani for providing me access to the 2018 Beta Cluster Survey, which provided helpful insights on how people use the Beta Cluster.