We just kicked off a major effort to address the availability issue Sid mentions above. The highlights are that we're moving to GCP which should provide better underlying reliability. But interestingly we found that only about ~20% of our downtime minutes where from underlying infrastructure. Whereas ~70% came from features that didn't scale.
So the more exciting part of the project is to tighten the feedback loop between development and deployment with a continuous delivery pipeline. This may be obvious to some people, but it's harder to pull off when you've got an open source project, an on-prem product, and a large-scale SaaS sharing the same code base. I'm calling it "Open-core SaaS" and there are only a handful of companies that run a large, multi-tenant service based on an open source project.
Maybe I just can't find it, but what URLs are pingdom.com checking? Do both issues have the same content? Or does one have 50 comments and another has 1?
We got work to do in the 99% and merge request page load but the overall situation has improve dramatically.
We still got work to do in availability, so I changed the 'feature' to reflect this https://gitlab.com/gitlab-com/www-gitlab-com/commit/3b3bddf5...