World

How Microsoft’s GitHub just told everyone that AI agents are ‘behind’ recent outages: We hear the pain you are…


How Microsoft’s GitHub just told everyone that AI agents are 'behind' recent outages: We hear the pain you are…
GitHub CTO Vlad Fedorov has published a public apology after two major incidents left thousands of repositories and pull requests in broken states. The platform’s April uptime has dropped below 85 percent—far below its 99.9% SLA—driven by a sharp surge in AI agent workflows demanding 30 times current infrastructure capacity. Ghostty developer Mitchell Hashimoto, GitHub’s 1,299th user, has already announced he’s leaving.

GitHub’s CTO has published a rare public apology, confirming that an explosion in AI-driven development workflows is the main culprit behind the platform’s worsening reliability problems — and that the company badly underestimated just how much capacity it would need to keep things running. The post, written by Chief Technology Officer Vlad Fedorov, admits that two recent incidents were “not acceptable.” It details a platform under serious strain: uptime in April has slipped below 85 percent, far short of the 99.9% the service level agreement promises. It closes with a blunt two-word admission: “We’re sorry.”The timing is pointed. The apology dropped just hours after Ghostty developer Mitchell Hashimoto—GitHub’s 1,299th ever user, who joined in February 2008—publicly announced he was pulling his popular terminal emulator off the platform. Hashimoto had been keeping a daily journal marking every date a GitHub outage disrupted his work, and almost every day had an X against it. “This is no longer a place for serious work,” he wrote.

The AI agent surge nobody planned for

So what actually happened? GitHub originally saw the traffic wave coming—just not at this scale. The company began executing a 10x capacity expansion plan in October 2025, but by February 2026, it became clear the platform needed to be designed for 30 times today’s scale.The main driver, GitHub says, is a sharp acceleration in agentic development workflows since late December 2025—with repository creation, pull request activity, API usage, and large-repository workloads all growing fast, simultaneously. That last part is the killer. It’s not one system buckling—it’s everything at once. A single pull request, Fedorov explains, can touch Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. At scale, small inefficiencies compound fast.

Two bad weeks that made the apology inevitable

Two specific incidents pushed things to a breaking point. On April 23, a merge queue bug caused incorrect commits when a merge group contained more than one pull request—with changes from previously merged pull requests inadvertently reverted. A total of 658 repositories and 2,092 pull requests were affected. No data was lost, but default branches were left in incorrect states that GitHub couldn’t safely repair automatically.Then came April 27. GitHub’s Elasticsearch cluster became overloaded—likely from a botnet attack—and stopped returning search results, breaking large parts of the UI for pull requests, issues, and projects. Elasticsearch, Fedorov notes, was one of the systems not yet fully isolated, because other higher-risk areas had taken priority. That calculus clearly didn’t hold.On the fix side, GitHub has been moving webhooks out of MySQL, redesigning session caches, overhauling authentication flows to reduce database load, and accelerating a migration of performance-sensitive code from Ruby into Go. The Azure migration, despite being blamed by some, has actually helped—allowing GitHub to spin up compute faster. A multi-cloud architecture is now also in the works for longer-term resilience.GitHub’s stated priority order going forward is availability first, then capacity, then new features. The platform has also updated its status page to include live availability numbers and committed to flagging both large and small incidents—so developers no longer have to guess whether the problem is on their end or GitHub’s. Whether those promises translate into a platform developers can actually rely on again is now the only question that matters.

Here is the full blog post from GitHub CTO Vlad Fedorov, published April 28, 2026:

I wanted to give an update on GitHub’s availability in light of two recent incidents. Both of those incidents are not acceptable, and we are sorry for the impact they had on you. I wanted to share some details on them, as well as explain what we’ve done and what we’re doing to improve our reliability.We started executing our plan to increase GitHub’s capacity by 10X in October 2025 with a goal of substantially improving reliability and failover. By February 2026, it was clear that we needed to design for a future that requires 30X today’s scale.The main driver is a rapid change in how software is being built. Since the second half of December 2025, agentic development workflows have accelerated sharply. By nearly every measure, the direction is already clear: repository creation, pull request activity, API usage, automation, and large-repository workloads are all growing quickly.This exponential growth does not stress one system at a time. A pull request can touch Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. At high scale, small inefficiencies compound: queues deepen, cache misses become database load, indexes fall behind, retries amplify traffic, and one slow dependency can affect several product experiences.Our priorities are clear: availability first, then capacity, then new features. We are reducing unnecessary work, improving caching, isolating critical services, removing single points of failure, and moving performance-sensitive paths into systems designed for these workloads. This is distributed systems work: reducing hidden coupling, limiting blast radius, and making GitHub degrade gracefully when one subsystem is under pressure. We’re making progress quickly, but these incidents are examples of where there’s still work to do.Short term, we had to resolve a variety of bottlenecks that appeared faster than expected from moving webhooks to a different backend (out of MySQL), redesigning user session cache to redoing authentication and authorization flows to substantially reduce database load. We also leveraged our migration to Azure to stand up a lot more compute.Next we focused on isolating critical services like git and GitHub Actions from other workloads and minimizing the blast radius by minimizing single points of failure. This work started with careful analysis of dependencies and different tiers of traffic to understand what needs to be pulled apart and how we can minimize impact on legitimate traffic from various attacks. Then we addressed those in order of risk. Similarly, we accelerated parts of migrating performance or scale sensitive code out of Ruby monolith into Go.While we were already in progress of migrating out of our smaller custom data centers into public cloud, we started working on path to multi cloud. This longer-term measure is necessary to achieve the level of resilience, low latency, and flexibility that will be needed in the future.The number of repositories on GitHub is growing faster than ever, but a much harder scaling challenge is the rise of large monorepos. For the last three months, we’ve been investing heavily in response to this trend both within git system and in the pull request experience.On April 23, pull requests experienced a regression affecting merge queue operations. Pull requests merged through merge queue using the squash merge method produced incorrect merge commits when a merge group contained more than one pull request. In affected cases, changes from previously merged pull requests and prior commits were inadvertently reverted by subsequent merges. During the impact window, 658 repositories and 2,092 pull requests were affected.On April 27, an incident affected our Elasticsearch subsystem, which powers several search-backed experiences across GitHub, including parts of pull requests, issues, and projects. What we know now is that the cluster became overloaded (likely due to a botnet attack) and stopped returning search results. There was no data loss, and Git operations and APIs were not impacted. However, parts of the UI that depended on search showed no results, which caused a significant disruption.We have also heard clear feedback that customers need greater transparency during incidents. We recently updated the GitHub status page to include availability numbers. We have also committed to statusing incidents both large and small, so you do not have to guess whether an issue is on your side or ours.The team at GitHub is incredibly passionate about our work. We hear the pain you’re experiencing. We read every email, social post, support ticket, and we take it all to heart. We’re sorry.We are committed to improving availability, increasing resilience, scaling for the future of software development, and communicating more transparently along the way.— Vlad Fedorov, GitHub CTO



Source link

Related posts

Three Lebanese journalists killed in Israeli airstrike on car | World News

beyondmedia

N3On: Streamer N3on admits paying $1.4 million to clippers to make his content go viral

beyondmedia

DRDO, Navy conduct maiden salvo launch of 2 anti-ship missiles from chopper off Odisha coast | India News

beyondmedia

Leave a Comment