I’m Taylor, the CTO here at YNAB. One of my jobs is to make sure that YNAB is always available for our customers. On April 28th, 2017, and then again on May 1st, our web app went down.
First, I, and the rest of the team are genuinely sorry that YNAB went down, and that you couldn’t budget during that time (and the first of the month, no less! The worst!). We know that you rely on us and that having access to your budget at all times is critically important. We have identified and solved the root issue, and have taken steps to ensure it cannot happen again.
What Happened & How We Fixed It
At its core, the outage was caused by a database that could not keep up with the load. We always see more load on our servers near the beginning and end of the month, but a combination of factors caused this to be more than the database could handle beginning on Friday, April 28th. Simply upgrading the database wasn’t possible (we were already running on the largest database our provider offers), so we took steps to reduce the load. Although it helped, we needed to pull the entire app down multiple times over several days (meaning you couldn’t budget!) and by Monday, May 1st, it was obvious that the reduced load wasn’t enough.
On Monday afternoon our database provider discovered the root problem was a database setting that was set too low for the way our database operates, and this hadn’t been picked up in their monitoring. They increased that setting, which not only solved the problem but made our database much faster and more stable than it’s ever been.
Immediately after fixing this issue, we began the process of moving to an even faster database. In addition, we are architecting changes that will allow us to split YNAB up across multiple database servers which will allow us to scale more easily as YNAB grows. We are also looking at ways to make our applications less dependent upon our servers in general so that downtime or degraded performance is a non-issue, but that’s for another post.
And for the more technical types (or those with a lot of extra time on their hands)…
More Than You Ever Wanted To Know About What Happened & How We Fixed It
On the morning of April 28th, the database began to exhibit strange behavior. It worked harder and harder but fell further and further behind. As too many requests piled into the system, it became a self-perpetuating problem. Many requests couldn’t be serviced within 30 seconds, and many new requests were stuck in line behind these slow requests. In short, the database had hit some sort of performance wall.
At 12:06 pm EDT, it became obvious that the database could not keep up, so we made the hard decision to take the application offline to give it a chance to catch up. When we do this, the application goes into “maintenance mode”. From then on, anyone who tries to use YNAB sees a message saying that we’re working on YNAB. However, one minute later, at 12:07 pm EDT, the database crashed.
Normally our maintenance is brief and planned. However, in this case, our “maintenance” message made it look like we were purposefully taking the application down on the last Friday of the month. It caused understandable confusion: “Why are they working on things on a Friday afternoon??”. I’m sorry about that! We later changed the message to say that we were having trouble.
Fortunately, database crashes are a rare thing. But since they’re not unheard of, we always run two exact copies of the database in parallel. So, our provider immediately switched to this failover database without issue. By 12:15 pm EDT, the new database was up and running right where the other one left off, and we brought the application back online.
Unfortunately, when a new database starts up, it has what’s known as a “cold cache”. A large database can take minutes to hours to get “warmed up” and until then it will go slower than normal. We expected this to get better over time as the cache warmed up, and we tweaked some other settings to help. Over the next couple of hours, performance improved, but not nearly quickly enough. Then, at 3:42 pm EDT, we were notified we needed to go offline again to give the new database time to recover. (For you techies, our Write Ahead Log (WAL) was consistently growing faster than it could be shipped to S3 storage. In addition, the database wasn’t able to hit checkpoints quickly enough. )
So, we took the application offline again at 3:42 pm EDT, and began planning to prevent this error from immediately happening when we came back online. We knew we were likely to get hit with an even heavier load when we came back online because there would be a rush of people trying to budget all at once. So, we coded a “throttle” we could use to reduce load if needed. It allowed us to come out of maintenance mode gradually for a percentage of customers. That gave us the ability to spread the initial rush out over time, and give our database a chance to catch up.
We pushed this throttle out at 4:55 pm EDT. Over the next couple of hours, we throttled up requests for more and more web customers and monitored performance. By 7:28 EDT, our web application was fully operational for 100% of web customers, and the database handled the load admirably. However, in our testing, we discovered that the requests coming from our mobile application were slowing the database down disproportionately. So, we kept much of the mobile application traffic throttled down so that it would not overload the server.
We throttled mobile traffic up and down over the weekend to keep the application performant for web customers, and generally continued to see acceptable database performance. By Monday morning, we had discovered a bug in our mobile applications that was causing them to send too much traffic to the server. This was introduced only a couple of weeks prior and was a partial explanation for the unprecedented behavior we were seeing. So, we quickly coded and released a fixed version of both the Android and iPhone apps. It takes time for those updates to be installed, and it was certainly going to help, but we knew it wasn’t a panacea.
In the meantime, on Monday morning, we were told by our database provider that our database WAL was again growing too quickly. We throttled our traffic down in order to give the database the breathing room it needed, but it wasn’t enough. At 1:33 pm EDT, we had to take the application offline again. At 2:19 pm EDT we came back online and continued to throttle requests as both our provider and we searched for answers.
Shortly thereafter, our database provider discovered a setting in our database that wasn’t high enough for our particular load, called “PIOPS,” which measures how many disk operations the server can perform per second. While investigating, they realized that our database had exceeded our capacity for that setting repeatedly, but had managed to hide that fact from their monitoring. At 6:40 pm EDT, they quintupled the setting and reduced our “checkpoint_completion_target” setting, which immediately and decisively fixed the issue. Databases can get overloaded in many ways, and the nature of each is different. These settings had never been an issue for other clients but were in fact at the heart of this particular issue.
Since May 1st, we have been running without throttling of any kind, and continue to monitor this closely. I am thrilled to report that since that change, the database is faster and more stable than it’s ever been.
Immediately after fixing this issue, we began the process of moving to an even faster database, backed by an infrastructure more appropriate for our load. In addition, we are architecting changes that will allow us to split YNAB up across multiple database servers (“sharding” for you tech folks). We knew that growth would lead us there eventually, as this architecture will ensure that we can scale to handle increased demand.
We are also looking at ways to make our applications less dependent upon our servers in general so that downtime or degraded performance is a non-issue, but that’s for another post…