I’m Taylor, the CTO here at YNAB. I want to apologize for a problem YNAB had at the beginning of the year, and let you know what we’ve done to fix it.
Since mid-December, on certain days our servers were struggling to keep up with the load. During these incidents, our mobile apps often weren’t able to sync their changes to other devices or create new accounts. For some customers, this meant our web app was sometimes so slow as to be unusable. Although the problem was correlated to high traffic that we expect to see with a new year and the first of the month, the traffic itself wasn’t the problem.
After an extensive investigation, on Tuesday, Feb 4th we discovered the root cause and permanently fixed the issue.
We are sorry. And I want to be clear, we are not simply apologizing for you being “inconvenienced”, because we understand it is so much more than that. Not being able to use YNAB means not being able to make financial decisions with your partner. It’s not being able to know how much to send to your credit card company, and not knowing whether you can afford something you’ve been saving up for. We take our responsibility to you seriously, we’ve fixed the underlying issue, and we ask that you continue to hold us to the highest standard, because that’s what you as our customers deserve!
First, we have confidence in our servers’ abilities to weather a much higher load both at the beginning of each month, and for a long time to come, now that we’ve fixed this underlying issue.
Secondly, we are currently upgrading our fleet of databases to significantly faster hardware that will give our databases the ability to operate above even their current capacity.
And lastly, we plan to make architectural changes that will allow both our web and our mobile apps to rely upon these servers even less, reducing their load, and reducing the impact if there are future incidents.
The Technical Nitty Gritty: More About What Happened
(The below is likely of more interest to the technical folks reading this, but we’ve tried to explain the issue in a way that won’t require you to hold a PhD in Computer Science.)
At its core, YNAB makes heavy use of a database server for virtually everything it and its clients do. At various times over the past few weeks, our database’s performance would degrade severely and instantly, stay degraded for a varying number of hours, and then recover as suddenly as it had started.
Although these incidents were correlated with periods of high load, particularly near the start of January and February, the load itself did not appear to be the issue, and reducing load on the server had little meaningful impact on the issue. Furthermore, there were no other symptoms or warnings that were visible to us or our database provider that could explain what was happening.
After intensive investigations both internally and with our provider, we discovered that one of the disks our database uses for “temporary” tables can only operate at “burst” capacity for a limited amount of time. This allows the disk to act like a much faster disk 99.9% of the time, but if it operates under high load for too long, it runs out of these “burst credits” and operates at a bare minimum of performance until its load is reduced for long enough, at which point the disk regains its “burst” performance again in the future.
This type of disk is rarely used by virtually all of the millions of databases operated by our database provider, and when it does happen to be used, this burst capacity has always been sufficient for them. And in fact, this burst capacity has also been sufficient for YNAB’s database the last few years, until the disk sustained high load for longer than it ever had in the past.
This sustained load led to hours in which this disk had 0 burst credits, operating over 10 times slower than it normally does, in turn causing the entire database to operate at a fraction of its capacity.
Unfortunately this disk’s performance issues were undetectable to the YNAB team because neither the disk’s performance metrics, nor these burst credit metrics are made available to us. In fact, this disk is so rarely used by databases that our database provider doesn’t even monitor or alert on its performance on their end. Once we realized that the issue was with this slow disk, we instructed our database to stop using that drive, and to use its primary drive that has guaranteed high performance instead. The issue was immediately resolved.
What This Means for You
All that to say, your budgets are back to being lightning quick whenever you need them. We want to apologize again if you experienced any of these issues at the beginning of this year—because that doesn’t cut it in our book. If we’ve dinged any trust we’ve built with you so far, we’ll keep working hard to restore it in the days and months to come.