For many years leading up to mainstream adoption of Agile, continuous deployment, and cloud computing, we - IT people, lived in a world where we put a lot of effort into keeping things running, only changing them a couple of times a year.
I remember working for a company where, at a certain point in time, we moved to a different office. Shortly after, there was a small “Welcome to the new office” - party. We all had some drinks and put on the silly hats. It was later that night when the system administrator shared with me with part sadness and a lot of pride in his voice that this was the first time in 9 years the FTP server went down - it had to be moved from one building to the other. That same server went down later in the year again because the Christmas lights on the facade of the building shortcut during the holidays and took down the electricity in our part of the building.
Yes, that was our plan for keeping the FTP running - be very careful and hope for the best.
The way software was run in that era hid some bad fundamental system design choices and made them very uneconomical to fix. The problems they were creating were quite rare, and most of the time, that made them rather negligible. So they were not worth fixing or even thinking about unless you had some severe penalty hanging over your head.
Things took a different turn when we got to Agile and Lean software development ideas, Continuous Deployment and even more so when Kubernetes and AWS Auto Scaling started managing our services for us. Those things are stopping and starting services left and right - network traffic was not reaching your service, rolling upgrades of the service code, or simply because overnight, almost no one is using your service.
Even when Kubernetes and Auto Scaling have decided that your service has achieved nirvana, there is a decent chance one of the servers running it will experience a hardware failure.
Hope for the best but prepare for the worst
So what kind of bad system design am I referring to? I find a lot of these come from developers who forget to wear their DevOps hats once in a while. They inadvertently design software that is optimized for their local development setup, which then often has an impact on what happens in production. Here are a few examples from recent experience:
I think this does not happen very often in the wild, but since it hurt my team so much, I am going to put it out there. Having an embedded database by itself limits it to a single replica of your database, and that is a big part of the problem. When the hammer comes down on your service, it is not easy to say if your database will be left in a consistent state. Even if your data is consistent, an index might not be, and that can leave you waiting for it to be re-built for hours.
Most databases support embedded mode to facilitate local development and automated testing. So don’t let it make it to production.
Your service received a message from a queue it needs to process. It acknowledged the message and put it in a list to process it within a second with other messages. It’s just a second, right. WHAT ARE THE CHANCES!
The idea here is to batch processing of messages, which can be a significant performance boost. Still, a restart might leave some of the messages partially or entirely unprocessed. If it happens during a considerable load, chances are pretty high karma will catch up with you. I hope that was not your billing service we were talking about.
This issue is not very typical of modern software but is a widely used method by Linux utils to avoid running multiple instances. Still, one of the most popular Java libraries is using this mechanism. The idea is to write one or more files into a particular folder as an indicator that the software is already running. In the startup code, the software would check for the presence of those files and exit with an error if they are.
The problem is those files are not removed automatically when your service gets restarted abruptly, and then it ends up stuck until someone cleans up those files and restarts it. It is no problem if you restart it once or twice a year but can be a real bugger if you do it a few times a day.
This one is probably the most important of these issues since the others you can avoid by simply being aware of them. This one is inherent in the way we solve many of the problems we face today.
Its essence is that we often need to update multiple services or data stores for each message/command that we get. Those services or stores are usually independent of each other, and there is no easy way to perform this cross-boundary change in a transactional manner. The danger here is that interrupting the processing of the message might leave only part of those systems updated. Again - if you restart your services a couple of times a year and at a carefully selected time outside of business hours, it is very likely you can avoid this issue. In the cloud or Kubernetes, chances are you will need to take this very seriously.
If you have accepted the trade-offs that come with eventual consistency, I have a couple of pointers that should give you an idea of how to deal with this problem:
- Saga pattern - in short, you use a coordination logic that performs the change to each service or store, but if there is a failure along the way, the changes that were applied successfully already need to be reverted by “compensation” logic.
- Event sourcing - each external service or store has logic in your service responsible for updating it, and it does so based on a sequence of events from a central store. All changes are based on events from the same event log.
Relying upon service executing sections of your code uninterrupted is a dangerous assumption. It is best to have a mindset that your system can be abruptly killed at any one point. Keep your state external and keep it consistent.
In the next part of this blog post, you can find out what data is worth all this trouble and what data is not, as well as how to test your service if it is crash-ready.