As people on the technical side of Big Data, building pipelines to ingest and process large amounts of data, we often find ourselves building distributed, loosely coupled software solutions. We use queues, Event Hubs, Hadoop and Service Fabric clusters, and large data stores like Data Warehouse. As any engineer will tell you, the more moving parts, the more likely it is that something will break.
Breakages in code can take many forms. Usually, they will be simple exceptions that are easily handled with an exception handler and a retry. They are often what we call transient errors, temporary situations that only last a few seconds that cause unexpected behaviour. Every component of a large solution should be prepared to handle these, especially when deployed to the cloud.
Sometimes however, the component can be unable to handle the exception - or even end up in a state that it can't recover from without outside help. There could be problems in the underlying VM, or a connection can unrecoverable. In a conventional application, we could notify an administrator who can perform the fix manually. In a larger distributed system that could become a significant drain on support resources.
The goal behind a self-healing system is that we find ways to detect unrecoverable errors and take the necessary actions automatically. We will implement those actions as part of the software, deploying them together with the main system and maintaining them as we would any other part. The exact form they take will also depend on the software and the platform of choice.
A common way to enable self-healing is by writing a monitoring service. This would be a continuously running service that keeps tabs on everything else in the solution. They can use various APIs to detect the state of machines and application running elsewhere. Cloud platform providers often have REST APIs that can be periodically polled for the state of the PaaS and SaaS parts of the solution. The monitoring system can then take the necessary actions to resolve any states that might be invalid, such as reboot a machine or trigger a clean-up routine.
When you're working in Azure, basic self-healing systems can be written in a more distributed and simpler way. Specifically, Azure gives us the power to use Alerts based on metrics. Think 'An Event Hub has not received any new messages in the last 10 minutes'. Most Azure users will know these alerts, because they can be used to alert admins by sending an email when they are triggered. Less well-known is that they can also be used to trigger webhooks, allowing us to build our monitoring application as web service that just responds to REST calls.
Azure simplifies this even further for us, though. Alerts can natively trigger Logic Apps. This means we can write our reactive logic in a set of Logic Apps instead of an entire monitoring application, and have them use the full power of Logic Apps workflows to react intelligently. Additionally, they can be deployed through Resource Manager templates, and if your templates are written well, can automatically adapt to the environment they're deployed in. They're also serverless and have Microsoft's uptime guarantees.
In a simple example that we use at a client, we have a Spark Streaming application running in HDInsight, pushing to Event Hub. On that Event Hub, an alert will fire if there are no messages in the last 5 minutes. Considering the speed of the data, that means something fatal has happened that did not lead to an automatic restart. When the alert fires, it triggers a Logic App that simply hits the Livy API on HDInsight to kill the currently running task and start a new one with the correct settings. This has made the pipeline a lot more robust, reduced the number of support calls enormously, and requires almost no maintenance or development time.
In the future, the new Event Grid service in Azure will make this kind of native self-healing even easier to implement. Bringing the same kind of reactive power to Azure Functions and Automation, and expanding it for all components, will give us many new tools to make our pipelines more robust and low-maintenance.
There is much more to write about self-healing systems. There are some pitfalls, more techniques and even testing your healing capabilities to cover, but those will be for another day, another post. I hope this one will at least wet your appetite to get started building better, stronger, faster :)