Sep 5, 2017



This is a really key part of the cloud operating model; writing one set of approaches to technology that permeate the business and spread their benefits across all divisions. We've just started work on formalising this part of the Cloud Operating Model, here's a sneak peak:


More coming on this soon. For now you can check out the reusability branch on Github.

New Posts
  • I stopped the other day to think about something that someone in the office said about a customer project after a good 3 months of work. It made me sit up and wonder what constitutes obsolete software these days. Before the cloud we could go years before upgrading to something new. I remember in the early days of .NET they were still applications as far back as 2-3 years ago that were still running on .NET 1.1 warts and all. The comment was a really innocuous one but made me think. We've been working on a Service Fabric application for a customer to process and generate events based on a set of rules. A very common pattern and with the Service Fabric we make light work of it. However, Microsoft released the Azure Event Grid which is completely serverless. You give it some rules and it goes off and works out how to route messages and invoke which services based on inputs. It doesn't need to worry scale or load because it's designed to scale on demand. Whenever I used to do the cloud introduction workshops for startups years ago I used to discuss the idea of cadence of software delivery in the cloud. It seems to be on steroids at the moment! Every pain point that you feel Microsoft or other cloud vendors seem to be pre-empting and delivering something new to fix it before you're aware of it. As software developers we're used to spending a lot of our time building plumbing or infrastructure code to bridge services and route user requests; it's the same for data scientists spending 80% of their time cleaning data. It just seems these days that by the time we build a system we've already accumulated a bunch of technical debt. I've been working with the cloud for a long time now and I'm very aware that customer drivers for the cloud are pushing the pace of change beyond the limits of anything we've seen. Many customers are waiting for the next best thing to plug something that they see as essential otherwise they can't deploy. This is something we're getting very used to now, not waiting for software over years! It's worth stopping and thinking every once in a while. I did a talk and broke the back of the new Machine Learning services in Azure a few weeks ago and now I have something in production. We use a combination of Agile and Crisp DM and a secret sauce as part of our methodology; the new Workbench tool shrinks the time we need to spend on Data Understanding and Data Preparation phases. It's things like this that make us more productive but they happen so quickly that we might miss the change and carry on in our own crappy way with the best of intentions. The moral of this post is to keep your eyes open for all of the crazy new tech you're seeing. Often it will waste time for you as the beta cycle is shrunk (delivery cadence!) but sometimes you'll have a great time which works really for you. Keep your eyes on the cloud; if you blink you'll miss the innovation. Happy trails!
  • As people on the technical side of Big Data, building pipelines to ingest and process large amounts of data, we often find ourselves building distributed, loosely coupled software solutions. We use queues, Event Hubs, Hadoop and Service Fabric clusters, and large data stores like Data Warehouse. As any engineer will tell you, the more moving parts, the more likely it is that something will break. Breakages in code can take many forms. Usually, they will be simple exceptions that are easily handled with an exception handler and a retry. They are often what we call transient errors, temporary situations that only last a few seconds that cause unexpected behaviour. Every component of a large solution should be prepared to handle these, especially when deployed to the cloud. Sometimes however, the component can be unable to handle the exception - or even end up in a state that it can't recover from without outside help. There could be problems in the underlying VM, or a connection can unrecoverable. In a conventional application, we could notify an administrator who can perform the fix manually. In a larger distributed system that could become a significant drain on support resources. The goal behind a self-healing system is that we find ways to detect unrecoverable errors and take the necessary actions automatically. We will implement those actions as part of the software, deploying them together with the main system and maintaining them as we would any other part. The exact form they take will also depend on the software and the platform of choice. A common way to enable self-healing is by writing a monitoring service. This would be a continuously running service that keeps tabs on everything else in the solution. They can use various APIs to detect the state of machines and application running elsewhere. Cloud platform providers often have REST APIs that can be periodically polled for the state of the PaaS and SaaS parts of the solution. The monitoring system can then take the necessary actions to resolve any states that might be invalid, such as reboot a machine or trigger a clean-up routine. When you're working in Azure, basic self-healing systems can be written in a more distributed and simpler way. Specifically, Azure gives us the power to use Alerts based on metrics. Think 'An Event Hub has not received any new messages in the last 10 minutes'. Most Azure users will know these alerts, because they can be used to alert admins by sending an email when they are triggered. Less well-known is that they can also be used to trigger webhooks, allowing us to build our monitoring application as web service that just responds to REST calls. Azure simplifies this even further for us, though. Alerts can natively trigger Logic Apps. This means we can write our reactive logic in a set of Logic Apps instead of an entire monitoring application, and have them use the full power of Logic Apps workflows to react intelligently. Additionally, they can be deployed through Resource Manager templates, and if your templates are written well, can automatically adapt to the environment they're deployed in. They're also serverless and have Microsoft's uptime guarantees. In a simple example that we use at a client, we have a Spark Streaming application running in HDInsight, pushing to Event Hub. On that Event Hub, an alert will fire if there are no messages in the last 5 minutes. Considering the speed of the data, that means something fatal has happened that did not lead to an automatic restart. When the alert fires, it triggers a Logic App that simply hits the Livy API on HDInsight to kill the currently running task and start a new one with the correct settings. This has made the pipeline a lot more robust, reduced the number of support calls enormously, and requires almost no maintenance or development time. In the future, the new Event Grid service in Azure will make this kind of native self-healing even easier to implement. Bringing the same kind of reactive power to Azure Functions and Automation, and expanding it for all components, will give us many new tools to make our pipelines more robust and low-maintenance. There is much more to write about self-healing systems. There are some pitfalls, more techniques and even testing your healing capabilities to cover, but those will be for another day, another post. I hope this one will at least wet your appetite to get started building better, stronger, faster :)
  • We've been working to define the fundamental driver for using a Cloud Operating Model against a Cloud Strategy. Feedback as always welcome here .