Aug 25, 2017

Replaying change records

0 comments

Working with some customers it is often desirable for them to maintain an on-premise, operational system but have that system stream it's data to the cloud for new types of work such as analysis or business intelligence. In this instance streams of transactional information is pretty straight forward, where each new record is a new entity all of its own. Sometimes though you might want to take copies of relational data but frequent snapshots just aren't viable, this could be because there's too much data to do this frequently, or because the impact of running a large select on the table impedes the performance of the operational system. One solution to this is to stream the changes to the table, or change data capture.

 

Using this stream of insert, update and delete statements can be useful in itself to get some insight into how your data is changing over time, identifying when there are a large number of update statements for instance could identify some background process executing which you were previously unaware of. But to use the data itself you'll often want to turn it back into a reasonably current view of the on-premise data. A method of doing this is to "re-play" the records as they arrive which is okay if the data doesn't change too much, but if it does then this can be burdensome both in terms of performance and possibly financially as well.

 

A possible solution is to periodically (at a frequency of your choosing) take the latest view of the data. Often people can get distracted by making sure that Insert, Update and Delete records are done in the correct order because of edge cases that happened a while ago. But what you're actually concerned about is "what does the data look like now". In this fictional example we have the current view of data.

And the changes received since.

A replay of this would take each change record and play it again against the destination. As you can see for employee_id 1 this would result in 2 updates before a final delete. Employee 2 would be added and then updated and employee 3 (who was previously employee 1) would be added. But if we take just the most recent records for each then we get the near-current state of each employee record.

So we know that employee 1 needs removing and employees 2 and 3 need upserting. That can be done differently depending on your platform of choice, but applying the changes as batch deletes and upserts you end up with the following.

The end result should be a process which is easier to think about, easier to implement and should save a lot of processing effort.

New Posts
  • I recently had the opportunity to spend some time speaking with Maxim Lukiyanov , Principal Program Manager at Microsoft’s AzureML team. I asked him what exciting things he’d been working on and he gave me an in-depth overview of the new Notebook VMs in the Azure ML Workspace service. I previously touched on these and how they share a heritage with the Data Science VMs and give Data Scientists a simple, standard and manageable way to provision compute and access the resources of their ML workspace. Even though the VMs are still in preview, he mentioned the significant investment going into them and touched on how it was possible to boost the development experience for a python based Data Scientist by using one of my favourite recent Microsoft announcements, Visual Studio Code Remote SSH . This provides people with the ability to develop against remote machines and use the VS Code features. I had to try it myself because that this represents another level of tooling for Data Scientists; the combination of a leading editor like VS Code and interoperability with Jupyter. Here are the three steps to using Visual Studio Code to create a great environment for your ML workspace. If code's not your thing, keep scrolling for a transcript of my interview with Maxim: 1.      Creating the Notebook VM Getting started is really easy with Azure ML Notebook VMs; you go into the Azure portal, access your Azure ML Workspace and create a new “Notebook VMs”. This provisions the Azure VM and installs all of the required tooling, configuring Jupyter and some samples. It also includes some Quickstarts to let you get going. When I did this, it took 14 minutes from creating the new Azure ML Workspace to having a running Jupyter notebook completing quickstart.  Here are the three steps to using Visual Studio Code to create a better Data Science environment: 2. Configuring the connection securely Once the VM is up and running you can use it as a Jupyter host, but if you want to try out the Visual Studio Code integration, you would start by setting up the VM that you created in the Azure Portal as a SSH host by configuring OpenSSH. Back in the Azure portal, you can find the private key for the VM and information such as its IP address, port and username. Copy these down and create them in a certain place locally, and you’re pretty much ready. For reference, you copy the private key to a file in ~/.ssh/ and then add an entry into ~/.ssh/config with the format: Host [what I want to call my VM as host name]    HostName [ip address from Azure portal]    Port [port number from Azure portal]    User azureuser    IdentityFile [the file I created with the PK in]  You should then be able to SSH straight to the machine by executing in a terminal (Ctrl+’ in VS Code) SSH [what I want to call my VM as host name] That’s then a shell session executing on the Notebook VM. That’s powerful because I can interact with that VM with SSH however I want. Maxim pointed out that on the SSH session, if you go to the “code” area of ~/cloudfiles/code then his team have used Azure-storage-fuse ( https://github.com/Azure/azure-storage-fuse ) to mount an Azure blob storage account as a virtual file system, so you can interact with that location and collaborate with your team members. This is again very powerful. 3. Using Visual Studio Code Once you’ve achieved the SSH connection as above, you can go into Visual Studio Code and use the Remote-SSH extension to connect to the Notebook VM and interact with it - with all the features of Visual Studio Code. Remembering the hostname you gave the VM in the SSH setup earlier, simply right click and connect to the VM. You can then interact with the VM through Visual Studio Code. I created a very simple python module, then tested it out in python at the shell before going into Jupyter, importing the module and then invoking it. This example is very contrived but works out of the box. There are some really interesting and more advanced features such as remote debugging that I am keen to try out also. I think using Visual Studio Code is going be super useful for Data Scientists - it’s a great tool for easily accessing the resources of an ML workspace. Interview with Maxim! I asked Maxim Lukiyanov, Principal Program Manager at Microsoft’s AzureML team a few questions about why they'd build this functionality and who it might help most of all. AC:      What is the top feature of Visual Studio Code for Data Scientists? Do you think any other roles will use this tool? ML: When we talk to data scientists we hear different opinions. Some like Jupyter interactivity, others prefer full featured python IDE and many use both. VS Code is one of the most popular IDE choices today and it really completes the code authoring story of Notebook VMs. Another aspect of Notebook VM that is valuable in enterprise setting is its improved security and compliance with IT policies. It works really well as cloud workstation and as such can be also be used by engineers, data analysts and new role of ML engineers. AC:     Do you think there's an ongoing trend towards Jupyter, or will you support other tools? ML: Jupyter and VS Code style editors are trending, and both are popular. It doesn’t seem one is overtaking another, so we support a combination of them. R Studio is also popular within R community, this is something we will look at in the future. AC:      There's a lot of notebooks on Azure now - Notebooks, ML Notebook VMs, stuff in HDInsight, stuff in Databricks ... can you give me a view on which is best of what or any insight on them? ML: Azure notebooks are designed for sharing in academia setting and really works well in those scenarios. Enterprise setting, with more locked down and more powerful compute scenarios, is where NBVM [ Notebook Virtual Machines ] shines. Databricks and HDI are for scale out analytics, not purely ML. It’s much more natural to do deep learning with native framework support in NBVM/AzML than in Azure Databricks. Azure databricks requires you to change your code, it’s very opinionated in that sense. Notebook VM is also brings full customizability of the VM something which is not available in other offerings. AC:      Does this plug in to the MLOps trend well? Can I create these via the az ml cli? ML: MLOps is a new feature we recently introduced in Azure ML. You might want to check out our blog post we recently published on that topic. But Notebook VM is still in Preview and lacks some of the integration points. It is certainly on our roadmap though, we receive frequent asks from our customers to support scripting for Notebook VM. That's it for now! Thanks for reading. Let me know if you’ve used Visual Studio Code and what you think of it in the comments below. 
  • Empowering data analytics is one of the mainstays of modern Cloud Computing. Approaches that may themselves be tried and tested are applied at internet scale with the co-ordinated, distributed computation of hundreds or thousands of machines over terabytes or petabytes of data, leading to new challenges and innovations. These scale capabilities provided by leading cloud providers make them the ideal place to watch for trends in the technology. In 2019 we have seen a significant trend towards increased ease of operations, productivity and experimentation of data on Azure, with new tools and service offerings across many product lines. In this article we’ll highlight some of the key innovations to illustrate these trends. Machine Learning has emerged as a mainstream area of software, with product owners in many companies looking to improve user experience through its application. Typically, companies have struggled to adopt Machine Learning because its process hasn’t been well understood and it has been seen as alien to existing software development practices. 1. Operationalization So far in 2019, there has been a trend towards the normalisation of Machine Learning as a process of software creation and an integration of Data Scientists - those skilled statisticians and data modellers who create Machine Learning models - into standard and common software development lifecycles. This trend is given the term “MLOps” and describes how to apply DevOps for Machine Learning. We see this trend as incredibly important, as it removes the biggest barrier to success for models that make the quality threshold – deployment in an enterprise setting. In a nutshell, MLOps includes management tools for Code, Data Set, Environment, Deployment and Monitoring and help DevOps professionals to work alongside Data Scientists to bring rigor and stability to Machine Learning platforms. One major aspect of this is the Azure Machine Learning service CLI , which provides command-line access to create and manage Machine Learning activities, from creating environments to executing code. An extension to the standard Azure ‘az’ CLI, the Machine Learning service CLI is well suited to DevOps workloads and can be automated using Azure Pipelines. In this simple example, we can create the necessary Azure compute resources to train a ML model: az ml computetarget create amlcompute -n cpu --min-nodes 1 --max-nodes 1 -s STANDARD_D3_V2 These new MLOps features in the Azure Machine Learning service aim to enable users to bring their ML scenarios to production by supporting reproducibility, auditability, and automation of the end-to-end ML lifecycle. Jordan Edwards Senior Program Manager, Azure MLOps Read more: https://azure.microsoft.com/en-us/blog/take-your-machine-learning-models-to-production-with-new-mlops-capabilities/ 2. Productivity For Data Scientists themselves, significant strides have been made in the auditing and ‘bookkeeping’ of the process of Data Science. Data Scientists will routinely create a number of models in the process of tuning, using different algorithms and hyperparameters to those algorithms. I have seen many approaches to managing these models, from Excel spreadsheets with catalogues of results to huge python scripts with comments holding the point in time results. Ultimately all of these become unwieldy and promising avenues of research as missed in the noise of results, models are executed multiple times unnecessarily and the process becomes personalised to the method used by the experimenting data scientist, reducing collaboration potential across a team. In a new development, Microsoft Azure’s data teams have been working to integrate with the open source project MLflow ; software that provides an innovative solution to this challenge. Whilst MLFlow remains cloud agnostic, the Azure teams provide model tracking and deployment interface that can target multiple services on Azure; Databricks, Azure Machine Learning Compute, Virtual Machines and the container services ACI and AKS. There was a great session that introduces this concept recorded at the Spark Summit and available online here: https://databricks.com/session/accelerating-the-machine-learning-lifecycle-with-mlflow-1-0 Read more here by the great Roope Astara : https://azure.microsoft.com/en-gb/blog/make-your-data-science-workflow-efficient-and-reproducible-with-mlflow/ 3. Experimentation The most creative part of solving data science challenges is the experimentation, the rapid application of algorithms to solve data challenges. A lot of Data Scientists approach this on their local machines because getting cloud based resources from the IT team is difficult; there is often a disconnection between the knowledge sets of the Data Science team and the IT Cloud Services team – they are not used to speaking the same language and this causes friction. Whilst using local machines is a valid starting point for many situations, sooner rather than later the Data Scientists will want to look at real data, and this brings the Data Scientists into conflict with organisational compliance standards; how can a Data Scientist guarantee that their local machine is secure? In the age of BYOD and portable compute, valid concerns emerge regarding information security. This battle between Data Scientist, Cloud Services provision and Compliance officers can be bypassed by using templated cloud resources that meet enterprise compliance standards and are provisioned at the click of a button and managed inside the Azure Portal. These are called Notebook VMs and include a system that builds a new environment drawn from the bloodline of the Data Science Virtual Machines (DSVMs) so are configured ready to go for a Data Scientist (GPU and all!), but are wrapped inside the Azure Machine Learning service so they can share the data stores and compute resources of the Workspace that they are created inside. Long time Azure Data maestro Maxim Lukiyanov (I first worked with him in 2014 on the trailblazing C# Streaming Map Reduce SDK!) introduces us to this concept in a great blog post: Data scientists have a dynamic role. They need environments that are fast and flexible while upholding their organization’s security and compliance policies. Maxim Lukiyanov Principal Program Manager, Azure Machine Learning Read more: https://azure.microsoft.com/en-gb/blog/three-things-to-know-about-azure-machine-learning-notebook-vm/
  • To enable the new breed of analysts and data scientists to deliver high quality data analytics, IT teams need to augment conventional data quality processes with modern measures aimed at improving the usability of data As data grows, more data means more data problems, data governance and quality processes are now a key focus to ensure analytics data is trustworthy A key challenge is to articulate what quality really means to a company. What are commonly referred to as the dimensions of data quality include accuracy, consistency, timeliness and conformity. But there are many different lists of dimensions, and even some common terms have different meanings. The mistake of solely relying on a particular list without building an underlying foundation of what you are looking to accomplish is commonplace but can easily be avoided The data quality challenge grows as data volume and variety explodes in Sparc and Hadoop clusters often accumulated from wide sources such as transaction systems, sensors and devices, clickstreams and unstructured data sets A well designed modern big data platform mitigates these risks by delivering an extensible framework to deliver data quality management & provenance outside of the traditional lists of data quality dimensions Here are some aspects of efforts to focus on Easy access : Good clean data needs to be at the Data Scientists fingertips. Once in a data lake use of a tool such as Azure Data Catalogue delivers easy access to data sets with qualitative descriptions of what’s contained in the data sets, where they are located and detailing access permissions to them. Conformance to Reference Models : A reference data model provides a logical description of data assets, such as customers, products & suppliers. However, unlike a conventional relational model, it abstracts the attributes and properties of the asset and its relationships, allowing for mappings among different physical representations in SQL, extensible markup language, JSON and other formats. When gathering, processing and disseminating asset data among a set of source and target systems, confirming to a reference model delivers a consistent representation and interpretation of data Synchronisation : In big data environments, best practice is for data sets to be replicated between different platforms. Generating data extracts for individual users is a poor use of time whereas well managed replication provides synchronisation on the data among all replicas delivering consistency and uniformity of shared data and driving faster usability Data Provenance : The precision required of big data analytics will be enhanced if data is accurately identifiable across its entire life-cycle, from creation to ingestion, integration, processing & production The consideration of usability of data is a key mantra of Elastacloud when designing big data platforms. It has been proven to us many times that an early focus on cleansing, cataloguing and curation of data assets drives massive productivity increases and fast insights from Data Science teams. This is Lean Analytics in action