Moving data from Lake into Warehouse: the power of Streams

One of our clients has a large archive of data, in wildly varying schemas, dumped in a folder structure in Azure Data Lake. They're split up by date and time, but every folder contains gzipped CSVs with wildly varying structures and sizes. To make that data useful, we needed to ingest that data into an Azure Data Warehouse.

The default way to do this in Data Warehouse would be to use PolyBase. It can handle Gzipped CSVs quite well out of the box, and it's very efficient for the larger files. In our case, some of the files are several gigabytes, so using polybase is a natural fit for those.

However, there are also thousands of smaller files, from a few kilobytes up to hundreds of megabytes. For those, especially the small ones, polybase requires too much setup for the payoff. So we went looking for a better way, preferably one that is also reusable for Azure SQL DB (which doesn't support polybase).

One thing you can do in both DB and DW is bulk copying. Specifically, the .Net Bulk Copy Client can read data from any DataReader and stream it in to a table without keeping the entire file in memory, in a safe and fast way. A DataReader feeds the Bulk Copy Client line by line, value by value, all of the data. Unfortunately, there are no DataReaders that read directly from Data Lake, let alone Gzipped CSV files, so we decided to write that capability ourselves.

We know that lots of usable packages for data manipulation can handle Streams. If we could read the Data Lake files with Streams, we could feed them through some steps and then encapsulate the last step as a DataReader ourselves. This would give us the ability to go from Data Lake into DW without needing to do any file copying or keeping an entire file in memory.

We initially tried working with the Stream the Data Lake client itself can give us, but we found that it was forcibly disconnected often. It just isn't designed for streaming, but rather for downloading onto disk or memory and then disposing of the stream. However, Data Lake also provides the ability to read chunks of the data as bytes, using just offset and count. That closely mirrors what a Stream can do. We ended up writing our own Stream implementation with position tracking, that reads chunks from Data Lake with the correct offset whenever it is read from. We even added some read-ahead logic to make better use of the inbound bandwidth. This class also handles reconnection and retries for DL.

Now that we have a Stream, we could start wiring this with some other packages. .Net's BufferedStream over the top allows us to decide how large the chunks we read should be, with a lot of automated management. A GzipStream (from System.IO.Compression) will read that BufferedStream and decompress it to its proper bytes for us. After that, we used a StreamReader that we feed into a CsvReader (from the CsvHelper package, a great little package for this kind of stuff). We do some set up to read the first row to determine columns, but aside from that the CsvReader gives us all the power we need to implement an IDataReader with it. Now we have a full chain of streams, from reading bytes in Data Lake, to buffering, decompressing, reading as CSV and pumping the value into a table in Data Warehouse. All without keeping anything larger than a few MB buffer in memory!

It's important to note that this method is fast, but will never beat PolyBase for raw speed. Bulk Copy is always restricted by the speed of the DW Control Node, which PolyBase handily sidesteps by distributing the operations across all nodes. Additionally, due to Bulk Copy transactions taking up to 15 seconds to commit at the end of streaming, and other speed limits in how quickly we can decompress a file, it's important to run copies for as many files at a time as possible. The more you can hit the DW with at the same time, the more throughput you can get.

To make the application a bit more powerful, we added automatic creation of destination tables, some logic to handle files with missing columns, and an automatic fallback to polybase for the largest of files.

All in all, it's been an interesting bit of development, and has definitely shown the power of writing and composing Streams. Because they're so widely supported by other packages, mainly with the intention of using them to read large files from disk, we can use them to set up intricate but very efficient flows of data. Give it a try!