Extracting data from SQL into Parquet

Lately I've needed to extract some data from a SQL database (and others) and move it into Azure to be worked on using other tools. We're big users of Parquet which is available across a range of tools such as Azure SQL Data Warehouse, Azure Data Lake Analytics and of course, Spark.

The first couple of times I needed to do this were one-off tasks and so I took the more common route of exporting data in a different format (such as CSV) and then using tools like FastParquet in Python to write the data out to Parquet. But the last couple of times I've needed a more robust, easy to deploy solution, preferably written in C#, naturally this meant pulling in the Parquet .NET library. But what to do when you don't know the schema of the data you want to extract until you extract it?

Without wanting to write a lot of code to handle all of the variations I turned to C# reflection to inspect the schema of the data being read from SQL and then using that to create the DataSet schema. A little bit more effort needed to go in so that null values were correctly handled as specifying a data field as System.Int32 means that giving it a null value results in an error, so instead the field should be created as a Nullable<System.Int32> type.

Anyway, for the fun of it I created a demo app targeting .NET Core to show this in action, the code is available on Github as a Gist and is under an MIT license, so help yourselves. The code currently has a query which is pulling out data from a demo movies database I have set up (and used in my last post) locally and is spitting out around 76,000 movie records.

If you're on Windows and want to look at the Parquet file after you've written it to disk then check out the Apache Parquet Viewer available on the Microsoft Store.