Sep 9, 2017

Learning from the world of Apache Spark part 1

0 comments

Spark abstracts the idea of a schema from us by enabling us to read a directory of files which can be similar or identical in schema. A key characteristic is that a superset schema is needed on many occasions. Spark will infer the schema automatically for timestamps, dates, numeric and string types across all of the various data providers including parquet.

 

This is the first in a series of posts where I'm going to review features that we've copied into Parquet.NET. This post is about automatic schema merges. To understand what this is we'll look at the basic behaviour and then we'll fall back to Set theory.

 

Spark abstracts the idea of a schema from us by enabling us to read a directory of files which can be similar or identical in schema. A key characteristic is that a superset schema is needed on many occassions. Spark will infer the schema automatically for timestamps, dates, numeric and string types across all of the various data providers including parquet.

 

For example, this will load all product data into memory:

 

val product_details = spark.read.parquet("/parquet/product_*").persist()

 

However, one drawback of this is that the schema read can be wrong given that the schema is inferred on a scan of the first n rows only so for larger files it may not pick up other product schemas.

 

The alternative is to be able to coerce the schema so that the values are automatically combined.

 

val df = spark.read.option("mergeSchema", "true").parquet("/parquet/product_1", "/parquet/product_2")

df.printSchema()

 

In set theory this is a Union operation which allows us to values such as 1,2,3,4 in Set 1 and combine with values 2,2,3,4,5 in Set 2. The Union will give us a unique combination of 1,2,3,4,5.

 

 

 

 

Parquet.NET now allows for a more explicit union without worrying about schema merger. It will automatically pick out similarities and differences between the schemas and combine them.

 

var ds1 = new DataSet(schema1);

var ds2 = new DataSet(schema2);

var ds3 = ds1.Merge(ds2);

 

Look for it formally in release 1.3 but it's floating around in one of the Alpha releases at the moment.

 

Happy trails!

New Posts
  • We've been working hard on Parquet.NET to give developers high level abstractions over Parquet so that there is an easy entrypoint into developing with Parquet that is not onerous for developers new to the format. As such, we've created ParquetConvert which allows the trivial creation of Parquet files from IEnumerable<T> Check out this gist I created that shows how to do this. It shows that a simple class as below can be serialized so easily that for the consuming developer there is next to no code. With a simple call to ParquetConvert.Serialize, we can save a collection of these elements in Parquet. Inspecting with Parq shows our output: We'll be continuing to add more features to help developers maximise their productivity with Parquet, as well as retaining low level features that allow complete control of the Parquet format from dotnet.
  • https://github.com/elastacloud/parquet-dotnet is about to be released in the following few days. Since v3.0 was pushed to the public, it saw a lot of interest and appraisal for it's incredible performance boost, however there were problems as well. To reiterate, v3.0 was a complete rewrite of 2.0 and allowed you to get deeper into parquet internals, especially API for creating *row groups* , writing columns directly, controlling row group sizes etc. Although this was a big improvement in the library's core itself, it made it harder to use for a general audience, because v2.0 had a handy *row-based interface* for accessing data. Although working with rows slows down parquet library, you will eventually run into a situation where you need to work with rows anyway. For instance, writing utilities for viewing parquet data, converting between parquet and row-based formats like CSV and so on. Therefore, V3.1 *resurrects row-based access* and makes it faster and better. The way you work with rows has changed slightly but mostly you shouldn't notice any differences at all. They come in play when working with complex data structures like maps, list, structures etc. Preview documentation for this feature is located here https://github.com/elastacloud/parquet-dotnet/blob/features/rows/doc/rows.md so feel free to browse and leave feedback either on this page or raise an issue on GitHub. PARQ We'd also like to announce that we're introducing .NET Core Global Tool in this version called parq . Full description is located here https://github.com/elastacloud/parquet-dotnet/blob/features/rows/doc/parq.md . Essentially it's a hassle free way to work with parquet files locally and the number of commands supported will continue to grow.
  • Did you know that it's possible to extract data from Parquet files in Azure Data Lake Analytics ? Well it is and the library just received a couple of updates, check it out over on its Github page. First of all the library has just received an update to bring it up to the latest version of Parquet .NET (2.1.2 at the time of writing), this brings with it a range of updates and I recommend checking out some of the other posts here to find out what's been going on. The other update is in how the library handles dates, by default Parquet .NET returns DateTimeOffset values when reading dates from parquet files, this is always the right thing to do as anyone who has ever had to work with time zones and offsets will tell you. However, U-SQL does not support the DateTimeOffset type for extracting data, so a small change has been made to convert the DateTimeOffset value to a DateTime, allowing the data to be extracted. So, have a look, give it a go and if you find any problems then please raise an issue on Github or even contribute a fix yourself.