This week at the Spark Summit has been interesting and fun, starting with a small course around spark tuning and best practices. This covered such topics as;
Memory Usage
Specifically discussing the sparks Tungsten in-memory storage format for Spark SQL/Dataframes which provides compact and efficient storage over JVM objects found in traditional RDD’s
Data locality was also discussed here, in that by ensuring our data is close to our calculation increases efficiency and reduces the need for shuffling data around the cluster. A number of strategies around this were discussed.
Broadcast variables
Use of broadcast variables can vastly improve the performance of wide transformations (such as joins) by ensuring small datasets are transmitted to each node in the cluster. This avoids shuffling large amounts of data around the cluster.
Catalyst
Catalyst is Sparks SQL optimizer, which leverages advanced programming language features to build an extensible query optimizer. This topic covered anti patterns such as Partially cached DataFrames, User defined functions and Cartesian products providing techniques on how to recognise and avoid these activities.
Tuning/shuffling
This covered techniques on how to avoid shuffling data around the cluster, generally this involves restricting queries and partitioning data appropriately
Cluster sizing
General guidelines were provided around cluster sizing for a given dataset
We also encountered the Databricks UI, this new interface provides a mechanism to create and maintain clusters, manage your data, create/run and monitor jobs and to create/manage notebooks. This is many ways collates a desperate set of interfaces into a single location making many aspects of using and managing a spark cluster much simpler. The day ended with BEER!!! (well a very entertaining Spark meetup with my new favorite speaker Holden Karau)