Spark Structured Streaming | elastacloud-channels

We have previous customers for whom we have written Spark Streaming applications, we used Spark Streaming where the key abstraction was the DStream which for all intents and purposes was a sequence of RDD’s. This did not make use of Tungsten memory optimizations or Catalyst for query optimization. Apache Spark then developed Spark Structured Streaming to take advantage of these features to bring efficiency of Spark SQL to real time streaming. Debuting in Spark 2.0 as an experimental release it has improved in subsequent iterations of spark. Spark 2.1 added additional functionality such as metrics, event time watermarks, additional file formats and kafka support, however it was still considered experimental. Spark 2.2 placed Structured Streaming in GA, improved kafka support, added support for stateful processing. There are many limitations on what you are able to do with streams specifically around joining streams, in that they are largely unsupported except for a few cases and then only against static RDD’s. One of the most interesting announcements around streaming we have come across as part of the Spark Summit is that streaming joins will be supported in Spark 2.3 unluckily we will have to wait a few months for this to be available. Still something that we see as a useful feature will be available to us in the coming months.