Spark 2.0 Under the hood

Spark 2.0 is a major release of spark. This release packages, structured  API improvements(Unification of DataFrame,DataSet,SparkSession), MLIB model exports and various update in platform libraries. Spark 2.0 supports the features in Scala 2.12 and SQL 2003. It enables ease of development with the streamed lined API , faster and optimized way of code complication and advanced intelligence making it smarter.
The whole-stage code optimization improves the performance by 10x faster the existing one. The I/O operations are also optimized in builtin cache and parquet storage.

Spark 2.0 continuous to support the standard SQL & bring up new ANSI SQL parser extends the querying capabilities drastically over other libraries. On the API side Spark 2.0 will merge the DataFrame and Dataset APIs.

RDD , DataFrames & DataSets

RDD was the standard abstraction of Spark. But from Spark 2.0, Dataset will become the new abstraction layer for spark. Though RDD API will be available, it will become low level API, used mostly for runtime and library development

Dataset is a superset of Dataframe API which is released in Spark 1.3. Dataset together with Dataframe API brings better performance and flexibility to the platform compared to RDD API. Dataset will be also replacing RDD as an abstraction for streaming in future releases.Spark 2.0 supports infinite dataframes, It helps to run various functions (aggregate,GroupBy etc) on the whole data. 

Spark Session 

Spark 2.0 brings up single point of entry SparkSession is essentially combination of SQLContext, HiveContext and future StreamingContext. All the API’s available on those contexts are available on spark session also.Spark session internally has a spark context for actual computation.
When you start the interactive shell (spark-shell)

We see a sample of using spark SQL. Basically we can read data from various it can be used to read data from various types of input formats such as  CSV,JSON,ORC,AVRO,parquet, text, jdbc, table etc. Here is an example of reading  a data from amazon S3.
val hadoopConf = sc.hadoopConfiguration;

hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

hadoopConf.set("fs.s3n.awsAccessKeyId", "yourKeyid")hadoopConf.set("fs.s3n.awsSecretAccessKey","your key")

Set the properties with amazon credentials and read it via the bucket URL.

val dF= spark.read.json("s3n://buket-name/*/*/*/*/*/*/*/*.json.gz")
This will read the files in S3 bucket  allows you to query it on the fly.



it reads as dataFrame and we can print the schema of the data read from the system
The speed it runs 10x faster than previous version Spark 1.6. Now the registerTempTable  method is deprecated  we use the new one

dF.createOrReplaceTempView("logStash")
Once the temporary view is created, with the spark session , queries can be triggered


spark.table("logStash")


spark.sql("select Records.eventName,Records.eventTime from logStash").show() 
The process works in the above way
The data is been returned as Warped array of rows.

Spark 2.0 Compiler update 

Spark plays a vital role in the performance which has keyed up many big data developers  start preferring it  over existing M/R engines. Spark 2.0 updated its execution layer to reduce the number of CPU cycles used by I/O operations happen during the process execution. Spark 2.0 ships with the second generation Tungsten engine. It will help to emit optimized bytecode at runtime that collapses the entire query into a single function, eliminating virtual function calls and leveraging CPU registers for intermediate data. This promises around 10X improvement in the  performance.

Spark Structured Streaming 

Currently Spark streaming API provides real time streaming and data processing. It evolved as the first attempt in the bigdata space in unifying batch and streaming computation. As a first streaming API called DStream and introduced in Spark 0.7. It offered highly scalable , fault tolerance systems with high throughput.Now Spark 2.0 enabled structured streaming. 

 It enables the applications to take decision in real time. Exiting system supported just streaming of data. To add more intelligent to the real time processing the Structured Streaming enables the feature to combine business logic with data streaming on the fly. Now the streaming API works as full stack, where the developers doesn't need to depend on external applications to apply logic on the streams.

The structure streaming will be a extension to DataSet/DataFrame API.The newly introduced the SparkSession will soon support the Streaming context soon . This unification should make adoption easy for existing Spark users, allowing them to leverage their knowledge of Spark batch API to answer new questions in real-time. Key features here will include support for event-time based processing, out-of-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks.



 

Contributors

Social Connect


View Sadagopan K V's profile on LinkedIn