About the Blog > Programming > Spark Structured Streaming

Spark Structured Streaming

Comparison of Spark streaming , Structured streaming and Kafka Streams

What is a Watermark?

Its a method to handle the lateness. Basically , it can be regarded as threshold to specify how long system wait before data arrives. If the event falls under the watermark interval , then the event’s data is utilized for computation else the event’s data is dropped for that time interval.

Unsupported Operations in Structured Spark streaming

Multiple streaming aggregations are not yet supported on streaming Datasets.
Limit and take the first N rows are not supported on streaming Datasets.
Distinct operations on streaming Datasets
Sorting operations are supported on streaming Datasets only after an aggregation and in Complete Output Mode.
Few types of outer joins on streaming Datasets are not supported. List is here

Details and credits are here

Sample code on Stuctured Spark Streaming (Apache site)

spark = SparkSession. ...

# Read text from socket
socketDF = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

socketDF.isStreaming()    # Returns True for DataFrames that have streaming sources

socketDF.printSchema()

# Read all the csv files written atomically in a directory
userSchema = StructType().add("name", "string").add("age", "integer")
csvDF = spark \
    .readStream \
    .option("sep", ";") \
    .schema(userSchema) \
    .csv("/path/to/directory")  # Equivalent to format("csv").load("/path/to/directory")