Spark Streaming: Streaming Word Count

Streaming data from HDFS and processing to get word count for files

We will look upon a simple example on Streaming Data from an HDFS folder/location on your Hadoop cluster to Spark where we will perform processing. Let’s get the word count of new files created at the location. We can add new files to the location and get the word count for these files. Let’s get started.


Prepare Application

Now, let’s we will write code for our streaming app to stream data from HDFS location.

//import libraries
import org.apache.spark._
import org.apache.spark.streaming._
//App Code
object StreamingWordCount {
def main(args:Array[String]) {
//Create Streaming Context with 10 seconds batch size
val ssc = new StreamingContext(sc, Seconds(10))
//Getting lines from file at HDFS location
val lines = ssc.textFileStream(args(0))
val words = lines.flatMap(_.split(" "))
val wordCounts = => (x, 1)).reduceByKey(_ + _)

//Start processing the stream
val ssc = new StreamingContext(sc, Seconds(10))
val sparkConf = new SparkConf().setAppName("StreamingWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(10))
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-XXXX).
set spark.driver.allowMultipleContexts = true

Run Application

Once you have your Application Code ready, you may call the main function and pass HDFS location as parameter as under.




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dataguy! -

Dataguy! -


Hi! I am Amit your DataGuy. Folks do call me ‘AD’ as well. I have worked over a decade with into multiple roles - developer, dba, data engineer, data architect.