Posts

Showing posts with the label Scala

Moving Data [Spark Streaming - simple data files]

Spark Streaming a local directory Introduction  For our streaming exercise, we will create two classes (one being a utility class that initially only handle the log4j properties for our application). We will manually compile our classes and use spark-submit to run our application. Note: For this example, we are creating the logging class inline into our main class.  Therefore this example helps you to slowly get to grips with the dynamics of Spark Scala programming, and, slowly grows your knowledge. We will not use the "spark/bin/run-example" script that most examples use, as I want you to understand step by step what you are doing. You are also more easily able to alter your code by doing it my way. Use Case  We are creating a streaming Apache Spark Scala program that reads a directory for new files and counts the amount of words in the file. Example  Create our class FileStream.scala (which contains the StreamingUtility class), which will for this e...

Moving Data [Apache Spark]

So I decided I'm going to use a real world example and do some transformations against such. I decided on http://dumps.wikimedia.org , so that I have some nice sized files which I'm able to really see the advantages Apache Spark brings to the table. My system is loaded with Apache Spark 1.6.0 and Scala 2.10.5. Lets do this : First open the spark shell: spark-shell Next load an SQLContext: val sqlContext = new org.apache.spark.sql.SQLContext(sc) //sc an existing Spark context  Next import the following packages into the shell session: import sqlContext.implicits._ import org.apache.spark.sql._ Now, you can start by loading the data from the " pagecounts-2011222 " file into a Resilient Distributed Dataset (RDD). RDDs have transformations and actions; the first() action returns the first element in the RDD  Now load the data from the file into a new RDD: val wikiHits = sc.textFile("/home/osboxes/Downloads/pagecounts-20151222") Do some ...