Posts

Showing posts from April, 2016

Design Dimensional Model

Design Dimensional Model The art of designing a dimensional model is exactly that. An art. And, like any reputable artist will know, you can hone your artistic skills with regular exposure to alternative design approaches. Your goal is to hone your skills, and, that takes practice. So, if you know all the techniques in designing a dimensional model, my advice is: Practice this art some more.  For some this is a topic that is "old news". They say: "We know all about designing a dimensional model." I say, well, this can largely be true, because you've been working with facts and dimensions for 1 2 3 or more years.  However, when I start to worry about such ones, is when they use terms like: "We should actually use a SCD (slowly changing dimension) on this fact, and not this dimension". Some of you might see this statement as acceptable. However, for people who are purists when it comes to Ralph Kimball's design, that statement is an irritation o

Moving Data [Spark Streaming - simple data files]

Spark Streaming a local directory Introduction  For our streaming exercise, we will create two classes (one being a utility class that initially only handle the log4j properties for our application). We will manually compile our classes and use spark-submit to run our application. Note: For this example, we are creating the logging class inline into our main class.  Therefore this example helps you to slowly get to grips with the dynamics of Spark Scala programming, and, slowly grows your knowledge. We will not use the "spark/bin/run-example" script that most examples use, as I want you to understand step by step what you are doing. You are also more easily able to alter your code by doing it my way. Use Case  We are creating a streaming Apache Spark Scala program that reads a directory for new files and counts the amount of words in the file. Example  Create our class FileStream.scala (which contains the StreamingUtility class), which will for this exercise,

Moving Data [Apache Spark]

So I decided I'm going to use a real world example and do some transformations against such. I decided on http://dumps.wikimedia.org , so that I have some nice sized files which I'm able to really see the advantages Apache Spark brings to the table. My system is loaded with Apache Spark 1.6.0 and Scala 2.10.5. Lets do this : First open the spark shell: spark-shell Next load an SQLContext: val sqlContext = new org.apache.spark.sql.SQLContext(sc) //sc an existing Spark context  Next import the following packages into the shell session: import sqlContext.implicits._ import org.apache.spark.sql._ Now, you can start by loading the data from the " pagecounts-2011222 " file into a Resilient Distributed Dataset (RDD). RDDs have transformations and actions; the first() action returns the first element in the RDD  Now load the data from the file into a new RDD: val wikiHits = sc.textFile("/home/osboxes/Downloads/pagecounts-20151222") Do some