Blog Feed

Posts

Notepad++ Regular expressions

November 10, 2016

Notepad++ regular expressions For those of you that regularly use notepad++, you would have come across the "Regular expression" option on the Replace window. I'm sure the code will work for older versions, but I've tested and been using these against version 6.9. As a developer, I've used this option to replace elements in my text files (example: replacing invalid date delimiter characters). Let me give you some tips on finding and replacing characters in files needed for processing: Find what : (\d{4}+)\.+(\d{2}+)\.(\d{2}+) Replace with : \1-\2-\3 Synopsis : We wish to search for a number (e.g. a date in our case), which starts off with 4 digits; followed by a decimal point; followed by 2 digits; followed by a decimal point; followed by 2 digits. We then wish to replace what was found with these these groups of numbers, separating them with a hyphen character. The plus + symbol after each character group ensures the separate groups are looked at as ...

Code Reviews

October 11, 2016

Code Review Checklist Code reviews are a very important part of the development process, in any environment. It is another safety net in ensuring a software deliverable of high quality. These are some points to keep in mind about code reviews. Take note of what needs to happen before, during and after a code review: Make sure that your code is self-explanatory and has adequate amount of code documentation and comments. These comments needs to describe the problem it is addressing in a clear and concise manner. Automated tests are good to test chunks of functionality. Include automated tests with your changes to make it easier to validate that your code works. In a SQL heavy environment, calling the different SP with specific input parameter values and inspecting the return result, or db stored result is a good midway point in testing the code. Your code should be sitting in source control. Before a code review commences: Set a date and time for the review to commence, and sen...

DeadLocks by Design

June 06, 2016

Deadlocks by Design In my previous posts, I talk a little about deadlocks and how to handle it. In this post I focus on handing deadlocks by proper design. To lighten the load on support developers in a complex environment as retail, where deadlocking of resources tend to be the norm, designing for this scenario is important. One way to tackle this problem, would be to retry the transaction. This is a fairly common approach to the issue. It also makes sense as a first step to solving this problem, because we don't change indexes on the tables, which might trigger a more complex integration testing phase. The basic pattern for implementing a retry would be as below depicted - I've implemented this using MS SQL Server, but the pattern remains the same. Please note, that it is also better to log the retry of the transaction in a table in the db, so that further opportunities exist for easier analysis on the event, because the information is stored in the db as opposed to a ...

Design Dimensional Model

April 18, 2016

Design Dimensional Model The art of designing a dimensional model is exactly that. An art. And, like any reputable artist will know, you can hone your artistic skills with regular exposure to alternative design approaches. Your goal is to hone your skills, and, that takes practice. So, if you know all the techniques in designing a dimensional model, my advice is: Practice this art some more. For some this is a topic that is "old news". They say: "We know all about designing a dimensional model." I say, well, this can largely be true, because you've been working with facts and dimensions for 1 2 3 or more years. However, when I start to worry about such ones, is when they use terms like: "We should actually use a SCD (slowly changing dimension) on this fact, and not this dimension". Some of you might see this statement as acceptable. However, for people who are purists when it comes to Ralph Kimball's design, that statement is an irritation o...

Moving Data [Spark Streaming - simple data files]

April 07, 2016

Spark Streaming a local directory Introduction For our streaming exercise, we will create two classes (one being a utility class that initially only handle the log4j properties for our application). We will manually compile our classes and use spark-submit to run our application. Note: For this example, we are creating the logging class inline into our main class. Therefore this example helps you to slowly get to grips with the dynamics of Spark Scala programming, and, slowly grows your knowledge. We will not use the "spark/bin/run-example" script that most examples use, as I want you to understand step by step what you are doing. You are also more easily able to alter your code by doing it my way. Use Case We are creating a streaming Apache Spark Scala program that reads a directory for new files and counts the amount of words in the file. Example Create our class FileStream.scala (which contains the StreamingUtility class), which will for this e...

Moving Data [Apache Spark]

April 05, 2016

So I decided I'm going to use a real world example and do some transformations against such. I decided on http://dumps.wikimedia.org , so that I have some nice sized files which I'm able to really see the advantages Apache Spark brings to the table. My system is loaded with Apache Spark 1.6.0 and Scala 2.10.5. Lets do this : First open the spark shell: spark-shell Next load an SQLContext: val sqlContext = new org.apache.spark.sql.SQLContext(sc) //sc an existing Spark context Next import the following packages into the shell session: import sqlContext.implicits._ import org.apache.spark.sql._ Now, you can start by loading the data from the " pagecounts-2011222 " file into a Resilient Distributed Dataset (RDD). RDDs have transformations and actions; the first() action returns the first element in the RDD Now load the data from the file into a new RDD: val wikiHits = sc.textFile("/home/osboxes/Downloads/pagecounts-20151222") Do some ...

Search This Blog