Posts

Showing posts from 2016

Notepad++ Regular expressions

Notepad++ regular expressions For those of you that regularly use notepad++, you would have come across the "Regular expression" option on the Replace window. I'm sure the code will work for older versions, but I've tested and been using these against version 6.9. As a developer, I've used this option to replace elements in my text files (example: replacing invalid date delimiter characters). Let me give you some tips on finding and replacing characters in files needed for processing: Find what :  (\d{4}+)\.+(\d{2}+)\.(\d{2}+) Replace with : \1-\2-\3 Synopsis : We wish to search for a number (e.g. a date in our case), which starts off with 4 digits; followed by a decimal point; followed by 2 digits; followed by a decimal point; followed by 2 digits. We then wish to replace what was found with these these groups of numbers, separating them with a hyphen character. The plus + symbol after each character group ensures the separate groups are looked at as

Code Reviews

Code Review Checklist Code reviews are a very important part of the development process, in any environment. It is another safety net in ensuring a software deliverable of high quality. These are some points to keep in mind about code reviews. Take note of what needs to happen before, during and after a code review: Make sure that your code is self-explanatory and has adequate amount of code documentation and comments. These comments needs to describe the problem it is addressing in a clear and concise manner. Automated tests are good to test chunks of functionality. Include automated tests with your changes to make it easier to validate that your code works. In a SQL heavy environment, calling the different SP with specific input parameter values and inspecting the return result, or db stored result is a good midway point in testing the code. Your code should be sitting in source control. Before a code review commences: Set a date and time for the review to commence, and sen

DeadLocks by Design

Deadlocks by Design In my previous posts, I talk a little about deadlocks and how to handle it. In this post I focus on handing deadlocks by proper design. To lighten the load on support developers in a complex environment as retail, where deadlocking of resources tend to be the norm, designing for this scenario is important. One way to tackle this problem, would be to retry the transaction. This is a fairly common approach to the issue. It also makes sense as a first step to solving this problem, because we don't change indexes on the tables, which might trigger a more complex integration testing phase. The basic pattern for implementing a retry would be as below depicted - I've implemented this using MS SQL Server, but the pattern remains the same. Please note, that it is also better to log the retry of the transaction in a table in the db, so that further opportunities exist for easier analysis on the event, because the information is stored in the db as opposed to a

Design Dimensional Model

Design Dimensional Model The art of designing a dimensional model is exactly that. An art. And, like any reputable artist will know, you can hone your artistic skills with regular exposure to alternative design approaches. Your goal is to hone your skills, and, that takes practice. So, if you know all the techniques in designing a dimensional model, my advice is: Practice this art some more.  For some this is a topic that is "old news". They say: "We know all about designing a dimensional model." I say, well, this can largely be true, because you've been working with facts and dimensions for 1 2 3 or more years.  However, when I start to worry about such ones, is when they use terms like: "We should actually use a SCD (slowly changing dimension) on this fact, and not this dimension". Some of you might see this statement as acceptable. However, for people who are purists when it comes to Ralph Kimball's design, that statement is an irritation o

Moving Data [Spark Streaming - simple data files]

Spark Streaming a local directory Introduction  For our streaming exercise, we will create two classes (one being a utility class that initially only handle the log4j properties for our application). We will manually compile our classes and use spark-submit to run our application. Note: For this example, we are creating the logging class inline into our main class.  Therefore this example helps you to slowly get to grips with the dynamics of Spark Scala programming, and, slowly grows your knowledge. We will not use the "spark/bin/run-example" script that most examples use, as I want you to understand step by step what you are doing. You are also more easily able to alter your code by doing it my way. Use Case  We are creating a streaming Apache Spark Scala program that reads a directory for new files and counts the amount of words in the file. Example  Create our class FileStream.scala (which contains the StreamingUtility class), which will for this exercise,

Moving Data [Apache Spark]

So I decided I'm going to use a real world example and do some transformations against such. I decided on http://dumps.wikimedia.org , so that I have some nice sized files which I'm able to really see the advantages Apache Spark brings to the table. My system is loaded with Apache Spark 1.6.0 and Scala 2.10.5. Lets do this : First open the spark shell: spark-shell Next load an SQLContext: val sqlContext = new org.apache.spark.sql.SQLContext(sc) //sc an existing Spark context  Next import the following packages into the shell session: import sqlContext.implicits._ import org.apache.spark.sql._ Now, you can start by loading the data from the " pagecounts-2011222 " file into a Resilient Distributed Dataset (RDD). RDDs have transformations and actions; the first() action returns the first element in the RDD  Now load the data from the file into a new RDD: val wikiHits = sc.textFile("/home/osboxes/Downloads/pagecounts-20151222") Do some

RSS feeds to your webpage [part 2]

Have a look at your management settings in https://analytics.google.com/ on the Admin tab. This process is going slow, but it helps to know why blank ads are shown on your website: 1. Google does preliminary checks on your website to ensure it passes their checks for conformity, etc. 2. Your website pages includes the ad code generated by Adsense . After you have passed these steps and you notice your Ad Unit status being "New", you need to be patient. Google could take a few days to approve your application. You will receive a notification via email on outcome. At the same time, if approved, you website will start showing ads. That then means the cash can start rolling in.... Yippee!! Happy income generation peoples. www.silvafox.co.za

RSS feeds to your webpage [part 1]

So you would like to add some dynamism to your static html pages. Well, ensure your hosted website service provider supports php scripting, which will be strange if they do not. Download the rss2html file where you will apply your changes to. Download from this location. Unzip this file and add it to the root of your website. Here are your checkpoints: Ensure your feed is reachable via following link: http:// <blogname> .blogspot.com/feeds/posts/default?alt=rss. Ensure you replace <blogname> with your blog name Modify your rss2html.php file as below: Configuration of RSS2html Unzip the folder.  Open the rss2html-docs.txt file in Notepad++ or other plain text editor and read the instructions.  Actually it's very simple to implement by modifying a few lines of code in the rss2html.php file: Tell the script where to find your RSS Feed:   $XMLfilename = "http:// your_blog .blogspot.com/feeds/posts/default?alt=rss"; Tell the script where to f

"Bigness" data - some thoughts about the big data journey

The " Bigness " data. Ideas of Big Data If you are reading this blog, you are likely to be a nerd. Everywhere - Yes, data is everywhere, and the volumes are exponentially increasing by day. There exists a saying: "Where there is a will there is relatives". Are your measures trustworthy? Ensure their integrity, but at the same time give attention to the training needed to its users. Data Detectives - The cool part for me of data science, is the detecting fraud part. Fraudsters love lots and lots of data. The more the merrier. Its also easier to hide abnormalities when lots of data are present. Therefore, if you find your organization at the stage where you are IYO consuming and processing big data, then its time you consider employing a data detective. Machine Learning - This will only be useful if you actually start using what your system is learning about trends, etc OODA Loop - What came first? The egg or the chicken. This scenario gives ri

High speed data loading with SSIS

Image
Tasks or Containers should be run in parallel. The below Container A illustrates the parallelism we want to achieve. Container B illustrates sequential workflow: In the package property, you want to play with the property MaxConcurrentExecutables . With it, you can configure how many executables to run in parallel. Other things that are very important to keep in mind are : Reduce the number of columns - Yes, ensure to reference columns that you actually need. Lots of wasted space is saved, by following this golden rule Reduce the number of rows - If you are loading from a flat file, split the large files where possible Reduce column width - Here data type is VERY important. Overkill in data type usage is a common problem. The smaller the datatype (ensure you are not truncating important field information :-) ), the more efficient your processing will be Retrieve data using sp_executesql - When you are retrieving data, you may find you can directly using the Tabl

Importing a database dump in MySQL

The below command is one that you need to remember: mysql -u < user > -p < db_backup. dump Just fired up the windows command line, and ensured my local MySQL server is running. I then just now ran the above command to do the import. Remember, that you may specify the exact path to the dump file as well, if it does not reside in the working directory of MySQL. Also note that the .dump file could have also been originally exported as a .sql, in which case the process will be the same. This import took a 400 meg database dump about 6 minutes to load on my system. If you are running a SSD setup, expect much faster load times :-) Happy importing and exporting people www.silvafox.co.za

Moving Data [part 2]

So we've put off diving into the world of Apache Spark, purely because we were used to the other tools in industry. One could say, we were in a comfort zone with, for the most part, vendor specific load functions like LOAD DATA INFILE (MySQL), etc. We could move data, and we weren't interested in looking into the new ways of doing things. However, since we've entered the world of transformations and actions with Apache Spark, it doesn't seem like we will be looking back. It's a skill set that is rare at this stage in industry, and, it's a sought after skill set. Come talk to us regarding moving terabytes and terabytes of data in the most efficient way possible. www.silvafox.co.za

Moving Data [part 1]

Recent data transformation needs has driven us to re-look at the types of offerings we give our clients. Don't get me wrong, we love building reports. However, what businesses want is intelligence from their valuable data. This is a challenge still today. Stay tuned for how we used Apache Spark to move data in a way no other tool could do before it. www.silvafox.co.za