High speed data loading with SSIS

Tasks or Containers should be run in parallel. The below Container A illustrates the parallelism we want to achieve. Container B illustrates sequential workflow:
In the package property, you want to play with the property MaxConcurrentExecutables. With it, you can configure how many executables to run in parallel.

Other things that are very important to keep in mind are:
  1. Reduce the number of columns - Yes, ensure to reference columns that you actually need. Lots of wasted space is saved, by following this golden rule
  2. Reduce the number of rows - If you are loading from a flat file, split the large files where possible
  3. Reduce column width - Here data type is VERY important. Overkill in data type usage is a common problem. The smaller the datatype (ensure you are not truncating important field information :-) ), the more efficient your processing will be
  4. Retrieve data using sp_executesql - When you are retrieving data, you may find you can directly using the Table or View option. Rather use the  sp_executesql and write an sql select statement to get the table information. The reason is this command is faster than the OPENROWSET command that is used when you use the Table or View option.
  5. Use Fast Parse property - This property boosts performance. You might not be able to use this function if you need to support local-specific formats like integer, date and time data types.
Fixing transformations in buffers - Try to have your transformations of same type (row-based transformation). That way, when the one transformation completes, that buffer becomes available for processing by next transformation in the data flow. If the next transformation is also a row-based transformation, data stays in the same buffer - Performance of the data flow is optimal in this case. If the next transformation is asynchronous, then a new buffer is required. In that case then, the data will be moved from one memory buffer to the next - even in this case, its still faster than I/O operations to store data on disk between transformations. With buffer usage, keep an eye out on out of memory problems, where no further buffers can be created because of memory limitations.

If the bottleneck are you transformations - If you find that a bottleneck exists in your transformations, you might want to try these things to improve performance:
  1. Data type casting, cleansing (using ISNULL or TRIM functions, for example), aggregations, and sorts are common operations that you can perform in the source query. Doing so, you take advantage of the database engine's capabilities, especially if the database server and execution server are separate.
  2. Reduce number of rows - Use WHERE clause (for relational sources), or add a Conditional split to filter data early in data flow process
  3. Remove unneeded columns to save space
  4. Use FULL CACHE option - This will place all rows in memory. Remember that cache needs to first be loaded before pipeline processing can continue.
  5. Preload the cache for lookup transformations - Use Cache Transform transformation to preload the cache in a separate data flow
  6. Replace SCD (slowly changing dimension) with Merge Join with left join which is often the fastest available join algorithm.
  7. Use an Execute SQL task instead of an OLE DB Command transformation
If the bottleneck is the Destination Component - Improve data flow design with the below recommendations:
  1. Optimize the OLE DB destination - When a table or view is your target, then use FAST LOAD option, which will perform a bulk insert instead of row by row based insert
  2. Use SQL Server destination instead of OLE DB destination - Get 25 percent faster performance from the SQL Server destination because the pipeline engine can bypass the network layer during the data load. However, the package must execute on the same server as the target database and all data types in the pipeline must match the data types in the target table
  3. Set data types explicitly - Casting the data type explicitly in the source query for a relational source or by manually editing the data type for each column in the Flat File Connection Manager for a Flat File source
Buffer efficiency
Enable the BufferSizeTuning logging event, and then execute your package and examine the log. You can then see how many rows were put into the buffer. Check the DefaultBufferMaxRows property. If the buffer is actually getting more rows than the default setting, then increase the size so that the buffer can hold more rows. Also, play with the  EngineThreads, which should be altered to match the amount of server processes you have.

As you think about your design, give concentrated attention to how it will affect performance. Yes, there are more than one way to skin this cat (pardon the pun). However, take time to understand the various approaches and how it affects performance.

Peace out people.



Popular posts from this blog

Moving Data [Spark Streaming - simple data files]

Notepad++ Regular expressions

RSS feeds to your webpage [part 1]