Tuesday, March 15, 2011

Performance Improvement in SSIS

(E) EXTRACT IMPROVEMENT


1) If there is a Flat file source / Derived Column Transformation, then set "Fast Parse" to "True.
    - It is available only in Flat File Source & Derived Column Transformation
    - It is specified at column level
    - Default value is False,
    - When we set it true, it will avoid some kind of pre-execute validations and considers all your data fine


    Steps

  1. Right-click the Flat File source or Data Conversion transformation, and then click Show Advanced Editor.
  2. In the Advanced Editor dialog box, click the Input and Output Properties tab.
  3. In the Inputs and Outputs pane, click the column for which you want to enable fast parse.
  4. In the Properties window, expand the Custom Properties node, and then set the FastParse property to True.
  5. Click OK.

2) Set packet size to 32767 for Connection Manager.
    - This will bump up the packet size from 4K (which is default)
    - This needs network admin to enable "Jumbo Frames"




3) In OLEDB source, use T-SQL Query instead of table as a direct input 
    - This will allow you to choose specific columns instead of pulling all the columns
    - We can specify nolock which avoids locking the table
    - We can use sort, group by, joins, forumlated columns instead of using different transformations like Sort,       
       Merge Join, Derived Column, Aggregate transformations.


4) In Cache connection manager, try to use create a file instead of using memory


5) If same OLEDB source connection, you are using at multiple places, then set "RetainSameConnection" property to "True"


     - This will allow engine to use the same connection every time
     - Default value : False. This will create connection - get data - close connection every time. 
     - by making it to TRUE, above activities will be done only once.




6) Divide source into a chunk instead of having a single master pool.  


(T) TRANSFORM IMPROVEMENT


1) Use Transformation based on the usage and buffer matrix


Behind the scenes, the data flow engine uses a buffer-oriented architecture to efficiently load and manipulate datasets in memory.
  • Row Transformations - 
    - They either manipulate data / create new fields using the data that is available in that row. 
    - They might create new columns but not new rows
    - Each output row has a 1:1 relationship with an input row
    - Also known as synchronous transformations
    Uses existing buffer rather than new buffer
    - Examples -  Derived Column, Data Conversion, Multicast, and Lookup. 

  • Partially blocking transformations 
    - They are often used to combine datasets using multiple data inputs. 
    - As a result, their output may have the same, greater, or fewer records than the total number of input records. 
    - Also known as asynchronous transformations
    - Output of the transformation is copied into a new buffer and a new thread may be introduced into the data flow
    - Examples - Merge, Merge Join, and Union All. 

  • Blocking transformations 
    - They must read and process all input records before creating any output records. 
    - They perform the most work and can have the greatest impact on available resources rather than above 2 categories
    - Also known as asynchronous transformations
    - With partially blocking transformations, the output of the transformation is copied into a new buffer and a new thread may be introduced into the data flow
    - Example - Aggregate and Sort. 


2) Use MERGE statement when you have to do INSERT-UPDATE-DELETE from source to target instead of using multiple transformations


3) Choice of type of cache inside Lookup Transformation
     - Full Cache : for small dataset
     - No Cache : for volatile dataset
     - Partial Cache : for large dataset


4)  Sort, Merge Join, Union All, Pivot, Aggregation SCD, Data Conversion can be easily replaced by normal T-SQL


      - There will be much more control on all the objects
      - T-SQL operation will be much more faster than SSIS Transformations because all the 
         buffers won't be used.



5) Make datatype as narrow as possible so that they will allocate less memory




(L) LOAD IMPROVEMENT


1) Try to execute the package on your destination server, rather than source server. 


      - LOAD is expensive operation than EXTRACT
      - So we can execute the package on the same server as destination server


2) Make a smart choice between Dropping/Keeping Index


      - It is not necessary to keep index always OR drop index always before you load.
      - If there is a clustered index, don't drop because data is sorted using this key. And dropping  
        and rebuilding clustered index will take even more time.
      - If there is a single non-clustered index and you expect more than 100% new data, then 
        dropping and re-creating index will help.
      - If there are multiple non-clustered index, probably leave them as it is.


   But these are not thumb rules, trial and error will always give you the best result.


3) If there is a huge huge load on destination, probably partitioning a table will help


4) If there is a huge huge load on destination, probably partitioning a table will help


5) Setting proper value of "Rows per batch" &  "Maximum Insert Commit Size"




Rows per batch - how many rows you want to send to insert the data
Maximum insert Commit Size -  how may rows you want to commit in one shot
      - If the value is 2147483647, these many rows will be committed in one single transaction and 
        they will be committed.
      - If you really have these many rows to load, better you define proper value in this commit 
        size. Let's say if you define 100000, then 1 lac rows will be committed in one shot. A huge 
        DML operation in one single transaction will degrade the performance.
      - If it is 0, it means, a package might stop responding, if the same table is being used by 
        some other source.





3 comments:

  1. How do you manage/configure "Rows per batch" & "Maximum Insert Commit Size" if you are NOT using an OLE DB destination connection manager? (SSIS and SQL Server are on the same machine.)
    Thanks

    ReplyDelete
    Replies
    1. I am going to guess that you would do this within the select / insert statement itself using a loop and set rowcount, etc.

      Delete
  2. Poor article, some generic recommendations don't apply in all cases. Unclear recommendations: it's better to give one clear recommendation and explain it when it is good or bad; than a bunch of unclear recommendations, some of which may actually cause trouble.

    ReplyDelete