Hands-On Microsoft SQL Server 2008 Integration Services part 42 doc

388 Hands-On Microsoft SQL Server 2008 Integration Services In the Component Properties tab, you specify common and custom properties for this component. The properties you can specify are: e number of seconds before the command times out in the CommandTimeout c field. By default, this field has a value of 0, which indicates an infinite timeout value. e code page value in the DefaultCodePage field when the data source is unable c to provide the code page information. e LocaleID, which can be changed from the default value to any other c Windows LocaleID. e ValidateExternalMetadata value, which can be changed from the default value c of True to False if you do not want your component to be validated during the validation phase. e SQLCommand field, where you can type in the SQL statement that this c transformation runs for each row of the input data. You can use parameters in your SQL statement and map these parameters to the input columns so that the SQL statement is modifying the data, made available by OLE DB Connection Manager, on the basis of values in the input columns. By default, these parameters are named as Param_0, Param_1, and so forth; and you cannot change these names. However, if you use a stored procedure with sql variables in it, then you can use the variable names that make the mappings of parameters easier. is option has been explained later in the chapter (refer to Figure 10-9). The mappings between parameters used in SQL statements and input columns are defined in the Column Mappings tab. In this tab, after typing the SQL statement in the SQLCommand field, you will see the Available Destination Columns populated with parameters for you. (You may have to click Refresh to see the parameters.) The automatic population of parameters is dependent on the ability of the OLE DB provider you’ve specified earlier. For some third-party OLE DB providers that do not support deriving parameter information from the SQL statement, you will need to manually create parameter columns in the External Columns node in OLE DB Command Input by going to Input And Output Properties tab; assigning them names such as Param_0, Param_1, and so on; and specify a value of 1 to the DBParamInfoFlags custom property of the column. You can then manually create mappings between Available Input Columns and Available Output Columns by using the drag-and-drop technique. Chapter 10: Data Flow Transformations 389 Split and Join Transformations These transformations can create multiple copies of input data, split input data into one or more outputs, merge multiple inputs, or add columns to the pipeline by looking up exact matches in the reference table. After reading through the descriptions for these split and join transformations, you will complete your first Hands-On exercise for this chapter that will cover some of the Row transformations also. Conditional Split Transformation Sometimes you will need to work with a particular data set—i.e., a subset—separately from the data that is coming in the pipeline. For example, you may want to apply different business rules to your different types of customers. In such cases, you will need to split data to match the specific criteria into the multiple data sets using conditions that are defined in the Conditional Split transformation. This transformation allows you to create more than one output and assign a condition (filter) to the output for the type of data that can pass through it. Among the outputs, one output has to be a default output for the rows that meet no criteria. When an input data row hits the Conditional Split transformation, it passes the data row through a set of conditions one by one and will route the data row to the output to which it matches the criteria first. Each row can only be diverted to one output, and this output has to be the first one for which the condition evaluates to True. This transformation has one input, one error output, and as you can make out, multiple outputs. The user interface of the Conditional Split Transformation Editor is similar to that of a property expression. You can select variables and columns from the top-left section and functions and operators from the top-right section of the editor window. You can build an expression using variables, columns, functions, and operators in the Condition field for an output. As you add a condition, an output is created with an order number specified in the Order column and a name is assigned to it in the Output Name field. The order number plays a vital role in routing the rows to outputs, as the row is matched to these conditions in an ascending order and the row is diverted to the output for which the condition becomes true first. Once the row has been diverted, the rest of the conditions are ignored, as a row is always sent only to one output. If a row doesn’t meet any condition, in the end it will be sent to the default output of the transformation. The Output Name can be changed to a more appropriate name that suits your data set. This transformation has a mandatory default output built in for you. The syntax you use to build an expression for a condition uses the expression grammar. For example, if you want to split customers on the basis of countries— 390 Hands-On Microsoft SQL Server 2008 Integration Services e.g., you want to separate out rows for UK, USA, and rest of the world—your Conditional Split transformation will have three outputs with the following configurations: Order Output Name Condition 1 Case 1 [Country] = = “UK” 2 Case 2 [Country] = = “USA” The default output will collect all the rows for which Country is neither UK nor USA. Multicast Transformation The Conditional Split transformation enables you to divert a row to a specific output and split a data set. It doesn’t allow you to divert a row to more than one output and hence does not create a copy of the data set. However, sometimes you may need to create copies of a data set during run time so that you can apply multiple sets of transformations to the same data. For example, working within the same ETL, you may want to treat your sales data differently while preparing data for balance sheets, than while using it to calculate commission paid to the salespersons. Another example could be when you want to load data to your relational database as well as the analytical reporting data mart, with both having different data models and obviously different loading requirements. In this case, you will create two sets of data, apply different transformations to the data to bring them in line with the data model requirements, and then load the data to the destination database. This transformation has one input and supports multiple outputs to perform multicast operation. The Multicast transformation creates copies of a data set by sending every row to every output. It does not apply any other transformation to the data other than diverting a data set to more than one output. The transformation is simple to use and doesn’t require much to configure. In fact, when you open the Multicast Transformation Editor, you will see nothing to configure with the two blank panes aligned side by side. This is because you configure this transformation by connecting its outputs to inputs of multiple components and not by setting properties on the attributes. On the Data Flow surface, when you click the Multicast transformation after connecting its first output to a downstream component, you will see that it provides another output (a green arrow emerging from the transformation) for you to connect to another downstream component. If you double-click the Multicast transformation after connecting the second output to another downstream component, you will see two outputs listed in the left pane of the editor window, and the right pane shows the properties of the output selected in the left pane. You can modify the name of the output and write a description for it if you want and that’s it. Chapter 10: Data Flow Transformations 391 Union All Transformation This component works similar to the Union All command of T-SQL and combines two or more inputs into a single output. To support this functionality, this transformation has multiple inputs and one output and does not support an error output. During run time, the transformation picks up all the rows from one input, sends them to the output, then picks up all the rows from the second input, sends them to the output after combining them with the rows of the first input, and this goes on until all the inputs have been combined. Note that this transformation can select inputs in any random order based on when the data is available at the inputs. Also, it does not sort the records in any way. All the rows from a particular input will be output together. On the Data Flow Designer when you connect an input to this transformation, this transformation copies the metadata from this first input to its output so that the columns of this first input are mapped to the output created by the transformation. The columns having matching metadata of any input you connect after the first input are also mapped to the corresponding Output columns. Any column that does not exist in the first input, but is a part of subsequent input, will not be copied in the output columns by the transformation. You must create and map this column manually in the Output Column Name field of the output with the corresponding Input column. If you don’t create and map a column that exists in subsequent inputs but not in the first input, that column will be ignored in the output. You can also change mappings in the editor by clicking in the Input field and choosing an option from the drop-down list. Merge Transformation As its name suggests, the Merge transformation combines two inputs into a single output but requires that the inputs be sorted and the columns have matching metadata. To perform merge functionality, this transformation has two inputs and one output. It does not have an error output. The Merge transformation can even provide you sorted output by inserting rows from inputs based on their key values. This transformation is similar to the Union All transformation with some key differences: e Merge transformation requires sorted inputs, whereas the Union All c transformation does not. e Merge transformation can combine only two inputs, whereas c Union All can combine more than two inputs. e Merge transformation can provide sorted output, whereas Union All cannot. c When you add a Merge transformation into a package’s data flow, you won’t be able to open its editor until you have attached two sorted inputs that have properly defined 392 Hands-On Microsoft SQL Server 2008 Integration Services sort-key positions for the columns on which they are sorted. The input data sets must be sorted—for example, using sort transformation—which must be indicated by the IsSorted property on the outputs and the SortKeyPosition property of the output columns of the upstream components. These properties are discussed in Chapter 15, where you can refer to Figure 15-4 and Figure 15-5 to see how these properties are used. Once the sorted inputs are connected to the inputs of this transformation and you are able to open the editor, you will see the output columns mapped to the Input columns of both the inputs. This transformation creates the output by copying the metadata of the first input you connect to it. You can make changes to the mappings. However, the metadata of the mapped columns must match. If the second input has a column that is not in the Output columns list, you can add and map this column manually. Merge Join Transformation The Merge Join transformation joins two sorted inputs into one output using a full, left, or inner join. For merging two inputs, this transformation supports two inputs and one output and does not support an error output. The joins work exactly as it does in T-SQL. For the benefit of those who haven’t worked with joins, here is a quick refresher for these joins. Inner Join An inner join returns the rows that have a join key present in both the data sets. For example, suppose you have a table for all the customers, a table for all the products that your company has sold in last ten years, and a date table. An inner join will return a list of customers and the products who purchased any product in year 2009. Left Outer Join A left outer join returns all the rows from the first (left side of the join) table and the rows from second table for the matching key. For the rows that don’t have a matching key in the second table, the corresponding output columns are filled with nulls. For the customers and products example, a left outer join will list all the customers with the products listed against the customers who have made purchases in 2009 and nulls against the customers who didn’t make a purchase in 2009. Full Outer Join A full outer join returns all the rows from both tables joined by the key—simply put, it is a list of all the data. If the rows from first table don’t have a matching key in the second table, the corresponding second table columns are filled with nulls. Similarly, if the rows from the second table don’t have a matching key in the first table, the corresponding first table columns are filled with nulls. Chapter 10: Data Flow Transformations 393 When you add a merge join transformation into a data flow, you won’t be able to open the editor until you have attached two sorted inputs—one to the merge join left input and the other to the merge join right input. The input data sets must be sorted physically—for example, using a sort transformation—which must be indicated by the IsSorted property on the outputs and the SortKeyPosition property of the output columns of the upstream components. These properties are discussed in Chapter 15, where you can refer to Figure 15-4 and Figure 15-5 to see how these properties are used. Once the sorted inputs have been connected to the left and right inputs of this transformation, you can open the editor. You can choose among inner join, left outer join, or full outer join types in the Join Type field. You can also swap your inputs by clicking the Swap Inputs button for a Left Outer Join. After selecting the join type, you can specify the join keys if they have not already been picked up by the transformation. The joined columns must have the matching metadata, and the join key must be in the same order as specified by the sort key. Next, you can select the output columns by selecting the check boxes or using the drop-down lists in the Input columns in the lower section of the Transformation Editor. You will also be able to assign an alias to the Output column. Once you have selected the required columns for the outputs, you can close the editor, as there is no other configuration in the editor; however, there are couple of important properties that you may configure for this transformation. You can access these special properties in the properties window when you press the  key while this transformation is selected. The first performance-related property is the MaxBuffersPerInput property, which lets you specify an integer value for the number of buffers for each input that you want to suggest. Merge Join is a semi-blocking transformation; that is, it needs a set of rows before it could start outputting any record. You can well imagine this as the transformation needs a good set of rows or buffers for each input to be able to join them and start spitting out the resultant rows. The second reason for this transformation to collect large sets of data is to avoid deadlocking between the threads that are processing data. If you are working with really large data sets, the Merge Join transformation can cache huge amounts of data, putting high pressure on memory available on the computer. This could affect performance of other applications on the server. To avoid such a situation, you can throttle the memory requirements using the MaxBuffersPerInput property. While this property allows you some control, it is just a suggestive value and not a hard value that forces the Merge Join to use only the specified number of buffers. If the transformation feels that there is a risk of thread deadlocking or it doesn’t have enough data to perform, it will increase the number of buffers required to some other value than what has been specified. However, it does stay within specified limits if no such risk exists. The default value for this property is five buffers per input, which works well with most of the requirements. You can specify a larger value to improve performance if you are working with large data sets and have 394 Hands-On Microsoft SQL Server 2008 Integration Services enough free memory on the server. Alternatively, you can decrease the value from the default value if the server is already struggling with memory pressure. But as you can imagine, reducing this value to 0 will disable all throttling and can adversely affect performance. You need to be a bit more cautious when configuring this transformation and configure a more realistic setting, as the memory usage could go too much off the mark when configured too low for a large data set. Another issue has been observed when both the input data streams hit this transformation at wildly different times. Testing your data flow is the best recommended approach for finding the right balance with this transformation. Last, you can specify to treat null values equal by using the TreatNullsAsEqual property of the transformation; the default is True for this property. If you decide not to treat nulls as equal, the transformation then treats nulls similar to the database engine. Cache Transform The Lookup transformation in Integration Services 2008 can use an in-memory cache or a cache file to load reference data to perform lookups on the data in pipeline. The in-memory cache or the cache file needs to be created for the lookup transformation before it starts execution. This means you can create a reference data cache either before the package starts running or during the same package execution, but before the lookup transformation. Using a cache transformation, you can write distinct data rows to a cache connection manager that, depending upon how you have configured it, can write this data into cache or can also persist this cached data to a file on the hard disk. The user interface of the Cache Transformation Editor is simple and has only two pages. In the Connection Manager page, you specify the Cache Connection Manager, and in the Mappings page, you map Input columns to the destination columns being written to a Cache Connection Manager. Here, you have to map all the input columns to the destination columns; otherwise, the component will throw an error. If you change the data type of a column later—e.g., you increase the length of an input column—the metadata can be corrected in the Cache Connection Manager, which displays the columns and their metadata in the Columns tab. While configuring the Cache Connection Manager, you will also need to configure an Index Position for the columns. By default, the index position is 0 for all the columns, indicating that these columns are non-index columns. You can assign positive integer values such as 1, 2, 3… and so on to the columns that are participating in indexing. The number assigned to an index position on a column indicates the order in which the Lookup transformation compares rows in the reference data set to rows in the input data source. The selection of the columns as index columns depends upon which columns you need to look up while configuring a Lookup transformation. For example, if you Chapter 10: Data Flow Transformations 395 are matching for car details based on the manufacturer, the model description, and the body style, you will assign the index positions as shown in this table: Column Name Index Position Index Position Description Manufacturer 1 This column participates in the lookup operation and should be mapped to the input column in a lookup transformation. This is the first column on which the comparison will be made. Model Description 2 This column participates in the lookup operation and should be mapped to the input column in a lookup transformation. This is the second column on which the comparison will be made. Body Style 3 This column participates in the lookup operation and should be mapped to the input column in a lookup transformation. This is the third column on which the comparison will be made. All other columns 0 These columns do not participate in the lookup operation mappings; however, they can be selected in order to add them into the data flow. As the Cache Connection Manager writes data to memory, it gets tied up to the cache. Hence, you cannot use multiple cache transforms to write to the same Cache Connection Manager. The cache transform that gets called first during the package execution writes data to the Cache Connection Manager while all subsequent cache transforms fail. As mentioned earlier, you can create a cache file using a cache transform in a separate package or in the same package that runs before the Lookup transformation. So, once the reference data is persisted to a cache file, which is a raw file, you can use this cache file among multiple data flow tasks within the same package, between multiple packages on the same server, or between multiple packages on different servers. The only consideration you have to keep in mind is that the data in the cache file will be current to the time it is loaded into the cache file. If your reference data doesn’t change that often, you can use this technique to perform quite fast lookup operations. Lookup Transformation With the ever-increasing use of the Web to capture data, lookup operations have become quite important. Web page designers tend to ask users to fill in the most critical data in a web form, and the form fills in rest of the information for them. For example, you may ask a visitor on your site to fill in a street address and postal code during registration, and based on these two pieces of information, you can fill the city and state address fields. This is done by looking up a database table that contains all the postal information keyed in with postal codes. So, you simply look for the row that contains the particular postcode and you will be able to complete the address fields. You can perform such lookup operations in Integration Services also—using 396 Hands-On Microsoft SQL Server 2008 Integration Services the Lookup transformation. While loading a data warehouse, you will frequently use a Lookup transformation to look up surrogate keys in a dimension table by matching the business key. This transformation lets you perform lookups by joining the input columns in the data flow with columns in a reference data set. The resulting values can be included in the data flow in the new columns, or you can replace the existing column values. For example, when a value of a record in the postal code column in the data flow is equal to a record in the postal code column in the reference data set, the lookup transformation will be able to get data for all the other address columns. This equality operation makes this join an equijoin, requiring that all the values in the data flow match at least one value in the reference data set. In a complex lookup operation, you may be joining multiple columns in the input to the multiple columns in the reference data set. This transformation has undergone a major upgrade since its predecessor in Integration Services 2005. The user interface is also changed and provides more flexibility and performance improvements. The Lookup Transformation Editor provides five pages to configure its properties. When you open the editor, you will choose the options in the General page to decide how you want to configure this transformation. Let’s start with the simplest one—the Lookup transformation allows you to configure its outputs on the basis of how you want to handle the non-matching rows. The “Specify how to handle rows with no matching entries” option in the General page provides you four options to choose from: Ignore failure c Redirect rows to error output c Fail component (default) c Redirect rows to no match output c When the transformation looks up an input data key against the reference data set, you can either have a match or no match. When the input row key is found in the reference data set, the row is called a match row and is sent to the match output. If a key in the input row is not found in the reference data set, the row is called a no- match row and this option allows you to decide how you want to handle such rows. In Integration Services 2005, no-match rows are treated as errors and are sent to error output, whereas in Integration Services 2008, you can choose to redirect non- matching rows to a no-match output. Choosing an option other than “Redirect rows to no match output” will treat no-match rows as error rows and will send them to error output. Chapter 10: Data Flow Transformations 397 Next you can select how you would like to cache your reference data set. The Cache Mode section provides three options to choose from: Full cache c Selecting this option will force the reference set to be prepared and loaded into the cache memory before the lookup operation is performed. For example, if you are looking for postal addresses, all the rows with postal information will be loaded into the memory before actual lookup is performed against this reference data set. As you can imagine, lookup against a reference set that is held in memory will be faster compared to the lookup operation where the component has to make a connection to the outside data store to get the lookup values. Full cache mode is the best-performing option if you have enough physical memory on the server to hold the complete reference data set. When you choose this option, two things happen in the user interface: e Advanced page is disabled. e options in this page are not used when c using full cache mode. e Cache Connection Manager option becomes available. e Lookup c transformation can use either an OLE DB Connection Manager or a cache connection manager in the case of full cache mode. is feature is a new addition in this version. e Cache Connection Manager can read the reference data from a cache file that can be prepared separately from the lookup process. You can actually prepare the reference data set separately, in the same Data Flow task before the Lookup transformation, in an earlier Data Flow task in the same package, or even in a separate package using a Cache transform. When the data is prepared and persisted to a cache file (.caw), the Lookup transformation can load the data from the cache file faster than from the relational sources using a Cache Connection Manager. So, by utilizing full cache mode and sharing a cache file between multiple lookups within the same package or in multiple packages, you can achieve high levels of performance, especially when the reference data set is not small. e only caveat here is that your reference data set must not change. If it changes with each data load, you have to be careful while using the Cache transform and the Cache Connection Manager in your package. Partial cache c Selecting this option will enable you to apply limits on the use of memory for caching reference data. When you select the partial cache mode, the Advanced page options become available, but the Cache Connection Manager option is disabled, as it is used only in the full cache mode. So, you can only use the OLE DB Connection Manager to collect reference data from a relational source. In the Advanced page, you can limit the cache size to a value that you can . and have 394 Hands-On Microsoft SQL Server 2008 Integration Services enough free memory on the server. Alternatively, you can decrease the value from the default value if the server is already. the particular postcode and you will be able to complete the address fields. You can perform such lookup operations in Integration Services also—using 396 Hands-On Microsoft SQL Server 2008 Integration. until you have attached two sorted inputs that have properly defined 392 Hands-On Microsoft SQL Server 2008 Integration Services sort-key positions for the columns on which they are sorted. The