Hands-On Microsoft SQL Server 2008 Integration Services part 38 pps

348 Hands-On Microsoft SQL Server 2008 Integration Services OLE DB Source You will be using an OLE DB source whenever you need to extract data from a relational database that is OLE DB–compliant, such as Microsoft SQL Server, an Oracle database server, IBM’s DB2 database server, an Access database, or even an Excel file. An OLE DB provider is used in an OLE DB Connection Manager to Figure 9-5 Flat File Advanced Editor showing FastParse and UseBinaryFormat properties Chapter 9: Data Flow Components 349 connect to the database, and an OLE DB source uses the OLE DB Connection Manager to bring the data into the data flow. Just to remind you here that if you use an Integration Services Data Source inside an OLE DB Connection Manager, you can actually make Data Source or Data Source view available through the OLE DB Connection Manager to all of your projects within an Integration Services project. Not all OLE DB providers are the same, you should check the documentation of the OLE DB provider that you want to use in your package, as some limitations are associated with different types of OLE DB providers. The OLE DB source has one regular output and one error output. This data flow component has a custom user interface that is similar to the Excel source adapter discussed earlier, and its properties can also be configured using the Advanced Editor. When you open an OLE DB Source Editor, it takes you into the Connection Manager page. Here you can specify an OLE DB Connection Manager from the drop-down list provided in the OLE DB Connection Manager field. The OLE DB Source provides you the following four levels of data access mode options: Table Or View c Specify the name of the table or the view from which you want to extract data. Table Name Or View Name Variable c Specify the variable name, which will be the holding name of the table or the view. SQL Command c Write an SQL statement to extract data. You have the option of using a parameterized SQL query here. SQL Command From Variable c Specify the variable name that will be holding an SQL statement to be passed on to OLE DB source. For more details on these options, refer to the section “Excel Source.” where they have been discussed in detail. In the Columns page, you can see the Output Columns mappings to the External Columns. The matching name will be written for you in the Output Column, which you can change if you wish to do so. In the Error Output page, you can specify how the OLE DB source should handle an error or a truncation for each column. Raw File Source All the prebuilt Integration Services sources other than this Raw File Source require a connection manager to connect to a data source. The Raw File source doesn’t require a connection manager. The Raw File source establishes a direct connection to the file containing the raw data and brings that raw data into the data flow. The raw data 350 Hands-On Microsoft SQL Server 2008 Integration Services written in the raw file is native to the source and requires no parsing or translation during import, so the Raw File source can extract the data much faster. You can use raw files to stage data due to its fast read and write operation; for example, in scenarios for which you export data from your system for later transformation and loading back into the similar system, this may be an ideal choice. The Raw File source has only one output and no error output. The Raw File source has been given a custom user interface in this version of Integration Services, but that doesn’t mean that you can’t use the Advanced Editor to configure its properties. The custom UI is very simple and contains only two pages. Connection Manager c ough this page is named like the ones in other components where you select a connection manager, this component doesn’t use any SSIS connection manager; rather, it connects to the raw file directly. In the Access Mode field, you can specify a filename or choose to get the filename from a variable. Depending on your choice of access mode, the interface changes the available fields to collect relevant information. Choosing File Name lets you specify the raw filename in the File Name field, and choosing File Name From Variable lets you specify the name of the variable that holds the raw filename in the Variable Name field. Column Mappings c Shows the mappings of external columns to output columns and allows you to rename the output columns. Script Component Source The preconfigured data flow sources in Integration Services have only one output available—for example, Flat File source, Excel source, and OLE DB source all have single outputs. If you need to output data to more than one downstream component in the data flow, you can’t do that using these preconfigured components. The only option you have in this case, or in other similar cases when existing source components do not meet your requirements, is to create your own data flow source. You can write yourself a custom component, but an easier option is to use the script component as a source. The script component has not been shown under data flow sources as a data source, but it can be configured as a data source. When you drop the script component onto the Data Flow Designer surface, you are asked to select whether you want to configure this component as a source, a transformation, or a destination. Based on your choice, the script component customizes the interface and options appropriate for the purpose. As a source, the script component doesn’t have any input and only one output to start with. You can add additional outputs using the Add Output button in the Inputs and Outputs page. Configuring a script component as a data flow source is covered in Chapter 11. Chapter 9: Data Flow Components 351 XML Source XML Source reads the XML data from an XML file or from a variable containing XML data and brings that data into the data flow. This source has a custom user interface to edit properties but also uses the Advanced Editor for configurations of some of its properties. When you open the XML Source Editor, the Connection Manager page appears, where you specify how you want to connect to and access data from the XML data file. Depending upon the option you choose, the interface changes to collect the relevant information: XML file location c Lets you specify the location and the filename for the XML data file in the XML Location field. XML file from variable c Allows you to use a variable to specify the XML data filename and location. You then provide name of the variable containing XML data file details in the Variable Name field. XML data from variable c Access XML data directly from a variable by specifying the name of the variable in the Variable Name field. Next you can choose schema options: use an inline schema or provide an XML schema definition file in the XSD format. When the XML data file contains the XSD schema itself to validate its structure and data, you will be using inline schema option; otherwise, you will have to supply an external schema definition file (XSD). If you don’t have an XSD file with you, you can generate this file by clicking the Generate XSD button and providing a location and name for the XSD file. This file is required to interpret the relationships among the elements in the XML data file. In the Columns page, you can map output columns to external columns and in the Error Output page, you can specify how the XML Source should handle errors and truncations of data for each column. This source can have multiple regular outputs and multiple error outputs. Data Flow Transformations Once the data has been captured by source adapters and passed on to data flow path, you can modify this data using a wide range of data flow transformations provided in SSIS. You can use data flow transformations to aggregate column values, update column values, add columns to the data flow, merge data, and accomplish many more data modifications. The data flow transformations can have single or multiple inputs or outputs, depending upon the type of transformation you choose. A data 352 Hands-On Microsoft SQL Server 2008 Integration Services flow transformation receives data on its input columns from the output columns of a data flow source or a transformation. After applying transformations on to the data, the data flow transformation provides the output through its output columns to the input columns of the downstream component, which can be another data flow transformation or a data flow destination. Some of the data flow transformations can also have error outputs. The data flow transformations send the data rows that fail to transform to the error output columns that can be dealt with separately. Data flow transformations do not have external columns, but you can create one with external columns programmatically. The following tables list the 29 data flow transformations grouped together in categories on the basis of the function they perform in Integration Services. Future service packs or add-ons may bring in more transformations. These transformations provide a rich set of functionalities in many areas such as data cleansing, data standardization, BI functionalities, loading slowly changing data warehouse dimension tables, pivoting and unpivoting, and a facility to write script using the Script Component transformation. However, if you still need a functionality that can’t be met by preconfigured components, you can write custom transformations with synchronous outputs or asynchronous outputs. Transformations with synchronous outputs make modifications to data inline—i.e., as the data rows flow through the component one at a time—whereas transformations with asynchronous outputs cannot process each row independently of all other rows—for example, an aggregate transformation needs all the rows before it can perform an operation across rows. You will learn more about synchronous and asynchronous components and programming options in Chapter 11. The following sections contain brief definitions about the data flow transformations, however, the details of configurations and usage exercises will be covered in next chapter. Business Intelligence Transformations This category groups together the transformations that allows you to perform business intelligence operations such as data cleansing, data standardizing, text mining, and running DMX prediction queries. Transformation Description Performs data cleansing and Standardizes values in column data. Uses fuzzy matching to cleanse or standardize data. Configures the updating of slowly changing dimension in data warehouse dimension tables. Chapter 9: Data Flow Components 353 Transformation Description Extracts a term from text in the input columns and write the extracted term to an output column. Performs a lookup and count for the terms in a table that are defined in a lookup table. Runs DMX queries for performing prediction queries against data mining models. Row Transformations The transformations in this category allow you to update column values or create new columns on a row-by-row basis. Transformation Description Applies string functions to string data type columns. Creates new columns in the transformation output by copying input columns. Converts data type of a column and optionally copies the converted data to a new output column. Creates new derivations of data by applying expressions using a combination of input columns, variables, functions, and operators; the results of this derivation can be used to modify an existing column or can be copied in to a new column in the output. Exports data from a pipeline column in to a file. This transformation can be especially useful to export DT_TEXT, DT_NTEXT, or DT_IMAGE data type data from the pipeline into a file. Reads data from a file and add it to the columns in the data flow. Hosts and runs a script that can be used to transform data. Updates, inserts, or deletes rows using SQL commands in a data flow. 354 Hands-On Microsoft SQL Server 2008 Integration Services Rowset Transformations The transformations in this category work on collections of rows and allow you to perform operations such as aggregate, sort, sample, pivot, and unpivot row sets. Transformation Description Performs aggregate functions such as average, sum, and count and copy the results to the output columns. Sorts input columns in ascending or descending order and copies the sorted data to the output columns. Creates a sample data set by specifying a percentage to randomly select input rows. Creates a sample data set by specifying the exact number of output rows to randomly select input rows. Pivots the input data on a column value to get a less normalized but compact view of data. Creates a more normalized version of a de-normalized table. Split and Join Transformations This category groups the transformations that allow you to distribute rows to different outputs, multicast input rows, join multiple inputs into one output, and perform lookup operations. Transformation Description Routes data rows to different outputs, depending on the content of the data. Distributes the input data set to multiple outputs. Merges multiple input data sets into one output. Merges two sorted data sets into a single output data set. Chapter 9: Data Flow Components 355 Transformation Description Joins two sorted input data sets using a FULL, LEFT, or INNER join into a single output. Prepares a cache file to be used by a lookup transformation by writing data in the pipeline to a cache connection manager. Perform lookups by joining data in the input columns with reference data set columns. Auditing Transformations These transformations are used to add audit information and count rows. Transformation Description Includes the system environment information in the data flow. Counts the rows flowing through the data flow and writes the final count in a variable. Data Flow Destinations A data flow destination is the last component in the Integration Services data flow and writes the data received to an external data store. This component has no output columns, though it may have error output columns for redirecting error rows. A data flow destination writes the data received at its input columns to an external data store or to an in-memory data set via its external columns interface. You can configure the data flow destination to handle errors in data in the same way as you configure a data flow source. You can choose to fail the component, ignore the error, or redirect the error rows to the error output columns. If the data flow destination can’t write some of the data rows to the external data store due to errors in data or data not matching with the external columns, it can redirect those rows to the error output columns depending how it is configured to redirect error rows. Integration Services provides 12 preconfigured destinations plus a script component that can be used as a destination. The following tables list the available destinations and the functions they perform. 356 Hands-On Microsoft SQL Server 2008 Integration Services Destination Description Writes data to ADO.NET–compliant databases. Passes the data to train data mining models through the data mining model algorithms. Exposes data in the data flow using ADO.NET DataReader interface for applications such as SQL Server Reporting Services. Loads and processes an SQL Server Analysis Services Dimension. Writes data from a data flow to an Excel workbook. Writes data from a data flow to a flat file. Loads data from a data flow to an OLE DB–compliant data stores. Loads and process an SQL Server Analysis Services partition. Writes data to a raw file. Creates and populates an in-memory ADO record set. Hosts and runs a script that can be used to load data. Though not shown under Data Flow Destinations in the Toolbox, Script Component can be used as a data flow destination. This component is covered in Chapter 11. Inserts rows into an SQL Server Compact database. Bulk-loads data into an SQL Server table or view. All these destinations have an input, and some of them have an error output to meet most of your requirements to write data to external stores. However, if you find that existing data flow destinations do not do what you want, you can custom build a destination to suit your requirements in the Integration Services object model. The following table lists input and outputs available in each data flow destination. Chapter 9: Data Flow Components 357 Data Flow Destination Input Output Error Output Custom UI Connection Manager ADO NET destination 1 No Yes Yes ADO.NET Connection Manager Data Mining Model Training destination 1 No No Yes Analysis Services Connection Manager DataReader destination 1 No No No Not Required Dimension Processing destination 1 No No Yes Analysis Services Connection Manager Excel destination 1 No 1 Yes Excel Connection Manager Flat File destination 1 No No Yes Flat File Connection Manager OLE DB destination 1 No 1 Yes OLE DB Connection Manager Partition Processing destination 1 No No Yes Analysis Services Connection Manager Raw File destination 1 No No Yes Not Required Recordset destination 1 No No No Not Required SQL Server Compact destination 1 No 1 No OLE DB Connection Manager SQL Server destination 1 No No Yes OLE DB Connection Manager ADO NET Destination You can use an ADO NET destination to load data into a table or view of an ADO.NET– compliant database. If you have a database into which you can load data using either an ADO NET destination or an OLE DB destination, it is generally preferred to use the OLE DB destination due to its performance and resilience; however, an ADO NET destination does provide more options for you to load data into the databases when an OLE DB destination cannot be used. The ADO.NET Destination Editor has three pages: Connection Manager, Mappings, and Error Output. The Connection Manager page enables you to specify an ADO.NET Connection Manager and the table or view into which the data is to be loaded. You have options in this page to create a new connection manager or a new table. In the Mappings page, you can map input columns to the destination columns. Finally, in the Error Output page, you can configure the error response conditions. Data Mining Model Training Destination Data changes with time, and the type of information that can be gathered from the data changes along with it. Mining data for useful and relevant information requires regular data mining model training so that it can keep up with the challenges of ever-changing data. The data mining model training destination passes data through the data mining model algorithms to train the data mining models. . transform data. Updates, inserts, or deletes rows using SQL commands in a data flow. 354 Hands-On Microsoft SQL Server 2008 Integration Services Rowset Transformations The transformations in. raw data and brings that raw data into the data flow. The raw data 350 Hands-On Microsoft SQL Server 2008 Integration Services written in the raw file is native to the source and requires no. outputs, depending upon the type of transformation you choose. A data 352 Hands-On Microsoft SQL Server 2008 Integration Services flow transformation receives data on its input columns from the