Learning Management Marketing and Customer Support_9 pptx

Thông tin tài liệu

470643 c15.qxd 3/8/04 11:20 AM Page 482 482 Chapter 15 (continued) database. to figure out from transaction data such things as the second product purchased, the last three promos a customer responded to, or the ordering of Account ID Customer ID Interest Rate Credit Limit Amount Due Account ID Date Time Amount But each account belongs to . represents the logical Customer ID Household ID Customer Name Gender FICO Score Household ID Number of Children ZIP Code WHAT IS A RELATIONAL DATABASE? An entity relationship diagram describes the layout of data for a simple credit card With respect to data mining, relational databases (and SQL) have some limitations. First, they provide little support for time series. This makes it hard events; these can require very complicated SQL. Another problem is that two operations often eliminate fields inadvertently. When a field contains a missing value (NULL) then it automatically fails any comparison, even “not equals”. ACCOUNT TABLE Account Type Minimum Payment Last Payment Amt TRANSACTION TABLE Transaction ID Vendor ID Authorization Code VENDOR TABLE Vendor ID Vendor Name Vendor Type A single transaction occurs at exactly one vendor. But, each vendor may have multiple transactions. One account has multiple transactions, but each transaction is associated with exactly one account. A customer may have one or more accounts. exactly one customer. Likewise, one or more customers may be in a household. An E-R diagram can be used to show the tables and fields in a relational database. Each box shows a single table and its columns. The lines between them show relationships, such as 1-many, 1-1, and many-to-many. Because each table corresponds to an entity, this is called a physical design Sometimes, the physical design of a database is very complicated. For instance, the TRANSACTION TABLE might actually be split into a separate table for each month of transactions. In this case, the above E-R diagram is still useful; it structure of the data, as business users would understand it. CUSTOMER TABLE Date of Birth HOUSEHOLD TABLE TEAMFLY Team-Fly ® 470643 c15.qxd 3/8/04 11:20 AM Page 483 Data Warehousing, OLAP, and Data Mining 483 Also, the default join operation (called an inner join) eliminates rows that do not match, which means that customers may inadvertently be left out of a data pull. The set of operations in SQL is not particularly rich, especially for text fields and dates. The result is that every database vendor extends standard SQL to include slightly different sets of functionality. Database schema can also illuminate unusual findings in the data. For instance, we once worked with a file of call detail records in the United States that had city and state fields for the destination of every call. The file contained over two hundred state codes—that is a lot of states. What was happening? We learned that the city and state fields were never used by operational systems, so their contents were automatically suspicious—data that is not used is not likely to be correct. Instead of the city and state, all location information was derived from zip codes. These redundant fields were inaccurate because the state field was written first and the city field, with 14 characters, was written second. Longer city names overwrote the state field next to it. So, “WEST PALM BEACH, FL” ended up putting the “H” in the state field, becoming “WEST PALM BEAC, HL,” and “COLORADO SPRINGS, CO” became “COLORADO SPRIN, GS.” Understanding the data layout helped us figure out this interesting but admittedly uncommon problem. Metadata Metadata goes beyond the database schema to let business users know what types of information are stored in the database. This is, in essence, documentation about the system, including information such as: ■■ The values legally allowed in each field ■■ A description of the contents of each field (for instance, is the start date the date of the sale or the date of activation) ■■ The date when the data was loaded ■■ An indication of how recently the data has been updated (when after the billing cycle does the billing data land in this system?) ■■ Mappings to other systems (the status code in table A is the status code field in table B in such-and-such source system) When available, metadata provides an invaluable service. When not available, this type of information needs to be gleaned, usually from friendly database administrators and analysts—a perhaps inefficient use of everyone’s time. For a data warehouse, metadata provides discipline, since changes to the 470643 c15.qxd 3/8/04 11:20 AM Page 484 484 Chapter 15 warehouse must be reflected in the metadata to be communicated to users. Overall, a good metadata system helps ensure the success of a data warehouse by making users more aware of and comfortable with the contents. For data miners, metadata provides valuable assistance in tracking down and understanding data. Business Rules The highest level of abstraction is business rules. These describe why relationships exist and how they are applied. Some business rules are easy to capture, because they represent the history of the business—what marketing cam- paigns took place when, what products were available when, and so on. Other types of rules are more difficult to capture and often lie buried deep inside code fragments and old memos. No one may remember why the fraud detec- tion system ignores claims under $500. Presumably there was a good business reason, but the reason, the business rule, is often lost once the rule is embedded in computer code. Business rules have a close relationship to data mining. Some data mining techniques, such as market basket analysis and decision trees, produce explicit rules. Often, these rules may already be known. For instance, learning that conference calling is sold with call waiting may not be interesting, since this feature is only sold as part of a bundle. Or a direct mail model response model that ends up targeting only wealthy areas may reflect the fact that the historical data used to build the model was biased, because the model set only had responders in these areas. Discovering business rules in the data is both a success and a failure. Find- ing these rules is a successful application of sophisticated algorithms. How- ever, in data mining, we want actionable patterns and such patterns are not actionable. A General Architecture for Data Warehousing The multitiered approach to data warehousing recognizes that data needs come in many different forms. It provides a comprehensive system for managing data for decision support. The major components of this architecture (see Figure 15.3) are: ■■ Source systems are where the data comes from. ■■ Extraction, transformation, and load (ETL) move data between different data stores. 470643 c15.qxd 3/8/04 11:20 AM Page 485 Data Warehousing, OLAP, and Data Mining 485 ■■ The central repository is the main store for the data warehouse. ■■ The metadata repository describes what is available and where. ■■ Data marts provide fast, specialized access for end users and applications. ■■ Operational feedback integrates decision support back into the operational systems. ■■ End users are the reason for developing the warehouse in the first place. a relational database with a logical data model. End users are the raison d'etre of the data ODBC connect end users to the data. Meta- data Central Repository Operational systems are where the data comes from. These are usually mainframe or midrange systems. Some data may be provided by external vendors. The central data store is Departmental data warehouses and metadata support applications used by end users. warehouse. They act on the information and knowledge gained from the data. Extraction, transformation, and load tools move data between systems. Networks using standard protocols like External Data Figure 15.3 The multitiered approach to data warehousing includes a central repository, data marts, end-user tools, and tools that connect all these pieces together. 470643 c15.qxd 3/8/04 11:20 AM Page 486 486 Chapter 15 One or more of these components exist in virtually every system called a data warehouse. They are the building blocks of decision support throughout an enterprise. The following discussion of these components follows a data- flow approach. The data is like water. It originates in the source systems and flows through the components of the data warehouse ultimately to deliver information and value to end users. These components rest on a technological foundation consisting of hardware, software, and networks; this infrastructure must be sufficiently robust both to meet the needs of end users and to meet growing data and processing requirements. Source Systems Data originates in the source systems, typically operational systems and external data feeds. These are designed for operational efficiency, not for decision support, and the data reflects this reality. For instance, transactional data might be rolled off every few months to reduce storage needs. The same information might be represented in different ways. For example, one retail point- of-sale source system represented returned merchandise using a “returned item” flag. That is, except when the customer made a new purchase at the same time. In this case, there would be a negative amount in the purchase field. Such anomalies abound in the real world. Often, information of interest for customer relationship management is not gathered as intended. Here, for instance, are six ways that business customers might be distinguished from consumers in a telephone company: ■■ Using a customer type indicator: “B” or “C,” for business versus consumer. ■■ Using rate plans: Some are only sold to business customers; others to consumers. ■■ Using acquisition channels: Some channels are reserved for business, others for consumers. ■■ Using number of lines: 1 or 2 for consumer, more for business. ■■ Using credit class: Businesses have a different set of credit classes from consumers. ■■ Using a model score based on businesslike calling patterns (Needless to say, these definitions do not always agree.) One challenge in data warehousing is arriving at a consistent definition that can be used across the business. The key to achieving this is metadata that documents the precise meaning of each field, so everyone using the data warehouse is speaking the same language. 470643 c15.qxd 3/8/04 11:20 AM Page 487 Data Warehousing, OLAP, and Data Mining 487 Gathering the data for decision support stresses operational systems since these systems were originally designed for transaction processing. Bringing the data together in a consistent format is almost always the most expensive part of implementing a data warehousing solution. The source systems offer other challenges as well. They generally run on a wide range of hardware, and much of the software is built in-house or highly customized. These systems are commonly mainframe and midrange systems and generally use complicated and proprietary file structures. Mainframe systems were designed for holding and processing data, not for sharing it. Although systems are becoming more open, getting access to the data is always an issue, especially when different systems are supporting very different parts of the organization. And, systems may be geographically dispersed, further contributing to the difficulty of bringing the data together. Extraction, Transformation, and Load Extraction, transformation, and load (ETL) tools solve the problem of gathering data from disparate systems, by providing the ability to map and move data from source systems to other environments. Traditionally, data move- ment and cleansing have been the responsibility of programmers, who wrote special-purpose code as the need arose. Such application-specific code becomes brittle as systems multiply and source systems change. Although programming may still be necessary, there are now products that solve the bulk of the ETL problems. These tools make it possible to specify source systems and mappings between different tables and files. They provide the ability to verify data, and spit out error reports when loads do not succeed. The tools also support looking up values in tables (so only known product codes, for instance, are loaded into the data warehouse). The goal of these tools is to describe where data comes from and what happens to it—not to write the step-by-step code for pulling data from one system and putting it into another. Standard procedural languages, such as COBOL and RPG, focus on each step instead of the bigger picture of what needs to be done. ETL tools often provide a metadata interface, so end users can understand what is happening to “their” data during the loading of the central repository. This genre of tools is often so good at processing data that we are surprised that such tools remain embedded in IT departments and are not more generally used by data miners. Mastering Data Mining has a case study from 1998 on using one of these tools from Ab Initio, for analyzing hundreds of gigabytes of call detail records—a quantity of data that would still be challenging to ana- lyze today. 470643 c15.qxd 3/8/04 11:20 AM Page 488 488 Chapter 15 Central Repository The central repository is the heart of the data warehouse. It is usually a relational database accessed through some variant of SQL. One of the advantages of relational databases is their ability to run on powerful, scalable machines by taking advantage of multiple processors and multiple disks (see the side bar “Background on Parallel Technology”). Most statistical and data mining packages, for instance, can run multiple processing threads at the same time. However, each thread represents one task, running on one processor. More hardware does not make any given task run faster (except when other tasks happen to be interfering with it). Relational databases, on the other hand, can take a single query and, in essence, create multiple threads all running at the same time for one query. As a result, data-intensive applications on powerful computers often run more quickly when using a relational database than when using non-parallel enabled software—and data mining is a very data-intensive application. A key component in the central repository is a logical data model, which describes the structure of the data inside a database in terms familiar to business users. Often, the data model is confused with the physical layout (or schema) of the database, but there is a critical difference between the two. The purpose of the physical layout is to maximize performance and to provide information to database administrators (DBAs). The purpose of the logical data model is to communicate the contents of the database to a wider, less technical audience. The business user must be able to understand the logical data model—entities, attributes, and relationships. The physical layout is an implementation of the logical data model, incorporating compromises and choices along the way to optimize performance. When embarking on a data warehousing project, many organizations feel compelled to develop a comprehensive, enterprise-wide data model. These efforts are often surprisingly unsuccessful. The logical data model for the data warehouse does not have to be quite as uncompromising as an enterprise- wide model. For instance, a conflict between product codes in the logical data model for the data warehouse can be (but not necessarily should be) resolved by including both product hierarchies—a decision that takes 10 minutes to make. In an enterprise-wide effort, resolving conflicting product codes can require months of investigations and meetings. Data warehousing is a process. Be wary of any large database called aTIP data warehouse that does not have a process in place for updating the system to meet end user needs. Such a data warehouse will eventually fade into disuse, because end users needs are likely to evolve, but the system will not. 470643 c15.qxd 3/8/04 11:20 AM Page 489 Data Warehousing, OLAP, and Data Mining 489 bus shared everything. Every processing unit can access all the memory and all the disk very high-speed network, sometimes called a switch. Each processing unit has its own memory and its own disk storage. Some nodes may be specialized long as the network connecting the processors can supply more bandwidth, of research into enabling their products to do so. (continued) BACKGROUND ON PARALLEL TECHNOLOGY Parallel technology is the key to scalable hardware, and it comes in two flavors: symmetric multiprocessing systems (SMPs) and massively parallel processing systems (MPPs), both of which are shown in the following figure. An SMP machine is centered on a , a special network present in all computers that connects processing units to memory and disk drives. The bus acts as a central communication device, so SMP systems are sometimes called drives. This form of parallelism is quite popular because an SMP box supports the same applications as uniprocessor boxes—and some applications can take advantage of additional hardware with minimal changes to code. However, SMP technology has its limitations because it places a heavy burden on the central bus, which becomes saturated as the processing load increases. Contention for the central bus is often what limits the performance of SMPs. They tend to work well when they have fewer than 10 to 20 processing units. MPPs, on the other hand, behave like separate computers connected by a for processing and have minimal disk storage, and others may be specialized for storage and have lots of disk capacity. The bus connecting the processing unit to memory and disk drives never gets saturated. However, one drawback is that some memory and some disk drives are now local and some are remote—a distinction that can make MPPs harder to program. Programs designed for one processor can always run on one processor in an MPP—but they require modifications to take advantage of all the hardware. MPPs are truly scalable so and faster networks are generally easier to design than faster buses. There are MPP-based computers with thousands of nodes and thousands of disks. Both SMPs and MPPs have their advantages. Recognizing this, the vendors of these computers are making them more similar. SMP vendors are connecting their SMP computers together in clusters that start to resemble MPP boxes. At the same time, MPP vendors are replacing their single-processing units with SMP units, creating a very similar architecture. However, regardless of how powerful the hardware is, software needs to be designed to take advantage of these machines. Fortunately, the largest database vendors have invested years 470643 c15.qxd 3/8/04 11:20 AM Page 490 490 Chapter 15 (continued) memory can be added to the system. P M P M PP P P M P M P M P M P M P M P M high speed Neumann. A processing unit stores both data and the It SMP architectures usually max processor (MMP) has a shared- It introduces a high-speed that connects independent MPP SMP MPP BACKGROUND ON PARALLEL TECHNOLOGY Parallel computers build on the basic Von Neumann uniprocessor architecture. SMP and MPP systems are scalable because more processing units, disk drives, and bus network A simple computer follows the architecture laid out by Von communicates to memory and disk over a local bus. (Memory executable program.) The speed of the processor, bus, and memory limits performance and scalability. The symmetric multiprocessor (SMP) has a shared-everything architecture. expands the capabilities of the bus to support multiple processors, more memory, and a larger disk. The capacity of the bus limits performance and scalability. out with fewer than 20 processing units. The massively parallel nothing architecture. network (also called a switch) processor/memory/disk components. architectures are very scalable but fewer software packages can take advantage of all the hardware. Uniprocessor Data warehousing is a process for managing the decision-support system of record. A process is something that can adjust to users’ needs as they are clari- fied and change over time. A process can respond to changes in the business as needs change over time. The central repository itself is going to be a brittle, little-used system without the realization that as users learn about data and about the business, they are going to want changes and enhancements on the 470643 c15.qxd 3/8/04 11:20 AM Page 491 Data Warehousing, OLAP, and Data Mining 491 time scale of marketing (days and weeks) rather than on the time scale of IT (months). Metadata Repository We have already discussed metadata in the context of the data hierarchy. It can also be considered a component of the data warehouse. As such, the metadata repository is an often overlooked component of the data warehousing environment. The lowest level of metadata is the database schema, the physical layout of the data. When used correctly, though, metadata is much more. It answers questions posed by end users about the availability of data, gives them tools for browsing through the contents of the data warehouse, and gives everyone more confidence in the data. This confidence is the basis for new applications and an expanded user base. A good metadata system should include the following: ■■ The annotated logical data model. The annotations should explain the entities and attributes, including valid values. ■■ Mapping from the logical data model to the source systems. ■■ The physical schema. ■■ Mapping from the logical model to the physical schema. ■■ Common views and formulas for accessing the data. What is useful to one user may be useful to others. ■■ Load and update information. ■■ Security and access information. ■■ Interfaces for end users and developers, so they share the same description of the database. In any data warehousing environment, each of these pieces of information is available somewhere—in scripts written by the DBA, in email messages, in documentation, in the system tables in the database, and so on. A metadata repository makes this information available to the users, in a format they can readily understand. The key is giving users access so they feel comfortable with the data warehouse, with the data it contains, and with knowing how to use it. Data Marts Data warehouses do not actually do anything (except store and retrieve data effectively). Applications are needed to realize value, and these often take the form of data marts. A data mart is a specialized system that brings together the data needed for a department or related applications. Data marts are often used for reporting systems and slicing-and-dicing data. Such data marts often use OLAP technology, which is discussed later in this chapter. Another [...]... are doing their jobs In the ideal customercentric organization, everyone is rewarded for increasing cus tomer value and understands that this requires learning from each customer Building the Data Mining Environment interaction and the ability to use what has been learned to serve customers bet ter As a result, the company records every interaction with its customers and keeps an extensive historical... decision trees and memory-based reasoning Data warehousing and data mining are not the same thing; however, they do complement each other, and data mining applications are often part of the data warehouse solution TE 512 Team-Fly® CHAPTER 16 Building the Data Mining Environment In the Big Rock Candy Mountains, There’s a land that’s fair and bright, Where the handouts grow on bushes And you sleep out... of customers is also of interest Did an acquisition campaign bring in good customers or did the newly acquired customers leave before they even paid? Did an upsell cam paign stick, or did customers return to their previous products? Measurement enables an organization to learn from its mistakes and to build on its successes Scalable Hardware and RDBMS Support The final synergy between data mining and. .. related to customer interactions, such as calls to customer service, payments, individual bills, and so on The summaries are made by aggregating events across the cube Such event cubes typically have a customer dimension or something similar, such as an account, Web cookie, or household, which ties the event back to the customer A small number of dimensions, such as the customer ID, date, and event... the boxcars all are empty And the sun shines every day And the birds and the bees And the cigarette trees The lemonade springs Where the bluebird sings In the Big Rock Candy Mountains Twentieth century hoboes had a vision of utopia, so why not twenty-first cen tury data miners? For us, the vision is one of a company that puts the customer at the center of its operations and measures its actions by... continue offering an unprofitable service if the customers who use the loss-generating service spend more in other areas and therefore increase the profitability of the company as a whole A customercentric company does not have to ask the same questions every time a customer calls in A customercentric company judges a marketing campaign on the value customers gener ate over their lifetimes rather... example is customer value Discrete 501 Chapter 15 ranges of customer value are useful as dimensions, and in many circumstances more useful than trying to include customer value as a fact When designing cubes, there is a temptation to mix facts and dimensions by creating a count or total for a group of related values For instance: ■ ■ Count of active customers of less than 1-year tenure, between 1 and 2... together customer data from all of the many places where it is originally collected and putting it into a form suitable for data min ing is a difficult and expensive process It will only happen in an organization that understands how valuable that data is once it can be properly exploited Information is power A learning organization values progress and steady improvement; such an organization wants and. .. involving the same customer, even when some happen at an ATM, some in a bank branch, some over the phone, and some on the Web In such an environment, an analyst at a telephone company trying to under stand the relationship between quality of wireless telephone service and churn has no trouble getting customer- level data on dropped calls and other failures The analyst can also readily see a customer s purchase... needs to cover days, weeks, months, and quarters shop The hierarchy for region starts at the shop level and then includes metropolitan areas and states product customer The hierarchy for product includes the department The hierarchy for the customer might include households Figure 15.8 Different views of the data often share common dimensions Finding the common dimensions and their base units is critical . data and about the business, they are going to want changes and enhancements on the 470643 c15.qxd 3/8/04 11:20 AM Page 491 Data Warehousing, OLAP, and Data Mining 491 time scale of marketing. allow users to slice and dice data, and sometimes to drill down to the customer level. 470643 c15.qxd 3/8/04 11:20 AM Page 497 Data Warehousing, OLAP, and Data Mining 497 TIP Quick response. truly scalable so and faster networks are generally easier to design than faster buses. There are MPP-based computers with thousands of nodes and thousands of disks. Both SMPs and MPPs have their

Ngày đăng: 22/06/2014, 04:20

Xem thêm: Learning Management Marketing and Customer Support_9 pptx