Learning Management Marketing and Customer Support_2 docx

Thông tin tài liệu

470643 c07.qxd 3/8/04 11:37 AM Page 244 244 Chapter 7 B B B B A B A A A A 1.0 0.0 -1.0 Figure 7.11 Running a neural network on 10 examples from the validation set can help determine how to interpret results. Neural Networks for Time Series In many business problems, the data naturally falls into a time series. Examples of such series are the closing price of IBM stock, the daily value of the Swiss franc to U.S. dollar exchange rate, or a forecast of the number of customers who will be active on any given date in the future. For financial time series, someone who is able to predict the next value, or even whether the series is heading up 470643 c07.qxd 3/8/04 11:37 AM Page 245 Artificial Neural Networks 245 or down, has a tremendous advantage over other investors. Although predom- inant in the financial industry, time series appear in other areas, such as fore- casting and process control. Financial time series, though, are the most studied since a small advantage in predictive power translates into big profits. Neural networks are easily adapted for time-series analysis, as shown in Figure 7.12. The network is trained on the time-series data, starting at the oldest point in the data. The training then moves to the second oldest point, and the oldest point goes to the next set of units in the input layer, and so on. The network trains like a feed-forward, back propagation network trying to predict the next value in the series at each step. output Time lag er value 1, time t value 1, time t-1 value 1, time t-2 value 2, time t value 2, time t-1 value 2, time t-2 Historical units Hidden lay value 1, time t+1 Figure 7.12 A time-delay neural network remembers the previous few training examples and uses them as input into the network. The network then works like a feed-forward, back propagation network. 470643 c07.qxd 3/8/04 11:37 AM Page 246 246 Chapter 7 Notice that the time-series network is not limited to data from just a single time series. It can take multiple inputs. For instance, to predict the value of the Swiss franc to U.S. dollar exchange rate, other time-series information might be included, such as the volume of the previous day’s transactions, the U.S. dollar to Japanese yen exchange rate, the closing value of the stock exchange, and the day of the week. In addition, non-time-series data, such as the reported infla- tion rate in the countries over the period of time under investigation, might also be candidate features. The number of historical units controls the length of the patterns that the network can recognize. For instance, keeping 10 historical units on a network predicting the closing price of a favorite stock will allow the network to recognize patterns that occur within 2-week time periods (since exchange rates are set only on weekdays). Relying on such a network to predict the value 3 months in the future may not be a good idea and is not recommended. Actually, by modifying the input, a feed-forward network can be made to work like a time-delay neural network. Consider the time series with 10 days of history, shown in Table 7.5. The network will include two features: the day of the week and the closing price. Create a time series with a time lag of three requires adding new features for the historical, lagged values. (Day-of-the-week does not need to be copied, since it does not really change.) The result is Table 7.6. This data can now be input into a feed-forward, back propagation network without any special support for time series. Table 7.5 Time Series DATA ELEMENT DAY-OF-WEEK CLOSING PRICE 1 1 $40.25 2 2 $41.00 3 3 $39.25 4 4 $39.75 5 5 $40.50 6 1 $40.50 7 2 $40.75 8 3 $41.25 9 4 $42.00 10 5 $41.50 470643 c07.qxd 3/8/04 11:37 AM Page 247 Artificial Neural Networks 247 Table 7.6 Time Series with Time Lag PRICE PRICE PRICE PREVIOUS PREVIOUS-1 DATA DAY-OF- CLOSING CLOSING CLOSING ELEMENT WEEK 1 1 $40.25 2 2 $41.00 $40.25 3 3 $39.25 $41.00 $40.25 4 4 $39.75 $39.25 $41.00 5 5 $40.50 $39.75 $39.25 6 1 $40.50 $40.50 $39.75 7 2 $40.75 $40.50 $40.50 8 3 $41.25 $40.75 $40.50 9 4 $42.00 $41.25 $40.75 10 5 $41.50 $42.00 $41.25 How to Know What Is Going on Inside a Neural Network Neural networks are opaque. Even knowing all the weights on all the nodes throughout the network does not give much insight into why the network produces the results that it produces. This lack of understanding has some philo- sophical appeal—after all, we do not understand how human consciousness arises from the neurons in our brains. As a practical matter, though, opaqueness impairs our ability to understand the results produced by a network. If only we could ask it to tell us how it is making its decision in the form of rules. Unfortunately, the same nonlinear characteristics of neural network nodes that make them so powerful also make them unable to produce simple rules. Eventually, research into rule extraction from networks may bring unequivocally good results. Until then, the trained network itself is the rule, and other methods are needed to peer inside to understand what is going on. A technique called sensitivity analysis can be used to get an idea of how opaque models work. Sensitivity analysis does not provide explicit rules, but it does indicate the relative importance of the inputs to the result of the network. Sensitivity analysis uses the test set to determine how sensitive the output of the network is to each input. The following are the basic steps: 1. Find the average value for each input. We can think of this average value as the center of the test set. 470643 c07.qxd 3/8/04 11:37 AM Page 248 248 Chapter 7 2. Measure the output of the network when all inputs are at their average value. 3. Measure the output of the network when each input is modified, one at a time, to be at its minimum and maximum values (usually –1 and 1, respectively). For some inputs, the output of the network changes very little for the three values (minimum, average, and maximum). The network is not sensitive to these inputs (at least when all other inputs are at their average value). Other inputs have a large effect on the output of the network. The network is sensitive to these inputs. The amount of change in the output measures the sensitivity of the network for each input. Using these measures for all the inputs creates a relative measure of the importance of each feature. Of course, this method is entirely empirical and is looking only at each variable independently. Neural networks are interesting precisely because they can take inter- actions between variables into account. There are variations on this procedure. It is possible to modify the values of two or three features at the same time to see if combinations of features have a particular importance. Sometimes, it is useful to start from a location other than the center of the test set. For instance, the analysis might be repeated for the minimum and maximum values of the features to see how sensitive the network is at the extremes. If sensitivity analysis produces significantly different results for these three situations, then there are higher order effects in the network that are taking advantage of combinations of features. When using a feed-forward, back propagation network, sensitivity analysis can take advantage of the error measures calculated during the learning phase instead of having to test each feature independently. The validation set is fed into the network to produce the output and the output is compared to the predicted output to calculate the error. The network then propagates the error back through the units, not to adjust any weights but to keep track of the sensitivity of each input. The error is a proxy for the sensitivity, determining how much each input affects the output in the network. Accumulating these sensi- tivities over the entire test set determines which inputs have the larger effect on the output. In our experience, though, the values produced in this fashion are not particularly useful for understanding the network. Neural networks do not produce easily understood rules that explain howTIP they arrive at a given result. Even so, it is possible to understand the relative importance of inputs into the network by using sensitivity analysis. Sensitivity can be a manual process where each feature is tested one at a time relative to the other features. It can also be more automated by using the sensitivity information generated by back propagation. In many situations, understanding the relative importance of inputs is almost as good as having explicit rules. 470643 c07.qxd 3/8/04 11:37 AM Page 249 Artificial Neural Networks 249 Self-Organizing Maps Self-organizing maps (SOMs) are a variant of neural networks used for undirected data mining tasks such as cluster detection. The Finnish researcher Dr. Tuevo Kohonen invented self-organizing maps, which are also called Kohonen Net- works. Although used originally for images and sounds, these networks can also recognize clusters in data. They are based on the same underlying units as feed- forward, back propagation networks, but SOMs are quite different in two respects. They have a different topology and the back propagation method of learning is no longer applicable. They have an entirely different method for training. What Is a Self-Organizing Map? The self-organizing map (SOM), an example of which is shown in Figure 7.13, is a neural network that can recognize unknown patterns in the data. Like the networks we’ve already looked at, the basic SOM has an input layer and an output layer. Each unit in the input layer is connected to one source, just as in the networks for predictive modeling. Also, like those networks, each unit in the SOM has an independent weight associated with each incoming connec- tion (this is actually a property of all neural networks). However, the similar- ity between SOMs and feed-forward, back propagation networks ends here. The output layer consists of many units instead of just a handful. Each of the units in the output layer is connected to all of the units in the input layer. The output layer is arranged in a grid, as if the units were in the squares on a checkerboard. Even though the units are not connected to each other in this layer, the grid-like structure plays an important role in the training of the SOM, as we will see shortly. How does an SOM recognize patterns? Imagine one of the booths at a carnival where you throw balls at a wall filled with holes. If the ball lands in one of the holes, then you have your choice of prizes. Training an SOM is like being at the booth blindfolded and initially the wall has no holes, very similar to the situation when you start looking for patterns in large amounts of data and don’t know where to start. Each time you throw the ball, it dents the wall a little bit. Eventually, when enough balls land in the same vicinity, the indentation breaks through the wall, forming a hole. Now, when another ball lands at that location, it goes through the hole. You get a prize—at the carnival, this is a cheap stuffed animal, with an SOM, it is an identifiable cluster. Figure 7.14 shows how this works for a simple SOM. When a member of the training set is presented to the network, the values flow forward through the network to the units in the output layer. The units in the output layer compete with each other, and the one with the highest value “wins.” The reward is to adjust the weights leading up to the winning unit to strengthen in the response to the input pattern. This is like making a little dent in the network. 470643 c07.qxd 3/8/04 11:37 AM Page 250 250 Chapter 7 The output units compete with each other for the output of the network. The output layer is laid out like a grid. Each unit is connected to all the input units, but not to each other. The input layer is connected to the inputs. Figure 7.13 The self-organizing map is a special kind of neural network that can be used to detect clusters. There is one more aspect to the training of the network. Not only are the weights for the winning unit adjusted, but the weights for units in its immedi- ate neighborhood are also adjusted to strengthen their response to the inputs. This adjustment is controlled by a neighborliness parameter that controls the size of the neighborhood and the amount of adjustment. Initially, the neighborhood is rather large, and the adjustments are large. As the training contin- ues, the neighborhoods and adjustments decrease in size. Neighborliness actually has several practical effects. One is that the output layer behaves more like a connected fabric, even though the units are not directly connected to each other. Clusters similar to each other should be closer together than more dissimilar clusters. More importantly, though, neighborliness allows for a group of units to represent a single cluster. Without this neighborliness, the network would tend to find as many clusters in the data as there are units in the output layer—introducing bias into the cluster detection. 470643 c07.qxd 3/8/04 11:37 AM Page 251 Artificial Neural Networks 251 0.1 0.2 0.7 0.2 0.6 0.6 0.1 0.9 0.4 0.2 0.1 0.8 The winning output unit and its path Figure 7.14 An SOM finds the output unit that does the best job of recognizing a particular input. Typically, a SOM identifies fewer clusters than it has output units. This is inefficient when using the network to assign new records to the clusters, since the new inputs are fed through the network to unused units in the output layer. To determine which units are actually used, we apply the SOM to the validation set. The members of the validation set are fed through the network, keeping track of the winning unit in each case. Units with no hits or with very few hits are discarded. Eliminating these units increases the run-time perfor- mance of the network by reducing the number of calculations needed for new instances. Once the final network is in place—with the output layer restricted only to the units that identify specific clusters—it can be applied to new instances. An 470643 c07.qxd 3/8/04 11:37 AM Page 252 252 Chapter 7 unknown instance is fed into the network and is assigned to the cluster at the output unit with the largest weight. The network has identified clusters, but we do not know anything about them. We will return to the problem of identifying clusters a bit later. The original SOMs used two-dimensional grids for the output layer. This was an artifact of earlier research into recognizing features in images com- posed of a two-dimensional array of pixel values. The output layer can really have any structure—with neighborhoods defined in three dimensions, as a network of hexagons, or laid out in some other fashion. Example: Finding Clusters A large bank is interested in increasing the number of home equity loans that it sells, which provides an illustration of the practical use of clustering. The bank decides that it needs to understand customers that currently have home equity loans to determine the best strategy for increasing its market share. To start this process, demographics are gathered on 5,000 customers who have home equity loans and 5,000 customers who do not have them. Even though the proportion of customers with home equity loans is less than 50 percent, it is a good idea to have equal weights in the training set. The data that is gathered has fields like the following: ■■ Appraised value of house ■■ Amount of credit available ■■ Amount of credit granted ■■ Age ■■ Marital status ■■ Number of children ■■ Household income This data forms a good training set for clustering. The input values are mapped so they all lie between –1 and +1; these are used to train an SOM. The network identifies five clusters in the data, but it does not give any information about the clusters. What do these clusters mean? A common technique to compare different clusters that works particularly well with neural network techniques is the average member technique. Find the most average member of each of the clusters—the center of the cluster. This is similar to the approach used for sensitivity analysis. To do this, find the average value for each feature in each cluster. Since all the features are numbers, this is not a problem for neural networks. For example, say that half the members of a cluster are male and half are female, and that male maps to –1.0 and female to +1.0. The average member for this cluster would have a value of 0.0 for this feature. In another cluster, TEAMFLY Team-Fly ® there may be nine females for every male. For this cluster, the average member would have a value of 0.8. This averaging works very well with neural networks since all inputs have to be mapped into a numeric range. TIP Self-organizing maps, a type of neural network, can identify clusters but they do not identify what makes the members of a cluster similar to each other. A powerful technique for comparing clusters is to determine the center or average member in each cluster. Using the test set, calculate the average value for each feature in the data. These average values can then be displayed in the same graph to determine the features that make a cluster unique. These average values can then be plotted using parallel coordinates as in Figure 7.15, which shows the centers of the five clusters identified in the bank- ing example. In this case, the bank noted that one of the clusters was particularly interesting, consisting of married customers in their forties with children. A bit more investigation revealed that these customers also had children in their late teens. Members of this cluster had more home equity lines than members of other clusters. Figure 7.15 The centers of five clusters are compared on the same graph. This simple visualization technique (called parallel coordinates) helps identify interesting clusters. Available Credit Credit Balance Age Marital Status Num Children Income -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 This cluster looks interesting. High-income customers with children in the middle age group who are taking out large loans. Artificial Neural Networks 253 470643 c07.qxd 3/8/04 11:37 AM Page 253 [...]... areas: Fraud detection New cases of fraud are likely to be similar to known cases MBR can find and flag them for further investigation Customer response prediction The next customers likely to respond to an offer are probably similar to previous customers who have responded MBR can easily identify the next likely customers Medical treatments The most effective treatment for a given patient is probably the... neighbors along the Hudson and Delaware rivers, rather its neighbors based on descriptive variables—in this case, popu lation and median home value The scatter plot shows New York towns arranged by these two variables Figure 8.1 shows that measured this way, Brooklyn and Queens are close neighbors, and both are far from Manhattan Although Manhattan is nearly as populous as Brooklyn and Queens, its home... dimensions and the choice of a distance metric are crucial to any nearest-neighbor approach 259 260 Chapter 8 The first stage of MBR finds the closest neighbor on the scatter plot shown in Figure 8.1 Then the next closest neighbor is found, and so on until the desired number are available In this case, the number of neighbors is two and the nearest ones turn out to be Shelter Island (which really is an island)... (which really is an island) way out by the tip of Long Island’s North Fork, and North Salem, a town in North ern Westchester near the Connecticut border These towns fall at about the middle of a list sorted by population and near the top of one sorted by home value Although they are many miles apart, along these two dimensions, Shel ter Island and North Salem are very similar to Tuxedo Once the neighbors... separate codes, distributed as follows in the training set (Table 8.2) The number and types of codes assigned to stories varied Almost all the stories had region and subject codes and, on average, almost three region codes per story At the other extreme, relatively few stories contained govern ment and product codes, and such stories rarely had more than one such code Table 8.2 Six Types of Codes Used... salary, and gender and suggests that standard distance is a good metric for nearest neighbors The four most common distance functions for numeric fields are: ■ ■ Absolute value of the difference: |A–B| ■ ■ Square of the difference: (A–B)2 ■ ■ Normalized absolute value: |A–B|/(maximum difference) ■ ■ Absolute value of difference of standardized values: |(A – mean)/(standard deviation) – (B – mean)/(standard... distance and combination functions, which often requires a bit of trial and error and intuition Example: Using MBR to Estimate Rents in Tuxedo, New York The purpose of this example is to illustrate how MBR works by estimating the cost of renting an apartment in the target town by combining data on rents in several similar towns—its nearest neighbors MBR works by first identifying neighbors and then... fact, the problem lay elsewhere The bank had initially only used general customer information It had not combined infor mation from the many different systems servicing its customers The bank returned to the problem of identifying customers, but this time it included more information—from the deposits system, the credit card system, and so on The basic methods remained the same, so we will not go into... statisticians started to use them and understand them better A neural network consists of artificial neurons connected together Each neuron mimics its biological counterpart, taking various inputs, combining them, and producing an output Since digital neurons process numbers, the activation function characterizes the neuron In most cases, this function takes the weighted sum of its inputs and applies an S-shaped... claims, and a mushroom hunter spotting Morels are all following a similar process Each first identifies similar cases from experience and then applies what their knowledge of those cases to the problem at hand This is the essence of memory-based reasoning A data base of known records is searched to find preclassified records similar to a new record These neighbors are used for classification and estimation . $40 .25 2 2 $41.00 $40 .25 3 3 $39 .25 $41.00 $40 .25 4 4 $39.75 $39 .25 $41.00 5 5 $40.50 $39.75 $39 .25 6 1 $40.50 $40.50 $39.75 7 2 $40.75 $40.50 $40.50 8 3 $41 .25 $40.75 $40.50 9 4 $ 42. 00. TOWN TION RENT Shelter 22 28 $804 3.1 34.6 31.4 10.7 3.1 17 Island North 5173 $1150 3 10 .2 21.6 30.9 24 .2 10 .2 Salem Figure 8.1 Based on 20 00 census population and home value, the town of. 1 $40 .25 2 2 $41.00 3 3 $39 .25 4 4 $39.75 5 5 $40.50 6 1 $40.50 7 2 $40.75 8 3 $41 .25 9 4 $ 42. 00 10 5 $41.50 470643 c07.qxd 3/8/04 11:37 AM Page 24 7 Artificial Neural Networks 24 7 Table

Ngày đăng: 22/06/2014, 04:20

Xem thêm: Learning Management Marketing and Customer Support_2 docx, Learning Management Marketing and Customer Support_2 docx

Learning Management Marketing and Customer Support_2 docx

Thông tin tài liệu

Từ khóa liên quan

Mục lục

sample.pdf

sterling.com

Welcome to Sterling Software

Tài liệu cùng người dùng

Tài liệu liên quan