O''''Reilly Network For Information About''''s Book part 132 pps

8 CC CC Now we are ready for the second, and final, sorting operation. We start by counting the number of occurrences of each character in the first, or more significant, position. There is one A in this position, along with four B's, and three C's. Starting with the lowest character value, the key that has an A in this position ends up in the first slot of the output. The B's follow, starting with the second slot. Remember that the keys are already in order by their second, or less significant, character, because of our previous sort. This is important when we consider the order of keys that have the same first character. For example, in Figure lesser.char, the asterisks mark the keys that start with B. 5 They are in order by their second character, since we have just sorted on that character. Therefore, when we rearrange the keys by their first character position, those that have the same first character will be in order by their second character as well, since we always move keys from the input to the output in order of their position in the input. That means that the B records have the same order in the output that they had in the input, as you can see in Figure greater.char. This may seem like a lot of work to sort a few strings. However, the advantage of this method when we have many keys to be sorted is that the processing for each pass is extremely simple, requiring only a few machine instructions per byte handled (less than 30 on the 80x86 family). More significant character sort (Figure greater.char) 1 BA + + AB | | 2 CA + +-+ BA | | 3 BA + + BA | | 4 AB + + + BB | | 5 CB +| | + BC +++ + | 6 BB +|+ + CA ++ + 7 BC ++ CB 8 CC CC On the Home Stretch Having finished sorting the keys, we have only to retrieve the records we want from the input file in sorted order and write them to the output file. In our example, this requires reading the desired records and writing them into an output file. But how do we know which records we need to read and in what order? The sort function requires more than just the keys to be sorted. We also have to give it a list of the record numbers to which those keys correspond, so that it can rearrange that list in the same way that it rearranges the keys. Then we can use the rearranged list of record numbers to retrieve the records in the correct order, which completes the task of our program. The Code Let's start our examination of the mailing list program by looking at the header file that defines its main constants and structures, which is shown in Figure mail.00a. The main header file for the mailing list program (mail\mail.h) (Figure mail.00a) codelist/mail.00a Now we're ready to look at the implementation of these algorithms, starting with function main (Figure mail.00). main function definition (from mail\mail.cpp) (Figure mail.00) codelist/mail.00 This function begins by checking the number of arguments with which it was called, and exits with an informative message if there aren't enough. Otherwise, it constructs the output file name and opens the file for binary input. Then it calls the initialize function (Figure mail.01), which sets up the selection criteria according to input arguments 3 through 6 (minimum spent, maximum spent, earliest last-here date, and latest last-here date). Now we are ready to call process (Figure mail.02), to select the records that meet those criteria. initialize function definition (from mail\mail.cpp) (Figure mail.01) codelist/mail.01 process function definition (from mail\mail.cpp) (Figure mail.02) codelist/mail.02 The first order of business in process is to set up the buffering for the list (output), and data files. It is important to note that we are using a large buffer for the list file and for the first pass through the data file, but are changing the buffer size to the size of 1 record for the second pass through the data file. What is the reason for this change? Determining the Proper Buffer Size On the first pass through the data file, we are going to read every record in physical order, so a large buffer is useful in reducing the number of physical disk accesses needed. This analysis, however, does not apply to the second pass through the data file. In this case, using a bigger buffer for the data file would actually reduce performance, since reading a large amount of data at once is helpful only if you are going to use the data that you are reading. 6 On the second pass, we will read the data records in order of their ZIP codes, forcing us to move to a different position in the data file for each record rather than reading them consecutively. Using a big buffer in this situation would mean that most of the data in the buffer would be irrelevant. Preparing to Read the Key File Continuing in process, we calculate the number of records in the data file, which determines how large our record selection bitmap should be. Then we call the macro allocate_bitmap, which is defined in bitfunc.h (Figure bitfunc.00a), to allocate storage for the bitmap. The header file for the bitmap functions (mail\bitfunc.h) (Figure bitfunc.00a) codelist/bitfunc.00a Of course, each byte of a bitmap can store eight bits, so the macro divides the number of bits we need by eight and adds one byte to the result. The extra byte is to accommodate any remainder after the division by eight. Now that we have allocated our bitmap, we can read through the data file and select the records that meet our criteria. After initializing our counts of "items read" and "found" to zero, we are ready to start reading records. Of course, we could calculate the number of times through the loop rather than continue until we run off the end of the input file, since we know how many records there are in the file. However, since we are processing records in batches, the last of which is likely to be smaller than the rest, we might as well take advantage of the fact that when we get a short count of items_read, the operating system is telling us that we have reached the end of the file. Reading the Key File The first thing we do in the "infinite" loop is to read a set of processing_batch records (to avoid the overhead of calling the operating system to read each record). Now we are ready to process one record at a time in the inner loop. Of course, we want to know whether the record we are examining meets our selection criteria, which are whether the customer has spent at least min_spent, no more than max_spent, and has last been here between min_date and max_date (inclusive). If the record fails to meet any of these criteria, we skip the remainder of the processing for this record via "continue". However, let's suppose that a record passes these four tests. In that case, we increment items_found. Then we want to set the bit in the found bitmap that corresponds to this record. To do this, we need to calculate the current record number, by adding the number of records read before the current processing batch (total_items_read) and the entry number in the current batch (i). Now we are ready to call setbit (Figure bitfunc.00). setbit function definition (from mail\bitfunc.cpp) (Figure bitfunc.00) codelist/bitfunc.00 Setting a Bit in a Bitmap The setbit function is quite simple. Since there are eight bits in a byte, we have to calculate which byte we need to access and which bit within that byte. Once we have calculated these two values, we can retrieve the appropriate byte from the bitmap. In order to set the bit we are interested in, we need to create a "mask" to isolate that bit from the others in the same byte. The statement that does this, mask = 1 << bitnumber;, may seem mysterious, but all we are doing is generating a value that has a 1 in the same position as the bit we are interested in and 0 in all other positions. Therefore, after we perform a "logical or" operation of the mask and the byte from the bitmap, the resulting value, stored back into the bitmap, will have the desired bit set. This setbit function also returns a value indicating the value of the bit before we set it. Thus, if we want to know whether we have actually changed the bit from off to on, we don't have to make a call to testbit before the call to setbit; we can use the return value from setbit to determine whether the bit was set before we called setbit. This would be useful, for example, in an application where the bitmap was being used to allocate some resource, such as a printer, which cannot be used by more than one process at a time. The function would call setbit and, if that bit had already been set, would return an error indicating that the resource was not available. Now we have a means of keeping track of which records have been selected. However, we also need to save the ZIP code for each selected record for our sort. Unfortunately, we don't know how many records are going to be selected until we select them. This is easily dealt with in the case of the bitmap, which is so economical of storage that we can comfortably allocate a bit for every record in the file; ZIP codes, which take ten bytes apiece, pose a more difficult problem. We need a method of allocation which can provide storage for an unknown number of ZIP codes. Allocate as You Go Of course, we could use a simple linked list. In that approach, every time we found a record that matches our criteria, we would allocate storage for a ZIP code and a pointer to the next ZIP code. However, this consumes storage very rapidly, as additional memory is required to keep track of every allocation of storage. When very small blocks of ten bytes or so are involved, the overhead can easily exceed the amount of storage actually used for our purposes, so that allocating 250000 14- byte blocks can easily take 7.5 megabytes or more, rather than the 3.5 megabytes that we might expect. To avoid this inefficiency, we can allocate larger blocks that can accommodate a number of ZIP codes each, and keep track of the addresses of each of these larger blocks so that we can retrieve individual ZIP codes later. That is the responsibility of the code inside the if statement that compares current_zip_entry with ZIP_BLOCK_ENTRIES. 7 To understand how this works, let's look back at the lines in process that set current_zip_block to -1 and current_zip_entry to ZIP_BLOCK_ENTRIES. This initialization ensures that the code in the "if" will be executed for the first selected record. We start by incrementing current_zip_block (to zero, in this case) and setting current_zip_entry to zero, to start a new block. Then we allocate storage for a new block of ZIP codes (zip_block) and set up . process is to set up the buffering for the list (output), and data files. It is important to note that we are using a large buffer for the list file and for the first pass through the data. which records have been selected. However, we also need to save the ZIP code for each selected record for our sort. Unfortunately, we don't know how many records are going to be selected until. comfortably allocate a bit for every record in the file; ZIP codes, which take ten bytes apiece, pose a more difficult problem. We need a method of allocation which can provide storage for