KEY CONCEPTS & TECHNIQUES IN GIS Part 3 potx

14 KEY CONCEPTS AND TECHNIQUES IN GIS coordinate systems. The good news is that most GIS these days relieve us from the burden of translating between the hundreds of projections and coordinate systems. The bad news is that we still need to understand how this works to ask the right questions in case the metadata fails to report on these necessities. Contrary to Dutch or Kansas experiences as well as the way we store data in a GIS, the Earth is not flat. Given that calculations in spherical geometry are very complicated, leading to rounding errors, and that we have thousands of calculations performed each time we ask the GIS to do something, manufacturers have decided to adopt the simple two-dimensional view of a paper map. Generations of cartogra- phers have developed a myriad of ways to map positions on a sphere to coordinates on flat paper. Even the better of these projections all have some flaws and the main difference between projections is the kind of distortion that they introduce to the data (see Figure 7). It is, for example, impossible to design a map that measures the distances between all cities correctly. We can have a table that lists all these distances but there is no way to draw them properly on a two-dimensional surface. Many novices to geographic data confuse the concepts of projections and coordinate systems. The former just describes the way we project points from a sphere on to a flat surface. The latter determines how we index positions and perform measurements on the result of the projection process. The confusion arises from the fact that many geographic Figure 6 Subset of a typical metadata tree Metadata Identification Information Citation Description Time Period of Content Status Spatial Reference Horizontal Coordinate System Definition: planar Map Projection: Lambert conformal conic Standard parallel: 43.000000 Standard parallel: 45.500000 Longitude of Central Meridian: –120.500000 Latitude of Projection Origin: 41.750000 False Easting: 1312336.000000 False Northing: 0.000000 Abcissa Resolution: 0.004096 Ordinate Resolution: 0.004096 Horizontal Datum: NAD83 Ellipsoid: GRS80 Semi-major Axis: 6378137.000000 Flattening Ratio: 298.572222 Keywords Access Constraints Reference Information Metadata Date Metadata Contact Metadata Standard Name Metadata Standard Version Albrecht-3572-Ch-02.qxd 7/13/2007 5:07 PM Page 14 ACCESSING EXISTING DATA 15 coordinate systems consist of projections and a mathematical coordinate system, and that sometimes the same name is used for a geographic coordinate system and the projection(s) it is based on (e.g. the Universal Transverse Mercator or UTM system). In addi- tion, geographic coordinate systems differ in their metric (do the numbers that make up a coordinate represent feet, meters or decimal degrees?), the definition of their origin, and the assumed shape of the Earth, also known as its geodetic datum. It goes beyond the scope of this book to explain all these concepts but the reader is invited to visit the USGS website at http://erg.usgs.gov/isb/pubs/factsheets/fs07701.html for more information on this subject. Sometimes (e.g. when we try to incorporate old sketches or undocumented maps), we do not have the information that a GIS needs to match different datasets. In that case, we have to resort to a process known as rubber sheeting, where we interac- tively try to link as many individually identifiable points in both datasets to gain enough information to perform a geometric transformation. This assumes that we have one master dataset whose coordinates we trust and an unknown or untrusted dataset whose coordinates we try to improve. 2.5 Geographic web services The previous sections describe a state of data acquisition, which is rapidly becom- ing outdated in some application areas. Among the first questions that one should ask oneself before embarking on a GIS project is how unique is this project? If it is not too specialized then chances are that there is a market for providing this service or at least the data for it. This is particularly pertinent in application areas where the geography changes constantly, such as a weather service, traffic monitoring, or real estate markets. Here it would be prohibitively expensive to constantly collect data Figure 7 The effect of different projections Lambert Conformal Conic Winkel Tripel Mollweide OrthographicAzimuthal Equidistant Albrecht-3572-Ch-02.qxd 7/13/2007 5:07 PM Page 15 for just one application and one should look for either data or if one is lucky even the analysis results on the web. Web-based geographic data provision has come a long (and sometimes unex- pected) way. In the 1990s and the first few years of the new millennium, the empha- sis was on FTP servers and web portals that provided access to either public domain data (the USGS and US Census Bureau played a prominent role in the US) or to commercial data, most commonly imagery. Standardization efforts, especially those aimed at congruence with other IT standards, helped geographic services to become mainstream. Routing services (like it or not, MapQuest has become a household name for what geography is about), neighborhood searches such as local.yahoo.com, and geodemographics have helped to catapult geographic web services out of the academic realm and into the marketplace. There is an emerging market for non-GIS applications that are yet based on the provision of decentralized geodata in the widest sense. Many near real-time applications such as sending half a million volunteers on door-to-door canvassing during the 2004 presidential elections in the US, the forecast of avalanche risks and subsequent day-to-day operation of ski lifts in the European Alps, or the coordination of emergency management efforts during the 2004 tsunami have only been possible because of the interoperability of web services. The majority of web services are commercial, accessible only for a fee (commercial providers might have special provisions in case of emergencies). As this is a very new market, the rates are fluctuating and negotiable but can be substantial if there are many (as in millions) individual queries. The biggest potential lies in the emergence of middle-tier applications not aimed at the end user that are based on raw data and transform these to be combined with other web services. Examples include concierge services that map attractions around hotels with continuously updated restaurant menus, department store sales, cinema schedules, etc., or a nature conservation website that continuously maps GPS locations of collared elephants in relationship to updated satellite imagery rendered in a 3-D landscape that changes according to the direction of the track. In some respect, this spells the demise of GIS as we know it because the tasks that one would usually perform in a GIS are now executed on a central server that combines individual services the same way that an end consumer used to combine GIS functions. Similar to the way that a Unix shell script programmer combines little programs to develop highly customized applications, web services application programmers now combine traditional GIS functionality with commercial services (like the one that performs a secure credit card transaction) to provide highly specialized functionality at a fraction of the price of a GIS installation. This form of outsourcing can have great economical benefits and, as in the case of emergency applications, may be the only way to compile crucial information at short notice. But it comes at the price of losing control over how data is combined. The next chapter will deal with this issue of quality control in some detail. 16 KEY CONCEPTS AND TECHNIQUES IN GIS Albrecht-3572-Ch-02.qxd 7/13/2007 5:07 PM Page 16 The only way to justifiably be confident about the data one is working with is to collect all the primary data oneself and to have complete control over all aspects of acquisition and processing. In the light of the costs involved in creating or accessing existing data this is not a realistic proposition for most readers. GIS own their right of existence to their use in a larger spatial decision-making process. By basing our decisions on GIS data and procedures, we put faith in the truthfulness of the data and the appropriateness of the procedures. Practical experi- ence has tested that faith often enough for the GIS community to come up with ways and means to handle the uncertainty associated with data and procedures over which we do not have complete control. This chapter will introduce aspects of spatial data quality and then discuss metadata management as the best method to deal with spatial data quality. 3.1 Spatial data quality Quality, in very general terms, is a relative concept. Nothing is or has innate quality; rather quality is related to purpose. Even the best weather map is pretty useless for navigation/orientation purposes. Spatial data quality is therefore described along characterizing dimensions such as positional accuracy or thematic precision. Other dimensions are completeness, consistency, lineage, semantics and time. One of the most often misinterpreted concepts is that of accuracy, which often is seen as synonymous with quality although it is only a not overly significant part of it. Accuracy is the inverse of error, or in other words the difference between what is supposed to be encoded and what actually is encoded. ‘Supposed to be encoded’ means that accuracy is measured relative to the world model of the person compil- ing the data; which, as discussed above, is dependent on the purpose. Knowing for what purpose data has been collected is therefore crucial in estimating data quality. This notion of accuracy can now be applied to the positional, the temporal and the attribute components of geographic data. Spatial accuracy, in turn, can be applied to points, as well as to the connections between points that we use to depict lines and boundaries of area features. Given the number of points that are used in a typical GIS database, the determination of spatial accuracy itself can be the basis for a disserta- tion in spatial statistics. The same reasoning applies to the temporal component of geographic data. Temporal accuracy would then describe how close the recorded time for a crime event, for instance, is to when that crime actually took place. Thematic accuracy, finally, deals with how close the match is between the attribute 3 Handling Uncertainty Albrecht-3572-Ch-03.qxd 7/13/2007 4:14 PM Page 17 18 KEY CONCEPTS AND TECHNIQUES IN GIS value that should be there and that which has been encoded. For quantitative measures this is determined similarly to positional accuracy. For qualitative measures, such as the correct land use classification of a pixel in a remotely sensed image, an error classification matrix is used. Precision, on the other hand, refers to the amount of detail that can be discerned in the spatial, temporal or thematic aspects of geographic information. Data model- ers prefer the term ‘resolution’ as it avoids a term that is often confused with accuracy. Precision is indirectly related to accuracy because it determines to a degree the world model against which the accuracy is measured. The database with the lower precision automatically also has lower accuracy demands that are easier to fulfill. For example, one land use categorization might just distinguish commercial versus residential, transport and green space, while another distinguishes different kinds of residential (single-family, small rental, large condominium) or commercial uses (markets, repair facilities, manufacturing, power production). Assigning the correct thematic association to each pixel or feature is considerably more difficult in the second case and in many instances not necessary. Determining the accuracy and precision requirements is part of the thought process that should precede every data model design, which in turn is the first step in building a GIS database. Accuracy and precision are the two most commonly described dimensions of data quality. Probably next in order of importance is database consistency. In traditional databases, this is accomplished by normalizing the tables, whereas in geographic databases topology is used to enforce spatial and temporal consistency. The classi- cal example is a cadastre of property boundaries. No two properties should overlap. Topological rules are used to enforce this commonsense requirement; in this case the rule that all two-dimensional objects must intersect at one-dimensional objects. Similarly, one can use topology to ascertain that no two events take place at the same time at the same location. Historically, the discovery of the value of topological rules for GIS database design can hardly be overestimated. Next in order of commonly sought data quality characteristics is completeness. It can be applied to the conceptual model as well as to its implementation. Data model completeness is a matter of mental rigor at the beginning of a GIS project. How do we know that we have captured all the relevant aspects of our project? A stakeholder meeting might be the best answer to that problem. Particularly on the implementation side, we have to deal with a surprising characteristic of completeness referred to as over-completeness. We speak of an error of commission when data is stored that should not be there because it is outside the spatial, temporal or thematic bounds of the specification. Important information can be gleaned from the lineage of a dataset. Lineage describes where the data originally comes from and what transformations it has gone through. Though a more indirect measure than the previously described aspects of data quality, it sometimes helps us make better sense of a dataset than accuracy figures that are measured against an unknown or unrealistic model. One of the difficulties with measuring data quality is that it is by definition relative to the world model and that it is very difficult to unambiguously describe one’s Albrecht-3572-Ch-03.qxd 7/13/2007 4:14 PM Page 18 HANDLING UNCERTAINTY 19 world model. This is the realm of semantics and has, as described in the previous chapter, initiated a whole new branch of information science trying to unambiguously describe all relevant aspects of a world model. So far, these ontology description languages are able to handle only static representations, which is clearly a shortcoming where even GIS are now moving into the realm of process orientation. 3.2 How to handle data quality issues Many jurisdictions now require mandatory data quality reports when transferring data. Individual and agency reputations need to be protected, particularly when geographic information is used to support administrative decisions subject to appeal. On the private market, firms need to safeguard against possible litigation by those who allege to have suffered harm through the use of products that were of insufficient quality to meet their needs. Finally, there is the basic scientific requirement of being able to describe how close information is to the truth it represents. The scientific community has developed formal models of uncertainty that help us to understand how uncertainty propagates through spatial processing and decision-making. The difficulty lies in communicating uncertainty to different levels of users in less abstract ways. There is no one-size-fits-all to assess the fitness for use of geographic information and reduce uncertainty to manageable levels for any given application. In a first step it is necessary to convey to users that uncertainty is present in geographic information as it is in their everyday lives, and to provide strategies that help to absorb that uncertainty. In applying the strategy, consideration has initially to be given to the type of application, the nature of the decision to be made and the degree to which system outputs are utilized within the decision-making process. Ideally, this prior knowledge per- mits an assessment of the final product quality specifications to be made before a project is undertaken; however, this may have to be decided later when the level of uncertainty becomes known. Data, software, hardware and spatial processes are combined to provide the necessary information products. Assuming that uncertainty in a product is able to be detected and modeled, the next consideration is how the various uncertainties may best be communicated to the user. Finally, the user must decide what product quality is acceptable for the application and whether the uncertainty present is appropriate for the given task. There are two choices available here: either reject the product as unsuitable and select uncertainty reduction techniques to create a more accurate product, or absorb (accept) the uncertainty present and use the product for its intended purpose. In summary, the description of data quality is a lot more than the mere portrayal of errors. A thorough account of data quality has the chance to be as exhaustive as the data itself. Combining all the aspects of data quality in one or more reports is referred to as metadata (see Chapter 2). Albrecht-3572-Ch-03.qxd 7/13/2007 4:14 PM Page 19 Albrecht-3572-Ch-03.qxd 7/13/2007 4:14 PM Page 20 Among the most elementary database operations is the quest to find a data item in a database. Regular databases typically use an indexing scheme that works like a library catalog. We might search for an item alphabetically by author, by title or by subject. A modern alternative to this are the indexes built by desktop or Internet search engines, which basically are very big lookup tables for data that is physically distributed all over the place. Spatial search works somewhat differently from that. One reason is that a spatial coordinate consists of two indices at the same time, x and y. This is like looking for author and title at the same time. The second reason is that most people, when they look for a location, do not refer to it by its x/y coordinate. We therefore have to trans- late between a spatial reference and the way it is stored in a GIS database. Finally, we often describe the place that we are after indirectly, such as when looking for all dry cleaners within a city to check for the use of a certain chemical. In the following we will look at spatial queries, starting with some very basic examples and ending with rather complex queries that actually require some spatial analysis before they can be answered. This chapter does deliberately omit any discussion of special indexing methods, which would be of interest to a computer scientist but perhaps not to the intended audience of this book. 4.1 Simple spatial querying When we open a spatial dataset in a GIS, the default view on the data is to see it displayed like a map (see Figure 8). Even the most basic systems then allow you to use a query tool to point to an individual feature and retrieve its attributes. They key word here is ‘feature’; that is, we are looking at databases that actually store features rather than field data. If the database is raster-based, then we have different options, depending on the sophistication of the system. Let’s have a more detailed look at the right part of Figure 8. What is displayed here is an elevation dataset. The visual representation suggests that we have contour lines but this does not necessarily mean that this is the way the data is actually stored and can hence be queried by. If it is indeed line data, then the current cursor position would give us nothing because there is no information stored for anything in between the lines. If the data is stored as areas (each plateau of equal elevation forming one area), then we could move around between any two lines and would always get the same elevation value. Only once we cross a line would we ‘jump’ to the next higher or lower plateau. Finally, the data could be 4 Spatial Search Albrecht-3572-Ch-04.qxd 7/13/2007 4:15 PM Page 21 22 KEY CONCEPTS AND TECHNIQUES IN GIS stored as a raster dataset, but rather than representing thousands of different elevation values by as many colors, we may make life easier for the computer as well as for us (interpreting the color values) by displaying similar elevation values with only one out of say 16 different color values. In this case, the hovering cursor could still query the underlying pixel and give us the more detailed information that we could not possibly distinguish by the hue. This example illustrates another crucial aspect of GIS: the way we store data has a major impact on what information can be retrieved. We will revisit this theme repeatedly throughout the book. Basically, data that is not stored, like the area between lines, cannot simply be queried. It would require rather sophisticated ana- lytical techniques to interpolate between the lines to come up with a guesstimate for the elevation when the cursor is between the lines. If, on the other hand, the elevation is explicitly stored for every location on the screen, then the spatial query is nothing but a simple lookup. 4.2 Conditional querying Conditional queries are just one notch up on the level of complication. Within a GIS, the condition can be either attribute- or geometry-based. To keep it simple and get the idea across, let’s for now look at attributes only (see Figure 9). Here, we have a typical excerpt from an attribute table with multiple variables. A conditional query works like a filter that initially accesses the whole database. Similar to the way we search for a URL in an Internet search engine, we now provide the system with all the criteria that have to be fulfilled for us to be interested in the final presentation of records. Basically, what we are doing is to reject ever more records until we end up with a manageable number of them. If our query is “Select the best property that is >40,000m 2 , does not belong to Silma, has tax code ‘B’, and Parcel# 231-12-687 Owner John Doe Zoning A3 Value 179,820 Figure 8 Simple query by location Albrecht-3572-Ch-04.qxd 7/13/2007 4:15 PM Page 22 SPATIAL SEARCH 23 has soils of high quality”, then we first exclude record #5 because it does not fulfill the first criterion. Our selection set, after this first step, contains all records but #5. Next, we exclude record #6 because our query specified that we do not want this owner. In the third step, we reduce the number of candidates to two because only records #1 and #3 survived up to here and fulfill the third criterion. In the fourth step, we are down to just one record, which may now be presented to us either in a window listing all its attributes or by being highlighted on the map. Keep in mind that this is a pedagogical example. In a real case, we might end up with any number of final records, including zero. In that case, our query was overly restrictive. It depends on the actual application, whether this is something we can live with or not, and therefore whether we should alter the query. Also, this conditional query is fairly elementary in the way it is phrased. If the GIS database is more than just a simple table, then the appropriate way to query the database may be to use one dialect or another of the structured query language SQL. 4.3 The query process One of the true benefits of a GIS is that we have a choice whether we want to use a tabular or a map interface for our query. We can even mix and match as part of the query process. As this book is process-oriented, let’s have a look at the individual steps. This is particularly important as we are dealing increasingly often with Internet GIS user interfaces, which are difficult to navigate if the sequence and the various options on the way are not well understood (see Figure 10). First, we have to make sure that the data we want to query is actually available. Usually, there is some table of contents window or a legend that tells us about the data layers currently loaded. Then, depending on the system, we may have to select the one data layer we want to query. If we want to find out about soil conditions and the ‘roads’ layer is active (the terminology may vary a little bit), then our query result will be empty. Now we have to decide whether we want to use the map or the tabular interface. In the first instance, we pan around the map and use the identify Property Number Area M 2 Owner Tax Code Soil Quality 1 100,000 TULATU High High B BRAUDO MediumA BRAUDO Medium B ANUNKU Low Low A ANUNKU A SILMA B 50,0002 90,0003 40,8004 30,2005 120,2006 Figure 9 Conditional query or query by (multiple) attributes Albrecht-3572-Ch-04.qxd 7/13/2007 4:15 PM Page 23 [...]...24 1 2 KEY CONCEPTS AND TECHNIQUES IN GIS Command: List Coverages Road Width Surface 8 10 x 3 B C 8 5 5 24 x 3 D 5 33 x–5 y 3 E 3A Display Database or 3B Display Coverage Length A Soil Elevation Precipitation Roads 8 31 y 3 5A 4B 5B List Records List Fields Zoom 6B Figure 10 4A Cursor Query Identity: Location: Road B 37 ° 13 22’’S 177°46’ 13 ’W The relationship between spatial... high or low income with high or low literacy Now we can look at Figure 13 On the left side we have one particular spatial configuration – not all that realistic because it’s not usual to have population data in equally sized spatial units, but it makes it a lot easier to understand the principle For each area, we can read the values of the two variables 26 KEY CONCEPTS AND TECHNIQUES IN GIS Point Features... chapter, the others should have a sincere look at the following Boolean logic was invented by English mathematician George Bool (1815–64) and underlies almost all our work with computers Most of us have encountered Boolean logic in queries using Internet search engines In essence, his logic can be described as the kind of mathematics that we can do if we have nothing but zeros and ones What made him... combine those zeros and ones and their powerfulness once they are combined The basic three operators in Boolean logic are NOT, OR and AND Figure 13 illustrates the effect of the three operators Let’s assume we have two GIS layers, one depicting income and the other depicting literacy Also assume that the two variables can be in one of two states only, high or low Then each location can be a combination... have glanced over in the above example is that we actually used one geometry to select some other geometries Figure 12 is a further illustration of SPATIAL SEARCH Stand Name Area Species Stand A 3 North 20 Pine A 3 C–2 East–1 10 Pine C–2 East–2 40 Mix Figure 11 Name 25 Area Species North 10 Pine C–2 East–1 5 Pine C–2 East–2 30 Mix Partial and complete selection of features the principle Here, we use... LI: Low Income HL LI LL LI HI HI LL HI LL LI HI HI HL not HI Figure 13 HL and HI Simple Boolean logic operations Now we can query our database and, depending on our use of Boolean operators, we gain very different insights In the right half of the figure, we see the results of four different queries (we get to even more than four different possible outcomes by combining two or more operations) In the... around the hotel we are staying at In the second case, we may want to specify ‘Thai cuisine under $40’ to filter the display Finally, we may follow the second approach and then make our final decision based on the visual display of what other features of interest are near the two or three restaurants depicted 4.4 Selection Most of the above examples ended with us selecting one or more records for subsequent... respect to combined purchase price or whatever On the right, our selection area overlaps only partly with two of the features The question now is: do we treat the two features as if they got fully selected or do we work with only those parts that fall within our search area? If it is the latter, then we have to perform some additional calculations that we will encounter in the following two chapters... from simple mapping systems to true GIS Even the selection process, though, comes at different levels of sophistication Let’s look at Figure 11 for an easy and a complicated example In the left part of the figure, our graphical selection neatly encompasses three features In this case, there is no ambiguity – the records for the three features are displayed and we can embark on performing our calculations... outcomes by combining two or more operations) In the first instance, we don’t query about literacy at all All we want to make sure is that we reject areas of high income, which leaves us with the four highlighted areas The NOT operator is a unary operator – it affects only the descriptor directly after the operand, in this first instance the income layer . Surface A B C D E 8 8 5 5 8 10 5 24 33 31 x 3 x 3 x–5 y 3 y 3 4A List Records List Fields 5A 5B 3A 3B Display Database or Display Coverage 4B 6B Zoom Cursor Query Identity: Road B Location: 37 ° 13 22’’S 177°46’ 13 ’W 1 2 Figure. B 50,0002 90,00 03 40,8004 30 ,2005 120,2006 Figure 9 Conditional query or query by (multiple) attributes Albrecht -35 72-Ch-04.qxd 7/ 13/ 2007 4:15 PM Page 23 24 KEY CONCEPTS AND TECHNIQUES IN GIS tool to. of losing control over how data is combined. The next chapter will deal with this issue of quality control in some detail. 16 KEY CONCEPTS AND TECHNIQUES IN GIS Albrecht -35 72-Ch-02.qxd 7/ 13/ 2007