Addison Wesley Unicode Demystified A Practical Programmers Guide To The Encoding Standard Sep 2002 ISBN 0201700522

11 49 0
Addison Wesley Unicode Demystified A Practical Programmers Guide To The Encoding Standard Sep 2002 ISBN 0201700522

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

General Category After the code point value and the name, the next most important property that a Unicode character has is its general category Seven primary categories exist: letter, number, punctuation, symbol, mark, separator, and miscellaneous Each is subdivided into additional categories Letters The Unicode standard uses the term "letter" rather loosely in assigning things to this general category Whatever counts as the basic unit of meaning in a particular writing system, whether it represents a phoneme, a syllable, or a whole word or idea, is assigned to the "letter" category The major exception to this rule comprises marks that combine typographically with other characters, which are categorized as "marks" instead of "letters." They include not only diacritical marks and tone marks, but also vowel signs in those consonantal writing systems where the vowels are written as marks applied to the consonants Some writing systems, such as the Latin, Greek, and Cyrillic alphabets, also have the concept of "case." That is, two series of letterforms are used together, with one series, the "uppercase," used for the first letter of a sentence or a proper name, or for emphasis, and the other series, the "lowercase," used for most other letters Uppercase Letter (Lu) In cased writing systems, the uppercase letters are placed in this category Lowercase Letter (Ll) In cased writing systems, the lowercase letters are placed in this category Titlecase Letter (Lt) Titlecase is reserved for a few special characters in Unicode These characters are basically examples of compatibility characters characters that were included for round-trip compatibility with some other standard Every titlecase letter is actually a glyph representing two letters, the first of which is uppercase and the second of which is lowercase For example, the Serbian letter nje ( ) can be thought of as a ligature of the Cyrillic letter n ( ) and the Cyrillic soft sign ( ) When Serbian is written using the Latin alphabet (as is done in Croatian, which is almost the same language), this letter is written using the letters nj Existing Serbian and Croatian standards were designed to provide a one-to-one mapping between every Cyrillic character used in Serbian and the corresponding Latin character used in Croatian This approach required using a single character code to represent the nj digraph in Croatian, and Unicode carries that character forward Capital Nje in Cyrillic ( ) thus can convert to either NJ or Nj in Latin depending on the context The fully uppercase form, NJ, is U+01CA LATIN CAPITAL LETTER NJ, and the combined upperlower form, U+01CB LATIN CAPITAL LETTER N WITH SMALL LETTER J, is considered a "titlecase" letter Three Serbian characters have a titlecase Latin form: (lje, which converts to lj), (nje, which converts to nj), and (dzhe, which converts to d ) These characters were the only three titlecase letters in Unicode 2.x Unicode 3.0 added several Greek letters to this category Some early Greek texts represented certain diphthongs by writing a small letter iota underneath the other vowel rather than after it For example, you'd see "ai" written as If you just capitalized the alpha ("Ai"), you'd get the titlecase version: In the fully uppercase version ("AI"), the small iota becomes a regular iota again: AI These characters are all in the Extended Greek section of the standard and are used only in writing ancient Greek texts In modern Greek, these diphthongs are written using a regular iota; for example, "ai" is written as Modifier Letter (Lm) Just as some things you might conceptually think of as "letters" (vowel signs in various languages) are classified as "marks" in Unicode, the opposite also occurs The modifier letters are independent forms that don't combine typographically with the characters around them, which is why Unicode doesn't classify them as "marks" (Unicode marks, by definition, combine typographically with their neighbors) Instead of carrying their own sounds, the modifier letters generally modify the sounds of their neighbors In other words, conceptually they're diacritical marks Because they occur in the middle of words, most textanalysis processes treat them as letters, so they're classified as letters The Unicode modifier letters are generally either International Phonetic Alphabet characters or characters that are used to transliterate certain "real" letters in non-Latin writing systems that don't seem to correspond to a regular Latin letter For example, U+02BC MODIFIER LETTER APOSTROPHE is typically used to represent the glottal stop, the sound made by (or, more accurately, the absence of sound represented by) the Arabic letter alef, so the Arabic letter is often transliterated as this character Likewise, U+02B2 MODIFIER LETTER SMALL J is used to represent palatalization, and thus is sometimes used in transliteration as the counterpart of the Cyrillic soft sign Other Letter (Lo) This catch-all category includes everything that's conceptually a "letter," but that doesn't fit into one of the other "letter" categories Letters from uncased alphabets such as Arabic and Hebrew fall into this category, as do syllables from syllabic writing systems like Kana and Hangul and the Han ideographs Marks Like letters, marks are part of words and carry linguistic information Unlike letters, marks combine typographically with other characters For example, U+0308 COMBINING DIAERESIS may look like ă when shown alone, but is usually drawn on top of the letter that precedes it That is, U+0061 LATIN SMALL LETTER A followed by U+0308 COMBINING DIAERESIS isn't drawn as "aă", but rather as "ọ" All of the Unicode combining marks do this kind of thing Non-spacing Mark (Mn) Most of the Unicode combining marks fall into this category Non-spacing marks don't take up any horizontal space along a line of textthey combine completely with the character that precedes them and fit entirely into that character's space The various diacritical marks used in European languages, such as the acute and grave accents, the circumflex, the diaeresis, and the cedilla, fall into this category Combining Spacing Mark (Mc) Spacing combining marks interact typographically with their neighbors, but still take up horizontal space along a line of text All of these characters are vowel signs or other diacritical marks in the various Indian and Southeast Asian writing systems For example, U+093F DEVANAGARI VOWEL SIGN I ( ) is a spacing combining mark Thus U+0915 DEVANAGARI LETTER KA followed by U+093F DEVANAGARI VOWEL SIGN I is drawn as the vowel sign attaches to the left-hand side of the consonant Not all spacing combining marks reorder, however: U+0940 DEVANAGARI VOWEL SIGN II ( ) is also a combining spacing mark When it follows U+0915 DEVANAGARI LETTER KA, you get the vowel attaches to the right-hand side of the consonant, but the two combine typographically Enclosing Mark (Me) Enclosing marks completely surround the characters they modify For example, U+20DD COMBINING ENCLOSING CIRCLE is drawn as a ring around the character that precedes it These ten characters are generally used to create symbols Numbers The Unicode characters that represent numeric quantities are given the "number" property (technically, it should be called the "numeral" property, but that's life) The characters in these categories have additional properties that govern their interpretation as numerals This category is subdivided as follows Decimal-Digit Number (Nd) The characters in this category can be used as decimal digits This category includes not only the digits with which we're all familiar ("0123456789"), but similar sets of digits used with other writing systems, such as the Thai digits (" ") Letter Number (Nl) The characters in this category can be either letters or numerals Many are compatibility composites whose decompositions consist of letters The Roman numerals and the Hangzhou numerals are the only characters in this category Other Number (No) All of the characters that belong in the "number" category, but not in one of the other subcategories, fall into this one This category includes various numeric presentation forms, such as superscripts, subscripts, and circled numbers; various fractions; and numerals used in various numeration systems other than the Arabic positional notation used in the West Punctuation This category attempts to make sense of the various punctuation characters in Unicode It breaks down as follows Opening Punctuation (Ps) For punctuation marks, such as parentheses and brackets, that occur in opening-closing pairs, the "opening" characters in these pairs are assigned to this category Closing Punctuation (Pe) For punctuation marks, such as parentheses and brackets, that occur in opening-closing pairs, the "closing" characters in these pairs are assigned to this category Initial-Quote Punctuation (Pi) Quotation marks occur in opening-closing pairs, just like parentheses do The problem is that which is which depends on the language For example, both French and Russian use quotation marks that look like this: «», but they use them differently «In French, a quotation is set off like this.» »But in Russian, a quotation is set off like this.« This category is equivalent to either Ps or Pe, depending on the language Final-Quote Punctuation (Pf) The counterpart to the Pi category, Pf is also used with quotation marks whose usage varies depending on language It's equivalent to either Ps or Pe depending on language It's always the opposite of Pi Dash Punctuation (Pd) This category is self-explanatory It includes all hyphens and dashes Connector Punctuation (Pc) Characters in this category, such as the middle dot and the underscore, get treated as part of the word in which they appear That is, they "connect" series of letters together into single words: This_is_all_one_word An important example is U+30FB KATAKANA MIDDLE DOT, which is used like a hyphen in Japanese Other Punctuation (Po) Punctuation marks that don't fit into any of the other subcategories, including obvious things like the period, comma, and question mark, fall into this category Symbols This group of categories contains various symbols Currency Symbol (Sc) Self-explanatory Mathematical Symbol (Sm) Mathematical operators Modifier Symbol (Sk) This category contains two main types of characters: the "spacing" versions of the combining marks and a few other symbols whose purpose is to modify the meaning of the preceding character in some way Unlike modifier letters, modifier symbols don't necessarily modify the meanings of letters, and they don't necessarily get counted as parts of words Other Symbol (So) This category contains all symbols that didn't fit into one of the other categories Separators These characters mark the boundaries between units of text Space Separator (Zs) This category includes all of the space characters (yes, there's more than one space character) Paragraph Separator (Zp) There is exactly one character in this category: the Unicode paragraph separator (U+2029) As its name suggests, it marks the boundary between paragraphs Line Separator (Zl) There's also only one character in this category: the Unicode line separator (U+2028) As its name suggests, it forces a line break without ending a paragraph Even though the ASCII carriage-return and line-feed characters are often used as line and paragraph separators, they're not placed in either of these categories Likewise, the ASCII tab character isn't considered a Unicode space character, even though it probably should be They're all put in the "Cc" category Miscellaneous A number of special character categories don't really fit in with the others Control Characters (Cc) The codes corresponding to the C0 and C1 control characters from the ISO 2022 standard appear in this category The Unicode standard doesn't officially assign any semantics to these characters (which include the ASCII control characters), but most systems that use Unicode text treat these characters the same way as they treat their counterparts in the source standards For example, most processes treat the ASCII linefeed character as a line or paragraph separator The original idea was to leave the definitions of these code points open, as ISO 2022 does Over time, however, various Unicode processes and algorithms have attached semantics to these code points, effectively nailing the ISO 6429 semantics to many of them Formatting Characters (Cf) Unicode includes some "control" characters of its own: characters with no visual representation of their own that are used to control how the characters around them are drawn or handled by various processes These characters are assigned to this category Surrogates (Cs) The code points in the UTF-16 surrogate range belong to this category Technically, the code points in the surrogate range are treated as unassigned and reserved, but Unicode implementations based on UTF-16 often treat them as characters, handling surrogate pairs the same way as combining character sequences are handled Private-Use Characters (Co) The code points in the private-use ranges are assigned to this category Unassigned Code Points (Cn) All unassigned and noncharacter code points, other than those in the surrogate range, are given this category These code points aren't listed in the Unicode Character Databasetheir omission gives them this categorybut are listed explicitly in DerivedGeneralCategory.txt ... Paragraph Separator (Zp) There is exactly one character in this category: the Unicode paragraph separator (U+2029) As its name suggests, it marks the boundary between paragraphs Line Separator (Zl)... Separators These characters mark the boundaries between units of text Space Separator (Zs) This category includes all of the space characters (yes, there's more than one space character) Paragraph Separator (Zp)... Even though the ASCII carriage-return and line-feed characters are often used as line and paragraph separators, they're not placed in either of these categories Likewise, the ASCII tab character isn't considered a Unicode space character, even

Ngày đăng: 26/03/2019, 17:13

Tài liệu cùng người dùng

Tài liệu liên quan