Document number: WG14 N589/X3J11 96-053 1996-07-02 Extended Characters in C9X ========================== Clive Feather ============= Abstract -------- This is an attempt to shape N574 (Tom Plum et.al. on Extended Identifiers and Extended Literals in C++) into a change proposal for C9X. Essentially, the changes I have made are: (1) Make universal character names behave like trigraphs. (2) Use different wording for extended identifiers so as to reduce the chances of confusion. Revision -------- 1996-07-02: added universal names for those characters without, and adjusted other wording to clarify that those names cannot be used. Also clarified the principles behind extended identifier characters. Trigraphs --------- The concept here is simple, but the detailed wording is hard. The basic points are described here. The concept of a "universal character name" is introduced. This is the code point in Unicode, but does *not* imply the use of Unicode internally. The horizontal and vertical tabs, form feed, and new line do not have universal names. Sequences like ??u1234 and ??U12345678 are defined to be trigraph sequences (even though more than 3 physical source characters long). They are mapped into an internal representation, which need not match any particular external for. The latter form cannot be used where the former could. The source character set is required to contain: * 4 control codes which can only be represented as themselves; * space, 52 letters, 10 digits, and 20 graphic characters which can only be represented as themselves; * 9 graphic characters which must be representable as ??/ and ??u005C style trigraphs, and might have other physical source representations; * a large number of extended identifier characters (listed in Annex Q) which must be representable as ??u00C0 style trigraphs, and might have other physical source representations. [The proposed contents of Annex Q only includes characters with names that can be represented in the ??u00C0 style. Should the list be revised to include characters with ??U12345678 style representations, no changes to the actual proposal are required.] The source character set can contain other characters, and ??u and ??U trigraphs are a way of representing some such characters. The existing wording in 5.2.1, 6.1.3.4, 6.1.4, and 6.1.7 makes it clear that these characters may only appear, after preprocessing, in string literals and similar constructs, and have an implementation-defined mapping to the execution character set. No wording changes are therefore required to handle these new characters. Extended identifiers -------------------- The change is simple and self-contained. The concept "non-digit" (basically, the characters that can appear anywhere in an identifier, unlike digits) is enhanced to include "extended-identifier-character". The characters that comprise this are then listed in a normative annex. Detailed changes ---------------- In subclause 5.2.1, insert a new paragraph between the second and third paragraphs: Each character in the source character set has a *universal character name*. This is a number: the index of the character in ISO 10646:???? written as 4 or 8 hexadecimal digits. There need not be any other physical source file representation for any character which can be represented by a trigraph. Amend the blocks of characters in the (originally) third paragraph to be: the 26 uppercase letters of the English alphabet A B C D E F G H I J K L M N O P Q R S T U V W X Y Z + (universal character names 0041 to 005A) the 26 lowercase letters of the English alphabet a b c d e f g h i j k l m n o p q r s t u v w x y z + (universal character names 0061 to 007A) the 10 decimal digits 0 1 2 3 4 5 6 7 8 9 + (universal character names 0030 to 0039) the following 29 graphic characters | ! " # (universal character names 0021 to 0023) | % & ' (universal character names 0025 to 0027) | ( ) * + , - . / (universal character names 0028 to 002F) | : ; < = > ? (universal character names 003A to 003F) | [ \ ] ^ _ (universal character names 005B to 005F) | { | } ~ (universal character names 007B to 007E) | the space character (universal character name 0020), and control | characters representing horizontal tab, vertical tab, and form feed | (universal character names 0009, 000B, and 000C respectively). Insert before "In the execution ..." the sentence: The source character set shall contain all the characters listed in Annex Q. [6a] and add the footnote: [6a] The implementation must have some internal representation for each of these characters, but it need not correspond to any particular external form. Replace subclause 5.2.1.1 with: 5.2.1.1 Trigraph sequences | All occurrences in a source file of the following sequences of characters (called trigraph sequences [7]) are replaced with the corresponding single character. [7a] [7] The trigraph sequences enable the input of characters that are not defined in the Invariant Code Set as described in ISO 646:1983, which is a subset of the seven-bit ASCII code set. + The name "trigraph" comes from the initial 3-character code, and + not the length of the entire sequence. + [7a] If the replacement character does not exist in the source character + set, the implementation should use an internal representation which + behaves, for the purposes of the Standard, as if it were a single-byte + character. ??= # ??( [ ??/ \ ??) ] ??' ^ ??< { ??! | ??> } ??- ~ + ??u followed by 4 hexadecimal ) the character whose universal + digits in either case ) character name is given by the + ??U followed by 8 hexadecimal ) value of the hexadecimal number + digits in either case ) + In trigraphs beginning with ??u, the value shall not lie within the + following ranges (inclusive): 0020-0022, 0025-003F, 0041-005A, 005F, + 0061-007A. [7b] In trigraphs beginning with ??U, the value shall not + be less than 00010000. (All numbers in this paragraph are in + hexadecimal.) + [7b] These are the characters in the Invariant Code Set. | No other trigraph sequences exist. Each ? that is not part of one of the trigraphs listed above is not changed. | Examples The following source line printf("Eh???/n"); becomes (after replacement of the trigraph sequence ??/) printf("Eh?\n"); + If the implementation has a $ character, then the following source + lines are equivalent: + printf("$2.50"); + printf("??u00232.50"); In subclause 6.1.2 syntax, add a further alternative to nondigit: nondigit: one of extended-identifier-character _ a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z and add the definition: extended-identifier-character: any character listed in Annex Q. Add an annex (designated as Q in these instructions): Annex Q (Normative) Extended identifier characters An extended identifier character is a character that can be used in an identifier (see 6.1.2) but is not in the minimal basic source character set defined in subclause 5.2.1. The extended identifier characters are precisely those characters whose universal character name lies within one of the following ranges (all inclusive): 00C0-00D6 (Latin) 00D8-00F6 (Latin) ... FFDA-FFDC (CJK Unified Ideographs) [Replacing the ellipsis by the full list. The list should be in strict numerical order.]