Public Comment Number PC-UK0027 ISO/IEC CD 9899 (SC22N2620) Public Comment =========================================== Date: 1998-01-03 Author: Clive D.W. Feather Author Affiliation: Self Postal Address: Demon Internet Limited 322 Regents Park Road London N3 2QQ United Kingdom E-mail Address: Telephone Number: +44 181 371 1138 Fax Number: +44 181 371 1037 Number of individual comments: 1 Comment 1. Category: Inconsistency Committee Draft subsection: 5.1.1.2, 5.2.1, 5.2.1.2, 6.1.2, 6.1.2.5, 6.8 Title: inconsistencies in use of "basic" and "extended" character sets and in their relationship to UCNs Detailed description: The Standard uses the terms "basic character set" and "extended character set" at various places. However, the exact meaning of these two is not clear, and this leads to confusion. Consider the UTF-8 encoding (codes from 0 to 127 are single byte, codes from 128 to 255 form part of multibyte characters with length from 2 to 5 bytes). There are five possible execution character sets: [1] The 95 characters required by 5.2.1p3, plus the null character. [2] The 128 single byte characters. [3] The 2**31 multibyte characters. [4] Set [3] minus set [1]. [5] Set [3] minus set [2]. (and corresponding source sets). It is unclear whether the "basic character set" means [1] or [2]. The use of the wording "at least the following members" in 5.2.1p3 implies that the basic set can be larger than [1]. On the other hand, if the term is taken to represent [2], then 5.1.1.2p2 would forbid using \u0040 to represent the @ sign, something which I do not believe was intended, since it means that the \u form would be forbidden for *all* characters in the implementation-defined "basic" set. Consideration of this and related matters has led me to believe that it is most useful to have terms for [1] and for [4], while on the other hand there is little or no need to refer to [2], [3], and [5]. Therefore "basic character set" should represent [1] and "extended character set" should represent [4]. To do this requires a number of changes. Replace 5.2.1p1, second sentence, by: Each set is further divided into a /basic/ set, whose contents are given by this subclause, and an /extended/ set, consisting of zero or more locale-specific members (which are not members of the basic set). In 5.2.1p3, delete "at least" in the first sentence, and in the fourth sentence change "In the execution character set" to "In the basic execution character set". Delete the last sentence of 5.2.1p3 ("If any other characters ... the behavior is undefined"). It is useless for several reasons: - If translation phase 1 is taken literally, all members of the extended character set are replaced by UCNs, which consist of members of the basic character set (this point is further addressed below). While some are converted back in translation phase 5, all such characters are included in the exemptions. - It does not allow for UCNs in identifiers. - If such a character was encountered, the preprocessing token it is in is either not converted to a token (in which case the sentence does not apply) or *is* converted; in the latter case, the constraint of 6.1p2 is violated and this sentence has no effect. Delete 5.1.1.2p2, and replace it by a constraint at the end of 5.2.1 (forming a new paragraph 6): Constraint A universal-character-name shall not specify (in either form) a character short identifier less than 00A0 other than the following: 0024 0040 0060 This is a more consistent position for the restriction, and it has the useful side effect of making it clear what the UCNs of the basic character set *are*. Replace 5.2.1.2p1, first bullet, by: - The basic character set shall be present and shall be encoded using single-byte characters. There is no longer a need to check for the shift states of comments, string literals, and so on, because during translation phase 1 these will have been converted to a stateless representation using UCNs. Therefore replace 5.2.1.2p2 by: If a source file does not consist of a valid sequence of multibyte characters, the behavior is undefined. In 6.1.2.5p2, replace "required source character set enumerated in 5.1.2" with "basic execution character set" (note that the execution set is more sensible in this context than the source set). The second sentence of 6.1.2p2 restricts UCNs in identifiers to those listed in annex H. If some other UCN appears, it is unclear whether the behavior is undefined, or whether the UCN is not part of the identifier. This is further complicated by the example in footnote 122. If the text appeared in a source file, by translation phase 4 it would be processed as: #define THIS\u0024AND\u0024THAT(a,b) ((a)+(b)) and so the replacement list *does* begin with a character required by subclause 5.2.1, and thus this is unambiguously a definition of the object-like macro THIS. However, this completely wrecks the whole point of 6.8p4 and FN122 (added in TC1). Replace the second sentence of 6.1.2p2 with: Only universal-character-names corresponding to the characters listed in annex I are nondigits.[20] and append to footnote 20: Since 00A0 is not listed in annex I, but 00C0 is, the sequence of characters a\u00C0b\u00A0 consists of two preprocessing tokens; the first is an identifier made up of three nondigits. (note also the correction to the annex cited). Replace 6.8p4 by: In the definition of an object-like macro, either the replacement list shall be separated from the identifier by white space, or it shall begin with one of the 26 graphic characters in the basic character set other than ( _ or \ (and thus shall not begin with a universal- character-name).[122]