Public Comment Number PC-UK0078 ISO/IEC CD 9899 (SC22N2620) Public Comment =========================================== Date: 1998-02-25 Author: N.M Maclaren Author Affiliation: Self Postal Address: University of Cambridge, Computer Laboratory, New Museums Site, Pembroke Street, Cambridge CB3 3QG, United Kingdom E-mail Address: Telephone Number: +44 1223 334761 Fax Number: +44 1223 334679 Number of individual comments: 1 Comment 1. Category: Normative change to existing feature retaining the original intent Committee Draft subsection: 5.1.1.2, 5.2.1 Title: Universal character name handling Detailed description: A nasty little problem arises in code like the following: #define str(a) #a str("$") In phase 1, the second line is mapped to str("\u0024") or perhaps str("\U00000024"). In phase 4, this will be mapped to #"\u0024" and (by 6.8.3.2 The # operator paragraph 2) to "\"\\u0024\"". In phase 5, this will be mapped to the execution character set, but there is no explicit statement of the priority of mapping escape sequences and universal character names. So it is probably mapped to the sequence of characters: '"','\\','u','0','0','2','4','"','\0' but (if universal character names take priority) to '"','\$','"','\0' which leads to undefined behaviour. In either case, this is a quiet change from C89. There are quite a lot of similar ambiguities commented on elsewhere, that need some sort of resolution. The more that I think about it, the less that I think the problems with these can be solved by tweaking, so here is a radical solution that I believe maintains all the functionality and resolves the problem. It is based on the principle that universal character names have a similar purpose to trigraphs and therefore should be treated similarly. I think that the following changes are all that are NECESSARY, but some more cleaning up may be desirable. 5.1.1.2 Translation phases Phase 1 should be rewritten as: 1. Physical source file multibyte characters are mapped to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Secondly, trigraph sequences are replaced by corresponding single-character internal representations. Thirdly, universal-character-names are replaced by the corresponding single-character internal representations. Phase 5 should be rewritten as: 5. Each source character set member and escape sequence in character constants and string literals is converted to a member of the execution character set. Footnote 6 should be rewritten as: 6. The process of handling extended characters is specified in terms of mapping to an single-character encoding that includes the union of the whole source character set and the characters specified by ISO/IEC 10646-1, and, in the case of character literals and strings, further mapping to the execution character set. In practical terms, however, any internal encoding may be used, so long as an actual character encountered in the input, and the same character expressed in the input as a universal-character-name (i.e., using the \U or \u notation), are handled equivalently. Constraint 2 could be deleted, as it is now unnecessary. 5.2.1 Character sets A new paragraph 6 should be added: 6. Source characters shall be encoded as if the source character set included the whole of ISO/IEC 10646 as single characters, using an unspecified mapping to integral values (except as specified above).