Public Comment Number PC-UK0078

ISO/IEC CD 9899 (SC22N2620) Public Comment

=========================================== 

Date: 1998-02-25
Author: N.M Maclaren
Author Affiliation: Self
Postal Address:
    University of Cambridge,
    Computer Laboratory,
    New Museums Site,
    Pembroke Street,
    Cambridge CB3 3QG,
    United Kingdom
E-mail Address: <nmm1@cam.ac.uk>
Telephone Number: +44 1223 334761
Fax Number:       +44 1223 334679
Number of individual comments: 1


Comment 1. 

Category: Normative change to existing feature retaining the original intent

Committee Draft subsection: 5.1.1.2, 5.2.1

Title: Universal character name handling


Detailed description:

A nasty little problem arises in code like the following:

    #define str(a) #a
    str("$")

In phase 1, the second line is mapped to str("\u0024") or perhaps
str("\U00000024").  In phase 4, this will be mapped to #"\u0024" and
(by 6.8.3.2 The # operator paragraph 2) to "\"\\u0024\"".  In phase 5,
this will be mapped to the execution character set, but there is no
explicit statement of the priority of mapping escape sequences and
universal character names.  So it is probably mapped to the sequence of
characters:

    '"','\\','u','0','0','2','4','"','\0'

but (if universal character names take priority) to

    '"','\$','"','\0'

which leads to undefined behaviour.  In either case, this is a quiet change
from C89.  There are quite a lot of similar ambiguities commented on
elsewhere, that need some sort of resolution.

The more that I think about it, the less that I think the problems with
these can be solved by tweaking, so here is a radical solution that I
believe maintains all the functionality and resolves the problem.  It is
based on the principle that universal character names have a similar
purpose to trigraphs and therefore should be treated similarly.  I think
that the following changes are all that are NECESSARY, but some more
cleaning up may be desirable.


5.1.1.2 Translation phases

Phase 1 should be rewritten as:

1.  Physical source file multibyte characters are mapped to the source
    character set (introducing new-line characters for end-of-line
    indicators) if necessary.  Secondly, trigraph sequences are replaced
    by corresponding single-character internal representations.  Thirdly,
    universal-character-names are replaced by the corresponding
    single-character internal representations.


Phase 5 should be rewritten as:

5.  Each source character set member and escape sequence in character
    constants and string literals is converted to a member of the execution
    character set.


Footnote 6 should be rewritten as:

6. The process of handling extended characters is specified in terms of
    mapping to an single-character encoding that includes the union of the
    whole source character set and the characters specified by ISO/IEC
    10646-1, and, in the case of character literals and strings, further
    mapping to the execution character set.  In practical terms, however,
    any internal encoding may be used, so long as an actual character
    encountered in the input, and the same character expressed in the
    input as a universal-character-name (i.e., using the \U or \u
    notation), are handled equivalently.


Constraint 2 could be deleted, as it is now unnecessary.


5.2.1  Character sets

A new paragraph 6 should be added:

6.  Source characters shall be encoded as if the source character set
    included the whole of ISO/IEC 10646 as single characters, using an
    unspecified mapping to integral values (except as specified above).