Public Comment Number PC-UK0027

ISO/IEC CD 9899 (SC22N2620) Public Comment

=========================================== 

Date: 1998-01-03
Author: Clive D.W. Feather
Author Affiliation: Self
Postal Address:
    Demon Internet Limited
    322 Regents Park Road
    London
    N3  2QQ
    United Kingdom
E-mail Address: <clive@demon.net>
Telephone Number: +44 181 371 1138
Fax Number:       +44 181 371 1037
Number of individual comments: 1


Comment 1. 
Category: Inconsistency
Committee Draft subsection: 5.1.1.2, 5.2.1, 5.2.1.2, 6.1.2, 6.1.2.5, 6.8

Title: inconsistencies in use of "basic" and "extended" character sets
       and in their relationship to UCNs

Detailed description:

The Standard uses the terms "basic character set" and "extended character
set" at various places. However, the exact meaning of these two is not
clear, and this leads to confusion.

Consider the UTF-8 encoding (codes from 0 to 127 are single byte, codes
from 128 to 255 form part of multibyte characters with length from 2 to 5
bytes). There are five possible execution character sets:

[1] The 95 characters required by 5.2.1p3, plus the null character.
[2] The 128 single byte characters.
[3] The 2**31 multibyte characters.
[4] Set [3] minus set [1].
[5] Set [3] minus set [2].

(and corresponding source sets).

It is unclear whether the "basic character set" means [1] or [2]. The use
of the wording "at least the following members" in 5.2.1p3 implies that
the basic set can be larger than [1]. On the other hand, if the term is
taken to represent [2], then 5.1.1.2p2 would forbid using \u0040 to
represent the @ sign, something which I do not believe was intended, since
it means that the \u form would be forbidden for *all* characters in the
implementation-defined "basic" set.

Consideration of this and related matters has led me to believe that it
is most useful to have terms for [1] and for [4], while on the other hand
there is little or no need to refer to [2], [3], and [5]. Therefore
"basic character set" should represent [1] and "extended character set"
should represent [4]. To do this requires a number of changes.


Replace 5.2.1p1, second sentence, by:

    Each set is further divided into a /basic/ set, whose contents are
    given by this subclause, and an /extended/ set, consisting of zero
    or more locale-specific members (which are not members of the basic
    set).

In 5.2.1p3, delete "at least" in the first sentence, and in the fourth
sentence change "In the execution character set" to "In the basic
execution character set".


Delete the last sentence of 5.2.1p3 ("If any other characters ... the
behavior is undefined"). It is useless for several reasons:
- If translation phase 1 is taken literally, all members of the extended
  character set are replaced by UCNs, which consist of members of the
  basic character set (this point is further addressed below). While some
  are converted back in translation phase 5, all such characters are
  included in the exemptions.
- It does not allow for UCNs in identifiers.
- If such a character was encountered, the preprocessing token it is in
  is either not converted to a token (in which case the sentence does
  not apply) or *is* converted; in the latter case, the constraint of
  6.1p2 is violated and this sentence has no effect.


Delete 5.1.1.2p2, and replace it by a constraint at the end of 5.2.1
(forming a new paragraph 6):

    Constraint

    A universal-character-name shall not specify (in either form) a
    character short identifier less than 00A0 other than the following:
        0024  0040  0060

This is a more consistent position for the restriction, and it has the
useful side effect of making it clear what the UCNs of the basic character
set *are*.


Replace 5.2.1.2p1, first bullet, by:

    - The basic character set shall be present and shall be encoded
      using single-byte characters.


There is no longer a need to check for the shift states of comments, string
literals, and so on, because during translation phase 1 these will have
been converted to a stateless representation using UCNs. Therefore replace
5.2.1.2p2 by:

    If a source file does not consist of a valid sequence of multibyte
    characters, the behavior is undefined.


In 6.1.2.5p2, replace "required source character set enumerated in 5.1.2"
with "basic execution character set" (note that the execution set is more
sensible in this context than the source set).


The second sentence of 6.1.2p2 restricts UCNs in identifiers to those listed
in annex H. If some other UCN appears, it is unclear whether the behavior
is undefined, or whether the UCN is not part of the identifier.

This is further complicated by the example in footnote 122. If the text
appeared in a source file, by translation phase 4 it would be processed
as:

    #define THIS\u0024AND\u0024THAT(a,b) ((a)+(b))

and so the replacement list *does* begin with a character required by
subclause 5.2.1, and thus this is unambiguously a definition of the
object-like macro THIS. However, this completely wrecks the whole point
of 6.8p4 and FN122 (added in TC1).

Replace the second sentence of 6.1.2p2 with:

    Only universal-character-names corresponding to the characters listed
    in annex I are nondigits.[20]

and append to footnote 20:

    Since 00A0 is not listed in annex I, but 00C0 is, the sequence of
    characters a\u00C0b\u00A0 consists of two preprocessing tokens; the
    first is an identifier made up of three nondigits.

(note also the correction to the annex cited).

Replace 6.8p4 by:

    In the definition of an object-like macro, either the replacement list
    shall be separated from the identifier by white space, or it shall
    begin with one of the 26 graphic characters in the basic character set
    other than ( _ or \ (and thus shall not begin with a universal-
    character-name).[122]