Document number: WG14 N589/X3J11 96-053    1996-07-02


                    Extended Characters in C9X
                    ==========================
                          Clive Feather
                          =============


Abstract
--------

This is an attempt to shape N574 (Tom Plum et.al. on Extended Identifiers
and Extended Literals in C++) into a change proposal for C9X. Essentially,
the changes I have made are:
(1) Make universal character names behave like trigraphs.
(2) Use different wording for extended identifiers so as to reduce the
    chances of confusion.


Revision
--------

1996-07-02: added universal names for those characters without, and adjusted
other wording to clarify that those names cannot be used. Also clarified
the principles behind extended identifier characters.


Trigraphs
---------

The concept here is simple, but the detailed wording is hard. The basic
points are described here.

The concept of a "universal character name" is introduced. This is the code
point in Unicode, but does *not* imply the use of Unicode internally. The
horizontal and vertical tabs, form feed, and new line do not have universal
names.

Sequences like ??u1234 and ??U12345678 are defined to be trigraph sequences
(even though more than 3 physical source characters long). They are mapped
into an internal representation, which need not match any particular external
for. The latter form cannot be used where the former could.

The source character set is required to contain:
 *  4 control codes which can only be represented as themselves;
 *  space, 52 letters, 10 digits, and 20 graphic characters which can
    only be represented as themselves;
 *  9 graphic characters which must be representable as ??/ and ??u005C
    style trigraphs, and might have other physical source representations;
 *  a large number of extended identifier characters (listed in Annex Q)
    which must be representable as ??u00C0 style trigraphs, and might
    have other physical source representations.

[The proposed contents of Annex Q only includes characters with names that
can be represented in the ??u00C0 style. Should the list be revised to
include characters with ??U12345678 style representations, no changes to
the actual proposal are required.]

The source character set can contain other characters, and ??u and ??U
trigraphs are a way of representing some such characters. The existing
wording in 5.2.1, 6.1.3.4, 6.1.4, and 6.1.7 makes it clear that these
characters may only appear, after preprocessing, in string literals and
similar constructs, and have an implementation-defined mapping to the
execution character set. No wording changes are therefore required to
handle these new characters.


Extended identifiers
--------------------

The change is simple and self-contained. The concept "non-digit" (basically,
the characters that can appear anywhere in an identifier, unlike digits) is
enhanced to include "extended-identifier-character". The characters that
comprise this are then listed in a normative annex.


Detailed changes
----------------

In subclause 5.2.1, insert a new paragraph between the second and third
paragraphs:

    Each character in the source character set has a *universal character
    name*. This is a number: the index of the character in ISO 10646:????
    written as 4 or 8 hexadecimal digits. There need not be any other
    physical source file representation for any character which can be
    represented by a trigraph.

Amend the blocks of characters in the (originally) third paragraph to be:

    the 26 uppercase letters of the English alphabet

        A  B  C  D  E  F  G  H  I  J  K  L  M
        N  O  P  Q  R  S  T  U  V  W  X  Y  Z
 +      (universal character names 0041 to 005A)

    the 26 lowercase letters of the English alphabet

        a  b  c  d  e  f  g  h  i  j  k  l  m
        n  o  p  q  r  s  t  u  v  w  x  y  z
 +      (universal character names 0061 to 007A)

    the 10 decimal digits

        0  1  2  3  4  5  6  7  8  9
 +      (universal character names 0030 to 0039)

    the following 29 graphic characters

 |      !  "  #                  (universal character names 0021 to 0023)
 |      %  &  '                  (universal character names 0025 to 0027)
 |      (  )  *  +  ,  -  .  /   (universal character names 0028 to 002F)
 |      :  ;  <  =  >  ?         (universal character names 003A to 003F)
 |      [  \  ]  ^  _            (universal character names 005B to 005F)
 |      {  |  }  ~               (universal character names 007B to 007E)

 |  the space character (universal character name 0020), and control
 |  characters representing horizontal tab, vertical tab, and form feed
 |  (universal character names 0009, 000B, and 000C respectively).

Insert before "In the execution ..." the sentence:

    The source character set shall contain all the characters listed
    in Annex Q. [6a]

and add the footnote:

    [6a] The implementation must have some internal representation for each
        of these characters, but it need not correspond to any particular
        external form.


Replace subclause 5.2.1.1 with:

    5.2.1.1 Trigraph sequences

 |  All occurrences in a source file of the following sequences of
    characters (called trigraph sequences [7]) are replaced with the
    corresponding single character. [7a]

    [7] The trigraph sequences enable the input of characters that are not
        defined in the Invariant Code Set as described in ISO 646:1983,
        which is a subset of the seven-bit ASCII code set.
 +      The name "trigraph" comes from the initial 3-character code, and
 +      not the length of the entire sequence.

 +  [7a] If the replacement character does not exist in the source character
 +      set, the implementation should use an internal representation which
 +      behaves, for the purposes of the Standard, as if it were a single-byte
 +      character.

        ??=                               #
        ??(                               [
        ??/                               \
        ??)                               ]
        ??'                               ^
        ??<                               {
        ??!                               |
        ??>                               }
        ??-                               ~
 +      ??u followed by 4 hexadecimal  )  the character whose universal
 +          digits in either case      )  character name is given by the
 +      ??U followed by 8 hexadecimal  )  value of the hexadecimal number
 +          digits in either case      )

 +  In trigraphs beginning with ??u, the value shall not lie within the
 +  following ranges (inclusive): 0020-0022, 0025-003F, 0041-005A, 005F,
 +  0061-007A. [7b] In trigraphs beginning with ??U, the value shall not
 +  be less than 00010000. (All numbers in this paragraph are in
 +  hexadecimal.)

 +  [7b] These are the characters in the Invariant Code Set.

 |  No other trigraph sequences exist. Each ? that is not part of one of
    the trigraphs listed above is not changed.

 |  Examples

    The following source line
        printf("Eh???/n");
    becomes (after replacement of the trigraph sequence ??/)
        printf("Eh?\n");

 +  If the implementation has a $ character, then the following source
 +  lines are equivalent:
 +      printf("$2.50");
 +      printf("??u00232.50");


In subclause 6.1.2 syntax, add a further alternative to nondigit:

    nondigit: one of
        extended-identifier-character
        _ a b c d e f g h i j k l m
          n o p q r s t u v w x y z
          A B C D E F G H I J K L M
          N O P Q R S T U V W X Y Z

and add the definition:

    extended-identifier-character:
        any character listed in Annex Q.

Add an annex (designated as Q in these instructions):

    Annex Q
    (Normative)
    Extended identifier characters

    An extended identifier character is a character that can be used in an
    identifier (see 6.1.2) but is not in the minimal basic source character
    set defined in subclause 5.2.1. The extended identifier characters are
    precisely those characters whose universal character name lies within
    one of the following ranges (all inclusive):
        00C0-00D6 (Latin)
        00D8-00F6 (Latin)
        ...
        FFDA-FFDC (CJK Unified Ideographs)

[Replacing the ellipsis by the full list. The list should be in strict
numerical order.]