Public Comment Number PC-____

ISO/IEC CD 9899 (SC22N2620) Public Comment

===========================================

Date: 1998-02-09
Author: Clive D.W. Feather
Author Affiliation: Self
Postal Address:
    Demon Internet Limited
    322 Regents Park Road
    London
    N3  2QQ
    United Kingdom
E-mail Address: <clive@demon.net>
Telephone Number: +44 181 371 1138
Fax Number:       +44 181 371 1037
Number of individual comments: 1


Comment 1.
Category: Inconsistency
Committee Draft subsection: various

Title: problems with UCNs

Detailed description:

Further examination of UCNs shows that they have many problems associated
with them, and in particular produce very different behaviour than would
occur with C89.

The following example was presented in comp.std.c by Antoine Leca
<Antoine.Leca@renault.fr> and is summarised by me:

What is the effect of the following code:

    #include <stdio.h>

    #define str(s) #s

    int main(void)
    {
        printf ("    # of <%s> is <%s>\n", "$", str ("$"));
        return 0;
    }

Since $ is not part of the basic character set, this is not strictly
conforming. However, assume that the implementation has a representation
for $. Then, under C9X the output is clearly:

    # of <$> is <"$">


Under C9X, the output is probably one of:

    # of <$> is <"\u0024">
or
    # of <$> is <"\$">

At Translation Phase 1, both $s will be converted to \u0024, and so the
source will become:

    #include <stdio.h>

    #define str(s) #s

    int main(void)
    {
        printf ("    # of <%s> is <%s>\n", "\u0024", str ("\u0024"));
        return 0;
    }

When the # operator is applied as part of the expansion of str, the \ is
doubled, producing the line:

        printf ("    # of <%s> is <%s>\n", "\u0024", "\"\\u0024\"");

in accordance with 6.8.3.2p2.

Now, when TP5 is reached one has to decide whether the UCN is recognised
first, generating:

        printf ("    # of <%s> is <%s>\n", "\u0024", "\"\$\"");

and undefined behaviour because of the escape sequence \$ - though I would
expect at least some implementations to generate:

    # of <$> is <"\$">

- or else the escape sequence \\ is recognised first, generating the output:

    # of <$> is <"\u0024">

Neither, however, is what the naive programmer would expect, and neither
interpretation allows a non-basic character to remain in a string that has
the # operator applied to it.


Another serious issue with UCNs is that they do not mix well with systems
such as ISO 2022. Consider a situation where redundant shift sequences
appear within string literals in source files. In C89 these sequences will
be retained throughout the translation process and will appear when the
literal is output by the program. In C9X the characters in the literal
will be converted to UCNs and the shift sequences lost; a new set of,
possibly different, shift sequences has to be added during TP5. For some
applications this is a Quiet Change from C89.