Last modified 2024-11-04

Understanding the Standard - part 2

Joining things together

Reading the C Standard is not an easy task. The wording has to be read carefully, and you always have to be remembered that certain words have special technical meanings, and may not mean what you think they do. Some apparently simple questions are only answered by assembling pieces from several different sections.

With all this in mind, your editor has asked me to put together a guide to some of the commonest pitfalls in Understanding the Standard. My first article explained some of the special terms in the standard, such as "undefined". This article talks about how the Standard connects different parts of your program together; the various modules you write, the library, and the operating system. I cover three topics: linkage, how and why to declare main, and freestanding implementations.

Linkage

If you use the same name in two different places, you might be referring to the same object, or you might not. For example, if you declare the variable i in two functions, you expect them to be independent. On the other hand, you expect a function call to refer the function with the same name, and not for the compiler to think that they are different. The Standard uses the term "linkage" to describe this sort of thing, and discusses it in subclauses 6.1.2.2 and 6.7.

As you undoubtedly know, linkage of identifiers is controlled by the places in which declarations and definitions are placed, and by the keywords static and extern. However, you might not be aware of some of the subtleties that the Standard includes.

Sidebar: declarations and definitions.

The terms "declaration" and "definition" are often misunderstood. A declaration is a statement that describes an object, for example by giving its type. So a function prototype is a declaration, as is a statement like extern int i;. A definition, on the other hand, is a declaration that also reserves memory for the object. So a function definition is the declaration that includes the function body, and a variable definition is the one that causes the variable to actual exist; for example, int i = 10; is a definition. All definitions are also declarations, but declarations don't have to be definitions.

In general, every variable and function in your program must have exactly one definition, but can have any number of declarations. However, declarations that are not definitions are only useful when the definition is not in scope, for example because it is in a different module, or (in the case of functions), because it comes later on in the source file.

As we will see, the Standard also has the concept of tentative definitions, which are declarations that might be definitions but might not.

The Standard talks about three kinds of linkage:

External linkage is used to connect identifiers in different modules. Where the same identifier occurs with external linkage in two or more modules, they all refer to the same variable or function. Only variables and functions with external linkage can be accessed by name in more than one module.
Internal linkage is used to connect identifiers in different parts of the same module. When an identifier has internal linkage, different uses within the same module all refer to the same variable or function, but this is independent from the use in any other module.
Identifiers with no linkage are not related to any other identifier in the program. For example, variables defined within function bodies have no linkage, and each declaration with no linkage generates a new variable.

The rules for functions and variables are different enough that they are best considered separately.

The big surprise with function declarations is that the keyword extern has no meaning whatsoever; in particular, it does not mean that the function has external linkage! Instead, the meaning of a function declaration depends only on whether or not it uses the keyword static [*]. Functions must have either internal or external linkage. Furthermore, every declaration of a function in the same module refers to the same function and must have the same linkage. Thus it is the first declaration (which might be within an included header file, of course) which determines the linkage. If this declaration uses the keyword static, the function has internal linkage, and all subsequent declarations can include or omit the static without effect; the function remains an internal one that cannot be accessed by name from another module. If, on the other hand, the first declaration does not use static, then the function has external linkage, and none of the subsequent declarations may use static either (a compiler will probably, but need not, detect this error).

* Of course, using both static and extern in the same declaration is a syntax error.

There are two subtleties here, both to do with the fact that a function can be declared within another function body - a "block scope declaration" (block scope declarations cannot use the keyword static). Firstly, if the first declaration is a block scope declaration, then its effect vanishes at the end of the block it appears in. Nevertheless, the above rules apply because the next declaration, which will be treated as another "first declaration", must specify the same linkage, which must therefore be external. Secondly, if a block scope declaration is not the first declaration, but the first declaration is hidden by a declaration of a variable in an outer block, then the inner declaration always specifies external linkage, and in this case the first declaration must also have external linkage.

The following code illustrates these situations (the comment "1-D" means that this is a "first declaration"):

    static int fn_a (void);  /* Internal linkage (1-D) */
    int fn_a (void);         /* Remains internal linkage */
    extern int fn_a (void);  /* Remains internal linkage; extern has no effect */

    int fn_b (void);         /* External linkage (1-D) */
    static int fn_b (void);  /* Error - fn_b has external linkage */

    extern int fn_c (void);  /* External linkage (1-D) */

    static int fn_d (void);  /* Internal linkage (1-D) */

    int main (void)
    {
        int fn_a (void);          /* Remains internal linkage */
        int fn_b (void);          /* Remains external linkage */
        static int fn_a (void);   /* Error - block scope must not be static */

        int fn_e (void);          /* External linkage (1-D) */
        int fn_f (void);          /* External linkage (1-D) */

        int fn_c, fn_d;           /* Variable declarations hide the functions */

        {
            int fn_c ();          /* Remains external linkage */
            int fn_d ();          /* Error - hidden fn_d has internal linkage */

            /* fn_c here refers to the external linkage function once more */
        }

        /* fn_c here refers to the variable */
        return 0;
    }

    int fn_e (void);              /* Remains external linkage */
    static int fn_f (void);       /* Error - fn_f has external linkage */

Now we're ready to go on to variables. Unlike with functions, the keyword extern is useful here. Firstly, variables declared inside function bodies (this includes the parameters of the function) without the keyword extern have no linkage. Therefore each separate declaration is also the definition of the variable, and it follows that such a variable can only be declared once in any given block (though of course the same name can be redeclared in an inner block, yielding a new variable). As we know, if static is used, the variable persists for the duration of the program. Otherwise it is created when the block is entered and destroyed when it is left.

For variables declared within a function body with the keyword extern, and for variables declared outside a function body, the rules are somewhat more complex. They depend not only on which keyword has been used, but also on whether the declaration includes an initializer (an equals sign and an initial value for the variable). The first rule to remember is that a variable can only have one definition in the entire program. If it has internal linkage, all declarations of the variable are in the same module, and only one of them can be a definition. If it has external linkage, then it can have declarations in several modules. In this case, only one module can contain a definition. This is a point that is often misunderstood. Many linkers allow a variable to have several definitions, and provided that they all agree, there is no problem. However, other linkers will either complain if there are two definitions, or your program may silently go wrong (as I discussed in my first article, this is what is called "undefined behaviour"), and the Standard prohibits it. Therefore you should always take care to ensure there is only one definition for each variable, and that is why these rules are so important. Another, minor, point is that if the variable is never used (use within a sizeof expression does not count), it need not have a definition. Of course, if it is used, there must be one.

After we've considered that lot, the remainder of what the Standard says can best be expressed as a set of simple rules.

If the declaration initializes the variable, then it is a definition of the variable. Thus only one declaration can initialize the variable. In addition, it is an error (which the compiler must report) to initialize a declaration using extern within a function body.

If no declaration initializes the variable, and there is at least one declaration which does not include the keyword extern, then all those declarations together act as a definition, and the initial value of the variable is zero or a null pointer. The Standard calls these declarations tentative definitions. (See later for another note on tentative definitions.)

If the first declaration of the variable uses the keyword static, the variable has internal linkage. All subsequent declarations must use either static or extern.

If the first declaration of the variable does not use the keyword static, then the variable has external linkage. No declarations may use the keyword static.

Finally, the two subtle points that applied to functions also apply to variables. If the first declaration is within a function body, or any declaration is within an inner block with the first declaration hidden from it, the variable has external linkage, and no declaration may use static.

Again, let's have some examples.

    static int var_a = 1;      /* Internal linkage, definition */
    static int var_a;          /* Remains internal linkage */
    extern int var_a;          /* Remains internal linkage */
    int var_a;                 /* Error - var_a has internal linkage */
    static int var_a = 1;      /* Error - more than one definition */

    static int var_b;          /* Internal linkage, tentative definition */
    extern int var_b = 1;      /* Remains internal linkage, definition overrides
                                  tentative definition */

    int var_c = 1;             /* External linkage, definition */
    extern int var_c;          /* Remains external linkage */
    int var_c;                 /* Remains external linkage */
    static int var_c;          /* Error - var_c has external linkage */
    static int var_c = 1;      /* Error - more than one definition */

    extern int var_d = 1;      /* External linkage, definition */

    static int var_e;          /* Internal linkage, tentative definition */
    int var_e = 1;             /* Error - var_d has internal linkage */

    static int var_f;          /* Internal linkage, tentative definition */
    int var_g;                 /* External linkage, tentative definition */
    extern int var_h;          /* External linkage */
    extern int var_f;          /* Remains internal linkage */
    static int var_f;          /* Remains internal linkage, another tentative
                                  definition */

    int main (void)
    {
        extern int var_i;      /* External linkage */
        extern int var_j;      /* External linkage */
        extern int var_k;      /* External linkage */

        auto int var_f, var_g; /* Auto declarations hide the previous ones */

        {
            extern int var_f;  /* Error - hidden var_f has internal linkage */
            extern int var_g;  /* Remains external linkage */

            /* var_g here refers to the external linkage variable once more */
        }

        /* var_g here refers to the auto variable */
        return 0;
    }

    int var_i;                 /* Remains external linkage, tentative definition */
    extern int var_j;          /* Remains external linkage */
    static int var_k;          /* Error - var_k has external linkage */

If that is the complete source file, and if we delete the lines with the "Error" comments, then the some of the variables mentioned are defined and some are not, as follows:

variable	linkage	defined
`var_a`	internal	yes
`var_b`	internal	yes
`var_c`	external	yes
`var_d`	external	yes
`var_e`	internal	yes
`var_f`	internal	yes
`var_g`	external	yes
`var_h`	external	no
`var_i`	external	yes
`var_j`	external	no
`var_k`	external	no

There is another note that should be remembered with tentative definitions that end up initializing variables. With most types of variable, the only effect that this has is to cause the variable to be assigned the value zero. However, if the variable is an array, and no declaration gives the array a size, then it is initialized with one element [*]. So, in the following code:

    int array_a [];
    int array_b [];
    int array_c [];

    extern int array_a [5];
    extern int array_b [] = { 1, 2, 3 };

arrays array_a and array_b are given sizes (5 and 3 respectively) by the second declarations, but array_c is not, and so will have only 1 element.

* This is not explicitly stated in the Standard, but is ISO's interpretation of the wording.

By this point you are probably beginning to panic slightly. Thankfully, however, you don't have to remember all of this. Instead, all you have to do is to obey a few simple rules.

If you want a variable or function to have internal linkage, specify static in all its declarations. At most one declaration can initialize a variable.
If you want a variable or function to have external linkage, you must not use static in any of its declarations. In all the declarations other than the definition, use extern.
- For functions, it doesn't matter whether you use extern in the definition or not.
- For variables, if you want to use extern, you must initialize the variable explicitly. If you don't initialize it, don't use extern. Despite the rules about tentative definitions, you should only have one definition of the variable.
Every declaration in a header file should use extern. Header files should never include initializations or function bodies, nor should they mention variables or functions with internal linkage.

These rules don't make use of all the features that the Standard allows, but there again, the Standard was designed to bring together diverse previous practices. Instead, they are easy to remember, practical, and easy to understand when you read the resulting code.

Defining `main()`

If you asked a group of experts on the C Standard what is the single most common violation in application programs, I suspect that most of them would reply 'main being defined as void'.

The Standard is very clear: subclause 5.1.2.2.1 says that main must be defined in one of the following two ways (or their equivalents not using prototypes):

    int main (void)
    { /* ... */ }

    int main (int argc, char *argv [])
    { /* ... */ }

"But," says the programmer, "making it void works for me". Unfortunately, that's not a defence when something fails. There are two parts to this; let's examine them separately.

Firstly, why does main return a value? Well, if we look at subclauses 5.1.2.2.3 and 7.10.4.3, we see that the answer is that the implementation (and here this usually means part of the operating system) uses the value to determine whether your program succeeded or failed. The Standard says that you can return zero, the value EXIT_SUCCESS, or the value EXIT_FAILURE (the last two are macros defined in <stdlib.h>). Zero and EXIT_SUCCESS both imply that the program "succeeded", while EXIT_FAILURE (obviously) implies that it "failed". Of course, what "success" and "failure" actually mean depend very much on the implementation, and there isn't much that the Standard can say about it; the same applies if you return a value other than one of those three. However, we can look at some common arrangements.

Unix looks at the bottom 8 bits only of the value; scripts and other programs can examine the result and decide what to do. Zero (EXIT_SUCCESS equals zero) is success, or true, while any other value is failure, or false (Unix scripts can have "if" and "while" instructions that depend on the results of programs).
MS-DOS does much the same, placing the value in the batch variable ERRORLEVEL where scripts can use it.
VMS treats odd values as meaning success and even values as meaning failure. C run-time systems have to treat zero specially, converting it into some odd number. The value can be used in much the same way as with Unix.

The other half of the problem with declaring main as void is to do with the ways in which functions are called in particular systems. The implementation is expecting main to return a value, but it doesn't! So what will be returned instead? Let's consider the four most common cases of this.

The first case is when the author of the C compiler you are using has explicitly decided to allow main to be declared as void. That's his or her prerogative, and there's nothing wrong with it. The Standard allows an implementation to do this, and many (including many popular MS-DOS compilers) have done so. On such systems, the compiler will generate a return value instead if one is needed; whether it's "success" or "failure" is, of course, another question.

So long as your program only runs under those compilers, you've got nothing to worry about. However, one day you may want to move to a new system where void doesn't work. Then you'll be in trouble. Why not save yourself the aggravation in the first place by using int?

The second case involves those systems where the return values from int functions are placed in a register. In this case, the part of the operating system that calls main will expect to find a value in that register, but nothing will have been placed there. Instead, it will find some random value, left there by a previous calculation, and use it. How it will interpret that value, of course, is beyond your control. If you don't look at the status of your program, you'll never find the problem. But it's lurking.

The third case is when the return value is placed somewhere on the stack. What will now happen depends on the exact way in which the stack is laid out, and how it is cleared after a function call. Let's consider a function call with four arguments: a, b, c, d. One arrangement is that the caller will push the parameters on to the stack, followed by the return address, and will take the result (if one is expected) from the location of the last parameter, after which it pops off the rest. Thus:

	`int` function	`void` function
Before call:	`... x y z`	`... x y z`
Arguments pushed:	`... x y z a b c d`	`... x y z a b c d`
Function called:	`... x y z a b c d addr`	`... x y z a b c d addr`
Result placed:	`... x y z a b c r addr`	[no action]
Return operation:	`... x y z a b c r`	`... x y z a b c d`
Result popped:	`... x y z a b c`	`... x y z a b c`
Other arguments popped:	`... x y z`	`... x y z`

So, if the function is a void one, but the caller thinks it returns an int, the value of the fourth parameter is "returned".

However, another common arrangement is for the called function to pop the arguments instead:

	`int` function	`void` function
Before call:	`... x y z`	`... x y z`
Arguments pushed:	`... x y z a b c d`	`... x y z a b c d`
Function called:	`... x y z a b c d addr`	`... x y z a b c d addr`
Arguments removed:	`... x y z addr`	`... x y z addr`
Result placed:	`... x y z r addr`	[no action]
Return operation:	`... x y z r`	`... x y z`
Result popped:	`... x y z`	`... x y`

And now, we see, the stack is corrupted! Since various things, such as cleaning up closed files, happen after main returns, we can see that disaster lurks.

The fourth and final case is the rare compiler that won't let you declare main as void. If you find one of these, it's worth paying extra for! Because there are no other functions with more than one possible prototype, and because there must not be a prototype for main in any standard header, making this test requires special code in the compiler. Anyone who's gone to the effort of getting that right will probably have put lots of other well-directed effort in as well.

Freestanding implementations

"What" I hear you cry "is a freestanding implementation?". Well, if you examine subclause 5.1.2, you will see that it talks about two different kinds of execution environment: hosted and freestanding. If you're the average C programmer, you will only have used hosted implementations up to now. Freestanding implementations are rather more specialised, and in general are used for things like writing operating systems - where you don't have the basic facilities that the Standard library needs - and for code for embedded systems, where you want your final program to contain the absolute minimum, even if it means not having functions like printf available.

The differences between hosted and freestanding implementations are best described in a table:

Feature	Hosted	Freestanding
Standard headers available	all 15	`<float.h> <limits.h> <stdarg.h> <stddef.h>` (the implementation may provide others)
Available library functions and macros	all listed in the Standard	those in the above 4 headers (the implementation may provide others)
Reserved identifiers	all listed in the Standard	all listed in the Standard (this was decided by WG14 last December; the wording of subclause 5.1.2.1 will be changed in a future Technical Corrigendum to bring it into line with this decision).
Function called to run the program	`main`	implementation-defined
Arguments for that function	`argc` and `argv` or none	implementation-defined

As you can see, the main difference is that freestanding implementations only provide a "stripped-down" library. You may never come across one in practice, but at least you now know what the term means when you find it in your reading of the Standard.

All linked up?

Hopefully you now understand how a Standard program links together. The next article in this series will cover the topics of international character sets, how the Standard allows you to use them, and what it requires you to do.

Back to the intro.

Back to the C index.

Back to Clive's home page.