N1990: Array Types and Bounds Checking

Submitter: Martin Uecker
Submission Date: December 9, 2015
Subject: N1990: Array Types and Bounds Checking
Summary:

The lack of a bounded array type in C is often cited as the cause of many programming errors and security problems. Here, I would like to point out that C already has a bounded array type, which could be used in combination with better compiler support to protect against out-of-bounds accesses in many cases. Minor
language extensions could significantly extend the usefulness of existing arrays types with known length.

1. Local/global arrays.

Consider the following code (inside a function body):

  int N = 4;
  int x[N];
  x[N] = 1;	// undefined behaviour (6.5.6)

Because accessing an array after its end is undefined behavior, compilers are free to automatically insert bounds checking - and (to some limited degree) are already able to do this. 

Also consider the following example:

  void bar(int N, int p[static N]);
  int N = 3;
  int x[N];
  bar(4, x);

With the 'static' keyword the array pointed to by 'p' must be of length 'N' as specified in 6.7.6.3(7). When a shorter array is passed to the function, this could be detected at runtime or sometimes already at compile-time. The compiler 'clang' already warns about this case when it happens to detect it during compile-time.

For example, if 'snprintf' had the prototype (note: arguments are swapped):

  int snprintf(size_t size, char str[static size], const char *format, ...); 

This could automatically detect certain cases where the array pointed to by 'str' is smaller than 'size' preventing buffer overruns.

Note that existing functions could often be enhanced in this way without breaking API or ABI. Changing the type of the argument from pointer to array-with-length only adds additional type information about the intended use of the argument which the compiler could use to emit diagnostics or insert additional run-time checks. 

It should be noted, that for 'snprintf' and other standard functions this would require a forward declaration of 'size' as supported by GCC as a GNU extension to keep the order of arguments in the API the same:

  int snprintf(size_t size; char str[static size], size_t, const char* format, ...);

3. Arrays passed to functions.

Another important situation to consider is when arrays are passed to a function and then accessed:

  void foo(int N, int x[N])
  {
 	x[N] = 1;
  }

Because 'x' is not of array type, but of pointer type, this is not (necessarily) undefined behavior. In fact, specifying the length 'N' has no effect at all. There are different solutions:

A. Ideally, the language would be changed to make 'x' have type array-of-int-of-length-'N' - while still passing a reference on the stack to retain pass-by-reference semantics and ABI compatibility. Unfortunately, this would also cause a backwards compatibility problem with 'sizeof'. (BTW: 'sizeof' on function arguments declared as arrays is should probably be deprecated.)

B. Alternatively, a special keyword could be used here to indicate the changed semantics. For example:

void foo(int N, int x[array N])
{
	x[N] = 1; // undefined behaviour!
}

could specify that 'x' has type array-of-int-of-length-'N' inside the function body. In this case, 'sizeof(x)' should 
return 'sizeof(int) * N' and a compiler could add bounds checking.

Note: A and B would also be useful in combination with some kind of special array notation, e.g. as introduced with Intel's cilk.

C. Without any changes the language, the user could rewrite the code as:

  void foo(int N, int (*x)[N])
  {
  	(*x)[N] = 1; // undefined behaviour!  
  }

This works, but has the problem that would require more subtle changes when adapting an existing code base and is more difficult to read.

D. A potential solution is to change the meaning of 'static' to also imply that accesses beyond the specified length are undefined (without changing the type of 'x'):

  void foo(int N, int x[static N])
  {
  	(*x)[N] = 1;
  }

Since there is not much code using 'static' and such code is unlikely to access the array after the 
specified size, such change in semantics might be  acceptable.




3. Arrays in structs

Additional changes to language could be considered to support the use of arrays-with-length in more situations. For example, when a array/pointer-to-array is a member of a struct one could consider to allow the size expression to reference previous members of the struct:

struct foo {
	int N;
	int *a[N];
};

or

struct bart {
	int N;
	int a[N]; 	// last element, similar to flexible array member
}


4. Compatibility problems with old code

Although this is already undefined behavior, old code sometimes assumes that it can write beyond the end of an array, e.g. a common case might be the use of an array of length 1 to as the last element of a struct (instead of flexible array members without size expression).

  struct foo {
  	int a[1];
  }* x = malloc(...)

  x->a[5] = 3;

Compilers should provide options to relax bounds-checking to support old code.



Conclusion:

Very minor changes to the language in combination with improved compiler support would enhance existing array type in C to a bounded array type which could protect against out-of-bounds accesses in many situations, while maintaining full ABI and API compatibility.