C Types and Type Punning
Page Contents
References
- https://en.cppreference.com/w/c/language/type
- https://en.wikipedia.org/wiki/Type_punning
- https://www.cocoawithlove.com/2008/04/using-pointers-to-recast-in-c-is-bad.html
- https://stackoverflow.com/questions/98650/what-is-the-strict-aliasing-rule/51228315#51228315
- https://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html
- https://adriann.github.io/undefined_behavior.html
- https://blog.regehr.org/archives/959
- https://stackoverflow.com/questions/98650/what-is-the-strict-aliasing-rule
- https://books.google.co.uk/books?id=VhnjDQAAQBAJ&pg=PA23&lpg=PA23&dq=examples+of+type+punning+bugs&source=bl&ots=p-uVSN0bvH&sig=ACfU3U1vF1xJy6gDF8VDPKXk4p705JL2Tg&hl=en&sa=X&ved=2ahUKEwj44u3ugf_oAhX6ThUIHc2yBqY4ChDoATADegQIBxAB#v=onepage&q=examples%20of%20type%20punning%20bugs&f=false
- https://davmac.wordpress.com/2010/02/26/c99-revisited/
- https://www.cocoawithlove.com/2008/04/using-pointers-to-recast-in-c-is-bad.html
- https://blog.regehr.org/archives/959
- https://developers.redhat.com/blog/2020/06/02/the-joys-and-perils-of-c-and-c-aliasing-part-1 -- REALLY GOOD!
- https://developers.redhat.com/blog/2020/06/03/the-joys-and-perils-of-aliasing-in-c-and-c-part-2 -- REALLY GOOD!
- https://mail.python.org/pipermail/python-dev/2003-July/036909.html
- https://en.wikipedia.org/wiki/Aliasing_(computing)#Aliased_pointers
To Read
https://stackoverflow.com/questions/11373203/accessing-inactive-union-member-and-undefined-behavior
Types
Before we can talk about type punning, we must know what a type is and what kinds of types there are...
In C an "object" is is just a region of data storage that represents a value. The
way in which an object is interpretted, i.e., the meaning that the value of the object is given,
is determined by the object's type. For example, if, in an 8-bit block of memory, the value 0x80 is
stored, when accessed as an unsigned char
type, it has the value 128. However, when the same
object is accessed as a signed char
it would have the value -128. Thus, to quote the
standard verbatim, the meaning of a value stored in an object ... is determined by the type
of the expression used to access it
.
All the above are unqualified types. They can be qualified using the keywords
const
, volatile
or restrict
.
When it comes to accessing an object, the effective type is its declared type, or in the case of allocated objects, which have no declared type, the type of the lvalue referencing the object. The standard goes into real details and says this (parapgraphs added for readability):
The effective type of an object for an access to its stored value is the declared type of the object, if any [Allocated objects have no declared type].
If a value is stored into an object having no declared type through an lvalue having a type that is not a character type, then the type of the lvalue becomes the effective type of the object for that access and for subsequent accesses that do not modify the stored value.
If a value is copied into an object having no declared type using memcpy or memmove, or is copied as an array of character type, then the effective type of the modified object for that access and for subsequent accesses that do not modify the value is the effective type of the object from which the value is copied, if it has one.
For all other accesses to an object having no declared type, the effective type of the object is simply the type of the lvalue used for the access...
Punning a.k.a. Aliasing
WikiPedia has the following to say about type punning.
In C and C++, constructs such as pointer type conversion and union ... are provided in order to permit many kinds of type punning, although some kinds are not actually supported by the standard language.
Type aliasing or punning is when an object is referred to using different types, which may or may not be "compatible". In the C99 standard it states:
All declarations that refer to the same object or function shall have compatible type; otherwise, the behavior is undefined.
Remember that an "object" is just a region of data storage that represents a value. This means that any variables that we declare that refers to the same region of data storage must have compatible types, otherwise the behaviour of the program when accessing said storage is undefined.
This begs the question, "what is a compatible type?". The standard has this to say about compatible types (paragraph formatting and bullet points added for readability):
Two types have compatible type if their types are the same. Additional rules for determining whether two types are compatible are described ... for type specifiers, ... for type qualifiers, and ... for declarators [footnote: two types need not be identical to be compatible] ...
Moreover, two structure, union, or enumerated types declared in separate translation units are compatible if their tags and members satisfy the following requirements:
- If one is declared with a tag, the other shall be declared with the same tag.
- If both are complete types, then the following additional requirements apply:
- there shall be a one-to-one correspondence between their members such that each pair of corresponding members are declared with compatible types, and such that
- if one member of a corresponding pair is declared with a name, the other member is declared with the same name.
- For two structures, corresponding members shall be declared in the same order.
- For two structures or unions, corresponding bit-fields shall have the same widths.
- For two enumerations, corresponding members shall have the same values.
So, what about type specifiers? The standard says this:
Each enumerated type shall be compatible with char, a signed integer type, or an unsigned integer type. The choice of type is implementation-defined, but shall be capable of representing the values of all the members of the enumeration ...
For type qualifiers:
For two qualified types to be compatible, both shall have the identically qualified version of a compatible type; the order of type qualifiers within a list of specifiers or qualifiers does not affect the specified type.
For type declarators:
For two pointer types to be compatible, both shall be identically qualified and both shall be pointers to compatible types ...
... For two array types to be compatible, both shall have compatible element types, and if both size specifiers are present, and are integer constant expressions, then both size specifiers shall have the same constant value. If the two array types are used in a context which requires them to be compatible, it is undefined behavior if the two size specifiers evaluate to unequal values ...
... For two function types to be compatible, both shall specify compatible return types. Moreover, the parameter type lists, if both are present, shall agree in the number of parameters and in use of the ellipsis terminator; corresponding parameters shall have compatible types ...
As is typical of the standard it can be a rather terse read and contain some vocabulary that we might need to understand.
A type specifier is the type you declare a variable with. For example "unsigned int varName;
"
declares a variable varName
with the type specified unsigned int
.
A type qualifier is something like const
or volatile
. It gives us
a property of the type - it doesn't change the underlying type, it just changes a property of the
type: is it read only, can in change unpredictably etc?
A declaritor is something that gives an identifier a type and possibly a qualifier and
reserves "room" for it. For example, the declaration struct MyStruct temp;
declares an identifier "temp" that references an object of type MyStruct
.
Memory has been reserved for the object, possibly on the stack, for example, if this it is
a local variable to a function.
A tag of a structure/union/enum is the name given to that type of structure/union/enum:
struct MyStruct { ... } # The "tag" is "MyStruct" union MyUnion { ... } # The "tag" is "MyUnion", etc
The standard also says:
An object shall have its stored value accessed only by an lvalue expression that has one of the following types [footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased]:
- a type compatible with the effective type of the object,
- a qualified version of a type compatible with the effective type of the object,
- a type that is the signed or unsigned type corresponding to the effective type of the object,
- a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
- an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
- a character type.
With an exception for unions:
One special guarantee is made in order to simplify the use of unions: if a union contains several structures that share a common initial sequence (see below), and if the union object currently contains one of these structures, it is permitted to inspect the common initial part of any of them anywhere that a declaration of the complete type of the union is visible.
...
The following is a valid fragment:
union { struct { int alltypes; } n; struct { int type; int intnode; } ni; struct { int type; double doublenode; } nf; } u; u.nf.type = 1; u.nf.doublenode = 3.14; /* ... */ if (u.n.alltypes == 1) if (sin(u.nf.doublenode) == 0.0) /* ... */
For example, a signed int
and unsigned int
have compatible types,
as far as I can see from the above definitions because they are the signed or unsigned type corresponding to the effective type
. Thus, I think integer is the "base" type and then it signedness is extra.
But, a short
is not however compatible with an int
from the above
definitions. Why can we not alias/type-pun an int as shown below?
int a = 10; short *b = (short *)&a; // b aliases a or is a type pun of a printf("%hd\n", *b); // What value does this print out??
Why is type puning sometimes seen as dangerous? The biggest problem is the mention of "undefined behaviour".
One reason accessing the object a
through the pointer b
can result
in "undefined behaviour" is that the C standard makes no assumptions about the
endianess of the platform. Hence *b
could access the MSB or LSB of the short
depending on what platform we compile this on! Thus, this is not a safe operation as it could result
in all sorts of issues should the program ever be ported onto machines with differing endianess.