c

Unions in C

pauljlucas

Paul J. Lucas

Posted on January 13, 2024

Unions in C

Introduction

A union is syntactically just like a struct and is used to store data for any one of its members at any one time. For example:

union value {
  long   i;
  double f;
  char   c;
  char  *s;
};

union value v;
v.i = 42;                  // value is now 42
v.c = 'a';                 // value is now 'a' (no more 42)

union value *pv = &v;
pv->s = malloc(6);         // -> works too
strcpy( pv->s, "hello" );
Enter fullscreen mode Exit fullscreen mode

The size of a union is the size of its largest member.

A common use-case for a union would be in a compiler or interpreter where a token is any one of a character literal, integer literal, floating-point literal, string literal, identifier, operator, etc. It would be wasteful to use a struct since only one member would ever have a value.

Initialization

Since all members have the same offset, their order mostly doesn’t matter — except that the first member is the one that is initialized when an initializer list is used so the value given must be the same type:

union value v = { 42 };    // as if: v.i = 42
Enter fullscreen mode Exit fullscreen mode

Although 0 can initialize any built-in type.

Alternatively, you can use a designated initializer to specify a member:

union value v = { .c = 'a' };
Enter fullscreen mode Exit fullscreen mode

Which Member?

One obvious problem with a union is, after you store a value in a particular member, how do you later remember which member that was? With a union by itself, you generally can’t. You need some other variable to “remember” the member you last stored a value in. Often, this is done using an enumeration and a struct:

enum token_kind {
   TOKEN_NONE,
   TOKEN_INT,
   TOKEN_FLOAT,
   TOKEN_CHAR,
   TOKEN_STR
};

struct token {
  enum token_kind kind;
  union {                  // "anonymous" union
    long   i;
    double f;
    char   c;
    char  *s;
  };
};

struct token t = { .kind = TOKEN_CHAR, .c = 'a' };
Enter fullscreen mode Exit fullscreen mode

When a union is used inside a struct, it’s often made an anonymous union, that is a union without a name. In this case, the union members behave as if they’re direct members of their enclosing struct except they all have the same offset.

Anonymous unions (and structs) are only supported starting in C11.

Type Punning

Type punning is a technique to read or write an object as if it were of a type other than what it was declared as. Since this circumvents the type system, you really have to know what you’re doing. In C (but not C++), a union can be used for type punning. For example, here’s a way to get the value of a 32-bit integer with the high and low order 16-bit halves swapped:

uint32_t swap16of32( uint32_t n ) {
  union {
    uint32_t u32;
    uint16_t u16[2];
  } u = { n };
  uint16_t const t16 = u.u16[0];
  u.u16[0] = u.u16[1];
  u.u16[1] = t16;
  return u.u32;
}
Enter fullscreen mode Exit fullscreen mode

The union members u32 and u16[2] “overlay” each other allowing you to read and write a uint32_t as if it were a 2-element array of uint16_t. (You could alternatively write a version that used uint8_t[4] and reversed the entire byte order depending on your particular need.)

You can also use unions to do type punning of unrelated types, for example int32_t and float allowing you to access the sign, exponent, and mantissa individually. (However, this is CPU-dependent.)

Restricted Class Hierarchies in C

Another use for unions is to implement class hierarchies in C, but only “restricted” class hierarchies. A “restricted” class hierarchy is one used only to implement a solution to a problem where all the classes are known. Users are not permitted to extend the hierarchy via derivation.

This can be partially achieved via final in C++ or fully achieved via sealed in Java or Kotlin.

Of course C doesn’t have either classes or inheritance, but restricted class hierarchies can be implemented via structs and a union.

The token example shown previously is simple example of this: all the kinds of tokens are known and there’s one member in the union to hold the data for each kind. But what if there’s more than one member per kind?

For a larger example, consider cdecl that is a program that can parse a C or C++ declaration (aka, “gibberish”) and explain it in English:

cdecl> explain int *const (*p)[4]
declare p as pointer to array 4 of constant pointer to integer
Enter fullscreen mode Exit fullscreen mode

During parsing, cdecl creates an abstract syntax tree (AST) of nodes where each node contains information for a particular kind of declaration. For example, the previous declaration could be represented as an AST like (expressed in JSON):

{
  name: "p",
  kind: "pointer",
  pointer: {
    to: {
      kind: "array",
      array: {
        size: 4,
        of: {
          kind: "pointer",
          type: "const",
          pointer: {
            to: {
              kind: "built-in type",
              type: "int"
            }
          }
        }
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

For this example, let’s consider a subset of the kinds of nodes in a C++ declaration (to keep the example shorter):

enum c_ast_kind {
  K_BUILTIN,               // e.g., int
  K_CLASS_STRUCT_UNION,
  K_TYPEDEF,
  K_ARRAY,
  K_ENUM,
  K_POINTER,
  K_REFERENCE,             // C++ reference
  K_CONSTRUCTOR,           // C++ constructor
  K_DESTRUCTOR,            // C++ destructor
  K_FUNCTION,
  K_OPERATOR,              // C++ overloaded operator
  // ...
};
typedef enum c_ast_kind c_ast_kind_t;
Enter fullscreen mode Exit fullscreen mode

And declare some structs to contain the information needed for each kind:

struct c_array_ast {
  c_ast_t            *of_ast;          // array of ...
  unsigned            size;
};

struct c_enum_ast {
  c_ast_t            *of_ast;          // fixed type, if any
  unsigned            bit_width;       // width when > 0
  char const         *enum_name;       // enumeration name
};

struct c_function_ast {
  c_ast_t            *ret_ast;         // return type
  c_ast_list_t        param_ast_list;  // parameters
};

struct c_operator_ast {
  c_ast_t            *ret_ast;         // return type
  c_ast_list_t        param_ast_list;  // parameters
  c_operator_t const *operator;        // operator info
};

struct c_ptr_ref_ast {
  c_ast_t            *to_ast;          // pointer/ref to ...
};

struct c_typedef_ast {
  c_ast_t const      *for_ast;         // typedef for ...
  unsigned            bit_width;       // width when > 0
};
Enter fullscreen mode Exit fullscreen mode

Notice that, of the AST information declared thus far, there are similarities, specifically:

  1. The nodes point to one other node and the pointer is declared first.
  2. Functions and operators both have return types and parameter lists and the parameter lists are declared second.
  3. For nodes that have bit-field widths, the width is alternatively declared second.

The fact that the same members in different structs are at the same offset is convenient because it means that code that, say, iterates over the parameters of a function will also work for the parameters of an operator. Having noticed this, we can make an effort to keep the same members in any remaining structs at the same offsets. For example, the information for K_BUILTIN could be declared as:

struct c_builtin_ast {
  unsigned bit_width;                  // width when > 0
};
Enter fullscreen mode Exit fullscreen mode

because that’s all the information that’s needed for a built-in type. However, the bit_width member wouldn’t be at the same offset as the same member in either c_enum_ast or c_typedef_ast. To fix that so code that accesses bit_width can do so for any type that has it, we need to insert an unused pointer (a void pointer will do):

struct c_builtin_ast {
  void    *reserved;                   // instead of for/to
  unsigned bit_width;                  // width when > 0
};
Enter fullscreen mode Exit fullscreen mode

If you think inserting unused members might waste space, remember that, once all these structs are put into the same union, the union will be the size of the largest member anyway; hence inserting unused members doesn’t waste space.

While using a named member like reserved is fine, if you want to help guarantee that the member can never be accessed directly, you can employ a macro:

#define DECL_UNUSED(T) \
  _Alignas(T) char UNIQUE_NAME(unused)[ sizeof(T) ]

struct c_builtin_ast {
  DECL_UNUSED(c_ast_t*);               // instead of for/to
  unsigned bit_width;                  // width when > 0
};
Enter fullscreen mode Exit fullscreen mode

See here for details on UNIQUE_NAME.

We can apply the same fix for the information for K_CONSTRUCTOR so param_list is at the same offset as in c_function_ast and c_operator_ast (constructors don’t have return types):

struct c_ctor_ast {
  DECL_UNUSED(c_ast_t*);               // instead of ret_ast
  c_ast_list_t  param_ast_list;        // parameter(s)
};
Enter fullscreen mode Exit fullscreen mode

And again apply the same fix for the information for K_CLASS_STRUCT_UNION so csu_name is at the same offset as enum_name in c_enum_ast:

struct c_csu_ast {
  DECL_UNUSED(c_ast_t*);              // instead of for/to
  DECL_UNUSED(unsigned);              // instead of bit_width
  char const *csu_name;
};
Enter fullscreen mode Exit fullscreen mode

Given all those declarations (assume that for any struct c_X_ast, there’s a typedef struct c_X_ast c_X_ast_t), we can now put them all inside an anonymous union inside a struct for an AST node:

struct c_ast {
  c_ast_kind_t kind;
  char const  *name;
  c_type_t     type;
  // ...

  union {
    c_array_ast_t    array;
    c_builtin_ast_t  builtin;
    c_csu_ast_t      csu;
    c_ctor_ast_t     ctor;
    c_enum_ast_t     enum_;
    c_function_ast_t func;
    c_operator_ast_t oper;
    c_ptr_ref_ast_t  ptr_ref;
    c_typedef_ast_t  tdef;
    // ...
  };
};
Enter fullscreen mode Exit fullscreen mode

Safeguards

One problem with this approach is that, if you modify any of the structs, you might inadvertently change the offset of some member so that it no longer is at the same offset as the same member in another struct. One way to guard against this is via offsetof and _Static_assert:

static_assert(
  offsetof( c_operator_ast_t, param_ast_list ) ==
  offsetof( c_function_ast_t, param_ast_list ),
  "offsetof param_ast_list in c_operator_ast_t & c_function_ast_t must equal"
);

static_assert(
  offsetof( c_csu_ast_t, csu_name ) ==
  offsetof( c_enum_ast_t, enum_name ),
  "offsetof csu_name != offsetof enum_name"
);

// More for other members ....
Enter fullscreen mode Exit fullscreen mode

Now you’ll get a compile-time error if any of the offsets change inadvertently.

Conclusion

Take-aways for unions in C:

  • They can be used either for storing data for any one member at any one time or for type punning.
  • For type punning very different types, the order of bytes is CPU-dependent.
  • They can be used to implement restricted class hierarchies.

You can also use unions in C++, but that’s a story for another time.

💖 💪 🙅 🚩
pauljlucas
Paul J. Lucas

Posted on January 13, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related