Friday, February 14, 2025

Embracing Undefined Behaviour

Suppose you're a pretty seasoned C developer. You've read quite a lot of articles about Undefined Behaviour in C, perhaps you've even read the C standard text to clarify some obscure points. 

You are always careful to avoid UB in your code, and you use all the relevant useful tools in your arsenal - ubsan, compiler warnings, static analyzers - just to be in the safe side. You are pretty sure you understand quite a lot of aspects of UB, while also knowing that in C it's extremely challenging to comprehend UB's full extent and effects, especially in the presence of different compiler optimizations. 

You are familiar with the adage about UB being used to enable compiler optimizations, and you've even seen countless examples in articles and blog posts - like this one, taken from Wikipedia:

int foo(unsigned char x) { int value = 2147483600; /* assuming 32-bit int and 8-bit char */ value += x; if (value < 2147483600) bar(); return value; }

Of course, you're not naive enough to write something like that. You've also come across more complex examples that demonstrate how compilers can misbehave due to accidental UB, so you understand the dangers well.

However, you recognize that many of these examples are specifically designed to frighten you with the dangers of UB - and for good reason. Readers must learn to avoid it. This, however, contrasts starkly with the also stated fact that UB rules (which you must not break!) allow for important compiler optimizations.

You find it a bit difficult to wrap your head around what this means in practice.

One day, you're tasked with writing a function that starts from the unsigned value 0, adds 7 to it N times in a loop (where N is a small positive integer), and returns the unsigned result. Yes, it's a silly example—but bear with me.

You assign this to a junior colleague with not as much exposure to UB as yours, and he comes up with this solution, accustomed to the common idiom of using an int as an induction variable in loops:

unsigned int compute_something_1(int j) {
unsigned int something = 0;
for (int i = 0; i <= j; i++) {
something += 7;
}
return something;
}

Being overtly cautious about UB, you know that signed - unsigned comparisons are risky (though ostensibly most compilers go through some pains to produce what the programmer actually had in mind in the well-formed case, because sign mismatch in comparisons is one of the most common mistakes in C, especially in the induction variable of a loop, like it's the case here). You also know that signed integer overflow is UB too. 

Wanting to avoid these pitfalls, you rewrite your colleagues code like this: 

unsigned int compute_something_2(unsigned int j) {
unsigned int something = 0;
for (unsigned int i = 0; i <= j; i++) {
something += 7;
}
return something;
}

Now, the signedness matches everywhere, including the induction variable. Everything seems fine.

Until your curious coworker fires up godbolt to check the assembly output of the two snippets under GCC -O2 (https://godbolt.org/z/8rEj1qhWq):

compute_something_1:
add w1, w0, 1
cmp w0, 0
lsl w0, w1, 3
sub w0, w0, w1
csel w0, w0, wzr, ge
ret

 

compute_something_2:
mov w1, 0
mov w2, 0
.L2:
add w1, w1, 1
add w2, w2, 7
cmp w0, w1
bcs .L2
mov w0, w2
ret

What happened here? 

Why does the compiler optimized the naive version by hoisting the calculation out of the loop (as you already expected it would do), but leave your "carefully thought-out" implementation unoptimized?

(I use ARM64 code generation because the assembly code is more clear - X86_64 lowers to a LEA instruction which is am optimal but sometimes confusing for the reader way to combine an integer addition with a multiplication - feel free to compare X86_64 output and you'll see essentially the same effect).

The reason this happens is because of incorrect communication of value bounds to the compiler.

In the first case, the compiler assumes that the signed induction variable i cannot overflow beyond MAX_INT (since signed overflow is UB) and optimizes accordingly, hoisting the calculation outside the loop, in this case.

In our "corrected" case however, the compiler actually loses information: Unsigned overflow is NOT Undefined Behaviour - it's well specified to wrap-around to 0. The compiler cannot prove that the loop terminates (and indeed it does not terminate for the case j == MAX_UINT), and leaves the loop unoptimized to cater for the case we indeed did want an infinite loop.

The point I want to make here is: Our well-intentioned effort to eliminate Undefined Behaviour, made the compiler unable to optimize our code!

Note that the code snippets are not functionally identical - there are values of j for which the two snippets produce different results. This is, "of course" our own mistake - we didn't provide value bounds to the compiler. Remember that we assumed "N is a small value"? We assumed that, but the compiler does not.

This is an equally common mistake as the first one - doing signed / unsigned comparisons. Albeit not having any direct evidence, I'd dare to say it's even more common than that.

There are several solutions to this specific problem, and all of them are somewhat unsatisfactory: 

Assume that the upper bound of N is 256.

1) The obvious solution - you can check parameter bounds explicitly inside the function. 

#define N 256

unsigned int compute_something_2_bounds_checked(unsigned int j) {
if (j >= N) {
return 0;
}
unsigned int something = 0;
for (unsigned int i = 0; i <= j; i++) {
something += 7;
}
return something;

This works, and the compiler is doing the hoisting optimization:

compute_something_2_bounds_checked:
add w1, w0, 1
cmp w0, 256
lsl w0, w1, 3
sub w0, w0, w1
csel w0, w0, wzr, cc
ret

The reason that this is unsatisfactory is because we have to decide what to do in the out-of-bounds condition, and check for it every time we use N. This is in fact one of the counter-arguments for automatic bounds checking - not that the bound checking is by itself expensive (most modern branch predictors are smart enough to make the bounds checking essentially costless), but the mere fact that you have to insert explicit bounds checks everywhere complicates the CFG to the degree that other optimizations are elided as undecidable! (notwithstanding the fact that this does not happen in the quite simple example presented here).

We cannot also rely to value range propagation applied by the compiler - it does its best to try, but it cannot always be done. For example, compute_something() might be inside a library.

A very common suggestion is to check your invariables using assert(). This communicates the relevant bounds to the compiler, and has the effect of enabling optimizations (a fact that is also not widely known!). 

This works...

#include <assert.h>
#define N 256

unsigned int compute_something_2_assertion(unsigned int j) {
assert(j < N);
unsigned int something = 0;
for (unsigned int i = 0; i <= j; i++) {
something += 7;
}
return something;
}

...but...

.LC0:
.string "/app/example.c"
.LC1:
.string "j < N"
compute_something_2_assertion:
cmp w0, 255
bhi .L7
add w0, w0, 1
lsl w1, w0, 3
sub w0, w1, w0
ret
.L7:
stp x29, x30, [sp, -16]!
adrp x3, .LANCHOR0
adrp x1, .LC0
mov x29, sp
adrp x0, .LC1
add x3, x3, :lo12:.LANCHOR0
add x1, x1, :lo12:.LC0
add x0, x0, :lo12:.LC1
mov w2, 5
bl __assert_fail
.set .LANCHOR0,. + 0
__PRETTY_FUNCTION__.0:
.string "compute_something_2_bounds_checked"

...it inserts all the overhead of checking for the bound, PLUS the additional overhead of aborting the program if the value is out of bounds.

Another issue with this approach is that it's not unusual for production code to be compiled with NDEBUG defined. This eliminates the entire assertion, together with the optimization that it provides:

#define NDEBUG
#include <assert.h>
#define N 256

unsigned int compute_something_2_assertion_off(unsigned int j) {
assert(j < N);
unsigned int something = 0;
for (unsigned int i = 0; i <= j; i++) {
something += 7;
}
return something;

We're back to square one - and worse: An optimization turned off in production mode, where it's exactly where we need it most: 

compute_something_2_assertion_off:
mov w1, 0
mov w2, 0
.L2:
add w1, w1, 1
add w2, w2, 7
cmp w0, w1
bcs .L2
mov w0, w2
ret

Another solution is using a type whose range is a strict subset of the induction variable bounds that can cause overflow in the loop, thus avoiding any chance of overflow: 

unsigned int compute_something_2_implicit_domain(unsigned char j) {
unsigned int something = 0;
for (unsigned int i = 0; i <= j; i++) {
something += 7;
}
return something;
}


compute_something_2_implicit_domain:
and w0, w0, 255
add w0, w0, 1
lsl w1, w0, 3
sub w0, w1, w0
ret

This one looks clever, but it's not only more contrived and largely unknown, it also does not communicate our intention to the reader of the code. What happens if the bound changes? Plus, of course, the problem that maybe the bound does not allow us to use a convenient sub-type. (e.g. it's 300 instead of 256). We have to get back to the other solutions of explicit checking.

There's just another way to communicate bounds to the compiler just the way we want to. Unfortunately, this way entails the use of two compiler intrinsics, PLUS a macro, if you care about readability (and you should):

#define ASSUME(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)
#define N 256

unsigned int compute_something_2_assume(unsigned int j) {
ASSUME(j < N);
unsigned int something = 0;
for (unsigned int i = 0; i <= j; i++) {
something += 7;
}
return something;
}

This works wonderfully, assuming (pun intended) you do it. It's definitely not a common idiom.

compute_something_2_assume:
add w0, w0, 1
lsl w1, w0, 3
sub w0, w1, w0
ret

But the biggest problem with all these solutions is that they all are just a big digression to the central issue I want to present - they apply just to the specific instance of the problem (and not perfectly - I didn't took the time to check for perfect equivalence of the "solutions").

This is because there are other kinds of UB than the particular example of signed vs unsigned overflow, and they all can be used to enable optimizations, sometimes resulting to counterintuitive results exactly when we are explicitly careful to avoid invoking UB!

A particularly contrived example is pointer aliasing: The mental model of keeping track of exactly what and when might alias can get so burdening that the best solution you often have is to use -fno-strict-aliasing, which is purported to also turn off several optimizations - though I have yet to personally observe this, I have no reason to believe it doesn't happen (I once intentionally compiled a large proprietary codebase with and without -fno-strict-aliasing just to compare the differences in assembly code output - and, curiously, there were none - but I have no doubt that I was just being lucky).

The takeaway lesson is that one does not only have to avoid UB. 

We must avoid UB (something that is already really hard), *while at the same time knowing what the side-effects of avoiding UB are*, in order to not fall into the trap of unintentionally under-specifying something that would be implicitly specified by UB, even if we never invoke UB. 

Looks like we have to *embrace* Undefined Behaviour rather than just knowing how to avoid it.

John Regher, who has written a lot about UB in C/C++ and Rust, believes that maybe Undefined Behaviour is by itself a bad term. Perhaps we should call it something else - but what?

 

If you are interested in a more thorough discourse of UB, I suggest you read the many relevant articles and blog entries by John Regher and Chris Lattner. This blog post was inspired by Chris Lattner's series about Undefined Behaviour, and particularly the paragraph "Advantages of Undefined Behavior in C, with Examples" in the first article of the series - I wanted to present a concrete case of an optimization caused by the assumption of lack of UB in a program, and how this benefit can be lost by unintentionall underspecifying assumed behaviour.