While this has a negative performance impact on x86_64, it has a
positive performance impact on smaller machines, which is where we're
actually using this code. For example, an A53:
Before: fiat32: 228605 cycles per call
After: fiat32: 188307 cycles per call
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>