'32-bit increment on a 16-bit system [closed]
Which of the 2 code parts will execute faster on a 16-bit machine?
1:
uint16_t inc = 1;
uint32_t sum = 0;
inc++; // 16 bit increment
sum += (uint32_t)inc; // inc has to be cast
2:
uint32_t inc = 1;
uint32_t sum = 0;
inc++; // 32 bit increment
sum += inc; // no cast for inc
Solution 1:[1]
That depends a lot on the particular 16 Bit machine, as well as the compiler's optimizer.
To simulate the behavior that really matters in a program, I have put the code inside a function operating on global variables.
Variant 1 (uint16_t inc):
#include <stdint.h>
uint16_t inc = 1;
uint32_t sum = 0;
void f() {
inc++;
sum += inc;
}
Variant 2 (uint32_t inc):
#include <stdint.h>
uint32_t inc = 1;
uint32_t sum = 0;
void f() {
inc++;
sum += inc;
}
Without any storage-class specifiers, code with the variables declared locally and nothing else actually has no observable behavior and could be optimized away entirely. So, it's important for reproducing this example to have the variables globally (or by other means produce an observable side-effect on both, inc and sum).
Motorola 68000
Let's take the Motorola 68000 as an example.
On a Motorola 68000 (not even 68020), which is a 16/32 Bit machine, the disassembly of a function with above code and global variables decompiles to (produced with m68k-linux-gnu-gcc-12 -O2):
Variant 1 (uint16_t inc):
00000000 <f>:
0: 3039 0000 0000 movew 0 <f>,%d0
6: 5240 addqw #1,%d0
8: 33c0 0000 0000 movew %d0,0 <f>
e: 0280 0000 ffff andil #65535,%d0
14: d1b9 0000 0000 addl %d0,0 <f>
1a: 4e75 rts
Variant 2 (uint32_t inc):
00000000 <f>:
0: 2039 0000 0000 movel 0 <f>,%d0
6: 5280 addql #1,%d0
8: 23c0 0000 0000 movel %d0,0 <f>
e: d1b9 0000 0000 addl %d0,0 <f>
14: 4e75 rts
Note the additional andil #65535,%d0 instruction to perform the upcast from d0 before adding it. However, we also need to take into account the number of cycles required for load and store. movem is faster than movel. So, without going into actual cycle counting using the 68000 user manual, we can say that on the 68000, the uint32_t variant of this piece of code is faster, as it requires 3 fewer bus cycles (no andil #65535,%d0 to execute) but 2 more bus cycles (twice movel from/to memory instead of movew), so in total it requires 1 fewer bus cycle.
However, that is 1 CPU architecture/family plus compiler, and one situation of types, a totally insufficient sample size for extrapolating a generic statement.
For example, the game already changes when the compiler/optimizer could replace andil #65535,%d0 with extl %d0 when the types are changed from unsigned to signed.
Intel 8086
On an Intel 8086, using Bruce's C Compiler (bcc), the picture is very different.
The disassembly, produced with bcc -O -0 -S, is as follows.
Variant 1 (uint16_t inc):
.text
export _f
_f:
push bp
mov bp,sp
inc word ptr [_inc]
mov ax,[_inc]
mov bx,dx
mov di,#_sum
call laddul
mov [_sum],ax
mov [_sum+2],bx
pop bp
ret
Variant 2 (uint32_t inc):
.text
export _f
_f:
push bp
mov bp,sp
mov ax,[_inc]
mov si,[_inc+2]
mov bx,#_inc
call lincl
mov ax,[_sum]
mov bx,[_sum+2]
mov di,#_inc
call laddul
mov [_sum],ax
mov [_sum+2],bx
pop bp
ret
We can see that on an Intel 8086, another 16-Bit CPU, there's a huge difference between using uint16_t or uint32_t for inc. On an 8086, using uint16_t can safely be assumed to be much faster than uint32_t. The 32-Bit variant needs significantly more instructions and even makes calls to off-screen subroutines to perform 32-Bit operations like inc and add.
Another Caveat
The above analysis is based on instruction count or bus fetch count. Additional cycles used by the CPU is not taken into account. But for a real-world scenario, that matters. A profiler should be used.
Conclusion
This is very much CPU-dependent. A generic statement about all 16 Bit platforms is not possible. It largely depends on how well the 16 Bit platform in question deals with 32-Bit operands. 16-Bit CPUs that have strong support for 32-Bit operands, like Motorola 68000, will perform slightly better on uint32_t than uint16_t in this example. Whereas 16-Bit CPUs that have no support for 32-Bit operands whatsoever, like 8086, perform very poorly in the uint32_t case and will be much better in the uint16_t case.
To some extent, it also depends on the compiler, although, we might want to assume in good faith, that an optimization level like GCC's -O2 will produce the optimum machine code for a given function at least in the case of small functions.
Overall, I would first optimize for clarity and declare data in the size that it has "by nature". Second, I would optimize for convenience and make sure that data that is passed around is passed around in a way convenient for developers. Third, I would profile if there are any performance or footprint issues, and if so, where they are, and only then optimize code, with the help of a profiler and the disassembly.
Solution 2:[2]
There is no answer.
Normally 16-bit arithmetics on 16-bit is faster, but It depends on compiler, optimization and architecture.
You have to at least compile it and check in the assembly.
In that particular case/example you may end with single 'mov' instruction in both cases because result can be predicted.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
