'Efficient uint32_t division on 8bit CPU

I'm trying to improve a synchronization algorithm on an ATmega1284P. Whenever a synchronization event is triggered by some external hardware source, I need to measure the total cycles of an internal counter that have passed since the last sync event, divide that value by four and set this as the new counter overflow value that will trigger the corresponding counter IRQ. Thus I sync my counter to tick four times as fast as the external event source I'm syncing to.

In pseudocode, calculating the counter overflow value when the external event is received looks like this:

totalCounterCyclesSinceLastSync = counterCyclesSinceLastTick + counterCyclesPerTick * counterTicksSinceLastSync;
counterCyclesPerTick = totalCounterCyclesSinceLastSync / 4;
writeToRegister(counterCyclesPerTick);
counterTicksSinceLastSync = 0

While the CPU can't handle atomic arithmetic operations on more than 8 bits, the counter register that stores the overflow value is already 16 bits wide. This means that the variables in the pseudocode above need to be of uint32_t. As far as I know the ATmegaX series don't even have instructions for division of uint8_t, let alone variables four times the size. But since the application is time critical, I can't afford to waste hundreds or thousands of CPU cycles for some elaborate library calls to do this division.

So, here is my question: Do any of you know if I can count on elaborate compiler optimizations to translate this in the most efficient way? Or can you think of an (existing) algorithm for such an application?

Some background on how the counter works

The counter is a hardware counter integrated into the ATmega1284P chip, with it's counting frequency locked to the CPU speed with a fixed divider (e.g. 8). With every counter cycle, the counter value will be incremented by 1. When the counter "overflows", that is when it reaches counterCyclesPerCounterTick, the counter will tick, that is trigger the corresponding IRQ, where I can execute a short routine. Also, the counter value will be reset to 0 by the CPU. In this event, after performing my IRQ routine, I increment the value of counterCyclesPassedSinceLastCounterTick so that even after some counter ticks have passed (resulting in resets of the counter value) I know how much total counter cycles have passed.

If you need more info, please drop your question in the comments. I'm kind of a newbie in this field of low level programming, so I'm not sure if this is enough info to get a grasp of the problem.



Solution 1:[1]

As Lundin said just divide counterCyclesPerTick by 4. If that introduces too much error you can calculate the error as uint8_t and one to counterCyclesPerTick every time the errors is greater than 4. But overall I think more time will be spend on the multiplication than the division by 4 unless counterCyclesPerTick is a power of 2 or some other happy constant that optimizes really well.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Goswin von Brederlow