Nice little trick I found in your link: A poor man's sleep routine :) for (uint3...

pizza234 · on Sept 30, 2020

For busy waiting in general, AFAIK one should signal the intention to the compiler, via intrinsics (`_mm_pause` in C, `std::sync::atomic::spin_loop_hint` in Rust, and so on). The compiler and processor will attempt to make the best out of the pause, e.g. send the processor to a lower-consumption state.

It may be overkill for such cases, but if one keeps the strategy as reference, it's best to know the optimal one :-)

I even think it's not overkill for this case, as they're actually working around the compiler, and a clean approach is preferrable regardless.

simias · on Sept 30, 2020

>e.g. send the processor to a lower-consumption state.

You have to be very careful in a bootloader, since you're initializing the clocks and other very low level subsystems you don't want to risk putting the CPU in a state you won't be able to get out of.

In general the standard method is to use a dummy busy delay like the "nop loop" shown above in the very early stages until you get to the point where you can enable interrupts and standby states and implement a proper sleep method that will effectively shut the CPU down waiting for an external event.

If you're talking about userland code (or even driver code) using busyloops you're right though.

saagarjha · on Sept 30, 2020

Isn’t _mm_pause an Intel intrinsic?

pizza234 · on Sept 30, 2020

Based on what I see on [Stack Overflow](https://stackoverflow.com/a/51908785), it's recognized by all the major compilers

The [Microsoft Compiler help](https://docs.microsoft.com/en-us/cpp/intrinsics/x86-intrinsi...) includes it in the "x86 intrinsics list".

I've just compiled with success, for testing purposes, on an AMD processor.

C, relevant section:

    #include <immintrin.h>

    void test() {
        _mm_pause();
    }

Output ASM, relevant section:

    test:
    .LFB4006:
      .cfi_startproc
      endbr64
      pushq %rbp
      .cfi_def_cfa_offset 16
      .cfi_offset 6, -16
      movq  %rsp, %rbp
      .cfi_def_cfa_register 6
      rep nop
      nop
      nop
      popq  %rbp
      .cfi_def_cfa 7, 8
      ret
      .cfi_endproc
    .LFE4006:
      .size test, .-test
      .globl  main
      .type main, @function

It should be the `rep nop`.

Including `ammintrin.h` yields the same `rep nop`.

saagarjha · on Sept 30, 2020

By “Intel” I really meant “x86” :P Is there an ARM compiler that recognizes it?

secondcoming · on Sept 30, 2020

Unlikely, intrinsics are CPU, and sometimes CPU model, specific. Compiler builtins will instead use the intrinsic if it's available/suitable or generate equivalent code if it's not.

pizza234 · on Sept 30, 2020

Ah! Sorry for that; I have no idea.

It seems that at least some ARMs have [timers that can be used](https://electronics.stackexchange.com/q/450971), which is a different strategy, but still more appropriate (I suppose).

kbumsik · on Sept 30, 2020

To be fair, this is not a "sleep" routine, but a delay routine. The purpose of the function is to give a slight delay until other part of the hardware is ready. This is actually the most deterministic (so it means great) way to do it.

More sophisticated does not always means better.

simias · on Sept 30, 2020

These busy loops are not super deterministic in practice unless they're calibrated beforehand or if you know very well 1/ your CPU pipeline 2/ the frequency of your clock source 3/ the frequency of your CPU PLL (which can be dynamic on modern CPUs depending on temperature and other factors) and a few other things like whether the caches and branch prediction are enabled etc...

If you look at the Linux kernel code dealing with calibrating delay loops you can see that it's far from trivial in the general case (and not necessarily very accurate): https://elixir.bootlin.com/linux/latest/source/init/calibrat...

If you want a precise and deterministic delay you almost always want a hardware timer. The problem of course is that in these early boot stages you can't always use one easily, so a rough busy delay approximation does the job just fine.

kbumsik · on Sept 30, 2020

The OP delay loop is for MCU/Arduino where embedded devs typically have all control of the PLLs and know exact CPU timings. You can see that the other part of the bootloader example fixes all related clocks manually. [1]

These MCUs are designed for hard realtime systems so the silicon design is required to have all timings well-known in the datasheets and required to be deterministic. The core (Cortex-M0) design removes nondeterministic aspects of the CPU so there are neither caches nor branch predictions in it. So the busy loops have deterministic timings.

Of course it won't be trivial in MPU/Linux spaces as you mention.

Lots of programming practices are quite different between MPU and MCU worlds. I personally find it interesting that many people are talking about MPU/Linux/x86 here. It's good to learn about them :)

[1] https://github.com/arduino/ArduinoCore-samd/blob/master/boot...

friendzis · on Sept 30, 2020

Poor man's sleeps (various NOP loops) are just that. In all cases you should try and utilize on-chip sleep instruments, timers most probably.

On a Cortex-sized MCU that is most probably as simple as installing appropriate interrupt handler and launching the timer by writing appropriate data to some register(s).

yitchelle · on Sept 30, 2020

I have to support this advice. Poor man's sleep is not really sleep as it is still executing NOPs. The on-chip sleep mechanism puts various parts of the silicon into an off-state. The target of sleep is to reduce current consumption when the hardware is idle.

stjohnswarts · on Oct 1, 2020

Sleep is a generic term. In this context of an arduino the sleep term is perfectly applicable. On a bigger processor with pipelining/memorycontroller/tlbs I'd agree with you.

kbumsik · on Sept 30, 2020

But the example here is to simply delay, not to sleep for power consumption.

yitchelle · on Sept 30, 2020

It might be fine if it is just for delay, then you run into problem of portability to the system with different instruction cycle times.

Not saying it is bad, but technical debt is high from several angles.

Someone · on Sept 30, 2020

But a delay that uses less power still is preferred. Busy-looping should be avoided whenever possible (I wouldn’t know whether that’s the case here)

kbumsik · on Sept 30, 2020

The delay is just 10ms and this is a bootloader code that runs only once at bootup. As a bootloader for an MCU, a less error-prone way with less code size is more preferrable.

friendzis · on Oct 1, 2020

I can agree that justification to use nop loops somewhere along the lines of "core starts executing code when, judging by real-world testing, hardware peripherals do not guarantee stable state. Thorough testing suggests that 8ms+safety margin delay mitigates issues related to hardware readiness. 2500 cycle nop-loop guarantees required 10ms delay on fastest clock speeds and 50ms delay on slowest core speeds without causing observable issues. 50ms is hereby deemed acceptable." would acceptable in most commercial design documents.

rhn_mk1 · on Sept 30, 2020

> In all cases

Aren't you overgeneralizing a little? There is nothing wrong with using loops when execution time is known and constant (like an Arduino). It saves the effort of setting up interrupts, and they can be used with interrupts disabled.

They won't be so accurate: time spent in interrupts will not be counted, and loop setup/teardown won't as well, but it's rare to need such precision over short scales, e.g. when delaying a LED blink.

Never use those when execution time is not guaranteed though, like in emulation or binaries executing on different MCUs.

friendzis · on Oct 1, 2020

> > In all cases you should try

Yes, in all cases you should try to avoid nop loops, similarly like in high level applications you should try and avoid blocking IO in favor of non-blocking IO. And it does not only mean "avoid nop-loops in code". It also means design your thing in a way where random waits (nop loops are random waits) are unnecessary in the first place. Wait on readiness flag, use timers. Sometimes you have shitty hardware that either does not indicate readiness or indicates readiness falsely, happens. There definitely are situations where nop loops are the only sane available solution.

> using loops when execution time is known and constant

Um... Cortex and larger cores normally can run at multiple clock speeds (e.g. via PLL or clock multiplier config), so this speed is constant only as long as you do not change core clock configuration. Furthermore, if the core in question is even more advanced and employs dynamic PLL, nop-loop execution time is not constant.

> (like an Arduino)

Yes, if you run on small cores like ATmega this can be more or less true.

> It saves the effort of setting up interrupts, and they can be used with interrupts disabled.

Sure and that is one of the main advantages of nop-loops. However, unless you have all "userland" interrupts disabled (all except fault handlers), you run inherently concurrent code which by definition cannot have constant real world run time. To be fair, in the context of a bootloader application flow can be controlled and you can avoid most concurrency issues.

> Never use those when execution time is not guaranteed though

This is the core issue. Execution time is most probably not guaranteed. Nop loops can work during development, hobby project or when you need just a tiny little bit of delay and hardware sync is unavailable.

kbumsik · on Sept 30, 2020

I agree. Using timer is not as deterministic as precisely-calculated NOP loop.

And why even bother setting timer peripherals if it is to delay for 10ms once in the entire program?