To OS or not to OS

On an embedded system, should you have an OS or run on the metal?

This one will run and run, so I’ll call this Part 1 for now.

I’ve spent the majority of my professional career rallying against the use of OSes for Embedded Systems. A few years ago I analysed my reasoning behind that I realised a lot of it was founded in the arrogant belief that no-one could write code that was as well optimised as mine. That may or may (more probably) not be true, but there are plenty of other reasons for seriously considering a lightweight OS for your next project.

Following that little thought investigation, I now invert the discussion, and start off with “Why wouldn’t I run an OS underneath this?”. The fact is, on day one, every project starts off small, manageable and with a simple set of needs…you don’t need an OS in that environment, plain and simple. But, as your project grows, you need to do more and more things and, without an OS, you’ll find yourself re-inventing stuff that you get for free in OSville. Ah, you say, but I already have a library for timers, and message passing, and task switching, and queues….congratulations, we call that an OS, it’s just that you didn’t.

There are legitimate reasons for going OS-commando. The main one being if you’re really short of memory (RAM or Flash). Like it or not an OS is going to gobble some of it up (a Mutex semaphore in FreeRTOS on CORTEX-M takes 80 bytes. That hurts when you’ve only got 4096 of ’em around) so defining your own can really help…but be careful, allocating one bit in the bit-addressable RAM area just saved you 79 7/8 bytes of memory but it isn’t the end of the story, you’ve still got the care and feeding of that structure to deal with. It’s surprising just how much Flash memory, in comparative terms, that care and feeding can take, and not too many people would claim that FreeRTOS is the most super-efficient RTOS in its RAM allocation.

Similar consideration apply on the flash side. A reasonably complete FreeRTOS implementation on a STM32F103 in release configuration is about 6K…you can come down a ways from there if you start chopping out options, but the total spend will still be a four digit number, and that’s a fair proportion of a budget that might only be 16K or 32K.

One thing that an OS doesn’t have to do though is slow you down, and that’s the main criticism I hear (and, indeed, was one of my primary prejudices). The fact is that most of the time, for most of your code, the 1-2% overhead the OS brings along for the ride really doesn’t matter. It does matter when you’ve got a time critical task to handle, and that’s often (mostly?) done in interrupt code, so how fast a RTOS handles interrupts is much more important than how it handles base code.

Most Real Time OSes offer ‘Zero latency interrupts’, or some equivalent term. All that really means is that the OS doesn’t pre-service the interrupt for you; It doesn’t grab the interrupt and perform the initial handling of it before passing it off to your code – that does happen in desktop OSes, and you’ll hear the term ‘top half handler’ and ‘bottom half handler’ used to reflect this split between OS-controlled and Application-controlled code.

With a Zero Latency interrupt, your response time would be exactly the same as in the OS free case, because the interrupt lands in your code just like it did in the OS-free case. Indeed, response time could even be better. How? Well, let’s look at a lazy-assed implementation of an OS free app (one of mine, so I can criticise..its available here if you want a laugh). In this app communication is arranged through simple flags….you set a flag in one place, and that triggers a task in another. The code to set a flag looks like this;

void flag_post(uint32_t flag_to_set)

… and the denter / dleave routines;

void denter_critical(void)

void dleave_critical(void)
    if (!--_critDepth)
}, as you can see, all interrupts are turned off while we go fiddle with flags and critDepth..and during that time the CPU is away with the fairies and isn’t going to respond to any other maskable interrupt, no matter how much it yells. That will show itself up as jitter in interrupt response time  (there is another reason for jitter, we’ll come back to that later).

So , how on earth could an RTOS be faster? Let’s consider the equivalent criticality setting in freeRTOS for a M3 CPU (you’ll find this in portmacro.h, and I’ve hacked the formatting around a bit);

portFORCE_INLINE static void vPortRaiseBASEPRI( void )
    uint32_t ulNewBASEPRI;
    __asm volatile
        "mov %0, %1\n" \
        "msr basepri, %0\n" \
        "isb\n" \
        "dsb\n" \
        :"=r" (ulNewBASEPRI) : "i" ( configMAX_SYSCALL_INTERRUPT_PRIORITY )

…not a __disable_irq in sight! What FreeRTOS does is to temporarily raise the minimum priority interrupt that will be recognised by the CPU.  That has exactly the same effect as __disable_irq for any interrupt with a lower priority than whatever is selected for configMAX_SYSCALL_INTERRUPT_PRIORITY, but will leave higher priority interrupts enabled. So, if I really need that fast response, I just give it a super-high priority and it will get serviced sharpish…the only constraint being that I cannot use OS services within that interrupt.

End result; I’ve got the option of slightly jittery interrupts and OS support, or interrupts faster than the native case, but if I want to use OS features in conjunction with them then I have to jump through some more hoops. Of course you could do the BASEPRI trick in your own code, but someone has already written, tested, debugged and documented it for you, so why bother?

Finally, remember I said that there was another source of jitter?  Well,  taking an M3 as an example; It should theoretically be able to always respond to an interrupt within 10 clock cycles, but other factors (bus latencies, peripheral response times, flash caches and speeds etc.) may conspire to prevent that…so you get response jitter. In real world applications it is often more important to be slower and jitter free than to be faster and a bit wobbly, so several manufacturers have added the capability to ‘stretch’ the number of cycles to respond to an interrupt, so it’s always the same. On the NXP LPC134x CPUs, that register is called IRQLATENCY, and has a default value of 0x10, meaning that, in general, the CPU will hit your code in  response to an unmasked, highest priority interrupt request 16 clock cycles after the request is generated….if that is enough delay to remove jitter in your configuration is dependent on exactly how you’ve got the whole system configured, so you can put a longer value in that register if you need it.

I started off this post by being a bit anti-OS, which I have been for most of my career, but when you start peeling back the covers you start to understand that an OS, be it FreeRTOS, RTX, ChibiOS, NuttX or one of the hundreds of others that are out there, is really just a big library of code that you don’t have to write for yourself.  Know your problem, know your chip, and don’t just trust your execution environment decisions to blind prejudice.

Join the discussion here.

newest oldest most voted
Notify of

The main beef I have with MCU operating systems actually has less to do with handing out control over the MCU to a higher level layer in general but more with the fact that the common C/C++ languages suck horrible at optimising a binary image properly to get rid of all the function calls. A regular framework is already bad but an OS makes it even worse and much more so when using concurrency features (which is the main point of using an OS). This is where I think Rust and especially the RTFM framework ( come in rather nicely… Read more »