To OS or not to OS

On an embedded system, should you have an OS or run on the metal?

This one will run and run, so I’ll call this Part 1 for now.

I’ve spent the majority of my professional career rallying against the use of OSes for Embedded Systems. A few years ago I analysed my reasoning behind that I realised a lot of it was founded in the arrogant belief that no-one could write code that was as well optimised as mine. That may or may (more probably) not be true, but there are plenty of other reasons for seriously considering a lightweight OS for your next project.

Following that little thought investigation, I now invert the discussion, and start off with “Why wouldn’t I run an OS underneath this?”. The fact is, on day one, every project starts off small, manageable and with a simple set of needs…you don’t need an OS in that environment, plain and simple. But, as your project grows, you need to do more and more things and, without an OS, you’ll find yourself re-inventing stuff that you get for free in OSville. Ah, you say, but I already have a library for timers, and message passing, and task switching, and queues….congratulations, we call that an OS, it’s just that you didn’t.

There are legitimate reasons for going OS-commando. The main one being if you’re really short of memory (RAM or Flash). Like it or not an OS is going to gobble some of it up (a Mutex semaphore in FreeRTOS on CORTEX-M takes 80 bytes. That hurts when you’ve only got 4096 of ’em around) so defining your own can really help…but be careful, allocating one bit in the bit-addressable RAM area just saved you 79 7/8 bytes of memory but it isn’t the end of the story, you’ve still got the care and feeding of that structure to deal with. It’s surprising just how much Flash memory, in comparative terms, that care and feeding can take, and not too many people would claim that FreeRTOS is the most super-efficient RTOS in its RAM allocation.

Similar consideration apply on the flash side. A reasonably complete FreeRTOS implementation on a STM32F103 in release configuration is about 6K…you can come down a ways from there if you start chopping out options, but the total spend will still be a four digit number, and that’s a fair proportion of a budget that might only be 16K or 32K.

One thing that an OS doesn’t have to do though is slow you down, and that’s the main criticism I hear (and, indeed, was one of my primary prejudices). The fact is that most of the time, for most of your code, the 1-2% overhead the OS brings along for the ride really doesn’t matter. It does matter when you’ve got a time critical task to handle, and that’s often (mostly?) done in interrupt code, so how fast a RTOS handles interrupts is much more important than how it handles base code.

Most Real Time OSes offer ‘Zero latency interrupts’, or some equivalent term. All that really means is that the OS doesn’t pre-service the interrupt for you; It doesn’t grab the interrupt and perform the initial handling of it before passing it off to your code – that does happen in desktop OSes, and you’ll hear the term ‘top half handler’ and ‘bottom half handler’ used to reflect this split between OS-controlled and Application-controlled code.

With a Zero Latency interrupt, your response time would be exactly the same as in the OS free case, because the interrupt lands in your code just like it did in the OS-free case. Indeed, response time could even be better. How? Well, let’s look at a lazy-assed implementation of an OS free app (one of mine, so I can criticise..its available here if you want a laugh). In this app communication is arranged through simple flags….you set a flag in one place, and that triggers a task in another. The code to set a flag looks like this;

void flag_post(uint32_t flag_to_set)
{
    denter_critical();
    flags|=flag_to_set;
    dleave_critical();
}

… and the denter / dleave routines;

void denter_critical(void)
{
     __disable_irq();
     _critDepth++;
}

void dleave_critical(void)
{
    if (!--_critDepth)
    {
        __enable_irq();
    }
}

..so, as you can see, all interrupts are turned off while we go fiddle with flags and critDepth..and during that time the CPU is away with the fairies and isn’t going to respond to any other maskable interrupt, no matter how much it yells. That will show itself up as jitter in interrupt response time  (there is another reason for jitter, we’ll come back to that later).

So , how on earth could an RTOS be faster? Let’s consider the equivalent criticality setting in freeRTOS for a M3 CPU (you’ll find this in portmacro.h, and I’ve hacked the formatting around a bit);

portFORCE_INLINE static void vPortRaiseBASEPRI( void )
{
    uint32_t ulNewBASEPRI;
    __asm volatile
    (
        "mov %0, %1\n" \
        "msr basepri, %0\n" \
        "isb\n" \
        "dsb\n" \
        :"=r" (ulNewBASEPRI) : "i" ( configMAX_SYSCALL_INTERRUPT_PRIORITY )
    );
}

…not a __disable_irq in sight! What FreeRTOS does is to temporarily raise the minimum priority interrupt that will be recognised by the CPU.  That has exactly the same effect as __disable_irq for any interrupt with a lower priority than whatever is selected for configMAX_SYSCALL_INTERRUPT_PRIORITY, but will leave higher priority interrupts enabled. So, if I really need that fast response, I just give it a super-high priority and it will get serviced sharpish…the only constraint being that I cannot use OS services within that interrupt.

End result; I’ve got the option of slightly jittery interrupts and OS support, or interrupts faster than the native case, but if I want to use OS features in conjunction with them then I have to jump through some more hoops. Of course you could do the BASEPRI trick in your own code, but someone has already written, tested, debugged and documented it for you, so why bother?

Finally, remember I said that there was another source of jitter?  Well,  taking an M3 as an example; It should theoretically be able to always respond to an interrupt within 10 clock cycles, but other factors (bus latencies, peripheral response times, flash caches and speeds etc.) may conspire to prevent that…so you get response jitter. In real world applications it is often more important to be slower and jitter free than to be faster and a bit wobbly, so several manufacturers have added the capability to ‘stretch’ the number of cycles to respond to an interrupt, so it’s always the same. On the NXP LPC134x CPUs, that register is called IRQLATENCY, and has a default value of 0x10, meaning that, in general, the CPU will hit your code in  response to an unmasked, highest priority interrupt request 16 clock cycles after the request is generated….if that is enough delay to remove jitter in your configuration is dependent on exactly how you’ve got the whole system configured, so you can put a longer value in that register if you need it.

I started off this post by being a bit anti-OS, which I have been for most of my career, but when you start peeling back the covers you start to understand that an OS, be it FreeRTOS, RTX, ChibiOS, NuttX or one of the hundreds of others that are out there, is really just a big library of code that you don’t have to write for yourself.  Know your problem, know your chip, and don’t just trust your execution environment decisions to blind prejudice.

Join the discussion here.

Ghetto CPU busy-meter

Even in the embedded world, you really want to know how busy your CPU is.

Many, or even most, embedded programmers have no idea how ‘busy’ their CPU is. By ‘busy’, I mean what proportion of the available CPU cycles are being spent on doing useful work in the name of the target application, and how much time its idle waiting for something to do. On your desktop PC you’ve got some sort of CPU busy meter available through the Task Manager, Activity Monitor or System Monitor depending on which OS religion you follow. Wouldn’t it be useful to have something similar on that embedded device you’re working on? Far too many people think you need an OS for that kind of info, but it turns out it’s trivially easy to do…and it should be table stakes for getting started on a new design.

In most embedded design patterns, you’ve got some sort of loop where the application sits, waiting for something to happen. That might be a timer expiring, an interrupt arriving or some other event. The general pattern looks something like this;

while (1)
{
    flagSet = flag_get();
    if (!flagSet)
    {
        __WFI();
    }
    else
    {
         <Do things>
    }
}

In this specific example the flags are set via Interrupt routines, and the __WFI instruction is basically “Wait for interrupt”. You can see how this works out….we sit waiting for an interrupt in the WFI (which is generally a low power mode), one arrives, sets a flag or two, and then any outstanding flags are processed before returning to the __WFI for whole thing to happen again.

Well, this will be easy … all we need to do is to provide some indication to the outside world of when we’re busy and when we’re not at the right places in the system. Something like;

while (1)
{
    flagSet = flag_get();
    if (!flagSet)
    {
        <Indicate Idle>
        __WFI();
        <Indicate Not Idle>
    }
    else
    {
         <Do things>
    }
}

That will work. Your split between ‘busy’ and ‘idle’ might not be quite so clean, but I don’t think I’ve ever seen a system where it’s not possible to differentiate between the two cases.

There is one issue, which is that the interrupt routine will be called before the <Indicate Not Idle> gets processed, but interrupt routines should be short in comparison to the total amount of work done…and I’ll give you a fix for that later anyway.

The easiest way to <Indicate Idle> and <Indicate Not Idle> is just by setting or clearing the state of a spare pin on your device. That also means you’ve got to set up the pin to be an output first of all…so our code now looks like;

<Set up output pin>
while (1)
{
    flagSet = flag_get();
    if (!flagSet)
    {
        <Indicate Idle>
        __WFI();
        <Indicate Not Idle>
    }
    else
    {
         <Do things>
    }
}

Great, we can now see, approximately, how busy our CPU is just by monitoring the pin. The code for each of these routines will vary according to the particular CPU you’re using, but here’s a typical example for a STM32F103 implemented using macros and CMSIS for port pin B12;

#define SETUP_BUSY GPIOB->CRH=((GPIOB->CRH)&0xFFF0FFFF)|0x30000
#define AM_IDLE    GPIOB->BRR=(1<<12)
#define AM_BUSY    GPIOB->BSRR=(1<<12)

These magic incantations will vary from CPU to CPU, but the principle holds fine. You might be even luckier and on your CPU have a pin that changes state when it’s in WFI mode without any of this AM_IDLE/AM_BUSY mucking about – typically that would be a clock that only runs while the CPU is active for example.

This works great if you’ve got a scope to see the output on, but I like something a bit more low-tech that can hang off the port so we’ve got a permanent indication of just how busy the system is.

A LED is a good first step – the CPU will be switching between busy and idle at quite a speed, faster than the human eye can follow, so the brightness is proportional to how busy the system is. If you’re a proper bodger you don’t even need a series resistor for the LED…the resistor is only there to limit the current through it and since the maximum current out of the CPU pin will generally be considerably less than the max the LED can cope with, you’re golden…..but do check the specs for your chip and your LEDs!

Unfortunately, the human eye is rather non-linear, so it won’t easily spot the difference between a 60% loaded CPU and an 80% loaded one, so something slightly more sophisticated is in order. If we put an low pass RC filter (or integrator, or smoother…they’re all the same thing, it just depends how you want to think about it) on the output pin then you’ll get a DC voltage out which is proportional to how busy the CPU is. Let’s go one step further and put a potential divider on the front to set the maximum input to the filter to be 1V. For a 3v3 system, something like this;Hey, I’ve now got a meter that reads from idle (0.00V) to fully loaded (1.00V) directly on the display. Meters with an analogue bar on the display are particularly suitable for this job. Increase the value of the capacitor if you need to reduce jitter in the reading, at the expense of making the response to load changes slower .

Just how cool is that? Of course, the practical implementation doesn’t look at pretty as the circuit diagram, but it does the job;

Remember how we said that one of the limitations of this technique was that the Interrupt routine got called before the Busy Indicator got set? If that is a problem (because, for example, you’re worried that you’re spending a lot of time in interrupts) just put another AM_BUSY call at the start of your ISR. Problem solved. Go to town, put one at the start of all of your ISRs.

Similarly, there’s no reason you have to restrict this technique to just telling you when you’re busy….if you bracket a specific routine in your code with the AM_BUSY tags you can read directly off your meter what percentage of your CPU time is being spent in it. You can even have multiple pins tagging different bits of code… knock yourself out.

At the other end of this post, you didn’t know how busy your CPU got. Now, you’ve got no excuses.

Join the discussion here.