SWO Instrumentation – Borrowing

The nice thing about standard file formats, is that they’re standard file formats, which means you can use other tools with them…even ones you hadn’t really designed for in the first place.

José Fonseca’s gprof2dot is a great example of exactly that.  It takes a variety of file formats and twists and mangles them into nicer graphs than the orbuculum tools do natively. You’ll need Python and GraphViz installed, then just create your source file exactly the same as if you were using KCacheGrind;

>ofiles/orbstat -e firmware.elf -z test.out

…but then, rather than starting KCachegrind on the resulting output, stick it into gprof2dot instead.  The file is in callgrind format, so the incantation is straightforward;

>gprof2dot -f callgrind test.out > a.dot

…and finally, the dot magic;

>dot a.dot -o -Tpdf > o.pdf

and you’ll get a rather nice graph showing you exactly where your code spent its time, with the hot paths highlighted;
You’ll find plenty of options for changing the format and layout on José’s github page, and plenty more tools for grokking callgrind format files floating around on the web. 

A cheap shot today, but hopefully it still hit home.

SWO Instrumentation, building the orchestra

We’ve already established that post processing delivers a lot of value just from the perspective of really understanding what your code is doing, so it makes sense that the better the post processing tool, the better that understanding will be. Enter KCacheGrind, considered by some to be the best execution understanding tool out there.

GraphViz is a great set of tools, but if you read the previous article you might have got left with the feeling that we were leaving a lot of really useful information on the cutting room floor. The instrumentation we added to the code delivered not only what was calling what, but it also provided timestamp information (measured in processor clock cycles) indicating how long each routine was taking…and we threw that away. Turning that raw data into useful information isn’t trivial because you’ve got to deal with things like interrupts and call trees, but once you’ve got it there’s a whole treasure trove of information to explore. Lets grab some data and format it into a file suitable as an input for KCacheGrind…back to our old friend orbstat;

>ofiles/orbstat -e firmware.elf -z test.out

The configuration of the target is exactly the same as for the last post, since it’s using the same data as input…just a different format for output. After you’ve got some data, poke it into KCacheGrind;

>kcachegrind test.out

…and you’ll be rewarded with something that looks a bit like this;

So, this all needs a bit of explaining, and I’m no cachegrind expert so you might find Google to be your friend, but let’s give it a go. To the left is the set of routines that were recognized during the sampling internal, together with the total amount of time spent in each of them both cumulatively (first column) and specifically in that routine (second). The number of calls is the third column and the rest should be pretty obvious….you can click around in those to centre on any routine.

Top right are the various analyses of the code, including source code views, number of cycles taken in the code and the split of execution time between this routine and the routines its called. You can change the way the metrics are presented to be relative to the whole execution time, or to the parent, and you can express them as percentages and absolute processor cycles…..the ‘Callee Map’, for example, allows you to see how the overall execution time divvies up between the called routine and it’s inferiors, like this;

…and you can see what was going on in the source code too;

To the lower right you’ll also find the disassembly of the matching source code, for when you really want to get down and dirty with what is going on;

There’s one small gotchya with the dissassembly view; it uses objdump and chances are your system-wide objdump might not understand the binaries you’re compiling for your target. You will also find that KCacheGrind can crash because the format of objdump doesn’t match what it expects (specifically, the text Address XXXX is out of bounds can appear at the end of a disassembly, which upsets it terribly). No big deal, just create a new file called something like objdumpkc and put the following contents in it;

~/bin/armgcc/bin/arm-none-eabi-objdump $@ | sed '/is out of bounds./d'

…then mark the file as executable and point the OBJDUMP environment variable to it before calling KCacheGrind, like this;

OBJDUMP=objdumpkc kcachegrind ....

Problem solved.

Most interesting, especially on complex projects, are the graphical views you can get of the relationship between routines and the execution splits between them. Those use graphviz just like in the previous article, but there are a lot more steroids in play.

Those are shown in the bottom right panel (you can re-arrange all this lot by the way if you don’t like the layout) and, again, can be done in absolute, relative or cycles formats…Here’s where my timer handler is spending its time in response to the timer interrupt;

…and here’s the same information as a percentage of the total time the CPU spent doing stuff;

As you can see cachegrind is a much richer, more interactive way of understanding your system. It still has its limits (its a static snapshot after all, although CTRL-R will reload the file) but when you’re trying to figure out exactly where your CPU has gone, and which bits of your system need to be re-engineered, its a wonderful too.

The one remaining problem is when to grab the samples, ‘cos at the moment it’s just when you happen to stop the acquisition. Getting more sophisticated (tell me what my system is doing in response to event X, for example) needs us to start looking at triggers. That one is coming up.

SWO – Instrumentation, first tunes

Instrumenting your application, passing the resulting data over the SWO link and post processing it on the host side turns a lot of textual data from the SWO into something that’s much more useful and easily digested. That’s where orbstat comes in.

The GNU compiler is mature and well tested, and over the years various smart people have bolted bits and pieces onto it to make their own lives easier…the trouble is it’s not always very easy to figure out exactly how to use that stuff in your own code, and the instrumentation functionality is a pretty good example of that.

Figuring out where your program is spending its time, what is calling what and just how often is difficult enough on a desktop platform, never mind about on an embedded one without the convenience of a screen, keyboard and copious storage. No matter, we can use the functions those smart people already developed, together with the SWO to offload the resulting data, to deliver powerful visualization from even the most inscrutable red led flasher.

To do this we will be using the __cyg_profile_func_enter and __cyg_profile_func_exit capabilities. The prototypes look something like this;

void __cyg_profile_func_enter (void *func, void *caller);
void __cyg_profile_func_exit (void *func, void *caller);

As you might guess from their names, it’s possible to convince gcc to insert calls to these two functions at the entry and exit of any function. Doing that just needs a single additional incantation on your gcc command line -finstrument_functions.

…and thats it! Any code compiled with this option will automatically call the entry and exit routines as it runs. OK, so it’ll be slower than it would be on a normal day, but you’re not running your CPU at 100% load anyway are you? are you?

There’s a one gotcya though. Instrument-functions will instrument every function…including the entry and exit ones, resulting in a vicious circle and a visit to the Hardfault_Handler in short order. Fortunately, the smart folks thought that one through too, and you can label individual functions to not be instrumented with the __attribute__ ((no_instrument_function)) decorator. There are probably entire files you don’t want instrumenting too, and they even covered that with an exclude file list that accepts partial matches. I’m not really interested in profiling my OS or the CMSIS, so my exclude list looks like;


So, we’ve got a mechanism for figuring out when we enter and leave routines. We just need to do is offload that from the CPU to somewhere we can post-process it. That’s a job for the SWO. Here’s a simple implementation that will report function entries and exits over ITM Channel 1;

#include "config.h"
#define TRACE_CHANNEL (30)
#define DELAY_TIME (0)

__attribute__ ((no_instrument_function))
              void __cyg_profile_func_enter (void *this_fn, void *call_site)
    if (!(ITM->TER&(1<<TRACE_CHANNEL))) return;
    uint32_t oldIntStat=__get_PRIMASK();

    // This is not atomic, but by using the stack for
    //storing oldIntStat it doesn't matter
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);

    // This is CYCCNT - number of cycles of the CPU clock  
    ITM->PORT[TRACE_CHANNEL].u32 = ((*((uint32_t *)0xE0001004))&0x03FFFFFF)|0x40000000;
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);
    ITM->PORT[TRACE_CHANNEL].u32 = (uint32_t)(call_site)&0xFFFFFFFE;
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);

    ITM->PORT[TRACE_CHANNEL].u32 = (uint32_t)this_fn&0xFFFFFFFE;
    for (uint32_t d=0; d<DELAY_TIME; d++) asm volatile ("NOP");


__attribute__ ((no_instrument_function))
              void __cyg_profile_func_exit (void *this_fn, void *call_site)
    if (!(ITM->TER&(1<<TRACE_CHANNEL))) return;
    uint32_t oldIntStat=__get_PRIMASK();
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);
    ITM->PORT[TRACE_CHANNEL].u32 = ((*((uint32_t *)0xE0001004))&0x03FFFFFF)|0x50000000;
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);
    ITM->PORT[TRACE_CHANNEL].u32 = (uint32_t)(call_site)&amp;0xFFFFFFFE;
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);
    ITM->PORT[TRACE_CHANNEL].u32 = (uint32_t)this_fn&amp;0xFFFFFFFE;
    for (uint32_t d=0; d&lt;DELAY_TIME; d++) asm volatile ("NOP");

You’ll recall from our previous discussions that each of the ITM channels runs independently, so you can have this code running while still throwing this visualization out of the port. There’s also a bit of slowdown code in there just in case the link gets flooded but, in reality, it’s not been a big deal as long as you don’t try to use the link heavily for anything else at the same time (the channel pretty much rate-limits itself due to the busy-spins in the code above … the slower your SWO, the slower your CPU).

The CPU needs to be configured to pass the SWO along. Assuming you’re using orbuculum with the orbtrace.init macros then that bit is easy enough with the following in your .gdbinit (this example is using the Bluepill variant of the Blackmagic probe…its an exercise for the reader to use a different probe);

source ../../orbuculum/Support/gdbtrace.init
target extended-remote /dev/ttyACM0
monitor swdp_scan
file ofiles/firmware.elf
attach 1
set mem inaccessible-by-default off
set print pretty load start
# ======================================
# Change these lines for a different CPU or probe
monitor traceswo 2250000
prepareSWD 72000000 2250000 0 0
# ======================================
dwtSyncTAP 3
dwtCycEna 1
ITMTSPrescale 3
ITMEna 1

So, we’ve got the stuff into the PC…how do we handle it? If you just want to see it’s there you can use orbcat, with a command like;

>orbcat -c 1,"0x%08x\n"

…Great, but not exactly useful. What we really need is something that takes these data, maps them across to the elf file containing the firmware that’s running on the CPU and turns the whole thing back into sensible data.

Fortunately, orbstat does that for you. The magic command you’re looking for is;

>orbstat -e firmware.elf -y xx.dot

This will swallow that output, cross reference it with the debug information in your elf file, and write the whole resulting mess out to a GraphViz input file. All that’s needed then is to ask graphviz nicely to turn it into a perty picture;

>dot xx.dot -o -Tpdf > o.pdf

…and heres the result for a trivially simple app (Click for a bigger image). Suddenly I can see exactly what is calling what, and how often, per second;

This is only the start. Pretty pictures are one thing, and a useful one at that, but being able to fly around that picture and dive in and out of elements of it is something quite a lot better. That’s the next installment.

SWO – There’s an App for that

Exploiting the SWO link for software logging and hardware state reporting delivers huge advantages in comparison with traditional debug techniques, but when extended with applications on the host side the benefit gained is amplified considerably.

The creation of apps for Orbuculum is really only just getting underway.  Any number of applications can attach to it simultaneously to deliver multiple views and insight into the operation of the target, and only a few of those have been created so far.

The main orbuculum program collects data from the SWO pin on the CPU and both processes it locally and re-issues it via TCP/IP port 3443. The format of these data is exactly the same as that which is issued from the port, which also happens to be the format that the Segger Jlink probe issues. By default the JLink uses port 2332…believe it or not, the choice of port 3443 for Orbuculum was made for very specific reasons which did not include consideration of the JLink port number, so that was quite a co-incidence! Applications designed to use the TCP/IP flow from a JLink device can now also be used with a USB Serial port, or Black Magic Probe. Conversely, with a bit of simple modification to change the port number, orbuculum post-processing apps can hook directly to a JLink device (just change the source port number), or via orbuculum itself – everyone’s a winner.

The orbuculum suite currently includes a couple of simple example applications that use the infrastructure…creating new ones is trivial based on the code you can find in these examples.

The simplest of these existing apps is orbdump, which dumps data directly to a file. That’s useful when you just want to take a sample period for later processing…perhaps pushing it into something like Sigrok for processing in conjunction with other data. A command line something like this will dump 3 seconds of data directly into the file output.swo;

>orbdump -l 3000 -o output.swo

We’ve already mentioned orbtop. That tool is used for creating unix-style top output, but it features one little Easter Egg. Theres an option -o <filename> to that which dumps the processed sample data to a file, and an example shell script in Support/orbplot_top uses these data to produce pie charts of the distribution of the CPU load, a bit like this;

Frequently an application needs to merge multiple data sources as a precursor to using it in other apps. If you’ve got orbuculum producing several fifos with independent data in each there are unix tools that can do that, something like;

>(tail -f swo/fifo1 & tail -f swo/fifo2 ) | cat > output

The problem with this is that you can never be completely sure of the order of data merging in the output file, so a dedicated tool, orbcat, is provided to hook to the TCP/IP port of orbuculum and take the same output format specifiers (but without the fifo names), dumping the resulting flow either to stdout or to a file or use by other tools, like this;

>orbcat -c 0,”%c” -c 1,”%c” -c 2,”%c” -c 3,”Z=%d\n” -c 4,”Temp=%d\n”

Since each value arrives discretely for each channel it is possible to be certain that each one is completely written before the next – whatever order they’re written on the target, they will be received in on the host (watch out for target OS issues here though!). This can resolve the problem of inconsistent intermingling. Indeed, it’s possible to go further, and use the enforced sequencing to advantage on the host. For example, we can write two characters and a int into a csv file on the host with a orbcat line like the following;

>orbcat -c 5,”%c” -c 6,” ,%c” -c 7,” ,%d

which would result in lines that look something like;

a, b, 45
g, w, -453

Always bear in mind that there is no (real) limit to the number of simultaneous apps that can use the dataflow from the orbuculum TCP/IP port, nor on the re-use of data for multiple dumps; perhaps there’s a reason for creating two csv files, with the data above in a different order, for example.

Orbuculum is only just at the start of it’s lifecycle. It can collect and distribute SWO data, but its the apps that make use of these data that make it powerful, and there are plenty more of those to be created for many different purposes.

For now, the most interesting app that comes with the suite is orbstat, and that will be the subject of the next post.

SWO – The Hard Stuff

SWOs credibility as a debug solution comes from it’s ability to support multiple software output channels, but it’s real capability is only realised when you use the hardware monitoring functions it offers too.

In my previous post I alluded to the hardware capabilities that the SWO ITM macrocell offered by virtue of the Data Watchpoint & Trace (DWT) macrocell. In this post we’re going to scratch the surface of what you can do with that.

DWT messages are encoded in exactly the same way as software ones, but they are generated automatically by hardware rather than programmatically. You’ll recall that event counters, exceptions, PC value and data traces can all be output by the DWT, so in this post we’ll provide a couple of examples of how to use that functionality.

If you’ve got orbuculum running, you’ll notice one extra fifo in its output directory alongside whatever you have defined. That fifo is called hwevent and is a simple continuous dump of whatever DWT events you’ve got switched on. By default, with the standard gdb orbuculum startup script, no events are requested for reporting, and so that fifo remains empty. From the gdb command line (assuming you’ve included the line source ../orbuculum/Support/gdbtrace.init in your .gdbinit file) you can find out quite a lot about the possibilities for configuring the ITM & DWT;

gdb>help orbuculum

GDB SWO Trace Configuration Helpers

Setup Device
enableSTM32F1SWD : Enable SWO on STM32F1 pins
prepareSWD : Prepare SWD output in specified format

Configure DWT
dwtPOSTCNT : Enable POSTCNT underflow event counter packet generation
dwtFOLDEVT : Enable folded-instruction counter overflow event packet generation
dwtLSUEVT : Enable LSU counter overflow event packet generation
dwtSLEEPEVT : Enable Sleep counter overflow event packet generation
dwtDEVEVT : Enable Exception counter overflow event packet generation
dwtCPIEVT : Enable CPI counter overflow event packet generation
dwtTraceException : Enable Exception Trace Event packet generation
dwtSamplePC : Enable PC sample using POSTCNT interval
dwtSyncTap : Set how often Sync packets are sent out (None, CYCCNT[24], CYCCNT[26] or CYCCNT[28])
dwtPostTap : Sets the POSTCNT tap (CYCCNT[6] or CYCCNT[10])
dwtPostInit : Sets the initial value for the POSTCNT counter
dwtPostReset : Sets the reload value for the POSTCNT counter
dwtCycEna : Enable or disable CYCCNT

Configure ITM
ITMId : Set the ITM ID for this device
ITMGTSFreq : Set Global Timestamp frequency
ITMTSPrescale : Set Timestamp Prescale
ITMSWDEna : TS counter uses Processor Clock, or clock from TPIU Interface
ITMTXEna : Control if DWT packets are forwarded to the ITM
ITMSYNCEna : Control if sync packets are transmitted
ITMTSEna : Enable local timestamp generation
ITMEna : Master Enable for ITM
ITMTER : Set Trace Enable Register bitmap for 32*&lt;Block&gt;
ITMTPR : Enable block 8*bit access from unprivledged code

There is another layer of help information below this top layer (beware that gdb doesn’t like MixedCase when you’re trying to do tab completion);

gdb>help dwttraceexception
dwtTraceException <0|1&> Enable Exception Trace Event packet generation

Understanding some of these options does need a bit of perusal of the  DWT and ITM technical documentation I’m afraid, but I’ll get around to writing something up on some of the more useful of them eventually (or, if someone else fancies making a textual contribution, it would be gratefully received….)

OK, so let’s give that a go, and see what we get in the hwevent fifo now;

gdb>dwtTraceException 1

>cat hwevent

The ‘1’ in the first column is the event type (an Exception Trace Event), followed by the time in uS since the previous event. That is followed by the condition, and by the Exception itself. This particular trace is for an otherwise idle FreeRTOS application with a 1mS system tick timer. You can see that the CPU entered the thread state and 989uS later dealt with a SysTick event that took 6uS to handle, and that that process continued during the sample time…that’s quite a level of insight for no code changes at all!

There are 993uS to 1003uS between SysTicks in this sample, and that brings us to one of the big problems with this technique. To save bandwidth across the link the timestamps are generated on the host rather than the target, so they are inevitably inaccurate and, even with this compromise, the TRACESWO quickly becomes overload. You will see ITM Overflow warning messages from orbuculum itself in any realistic application using Exception Tracing…the effective use of Exception Tracing will have to wait until the parallel trace is available. By the way, there is a great description about CORTEX-M exceptions available here.

So, instead, let’s move on to something that does work reasonably OK even within the constraints of TRACESWO. Interrupt the application and type;

gdb>dwtTraceException 0
gdb>dwtSamplePC 1

…and again we can look at the hwevent fifo;

>cat hwevent


Basically, we can set an interval at which we want the DWT to sample the current value of the Program Counter (by means of the dwtPostTap and dwtPostReset options) and it will tell us the value of PC at that interval. If the target is sleeping then obviously the PC has no value and rather the special value **SLEEP** is returned.

Using combinations of these options you provide information to homebrewed applications that parse the hwevent fifo to infer things about the behavior of your target, but there are alternative ways of getting information which can be easier to use.

In a previous note I mentioned that orbuculum exports a TCP/IP interface on port 3443…we can hook applications to this port and parse the data that are returned. The easiest example (which is completely useless) is;

>telnet localhost 3443

(Oh, CTRL-] followed by q will get you out of that).

Fortunately, the orbuculum gnomes have provided slightly more useful applications than that. The first of these is orbtop, which takes the PC samples, looks them up in the matching firmware elf file (assuming you compiled it with debug info in there) and marshals them into something distinctly useful;

>orbtop -e ../STM32F103-skel/ofiles/firmware.elf

98.91% 4360 ** SLEEPING **
 0.36% 16   USB_LP_CAN1_RX0_IRQHandler
 0.18% 8    xTaskIncrementTick
 0.13% 6    Suspend
99.58% 4408 Samples

I think that’s enough for now. I doubt you were expecting a full top implementation for your target, with no target software instrumentation needed, but we’re still nowhere near the limits of what we can do.

Till next time….

SWO – starting the Steroids

Basic Single Wire Output replaces a serial port for debug purposes, but that’s hardly scratching the surface of the full capability of what’s behind that pin. To get more out of it needs additional software on the host side, and that’s where Orbuculum makes its first appearance.

If you’re following along at home, and you’re of that kind of engineering mentality, you will have looked at the SWO output from the last blog post and noticed that every valid data byte was interspersed with a 0x00. That doesn’t matter to most terminal programs (although it will screw up flashy terminal handling in case you were trying to get clever) and it’s really just a way of the ITM reminding you that it’s still there, and would still like to play.

The ITM is documented in The ARMv7-M Architecture Reference Manual which is a right riveting read. It can actually output data four different types of data;

  • Software Trace: Messages generated by program code
  • Hardware Trace: Messages generated by the DWT, which the ITM then outputs
  • Time Stamps: Either relative to the the CPU clock or the SWO clock
  • Extension Packets: These aren’t used much in CORTEX-M, but the one facility they do provide is a ‘page extension’ to extend the number of available stimulus ports from 32 to 256.

The minimalist pseudo-serial port output from the last post is actually a degenerate example of the use of Software Trace outputting one byte messages from ITM channel 0. That’s the reason you’re seeing the 0’s interspersed with the data… but a lot more functionality is available.

An ITM message is, in general, a data packet of 8 to 32 bits. Program code can send out chunks of 8-32 bits via 32 ‘simulus ports’. A write to stimulus port 0..31 on the target side of 1, 2 or 4 bytes will result in a ITM Software message being encoded and sent over the link. This effectively means you’ve got 32 individual channels of up to 32 bit width multiplexed onto a single serial link, and handled by the hardware. You can do that kind of thing just using software and a conventional serial port, but the ITM embeddeds that functionality in code you don’t have to write.

This makes the ITM Software channels ideal for separating different types of debug information for processing by the host; Channel 31 is reserved for Operating System support information, and 0 is generally used for 8 bit serial data (as we’ve already seen). The others are pretty much available for whatever purpose you wish. There’s no CMSIS support for anything other than Channel 0, but adding support for the other channels is trivial;

static __INLINE uint32_t ITM_SendChar (uint32_t c, uint32_t ch)
    if ((CoreDebug->DEMCR & CoreDebug_DEMCR_TRCENA_Msk) && /* Trace enabled */
         (ITM->TCR & ITM_TCR_ITMENA_Msk) && /* ITM enabled */
         (ITM->TER & (1ul << c) ) /* ITM Port c enabled */
        while (ITM->PORT[c].u32 == 0); // Port available?
        ITM->PORT[c].u8 = (uint8_t) ch; // Write data
    return (ch);

I’ll leave it as an exercise for the reader how to create 16 and 32 bit variants of the write routine…or extend this one.

Anyway, while we’re here we’ll take a quick look at the hardware messages that the ITM conveys. These messages originate from the DWT and are encoded in a very similar way to the software ones. However, the message types are much more standardised, and offer an incredibly rich insight into the operation of the CPU, considering how minimal the implementation is. The defined messages are;

  • ID0 : Event Counter: the DWT maintains event counters for a number of distinct event types. When these counters ‘wrap around’ to zero then this event is emitted.
  • ID1: Exception Trace: One of the most versatile messages, this reports which interrupt is Entered, Exited or Returned to. By monitoring exception trace messages the host can identify exactly how interrupts are being handled.
  • ID2: Periodic Program Counter Sample Packets: the DWT can be configured to sample and report the current value of the Program Counter (PC). This allows statistical profiling and code coverage of an application running on the target without any code changes.
  • ID3-23: Data Trace Packets: These messages allow you to trigger events when certain data locations are accessed, values are changed or program locations hit. You might question how these messages differ from the capability afforded by the Debug module, but it’s much more intended for monitoring flows and triggering actions, rather than the interventional stuff that the Debug macrocell is generally used for.

You can see why the DWT is a bit of a Cinderella…its doing quite a lot of useful work and there’s a rich seam to be mined here, so we’ll be back to give it more attention in a future post.

Obviously the ITM has limited bandwidth, especially in comparison to the TRACEDATA pins, and it’s quite possible that it can be flooded by multiple data sources contending for it’s use. When that occurs there is a priority order to the messages that are output, with the end result that if you start seeing overflow messages, you can be reasonably sure that you are losing useful data. Unfortunately, the available bandwidth is the Achilles heel of the TRACESWO pin.

Lets consider the flexibility that the software source packets afford as a simple example of the use of the ITM. Doing this requires some software on the host side which, until recently, was limited and mostly only available in expensive (costing more than zero) proprietary packages, although OpenOCD and Sigrok both have some decode capability.

Orbuculum was created during early summer 2017 to capture and decode these SWO (and, specifically, ITM) flows. Running on OSX or Linux Orbuculum has significantly opened up the potential that SWO offers. In its core form it receives the data stream from the ITM (which may, optionally, have been through the TPIU multiplexer) and both presents it via TCP/IP port 3443 to any number of subsidiary client applications while simultaneously creating FIFOs delivering the decoded serial data to any local application that wants to use it.

The TCP/IP link is another thing we’ll deal with later, but for now, as an example, let’s consider an application where we want three debug serial flows (debug, clientEvents and Actions) with a 32-bit signed value Z and a 16-bit signed value Temperature.

Orbuculum can connect via a USB logic level UART, a Segger debug probe or, the default, a Black Magic Debug probe. For now, let’s assume we’re using the BMP, but it’s only a couple of slightly different command line options to connect to either a Segger or a logic level USB UART.

Anyway, the command line to achieve all this functionality would be;

orbuculum -b swo/ -c 0,debug,”%c” -c 1,clientEvents,”%c” -c 2,Actions,”%c” -c 3,Z,”Z=%d\n” \
                                  -c 4,Temperature,”Temp=%d\n”

when orbuculum is running it will create, in the directory swo/, the following files;


(+1 more file, which we’re not going to deal with in this post)

These can be streamed via the ‘cat’ command, or copied to a regular file. On the target side writing to one of the ITM channels (0 = debug, 1 = clientEvents etc.) with the appropriate length message will cause that number of octets (comms people say ‘Octets’ rather than ‘Bytes’ cos we’re pedantic) to be sent over the link to pop out and be processed by Orbuculum on the host.

As with the simple serial streaming case we talked about in the last post, some configuration is required to get all the various bits and pieces of SWO pointing in the right direction and running at the same speed. In general you’ll find it’s easier to do that from the debug port rather than target program code, and there are gdb scripts and libraries for exactly that purpose shipped with Orbuculum.

Orbuculum is designed to be a pretty hardy piece of code. It will deal with the target (and the debug interface) appearing and disappearing as the debug cycle takes place. The intention is that it behaves more as a daemon than as a regular application program so that it becomes part of the instrumentation infrastructure that supports your debug activities. Typically, I have several windows open each cat’ing one of the debug flows, and those windows are maintained through restarts, pauses and reboots of the target.

So, you now have the ability to stream multiple, independent, information flows from your target to your host. More sophisticated exploitation of this capability will be the subject of the next few posts, once we’ve dealt with the hardware side messages from the DWT, SWOs Cinderella.

Single Wire Output

SWO is the underloved younger brother of SemiHosting. Only available on M3 and above, it provides a flexible window into the behaviour of your target. In the simplest use case, it’s a high speed output only debug serial port.

Understanding SWO needs a bit of background about the various bits of the ARM CoreSight Debug architecture that participate in it. CORTEX-M doesn’t implement full-on CoreSight, its more a sort of lightweight version of it and there are only three component subsystems that have a role, at least, for the kinds of cores we’re talking about today;

  • Embedded Trace Macrocell (ETM): Provides live tracing of what the CPU is actually doing
  • Instrumentation Trace Macrocell (ITM): Provides multi-channel program-controlled data output
  • Data Watchpoint & Trace (DWT): Provides watchpoints and change-tracking output

ARM have a bit of a habit of talking in TLAs (Thee Letter Acronyms) that make this stuff more impenetrable than it needs to be, but once you’re in the club you can use the TLAs too to keep the riff-raff out, so try and keep up.

Now, each of these three data sources are configured either programmatically or via the debug port. Their output flows through to the  Trace Port Interface Unit (TPIU…I guess they ran out of combinations of only three letters) and that talks to the outside world. The block diagram of the TPIU looks like this;

The TPIU consists of a number of functional blocks; Interfaces to the ETM, ITM and APB (ARM Peripheral Bus, for config and management), a formatter to frame up the data from these sources and a serialiser to turn it into an appropriate format to be sent over the wire. DWT is the poor stepchild here. It sends its data via the ITM and never seems to get mentioned in letters home…but when we talk about ITM, you can assume the DWT is along for the ride too.

The formatter multiplexes the available data sources into packets that are sixteen bytes long. The formatting of this multiplexed packet is really rather clever (see Section D4 in here) and is designed to minimise the overhead that the multiplexing imposes. When you’re only using the TPIU for ITM output (See, you’re getting the hang of these TLAs) the formatter can be bypassed and the ITM data are passed directly to the Serialiser, thus reducing overhead and simplifying the packet format. That is indeed the way the SWO is often used in ‘simple’ implementations.

The serialiser is interesting. You’ll notice it has both a TRACESWO output and a four bit TraceData output too.

The four bit TraceData, in conjunction with the TRACECLK output, is used for ‘parallel trace’. It has higher bandwidth than the single wire output (which allows it to do new things) but, importantly, it’s fed from the same data sources so, modulo bandwidth limitations, you can do the same things with the TRACESWO output that you can do with the TRACEDATA outputs. We’ll deal with TRACEDATA extensively in a future post, but for now TRACESWO is the star of the show.

The serialiser kicks data out of the TRACESWO pin at a rate governed by the TRACECLKIN (which is fed on-chip by some clock source or other). Data can be sent out either Manchester encoded, or in UART format that will be more familiar to many people. You’ll hear the terms NRZ (Non-Return to Zero) and RZ (Return to Zero) used to describe these formats. You can Google for more information easily enough, but the important thing is that a RZ protocol also encodes the clocking information (at the expense of double the bandwidth requirement) whereas a NRZ protocol requires you to know the bitrate ahead of time. If you’re developing custom hardware to swallow the TRACESWO output you’d want to use RZ, if you’re hoping to use a TTL UART, then it’s NRZ all the way. The NRZ TRACESWO output format is hardwired as 8 databits, 1 stop bit, no parity.

So, let’s recap where we are. By the appropriate configuration of registers we can get realtime logging, exception and even execution trace out of our CPU via a single pin. We can even get those data out via a logic level UART connection (and yes, you can just capture the output using one of those horrible USB to UART adaptors). Next step – how do you grok the data on the host side?

Well, if all you want is a extra serial output for debug then that’s easy – configure the TPIU to bypass the formatter and to spit out the messages in NRZ format, then make sure you write to ITM channel 0 and hang a USB to UART adaptor off the SWO pin with a terminal application on the host. You’re done. You’ll even find a suitable call in the CMSIS, ITM_SendChar, which will send a single character over the link on channel 0 to drop out on your host.

The magic incantations to get all of this going fall into two parts; the first is chip specific to configure the SWO pin for use, the second is CORTEX-M generic, to configure the ITM, DWT, ETM and TPIU (although, in reality, you can largely ignore the ETM if you’re just wanting simple debug output, and the DWT just needs to provide sync to the ITM). Something like this suffices for a STM32F103;

/* STM32 specific configuration to enable the TRACESWO IO pin */
AFIO->MAPR |= (2 << 24); // Disable JTAG to release TRACESWO
DBGMCU->CR |= DBGMCU_CR_TRACE_IOEN; // Enable IO trace pins for Async trace
/* End of STM32 Specific instructions */

*((volatile unsigned *)(0xE0040010)) = 625; // Output bits at 72000000/625+1)=115.2kbps.
*((volatile unsigned *)(0xE00400F0)) = 2; // Use Async mode pin protocol
*((volatile unsigned *)(0xE0040304)) = 0; // Bypass the TPIU formatter and send output directly

/* Configure Trace Port Interface Unit */
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk; // Enable access to registers
DWT->CTRL = 0x400003FE; // DWT needs to provide sync for ITM
ITM->LAR = 0xC5ACCE55; // Allow access to the Control Register
ITM->TPR = 0x0000000F; // Trace access privilege from user level code, please
ITM->TCR = 0x0001000D;
ITM->;TER = 1; // Only Enable stimulus port 1

while (1)
    for (uint32_t i=’A’; i<=’Z’; i++)

So, there you go, serial port debug with low overhead and without it actually costing you a serial port on the target. The good stuff, however, using the real capabilities of the ITM and DWT, you only get if you spend more effort understanding those two macrocells, and if you put real decode software on the host side. That’s the subject of the next installment.

While you’re waiting for that to land, there’s a short YouTube Video from ARM giving a better overview of this stuff than I ever could.



SemiHosting is one of the oldest ARM debug support mechanisms which even today has a couple of advantages over most of the alternatives.

In general the debug options for ARM CORTEX CPUs are confusing to the newcomer. The embedded world expects everyone to already be an expert, with the end result that you’ve got to be living in it for a fairly significant length of time before the fog finally starts to clear.

I’m assuming that anyone reading this stuff has already got their head around I/O bits and serial ports, so lets concentrate on SemiHosting as our first entry into this wonderful world. This is obviously just an intro, you should look at the ARM documents when you want the real lowdown. I should state upfront that I don’t generally use SemiHosting, I find other techniques more suitable, but this should give you enough of a foothold to start using it if it looks like it floats your boat.

SemiHosting has been around since the 1990s. It allows the application running on your Target (embedded CPU) to access the Input and Output capabilities of the Host that is connected over the debug link. It does this by ‘tunneling’ the I/O requests over that link for various file descriptors. You’ll recall that file descriptors 0 and 1 are stdin and stdout in the Unix world, so one of the things you get with SemiHosting in addition to file access is remote screen and keyboard for your target application. Bargain.

It’s important to be aware that when an app is compiled with SemiHosting it will not work without the debugger connected. This is a big restriction. It also switches the CPU into Debug mode while it’s active, where it doesn’t play nicely with interrupts and stuff. Let’s be honest, SemiHosting is really useful for testing routines that take chunks of data in or throw chunks of data out because that’s where the file handling bit comes in. It’s not great for realtime oriented stuff either because it’s not a particularly fast technique. Its big advantages are that its properly bidirectional and it integrates cleanly, with no (or very little) glue with the filesystem on the host.

So, how does it work? Turns out the implementation is slightly different depending on if you’re on a ARMv6-M or ARMv7-M (M0 or M3/4 etc.) as distinct from any other ARM family CPU. In the former case the BKPT (Breakpoint) instruction is used, other ARM CPUs use SVC (Service) calls…that distinction doesn’t really matter though unless you’re stepping through machine code trying to figure out what’s going on….so lets stick with the CORTEX-M case.

When the application on the target wants to perform a SemiHosting call in regular code it performs a BKPT 0xAB instruction with the operation to be performed in R0, and parameters in other registers. A few examples of ARM-set standard actions are;

1 – SYS_OPEN : Open a file on the host
2 – SYS_CLOSE: You can figure this one out
3 – SYS_WRITEC: Write a single character
5 – SYS_WRITE: Write a block
6 – SYS_READ: ...and so it goes on

Obviously each of these calls needs parameters and returns results. The reference above gives you all the info you need on what those actually are…although in reality you mostly use libraries to realise a SemiHosting implementation so you don’t need this level of detail. One question I always had was why SemiHosting was implemented with BKPT/SVC and not just a library…well, if you think about it an exception-based calling routine will work anywhere with any language and from any processor state (pushing the CPU into a Debug state), so it’s much cleaner implementation than the alternatives that you might dream up.

So, we’ve reached the BPKT/SVC handler, and we’ve got our marching orders in the various registers….how does this get conveyed to the connected debugger? That depends on the compiler and debugger you’re using, but let’s stay in a GCC/GDB world where everything is documented and transparent.

In that case the handler marshals everything and sends it over the GDB link. That’s all documented in the Remote Protocol section of the GDB manual, and specifically the File I/O Remote Protocol Extension. I’m not going to regurgitate all of that stuff here for the purposes of padding a blog, but suffice to say that requests from the target eventually pop up at the host end where GDB (or, if you’re using something like Segger or OpenOCD, initially the debug driver) handles it and returns the results back to the target.

OK, so that’s the mechanics, and you understand the limitations, so how to use it in the real world? Turns out it’s pretty straightforward, just add the magic incantation


to your linker options (replace that with


when you want to turn SemiHosting off). That will load up the BKPT/SVC handling routines and allow you to use printf/scanf and all the file handling stuff in your application. One thing that folks do forget is an initialisation call that’s needed at the start of main (or leastways, before you do any SemiHostery) if you’re not running newlib;

extern void initialise_monitor_handles(void); /* prototype */

You’ll probably need to switch on the semihosting options on your host side debug stub, and the MCUOnEclipse site has good info on doing that.

You don’t need to do anything extra if you’re running Blackmagic probe…one of it’s big advantages is that it’s all handled natively.

So, there you have it. Zero to SemiHosting-competent in ten minutes, but if you can cope with an output-only channel though there are better, faster, more flexible options. More to follow.

Debug communication with an ARM CORTEX-M target

Textual debug input/output is a way of life on the desktop. With no screen and no keyboard, surely you’ve got less options in an embedded system?

In some ways you’ve got more options for getting debug out of an embedded system than you have a desktop one. Just the other day I posted an example of using a single digital output pin to convey external information about how busy a system is – something that’s rather more involved to achieve on the desktop, so let’s do a brief survey of some of the options that let you figure out what that inscrutable lump of highly refined sand on your bench is actually doing.

The basic option, that’s been around as long as embedded systems themselves, is the single I/O bit. For output it can be used to indicate entry into a specific bit of code, how loaded the system is, any number of error conditions and a thousand and one other things. Most designs feature at least one LED hooked onto that output pin to give a visual indication to the user without needing any special equipment beyond a Mk.1 eyeball. In my designs I always have a minimum of one (not red) LED which does normal service as a ‘heartbeat’ showing that the system is alive. It serves double duty as the exception indication LED when the system ends up in an unrecoverable situation. Believe me, it can be difficult to see that situation quickly (you’ll implement that on your second design, immediately after you’ve spent an hour staring at your board wondering why its not responding)….don’t underestimate how useful that is. Frankly, if you can spare the bits, put a RGB LED on there (A Wurth 150141M173100 is only about 30c) and you’ve got eight different conditions you can show, even if you chose to not provision it for production. Stick that LED on PWM outputs and you’ve got any colour under the rainbow. Perhaps not really too useful, but cool anyway.

On the input side a single bit lets you trigger certain paths through the code or change runtime conditions. Its very difficult to get a clean switching action on an input bit without a bit of additional circuitry…generally you can work around that by sampling twice (Google ‘Switch Debouncing’ for more than enough examples on how to do that) and software is always cheaper than hardware – in per unit cost at least. The lack of clean switching action can bite you if you sample very quickly or use an interrupt to capture the state change event though …. and it’s considerably worse if you just use a shorting wire, pair of tweezers or whatever other conductive implement you happen to have on your desk. The one liner algorithm description for a software debounce is simple enough; After sensing a state change wait for 20mS or so and check if it’s still changed..if it is then the transition is valid.

Moving on from the single I/O bit approaches, we very quickly end up wanting to spit out serial data; Debug strings, data values, error conditions and operational conditions really help colour in what a system is really doing as opposed to what you think it should be doing. There’s so much value in a system reporting what it’s up to that we often find output serial ports for debug application with no corresponding input, and there are multiple options for getting that.

Lets consider what the various options for serial I/O are;

  • A real serial port. Normally this is configured for asynchronous operation (i.e. with start and stop bits) and it’s sometimes referred as a UART (Universal Asynchronous Receiver/Transmitter), which is often used to implement the RS232 communications protocol. It’s not really though, ‘cos RS232 specifies a lot of things that are ‘interpreted liberally’ in a debug port; The signaling levels might be 0 & 3v rather than that positive and negative 12V signaling that a ‘real’ RS232 port generally uses. With the advent of uber-cheap USB ‘TTL Serial’ interfaces from FTDI and others this kind of debug port has become very popular, and you’ll often find Logic Level serial interfaces on debug probes like the Black Magic Probe or the Segger JTAG.
  • Overlaid functionality on a debug channel. If we’ve got debug communication established with a CPU via JTAG or SWD then that channel can also be used for bidirectional debug communication. On ARM it’s generally known as ‘Semihosting’ and its a virtually no-cost channel in hardware terms, but fast it isn’t. It does have a few distinct advantages to it though and we’ll talk about those later.
  • Single Wire Output. When the JTAG interface is in SWD mode there are spare pins, one of which (the one thats normally used for TDO) can be used for serial debug output. There’s quite a sophisticated infrastructure behind this pin on chip and its a powerful capability. We’ll start to investigate that in a series of future posts. The big problem with SWO is that it’s output only, and if you’ve got a minimal debug setup on your board (SWCLK, SWDIO, Gnd) then SWO needs another pin. The big brother of SWO is TRACE Output, which is effectively parallel SWO, but that’s for discussion quite a lot later on.
  • Real Time Terminal (RTT). This one isn’t as well known as the other options, but it levers the Segger hardware in a very clever way to deliver high speed communication with minimal target overhead. Basically, you put aside an area of memory on the target for ring buffers and then the debugger dips into those buffers while the target is running to exchange data. Since the debug capability on a CORTEX CPU doesn’t impact on the runtime speed of the target this is a pretty quick mechanism, the target ‘cost’ is limited to the ring buffers and simple memory copies to get the stuff to/from the buffers. Other probes could do this, but generally don’t, at least today.

So, that’s a quick overview of the various techniques I’m aware of, but perhaps there are more (or variations on a theme) that are worth documenting too? Of course, no one of these has to be used exclusively, and its quite common to see them used in combination on any given target. As a quick example, when I have a system that gets into a panic condition, I call the following routine;

void GenericsAssertDA(char *msg, char *file, uint32_t line)

/* Lock up tighter than a Ducks A?? and spin, flashing the error LED */

    while (1)
        dbgprint("%s: %s line %d" EOL,(msg==NULL)?"Assert Fail":msg,file,line);
        GPIO_WriteBit((GPIO_TypeDef *)GPIOport[GETGPIOPORT(PIN_HB_LED)], (1<<GETGPIOPIN(PIN_HB_LED)),(isSet=!isSet));

        uint32_t nopCount=1250000;
        while (nopCount--)

…this is just a few lines of code, but they’re mostly there for a good reason. The first thing we do in a panic is to switch off all the tasks, then dump an error message out of the debug serial port, before inverting the state of the error LED. We delay in a busy loop with the NOPs to avoid relying on the majority of the chip being in an operational condition. The nopCount initial value is set to make the LED flicker quite quickly. This sequence is repeated continually in case you miss the first serial output (y’know, cos you didn’t have the serial port connected, or whatever).

A GCC preprocessor definitions add quite a lot of value to what you get out;

#define ASSERT_MSG(x,y) if (!(x)) GenericsAssertDA((y), __FILE__, __LINE__)

Over the next few posts I’ll start digging into these debug options, and show just how powerful they really can be with the right processing hanging off the other end.

Now you at least appreciate that there a whole range of options for debug communication with your target, and the more sophisticated ones aren’t really more expensive than the simpler ones, they just need more setting up.

To OS or not to OS

On an embedded system, should you have an OS or run on the metal?

This one will run and run, so I’ll call this Part 1 for now.

I’ve spent the majority of my professional career rallying against the use of OSes for Embedded Systems. A few years ago I analysed my reasoning behind that I realised a lot of it was founded in the arrogant belief that no-one could write code that was as well optimised as mine. That may or may (more probably) not be true, but there are plenty of other reasons for seriously considering a lightweight OS for your next project.

Following that little thought investigation, I now invert the discussion, and start off with “Why wouldn’t I run an OS underneath this?”. The fact is, on day one, every project starts off small, manageable and with a simple set of needs…you don’t need an OS in that environment, plain and simple. But, as your project grows, you need to do more and more things and, without an OS, you’ll find yourself re-inventing stuff that you get for free in OSville. Ah, you say, but I already have a library for timers, and message passing, and task switching, and queues….congratulations, we call that an OS, it’s just that you didn’t.

There are legitimate reasons for going OS-commando. The main one being if you’re really short of memory (RAM or Flash). Like it or not an OS is going to gobble some of it up (a Mutex semaphore in FreeRTOS on CORTEX-M takes 80 bytes. That hurts when you’ve only got 4096 of ’em around) so defining your own can really help…but be careful, allocating one bit in the bit-addressable RAM area just saved you 79 7/8 bytes of memory but it isn’t the end of the story, you’ve still got the care and feeding of that structure to deal with. It’s surprising just how much Flash memory, in comparative terms, that care and feeding can take, and not too many people would claim that FreeRTOS is the most super-efficient RTOS in its RAM allocation.

Similar consideration apply on the flash side. A reasonably complete FreeRTOS implementation on a STM32F103 in release configuration is about 6K…you can come down a ways from there if you start chopping out options, but the total spend will still be a four digit number, and that’s a fair proportion of a budget that might only be 16K or 32K.

One thing that an OS doesn’t have to do though is slow you down, and that’s the main criticism I hear (and, indeed, was one of my primary prejudices). The fact is that most of the time, for most of your code, the 1-2% overhead the OS brings along for the ride really doesn’t matter. It does matter when you’ve got a time critical task to handle, and that’s often (mostly?) done in interrupt code, so how fast a RTOS handles interrupts is much more important than how it handles base code.

Most Real Time OSes offer ‘Zero latency interrupts’, or some equivalent term. All that really means is that the OS doesn’t pre-service the interrupt for you; It doesn’t grab the interrupt and perform the initial handling of it before passing it off to your code – that does happen in desktop OSes, and you’ll hear the term ‘top half handler’ and ‘bottom half handler’ used to reflect this split between OS-controlled and Application-controlled code.

With a Zero Latency interrupt, your response time would be exactly the same as in the OS free case, because the interrupt lands in your code just like it did in the OS-free case. Indeed, response time could even be better. How? Well, let’s look at a lazy-assed implementation of an OS free app (one of mine, so I can criticise..its available here if you want a laugh). In this app communication is arranged through simple flags….you set a flag in one place, and that triggers a task in another. The code to set a flag looks like this;

void flag_post(uint32_t flag_to_set)

… and the denter / dleave routines;

void denter_critical(void)

void dleave_critical(void)
    if (!--_critDepth)

..so, as you can see, all interrupts are turned off while we go fiddle with flags and critDepth..and during that time the CPU is away with the fairies and isn’t going to respond to any other maskable interrupt, no matter how much it yells. That will show itself up as jitter in interrupt response time  (there is another reason for jitter, we’ll come back to that later).

So , how on earth could an RTOS be faster? Let’s consider the equivalent criticality setting in freeRTOS for a M3 CPU (you’ll find this in portmacro.h, and I’ve hacked the formatting around a bit);

portFORCE_INLINE static void vPortRaiseBASEPRI( void )
    uint32_t ulNewBASEPRI;
    __asm volatile
        "mov %0, %1\n" \
        "msr basepri, %0\n" \
        "isb\n" \
        "dsb\n" \
        :"=r" (ulNewBASEPRI) : "i" ( configMAX_SYSCALL_INTERRUPT_PRIORITY )

…not a __disable_irq in sight! What FreeRTOS does is to temporarily raise the minimum priority interrupt that will be recognised by the CPU.  That has exactly the same effect as __disable_irq for any interrupt with a lower priority than whatever is selected for configMAX_SYSCALL_INTERRUPT_PRIORITY, but will leave higher priority interrupts enabled. So, if I really need that fast response, I just give it a super-high priority and it will get serviced sharpish…the only constraint being that I cannot use OS services within that interrupt.

End result; I’ve got the option of slightly jittery interrupts and OS support, or interrupts faster than the native case, but if I want to use OS features in conjunction with them then I have to jump through some more hoops. Of course you could do the BASEPRI trick in your own code, but someone has already written, tested, debugged and documented it for you, so why bother?

Finally, remember I said that there was another source of jitter?  Well,  taking an M3 as an example; It should theoretically be able to always respond to an interrupt within 10 clock cycles, but other factors (bus latencies, peripheral response times, flash caches and speeds etc.) may conspire to prevent that…so you get response jitter. In real world applications it is often more important to be slower and jitter free than to be faster and a bit wobbly, so several manufacturers have added the capability to ‘stretch’ the number of cycles to respond to an interrupt, so it’s always the same. On the NXP LPC134x CPUs, that register is called IRQLATENCY, and has a default value of 0x10, meaning that, in general, the CPU will hit your code in  response to an unmasked, highest priority interrupt request 16 clock cycles after the request is generated….if that is enough delay to remove jitter in your configuration is dependent on exactly how you’ve got the whole system configured, so you can put a longer value in that register if you need it.

I started off this post by being a bit anti-OS, which I have been for most of my career, but when you start peeling back the covers you start to understand that an OS, be it FreeRTOS, RTX, ChibiOS, NuttX or one of the hundreds of others that are out there, is really just a big library of code that you don’t have to write for yourself.  Know your problem, know your chip, and don’t just trust your execution environment decisions to blind prejudice.

Join the discussion here.