SWO Instrumentation – Borrowing

The nice thing about standard file formats, is that they’re standard file formats, which means you can use other tools with them…even ones you hadn’t really designed for in the first place.

José Fonseca’s gprof2dot is a great example of exactly that.  It takes a variety of file formats and twists and mangles them into nicer graphs than the orbuculum tools do natively. You’ll need Python and GraphViz installed, then just create your source file exactly the same as if you were using KCacheGrind;

>ofiles/orbstat -e firmware.elf -z test.out

…but then, rather than starting KCachegrind on the resulting output, stick it into gprof2dot instead.  The file is in callgrind format, so the incantation is straightforward;

>gprof2dot -f callgrind test.out > a.dot

…and finally, the dot magic;

>dot a.dot -o -Tpdf > o.pdf

and you’ll get a rather nice graph showing you exactly where your code spent its time, with the hot paths highlighted;
You’ll find plenty of options for changing the format and layout on José’s github page, and plenty more tools for grokking callgrind format files floating around on the web. 

A cheap shot today, but hopefully it still hit home.

SWO Instrumentation, building the orchestra

We’ve already established that post processing delivers a lot of value just from the perspective of really understanding what your code is doing, so it makes sense that the better the post processing tool, the better that understanding will be. Enter KCacheGrind, considered by some to be the best execution understanding tool out there.

GraphViz is a great set of tools, but if you read the previous article you might have got left with the feeling that we were leaving a lot of really useful information on the cutting room floor. The instrumentation we added to the code delivered not only what was calling what, but it also provided timestamp information (measured in processor clock cycles) indicating how long each routine was taking…and we threw that away. Turning that raw data into useful information isn’t trivial because you’ve got to deal with things like interrupts and call trees, but once you’ve got it there’s a whole treasure trove of information to explore. Lets grab some data and format it into a file suitable as an input for KCacheGrind…back to our old friend orbstat;

>ofiles/orbstat -e firmware.elf -z test.out

The configuration of the target is exactly the same as for the last post, since it’s using the same data as input…just a different format for output. After you’ve got some data, poke it into KCacheGrind;

>kcachegrind test.out

…and you’ll be rewarded with something that looks a bit like this;

So, this all needs a bit of explaining, and I’m no cachegrind expert so you might find Google to be your friend, but let’s give it a go. To the left is the set of routines that were recognized during the sampling internal, together with the total amount of time spent in each of them both cumulatively (first column) and specifically in that routine (second). The number of calls is the third column and the rest should be pretty obvious….you can click around in those to centre on any routine.

Top right are the various analyses of the code, including source code views, number of cycles taken in the code and the split of execution time between this routine and the routines its called. You can change the way the metrics are presented to be relative to the whole execution time, or to the parent, and you can express them as percentages and absolute processor cycles…..the ‘Callee Map’, for example, allows you to see how the overall execution time divvies up between the called routine and it’s inferiors, like this;

…and you can see what was going on in the source code too;

To the lower right you’ll also find the disassembly of the matching source code, for when you really want to get down and dirty with what is going on;

There’s one small gotchya with the dissassembly view; it uses objdump and chances are your system-wide objdump might not understand the binaries you’re compiling for your target. You will also find that KCacheGrind can crash because the format of objdump doesn’t match what it expects (specifically, the text Address XXXX is out of bounds can appear at the end of a disassembly, which upsets it terribly). No big deal, just create a new file called something like objdumpkc and put the following contents in it;

~/bin/armgcc/bin/arm-none-eabi-objdump $@ | sed '/is out of bounds./d'

…then mark the file as executable and point the OBJDUMP environment variable to it before calling KCacheGrind, like this;

OBJDUMP=objdumpkc kcachegrind ....

Problem solved.

Most interesting, especially on complex projects, are the graphical views you can get of the relationship between routines and the execution splits between them. Those use graphviz just like in the previous article, but there are a lot more steroids in play.

Those are shown in the bottom right panel (you can re-arrange all this lot by the way if you don’t like the layout) and, again, can be done in absolute, relative or cycles formats…Here’s where my timer handler is spending its time in response to the timer interrupt;

…and here’s the same information as a percentage of the total time the CPU spent doing stuff;

As you can see cachegrind is a much richer, more interactive way of understanding your system. It still has its limits (its a static snapshot after all, although CTRL-R will reload the file) but when you’re trying to figure out exactly where your CPU has gone, and which bits of your system need to be re-engineered, its a wonderful too.

The one remaining problem is when to grab the samples, ‘cos at the moment it’s just when you happen to stop the acquisition. Getting more sophisticated (tell me what my system is doing in response to event X, for example) needs us to start looking at triggers. That one is coming up.

SWO – Instrumentation, first tunes

Instrumenting your application, passing the resulting data over the SWO link and post processing it on the host side turns a lot of textual data from the SWO into something that’s much more useful and easily digested. That’s where orbstat comes in.

The GNU compiler is mature and well tested, and over the years various smart people have bolted bits and pieces onto it to make their own lives easier…the trouble is it’s not always very easy to figure out exactly how to use that stuff in your own code, and the instrumentation functionality is a pretty good example of that.

Figuring out where your program is spending its time, what is calling what and just how often is difficult enough on a desktop platform, never mind about on an embedded one without the convenience of a screen, keyboard and copious storage. No matter, we can use the functions those smart people already developed, together with the SWO to offload the resulting data, to deliver powerful visualization from even the most inscrutable red led flasher.

To do this we will be using the __cyg_profile_func_enter and __cyg_profile_func_exit capabilities. The prototypes look something like this;

void __cyg_profile_func_enter (void *func, void *caller);
void __cyg_profile_func_exit (void *func, void *caller);

As you might guess from their names, it’s possible to convince gcc to insert calls to these two functions at the entry and exit of any function. Doing that just needs a single additional incantation on your gcc command line -finstrument_functions.

…and thats it! Any code compiled with this option will automatically call the entry and exit routines as it runs. OK, so it’ll be slower than it would be on a normal day, but you’re not running your CPU at 100% load anyway are you? are you?

There’s a one gotcya though. Instrument-functions will instrument every function…including the entry and exit ones, resulting in a vicious circle and a visit to the Hardfault_Handler in short order. Fortunately, the smart folks thought that one through too, and you can label individual functions to not be instrumented with the __attribute__ ((no_instrument_function)) decorator. There are probably entire files you don’t want instrumenting too, and they even covered that with an exclude file list that accepts partial matches. I’m not really interested in profiling my OS or the CMSIS, so my exclude list looks like;


So, we’ve got a mechanism for figuring out when we enter and leave routines. We just need to do is offload that from the CPU to somewhere we can post-process it. That’s a job for the SWO. Here’s a simple implementation that will report function entries and exits over ITM Channel 1;

#include "config.h"
#define TRACE_CHANNEL (30)
#define DELAY_TIME (0)

__attribute__ ((no_instrument_function))
              void __cyg_profile_func_enter (void *this_fn, void *call_site)
    if (!(ITM->TER&(1<<TRACE_CHANNEL))) return;
    uint32_t oldIntStat=__get_PRIMASK();

    // This is not atomic, but by using the stack for
    //storing oldIntStat it doesn't matter
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);

    // This is CYCCNT - number of cycles of the CPU clock  
    ITM->PORT[TRACE_CHANNEL].u32 = ((*((uint32_t *)0xE0001004))&0x03FFFFFF)|0x40000000;
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);
    ITM->PORT[TRACE_CHANNEL].u32 = (uint32_t)(call_site)&0xFFFFFFFE;
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);

    ITM->PORT[TRACE_CHANNEL].u32 = (uint32_t)this_fn&0xFFFFFFFE;
    for (uint32_t d=0; d<DELAY_TIME; d++) asm volatile ("NOP");


__attribute__ ((no_instrument_function))
              void __cyg_profile_func_exit (void *this_fn, void *call_site)
    if (!(ITM->TER&(1<<TRACE_CHANNEL))) return;
    uint32_t oldIntStat=__get_PRIMASK();
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);
    ITM->PORT[TRACE_CHANNEL].u32 = ((*((uint32_t *)0xE0001004))&0x03FFFFFF)|0x50000000;
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);
    ITM->PORT[TRACE_CHANNEL].u32 = (uint32_t)(call_site)&amp;0xFFFFFFFE;
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);
    ITM->PORT[TRACE_CHANNEL].u32 = (uint32_t)this_fn&amp;0xFFFFFFFE;
    for (uint32_t d=0; d&lt;DELAY_TIME; d++) asm volatile ("NOP");

You’ll recall from our previous discussions that each of the ITM channels runs independently, so you can have this code running while still throwing this visualization out of the port. There’s also a bit of slowdown code in there just in case the link gets flooded but, in reality, it’s not been a big deal as long as you don’t try to use the link heavily for anything else at the same time (the channel pretty much rate-limits itself due to the busy-spins in the code above … the slower your SWO, the slower your CPU).

The CPU needs to be configured to pass the SWO along. Assuming you’re using orbuculum with the orbtrace.init macros then that bit is easy enough with the following in your .gdbinit (this example is using the Bluepill variant of the Blackmagic probe…its an exercise for the reader to use a different probe);

source ../../orbuculum/Support/gdbtrace.init
target extended-remote /dev/ttyACM0
monitor swdp_scan
file ofiles/firmware.elf
attach 1
set mem inaccessible-by-default off
set print pretty load start
# ======================================
# Change these lines for a different CPU or probe
monitor traceswo 2250000
prepareSWD 72000000 2250000 0 0
# ======================================
dwtSyncTAP 3
dwtCycEna 1
ITMTSPrescale 3
ITMEna 1

So, we’ve got the stuff into the PC…how do we handle it? If you just want to see it’s there you can use orbcat, with a command like;

>orbcat -c 1,"0x%08x\n"

…Great, but not exactly useful. What we really need is something that takes these data, maps them across to the elf file containing the firmware that’s running on the CPU and turns the whole thing back into sensible data.

Fortunately, orbstat does that for you. The magic command you’re looking for is;

>orbstat -e firmware.elf -y xx.dot

This will swallow that output, cross reference it with the debug information in your elf file, and write the whole resulting mess out to a GraphViz input file. All that’s needed then is to ask graphviz nicely to turn it into a perty picture;

>dot xx.dot -o -Tpdf > o.pdf

…and heres the result for a trivially simple app (Click for a bigger image). Suddenly I can see exactly what is calling what, and how often, per second;

This is only the start. Pretty pictures are one thing, and a useful one at that, but being able to fly around that picture and dive in and out of elements of it is something quite a lot better. That’s the next installment.