VGA Video out on BluePill

One major limitation during the development of an embedded system, especially for programmers who are used to PCs, is the lack of video output. That’s exactly what Vidout provides, using only 24% of the CPU on a STM32F103.

It would be great to be able to be able to, at least temporarily, add video to a system during development and then remove it when it goes to the field. Unfortunately, creating video output is generally a rather resource intensive activity, and not one that can normally be patched in for development only. It is also ‘Hard Realtime’, in that the generation of video output must meet tight timing constraints if the display is to look right.

Over the years I’ve built a lot of systems which I’ve claimed to have Realtime characteristics, but even so I’ve never really bumped against the limits of what is possible on a commodity processor, so, with a few evenings free over the Christmas break, I thought I’d see how far I could get at implementing a VGA output for debug use on a pretty much bottom-of-the-barrel CORTEX-M3; The STM32F103C8 based Bluepill board.

Software-based video output is a hard realtime problem that demands monotonic response. Given that the duration of pixel times is measured in nanoseconds and any mis-timing is very obvious in the form of ‘fizzing’ or deformities in the presented image then this was going to be a major challenge, just the thing for chrimbo. Thus, Vidout was born.

The project objective was simple enough; produce stable video using that board and no (or minimal) additional components. The starting point was the December 2012 blog by Artekit that produced 400×200 bitmapped video using this processor..so it was obviously possible to create some kind of output. A quick read-through of that blog and a few limitations came to light;

  • The video output was a bitmap buffer only, which means you need a lot of RAM to store it (10K Bytes). This is a heck of a hit on a part with only 20K available in the first place, some of which is presumably already used by your target application!
  • It uses three interrupts and relies on the interaction between two timers to generate the precise signalling needed. Thats a lot of resources being used on a constrained part.
  • The code is GPL…there’s nothing wrong with GPL, but it can make it difficult to fold the code into projects, especially if you want to be able to leave it in there as ‘sleeper’ code.
  • I didn’t write it.

What I really wanted was a character oriented driver using minimal resources that could be bolted into existing projects for debug and monitoring purposes. More powerful CPUs have been used to create video exploiting configurable ‘raster generators’ that create the image dynamically on the fly (e.g. by Cliff Biffle with ‘Glitch’). The combination of these two ideas form the basis of this new implementation.

Martin Hinner has collected the VGA standard timing information on his site, and that’s a great resource. Vidout uses the same timing as the Artekit article – the VGA 800×600 output frame which leads to a line rate of 28.444uS and a frame rate of 56Hz. With these timings and each horizontal pixel ‘doubled up’ to give a horizontal resolution of 400 pixels each pixel has a duration of 2.8nS…if we can’t get these levels of accuracy then the video will be corrupt. For simplicity, and due in part to constraints set by this specific CPU, the pinout is the same as the Artekit code.

Assuming for a moment that we have the source material for an image to be displayed (remember, in Artekit that’s just a static block of memory) then there are three distinct and separate tasks to be performed;

  • Creating and maintaining the ‘frame protocol’ so the monitor will display the image
  • Calculating the pixels to be output for each line of the frame
  • Outputting the pixels for each line

In Vidout these tasks are all performed in vidout.c, and we’ll run through them in turn;

The frame protocol

A single VGA 800×600 frame consists of the following distinct elements;

  • A frame sync pulse, lasting 57uS.
  • A ‘back porch’ of 22 lines (plus any additional lines to centre the image vertically)
  • A sequence of 28.444uS image lines containing the actual data to be displayed
  • Any remaining lines to complete the frame

A timer is used to generate the horizontal line timing. Two channels are used; The first one actually generates the line sync pulse and the second one triggers the line state machine. This allows the ‘back porch’ to be produced automatically by the difference between the two timers, thus avoiding software timing loops, which are never a good thing.

Video Line Output

The line handler is contained in TIM_IRQHandler and runs in response to the timer interrupt. In Pseudo-code it looks something like this;


Clear the interrupt and move to the next scanline
Stop the any existing DMA that's in progress

switch (based on scanLine of Frame)
{
case FRAME_START ... FRAME_BACKPORCH - 1:
  Enable Vertical Sync Pulse

case FRAME_BACKPORCH ... FRAME_OUTPUT_START - 1:
  Output a blank line

case FRAME_OUTPUT_START ... FRAME_OUTPUT_END:
  Start DMA output of the prepared pixel line
  IF (this is the last repetition of this line)
  {
    Set a DMA interrupt to occur at the end of this transfer
    Make sure the other buffer will be transmitted next time
  }

case FRAME_OUTPUT_END + 1 ... FRAME_END - 1:
  Send out a zeroed line

case FRAME_END:
  Reset to the top of the frame
}

You may not have seen this ‘…’ syntax before – it’s a GCC extension for covering multiple case values. If your compiler doesn’t have it then you can just, rather more untidily, substitute if ((x>a) && (x<b)) kinds of filters if you prefer.

This is the most timing critical part of the whole implementation so it’s run from RAM (to avoid wait states) and is maximally optimised. Wait states are a huge problem when you’re running tight realtime responses, especially with interrupts in the mix; Flash memory is too slow to feed instructions to the CPU at its full operating speed, so a small, fast, cache is inserted between the CPU and the Flash. This buffers the Flash and generally fixes the majority of the problem. Unfortunately, interrupts generate exceptions to the regular flow of execution and the benefit of the cache is negated. Whenever possible, keep interrupt handlers in fast RAM and not slow Flash, to avoid a disproportionate performance hit both to the interrupt handler and the base level code that’s getting interrupted.

The line handler operates as a state machine which performs actions depending on which line of the frame being processed. You can easily see from the code each of the segments of the frame being output, with the biggest part being the output of the active display lines.

The output of the display pixels is the heart of the system and is performed using a SPI peripheral fed from a DMA channel. The transfer is triggered by this routine and it’s essential that it is triggered at exactly the same time (relative to the line sync pulse) for each line, otherwise jitter will occur which will be visible as fizzing around the characters. This is a particular issue in this implementation because each individual line is generated twice to ‘stretch’ the Y resolution. This means the handling for each new or repeated line is slightly different; You can see the practical consequence of this where the DMA IFCR register Transmission Complete (TC3) bit is reset before every line transmission even though its actually only set on every other line transmission…this simply equalises the time taken on both paths through the code.

Video Line Generation

So, we now know how the frame protocol is generated, but where do the pixel data for the frame come from? Well, since each line is repeated twice there are two line intervals (2×28.44uS) worth of time available to calculate each row of output, so this can be done in a separate, lower priority, thread of execution to the frame handler. This job is done by the DMA_CHANNEL_IRQHandler.

Two lines of output pixel data are maintained; This allows a line to be output while the next one is being generated. The basic process is that as soon as a line has been transmitted an interrupt is triggered that starts the creation of the next one. This interrupt is deliberately set to have a lower priority than the line interrupt so that the second ‘copy’ of a line can be transmitted while the next one is still being generated.


Acknowledge Interrupt
if (attempting to output line greater than number on screen)
{
  Setup first line ready for next frame into one half of pixelbuffer
  Setup zeros in other half of pixelbuffer
}
else
{
   Prepare next line for output in the half of the pixelbuffer
}

Pixel data lines are created by indexing into a character font using a character index from the displayFile (in the rasterLine routine)….this allows the display file to just contain the characters (for a 8×16 font, that’s 50 characters per line at a 400 pixel resolution) and the pixel buffer to be constructed dynamically as it’s needed. That means that only 900 bytes are needed for a complete 50×18 screen displayed at 400×288 pixel resolution, rather than the 14KBytes that would be needed for a full pixel based representation. It also makes manipulation of this buffer much faster, simply because there’s less of it.

The original design was intended to be text only, but quite often it is also useful to have a small amount of graphical output. Full screen graphics are prohibitive but a small window is managable. For this reason rasterLine (supported by displayFile) also supports a small, configurable, graphical window which can be overlaid anywhere on screen. The overall effect is a large graphic ‘sprite’ that can be moved freely around.

The Result

Results were better than expected. It is possible to generate good quality VGA video output at resolutions of up to 100×36 characters, although you’ve got the CPU on it’s knees at that point with near enough 100% of it in use;

Hires (800 x 400) output displaying 100 x 18 text.

Interestingly, this mode bounces against the limits of the memory busses too. If the graphic is overlaid on the text you start to see fizzing towards the right hand end of the text line – that’s because the DMA is having to wait for access to the RAM, and it causes visible artifacts.

More realistically 50×18 is easily possible, even when running with a partial graphic window. Even cooler, the graphic window can be moved around with no impact on the text;

Hires (800 x 400) displaying 50 x 18 text

Here’s a video showing how the graphic and text layers can be moved independently (this is the code in the example main.c you’ll find in the repository).

Size isn’t too shabby either, bearing in mind that there’s a complete character 8×16 character set gobbling up 4K of the Flash (which you can obviously reduce if you don’t care about certain characters);

~/Develop/vidout$ make
Compiling thirdparty/CMSIS/src/system_stm32f10x.c
Compiling thirdparty/CMSIS/src/core_cm3.c
Compiling vidout/rasterLine.c
Compiling vidout/displayFile.c
Compiling vidout/vidout.c
Compiling app/main.c
Assembling thirdparty/CMSIS/src/startup_stm32f10x_md.s
Built Release version
text data bss dec hex filename
7704 40 3140 10884 2a84 ofiles/firmware.elf
~/Develop/vidout$

This size is also with a 192×80 graphic panel inserted. Without that the bss comes down to around 1220 Bytes.

Performance Testing

As you know by now, I’m a bit obsessive about knowing just how busy my CPU is. Producing the VGA image only requires code running under interrupt, so its pretty easy to instrument the system to measure this. Simply place AM_BUSY calls at the start of each interrupt routine and an AM_IDLE call before entering WFI in the idle loop. Pin B12 will be high whenever the CPU is doing something, thus giving an easy readout of the performance of the code. In the previous article on the subject I used a couple of resistors and a multimeter to give me a direct readout of this figure, but there’s another useful trick here; Just put your scope onto the output pin and low pass filter the output using its maths facilities – the resulting trace shows you just how busy your CPU is over time and, by adjusting the cutoff frequency, you can set different ‘time horizons’ for assessing this busy factor.

Here’s the output for Vidout based on a 1KHz cutoff frequency. The drop in busy factor when pixel lines don’t need to be generated (at the start and end of a frame) is very easy to see;

Busy/Free pin (Yellow Trace) with Low Pass Filtered version (Pink Trace)

The yellow trace is the state of the busy/free pin and, as you can see, it’s changing too fast to really be visible, but it’s the low-passed version of this trace that is really quite revealing – the dip is the end of one frame and start of the next, which is when no pixel lines are being generated, is clearly visible. The ‘bump’ is the part of the screen where there’s both a graphic block and text (so you can see the incremental impact of the graphic block) and the rest of the line is flat. Overall, we’re using about 24% of the CPU to create this video output.

Using Vidout

You can find Vidout on github, complete with a demonstration main app. Using it is trivial. Ensure that at least one very high priority interrupt level is available – that will be used for the Line Handler. Absolutely nothing must get in the way of that being able to run immediately the timer fires. A lower priority interrupt is used for the Line Creator. Simply call vidInit from your code and the video generator will be installed and will start outputting. A handle to a displayFile will be returned. It is by manipulating the displayFile that the screen is updated….take a look in displayFile.h for APIs to manipulate both the graphic and text elements of a screen, while main.c gives you an example of how to use it all.

The good stuff is all in the vidout directory, which you can simply copy into your own project, but be careful to make sure you have the support in your linkfile for RAM based routines. There are comments explaining this in the source, and there’s a suitable linkfile there that you can copy from. The code uses CMSIS to identify registers and bits, but if you’re adverse to CMSIS then you can easily replace those definitions with direct memory addresses, there aren’t so very many of them and they’re all in vidout.c.

Porting

Porting Vidout to other CPUs should be straightforward. All that’s needed is a timer that can generate the horizontal sync pulse and trigger the state machine, and a peripheral that can pump bits out quickly enough to deliver the image data to the monitor. In the STM32F103 implementation that’s done by SPI, but other CPUs have other peripherals that can also do it (perhaps direct DMA output or a special programmable logic such as exists on the IMXRT or PsoCs)….if you do port Vidout to another CPU, please send the patches so they can be folded into the repository for others to use.

Alan Assis is busy porting Vidout to Nuttx so it’s worth keeping an eye on his blog for current status. Once initialised Vidout only uses interrupt level code so it’s pretty much transparent to any OS that might be running – it certainly runs alongside FreeRTOS just fine.

Enjoy.

SWO Instrumentation – Borrowing

The nice thing about standard file formats, is that they’re standard file formats, which means you can use other tools with them…even ones you hadn’t really designed for in the first place.

José Fonseca’s gprof2dot is a great example of exactly that.  It takes a variety of file formats and twists and mangles them into nicer graphs than the orbuculum tools do natively. You’ll need Python and GraphViz installed, then just create your source file exactly the same as if you were using KCacheGrind;

>ofiles/orbstat -e firmware.elf -z test.out

…but then, rather than starting KCachegrind on the resulting output, stick it into gprof2dot instead.  The file is in callgrind format, so the incantation is straightforward;

>gprof2dot -f callgrind test.out &gt; a.dot

…and finally, the dot magic;

>dot a.dot -o -Tpdf > o.pdf

and you’ll get a rather nice graph showing you exactly where your code spent its time, with the hot paths highlighted;
You’ll find plenty of options for changing the format and layout on José’s github page, and plenty more tools for grokking callgrind format files floating around on the web. 

A cheap shot today, but hopefully it still hit home.

SWO Instrumentation, building the orchestra

We’ve already established that post processing delivers a lot of value just from the perspective of really understanding what your code is doing, so it makes sense that the better the post processing tool, the better that understanding will be. Enter KCacheGrind, considered by some to be the best execution understanding tool out there.

GraphViz is a great set of tools, but if you read the previous article you might have got left with the feeling that we were leaving a lot of really useful information on the cutting room floor. The instrumentation we added to the code delivered not only what was calling what, but it also provided timestamp information (measured in processor clock cycles) indicating how long each routine was taking…and we threw that away. Turning that raw data into useful information isn’t trivial because you’ve got to deal with things like interrupts and call trees, but once you’ve got it there’s a whole treasure trove of information to explore. Lets grab some data and format it into a file suitable as an input for KCacheGrind…back to our old friend orbstat;

>ofiles/orbstat -e firmware.elf -z test.out

The configuration of the target is exactly the same as for the last post, since it’s using the same data as input…just a different format for output. After you’ve got some data, poke it into KCacheGrind;

>kcachegrind test.out

…and you’ll be rewarded with something that looks a bit like this;

So, this all needs a bit of explaining, and I’m no cachegrind expert so you might find Google to be your friend, but let’s give it a go. To the left is the set of routines that were recognized during the sampling internal, together with the total amount of time spent in each of them both cumulatively (first column) and specifically in that routine (second). The number of calls is the third column and the rest should be pretty obvious….you can click around in those to centre on any routine.

Top right are the various analyses of the code, including source code views, number of cycles taken in the code and the split of execution time between this routine and the routines its called. You can change the way the metrics are presented to be relative to the whole execution time, or to the parent, and you can express them as percentages and absolute processor cycles…..the ‘Callee Map’, for example, allows you to see how the overall execution time divvies up between the called routine and it’s inferiors, like this;

…and you can see what was going on in the source code too;

To the lower right you’ll also find the disassembly of the matching source code, for when you really want to get down and dirty with what is going on;

There’s one small gotchya with the dissassembly view; it uses objdump and chances are your system-wide objdump might not understand the binaries you’re compiling for your target. You will also find that KCacheGrind can crash because the format of objdump doesn’t match what it expects (specifically, the text Address XXXX is out of bounds can appear at the end of a disassembly, which upsets it terribly). No big deal, just create a new file called something like objdumpkc and put the following contents in it;

#!/bin/sh
~/bin/armgcc/bin/arm-none-eabi-objdump $@ | sed '/is out of bounds./d'

…then mark the file as executable and point the OBJDUMP environment variable to it before calling KCacheGrind, like this;

OBJDUMP=objdumpkc kcachegrind ....

Problem solved.

Most interesting, especially on complex projects, are the graphical views you can get of the relationship between routines and the execution splits between them. Those use graphviz just like in the previous article, but there are a lot more steroids in play.

Those are shown in the bottom right panel (you can re-arrange all this lot by the way if you don’t like the layout) and, again, can be done in absolute, relative or cycles formats…Here’s where my timer handler is spending its time in response to the timer interrupt;

…and here’s the same information as a percentage of the total time the CPU spent doing stuff;

As you can see cachegrind is a much richer, more interactive way of understanding your system. It still has its limits (its a static snapshot after all, although CTRL-R will reload the file) but when you’re trying to figure out exactly where your CPU has gone, and which bits of your system need to be re-engineered, its a wonderful too.

The one remaining problem is when to grab the samples, ‘cos at the moment it’s just when you happen to stop the acquisition. Getting more sophisticated (tell me what my system is doing in response to event X, for example) needs us to start looking at triggers. That one is coming up.

SWO – Instrumentation, first tunes

Instrumenting your application, passing the resulting data over the SWO link and post processing it on the host side turns a lot of textual data from the SWO into something that’s much more useful and easily digested. That’s where orbstat comes in.

The GNU compiler is mature and well tested, and over the years various smart people have bolted bits and pieces onto it to make their own lives easier…the trouble is it’s not always very easy to figure out exactly how to use that stuff in your own code, and the instrumentation functionality is a pretty good example of that.

Figuring out where your program is spending its time, what is calling what and just how often is difficult enough on a desktop platform, never mind about on an embedded one without the convenience of a screen, keyboard and copious storage. No matter, we can use the functions those smart people already developed, together with the SWO to offload the resulting data, to deliver powerful visualization from even the most inscrutable red led flasher.

To do this we will be using the __cyg_profile_func_enter and __cyg_profile_func_exit capabilities. The prototypes look something like this;

void __cyg_profile_func_enter (void *func, void *caller);
void __cyg_profile_func_exit (void *func, void *caller);

As you might guess from their names, it’s possible to convince gcc to insert calls to these two functions at the entry and exit of any function. Doing that just needs a single additional incantation on your gcc command line -finstrument_functions.

…and thats it! Any code compiled with this option will automatically call the entry and exit routines as it runs. OK, so it’ll be slower than it would be on a normal day, but you’re not running your CPU at 100% load anyway are you? are you?

There’s a one gotcya though. Instrument-functions will instrument every function…including the entry and exit ones, resulting in a vicious circle and a visit to the Hardfault_Handler in short order. Fortunately, the smart folks thought that one through too, and you can label individual functions to not be instrumented with the __attribute__ ((no_instrument_function)) decorator. There are probably entire files you don’t want instrumenting too, and they even covered that with an exclude file list that accepts partial matches. I’m not really interested in profiling my OS or the CMSIS, so my exclude list looks like;

-finstrument-functions-exclude-file-list=CMSIS,FreeRTOS

So, we’ve got a mechanism for figuring out when we enter and leave routines. We just need to do is offload that from the CPU to somewhere we can post-process it. That’s a job for the SWO. Here’s a simple implementation that will report function entries and exits over ITM Channel 1;

#include "config.h"
#define TRACE_CHANNEL (30)
#define DELAY_TIME (0)

__attribute__ ((no_instrument_function))
              void __cyg_profile_func_enter (void *this_fn, void *call_site)
{
    if (!(ITM->TER&(1<<TRACE_CHANNEL))) return;
    uint32_t oldIntStat=__get_PRIMASK();

    // This is not atomic, but by using the stack for
    //storing oldIntStat it doesn't matter
    __disable_irq();
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);

    // This is CYCCNT - number of cycles of the CPU clock  
    ITM->PORT[TRACE_CHANNEL].u32 = ((*((uint32_t *)0xE0001004))&0x03FFFFFF)|0x40000000;
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);
    ITM->PORT[TRACE_CHANNEL].u32 = (uint32_t)(call_site)&0xFFFFFFFE;
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);

    ITM->PORT[TRACE_CHANNEL].u32 = (uint32_t)this_fn&0xFFFFFFFE;
    for (uint32_t d=0; d<DELAY_TIME; d++) asm volatile ("NOP");

    __set_PRIMASK(oldIntStat);
}

__attribute__ ((no_instrument_function))
              void __cyg_profile_func_exit (void *this_fn, void *call_site)
{
    if (!(ITM->TER&(1<<TRACE_CHANNEL))) return;
    uint32_t oldIntStat=__get_PRIMASK();
    __disable_irq();
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);
    ITM->PORT[TRACE_CHANNEL].u32 = ((*((uint32_t *)0xE0001004))&0x03FFFFFF)|0x50000000;
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);
    ITM->PORT[TRACE_CHANNEL].u32 = (uint32_t)(call_site)&amp;0xFFFFFFFE;
    while (ITM->PORT[TRACE_CHANNEL].u32 == 0);
    ITM->PORT[TRACE_CHANNEL].u32 = (uint32_t)this_fn&amp;0xFFFFFFFE;
    for (uint32_t d=0; d&lt;DELAY_TIME; d++) asm volatile ("NOP");
    __set_PRIMASK(oldIntStat);
}

You’ll recall from our previous discussions that each of the ITM channels runs independently, so you can have this code running while still throwing this visualization out of the port. There’s also a bit of slowdown code in there just in case the link gets flooded but, in reality, it’s not been a big deal as long as you don’t try to use the link heavily for anything else at the same time (the channel pretty much rate-limits itself due to the busy-spins in the code above … the slower your SWO, the slower your CPU).

The CPU needs to be configured to pass the SWO along. Assuming you’re using orbuculum with the orbtrace.init macros then that bit is easy enough with the following in your .gdbinit (this example is using the Bluepill variant of the Blackmagic probe…its an exercise for the reader to use a different probe);

source ../../orbuculum/Support/gdbtrace.init
target extended-remote /dev/ttyACM0
monitor swdp_scan
file ofiles/firmware.elf
attach 1
set mem inaccessible-by-default off
set print pretty load start
# ======================================
# Change these lines for a different CPU or probe
enableSTM32F1SWD
monitor traceswo 2250000
prepareSWD 72000000 2250000 0 0
# ======================================
dwtSyncTAP 3
dwtCycEna 1
ITMId 1
ITMGTSFreq 3
ITMTSPrescale 3
ITMTXEna 1
ITMSYNCEna 1
ITMEna 1
ITMTER 0 0xFFFFFFFF
ITMTPR 0xFFFFFFFF

So, we’ve got the stuff into the PC…how do we handle it? If you just want to see it’s there you can use orbcat, with a command like;

>orbcat -c 1,"0x%08x\n"
0x414aa2e0
0x08000efe
0x0800770c
0x514aa46c
0x08000efe
0x0800770c
0x414ab5b6
0x08001f42
0x080076ac
0x514ac871
0x08001f42
0x080076ac
0x414b0212
0xfffffffc
0x08006200
0x414b03cd
0x08006400

…Great, but not exactly useful. What we really need is something that takes these data, maps them across to the elf file containing the firmware that’s running on the CPU and turns the whole thing back into sensible data.

Fortunately, orbstat does that for you. The magic command you’re looking for is;

>orbstat -e firmware.elf -y xx.dot

This will swallow that output, cross reference it with the debug information in your elf file, and write the whole resulting mess out to a GraphViz input file. All that’s needed then is to ask graphviz nicely to turn it into a perty picture;

>dot xx.dot -o -Tpdf > o.pdf

…and heres the result for a trivially simple app (Click for a bigger image). Suddenly I can see exactly what is calling what, and how often, per second;

This is only the start. Pretty pictures are one thing, and a useful one at that, but being able to fly around that picture and dive in and out of elements of it is something quite a lot better. That’s the next installment.

SWO – There’s an App for that

Exploiting the SWO link for software logging and hardware state reporting delivers huge advantages in comparison with traditional debug techniques, but when extended with applications on the host side the benefit gained is amplified considerably.

The creation of apps for Orbuculum is really only just getting underway.  Any number of applications can attach to it simultaneously to deliver multiple views and insight into the operation of the target, and only a few of those have been created so far.

The main orbuculum program collects data from the SWO pin on the CPU and both processes it locally and re-issues it via TCP/IP port 3443. The format of these data is exactly the same as that which is issued from the port, which also happens to be the format that the Segger Jlink probe issues. By default the JLink uses port 2332…believe it or not, the choice of port 3443 for Orbuculum was made for very specific reasons which did not include consideration of the JLink port number, so that was quite a co-incidence! Applications designed to use the TCP/IP flow from a JLink device can now also be used with a USB Serial port, or Black Magic Probe. Conversely, with a bit of simple modification to change the port number, orbuculum post-processing apps can hook directly to a JLink device (just change the source port number), or via orbuculum itself – everyone’s a winner.

The orbuculum suite currently includes a couple of simple example applications that use the infrastructure…creating new ones is trivial based on the code you can find in these examples.

The simplest of these existing apps is orbdump, which dumps data directly to a file. That’s useful when you just want to take a sample period for later processing…perhaps pushing it into something like Sigrok for processing in conjunction with other data. A command line something like this will dump 3 seconds of data directly into the file output.swo;

>orbdump -l 3000 -o output.swo

We’ve already mentioned orbtop. That tool is used for creating unix-style top output, but it features one little Easter Egg. Theres an option -o <filename> to that which dumps the processed sample data to a file, and an example shell script in Support/orbplot_top uses these data to produce pie charts of the distribution of the CPU load, a bit like this;

Frequently an application needs to merge multiple data sources as a precursor to using it in other apps. If you’ve got orbuculum producing several fifos with independent data in each there are unix tools that can do that, something like;

>(tail -f swo/fifo1 & tail -f swo/fifo2 ) | cat > output

The problem with this is that you can never be completely sure of the order of data merging in the output file, so a dedicated tool, orbcat, is provided to hook to the TCP/IP port of orbuculum and take the same output format specifiers (but without the fifo names), dumping the resulting flow either to stdout or to a file or use by other tools, like this;

>orbcat -c 0,”%c” -c 1,”%c” -c 2,”%c” -c 3,”Z=%d\n” -c 4,”Temp=%d\n”

Since each value arrives discretely for each channel it is possible to be certain that each one is completely written before the next – whatever order they’re written on the target, they will be received in on the host (watch out for target OS issues here though!). This can resolve the problem of inconsistent intermingling. Indeed, it’s possible to go further, and use the enforced sequencing to advantage on the host. For example, we can write two characters and a int into a csv file on the host with a orbcat line like the following;

>orbcat -c 5,”%c” -c 6,” ,%c” -c 7,” ,%d

which would result in lines that look something like;

a, b, 45
g, w, -453
...etc

Always bear in mind that there is no (real) limit to the number of simultaneous apps that can use the dataflow from the orbuculum TCP/IP port, nor on the re-use of data for multiple dumps; perhaps there’s a reason for creating two csv files, with the data above in a different order, for example.

Orbuculum is only just at the start of it’s lifecycle. It can collect and distribute SWO data, but its the apps that make use of these data that make it powerful, and there are plenty more of those to be created for many different purposes.

For now, the most interesting app that comes with the suite is orbstat, and that will be the subject of the next post.

SWO – The Hard Stuff

SWOs credibility as a debug solution comes from it’s ability to support multiple software output channels, but it’s real capability is only realised when you use the hardware monitoring functions it offers too.

In my previous post I alluded to the hardware capabilities that the SWO ITM macrocell offered by virtue of the Data Watchpoint & Trace (DWT) macrocell. In this post we’re going to scratch the surface of what you can do with that.

DWT messages are encoded in exactly the same way as software ones, but they are generated automatically by hardware rather than programmatically. You’ll recall that event counters, exceptions, PC value and data traces can all be output by the DWT, so in this post we’ll provide a couple of examples of how to use that functionality.

If you’ve got orbuculum running, you’ll notice one extra fifo in its output directory alongside whatever you have defined. That fifo is called hwevent and is a simple continuous dump of whatever DWT events you’ve got switched on. By default, with the standard gdb orbuculum startup script, no events are requested for reporting, and so that fifo remains empty. From the gdb command line (assuming you’ve included the line source ../orbuculum/Support/gdbtrace.init in your .gdbinit file) you can find out quite a lot about the possibilities for configuring the ITM & DWT;

gdb>help orbuculum

GDB SWO Trace Configuration Helpers
===================================

Setup Device
------------
enableSTM32F1SWD : Enable SWO on STM32F1 pins
prepareSWD : Prepare SWD output in specified format

Configure DWT
-------------
dwtPOSTCNT : Enable POSTCNT underflow event counter packet generation
dwtFOLDEVT : Enable folded-instruction counter overflow event packet generation
dwtLSUEVT : Enable LSU counter overflow event packet generation
dwtSLEEPEVT : Enable Sleep counter overflow event packet generation
dwtDEVEVT : Enable Exception counter overflow event packet generation
dwtCPIEVT : Enable CPI counter overflow event packet generation
dwtTraceException : Enable Exception Trace Event packet generation
dwtSamplePC : Enable PC sample using POSTCNT interval
dwtSyncTap : Set how often Sync packets are sent out (None, CYCCNT[24], CYCCNT[26] or CYCCNT[28])
dwtPostTap : Sets the POSTCNT tap (CYCCNT[6] or CYCCNT[10])
dwtPostInit : Sets the initial value for the POSTCNT counter
dwtPostReset : Sets the reload value for the POSTCNT counter
dwtCycEna : Enable or disable CYCCNT

Configure ITM
-------------
ITMId : Set the ITM ID for this device
ITMGTSFreq : Set Global Timestamp frequency
ITMTSPrescale : Set Timestamp Prescale
ITMSWDEna : TS counter uses Processor Clock, or clock from TPIU Interface
ITMTXEna : Control if DWT packets are forwarded to the ITM
ITMSYNCEna : Control if sync packets are transmitted
ITMTSEna : Enable local timestamp generation
ITMEna : Master Enable for ITM
ITMTER : Set Trace Enable Register bitmap for 32*&lt;Block&gt;
ITMTPR : Enable block 8*bit access from unprivledged code

There is another layer of help information below this top layer (beware that gdb doesn’t like MixedCase when you’re trying to do tab completion);

gdb>help dwttraceexception
dwtTraceException <0|1&> Enable Exception Trace Event packet generation

Understanding some of these options does need a bit of perusal of the  DWT and ITM technical documentation I’m afraid, but I’ll get around to writing something up on some of the more useful of them eventually (or, if someone else fancies making a textual contribution, it would be gratefully received….)

OK, so let’s give that a go, and see what we get in the hwevent fifo now;

gdb>dwtTraceException 1

>cat hwevent
1,2,Resume,Thread
1,989,Enter,SysTick
1,6,Exit,SysTick
1,1,Resume,Thread
1,989,Enter,SysTick
1,4,Exit,SysTick
1,1,Resume,Thread
1,996,Enter,SysTick
1,5,Exit,SysTick
1,2,Resume,Thread
1,996,Enter,SysTick
1,6,Exit,SysTick
1,2,Resume,Thread
1,985,Enter,SysTick
...etc

The ‘1’ in the first column is the event type (an Exception Trace Event), followed by the time in uS since the previous event. That is followed by the condition, and by the Exception itself. This particular trace is for an otherwise idle FreeRTOS application with a 1mS system tick timer. You can see that the CPU entered the thread state and 989uS later dealt with a SysTick event that took 6uS to handle, and that that process continued during the sample time…that’s quite a level of insight for no code changes at all!

There are 993uS to 1003uS between SysTicks in this sample, and that brings us to one of the big problems with this technique. To save bandwidth across the link the timestamps are generated on the host rather than the target, so they are inevitably inaccurate and, even with this compromise, the TRACESWO quickly becomes overload. You will see ITM Overflow warning messages from orbuculum itself in any realistic application using Exception Tracing…the effective use of Exception Tracing will have to wait until the parallel trace is available. By the way, there is a great description about CORTEX-M exceptions available here.

So, instead, let’s move on to something that does work reasonably OK even within the constraints of TRACESWO. Interrupt the application and type;

gdb>dwtTraceException 0
gdb>dwtSamplePC 1

…and again we can look at the hwevent fifo;

>cat hwevent

2,1,**SLEEP**
2,2,**SLEEP**
2,1,**SLEEP**
2,2,0x08002f70
2,2,**SLEEP**
2,1,**SLEEP**
2,1,**SLEEP**
...etc

Basically, we can set an interval at which we want the DWT to sample the current value of the Program Counter (by means of the dwtPostTap and dwtPostReset options) and it will tell us the value of PC at that interval. If the target is sleeping then obviously the PC has no value and rather the special value **SLEEP** is returned.

Using combinations of these options you provide information to homebrewed applications that parse the hwevent fifo to infer things about the behavior of your target, but there are alternative ways of getting information which can be easier to use.

In a previous note I mentioned that orbuculum exports a TCP/IP interface on port 3443…we can hook applications to this port and parse the data that are returned. The easiest example (which is completely useless) is;

>telnet localhost 3443

(Oh, CTRL-] followed by q will get you out of that).

Fortunately, the orbuculum gnomes have provided slightly more useful applications than that. The first of these is orbtop, which takes the PC samples, looks them up in the matching firmware elf file (assuming you compiled it with debug info in there) and marshals them into something distinctly useful;

>orbtop -e ../STM32F103-skel/ofiles/firmware.elf

98.91% 4360 ** SLEEPING **
 0.36% 16   USB_LP_CAN1_RX0_IRQHandler
 0.18% 8    xTaskIncrementTick
 0.13% 6    Suspend
-----------------
99.58% 4408 Samples

I think that’s enough for now. I doubt you were expecting a full top implementation for your target, with no target software instrumentation needed, but we’re still nowhere near the limits of what we can do.

Till next time….

SWO – starting the Steroids

Basic Single Wire Output replaces a serial port for debug purposes, but that’s hardly scratching the surface of the full capability of what’s behind that pin. To get more out of it needs additional software on the host side, and that’s where Orbuculum makes its first appearance.

If you’re following along at home, and you’re of that kind of engineering mentality, you will have looked at the SWO output from the last blog post and noticed that every valid data byte was interspersed with a 0x00. That doesn’t matter to most terminal programs (although it will screw up flashy terminal handling in case you were trying to get clever) and it’s really just a way of the ITM reminding you that it’s still there, and would still like to play.

The ITM is documented in The ARMv7-M Architecture Reference Manual which is a right riveting read. It can actually output data four different types of data;

  • Software Trace: Messages generated by program code
  • Hardware Trace: Messages generated by the DWT, which the ITM then outputs
  • Time Stamps: Either relative to the the CPU clock or the SWO clock
  • Extension Packets: These aren’t used much in CORTEX-M, but the one facility they do provide is a ‘page extension’ to extend the number of available stimulus ports from 32 to 256.

The minimalist pseudo-serial port output from the last post is actually a degenerate example of the use of Software Trace outputting one byte messages from ITM channel 0. That’s the reason you’re seeing the 0’s interspersed with the data… but a lot more functionality is available.

An ITM message is, in general, a data packet of 8 to 32 bits. Program code can send out chunks of 8-32 bits via 32 ‘simulus ports’. A write to stimulus port 0..31 on the target side of 1, 2 or 4 bytes will result in a ITM Software message being encoded and sent over the link. This effectively means you’ve got 32 individual channels of up to 32 bit width multiplexed onto a single serial link, and handled by the hardware. You can do that kind of thing just using software and a conventional serial port, but the ITM embeddeds that functionality in code you don’t have to write.

This makes the ITM Software channels ideal for separating different types of debug information for processing by the host; Channel 31 is reserved for Operating System support information, and 0 is generally used for 8 bit serial data (as we’ve already seen). The others are pretty much available for whatever purpose you wish. There’s no CMSIS support for anything other than Channel 0, but adding support for the other channels is trivial;

static __INLINE uint32_t ITM_SendChar (uint32_t c, uint32_t ch)
{
    if ((CoreDebug->DEMCR & CoreDebug_DEMCR_TRCENA_Msk) && /* Trace enabled */
         (ITM->TCR & ITM_TCR_ITMENA_Msk) && /* ITM enabled */
         (ITM->TER & (1ul << c) ) /* ITM Port c enabled */
        )
    {
        while (ITM->PORT[c].u32 == 0); // Port available?
        ITM->PORT[c].u8 = (uint8_t) ch; // Write data
    }
    return (ch);
}

I’ll leave it as an exercise for the reader how to create 16 and 32 bit variants of the write routine…or extend this one.

Anyway, while we’re here we’ll take a quick look at the hardware messages that the ITM conveys. These messages originate from the DWT and are encoded in a very similar way to the software ones. However, the message types are much more standardised, and offer an incredibly rich insight into the operation of the CPU, considering how minimal the implementation is. The defined messages are;

  • ID0 : Event Counter: the DWT maintains event counters for a number of distinct event types. When these counters ‘wrap around’ to zero then this event is emitted.
  • ID1: Exception Trace: One of the most versatile messages, this reports which interrupt is Entered, Exited or Returned to. By monitoring exception trace messages the host can identify exactly how interrupts are being handled.
  • ID2: Periodic Program Counter Sample Packets: the DWT can be configured to sample and report the current value of the Program Counter (PC). This allows statistical profiling and code coverage of an application running on the target without any code changes.
  • ID3-23: Data Trace Packets: These messages allow you to trigger events when certain data locations are accessed, values are changed or program locations hit. You might question how these messages differ from the capability afforded by the Debug module, but it’s much more intended for monitoring flows and triggering actions, rather than the interventional stuff that the Debug macrocell is generally used for.

You can see why the DWT is a bit of a Cinderella…its doing quite a lot of useful work and there’s a rich seam to be mined here, so we’ll be back to give it more attention in a future post.

Obviously the ITM has limited bandwidth, especially in comparison to the TRACEDATA pins, and it’s quite possible that it can be flooded by multiple data sources contending for it’s use. When that occurs there is a priority order to the messages that are output, with the end result that if you start seeing overflow messages, you can be reasonably sure that you are losing useful data. Unfortunately, the available bandwidth is the Achilles heel of the TRACESWO pin.

Lets consider the flexibility that the software source packets afford as a simple example of the use of the ITM. Doing this requires some software on the host side which, until recently, was limited and mostly only available in expensive (costing more than zero) proprietary packages, although OpenOCD and Sigrok both have some decode capability.

Orbuculum was created during early summer 2017 to capture and decode these SWO (and, specifically, ITM) flows. Running on OSX or Linux Orbuculum has significantly opened up the potential that SWO offers. In its core form it receives the data stream from the ITM (which may, optionally, have been through the TPIU multiplexer) and both presents it via TCP/IP port 3443 to any number of subsidiary client applications while simultaneously creating FIFOs delivering the decoded serial data to any local application that wants to use it.

The TCP/IP link is another thing we’ll deal with later, but for now, as an example, let’s consider an application where we want three debug serial flows (debug, clientEvents and Actions) with a 32-bit signed value Z and a 16-bit signed value Temperature.

Orbuculum can connect via a USB logic level UART, a Segger debug probe or, the default, a Black Magic Debug probe. For now, let’s assume we’re using the BMP, but it’s only a couple of slightly different command line options to connect to either a Segger or a logic level USB UART.

Anyway, the command line to achieve all this functionality would be;

orbuculum -b swo/ -c 0,debug,”%c” -c 1,clientEvents,”%c” -c 2,Actions,”%c” -c 3,Z,”Z=%d\n” \
                                  -c 4,Temperature,”Temp=%d\n”

when orbuculum is running it will create, in the directory swo/, the following files;

swo/
  debug
  clientEvents
  Actions
  Z
  Temperature

(+1 more file, which we’re not going to deal with in this post)

These can be streamed via the ‘cat’ command, or copied to a regular file. On the target side writing to one of the ITM channels (0 = debug, 1 = clientEvents etc.) with the appropriate length message will cause that number of octets (comms people say ‘Octets’ rather than ‘Bytes’ cos we’re pedantic) to be sent over the link to pop out and be processed by Orbuculum on the host.

As with the simple serial streaming case we talked about in the last post, some configuration is required to get all the various bits and pieces of SWO pointing in the right direction and running at the same speed. In general you’ll find it’s easier to do that from the debug port rather than target program code, and there are gdb scripts and libraries for exactly that purpose shipped with Orbuculum.

Orbuculum is designed to be a pretty hardy piece of code. It will deal with the target (and the debug interface) appearing and disappearing as the debug cycle takes place. The intention is that it behaves more as a daemon than as a regular application program so that it becomes part of the instrumentation infrastructure that supports your debug activities. Typically, I have several windows open each cat’ing one of the debug flows, and those windows are maintained through restarts, pauses and reboots of the target.

So, you now have the ability to stream multiple, independent, information flows from your target to your host. More sophisticated exploitation of this capability will be the subject of the next few posts, once we’ve dealt with the hardware side messages from the DWT, SWOs Cinderella.

Single Wire Output

SWO is the underloved younger brother of SemiHosting. Only available on M3 and above, it provides a flexible window into the behaviour of your target. In the simplest use case, it’s a high speed output only debug serial port.

Understanding SWO needs a bit of background about the various bits of the ARM CoreSight Debug architecture that participate in it. CORTEX-M doesn’t implement full-on CoreSight, its more a sort of lightweight version of it and there are only three component subsystems that have a role, at least, for the kinds of cores we’re talking about today;

  • Embedded Trace Macrocell (ETM): Provides live tracing of what the CPU is actually doing
  • Instrumentation Trace Macrocell (ITM): Provides multi-channel program-controlled data output
  • Data Watchpoint & Trace (DWT): Provides watchpoints and change-tracking output

ARM have a bit of a habit of talking in TLAs (Thee Letter Acronyms) that make this stuff more impenetrable than it needs to be, but once you’re in the club you can use the TLAs too to keep the riff-raff out, so try and keep up.

Now, each of these three data sources are configured either programmatically or via the debug port. Their output flows through to the  Trace Port Interface Unit (TPIU…I guess they ran out of combinations of only three letters) and that talks to the outside world. The block diagram of the TPIU looks like this;

The TPIU consists of a number of functional blocks; Interfaces to the ETM, ITM and APB (ARM Peripheral Bus, for config and management), a formatter to frame up the data from these sources and a serialiser to turn it into an appropriate format to be sent over the wire. DWT is the poor stepchild here. It sends its data via the ITM and never seems to get mentioned in letters home…but when we talk about ITM, you can assume the DWT is along for the ride too.

The formatter multiplexes the available data sources into packets that are sixteen bytes long. The formatting of this multiplexed packet is really rather clever (see Section D4 in here) and is designed to minimise the overhead that the multiplexing imposes. When you’re only using the TPIU for ITM output (See, you’re getting the hang of these TLAs) the formatter can be bypassed and the ITM data are passed directly to the Serialiser, thus reducing overhead and simplifying the packet format. That is indeed the way the SWO is often used in ‘simple’ implementations.

The serialiser is interesting. You’ll notice it has both a TRACESWO output and a four bit TraceData output too.

The four bit TraceData, in conjunction with the TRACECLK output, is used for ‘parallel trace’. It has higher bandwidth than the single wire output (which allows it to do new things) but, importantly, it’s fed from the same data sources so, modulo bandwidth limitations, you can do the same things with the TRACESWO output that you can do with the TRACEDATA outputs. We’ll deal with TRACEDATA extensively in a future post, but for now TRACESWO is the star of the show.

The serialiser kicks data out of the TRACESWO pin at a rate governed by the TRACECLKIN (which is fed on-chip by some clock source or other). Data can be sent out either Manchester encoded, or in UART format that will be more familiar to many people. You’ll hear the terms NRZ (Non-Return to Zero) and RZ (Return to Zero) used to describe these formats. You can Google for more information easily enough, but the important thing is that a RZ protocol also encodes the clocking information (at the expense of double the bandwidth requirement) whereas a NRZ protocol requires you to know the bitrate ahead of time. If you’re developing custom hardware to swallow the TRACESWO output you’d want to use RZ, if you’re hoping to use a TTL UART, then it’s NRZ all the way. The NRZ TRACESWO output format is hardwired as 8 databits, 1 stop bit, no parity.

So, let’s recap where we are. By the appropriate configuration of registers we can get realtime logging, exception and even execution trace out of our CPU via a single pin. We can even get those data out via a logic level UART connection (and yes, you can just capture the output using one of those horrible USB to UART adaptors). Next step – how do you grok the data on the host side?

Well, if all you want is a extra serial output for debug then that’s easy – configure the TPIU to bypass the formatter and to spit out the messages in NRZ format, then make sure you write to ITM channel 0 and hang a USB to UART adaptor off the SWO pin with a terminal application on the host. You’re done. You’ll even find a suitable call in the CMSIS, ITM_SendChar, which will send a single character over the link on channel 0 to drop out on your host.

The magic incantations to get all of this going fall into two parts; the first is chip specific to configure the SWO pin for use, the second is CORTEX-M generic, to configure the ITM, DWT, ETM and TPIU (although, in reality, you can largely ignore the ETM if you’re just wanting simple debug output, and the DWT just needs to provide sync to the ITM). Something like this suffices for a STM32F103;

/* STM32 specific configuration to enable the TRACESWO IO pin */
RCC->APB2ENR |= RCC_APB2ENR_AFIOEN;
AFIO->MAPR |= (2 << 24); // Disable JTAG to release TRACESWO
DBGMCU->CR |= DBGMCU_CR_TRACE_IOEN; // Enable IO trace pins for Async trace
/* End of STM32 Specific instructions */

*((volatile unsigned *)(0xE0040010)) = 625; // Output bits at 72000000/625+1)=115.2kbps.
*((volatile unsigned *)(0xE00400F0)) = 2; // Use Async mode pin protocol
*((volatile unsigned *)(0xE0040304)) = 0; // Bypass the TPIU formatter and send output directly

/* Configure Trace Port Interface Unit */
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk; // Enable access to registers
DWT->CTRL = 0x400003FE; // DWT needs to provide sync for ITM
ITM->LAR = 0xC5ACCE55; // Allow access to the Control Register
ITM->TPR = 0x0000000F; // Trace access privilege from user level code, please
ITM->TCR = 0x0001000D;
ITM->;TER = 1; // Only Enable stimulus port 1

while (1)
{
    for (uint32_t i=’A’; i<=’Z’; i++)
    ITM_SendChar(i);
}

So, there you go, serial port debug with low overhead and without it actually costing you a serial port on the target. The good stuff, however, using the real capabilities of the ITM and DWT, you only get if you spend more effort understanding those two macrocells, and if you put real decode software on the host side. That’s the subject of the next installment.

While you’re waiting for that to land, there’s a short YouTube Video from ARM giving a better overview of this stuff than I ever could.

 

SemiHosting

SemiHosting is one of the oldest ARM debug support mechanisms which even today has a couple of advantages over most of the alternatives.

In general the debug options for ARM CORTEX CPUs are confusing to the newcomer. The embedded world expects everyone to already be an expert, with the end result that you’ve got to be living in it for a fairly significant length of time before the fog finally starts to clear.

I’m assuming that anyone reading this stuff has already got their head around I/O bits and serial ports, so lets concentrate on SemiHosting as our first entry into this wonderful world. This is obviously just an intro, you should look at the ARM documents when you want the real lowdown. I should state upfront that I don’t generally use SemiHosting, I find other techniques more suitable, but this should give you enough of a foothold to start using it if it looks like it floats your boat.

SemiHosting has been around since the 1990s. It allows the application running on your Target (embedded CPU) to access the Input and Output capabilities of the Host that is connected over the debug link. It does this by ‘tunneling’ the I/O requests over that link for various file descriptors. You’ll recall that file descriptors 0 and 1 are stdin and stdout in the Unix world, so one of the things you get with SemiHosting in addition to file access is remote screen and keyboard for your target application. Bargain.

It’s important to be aware that when an app is compiled with SemiHosting it will not work without the debugger connected. This is a big restriction. It also switches the CPU into Debug mode while it’s active, where it doesn’t play nicely with interrupts and stuff. Let’s be honest, SemiHosting is really useful for testing routines that take chunks of data in or throw chunks of data out because that’s where the file handling bit comes in. It’s not great for realtime oriented stuff either because it’s not a particularly fast technique. Its big advantages are that its properly bidirectional and it integrates cleanly, with no (or very little) glue with the filesystem on the host.

So, how does it work? Turns out the implementation is slightly different depending on if you’re on a ARMv6-M or ARMv7-M (M0 or M3/4 etc.) as distinct from any other ARM family CPU. In the former case the BKPT (Breakpoint) instruction is used, other ARM CPUs use SVC (Service) calls…that distinction doesn’t really matter though unless you’re stepping through machine code trying to figure out what’s going on….so lets stick with the CORTEX-M case.

When the application on the target wants to perform a SemiHosting call in regular code it performs a BKPT 0xAB instruction with the operation to be performed in R0, and parameters in other registers. A few examples of ARM-set standard actions are;

1 – SYS_OPEN : Open a file on the host
2 – SYS_CLOSE: You can figure this one out
3 – SYS_WRITEC: Write a single character
5 – SYS_WRITE: Write a block
6 – SYS_READ: ...and so it goes on

Obviously each of these calls needs parameters and returns results. The reference above gives you all the info you need on what those actually are…although in reality you mostly use libraries to realise a SemiHosting implementation so you don’t need this level of detail. One question I always had was why SemiHosting was implemented with BKPT/SVC and not just a library…well, if you think about it an exception-based calling routine will work anywhere with any language and from any processor state (pushing the CPU into a Debug state), so it’s much cleaner implementation than the alternatives that you might dream up.

So, we’ve reached the BPKT/SVC handler, and we’ve got our marching orders in the various registers….how does this get conveyed to the connected debugger? That depends on the compiler and debugger you’re using, but let’s stay in a GCC/GDB world where everything is documented and transparent.

In that case the handler marshals everything and sends it over the GDB link. That’s all documented in the Remote Protocol section of the GDB manual, and specifically the File I/O Remote Protocol Extension. I’m not going to regurgitate all of that stuff here for the purposes of padding a blog, but suffice to say that requests from the target eventually pop up at the host end where GDB (or, if you’re using something like Segger or OpenOCD, initially the debug driver) handles it and returns the results back to the target.

OK, so that’s the mechanics, and you understand the limitations, so how to use it in the real world? Turns out it’s pretty straightforward, just add the magic incantation

–specs=rdimon.specs

to your linker options (replace that with

--specs=nosys.specs

when you want to turn SemiHosting off). That will load up the BKPT/SVC handling routines and allow you to use printf/scanf and all the file handling stuff in your application. One thing that folks do forget is an initialisation call that’s needed at the start of main (or leastways, before you do any SemiHostery) if you’re not running newlib;

extern void initialise_monitor_handles(void); /* prototype */
main(void)
{
    initalise_monitor_handles();
    ....

You’ll probably need to switch on the semihosting options on your host side debug stub, and the MCUOnEclipse site has good info on doing that.

You don’t need to do anything extra if you’re running Blackmagic probe…one of it’s big advantages is that it’s all handled natively.

So, there you have it. Zero to SemiHosting-competent in ten minutes, but if you can cope with an output-only channel though there are better, faster, more flexible options. More to follow.

Debug communication with an ARM CORTEX-M target

Textual debug input/output is a way of life on the desktop. With no screen and no keyboard, surely you’ve got less options in an embedded system?

In some ways you’ve got more options for getting debug out of an embedded system than you have a desktop one. Just the other day I posted an example of using a single digital output pin to convey external information about how busy a system is – something that’s rather more involved to achieve on the desktop, so let’s do a brief survey of some of the options that let you figure out what that inscrutable lump of highly refined sand on your bench is actually doing.

The basic option, that’s been around as long as embedded systems themselves, is the single I/O bit. For output it can be used to indicate entry into a specific bit of code, how loaded the system is, any number of error conditions and a thousand and one other things. Most designs feature at least one LED hooked onto that output pin to give a visual indication to the user without needing any special equipment beyond a Mk.1 eyeball. In my designs I always have a minimum of one (not red) LED which does normal service as a ‘heartbeat’ showing that the system is alive. It serves double duty as the exception indication LED when the system ends up in an unrecoverable situation. Believe me, it can be difficult to see that situation quickly (you’ll implement that on your second design, immediately after you’ve spent an hour staring at your board wondering why its not responding)….don’t underestimate how useful that is. Frankly, if you can spare the bits, put a RGB LED on there (A Wurth 150141M173100 is only about 30c) and you’ve got eight different conditions you can show, even if you chose to not provision it for production. Stick that LED on PWM outputs and you’ve got any colour under the rainbow. Perhaps not really too useful, but cool anyway.

On the input side a single bit lets you trigger certain paths through the code or change runtime conditions. Its very difficult to get a clean switching action on an input bit without a bit of additional circuitry…generally you can work around that by sampling twice (Google ‘Switch Debouncing’ for more than enough examples on how to do that) and software is always cheaper than hardware – in per unit cost at least. The lack of clean switching action can bite you if you sample very quickly or use an interrupt to capture the state change event though …. and it’s considerably worse if you just use a shorting wire, pair of tweezers or whatever other conductive implement you happen to have on your desk. The one liner algorithm description for a software debounce is simple enough; After sensing a state change wait for 20mS or so and check if it’s still changed..if it is then the transition is valid.

Moving on from the single I/O bit approaches, we very quickly end up wanting to spit out serial data; Debug strings, data values, error conditions and operational conditions really help colour in what a system is really doing as opposed to what you think it should be doing. There’s so much value in a system reporting what it’s up to that we often find output serial ports for debug application with no corresponding input, and there are multiple options for getting that.

Lets consider what the various options for serial I/O are;

  • A real serial port. Normally this is configured for asynchronous operation (i.e. with start and stop bits) and it’s sometimes referred as a UART (Universal Asynchronous Receiver/Transmitter), which is often used to implement the RS232 communications protocol. It’s not really though, ‘cos RS232 specifies a lot of things that are ‘interpreted liberally’ in a debug port; The signaling levels might be 0 & 3v rather than that positive and negative 12V signaling that a ‘real’ RS232 port generally uses. With the advent of uber-cheap USB ‘TTL Serial’ interfaces from FTDI and others this kind of debug port has become very popular, and you’ll often find Logic Level serial interfaces on debug probes like the Black Magic Probe or the Segger JTAG.
  • Overlaid functionality on a debug channel. If we’ve got debug communication established with a CPU via JTAG or SWD then that channel can also be used for bidirectional debug communication. On ARM it’s generally known as ‘Semihosting’ and its a virtually no-cost channel in hardware terms, but fast it isn’t. It does have a few distinct advantages to it though and we’ll talk about those later.
  • Single Wire Output. When the JTAG interface is in SWD mode there are spare pins, one of which (the one thats normally used for TDO) can be used for serial debug output. There’s quite a sophisticated infrastructure behind this pin on chip and its a powerful capability. We’ll start to investigate that in a series of future posts. The big problem with SWO is that it’s output only, and if you’ve got a minimal debug setup on your board (SWCLK, SWDIO, Gnd) then SWO needs another pin. The big brother of SWO is TRACE Output, which is effectively parallel SWO, but that’s for discussion quite a lot later on.
  • Real Time Terminal (RTT). This one isn’t as well known as the other options, but it levers the Segger hardware in a very clever way to deliver high speed communication with minimal target overhead. Basically, you put aside an area of memory on the target for ring buffers and then the debugger dips into those buffers while the target is running to exchange data. Since the debug capability on a CORTEX CPU doesn’t impact on the runtime speed of the target this is a pretty quick mechanism, the target ‘cost’ is limited to the ring buffers and simple memory copies to get the stuff to/from the buffers. Other probes could do this, but generally don’t, at least today.

So, that’s a quick overview of the various techniques I’m aware of, but perhaps there are more (or variations on a theme) that are worth documenting too? Of course, no one of these has to be used exclusively, and its quite common to see them used in combination on any given target. As a quick example, when I have a system that gets into a panic condition, I call the following routine;

void GenericsAssertDA(char *msg, char *file, uint32_t line)

/* Lock up tighter than a Ducks A?? and spin, flashing the error LED */

{
    vTaskSuspendAll();
    while (1)
    {
        dbgprint("%s: %s line %d" EOL,(msg==NULL)?"Assert Fail":msg,file,line);
        GPIO_WriteBit((GPIO_TypeDef *)GPIOport[GETGPIOPORT(PIN_HB_LED)], (1<<GETGPIOPIN(PIN_HB_LED)),(isSet=!isSet));

        uint32_t nopCount=1250000;
        while (nopCount--)
        {
            __asm__("NOP");
        }
    }
}

…this is just a few lines of code, but they’re mostly there for a good reason. The first thing we do in a panic is to switch off all the tasks, then dump an error message out of the debug serial port, before inverting the state of the error LED. We delay in a busy loop with the NOPs to avoid relying on the majority of the chip being in an operational condition. The nopCount initial value is set to make the LED flicker quite quickly. This sequence is repeated continually in case you miss the first serial output (y’know, cos you didn’t have the serial port connected, or whatever).

A GCC preprocessor definitions add quite a lot of value to what you get out;

#define ASSERT_MSG(x,y) if (!(x)) GenericsAssertDA((y), __FILE__, __LINE__)

Over the next few posts I’ll start digging into these debug options, and show just how powerful they really can be with the right processing hanging off the other end.

Now you at least appreciate that there a whole range of options for debug communication with your target, and the more sophisticated ones aren’t really more expensive than the simpler ones, they just need more setting up.