SWO – There’s an App for that

Exploiting the SWO link for software logging and hardware state reporting delivers huge advantages in comparison with traditional debug techniques, but when extended with applications on the host side the benefit gained is amplified considerably.

The creation of apps for Orbuculum is really only just getting underway.  Any number of applications can attach to it simultaneously to deliver multiple views and insight into the operation of the target, and only a few of those have been created so far.

The main orbuculum program collects data from the SWO pin on the CPU and both processes it locally and re-issues it via TCP/IP port 3443. The format of these data is exactly the same as that which is issued from the port, which also happens to be the format that the Segger Jlink probe issues. By default the JLink uses port 2332…believe it or not, the choice of port 3443 for Orbuculum was made for very specific reasons which did not include consideration of the JLink port number, so that was quite a co-incidence! Applications designed to use the TCP/IP flow from a JLink device can now also be used with a USB Serial port, or Black Magic Probe. Conversely, with a bit of simple modification to change the port number, orbuculum post-processing apps can hook directly to a JLink device (just change the source port number), or via orbuculum itself – everyone’s a winner.

The orbuculum suite currently includes a couple of simple example applications that use the infrastructure…creating new ones is trivial based on the code you can find in these examples.

The simplest of these existing apps is orbdump, which dumps data directly to a file. That’s useful when you just want to take a sample period for later processing…perhaps pushing it into something like Sigrok for processing in conjunction with other data. A command line something like this will dump 3 seconds of data directly into the file output.swo;

>orbdump -l 3000 -o output.swo

We’ve already mentioned orbtop. That tool is used for creating unix-style top output, but it features one little Easter Egg. Theres an option -o <filename> to that which dumps the processed sample data to a file, and an example shell script in Support/orbplot_top uses these data to produce pie charts of the distribution of the CPU load, a bit like this;

Frequently an application needs to merge multiple data sources as a precursor to using it in other apps. If you’ve got orbuculum producing several fifos with independent data in each there are unix tools that can do that, something like;

>(tail -f swo/fifo1 & tail -f swo/fifo2 ) | cat > output

The problem with this is that you can never be completely sure of the order of data merging in the output file, so a dedicated tool, orbcat, is provided to hook to the TCP/IP port of orbuculum and take the same output format specifiers (but without the fifo names), dumping the resulting flow either to stdout or to a file or use by other tools, like this;

>orbcat -c 0,”%c” -c 1,”%c” -c 2,”%c” -c 3,”Z=%d\n” -c 4,”Temp=%d\n”

Since each value arrives discretely for each channel it is possible to be certain that each one is completely written before the next – whatever order they’re written on the target, they will be received in on the host (watch out for target OS issues here though!). This can resolve the problem of inconsistent intermingling. Indeed, it’s possible to go further, and use the enforced sequencing to advantage on the host. For example, we can write two characters and a int into a csv file on the host with a orbcat line like the following;

>orbcat -c 5,”%c” -c 6,” ,%c” -c 7,” ,%d

which would result in lines that look something like;

a, b, 45
g, w, -453
...etc

Always bear in mind that there is no (real) limit to the number of simultaneous apps that can use the dataflow from the orbuculum TCP/IP port, nor on the re-use of data for multiple dumps; perhaps there’s a reason for creating two csv files, with the data above in a different order, for example.

Orbuculum is only just at the start of it’s lifecycle. It can collect and distribute SWO data, but its the apps that make use of these data that make it powerful, and there are plenty more of those to be created for many different purposes.

For now, the most interesting app that comes with the suite is orbstat, and that will be the subject of the next post.

SWO – The Hard Stuff

SWOs credibility as a debug solution comes from it’s ability to support multiple software output channels, but it’s real capability is only realised when you use the hardware monitoring functions it offers too.

In my previous post I alluded to the hardware capabilities that the SWO ITM macrocell offered by virtue of the Data Watchpoint & Trace (DWT) macrocell. In this post we’re going to scratch the surface of what you can do with that.

DWT messages are encoded in exactly the same way as software ones, but they are generated automatically by hardware rather than programmatically. You’ll recall that event counters, exceptions, PC value and data traces can all be output by the DWT, so in this post we’ll provide a couple of examples of how to use that functionality.

If you’ve got orbuculum running, you’ll notice one extra fifo in its output directory alongside whatever you have defined. That fifo is called hwevent and is a simple continuous dump of whatever DWT events you’ve got switched on. By default, with the standard gdb orbuculum startup script, no events are requested for reporting, and so that fifo remains empty. From the gdb command line (assuming you’ve included the line source ../orbuculum/Support/gdbtrace.init in your .gdbinit file) you can find out quite a lot about the possibilities for configuring the ITM & DWT;

gdb>help orbuculum

GDB SWO Trace Configuration Helpers
===================================

Setup Device
------------
enableSTM32F1SWD : Enable SWO on STM32F1 pins
prepareSWD : Prepare SWD output in specified format

Configure DWT
-------------
dwtPOSTCNT : Enable POSTCNT underflow event counter packet generation
dwtFOLDEVT : Enable folded-instruction counter overflow event packet generation
dwtLSUEVT : Enable LSU counter overflow event packet generation
dwtSLEEPEVT : Enable Sleep counter overflow event packet generation
dwtDEVEVT : Enable Exception counter overflow event packet generation
dwtCPIEVT : Enable CPI counter overflow event packet generation
dwtTraceException : Enable Exception Trace Event packet generation
dwtSamplePC : Enable PC sample using POSTCNT interval
dwtSyncTap : Set how often Sync packets are sent out (None, CYCCNT[24], CYCCNT[26] or CYCCNT[28])
dwtPostTap : Sets the POSTCNT tap (CYCCNT[6] or CYCCNT[10])
dwtPostInit : Sets the initial value for the POSTCNT counter
dwtPostReset : Sets the reload value for the POSTCNT counter
dwtCycEna : Enable or disable CYCCNT

Configure ITM
-------------
ITMId : Set the ITM ID for this device
ITMGTSFreq : Set Global Timestamp frequency
ITMTSPrescale : Set Timestamp Prescale
ITMSWDEna : TS counter uses Processor Clock, or clock from TPIU Interface
ITMTXEna : Control if DWT packets are forwarded to the ITM
ITMSYNCEna : Control if sync packets are transmitted
ITMTSEna : Enable local timestamp generation
ITMEna : Master Enable for ITM
ITMTER : Set Trace Enable Register bitmap for 32*&lt;Block&gt;
ITMTPR : Enable block 8*bit access from unprivledged code

There is another layer of help information below this top layer (beware that gdb doesn’t like MixedCase when you’re trying to do tab completion);

gdb>help dwttraceexception
dwtTraceException <0|1&> Enable Exception Trace Event packet generation

Understanding some of these options does need a bit of perusal of the  DWT and ITM technical documentation I’m afraid, but I’ll get around to writing something up on some of the more useful of them eventually (or, if someone else fancies making a textual contribution, it would be gratefully received….)

OK, so let’s give that a go, and see what we get in the hwevent fifo now;

gdb>dwtTraceException 1

>cat hwevent
1,2,Resume,Thread
1,989,Enter,SysTick
1,6,Exit,SysTick
1,1,Resume,Thread
1,989,Enter,SysTick
1,4,Exit,SysTick
1,1,Resume,Thread
1,996,Enter,SysTick
1,5,Exit,SysTick
1,2,Resume,Thread
1,996,Enter,SysTick
1,6,Exit,SysTick
1,2,Resume,Thread
1,985,Enter,SysTick
...etc

The ‘1’ in the first column is the event type (an Exception Trace Event), followed by the time in uS since the previous event. That is followed by the condition, and by the Exception itself. This particular trace is for an otherwise idle FreeRTOS application with a 1mS system tick timer. You can see that the CPU entered the thread state and 989uS later dealt with a SysTick event that took 6uS to handle, and that that process continued during the sample time…that’s quite a level of insight for no code changes at all!

There are 993uS to 1003uS between SysTicks in this sample, and that brings us to one of the big problems with this technique. To save bandwidth across the link the timestamps are generated on the host rather than the target, so they are inevitably inaccurate and, even with this compromise, the TRACESWO quickly becomes overload. You will see ITM Overflow warning messages from orbuculum itself in any realistic application using Exception Tracing…the effective use of Exception Tracing will have to wait until the parallel trace is available. By the way, there is a great description about CORTEX-M exceptions available here.

So, instead, let’s move on to something that does work reasonably OK even within the constraints of TRACESWO. Interrupt the application and type;

gdb>dwtTraceException 0
gdb>dwtSamplePC 1

…and again we can look at the hwevent fifo;

>cat hwevent

2,1,**SLEEP**
2,2,**SLEEP**
2,1,**SLEEP**
2,2,0x08002f70
2,2,**SLEEP**
2,1,**SLEEP**
2,1,**SLEEP**
...etc

Basically, we can set an interval at which we want the DWT to sample the current value of the Program Counter (by means of the dwtPostTap and dwtPostReset options) and it will tell us the value of PC at that interval. If the target is sleeping then obviously the PC has no value and rather the special value **SLEEP** is returned.

Using combinations of these options you provide information to homebrewed applications that parse the hwevent fifo to infer things about the behavior of your target, but there are alternative ways of getting information which can be easier to use.

In a previous note I mentioned that orbuculum exports a TCP/IP interface on port 3443…we can hook applications to this port and parse the data that are returned. The easiest example (which is completely useless) is;

>telnet localhost 3443

(Oh, CTRL-] followed by q will get you out of that).

Fortunately, the orbuculum gnomes have provided slightly more useful applications than that. The first of these is orbtop, which takes the PC samples, looks them up in the matching firmware elf file (assuming you compiled it with debug info in there) and marshals them into something distinctly useful;

>orbtop -e ../STM32F103-skel/ofiles/firmware.elf

98.91% 4360 ** SLEEPING **
 0.36% 16   USB_LP_CAN1_RX0_IRQHandler
 0.18% 8    xTaskIncrementTick
 0.13% 6    Suspend
-----------------
99.58% 4408 Samples

I think that’s enough for now. I doubt you were expecting a full top implementation for your target, with no target software instrumentation needed, but we’re still nowhere near the limits of what we can do.

Till next time….

SWO – starting the Steroids

Basic Single Wire Output replaces a serial port for debug purposes, but that’s hardly scratching the surface of the full capability of what’s behind that pin. To get more out of it needs additional software on the host side, and that’s where Orbuculum makes its first appearance.

If you’re following along at home, and you’re of that kind of engineering mentality, you will have looked at the SWO output from the last blog post and noticed that every valid data byte was interspersed with a 0x00. That doesn’t matter to most terminal programs (although it will screw up flashy terminal handling in case you were trying to get clever) and it’s really just a way of the ITM reminding you that it’s still there, and would still like to play.

The ITM is documented in The ARMv7-M Architecture Reference Manual which is a right riveting read. It can actually output data four different types of data;

  • Software Trace: Messages generated by program code
  • Hardware Trace: Messages generated by the DWT, which the ITM then outputs
  • Time Stamps: Either relative to the the CPU clock or the SWO clock
  • Extension Packets: These aren’t used much in CORTEX-M, but the one facility they do provide is a ‘page extension’ to extend the number of available stimulus ports from 32 to 256.

The minimalist pseudo-serial port output from the last post is actually a degenerate example of the use of Software Trace outputting one byte messages from ITM channel 0. That’s the reason you’re seeing the 0’s interspersed with the data… but a lot more functionality is available.

An ITM message is, in general, a data packet of 8 to 32 bits. Program code can send out chunks of 8-32 bits via 32 ‘simulus ports’. A write to stimulus port 0..31 on the target side of 1, 2 or 4 bytes will result in a ITM Software message being encoded and sent over the link. This effectively means you’ve got 32 individual channels of up to 32 bit width multiplexed onto a single serial link, and handled by the hardware. You can do that kind of thing just using software and a conventional serial port, but the ITM embeddeds that functionality in code you don’t have to write.

This makes the ITM Software channels ideal for separating different types of debug information for processing by the host; Channel 31 is reserved for Operating System support information, and 0 is generally used for 8 bit serial data (as we’ve already seen). The others are pretty much available for whatever purpose you wish. There’s no CMSIS support for anything other than Channel 0, but adding support for the other channels is trivial;

static __INLINE uint32_t ITM_SendChar (uint32_t c, uint32_t ch)
{
    if ((CoreDebug->DEMCR & CoreDebug_DEMCR_TRCENA_Msk) && /* Trace enabled */
         (ITM->TCR & ITM_TCR_ITMENA_Msk) && /* ITM enabled */
         (ITM->TER & (1ul << c) ) /* ITM Port c enabled */
        )
    {
        while (ITM->PORT[c].u32 == 0); // Port available?
        ITM->PORT[c].u8 = (uint8_t) ch; // Write data
    }
    return (ch);
}

I’ll leave it as an exercise for the reader how to create 16 and 32 bit variants of the write routine…or extend this one.

Anyway, while we’re here we’ll take a quick look at the hardware messages that the ITM conveys. These messages originate from the DWT and are encoded in a very similar way to the software ones. However, the message types are much more standardised, and offer an incredibly rich insight into the operation of the CPU, considering how minimal the implementation is. The defined messages are;

  • ID0 : Event Counter: the DWT maintains event counters for a number of distinct event types. When these counters ‘wrap around’ to zero then this event is emitted.
  • ID1: Exception Trace: One of the most versatile messages, this reports which interrupt is Entered, Exited or Returned to. By monitoring exception trace messages the host can identify exactly how interrupts are being handled.
  • ID2: Periodic Program Counter Sample Packets: the DWT can be configured to sample and report the current value of the Program Counter (PC). This allows statistical profiling and code coverage of an application running on the target without any code changes.
  • ID3-23: Data Trace Packets: These messages allow you to trigger events when certain data locations are accessed, values are changed or program locations hit. You might question how these messages differ from the capability afforded by the Debug module, but it’s much more intended for monitoring flows and triggering actions, rather than the interventional stuff that the Debug macrocell is generally used for.

You can see why the DWT is a bit of a Cinderella…its doing quite a lot of useful work and there’s a rich seam to be mined here, so we’ll be back to give it more attention in a future post.

Obviously the ITM has limited bandwidth, especially in comparison to the TRACEDATA pins, and it’s quite possible that it can be flooded by multiple data sources contending for it’s use. When that occurs there is a priority order to the messages that are output, with the end result that if you start seeing overflow messages, you can be reasonably sure that you are losing useful data. Unfortunately, the available bandwidth is the Achilles heel of the TRACESWO pin.

Lets consider the flexibility that the software source packets afford as a simple example of the use of the ITM. Doing this requires some software on the host side which, until recently, was limited and mostly only available in expensive (costing more than zero) proprietary packages, although OpenOCD and Sigrok both have some decode capability.

Orbuculum was created during early summer 2017 to capture and decode these SWO (and, specifically, ITM) flows. Running on OSX or Linux Orbuculum has significantly opened up the potential that SWO offers. In its core form it receives the data stream from the ITM (which may, optionally, have been through the TPIU multiplexer) and both presents it via TCP/IP port 3443 to any number of subsidiary client applications while simultaneously creating FIFOs delivering the decoded serial data to any local application that wants to use it.

The TCP/IP link is another thing we’ll deal with later, but for now, as an example, let’s consider an application where we want three debug serial flows (debug, clientEvents and Actions) with a 32-bit signed value Z and a 16-bit signed value Temperature.

Orbuculum can connect via a USB logic level UART, a Segger debug probe or, the default, a Black Magic Debug probe. For now, let’s assume we’re using the BMP, but it’s only a couple of slightly different command line options to connect to either a Segger or a logic level USB UART.

Anyway, the command line to achieve all this functionality would be;

orbuculum -b swo/ -c 0,debug,”%c” -c 1,clientEvents,”%c” -c 2,Actions,”%c” -c 3,Z,”Z=%d\n” \
                                  -c 4,Temperature,”Temp=%d\n”

when orbuculum is running it will create, in the directory swo/, the following files;

swo/
  debug
  clientEvents
  Actions
  Z
  Temperature

(+1 more file, which we’re not going to deal with in this post)

These can be streamed via the ‘cat’ command, or copied to a regular file. On the target side writing to one of the ITM channels (0 = debug, 1 = clientEvents etc.) with the appropriate length message will cause that number of octets (comms people say ‘Octets’ rather than ‘Bytes’ cos we’re pedantic) to be sent over the link to pop out and be processed by Orbuculum on the host.

As with the simple serial streaming case we talked about in the last post, some configuration is required to get all the various bits and pieces of SWO pointing in the right direction and running at the same speed. In general you’ll find it’s easier to do that from the debug port rather than target program code, and there are gdb scripts and libraries for exactly that purpose shipped with Orbuculum.

Orbuculum is designed to be a pretty hardy piece of code. It will deal with the target (and the debug interface) appearing and disappearing as the debug cycle takes place. The intention is that it behaves more as a daemon than as a regular application program so that it becomes part of the instrumentation infrastructure that supports your debug activities. Typically, I have several windows open each cat’ing one of the debug flows, and those windows are maintained through restarts, pauses and reboots of the target.

So, you now have the ability to stream multiple, independent, information flows from your target to your host. More sophisticated exploitation of this capability will be the subject of the next few posts, once we’ve dealt with the hardware side messages from the DWT, SWOs Cinderella.

Single Wire Output

SWO is the underloved younger brother of SemiHosting. Only available on M3 and above, it provides a flexible window into the behaviour of your target. In the simplest use case, it’s a high speed output only debug serial port.

Understanding SWO needs a bit of background about the various bits of the ARM CoreSight Debug architecture that participate in it. CORTEX-M doesn’t implement full-on CoreSight, its more a sort of lightweight version of it and there are only three component subsystems that have a role, at least, for the kinds of cores we’re talking about today;

  • Embedded Trace Macrocell (ETM): Provides live tracing of what the CPU is actually doing
  • Instrumentation Trace Macrocell (ITM): Provides multi-channel program-controlled data output
  • Data Watchpoint & Trace (DWT): Provides watchpoints and change-tracking output

ARM have a bit of a habit of talking in TLAs (Thee Letter Acronyms) that make this stuff more impenetrable than it needs to be, but once you’re in the club you can use the TLAs too to keep the riff-raff out, so try and keep up.

Now, each of these three data sources are configured either programmatically or via the debug port. Their output flows through to the  Trace Port Interface Unit (TPIU…I guess they ran out of combinations of only three letters) and that talks to the outside world. The block diagram of the TPIU looks like this;

The TPIU consists of a number of functional blocks; Interfaces to the ETM, ITM and APB (ARM Peripheral Bus, for config and management), a formatter to frame up the data from these sources and a serialiser to turn it into an appropriate format to be sent over the wire. DWT is the poor stepchild here. It sends its data via the ITM and never seems to get mentioned in letters home…but when we talk about ITM, you can assume the DWT is along for the ride too.

The formatter multiplexes the available data sources into packets that are sixteen bytes long. The formatting of this multiplexed packet is really rather clever (see Section D4 in here) and is designed to minimise the overhead that the multiplexing imposes. When you’re only using the TPIU for ITM output (See, you’re getting the hang of these TLAs) the formatter can be bypassed and the ITM data are passed directly to the Serialiser, thus reducing overhead and simplifying the packet format. That is indeed the way the SWO is often used in ‘simple’ implementations.

The serialiser is interesting. You’ll notice it has both a TRACESWO output and a four bit TraceData output too.

The four bit TraceData, in conjunction with the TRACECLK output, is used for ‘parallel trace’. It has higher bandwidth than the single wire output (which allows it to do new things) but, importantly, it’s fed from the same data sources so, modulo bandwidth limitations, you can do the same things with the TRACESWO output that you can do with the TRACEDATA outputs. We’ll deal with TRACEDATA extensively in a future post, but for now TRACESWO is the star of the show.

The serialiser kicks data out of the TRACESWO pin at a rate governed by the TRACECLKIN (which is fed on-chip by some clock source or other). Data can be sent out either Manchester encoded, or in UART format that will be more familiar to many people. You’ll hear the terms NRZ (Non-Return to Zero) and RZ (Return to Zero) used to describe these formats. You can Google for more information easily enough, but the important thing is that a RZ protocol also encodes the clocking information (at the expense of double the bandwidth requirement) whereas a NRZ protocol requires you to know the bitrate ahead of time. If you’re developing custom hardware to swallow the TRACESWO output you’d want to use RZ, if you’re hoping to use a TTL UART, then it’s NRZ all the way. The NRZ TRACESWO output format is hardwired as 8 databits, 1 stop bit, no parity.

So, let’s recap where we are. By the appropriate configuration of registers we can get realtime logging, exception and even execution trace out of our CPU via a single pin. We can even get those data out via a logic level UART connection (and yes, you can just capture the output using one of those horrible USB to UART adaptors). Next step – how do you grok the data on the host side?

Well, if all you want is a extra serial output for debug then that’s easy – configure the TPIU to bypass the formatter and to spit out the messages in NRZ format, then make sure you write to ITM channel 0 and hang a USB to UART adaptor off the SWO pin with a terminal application on the host. You’re done. You’ll even find a suitable call in the CMSIS, ITM_SendChar, which will send a single character over the link on channel 0 to drop out on your host.

The magic incantations to get all of this going fall into two parts; the first is chip specific to configure the SWO pin for use, the second is CORTEX-M generic, to configure the ITM, DWT, ETM and TPIU (although, in reality, you can largely ignore the ETM if you’re just wanting simple debug output, and the DWT just needs to provide sync to the ITM). Something like this suffices for a STM32F103;

/* STM32 specific configuration to enable the TRACESWO IO pin */
RCC->APB2ENR |= RCC_APB2ENR_AFIOEN;
AFIO->MAPR |= (2 << 24); // Disable JTAG to release TRACESWO
DBGMCU->CR |= DBGMCU_CR_TRACE_IOEN; // Enable IO trace pins for Async trace
/* End of STM32 Specific instructions */

*((volatile unsigned *)(0xE0040010)) = 625; // Output bits at 72000000/625+1)=115.2kbps.
*((volatile unsigned *)(0xE00400F0)) = 2; // Use Async mode pin protocol
*((volatile unsigned *)(0xE0040304)) = 0; // Bypass the TPIU formatter and send output directly

/* Configure Trace Port Interface Unit */
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk; // Enable access to registers
DWT->CTRL = 0x400003FE; // DWT needs to provide sync for ITM
ITM->LAR = 0xC5ACCE55; // Allow access to the Control Register
ITM->TPR = 0x0000000F; // Trace access privilege from user level code, please
ITM->TCR = 0x0001000D;
ITM->;TER = 1; // Only Enable stimulus port 1

while (1)
{
    for (uint32_t i=’A’; i<=’Z’; i++)
    ITM_SendChar(i);
}

So, there you go, serial port debug with low overhead and without it actually costing you a serial port on the target. The good stuff, however, using the real capabilities of the ITM and DWT, you only get if you spend more effort understanding those two macrocells, and if you put real decode software on the host side. That’s the subject of the next installment.

While you’re waiting for that to land, there’s a short YouTube Video from ARM giving a better overview of this stuff than I ever could.

 

SemiHosting

SemiHosting is one of the oldest ARM debug support mechanisms which even today has a couple of advantages over most of the alternatives.

In general the debug options for ARM CORTEX CPUs are confusing to the newcomer. The embedded world expects everyone to already be an expert, with the end result that you’ve got to be living in it for a fairly significant length of time before the fog finally starts to clear.

I’m assuming that anyone reading this stuff has already got their head around I/O bits and serial ports, so lets concentrate on SemiHosting as our first entry into this wonderful world. This is obviously just an intro, you should look at the ARM documents when you want the real lowdown. I should state upfront that I don’t generally use SemiHosting, I find other techniques more suitable, but this should give you enough of a foothold to start using it if it looks like it floats your boat.

SemiHosting has been around since the 1990s. It allows the application running on your Target (embedded CPU) to access the Input and Output capabilities of the Host that is connected over the debug link. It does this by ‘tunneling’ the I/O requests over that link for various file descriptors. You’ll recall that file descriptors 0 and 1 are stdin and stdout in the Unix world, so one of the things you get with SemiHosting in addition to file access is remote screen and keyboard for your target application. Bargain.

It’s important to be aware that when an app is compiled with SemiHosting it will not work without the debugger connected. This is a big restriction. It also switches the CPU into Debug mode while it’s active, where it doesn’t play nicely with interrupts and stuff. Let’s be honest, SemiHosting is really useful for testing routines that take chunks of data in or throw chunks of data out because that’s where the file handling bit comes in. It’s not great for realtime oriented stuff either because it’s not a particularly fast technique. Its big advantages are that its properly bidirectional and it integrates cleanly, with no (or very little) glue with the filesystem on the host.

So, how does it work? Turns out the implementation is slightly different depending on if you’re on a ARMv6-M or ARMv7-M (M0 or M3/4 etc.) as distinct from any other ARM family CPU. In the former case the BKPT (Breakpoint) instruction is used, other ARM CPUs use SVC (Service) calls…that distinction doesn’t really matter though unless you’re stepping through machine code trying to figure out what’s going on….so lets stick with the CORTEX-M case.

When the application on the target wants to perform a SemiHosting call in regular code it performs a BKPT 0xAB instruction with the operation to be performed in R0, and parameters in other registers. A few examples of ARM-set standard actions are;

1 – SYS_OPEN : Open a file on the host
2 – SYS_CLOSE: You can figure this one out
3 – SYS_WRITEC: Write a single character
5 – SYS_WRITE: Write a block
6 – SYS_READ: ...and so it goes on

Obviously each of these calls needs parameters and returns results. The reference above gives you all the info you need on what those actually are…although in reality you mostly use libraries to realise a SemiHosting implementation so you don’t need this level of detail. One question I always had was why SemiHosting was implemented with BKPT/SVC and not just a library…well, if you think about it an exception-based calling routine will work anywhere with any language and from any processor state (pushing the CPU into a Debug state), so it’s much cleaner implementation than the alternatives that you might dream up.

So, we’ve reached the BPKT/SVC handler, and we’ve got our marching orders in the various registers….how does this get conveyed to the connected debugger? That depends on the compiler and debugger you’re using, but let’s stay in a GCC/GDB world where everything is documented and transparent.

In that case the handler marshals everything and sends it over the GDB link. That’s all documented in the Remote Protocol section of the GDB manual, and specifically the File I/O Remote Protocol Extension. I’m not going to regurgitate all of that stuff here for the purposes of padding a blog, but suffice to say that requests from the target eventually pop up at the host end where GDB (or, if you’re using something like Segger or OpenOCD, initially the debug driver) handles it and returns the results back to the target.

OK, so that’s the mechanics, and you understand the limitations, so how to use it in the real world? Turns out it’s pretty straightforward, just add the magic incantation

–specs=rdimon.specs

to your linker options (replace that with

--specs=nosys.specs

when you want to turn SemiHosting off). That will load up the BKPT/SVC handling routines and allow you to use printf/scanf and all the file handling stuff in your application. One thing that folks do forget is an initialisation call that’s needed at the start of main (or leastways, before you do any SemiHostery) if you’re not running newlib;

extern void initialise_monitor_handles(void); /* prototype */
main(void)
{
    initalise_monitor_handles();
    ....

You’ll probably need to switch on the semihosting options on your host side debug stub, and the MCUOnEclipse site has good info on doing that.

You don’t need to do anything extra if you’re running Blackmagic probe…one of it’s big advantages is that it’s all handled natively.

So, there you have it. Zero to SemiHosting-competent in ten minutes, but if you can cope with an output-only channel though there are better, faster, more flexible options. More to follow.

Debug communication with an ARM CORTEX-M target

Textual debug input/output is a way of life on the desktop. With no screen and no keyboard, surely you’ve got less options in an embedded system?

In some ways you’ve got more options for getting debug out of an embedded system than you have a desktop one. Just the other day I posted an example of using a single digital output pin to convey external information about how busy a system is – something that’s rather more involved to achieve on the desktop, so let’s do a brief survey of some of the options that let you figure out what that inscrutable lump of highly refined sand on your bench is actually doing.

The basic option, that’s been around as long as embedded systems themselves, is the single I/O bit. For output it can be used to indicate entry into a specific bit of code, how loaded the system is, any number of error conditions and a thousand and one other things. Most designs feature at least one LED hooked onto that output pin to give a visual indication to the user without needing any special equipment beyond a Mk.1 eyeball. In my designs I always have a minimum of one (not red) LED which does normal service as a ‘heartbeat’ showing that the system is alive. It serves double duty as the exception indication LED when the system ends up in an unrecoverable situation. Believe me, it can be difficult to see that situation quickly (you’ll implement that on your second design, immediately after you’ve spent an hour staring at your board wondering why its not responding)….don’t underestimate how useful that is. Frankly, if you can spare the bits, put a RGB LED on there (A Wurth 150141M173100 is only about 30c) and you’ve got eight different conditions you can show, even if you chose to not provision it for production. Stick that LED on PWM outputs and you’ve got any colour under the rainbow. Perhaps not really too useful, but cool anyway.

On the input side a single bit lets you trigger certain paths through the code or change runtime conditions. Its very difficult to get a clean switching action on an input bit without a bit of additional circuitry…generally you can work around that by sampling twice (Google ‘Switch Debouncing’ for more than enough examples on how to do that) and software is always cheaper than hardware – in per unit cost at least. The lack of clean switching action can bite you if you sample very quickly or use an interrupt to capture the state change event though …. and it’s considerably worse if you just use a shorting wire, pair of tweezers or whatever other conductive implement you happen to have on your desk. The one liner algorithm description for a software debounce is simple enough; After sensing a state change wait for 20mS or so and check if it’s still changed..if it is then the transition is valid.

Moving on from the single I/O bit approaches, we very quickly end up wanting to spit out serial data; Debug strings, data values, error conditions and operational conditions really help colour in what a system is really doing as opposed to what you think it should be doing. There’s so much value in a system reporting what it’s up to that we often find output serial ports for debug application with no corresponding input, and there are multiple options for getting that.

Lets consider what the various options for serial I/O are;

  • A real serial port. Normally this is configured for asynchronous operation (i.e. with start and stop bits) and it’s sometimes referred as a UART (Universal Asynchronous Receiver/Transmitter), which is often used to implement the RS232 communications protocol. It’s not really though, ‘cos RS232 specifies a lot of things that are ‘interpreted liberally’ in a debug port; The signaling levels might be 0 & 3v rather than that positive and negative 12V signaling that a ‘real’ RS232 port generally uses. With the advent of uber-cheap USB ‘TTL Serial’ interfaces from FTDI and others this kind of debug port has become very popular, and you’ll often find Logic Level serial interfaces on debug probes like the Black Magic Probe or the Segger JTAG.
  • Overlaid functionality on a debug channel. If we’ve got debug communication established with a CPU via JTAG or SWD then that channel can also be used for bidirectional debug communication. On ARM it’s generally known as ‘Semihosting’ and its a virtually no-cost channel in hardware terms, but fast it isn’t. It does have a few distinct advantages to it though and we’ll talk about those later.
  • Single Wire Output. When the JTAG interface is in SWD mode there are spare pins, one of which (the one thats normally used for TDO) can be used for serial debug output. There’s quite a sophisticated infrastructure behind this pin on chip and its a powerful capability. We’ll start to investigate that in a series of future posts. The big problem with SWO is that it’s output only, and if you’ve got a minimal debug setup on your board (SWCLK, SWDIO, Gnd) then SWO needs another pin. The big brother of SWO is TRACE Output, which is effectively parallel SWO, but that’s for discussion quite a lot later on.
  • Real Time Terminal (RTT). This one isn’t as well known as the other options, but it levers the Segger hardware in a very clever way to deliver high speed communication with minimal target overhead. Basically, you put aside an area of memory on the target for ring buffers and then the debugger dips into those buffers while the target is running to exchange data. Since the debug capability on a CORTEX CPU doesn’t impact on the runtime speed of the target this is a pretty quick mechanism, the target ‘cost’ is limited to the ring buffers and simple memory copies to get the stuff to/from the buffers. Other probes could do this, but generally don’t, at least today.

So, that’s a quick overview of the various techniques I’m aware of, but perhaps there are more (or variations on a theme) that are worth documenting too? Of course, no one of these has to be used exclusively, and its quite common to see them used in combination on any given target. As a quick example, when I have a system that gets into a panic condition, I call the following routine;

void GenericsAssertDA(char *msg, char *file, uint32_t line)

/* Lock up tighter than a Ducks A?? and spin, flashing the error LED */

{
    vTaskSuspendAll();
    while (1)
    {
        dbgprint("%s: %s line %d" EOL,(msg==NULL)?"Assert Fail":msg,file,line);
        GPIO_WriteBit((GPIO_TypeDef *)GPIOport[GETGPIOPORT(PIN_HB_LED)], (1<<GETGPIOPIN(PIN_HB_LED)),(isSet=!isSet));

        uint32_t nopCount=1250000;
        while (nopCount--)
        {
            __asm__("NOP");
        }
    }
}

…this is just a few lines of code, but they’re mostly there for a good reason. The first thing we do in a panic is to switch off all the tasks, then dump an error message out of the debug serial port, before inverting the state of the error LED. We delay in a busy loop with the NOPs to avoid relying on the majority of the chip being in an operational condition. The nopCount initial value is set to make the LED flicker quite quickly. This sequence is repeated continually in case you miss the first serial output (y’know, cos you didn’t have the serial port connected, or whatever).

A GCC preprocessor definitions add quite a lot of value to what you get out;

#define ASSERT_MSG(x,y) if (!(x)) GenericsAssertDA((y), __FILE__, __LINE__)

Over the next few posts I’ll start digging into these debug options, and show just how powerful they really can be with the right processing hanging off the other end.

Now you at least appreciate that there a whole range of options for debug communication with your target, and the more sophisticated ones aren’t really more expensive than the simpler ones, they just need more setting up.

To OS or not to OS

On an embedded system, should you have an OS or run on the metal?

This one will run and run, so I’ll call this Part 1 for now.

I’ve spent the majority of my professional career rallying against the use of OSes for Embedded Systems. A few years ago I analysed my reasoning behind that I realised a lot of it was founded in the arrogant belief that no-one could write code that was as well optimised as mine. That may or may (more probably) not be true, but there are plenty of other reasons for seriously considering a lightweight OS for your next project.

Following that little thought investigation, I now invert the discussion, and start off with “Why wouldn’t I run an OS underneath this?”. The fact is, on day one, every project starts off small, manageable and with a simple set of needs…you don’t need an OS in that environment, plain and simple. But, as your project grows, you need to do more and more things and, without an OS, you’ll find yourself re-inventing stuff that you get for free in OSville. Ah, you say, but I already have a library for timers, and message passing, and task switching, and queues….congratulations, we call that an OS, it’s just that you didn’t.

There are legitimate reasons for going OS-commando. The main one being if you’re really short of memory (RAM or Flash). Like it or not an OS is going to gobble some of it up (a Mutex semaphore in FreeRTOS on CORTEX-M takes 80 bytes. That hurts when you’ve only got 4096 of ’em around) so defining your own can really help…but be careful, allocating one bit in the bit-addressable RAM area just saved you 79 7/8 bytes of memory but it isn’t the end of the story, you’ve still got the care and feeding of that structure to deal with. It’s surprising just how much Flash memory, in comparative terms, that care and feeding can take, and not too many people would claim that FreeRTOS is the most super-efficient RTOS in its RAM allocation.

Similar consideration apply on the flash side. A reasonably complete FreeRTOS implementation on a STM32F103 in release configuration is about 6K…you can come down a ways from there if you start chopping out options, but the total spend will still be a four digit number, and that’s a fair proportion of a budget that might only be 16K or 32K.

One thing that an OS doesn’t have to do though is slow you down, and that’s the main criticism I hear (and, indeed, was one of my primary prejudices). The fact is that most of the time, for most of your code, the 1-2% overhead the OS brings along for the ride really doesn’t matter. It does matter when you’ve got a time critical task to handle, and that’s often (mostly?) done in interrupt code, so how fast a RTOS handles interrupts is much more important than how it handles base code.

Most Real Time OSes offer ‘Zero latency interrupts’, or some equivalent term. All that really means is that the OS doesn’t pre-service the interrupt for you; It doesn’t grab the interrupt and perform the initial handling of it before passing it off to your code – that does happen in desktop OSes, and you’ll hear the term ‘top half handler’ and ‘bottom half handler’ used to reflect this split between OS-controlled and Application-controlled code.

With a Zero Latency interrupt, your response time would be exactly the same as in the OS free case, because the interrupt lands in your code just like it did in the OS-free case. Indeed, response time could even be better. How? Well, let’s look at a lazy-assed implementation of an OS free app (one of mine, so I can criticise..its available here if you want a laugh). In this app communication is arranged through simple flags….you set a flag in one place, and that triggers a task in another. The code to set a flag looks like this;

void flag_post(uint32_t flag_to_set)
{
    denter_critical();
    flags|=flag_to_set;
    dleave_critical();
}

… and the denter / dleave routines;

void denter_critical(void)
{
     __disable_irq();
     _critDepth++;
}

void dleave_critical(void)
{
    if (!--_critDepth)
    {
        __enable_irq();
    }
}

..so, as you can see, all interrupts are turned off while we go fiddle with flags and critDepth..and during that time the CPU is away with the fairies and isn’t going to respond to any other maskable interrupt, no matter how much it yells. That will show itself up as jitter in interrupt response time  (there is another reason for jitter, we’ll come back to that later).

So , how on earth could an RTOS be faster? Let’s consider the equivalent criticality setting in freeRTOS for a M3 CPU (you’ll find this in portmacro.h, and I’ve hacked the formatting around a bit);

portFORCE_INLINE static void vPortRaiseBASEPRI( void )
{
    uint32_t ulNewBASEPRI;
    __asm volatile
    (
        "mov %0, %1\n" \
        "msr basepri, %0\n" \
        "isb\n" \
        "dsb\n" \
        :"=r" (ulNewBASEPRI) : "i" ( configMAX_SYSCALL_INTERRUPT_PRIORITY )
    );
}

…not a __disable_irq in sight! What FreeRTOS does is to temporarily raise the minimum priority interrupt that will be recognised by the CPU.  That has exactly the same effect as __disable_irq for any interrupt with a lower priority than whatever is selected for configMAX_SYSCALL_INTERRUPT_PRIORITY, but will leave higher priority interrupts enabled. So, if I really need that fast response, I just give it a super-high priority and it will get serviced sharpish…the only constraint being that I cannot use OS services within that interrupt.

End result; I’ve got the option of slightly jittery interrupts and OS support, or interrupts faster than the native case, but if I want to use OS features in conjunction with them then I have to jump through some more hoops. Of course you could do the BASEPRI trick in your own code, but someone has already written, tested, debugged and documented it for you, so why bother?

Finally, remember I said that there was another source of jitter?  Well,  taking an M3 as an example; It should theoretically be able to always respond to an interrupt within 10 clock cycles, but other factors (bus latencies, peripheral response times, flash caches and speeds etc.) may conspire to prevent that…so you get response jitter. In real world applications it is often more important to be slower and jitter free than to be faster and a bit wobbly, so several manufacturers have added the capability to ‘stretch’ the number of cycles to respond to an interrupt, so it’s always the same. On the NXP LPC134x CPUs, that register is called IRQLATENCY, and has a default value of 0x10, meaning that, in general, the CPU will hit your code in  response to an unmasked, highest priority interrupt request 16 clock cycles after the request is generated….if that is enough delay to remove jitter in your configuration is dependent on exactly how you’ve got the whole system configured, so you can put a longer value in that register if you need it.

I started off this post by being a bit anti-OS, which I have been for most of my career, but when you start peeling back the covers you start to understand that an OS, be it FreeRTOS, RTX, ChibiOS, NuttX or one of the hundreds of others that are out there, is really just a big library of code that you don’t have to write for yourself.  Know your problem, know your chip, and don’t just trust your execution environment decisions to blind prejudice.

Join the discussion here.

Ghetto CPU busy-meter

Even in the embedded world, you really want to know how busy your CPU is.

Many, or even most, embedded programmers have no idea how ‘busy’ their CPU is. By ‘busy’, I mean what proportion of the available CPU cycles are being spent on doing useful work in the name of the target application, and how much time its idle waiting for something to do. On your desktop PC you’ve got some sort of CPU busy meter available through the Task Manager, Activity Monitor or System Monitor depending on which OS religion you follow. Wouldn’t it be useful to have something similar on that embedded device you’re working on? Far too many people think you need an OS for that kind of info, but it turns out it’s trivially easy to do…and it should be table stakes for getting started on a new design.

In most embedded design patterns, you’ve got some sort of loop where the application sits, waiting for something to happen. That might be a timer expiring, an interrupt arriving or some other event. The general pattern looks something like this;

while (1)
{
    flagSet = flag_get();
    if (!flagSet)
    {
        __WFI();
    }
    else
    {
         <Do things>
    }
}

In this specific example the flags are set via Interrupt routines, and the __WFI instruction is basically “Wait for interrupt”. You can see how this works out….we sit waiting for an interrupt in the WFI (which is generally a low power mode), one arrives, sets a flag or two, and then any outstanding flags are processed before returning to the __WFI for whole thing to happen again.

Well, this will be easy … all we need to do is to provide some indication to the outside world of when we’re busy and when we’re not at the right places in the system. Something like;

while (1)
{
    flagSet = flag_get();
    if (!flagSet)
    {
        <Indicate Idle>
        __WFI();
        <Indicate Not Idle>
    }
    else
    {
         <Do things>
    }
}

That will work. Your split between ‘busy’ and ‘idle’ might not be quite so clean, but I don’t think I’ve ever seen a system where it’s not possible to differentiate between the two cases.

There is one issue, which is that the interrupt routine will be called before the <Indicate Not Idle> gets processed, but interrupt routines should be short in comparison to the total amount of work done…and I’ll give you a fix for that later anyway.

The easiest way to <Indicate Idle> and <Indicate Not Idle> is just by setting or clearing the state of a spare pin on your device. That also means you’ve got to set up the pin to be an output first of all…so our code now looks like;

<Set up output pin>
while (1)
{
    flagSet = flag_get();
    if (!flagSet)
    {
        <Indicate Idle>
        __WFI();
        <Indicate Not Idle>
    }
    else
    {
         <Do things>
    }
}

Great, we can now see, approximately, how busy our CPU is just by monitoring the pin. The code for each of these routines will vary according to the particular CPU you’re using, but here’s a typical example for a STM32F103 implemented using macros and CMSIS for port pin B12;

#define SETUP_BUSY GPIOB->CRH=((GPIOB->CRH)&0xFFF0FFFF)|0x30000
#define AM_IDLE    GPIOB->BRR=(1<<12)
#define AM_BUSY    GPIOB->BSRR=(1<<12)

These magic incantations will vary from CPU to CPU, but the principle holds fine. You might be even luckier and on your CPU have a pin that changes state when it’s in WFI mode without any of this AM_IDLE/AM_BUSY mucking about – typically that would be a clock that only runs while the CPU is active for example.

This works great if you’ve got a scope to see the output on, but I like something a bit more low-tech that can hang off the port so we’ve got a permanent indication of just how busy the system is.

A LED is a good first step – the CPU will be switching between busy and idle at quite a speed, faster than the human eye can follow, so the brightness is proportional to how busy the system is. If you’re a proper bodger you don’t even need a series resistor for the LED…the resistor is only there to limit the current through it and since the maximum current out of the CPU pin will generally be considerably less than the max the LED can cope with, you’re golden…..but do check the specs for your chip and your LEDs!

Unfortunately, the human eye is rather non-linear, so it won’t easily spot the difference between a 60% loaded CPU and an 80% loaded one, so something slightly more sophisticated is in order. If we put an low pass RC filter (or integrator, or smoother…they’re all the same thing, it just depends how you want to think about it) on the output pin then you’ll get a DC voltage out which is proportional to how busy the CPU is. Let’s go one step further and put a potential divider on the front to set the maximum input to the filter to be 1V. For a 3v3 system, something like this;Hey, I’ve now got a meter that reads from idle (0.00V) to fully loaded (1.00V) directly on the display. Meters with an analogue bar on the display are particularly suitable for this job. Increase the value of the capacitor if you need to reduce jitter in the reading, at the expense of making the response to load changes slower .

Just how cool is that? Of course, the practical implementation doesn’t look at pretty as the circuit diagram, but it does the job;

Remember how we said that one of the limitations of this technique was that the Interrupt routine got called before the Busy Indicator got set? If that is a problem (because, for example, you’re worried that you’re spending a lot of time in interrupts) just put another AM_BUSY call at the start of your ISR. Problem solved. Go to town, put one at the start of all of your ISRs.

Similarly, there’s no reason you have to restrict this technique to just telling you when you’re busy….if you bracket a specific routine in your code with the AM_BUSY tags you can read directly off your meter what percentage of your CPU time is being spent in it. You can even have multiple pins tagging different bits of code… knock yourself out.

At the other end of this post, you didn’t know how busy your CPU got. Now, you’ve got no excuses.

Join the discussion here.