VGA Video out on BluePill

One major limitation during the development of an embedded system, especially for programmers who are used to PCs, is the lack of video output. That’s exactly what Vidout provides, using only 24% of the CPU on a STM32F103.

It would be great to be able to be able to, at least temporarily, add video to a system during development and then remove it when it goes to the field. Unfortunately, creating video output is generally a rather resource intensive activity, and not one that can normally be patched in for development only. It is also ‘Hard Realtime’, in that the generation of video output must meet tight timing constraints if the display is to look right.

Over the years I’ve built a lot of systems which I’ve claimed to have Realtime characteristics, but even so I’ve never really bumped against the limits of what is possible on a commodity processor, so, with a few evenings free over the Christmas break, I thought I’d see how far I could get at implementing a VGA output for debug use on a pretty much bottom-of-the-barrel CORTEX-M3; The STM32F103C8 based Bluepill board.

Software-based video output is a hard realtime problem that demands monotonic response. Given that the duration of pixel times is measured in nanoseconds and any mis-timing is very obvious in the form of ‘fizzing’ or deformities in the presented image then this was going to be a major challenge, just the thing for chrimbo. Thus, Vidout was born.

The project objective was simple enough; produce stable video using that board and no (or minimal) additional components. The starting point was the December 2012 blog by Artekit that produced 400×200 bitmapped video using this processor..so it was obviously possible to create some kind of output. A quick read-through of that blog and a few limitations came to light;

  • The video output was a bitmap buffer only, which means you need a lot of RAM to store it (10K Bytes). This is a heck of a hit on a part with only 20K available in the first place, some of which is presumably already used by your target application!
  • It uses three interrupts and relies on the interaction between two timers to generate the precise signalling needed. Thats a lot of resources being used on a constrained part.
  • The code is GPL…there’s nothing wrong with GPL, but it can make it difficult to fold the code into projects, especially if you want to be able to leave it in there as ‘sleeper’ code.
  • I didn’t write it.

What I really wanted was a character oriented driver using minimal resources that could be bolted into existing projects for debug and monitoring purposes. More powerful CPUs have been used to create video exploiting configurable ‘raster generators’ that create the image dynamically on the fly (e.g. by Cliff Biffle with ‘Glitch’). The combination of these two ideas form the basis of this new implementation.

Martin Hinner has collected the VGA standard timing information on his site, and that’s a great resource. Vidout uses the same timing as the Artekit article – the VGA 800×600 output frame which leads to a line rate of 28.444uS and a frame rate of 56Hz. With these timings and each horizontal pixel ‘doubled up’ to give a horizontal resolution of 400 pixels each pixel has a duration of 2.8nS…if we can’t get these levels of accuracy then the video will be corrupt. For simplicity, and due in part to constraints set by this specific CPU, the pinout is the same as the Artekit code.

Assuming for a moment that we have the source material for an image to be displayed (remember, in Artekit that’s just a static block of memory) then there are three distinct and separate tasks to be performed;

  • Creating and maintaining the ‘frame protocol’ so the monitor will display the image
  • Calculating the pixels to be output for each line of the frame
  • Outputting the pixels for each line

In Vidout these tasks are all performed in vidout.c, and we’ll run through them in turn;

The frame protocol

A single VGA 800×600 frame consists of the following distinct elements;

  • A frame sync pulse, lasting 57uS.
  • A ‘back porch’ of 22 lines (plus any additional lines to centre the image vertically)
  • A sequence of 28.444uS image lines containing the actual data to be displayed
  • Any remaining lines to complete the frame

A timer is used to generate the horizontal line timing. Two channels are used; The first one actually generates the line sync pulse and the second one triggers the line state machine. This allows the ‘back porch’ to be produced automatically by the difference between the two timers, thus avoiding software timing loops, which are never a good thing.

Video Line Output

The line handler is contained in TIM_IRQHandler and runs in response to the timer interrupt. In Pseudo-code it looks something like this;


Clear the interrupt and move to the next scanline
Stop the any existing DMA that's in progress

switch (based on scanLine of Frame)
{
case FRAME_START ... FRAME_BACKPORCH - 1:
  Enable Vertical Sync Pulse

case FRAME_BACKPORCH ... FRAME_OUTPUT_START - 1:
  Output a blank line

case FRAME_OUTPUT_START ... FRAME_OUTPUT_END:
  Start DMA output of the prepared pixel line
  IF (this is the last repetition of this line)
  {
    Set a DMA interrupt to occur at the end of this transfer
    Make sure the other buffer will be transmitted next time
  }

case FRAME_OUTPUT_END + 1 ... FRAME_END - 1:
  Send out a zeroed line

case FRAME_END:
  Reset to the top of the frame
}

You may not have seen this ‘…’ syntax before – it’s a GCC extension for covering multiple case values. If your compiler doesn’t have it then you can just, rather more untidily, substitute if ((x>a) && (x<b)) kinds of filters if you prefer.

This is the most timing critical part of the whole implementation so it’s run from RAM (to avoid wait states) and is maximally optimised. Wait states are a huge problem when you’re running tight realtime responses, especially with interrupts in the mix; Flash memory is too slow to feed instructions to the CPU at its full operating speed, so a small, fast, cache is inserted between the CPU and the Flash. This buffers the Flash and generally fixes the majority of the problem. Unfortunately, interrupts generate exceptions to the regular flow of execution and the benefit of the cache is negated. Whenever possible, keep interrupt handlers in fast RAM and not slow Flash, to avoid a disproportionate performance hit both to the interrupt handler and the base level code that’s getting interrupted.

The line handler operates as a state machine which performs actions depending on which line of the frame being processed. You can easily see from the code each of the segments of the frame being output, with the biggest part being the output of the active display lines.

The output of the display pixels is the heart of the system and is performed using a SPI peripheral fed from a DMA channel. The transfer is triggered by this routine and it’s essential that it is triggered at exactly the same time (relative to the line sync pulse) for each line, otherwise jitter will occur which will be visible as fizzing around the characters. This is a particular issue in this implementation because each individual line is generated twice to ‘stretch’ the Y resolution. This means the handling for each new or repeated line is slightly different; You can see the practical consequence of this where the DMA IFCR register Transmission Complete (TC3) bit is reset before every line transmission even though its actually only set on every other line transmission…this simply equalises the time taken on both paths through the code.

Video Line Generation

So, we now know how the frame protocol is generated, but where do the pixel data for the frame come from? Well, since each line is repeated twice there are two line intervals (2×28.44uS) worth of time available to calculate each row of output, so this can be done in a separate, lower priority, thread of execution to the frame handler. This job is done by the DMA_CHANNEL_IRQHandler.

Two lines of output pixel data are maintained; This allows a line to be output while the next one is being generated. The basic process is that as soon as a line has been transmitted an interrupt is triggered that starts the creation of the next one. This interrupt is deliberately set to have a lower priority than the line interrupt so that the second ‘copy’ of a line can be transmitted while the next one is still being generated.


Acknowledge Interrupt
if (attempting to output line greater than number on screen)
{
  Setup first line ready for next frame into one half of pixelbuffer
  Setup zeros in other half of pixelbuffer
}
else
{
   Prepare next line for output in the half of the pixelbuffer
}

Pixel data lines are created by indexing into a character font using a character index from the displayFile (in the rasterLine routine)….this allows the display file to just contain the characters (for a 8×16 font, that’s 50 characters per line at a 400 pixel resolution) and the pixel buffer to be constructed dynamically as it’s needed. That means that only 900 bytes are needed for a complete 50×18 screen displayed at 400×288 pixel resolution, rather than the 14KBytes that would be needed for a full pixel based representation. It also makes manipulation of this buffer much faster, simply because there’s less of it.

The original design was intended to be text only, but quite often it is also useful to have a small amount of graphical output. Full screen graphics are prohibitive but a small window is managable. For this reason rasterLine (supported by displayFile) also supports a small, configurable, graphical window which can be overlaid anywhere on screen. The overall effect is a large graphic ‘sprite’ that can be moved freely around.

The Result

Results were better than expected. It is possible to generate good quality VGA video output at resolutions of up to 100×36 characters, although you’ve got the CPU on it’s knees at that point with near enough 100% of it in use;

Hires (800 x 400) output displaying 100 x 18 text.

Interestingly, this mode bounces against the limits of the memory busses too. If the graphic is overlaid on the text you start to see fizzing towards the right hand end of the text line – that’s because the DMA is having to wait for access to the RAM, and it causes visible artifacts.

More realistically 50×18 is easily possible, even when running with a partial graphic window. Even cooler, the graphic window can be moved around with no impact on the text;

Hires (800 x 400) displaying 50 x 18 text

Here’s a video showing how the graphic and text layers can be moved independently (this is the code in the example main.c you’ll find in the repository).

Size isn’t too shabby either, bearing in mind that there’s a complete character 8×16 character set gobbling up 4K of the Flash (which you can obviously reduce if you don’t care about certain characters);

~/Develop/vidout$ make
Compiling thirdparty/CMSIS/src/system_stm32f10x.c
Compiling thirdparty/CMSIS/src/core_cm3.c
Compiling vidout/rasterLine.c
Compiling vidout/displayFile.c
Compiling vidout/vidout.c
Compiling app/main.c
Assembling thirdparty/CMSIS/src/startup_stm32f10x_md.s
Built Release version
text data bss dec hex filename
7704 40 3140 10884 2a84 ofiles/firmware.elf
~/Develop/vidout$

This size is also with a 192×80 graphic panel inserted. Without that the bss comes down to around 1220 Bytes.

Performance Testing

As you know by now, I’m a bit obsessive about knowing just how busy my CPU is. Producing the VGA image only requires code running under interrupt, so its pretty easy to instrument the system to measure this. Simply place AM_BUSY calls at the start of each interrupt routine and an AM_IDLE call before entering WFI in the idle loop. Pin B12 will be high whenever the CPU is doing something, thus giving an easy readout of the performance of the code. In the previous article on the subject I used a couple of resistors and a multimeter to give me a direct readout of this figure, but there’s another useful trick here; Just put your scope onto the output pin and low pass filter the output using its maths facilities – the resulting trace shows you just how busy your CPU is over time and, by adjusting the cutoff frequency, you can set different ‘time horizons’ for assessing this busy factor.

Here’s the output for Vidout based on a 1KHz cutoff frequency. The drop in busy factor when pixel lines don’t need to be generated (at the start and end of a frame) is very easy to see;

Busy/Free pin (Yellow Trace) with Low Pass Filtered version (Pink Trace)

The yellow trace is the state of the busy/free pin and, as you can see, it’s changing too fast to really be visible, but it’s the low-passed version of this trace that is really quite revealing – the dip is the end of one frame and start of the next, which is when no pixel lines are being generated, is clearly visible. The ‘bump’ is the part of the screen where there’s both a graphic block and text (so you can see the incremental impact of the graphic block) and the rest of the line is flat. Overall, we’re using about 24% of the CPU to create this video output.

Using Vidout

You can find Vidout on github, complete with a demonstration main app. Using it is trivial. Ensure that at least one very high priority interrupt level is available – that will be used for the Line Handler. Absolutely nothing must get in the way of that being able to run immediately the timer fires. A lower priority interrupt is used for the Line Creator. Simply call vidInit from your code and the video generator will be installed and will start outputting. A handle to a displayFile will be returned. It is by manipulating the displayFile that the screen is updated….take a look in displayFile.h for APIs to manipulate both the graphic and text elements of a screen, while main.c gives you an example of how to use it all.

The good stuff is all in the vidout directory, which you can simply copy into your own project, but be careful to make sure you have the support in your linkfile for RAM based routines. There are comments explaining this in the source, and there’s a suitable linkfile there that you can copy from. The code uses CMSIS to identify registers and bits, but if you’re adverse to CMSIS then you can easily replace those definitions with direct memory addresses, there aren’t so very many of them and they’re all in vidout.c.

Porting

Porting Vidout to other CPUs should be straightforward. All that’s needed is a timer that can generate the horizontal sync pulse and trigger the state machine, and a peripheral that can pump bits out quickly enough to deliver the image data to the monitor. In the STM32F103 implementation that’s done by SPI, but other CPUs have other peripherals that can also do it (perhaps direct DMA output or a special programmable logic such as exists on the IMXRT or PsoCs)….if you do port Vidout to another CPU, please send the patches so they can be folded into the repository for others to use.

Alan Assis is busy porting Vidout to Nuttx so it’s worth keeping an eye on his blog for current status. Once initialised Vidout only uses interrupt level code so it’s pretty much transparent to any OS that might be running – it certainly runs alongside FreeRTOS just fine.

Enjoy.

2
Leave a Reply

avatar
1 Comment threads
1 Thread replies
4 Followers
 
Most reacted comment
Hottest comment thread
2 Comment authors
Dave MarplesDaniele facchin Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
Daniele facchin
Guest
Daniele facchin

Kudos! very nice work!