Like many complex problems, the question of how to meet real-time requirements has many answers. On a Technologic Systems SBC or CoM running Linux, application-specific algorithms can be implemented in at least four different layers:
- Scripting languages
- Compiled applications
- Device drivers
- FPGA logic
Developers who begin with the assumption that their entire embedded application will be implemented in their language of choice are likely to end up investing a lot of engineering time in a system that doesn't work as expected. The recommended practice for most embedded applications is to split them into layers, based primarily on latency requirements, and implement each layer in the appropriate way. After doing this analysis, designers may be pleasantly surprised to find that 90% of their application should be written in perl or python. The purpose of this paper is to guide embedded system developers towards an application architecture that fully meets latency requirements while minimizing cost and time to market.
Latency, for our purposes, means the time between critical events. For example, with a touchscreen human-machine interface (HMI), users will prefer a device that seems to respond immediately to input. Human reaction time to visual stimuli is around 200ms, so a touchscreen interface that responds in 100ms or faster can be expected to feel "snappy." The latency being measured here is the time between when the user presses a button and when the screen is re-drawn accordingly. For an HMI system with no other real-time requirements, this is a very realistic goal.
Real-time is a buzzword that is applied to many systems with a wide variety of latency requirements. A hard real-time system is one where failure to respond to certain events in time is a failure of the entire system, so latency must be guaranteed. A firm or soft real-time system is one where an occasional failure to meet deadlines degrades the quality of the system but is not catastrophic. The user interface example just mentioned is an example of one with a soft real-time constraint, because the UI feeling slow on rare occasions is not problematic.
Latency should not be confused with CPU throughput or bus bandwidth. CPU throughput or CPU performance is simply how quickly the processor can execute instructions. Bandwidth is how quickly data can be transferred to or from an external device. An Arm9™-based CPU running at 800MHz may have excellent CPU throughput and fairly good interrupt latency, but score poorly on polling latency and bus bandwidth. Hardware and software factors affect these qualities.
System developers planning a real-time system need to define the parameters of what real-time means for their particular application. In most cases, there are only a small number of events or classes of events that need to be responded to with a limited latency. Other systems might have multiple sets of timing concerns; for example, a set of inputs may need to be sampled at precisely 1kHz by a device that is still presenting a responsive HMI as described above.
Planning Data Flow and Execution Flow in an Embedded System
Pictured below is a typical hardware-software stack for a system built around a Technologic Systems SBC or TS-SOCKET product.
Scripting languages are easy to work in but may have poor efficiency and latency performance. On the other end of the spectrum, an FPGA has superb latency performance but is challenging to program. System designers may have a preconceived notion of where to implement their application algorithms. Our recommendation is to have an open mind regarding what techniques will be useful or not useful. The following questions should be answered before software development begins and before any custom hardware has been built:
- What events in my system have latency requirements?
- What latency do I require?
- How complex is the low-latency task?
- Is it good enough to meet the latency requirement 99+% of the time or do I need a guarantee?
- Does the application need sophisticated software stacks provided by Linux?
The answers to these questions inform the decision of which tools to consider using to improve the latency performance of your system. The following tools should be evaluated:
Linux Kernel Driver
Implementing I/O functionality inside a Linux kernel driver is a standard method that many developers expect to use. Functionality inside the kernel can respond to external interrupts and, in many cases, use DMA to move data without tying up the processor. A kernel driver can achieve typical interrupt response latency measured in microseconds. However, time consuming calculations cannot be executed in an irq handler without degrading the performance of the entire system. Also, since interrupts can be masked out by other interrupt handlers, latency cannot be guaranteed. Another disadvantage of writing code that runs inside the kernel is the overall difficulty of writing multithreaded OS code and the fact that kernel APIs are subject to change in newer versions, thereby limiting the portability of your solution.
Despite these factors, there are situations where a Linux kernel driver is the best and only viable solution. A kernel driver is ideal for streaming significant amounts of data to or from an external device because DMA can be used.
User Space Driver with Real-time Priority
In many applications, the kernel development described above is not necessary. For systems with only one real-time task, or where the real-time tasks can be condensed into one thread, this thread can be implemented in a simple user space application. The thread can be given real-time priority using the nice() function call. It is important that the thread spend time sleeping, so that all other threads have a chance to run.
The user space driver solution can respond to interrupts using the /proc/irq subsystem. This allows for a system with latency in the microseconds to be implemented in Linux user-space.
Technologic Systems has implemented many standard device drivers in user
space. Many of these user space utilities are open source and are
recommended as sample code for developers writing their own user space
drivers. Some examples are
Increased Kernel Tick Rate
The Linux kernel tick rate informs the granularity of the usleep() function. If interrupts are not used, a thread with real-time priority must sleep using usleep(). The parameter passed to usleep() specifies a minimum number of microseconds to sleep, but in practice the thread will not wake up until a kernel tick. If the kernel tick rate is 100Hz, which is common, this means the typical latency of a system implemented this way is at least 10ms. Newer kernels have a default tick rate of 1000Hz, which results in an order of magnitude latency reduction for real-time tasks.
When functionality is implemented in the FPGA, extremely low latency, measured in nanoseconds, can be guaranteed, but the complexity of the low latency task is limited by the fact that it must be implemented in hardware. In many cases, Technologic Systems has already implemented low-latency functionality such as digital counters and PWM outputs. In other cases, developers can use an opencore FPGA project provided by Technologic Systems to add their own simple functionality.
Bare Metal Code
The reason that Linux is even used on a typical embedded system is that it provides excellent support for high-speed interfaces such as USB, Ethernet, SATA, and SD cards. For a system that must implement complex tasks with low latency but doesn't need these external interfaces, no operating system is needed. The Arm9™-based CPU can be treated the same way an 8 bit or 16 bit microcontroller would be treated, where application code runs without an intermediate OS layer. Technologic Systems TS-BOOTROM will load any binary it finds in a bootable partition. If you need to write your application with no OS, contact Technologic Systems for further assistance.
A Real-time OS
Systems with complex hard real-time requirements probably cannot be built around Linux. An RTOS is required. A wide variety of open source and proprietary RTOSes are portable to Technologic Systems' products. If you need to implement an OS port, Technologic Systems can provide OS-independent APIs for many hardware features such as SD cards, NAND flash with XNAND, ADC devices, GPIO, CAN, XUARTs, and more.
Case Study: The Serial Port
UARTs are ubiquitous in embedded systems. The standard way of implementing a UART is with a dedicated IC such as the 16550. Modern SoC CPUs normally have similar functionality on chip. In both cases, the UART hardware will normally have FIFOs of 4 bytes or 16 bytes. Streaming data in or out of the UART presents a well-known latency issue. For example, with a 4 byte FIFO, running the UART at 115200 baud, streaming data for 1 second would require servicing the device at least 115200/32 = 3600 times, with a latency of less than 300 usec.
If no handshaking lines are used, receiving UART data is a hard real-time requirement, because if the receive FIFO is not serviced in time, data will be lost. Sending data, or receiving data when the rate can be throttled by the UART hardware, is a soft real-time requirement, because if the FIFO is not serviced in time, it only results in a slight loss of serial bandwidth.
Most Linux systems use a kernel UART driver. The UART hardware interrupts the CPU when a receive FIFO is full or nearly full, or when a transmit FIFO is empty or nearly empty. Data bytes are moved from RAM to the UART by an interrupt service routine. This works well in many systems, but in a system with a large number of UARTs moving a lot of data, the ISR overhead can be a problem. The more time the CPU spends servicing interrupt requests, the worse the typical latency gets for other interrupt-based functionality.
To expand the capabilities of UARTs, Technologic Systems uses another latency reduction tool: custom FPGA logic. The XUART is a proprietary UART implemented in an FPGA with extremely deep FIFOs. Since 256 bytes at a time can be moved to or from an XUART, instead of only 4 or 16, software latency requirements are greatly reduced. For example, in order to receive at 115200 baud, a UART can be serviced at only 100Hz with no overflow. The XUART driver thus has no need to be inside the kernel and is implemented in user space. To prevent FIFO overflows, the XUART driver, xuartctl, also uses interrupts. This code is open source and is recommended as a reference for any developers writing user space utilities that handle interrupts.
Case Study: Data Acquisition
Many embedded systems need to move data from sensors into RAM at a constant rate anywhere from 100Hz to 100kHz or more. If the ADC chips were connected directly to a CPU bus, this would be a daunting challenge. At Technologic Systems, our normal practice is to put an FPGA with blockram between the CPU and the ADC. The FPGA collects ADC samples at a programmable rate, storing them in a blockram FIFO. The FPGA logic solves all the primary latency issues involved with interfacing with the ADC chip and collecting each sample at a precise regular interval. There is still a latency concern because the CPU must empty the FIFO at a regular rate or it will overflow and data will be lost.
For a concrete example, consider the TS-ADC16, which can collect samples on 16 channels and has a FIFO that holds 512 samples. At 500Hz sampling, the FIFO fills up in 512/16*2ms = 64ms. With a kernel tick period of 10ms, a simple user space utility can easily keep up. No interrupt handling is needed. It is still a good idea to elevate the data acquisition thread to real-time priority.
At 5kHz sampling, the FIFO fills up in 6.4ms. If a 1000Hz kernel is used (tick period of 1ms), then the same solution should still work. If a 1000Hz kernel is not available, user space interrupts would also be effective.
At 50kHz sampling, the FIFO fills up in 640us. An interrupt-driven user space application would probably still work, but a kernel driver may be a better solution, since kernel-based interrupt handling has lower overhead. A kernel driver may also be able to use DMA or FPGA bus-mastering, which would greatly reduce the number of CPU cycles used on data transfer.
Case Study: Control Loop for a DC Motor
Consider a system that controls a DC motor, maintaining it in a fixed position based on user input. The outputs to the motor are two PWM signals that control an H-bridge, thus controlling the force applied in either direction by the motor. The inputs from the motor are quadrature signals indicating position.
Putting out a PWM with a precise frequency and duty cycle is a task that is quite simple and has extremely low latency requirements, so it is implemented in the FPGA. The same is true of interpreting quadrature signals. On top of this FPGA firmware layer, a software layer must implement a control loop that looks at the history of position (quadrature) and force (PWM) and repeatedly adjusts the PWM output. As with the data acquisition task, this can be implemented by a real-time user space thread that reads inputs, adjusts outputs, and then sleeps until the next kernel tick, thus adjusting the output at 100Hz or 1000Hz. For many systems, this solution is simple and adequate.
Suppose that, for safety reasons, there is a requirement that the control loop is guaranteed to run at 1000Hz and never skip a tick between updates. Linux can never provide this guarantee. Running without an OS may be a good solution in this case. If communications via Ethernet, USB, and other interfaces that require a sophisticated software stack are not needed, then Linux itself may not be needed. Even if one of those interfaces is required, it could be that an API for it is provided by the CPU manufacturer or by the open-source community, allowing that functionality to be integrated into the bare metal executable.
However, let's extend this example again and suppose that human input for the motor positioning comes from a USB joystick or from a touchscreen GUI or from buttons on a web page served by the embedded system. These functionalities are much easier to implement with an OS such as Linux. It is not obvious how to build a system that meets all these requirements. One way would be a system with two CPUs, one for motor control and one for user interface. Another option would be to find the right RTOS. Another would be to attempt to implement the entire motor control algorithm in the FPGA, possibly using a softcore microprocessor.
When a single system has hard real-time requirements, complex calculations inside the low latency task, and a requirement for sophisticated OS driver stacks, then there is no easy solution. Picking a solution would be a decision informed by budget and quantity.
There are a variety of issues that embedded Linux developers need to think about early in the product development process. (One of those would be gracefully handling power outages.) For many systems, the most important concern is how to meet latency requirements. The availability of a reprogrammable FPGA is an added variable that greatly increases flexibility and options. Designers need to understand all their options well enough to know which ones to use and which to avoid. Some of the likely architecture mistakes are:
- using C or C++ when bash or perl is adequate
- writing OS kernel code when user space drivers are adequate
- implementing very low latency tasks in software when FPGA logic is required
- implementing high-bandwidth, low latency software in user space when kernel code is required
This paper has only scratched the surface of the issues facing developers of high-performance systems. Our goal is to prevent wasted investment in techniques that will not work effectively. If you need advice on your real-time system architecture, please contact Technologic Systems.
Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.
Arm and Arm9 are registered trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere.