Skip to content

Using the CMSIS‐DSP Library in IAR Embedded Workbench for Arm

Felipe Torrezan edited this page Sep 25, 2024 · 8 revisions

In this article you will learn how you can start using the CMSIS-DSP Library directly from the IAR Embedded Workbench for Arm without having to build it from the sources. Further below we will go through some interesting performance comparisons.

Warning

The IAR Embedded Workbench for Arm V9.xx ships with pre-built CMSIS-DSP Libraries based on version 1.8.0, released alongside CMSIS 5.7.0. For new projects, it is recommended to use the latest stable version.

Arm Cortex-M instruction groups

The Cortex-M cores have specific silicon options for Digital Signal Processing (DSP) with different types of floating-point units (FPU). The -M4 cores introduced DSP instructions. Some -M4/-M7/-M33 cores have an option for single-precision (SP) FPU and the -M7 has also an option for double-precision (DP) FPU.

Group # instructions Cortex M0~M3 Cortex M4 Cortex M7 Cortex M23 Cortex M33
DSP 80 No Yes Yes No Optional
SP float 25 No Optional Optional No Optional
DP float 14 No No Optional No No

In order to distinguish when an FPU is included, a silicon vendor might opt for using the "Cortex-MxF" denomination, where 'x' is the core variant, such as Cortex-M4F.

Using the CMSIS-DSP Library

Target-optimized pre-built CMSIS-DSP libraries are automatically installed with IAR Embedded Workbench for Arm.

For using the CMSIS-DSP Library, a few General Options needs to be selected.

First, choose a supported Target Device, such as the ST STM32F407VG (Cortex-M4):

image

Then, enable the DSP library option in the Library Configuration tab:

image

And that is it!

Based on your selection, a pre-built CMSIS-DSP Library, built for high speed, will be automatically imported into the project. At the same time, the directory containing the necessary library header files (<arm_math.h>, ...) will be implicitly added to the search path used by the C/C++ Preprocessor.

Note

Alternatively, for advanced tweaking of the CMSIS-DSP Library internal options, the original library project containing build configurations for the supported core variants, alongside its full source code, is provided under arm\CMSIS\DSP\Projects\IAR\, enabling end users to customize their libraries.

Test-driving the CMSIS-DSP Library

Using the main() function below, we can compare performance when calculating √2 with the C Standard Math Library (<math.h>) functions and then with the CMSIS-DSP Library counterpart (<arm_math.h>):

#include <arm_math.h>
#include <math.h>

const float32_t f_input_cmsis_dsp = 2.0f;
const float f_input = 2.0f;
const double d_input = 2.0;

float32_t f_result_cmsis_dsp;
float f_result;
double d_result;

void main()
{
  while (1)
  {
    /* Function from <math.h> (double) */
    d_result = sqrt(d_input);

    /* Function from <math.h> (float) */
    f_result = sqrtf(f_input);

    /* Function from CMSIS-DSP <arm_math.h> (typedef float float32_t) */
    arm_sqrt_f32(f_input_cmsis_dsp, &f_result_cmsis_dsp);
  }
}

Verifying results

In the debugger, you can add the result variables to the Watch window so that you make sure they are correct:

image

Benchmarking functions

Now we take advantage of the Arm Cortex-M CoreSight™ cycle counters for some performance insights. In IAR Embedded Workbench, these can be easily accessed via the Registers window, under the "Current CPU Registers" group, such as CYCLECOUNTER and CCSTEP:

image

The CYCLECOUNTER register tells you the total execution time in number of cycles. On the other hand, the CCSTEP register counts the number of cycles since the last time the program execution has been halted.

By clicking on the left curb of the code editor, you can set code breakpoints for each line in which there is a function call in the while(1) loop so that CCSTEP will count the number of cycles between two consecutive breakpoints.

image

We used this breakpoint-based benchmarking technique with that main() function in different devices, which yielded some interesting results about the number of cycles required for each function to execute, taken directly from the CCSTEP register:

Function # cycles
Cortex-M3
soft float
# cycles
Cortex-M4
SP float
# cycles
Cortex-M7
DP float
sqrt() 689 695 17
sqrtf() 229 17 17
arm_sqrt_f32() 252 14 14

Warning

Only for reference. Precise results will vary when changing the compiler/library/target combination.

This simple test immediately gives important insights:

  • The first is that the compute performance when using double data types on target variants without a double-precision (DP) FPU will impose substantial overhead. In this case, it took roughly 50 times more to execute.
  • The second is that the CMSIS-DSP Library offered the most performant option for this particular function for hardware equipped with FPU.
  • The third is that software implementation from the IAR C Runtime Library for the square root function already performs incredibly well for targets without FPU when the compiler optimization level is set to Low (ideal for debugging). It is important to mention that the C Runtime Library in IAR Embedded Workbench comes with multiple function implementations for different optimization objectives. For example, if we set the optimization objective to High: Speed in our test application, the performance can be even higher!

Example Projects using the CMSIS-DSP Library

IAR Embedded Workbench for Arm includes thousands of example projects, coming from dozens of different silicon providers, and which you can download directly from the Information Center. Among them you will find some projects using the CMSIS-DSP library.

In this article, we will use the STSTM32FxxCMSIS and STM32F4xx stdperiph lib 1.2.0RC2DSP Lib workspace by clicking in the "Open project" icon.

image

The DSP Lib workspace comes with 11 projects; each of which explores a different functionality provided by the CMSIS-DSP Library.

image

Overview

Among the available projects, let's select the "Frequency Bin Example" (arm_fft_bin_example). It demonstrates how to find the maximum energy bin of a 10 kHz test signal with uniformly distributed white noise. By performing the Fast Fourier Transform (FFT) on the input signal, the example will give the signal's maximum energy bin in the frequency-domain.

The main() function in the arm_fft_bin_example_f32.c file uses four functions from the CMSIS-DSP library. The code is well commented.

/* ---------------------------------------------------------------------- 
* Max magnitude FFT Bin test 
* ------------------------------------------------------------------- */ 
int32_t main(void) 
{ 
  arm_status status; 
  arm_cfft_radix4_instance_f32 S; 
  float32_t maxValue; 
   
  status = ARM_MATH_SUCCESS; 
   
  /* Initialize the CFFT/CIFFT module */  
  status = arm_cfft_radix4_init_f32(&S, fftSize, ifftFlag, doBitReverse); 
   
  /* Process the data through the CFFT/CIFFT module */ 
  arm_cfft_radix4_f32(&S, testInput_f32_10khz);
   
  /* Process the data through the Complex Magnitude Module for  
  calculating the magnitude at each bin */ 
  arm_cmplx_mag_f32(testInput_f32_10khz, testOutput, fftSize);  
   
  /* Calculates maxValue and returns corresponding BIN value */ 
  arm_max_f32(testOutput, fftSize, &maxValue, &testIndex);
  
/* ... */

The arm_fft_bin_data.c file contains the test input signal (testInput_f32_10khz) used in the main() function, stored as a C array with 1024 samples:

#include "arm_math.h"

/* ----------------------------------------------------------------------
Test Input signal contains 10KHz signal + Uniformly distributed white noise
** ------------------------------------------------------------------- */

float32_t testInput_f32_10khz[2048] = 
{   
  -0.865129623056441, 0.000000000000000, -2.655020678073846, 0.000000000000000, /* ... */
  -2.899160484012034, 0.000000000000000,  2.563004262857762, 0.000000000000000, /* ... */
   0.048366940168201, 0.000000000000000, -0.145696461188734, 0.000000000000000, /* ... */

/* ... */

The CFFT/CIFFT module process the input data as complex numbers (z = a + b * i). In this example, each sample in the test input signal array is treated as a complex number, which has its real part (a) stored in even indexes ([0], [2], [4], ..., [2046]) and its imaginary part (b) is stored in odd indexes ([1], [3], [5], ..., [2047]). This also means that the test input signal contains only real data (z = a + 0 * i). That is the reason why the testOutput array is reduced to 1024 elements, as it will only retain the real part.

By plotting the input signal in time-domain, we can inspect its true nature:

image

Debugging

Then we run the demo with a code breakpoint set to halt the execution right after the maximum energy bin was found. These example projects comes with some form of validation. In this particular case, the if() statement validates testIndex against refIndex. When the execution reaches this breakpoint, you can hover the mouse pointer over these variables to get their current values as tool tips as well as get their respective values from the Watch window:

image

A frequency-domain analysis shows that the bin 214 (corresponding to index 213 in the testOutput array) has the signal's maximum energy:

image

Benchmarking FFT

By running the demonstration on different targets, while using the cycle counting technique (CCSTEP) presented earlier, we yielded:

Function # cycles
Cortex-M3
soft float
# cycles
Cortex-M4
SP float
# cycles
Cortex-M7
DP float
arm_cfft_radix4_init_f32() 69 66 68
arm_cfft_radix4_f32() 1764005 98583 99969
arm_cmplx_mag_f32() 376590 14870 14870
arm_max_f32() 24348 7972 7972
TOTAL 2165012 121491 122879

Warning

Only for reference. Precise results will vary when changing the compiler/library/target combination.

In practical terms, if a Cortex-M running at 100 MHz executes this code to transform the 1024-sample signal, it will take ~21.7ms if it is a Cortex-M3 and a mere ~1.2ms if it is a Cortex-M4/-M7, clearly showing that with the introduction of DSP instructions, this FFT algorithm can deliver results much faster, on a different order of magnitude.

image

Summary

For those looking for legacy versions of the library, IAR Embedded Workbench for Arm 9.40.1+ ships with ready-to-use, pre-built CMSIS-DSP Libraries for the supported target devices. Such CMSIS-DSP pre-built libraries were based on an earlier version 1.8.0 (using CMSIS 5.7.0), which does not include the library's bleeding-edge features. In both cases, IAR Embedded Workbench for Arm unleashes compute performance by providing highly optimized versions of the CMSIS-DSP libraries for Cortex-M devices in applications requiring higher compute performance.