Daisy boot loader and ITCMRAM

Does the Daisy bootloader support loading code into ITCMRAM from QSPI?

Why would you want that? Do you realize that ITCM size is 64kb which is just half of what you can store on internal flash?

I only need to put my CPU intensive DSP code into ITCMRAM. Executing from direct coupled memory is substantially faster than executing from flash, particularly compared with executing from QSPI flash.

I was mistaken in thinking this was a boot loader function, it is handled by the linker with a little help from the startup code. Below is how I got it working in case others might find this useful.

This is my current memory usage, you can see that about 5.6K of my code is now running in ITCMRAM.

Memory region         Used Size  Region Size  %age Used
           FLASH:          0 GB       128 KB      0.00%
         DTCMRAM:      105696 B       128 KB     80.64%
            SRAM:       59088 B       512 KB     11.27%
      RAM_D2_DMA:        8256 B        32 KB     25.20%
          RAM_D2:          0 GB       256 KB      0.00%
          RAM_D3:          0 GB        64 KB      0.00%
         ITCMRAM:        5616 B        64 KB      8.57%
           SDRAM:          0 GB        64 MB      0.00%
       QSPIFLASH:      130200 B      7936 KB      1.60%

First step is to add the following to the .lds file. This tells the linker to create a code segment in ITCMRAM.

	.itcmram_text (NOLOAD) :
	{
		. = ALIGN(4);
		_sitcmram_text = .;

		PROVIDE(__itcmram_text_start__ = _sitcmram_text);
		*(.itcmram_text)
		*(.itcmram_text*)
		. = ALIGN(4);
		_eitcmram_text = .;

		PROVIDE(__ictmram_text_end__ = _eitcmram_text);
	} > ITCMRAM AT > QSPIFLASH

	_sitext = LOADADDR(.itcmram_text);

Second step is to declare externs for the segment addresses and add a loop to ResetHandler() in startup_stm32h750xx.c copy code from flash to ITCMRAM

extern void *_sitext, *_sitcmram_text, *_eitcmram_text;
...
	for (pSource = &_sitext, pDest = &_sitcmram_text; 
              pDest != &_eitcmram_text; pSource++, pDest++)
		*pDest = *pSource;

CPU intensive functions are allocated in ITCRAM by preceding the function definition with this attribute:

#define ITCM_MEM_SECTION __attribute__((section(".itcmram_text")))

It would be good to see these changes incorporated in the libdaisy code base.

3 Likes

It won’t always be substantially faster, because any CPU intensive code will normally run from the MCU cache. But there should be some improvements and it’s certainly better than not utilizing that memory section at all.

I would also suggest excluding the first the 4 bytes from ITCM section, because its address is 0x0 and a pointer to that value is indistinguishable from a null pointer. This occasionally creates very interesting bugs.

1 Like

True, if your inner loop fits in the cache there may not be much of a win. In my case the inner loop is running FFT. Good point about address 0, probably a good idea to plant an infinite loop at address 0.

1 Like

I just found an interesting-relevant application note from ST on this topic:
AN4891 STM32H72x, STM32H73x, and single-core STM32H74x/75x
system architecture and performance

It applies to the STM32H750IBK6 used in Daisy and even gives an FFT application as an example!

1 Like

Thanks for posting this!
Though I agree, it does seem like it should be part of the libdaisy code.
It’s great for quick ISR’s and things like that.

I am curious about this as well. I have been able to load and execute code out of ITCMRAM using similar steps as outlined above for default internal flash apps but I’m not sure if it’s possible for Daisy bootloader apps.

We don’t have it set up to make that easy. I don’t think it specifically handles ITCMRAM (although I’m not sure that’s necessary). If you can fit your whole program in there, you can just modify one of the linker scripts to put the text section in ITCMRAM, and I think it might work.

Thanks for the info!

If you can fit your whole program in there, you can just modify one of the linker scripts to put the text section in ITCMRAM, and I think it might work.

ITCMRAM being only 64kb I don’t think that’s particularly useful in most cases (I think just a barebones program linking libDaisy already comes close to that if not exceeds it). Along the lines of others here I was thinking more specifically marking some ISRs etc to go in ITCMRAM.

Though, I guess perhaps overall it’s a little less necessary on Daisy vs. other platforms since I imagine it’s less common to write to the internal flash, where e.g. something like ESP32 requires ISRs to go into IRAM if they need to be available to execute during a flash write.

ITCMRAM is the fastest memory on the microcontroller it is also not cached so there are no cache misses, which means access times are deterministic. Determinism is a very good thing for time sensitive CPU intensive operations such as DSP algorirthms running in the audio loop. So depedning on the application there may be a performance advantage in selectively putting inner loop DSP code into ITCMRAM. I do this for FFT transforms running in the audio callback. Other code, particularly the UI code writing to the display runs in the main program loop and runs just fine out of flash.

I have had success putting code in ITCMRAM when booting with the Daisy bootloader from QSPI flash but until this afternoon have never been able to get it to work when booting from internal flash or more recently SRAM. I finally got it to work by skipping the first 1K by setting the ITCMRAM base address to 0x400. Anyone got a theory why this works, and conversely why a base address of 0 only works when booting from QSPI? The map files show the same load and store addresses in all three cases.

Could be that I am writing through an unitialized pointer that gets set to zero in the SRAM and internal flash cases but gets some other random value when booting from QSPI. I’ll see if I can repro the issue with a small test case.

No idea why exactly it fails in this case, but ITCMRAM starts from 0x0 address. Any pointer to it equivalent to nullptr. This leads to a situation where a pointer to something that lives in first byte of this memory area always would fail a check that it’s not set to nullptr. You shouldn’t need to waste that much space, try to skip just first 4 bytes.

What changes did you make to the linker script to get ITCMRAM text section to work with BOOT_SRAM apps? I have not had any success. I tried using an offset base address for the section to avoid 0x00000000, but no luck. Here’s what the relevant linker script section looks like:

	.itcmram_text (NOLOAD) :
	{
		. = ALIGN(4);
		_sitcmram_text = .;

		PROVIDE(__itcmram_text_start__ = _sitcmram_text);
		*(.itcmram_text)
		*(.itcmram_text*)
		. = ALIGN(4);
		_eitcmram_text = .;

		PROVIDE(__ictmram_text_end__ = _eitcmram_text);
	} > ITCMRAM AT > SRAM

	_sitext = LOADADDR(.itcmram_text);

Note the AT > SRAM instead of QSPIFLASH - if I try to use QSPIFLASH the DFU write does not succeed for BOOT_SRAM. I assume SRAM would be needed here since that’s where the normal text section goes, but I have no idea how the daisy bootloader is handling 1) flashing the program data into QSPIFLASH and 2) loading it from QSPIFLASH to SRAM… so I assume my ITCM text section is just getting lost in that process somehow.

My full linker script is below. It is tailored quite a bit to my needs, for example I needed 64K for DMA buffers. I put .bss into RAM_D3, I use the first 64K of QSPI flash to store my persistent settings, and avoid the first 4 bytes of ITCMRAM. Your needs may be different.

Don’t forget to add this right to the top of main:

    extern uint32_t _sitext, _sitcmram_text, _eitcmram_text;
    memcpy(&_sitcmram_text, &_sitext, ((uint8_t*)&_eitcmram_text) - ((uint8_t*)&_sitcmram_text));

Also check you haven’t put any code into ITCRAM that will get executed before main is called. In particularl look out for functions called from the constructor of a global object. This has bitten me more than once.

Linker script

/* Generated by LinkerScriptGenerator [http://visualgdb.com/tools/LinkerScriptGenerator]
 * Target: STM32H750IB
 * The file is provided under the BSD license.
 */

ENTRY(Reset_Handler)

MEMORY
{
	FLASH       (RX)  : ORIGIN = 0x08000000, LENGTH = 128K
	DTCMRAM     (RWX) : ORIGIN = 0x20000000, LENGTH = 128K
	SRAM        (RWX) : ORIGIN = 0x24000000, LENGTH = 512K
	RAM_D2_DMA  (RWX) : ORIGIN = 0x30000000, LENGTH = 64K
	RAM_D2      (RWX) : ORIGIN = 0x30010000, LENGTH = 256K - LENGTH(RAM_D2_DMA)
	RAM_D3      (RWX) : ORIGIN = 0x38000000, LENGTH = 64K
	ITCMRAM     (RWX) : ORIGIN = 0x00000004, LENGTH = 64K - 4
	SDRAM       (RWX) : ORIGIN = 0xc0000000, LENGTH = 64M
	QSPIFLASH0  (RX)  : ORIGIN = 0x90000000, LENGTH = 64K    /* Reserved for configuration parameters */
	QSPIFLASH   (RX)  : ORIGIN = 0x90040000, LENGTH = 8M - LENGTH(QSPIFLASH0)

}

_estack = 0x20020000;

SECTIONS
{
	.isr_vector :
	{
		. = ALIGN(4);
		KEEP(*(.isr_vector))
		. = ALIGN(4);
	} > SRAM

	.text :
	{
		. = ALIGN(4);
		_stext = .;

		*(.text)
		*(.text*)
		*(.rodata)
		*(.rodata*)
		*(.glue_7)
		*(.glue_7t)
		KEEP(*(.init))
		KEEP(*(.fini))
		. = ALIGN(4);
		_etext = .;

	} > SRAM

	.ARM.extab :
	{
		. = ALIGN(4);
		*(.ARM.extab)
		*(.gnu.linkonce.armextab.*)
		. = ALIGN(4);
	} > SRAM

	.exidx :
	{
		. = ALIGN(4);
		PROVIDE(__exidx_start = .);
		*(.ARM.exidx*)
		. = ALIGN(4);
		PROVIDE(__exidx_end = .);
	} > SRAM

	.ARM.attributes :
	{
		*(.ARM.attributes)
	} > SRAM

	.preinit_array :
	{
		PROVIDE(__preinit_array_start = .);
		KEEP(*(.preinit_array*))
		PROVIDE(__preinit_array_end = .);
	} > SRAM

	.init_array :
	{
		PROVIDE(__init_array_start = .);
		KEEP(*(SORT(.init_array.*)))
		KEEP(*(.init_array*))
		PROVIDE(__init_array_end = .);
	} > SRAM

	.fini_array :
	{
		PROVIDE(__fini_array_start = .);
		KEEP(*(.fini_array*))
		KEEP(*(SORT(.fini_array.*)))
		PROVIDE(__fini_array_end = .);
	} > SRAM

	.sram1_bss (NOLOAD) :
	{
		. = ALIGN(4);
		_ssram1_bss = .;

		PROVIDE(__sram1_bss_start__ = _sram1_bss);
		*(.sram1_bss)
		*(.sram1_bss*)
		. = ALIGN(4);
		_esram1_bss = .;

		PROVIDE(__sram1_bss_end__ = _esram1_bss);
	} > RAM_D2_DMA

	.data :
	{
		. = ALIGN(4);
		_sdata = .;

		PROVIDE(__data_start__ = _sdata);
		*(.data)
		*(.data*)
		. = ALIGN(4);
		_edata = .;

		PROVIDE(__data_end__ = _edata);
	} > RAM_D2 AT > SRAM

	_sidata = LOADADDR(.data);

	.itcmram_text (NOLOAD) :
	{
		. = ALIGN(4);
		_sitcmram_text = .;

		PROVIDE(__itcmram_text_start__ = _sitcmram_text);
		*(.itcmram_text)
		*(.itcmram_text*)
		. = ALIGN(4);
		_eitcmram_text = .;

		PROVIDE(__ictmram_text_end__ = _eitcmram_text);
	} > ITCMRAM AT > SRAM

	_sitext = LOADADDR(.itcmram_text);

	.bss (NOLOAD) :
	{
		. = ALIGN(4);
		_sbss = .;

		PROVIDE(__bss_start__ = _sbss);
		*(.bss)
		*(.bss*)
		*(COMMON)
		. = ALIGN(4);
		_ebss = .;

		PROVIDE(__bss_end__ = _ebss);
	} > RAM_D3

	.dtcmram_bss (NOLOAD) :
	{
		. = ALIGN(4);
		_sdtcmram_bss = .;

		PROVIDE(__dtcmram_bss_start__ = _sdtcmram_bss);
		*(.dtcmram_bss)
		*(.dtcmram_bss*)
		. = ALIGN(4);
		_edtcmram_bss = .;

		PROVIDE(__dtcmram_bss_end__ = _edtcmram_bss);
	} > DTCMRAM

	.sdram_bss (NOLOAD) :
	{
		. = ALIGN(4);
		_ssdram_bss = .;

		PROVIDE(__sdram_bss_start = _ssdram_bss);
		*(.sdram_bss)
		*(.sdram_bss*)
		. = ALIGN(4);
		_esdram_bss = .;

		PROVIDE(__sdram_bss_end = _esdram_bss);
	} > RAM_D3

    .qspiflash_text :
	{
		. = ALIGN(4);
		_sqspiflash_text = .;

		PROVIDE(__qspiflash_text_start = _sqspiflash_text);
		*(.qspiflash_text)
		*(.qspiflash_text*)
		. = ALIGN(4);
		_eqspiflash_text = .;

		PROVIDE(__qspiflash_text_end = _eqspiflash_text);
	} > QSPIFLASH

	.qspiflash_data :
	{
		. = ALIGN(4);
		_sqspiflash_data = .;

		PROVIDE(__qspiflash_data_start = _sqspiflash_data);
		*(.qspiflash_data)
		*(.qspiflash_data*)
		. = ALIGN(4);
		_eqspiflash_data = .;

		PROVIDE(__qspiflash_data_end = _eqspiflash_data);
	} > QSPIFLASH

	.qspiflash_bss (NOLOAD) :
	{
		. = ALIGN(4);
		_sqspiflash_bss = .;

		PROVIDE(__qspiflash_bss_start = _sqspiflash_bss);
		*(.qspiflash_bss)
		*(.qspiflash_bss*)
		. = ALIGN(4);
		_eqspiflash_bss = .;

		PROVIDE(__qspiflash_bss_end = _eqspiflash_bss);
	} > QSPIFLASH

	.heap (NOLOAD) :
	{
		. = ALIGN(4);
		PROVIDE(__heap_start__ = .);
		KEEP(*(.heap))
		. = ALIGN(4);
		PROVIDE(__heap_end__ = .);
	} > RAM_D2

	PROVIDE(end = .);

	.reserved_for_stack (NOLOAD) :
	{
		. = ALIGN(4);
		PROVIDE(__reserved_for_stack_start__ = .);
		KEEP(*(.reserved_for_stack))
		. = ALIGN(4);
		PROVIDE(__reserved_for_stack_end__ = .);
	} > DTCMRAM

	.qspiflash_cfg (NOLOAD) :
	{
		. = ALIGN(4);
		PROVIDE(__qspiflash_cfg_start__ = .);
		KEEP(*(.qspiflash_cfg))
		. = ALIGN(4);
		PROVIDE(__qspiflash_cfg_end__ = .);
	} > QSPIFLASH0

    DISCARD :
    {
        libc.a ( * )
        libm.a ( * )
        libgcc.a ( * )
    }

}

1 Like

Thanks so much! I will double check but the ITCM parts of this look identical to what I was doing.

Don’t forget to add this right to the top of main:

Is there a reason this has to be done at the top of main and not as part of Reset_Handler() in the libDaisy core startup .c file (i.e. where the .data contents are copied into RAM and the .bss is zeroed out?)

I was trying to copy the ITCM memory segment immediately after those other operations in the reset handler, so I wonder if it has to be done after SystemInit() is called (i.e. at the top of main() or just before the call to main() in the reset handler) EDIT: scratch that, SystemInit() isn’t called for daisy bootloader apps :thinking:

I will report back when I get a chance to try it.