External Flash Erase Speed

Hi Folks

Been messing with saving files into External flash - and it’s horribly slow:

86 seconds to erase to write a 3MB file.

Seems a little too slow TBH, any ideas how to speed this up?

many thanks!

Oh it’s obvious why your code is slow just by looking terminal screenshot! (no, actually it isn’t, nobody can’t tell what’s wrong without seeing what you’re doing)

There are a few ways to make QSPI writes work faster, i.e. erase in larger sector size (64kb insead of 4kb that is implemented in current libDaisy code), use DMA for writing large blocks of data (to avoid polling). I imagine some of it can be done when there would be official flash writer solution for libDaisy. You’ll still be limited by USB FS bandwidth (12Mbit).

Hey antisvin thanks for your reply :slight_smile:

Screenshot is to show the time taken!

The actual libdaisy function seem dog slow - I think they are a lot slower than they should be.

Yeah - I did have a quick look at the qspi internals dsy_qspi_erase just calls dsy_qspi_erasesector - so I don’t think doing larger blocks is going to speed it up by any major factor. The same with dsy_qspi_write - that just calls dsy_qspi_writepage - and splits it up into page sizes.

I’m more interested in why the internal functions are slow and how to get them faster.

I don’t think I have the skills/time to write a DMA based system at the moment - and I think it would be super messy from my codes point of view.

thanks for your help tho!

Erasing by larger sector size will give faster erase time. If you don’t understand that, you should do some research - erasing is not fast and there’s overhead from the number of commands that you send (so making x16 less erase commands would save some time). Other changes would be non-trivial. But you may be doing something that makes things worse in your code as well, i.e. use SDRAM to store data (higher access latency than SRAM) or switching QSPI modes between writes, etc.

Hi antisvin

are you saying that I should mod

int dsy_qspi_erasesector(uint32_t addr)
{
uint8_t use_qpi = 0;
QSPI_CommandTypeDef s_command;
if(use_qpi)
{
s_command.InstructionMode = QSPI_INSTRUCTION_4_LINES;
s_command.Instruction = SECTOR_ERASE_QPI_CMD;
s_command.AddressMode = QSPI_ADDRESS_4_LINES;
}
else
{
s_command.InstructionMode = QSPI_INSTRUCTION_1_LINE;
s_command.Instruction = SECTOR_ERASE_CMD;
s_command.AddressMode = QSPI_ADDRESS_1_LINE;
}
s_command.AddressSize = QSPI_ADDRESS_24_BITS;
s_command.AlternateByteMode = QSPI_ALTERNATE_BYTES_NONE;
s_command.DataMode = QSPI_DATA_NONE;
s_command.DummyCycles = 0;
s_command.NbData = 0;
s_command.DdrMode = QSPI_DDR_MODE_DISABLE;
s_command.DdrHoldHalfCycle = QSPI_DDR_HHC_ANALOG_DELAY;
s_command.SIOOMode = QSPI_SIOO_INST_EVERY_CMD;
s_command.Address = addr;
if(write_enable(&qspi_handle.hqspi) != DSY_MEMORY_OK)
{
return DSY_MEMORY_ERROR;
}
if(HAL_QSPI_Command(
&qspi_handle.hqspi, &s_command, HAL_QPSI_TIMEOUT_DEFAULT_VALUE)
!= DSY_MEMORY_OK)
{
return DSY_MEMORY_ERROR;
}
if(autopolling_mem_ready(&qspi_handle.hqspi, HAL_QPSI_TIMEOUT_DEFAULT_VALUE)
!= DSY_MEMORY_OK)
{
return DSY_MEMORY_ERROR;
}
return DSY_MEMORY_OK;
}

to erase larger than the usual IS25LP080D_SECTOR_SIZE ?

Interesting!

Ok I took your advice

I modded that to use Block sizes instead of Sector sizes - it appears to work - haven’t done a full test yet but it seems legit.

and it’s got 62secs down to 11.secs - nice win! ( there was a bug in my first tests - causing a sector to be written twice in turn taking twice the amount of time it should of).

here’s the code to put into qspi.c

int dsy_qspi_eraseblocks(uint32_t start_adr, uint32_t end_adr)
{
uint32_t block_addr;
uint32_t block_size = IS25LP064A_BLOCK_SIZE; // 64kB blocks for now.

start_adr = start_adr - (start_adr % block_size);
while(end_adr >= start_adr)
{
    block_addr = start_adr & 0x0FFFFFFF;
    if(dsy_qspi_eraseblock(block_addr) != DSY_MEMORY_OK)
    {
        return DSY_MEMORY_ERROR;
    }
    start_adr += block_size;
}
return DSY_MEMORY_OK;

}

int dsy_qspi_eraseblock(uint32_t addr)
{
uint8_t use_qpi = 0;
QSPI_CommandTypeDef s_command;
if(use_qpi)
{
s_command.InstructionMode = QSPI_INSTRUCTION_4_LINES;
s_command.Instruction = SECTOR_ERASE_QPI_CMD;
s_command.AddressMode = QSPI_ADDRESS_4_LINES;
}
else
{
s_command.InstructionMode = QSPI_INSTRUCTION_1_LINE;
s_command.Instruction = BLOCK_ERASE_CMD;
s_command.AddressMode = QSPI_ADDRESS_1_LINE;
}
s_command.AddressSize = QSPI_ADDRESS_24_BITS;
s_command.AlternateByteMode = QSPI_ALTERNATE_BYTES_NONE;
s_command.DataMode = QSPI_DATA_NONE;
s_command.DummyCycles = 0;
s_command.NbData = 0;
s_command.DdrMode = QSPI_DDR_MODE_DISABLE;
s_command.DdrHoldHalfCycle = QSPI_DDR_HHC_ANALOG_DELAY;
s_command.SIOOMode = QSPI_SIOO_INST_EVERY_CMD;
s_command.Address = addr;
if(write_enable(&qspi_handle.hqspi) != DSY_MEMORY_OK)
{
return DSY_MEMORY_ERROR;
}
if(HAL_QSPI_Command(
&qspi_handle.hqspi, &s_command, HAL_QPSI_TIMEOUT_DEFAULT_VALUE)
!= DSY_MEMORY_OK)
{
return DSY_MEMORY_ERROR;
}
if(autopolling_mem_ready(&qspi_handle.hqspi, HAL_QPSI_TIMEOUT_DEFAULT_VALUE)
!= DSY_MEMORY_OK)
{
return DSY_MEMORY_ERROR;
}
return DSY_MEMORY_OK;
}

That condition is left from the old command and still erases in 4k sectors. It’s better to just remove it in your function in case if use_qpi would be set to true later. And there doesn’t seem to be an equivalent command for QPI block erasure.

I would say that dsy_qspi_eraseblocks could be more generic if it did something like:

  1. align start / end to 4k sector size
  2. erase in sectors until first address aligned to 64k block is reached
  3. erase in blocks until last address aligned to 64k block is reached
  4. erase remaining sectors.

I would expect something like that to be added to libDaisy itself. And if you just want to erase whole chip every time, it’s possible to do it with a single CHIP_ERASE_CMD command.

I’m curious, do you use some kind of metadata for your files or just require patches to use hardcoded addresses?

yeah I saw the CHIP_ERASE_CMD - got me all excited!!

And yeah that is great Idea - read the whole flash - wipe it - then write. In my intial tests the actual write is pretty god damn fast:

// Erasing 4,172,152 bytes in 4kb blocks in External Flash takes 61.6 secs
// Writing 4,172,152 bytes in 4kb blocks to External Flash takes 3.5 secs
// Erasing and Writing 4,172,152 bytes in 4kb blocks to External Flash takes 64.0 secs

I’ll try the chip wipe later on - see what speeds I get.

I was going to mod dsy_qspi_erase as you describe to use 64kb then 4kb - easy enough - just too lazy at the moment - I shouldn’t even be awake - it’s sunday!

I did try that use_qpi and got all excited when the whole erase/write took 3.5secs - then when testing it - of couse it didn’t actually work :slight_smile:

I’m curious, do you use some kind of metadata for your files or just require patches to use hardcoded addresses?

Don’t quite understand what you mean here - I’m ZZZzzzzzz brained at the moment. I’m just writing system and some code to upload a zip to external flash easily, and the various code to read that zip and load in any files I want into memory - sf2,wavs,images so I can use them. It’s the hard work before I actually start any dsp e.t.c. There’s no metadata.

I mean that you could preprocess those files to generate a single binary with index for them. I.e. number of files, name and size for each file, followed by contents of all files (you may have to insert extra bytes for padding to align object addresses to 4 bytes). Then you will able to use it as a primitive read-only file system, with access to files by name in your patches rather than using hardcoded addresses.

So by metadata I mean index for those files. This is the simplest way to approach this, but it could be good enough if you just want to simply share the same static data among different patches.

I think I understand, I have a tiny zip lib I wrote for that:

// Zip Archive Functions
// move this into it’s on .h/.c file
// it’s super handy
std::string ZipFileName(char* Data, int FP);
bool ZipLocalSig(char* Data, int FP);
uint32_t ZipFileSize(char* Data, int FP);
uint16_t ZipCompressionMethod(char* Data, int FP);
uint32_t ZipCompressedFileSize(char* Data, int FP);
uint32_t ZipNextFile(char* Data, int FP);
char* ZipData(char* Data, int FP);

in code I simply do this to list and read any file I like by name:

I made that zip lib, because zlib and whatever are too bloaty and really you only need em if you want to create zip files.

void DumpFlashFile()
{
MapQFlash();
uint32_t* Header = (uint32_t*)ExternalFlash; // file header data here
char* BS = (char*)ExternalFlash + 4096; // raw zip data starts here

UartSerialSendString(“>>> FlashFile: “ANSIYELLOW”%s"ANSIGREEN” FileSize:%d CRC32: 0x%08X\r\n"ANSIWHITE, (char*)&Header[5], Header[2], Header[3]);
UartSerialSendString(“>>> Zip Contains:\r\n”);

// Super tiny Zip file reader - because we read from the local file list only, this only Supports properly formed zip files,
// which a few archivers don’t make (tar) suggest using 7za to create .zip files - since it correctly fills out the zip local file header

// Iterate all the files in the zip archive
uint32_t FP = 0;
while ((ZipLocalSig(BS, FP)) && (FP < Header[2]))
{
UartSerialSendString(">>> “ANSIYELLOW “%s"ANSIWHITE” %dkb %dkb\r\n”, ZipFileName(BS, FP).c_str(), ZipCompressedFileSize(BS, FP) / 1024, ZipFileSize(BS, FP) / 1024);
FP = ZipNextFile(BS, FP);
}
UnmapQFlash();
}

Ok, I see, I thought that the archive in the log was decompressed before writing or something like that. So you’re actually using achives for indexing files. This is pretty neat, but also means that you would have to extract and preload everything you need to RAM.

yeah - I have the zip in external flash - then I can read and decompress (using tinflate) whenever I like in code. I’ve done it before - and the seed has plenty of power to do that very quickly. Because of this I can have tons of files in flash and use them whenever I like - super handy, whilst 64mb is pretty awesome - we always run out of memory one way or another.

Because I’ve written pc comms side of it as well I can do this:

make zippy build/qflashdata.zip reboot all flash

that will:

  1. Zip up everything in the qflashdata folder in my project
  2. upload and write that zip to the seeds external flash (only uploads if the zip has changed)
  3. reboot the seed into DFU mode
  4. compile my project
  5. upload and run on the seed

it’s not quite there yet - but almost - once I have that - I’m happy and can actually start coding some DSP :slight_smile:

The problem with putting everything in SDRAM is not its size, but bandwidth. It becomes a bottleneck if used too intensively (i.e. in something like reverb with lots of delay lines… and bats!)

So being able to read from QSPI may work better in some cases, I would expect it to work faster if you need random IO that has lots of cache misses. Ideally you would want to avoid both and preload only data that is going to be used for each audio block to internal SRAM in advance.

Yeah - I’ve gotta learn all that internal ram/sdram caching/bandwidth fun later - when I port my reverb across - which uses 18 dlines at the moment :slight_smile:

Hey I tried that chip erase:

gets the time down to 8.9 secs from 11.1 - which is pretty cool (since the 11.1 is only erasing 4mb).

here’s the code if anyone else wants to try this, I had to

dsy_qspi_init(&seed->qspi_handle);
dsy_qspi_erasechip();
dsy_qspi_deinit();
dsy_qspi_init(&seed->qspi_handle);

before it would work. Without a reinitialization - it just doesn’t work at all.

int dsy_qspi_erasechip()
{
uint8_t use_qpi = 0;
QSPI_CommandTypeDef s_command;
if(use_qpi)
{
s_command.InstructionMode = QSPI_INSTRUCTION_4_LINES;
s_command.Instruction = SECTOR_ERASE_QPI_CMD;
s_command.AddressMode = QSPI_ADDRESS_4_LINES;
}
else
{
s_command.InstructionMode = QSPI_INSTRUCTION_1_LINE;
s_command.Instruction = CHIP_ERASE_CMD;
s_command.AddressMode = QSPI_ADDRESS_NONE;
}
s_command.AddressSize = QSPI_ADDRESS_24_BITS;
s_command.AlternateByteMode = QSPI_ALTERNATE_BYTES_NONE;
s_command.DataMode = QSPI_DATA_NONE;
s_command.DummyCycles = 0;
s_command.NbData = 0;
s_command.DdrMode = QSPI_DDR_MODE_DISABLE;
s_command.DdrHoldHalfCycle = QSPI_DDR_HHC_ANALOG_DELAY;
s_command.SIOOMode = QSPI_SIOO_INST_EVERY_CMD;
s_command.Address = 0;
if(write_enable(&qspi_handle.hqspi) != DSY_MEMORY_OK)
{
return DSY_MEMORY_ERROR;
}
if(HAL_QSPI_Command(
&qspi_handle.hqspi, &s_command, HAL_QPSI_TIMEOUT_DEFAULT_VALUE)
!= DSY_MEMORY_OK)
{
return DSY_MEMORY_ERROR;
}
if(autopolling_mem_ready(&qspi_handle.hqspi, HAL_QPSI_TIMEOUT_DEFAULT_VALUE)
!= DSY_MEMORY_OK)
{
return DSY_MEMORY_ERROR;
}
return DSY_MEMORY_OK;
}