start_command_read_blocks() function
Receiving one or more blocks of data happens as follows: 1. send command + receive r1 response (SDSPI_CMD_R1_SIZE bytes total) 2. keep receiving bytes until TOKEN_BLOCK_START is encountered (this may take a while, depending on card's read speed) 3. receive up to SDSPI_MAX_DATA_LEN = 512 bytes of actual data 4. receive 2 bytes of CRC 5. for multi block transfers, go to step 2 These steps can be done separately, but that leads to a less than optimal performance on large transfers because of delays between each step. For example, if steps 3 and 4 are separate SPI transactions queued one after another, there will be ~16 microseconds of dead time between end of step 3 and the beginning of step 4. A delay between two blocking SPI transactions in step 2 is even higher (~60 microseconds). To improve read performance the following sequence is adopted: 1. Do the first transfer: command + r1 response + 8 extra bytes. Set pre_scan_data_ptr to point to the 8 extra bytes, and set pre_scan_data_size to 8. 2. Search pre_scan_data_size bytes for TOKEN_BLOCK_START. If found, the rest of the bytes contain part of the actual data. Store pointer to and size of that extra data as extra_data_{ptr,size}. If not found, fall back to polling for TOKEN_BLOCK_START, 8 bytes at a time (in poll_data_token function). Deal with extra data in the same way, by setting extra_data_{ptr,size}. 3. Receive the remaining 512 - extra_data_size bytes, plus 4 extra bytes (i.e. 516 - extra_data_size). Of the 4 extra bytes, first two will capture the CRC value, and the other two will capture 0xff 0xfe sequence indicating the start of the next block. Actual scanning is done by setting pre_scan_data_ptr to point to these last 2 bytes, and setting pre_scan_data_size = 2, then going to step 2 to receive the next block. When the final block is being received, the number of extra bytes is 2 (only for CRC), because we don't need to wait for start token of the next block, and some cards are getting confused by these two extra bytes. With this approach the delay between blocks of a multi-block transfer is ~95 microseconds, out of which 35 microseconds are spend doing the CRC check. Further speedup is possible by pipelining transfers and CRC checks, at an expense of one extra temporary buffer.