So indeed the contiguous-access method you proposed is the way I was rearranging the pipeline. Doing this in the 32-byte loop shaves another microsecond off the XRAM/YRAM benchmark, at which point I think I will start using the 0.14 µs-level measure because it's getting really small. The take-away is: it's indeed faster, but only slightly.
To understand exactly the reason, I went back to the pipeline documentation. The mov instructions we're using have a 3-stage register fetch + memory access sequence that blocks the CPU's LS region, meaning that one can execute only every 3 cycles at best. If there is no data dependency, this should be the limit. When there is a data dependency, the WB stage that follows the 3-stage register + memory access sequence brings this limit to 4 cycles per instruction. So the improvement of freeing the pipeline cannot be more than 25%.
Now the only think I still don't understand is how the absolute time can be so small. Eliminating every overhead source, the function absolutely needs 6 pipeline cycles for 4 bytes of data, and I measured only two Iϕ cycles for 4 bytes of data. Maybe my understanding that Iϕ is the frequency of the pipeline is incorrect; do you have any information on this aspect?
Oh so I was unsure about this bit. The documentation of the SH7724 hints at such a ridiculous cache being used, but it seemed really wrong so I mostly ignored it. Do we have quantified evidence of this? I think I'll try some test programs because it's normally easy to find cache parameters by manipulating memory, and I'd like to be sure.
To understand exactly the reason, I went back to the pipeline documentation. The mov instructions we're using have a 3-stage register fetch + memory access sequence that blocks the CPU's LS region, meaning that one can execute only every 3 cycles at best. If there is no data dependency, this should be the limit. When there is a data dependency, the WB stage that follows the 3-stage register + memory access sequence brings this limit to 4 cycles per instruction. So the improvement of freeing the pipeline cannot be more than 25%.
Now the only think I still don't understand is how the absolute time can be so small. Eliminating every overhead source, the function absolutely needs 6 pipeline cycles for 4 bytes of data, and I measured only two Iϕ cycles for 4 bytes of data. Maybe my understanding that Iϕ is the frequency of the pipeline is incorrect; do you have any information on this aspect?
Quote:
Hmmm.. like I said there is a terrible little cache in this CPU, I think it is 32 bytes? So we might be "thrashing the cache" by moving back and forth. What happens if we trash more registers and keep access contiguous, something like this? Using whichever registers.. I just adapted the inline code sample I had:
Oh so I was unsure about this bit. The documentation of the SH7724 hints at such a ridiculous cache being used, but it seemed really wrong so I mostly ignored it. Do we have quantified evidence of this? I think I'll try some test programs because it's normally easy to find cache parameters by manipulating memory, and I'd like to be sure.