Performance advice for large array (framebuffer) please?

So my Zig based PinePhone OS finally has touch screen working :slight_smile: Now I do some UI stuff to make testing easier. First job, buttons! I have the following code, but it seems slow when I used the same code to clear the screen (visible drawing, maybe .25s). I have no way to time this code on device yet.

I wanted to check if I am exposed to anything like bounds checking when doing it this way, and if so, how to avoid without dropping down to assembly again please.

I hope to also get double buffering working, which should hide drawing artifacts, but I would still like to be optimal on this code path, since next I will need all the other pixel-by-pixel drawing primatives.

Any advice very welcome, thanks.

//defined in linker.ld within .data section: LONG(fb0);
export var fb0 = [_]u32{0x80FF0000} ** (720*1440);  //red background to framebuffer

var fgColor:u32 = 0x80FFFFFF;   //white

pub fn fillRect(sx:usize, sy:usize, width:usize, height:usize) void {
    var x: usize = sx;
    var y: usize = sy;
    const xLimit: usize = sx+width;
    const yLimit: usize = sy+height;
    while (y < yLimit) {
        x = sx;
        while (x < xLimit) {
            fb0[(y * windowPitch) + x] = fgColor;
            x += 1;
        }
        y += 1;
    }
}
1 Like

@whitehexagon I fixed the formatting of the code block, please use:

```
your code here
```

Not ` (single backtick) for the start and end of each line individually, you may have to switch to the ā€œstandard markdown editorā€ (first button / Ctrl-M) to see the difference. Using single backtick is for inline code within text descriptions not for multiline code examples.

1 Like

Hi! 0.25s to clear ~4MB is suspiciously slow - even naive code on A53 should do this in milliseconds. This suggests a hardware-level issue, not a Zig code issue.

Still worth a try:

Align the framebuffer for optimal SIMD:

export var fb0: [720 * 1440]u32 align(64) = [_]u32{0x80FF0000} ** (720 * 1440);

Replace inner loop with @memset:

pub fn fillRect(sx: usize, sy: usize, width: usize, height: usize) void {
    for (sy..sy + height) |y| {
        const row_start = y * windowPitch + sx;
        @memset(fb0[row_start..][0..width], fgColor);
    }
}

For full screen clear, single call: @memset(&fb0, color);

Make sure you compile with -Doptimize=ReleaseFast.

But the real issue is likely memory mapping:

If your framebuffer is mapped as Device memory (strongly ordered), each 4-byte write waits for completion before the next - that’s ~4 million slow sequential writes.

You could map the framebuffer region as Normal Non-Cacheable (write-combining) in your MMU page tables. Change the MAIR attribute index for those pages to 0x44. This alone could be 10-100x faster.

If you’re using cached memory instead, you’ll need to flush the cache after drawing (dc cvac per cache line + dsb) so the display controller sees the updated pixels.

This would require some inline assembly though.

11 Likes

If you can access the DMA controller, use it to implement double buffering.

2 Likes

this is screaming for the good old DMA, at least on some device it can do some sort of memset very efficiently and since it’s DMA it won’t ruin your CPU.

2 Likes

This looks lile a job for a gpu, not a cpu. Can’t you use Vulkan?

@Sze - thanks for the fix-up. The forum editor is painfully slow from here (.5 char/s?), so I have to edit offline and messed up the paste.

@M64 - Thanks for the ideas! I’m baremetal based, and probably not got things setup optimally. I currently build ReleaseSmall, since I had some memory issues with other options. The inner memset is nice, but since I’ll be doing lots of line rendering next, not just horizontal lines :wink: I need something that also works fast for pixels, Hence I wanted to ensure that array access like this doesnt have overheads.

I wasnt aware there was SIMD support, another topic for me to learn :slight_smile:

I havent configured anything memory wise, so the framebuffer is just part of .data and from my basic understanding probably just sits in regular SDRAM. I tried a DSB and it didnt help, but I shall investigate the MMU you mentioned. I can see in the A64 manual there is GPUMMU, but not clear if there is an MMU.

I’m not using the GPU since it required a driver as far as I understand, and I try to avoid 3rd party code there. So the display data is coming from RAM and then kinda direct to LCD.

@dimdin , @pierrelgol - Thanks. So I can see the A64 SoC has a DMA controller, but doesnt say much about it other than 8 channels, and lists a bunch of registers. I shall try and find a better resource. I havent done anything with DMA since Amiga copper bar blitting, if that is similar concept, great if so, I can use it for my bitmap fonts!

But I dont think DMA will help with pixel based line drawing etc that I’ll soon need.

Anyway, thanks to everyone for your support!

What I did try was using a second layer in the rendering pipeline, and switching opacity between layers. Not really double buffering, but somehow worked better. So I’ll focus on a true double buffer next, and try to find a way to measure performance. Mainly I wanted to ensure I wasnt missing something obvious like the way I declared the array or accessed it etc.

Perhaps there is a better or newer guide.
But for now see: Section 3.11. DMA - page 196

1 Like

I’d do some back-of-the-napkin estimates to check if the 0.25 seconds you see make sense:

720 * 1440 = 1036800

…about a million u32 writes… for a 1 GHz CPU that would be 1000 250 clock cycles per write (edit: 250 instead of 1000 for 0.25 sec of course), this is so slow that bounds checking or debug build without optimization can’t be the reason. E.g I would bet that the problem you’re seeing is not Zig related.

3 Likes

Update: So last night I got a simple double-buffer framebuffer working :slight_smile: My sleep deprived brain had the idea of just updating the framebuffer memory pointers in the SoC Display-Engine, and this works! I think it uses DMA under the covers to transfer the pixels direct to the LCD. Great!

I added a button to switch buffers manually, and it is fast :slight_smile: Fast enough that I noticed my new touch-screen code was generating spurious duplicated touch events, oopsi! I was hoping that was the reason for the slowness, but if I call fillRect/clearScreen on the off-screen buffer, just before switching, I no longer get the flood-fill effect, but there is still a very noticeable delay.

I’m going to try some of the other suggestions here today, and try again to see if I can get ReleaseFast working. Strangely enough the point at which that crashes is in the initialisation of the Display Engine.

I did discover that MMU is A53 not SoC feature, so that is another interesting avenue to explore. I assume u-boot must have done some of this setup already, since I havent configured anything on memory side of things thus far.

@LucasSantos91 - yeah, there is a built in Mali 400MP2 GPU, but I think it requires a driver, shame because I was hoping to do some OpenGL ES on the device.

@dimdin - yes, that document is very dog-eared at this point :slight_smile: But I seem to be missing the knowledge to make sense of the DMA descriptors, packages, half-packages. DRQ, ports, etc. My UI is going to be 2D so I will for sure investigate this topic further.

@floooh - thanks, that’s what I was trying to rule out (Zig) because if I have to poke a million pixels a frame for my UI I need to be sure my Zig code path is optimal, hence my concern about bounds checking etc.

[Edit] PS I just found the A53 has built in performance monitors :slight_smile:

3 Likes

So I think I got the cycles counter (PMCR_EL0 & PMCCNTR_EL0) working:

Switch to backbuffer only:            ~1,500
Memset to clear before swap:    ~205,000,000
asm loop to clear before swap:  ~205,000,000
asm clear 1/10th before swap:    ~20,000,000

So nothing to do with Zig, of course! so you are all free to enjoy your week :wink: and I begin the journey to learn about MMU.
Thanks for the help to get this far, and I will feedback once I have the solution.

1 Like

Update: So I think I managed to implement the MMU support from Zig :slight_smile: And yes, it did take me this long to understand enough of ARM architectural reference manual to be able to implement even a simple block based translation table :slight_smile: Wow, that is one meaty document!

But I also took the time to implement all the relevant registers in Zig rather than adding to my growing volume of global assembly routines. Although I still used G.A. for various memory barriers and some cache hacks.

It was a strange challenge, because along the way I discovered a partially configured, but seemingly not enabled, EL2 4KB table (must be part of u-boot).

Now I have a 64KB EL1 table setup, but I am not convinced it is being used, since I get the same test results.

I did managed to rule out the graphics pipeline. A memset of same sized ā€˜regular memory’ is just as slow. I tried with an array in Zig, and also a chunk of memory defined in my linker file.

So I suspect my new table is not actually being used, but not sure how to validate that yet.

As an aside. I used a packed struct for the table entry modeling. And I did wonder if there was a way in Zig to represent the various page size ā€˜output address’ types without duplicating the entire struct for each level & granule page type. ie I could just use a single field [47:17], but then I’ll have to fiddle to bits to set the values. Maybe a setter on the struct would be enough, but then my struct would be inflated with page type info.

64KB
    _17: u12 = 0,   //bits[n-1:17] = [28:17] - RES0
    OA: u19 = 0,    //Output address[47:n] = [47:29]

4KB
    _17: u4 = 0,    //bits[n-1:17] = [20:17] - RES0
    OA: u27 = 0,    //Output address[47:n] = [47:21]
2 Likes

@M64 MMU is working thanks! Although it turned out an extra step was required:

        var sctrl = SCTLR_EL1.peek();
        sctrl.M = 1;
        sctrl.C = 0;
        sctrl.I = 1;
        SCTLR_EL1.poke(sctrl);

With the ā€˜I’ - instruction caching? clearScreen() drops from 205 million cycles, down to 4.8 million cycles. ~42x faster :slight_smile:

It’s a 24Mhz clock signal on the PinePhone, so I think I am right in saying that this is probably 1 instruction per cycle? so ~0.2s. Not sure if that is reasonable, but at least no longer a visible draw effect now :slight_smile:

4 Likes

@dimdin , @pierrelgol So I now have Zig based implementation of the DMA controller on the phone :slight_smile: I tried something a bit sneaky, and made the blit source a fake register (u32 using ā€˜IO’ Mode) holding my clear color. It works! thanks :slight_smile:

I havent found a good way to get wall-clock performance measurements yet, but we are now down to ~740,000 cycles! but I guess this is now kinda async.

Oh and dont laugh, but I only just discovered the function of a PLL. So I need to get some Zig together to peek ā€˜n’ poke that next.

3 Likes

Awesome! I am glad you keeped digging and got rewarded with that speedup! I find it always exciting, learning about hardware specifics, and how it affects performance.

1 Like