pouët.net

Software Triangle Rasterization on ARM CPU

category: general [glöplog]
Hey guys (and girls), I would like to get some opinions on triangle rasterization from the great minds that hang out here. Mainly what you found worked best and what to avoid. I'm currently writing a software triangle rasterizer for Windows Mobile running ARM CPU. Target surface is always a 16-bit RGB565 surface.

To start off, I had defined a few fixed point types which would be handy for different purposes. All are based on the standard 32-bit integer:

* fx24 - 8.24

* fx16 - 16.16

* fx10 - 22.10

* fx8 - 24.8

Starting with the vertex type, I was contemplating between a collection of different vertex types to handle different scenarios OR just one BIG vertex type that would suit all purposes. I'll go with the all in one.

The verticies will be fx10 which are nice to work with specifically with multiplication and division operations. UV's will use the fx24 type for precision when iterating a texture. But then I'm limiting myself to 256 x 256 textures but I don't see that as so bad. RAM is quite limited to begin with and 256 x 256 textures are so nice to work with. I use fx8 for the color and keep them individually stored rather than as a packed 16-bit word which makes them easy for interpolating.

Here is my fabulous data structure:

TVertex2D
{
fx10 fxX, fxY, fxZ;
fx24 fxU, fxV;
fx8 fxR, fxG, fxB;
};

For rasterization, I've chosen to stick to triangles as they are easier to manage. So given three verticies [v0, v1, v2] - I was thinking either to sort them such that v0.y > v1.y > v2.y which leads to v0 to v2 being the longest edge and v0 to v1 to v2 would be the other edge UNLESS it's a flat top or bottom type triangle in which case any of the y's are equal. So if that was the case, would it be better to split the triangles along the v2.y line and handle each case differently?

Then when it comes to scanning an edge I was thinking about the classic way of having an edge list that is the size of the screen height and storing the start and end x of each scanline for each y. But I don't like the idea of clearing out that list for each triangle to be rendered. So then I was thinking maybe a list that kept the x1, x2 and Y. The line raster function would take care of figuring out whether to draw from x2 to x1 or vice versa.

So... thoughts so far?
added on the 2009-02-11 18:03:37 by Phred Phred
"But I don't like the idea of clearing out that list for each triangle to be rendered."

even democoders cant code ? :P I dont like the idea of having a list.

for y=starty to endy

x1=x1+x1d
x2=x2+x2d

rasterize_here

next y
added on the 2009-02-11 18:37:02 by Oswald Oswald
I use 16.16 for all my fixed point needs unless the algorithm needs extra precision. Mixing different fixed point formats is doable but makes things much more complicated. It just sucks balls if you have to use a different correction factor in the subtexel correction for RGB and UV.. That just increases the amount of work you have to do.

If your ARM CPU supports the ARMv5E DSP instruction set you may want to use 1.15 format for RGBA and maybe even for UV as well. For UV you can still multiply with the texture width/height afterwards. ARMv5E makes this a cheap operation.

24 bit subtexel precision is overkill.
added on the 2009-02-11 19:10:16 by torus torus
Oh - regarding that left and right edge array: Don't do it, it sucks. Do a google search on fatmap.txt and fatmap2.txt. They'll teach you how to interpolate the edges without lame arrays.
added on the 2009-02-11 19:12:27 by torus torus
Anyone else thinks that a software rasterizer in 2009 is anachronistic?
added on the 2009-02-11 19:19:32 by friol friol
What Oswald said. For each triangle you know the starting and ending Y and you don't need to clear the list, just write the new stuff in it to erase old stuff and loop only on the scanlines the triangle occupies.

My old engine on gamepark had some messy and bloated triangle rasterizing code. That's why I was following the method of splitting the triangle in to parts as in some tutorials I have read. Then someone else suggested me the edge list method (I have a struct edge with the x element and then an array of the elements that I malloc later depending on how much elements I need in a scene (I think bigger structs in generally slow down my code)). The code is cleaner and shorter and on my PC it does almost double speed than the old rasterizer. I haven't ported the engine to gamepark yet to see if I have the same gain in performance or if their is some cache drawback for having the list of structs as big as the screen height. There are actually a lot of issues and I am back at flat rendering, have to rewrite the basic shadings again and there are some bugs. But edge list ftw, go for it!
added on the 2009-02-11 19:23:42 by Optimus Optimus
Quote:
Oh - regarding that left and right edge array: Don't do it, it sucks. Do a google search on fatmap.txt and fatmap2.txt. They'll teach you how to interpolate the edges without lame arrays.


Hmm, really? And why is it lame? I'll check the links anyways..
added on the 2009-02-11 19:25:25 by Optimus Optimus
Actually, I have a question to or I am ready to hear some suggestions about. Another thing that changed in my new engine but produced the bug that made me abandon the project for a while is:

In the past for each edge I recalculated the dc value (I mean if color c1 at the left edge and color c2 at the right, dc = (c2 - c1) / length of polygon rasterline). But later I read in a tutorial that a single polygon has one and the same dc (or du or dv or danything) at it's plane so only once is needed to be calculated. I now calculate it from the scanline in the middle of the polygon (or actually the one that splits the triangle in two) and start increasing by dc fro the left to right. The big problem is I get some dirty pixels at the right edge (especially with du, dv in texture mapping). Probably it overpasses the value at the right edge it was supposed to exactly reach. I could make some hack to clamp that value or decrease dc a tiny to not reach but I wouldn't like to use a hack, but a more clean solution. I thought it would be the fact that I am not using subpixel or subtexel accuracy? Once I tried to implement that too but I got overflows (I have 16:16 fixed point) and abandoned the try :(

What the fuck I should do? Maybe read those tuts?
added on the 2009-02-11 19:32:50 by Optimus Optimus
Has anyone implemented span buffers with good results?

Torus: I plan to later make different routines optimized for different ARM platforms V5, XScale, V6 SIMD and so on but for now, I'm assuming plain old V4 architecture.

Also when it comes to sorting points what works better? clock-wise (or counter) or sorting y's to be smallest to largest? It seems both approaches changes how you scan the edges.
added on the 2009-02-11 19:39:39 by Phred Phred
Otinanist: "dc" is only constant if you have no perspective correct texturing, and no perspective projections at all.

phred: you can sort 3 numbers with 3 comparisons. and you dont need any list while rasterizing a triangle, you can do it on the fly.
added on the 2009-02-11 20:12:50 by Oswald Oswald
Quote:
Otinanist: "dc" is only constant if you have no perspective correct texturing, and no perspective projections at all.


Hmm yes, now I am thinking it yes that's true. But my engine isn't using perspective correction yet. For this one, the question remains, what did I do wrong?
added on the 2009-02-11 20:15:03 by Optimus Optimus
Quote:
Otinanist: "dc" is only constant if you have no perspective correct texturing, and no perspective projections at all.


Hmm yes, now I am thinking it yes that's true. But my engine isn't using perspective correction yet. For this one, the question remains, what did I do wrong?
added on the 2009-02-11 20:15:14 by Optimus Optimus
Sorry for the double post :P
added on the 2009-02-11 20:15:33 by Optimus Optimus
Quote:

Hmm, really? And why is it lame? I'll check the links anyways..


Assume you have 480 lines in your array. If you store each line with 4 bytes (two shorts) the entire array will be around 2kb in size. That'll be 1/4 of your precious data-cache..

That won't be a problem if you just render triangles with a small height and you only touch a small portion of the array. Unfortunately perspective projection will give you a good amount of very high yet narrow triangles.

You need the cache badly for the framebuffer and the textures.

added on the 2009-02-11 20:18:18 by torus torus
Quote:
Also when it comes to sorting points what works better? clock-wise (or counter) or sorting y's to be smallest to largest? It seems both approaches changes how you scan the edges.


Just sort by Y. Your triangles ought to only enter rasterization if they are clockwise or counterclockwise at the first place. It's best to do this test as early as possible, e.g. before clipping.

If you want to keep it simple you can do your clockwise/counterclockwise test as part of your UV gradient setup. Inside this setup you will sooner or later have to do a divide. The sign of the divisor will depend on the winding of the triangle.


added on the 2009-02-11 20:24:53 by torus torus
Quote:
Anyone else thinks that a software rasterizer in 2009 is anachronistic?

Intel doesn't, apparently.
added on the 2009-02-11 20:43:42 by ryg ryg
:-???

that is a GPU!
added on the 2009-02-11 21:02:45 by friol friol
Quote:

phred: you can sort 3 numbers with 3 comparisons. and you dont need any list while rasterizing a triangle, you can do it on the fly.


Exactly where I was headed with all this. I wanted to fully leverage the power of the ARM CPU using assembler and conditional instructions to avoid branches.

I was thinking of something of an algorithm that sorted v0 vs v1 and then v0 vs v2. Then it was just a matter of doing v1 vs v2. I wonder if that can be done in less than 3 comparisons....
added on the 2009-02-11 21:55:51 by Phred Phred
Well yes, if the rasterization code can be done without a list and making no mess then I should try fixing my old rasterization code. And read those txt tuts oneday..
added on the 2009-02-11 22:29:39 by Optimus Optimus
As a matter of fact, I did such engine for my job, for wm5/wm6 ARM.
Have a look at OpenGL ES 1 for these systems, it's free, it has optional integer fixed point formats for vertexes,just one or 2 dll to get, then if you configure a quiet opengl rendering it should be fine.

There is also D3D Mobile, actually installed by default on all WM5/WM6 with a software implementation, but this default version is very poor (no way to manage Z buffers)
added on the 2009-02-11 23:07:53 by krabob krabob
Just wanted to say how happy I am that people are still writing triangle rasterizers. Keep the legacy alive! And what almost everyone else said: do what you can on the fly.

friol: it's a massively parallel cpu, the rest is packaging.
added on the 2009-02-11 23:20:47 by forty forty
Quote:

As a matter of fact, I did such engine for my job, for wm5/wm6 ARM.
Have a look at OpenGL ES 1 for these systems, it's free, it has optional integer fixed point formats for vertexes,just one or 2 dll to get, then if you configure a quiet opengl rendering it should be fine.


Ya I had looked at OpenGL ES a little but only for hardware accelerated devices. The software implementations I've tried are slow as shit!

Quote:

There is also D3D Mobile, actually installed by default on all WM5/WM6 with a software implementation, but this default version is very poor (no way to manage Z buffers)


D3D would have been awesome IF everyone did their part and properly implemented drivers. A lot of these devices have hardware that help D3D quite a bit but for some reason, drivers don't get written. There are a few with hardware D3D and it's ok. The software D3D is just shit!
added on the 2009-02-11 23:24:41 by Phred Phred
Does any one here actualy working on ARM holdings or any affiliates??
@KammutierSpule: Kusma works for ARM. He also happens to have written a bloody fast triangle rasterrizer that he used on the GBA.

@Phred: Ditch the phone and switch to GBA plzkkthxbaibai :)
I work for NEC on ARM11 MPcore (co-designed by ARM and NEC) software. Trying out how far one could push triangle rasterization on a 4 core ARM system is on my wishlist since several months, but didn't have time for it unfortunately.
I worked on a parallelized realtime 6DOF terrain raycaster with some help of torus (thanks mate!) though, and results were nice..
But then again, decent devices (e.g. iPhone) nowadays have awesome Imagination SGX 3d hw, so what the heck..
added on the 2009-02-12 09:07:16 by arm1n arm1n

login