## Fast software rasteriser in JavaScript?

**category:**code [glöplog]

I wanted to go all integer and add-/shift-only, no multiplies. Uses more regs, but ARM (GBA) has "plenty". That's what I did a while back here:

That gave me 4 divs and 8 multiplies per block, everything else were adds and shifts.

**Code:**

```
calculate delta for u/z per x and y (duzdx and duzdy)
set up start value uz0 for minx/miny
step through blocks
interpolate u/z
if (semi-)filled block
calculate z and then u for 4 corner points
(tri-)linerarly interpolate u over block
```

That gave me 4 divs and 8 multiplies per block, everything else were adds and shifts.

I obviously need to interpolate 1/z too...

ARM32 really doesn't have enough regs to be cavalier about it for this kind of code... 16 GPRs, minus PC and probably the SP too. The reset we can definitely spill, but still, that leaves 14, and you probably lose another 2 for loop counters (unless you fully unroll a block) and temps. For each interpolated attribute that we want to leave fully in regs, we need 3 registers (current value, x-step, y-step) to a first-order approximation. If you naively take the 3 edge equations plus u/w, v/w, 1/w, that's already 2 too many for the partial blocks (the full blocks don't need edge eqs).

With a bit of spilling (specifically, spilling the y-steps), you can get 2 registers per attribute at some cost. You can also use that for the three half-edge functions, c1+c2+c3=const., so you can rewrite "c3 >= 0" as "C - (c1+c2) >= 0" <=> " C >= c1 + c2". Some extra work per pixel but it means c3 only needs one reg "permanently" (holding C) and one temp.

You can break out the old SW rendering bag of tricks and start packing multiple values into registers and effectively do SIMD with regular integer ops. That's a fine technique to polish a good loop but I wouldn't recommend going down that route until you're sure you got the basic algorithm down, as it makes further changes painful.

You can use a 2DH bary rasterizer, using barycentric u/w, v/w, 1/w - which correspond to c1, c2 and c1+c2+c3. This makes stepping cheaper and frees up lots of regs, but recovering the interpolated texture coords per pixel now requires 2 mul-adds. Worth considering if you have fast mul-add, not so great otherwise (i.e. not something you'd want on ARM, generally).

What I described in my prev post is just evaluating 1/w *once*, at the middle of the block, and pretending it's constant. One divide and a couple of muls get you all you need to do interpolation setup from that. This is nice and simple but the UVs at the block corners generally won't match. How bad that is depends on how steep the tri is. Worth trying for 4x4, I'd suspect it looks pretty crappy for 8x8.

If you take 1/w at the corners, you still shouldn't compute the 4 corners for every block - adjacent blocks share corners (after all, that's the whole reason for doing this in the first place - you want the corners to match). If your main block sweep is left-right, it's easy to get down to only computing 2 corner values (thus 2 divs) per block, and recycling 2 values from the previous block (only computing them from scratch if you step down a block row). You can get down to only computing 1 value in most blocks if you also cache the corner values for the previous block row so you don't need to recompute them next time. This gets messy though since there's no guarantee there was a block right above the current one in the previous block row, so you need to keep track of which of the cached corner values are actually valid (it's always a single interval, but still... ugh).

Finally, some pedantry: Affine interpolation is linear interpolation. Not bilinear, certainly not trilinear. If you compute 4 arbitrary values at the block corners and interpolate between them, you'll generally end up with bilinear, but still not trilinear. :)

With a bit of spilling (specifically, spilling the y-steps), you can get 2 registers per attribute at some cost. You can also use that for the three half-edge functions, c1+c2+c3=const., so you can rewrite "c3 >= 0" as "C - (c1+c2) >= 0" <=> " C >= c1 + c2". Some extra work per pixel but it means c3 only needs one reg "permanently" (holding C) and one temp.

You can break out the old SW rendering bag of tricks and start packing multiple values into registers and effectively do SIMD with regular integer ops. That's a fine technique to polish a good loop but I wouldn't recommend going down that route until you're sure you got the basic algorithm down, as it makes further changes painful.

You can use a 2DH bary rasterizer, using barycentric u/w, v/w, 1/w - which correspond to c1, c2 and c1+c2+c3. This makes stepping cheaper and frees up lots of regs, but recovering the interpolated texture coords per pixel now requires 2 mul-adds. Worth considering if you have fast mul-add, not so great otherwise (i.e. not something you'd want on ARM, generally).

What I described in my prev post is just evaluating 1/w *once*, at the middle of the block, and pretending it's constant. One divide and a couple of muls get you all you need to do interpolation setup from that. This is nice and simple but the UVs at the block corners generally won't match. How bad that is depends on how steep the tri is. Worth trying for 4x4, I'd suspect it looks pretty crappy for 8x8.

If you take 1/w at the corners, you still shouldn't compute the 4 corners for every block - adjacent blocks share corners (after all, that's the whole reason for doing this in the first place - you want the corners to match). If your main block sweep is left-right, it's easy to get down to only computing 2 corner values (thus 2 divs) per block, and recycling 2 values from the previous block (only computing them from scratch if you step down a block row). You can get down to only computing 1 value in most blocks if you also cache the corner values for the previous block row so you don't need to recompute them next time. This gets messy though since there's no guarantee there was a block right above the current one in the previous block row, so you need to keep track of which of the cached corner values are actually valid (it's always a single interval, but still... ugh).

Finally, some pedantry: Affine interpolation is linear interpolation. Not bilinear, certainly not trilinear. If you compute 4 arbitrary values at the block corners and interpolate between them, you'll generally end up with bilinear, but still not trilinear. :)

oops. yes, bilinear... :)

I wouldn't mind having most of the y-values in memory. Anyway 14 regs better than what you get with x86.

Yeah. That was definitely on my mind, because it saves a lot of muls and divs, but you're right that it gets messy... Maybe keep an array in memory that you fill on the fly while you step through blocks.

I wouldn't mind having most of the y-values in memory. Anyway 14 regs better than what you get with x86.

**Quote:**

If you take 1/w at the corners, you still shouldn't compute the 4 corners for every block - adjacent blocks share corners (after all, that's the whole reason for doing this in the first place - you want the corners to match). If your main block sweep is left-right, it's easy to get down to only computing 2 corner values (thus 2 divs) per block, and recycling 2 values from the previous block (only computing them from scratch if you step down a block row). You can get down to only computing 1 value in most blocks if you also cache the corner values for the previous block row so you don't need to recompute them next time. This gets messy though since there's no guarantee there was a block right above the current one in the previous block row, so you need to keep track of which of the cached corner values are actually valid (it's always a single interval, but still... ugh).

Yeah. That was definitely on my mind, because it saves a lot of muls and divs, but you're right that it gets messy... Maybe keep an array in memory that you fill on the fly while you step through blocks.

If you scan top-down, all you need is 2 block rows worth of values ("current" and "prev" row, switch pointers every row). Storing this is not the problem, this kind of case is: (numbers=corners, # = partially/fully covered blocks)

While in the top block row, you didn't calculate 1/w at either corner 1 or 4 (because the block wasn't covered). By the time you render the bottom-left block, you have 5 and 8 (from the adjacent block to the left) and you already know you need to calculate 7, but you also need to calculate 4.

It's not hard to fix, but the extra bookkeeping, testing and branching is seriously annoying.

**Code:**

```
1---2---3
| | # |
4---5---6
| # | # |
7---8---9
```

While in the top block row, you didn't calculate 1/w at either corner 1 or 4 (because the block wasn't covered). By the time you render the bottom-left block, you have 5 and 8 (from the adjacent block to the left) and you already know you need to calculate 7, but you also need to calculate 4.

It's not hard to fix, but the extra bookkeeping, testing and branching is seriously annoying.

And the smaller the triangle the less this makes sense.

Have you tried adaptive subdivision?

Calc the UV of the barycenter affine and perpective corrected.

Deviate > N (like 1texel), subdivide (compute 3 edge, that split 1 triangle into 3).

Have you tried adaptive subdivision?

Calc the UV of the barycenter affine and perpective corrected.

Deviate > N (like 1texel), subdivide (compute 3 edge, that split 1 triangle into 3).

T21: We were talking about speed optimization. If you already have blocks it makes sense to use them imo. I used 8x8 blocks, 32bit fixed-point u/z, duzdx, duzdy values and 32bit arithmetic and in a 1024x1024 window everything was just fine (read: like the float renderer) after some fiddling with precision.

ryg: Swapping pointers sounds good, but, yeah, logic when to calculate the first row and column messes the whole thing up, I guess... :(

ryg: Swapping pointers sounds good, but, yeah, logic when to calculate the first row and column messes the whole thing up, I guess... :(