pouët.net

Fast software rasteriser in JavaScript?

category: code [glöplog]
Most of the WebGL-less renderers in JavasScript tend to rely on the Canvas API for drawing polygons. This translates to something liket his:

Code:context.fillStyle = "#ff0000"; context.beginPath(); context.moveTo( x0, y0 ); context.lineTo( x1, y1 ); context.lineTo( x2, y2 ); context.lineTo( x0, y0 ); context.fill();

This gives ok-ish perfomance.

http://dl.dropbox.com/u/7508542/three.js/software_renderer/disneyc.html

However, as you can see there is some z fighting on the edges, and also spikes on the contour. This is because there is no way to disable antialiasing in the Canvas API and expanding the polygons by 0.5 in screen space is needed for hidind these gaps:

http://dl.dropbox.com/u/7508542/three.js/software_renderer/triangles.html

Another limitation of this approach is that you can't easily do vertex colors or illumination effects other than flat shading.

Over the weekend I challenged myself on writing a software renderer. It was primarily for learning purposes, however I ended up getting better performance:

http://dl.dropbox.com/u/7508542/three.js/software_renderer/disneys.html

So, rather than using the API I shown before now it's like this:

Code:var canvas = document.createElement( 'canvas' ); canvas.width = 1024; canvas.height = 1024; document.body.appendChild( canvas ); var context = canvas.getContext( '2d' ); var imagedata = context.getImageData( 0, 0, canvas.width, canvas.height ); var data = imagedata.data;

With that you get the pixel buffer of that canvas. By having such buffer I'm sure you all know how the rest works.

So far I've tried three approaches:

Scan line (info)

http://dl.dropbox.com/u/7508542/three.js/software_renderer/raster.html

Half space (info)

http://dl.dropbox.com/u/7508542/three.js/software_renderer/raster1.html
http://dl.dropbox.com/u/7508542/three.js/software_renderer/raster2.html
http://dl.dropbox.com/u/7508542/three.js/software_renderer/raster3.html

And barycentric (info)

http://dl.dropbox.com/u/7508542/three.js/software_renderer/raster4.html

....

The scan line one seems the fastest. Albeit I'm aware that testing just big polygons is not realistic.

The half space one seems to be the easiest to work with. However, most of the performance optimisations from that article turned out to be not good for JavaScript (where everything is a float) so the Disney head one basically uses half space code but half implemented.

The barycentric one also seems pretty simple, but I have tried to optimise without much success. But I see a lot of potential on this one.

This is the drawTriangle routine used in the first Disney head test (using half space):

https://github.com/mrdoob/three.js/blob/dev/examples/js/renderers/SoftwareRenderer2.js#L176

I managed to add interpolants:

https://github.com/mrdoob/three.js/blob/dev/examples/js/renderers/SoftwareRenderer2.js#L270

But performance wasn't too good...

So, at this point I'm basically looking for tips and things to try that could increase performance and/or allow more features.

Thanks in advance!
added on the 2012-04-24 19:26:27 by mrdoob mrdoob
Don't implement Bresenham, it's really slow - tried that :\
A pity that -moz-crisp-edges can't be enforced in Chrome somehow, this would keep us from jumping through burning rings..
added on the 2012-04-24 19:32:39 by mog mog
Actually, now that I know how to do interpolants, I can give it another go at optimising the barycentric approach.
added on the 2012-04-24 19:38:09 by mrdoob mrdoob
There might be a small improvement by making drawPixel inline, and having offset be incrementally calculated in the main loop and not as a multiplication for each pixel, but that's probably small potatoes.
added on the 2012-04-24 20:13:47 by spite spite
The "add interpolant" thingy is not worth it when CPU can add Floats in 1 clock cycle ;)
All JS engines have type inference and they will happily use unboxed floats everywhere they can. Only obvious optimization is to inline the drawPixel code and get rid of the offset = (x+y*canvasWidth)*4 crazy talk for a simple offset += 4 and offset += offToTheNextScanline which should all be type infered to integers.

Nice job. And the software Q3 map viewer you posted on the Twitters was bloody cool!
added on the 2012-04-24 21:29:59 by p01 p01
half-space and barycentric are the same. say you have barycentric (u,v,w) coordinates. barycentric u is 1 at one of the triangle vertices, and 0 along the opposite edge - it's just a normalized signed distance from that edge.

in the opposite direction, take a half-space distance formula for any given edge. evaluate it at the vertex opposite that edge. that gives you the distance of that vertex from the edge. divide the components of your edge equation by that value, and you now have one normalized barycentric coordinate.

don't ever interpolate more than three values over the triangle directly. if you want non-perspective interpolation from 2D screen-space rasterization, you can use the barycentric coordinates (or equivalently half-space distances) directly, i.e. interpolated_r = vert0_r*u + vert1_r*v + vert2_r*v, or alternatively interpolated_r = vert0_r + (vert1_r-vert0_r)*u + (vert2_r-vert0_r)*v (since barycentric coords always sum to 1, you can just compute w as 1-u-v). so that's two values to step (or just re-compute from u, v every pixel).

for perspective correction, you just compute perspective-correct barycentric U and V (using upper-case here). these are barycentric coordinates for the clip-space not screen-space triangle. you can do this using either a homogeneous 2D rasterizer, or by linearly interpolating U/w, V/w and 1/w in screen-space. then per-pixel you can reconstruct U and V, after which you can handle an arbitrary number of perspective-correct interpolants (same formula as above, only now using U/V not u/v).
added on the 2012-04-24 22:48:32 by ryg ryg
oh, and it's best to bias things so you don't test for >0, but for >=0! (just subtract 1 from everything)

that way, "if ( cx1 > 0 && cx2 > 0 && cx3 > 0)" can be turned into "if ((cx1 | cx2 | cx3) >= 0)" (among other things). and there's nice tricks to increase the range. damnit i guess i'll just spend some time tonight tinkering with it and seeing how far i can get :)
added on the 2012-04-24 22:59:10 by ryg ryg
also don't evaluate edge equation at all 4 corners, you can simplify that a lot too by computing once which corner is the near/far corner for every edge. yeah, definitely gotta code that up tonight :)
added on the 2012-04-24 23:01:01 by ryg ryg
Quote:
Actually, now that I know how to do interpolants, I can give it another go at optimising the barycentric approach.

Got it to work.
http://dl.dropbox.com/u/7508542/three.js/software_renderer/raster41.html

But for whatever reason, the version where all the computation are done per pixel in the triangle is faster... (this one does the computations only for the 3 vertices...

This version already assumes:

Code:var u1 = 0, v1 = 0; var u2 = 0, v2 = 1; var u3 = 1, v3 = 0;

http://dl.dropbox.com/u/7508542/three.js/software_renderer/raster42.html

Still, slower...
added on the 2012-04-25 01:26:37 by mrdoob mrdoob
Having tried that, time to read/understand p01 and ryg comments... :)

ryg: Looking forward to seeing what you get!
added on the 2012-04-25 01:27:19 by mrdoob mrdoob
Quote:
The "add interpolant" thingy is not worth it when CPU can add Floats in 1 clock cycle ;)

Indeed. I just removed that and it's 2x+ faster...
http://dl.dropbox.com/u/7508542/three.js/software_renderer/raster43.html
added on the 2012-04-25 02:10:00 by mrdoob mrdoob
Maybe you should change your metric to triangles/sec rather than fps in your benchmarks. Not that it changes how fast your program goes, but it would give a better indication at those random-size triangles.
Quote:
that way, "if ( cx1 > 0 && cx2 > 0 && cx3 > 0)" can be turned into "if ((cx1 | cx2 | cx3) >= 0)" (among other things).

That's so smart. At first I was like "No way, that's not right!", and then it struck me how a negative number in the second expression would set the sign bit just right. Nice :)
spent an hour and a half tinkering with it ("it" being the half-space variant). always bothered me that nicks code which is really pretty half-assed is the "standard" implementation of the idea. here you go:

https://gist.github.com/2486101

same as the original half-space rast code, i.e. just fills tris and colors pixels based on whether they were fully or partially covered blocks.

start with raster3_ryg4.html. note that i changed the number of tris drawn/frame from 100 to 1000 and added a millisecond timer. i also changed the meanings of variables a bit from the original to make the code look nicer :). here's the basic ideas that went into this:

  • as explained above, change the tests to be ">=0" not ">0", which allows everything to be framed in terms of integer sign tests. i just took the original edge setup code and subtract 1 at the end, no sweat. (this can be done a bit slicker by reworking the if condition - exercise to the reader).
  • flip the signs of dy* so that the terms are added not subtracted. (makes things more symmetric)
  • note that everything that happens per pixel advances the edge functions in multiples of 16 (the sub-pixel resolution). well, rather than multiplying the pixel-coordinate terms by 16, shift down the constant term by 4! (this works only because we changed the tests to be ">=0", i.e. check the sign bit). as a side effect we get rid of the whole "fdx"/"fdy" business.
  • the trivial accept/reject tests are completely reworked to do something sane (by far the sloppiest part of nicks original code). the original code computes the edge functions at all 4 corners of every block, only to do a few sign tests. this is bullshit. what we really need to know is whether at least one edge function is <0 over the full block (=> trivial reject), or whether all of them are >=0 over the full block (=> trivial accept). well, an edge function is <0 over the full block precisely if its maximum over the full block is <0, and it's >=0 over a full block precisely when its minimum over the full bock is >=0. the terms are independent, so we can minimize/maximize over them individually, which in turn happens to only depend on the sign of the slope of the edge equations. the punchline is that it all boils down to a couple constant offsets (nmin/nmax) from the top-left corner edge values (which we need anyway to seed the hit-testing).
  • there's also p01i's manual inlining and the or-to-test-signs i mentioned earlier. both of these aren't huge but hey why not :)


for what it's worth, that code also pretty nicely illustrates how dedicated hardware rasterizers work. i do a few multiplies per block for the initial edge equation setup, but you could do that incrementally as well - at which point you're left with nothing but a few adds and sign tests per pixel. all the "brains" are in the triangle setup (which is still pretty simple).

raster3_ryg5.html adds some more icing to the cake. note that drawPixel will bail if the pixel has already been written in the current frame (alpha=255). well, we happen to sometimes write 8x8 pixels at a time. so for every 8x8 block, i keep a flag saying whether that block has been written as a solid block in the current frame. no matter what the edges say, if this block has been solidly filled, all the pixels in it are going to fail the test in drawPixel, so we needn't even bother determining exact coverage (again, for those keeping score, this is analogous to hierarchical z test in current GPUs). this kind of thing nicely falls out of doing this sort of hierarchical rasterizer and is tricky to do with low overhead for pure scan-line rasterizers.

end result? original version did 100 tris/frame and ran at 15fps (reported) on my machine. raster3_ryg5 does 1000 tris/frame and runs at 58fps, so very roughly a 38-40x speedup. of course that's still purely flat shading without any interpolation :)
added on the 2012-04-25 06:25:33 by ryg ryg
actually, the 40x is deceptive, since after all most of the late tris are mostly culled (that's the point!). but it's plenty fast alright.

added raster3_ryg6.html that shows how to recover u/v from the edge equations (the stuff i tried to explain earlier), to get the same gouraud as in the barycentric code. still plenty fast. note that in this code, the u weight corresponds to the third vertex and v to the first.
added on the 2012-04-25 06:49:32 by ryg ryg
I really love this approach. It's another proof of how competent JS is for heavy processing these days.

Just a note: when getting down-and-dirty with JS (sub-)optimizations, please consider trying many different browsers throughout the optimization process, since all have slightly different strengths & weaknesses. Just making it fast on Chrome (for instance) is, well, ... (insert rant about the web being about platform independence etc). Especially given that some of the most interesting use cases for a SW rasterizer are running 3D on platforms that do not support WebGL (e.g. IE, many mobile devices, etc).
added on the 2012-04-25 09:41:02 by marcus256 marcus256
Well, that will keep me entertained for the rest of the day... ^^
Thanks for taking the challenge ryg!
added on the 2012-04-25 11:56:39 by mrdoob mrdoob
Quote:
1000 tris/frame and runs at 58fps

Quote:
how competent JS is for heavy processing

Am I missing something ?...
added on the 2012-04-25 11:58:00 by MsK` MsK`
Code:i keep a flag saying whether that block has been written as a solid block in the current frame. no matter what the edges say, if this block has been solidly filled, all the pixels in it are going to fail the test in drawPixel, so we needn't even bother determining exact coverage (again, for those keeping score, this is analogous to hierarchical z test in current GPUs).

Is there a good article explaining hierarchical z test? I don't get to see how per pixel z testing may fit into this 8x8 block discarding.

added on the 2012-04-25 12:09:03 by mrdoob mrdoob
Oh wait. I guess what you would do is storing the lowest z of the triangle on that block buffer. And if the new triangle has a vertex with a higher z you draw doing z test per pixel.... ?
added on the 2012-04-25 12:42:34 by mrdoob mrdoob
Yes, or you store min/max Z per block, so you can support all types of Z-tests (GREATER/LESS/EQUAL etc).
added on the 2012-04-25 14:04:39 by marcus256 marcus256
...not sure about actual implementations, but I guess you could make some sort of quad-tree structure from this (with min/max Z for each node in the tree) for fast culling. Also, you could probably calculate min/max Z of the triangle per block/square, instead of using the min/max for the entire triangle (not sure which would be more efficient).
added on the 2012-04-25 14:07:53 by marcus256 marcus256
yes, you store coarse resolution min/max Z, and also calculate min/max Z bounds per 8x8 block. (you can use a similar trick as for the edge equations). that again gives you trivial accept/reject status for Z test.
added on the 2012-04-25 15:53:09 by ryg ryg
we really need a thread thumbing functionality on pouet.
this thread is top of the month :D
added on the 2012-04-25 15:59:47 by psenough psenough
Quote:
Am I missing something ?...


Javascript and in-browser frameworks are considerably less slow and useless than they used to be, especially in Chrome. But in absolute terms it's still very unimpressive what anyone has managed to do with it all. I suspect there's a reason for that.
added on the 2012-04-25 16:04:35 by doomdoom doomdoom

login