pouët.net

4k intro, C vs assembler?

category: code [glöplog]
yzi, yes, I am aware of that side of the coin too. However, it is "the amount of nontrivial information" that you really need to increase, not the uncompressed size. Having a generic representation for your nontrivial information, however good the generic compressor is, won't beat the manual hand-tuning.

Just for the record, I am not saying compressors are bad. I love my compressors. I am just saying that they are suboptimal, for obvious reasons.
added on the 2014-02-04 22:25:29 by introspec introspec
gasman, I'm actually loving this discussion, because I am experimenting with both approaches. I just cannot help but think that trying to think like a compressor is a slightly unnatural practice :) which does not make it less potentially viable, of cause!

yzi, self-code generation is a very common approach in 8-bit world. It is entirely not unusual to have compact executables that generate kilobytes of specialised unrolled code. This would correspond to huge compression ratios for not really well compressible code. Compressors are just not good enough to understand what you do at that level.
added on the 2014-02-04 22:32:32 by introspec introspec
Well yeah I know what you're saying. My experience with 1k and Crinkler is that basically anything in the code can affect anything else, and even though there are some general rules of thumb as to what usually compresses well, quite often it's worth trying everything just in case. The optimal way would be to have a brute force algorithm generate all possible codings for the idea, and then select the one that happens to compress the best. And even better, wrap that to another algorithm that tries out all possible ideas. ;) Because... these things are pretty abstract art anyway, so who cares about the details in the result as long as it gives a nice feeling. But luckily, no such algorithm can currently be made, so you just have to trust your intuition. Which is why it's fun.
added on the 2014-02-04 22:42:02 by yzi yzi
introspec: for the sake of argument, let's say you have a fully optimal code segment with commutative operations - the uncompressed code is as small as it can get. however, since the order of the operations doesn't matter, there's a possibility that there are permutations of that particular code segment where a context-modelling compressor like crinkler can pick up on patterns that can reduce the code size based on the data before or after it. i've done that in 4k's myself where simply re-ordering some of the operations has won me space even though the uncompressed code size didn't change.
added on the 2014-02-04 22:43:53 by Gargaj Gargaj
OK, let's set a few things straight here...

Quote:
the very fact that you can compress your assembly by a compressor is telling you something about how size (in)efficient your code is.

It is telling you how much redundancy there is in your data. Every structured encoding of information contains redundancy, and typical instruction encodings contain quite a lot. If it didn't have any redundancy, it would mean that there was only one possible way to encode a particular program. That would be a quite weird encoding, that.

Quote:
My real experience with size codes is assembly on 8-bit platforms; and there I would normally expect that a program that was truly well optimized for size cannot actually be compressed any further.

It might be that you can create a stream of instructions that cannot be compressed (very much), but that doesn't mean that you couldn't create a different, equivalent stream of instructions which would compress to less that your "minimal" stream.

yzi gave an example with the pushes. Another one that I encounter often is with small loops that run for just a few iterations. It is often smaller to unroll the loop and let the compressor squeeze it than to suffer the overhead of the loop logic. Similarly, it is often smaller to inline small functions, even if called several times, than to have all the redundant call instructions.

Generally, you could say that the (de)compressor has a single, general mechanism for reducing redundancy, implemented once, whereas if you do it manually, you have to explicitly implement it every time, and you are confined to the limits of the instruction encoding.

Quote:
Compressors are pretty stupid normally, you should always be able to beat one.

I think you are a bit out of touch here. State-of-the-art compressors today are actually very smart. Otherwise they wouldn't spend several minutes compressing a few kilobytes on a multi-GHz machine.

Quote:
Compressing by one of the standard compressors should literally increase the size of the program, even if decompresser length is not taken into the account.

Depends on what you mean by "standard compressors". Zip? It's 25 years old. Compression technology has moved on.

Yes, that's a strawman argument, I know. Sorry. ;)
added on the 2014-02-04 22:55:43 by Blueberry Blueberry
Quote:
the very fact that you can compress your assembly by a compressor is telling you something about how size (in)efficient your code is.

My experience with writing 8-bit ASM is very limited, but are you saying that it's possible to write code on such platforms where the entropy is close to minimal? I could understand it if it was a simple compression method like RLE, but I find this quite hard to believe for stuff like tables that you need on limited platforms and the amount of repetition that any code has. It's impossible to avoid repetition when doing anything sensible.
added on the 2014-02-04 22:57:49 by Preacher Preacher
I am just saying a very simple thing: your code is only one part of the equation, decompressor is the other part. It is their tandem that needs to be optimised. What code-generators are, if not decompressors hand-tuned for a particular task? Not having a generic decompressor, however good it is, as part of your code does not automatically imply that your code is less efficient.

I would be perfectly happy to accept that one also needs to take into account workflow and the man-hours that go into these prods nowadays. It may well be that using generic compressors is the best way to address this kind of concerns. I am just surprised that thinking about hand-tuning these things is such a radical idea these days :)

Gargaj, I do not really understand what your example shows. If I am thinking about code- (or data) generator, I can easily imagine cases where specific order of the resulting code can be achieved by a shorter generator. Sometimes one moves around data and modifies the generator completely, so that few extra bytes can be squeezed from the result.

Blueberry, I usually test the redundancy of my code by using one of the modern compressors. These days I tend to use rar (I know it is not the top one, but it tends to be enough to see how well you are doing). The thing is, I can really easily generate practical examples of the unrolled code, that will not be compressed by any one of them. It is a bit like procedural textures. No-one seems to try and develop extreme compressors for textures, because the idea seems absurd. Well, some codes are not that different from the procedural textures, actually.

Preacher, I would be very careful with the notion of zero entropy, especially given how limited computationally these platforms are :) I just think that the command sets are smaller on smaller platforms, so their combinatorial complexity is lower and can be easier handled. It is actually nice that you mentioned tables. People spend ages trying to squeeze extra bytes of commonly used things like sin-tables. A typical code to generate exact 256-byte table of 8-bit sine values (one full period) is about 60 bytes long. If small approximation errors are allowed, one can go below 50 bytes. Would you be able to find a generic compressor that would give you this kind of compression?
added on the 2014-02-04 23:28:14 by introspec introspec
gasman, but seriously, given how much more simplistic the 8-bit decompressors are, do you honestly think that re-writing "Haiku" in a more compressor-friendly style would work just as well as your original code did? I do not mean it definitely won't, I just know from experience the kind of optimizations one ends up using in these codes, and cannot possibly imagine the compressor coming up with something on the same scale of craziness :)
added on the 2014-02-05 00:01:28 by introspec introspec
tl;dr
go make an intro about it.
added on the 2014-02-05 01:27:10 by las las
Interesting discussion.

Some retrospective:
Its have been long time since I started this thread and obviously there is no apparent result for the "go make an intro about it".
My 4k aims were the tunnel effect in the this demo.
For better or worse, the maze generation and polygonizion is done by the CPU, the code turned out a little big for 4k so I had to find away to compress it - hence this thread.

In the end, I didn't have enough space for music, so I wrote this tiny GPU synth. But I couldn't find any composer to use the synth. At some point I asked gloom for help with music, he was welling to help as long as it's standard mp3 song. After that, efforts have been directed into the demo and the entire 4k concept was pretty much thrown away...
added on the 2014-02-05 08:35:32 by TLM TLM
introspec: For the case of generated code, I agree with you completely. No matter how smart the compressor is, there will always be examples of systematic code where it cannot detect the pattern. I use code generation (and specialized code transformation) a lot myself.

It's a tricky balance, though. If the generation is sufficiently primitive (e.g. just copying things around) the compressor can often do a better job. For instance, when I convert music for my 4ks, I unfold all patterns into one long stream of notes per track, because the overhead compared to just storing the patterns (in the context of Crinkler) is smaller than the code for unfolding it would be.

Sometimes a middle ground is optimal. The classic example is delta coding. For your sine table, my guess is that a delta-encoded quarter of the table along with code for decoding and mirroring it could be smaller than generating it from scratch (depending on what you are compressing it together with).

You have a good point concerning the computational power of the 8-bit platforms: Good compression is quite computationally intensive, so for these platforms, you would indeed often be better off not compressing.
added on the 2014-02-05 08:42:44 by Blueberry Blueberry
Blueberry, I fully agree with the view that there needs to be a balance. I was trying to restore it somewhat here. Also, I did fell behind regarding modern compression techniques, so reading about Crinkler and then experimenting with PAQ8 definitely was very useful.

Regarding the sine tables, it is interesting. The best "exact" solution for z80 that I know about is 63 bytes long and uses a 16-byte table of deltas of deltas, and then explicitly uses mirroring and reflecting. The "approximate" solutions work more like fixed point ODE solvers, they are usually shorter by 15-20 bytes. The shortest solutions on ZX Spectrum tend to use floating point calculator in ROM (e.g. Gasman's sine-table generator in Haiku is 38 bytes long).
added on the 2014-02-05 16:00:54 by introspec introspec
introspec, forgot about z80ish routines. this is x86 and sinus generation looks shorter.
added on the 2014-02-05 17:05:26 by g0blinish g0blinish
Just bumping this topic to add my personal experience.

I was trying to save some bytes in a very small intro coded in C and tried many options to get the smallest compressed code possible using crinkler. I tested every part of the code with different versions to see which gave the smallest size after compression. I moved every section of code around and every piece of data, I tested every possible combination until it gave the best compressed size. It is surprising how much you can save even just moving everything around to get the best order to make the compressor happy, although this starts to become harder with more code and data. For a 1k its not a problem to do it by hand, for 4k it would start to take a lot of time, any bigger and you would need a lot of late nights.

After all this I believe I have the smallest possible size to do what I want in C, setting up an openGL window, compiling a shader, hiding cursor, floating point timer sent to shader, some basic sound output and escape key and timer based exit, compressed size of code only was about 194 bytes. But could I save any bytes by rewriting in assembly? At first I tried to inline some assembly to replace small parts of C code but it always ended up compressing the same size or often larger. To really see any difference the whole thing needed to be coded from scratch in assembly.

Now I am very rusty with assembly since my last experience on amiga 20 years ago so my first step was simply to take the asm output from visual studio and start from that. Exactly like yzi said earlier in this thread. I was surprised at how good some of the optimisations from the C compiler were already. It would hold some known data in a register for the entire length of code to save for later use, it even pre-pushed some data in blocks for later calls. It was obviously optimising over the entire code not just individual segments which was quite impressive, obviously with longer and more complex code this would not always be the case. It looked like it might be hard work to get any smaller size.

Ok so the C compiler did a great job but it was making the smallest possible uncompressed code. For example it would xor a register and then push that everywhere instead of pushing 0 which is the smallest way to do it without compression. But I found replacing all those register pushes with push 0 actually compressed smaller even though the uncompressed code is double the size. I also found some other small savings in the code by reusing registers, this has to be done by hand because you have to know exactly what values are being stored and if they are values you want to use again later. Some savings were also made with conditional branch checks and the way some data was being handled. The C compiler uses generic and flexible ways to handle variables and conditional checks because it does not know exactly what sort of values might occur and how you want to use it. Once you start working with floating points the asm code gets really large, converting from integer to floats is very ugly. Sometimes it is also smaller to not bother with the stack at all.

Final compressed code size in assembly was 170 bytes, down from original 194 bytes in C, and I think possibly there could be maybe a couple more bytes saved somewhere. 24 bytes sounds small but compared to the total code size it is over 12% smaller. With more code the relative saving could be much higher but also it would take a lot of hard work.
added on the 2014-08-21 10:46:38 by drift drift
Ok, I have a question.

Struggling with my 1k (because I use IQs framework, and I don't know what happens behind), I think a simple hello world, just loading an empty shader and that, was near 700 bytes iirc.

So, how did you get 194 bytes? Where can I learn what goes behind, how to achieve the smallest possible from terms of how visual studio compiles. Also, is it an EXE masquaraded as COM?

Because I was puzzled, there was almost no space if you just get the framework, but the good 1ks at assembly had fades in-out, real music, camera changes and I wondered how do they fit.

So, is there some technical information on that?
added on the 2014-08-21 10:57:32 by Optimus Optimus
Being no coder, I ended up with 1.1kb wityhout a shader.
I suspect the trick is in the compiler flags of MSVC.
Anyone care to comment on this?
added on the 2014-08-21 11:23:24 by numtek numtek
Ok because you and also myself are using crinkler that automatically adds around 400 bytes (not sure exactly how much). There is the PE header, the depacker and then the dll import code.

It has already been mentioned that crinkler is overkill for a 1k but for people like myself who has no experience with writing a compressor it is the only option. Maybe one day I will look at making my own compressor for 1k but it would take a lot of time.

So yes what I have would be closer to 600 bytes with no data. I know my data section compressed is 360 bytes (mostly just shader code) and the main code was 194 bytes in C.

I think my openGL framework is very small, I spent some time optimising. I was planning to release the C source and maybe the assembly source after I had released the intro but I am waiting for a party that has a 1k comp and there are not many. Maybe I will just put it in a 4k comp but that doesn't seem fair.

Anyway I bet you can save some bytes somewhere, if you want to mail me at jeffsymons101@gmail.com I might be able to help you.
added on the 2014-08-21 11:25:08 by drift drift
TinyC versus Assembler (TASM or MASM) - what is best?
added on the 2014-08-21 11:27:56 by g0blinish g0blinish
Also I don't know about directX, I haven't bothered to learn it yet. Some people say the code is smaller for directX 9 but I have not personal experience so I can't comment.

Maybe for the sake of the community I can post some general advice on a minimal openGL setup here. Right now I have to spend some family time, I will be back in a few hours and I can write more then.
added on the 2014-08-21 11:28:02 by drift drift
Hah, of course crinkler. I am curious now, if I throw away crinkler, if the size will be smaller.
added on the 2014-08-21 11:46:53 by Optimus Optimus
The absolute minimum size you can get to without header/section/import tricks is 1.5k. This is a 512 bytes header, a 512 bytes code section, and a 512 bytes rdata section (which needs to be there for the import table, but you can use the rest of it for your own data).

Probably the least-effort trick you could do to make it 1k would be to place the import table in the code section. Still, this gives you 512 bytes of uncompressed code + data + import table, whereas Crinkler gives you around 600 bytes of compressed ditto. ;)
added on the 2014-08-21 12:11:31 by Blueberry Blueberry
Maybe just post your basic opengl framework code here? :)

Standard OpenGL setup/mainloop for one fullscreen quad shader is pretty compact to write.
Already using glCreateShaderProgramv?
added on the 2014-08-21 12:18:53 by las las
Hey, I'm back. Give me a few minutes and I will write something up.

Sorry Optimus but as Blueberry says you need some sort of tricks to do anything in a windows exe under 1k.

The only options are the public release crinkler which is not optimal for 1k but it is better than nothing. I am actually pleased with what I have been able to do so far under 1k using crinkler and there were some good 1k releases recently that use it too.

Or you try and use Hitchikers 1k pack, I personally have not looked at it so not sure if it still works or anything.

Other thing is to make your own 1k compressor. For me who has no idea about compresson methods or PE headers it would take a long time to develop something better than using crinkler. Maybe one day when I have some motivation and time I will try to learn how to do it.

As Las says there is not much to setup a basic openGL framework, I am not claiming to be an expert but I will share what I know. I was going to release my source with my 1k but as I said I have been waiting for a 1k compo. But I can't complain about not enough 1k compos if there are not enough people making 1k intros so maybe if I can share some info now then more people can make new and better 1ks and we might see more competitions. Just give me some time to type it up.
added on the 2014-08-21 13:52:34 by drift drift
A matter of taste: Basically 90% of the 1k's end up to be a heavily crunched single fragment shader + some strange ambient music - which I find both rather boring.

Nevertheless all these size crunching tricks are interesting for 4k/8k, for which I personally prefer using nasm.

Maybe we should try to specify some kind of minimal 1k framework, I would suggest it needs:

Code: Window creation Fullscreen mode switch Hide mouse cursor OpenGL setup Setup a single fragment shader Setup audio with zeroed audiobuffer // You will need something to fill that one up Start playing audio mainloop (EXIT ON ESC, EXIT ON TIME) { Fetch and pass audio position to the shader Render Prevent bad things using peekmessage Swap } Exit
added on the 2014-08-21 16:45:29 by las las
Quote:
Already using glCreateShaderProgramv?

I tried using this to compile shaders, but separable shader programs didn't seem to be compatible with the fixed function features like gl_Color. Is there some trick I'm missing here? Can I use the created program object without setting up a full GL_PROGRAM_SEPARABLE compatible pipeline?
added on the 2014-08-21 21:57:24 by cce cce

login