Tiny Intro Toolbox Thread

category: code [glöplog]
I was imagining things. :) The one-write version is not any faster.

Is there something that can be assumed in general about which registers are preserved by DOS calls?
added on the 2019-04-17 12:34:04 by Blueberry Blueberry
@Blueberry: you could look here and here, not sure it really preserves the things which are *not* mentioned in the "return values" list
added on the 2019-04-17 12:45:05 by HellMood HellMood
On a real dos you can just trace into the interrupts and look if the service routines preserve explicitly (pusha, pushf, push *s), never tried that on a dosbox though ^^
added on the 2019-04-17 12:46:34 by HellMood HellMood
I was curious because evidently the WRITE call (int 21h, AH=40h) preserves BX. But this doesn't seem to be mentioned in any of those docs.
added on the 2019-04-17 13:04:25 by Blueberry Blueberry
@Blueberry: i debugged into int21h for AH = 3Ch (virtual box, msdos 6.22, debug.exe, "t" & "p") and from what i saw, it pushed/popped everything it changes inside, besides the return values. Of course i can't cover every case but there is also this : "If calls are made by the approved method the contents of all registers are preserved through calls". That's also what one would assume intuitively, right?
added on the 2019-04-17 19:34:13 by HellMood HellMood
Also, german source : "Generell gilt, dass die Registerinhalte vor dem Aufruf von INT21h-Funktionen nicht gesichert werden müssen, denn sie bleiben erhalten. Das gilt natürlich nicht für die Register, in denen die betreffende INT21h-Funktion Werte zurückgibt."
added on the 2019-04-17 19:38:26 by HellMood HellMood
You can't rely on this behavior because different versions of DOS can do slightly different things. Don't assume what you see in DOS 6.22 works in DOSBox, or FreeDOS, or DOS 3.3, or MS-DOS 7.x (ie. what comes with Win9x).

Most DOS calls are supposed to preserve all regs that don't return results, but many DOSes trash BP. Ralf Brown's Interrupt List has a few known gotchas.
added on the 2019-04-22 17:09:02 by trixter trixter
Do you want High Resolution and PC speaker in 32 bytes?
> here you go <
added on the 2019-04-26 10:56:33 by HellMood HellMood
When tinkering with "secret modes" i found the not-so-secret mode 0x6A which i give too little attention since it didnt work in dosbox. Apparently this allows 800x600 resolution with no overhead, applied here to "dragon fade"
added on the 2019-05-01 00:51:55 by HellMood HellMood
It might be overlooked, but if you are interested in coding your 256 Byter with the help of SSE (didn't see much of that yet for tiny intros...) to get some serious speedup over an FPU version you can check out Frigo's lovely Kali-set and my SSE (level 4.1 needed) implementation in that product thread. Thanks also to TomCat for additional help.

The speedup of the double pixel interleaved loop is > 400% while keeping the size, so it actually makes sense for that kind of algorithm. For others may be the overhead could be too much.

What keeps me puzzled is how to optimize the conversion of an xmm register where you got your RGB values as single floats to an RGB dword integer. Especially when there's a chance that you got some floats that are too big for a DWORD I needed to do an extra minps (plus xmm2 setup for the 255.0 mask, what also costs another may be 10 Bytes) to overcome artefacts:
Code:;xmm1 = R|G|B|... minps xmm1,xmm2 ;clamp to a maximum of 255.0 (xmm2=255.0|255.0|255.0|255.0 cvtps2dq xmm1,xmm1 ;int conversion packuswb xmm1,xmm1 ; packuswb xmm1,xmm1 ;dword to byte => 2 times needed movd eax,xmm1 stosd ;plot pixel

Any other idea to do this shorter ? May be it's even better to do that in non-SSE code as it's not speed critical anymore...
added on the 2019-09-27 21:18:42 by Kuemmel Kuemmel
Seems after a while of digging I solved it by myself. So the following sequence seems to do the correct conversion with no artefacts, avoiding the minps and the mask setup. Using first packssdw does the trick :-)
Code:;xmm1 = R|G|B|... cvtps2dq xmm1,xmm1 ;int conversion packssdw xmm1,xmm1 packuswb xmm1,xmm1 ;dword to byte movd eax,xmm1 stosd ;plot pixel
added on the 2019-09-29 10:09:15 by Kuemmel Kuemmel
I can confirm this. I wrote a test for this: DUMP6.ASM

Here is the result:
I put together some information which is useful if you wanna do some sizecoding under DOS
Still pretty sure it can be done in 10 bytes.

10 bytes:
Code:fistp word [si] lodsw neg ah jz ok shl ah,1 salc ok:
but values 8000h..80FFh will return 0FFh.
added on the 2021-02-26 16:29:57 by Jin X Jin X
The same in 9 bytes:
Code:fistp word [si] lodsw neg ah jz ok cwd xchg ax,dx ok:
added on the 2021-02-26 17:13:17 by Jin X Jin X
Code:fistp word [si] lodsw

... if you do that repeatedly, you will overwrite your own code without precautions ;)
added on the 2021-02-26 18:44:07 by HellMood HellMood
Yeah, of course. We assume that si is restored every time.
Else you can change it to mov ax,[si].
added on the 2021-02-26 22:44:21 by Jin X Jin X
In the Sseraf intro, I have used a nice was to convert float->int. (SSE conversions kinda suck.)

First you add a magic constant so that your step is 1 ulp, store to memory and read the fractional bits. For example, float32bits(z∈0..255 + 2²³) = 0x4b0000zz.
What's new for me:
- You can convert not only to integer, but also to fixed-point. For example, float32bits(y∈0..1 + 2²³⁻¹⁶) = 0x4300yyyy (where y is 0.16 fixed-point).
- Normally the addition uses the current SSE rounding, but you can use the fixed-point trick to simulate a floor(): float32bits(z∈0..255.99 + 2²³⁻⁸) = 0x4700zz??. (You then read only the second-to-lowest byte).
added on the 2021-04-04 02:08:54 by rrrola rrrola
@rrrola: Thanks for sharing ! Can you elaborate also on the packing concept algo ? I think that's the first ever 256 Byte with a code packer. You pack the first 2 bytes of each SSE instruction (into what format ?) and stick to the 0x0f encoded instructions in the entire code ?
added on the 2021-04-04 09:35:43 by Kuemmel Kuemmel
The intro is split into two parts: only the second SSE part is packed.
All used SSE instructions look like 0F ‹opcode=28|29|5x› rm ‹optional offset›.

The packed data is split into a bitstream (commands) and a bytestream (data).
The command bits in the party version:
- 10: set XX=new byte, write 0F XX, copy one extra literal byte
- 11: write 0F XX, copy one extra literal byte
- 0: copy one literal byte

So a new SSE instruction costs 18 bits (instead of 24), an SSE instruction with a repeated opcode costs 10 bits (instead of 24), and everything else expands to 9 bits (instead of 8): loops, offsets, the esc check...
The packing method is predictable and fun to hand-optimize for: all the SSE instructions are put at the end (so that the rest doesn't expand to 9/8) and I use same-opcode chunks as much as possible.

The unpacker code is in the "tools" directory: it uses the starting registers as much as possible, and overlaps the bitstream and the bytestream. I've managed to get AH=0 after unpacking for the video mode.
added on the 2021-04-04 10:29:18 by rrrola rrrola
Stats for the party version:
Code: '0F RR xx' * 24: 72 -> 30 bytes '0F NN xx' * 18: 54 -> 40.5 bytes 'xx' * 39: 39 -> 43.875 bytes Packed: 165 -> 114.375 bytes, rounded down (3 bits merged with the bytestream) Not packed: 28 (unpacker) + 114 bytes (intro code) 114 + 28 + 114 = 256
added on the 2021-04-04 10:37:25 by rrrola rrrola
@rrrola: Thanks for the additional details! Very nice to see these concepts providing benefits.

For x86 and tiny code footprints the separation of bit and bytestreams is the best option (except for byte based packers ofc).

@all: Since it seemed to be kind of a surprise to some sizecoders at the sizecoding discord:
BT (bit test) and its variants are able to address a bitfield with 2^16 bits or more (depending on register width) using a single index.
BT [mem],reg
The index in reg is signed: the lowest bit has index 0.
The instruction accesses the whole 16-bit (or 32-bit) word even when it needs only one bit.
added on the 2021-04-04 13:32:01 by rrrola rrrola
Tailored to the intro or not, this comes close to first real packer for 256b bytes, strong!
I'm as well guilty of not knowing that the bit testing index can skyrocket when used on memory xD
Please do join us on the sizecoding discord =)
added on the 2021-04-04 14:14:28 by HellMood HellMood
@HellMood: For x86 it's a first I think. introspec for example did this on Z80 already (as he mentioned on the Discord):