Tiny Intro Toolbox Thread

category: code [glöplog]
*FCOMI ST(0), ST(1)
*6 bytes only
:-( I would delete my last post if I could
And as long it runs with a non emulated DOS on a current machine - you have to deal with the fact that it is a totally valid demoscene related coding platform.
(was las's respond to xTr1m and hArDy)

For me fits everything into the DOS platform, which could run during the boot process on every current machine (from boot sector) or at least it runs from USB booted DOS :-)

- mode 13h
- vesa 640x480 truecolor
- PC speaker

I'm not sure about midi music. No hw midi support in current machines.
So I don't feel the midi intro is a DOS product, just a DOSBox product (or maybe a retro DOS product).
Some recent find for me, might have been overlooked because the instruction fisttp was introduced only with SSE3 for the FPU. If you want to truncate a float for further use you can use this without changing any rounding mode:
Code:fisttp word[address] fild word[address]

Even going through memory it's also much faster than using frndint.
added on the 2019-04-06 18:50:52 by Kuemmel Kuemmel
Puls uses this - it's about 6 bytes shorter:
Code: for (uint16 i = 0; i < 65536; i++) // automatic wrap { int16 dx = (i * 0xCCCD - Center) >> 8; // automatic modulo to ±32767 int16 dy = (i * 0xCCCD - Center) >> 16; // ±20480 //use dx, dy }

Center is a constant that maps (159.5, 99.5) to (0,0), can be killed using segment magic.
The >>8 and >>16 are also free (pusha + byte addressing into the stack).

Pyrit goes one step further, it uses add instead of mul:
Code: mov dx,0xA000-10-20-20-4 mov es,dx ; dx:bx = YX:XX = 0x9fca:0 ; the visible pixels are A0000..AF9FF, I want X=0 Y=0 in the center ;Each pixel: cx=T dx:bx=YX:XX(init=9fca:0) di=adr(init=-4) X:inc dx ; part of "dx:bx += 0x0000CCCD" X2: stosb pusha ; adr: -18 -16 -14 -12 -10 -8 -6 -4 -2 fninit ; stack: di si bp sp bx dx cx ax 0 mov bx,es ; s16: pixadr 100 9?? -2 ..X..Y T result mov di,-4 ;di = address of pushed ax ... add bx,0xCCCD; dx:bx = YXX += 0000CCCD jnc X2 jnz X ; do 65536 pixels
Both Puls and Pyrit use ADD and both spend about 26 bytes on the loop with different tradeoffs.

Puls has YYXX in memory. The long memory addition doesn't set the correct flags at the end, so [ES:BP+SI] is used for the pixel address and INC BP for the loop test. BP is wrong during the first frame.
Code: 8 push 0x9FCE | pop es ... pop 4 bytes | push es | push bp ; INIT 5 XY: L: fild word[di] | dec di | jpo L ; FILD 7 add dword[di],0xCCCD ; YYXX++ 4 inc bp | mov [es:bp+si],al ; DRAW, ADDR++ 2 jnz XY

Pyrit has YYXX in DX:BX. The low-word addition sets the zero flag correctly at the end, so [ES:DI] could be used for the address. But DI needs to be -4 in the inner loop for other reasons.
Code: 5 assume bx=0 | mov dx,0x9FCA | mov es,dx ; INIT 1 XY: inc dx | XY2: ; YYXX++ (first part) 11 pusha | mov di,-4 | fild word[di+4-9] | fild word[di+4-8] | popa ; FILD 1 stosb ; DRAW, ADDR++ 6 add bx,0xCCCD | jnc XY2 ; YYXX++ (second part) 2 jnz XY

I think Pyrit's version is more versatile. Puls uses PUSHA/POPA anyway to get more registers in the pixel computation, so they're almost free. And having the approximation of Y:X in DH:DL is useful (some intros need only that).

The MUL version can probably be smaller because of more free registers. I don't know yet.
added on the 2019-04-06 23:29:34 by rrrola rrrola

Back then, when i was diving into all the floating point stuff (2015/16), i came across this thread - of course - and learned about what we (the sizecoders) coined the "Rrrola constant" (0xCCCD). When p01 was amazed about how short the idea *really* is, i looked into the source of "puls" and was a bit puzzled by that 7 byte ADD - that's not really short is it? In my own tries of making that magic constant work, i reverted to using MUL and aligning manually with opcode of "PUSH <word>" which works the same way as offsetting DX with ES (which i find really funny btw). My most optimized version i can think of now is the 52 byte version of the tunnel included in Neontube 64b. However, i initially was satisfied with 8 bit precision of X, which can look chunky. It's not too hard to align the values on the stack though (and correcting BX again for optimizing the stack accesses) which leads to this 25 byte version (following the previous comment convention):

Code:4 push 0x9FBA | pop es ; [SI] contains alignment(!) ; INIT 3 X: mov ax,0xCCCD 5 mul di | sub dh,[si] | xchg bx,ax ; DX:BX = YYXX 10 pusha | xor bx,bx | fild word [bx-8] | fild word [bx-9] | popa ; FILD 3 stosb | jmp short X ; DRAW, ADDR

SI is "locked" in that version, but two further bytes can be spared if the stack access happens in other ways than [BX +- signed byte]. With 8 bit precision, it is 22 bytes.

Also, recently, i found that the "Rrrola trick" works in textmode, although you'd have to spend 6 more bytes for conversion and aspect ratio (see here ).

In this 25 byte version, as well as in the two 26 bytes versions there is no explicit framecounter. Back then i found a real nice trick to get a framecounter from MUL, which sets the Carry Flag everytime but twice, with keeping CL as is (0xFF) and reusing the mod byte of "INT 0x10" (ADC) this saves another byte, explained here. In the Pyrit Code a similar idea could save one byte (ADC with Null in reg/mem instead of JNC + INC), but only if there is NULL available in a register or easy accessible memory location.

It's "better" to use a synced timer anyway [0:0x46C], but as of now, that requires more space then the versions above.
added on the 2019-04-07 14:11:15 by HellMood HellMood
Another idea is to put the values already ordered on the stack. That "locks" BX, too, but leads to this 24 bytes version
Code:4 push 0x9FBA | pop es ; [SI] contains alignment(!) ; INIT 3 X: mov ax,0xCCCD 6 mul di | sub dh,[si] | push dx | push ax ; stack = YYXX 8 fild word [bx-4] | fild word [bx-5] | pop dx | pop ax ; FILD 3 stosb | jmp short X ; DRAW, ADDR++
added on the 2019-04-07 20:26:43 by HellMood HellMood
Note: both my versions already loop infinitely, so strictly speaking their core is two bytes less (23/22 bytes) while they require additional space later on for example for checking frame ends or continuous sync to the timer, which again needs ~two more bytes (inc reg + jmp short (3) in rrrolas examples vs pop ds + add/sub anim_reg,[0x46c] (5), alternatively the mentioned int10 reusing ADC trick (3-5) which is hard to perform and locks yet another register)

I guess the approaches are not downright comparable bit by bit due to the different constraints. The number of possibilities is amazing =)
added on the 2019-04-07 20:51:38 by HellMood HellMood

Which was the first product which used the CCCDh constant?
Was it Pulse from 2009?

Yes, probably.
Before I thought I saw this earlier at HugiCompos or at ChristmasCompos, maybe from Digimind.

But yesterday I did a little research. With the help of TotalCommander I was searching for the string CCCD in *.ASM and for the hexnum CDCC in *.COM. And I didn't find anything from earlier (only some false positive results because of some float constants)

So yes, Rrrola it's from You!

I needed to make a float max() and came up with the same solutions as in your post, but I'm unable to test the FCMOV variant because it requires a Pentium Pro. DOSBox doesn't seem to be able to emulate it, and I don't have a way to run FreeDOS. Is there a good emulator which could run an intro using these instructions?
added on the 2019-04-11 14:01:44 by fizzer fizzer
Okay, I just noticed DOSBox-X is able to emulate it. Thanks!
added on the 2019-04-11 14:06:05 by fizzer fizzer
May be some common task you want to avoid, but may be can't sometimes...clamp a float (unknown if negative or positive) to an integer byte. I came up with this. It's not sexy, but seems short (15 Bytes). Any shorter ideas ?
Code:fistp word[si] ;store float in int test word[si],0xff00 mov al,byte[si] jz skip_clamp ;...already in range of 0...255 stc ;? > 255 => carry = 1 jns skip_min clc ;? < 0 => carry = 0 skip_min: salc ;carry=0=> al=0, carry=1=> al=255 skip_clamp:
added on the 2019-04-11 19:18:49 by Kuemmel Kuemmel
11 bytes:

Code:fistp word [si] lodsw test ah, ah jz skip_clamp add ah, ah cmc salc skip_clamp:
added on the 2019-04-11 20:24:49 by frag frag
Oops, 9 bytes:
Code:fistp word [si] lodsw rol ah, 1 jz skip_clamp cmc salc skip_clamp:
added on the 2019-04-11 20:30:33 by frag frag
@frag are you sure about the zero flag after rol ah,1 ?

this was my first thought:
Code: FISTP WORD [SI] LODSW TEST AH,AH JZ ok JNS negative STC negative: SALC ok:
Nice ! I can't afford to have si incremented, but so still down to 10 Bytes :-)
added on the 2019-04-11 20:51:19 by Kuemmel Kuemmel
but use SHL or ADD AH,AH instead of ROL
Shit, forgot all the x86 asm lol. Of course rol will not change cl.
shl, add would not work for 0x80xx.
Still pretty sure it can be done in 10 bytes.

Your JNS must be JS by the way.
added on the 2019-04-11 22:03:33 by frag frag
Kuemmel knows... but 80xxh is very low negative number, so maybe it works for him.
This thread has been a tremendous help for me to get started on Tiny Intro coding for DOS. It's time I give something back. :)

Include this function in your Mode 13h intro and call it after producing each frame to dump a sequence of BMP images that can be merged in VirtualDub or similar.

Does not assume any specific state on entry. Preserves all registers, segment registers, flags and the palette index. Trashes 1078 bytes just prior to A0000.

If you want more than 9999 frames, you can just increase the number of trailing '0' in the filename (while keeping it at max 8 chars total) and change the "mov cx, 4" accordingly.

Code:FrameDump: pusha push ds push es lahf push ax cld push cs pop ds ; Update filename mov bx, Extension mov cx, 4 .incloop: dec bx inc byte [bx] cmp byte [bx], '9' jle .incdone mov byte [bx], '0' loop .incloop ; Exit after 9999 frames mov ax, 0x3 int 0x10 int 0x20 .incdone: push 0xa000-1536/16 pop es ; Copy header mov si, BMP mov di, 1536-(14+40+256*4) mov cx, 14+40 rep movsb ; Get palette mov dx, 0x3c8 in al, dx push ax mov dx, 0x3c7 mov al, 0 out dx, al mov dx, 0x3c9 mov cx, 256 .palette: xor eax, eax in al, dx shl eax, 8 in al, dx shl eax, 8 in al, dx shl eax, 2 stosd loop .palette mov dx, 0x3c8 pop ax out dx, al ; Create File mov cx, 0 mov dx, Filename mov ah, 0x3c int 0x21 jc .done push ax ; Write data pop bx push bx mov cx, 14+40+256*4+320*200 push es pop ds mov dx, 1536-(14+40+256*4) mov ah, 0x40 int 0x21 ; Close file pop bx mov ah, 0x3e int 0x21 .done: pop ax sahf pop es pop ds popa ret Filename: db "dump0000" Extension: db ".bmp",0 BMP: ; File header db "BM" dd 14+40+256*4+320*200 dw 0,0 dd 14+40+256*4 ; Info header dd 40, 320, -200 ; Header size, width, height dw 1, 8 ; Planes, depth dd 0,0,0,0,0,0

Use as you wish. :)
added on the 2019-04-15 12:13:32 by Blueberry Blueberry
Thanks, Blueberry!
I use this to save a series of paletted, vertically-flipped TGA files. Just "call SCREENSHOT". Uses the stack and 768 bytes of memory right after the intro.
Code:SCREENSHOT: pusha pushf push ds push cs pop ds ;read palette mov ax,255 mov di,PALETTE+256*3-1 PALREAD: mov dx,3C7h out dx,al inc dx inc dx mov cx,3 push ax PALRGB: in al,dx shl al,2 mov [di],al ; [di] = b*4 g*4 r*4 dec di loop PALRGB pop ax dec ax jns PALREAD ;increase filename number mov di,FILENAME + (HEADER-FILENAME) - 5 INCNAME: inc byte[di] cmp byte[di],':' jb ENDINCNAME mov byte[di],'0' dec di jmp INCNAME ENDINCNAME: ;write the TGA file and return mov ah,3Ch ; create file mov dx,FILENAME xor cx,cx int 21h xchg ax,bx ; bx=handle mov ah,40h ; write header and palette mov dx,HEADER mov cx,18+256*3 int 21h push 0A000h pop ds mov ah,40h ; write pixels cwd ; dx=0 mov cx,320*200 int 21h mov ah,3Eh ; close file int 21h pop ds popf popa ret FILENAME db "0000/.tga" ;,0 HEADER db 0,1,1 dw 0,256 db 24 dw 0,0,320,200 db 8,00100000b section .bss align=1 PALETTE: resb 256*3
added on the 2019-04-15 23:54:14 by rrrola rrrola
10-byte clamp to unsigned byte. The trick is to test for negatives first.
Code: fistp word[si] lodsw add ah,ah jc NEGATIVE ; 8000..FFFF -> FF jz OK ; 0000..00FF ; 0100..7FFF -> 00 (carry=0 here) NEGATIVE: salc OK:
added on the 2019-04-16 00:26:36 by rrrola rrrola
Disregard that, I forgot CMC. It's still 11 bytes.
added on the 2019-04-16 00:28:59 by rrrola rrrola
Signed clamp is easier. Still 11 bytes, but you can use other multipliers, which might save space elsewhere. Result in AH.
Code: fistp word[di] ; assume di=sp pop ax imul si ; si=100h -> dh:dl:ah:al = signbit:ah:al:0 jnc OK mov ah,7Fh sub ah,dh ; ah: FF->80, 00->7F OK:
Instead of pop|imul, you can also do mov ax,si | imul word[di].
added on the 2019-04-16 01:12:54 by rrrola rrrola
I use this to save a series of paletted, vertically-flipped TGA files. Just "call SCREENSHOT". Uses the stack and 768 bytes of memory right after the intro.

Ah, I completely forgot about the pushf/popf instructions. :)

This tendency of old, simple image format to store the image bottom-up is quite annoying. Good thing that (an appropriate variation of) the BMP format allows you to put a negative height to flip the image to top-down.

I had an earlier version that split the write into two in order not to trash the memory before the screen area. But it seemed the one-write version was faster (though still quite slow) when writing to a USB stick in FreeDOS. I could be imagining things, though...
added on the 2019-04-16 14:10:35 by Blueberry Blueberry