12 byte patch for speeding up tiny msdos intros

category: code [glöplog]
From the names i read upon my research in some forums, this is kind of old news to at least some of you, but maybe it is not so common knowledge that the patch is actually really small.

I am talking about "write combining", which in the usecase of tiny intros collects writing memory accesses to the VRAM to burst write it later for speeding up. I stumbled across "WC" when looking into, and thinking about the usefulness of, debug registers, control registers, test registers (which as of now just seem to be fancy storage alternatives that may save one byte or another) and finally model specific registers (MSR).

In short, you mark a specific memory region to have (or not have) WC and then things run faster. There is really a lot about this to find online so i come directly to the point:
Code:mov cx,0x259 mov eax,0x01010101 pop dx ;=0, save=CDQ, 1b more wrmsr

This sets WC for the segment A0000 (to B0000). I tested that on FreeDos on my Samsung R519 notebook with a Intel Intel(R) Core(TM)2 Duo CPU T6500 and a NVIDIA G105M.

This works on Pentium and higher, DOSBOX does not recognize the WRMSR command ...

I did a simple test of repeatedly writing pixels to the screen, with a flashing animation which does 256 screens per flash, and because i am lazy, just used https://www.all8.com/tools/bpm.htm to measure the speed ^^

Code:; comment out for no speedup mov cx,0x259 mov eax,0x01010101 pop dx ;=0, save=CDQ, 1b more wrmsr mov ax,0x13 int 0x10 push 0xa000 pop es X: stosb loop X mov al,16 inc bl test bl,0x80 jnz Z dec ax Z: jmp short X

without speedup : 40
with speedup : 149

Yes, that's 372% performance.

Of course, the boost becomes a lot smaller when using a lot of calculations between pixel writes. I tested again with the tiny intro Lucy 64b.

without speedup : 76
with speedup : 98

That's 130% performance.

I tested again with the known program MTRRLFBE instead of my own patch and the numbers didn't change, so i guess i did everything right.

So, where did i get the information from?

Number 0x259 from here and also in the Intel Manual
Pages 3137-3140 of "Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4" ( here )
General Video about Graphics Boost for DOS : here
On bare DOS you may need "HDPMI", found here.

Since the technique itself seems fairly common, i'm happy to get more insights about the whole topic or comments to get more speed or shrink the patch size =)
added on the 2019-03-17 20:13:03 by HellMood HellMood
MOV EAX,0x01010101 can be one byte shorter (assuming SI=0x100):
Code:inc si | push si | push si | pop eax

You need to set all of EDX (not only DX), so CDQ or MOV EDX,EAX is the way to go. (Or RDMSR.)
added on the 2019-03-17 21:36:35 by rrrola rrrola
I felt there was a shorter way, thanks :) I am assuming top of EDX is already zero, i would of course revert to CDQ if the code fails (and it will fail on random values in EDX)
added on the 2019-03-17 21:42:23 by HellMood HellMood
Quite interesting, didn't know that ! Does that also enhance speed for true colour modes like 640x480 and STOSD under FreeDOS despite the page flipping required in those modes ?
added on the 2019-03-17 21:48:01 by Kuemmel Kuemmel
@Kuemmel Yes it does, and that's where the speedup get's real because with rising resolution, the pixel pushing requires more time in relation to calculation. I didn't benchmark that yet. Also there is there "normal" way to do it, in contrast to "IA32_MTRR_FIX16K_A0000" which i am using now. You would specify the real 32 bit adress of the LFB (VESA). You can test it with the MTRRLFBE tool linked above (Look into the video for parameters) I didn't yet delve into that ;)

Another note, of course the maximum speed up is achieved when having double buffering. In case there is place for that, that's a real boost =)
added on the 2019-03-17 21:58:03 by HellMood HellMood
Here are my results for 640x480 TrueColor Recursive RayTracing,
the fx from my seminar

Desktop PC with nVidia 1050
no speedup
318 timer ticks
15,3792472024 RDTSC EDX,EAX

write combine
166 ticks
8,1419034369 RDTSC EDX,EAX

lower fpu precision
310 timer ticks
15,2191496823 RDTSC EDX,EAX

both speedups
161 ticks
8,163713452 RDTSC EDX,EAX

Notebook, integrated gfx
no speedup
494 timer ticks
15,514199760 RDTSC EDX,EAX

write combine
279 ticks
8,2305766180 RDTSC EDX,EAX

lower fpu precision
478 timer ticks
14,2801465456 RDTSC EDX,EAX

both speedups
264 ticks
8,296854544 RDTSC EDX,EAX

For the other speedup I tried this code:
Code: CALL @F DW 0x06BF @@: POP BX FLDCW [BX]

You can read about this at Ped's source code:
These speedup is more powerfull under Dosbox!

And one more thing:
you may noticed already if you write your temp vars to the address 100H, you may get a code cache conflict easily... and your code will run very slowly,
(the interesting thing, if i run the code from volkov commander, the problem never appear)

So if you want to MOV [SI],AX, first change SI, like
@Tomcat thanks for the detailed benchmarking =) And also for the FPU speed up trick, that went under my radar. So, "write combine" can speed up real hi res intros by a factor of 2 on real hardware, that's awesome to know. I wonder what else is there in our weaponry which gives speed boosts for the mere price of a few bytes.
added on the 2019-03-18 10:46:56 by HellMood HellMood
nice! i'm also thinking on writing a TSR for on-the-fly VGA write combining switching, as Mode-X/16 color modes (tbmp VGA latches/graphic controller/ALU magic) are entirely incompatible with write combine but it's handy at other times for mode0x13 stuff :)
the only issue here is DOS extender/DPMI applications, as the would not allow you to fiddle with MSR so easily
added on the 2019-03-18 13:47:07 by wbcbz7 wbcbz7
Thumbs up for starting this thread.
added on the 2019-03-18 14:29:54 by Adok Adok
I was aware about this, but just running tools like Fastvid or MTRRLFBE. Never knew how to enable it with my own code or more info about what it does.
I think it's support from Pentium Pro and on. I always run it on start up and does big difference especially in high resolutions.
added on the 2019-03-18 21:23:34 by Optimus Optimus
I was noticing some interesting stuff with it.

Before that, I used to rep stosd, made a difference if you write 8bits, 16 or 32bits at once.
But after enabling it, a rep stosb is same fast as rep stosd. Makes sense, if it collects the writes and bursts them out later.
Then, I write byte per byte for vertical columns (e.g. for wolfenstein or voxel). At that point it lost it's effectiveness, I was as slow as writting individual bytes, not packing and writting 32bit at once. I guess it makes sense, you need to write linearly if possible.

One more interesting thing I noticed, games that use Mode-X tricks to write 2-3-4 pixels at once (by changing that mask register), will sometimes draw garbage when write combining is enabled. Doom for example, the border around the game when you make the screen smaller, comes out as garbage. But the game area is correct. Why is that? I guess, when writting the border background, a byte is written, the switch frequently the mask register to select which pixel to write for the byte. But WC will keep the writes and write them later when the mask register can be at any arbitrary state, not the originals intended per byte written. While the game are rendering (of walls, floors and sprites) from what I remember, would minimise the mask register changes, would render all columns at 0,4,8,etc then switch the mask, render the 1,5,9, etc. WC is agnosting of what mode-x mask register changes while you were writing the bytes, will just burst the pixels all at once later. Wolfesntein 3D must be different, everything (border and game area) come garbled.
added on the 2019-03-18 21:33:40 by Optimus Optimus
Are you sure the garble is from writing only, and not read/write copying? It works like this: you read one byte and write one byte over the bus, but in the VGA, 32 bits get copied. The 3 extra bytes are never transferred over the bus. This only works with byte-size loads and stores, because inside the VGA the ”latches” (or whatever they’re called) are only 32 bits wide.
added on the 2019-03-18 23:48:32 by yzi yzi
i remember having had "write combing" (sic) as a flag in a bios back then. obviosuly enabling this had massive impact on performance. I wonder whether this bios setting governed the general availa bility of that feature, or maybe it was a "global force enable"? in other words: what are the disadvantages or problems resulting from using this? there must be a tradeoff, otherwise it wouldn't be optional.
added on the 2019-03-19 00:50:28 by jco jco
Instructions don’t do exactly what they’re supposed to.
added on the 2019-03-19 06:24:59 by yzi yzi
"For example, a Write/Read/Write combination to a specific address would lead to the write combining order of Read/Write/Write which can lead to obtaining wrong values with the first read (which potentially relies on the write before)." - source

The next thing that comes to mind is a cellular automata effect like amoeba, depending on how you implement it, it works slightly different with WC or even not at all as expected.

@optimus, thanks for the insights =)

@wbc, you are one of the names i saw in the forums ;) why would MTRRLFBE even need DPMI on real hardware? The tiny patch just works without it. Seems i fail to understand something here ^^
added on the 2019-03-19 11:47:05 by HellMood HellMood
why would MTRRLFBE even need DPMI on real hardware? The tiny patch just works without it. Seems i fail to understand something here ^^

yup, while you are running in real mode (and in case of EMM386/QEMM, in V86, can't remember if they allow to fiddle with MSRs though), nothing can restrict you from accessing MSRs and DPMI/whatever is not required at all (and that's why your patch works fine :)
but in case of TSR doing the same task (i.e. hooking INT 0x10 and switching write combining state depending on mode being set) your code can (if being asserted by protected mode application) actually running in V86 mode, where WRMSR\RDMSR are virtualized and either handled correctly or causing a general protection fault, depending on DPMI server/DOS extender design :)
added on the 2019-03-20 18:36:44 by wbcbz7 wbcbz7
TomCat tested my codegrinder true colour versions (linky) with WC, FPU and SSE...none of them showed any speedup...I wonder why...may be if there's too much other stuff going on between writing the next byte/word to the screen, so it just decides somehow to write it without speedup...?

Reminds me on ryg's headline about WC though it was in another context ;-)
added on the 2019-03-20 21:17:53 by Kuemmel Kuemmel
If EDX=0000????, EAX=0 and ECX=0xFF, this is even shorter:
Code:cwd | inc dx | div ecx

It sets EDX to 0x1, which just enables write combining for B0000..B3FFF.
But neither WinXP nor DOSBox clear the top 16 bits of registers between program runs, ymmv.
added on the 2019-03-21 23:37:13 by rrrola rrrola
@rrrola, dude that's hilarious and beautiful. How did you even get this idea :D Didn't work on my notebook though ... i have another weird idea, which so far just works on dosbox, where it's useless ^^ (it's also not shorter)
Code:pop ss | mov sp,6a1eh | popad

debug view

@Kuemmel I'm not even sure that your link was so out of context : "This means all pending write-combining buffers get flushed, and then the read is performed without any cache" (from that article) which would explain why so far, i couldn't create a counter example to show that WC can fail badly. Unless the "only" drawback would be less performance than without WC, but i don't know enough about that.
added on the 2019-03-22 12:19:31 by HellMood HellMood
if I sped up pluto128 it would have runned too fast. i wouldn't want that :) anyways nice to see you figure out new things even though i dont understand any of it.
added on the 2019-03-22 15:20:31 by rudi rudi
@rudi don't worry about that. To prevent running too fast you need only one extra byte for the instruction
is WC the thing or not? I had more tests with my truecolor intros,
and now I can report it, that not everything is gold which is shining.

So let's look at the intros one by one

1. RT
Code: MOV CX,640/5*4/2 REP MOVSW

2-3x speed bonus, 318vs116, 494vs264

Code: movd dword[es:di],xmm1

nothing or 4-5x speed bonus?!, 103vs103, 245vs49

usually nothing but in only one case on my laptop with integrated VGA
I was able to measure very good result (without volkov commander!)

Same intro but the FPU version
Code: stosb dec ch jnz rgb_loop inc di

no speed bonus, 257vs257, 161vs161

Code:VideoPtr: FISTP WORD [ES:BX+0FFFCH] INC BX JPO loop3x SUB WORD [BYTE SI+(VideoPtr+3-264)],DX

no WC speed bonus, 610vs583, 591vs582
(little achievement by the FPU tweak)


4-5x speed bonus or less?, 1114vs248, 803vs616

after this I've tried to replace REP MOVSW to REP MOVSD in the 1st intro,
but the result not changed.


3-4x speed bonus!, 174vs53, 127vs46

Code: CMP [ES:DI+3],DL JA pixelok loop3x: STOSB INC BX JPO loop3x XCHG AX,DX STOSB skip: SUB DI,SP

no WC speed bonus, 243vs240, 265vs263
(little achievement by the FPU tweak)

This is very similar code to POKEBALL, but this time we are reading from the vmem.
what more, we are writing only one sphere/frame, not the whole screen.

So I think ryg's article has some rights.
and with some restrictions we can reach more than 4 times speedup!
- 1st: not to read from the WC mem, 2nd: use regular STOS instructions.

foot note:
I don't know why VC matter, maybe the intro runs in different memory segment
with and without the Volkov Commander.
Volkov Commander! Must be a distant relative of mine.
added on the 2019-03-23 08:24:59 by Adok Adok
One more important note, that the upper half of ECX must be zero... otherwise your machine will freeze.

I made a test pack with 256b intros for write combine. After DOS reboot you can run a batch file and you will get a text file for result. It is available on request (email prefered).

Here are my final results in table format:
(ECX bug fixed since my previous post).
BB Image

Some of the intros run nearly constant speed, like SEASHELL or COLORFUL.
here WC doesn't help at all...
reasons could be: not filling the whole screen, or reading from vmem, or drawing too slow.

Other intros gain benefit of WC. The speedup is: from 2.0x to 4,3x !!!

The other thing which shown in the table... The running speed how different can be with or without VolkovCommander :-(
(and also... different CPUs could have different behavior)
@yzi: The garble appears on Doom/Wolf because of the mode x tricks seemingly. They had the pattern of each 4 pixels.
added on the 2019-03-24 12:33:20 by Optimus Optimus