pouët.net

The ASM instruction you always wanted, but never had?

category: code [glöplog]
evl,
Quote:
I miss a simple swap.w on 68000

and that's one thing rlwinm does (amongst others).
added on the 2010-07-13 18:53:56 by Paranoid Paranoid
o yeah, absolute value on the 68k would be nice too.
added on the 2010-07-13 19:02:43 by sigflup sigflup
superplek: exactly, int 21h was the way to call the OS. Most people here are asking for new HW, but I only ask for new SW :(
added on the 2010-07-13 19:19:22 by iq iq
I frequently miss these on the 65816:
- JSL [$12],Y ;Subroutine Jump, Direct Page Indirect Long Indexed Y
- INC $123456 ;Increment/Decrement, Absolute Long
- PLK ;Pull byte from stack, put into program bank reg (normal long jumps work just as well for that, of course)

But maybe that's just me...
added on the 2010-07-13 20:58:44 by d4s d4s
something that enables stackless use of the x86 fpu
added on the 2010-07-13 20:59:56 by Gargaj Gargaj
though for starters, an FLD from EAX would be good too.
added on the 2010-07-13 21:00:19 by Gargaj Gargaj
FILD i mean.
added on the 2010-07-13 21:01:01 by Gargaj Gargaj
Ok here's a longer and more explained one for Z80


As I already said faster index register instructions would be nice
LD A,(IX+5) takes 19 t-states when eg. LD A,(HL) takes 7 t-states.
But 19 t-states is just fine for 3,5 Mhz and can be useful as Optimus already explained! :-)

Faster clear memory instruction would indeed be quite nice since even fastest way by
pushing stack takes 11 t-states. LDIR going 4 t-states would be kickass!

I love the way Gameboys custom Z80 does the load location (HL) with register and increase HL instruction.
##GBZ80
LDI (HL),anyregister ;8 t-states
##Z80
LD (HL),anyregister ; 7 t-states
INC HL ; 6 t-states

I don't see no use for multiplication when you can use tables, but indeed barrel shifts would've been nice.


I personally think all these 'upgrades' just kills the creativity.
added on the 2010-07-13 21:28:14 by MuffinHop MuffinHop
I mean by creativity, the coders creativity of searching new ways to code(ofc. not entirely), this is atleast for me(i think).
added on the 2010-07-13 21:35:25 by MuffinHop MuffinHop
This post starts with a lie: I need more than one instruction. A famility so to say. I also now had them native on the C64x+ DSP

Here it is:

SHFL RegOut, RegA, RegB

Shuffle: interleaves the lower 16 bits from RegA with RegB and places the result into 32 bit register RegOut:

E.g:

RegA: <abcdefghijklmnop>
RegB: <AbCDEFGHIJKLMNOP>
RegOut: <aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpP>


Of course we need the reverse thing: DEAL:

DEAL RegOut, RegA

Deal: de-interleaves the bits from RegA and sorts them into MSB16 and LSB16 of RegOut:

E.g:
RegA: <aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpP>
RegOut: <ABCDEFGHIJKLMNOPabcdefghijklmnop>


-----------------

Why do I need these:

Well - Interleaving bits this way gives the best cache-hit ratios regardless of the direction you scan through a 2D-Array. The only important thing is cache locality in general and the manhatten difference between subsequent reads. In extreme cases this roxors hard and gives a factor 10 performance gain!

Such an instruction would *rock* 2D matrix mathematics for matrices larger than your first level cache. And for the extra demo-sceener factor: They would totally rock all 2D software texture mappers.


Fun fact: In the x86 world we're about to even see these instructions in the future.. Not like this, but as a side-effect of carry-less multiplication (aka Galois GF2-multiplications).

Now dig this:

RegA: <abcdefghijklmnop>

Result: = RegA (Galois-Multipliy) RegA

Results in: <a0b0c0d0e0f0gh0i0j0k0l0m0n0o0p>

Or in other words: a pow2 as carry-less multiplication spreads the bits out and fills the gap with zeroes. So it takes only two of these multiplies and a humble OR to interleave bits and get all the cache-miss galore that I talked about..

ARM-NEON has this feature for two years now btw. As an intrinsic it is called:

poly8x8_t vmul_p8 (poly8x8_t, poly8x8_t)


Last but not least: Such an instruction has great use in crypto-applications as well.

added on the 2010-07-13 23:15:50 by torus torus
On my oldskool hobby platform of choice, I like all the instructions I have (the MUL is fucking slow, but it functions). I just wish every single byte didn't take 4 cycles to read. I have nearly 5MHz at my disposal but it's no faster than a C64.
added on the 2010-07-14 06:50:50 by trixter trixter
torus: What do you mean, "about to see"? Carryless multiply is there already, starting with Westmere (the 32nm shrink of Nehalem aka Core i7). http://en.wikipedia.org/wiki/CLMUL_instruction_set

Meanwhile in Iceland: http://software.intel.com/en-us/articles/prototype-primitives-guide/. This has the shuffle (bitinterleave11) but no inverse. But it's there in both vector and scalar variants. Halfway there, I guess. :)

Quote:
Such an instruction would *rock* 2D matrix mathematics for matrices larger than your first level cache. And for the extra demo-sceener factor: They would totally rock all 2D software texture mappers.

The difference is fairly minuscule for anything that has sequential-access inner loops (such as matrix multiplies), because you can efficiently step indices in Z-order.

Take a matrix multiply, A is m x p and B is p x n so C=AB is m x n.

Code: for(int i=0; i<m; i++) { for(int j=0; j<n; j++) { float sum = 0.0f; for(int k=0; k<p; k++) sum += A(i,k) * B(k,j); C(i,j) = sum; } }


Now, with shuf and the matrix stored in morton order, a first attempt at the inner loop would look something like this:
Code: ; in: esi = &A(i,0), edi = &B(0,j), ebx = p ; eax = k innerLoop: mov edx, eax xor ecx, ecx shuf edx, ecx ; edx=shuf(X,0) shuf ecx, eax ; ecx=shuf(0,X) movss xmm1, [esi+edx*4] inc eax mulss xmm1, [edi+ecx*4] addss xmm0, xmm1 cmp eax, ebx ; k<p? jb innerLoop


and realizing that shuf(0, X) = shuf(X,0) << 1, we can get rid of some fat:

Code: ; in: esi = &A(i,0), edi = &B(0,j), ebx = p ; eax = k innerLoop: mov edx, eax xor ecx, ecx shuf edx, ecx ; edx=shuf(X,0) movss xmm1, [esi+edx*4] inc eax mulss xmm1, [edi+edx*8] addss xmm0, xmm1 cmp eax, ebx ; k<p? jb innerLoop


One instruction less. But watch what happens when we just step in morton order directly:


Code: ; in: esi = &A(i,0), edi = &B(0,j), ebx = shuf(p, 0) ; eax = shuf(k, 0) innerLoop: movss xmm1, [esi+eax*4] mulss xmm1, [edi+eax*8] sub eax, 055555555h ; step k, part 1: k -= shuf(-1,0) addss xmm0, xmm1 and eax, 055555555h ; step k, part 2: k &= shuf(~0,0) cmp eax, ebx ; shuf(k,0) < shuf(p,0) <=> k<p jb innerLoop


Two instructions less and less registers. Short version: If you want morton-order matrices, then just go ahead already, you don't need special instructions to access them quickly. :)
added on the 2010-07-14 10:10:29 by ryg ryg
@Gargaj: Never understood why that (and store to EAX) wasn't possible...
added on the 2010-07-14 10:11:55 by raer raer
rarefluid: Well, Intel stopped extending the x87 instruction set ages ago. Nowadays it's just deprecated, but even pre-SSE, I think the main problem was that the encoding space for x87 is full. Worse, FPU opcodes use ModRM bytes (which are normally used to specify register targets) but then allocate the bits normally used for destination register specification as opcode extension. You do not want to introduce a special destination register encoding that has a different form of ModRM byte, because that and the SIB decoding logic is the one piece of the instruction decoder that's actually used without tons of special cases in the existing x86 instruction set. And you don't want to start allocating FPU opcodes from other ranges, that would royally fuck up instruction decoder design - the kind of fuck-up that makes the critical path longer (i.e. decreases max clock rate).

Outside 4ks where you want the shorter opcodes, the problem for all practical purposes doesn't exist anymore. Since SSE2 (i.e. P4 and after) you can movd between GPRs and the XMM regs.
added on the 2010-07-14 10:34:02 by ryg ryg
ryg,

yep. I knew that incremental morton-order trick. That works well and it is fast. But what if you have to jump to a new location often just to iterate three or four elements? You have to interleave the bits of the new location again. This becomes costly.

Think of the SVD's I've calculated last year. If I remember right a big part of it was the diagonal reduction step. The motion through the data was always along a rather thin band around the diagonal of the matrix.

Calling address = interleave (x,y) would be very convenient here..

Nice to see that the instruction made it into the larrabee architecture btw. Any idea what the 2:1 bit interleave pattern is used for?

Cheers,
Nils
added on the 2010-07-14 10:42:42 by torus torus
Ok. Didn't think about the decoders. The ModR/M econding is seriously fucked up anyway...
added on the 2010-07-14 10:45:01 by raer raer
i always miss
Code: add.l (a0)+,(a1)+

on 68k (also for .w and .b and all other address regs and also incr or decr or none of those behaviours on the address regs)

instead, i have to trash a reg and use at least 2 instructions. ;(
added on the 2010-07-14 10:51:46 by scicco scicco
oh, and this also stands for missing sub support, of course
added on the 2010-07-14 10:53:01 by scicco scicco
6502:
Code: ADD #imm ; fuck CLC MOV zp,#imm DBNZX rel ; instead of DEX / BNE PHX / PHY


SuperH-2:
Code: A decent barrel-shifter MOVZX.B / MOVZX.W ; instead of MOV / EXTU


x86:
Code: MOVDQBLUBA XMM0,XMM1,[address] ; Move DQ Byte Lookup Byte Aligned, i.e. set XMM0[0..7] = [address + (u8)XMM1[0..7]], etc.

added on the 2010-07-14 11:26:43 by mic mic
Quote:
Think of the SVD's I've calculated last year. If I remember right a big part of it was the diagonal reduction step. The motion through the data was always along a rather thin band around the diagonal of the matrix.

Calling address = interleave (x,y) would be very convenient here..

The first step in SVD computations is to reduce the matrix to bidiagonal form, in what is basically a variant of the QR decomposition, usually done with either Householder reflections or Givens rotations. Both work by rows/columns, and the matrix is dense at that point.

Nobody stores a bidiagonal matrix in 2D - Morton order or no, that's just asking for piss-poor cache usage. You just store the main diagonal and the super/subdiagonal as vectors.

The final solution step reduces the bidiagonal matrix to a diagonal matrix iteratively, and it needs to apply updates to the basis vectors as it goes along. But that step always goes right along the diagonal (or at most one row/col off it), so just having the increment is fine.

Quote:
Nice to see that the instruction made it into the larrabee architecture btw. Any idea what the 2:1 bit interleave pattern is used for?

I guess it's for 3D arrays.

Code: ; in: eax=X, ecx=Y, edx=Z bitinterleave11 eax, ecx bitinterleave21 eax, edx
added on the 2010-07-14 11:51:27 by ryg ryg
On the 68060, I have often missed CVALID - an MMU instruction which forces a cache line to go valid without reading any data from memory. And the instruction should not be privileged, of course.

Proper write merging would mostly do the trick, though.
added on the 2010-07-14 15:38:49 by Blueberry Blueberry
hm, I guess a nice shuffle instruction would speed up chunky-to-planar routines on the Amiga quite a bit, no?
added on the 2010-07-14 19:06:53 by arm1n arm1n
spike: only on slower cpus (ie less than a decent 040). Otherwise it's all about memory performance.
added on the 2010-07-14 19:21:27 by Psycho Psycho
I tend to miss "djnz a" on the z80. Or whatever isn't orthogonal in the instruction set (why can I add HL,DE but not add DE,HL ?)
Quote:
Which instructions would you add to a .. 6502 ..CPU if you had the chance?


None. All instructions are there (I'm sure ;-) - but I would definitely like to have them all executed in a single cycle, so I can split the rasters like hell and push my GTIA to a proper bandwidth on my Atar without changing the rest of the chipsi.
added on the 2010-07-14 20:00:01 by JAC! JAC!

login