How is GL's multmatrix so fast?

BeginTiming();
glLoadMatrix(a);
glMultMatrix(b);
glGetFloatv(GL_//CURRENTMATRIX//,c);
EndTiming();

have you tested it like this or before the getting of the result? you know gl runs in a parallel thread…

should be logical, but we’re never sure

Well, I’ll be darned. It’s time for me to eat crow.

Knackered - I didn’t believe that there was no difference between my algorithm and yours, so I wrote up a little benchmark program to compare them.

To my surprise (and embarrasment), my algorithm came out last (no matter the compiler). Yours was second place and the OGL routines were indeed the fastest.

Here are the results of the program from my Athlon 1.3 using the .exe from the Intel compiler:

Zeno: 9.6 Million mults/sec
Knackered: 13.5 Million mults/sec
OGL: 22.8 million mults/sec

If anyone else wants to try (or check my stuff) I put the source code and two .exe files up here:
www.sciencemeetsart.com/wade/temp/benchmark.zip

Learn something new every day

– Zeno

[This message has been edited by Zeno (edited 04-28-2002).]

just wanted to note that you don’t create any opengl context… wondered that it did not crashed

and, did you checked the results for beeing ok? possibly you get just an error back from the glfunc because there is no gl existing… dunno

Here is a 3dnow 4x4 matrix multiply routine you can test, BUT, I do not recall where it came from. I just happen to find a file called 3dnow4x4.asm in my “other” directory.
I sure hope the formatting is correct, since I don’t have a web page to post it on.

;******************************************************************************
; Routine: void _glMul_4x4 (const float r, const float a, const float b)
; Input: r - matrix (4x4) address
; a - matrix (4x4) address
; b - matrix (4x4) address
; Output: r = a * b, using standard matrix multiplication rules
; Uses: eax, ecx, edx, mm0 - mm7
; UPDATED 01/21/00
;
***************************************************************************
ALIGN 32
PUBLIC __glMul_4x4
__glMul_4x4 PROC

r = 4 ;stack offset for result address
a = 8 ;stack offset for matrix a address
b = 12 ;stack offset for matrix b address

a_11 = 0 ;local stack frame layout
a_12 = 4
a_13 = 8
a_14 = 12

a_21 = 16
a_22 = 20
a_23 = 24
a_24 = 28

a_31 = 32
a_32 = 36
a_33 = 40
a_34 = 44

a_41 = 48
a_42 = 52
a_43 = 56
a_44 = 60

    femms
    mov         eax,[esp+a]			;source a
    mov         ecx,[esp+b]			;source b
    mov         edx,[esp+r]			;result r

    sub         esp,64				;T_ local work space to store temp results

    movd        mm0,[eax + a_21]    ;       | a_21
    movd        mm1,[eax + a_11]    ;       | a_11
    movd        mm6,[eax + a_12]    ;       | a_12
    punpckldq   mm1,mm0             ; a_21  | a_11  
    movd        mm5,[eax + a_22]    ;       | a_22
    pfmul       mm1,[ecx]           ; a_21 * b_12 | a_11 * b_11     
    punpckldq   mm6,mm5             ; a_22  | a_12      
    movd        mm7,[eax + a_32]    ;       | a_32
    movd        mm5,[eax + a_42]    ;       | a_42
    pfmul       mm6,[ecx]           ; a_22 * b_12 | a_12 * b_11     
    movd        mm2,[eax + a_31]    ;       | a_31
    punpckldq   mm7,mm5             ; a_42  | a_32
    movd        mm0,[eax + a_41]    ;       | a_41
    pfmul       mm7,[ecx+8]         ; a_42 * b_14 | a_32 * b13
    punpckldq   mm2,mm0             ; a_41  | a_31
    pfadd       mm6,mm7				; a_42 * b_14 + a_22 * b_12 | a_32 * b13 + a_12 * b_11
    pfmul       mm2,[ecx+8]         ; a_41 * b_14 | a_31 * b13
    pfacc       mm6,mm6				;		| a_12 * b_11 + a_22 * b_12 + a_32 * b_13 + a_42 * b_14  
    pfadd       mm1,mm2				; a_21 * b_12 + a_41 * b_14 | a_11 * b_11 + a_31 * b13
    movd        [esp+4],mm6         ; T_12   
    pfacc       mm1,mm1				;       |  a_21 * b_12 + a_41 * b_14 + a_11 * b_11 + a_31 * b13
    movd        [esp],mm1           ; T_11

    movd        mm0,[eax + a_23]    ;       | a_23
    movd        mm1,[eax + a_13]    ;       | a_13
    movd        mm6,[eax + a_14]    ;       | a_14
    punpckldq   mm1,mm0             ; a_23  | a_13  
    movd        mm5,[eax + a_24]    ;       | a_24
    pfmul       mm1,[ecx]           ; a_23 * b_12 | a_13 * b_11     
    punpckldq   mm6,mm5             ; a_24  | a_14      
    movd        mm7,[eax + a_34]    ;       | a_34
    movd        mm5,[eax + a_44]    ;       | a_44
    pfmul       mm6,[ecx]           ; a_24 * b_12 | a_14 * b_11     
    movd        mm2,[eax + a_33]    ;       | a_33
    punpckldq   mm7,mm5             ; a_44  | a_34
    movd        mm0,[eax + a_43]    ;       | a_43
    pfmul       mm7,[ecx+8]         ; a_44 * b_14 | a_34 * b_13
    punpckldq   mm2,mm0             ; a_43  | a_33
    pfadd       mm6,mm7				; a_44 * b_14 + a_24 * b_12 | a_34 * b_13 + a_14 * b_11
    pfmul       mm2,[ecx+8]         ; a_43 * b_12 | a_33 * b11
    pfacc       mm6,mm6				;		| a_44 * b_14 + a_24 * b_12 + a_34 * b_13 + a_14 * b_11
    pfadd       mm1,mm2				; a_43 * b_12 + a_23 * b_12 | a_33 * b11 + a_13 * b_11
    movd        [esp+12],mm6		; T_14
    pfacc       mm1,mm1				;		| a_43 * b_12 + a_23 * b_12 + a_33 * b11 + a_13 * b_11
    movd        [esp+8],mm1			; T_13

    movd        mm0,[eax + a_21]    ;       | a_21
    movd        mm1,[eax + a_11]    ;       | a_11
    movd        mm6,[eax + a_12]    ;       | a_12
    punpckldq   mm1,mm0             ; a_21  | a_11  
    movd        mm5,[eax + a_22]    ;       | a_22
    pfmul       mm1,[ecx+16]        ; a_21 * b_22 | a_11 * b_21     
    punpckldq   mm6,mm5             ; a_22  | a_12      
    movd        mm7,[eax + a_32]    ;       | a_32
    movd        mm5,[eax + a_42]    ;       | a_42
    pfmul       mm6,[ecx+16]        ; a_22 * b_22 | a_12 * b_21     
    movd        mm2,[eax + a_31]    ;       | a_31
    punpckldq   mm7,mm5             ; a_42  | a_32
    movd        mm0,[eax + a_41]    ;       | a_41
    pfmul       mm7,[ecx+24]        ; a_42 * b_24 | a_32 * b_23
  punpckldq   mm2,mm0             ; a_41  | a_31
    pfadd       mm6,mm7				; a_42 * b_24 + a_22 * b_22 | a_32 * b_23 + a_12 * b_21
    pfmul       mm2,[ecx+24]        ; a_41 * b_24 | a_31 * b_23
    pfacc       mm6,mm6				;       | a_42 * b_24 + a_22 * b_22 + a_32 * b_23 + a_12 * b_21
    pfadd       mm1,mm2				; a_41 * b_24 + a_21 * b_22 | a_31 * b_23 + a_11 * b_21
    movd        [esp+20],mm6		; T_22
    pfacc       mm1,mm1				;		|a_41 * b_24 + a_21 * b_22 + a_31 * b_23 + a_11 * b_21
    movd        [esp+16],mm1		; T_21

    movd        mm0,[eax + a_23]    ;       | a_23
    movd        mm1,[eax + a_13]    ;       | a_13
    movd        mm6,[eax + a_14]    ;       | a_14
    punpckldq   mm1,mm0             ; a_23  | a_13  
    movd        mm5,[eax + a_24]    ;       | a_24
    pfmul       mm1,[ecx+16]        ; a_23 * b_22 | a_13 * b_21 
    punpckldq   mm6,mm5             ; a_24  | a_14      
    movd        mm7,[eax + a_34]    ;       | a_34
    movd        mm5,[eax + a_44]    ;       | a_44
    pfmul       mm6,[ecx+16]        ; a_24 * b_22 | a_14 * b_21     
    movd        mm2,[eax + a_33]    ;       | a_33
    punpckldq   mm7,mm5             ; a_44  | a_34
    movd        mm0,[eax + a_43]    ;       | a_43
    pfmul       mm7,[ecx+24]        ; a_44 * b_24 | a_34 * b_23
    punpckldq   mm2,mm0             ; a_43  | a_33
    pfadd       mm6,mm7				; a_24 * b_22 + a_44 * b_24 | a_14 * b_21 + a_34 * b_23
    pfmul       mm2,[ecx+24]        ; a_43 * b_24 | a_33 * b_23
    pfacc       mm6,mm6				;		|a_24 * b_22 + a_44 * b_24 + a_14 * b_21 + a_34 * b_23
    pfadd       mm1,mm2				; a_43 * b_24 + a_23 * b_22 | a_33 * b_23 + a_14 * b_21
    movd        [esp+28],mm6		; T_24
    pfacc       mm1,mm1				;		| a_43 * b_24 + a_23 * b_22 + a_33 * b_23 + a_14 * b_21
    movd        [esp+24],mm1		; T_23

    movd        mm0,[eax + a_21]    ;       | a_21
    movd        mm1,[eax + a_11]    ;       | a_11
    movd        mm6,[eax + a_12]    ;       | a_12
    punpckldq   mm1,mm0             ; a_21  | a_11  
    movd        mm5,[eax + a_22]    ;       | a_22
    pfmul       mm1,[ecx+32]        ; a_21 * b_32 | a_11 * b_31     
    punpckldq   mm6,mm5             ; a_22  | a_12      
    movd        mm7,[eax + a_32]    ;       | a_32
    movd        mm5,[eax + a_42]    ;       | a_42
    pfmul       mm6,[ecx+32]        ; a_22 * b_32 | a_12 * b_31 
    movd        mm2,[eax + a_31]    ;       | a_31
    punpckldq   mm7,mm5             ; a_42  | a_32
    movd        mm0,[eax + a_41]    ;       | a_41
    pfmul       mm7,[ecx+40]        ; a_42 * b_34 | a_32 * b33
    punpckldq   mm2,mm0             ; a_41  | a_31
    pfadd       mm6,mm7				; a_42 * b_34 + a_22 * b_32 | a_32 * b33 + a_12 * b_31 
    pfmul       mm2,[ecx+40]        ; a_41 * b_34 | a_31 * b33
    pfacc       mm6,mm6				;		|a_42 * b_34 + a_22 * b_32 + a_32 * b33 + a_12 * b_31 
    pfadd       mm1,mm2				; a_41 * b_34 + a_21 * b_32 | a_31 * b33 + a_11 * b_31
    movd        [esp+36],mm6		; T_32
    pfacc       mm1,mm1				;		|a_41 * b_34 + a_21 * b_32 + a_31 * b33 + a_11 * b_31
    movd        [esp+32],mm1		; T_31

    movd        mm0,[eax + a_23]    ;       | a_23
    movd        mm1,[eax + a_13]    ;       | a_13
    movd        mm6,[eax + a_14]    ;       | a_14
    punpckldq   mm1,mm0             ; a_23  | a_13  
    movd        mm5,[eax + a_24]    ;       | a_24
    pfmul       mm1,[ecx+32]        ; a_21 * b_32 | a_11 * b_31     
    punpckldq   mm6,mm5             ; a_22  | a_12      
    movd        mm7,[eax + a_34]    ;       | a_34
    movd        mm5,[eax + a_44]    ;       | a_44
    pfmul       mm6,[ecx+32]        ; a_22 * b_32 | a_12 * b_31     
    movd        mm2,[eax + a_33]    ;       | a_33
    punpckldq   mm7,mm5             ; a_42  | a_32
    movd        mm0,[eax + a_43]    ;       | a_43
    pfmul       mm7,[ecx+40]        ; a_42 * b_34 | a_32 * b_33
    punpckldq   mm2,mm0             ; a_43  | a_33
    pfadd       mm6,mm7				; a_42 * b_34 + a_22 * b_32 | a_32 * b_33 + a_12 * b_31
    pfmul       mm2,[ecx+40]        ; a_41 * b_34 | a_31 * b_33
    pfacc       mm6,mm6				;		|a_42 * b_34 + a_22 * b_32 + a_32 * b_33 + a_12 * b_31
    pfadd       mm1,mm2				; a_41 * b_34 + a_21 * b_32 | a_31 * b_33 + a_11 * b_31
    movd        [esp+44],mm6		; T_34
    pfacc       mm1,mm1				;		|a_41 * b_34 + a_21 * b_32 + a_31 * b_33 + a_11 * b_31
    movd        [esp+40],mm1		; T_33

    movd        mm0,[eax + a_21]    ;       | a_21
    movd        mm1,[eax + a_11]    ;       | a_11
    movd        mm6,[eax + a_12]    ;       | a_12
    punpckldq   mm1,mm0             ; a_21  | a_11  
    movd        mm5,[eax + a_22]    ;       | a_22
    pfmul       mm1,[ecx+48]        ; a_21 * b_42 | a_11 * b_41     
    punpckldq   mm6,mm5             ; a_22  | a_12      
    movd        mm7,[eax + a_32]    ;       | a_32
    movd        mm5,[eax + a_42]    ;       | a_42
    pfmul       mm6,[ecx+48]        ; a_22 * b_42 | a_12 * b_41     
    movd        mm2,[eax + a_31]    ;       | a_31
    punpckldq   mm7,mm5             ; a_42  | a_32
    movd        mm0,[eax + a_41]    ;       | a_41
    pfmul       mm7,[ecx+56]        ; a_42 * b_44 | a_32 * b_43
    punpckldq   mm2,mm0             ; a_41  | a_31
    pfadd       mm6,mm7				; a_42 * b_44 + a_22 * b_42 | a_32 * b_43 + a_12 * b_41
    pfmul       mm2,[ecx+56]        ; a_41 * b_44 | a_31 * b_43
    pfacc       mm6,mm6				;		|a_42 * b_44 + a_22 * b_42 + a_32 * b_43 + a_12 * b_41
    pfadd       mm1,mm2				; a_41 * b_44 + a_21 * b_42 | a_31 * b_43 + a_11 * b_41
    movd        [esp+52],mm6		; T_42
    pfacc       mm1,mm1				;		| a_41 * b_44 + a_21 * b_42 + a_31 * b_43 + a_11 * b_41
    movd        [esp+48],mm1		; T_41
    movd        mm0,[eax + a_23]    ;       | a_23
    movd        mm1,[eax + a_13]    ;       | a_13
    movd        mm6,[eax + a_14]    ;       | a_14
    punpckldq   mm1,mm0             ; a_23  | a_13  
    movd        mm5,[eax + a_24]    ;       | a_24
    pfmul       mm1,[ecx+48]        ; a_21 * b_42 | a_11 * b_41     
    punpckldq   mm6,mm5             ; a_22  | a_12      
    movd        mm7,[eax + a_34]    ;       | a_34
    movd        mm5,[eax + a_44]    ;       | a_44
    pfmul       mm6,[ecx+48]        ; a_22 * b_42 | a_12 * b_41     
    movd        mm2,[eax + a_33]    ;       | a_33
    punpckldq   mm7,mm5             ; a_42  | a_32
    movd        mm0,[eax + a_43]    ;       | a_43
    pfmul       mm7,[ecx+56]        ; a_42 * b_44 | a_32 * b_43
    punpckldq   mm2,mm0             ; a_43  | a_33
    pfadd       mm6,mm7				; a_42 * b_44 + a_22 * b_42 | a_32 * b_43 + a_12 * b_41
    pfmul       mm2,[ecx+56]        ; a_41 * b_44 | a_31 * b_43
    pfacc       mm6,mm6				;		|a_42 * b_44 + a_22 * b_42 + a_32 * b_43 + a_12 * b_41
    pfadd       mm1,mm2				; a_41 * b_44 + a_21 * b_42 | a_31 * b_43 + a_11 * b_41 
    movd        [esp+60],mm6		; T_44
    pfacc       mm1,mm1				; a_41 * b_44 + a_21 * b_42 + a_31 * b_43 + a_11 * b_41 
    movd        [esp+56],mm1		; T_43
    movq        mm3,[esp]			;MOVE FROM LOCAL TEMP MATRIX TO ADDRESS OF RESULT
    movq        mm4,[esp+8]
    movq        [edx],mm3
    movq        [edx+8],mm4

    movq        mm3,[esp+16]
    movq        mm4,[esp+24]
    movq        [edx+16],mm3
    movq        [edx+24],mm4

    movq        mm3,[esp+32]
    movq        mm4,[esp+40]
    movq        [edx+32],mm3
    movq        [edx+40],mm4

    movq        mm3,[esp+48]
    movq        mm4,[esp+56]
    movq        [edx+48],mm3
    movq        [edx+56],mm4

    add         esp,64
    femms

    ret

__glMul_4x4 ENDP

Daveperman wrote:

> have you tested it like this or before the
> getting of the result? you know gl runs in a
> parallel thread…

If it did, pretty much anything you did would take milliseconds because of the synchronization overhead. On most current hardware, it runs just as a library linked into your process space, talking to the hardware directly, and the “second entity” is the GPU running DMA.

Speaking of knackered’s “milliseconds”; I can see no way that you can spend MILLISECONDS on a simple matrix multiply. Not even on a hundred of them. There’s one million cycles in a millisecond (give or take). Benchmarks of routines at this level should be measured in CYCLES, and should specify the hardware used, as well as where source and destination reside before each iteration of the benchmark (RAM, L2, L1).

Saying that a matrix mult will “thrash the cache” is similarly out of whack with reality. An Athlon, and a Pentium IV, fits a 4x4 float matrix in a single cache line, assuming it’s aligned, else it’s two. A Pentium III needs two cache lines, or three in the unaligned case. Thus, if you really wanted, you could conceivably fit all three matrices in line fetch buffers (write combiners) on a P-III !

just wanted to note that you don’t create any opengl context… wondered that it did not crashed

Yea, so was I, but it didn’t crash and did give the same result as the others, so I figured no context is needed for that particular call.

– Zeno

funny

Originally posted by jwatte:
Speaking of knackered’s “milliseconds”; I can see no way that you can spend MILLISECONDS on a simple matrix multiply. Not even on a hundred of them. There’s one million cycles in a millisecond (give or take). Benchmarks of routines at this level should be measured in CYCLES, and should specify the hardware used, as well as where source and destination reside before each iteration of the benchmark (RAM, L2, L1).

I don’t know much about cache’s and such things, Jwatte - I appreciate you educating me. The reason I’m talking in milliseconds is just because I’m measuring the time before rendering anything, then measuring it again after the swapbuffer - in between those 2 measurements my scenegraph gets traversed, during which something like 70 to 80 matrix mults happen…now, if I use gl to multiply the matrix, the time spent is 16 milliseconds less than if I do the mults myself.
Hope that clears some things up.

Knackered,

That seems un-intuitive, if that’s the only difference. 80 matmuls should never take 16 milliseconds, no way. Are you measuring over many frames and averaging? Which timing function do you use? On Windows, timeGetTime() or GetTickCount() are notoriously unreliable; they drop ticks under heavy load and they only give, at best, millisecond accuracy.

How about tracing through your matmul in the debugger and see if it goes off in a 100-times loop or something? How about trying it with a profiler? (try getting the demo version of VTune from Intel’s developer site)

The thing is, the frame rate drops dramatically too - it’s a physically apparent that it’s slower. I’m using gettickcount, yes, I know it’s not as accurate as queryperformancecounter, but I just banged it in to give me a quick measurement - but as i say, it’s a dramatic frame rate difference anyway.
No, I’m not going into some loop or other, the code is just as I detailed.
A mystery?

I gave the stuff a little benchmark.
I used QueryPerformanceCounter and switched optimizations off for the calling loop.
80 matmults with knackereds original matmult function took about 0.8 microseconds, but one has to respect the overhead of the non-optimized loop. I also found that in any case, if one copies the a and b matrices to two temporary ones and calculates with them it gets a bit faster.
However, I only have a P2 350, so these results are probably irrelevant.
I also learnt that the fpu-calc time depends on the values you put in. When I didnt give the matrices an initial value, it was a 10 or even 100 times slower.

[This message has been edited by Michael Steinberg (edited 04-28-2002).]

2nd Edit:
I guess the whole benchmark is irrelevant. When I set initial values, the CPU probably fetches the three matrices into the cache. It probably only work there then.

[This message has been edited by Michael Steinberg (edited 04-28-2002).]

Long time no post

Here’s a link to another algorithm for performing matrix mults. Might be interesting to bench against some of these implementations:
http://lib-www.lanl.gov/numerical/bookcpdf/c2-11.pdf

Apologies if this alg. has already been covered above. I didn’t read through all the code in detail (esp. the assembly version). BTW, great thread.

Regards.

I’m a little surprised that OGL doesn’t lose simply because of all the API overhead. After all, you’ve passed in two matrices now, and they have to be all copied around and stuff…

Hint 1: The value of MatrixMode may affect OGL matrix performance. Some modes are probably faster/slower than others.

Hint 2: Nah, I won’t tell you, this should be too obvious.

  • Matt

I didn’t believe the results it gave me, so I looked at the app. (This has nothing to do with my Hint 2, BTW. That was something else.)

You haven’t set up a GL context or anything. Those entry points probably just point to a “RET” instruction!

Also, in a fair comparison, glGetFloatv is going to absolutely destroy the GL driver because of, e.g., big switch statement overhead.

  • Matt

I need to stop posting on this thread pretty soon…I keep looking like an idiot

Yes, Matt, the lack of context must have made those functions NULL. And, of course, the reason that the right answer was appearing anyway is that I did the OGL test LAST using the same arrays, so the answer was already done. sigh.

I put up a new main and two new .exe files. Here are the results when I create a context using GLUT:

Zeno: 10.2 Million mults/sec
Knackered: 12.0 Million mults/sec
OGL: 1.7 Million mults/sec

Sorry for all the mistakes here guys At least we’re getting at the truth.

– Zeno

Then why am I getting this drop in frame rate, if my method is the fastest? I’ve told you the whole story, there’s nothing more I can add…the mult is inline’d too

Very interesting thread.

Knackered: Is it possible you’re getting a processor stall due to a write-read pairing? Try dropping a /small/ operation or two in between the MatMult and glLoadMatrix calls.

I’ve never run into a clear case of a dependant read stall, so I have no idea if this is what’s actually slowing you down.

Depends on what ASM the compiler is generating, I guess.

I might play with this today. It’s a very curious problem. 8)
– Jeff

Edit: I found the spaced version to be slightly slower than the tightly executed version. My bench (quick&dirty) also shows Knackered’s code outperforming the OpenGL version, at 150% of the OpenGL speed (nVidia’s 28.32 detonators on a GF2MX).

[This message has been edited by Thaellin (edited 04-30-2002).]

Zeno no offense but your benchmark is a little unfair.

I noticed a few things that are worth mentioning.

1.) The custom benchmarks dont upload the results to OpenGL.

2.) The custom benchmarks always work on the same matricies making them L1 cache local after the first call (ok an interrupt or an process context change will kick them out once in a while)

3.) The OpenGL version reads the results back (most likley over the AGP bus) Why?

4.) glLoadMatrixf and glMultMatrixf are making copies of the data so it is unlikley to get L1 cache hits for succesive calls.

Since I know its alot easier to criticise someone else work than doing it better, I’ll put a new bechmark together when I get home.

Regards,

LG

Zeno no offense but your benchmark is a little unfair.

None taken. Benchmarks are difficult to make fair and I don’t really have any experience.

  1. That’s true. I forgot that the idea was to eventually give these matrices to opengl, not just get the answer.

  2. Yes, that, actually, was on purpose. I wanted them to be in cache so I could see which one was more efficient without worrying as much about memory issues.

  3. The opengl one reads results back for the same reason as I mentioned in number 1). For whatever reason, I had it in my head that we wanted the answer on the CPU.

  4. True…but there’s no way around this, is there? I guess I could load once, then push and pop and mult many times.

Anyway, thanks for the comments. Feel free to use my timer code if you put a benchmark together. I think that part is right at least

– Zeno

Okidoki I wrote a new benchmark that works on uncached data and uploads the resulting matricies to OpenGL.

On a 1.5Ghz P4 running W2K & latest Detonators I get the following (using the MS compiler):

Zenos: 1.92 Million iterations/s
Knackered: 1.31 Million iterations/s
OpenGL: 1.95 Million iterations/s

Looks different eh?

Knackered maybe you have Vsync enabled and the extra cycles make you miss the next retrace? I mean 16ms is really a bummer and I dont think that the matrix code alone can cause that (and 16ms smell like 60Hz refreshrate).

Oh you can grab the benchmark & source here

EDIT:Fixed the URL

Regards,

LG

[This message has been edited by lgrosshennig (edited 05-01-2002).]