BeginTiming();
glLoadMatrix(a);
glMultMatrix(b);
glGetFloatv(GL_//CURRENTMATRIX//,c);
EndTiming();
have you tested it like this or before the getting of the result? you know gl runs in a parallel thread…
should be logical, but we’re never sure
BeginTiming();
glLoadMatrix(a);
glMultMatrix(b);
glGetFloatv(GL_//CURRENTMATRIX//,c);
EndTiming();
have you tested it like this or before the getting of the result? you know gl runs in a parallel thread…
should be logical, but we’re never sure
Well, I’ll be darned. It’s time for me to eat crow.
Knackered - I didn’t believe that there was no difference between my algorithm and yours, so I wrote up a little benchmark program to compare them.
To my surprise (and embarrasment), my algorithm came out last (no matter the compiler). Yours was second place and the OGL routines were indeed the fastest.
Here are the results of the program from my Athlon 1.3 using the .exe from the Intel compiler:
Zeno: 9.6 Million mults/sec
Knackered: 13.5 Million mults/sec
OGL: 22.8 million mults/sec
If anyone else wants to try (or check my stuff) I put the source code and two .exe files up here:
www.sciencemeetsart.com/wade/temp/benchmark.zip
Learn something new every day
– Zeno
[This message has been edited by Zeno (edited 04-28-2002).]
just wanted to note that you don’t create any opengl context… wondered that it did not crashed
and, did you checked the results for beeing ok? possibly you get just an error back from the glfunc because there is no gl existing… dunno
Here is a 3dnow 4x4 matrix multiply routine you can test, BUT, I do not recall where it came from. I just happen to find a file called 3dnow4x4.asm in my “other” directory.
I sure hope the formatting is correct, since I don’t have a web page to post it on.
;******************************************************************************
; Routine: void _glMul_4x4 (const float r, const float a, const float b)
; Input: r - matrix (4x4) address
; a - matrix (4x4) address
; b - matrix (4x4) address
; Output: r = a * b, using standard matrix multiplication rules
; Uses: eax, ecx, edx, mm0 - mm7
; UPDATED 01/21/00
;***************************************************************************
ALIGN 32
PUBLIC __glMul_4x4
__glMul_4x4 PROCr = 4 ;stack offset for result address
a = 8 ;stack offset for matrix a address
b = 12 ;stack offset for matrix b addressa_11 = 0 ;local stack frame layout
a_12 = 4
a_13 = 8
a_14 = 12a_21 = 16
a_22 = 20
a_23 = 24
a_24 = 28a_31 = 32
a_32 = 36
a_33 = 40
a_34 = 44a_41 = 48
a_42 = 52
a_43 = 56
a_44 = 60femms mov eax,[esp+a] ;source a mov ecx,[esp+b] ;source b mov edx,[esp+r] ;result r sub esp,64 ;T_ local work space to store temp results movd mm0,[eax + a_21] ; | a_21 movd mm1,[eax + a_11] ; | a_11 movd mm6,[eax + a_12] ; | a_12 punpckldq mm1,mm0 ; a_21 | a_11 movd mm5,[eax + a_22] ; | a_22 pfmul mm1,[ecx] ; a_21 * b_12 | a_11 * b_11 punpckldq mm6,mm5 ; a_22 | a_12 movd mm7,[eax + a_32] ; | a_32 movd mm5,[eax + a_42] ; | a_42 pfmul mm6,[ecx] ; a_22 * b_12 | a_12 * b_11 movd mm2,[eax + a_31] ; | a_31 punpckldq mm7,mm5 ; a_42 | a_32 movd mm0,[eax + a_41] ; | a_41 pfmul mm7,[ecx+8] ; a_42 * b_14 | a_32 * b13 punpckldq mm2,mm0 ; a_41 | a_31 pfadd mm6,mm7 ; a_42 * b_14 + a_22 * b_12 | a_32 * b13 + a_12 * b_11 pfmul mm2,[ecx+8] ; a_41 * b_14 | a_31 * b13 pfacc mm6,mm6 ; | a_12 * b_11 + a_22 * b_12 + a_32 * b_13 + a_42 * b_14 pfadd mm1,mm2 ; a_21 * b_12 + a_41 * b_14 | a_11 * b_11 + a_31 * b13 movd [esp+4],mm6 ; T_12 pfacc mm1,mm1 ; | a_21 * b_12 + a_41 * b_14 + a_11 * b_11 + a_31 * b13 movd [esp],mm1 ; T_11 movd mm0,[eax + a_23] ; | a_23 movd mm1,[eax + a_13] ; | a_13 movd mm6,[eax + a_14] ; | a_14 punpckldq mm1,mm0 ; a_23 | a_13 movd mm5,[eax + a_24] ; | a_24 pfmul mm1,[ecx] ; a_23 * b_12 | a_13 * b_11 punpckldq mm6,mm5 ; a_24 | a_14 movd mm7,[eax + a_34] ; | a_34 movd mm5,[eax + a_44] ; | a_44 pfmul mm6,[ecx] ; a_24 * b_12 | a_14 * b_11 movd mm2,[eax + a_33] ; | a_33 punpckldq mm7,mm5 ; a_44 | a_34 movd mm0,[eax + a_43] ; | a_43 pfmul mm7,[ecx+8] ; a_44 * b_14 | a_34 * b_13 punpckldq mm2,mm0 ; a_43 | a_33 pfadd mm6,mm7 ; a_44 * b_14 + a_24 * b_12 | a_34 * b_13 + a_14 * b_11 pfmul mm2,[ecx+8] ; a_43 * b_12 | a_33 * b11 pfacc mm6,mm6 ; | a_44 * b_14 + a_24 * b_12 + a_34 * b_13 + a_14 * b_11 pfadd mm1,mm2 ; a_43 * b_12 + a_23 * b_12 | a_33 * b11 + a_13 * b_11 movd [esp+12],mm6 ; T_14 pfacc mm1,mm1 ; | a_43 * b_12 + a_23 * b_12 + a_33 * b11 + a_13 * b_11 movd [esp+8],mm1 ; T_13 movd mm0,[eax + a_21] ; | a_21 movd mm1,[eax + a_11] ; | a_11 movd mm6,[eax + a_12] ; | a_12 punpckldq mm1,mm0 ; a_21 | a_11 movd mm5,[eax + a_22] ; | a_22 pfmul mm1,[ecx+16] ; a_21 * b_22 | a_11 * b_21 punpckldq mm6,mm5 ; a_22 | a_12 movd mm7,[eax + a_32] ; | a_32 movd mm5,[eax + a_42] ; | a_42 pfmul mm6,[ecx+16] ; a_22 * b_22 | a_12 * b_21 movd mm2,[eax + a_31] ; | a_31 punpckldq mm7,mm5 ; a_42 | a_32 movd mm0,[eax + a_41] ; | a_41 pfmul mm7,[ecx+24] ; a_42 * b_24 | a_32 * b_23 punpckldq mm2,mm0 ; a_41 | a_31 pfadd mm6,mm7 ; a_42 * b_24 + a_22 * b_22 | a_32 * b_23 + a_12 * b_21 pfmul mm2,[ecx+24] ; a_41 * b_24 | a_31 * b_23 pfacc mm6,mm6 ; | a_42 * b_24 + a_22 * b_22 + a_32 * b_23 + a_12 * b_21 pfadd mm1,mm2 ; a_41 * b_24 + a_21 * b_22 | a_31 * b_23 + a_11 * b_21 movd [esp+20],mm6 ; T_22 pfacc mm1,mm1 ; |a_41 * b_24 + a_21 * b_22 + a_31 * b_23 + a_11 * b_21 movd [esp+16],mm1 ; T_21 movd mm0,[eax + a_23] ; | a_23 movd mm1,[eax + a_13] ; | a_13 movd mm6,[eax + a_14] ; | a_14 punpckldq mm1,mm0 ; a_23 | a_13 movd mm5,[eax + a_24] ; | a_24 pfmul mm1,[ecx+16] ; a_23 * b_22 | a_13 * b_21 punpckldq mm6,mm5 ; a_24 | a_14 movd mm7,[eax + a_34] ; | a_34 movd mm5,[eax + a_44] ; | a_44 pfmul mm6,[ecx+16] ; a_24 * b_22 | a_14 * b_21 movd mm2,[eax + a_33] ; | a_33 punpckldq mm7,mm5 ; a_44 | a_34 movd mm0,[eax + a_43] ; | a_43 pfmul mm7,[ecx+24] ; a_44 * b_24 | a_34 * b_23 punpckldq mm2,mm0 ; a_43 | a_33 pfadd mm6,mm7 ; a_24 * b_22 + a_44 * b_24 | a_14 * b_21 + a_34 * b_23 pfmul mm2,[ecx+24] ; a_43 * b_24 | a_33 * b_23 pfacc mm6,mm6 ; |a_24 * b_22 + a_44 * b_24 + a_14 * b_21 + a_34 * b_23 pfadd mm1,mm2 ; a_43 * b_24 + a_23 * b_22 | a_33 * b_23 + a_14 * b_21 movd [esp+28],mm6 ; T_24 pfacc mm1,mm1 ; | a_43 * b_24 + a_23 * b_22 + a_33 * b_23 + a_14 * b_21 movd [esp+24],mm1 ; T_23 movd mm0,[eax + a_21] ; | a_21 movd mm1,[eax + a_11] ; | a_11 movd mm6,[eax + a_12] ; | a_12 punpckldq mm1,mm0 ; a_21 | a_11 movd mm5,[eax + a_22] ; | a_22 pfmul mm1,[ecx+32] ; a_21 * b_32 | a_11 * b_31 punpckldq mm6,mm5 ; a_22 | a_12 movd mm7,[eax + a_32] ; | a_32 movd mm5,[eax + a_42] ; | a_42 pfmul mm6,[ecx+32] ; a_22 * b_32 | a_12 * b_31 movd mm2,[eax + a_31] ; | a_31 punpckldq mm7,mm5 ; a_42 | a_32 movd mm0,[eax + a_41] ; | a_41 pfmul mm7,[ecx+40] ; a_42 * b_34 | a_32 * b33 punpckldq mm2,mm0 ; a_41 | a_31 pfadd mm6,mm7 ; a_42 * b_34 + a_22 * b_32 | a_32 * b33 + a_12 * b_31 pfmul mm2,[ecx+40] ; a_41 * b_34 | a_31 * b33 pfacc mm6,mm6 ; |a_42 * b_34 + a_22 * b_32 + a_32 * b33 + a_12 * b_31 pfadd mm1,mm2 ; a_41 * b_34 + a_21 * b_32 | a_31 * b33 + a_11 * b_31 movd [esp+36],mm6 ; T_32 pfacc mm1,mm1 ; |a_41 * b_34 + a_21 * b_32 + a_31 * b33 + a_11 * b_31 movd [esp+32],mm1 ; T_31 movd mm0,[eax + a_23] ; | a_23 movd mm1,[eax + a_13] ; | a_13 movd mm6,[eax + a_14] ; | a_14 punpckldq mm1,mm0 ; a_23 | a_13 movd mm5,[eax + a_24] ; | a_24 pfmul mm1,[ecx+32] ; a_21 * b_32 | a_11 * b_31 punpckldq mm6,mm5 ; a_22 | a_12 movd mm7,[eax + a_34] ; | a_34 movd mm5,[eax + a_44] ; | a_44 pfmul mm6,[ecx+32] ; a_22 * b_32 | a_12 * b_31 movd mm2,[eax + a_33] ; | a_33 punpckldq mm7,mm5 ; a_42 | a_32 movd mm0,[eax + a_43] ; | a_43 pfmul mm7,[ecx+40] ; a_42 * b_34 | a_32 * b_33 punpckldq mm2,mm0 ; a_43 | a_33 pfadd mm6,mm7 ; a_42 * b_34 + a_22 * b_32 | a_32 * b_33 + a_12 * b_31 pfmul mm2,[ecx+40] ; a_41 * b_34 | a_31 * b_33 pfacc mm6,mm6 ; |a_42 * b_34 + a_22 * b_32 + a_32 * b_33 + a_12 * b_31 pfadd mm1,mm2 ; a_41 * b_34 + a_21 * b_32 | a_31 * b_33 + a_11 * b_31 movd [esp+44],mm6 ; T_34 pfacc mm1,mm1 ; |a_41 * b_34 + a_21 * b_32 + a_31 * b_33 + a_11 * b_31 movd [esp+40],mm1 ; T_33 movd mm0,[eax + a_21] ; | a_21 movd mm1,[eax + a_11] ; | a_11 movd mm6,[eax + a_12] ; | a_12 punpckldq mm1,mm0 ; a_21 | a_11 movd mm5,[eax + a_22] ; | a_22 pfmul mm1,[ecx+48] ; a_21 * b_42 | a_11 * b_41 punpckldq mm6,mm5 ; a_22 | a_12 movd mm7,[eax + a_32] ; | a_32 movd mm5,[eax + a_42] ; | a_42 pfmul mm6,[ecx+48] ; a_22 * b_42 | a_12 * b_41 movd mm2,[eax + a_31] ; | a_31 punpckldq mm7,mm5 ; a_42 | a_32 movd mm0,[eax + a_41] ; | a_41 pfmul mm7,[ecx+56] ; a_42 * b_44 | a_32 * b_43 punpckldq mm2,mm0 ; a_41 | a_31 pfadd mm6,mm7 ; a_42 * b_44 + a_22 * b_42 | a_32 * b_43 + a_12 * b_41 pfmul mm2,[ecx+56] ; a_41 * b_44 | a_31 * b_43 pfacc mm6,mm6 ; |a_42 * b_44 + a_22 * b_42 + a_32 * b_43 + a_12 * b_41 pfadd mm1,mm2 ; a_41 * b_44 + a_21 * b_42 | a_31 * b_43 + a_11 * b_41 movd [esp+52],mm6 ; T_42 pfacc mm1,mm1 ; | a_41 * b_44 + a_21 * b_42 + a_31 * b_43 + a_11 * b_41 movd [esp+48],mm1 ; T_41
movd mm0,[eax + a_23] ; | a_23 movd mm1,[eax + a_13] ; | a_13 movd mm6,[eax + a_14] ; | a_14 punpckldq mm1,mm0 ; a_23 | a_13 movd mm5,[eax + a_24] ; | a_24 pfmul mm1,[ecx+48] ; a_21 * b_42 | a_11 * b_41 punpckldq mm6,mm5 ; a_22 | a_12 movd mm7,[eax + a_34] ; | a_34 movd mm5,[eax + a_44] ; | a_44 pfmul mm6,[ecx+48] ; a_22 * b_42 | a_12 * b_41 movd mm2,[eax + a_33] ; | a_33 punpckldq mm7,mm5 ; a_42 | a_32 movd mm0,[eax + a_43] ; | a_43 pfmul mm7,[ecx+56] ; a_42 * b_44 | a_32 * b_43 punpckldq mm2,mm0 ; a_43 | a_33 pfadd mm6,mm7 ; a_42 * b_44 + a_22 * b_42 | a_32 * b_43 + a_12 * b_41 pfmul mm2,[ecx+56] ; a_41 * b_44 | a_31 * b_43 pfacc mm6,mm6 ; |a_42 * b_44 + a_22 * b_42 + a_32 * b_43 + a_12 * b_41 pfadd mm1,mm2 ; a_41 * b_44 + a_21 * b_42 | a_31 * b_43 + a_11 * b_41 movd [esp+60],mm6 ; T_44 pfacc mm1,mm1 ; a_41 * b_44 + a_21 * b_42 + a_31 * b_43 + a_11 * b_41 movd [esp+56],mm1 ; T_43
movq mm3,[esp] ;MOVE FROM LOCAL TEMP MATRIX TO ADDRESS OF RESULT movq mm4,[esp+8] movq [edx],mm3 movq [edx+8],mm4 movq mm3,[esp+16] movq mm4,[esp+24] movq [edx+16],mm3 movq [edx+24],mm4 movq mm3,[esp+32] movq mm4,[esp+40] movq [edx+32],mm3 movq [edx+40],mm4 movq mm3,[esp+48] movq mm4,[esp+56] movq [edx+48],mm3 movq [edx+56],mm4 add esp,64 femms ret
__glMul_4x4 ENDP
Daveperman wrote:
> have you tested it like this or before the
> getting of the result? you know gl runs in a
> parallel thread…
If it did, pretty much anything you did would take milliseconds because of the synchronization overhead. On most current hardware, it runs just as a library linked into your process space, talking to the hardware directly, and the “second entity” is the GPU running DMA.
Speaking of knackered’s “milliseconds”; I can see no way that you can spend MILLISECONDS on a simple matrix multiply. Not even on a hundred of them. There’s one million cycles in a millisecond (give or take). Benchmarks of routines at this level should be measured in CYCLES, and should specify the hardware used, as well as where source and destination reside before each iteration of the benchmark (RAM, L2, L1).
Saying that a matrix mult will “thrash the cache” is similarly out of whack with reality. An Athlon, and a Pentium IV, fits a 4x4 float matrix in a single cache line, assuming it’s aligned, else it’s two. A Pentium III needs two cache lines, or three in the unaligned case. Thus, if you really wanted, you could conceivably fit all three matrices in line fetch buffers (write combiners) on a P-III !
just wanted to note that you don’t create any opengl context… wondered that it did not crashed
Yea, so was I, but it didn’t crash and did give the same result as the others, so I figured no context is needed for that particular call.
– Zeno
funny
Originally posted by jwatte:
Speaking of knackered’s “milliseconds”; I can see no way that you can spend MILLISECONDS on a simple matrix multiply. Not even on a hundred of them. There’s one million cycles in a millisecond (give or take). Benchmarks of routines at this level should be measured in CYCLES, and should specify the hardware used, as well as where source and destination reside before each iteration of the benchmark (RAM, L2, L1).
I don’t know much about cache’s and such things, Jwatte - I appreciate you educating me. The reason I’m talking in milliseconds is just because I’m measuring the time before rendering anything, then measuring it again after the swapbuffer - in between those 2 measurements my scenegraph gets traversed, during which something like 70 to 80 matrix mults happen…now, if I use gl to multiply the matrix, the time spent is 16 milliseconds less than if I do the mults myself.
Hope that clears some things up.
Knackered,
That seems un-intuitive, if that’s the only difference. 80 matmuls should never take 16 milliseconds, no way. Are you measuring over many frames and averaging? Which timing function do you use? On Windows, timeGetTime() or GetTickCount() are notoriously unreliable; they drop ticks under heavy load and they only give, at best, millisecond accuracy.
How about tracing through your matmul in the debugger and see if it goes off in a 100-times loop or something? How about trying it with a profiler? (try getting the demo version of VTune from Intel’s developer site)
The thing is, the frame rate drops dramatically too - it’s a physically apparent that it’s slower. I’m using gettickcount, yes, I know it’s not as accurate as queryperformancecounter, but I just banged it in to give me a quick measurement - but as i say, it’s a dramatic frame rate difference anyway.
No, I’m not going into some loop or other, the code is just as I detailed.
A mystery?
I gave the stuff a little benchmark.
I used QueryPerformanceCounter and switched optimizations off for the calling loop.
80 matmults with knackereds original matmult function took about 0.8 microseconds, but one has to respect the overhead of the non-optimized loop. I also found that in any case, if one copies the a and b matrices to two temporary ones and calculates with them it gets a bit faster.
However, I only have a P2 350, so these results are probably irrelevant.
I also learnt that the fpu-calc time depends on the values you put in. When I didnt give the matrices an initial value, it was a 10 or even 100 times slower.
[This message has been edited by Michael Steinberg (edited 04-28-2002).]
2nd Edit:
I guess the whole benchmark is irrelevant. When I set initial values, the CPU probably fetches the three matrices into the cache. It probably only work there then.
[This message has been edited by Michael Steinberg (edited 04-28-2002).]
Long time no post
Here’s a link to another algorithm for performing matrix mults. Might be interesting to bench against some of these implementations:
http://lib-www.lanl.gov/numerical/bookcpdf/c2-11.pdf
Apologies if this alg. has already been covered above. I didn’t read through all the code in detail (esp. the assembly version). BTW, great thread.
Regards.
I’m a little surprised that OGL doesn’t lose simply because of all the API overhead. After all, you’ve passed in two matrices now, and they have to be all copied around and stuff…
Hint 1: The value of MatrixMode may affect OGL matrix performance. Some modes are probably faster/slower than others.
Hint 2: Nah, I won’t tell you, this should be too obvious.
I didn’t believe the results it gave me, so I looked at the app. (This has nothing to do with my Hint 2, BTW. That was something else.)
You haven’t set up a GL context or anything. Those entry points probably just point to a “RET” instruction!
Also, in a fair comparison, glGetFloatv is going to absolutely destroy the GL driver because of, e.g., big switch statement overhead.
I need to stop posting on this thread pretty soon…I keep looking like an idiot
Yes, Matt, the lack of context must have made those functions NULL. And, of course, the reason that the right answer was appearing anyway is that I did the OGL test LAST using the same arrays, so the answer was already done. sigh.
I put up a new main and two new .exe files. Here are the results when I create a context using GLUT:
Zeno: 10.2 Million mults/sec
Knackered: 12.0 Million mults/sec
OGL: 1.7 Million mults/sec
Sorry for all the mistakes here guys At least we’re getting at the truth.
– Zeno
Then why am I getting this drop in frame rate, if my method is the fastest? I’ve told you the whole story, there’s nothing more I can add…the mult is inline’d too
Very interesting thread.
Knackered: Is it possible you’re getting a processor stall due to a write-read pairing? Try dropping a /small/ operation or two in between the MatMult and glLoadMatrix calls.
I’ve never run into a clear case of a dependant read stall, so I have no idea if this is what’s actually slowing you down.
Depends on what ASM the compiler is generating, I guess.
I might play with this today. It’s a very curious problem. 8)
– Jeff
Edit: I found the spaced version to be slightly slower than the tightly executed version. My bench (quick&dirty) also shows Knackered’s code outperforming the OpenGL version, at 150% of the OpenGL speed (nVidia’s 28.32 detonators on a GF2MX).
[This message has been edited by Thaellin (edited 04-30-2002).]
Zeno no offense but your benchmark is a little unfair.
I noticed a few things that are worth mentioning.
1.) The custom benchmarks dont upload the results to OpenGL.
2.) The custom benchmarks always work on the same matricies making them L1 cache local after the first call (ok an interrupt or an process context change will kick them out once in a while)
3.) The OpenGL version reads the results back (most likley over the AGP bus) Why?
4.) glLoadMatrixf and glMultMatrixf are making copies of the data so it is unlikley to get L1 cache hits for succesive calls.
Since I know its alot easier to criticise someone else work than doing it better, I’ll put a new bechmark together when I get home.
Regards,
LG
Zeno no offense but your benchmark is a little unfair.
None taken. Benchmarks are difficult to make fair and I don’t really have any experience.
That’s true. I forgot that the idea was to eventually give these matrices to opengl, not just get the answer.
Yes, that, actually, was on purpose. I wanted them to be in cache so I could see which one was more efficient without worrying as much about memory issues.
The opengl one reads results back for the same reason as I mentioned in number 1). For whatever reason, I had it in my head that we wanted the answer on the CPU.
True…but there’s no way around this, is there? I guess I could load once, then push and pop and mult many times.
Anyway, thanks for the comments. Feel free to use my timer code if you put a benchmark together. I think that part is right at least
– Zeno
Okidoki I wrote a new benchmark that works on uncached data and uploads the resulting matricies to OpenGL.
On a 1.5Ghz P4 running W2K & latest Detonators I get the following (using the MS compiler):
Zenos: 1.92 Million iterations/s
Knackered: 1.31 Million iterations/s
OpenGL: 1.95 Million iterations/s
Looks different eh?
Knackered maybe you have Vsync enabled and the extra cycles make you miss the next retrace? I mean 16ms is really a bummer and I dont think that the matrix code alone can cause that (and 16ms smell like 60Hz refreshrate).
Oh you can grab the benchmark & source here
EDIT:Fixed the URL
Regards,
LG
[This message has been edited by lgrosshennig (edited 05-01-2002).]