bf-blender v2.33a / intel compiler optimized (windows)

User-contributed CVS development builds. Please test and give feedback!

Moderators: jesterKing, stiv

Antares
Posts: 0
Joined: Fri Nov 14, 2003 4:04 pm

Post by Antares »

there's a lot of good things you wrote.
actually i've already done a little bit of research on using aligned memory in blender.

the problem is, that many structures have to be reorderred to fit into the required alignment.

one good news is, that blender's memory management is using a centralized module which could be changed such that it creates for instance 128bit aligend memory blocks.

i've also optimized other functions like the 4x4matrix-vector multipilcation.

Code: Select all

void _fastcall Mat4MulVec4fl( float mat[][4], float *vec)
_asm {
	mov			eax,	mat
	mov			ebx,	vec

	movdqa		xmm4,	xmmword ptr [eax]
	movdqa		xmm5,	xmmword ptr [eax+16]
	movdqa		xmm6,	xmmword ptr [eax+32]
	movdqa		xmm7,	xmmword ptr [eax+48]

	movd		xmm0,	dword ptr [ebx]
	movd		xmm1,	dword ptr [ebx+4]
	movd		xmm2,	dword ptr [ebx+8]
	movd		xmm3,	dword ptr [ebx+12]
	shufps		xmm0,	xmm0,	0
	shufps		xmm1,	xmm1,	0
	shufps		xmm2,	xmm2,	0
	shufps		xmm3,	xmm3,	0

	mulps		xmm0,	xmm4	
	mulps		xmm1,	xmm5
	mulps		xmm2,	xmm6
	mulps		xmm3,	xmm7

	addps		xmm2,	xmm3
	addps		xmm1,	xmm2
	addps		xmm0,	xmm1

	movdqa		xmmword ptr[ebx],	xmm0

}
}
this function is at least two orders of magnitude faster that typical unoptimized versions (using /O2 settings and so on) and also about 10 to 20% faster than the code generated by the intel compiler.

i havent had time to really contribute this work to the blender project.

to your PS statement:
my build can also be used on PIII with SSE only since SSE can be used for SIMD single presicion floating point opteration. SSE2 mainly adds support to do SIMD stuff with double presicion ones. so having SSE is important, and it's successor won't help a lot.

SpookyElectric
Posts: 0
Joined: Thu Aug 19, 2004 11:28 am

Post by SpookyElectric »

Maybe alignment can be lower priority, since there is an instruction to load unaligned data. It's just presumably slower.
Multiplying matricies doesn't happen all that often. It's the vector * matrix, and specifically loops of these that will get called enough that manual optimizations will (I think) have a noticable effect on the overall speed.

Anyways, lack of time plagues us all. When I get some time (which won't be all that soon) I'll look into things some more on my end, in the UN*X/gcc world.

Post Reply