bf-blender v2.33a / intel compiler optimized (windows)

User-contributed CVS development builds. Please test and give feedback!

Moderators: jesterKing, stiv

Antares
Posts: 0
Joined: Fri Nov 14, 2003 4:04 pm

bf-blender v2.33a / intel compiler optimized (windows)

Post by Antares »

hi all

Ton told me to create a thread about this topic in here. Perhaps some of you already know the compilations i did and still do.

My Blender compilations uses a special intel compiler which makes basic
improvements like rewriting and restructoring code, jump-prediction,
loop-unrolling, parallel-instruction execution using the pipelining
technology of modern intel and AMD processors or executing floating point
operations parallel to integer operations when possible.

But the main advantage results in using the more specialized instruction
sets of intel (and AMD) processors, like SSE (Streaming SIMD Extension) and
SSE2. These instructon sets together with 8x 128-bit registers are the
direct successors of the MMX technologie, which supported SIMD (Single
Instruction Multiple Data) executions. This means a CPU can do the same
operations (like in a loop adding X to a vectors of array Y) on data that
lies close in memory and is logical connected.

Because Blender uses single precision floating point units and calculation a
processor supporting SSE2 can do a certain operation on data in three steps
(loading, executing, storing), that normally requires at least 12 steps.

My Blender version is also compiled with automated CPU disatching which
means, it is checking on what CPU it is running and uses the appropriate
function set which then use specific instruction sets for performance
increase.

This means my compilations runs on ANY x86 CPU including P-PIV and AMD
processors.

Main compiler settings in MSVC++ 6 are:
/Qaxw: for automated CPU dispatching and usage of SSE(2) if available
/O3: optimize for maximum speed and enable high-level optimizations
/Ob2: inline any function, at the compiler's discretion

These settings wont work for the following modules:
FTF_ftfont
KX_ketsji
SG_SceneGraph

I also use intel compiler specific profiling for the Blender kernel (module:
BKE_blenkernel) functions and primary rendering functions (module:
BRE_render), which also increase performance and decreases the size of the
binary a little bit.

All this normally cuts down rendertimes by 20-60% and improves the GUI in
terms of responding a little bit.

A one month fully operatable demo version of this compiler is available
under www.intel.com. Any important documation of how to include and invoke
the compiler is included in the demo version and can be done within 30
seconds.

Link to discussion thread at www.elysiun.com:
http://www.elysiun.com/forum/viewtopic.php?t=17626

Link to windows-binary:
http://www.silentthunder.de/files/blenderintel.zip



Some other efforts of mine in this area are rewriting some functions of blender in assembler. One of these was the following function:

Code: Select all

void MTC_Mat4MulMat4(float m1[][4], float m2[][4], float m3[][4])
{
  /* matrix product: c[j][k] = a[j][i].b[i][k] */
	m1[0][0] = m2[0][0]*m3[0][0] + m2[0][1]*m3[1][0] + m2[0][2]*m3[2][0] + m2[0][3]*m3[3][0];
	m1[0][1] = m2[0][0]*m3[0][1] + m2[0][1]*m3[1][1] + m2[0][2]*m3[2][1] + m2[0][3]*m3[3][1];
	m1[0][2] = m2[0][0]*m3[0][2] + m2[0][1]*m3[1][2] + m2[0][2]*m3[2][2] + m2[0][3]*m3[3][2];
	m1[0][3] = m2[0][0]*m3[0][3] + m2[0][1]*m3[1][3] + m2[0][2]*m3[2][3] + m2[0][3]*m3[3][3];

	m1[1][0] = m2[1][0]*m3[0][0] + m2[1][1]*m3[1][0] + m2[1][2]*m3[2][0] + m2[1][3]*m3[3][0];
	m1[1][1] = m2[1][0]*m3[0][1] + m2[1][1]*m3[1][1] + m2[1][2]*m3[2][1] + m2[1][3]*m3[3][1];
	m1[1][2] = m2[1][0]*m3[0][2] + m2[1][1]*m3[1][2] + m2[1][2]*m3[2][2] + m2[1][3]*m3[3][2];
	m1[1][3] = m2[1][0]*m3[0][3] + m2[1][1]*m3[1][3] + m2[1][2]*m3[2][3] + m2[1][3]*m3[3][3];

	m1[2][0] = m2[2][0]*m3[0][0] + m2[2][1]*m3[1][0] + m2[2][2]*m3[2][0] + m2[2][3]*m3[3][0];
	m1[2][1] = m2[2][0]*m3[0][1] + m2[2][1]*m3[1][1] + m2[2][2]*m3[2][1] + m2[2][3]*m3[3][1];
	m1[2][2] = m2[2][0]*m3[0][2] + m2[2][1]*m3[1][2] + m2[2][2]*m3[2][2] + m2[2][3]*m3[3][2];
	m1[2][3] = m2[2][0]*m3[0][3] + m2[2][1]*m3[1][3] + m2[2][2]*m3[2][3] + m2[2][3]*m3[3][3];

	m1[3][0] = m2[3][0]*m3[0][0] + m2[3][1]*m3[1][0] + m2[3][2]*m3[2][0] + m2[3][3]*m3[3][0];
	m1[3][1] = m2[3][0]*m3[0][1] + m2[3][1]*m3[1][1] + m2[3][2]*m3[2][1] + m2[3][3]*m3[3][1];
	m1[3][2] = m2[3][0]*m3[0][2] + m2[3][1]*m3[1][2] + m2[3][2]*m3[2][2] + m2[3][3]*m3[3][2];
	m1[3][3] = m2[3][0]*m3[0][3] + m2[3][1]*m3[1][3] + m2[3][2]*m3[2][3] + m2[3][3]*m3[3][3];

}
after running a profiler for the rendering process this funciton came up with a lot of calls and thus i rewrote it using sse2 instruction set for aligned memory. this is the result:

Code: Select all

_asm {
		mov			eax,	m1
		mov			ebx,	m2
		mov			ecx,	m3

		movdqa		xmm4,	xmmword ptr [ecx]
		movdqa		xmm5,	xmmword ptr [ecx+16]
		movdqa		xmm6,	xmmword ptr [ecx+32]
		movdqa		xmm7,	xmmword ptr [ecx+48]
		
		movss		xmm0,	dword ptr[ebx]
		shufps		xmm0,	xmm0,	0		
		movss		xmm1,	dword ptr[ebx+4]
		shufps		xmm1,	xmm1,	0
		movss		xmm2,	dword ptr[ebx+8]
		shufps		xmm2,	xmm2,	0
		movss		xmm3,	dword ptr[ebx+12]
		shufps		xmm3,	xmm3,	0	
			
		mulps		xmm0,	xmm4	
		mulps		xmm1,	xmm5
		mulps		xmm2,	xmm6
		mulps		xmm3,	xmm7

		addps		xmm2,	xmm3
		addps		xmm1,	xmm2
		addps		xmm0,	xmm1

		movdqa		xmmword ptr[eax],	xmm0


		
		movss		xmm0,	dword ptr[ebx+16]
		shufps		xmm0,	xmm0,	0		
		movss		xmm1,	dword ptr[ebx+20]
		shufps		xmm1,	xmm1,	0
		movss		xmm2,	dword ptr[ebx+24]
		shufps		xmm2,	xmm2,	0
		movss		xmm3,	dword ptr[ebx+28]
		shufps		xmm3,	xmm3,	0	
			
		mulps		xmm0,	xmm4	
		mulps		xmm1,	xmm5
		mulps		xmm2,	xmm6
		mulps		xmm3,	xmm7

		addps		xmm2,	xmm3
		addps		xmm1,	xmm2
		addps		xmm0,	xmm1

		movdqa		xmmword ptr[eax+16],	xmm0	

	

		movss		xmm0,	dword ptr[ebx+32]
		shufps		xmm0,	xmm0,	0		
		movss		xmm1,	dword ptr[ebx+36]
		shufps		xmm1,	xmm1,	0
		movss		xmm2,	dword ptr[ebx+40]
		shufps		xmm2,	xmm2,	0
		movss		xmm3,	dword ptr[ebx+44]
		shufps		xmm3,	xmm3,	0	
			
		mulps		xmm0,	xmm4	
		mulps		xmm1,	xmm5
		mulps		xmm2,	xmm6
		mulps		xmm3,	xmm7

		addps		xmm2,	xmm3
		addps		xmm1,	xmm2
		addps		xmm0,	xmm1

		movdqa		xmmword ptr[eax+32],	xmm0	

		
		movss		xmm0,	dword ptr[ebx+48]
		shufps		xmm0,	xmm0,	0		
		movss		xmm1,	dword ptr[ebx+52]
		shufps		xmm1,	xmm1,	0
		movss		xmm2,	dword ptr[ebx+56]
		shufps		xmm2,	xmm2,	0
		movss		xmm3,	dword ptr[ebx+60]
		shufps		xmm3,	xmm3,	0	
			
		mulps		xmm0,	xmm4	
		mulps		xmm1,	xmm5
		mulps		xmm2,	xmm6
		mulps		xmm3,	xmm7

		addps		xmm2,	xmm3
		addps		xmm1,	xmm2
		addps		xmm0,	xmm1

		movdqa		xmmword ptr[eax+48],	xmm0
	}
after testing both routines with random single precision floating point values the test result was the following:
std code: 2.934 sec for 51000 loops
my code: 2.936 sec for 13000000 loops

thats more than 250 faster than the std version ^^

i hope this posting wasnt too long :)

thanx for reading

theeth
Posts: 500
Joined: Wed Oct 16, 2002 5:47 am
Location: Montreal
Contact:

Post by theeth »

Three cheers for assembly optimized functions!

I really appreciate the efforts you are putting into this, and I don't think I'm the only one :)

Martin
Life is what happens to you when you're busy making other plans.
- John Lennon

solmax
Posts: 86
Joined: Fri Oct 18, 2002 2:47 am
Contact:

Post by solmax »

I really appreciate the efforts you are putting into this, and I don't think I'm the only one
absolutely right! the first optimized build (2.30 i think) helped me to cut rendering times in the middle of a production by almost 50%!

so, another three cheers from me, too :)

best regards, marin

solmax
Posts: 86
Joined: Fri Oct 18, 2002 2:47 am
Contact:

Post by solmax »

and a quick benchmark:

4004x1600px, 16 OSA

regular 232a: 348,04 sec
optimized bf: 240,62 sec

the optimized version is apx. 30% faster

not the same gain in performance as with the first optimized build (i measured some 50%), but still a lot. of course other rendering features, such as raytracing or AO should be testes as well, since my render uses the regular (unified)scanline renderer without shadow casting lights.

edit:oups missed the elysiun link..

greets marin
Last edited by solmax on Fri May 28, 2004 9:25 pm, edited 1 time in total.

arangel
Posts: 0
Joined: Wed Oct 22, 2003 2:54 pm
Location: Brasília - Brazil
Contact:

Post by arangel »

Thanks a lot Antares. I´ve been a fan of your build from some time now!
Is there a change that this method find its way into the offical releases of Blender? That would be nice, too.
Alexandre Rangel
Multimedia Designer
www.3Dzine.com.br

Antares
Posts: 0
Joined: Fri Nov 14, 2003 4:04 pm

Post by Antares »

arangel wrote:Thanks a lot Antares. I´ve been a fan of your build from some time now!
Is there a change that this method find its way into the offical releases of Blender? That would be nice, too.
i've talked to ton about this issue and offered to by an intel compiler licence in the name of the blender foundation or donate the appropriate amount of money such that the foundation itself can buy it.

we are speaking roughly about 400-500 euro here for a single windows licence.

first ton said, it would be more important to inform people about the "how it works" and everything around it. main reason for this is to motivate others to create optimized versions.

another thing was to give developers the possibilty of testing versions like that and think about how to integrate such stuff into the official versions.
problems that might occur are analyzing such an optimized compilation, debug it (hard time here) and get it onto the quality level of "normal" blender releases. ton thought, that this might break one or more developement frames at the moment.

he also said he prefers saving other ones money if possible, but if needed he would (of course) accept such a donation ^^


the first step is done (again) by creating this thread. perhaps i have to bring it up in the mailing lists too therewith it is really accepted throughout the developers.

again...
1.i would pay !! (at least a windows version)
2.i can tell other my current experience (like i did after theeth asked how this works ^^)

i can give the initial spark for this, but at least SOMEBODY has to be interested for further progress on this.

there arent a lot of things i can do for improving blender, since i am a student and dont have that much time. normally my goal was to get involved in all this blender project stuff and really contribute some codelines, but i found no possibility to do so so far.

ideasman
Posts: 0
Joined: Tue Feb 25, 2003 2:37 pm

Post by ideasman »

Its a tricky one... Im new to this, but even from the benchmarks it dosent seem that GCC is THAT much worse then other compilers.

I remember I set my CFLAGS to use -O3 and AMD-XP and other optimizations,
recompiled blender, python, used psyco for my script and gained a 50% speed increase on one of my scripts (26sec -> 13 sec)

It sounds off topic but you are comparing your new blender to a blender release that didnt use heavy optimization in the first place?.

Assembly optimizations look promising, can they be contributed into BF-Blender???
Just check for the correct arcitecture, mabe add powerPC optimizations too.

faster mash manitulation is desireable too, would be interesting to see how these could be sped up.

So many renderers do _everything_ that people dont care so much about pretty pictures on a web site, rather- How long they took to render.

IMHO Anything that speeds up the render process is good, and gives Blender a great advantage over other renderers.

I think a renderfarm that can be used over the internet would be awesome and I have made progress in this area- is it could be available to new users imagine how it would changes users experience of rendering in blender (Im thinking broardband here)

'In blender My complex scene renders in 2 minuts rarther then an hour' (without the renderfarm or in some other app)
If it was intergrated well the user wouldent even need to fiddle about, it would just be like a normal render..... hmmm /rant.

- Cam

Antares
Posts: 0
Joined: Fri Nov 14, 2003 4:04 pm

Post by Antares »

ideasman wrote:Its a tricky one... Im new to this, but even from the benchmarks it dosent seem that GCC is THAT much worse then other compilers.
well... my compilations always were the fastest so far :)

as ton said (and he's right there):
is about informing people and accelerate the progress in making blender faster.

one word about the assembler stuff:
i rewrote already some functions, compiled them and they worked well.
the ony problem is, that the asm code is MSVC specific and cannot be used elsewhere. in addition such compilations would ONLY run on P4+ because they use SSE2 without automated CPU dispatching

Caronte
Posts: 76
Joined: Wed Oct 16, 2002 12:53 am
Location: Valencia-Spain-Europe

Post by Caronte »

This compilation don't have the new OSA :(

Antares
Posts: 0
Joined: Fri Nov 14, 2003 4:04 pm

Post by Antares »

Caronte wrote:This compilation don't have the new OSA :(
thats because it isnt the latest CVS version. it is a build of the code available one day after blender 2.33a was relaeased.

ideasman
Posts: 0
Joined: Tue Feb 25, 2003 2:37 pm

Post by ideasman »

Whats this new OSA? Havent heard abt it (Sorry off topic)

Mel_Q
Posts: 41
Joined: Wed Oct 16, 2002 12:00 am

small but noticable...

Post by Mel_Q »

ideasman wrote:Whats this new OSA? Havent heard abt it (Sorry off topic)
http://www.blender3d.org/cms/Rendering_ ... 320.0.html

crsrma
Posts: 0
Joined: Tue Mar 30, 2004 3:47 pm

Re: bf-blender v2.33a / intel compiler optimized (windows)

Post by crsrma »

Antares wrote:thats more than 250 faster than the std version ^^
Holy :shock:!
Nice job, Antares. Do you think that sort of performance would be similar throughout the entire code? On second thought, forget I brought that up. What a hellish project that would be to undertake. :)

Antares
Posts: 0
Joined: Fri Nov 14, 2003 4:04 pm

Re: bf-blender v2.33a / intel compiler optimized (windows)

Post by Antares »

crsrma wrote:
Antares wrote:thats more than 250 faster than the std version ^^
Holy :shock:!
Nice job, Antares. Do you think that sort of performance would be similar throughout the entire code? On second thought, forget I brought that up. What a hellish project that would be to undertake. :)
no !
defenitly not. while 250 sounds great you have to know, that 1/250 of 1 second prosessor usage is basicly just one second of performance improvement. this routine only used about 1 sec CPU time, when i ran my profiling while rendering.

this is some part of it (time(ms)/calls/function name):

Code: Select all

110.454.720	3886	GHOST_SystemWin32::processEvents(bool)(ghost_systemwin32.obj)
4.005.562	86	_PIL_sleep_ms(time.obj)
2.244.008	677052	_testshadowbuf(shadbuf.obj)
1.910.368	1258617	_boxsampleclip.(image.obj)
1.586.608	2601516	_BLI_linklist_prepend_arena(bli_linklist.obj)
1.343.953	321540	_shade_lamp_loop(rendercore.obj)
1.290.993	64	_hypermesh_subdivide(subsurf.obj)
1.217.247	10151	FTPixmapGlyph::Render(class
1.173.797	4315664	_BLI_memarena_alloc(bli_memarena.obj)
948.509	1241787	_boxsample.(image.obj)
856.676	53022	_zbufinvulGL(zbuf.obj)
813.261	321540	_do_material_tex(texture.obj)
799.277	321544	_shadepixel.(rendercore.obj)
722.073	2031283	_MTC_Mat4MulVec4fl(arithb.obj)
722.073	2031283	_Mat4MulVec4fl(arithb.obj)
as you can see arithmetic functions like "Mat4MulVec4fl" arent taking too much time and thus improvments wont give that much of speed increase.

it would be much better if the time-eater functions on top of the list would be improved...

SpookyElectric
Posts: 0
Joined: Thu Aug 19, 2004 11:28 am

Post by SpookyElectric »

Hello. I guess this is kindof an old thread. But rendering time consuming animation brought it to my mind. I don't use Windows, and I am still using a P3. I noticed recent gcc 3-series has sse support. To some extent it can optimize floating point operations into SSE operations as I understand, but I don't know how well, I'll have to check the assembly it dumps out someday.
It seems though v4fs as a datatype is supported in SSE and PowerPC Altivec supports rather similar options.
Multiplying 4x4 matricies I hope doesn't happen a lot. If it does, something is coded wrong. Multiplying a vector by a 4x4 matrix however is a very important function to speed up. I think it may be worth while to change the way the math functions work a little bit, and the way data is stored a little bit:

* Math functions should always be inlined.
* vectors / verts should maybe be stored in 128 bits (so four floats, even if the fourth isn't used), so loading 4 values at one with movaps (I think that can be easily be done with no impact on the rest of the code) Structures involving vectors and matricies should be sized for 128 bit boundaries (I think they're sized for 8 now). I don't think this will have a noticably negative impact on anything anywhere in the code from what I've seen.
* It would be best I think to use use the xmm "intrinsic" operations, which it seems are standard in both microsoft's compilers and gcc. Basically no need to have actual assembly in the C code, just C code that has 1:1 mapping to assembly instructions. And better yet is to orient the code so the compiler will take care of things for you, but I think in the case of things like multiplying with matrices it will require manually coding the function.
* I assume somewhere is function to apply a 4x4 matrix to an entire mesh. This should probably be part of the optimized math library, rather than having this function call the library. For the duration of the loop, the matrix can be stored in registers, and that would speed things up considerably. This will help much. much more than just speeding up the functions.
* Add a math function for performing rgba color optimizations. Most such operations can be pulled off even with just MMX, and I'm sure they happen a lot (like adding texture layers)

There probably needs to be two modes for optimization - runtime & compile time. Compile time would use branch to determine which optimized version to use. Compile time just uses the right one, and avoids wasting time on the branch.

For the searching mailing list archives, there seems to be some resistance to manual optimizations. And I agree they should be used in as few places as possible, and should probably all be put in one library. But I think they are deffinately worth while. Comparing custom versions of math routines like you did and the optimized ones the compiler spits out shows this. (Though maybe a test / benchmarking suite is worth coding to make it apparent if the compiler itself ever gets better than us at optimizing our code)

Have you done any more work in optimized version of the other functions? I would be interested in helping out with this project. I don't really have time to work on this at the moment, but maybe in a couple weeks I will have time to look into some minor improvements. Though of ideally the bulk of the blender source should be put under review to use a more consistant mathematics system that lends itself to Altivec and SSE style optimizations.


PS: Can you try to make your work SSE / PIII friendly? You need SSE over MMX to support four floats at once, I don't think whatever added function in SSE2 yo're using is that much of a speed gain. Then it's available to more users. I'll gladly do PIII testing if I can compile it w/ gcc. And I think there are some Mac G4's I have access to that I can potentially test Altivec versions of the math routines.

Post Reply