### bf-blender v2.33a / intel compiler optimized (windows)

Posted:

**Fri May 28, 2004 3:01 pm**hi all

Ton told me to create a thread about this topic in here. Perhaps some of you already know the compilations i did and still do.

My Blender compilations uses a special intel compiler which makes basic

improvements like rewriting and restructoring code, jump-prediction,

loop-unrolling, parallel-instruction execution using the pipelining

technology of modern intel and AMD processors or executing floating point

operations parallel to integer operations when possible.

But the main advantage results in using the more specialized instruction

sets of intel (and AMD) processors, like SSE (Streaming SIMD Extension) and

SSE2. These instructon sets together with 8x 128-bit registers are the

direct successors of the MMX technologie, which supported SIMD (Single

Instruction Multiple Data) executions. This means a CPU can do the same

operations (like in a loop adding X to a vectors of array Y) on data that

lies close in memory and is logical connected.

Because Blender uses single precision floating point units and calculation a

processor supporting SSE2 can do a certain operation on data in three steps

(loading, executing, storing), that normally requires at least 12 steps.

My Blender version is also compiled with automated CPU disatching which

means, it is checking on what CPU it is running and uses the appropriate

function set which then use specific instruction sets for performance

increase.

This means my compilations runs on ANY x86 CPU including P-PIV and AMD

processors.

Main compiler settings in MSVC++ 6 are:

/Qaxw: for automated CPU dispatching and usage of SSE(2) if available

/O3: optimize for maximum speed and enable high-level optimizations

/Ob2: inline any function, at the compiler's discretion

These settings wont work for the following modules:

FTF_ftfont

KX_ketsji

SG_SceneGraph

I also use intel compiler specific profiling for the Blender kernel (module:

BKE_blenkernel) functions and primary rendering functions (module:

BRE_render), which also increase performance and decreases the size of the

binary a little bit.

All this normally cuts down rendertimes by 20-60% and improves the GUI in

terms of responding a little bit.

A one month fully operatable demo version of this compiler is available

under www.intel.com. Any important documation of how to include and invoke

the compiler is included in the demo version and can be done within 30

seconds.

Link to discussion thread at www.elysiun.com:

http://www.elysiun.com/forum/viewtopic.php?t=17626

Link to windows-binary:

http://www.silentthunder.de/files/blenderintel.zip

Some other efforts of mine in this area are rewriting some functions of blender in assembler. One of these was the following function:
after running a profiler for the rendering process this funciton came up with a lot of calls and thus i rewrote it using sse2 instruction set for aligned memory. this is the result:

after testing both routines with random single precision floating point values the test result was the following:

std code: 2.934 sec for 51000 loops

my code: 2.936 sec for 13000000 loops

thats more than 250 faster than the std version ^^

i hope this posting wasnt too long

thanx for reading

Ton told me to create a thread about this topic in here. Perhaps some of you already know the compilations i did and still do.

My Blender compilations uses a special intel compiler which makes basic

improvements like rewriting and restructoring code, jump-prediction,

loop-unrolling, parallel-instruction execution using the pipelining

technology of modern intel and AMD processors or executing floating point

operations parallel to integer operations when possible.

But the main advantage results in using the more specialized instruction

sets of intel (and AMD) processors, like SSE (Streaming SIMD Extension) and

SSE2. These instructon sets together with 8x 128-bit registers are the

direct successors of the MMX technologie, which supported SIMD (Single

Instruction Multiple Data) executions. This means a CPU can do the same

operations (like in a loop adding X to a vectors of array Y) on data that

lies close in memory and is logical connected.

Because Blender uses single precision floating point units and calculation a

processor supporting SSE2 can do a certain operation on data in three steps

(loading, executing, storing), that normally requires at least 12 steps.

My Blender version is also compiled with automated CPU disatching which

means, it is checking on what CPU it is running and uses the appropriate

function set which then use specific instruction sets for performance

increase.

This means my compilations runs on ANY x86 CPU including P-PIV and AMD

processors.

Main compiler settings in MSVC++ 6 are:

/Qaxw: for automated CPU dispatching and usage of SSE(2) if available

/O3: optimize for maximum speed and enable high-level optimizations

/Ob2: inline any function, at the compiler's discretion

These settings wont work for the following modules:

FTF_ftfont

KX_ketsji

SG_SceneGraph

I also use intel compiler specific profiling for the Blender kernel (module:

BKE_blenkernel) functions and primary rendering functions (module:

BRE_render), which also increase performance and decreases the size of the

binary a little bit.

All this normally cuts down rendertimes by 20-60% and improves the GUI in

terms of responding a little bit.

A one month fully operatable demo version of this compiler is available

under www.intel.com. Any important documation of how to include and invoke

the compiler is included in the demo version and can be done within 30

seconds.

Link to discussion thread at www.elysiun.com:

http://www.elysiun.com/forum/viewtopic.php?t=17626

Link to windows-binary:

http://www.silentthunder.de/files/blenderintel.zip

Some other efforts of mine in this area are rewriting some functions of blender in assembler. One of these was the following function:

Code: Select all

```
void MTC_Mat4MulMat4(float m1[][4], float m2[][4], float m3[][4])
{
/* matrix product: c[j][k] = a[j][i].b[i][k] */
m1[0][0] = m2[0][0]*m3[0][0] + m2[0][1]*m3[1][0] + m2[0][2]*m3[2][0] + m2[0][3]*m3[3][0];
m1[0][1] = m2[0][0]*m3[0][1] + m2[0][1]*m3[1][1] + m2[0][2]*m3[2][1] + m2[0][3]*m3[3][1];
m1[0][2] = m2[0][0]*m3[0][2] + m2[0][1]*m3[1][2] + m2[0][2]*m3[2][2] + m2[0][3]*m3[3][2];
m1[0][3] = m2[0][0]*m3[0][3] + m2[0][1]*m3[1][3] + m2[0][2]*m3[2][3] + m2[0][3]*m3[3][3];
m1[1][0] = m2[1][0]*m3[0][0] + m2[1][1]*m3[1][0] + m2[1][2]*m3[2][0] + m2[1][3]*m3[3][0];
m1[1][1] = m2[1][0]*m3[0][1] + m2[1][1]*m3[1][1] + m2[1][2]*m3[2][1] + m2[1][3]*m3[3][1];
m1[1][2] = m2[1][0]*m3[0][2] + m2[1][1]*m3[1][2] + m2[1][2]*m3[2][2] + m2[1][3]*m3[3][2];
m1[1][3] = m2[1][0]*m3[0][3] + m2[1][1]*m3[1][3] + m2[1][2]*m3[2][3] + m2[1][3]*m3[3][3];
m1[2][0] = m2[2][0]*m3[0][0] + m2[2][1]*m3[1][0] + m2[2][2]*m3[2][0] + m2[2][3]*m3[3][0];
m1[2][1] = m2[2][0]*m3[0][1] + m2[2][1]*m3[1][1] + m2[2][2]*m3[2][1] + m2[2][3]*m3[3][1];
m1[2][2] = m2[2][0]*m3[0][2] + m2[2][1]*m3[1][2] + m2[2][2]*m3[2][2] + m2[2][3]*m3[3][2];
m1[2][3] = m2[2][0]*m3[0][3] + m2[2][1]*m3[1][3] + m2[2][2]*m3[2][3] + m2[2][3]*m3[3][3];
m1[3][0] = m2[3][0]*m3[0][0] + m2[3][1]*m3[1][0] + m2[3][2]*m3[2][0] + m2[3][3]*m3[3][0];
m1[3][1] = m2[3][0]*m3[0][1] + m2[3][1]*m3[1][1] + m2[3][2]*m3[2][1] + m2[3][3]*m3[3][1];
m1[3][2] = m2[3][0]*m3[0][2] + m2[3][1]*m3[1][2] + m2[3][2]*m3[2][2] + m2[3][3]*m3[3][2];
m1[3][3] = m2[3][0]*m3[0][3] + m2[3][1]*m3[1][3] + m2[3][2]*m3[2][3] + m2[3][3]*m3[3][3];
}
```

Code: Select all

```
_asm {
mov eax, m1
mov ebx, m2
mov ecx, m3
movdqa xmm4, xmmword ptr [ecx]
movdqa xmm5, xmmword ptr [ecx+16]
movdqa xmm6, xmmword ptr [ecx+32]
movdqa xmm7, xmmword ptr [ecx+48]
movss xmm0, dword ptr[ebx]
shufps xmm0, xmm0, 0
movss xmm1, dword ptr[ebx+4]
shufps xmm1, xmm1, 0
movss xmm2, dword ptr[ebx+8]
shufps xmm2, xmm2, 0
movss xmm3, dword ptr[ebx+12]
shufps xmm3, xmm3, 0
mulps xmm0, xmm4
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
addps xmm2, xmm3
addps xmm1, xmm2
addps xmm0, xmm1
movdqa xmmword ptr[eax], xmm0
movss xmm0, dword ptr[ebx+16]
shufps xmm0, xmm0, 0
movss xmm1, dword ptr[ebx+20]
shufps xmm1, xmm1, 0
movss xmm2, dword ptr[ebx+24]
shufps xmm2, xmm2, 0
movss xmm3, dword ptr[ebx+28]
shufps xmm3, xmm3, 0
mulps xmm0, xmm4
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
addps xmm2, xmm3
addps xmm1, xmm2
addps xmm0, xmm1
movdqa xmmword ptr[eax+16], xmm0
movss xmm0, dword ptr[ebx+32]
shufps xmm0, xmm0, 0
movss xmm1, dword ptr[ebx+36]
shufps xmm1, xmm1, 0
movss xmm2, dword ptr[ebx+40]
shufps xmm2, xmm2, 0
movss xmm3, dword ptr[ebx+44]
shufps xmm3, xmm3, 0
mulps xmm0, xmm4
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
addps xmm2, xmm3
addps xmm1, xmm2
addps xmm0, xmm1
movdqa xmmword ptr[eax+32], xmm0
movss xmm0, dword ptr[ebx+48]
shufps xmm0, xmm0, 0
movss xmm1, dword ptr[ebx+52]
shufps xmm1, xmm1, 0
movss xmm2, dword ptr[ebx+56]
shufps xmm2, xmm2, 0
movss xmm3, dword ptr[ebx+60]
shufps xmm3, xmm3, 0
mulps xmm0, xmm4
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
addps xmm2, xmm3
addps xmm1, xmm2
addps xmm0, xmm1
movdqa xmmword ptr[eax+48], xmm0
}
```

std code: 2.934 sec for 51000 loops

my code: 2.936 sec for 13000000 loops

thats more than 250 faster than the std version ^^

i hope this posting wasnt too long

thanx for reading