Hanoi towers as a pure recursion benchmark, part 2 – Code Optimizations

by · June 1, 2005

As an aside from the 1st part, we try to optimize the performances of the solutions found in part 1, with special attention to the assembly version.

First, let’s take two timings of the versions presented in part 1, compiling in release with optimizations turned on, running on an Athlon64 3000+ 1.8GHz and with times taken with timeGetTime.

These are the times (native C++, assembly) :

  • Assembly : 1012ms
  • C++ : 1605ms

There are a couple of optimization we could try to make the code faster.

Quick, trivial optimization
As we know from part 1, we are doing O(2^n) moves to find the solution. This means that we are calling the function 2^n-1 times with the howmany parameter set to 1. A quick optimization (and the first and last we’ll make to the high level languages solutions) is to avoid the middle recursion. We apply the optimization to both C++ code and Assembly code :

C/C++

int Hanoi_Native(int from, int to, int use, int howmany)
{
if (howmany == 1)
{
return 1;
}
else
{
int imovs = 0;
imovs += Hanoi_Native(from, use, to, howmany - 1);
imovs += 1;
imovs += Hanoi_Native(use, to, from, howmany - 1);
return imovs;
}
}

x86 Assembly

mov ebx, from
mov ecx, to
mov edx, use
mov eax, howmany
xor esi, esi

call HanoiAsm
mov retcode, esi
jmp EndHanoiAsm
HanoiAsm:
dec eax
jz NextLoop
push eax
xchg ecx, edx
call HanoiAsm
xchg ecx, edx
xchg edx, ebx
pop eax
call HanoiAsm
xchg edx, ebx
NextLoop:
inc esi
ret
EndHanoiAsm:
Result
The times are considerably faster :

  • Assembly : 645ms
  • C++ : 1233ms

It should be noted that the x86 assembly solution is somewhat cheating in respect of the C++ one. Infact we’ve applied a great knowledge of our problem to optimize it in this way : we’ve only one parameter passed on the stack, we are skipping stack frame and we have literally demolished every structure of the algorithm to suite our needs.

The second optimization on the assembly version is something yesterday I couldn’t find how to do. It is, infact the removal of the push eax and pop eax from the loop. The problem was that I could effectively calculate eax knowing how deep we were in recursion, but I couldn’t find a way to do it. The idea struck me today, we can simply use the difference of esp and a given base we’d save in ebp.

Assembly, no push/pop

push ebp
mov ebx, from
mov ecx, to
mov edx, use
xor esi, esi

mov eax, howmany
mov ebp, esp
shl eax, 2
sub ebp, eax
sar eax, 2

call HanoiAsm
pop ebp
mov retcode, esi
jmp EndHanoiAsm
HanoiAsm:
dec eax
jz NextLoop
xchg ecx, edx
call HanoiAsm
mov eax, esp
xchg ecx, edx
sub eax, ebp
xchg edx, ebx
sar eax, 2
call HanoiAsm
xchg edx, ebx
NextLoop:
inc esi
ret
EndHanoiAsm:
Athlon64 3000+ 1.8GHz Winchester, 512KB L2 Cache (average over 15 samples)

  • Optimized Assembly : 707ms
  • Assembly : 641ms

Well, the result speak for themselves : the new version takes 707ms, while the old one 641ms. This is NOT an optimization but a waste of time. Time to understand why.

The first thing to do is to try the code on other processors.

Pentium-4 550 3.4GHz Prescott 1MB L2 Cache (average over 15 samples)

  • Optimized Assembly : 748ms
  • Assembly : 703ms

Pentium-M 1.5GHz Banias 1MB L2 Cache (average over 15 samples)

  • Optimized Assembly : 1202ms
  • Assembly : 1262ms

So, on a PentiumM the faster version is the one which is plainly slower on an Athlon64 and on a P4 Prescott. So this is an optimization heavy dependent on the processor’s architecture. (Also on such a task, I would’ve expected the Prescott 3.4GHz to far beat the Athlon64 – it seems this code is very k8-friendly).

Now, our target is the best optimization for the Athlon64, for a valid reason : it’s my own CPU :)

Then the next step is this one : starting with the original version we’ll add the instructions one by one, till we know which one is causing such a great slowdown. And here come the surprises.

… SURPRISE!

The surprise is actually a kinda terrible fact. I’ve copy/pasted the loop code and the second version is still slower. Time to pull AMD CodeAnalyst out of the toolbox.

The results are an incredible number of mispredicted branches. The funny thing (well, it depends of your definition of “funny”) is that sometimes it is enough to add a nop before the loop to raise astronomically the running times.

Admitting defeat

In the end I admit my defeat. While here you read only a few lines, I’ve spent literally hours adding instructions, probing with CodeAnalyst, crashing my PC (a bad side effect of CodeAnalyst with the wrong event probes configured) only to find next to nothing.
I couldn’t find a single valid reason why adding a few instructions before the really long loop could change so wildly the performance of the loop itself, causing the last call of the loop to be constantly mispredicted. This happened also when the instructions added would have not had an influence at all (such as adding a mov esi, eax before the first instruction of the function).

Benchmark binary here (x86, Win32)

Discussion4 Comments

  1. lahgrooto says:

    Hello, i read your site, this a best site from me, thanks!

  2. ZS Pansota says:

    it is a good effort for sharing knowledge.

  3. ZS Pansota says:

    hi!
    sir any one who reads, plz. give me the full assembly code of hanoi towers game,i cannot find it at Google.

    im in 6th smester of telecom Engg,CIIt Lahore.pakistan

    the code that i put in emulator and that runs game.
    thanks.

    ZS Pansota

Add a Comment