Unbelievable Optimization X

This is something which happened some years ago.

There was some-project I won’t name(*) (let’s just say it was a server emulator of some popular MMORPG), with their development forum. I found a post suggesting an optimization they were going to implement. Basically the file saving phase was something like this :


if (someplayer->hp != 0) fprintf(fout, "hp=%d", someplayer->hp);
if (someplayer->something != 0) fprintf(fout, "hp=%d", someplayer->something );
if (someplayer->foobar != 0) fprintf(fout, "hp=%d", someplayer->foobar );
if (someplayer->purplet != 0) fprintf(fout, "hp=%d", someplayer->purplet );
if (someplayer->dontknow != 0) fprintf(fout, "hp=%d", someplayer->dontknow );

Of course, most attributes were defaulting at 0 (or the value was different in the if — whatever) and many I/O operations were saved (also file length was greatly reduced due to this “optimization”). The reason of this high effectiveness was that most attributes were not used unless in a particular condition, and so most of the time they had a default value.

Comes the Pentium 4 and its light-years long pipeline.

Developers forums started writing about how crappy was the new architecture (the P4 debuted at 1.5GHz - at that speed it was slower than a P3 1GHz, not to mention the Athlon - the real power of the P4 was revealed only when it debuted a new revision and frequencies of 2.0GHz and above) and how much a single conditional jump could stall the pipeline and lose tiny precious clock cycles.

Then the idea struck a developer of that project. What if we remove all the if conditions ? No more conditional jumps yippieee!

This is the kind of ideas you feel dumber only hearing it. Ok, so you pull out the ifs and suddenly you have about double the number of fprintf. Since there is file caching (so that not particular much time is lost waiting for the disk I/O) the performances should be higher, right ? Wrong.
And the reason is simple. A single fprintf may be something like hundreds of conditional jumps. Really. (Not to mention the additional load in terms of disk I/O which whatever superscsi you have, it’s slower than the slowest of the P4s)

Let’s analyze (lightly) what happens on a call to fprintf. After all the arguments are pushed on the stack, the procedure is called. Here the format string is parsed to find the % sign. There is comparison/conditional jump for every character, but we can be sure that most of this conditionals are correctly predicted and only a fraction of them are mispredicted (probably the ones where the % is found). After the % char is found, a quite complex loop (thus involving a number of conditional jumps) parses the input string. For %d tokens this loop ends quite quickly but for more complex tokens this can be quite expensive. In the meantime, parsed characters were being passed to the file buffers, ready to be written out (we’ll see later what this means). When the %something token is parsed, the argument is written out to the disk. This involves things like sign extension and stuff like that and no, most of it (in Visual C++ RTL at least) is not done using movsx and alikes, but with slow C code using various if, to mantain compatibility across platforms (the VC++ RTL should run on x86, x86/64, IA64, Alpha, Mips and PowerPC, even if the support for most of the listed architectures is now dropped) . After this a number should be converted from the internal representation to the decimal one, and this involves a bazillion of jumps and divisions by 10 (in M$ Visual C++ you can see all this code in the crt/src/output.c file). After the string is prepared it’s stored in the files buffers. Overflows checks must be made (more jumps! and memory copies, with an high chance of trashing what is in L1 cache if not even L2) and when they are emptied (and sooner or later they will be) controls goes to ring-0 code. Apart from the jumps, going to ring-0 is an uber-slow operation in itself. Also ring-0 code has quite an high chance of accessing a resource which needs a synchronization lock with an high chance of a spinlock loop or, worse, an anticipated task switch.

But boy, you saved a conditional jump.

(*) There is a secret hint hidden in the post about the identity of this project…

Edit : I wrote Pentium 4 1.5MHz :D. While it would be funny to check if the P4 1.5MHz could beat the 8086 at 4.77MHz or the 286 at 8-12Mhz, usually Pentium 4 users runs them slightly faster ;)

Comments

One Response to “Unbelievable Optimization X”

  1. Florian on January 29th, 2007 3:24 am

    Hi,
    I found your blog via google by accident and have to admit that youve a really interesting blog :-)
    Just saved your feed in my reader, have a nice day :)

  • About

    A blog about whatever gets on my computer or any other programmable device I own. From Linux to .NET.

  • Social Networking

  •  

    View Marco Mastropaolo's profile on LinkedIn
    www.flickr.com
    This is a Flickr badge showing public photos from Marco Mastropaolo. Make your own badge here.
  • Blogroll