Unbelievable Optimization X
This is something which happened some years ago.
There was some-project I won’t name(*) (let’s just say it was a server emulator of some popular MMORPG), with their development forum. I found a post suggesting an optimization they were going to implement. Basically the file saving phase was something like this :
if (someplayer->hp != 0) fprintf(fout, "hp=%d", someplayer->hp);
if (someplayer->something != 0) fprintf(fout, "hp=%d", someplayer->something );
if (someplayer->foobar != 0) fprintf(fout, "hp=%d", someplayer->foobar );
if (someplayer->purplet != 0) fprintf(fout, "hp=%d", someplayer->purplet );
if (someplayer->dontknow != 0) fprintf(fout, "hp=%d", someplayer->dontknow );
Of course, most attributes were defaulting at 0 (or the value was different in the if — whatever) and many I/O operations were saved (also file length was greatly reduced due to this “optimization”). The reason of this high effectiveness was that most attributes were not used unless in a particular condition, and so most of the time they had a default value.
Comes the Pentium 4 and its light-years long pipeline.
Developers forums started writing about how crappy was the new architecture (the P4 debuted at 1.5GHz - at that speed it was slower than a P3 1GHz, not to mention the Athlon - the real power of the P4 was revealed only when it debuted a new revision and frequencies of 2.0GHz and above) and how much a single conditional jump could stall the pipeline and lose tiny precious clock cycles.
Then the idea struck a developer of that project. What if we remove all the if conditions ? No more conditional jumps yippieee!
This is the kind of ideas you feel dumber only hearing it. Ok, so you pull out the ifs and suddenly you have about double the number of fprintf. Since there is file caching (so that not particular much time is lost waiting for the disk I/O) the performances should be higher, right ? Wrong.
And the reason is simple. A single fprintf may be something like hundreds of conditional jumps. Really. (Not to mention the additional load in terms of disk I/O which whatever superscsi you have, it’s slower than the slowest of the P4s)
Let’s analyze (lightly) what happens on a call to fprintf. After all the arguments are pushed on the stack, the procedure is called. Here the format string is parsed to find the % sign. There is comparison/conditional jump for every character, but we can be sure that most of this conditionals are correctly predicted and only a fraction of them are mispredicted (probably the ones where the % is found). After the % char is found, a quite complex loop (thus involving a number of conditional jumps) parses the input string. For %d tokens this loop ends quite quickly but for more complex tokens this can be quite expensive. In the meantime, parsed characters were being passed to the file buffers, ready to be written out (we’ll see later what this means). When the %something token is parsed, the argument is written out to the disk. This involves things like sign extension and stuff like that and no, most of it (in Visual C++ RTL at least) is not done using movsx and alikes, but with slow C code using various if, to mantain compatibility across platforms (the VC++ RTL should run on x86, x86/64, IA64, Alpha, Mips and PowerPC, even if the support for most of the listed architectures is now dropped) . After this a number should be converted from the internal representation to the decimal one, and this involves a bazillion of jumps and divisions by 10 (in M$ Visual C++ you can see all this code in the crt/src/output.c file). After the string is prepared it’s stored in the files buffers. Overflows checks must be made (more jumps! and memory copies, with an high chance of trashing what is in L1 cache if not even L2) and when they are emptied (and sooner or later they will be) controls goes to ring-0 code. Apart from the jumps, going to ring-0 is an uber-slow operation in itself. Also ring-0 code has quite an high chance of accessing a resource which needs a synchronization lock with an high chance of a spinlock loop or, worse, an anticipated task switch.
But boy, you saved a conditional jump.
(*) There is a secret hint hidden in the post about the identity of this project…
Edit : I wrote Pentium 4 1.5MHz :D. While it would be funny to check if the P4 1.5MHz could beat the 8086 at 4.77MHz or the 286 at 8-12Mhz, usually Pentium 4 users runs them slightly faster ![]()
Code Snippet II : Handy tridimensional vector class for C++
As you may have seen, I’ve changed the title of a previous post and I have started a series named “Code Snippets”. These are small pieces of code containing utility functions or classes.
So without further delays here is the code for a Vector3f class, implementing a tridimensional vector using low precision floating point numbers, for tridimensional graphics or audio. Released under BSD license.
Note : the class has no constructors so that object allocation does not have any significant performance impact. An handy static method called Make is provided to construct a Vector3f given its components on the fly.
Code Snippet I : Wrapper for CLR string to native string marshalling
A frequent (too much frequent) need in Managed C++ is going from System::String objects to char* or wchar_t* strings. The boring side of the marshalling is that the memory allocated with the StringToHGlobalxxx functions should be freed with a matching call to FreeHGlobal.
The solution is simple : using the power of C++ destructors we can create a temporary object for the entire time we need the string. Whenever the object goes out of scope, the string is automagically freed.
You can find the code (released to public domain) here.
Please note that assignments (and copy-constructor) are disabled (they are in an empty private method) to preserve the destruction safety (the free is thus execute once and only once).
As an example you can do :
void mystrcpy(char* dest, const System::String __gc* source)
{
strcpy(dest, StringAutoMarshal(source));
}
with no need to manage the marshalling manually.
.NET Type Viewer
I’ve uploaded a page with a small utility I’ve written yesterday in about 2 hours to solve a problem I had with some code of mine.
.NET Type Viewer is a small program using reflection to inspect .NET modules, with special power features for mixed mode DLLs.