CodeBetter.Com
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @CodeBetter

Greg Young [MVP]


Performance Contest #1 Results and Analysis

 

Sorry for the delay in getting the results out. There is a really funny story behind it with me running 7 hours of tests in debug mode then aggregating results and getting it all ready before I realized what I had done :-/

You can find the original contest here http://codebetter.com/blogs/gregyoung/archive/2006/07/27/147736.aspx.

Before I announce the winners I want to look at various optimizations for the problem, some which were used and some which were not used.  Many seemed to have realized that the suggestion towards threading in this case was a red herring (well sort of, further information below). Large parts of this problem are spent in the creation of objects, the hash function and tokenizing the string. Let's look at some strategies to coming up with a good solution.

Order is Important!

One of the key optimizations for the tokenizer is realizing that the order of processing is extremely important. If we know any information about the domain our code will run in we can often make domain specific optimizations by changing the ordering of our statements.

In the English language for instance characters appear on an order of magnitude more frequently than punctuation. In order to optimize for this case we should check our most often circumstances first. This is better shown with a simple example. The following two functions are both doing the exact same thing but the second example is much faster than the first.

        static count CountWrongOrder(string s) {

            count ret = new count();

            foreach (char c in s) {

                if (c == ' ' || c == '\t' || c == '\n') {

                    ret.spaces++;

                } else if (char.IsDigit(c)) {

                    ret.numbers++;

                } else if (char.IsUpper(c)) {

                    ret.uppercase++;

                } else if (char.IsLower(c)) {

                    ret.lowercase++;

                }

            }

            return ret;

        }

        static count CountRightOrder(string s) {

            count ret = new count();

            foreach (char c in s) {

                if (char.IsLower(c)) {

                    ret.lowercase++;

                } else if (c == ' ' || c =='\t' || c=='\n') {

                    ret.spaces++;

                } else if (char.IsUpper(c)) {

                    ret.uppercase++;

                } else if (char.IsDigit(c)) {

                    ret.numbers++;

                }

            }

            return ret;

        }

Right and wrong order

 

Run across the string "This is some normal english text. Occasionally you will also get a number such as 2" (concat’ed 10000) times the second version is almost twice the speed of the first. In our most often case in the first example (a lower case letter) we have to fail through all of the other conditions where as in the second example it ius our first condition. The goal should be to check the mutually exclusive conditions in an order that is correct when you statistically analyze the data.

ToCharArray() ?!

I saw many submissions calling Data.ToCharArray() to get a character array representing the data so they could read it. There is another method on the string object that does a similar job, which although hidden to you is much faster than ToCharArray(). Consider the following test.

        static int ORCharsNewArray(string s) {

            char[] tmp = s.ToCharArray();

            int ret = 0;

            foreach (char c in tmp) {

                ret |= c;

            }

            return ret;

        }

        static int ORCharsSameArray(string s) {

            int ret = 0;

            foreach (char c in s) {

                ret |= c;

            }

            return ret;

        }

        static void Main(string[] args) {

            string foo = "123456789012345678901234567890";

            Stopwatch s = new Stopwatch();

            s.Start();

            for (int i = 0; i < 100000; i++) {

                int bar = ORCharsNewArray(foo);

            }

            s.Stop();

            Console.WriteLine(s.ElapsedTicks);

            s.Reset();

            s.Start();

            for (int i = 0; i < 100000; i++) {

                int bar = ORCharsSameArray(foo);

            }

            s.Stop();

            Console.WriteLine(s.ElapsedTicks);

        }

ToCharArray vs get_Chars

 

In debug mode they have nearly identical performance but in release the ORCharsSameArray function is about 3 times faster. When you use a foreach on a string (or if you use an index) a special method get_Chars is called (you can’t call this on your own). We can see this occurring by looking at the IL in question.

.method private hidebysig static int32 ORCharsSameArray(string s) cil managed

{

      .maxstack 2

      .locals init (

            [0] int32 num1,

            [1] char ch1,

            [2] string text1,

            [3] int32 num2)

      L_0000: ldc.i4.0

      L_0001: stloc.0

      L_0002: ldarg.0

      L_0003: stloc.2

      L_0004: ldc.i4.0

      L_0005: stloc.3

      L_0006: br.s L_0018

      L_0008: ldloc.2

      L_0009: ldloc.3

      L_000a: callvirt instance char string::get_Chars(int32)

      L_000f: stloc.1

      L_0010: ldloc.0

      L_0011: ldloc.1

      L_0012: or

      L_0013: stloc.0

      L_0014: ldloc.3

      L_0015: ldc.i4.1

      L_0016: add

      L_0017: stloc.3

      L_0018: ldloc.3

      L_0019: ldloc.2

      L_001a: callvirt instance int32 string::get_Length()

      L_001f: blt.s L_0008

      L_0021: ldloc.0

      L_0022: ret

}

IL of ORCharsSameArray

 

This method lets you view the data inside of the string as if it were an array of chars.  Since the data is only being read there is no need to generate an array. The reason that they are roughly the same speed in debug is that the method does not get inlined, in release mode the method is inlined and it is nearly as efficient as unsafe code.

Threading (The Red Herring?)

A few tried threaded solutions only one did it efficiently. Garmon got the algorithm right.

The problem with threading is locking … In order to maintain a hash properly between two threads one needs to setup critical sections around the hash. The big problem comes in that on every probe to the hash one has to worry about a resize of the hash by the other thread. In order to get around this one has to create a hash per thread, since each thread has its own hash they can operate on their own hash without locking as the other thread will not touch the hash.

Once the two threads are complete one then must merge the two hashes. This will always be a O(n) operation at best generally O(hashsize) which is why threading was not much of an issue here unless you get absolutely huge data sets with lots of repetition. It takes a lot to make up for the O(n) operation on the copy, the creation of the second hash, the delay to actually start the thread, and the additional memory overhead caused by having two hashes.

The key to threads being successful is a high rate of repetition in the data!

Use a Map

All of the submissions used if conditionals to tokenize the data.  There were cases such as

            if ((c >= 'a') && (c <= 'z')) {

                builder.Append(char.ToUpper(value));

            }

Conditional Tokenizing

 

Could we pre-generate this information into a table and simply do a table lookup?

        enum Operations {

            EndWord = 0x0100,

            MoveNext = 0x0200

        }

        static string FormatEntry(int value) {

            return string.Format("0x{0:x4}", value);

        }

 

        static void MainToRun(string[] args) {

            Console.WriteLine("UInt16 [] map = {");

            for (int i = 0; i < 255; i++) {

                char c = (char)i;

                if ((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9')) {

                    Console.Write(FormatEntry(c | (int)Operations.MoveNext));

                } else if (c >= 'A' && c <= 'Z') {

                    Console.Write(FormatEntry(char.ToLower(c) | (int)Operations.MoveNext));

                } else if (c == ' ' || c == '\t' || c == '\n' || c == '\r') {

                    Console.Write(FormatEntry(0 | (int)Operations.EndWord));

                } else {

                    Console.Write(FormatEntry(0));

                }

                if (i < 254) {

                    Console.Write(", ");

                }

                if ((i + 1) % 8 == 0) {

                    Console.Write("\n");

                }

            }

            Console.Write("\n}\n");

        }

Code to generate map

 

Note that this code is doing any transformations that we may want and is also storing some additional information in the high bits of the int16(we only use 8 bits for the char). In particular it is storing a bit that tells us if what was read is a word terminator or not and it tells us whether or not we should add to the current position of the output buffer. This code will produce output similar to the following

        static readonly UInt16 [] map = {

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0100, 0x0100, 0x0000, 0x0000, 0x0100, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0100, 0x0000, 0x0000, 0x0000, 0x0224, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x022d, 0x0000, 0x0000,

0x0230, 0x0231, 0x0232, 0x0233, 0x0234, 0x0235, 0x0236, 0x0237,

0x0238, 0x0239, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0261, 0x0262, 0x0263, 0x0264, 0x0265, 0x0266, 0x0267,

0x0268, 0x0269, 0x026a, 0x026b, 0x026c, 0x026d, 0x026e, 0x026f,

0x0270, 0x0271, 0x0272, 0x0273, 0x0274, 0x0275, 0x0276, 0x0277,

0x0278, 0x0279, 0x027a, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0261, 0x0262, 0x0263, 0x0264, 0x0265, 0x0266, 0x0267,

0x0268, 0x0269, 0x026a, 0x026b, 0x026c, 0x026d, 0x026e, 0x026f,

0x0270, 0x0271, 0x0272, 0x0273, 0x0274, 0x0275, 0x0276, 0x0277,

0x0278, 0x0279, 0x027a, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,

0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000

Map outputted by example

 

Now that we have this map saved off we can write an extremely elegant tokenizer that is also faster than our prior incarnations which used conditionals.

        public unsafe WordEntry[] CountWords(string _Text) {

            table.Clear();

            int loc = 0;

            int lastloc = 0;

            byte [] buf = new byte[_Text.Length * 2];

            fixed (byte* buffer = buf) {

                fixed (char* c = _Text) {

                    char* current = c;

                    char* stop = c + _Text.Length;

                    while (current < stop) {

                        UInt16 val = map[*current];

                        int add = (val >> 9);

                        if (add > 0) {

                            buffer[loc] = (byte) (val & 0xFF);

                            loc++;

                        } else if (loc != lastloc && ((val >> 8) & 1) == 1) {

                            table.Increment(buffer + lastloc, loc - lastloc );

                            loc += 4;

                            lastloc = loc;

                        }

                        current++;

                    }

                    if (loc != lastloc) {

                        table.Increment(buffer + lastloc, loc - lastloc);

                    }

                    return table.ToArray();

                }

            }

        }

Map Based Parser

 

Clean, compact, and efficient … what more can you ask for?

A Side Challenge

I gave myself a side challenge yesterday, could I tokenize the string two chars at a time with only a single branch in my loop? It took me a little while to come up with an answer to this problem but it did exist. Here is the code.

                    while (current < stop) {

                        Int32 chars = *current;

                        chars = (chars ) >> 16 | ((chars & 0xFF) << 8);

                        Int32 processed = parsermap[chars];

                        Int16* tmp2 = (Int16*)tmp;

                        *tmp2 = (Int16)(processed & 0xFFFF);

                        int add = (processed >> 19); //number to add to output pointer

                        tmp += add;

                        tmp2 = (Int16*)current;

                        tmp2 += (processed >> 17) & 0x03; //number to add to buffer pointer (1 or 2)

                        current = (Int32*)tmp2;

                        if ((processed & 0x010000) > 0) {

                            Int32* t = (Int32*) tmp;

                            *t=0;

                            table.Increment(last, (int) (tmp - last));

                            tmp += 4;

                            last = tmp;

                        }

                    }

Parse 2 chars at a time with only one if statement?

 

In order to really look at this code you will also need to look at the code that generates a 64k map that it references (no I will not post the 64k map hereJ). The complete code and the map can be found in the MappedWordCounter.cs file. If people would like I can explain further just how it works but basically it sets high bits in the map to define how much to move forward both the buffer and the output pointers (it also stores a bit flag as to whether or not the 2 chars contained a word terminator as did the previous tokenizing example). There is however one odd case which is worth discussion.

If you read two characters and the first is an ignored or termination character. In the case of an ignored character you don’t want to end up with a “hole” in your array as such you have to not move the output buffer.  In this case one is added to the buffer making it on an odd letter and you add zero to the output as you don’t want to include the character. This means that in this case you read and write the second character twice. A further special case can be seen when you have two ignored or termination characters except the buffer moves two while the output moves zero positions.

Initial Hash Table Size

In the tests where a new object is created for every call the hash table size becomes extremely important. If the table is too big it will be allocating useless memory and will need to clear that memory. If the table is too small it will have to go through numerous growth operations which are a fairly big hit. I have yet to find a “good” way of initializing the table except for using domain information such as “every 9 letters or so will on average produce a new word”.

In general it is best to error a bit on the side of caution and allocate slightly more memory to your hash table than to error the other direction and to force a growth of your table.

I made YoungCompressed use a 150000 entry table for every entry, check out the performance characteristics compared to the other entries which were using a text.length / 7 sized table.

The Default Equality Comparison is SLOW

No submissions used this but it can quickly speed up many of them.  I am going to pick on Brandon Grossutti’s submission here but the optimization applies to all of them that use a dictionary<>. By simply changing Brandon’s code from

Dictionary<string, int> MyWords = new Dictionary<string, int>();

To

Dictionary<string, int> MyWords = new Dictionary<string, int> (1, StringComparer.Ordinal);

Took about 7% off the total running time of the submission. This is an important one to remember.

Write a Custom Hash Table

It is well known that code which is designed to be generic will often be less efficient than code tailor made for the job at hand (in other words generality comes at a loss of performance). The .NET dictionary and hash classes are both very generic. By writing your own version of these classes there is a performance gain to be had. I wrote a few varying forms of my hash table that are included with the code.

The first version of the table works in a very similar fashion to the standard table. It stores  the string keys as char * and uses a modulus based system for finding the next available slot (i.e. slot = (slot + 1) % tablesize). This method is quite fast but there are a few optimizations which can be made.

One of these optimizations is to get rid of the modulus which is a slow operation on most machines (and it is used extremely often, every probe). In order to get rid of the modulus we need to make some special requirements on the hash table. To be exact you need to control the size of the hash table to only be a power of two (this can be seen in "bithash" and is in my overall submission).

        /// <summary>

        /// calculates the next highest power of two

        /// </summary>

        /// <param name="number"></param>

        private static UInt32 CalculateNextPowerOfTwo(uint _Number) {

            for (int i = 8; i < 32; i++) {

                uint current = (uint) 1 << i;

                if (current > _Number) {

                    return current;

                }

            }

            return uint.MaxValue;

        }

Code to find the next item that is a power of two given a number

 

If the table is a power of two you can use a bit mask instead of a modulus to insure that your current value is within the realm of your hash. Once the hash is assured to be a power of two one can calculate the bitmask for the hash table as hashsize – 1. All that is then required is the mask which can be seen here.

uint slot = (uint)(Hash & m_HashMask);

The bit mask is significantly faster on most architectures, this change yielded nearly a 10% gain for the algorithm on my machine.

Buffer?

Since I am using char* in my hash table, I do not need to initialize a specific string for it (and I can deal with as a unit32* when I need to do things like comparisons). The big problem you will run into here is that you need to keep the memory that the string is in pinned. This can either be done explicitly or by holding a GCHandle for the object (such as I do with the actual data in the hash table). If you do not do this you can end up with seemingly random bugs that occur when the GC compacts the heap (the string you were pointing to has moved).

In order to get around the nightmare of trying to maintain GCHandles my tokenizer puts the data into a big output buffer. This output buffer is unsafe and it is pinned for the duration of the operation (so we know it won’t move).

This methodology is not very general but it does help prevent a lot of other overhead.

Equality Compare (Custom Hash)

One of the slowest items in the code is the equality compare of an entry in the hash table.  Consider the following code

                    for(int j=0;j<_Length;j++) {

                        if(_Word[j] != Current->Chars[j]) {

                            //failed

                        }

                    }

 

Naïve Equality Compare

 

This code goes through character by character comparing our string in our hash to the string we wish to use (keep in mind that this gets called at least once for every probe to the hash). A better way of doing this would be unroll the loop a bit by using a (uint32*) to perform our comparison, as such we could compare two characters every iteration of the loop instead of just one (remember that a character is 16 bits).

We will however run into a problem comparing two characters at a time. What happens if we have a string with an odd number of characters? We only really have two choices about what to do here.  The first option is to loop 1 character short and then check the last character, this is a generally good solution but it can be optimized further if you can control the data coming in.

Since there is only one client for this hash we can definitely control the data coming in. Imagine the words cat, dog, and “the” in memory as shown below.

C

A

T

D

O

G

T

H

E

 

 

 

 

In this case we would need to use the above method (where we special case the last character and read it separately). Since we control the memory layout however we can avoid this by “aligning” the memory as follows.

C

A

T

0

D

O

G

0

T

H

E