My last two posts have been about Performance Measurement and Viewing Unmanaged Code in VS.NET, these posts while rather interesting have really been to set us up for the this discussion. If you have not read these posts you may want to as I will be using information from them often in this post.
There are a lot of posts on the Internet discussing varying methods of looping using for loops and which perform best. These posts also generally give advice as to how you should handle your looping based upon performance metrics. Most of these types of posts suffer from an affliction I have discussed previously; they look at IL (Intermediate Language) to rationalize performance which is just flat out wrong as subtle differences can cause JIT (Just in Time) optimizations to go haywire.
In this post we will look at the unmanaged code produced by varying constructs to come up with a definitive answer to the performance question. We will also look at some other optimization methods that people claim happen but do not always happen and discuss some reasons why they may not be happening and of course set some rules for fast looping.
In trying to apply this the version of your framework is important as the JIT may vary from version to version, I do not believe anything here will change based upon the platform but that is a possibility. I used version 2.0.50727 for all examples in this document. Other versions such as Rotor, Mono, or 1.x will most likely show differing behaviors.
Hoisting?
Before I start getting into code I would like to discuss the concept of hoisting as it will be the focal point of this entire discussion. When dealing with loops we can break any given loop into three sections
1) The preamble (setting up for the loop, doing things such as setting our counter to 0)
2) The code within the loop
3) The code to increment our counter and check for the end of iteration possibly jumping back to step two
For the rest of this post I will color code the disassembled assembly as section one (green), section two (blue), section 3 (yellow) to make things easier to follow
Any code that is in step two or three will be run N times as the loop is executed. The code in step one will only be executed a single time when we first enter the loop. The concept of hoisting involves moving code out of steps two and three and into step one. Hoisting obviously gives you a huge performance gain as the code will only be run once.
Simple Hoisting Example
If you read alot articles on for loop performance they will tell you to not manually hoist because the JIT will end up doing it for you. In order to test this I have created the following test code (which can be used in conjunction with the harness discussed in the previous Performance Measurement post)
|
class Program {
static int[] foo = new int[100000];
static int Dummy;
static void Test1NoHoist() {
int total = 0;
for (int i = 0; i < foo.Length; i++) {
total|=i;
}
Dummy = total;
}
static void Test1WithHoist() {
int total = 0;
int length = foo.Length;
for (int i = 0; i < length; i++) {
total|=i;
}
Dummy = total;
}
static void Main(string[] args) {
TestHarness.Test("Test1NoHoist", 10000, new TestHandler(Test1NoHoist));
TestHarness.Test("Test1WithHoist", 10000, new TestHandler(Test1WithHoist));
Console.ReadLine();
}
} |
|
Listing 1: Basic manual hoisting example |
You will quickly notice that in Test1WithHoist we manually took the check Length property on foo and moved it to before the loop; this is a common optimization if you come from a C/C++ world. The thought behind this is that since Length translates to get_Length() that we are saving ourselves from having to call the method n times (remember that this would be in the second or third section and everything in the third section is called n times).
An astute reader will notice an oddity dealing with the “Dummy” variable that is assigned after the loop. We will come back this a bit later on… good catch ;)
The by running our performance test, we can see that the code executes in roughly equivalent time.
|
Test |
Total Time (ns) |
Average (ns) |
|
Test1NoHoist |
820583612.35 |
82058.36 |
|
Test1WithHoist |
812735363.70 |
81273.53 |
Let’s take a look at the generated code to offer proof that the two bits of code should run in equivalent time (if you want to see the results for yourself remember to follow the instructions for getting JIT optimized code from Viewing Unmanaged Code in VS.NET.
|
Test1NoHoist |
Test1WithHoist |
|
00000000 xor ecx,ecx
00000002 xor edx,edx
00000004 mov eax,dword ptr ds:[022B1EC4h]
00000009 mov eax,dword ptr [eax+4]
0000000c test eax,eax
0000000e jle 00000019
00000010 or ecx,edx
00000012 add edx,1
00000015 cmp eax,edx
00000017 jg 00000010
00000019 mov dword ptr ds:[00912FE8h],ecx
0000001f ret
|
00000000 xor ecx,ecx
00000002 mov eax,dword ptr ds:[022B1EC4h]
00000007 mov edx,dword ptr [eax+4]
0000000a xor eax,eax
0000000c test edx,edx
0000000e jle 00000019
00000010 or ecx,eax
00000012 add eax,1
00000015 cmp eax,edx
00000017 jl 00000010
00000019 mov dword ptr ds:[00912FE8h],ecx
0000001f ret |
|
Listing 2: Disassembled versions of our functions |
|
A kind of neat item in these two bits of code is that the only real difference is that EAX and EDX are interchanged … This makes them look a bit more different than they are but rest assured they are pretty much identical. In both cases, the access to the stop variable has been hoisted out of the loop (in section three, both are comparing their counter to a register that was preloaded in section one). To better illustrate the point, here is a disassembly of the first for loop (without manual hoisting) when run without JIT optimizations (i.e. it will not be hoisted).
|
0000002a xor esi,esi
0000002c nop
0000002d jmp 00000034
0000002f nop
00000030 or edi,esi
00000032 nop
00000033 inc esi
00000034 mov eax,dword ptr ds:[02275A34h] ;inlined length
00000039 cmp esi,dword ptr [eax+4]
0000003c setl al
0000003f movzx eax,al
00000042 mov ebx,eax
00000044 test ebx,ebx
00000046 jne 0000002F
|
|
Listing 3: First loop (no hoisting with optimizations disabled) non-important code removed |
If this does not convince you to NEVER release code that is not being optimized I don’t know what will. It should also give you an idea of how much work the JIT optimizer really does. What has happened here is that the property was inlined, but it was inlined into section three. The optimizer was smart enough to realize this and to move it up to the preamble. +1 for the optimizer
Non-Trivial Hoisting
Now that we have gone through a simple example, let’s try a more difficult one for the optimizer. In this example we will use a method GetUpperBound(0) which will not be inlined for us, let’s take a look at how well the JIT handles it. Here is the testing code to add to our previous code.
|
static void Test1NoHoistNoInlining() {
int total = 0;
for (int i = 0; i < foo.GetUpperBound(0); i++) {
total |= i;
}
Dummy = total;
}
static void Test1WithHoistNoInlining() {
int total = 0;
int length = foo.GetUpperBound(0);
for (int i = 0; i < length; i++) {
total |= i;
}
Dummy = total;
} |
|
Listing 4: Tests without inlining available |
Next we should add the following to our Main to run the tests.
|
TestHarness.Test("Test1NoHoist", 10000, new TestHandler(Test1NoHoistNoInlining));
TestHarness.Test("Test1WithHoist", 10000, new TestHandler(Test1WithHoistNoInlining));
|
|
Listing 5: Calling our new methods |
When we run the tests in release mode, we get quite different results than previously. I am including the previous results in the table for comparison.
|
Test |
Total Time (ns) |
Average (ns) |
|
Test1NoHoist |
820583612.35 |
82058.36 |
|
Test1WithHoist |
812735363.70 |
81273.53 |
|
Test1NoHoistNoInlining |
25457518879.48 |
2545751.88 |
|
Test1WithHoistNnInlining |
824641533.67 |
82464.15 |
Interesting, not manually hoisting makes our function 32x slower. Before we even get into code on this one I will put my money on the optimizer not hoisting the call. I believe however there are some reasons for this that we will discuss after looking through the code.
|
Test1NoHoistNoInlining |
Test1WithHoistNoInlining |
|
00000004 xor esi,esi
00000006 mov ecx,dword ptr ds:[022B1EC4h]
0000000c xor edx,edx
0000000e cmp dword ptr [ecx],ecx
00000010 call 792661F8
00000015 test eax,eax
00000017 jle 00000031
00000019 or edi,esi
0000001b add esi,1
0000001e mov ecx,dword ptr ds:[022B1EC4h]
00000024 xor edx,edx
00000026 cmp dword ptr [ecx],ecx
00000028 call 792661F8
0000002d cmp eax,esi
0000002f jg 00000019
|
00000003 mov ecx,dword ptr ds:[022B1EC4h]
00000009 xor edx,edx
0000000b cmp dword ptr [ecx],ecx
0000000d call 792661A8
00000012 mov edx,eax
00000014 xor eax,eax
00000016 test edx,edx
00000018 jle 00000023
0000001a or esi,eax
0000001c add eax,1
0000001f cmp eax,edx
00000021 jl 0000001A |
|
Listing 6: Disassembled versions |
|
As we suspected the code on the left hand side is not automatically hoisting the call of the function for us. It would seem that the JIT will not automatically hoist method calls for us, that instead we have to explicitly state that we want them hoisted on our own.
It seems that that the JIT cannot get to the point of having a value in a register or memory that it can consider its result and as such feels the need to make the call on every iteration. This would make sense in general as I may be depending upon the behavior of being called at every interval. Consider the following code.
|
static int t = 0;
static bool KeepGoing() {
Console.WriteLine("Still Going!");
t++;
return t < 10;
}
static void TestShortMethod() {
for (; KeepGoing();) {
System.Threading.Thread.Sleep(100);
}
} |
|
Listing 7: Odd but valid code |
Obviously this is nasty code but it is still valid. This is a simplified example it appears that it was a conscious decision to not support these types of situations as they can quickly become extremely complex from the JIT point of view, my guess is that it won’t deal with anything beyond simply returning a variable which will be inlined into a simple read anyway. In my opinion, something like this should still be inlined (and left in section 3) to avoid the overhead of setting up the call on every iteration but I may be missing some other case that makes this more difficult. Based upon these results we can create the following rule.
If you wish to use a method call for your stop condition that does anything more complex than simply returning a variable or is not inlinable for other reasons such as being virtual; you should hoist it manually by placing it in the first part of your for loop in order to make it explicit to the JIT that you do not wish the behavior to be called on every iteration.
Mark Lubischer brought up an excellent point here. We can in fact also hoist the call by using for like this.
|
static int MarkWithoutInlining() {
int total = 0;
int[] length = new int[10000];
for (int i = 0, j = length.GetUpperBound(0); i < j; i++) {
total |= i;
}
return total;
} |
This is a much better way of handling our hoisting as it better defines the scope of our variable while the behavior is in fact equivalent to our original hoist.
|
static int MarkWithoutInlining() {
int total = 0;
00000000 push edi
00000001 push esi
00000002 push ebx
00000003 xor ebx,ebx
int[] length = new int[10000];
00000005 mov edx,2710h
0000000a mov ecx,7915982Ah
0000000f call FFB21D98
00000014 mov edi,eax
for (int i = 0, j = length.GetUpperBound(0); i < j; i++) {
00000016 xor esi,esi
00000018 mov ecx,edi
0000001a xor edx,edx
0000001c cmp dword ptr [ecx],ecx
0000001e call 792664C8
00000023 test eax,eax
00000025 jle 00000039
00000027 mov edx,dword ptr [edi+4]
total |= length ;
0000002a cmp esi,edx
0000002c jae 0000003F
0000002e or ebx,dword ptr [edi+esi*4+8]
for (int i = 0, j = length.GetUpperBound(0); i < j; i++) {
00000032 add esi,1
00000035 cmp esi,eax
00000037 jl 0000002A
}
return total;
00000039 mov eax,ebx
0000003b pop ebx
0000003c pop esi
0000003d pop edi
0000003e ret
0000003f call 792B42E9
00000044 int 3 |
A Bit About Foreach
Foreach is not an IL concept, it is a compiler concept. Compilers generate normal for loops when they see a foreach being used to iterate an array. If you are interested in seeing how foreach loops work I would highly recommend taking a look with ILDASM or Reflector. I will not discuss heavily foreach loops as they have at this moment no difference when they reach the native level.
Foreach could in the future offer many benefits over a general for loop. Since the foreach loop is explicitly stating that you want to iterate through the array (not allowing things like addition and subtraction to your counter), other things such as hoisting array bounds checks (which we will discuss shortly) would be much more easily accomplished. I would imagine that in the future foreach will be the preferred iteration construct.
I will assure you though that as of now foreach is not faster than the equivalent for loop when dealing with arrays, in fact both VB.NET and C# insure that they are identical. There is one case where foreach will be slower than a for loop and that is if you do not actually use the item within your loop, the for version will obviously be faster since it never loads the variable where as the foreach does this implicitly on every iteration. Foreach can however offer you great performance increases when dealing with other types that implement IEnumerable (not including collections as they simply perform an indexed lookup in their enumerator).
Logically one can come up with a great example for this when looking at a linked list. The for loop will cause ∑ n total operations on the iteration where as the foreach will only cause n. The reason for this is every index operation would have to start at the beginning of the list and iterate n nodes in order to return the nth node where as the enumerator will simply remember the last node it was on and give the next node. For this reason we will add our second rule.
When dealing with items that are enumerable as opposed to dealing with arrays or collections; prefer foreach to a for loop as the enumerator will often offer a much faster way of enumerating than using an index.
Array Bounds Check Hoisting
The check to find out whether or not we are valid to continue is not the only thing that can be hoisted from inside of a loop. In fact every time the variable is used in a comparison within the loop is an opportunity for hoisting to occur. The first example of this type of hoist we will look at is array bounds checking.
Array bounds are checked automatically by the CLR in order to prevent things like buffer overflows from occurring. Every time that you access an array in safe code you will in fact have a comparison occur to insure that you are within the range of the array. If you are not within a valid range an IndexOutOfRangeException will occur as opposed to writing happily beyond the end of your array as many other language such as C would do.
The problem with array bounds checks is that they are extremely redundant when dealing with loops. Consider the following code that shows what is happening.
|
static void SampleArrayBoundsCheck() {
int total = 0;
for (int i = 0; i < foo.Length; i++) {
if (i < foo.Length) {
total |= foo[ i ];
}
}
} |
|
Listing 9: Sample Array Bounds Code |
Naturally you are not issuing these checks in your code, but this example helps to make what’s really going on a bit clearer. When looking at this code anyone who has read the first few chapters of a C# book would scratch their head and wonder why all of the redundancy has been placed into the code. It is obvious that if our counter has a constraint to stay below foo.Length that it will in fact always succeed the conditional of being below foo.Length. To test how intelligent the JIT is with handling this situation we can use the following code. Note that for this test we will simply be looking at the native code generated as opposed to measuring performance.
|
static int [] SampleArrayBoundsCheck() {
int [] Destination = new int[foo.Length];
for (int i = 0; i < foo.Length; i++) {
Destination[ i ]= foo[ i ];
|