Thursday, August 8, 2013

GetHashCode() Is it that important ?

Introduction



The Idea for this blog entry came up after a pair code review session.
I’ve noticed the following method (from class that holds picture coordinates):

public override int GetHashCode()
{
    return X.GetHashCode() + Y.GetHashCode(); // Algo. 1
}

At first glance it looked like a fair implementation

I know a few rules of thumb for solid GetHashCode implementation:
•  Hash algorithm must be deterministic (for given input, output must be the same).
•  Equal objects must have the same HashCode.
•  Objects with the same HashCode aren’t necessarily Equals.

This entry overview implementation of GetHashCode.


Preliminary testing


Why it isn’t a solid implementation?
Well let’s test that, the testing will reveal the answers.
The test define image 1024 X 1024 in dimensions, 1M pixels in total.
Our class represents a single pixel on the image,
and override the GetHashCode() method.
Now, let’s calculate the Hash Code for each and every pixel on the image surface.
Values will be compared and summarize according to colliding Hash Code values.

The number of collisions per pixel will indicate the pixel color:
White for collisions free pixels and darker colors correlated
to the “popularity” (see legend) of the Hash Code.

This produces the resulting collisions map for the 1st algorithm result:

 










The results are even worse than expected!
This is mainly due to the fact that .Net implementation of int.GetHashCode()
returns the int value itself!
So,
Pixel(X=50, Y=55).GetHashCode() == 105
Pixel(X=55, Y=50).GetHashCode() == 105
Pixel(X=54, Y=51).GetHashCode() == 105
And so on and on...


Suggested alternative algorithms


Let’s try a slightly better solution:
public override int GetHashCode()
{
    return X.ToString().GetHashCode() 
              + Y.ToString().GetHashCode(); // Algo. 2
}

You can expect using Xor operator between the X and Y
will produce considerably better results:
public override int GetHashCode()
{
    return X.ToString().GetHashCode() 
              ^ Y.ToString().GetHashCode(); // Algo. 3
}

Common solution is to use "Mersenne prime"
public override int GetHashCode()
{
    return X + Y * 31; // Algo 4.
}

Assuming that images dimensions are restricted to 64K (X or Y)
we can create "Perfect Hash"
public override int GetHashCode()
{
    return (X << 16) + Y; // Algo.5
}
 

Testing output


2nd algorithm result:


3rd algorithm result (Although it looks better, it's even a bit worse):



4th algorithm result (good distribution across the field,
pretty impressive for such simple algorithm):

No use to show the 5th algorithm result, the image is perfect white,
no collisions at all!


Collision effect !


Why the collisions are so important?
Mainly, due to one big reason: Performance !

Let’s try to measure the time it takes to:
Insert / lookup in dictionary (Dictionary<XyPoint, int>) and calculate Hash Code.

This table illustrates the time it takes (1 million POCO items)
for our Algorithms (tested on Intel E8400 machine):

Time in ms.
Algo. 1
Algo. 2
Algo. 3
Algo. 4
Algo. 5
HashCode Calculation
34
280
283
30
31
Dictionary Insert
19,300
2,587
2,758
547
72
Dictionary Lookup
18,614
2,468
2,692
577
98
Total time
37,948
5,335
5,733
1,154
201


Conclusion


Collisions cause huge performance impact.
Even a good Hash algorithm can suffer from a bad implementation
due to incompatibility or misuse.
Although Algo. 1 was one of the fastest Hash calculations,
the overall performance compared to Algo. 5 were about 190 times slower!
This is due to the reason that colliding values are chained,
this requires doing additional search to find the value in the chain.
That’s incredible, small and grey line of code can degrade (or boost) the performance.

Probably next time you’ll override the GetHashCode(),
you’ll spend a little bit more time to find the better solution.

I hope this entry shed some light on the subject and emphasize its importance.

No comments:

Post a Comment