I spent about 10 hours from Friday evening until Sunday evening beating my head on the wall trying to squeeze every last bit of performance out of some checksum code I wrote. I need to compute a lot of CRC32s (hundreds of thousands of files that make up roughly 5 TB). On the target machine, with an installed crc32 program (that used ZLib under the covers) I was able to do about 341 MBps. On the implementation I custom wrote I initially was getting 64 MBps. This clearly was a hangup since with this volume of data, I need all the speed I could get. I also didn't want a dependency on a 3rd party library, so I set about trying to squeeze all the performance I could out of my own implementation.

I initially re-wrote it to use several threads in parallel,computing the CRC in a block fashion, and then combining them with some tricky Galois field algebra, but this was another dead end. Minimum performance gain. Back to the drawing board, and my next plan was to operate on either 64bit words or 32bit half-words, rather than individual bytes (since nearly every CPU anymore is 64bit). Another dead end.

By this point I was getting pretty frustrated, and I nearly accepted the fact that the only performance I was going to gain was by doing the files themselves in parallel, giving me only about 250 MBps, an improvement of course, but still an admission of defeat. For some reason, I decided to try adding the compiler optimization flag to the makefile. For GCC, -O3 will give the maximum optimization. I gave it another run, and my initial quick implementation was on par with ZLib application. Success! But at the same time the realization that I wasted several hours on several pointless iterations of the code, for naught! A good lesson for all - compiler optimization is your friend! That and I also discovered in the GCC docs that even if you declare a function inline, it will NOT be inlined unless you specify a level of optimization. In case you are interested, here is the base of the CRC implementation I wrote. You need to either compute, or hardcode, a lookup table, and then pass chunks (some files are too big to load all in RAM) of the file to this running CRC checksum function. This function also assumes that the initial crc value passed in is 0.

uint32_t crc32(uint32_t crc, const uint8_t *ptr, const size_t len)
    crc ^= (~0UL);

    for (size_t i = 0; i < len; ++i) {
      crc = _table[(crc ^ *ptr++) & 0xFF] ^ (crc >> 8);

    return (crc ^ (~0UL));