I recently received an email from someone who had a question about some code in one of my github
repositories. It made me think of various code hacks or optimizations I've used throughout the years. The question posed to me
was specifically about this function and how
the assembly routine worked to produce a log base-2. The BSR instruction
takes a value in the source operand and scans from the most significant bit to the least significant bit (hence the reverse in the name)
searching for the first set bit. If it finds no bits set, it clears the Zero Flag, otherwise it sets ZF and loads the bit index
with the set bit into the destination operand. This becomes really useful when you consider how log base-2 and binary numbers work: any
number's most significant bit, in base 2, ends up being the logarithm for that base. This is a really quick way to calculate this value
without much overhead as long as you are fine with a potentially rounded integer result. If you need real numbers, use the libraries.
Another one that I've seen a few times, and used a few times myself is using & (n-1) to compute the modulo of a number n, provided that n is a power of 2. This works very well and is quite fast (although the compiler generated assembly for modulo is also pretty fast. It almost always (at least in the dumps I've looked at) uses a series of moves, shifts, adds and multiplies to produce a result, that surprisingly enough, is always faster than using the DIV instruction in assembly, which itself will save the remainder in DX) and is an acceptable substitute when you know that n is a power of 2. To understand how it works, consider what any 2^n - 1 value looks like in binary. Its all 1s, from the significant bit of n-1 down to the least significant bit. When we bitwise and any number with n-1 we are basically masking off any bits that would be part of a remainder! If this isnt clicking, try working out a few examples by hand to see how it works.
Another that I've never used but that is pretty well known is the Fast Inverse Square Root. This is frequently attributed to John Carmack of id Softawre fame though he has confirmed that he was not responsible for its creation. I won't even attempt to explain the algorithm, and I cant even imagine how someone came up with it. (Chris Lomont wrote up a really interesting paper regarding the approximation of the constant that is used in this algorithm. Its long and dense, but worth a read if you are interested). I think its a great example though of some of the lengths people will go to to squeeze performance out of their programs and also of how unintuitive some of these tricks can be. Just take a look at the source code comment in the Quake 3 source code ;)
Another one that I've seen a few times, and used a few times myself is using & (n-1) to compute the modulo of a number n, provided that n is a power of 2. This works very well and is quite fast (although the compiler generated assembly for modulo is also pretty fast. It almost always (at least in the dumps I've looked at) uses a series of moves, shifts, adds and multiplies to produce a result, that surprisingly enough, is always faster than using the DIV instruction in assembly, which itself will save the remainder in DX) and is an acceptable substitute when you know that n is a power of 2. To understand how it works, consider what any 2^n - 1 value looks like in binary. Its all 1s, from the significant bit of n-1 down to the least significant bit. When we bitwise and any number with n-1 we are basically masking off any bits that would be part of a remainder! If this isnt clicking, try working out a few examples by hand to see how it works.
Another that I've never used but that is pretty well known is the Fast Inverse Square Root. This is frequently attributed to John Carmack of id Softawre fame though he has confirmed that he was not responsible for its creation. I won't even attempt to explain the algorithm, and I cant even imagine how someone came up with it. (Chris Lomont wrote up a really interesting paper regarding the approximation of the constant that is used in this algorithm. Its long and dense, but worth a read if you are interested). I think its a great example though of some of the lengths people will go to to squeeze performance out of their programs and also of how unintuitive some of these tricks can be. Just take a look at the source code comment in the Quake 3 source code ;)
C
Assembly