Head to https://brilliant.org/BreakingTaps/ to get a 30-day free trial. The first 200 people will get 20% off their annual subscription.

-------------------------------------------------------

Today we're looking at HyperLogLog, an algorithm that leverages random chance to count the number of distinct items are in a dataset. It does this by tracking the longest run of zeros in a binary sequence, and uses that as an estimate of cardinality.

HLL is a probabilistic algorithm, meaning it's a guess rather than true answer. But due to some clever tricks it is usually within 2% of the correct value, and can do it both quickly and in a memory-efficient manner. A 512kb datastructure can accurately process trillions of items and terrabytes of data, which is pretty impressive!

When I made this video, I didn't realize that another #SoME3 was in progress. But a bunch of viewers suggested I enter the video, so I guess this is will be part of the event!

----------------------

🔬Patreon if that's your jam: https://www.patreon.com/breakingtaps

📢Twitter: https://twitter.com/BreakingTaps
📷Instagram: https://instagram.com/breakingtaps
💻Discord: https://discord.gg/R45uCXcEv4

----------------------

Journal papers:

Flajolet, Philippe, et al. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." _Discrete Mathematics and Theoretical Computer Science_. Discrete Mathematics and Theoretical Computer Science, 2007. https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

Heule, Stefan, Marc Nunkesser, and Alexander Hall. "Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm." _Proceedings of the 16th International Conference on Extending Database Technology_. 2013. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf

Earlier work:

Durand, Marianne, and Philippe Flajolet. "Loglog counting of large cardinalities." _Algorithms-ESA 2003: 11th Annual European Symposium, Budapest, Hungary, September 16-19, 2003. Proceedings 11_. Springer Berlin Heidelberg, 2003. 

Flajolet, Philippe, and G. Nigel Martin. "Probabilistic counting algorithms for data base applications." _Journal of computer and system sciences_ 31.2 (1985): 182-209. 

Articles:

- https://towardsdatascience.com/hyperloglog-a-simple-but-powerful-algorithm-for-data-scientists-aed50fe47869
- https://engineering.fb.com/2018/12/13/data-infrastructure/hyperloglog/

If I gave you 3 cubes and asked how many colors there are, well, it's not that hard of a question.

And how would you even go about figuring that out?

Today we're talking about HyperLogLog, an algorithm which does just that.

Now before you click away, you don't need to know anything about programming to understand how this works.

It simply takes a little observation about flipping coins and extrapolates that out to its logical conclusion.

This is such an important algorithm that despite being relatively new, it's used by every major tech company and most modern scalable databases.

Google even spent time and resources refining this technique.

I think you'll find this really interesting, even if you don't know anything about programming.

So the problem at hand is counting distinct or unique elements.

We could add more cubes, but still only have 3 distinct colors, red, green, and blue.

As humans, we can just look at this and immediately know how many colors there are.

But how would a computer count those colors?

Well, 1 solution might be to keep a stack of cubes that we've already seen and then reference that whenever we need to check a new cube to see if it's a new color or not.

So we take our first cube and place it on the stack, but because this is the first 1, there's nothing to do except put it on the stack.

Then we pick up the next cube and check the stack to see if we've seen this color before.

We haven't, so we've also put this 1 on top of the pile.

We repeat this process, but this time we have a duplicate, and we know it's a duplicate because when we scan through our little tower of cubes, we can see that there's a matching cube on the tower already, so we can just discard it.

Then we just continue this process, adding new colors and discarding duplicates until we get done with our mountain of cubes.

When we're done, we're left with a stack representing all the colors that were in the original pile with all the duplicates removed.

And then we just count the number of colors in the stack to determine how many distinct elements there were.

This definitely works, but I think you can appreciate that it will take a long time if you have a lot of items and many colors to check.

Each time we pick up a new cube, we have to check all the items in the stack to determine if it's a new color.

Instead of using a stack of cubes, we could instead use this magic chest.

When we throw a cube inside the chest, it will place that cube on a shelf depending on its color.

Red goes to this spot on the shelf, and blue goes to that spot on the shelf.

It's a magic chest after all, so it has spots for every imaginable color inside.

To count the number of distinct elements, we just toss all of our cubes inside the chest.

It will nom its way through all the cubes, sorting them into their individual spots on the shelves.

And at the end, we just ask it how many unique colors that it ate.

The problem with this approach is that while the chest is magic, physics is still physics.

And inside that chest it needs a huge amount of shelving for all the potential colors, even if we didn't actually see those colors in our pile.

It's faster than the stack of cubes, and it takes up a whole lot less space than the magic chest.

To understand how this algorithm works, we're going to flip some coins.

Specifically, we're going to look for a run of heads or tails.

So if I flip the coin 5 times and get 5 heads, we consider that a run of 5.

Mathematically the chance of this happening is 1 over 2 to the n.

So the probability of getting 5 heads in a row is 1 over 32.

The Probability of this isn't really that extreme, so we could probably flip coins for, I don't know, like an hour and get 5 heads in a row.

Well that works out to 1 over 1, 048, 576.

You might be flipping that coin for weeks before you got 20 heads in a row.

It's not impossible, it's just very uncommon.

And this is the key insight which makes the whole algorithm work.

The length of your longest run is roughly equivalent to how long you've been flipping coins.

If you only have 5 heads in a row, you probably just started flipping the coin.

And if you've seen a hundred heads in a row, well, you've probably been flipping that coin for a really, really long time.

Hyperloglog uses this concept, but instead of flipping coins, we're going to look at binary sequences and check for runs of zeros.

So we'll take our original cube and pass it through this mechanical transmogrifier which spits out a binary sequence.

The details of this really aren't that important, but what you do need to know is that when you put in a red cube you always get out the same binary sequence.

And when you put in a blue cube you get a different binary sequence, but you always get that same sequence out.

So it's a consistent one-way transformation.

Now we have everything we need to run the algorithm.

We transform our first cube and get a binary sequence.

Starting at the right-hand side, we count how many zeros there are in a row.

This is the run of zeros that we're looking for.

In this case, there are 2 zeros in a row.

So we record that in a scorecard off to the side and feed in the next cube.

This binary sequence happens to have 4 zeros in a row, which is larger than our current best run.

The next cube is a color that we've already seen, and it transforms back to the same sequence that we had before, with 2 zeros in a row.

This is less than our best score, so we don't need to make any changes, we just discard it.

Because the transformation from color to binary is consistent, we can see that duplicates are essentially ignored, because we've already accounted for that the first time that we saw that binary sequence.

We run through this process for the rest of the cubes, updating the scorecard whenever we see a longer run of zeros.

And in the end we are left with a number that represents the longest run of zeros that we've seen out of all the cubes we fed to the algorithm.

Remembering back to the coin flipping, the length of a run is roughly proportional to how long we were flipping that coin.

So if the longest run of zeros we've seen is pretty small, well that means we haven't really looked at that many colors, just like we weren't flipping coins for all that long.

But if our final scorecard is really high, chances are good that we've looked at millions of distinct colors to get there.

The chance of getting 20 zeros in a row in our binary sequence is the same probability as flipping 20 heads in a row.

So if we find a binary sequence with 20 zeros in a row, we've probably seen on the order of a million different colors.

Now this does mean it's a probabilistic algorithm.

It won't tell you exactly how many colors it's seen, it's just guessing at how many colors it has probably seen.

But that guess is surprisingly good, usually within 2% of the actual real distinct count.

And this algorithm can scale to trillions of items.

It's important to note that this only works with random events.

There's equal chances of getting heads or tails when flipping a coin.

And similarly, our mechanical transmogrifier gives us a binary sequence that is completely random.

There's a 50-50 chance that any particular bit is a 1 or a 0.

But there is 1 notable edge case that we need to talk about.

Maybe the first color has 20 zeros in a row.

It's unlikely, but it's totally possible.

We've only seen this 1 item, but the algorithm will estimate that we've already seen around a million distinct colors, which is obviously pretty wrong.

The solution to this is just to keep multiple scorecards.

We use the first few bits of the binary sequence as an index value, which picks the scorecard to use.

Otherwise, the algorithm is basically the same, and when we're done we're left over with a list of numbers representing the longest runs, rather than a single value.

And then we just take the harmonic mean of all these values to make our final guess.

By splitting it up into multiple scorecards, we reduce the impact of getting unlucky.

And it also has some nice side benefits like being able to adjust the precision of the algorithm.

If you want more accurate guesses, you just increase the number of scorecards.

And if you want a more memory-friendly algorithm, you reduce the number.

These scorecards are relatively small numbers, so you can actually compress them into 5 or 6 bits, saving a fair amount of memory.

And what that means is basically we can run this algorithm on 2 different computers and merge their scorecards together.

This gives us the total cardinality from both computers, even though we ran the algorithm separately and didn't transfer any data between the 2 computers.

For example, we might have 3 unique colors on both of our computers, and so locally the number of distinct elements is 3.

But there's a duplicate shared between the 2 computers, so we can't simply add up the individual unique elements and get a total count of 6, because in this shared set there's really only 5 unique values.

For small datasets, it's not a problem just to exchange all the data and count it in 1 spot But if you have large amounts of data, this is just impossible to do and that's why HyperLogLog is so helpful here We can calculate the local cardinalities and then merge all the results together to get a representation of the total cardinality of the whole system.

In practice, this means we can run it on thousands of computers in parallel, counting trillions of elements, and then merge those results together to get a single estimate of the cardinality across all those computers.

The only real tradeoff here is that you're getting an estimate of the cardinality rather than the true value, but that's often a perfectly acceptable tradeoff, since the alternative to calculate the true distinct count is an enormously expensive process.

Now in an effort to keep this accessible to everyone, I did gloss over some details and some of the finer nitty gritty aspects of the algorithm.

There's a bunch of formal math and simulation data proving that this actually works, so if you're interested in that, there's links down in the description.

I know this video and the style is a little different than the sorts of videos I normally make, but I've worked with these probabilistic algorithms a lot in the past and I just find them super interesting, so I thought, you know, maybe you would too.

In some ways, this is a pretty advanced algorithm.

You won't find it in college textbooks or on a job interview.

But hopefully over the last 10 minutes or so I've convinced you that it's also not that complicated.

At its core, it leverages a pretty simple idea.

I firmly believe that anyone can learn programming, no matter how complicated it might seem, if the concepts are presented in a visual and interactive manner.

If you've ever wanted to learn how to program, but were a little daunted on where to start, I think Brilliant.org is the best option.

Like this video, the lessons are all presented in a visual manner that helps develop a deep understanding of the concepts.

1 of the things I love about the programming lessons on Brilliant is the fast, iterative nature.

You can make a change and see how it reacts immediately.

This short feedback loop means you can make a ton of progress really quickly, which I find helpful to stay motivated when learning a new skill.

And Brilliant has a huge catalog of lessons ranging from programming to data structures and algorithms to foundational math.

When I was working as a full-time software engineer, I often had to go out and learn new skills for a project or refresh old knowledge.

My calculus skills are pretty rusty at this point, but I have a moderate knowledge of statistics.

And so something like Brilliant is great because it has a selection of lessons ranging from beginner all the way up to expert for whatever topic you're working on.

If you've been thinking about taking up programming or have tried in the past and it just didn't stick, I think you should give Brilliant a try.

It's free for the first 30 days if you visit brilliant.org slash breaking taps or you can click the link down below.

The first 200 folks will also get 20% off Brilliant's annual premium subscription.

There's some other neat algorithms that do kind of related topics in a probabilistic manner, so if there's interest, maybe I'll make some videos about that in the future.


Speaker 0: If I gave you 3 cubes and asked how many colors there are, well, it's not that hard of a question. But what about a million cubes? How many unique colors are in this pile? And how would you even go about figuring that out? Today we're talking about HyperLogLog, an algorithm which does just that.

Now before you click away, you don't need to know anything about programming to understand how this works. It simply takes a little observation about flipping coins and extrapolates that out to its logical conclusion. This is such an important algorithm that despite being relatively new, it's used by every major tech company and most modern scalable databases. Google even spent time and resources refining this technique. I think you'll find this really interesting, even if you don't know anything about programming.

So the problem at hand is counting distinct or unique elements. Our example had 3 distinct colors. We could add more cubes, but still only have 3 distinct colors, red, green, and blue. As humans, we can just look at this and immediately know how many colors there are. But how would a computer count those colors?

Well, 1 solution might be to keep a stack of cubes that we've already seen and then reference that whenever we need to check a new cube to see if it's a new color or not. So we take our first cube and place it on the stack, but because this is the first 1, there's nothing to do except put it on the stack. Then we pick up the next cube and check the stack to see if we've seen this color before. We haven't, so we've also put this 1 on top of the pile. We repeat this process, but this time we have a duplicate, and we know it's a duplicate because when we scan through our little tower of cubes, we can see that there's a matching cube on the tower already, so we can just discard it.

Then we just continue this process, adding new colors and discarding duplicates until we get done with our mountain of cubes. When we're done, we're left with a stack representing all the colors that were in the original pile with all the duplicates removed. And then we just count the number of colors in the stack to determine how many distinct elements there were. This definitely works, but I think you can appreciate that it will take a long time if you have a lot of items and many colors to check. Each time we pick up a new cube, we have to check all the items in the stack to determine if it's a new color.

Instead of using a stack of cubes, we could instead use this magic chest. When we throw a cube inside the chest, it will place that cube on a shelf depending on its color. Red goes to this spot on the shelf, and blue goes to that spot on the shelf. It's a magic chest after all, so it has spots for every imaginable color inside. To count the number of distinct elements, we just toss all of our cubes inside the chest.

It will nom its way through all the cubes, sorting them into their individual spots on the shelves. And at the end, we just ask it how many unique colors that it ate. The problem with this approach is that while the chest is magic, physics is still physics. And inside that chest it needs a huge amount of shelving for all the potential colors, even if we didn't actually see those colors in our pile. And this is where Hyperlog Log comes in.

It's faster than the stack of cubes, and it takes up a whole lot less space than the magic chest. To understand how this algorithm works, we're going to flip some coins. Specifically, we're going to look for a run of heads or tails. So if I flip the coin 5 times and get 5 heads, we consider that a run of 5. Mathematically the chance of this happening is 1 over 2 to the n.

So the probability of getting 5 heads in a row is 1 over 32. The Probability of this isn't really that extreme, so we could probably flip coins for, I don't know, like an hour and get 5 heads in a row. But how about 20 heads in a row? Well that works out to 1 over 1, 048, 576. You might be flipping that coin for weeks before you got 20 heads in a row.

It's not impossible, it's just very uncommon. And this is the key insight which makes the whole algorithm work. The length of your longest run is roughly equivalent to how long you've been flipping coins. If you only have 5 heads in a row, you probably just started flipping the coin. And if you've seen a hundred heads in a row, well, you've probably been flipping that coin for a really, really long time.

Hyperloglog uses this concept, but instead of flipping coins, we're going to look at binary sequences and check for runs of zeros. So we'll take our original cube and pass it through this mechanical transmogrifier which spits out a binary sequence. The details of this really aren't that important, but what you do need to know is that when you put in a red cube you always get out the same binary sequence. And when you put in a blue cube you get a different binary sequence, but you always get that same sequence out. So it's a consistent one-way transformation.

Now we have everything we need to run the algorithm. We transform our first cube and get a binary sequence. Starting at the right-hand side, we count how many zeros there are in a row. This is the run of zeros that we're looking for. In this case, there are 2 zeros in a row.

So we record that in a scorecard off to the side and feed in the next cube. This binary sequence happens to have 4 zeros in a row, which is larger than our current best run. So we update our scorecard from 2 to 4. The next cube is a color that we've already seen, and it transforms back to the same sequence that we had before, with 2 zeros in a row. This is less than our best score, so we don't need to make any changes, we just discard it.

Because the transformation from color to binary is consistent, we can see that duplicates are essentially ignored, because we've already accounted for that the first time that we saw that binary sequence. And that is basically it. We run through this process for the rest of the cubes, updating the scorecard whenever we see a longer run of zeros. And in the end we are left with a number that represents the longest run of zeros that we've seen out of all the cubes we fed to the algorithm. Now here's the cool part.

Remembering back to the coin flipping, the length of a run is roughly proportional to how long we were flipping that coin. So if the longest run of zeros we've seen is pretty small, well that means we haven't really looked at that many colors, just like we weren't flipping coins for all that long. But if our final scorecard is really high, chances are good that we've looked at millions of distinct colors to get there. The chance of getting 20 zeros in a row in our binary sequence is the same probability as flipping 20 heads in a row. So if we find a binary sequence with 20 zeros in a row, we've probably seen on the order of a million different colors.

Now this does mean it's a probabilistic algorithm. It won't tell you exactly how many colors it's seen, it's just guessing at how many colors it has probably seen. But that guess is surprisingly good, usually within 2% of the actual real distinct count. And this algorithm can scale to trillions of items. It's important to note that this only works with random events.

There's equal chances of getting heads or tails when flipping a coin. And similarly, our mechanical transmogrifier gives us a binary sequence that is completely random. There's a 50-50 chance that any particular bit is a 1 or a 0. But there is 1 notable edge case that we need to talk about. What happens if we just get unlucky?

Maybe the first color has 20 zeros in a row. It's unlikely, but it's totally possible. We've only seen this 1 item, but the algorithm will estimate that we've already seen around a million distinct colors, which is obviously pretty wrong. The solution to this is just to keep multiple scorecards. We use the first few bits of the binary sequence as an index value, which picks the scorecard to use.

Otherwise, the algorithm is basically the same, and when we're done we're left over with a list of numbers representing the longest runs, rather than a single value. And then we just take the harmonic mean of all these values to make our final guess. By splitting it up into multiple scorecards, we reduce the impact of getting unlucky. And it also has some nice side benefits like being able to adjust the precision of the algorithm. If you want more accurate guesses, you just increase the number of scorecards.

And if you want a more memory-friendly algorithm, you reduce the number. These scorecards are relatively small numbers, so you can actually compress them into 5 or 6 bits, saving a fair amount of memory. And the union of 2 sets is lossless. And what that means is basically we can run this algorithm on 2 different computers and merge their scorecards together. This gives us the total cardinality from both computers, even though we ran the algorithm separately and didn't transfer any data between the 2 computers.

For example, we might have 3 unique colors on both of our computers, and so locally the number of distinct elements is 3. But there's a duplicate shared between the 2 computers, so we can't simply add up the individual unique elements and get a total count of 6, because in this shared set there's really only 5 unique values. For small datasets, it's not a problem just to exchange all the data and count it in 1 spot But if you have large amounts of data, this is just impossible to do and that's why HyperLogLog is so helpful here We can calculate the local cardinalities and then merge all the results together to get a representation of the total cardinality of the whole system. In practice, this means we can run it on thousands of computers in parallel, counting trillions of elements, and then merge those results together to get a single estimate of the cardinality across all those computers. The only real tradeoff here is that you're getting an estimate of the cardinality rather than the true value, but that's often a perfectly acceptable tradeoff, since the alternative to calculate the true distinct count is an enormously expensive process.

Now in an effort to keep this accessible to everyone, I did gloss over some details and some of the finer nitty gritty aspects of the algorithm. There's a bunch of formal math and simulation data proving that this actually works, so if you're interested in that, there's links down in the description. I know this video and the style is a little different than the sorts of videos I normally make, but I've worked with these probabilistic algorithms a lot in the past and I just find them super interesting, so I thought, you know, maybe you would too. In some ways, this is a pretty advanced algorithm. You won't find it in college textbooks or on a job interview.

But hopefully over the last 10 minutes or so I've convinced you that it's also not that complicated. At its core, it leverages a pretty simple idea. I firmly believe that anyone can learn programming, no matter how complicated it might seem, if the concepts are presented in a visual and interactive manner. If you've ever wanted to learn how to program, but were a little daunted on where to start, I think Brilliant.org is the best option. Like this video, the lessons are all presented in a visual manner that helps develop a deep understanding of the concepts.

1 of the things I love about the programming lessons on Brilliant is the fast, iterative nature. You can make a change and see how it reacts immediately. This short feedback loop means you can make a ton of progress really quickly, which I find helpful to stay motivated when learning a new skill. And Brilliant has a huge catalog of lessons ranging from programming to data structures and algorithms to foundational math. When I was working as a full-time software engineer, I often had to go out and learn new skills for a project or refresh old knowledge.

My calculus skills are pretty rusty at this point, but I have a moderate knowledge of statistics. And so something like Brilliant is great because it has a selection of lessons ranging from beginner all the way up to expert for whatever topic you're working on. If you've been thinking about taking up programming or have tried in the past and it just didn't stick, I think you should give Brilliant a try. It's free for the first 30 days if you visit brilliant.org slash breaking taps or you can click the link down below. The first 200 folks will also get 20% off Brilliant's annual premium subscription.

There's some other neat algorithms that do kind of related topics in a probabilistic manner, so if there's interest, maybe I'll make some videos about that in the future. Otherwise, thanks for watching!

See all Breaking Taps transcripts on Youtube

A problem so hard even Google relies on Random Chance