Science is great for innovation and improving our lives, but let’s face it: there are some things we’ve pretty much got down pat. You wouldn’t expect, for example, that we could improve on something like… like counting.

So it may come as a surprise that a group of computer scientists have done just that: found a new way to solve a decades-old problem that asks what, on the face of it, looks to be a very simple problem – how many distinct things are there in front of me?

It’s a harder problem – and a smarter solution – than you might think.

**The Distinct Elements Problem**

Computers can be very smart, but they can also be very, very… not-smart. Just look at the recent explosion of AI chatbots for evidence of that: they’re great at sounding intelligent, but put ‘em to the test and you might just find yourself in an ouroboros of bullshit.

And sometimes, it’s the things that seem almost laughably simple to a human that cause the most trouble. Take counting, for example – specifically, counting distinct objects. For us, it’s easy: we look at the collection of objects, and our brain just kind of automatically sorts them into groups for us. We barely have to work at it at all.

For computers, on the other hand, it’s a fundamental and decades-old problem. And it’s one that really needs to be answered, since its applications in the modern world span everything from network traffic analysis – think Facebook or Twitter monitoring how many people are logged in at any given time – to fraud detection, to bioinformatics, to text analysis, and much more.

Now, obviously, we’ve been able to do those things for a while now, and that’s because this counting question – properly known as the Distinct Elements Problem – does have answers. They’re just not very good ones.

“Earlier known algorithms all were ‘hashing based,’ and the quality of that algorithm depended on the quality of hash functions that algorithm chooses,” explained Vinodchandran Variyam, a professor in the University of Nebraska–Lincoln’s School of Computing, in a statement last year.

But, together with colleagues Sourav Chakraborty of the Indian Statistical Institute and Kuldeep Meel of the University of Toronto, he discovered a way to massively simplify the problem: “The new algorithm only uses a sampling strategy, and quality analysis can be done using elementary techniques.”

**How does it work?**

The new method, since named the CVM algorithm in honor of its inventors, drastically reduces memory requirements – an important advantage in this modern age of big data – and it does so using a neat trick of probability theory. To illustrate the concept, consider the example studied by Variyam and his colleagues, as well as a recent article in Quanta Magazine: imagine you’re counting the number of unique words in Shakespeare’s *Hamlet*, but you have only enough memory to store 100 words at a time.

First, you do the obvious: you record the first 100 unique words you come across. You’re now out of space – so you take a coin and flip it for each word. Heads, it stays; tails, you forget it.

At the end of this process, you’ll have around 50 unique words in your list. You restart the process from before – but this time, if you come to a word already on the list, you flip the coin again to see whether or not to delete it. Once you reach 100 words, you run through the list again, flipping a coin for each word and deleting or keeping it as prompted.

In round two, things are a tiny bit more complex: instead of one head to keep a word in the list, you’ll need two in a row – anything else, and it gets deleted. Similarly, in round three, you’ll need to get three heads in a row for it to stay; round four will need four in a row, and so on until you reach the end of Hamlet.

There’s method in the madness – and it’s a smart one, too. By working through the text like this, you’ve ensured that every word in your list had the same probability of being there: 1/2* ^{k}*, where

*k*is the number of times you had to work through the list. So, let’s say it took you six rounds to get to the end of Hamlet, and you’re left with a list of 61 distinct words: you can then multiply 61 by 2

^{6}to get an estimate of the number of words.

We’ll save you opening your calculator app: the answer is 3,904 – and according to Variyam and co, the actual answer is 3,967 (yes, they counted.) If you have a memory that can store more than 100 words, the accuracy goes up further: with the ability to store 1,000 words, the algorithm estimates the answer as 3,964 – barely a rounding error already – and “of course,” Variyam told Quanta, “if the [memory] is so big that it fits all the words, then we can get 100 percent accuracy.”

**A simple approach**

So, it’s effective – but what makes the algorithm even more intriguing is its simplicity. “The new algorithm is astonishingly simple and easy to implement,” Andrew McGregor, a Professor in the College of Information and Computer Sciences at the University of Massachusetts, Amherst, told Quanta. “I wouldn’t be surprised if this became the default way the [distinct elements] problem is approached in practice.”

Indeed, since its posting in January 2023 – and barring a few minor quibbles and bugs in the meantime – the algorithm has attracted attention and admiration from many other computer scientists. That means that, while the paper detailing the algorithm has not* *been peer-reviewed in the official sense, it definitely has been reviewed by peers. Indeed, Donald Knuth, author of *The Art of Computer Programming* and so-called “father of the analysis of algorithms,” wrote a paper in praise of the algorithm back in May 2023: “ever since I saw it […] I’ve been unable to resist trying to explain the ideas to just about everybody I meet,” he commented.

Meanwhile, various teams – Chakraborty, Variyam, and Meel included – have spent the last year investigating and fine-tuning the algorithm. Some, Variyam said, are already teaching it in their computer science courses.

“We believe that this will be a mainstream algorithm that is taught in the first computer science course on algorithms in general and probabilistic algorithm in particular,” he said. Knuth agrees: “It’s wonderfully suited to teaching students who are learning the basics of computer science,” he wrote in his May paper. “I’m pretty sure that something like this will eventually become a standard textbook topic.”

So, how did such a breakthrough algorithm evade notice for so long? According to Variyam, it’s not as unlikely as it sounds.

“It is surprising that this simple algorithm had not been discovered earlier,” he said. “It is not uncommon in science that simplicity is missed for several years.”

The paper is posted on the ArXiv and appeared in Proceedings of the 30^{th} Annual European Symposium on Algorithms (ESA 2022).

Source Link: Scientists Have Discovered A New Way To Count (And It's Actually Really Important)