Civilizng the Random Forest


In my previous post, I explained that I was pondering the rather uneven distribution of features among the trees in a random forest. It didn’t feel right to just verify whether I could achieve a more uniform distribution. I felt compelled to explore how it functions with this wild notion of democracy. Yes, the random forest is akin to a mathematical proof that democracy works—a bunch of ignorants casting votes for the final result. And time and again, it’s shown to be superior to relying on a single, even wisest tree.

It all sounded easy. It always does. Just create a bunch of trees, compile the results, summarize those results, and incorporate a more uniform distribution. It always sounds easy until I actually try to do something.

Long ago, in a galaxy far, far away, when the world seemed full of promise, I built this class for ARandomForest with universal distribution. I even created a separate script that could run some tests. I used the simple IRIS dataset—not that I know what it is, probably some flowers, likely remnants of the good old hippie era—and ran it through these tests. After rectifying a few bugs that prevented my code from running, I obtained the first results. The accuracy for the professional scikit-learn library was perfect. The accuracy for my code was also perfect, albeit in both cases without cross-validation, but it was a first attempt just to debug the code and check if it runs.

Okay, not bad; it runs. It’s just that the data is so straightforward that even a deaf and blind dwarf could figure out how to sort it. So, I tried something more complex – data about wines and their quality, available somewhere on the University of California, Irvine page. And the results were devastating. My algorithm was clearly inferior to scikit-learn’s. So my idea was nonsense, I’m no good, I will never do anything sensible… But it didn’t give me peace: how could it be that making a more uniform selection of features resulted in worse outcomes? I speculated that, perhaps by sheer luck, the random forest from those ugly scientists at scikit-learn was fortunate to assign the best features… So I ran the test over and over with different seeds for the random generator… and it was always better. It didn’t feel right. They may be superior, but they must be employing some unfair black magic, I’ll catch them.

I wanted to compare apples to apples, so I needed to have the same algorithm—one with and another without the trick. Thus, I made a copy of my naive random forest classifier and, in this version, I didn’t use the more universal distribution but just randomly selected features as described in lectures. And I ran the tests comparing three solutions. Here are the results:

  • classic is those unfair villain scientists who dare to be better than me
  • uniform is my naive implementation with a trick of more universal distribution of features
  • naive is just like uniform but without the trick

cross validationaccuracyprecisionrecallf1
classic0.59 +/- 0.030.590.290.260.26
uniform0.54 +/- 0.050.560.270.230.21
naive0.36 +/- 0.130.400.120.170.12
1 means perfect, less than 0.5 means utter crap

It became very interesting. While those devil charlatans who wrote scikit-learn are always much better than me, the difference isn’t as vast as day and night. But then, my naive implementation without the trick is as dumb as one can be. I wonder if there’s perhaps some contest for the worst classification algorithm ever—I might have a chance to win.

Anyway what’s really, really interesting is that the trick made a huge difference. So my intuition isn’t totally wrong and I’m not biggest idiot under the sun.

What’s more worrisome is that now I need to somehow modify scikit-learn, and I had a brief look at their code; it’s not a trivial thing to do. They knew I’d come for them, and they prepared by applying some human-made obfuscation of the code.

If only time were a bit more stretchable. If only a day were as long as on the moon (and I wouldn’t need to sleep or do other boring stuff).

Ahh… and here is the code.

Democracy may work, but all options must be equally heard among the voters. It’s now my personal quest to prove it!


2 responses to “Civilizng the Random Forest”

Leave a Reply

Your email address will not be published. Required fields are marked *