Second encounter with PCA


In the dangerous forest of dimension reduction.

In my previous post about PCA, I said that PCA is like a beautiful and playful woman. Oh boy, was I so wrong. Like almost always when it comes to women.

I assumed that the reconstruction error for principal component analysis is significant for non-linear data. If it were true, it would be easy to develop a simple tool to validate data linearity, so it’d be just one line of code: “is data linear and can it be used for linear regression, logistical regression?” Moreover, there was a possibility to build a simple tool that would drop one column at a time and record scores. In this simple way, calculate which columns are most non-linearly correlated with data, so data engineers would have an easier job: “Hey, I just ran five lines of code, and I now know that features X and Y are correlated with other data in a highly non-linear way. Please examine how they can be transformed.” The world would be much easier, so I had to check it.

I ran some tests that can be found here, and it’s all in vain; PCA reconstruction error is always around 1e-30, so the beautiful dream of an easy life is crushed and deep at the bottom of the deepest ocean. But while I haven’t put a terrible amount of effort into this project, it’s still interesting lessons what mistakes I made to get to the false assumption.

First mistake: I manually validated data. Differences that seemed big to me were not really so big. Looking at a few columns, finding a big error here and there. This is not the way. In general, what hurts me in data science is that there is so much “let’s look and visually make opinions.” It’s not some procedure, algorithm; decisions are not driven by data but by human intuition. “For me, this spike looks ok,” “I feel this missing value can be replaced with mean,” “I don’t think this difference is meaningful,” “Let’s use 80%, seems ok,” “Business wants to see 5 clusters” – rules of thumb, habits, wishes… it doesn’t sound like science. And I made exactly the same error. Shame on me.

Second mistake: Before going to conclusions, I did not double-check results that were fundamental to them. That is a common error and very understandable as I have very limited resources (there is no such thing as free time being a caring father of two youngsters and having a full-time job with a damn long commute)… so I can’t double-check everything. There is an exponential drain of resources with re-checking everything, but… that’s the tradeoff. In my case, I believe I just must accept that sometimes I’ll be wrong.

About the idea of an automated single-line, one-number-out tool for linearity check – I believe there is still a way to do this. Just not using PCA. But that must wait. Next in the pipeline is to try some idea for a clustering algorithm. And one day I’d love to move into deep learning – it’s so wrong, it feels so wrong how it’s done.

So, I misunderstood some results. Did a project based on those assumptions, failed miserably, learned I was wrong. I learned what PCA really is. Moving on. Maybe there will be a second approach to a one-line linearity check, which idea came from a false assumption… may time be on my side.


Leave a Reply

Your email address will not be published. Required fields are marked *