{"id":61,"date":"2024-02-21T22:38:36","date_gmt":"2024-02-21T22:38:36","guid":{"rendered":"https:\/\/aulendil.net\/hallucinations\/?p=61"},"modified":"2024-02-24T23:53:02","modified_gmt":"2024-02-24T23:53:02","slug":"second-encounter-with-pca","status":"publish","type":"post","link":"https:\/\/aulendil.net\/hallucinations\/second-encounter-with-pca\/","title":{"rendered":"Second encounter with PCA"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"640\" src=\"https:\/\/aulendil.net\/hallucinations\/wp-content\/uploads\/2024\/02\/dle-lost-in-math-edited.webp\" alt=\"\" class=\"wp-image-66\" style=\"width:649px;height:auto\" srcset=\"https:\/\/aulendil.net\/hallucinations\/wp-content\/uploads\/2024\/02\/dle-lost-in-math-edited.webp 1024w, https:\/\/aulendil.net\/hallucinations\/wp-content\/uploads\/2024\/02\/dle-lost-in-math-edited-300x188.webp 300w, https:\/\/aulendil.net\/hallucinations\/wp-content\/uploads\/2024\/02\/dle-lost-in-math-edited-768x480.webp 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">In the dangerous forest of dimension reduction.<\/figcaption><\/figure>\n\n\n\n<p>In my <a href=\"https:\/\/aulendil.net\/hallucinations\/index.php\/2024\/02\/17\/first-encounter-with-principal-component-analysis\/\" data-type=\"link\" data-id=\"https:\/\/aulendil.net\/hallucinations\/index.php\/2024\/02\/17\/first-encounter-with-principal-component-analysis\/\">previous post about PCA<\/a>, I said that PCA is like a beautiful and playful woman. Oh boy, was I so wrong. Like almost always when it comes to women.<\/p>\n\n\n\n<p>I assumed that the reconstruction error for principal component analysis is significant for non-linear data. If it were true, it would be easy to develop a simple tool to validate data linearity, so it&#8217;d be just one line of code: &#8220;is data linear and can it be used for linear regression, logistical regression?&#8221; Moreover, there was a possibility to build a simple tool that would drop one column at a time and record scores. In this simple way, calculate which columns are most non-linearly correlated with data, so data engineers would have an easier job: &#8220;Hey, I just ran five lines of code, and I now know that features X and Y are correlated with other data in a highly non-linear way. Please examine how they can be transformed.&#8221; The world would be much easier, so I had to check it.<\/p>\n\n\n\n<p>I ran some tests that <a href=\"https:\/\/github.com\/adaslesniak\/ai-xp03\/blob\/main\/README.md\" data-type=\"link\" data-id=\"https:\/\/github.com\/adaslesniak\/ai-xp03\/blob\/main\/README.md\">can be found here<\/a>, and it&#8217;s all in vain; PCA reconstruction error is always around 1e-30, so the beautiful dream of an easy life is crushed and deep at the bottom of the deepest ocean. But while I haven&#8217;t put a terrible amount of effort into this project, it&#8217;s still interesting lessons what mistakes I made to get to the false assumption.<\/p>\n\n\n\n<p>First mistake: I manually validated data. Differences that seemed big to me were not really so big. Looking at a few columns, finding a big error here and there. This is not the way. In general, what hurts me in data science is that there is so much &#8220;let&#8217;s look and visually make opinions.&#8221; It&#8217;s not some procedure, algorithm; decisions are not driven by data but by human intuition. &#8220;For me, this spike looks ok,&#8221; &#8220;I feel this missing value can be replaced with mean,&#8221; &#8220;I don&#8217;t think this difference is meaningful,&#8221; &#8220;Let&#8217;s use 80%, seems ok,&#8221; &#8220;Business wants to see 5 clusters&#8221; &#8211; rules of thumb, habits, wishes&#8230; it doesn&#8217;t sound like science. And I made exactly the same error. Shame on me.<\/p>\n\n\n\n<p>Second mistake: Before going to conclusions, I did not double-check results that were fundamental to them. That is a common error and very understandable as I have very limited resources (there is no such thing as free time being a caring father of two youngsters and having a full-time job with a damn long commute)&#8230; so I can&#8217;t double-check everything. There is an exponential drain of resources with re-checking everything, but&#8230; that&#8217;s the tradeoff. In my case, I believe I just must accept that sometimes I&#8217;ll be wrong.<\/p>\n\n\n\n<p>About the idea of an automated single-line, one-number-out tool for linearity check &#8211; I believe there is still a way to do this. Just not using PCA. But that must wait. Next in the pipeline is to try some idea for a clustering algorithm. And one day I&#8217;d love to move into deep learning &#8211; it&#8217;s so wrong, it feels so wrong how it&#8217;s done.<\/p>\n\n\n\n<p>So, I misunderstood some results. Did a project based on those assumptions, failed miserably, learned I was wrong. I learned what PCA really is. Moving on. Maybe there will be a second approach to a one-line linearity check, which idea came from a false assumption&#8230; may time be on my side.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In my previous post about PCA, I said that PCA is like a beautiful and playful woman. Oh boy, was I so wrong. Like almost always when it comes to women. I assumed that the reconstruction error for principal component analysis is significant for non-linear data. If it were true, it would be easy to [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"_links":{"self":[{"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/posts\/61"}],"collection":[{"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/comments?post=61"}],"version-history":[{"count":5,"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/posts\/61\/revisions"}],"predecessor-version":[{"id":112,"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/posts\/61\/revisions\/112"}],"wp:attachment":[{"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/media?parent=61"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/categories?post=61"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/tags?post=61"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}