{"id":130,"date":"2024-02-27T14:45:06","date_gmt":"2024-02-27T14:45:06","guid":{"rendered":"https:\/\/aulendil.net\/hallucinations\/?p=130"},"modified":"2024-02-27T14:54:54","modified_gmt":"2024-02-27T14:54:54","slug":"clustering-frankenstain","status":"publish","type":"post","link":"https:\/\/aulendil.net\/hallucinations\/clustering-frankenstain\/","title":{"rendered":"Clustering Frankenstain"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/aulendil.net\/hallucinations\/wp-content\/uploads\/2024\/02\/dle_uneven_fight-edited.webp\" alt=\"\" class=\"wp-image-139\" srcset=\"https:\/\/aulendil.net\/hallucinations\/wp-content\/uploads\/2024\/02\/dle_uneven_fight-edited.webp 1024w, https:\/\/aulendil.net\/hallucinations\/wp-content\/uploads\/2024\/02\/dle_uneven_fight-edited-256x144.webp 256w, https:\/\/aulendil.net\/hallucinations\/wp-content\/uploads\/2024\/02\/dle_uneven_fight-edited-512x288.webp 512w, https:\/\/aulendil.net\/hallucinations\/wp-content\/uploads\/2024\/02\/dle_uneven_fight-edited-768x432.webp 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Frankenstain is alive. Barely. <\/p>\n\n\n\n<p>When learning about popular clustering methods I was wondering that they all are based on points physical proximity. This is a reasonable assumption, but it&#8217;s utterly annoying. I am an introvert and I don&#8217;t want to be squashed in to one cluster with the nearest bunch of extroverts. So it&#8217;s personal.<\/p>\n\n\n\n<p>With clustering points in similar neighbourhood there is no way to identify rural areas, nor introverts who have sparse connections with others. It doesn&#8217;t matter if I read philosophy and other guy plants trees &#8211; we are far away in feature space, even if we both have deep passions and are quite similar. So I was wondering about different metric of similarity between points.<\/p>\n\n\n\n<p>The idea is to transform data from physical distances to adjacency matrix that contains distances between each pair of points. Yes, we are squashing all dimensions into one single distance measure, but that&#8217;s how clustering is done anyway, nobody cares about what makes individual and individual. However that is only the first step, it&#8217;s just cleaning the data, to extract what really matters for clustering.<\/p>\n\n\n\n<p>Now there is the next step. For each pair of points, calculate a similarity score. This similarity score is based on how each point relates to every other point in the dataset. So, two outliers who keep themselves away from most of the population can be thought of as similar. Therefore, we again reintroduce some idea of points&#8217; identity &#8211; but this time, it&#8217;s in one dimension. This identity may be very abstract, but relations to others are what really shapes individuals &#8211; a common enemy makes friends. But the core point is that the algorithm should be able to distinguish rural populations from dense clusters.<\/p>\n\n\n\n<p>On paper, it all sounds so nice that I had to check it. It&#8217;s easy to explain, so it should be easy to code. Yeah, of course, like hell. Let paper be damned for its false promises.<\/p>\n\n\n\n<p>The first issue is that Python isn&#8217;t my primary language, and using numpy arrays makes it hard to debug. The results were unholy wrong. It made no sense. The computer was laughing in my face so much that I had to cover my ears. So there came this question: &#8220;Is my idea dumb? It makes no sense.&#8221; <\/p>\n\n\n\n<p>It&#8217;s always easy to give up. But I hate giving up&#8230; I hate giving up even more than I hate doing actual work. So, painfully step by step, using simpler and simpler examples, I found out that I had the comparison wrong. When searching for the most similar points, I was comparing if the score is higher, instead of if it is lower (0 means identical points). Let the keys &#8220;>&#8221; and &#8220;&lt;&#8221; be damned forever and removed from all the keyboards in the world.<\/p>\n\n\n\n<p>So, it worked on a very simple dataset &#8211; a few points in one corner, a few in another. I tried more complex data, and it all went kaput. It made no sense again. I almost thought I had it under control, but this treacherous algorithm was attacking my resolve with full might again. Painfully debugging how the similarity score is calculated, I found out that I forgot to remove the distance between points themselves, and since they were represented twice (distance a-b in one vector and distance b-a in the second vector), they dominated the similarity score. I&#8217;m not sure if I&#8217;m more dumb for making such a mistake or smart for finding it.<\/p>\n\n\n\n<p>fixed it and used a very simple dataset (3 points in the center and 6 quite evenly spaced some distance around them), and it worked. Wow, hurray, I&#8217;m a genius&#8230; let&#8217;s apply some more realistic data&#8230; I&#8217;m an idiot, it makes no sense, clustering is totally wrong.<\/p>\n\n\n\n<p>That&#8217;s my life in short. One moment I&#8217;m brilliant, only to become an utter idiot the very next second. But deep down, I believe in one thing: never ever give up! Unless you have more important or urgent stuff to do, that is.<\/p>\n\n\n\n<p>And yeah &#8211; <a href=\"https:\/\/github.com\/adaslesniak\/ai-xp04\" data-type=\"link\" data-id=\"https:\/\/github.com\/adaslesniak\/ai-xp04\">here is the Frankenstain<\/a>. For now, I need to do some more useful work, but I am gonna get back to this mischievous code and science the reason out of it.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Frankenstain is alive. Barely. When learning about popular clustering methods I was wondering that they all are based on points physical proximity. This is a reasonable assumption, but it&#8217;s utterly annoying. I am an introvert and I don&#8217;t want to be squashed in to one cluster with the nearest bunch of extroverts. So it&#8217;s personal. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/posts\/130"}],"collection":[{"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/comments?post=130"}],"version-history":[{"count":5,"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/posts\/130\/revisions"}],"predecessor-version":[{"id":140,"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/posts\/130\/revisions\/140"}],"wp:attachment":[{"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/media?parent=130"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/categories?post=130"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aulendil.net\/hallucinations\/wp-json\/wp\/v2\/tags?post=130"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}