First encounter with clustering


Too much for small human brain of mine.

Unsupervised learning is as challenging to comprehend as solving the Gordian Knot, precisely because it starts with the fundamental unknown—what to expect is not predefined, capturing the very essence of its complexity.

However, one aspect particularly struck me. There’s no simple answer to the question, ‘Are there clusters in the data?’ Given the plethora of algorithms and parameters available to run each algorithm, it often feels like, ‘If you want some clusters, I’ll find them for you. What clusters do you want?’ This feels extremely subjective and not driven by the data, as if there’s no objective answer. It feels more akin to some mystical practice than a straightforward scientific endeavor.

A few days ago, there was a lesson about networks and their representation using an adjacency matrix. I’ve been contemplating that for clustering purposes, each dataset (at least, every numerical one) can be transformed into an adjacency matrix that represents the similarity of different entries—a pure representation of the relationships between data points. The algorithm would proceed as follows:

  • Normalize each data column using standard normalization. I’m hesitant to use min-max normalization since it’s more sensitive to outliers and could behave unpredictably with new data that extends beyond the current min-max range.
  • Calculate an adjacency matrix that stores the simple Euclidean distance between every pair of vectors in the original data. There’s no need for a tensor, as this adjacency matrix doesn’t contain any meaning except for the relationships, so adjusting the spread of data for each row (e.g., how to convert categorical variables) must be done at an earlier stage of data engineering.

This approach creates a map that is abstract of all values but retains all the data about relationships between entries. Such preprocessed data keeps all the spatial information about relationships and nothing else, offering a clean abstraction. The next steps would be:

  • Remove nodes that are least similar to anything else; these are far away from basically everything. This acts as automatic noise removal. One clear advantage here is the automatic detection of outliers—not quite clusters but still a distinct category.
  • Remove nodes with high betweenness scores; they disrupt clustering. These nodes, which can be automatically detected in graph analysis, don’t belong to any cluster but lie between them, so they confuse clustering algorithms.

For removing nodes, both outliers and those in-between, there must be some automated algorithm for detecting the cutoff point. It’s not too challenging to calculate a line and then calculate at which point the actual distribution line deviates most from a straight line – thereby detecting the knee point.

As a result, we should obtain clear spatial data about the relationships between entries, free from noise. If there are any clusters, they should be distinctly detectable. Additionally we can run some additional tests on those outliers to combine power of Gaussian Mixture Models and DBSCAN.

Of course, like with every spatial information there is one parameter that can’t be omitted. But it is very intuitive one: zoom or level of detail. This still would need to be applied, as it’s just pure user choice if he want to focus on lot of small clusters, or have high level overview – but it’s as intuitive as zooming in google maps.

I would love to run some experiments with this idea, unfortunately it’s not as crucial to what really matters as other ideas I want to pursue, so… Leaving it here, so maybe one day I’ll get back to this one day.

, ,

Leave a Reply

Your email address will not be published. Required fields are marked *