'Detecting Patterns in dataset using clustering algorithm

I have a dataset that contains a cross pattern. How can I filter off this cross-like pattern?

Tried DBScan but it didn't work effectively. Also, can't use any cluster which needs to specify number of clusters as the data cleaning needs to be automated.

data pattern



Solution 1:[1]

Sorry, forget SVM, that's for classification. Honestly, I didn't ready your question carefully the first time. I just re-read what you posted originally. Try Mean Shift for automatically detecting the optimal number of clusters. Here's an example. Hopefully you can adapt it for your specific use.

import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets import make_blobs

# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1], [1, -1], [1, -1]]
X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.2)

# #############################################################################
# Compute clustering with MeanShift

# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.6, n_samples=5000)

ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)

# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle

plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    my_members = labels == k
    cluster_center = cluster_centers[k]
    plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

enter image description here

Or, try this.

import numpy as np 
import pandas as pd 
from sklearn.cluster import MeanShift 
from matplotlib import pyplot as plt 
from mpl_toolkits.mplot3d import Axes3D 
from sklearn.datasets import make_blobs
  
# We will be using the make_blobs method 
# in order to generate our own data. 
  
clusters = [[2, 2, 2], [7, 7, 7], [5, 13, 13]] 
  
X, _ = make_blobs(n_samples = 150, centers = clusters, 
                                   cluster_std = 0.60) 
   
# After training the model, We store the 
# coordinates for the cluster centers 
ms = MeanShift() 
ms.fit(X) 
cluster_centers = ms.cluster_centers_ 
   
# Finally We plot the data points 
# and centroids in a 3D graph. 
fig = plt.figure() 
  
ax = fig.add_subplot(111, projection ='3d') 
  
ax.scatter(X[:, 0], X[:, 1], X[:, 2], marker ='o') 
  
ax.scatter(cluster_centers[:, 0], cluster_centers[:, 1], 
           cluster_centers[:, 2], marker ='x', color ='red', 
           s = 300, linewidth = 5, zorder = 10) 
  
plt.show() 

enter image description here

There are a few clustering methodologies that help you choose the optimal number of clusters automatically. Check out the link below for some ideas of how to move forward with your project.

https://scikit-learn.org/stable/modules/clustering.html

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1