How to use K-means to classify a histogram of a 1D data set using two features in time and value

井民全, Jing, mqjing@gmail.com

At first step, we crate a demo dataset by generating a 100 samples data array with mixed two distribution in which the first distribution has 30 samples with properties of means as 40 with standard deviation 1 and the second distribution has 70 samples with means as 60 and the standard deviation as 10.

The second step, let k-means to classify the histogram of data and provide the cut result.

Key points

(1) How to get the histogram bin for k-means

i = 1

bins = np.linspace(0, len(X), len(X))

axes[i].hist(X, bins=bins, fc='#AAAAFF')

axes[i].set_title("(b) Show the frequency histogram of X.")

axes[i].set_ylabel("Histogram")

axes[i].set_xlabel("Values")

# get distributed bins

p = axes[i].patches

lstHistogram = [patch.get_height() for patch in p]

(2) How to use k-means to classify the 1D data

# run k-means

kmeans = KMeans(n_clusters=2)

lstLabel = kmeans.fit_predict(list(zip([x for x in range(0, len(lstHistogram))], lstHistogram)))

print(lstLabel)

# get the k-means result

lstIndex = np.where(np.abs(np.diff(lstLabel)) == 1)[0]

print("The cut points, lstIndex = ", lstIndex)

Code

np.random.seed(1)

N = 100 # total samples

# normal distribution list. mean = 40, std= 1, number of samples = 30%

arrA = np.random.normal(40, 1, int(0.3 * N))

# normal distribution list. mean = 60, std= 10, number of samples = 70%

arrB = np.random.normal(60, scale=10, size=int(0.7 * N))

X = np.concatenate((arrA, arrB))

fig, axes = plt.subplots(nrows=3, sharex=True, gridspec_kw={"height_ratios": [1,1,1]})

fig.suptitle('Fig. Test of Normal Distribution and K-means.')

fig.canvas.set_window_title('Test of Normal Distribution and K-means')

# show all samples

i = 0

axes[i].plot([x for x in range(0, X.size)], X, "tab:red", label="X")

axes[i].set_title("(a) Data X with two distributions.")

axes[i].set_ylabel("Value")

axes[i].set_xlabel("Samples")

axes[i].grid()

axes[i].legend()

# show histogram

i = 1

bins = np.linspace(0, N, N)

axes[i].hist(X, bins=bins, fc='#AAAAFF')

axes[i].set_title("(b) Show the frequency histogram of X.")

axes[i].set_ylabel("Histogram")

axes[i].set_xlabel("Values")

# get distributed bins

p = axes[i].patches

lstHistogram = [patch.get_height() for patch in p]

# run k-means

kmeans = KMeans(n_clusters=2)

lstLabel = kmeans.fit_predict(list(zip([x for x in range(0, len(lstHistogram))], lstHistogram)))

print(lstLabel)

lstIndex = np.where(np.abs(np.diff(lstLabel)) == 1)[0]

print("The cut points, lstIndex = ", lstIndex)

# show k-means result

i = 2

axes[i].plot([x for x in range(0, len(lstLabel))], lstLabel, "tab:red", label="k-means label")

axes[i].set_title("(c) Show K-means result for the X distribution.")

axes[i].set_ylabel("Label{0/1}")

axes[i].set_xlabel("Values")

axes[i].grid()

axes[i].legend()

plt.tight_layout() # Prvent the figure title overlaps axes label

plt.show()

Result

Figure

Stdout

fig.canvas.set_window_title('Test of Normal Distribution and K-means')

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

The cut points, lstIndex = [49]

井民全觀點 (Jing's Perspective)

2022年8月21日星期日

[python, k-means, 1d] How to use K-means to classify a histogram of a 1D data set using two features in time and value

Key points

(1) How to get the histogram bin for k-means

(2) How to use k-means to classify the 1D data

Code

Result

Figure

Stdout

References

2022年8月21日 星期日

[python, k-means, 1d] How to use K-means to classify a histogram of a 1D data set using two features in time and value

Key points

(1) How to get the histogram bin for k-means

(2) How to use k-means to classify the 1D data

Code

Result

Figure

Stdout

References

2022年8月21日星期日