Is there a clustering algorithm for my specific situation? - r

This is a scatterplot of my data:
In my opinion, I see two clusters here. A large, wide, highly dense one up top, and a smaller, lower density one in the bottom right:
When I do k-means clustering, the data ends up getting split right down the middle:
My question: Is there a clustering algorithm out there that would be able to discern the clusters that I am seeing/describing?

I don't see any code or data here. Anyway, I'd recommend DBSCAN, Affinity Propagation, and you can try GMM as well. You can find some sample code below.
https://scikit-learn.org/stable/modules/clustering.html#dbscan
https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation
https://scikit-learn.org/stable/modules/mixture.html#mixture
print(__doc__)
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
random_state=0)
X = StandardScaler().fit_transform(X)
# #############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

Related

How to use a for loop over negative (and positive) values in R?

I am trying to use a for-loop over a range of positive and negative values and then plot the results. However, I'm having trouble getting R not to plot the correct values, since the negative values seem to screw up the index.
More precisely, the code I am running is:
# Setup objects
R = (1:20)
rejection = rep(NA, 20)
t = seq(from = -10, to = 10, by = 1)
avg_rej_freq = rep(NA, 21)
# Test a hypothesis for each possible value of x and each replication
for (x in t) {
for (r in R) {
# Generate 1 observation from N(x,1)
y = rnorm(1, x, 1)
# Take the average of this observation
avg_y = mean(y)
# Test this observation using the test we found in part a
if (avg_y >= 1 + pnorm(.95))
{rejection[r] = 1}
if (y < 1 + pnorm(.95))
{rejection[r] = 0}
}
# Calculate the average rejection frequency across the 20 samples
avg_rej_freq[x] = mean(rejection)
}
# Plot the different values of x against the average rejection frequency
plot(t, avg_rej_freq)
The resulting graph should look something like this
# Define the rejection probability for n=1
rej_prob = function(x)(1-pnorm(1-x+qnorm(0.95)))
# Plot it
curve(rej_prob,from = -10, to = 10, xlab = expression(theta),
ylab = "Rejection probability")
...but there's clearly something wrong with my code that is shifting the positive values on the graph over to the left.
Any help on how to fix this would be much appreciated!
Yep, as you suspected the negative indices are causing problems. R doesn't know how to store something as the "negative first" object in a vector, so it just drops them. Instead, try using seq_along to produce a vector of all positive indices and looping over those instead:
# Setup objects
R = (1:20)
rejection = rep(NA, 20)
t = seq(from = -10, to = 10, by = 1)
avg_rej_freq = rep(NA, 21)
# Test a hypothesis for each possible value of x and each replication
for (x in seq_along(t)) {
for (r in R) {
# Generate 1 observation from N(x,1)
# Now we ask for the value of t at index x rather than t directly
y = rnorm(1, t[x], 1)
# Take the average of this observation
avg_y = mean(y)
# Test this observation using the test we found in part a
if (avg_y >= 1 + pnorm(.95))
{rejection[r] = 1}
if (y < 1 + pnorm(.95))
{rejection[r] = 0}
}
# Calculate the average rejection frequency across the 20 samples
avg_rej_freq[x] = mean(rejection)
}
# Plot the different values of x against the average rejection frequency
plot(t, avg_rej_freq)
which produces the following plot:
Not sure why you want to simulate the vectorized function pnrom() using for loops, still correcting the mistakes in your code (check the comments):
# Test a hypothesis for each possible value of x and each replication
for (x in t) {
for (r in R) {
# Generate 1 observation from N(x,1)
y = rnorm(1, x, 1)
# no need to take average since you have a single observation
# Test this observation using the test we found in part a
rejection[r] = ifelse(y >= 1 + pnorm(.95), 1, 0)
}
# Calculate the average rejection frequency across the 20 samples
# `R` vector index starts from 1, transform your x values s.t., negative values become positive
avg_rej_freq[x-min(t)+1] = mean(rejection)
}
# Define the rejection probability for n=1
rej_prob = function(x)(1-pnorm(1-x+qnorm(0.95)))
# Plot it
curve(rej_prob,from = -10, to = 10, xlab = expression(theta),
ylab = "Rejection probability")
# plot your points
points(t, avg_rej_freq, pch=19, col='red')
Not sure why the for loops etc, what you are doing can be collapsed into a one line. The rest of the code taken from #Sandipan Dey:
R <- 20
t <- seq(from = -10, to = 10, by = 1)
#All the for-loops collapsed into this one line:
avg_rej_freq <- rowMeans(matrix(rnorm(R * length(t), t), 21) >= 1 + pnorm(.95))
rej_prob <- function(x) 1 - pnorm(1 - x + qnorm(0.95))
curve(rej_prob,from = -10, to = 10, xlab = expression(theta),
ylab = "Rejection probability")
# plot your points
points(t, avg_rej_freq, pch=19, col='red')

How to plot scatter plot with original variables after scalling with K-means

I have scaled my original data X1:
scaler = StandardScaler()
X1_scaled = pd.DataFrame(scaler.fit_transform(X1),columns = X1.columns)
and then performed k-means clustering:
kmeans = KMeans(
init="random",
n_clusters=3,
n_init=10,
max_iter=300,
random_state=123)
X1['label'] = kmeans.fit_predict(X1_scaled[['Wn', 'LL']])
# get centroids
centroids = kmeans.cluster_centers_
cen_x = [i[0] for i in centroids]
cen_y = [i[1] for i in centroids]
Now, I would like to plot the original data(X1) and the centroids, but the centroids are scaled, so when I plot the results:
g = sns.scatterplot(x=X1.Wn, y= X1.LL, hue=X1.label,
data=X1, palette='colorblind',
legend='full')
g = sns.scatterplot(cen_x,cen_y,s=80,color='black')
the centroids is outside the clusters.
How can I plot the original data, with the groups and the centroids?
this is the image I got:
and this is what I would like to have, but with the original data and not the scaled data:
You can call scaler.inverse_transform() on the centroids. (Note that sns.scatterplot is an axes-level function and returns an ax, not a FacetGrid.)
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
X1 = pd.DataFrame({'Wn': np.random.rand(30) * 12, 'LL': np.random.rand(30) * 6})
scaler = StandardScaler()
X1_scaled = pd.DataFrame(scaler.fit_transform(X1), columns=X1.columns)
kmeans = KMeans(init="random",
n_clusters=3,
n_init=10,
max_iter=300,
random_state=123)
X1['label'] = kmeans.fit_predict(X1_scaled[['Wn', 'LL']])
# get centroids
centroids = scaler.inverse_transform(kmeans.cluster_centers_)
cen_x = [i[0] for i in centroids]
cen_y = [i[1] for i in centroids]
ax = sns.scatterplot(x='Wn', y='LL', hue='label',
data=X1, palette='colorblind',
legend='full')
sns.scatterplot(x=cen_x, y=cen_y, s=80, color='black', ax=ax)
plt.tight_layout()
plt.show()

Holoviews tap stream of correlation heatmap and regression plot

I want to make a correlation heatmap for a DataFrame and a regression plot for each pair of the variables. I have tried to read all the docs and am still having a very hard time to connect two plots so that when I tap the heatmap, the corresponding regression plot can show up.
Here's some example code:
import holoviews as hv
from holoviews import opts
import seaborn as sns
import numpy as np
import pandas as pd
hv.extension('bokeh')
df = sns.load_dataset('tips')
df = df[['total_bill', 'tip', 'size']]
corr = df.corr()
heatmap = hv.HeatMap((corr.columns, corr.index, corr))\
.opts(tools=['tap', 'hover'], height=400, width=400, toolbar='above')
m, b = np.polyfit(df.tip, df.total_bill, deg=1)
x = np.linspace(df.tip.min(), df.tip.max())
y = m*x + b
curve = hv.Curve((x, y))\
.opts(height=400, width=400, color='red', ylim=(0, 100))
points = hv.Scatter((df.tip, df.total_bill))
hv.Layout((points * curve) + heatmap).cols(2)
I adjusted the relevant parts of the docs http://holoviews.org/reference/streams/bokeh/Tap.html with your code. Maybe this clears up your confusion.
import pandas as pd
import numpy as np
import holoviews as hv
from holoviews import opts
hv.extension('bokeh', width=90)
import seaborn as sns
# Declare dataset
df = sns.load_dataset('tips')
df = df[['total_bill', 'tip', 'size']]
# Declare HeatMap
corr = df.corr()
heatmap = hv.HeatMap((corr.columns, corr.index, corr))
# Declare Tap stream with heatmap as source and initial values
posxy = hv.streams.Tap(source=heatmap, x='total_bill', y='tip')
# Define function to compute histogram based on tap location
def tap_histogram(x, y):
m, b = np.polyfit(df[x], df[y], deg=1)
x_data = np.linspace(df.tip.min(), df.tip.max())
y_data = m*x_data + b
return hv.Curve((x_data, y_data), x, y) * hv.Scatter((df[x], df[y]), x, y)
tap_dmap = hv.DynamicMap(tap_histogram, streams=[posxy])
(heatmap + tap_dmap).opts(
opts.Scatter(height=400, width=400, color='red', ylim=(0, 100), framewise=True),
opts.HeatMap(tools=['tap', 'hover'], height=400, width=400, toolbar='above'),
opts.Curve(framewise=True)
)
Two common problems we face while modeling is collinearity and nonlinearity. The collinearity could be visualized with a correlation heatmap, but it would become hard to explore with a large amount of variables/features. In the following application, you can hover the mouse over to check the correlation coefficient between any two variables. When you tap, the scatter plot will be updated with a second-degree fitted curve to reveal the nonlinearity between the two variables.
With the help of #doopler, I changed the code a little bit and share it here:
import numpy as np
import pandas as pd
import holoviews as hv
hv.extension('bokeh')
# generate random data
df = pd.DataFrame(data={'col_1': np.random.normal(5, 2, 100)})
df['col_2'] = df.col_1 + np.random.gamma(5, 2, 100)
df['col_3'] = df.col_1*2 + np.random.normal(0, 10, 100)
df['col_4'] = df.col_1**2 + np.random.normal(0, 10, 100)
df['col_5'] = np.sin(df.col_1)
df['col_6'] = np.cos(df.col_1)
corr = df.corr().abs()
# mask the upper triangle of the heatmap
corr.values[np.triu_indices_from(corr, 0)] = np.nan
heatmap = hv.HeatMap((corr.columns, corr.index, corr))\
.opts(tools=['hover'], height=400, width=400, fontsize=9,
toolbar='above', colorbar=False, cmap='Blues',
invert_yaxis=True, xrotation=90, xlabel='', ylabel='',
title='Correlation Coefficient Heatmap (absolute value)')
# define tap stream with heatmap as source
tap_xy = hv.streams.Tap(source=heatmap, x='col_1', y='col_4')
# calculate correlation plot based on tap
def tap_corrplot(x, y):
# drop missing values if there are any
df_notnull = df[[x, y]].dropna(how='any')
# fit a 2nd degree line/curve
m1, m2, b = np.polyfit(df_notnull[x], df_notnull[y], deg=2)
# generate data to plot fitted line/curve
x_curve = np.linspace(df[x].min(), df[x].max())
y_curve = m1*x_curve**2 + m2*x_curve+ b
curve = hv.Curve((x_curve, y_curve), x, y)\
.opts(color='#fc4f30', framewise=True)
scatter = hv.Scatter((df[x], df[y]), x, y)\
.opts(height=400, width=400, fontsize=9, size=5,
alpha=0.2, ylim=(df[y].min(), df[y].max()),
color='#30a2da', framewise=True,
title='Correlation Plot (2nd degree fit)')
return curve * scatter
# map tap in heatmap with correlation plot
tap_dmap = hv.DynamicMap(tap_corrplot, streams=[tap_xy])
layout = heatmap + tap_dmap
layout
In case that you need to run a Bokeh application:
from bokeh.server.server import Server
renderer = hv.renderer('bokeh')
app = renderer.app(layout)
server = Server({'/': app}, port=0)
server.start()
server.show('/')
The code works well with Jupyter Lab. If you use Jupyter Notebook, check this link.

Arc length of piecewise spline using R

I know there are many ways to calculate the arc length of curve, but I am looking for an efficient way to calculate the arc length of a piecewise spline through irregularly spaced points.
The actual curve I'm trying to find the length of is quite complex (contour line) so here is a quick example using a circle where the actual arclength is known to be 2*pi:
# Generate "random" data
set.seed(50)
theta = seq(0, 2*pi, length.out = 50) + runif(50, -0.05, 0.05)
theta = c(0, theta[theta >=0 & theta <= 2*pi], 2*pi)
data = data.frame(x = cos(theta), y = sin(theta))
# Bezier Curve fit
library("bezier")
bezierArcLength(data, t1=0, t2=1)$arc.length
# Calculate arc length using euclidean distance
library("dplyr")
data$eucdist = sqrt((data$x - lag(data$x))^2 + (data$y - lag(data$y))^2)
print(paste("Euclidean distance:", sum(data$eucdist[-1])))
print(paste("Actual distance:", 2*pi))
# Output
Bezier distance: 5.864282
Euclidean distance: 6.2779
Actual distance: 6.2831
The closest thing I have found is https://www.rdocumentation.org/packages/pracma/versions/1.9.9/topics/arclength but I would have to parameterise my data to be some function(t) ...spline(data, t)... to use arclength. I tried this, but the fitted spline ran along the middle of the circle rather than along the circumference.
Another alternative I have been (unsuccessfully) trying is fit piecewise splines and determine the length of each spline.
Any help would be much appreciated!
EDIT: Added alternate method using the Bezier package, but the arc length found is even worse than just using the Euclidean method.
In lieu of community answers, I've cobbled together a solution which seems to work for what I was after! I'll leave my code here in case anyone has the same question and comes across this.
# Libraries
library("bezier")
library("pracma")
library("dplyr")
# Very slow for loops, sorry! Didn't write it as an apply function
output = data.frame()
for (i in 1:100) {
# Generate "random" data
# set.seed(50)
theta = seq(0, 2*pi, length.out = 50) + runif(50, -0.1, 0.1)
theta = sort(theta)
theta = c(0, theta[theta >=0 & theta <= 2*pi], 2*pi)
data = data.frame(x = cos(theta), y = sin(theta))
# Bezier Curve fit
b = bezierArcLength(data, t1=0, t2=1)$arc.length
# Pracma Piecewise cubic
t = atan2(data$y, data$x)
t = t + ifelse(t < 0, 2*pi, 0)
csx <- cubicspline(t, data$x)
csy <- cubicspline(t, data$y)
dcsx = csx; dcsx$coefs = t(apply(csx$coefs, 1, polyder))
dcsy = csy; dcsy$coefs = t(apply(csy$coefs, 1, polyder))
ds <- function(t) sqrt(ppval(dcsx, t)^2 + ppval(dcsy, t)^2)
s = integral(ds, t[1], t[length(t)])
# Calculate arc length using euclidean distance
data$eucdist = sqrt((data$x - lag(data$x))^2 + (data$y - lag(data$y))^2)
e = sum(data$eucdist[-1])
# Use path distance as parametric variable
data$d = c(0, cumsum(data$eucdist[-1]))
csx <- cubicspline(data$d, data$x)
csy <- cubicspline(data$d, data$y)
dcsx = csx; dcsx$coefs = t(apply(csx$coefs, 1, polyder))
dcsy = csy; dcsy$coefs = t(apply(csy$coefs, 1, polyder))
ds <- function(t) sqrt(ppval(dcsx, t)^2 + ppval(dcsy, t)^2)
d = integral(ds, data$d[1], data$d[nrow(data)])
# Actual value
a = 2*pi
# Append to result
output = rbind(
output,
data.frame(bezier=b, cubic.spline=s, cubic.spline.error=(s-a)/a*100,
euclidean.dist=e, euclidean.dist.error=(e-a)/a*100,
dist.spline=d, dist.spline.error=(d-a)/a*100))
}
# Summary
apply(output, 2, mean)
# Summary output
bezier cubic.spline cubic.spline.error euclidean.dist euclidean.dist.error dist.spline dist.spline.error
5.857931e+00 6.283180e+00 -7.742975e-05 6.274913e+00 -1.316564e-01 6.283085683 -0.001585570
I still don't quite understand what bezierArcLength does, but I'm very happy with my solution using cubicspline from the pracma package as it is a lot more accurate.
Other solutions are still more than welcome!

r deSolve - plotting time evolution pde

suppose that we have a pde that describes the evolution of a variable y(t,x) over time t and space x, and I would like to plot its evolution on a three dimensional diagram (t,x,y). With deSolve I can solve the pde, but I have no idea about how to obtain this kind of diagram.
The example in the deSolve package instruction is the following, where y is aphids, t=0,...,200 and x=1,...,60:
library(deSolve)
Aphid <- function(t, APHIDS, parameters) {
deltax <- c (0.5, rep(1, numboxes - 1), 0.5)
Flux <- -D * diff(c(0, APHIDS, 0)) / deltax
dAPHIDS <- -diff(Flux) / delx + APHIDS * r
list(dAPHIDS )
}
D <- 0.3 # m2/day diffusion rate
r <- 0.01 # /day net growth rate
delx <- 1 # m thickness of boxes
numboxes <- 60
Distance <- seq(from = 0.5, by = delx, length.out = numboxes)
APHIDS <- rep(0, times = numboxes)
APHIDS[30:31] <- 1
state <- c(APHIDS = APHIDS) # initialise state variables
times <-seq(0, 200, by = 1)
out <- ode.1D(state, times, Aphid, parms = 0, nspec = 1, names = "Aphid")
"out" produces a matrix containing all the data that we need, t, y(x1), y(x2), ... y(x60). How can I produce a surface plot to show the evolution and variability of y in (t,x)?
The ways change a bit depending on using package. But you can do it with little cost because out[,-1] is an ideal matrix form to draw surface. I showed two examples using rgl and plot3D package.
out2 <- out[,-1]
AphID <- 1:ncol(out2)
library(rgl)
persp3d(times, AphID, out2, col="gray50", zlab="y")
# If you want to change color with value of Z-axis
# persp3d(times, AphID, out2, zlab="y", col=topo.colors(256)[cut(c(out2), 256)])
library(plot3D)
mat <- mesh(times, AphID)
surf3D(mat$x, mat$y, out2, bty="f", ticktype="detailed", xlab="times", ylab="AphID", zlab="y")

Resources