Kmean clustering in ggplot

Kmean clustering in ggplot - r

I am using K-mean alg. in R in order to separe variables. I would like to plot results in ggplot witch I was able to manage,
however results seem to be different in ggplot and in cluster::clusplot
So I wanted to ask what I am missing: for example I know that scaling in different but I was wondering Whz when using clustplot all variables are inside the bounds and when using ggplot it is not.
Is it just because of the scaling?
So are two below result exatly the same?
library(cluster)
library(ggfortify)
x <- rbind(matrix(rnorm(2000, sd = 123), ncol = 2),
matrix(rnorm(2000, mean = 800, sd = 123), ncol = 2))
colnames(x) <- c("x", "y")
x <- data.frame(x)
A <- kmeans(x, centers = 3, nstart = 50, iter.max = 500)
cluster::clusplot(cbind(x$x, x$y), A$cluster, color = T, shade = T)
autoplot(kmeans(x, centers = 3, nstart = 50, iter.max = 500), data = x, frame.type = 'norm')

For me, I get the same plot using either clusplot or ggplot. But for using ggplot, you have to first make a PCA on your data in order to get the same plot as clustplot. Maybe it's where you have an issue.
Here, with your example, I did:
x <- rbind(matrix(rnorm(2000, sd = 123), ncol = 2),
matrix(rnorm(2000, mean = 800, sd = 123), ncol = 2))
colnames(x) <- c("x", "y")
x <- data.frame(x)
A <- kmeans(x, centers = 3, nstart = 50, iter.max = 500)
cluster::clusplot(cbind(x$x, x$y), A$cluster, color = T, shade = T)
pca_x = princomp(x)
x_cluster = data.frame(pca_x$scores,A$cluster)
ggplot(test, aes(x = Comp.1, y = Comp.2, color = as.factor(A.cluster), fill = as.factor(A.cluster))) + geom_point() +
stat_ellipse(type = "t",geom = "polygon",alpha = 0.4)
The plot using clusplot
And the one using ggplot:
Hope it helps you to figure out the reason of your different plots

Related

Overlay two contours of bivariate gaussian distribution using ggplot2

I want to overlay two contours of bivariate guassian distribution on the same plot using ggplot2 using different color for each contour. I looked at a previous post about how to plot contours of bivariate gaussian (Plot multivariate Gaussian contours with ggplot2). But that is only plotting one contour. I tried using stat_density2d, but was unsuccessful. Here is my code with reproducible example.
set.seed(13)
m1 <- c(.5, -.5)
sigma1 <- matrix(c(1,.5,.5,1), nrow=2)
m2 <- c(0, 0)
sigma2 <- matrix(c(140,67,67,42), nrow=2)
data.grid <- expand.grid(s.1 = seq(-25, 25, length.out=200), s.2 = seq(-25,
25, length.out=200))
q.samp <- cbind(data.grid, prob = mvtnorm::dmvnorm(data.grid, mean = m2,
sigma = sigma2))
ggplot(q.samp, aes(x=s.1, y=s.2, z=prob)) +
geom_contour() +
coord_fixed(xlim = c(-25, 25), ylim = c(-25, 25), ratio = 1)

If I follow your code and create q1.samp and q2.sampfrom your parameters:
q2.samp = cbind(data.grid, prob = mvtnorm::dmvnorm(data.grid, mean = m2, sigma=sigma2))
q1.samp = cbind(data.grid, prob = mvtnorm::dmvnorm(data.grid, mean = m1, sigma=sigma1))
then I can do this:
ggplot() +
geom_contour(data=q1.samp,aes(x=s.1,y=s.2,z=prob)) +
geom_contour(data=q2.samp,aes(x=s.1,y=s.2,z=prob),col="red")
then I get one set of contours in the default colour and one in red.

Another option would be to combine the data into one data.frame and map color to the "origin" of where the data came from. This gives you a handy legend, should you need one, and all its benefits (like mapping color).
q1.samp = cbind(data.grid, prob = mvtnorm::dmvnorm(data.grid, mean = m1, sigma=sigma1))
q1.samp$origin <- "q1"
q2.samp = cbind(data.grid, prob = mvtnorm::dmvnorm(data.grid, mean = m2, sigma=sigma2))
q2.samp$origin <- "q2"
q <- rbind(q1.samp, q2.samp)
ggplot(q, aes(x=s.1, y=s.2, z=prob, color = origin)) +
geom_contour() +
coord_fixed(xlim = c(-25, 25), ylim = c(-25, 25), ratio = 1)

How do I plot a probability heatmap with ggplot2? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
Say I have a bunch of data with x and y coordinates and TRUE/FALSE for each row:
library(tidyverse)
set.seed(666) #666 for the devil
x <- rnorm(1000, 50, 10)
y <- sample(1:100, 1000, replace = T)
result <- sample(c(T, F), 1000, prob = c(1, 9),replace = T)
data <- tibble(x, y, result)
Now, I want to make a plot that shows the likelihood of an area being TRUE based on that data. I could group the data into little squares(or whatever) and calculate the TRUE percentage and then plot that but what I wonder if there is something in ggplot2 that will do that for me automatically.

ggplot(data, aes(x = x, y = y, z = as.numeric(result))) +
stat_summary_2d(bins = 20, color = "grey", fun = mean) +
theme_classic()

Not completely in ggplot2 but the following produces what I think you are asking for
library(tidyverse)
library(broom)
set.seed(666) #666 for the devil
data.frame(x = rnorm(1000, 50, 10),
y = sample(1:100, 1000, replace = T),
result = sample(c(T, F), 1000, prob = c(1, 9), replace = T)) %>%
do(augment(glm(result ~ x * y, data = ., family = "binomial"), type.predict = "response")) %>%
ggplot(aes(x, y, color = .fitted)) +
geom_point()
or geom_hex instead of geom_point looks interesting

This seems to be an other solution:
library(tidyverse)
#Preparing data
set.seed(666) #666 for the devil
data <- tibble(x = rnorm(1000, 50, 10),
y = sample(1:100, 1000, replace = T),
result = sample(c(T, F), 1000,
prob = c(1, 9), replace = T)) %>%
filter(result == TRUE)
#Plotting with ggplot
ggplot(data, aes(x, y)) +
geom_bin2d()

Interpolate curved line betweenstart and end points for ggplot2

I'd like to create a sankey-like plot that I can create in ggplot2 where there are curved lines between my start and end locations. Currently, I have data that looks like this:
df <- data.frame(Line = rep(letters[1:4], 2),
Location = rep(c("Start", "End"), each=4),
X = rep(c(1, 10), each = 4),
Y = c(c(1,3, 5, 15), c(9,12, 14, 6)),
stringsAsFactors = F)
ex:
Line Location X Y
1 a Start 1 1
2 a End 10 9
and creates a plot that looks something like this:
library(ggplot2)
ggplot(df) +
geom_path(aes(x= X, y= Y, group = Line))
I would like to see the data come out like this:
This is another option for setting up the data:
df2 <- data.frame(Line = letters[1:4],
Start.X= rep(1, 4),
Start.Y = c(1,3,5,15),
End.X = rep(10, 4),
End.Y = c(9,12,14,6))
ex:
Line Start.X Start.Y End.X End.Y
1 a 1 1 10 9
I can find examples of how to add a curve to the graphics of base R but these examples don't demonstrate how to get a data frame of the points in between in order to draw that curve. I would prefer to use dplyr for data manipulation. I imagine this will require a for-loop to build a table of the interpolated points.
These examples are similar but do not produce an s-shaped curve:
Plotting lines on map - gcIntermediate
http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/
Thank you in advance!

The code below creates curved lines via a logistic function. You could use whatever function you like instead, but this is the main idea. I should note that for other than graphical purposes, creating a curved line out of 2 points is a bad idea. It implies that the data show a certain type of relation while it actually doesn't imply that relation.
df <- data.frame(Line = rep(letters[1:4], 2),
Location = rep(c("Start", "End"), each=4),
X = rep(c(1, 10), each = 4),
Y = c(c(1,3, 5, 15), c(9,12, 14, 6)),
stringsAsFactors = F)
# logistic function for curved lines
logistic = function(x, y, midpoint = mean(x)) {
ry = range(y)
if (y[1] < y[2]) {
sign = 2
} else {
sign = -2
}
steepness = sign*diff(range(x)) / diff(ry)
out = (ry[2] - ry[1]) / (1 + exp(-steepness * (x - midpoint))) + ry[1]
return(out)
}
# an example
x = c(1, 10)
y = c(1, 9)
xnew = seq(1, 10, .5)
ynew = logistic(xnew, y)
plot(x, y, type = 'b', bty = 'n', las = 1)
lines(xnew, ynew, col = 2, type = 'b')
# applying the function to your example
xnew = seq(min(df$X), max(df$X), .1) # new x grid
m = matrix(NA, length(xnew), 4) # matrix to store results
uniq = unique(df$Line) # loop over all unique values in df$Line
for (i in seq_along(uniq)) {
m[, i] = logistic(xnew, df$Y[df$Line == uniq[i]])
}
# base R plot
matplot(xnew, m, type = 'b', las = 1, bty = 'n', pch = 1)
# put stuff in a dataframe for ggplot
df2 = data.frame(x = rep(xnew, ncol(m)),
y = c(m),
group = factor(rep(1:ncol(m), each = nrow(m))))
library(ggplot2)
ggplot(df) +
geom_path(aes(x= X, y= Y, group = Line, color = Line)) +
geom_line(data = df2, aes(x = x, y = y, group = group, color = group))

joinPolys function in PBSmapping gives NULL output

I am using the function joinPolys in the R package PBSmapping to find intersections between polygons. However it is giving a NULL output with my data, even though I am pretty sure the intersection is non-empty.
I've created an example from https://code.google.com/p/pbs-mapping/issues/detail?id=31. In the link, the code is designed to show a case where the code does work (but doesn't work for me). The example is as follows:
Code does not work:
require(PBSmapping)
polyA <- data.frame(PID=rep(1,4),POS=1:4,X=c(0,1,1,0),Y=c(0,0,1,1))
polyB <- data.frame(PID=rep(1,4),POS=1:4,X=c(.5,1.5,1.5,.5),Y=c(.5,.5,1.5,1.5))
# Plot polygons
plotPolys(polyA, xlim=c(0,3), ylim=c(0,3))
addPolys(polyB, border=2)
# returns NULL
print(joinPolys(polyA, polyB))
However, in other cases, the code does work:
require(PBSmapping)
N <- 4
X = cos(seq(0, 2*pi, length = N))
Y = sin(seq(0, 2*pi, length = N))
require(PBSmapping)
polysA1 = data.frame(PID = rep(1, N), POS = 1:N,
X = 5*X, Y = 5*Y)
polysB1 = data.frame(PID = rep(1, N), POS = 1:N,
X = 5*X + 5, Y = 5*Y)
plotMap(NULL, xlim = c(-10, 10), ylim = c(-10, 10))
addPolys(polysA1, col = 'blue', lty = 12, density = 0, pch = 16)
addPolys(polysB1, col = 'red', lty = 12, density = 0, pch = 16)
addPolys(joinPolys(polysA1, polysB1), col = 2)
print(head(joinPolys(polysA1, polysB1)))
I am using R version 3.1.3, and Ubuntu 14.04.2 LTS.
Thanks in advance! I'm new to stackoverflow, so please let me know if there is anything else I can provide.
Cheers

Attaching multiple covariates to ggplot scatterplot

I'm producing a plot like this:
library(ggplot2)
data.dist = matrix(
c(10, -10, 10, -10, 10, -10, 10, -10, 10),
nrow=3,
ncol=3,
byrow = TRUE)
hc <- agnes(dist(data.dist), method = "ward", diss = TRUE)
cluster <- cutree(hc, k=2)
xy <- data.frame(cmdscale(dist(data.dist)), factor(cluster))
names(xy) <- c("x", "y", "cluster")
xy$model <- rownames(xy)
ggplot(xy, aes(x, y)) + geom_point(aes(colour=cluster), size=3)
Which gives me:
However, let's say I want to attach another covariate, say a binary variable c(1, 0, 1) to the data and display all 1 using one symbol (say an X) and all 0 using another symbol (say a dot). How can I accomplish this?

xy<-data.frame(x=rnorm(3),y=rnorm(3),cluster=as.factor(c(1,0,1)),another=as.factor(c(1,1,0)) )
ggplot(xy, aes(x, y,shape=another)) + geom_point(aes(colour=cluster), size=3)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Kmean clustering in ggplot - r

Related

Overlay two contours of bivariate gaussian distribution using ggplot2

How do I plot a probability heatmap with ggplot2? [closed]

Interpolate curved line betweenstart and end points for ggplot2

joinPolys function in PBSmapping gives NULL output

Attaching multiple covariates to ggplot scatterplot

Categories

Resources