Creating clusters based on a plot - r

I have a dataset like this:
Region
Year
Month
rate
residuals
1
2010
1
0.5
0.5
2
2010
1
4.0
0.5
This dataset continues it has 15'000 observations.
I created a scatter plot :
plot(df$full.residuals, df$rate, main="Scatterplot",
xlab="rate", ylab="Residuals")
Now I can't do it further to create cluster in the plot? Does anyone know how to create clusters in the plot?

First of all I created some more random datapoints, because with 2 points it will be hard to create clusters. You could use kmeans as an algorithm to create clusters. In this case I decide to create 2 clusters which you can change if you want. With the factoextra package you can create some nice visualizations like this:
library(factoextra)
set.seed(123)
df <- data.frame(rate = runif(20, 0, 1),
full.residuals = runif(20, 0, 1))
kmeans_cluster <- kmeans(scale(df), 2, nstart = 5)
kmeans_cluster$cluster
#> [1] 2 2 2 2 2 1 2 2 1 1 2 2 2 2 1 2 2 1 1 2
fviz_cluster(kmeans_cluster, data = df,
palette = c("#2E9FDF", "#00AFBB"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw())
Created on 2022-08-18 with reprex v2.0.2
I would suggest to have a look at this link for some extra information about using this package.

Related

R: "Animate" Points on a Scatter Plot

I am working with R. Suppose I have the following data frame:
my_data <- data.frame(
"col" = c("red","red","red","red","red","blue","blue","blue","blue","blue","green", "green", "green", "green","green"),
"x_cor" = c(1,2,5,6,7,4,9,1,0,1,4,4,7,8,2),
"y_cor" = c(2,3,4,5,9,5,8,1,3,9,11,5,7,9,1),
"frame_number" = c(1,2,3,4,5, 1,2,3,4,5, 1,2,3,4,5)
)
my_data$col = as.factor(my_data$col)
head(my_data)
col x_cor y_cor frame_number
1 red 1 2 1
2 red 2 3 2
3 red 5 4 3
4 red 6 5 4
5 red 7 9 5
6 blue 4 5 1
In R, is it possible to create a (two-dimensional) graph that will "animate" each colored point to a new position based on the "frame number"?
For example:
I started following the instructions from this website here: https://www.datanovia.com/en/blog/gganimate-how-to-create-plots-with-beautiful-animation-in-r/
First, I made a static graph:
library(ggplot2)
library(gganimate)
p <- ggplot(
my_data,
aes(x = x_cor, y=y_cor, colour = col)
Then, I tried to animate it:
p + transition_time(frame_number) +
labs(title = "frame_number: {frame_number}")
Unfortunately, this produced an empty plot and the following warnings:
There were 50 or more warnings (use warnings() to see the first 50)
1: Cannot get dimensions of plot table. Plot region might not be fixed
2: values must be length 1,
but FUN(X[[1]]) result is length 15
Can someone please show me how to fix this problem?
Thanks

How to tag data point to a cluster?

I have completed and plotted the DBSCAN cluster in R markdown.
This is my code currently:
dbscan.8=fpc::dbscan(current.matrix, eps=2, MinPts=log(33359)) #list generated
fviz_cluster(dbscan.8, data=current.matrix, stand=FALSE, ellipse=FALSE,
show.clust.cent=FALSE, geom="point", palette="jco",
ggtheme=theme_classic()) # Plot the clusters
How do I add a new column in the original dataframe (current.matrix), that contains the cluster that each row belongs to? so it will look something like that:
Thank you!
Using an example dataset:
library(factoextra)
library(fpc)
dat = data.frame(scale(iris[,-5]))
clus = dbscan(dat,1.5)
The clustering looks like this
viz = fviz_cluster(clus, data=dat,
stand=FALSE, ellipse=FALSE, show.clust.cent=FALSE, geom="point",
palette="jco")
print(viz)
The cluster information is already stored in the object fromm fviz_cluster:
head(viz$data)
name x y coord cluster
1 1 -2.257141 -0.4784238 5.323576 1
2 2 -2.074013 0.6718827 4.752956 1
3 3 -2.356335 0.3407664 5.668437 1
4 4 -2.291707 0.5953999 5.606421 1
5 5 -2.381863 -0.6446757 6.088877 1
6 6 -2.068701 -1.4842053 6.482388 1
The cluster is also stored under the dbscan object, as $clusters . So you can do:
dat$cluster = viz$data$cluster
or:
dat$cluster = clus$cluster

How do you plot the first few values of a PCA

I've run a PCA with a moderately-sized data set, but I only want to visualize a certain amount of points from that analysis because they are from repeat observations and I want to see how close the paired observations are to each other on the plot. I've set it up so that the first 18 individuals are the ones I want to plot, but I can't seem to only plot just the first 18 points without only doing an analysis of only the first 18 instead of the whole data set (43 individuals).
# My data file
TrialsMR<-read.csv("NER_Trials_Matrix_Retrials.csv", row.names = 1)
# I ran the PCA of all of my values (without the categorical variable in col 8)
R.pca <- PCA(TrialsMR[,-8], graph = FALSE)
# When I try to plot only the first 18 individuals with this method, I get an error
fviz_pca_ind(R.pca[1:18,],
labelsize = 4,
pointsize = 1,
col.ind = TrialsMR$Bands,
palette = c("red", "blue", "black", "cyan", "magenta", "yellow", "gray", "green3", "pink" ))
# This is the error
Error in R.pca[1:18, ] : incorrect number of dimensions
The 18 individuals are each paired up, so only using 9 colours shouldn't cause an error (I hope).
Could anyone help me plot just the first 18 points from a PCA of my whole data set?
My data frame looks similar to this in structure
TrialsMR
Trees Bushes Shrubs Bands
JOHN1 1 4 18 BLUE
JOHN2 2 6 25 BLUE
CARL1 1 3 12 GREEN
CARL2 2 4 15 GREEN
GREG1 1 1 15 RED
GREG2 3 11 26 RED
MIKE1 1 7 19 PINK
MIKE2 1 1 25 PINK
where each band corresponds to a specific individual that has been tested twice.
You are using the wrong argument to specify individuals. Use select.ind to choose the individuals required, for eg.:
data(iris) # test data
If you want to rename your rows according to a specific grouping criteria for readily identifiable in a plot. For eg. let setosa lies in series starting with 1, something like in 100-199, similarly versicolor in 200-299 and virginica in 300-399. Do it before the PCA.
new_series <- c(101:150, 201:250, 301:350) # there are 50 of each
rownames(iris) <- new_series
R.pca <- prcomp(iris[,1:4],scale. = T) # pca
library(factoextra)
fviz_pca_ind(X= R.pca, labelsize = 4, pointsize = 1,
select.ind= list(name = new_series[1:120]), # 120 out of 150 selected
col.ind = iris$Species ,
palette = c("blue", "red", "green" ))
Always refer to R documentation first before using a new function.
R documentation: fviz_pca {factoextra}
X
an object of class PCA [FactoMineR]; prcomp and princomp [stats]; dudi and pca [ade4]; expOutput/epPCA [ExPosition].
select.ind, select.var
a selection of individuals/variables to be drawn. Allowed values are NULL or a list containing the arguments name, cos2 or contrib
For your particular dummy data, this should do:
R.pca <- prcomp(TrailsMR[,1:3], scale. = TRUE)
fviz_pca_ind(X= R.pca,
select.ind= list(name = row.names(TrialsMR)[1:4]), # 4 out of 8
pointsize = 1, labelsize = 4,
col.ind = TrialsMR$Bands,
palette = c("blue", "green" )) + ylim(-1,1)
Dummy Data:
TrialsMR <- read.table( text = "Trees Bushes Shrubs Bands
JOHN1 1 4 18 BLUE
JOHN2 2 6 25 BLUE
CARL1 1 3 12 GREEN
CARL2 2 4 15 GREEN
GREG1 1 1 15 RED
GREG2 3 11 26 RED
MIKE1 1 7 19 PINK
MIKE2 1 1 25 PINK", header = TRUE)

r dendrogram - groupLabels not match real labels (package dendextend)

Let's do a quick 3-clusters classification on the iris dataset with the FactoMineR package:
library(FactoMineR)
model <- HCPC(iris[,1:4], nb.clust = 3)
summary(model$data.clust$clust)
1 2 3
50 62 38
We see that 50 observations are in cluster 1, 62 in cluster 2 and 38 in cluster 3.
Now, we want to visualize these 3 clusters in a dendrogram, with the package dendextend which enables to make pretty ones:
library(dendextend)
library(dplyr)
model$call$t$tree %>%
as.dendrogram() %>%
color_branches(k = 3, groupLabels = unique(model$data.clust$clust)) %>%
plot()
The problem is that the labels on the dendrogram don't meet the true labels of the classification. The cluster 2 should be the biggest one (62 observations according to the data), but on the dendrogram, we clearly see it is the smallest one.
I tried different thinks but nothing work for now, so if you have any idea of which input give to groupLabels = in order to match the real labels, that would be great.
Looking inside dendextend::color_branches, we can see that group labels are assigned using the command g <- dendextend::cutree(dend, k = k, h = h, order_clusters_as_data = FALSE).
This fact can be used for building a map between the cluster labels assigned by HCPC and group labels assigned by dendextend::color_branches.
library(FactoMineR)
library(dendextend)
library(dplyr)
model <- HCPC(iris[,1:4], nb.clust = 3)
clust.hcpc <- as.numeric(model$data.clust$clust)
clust.cutree <- dendextend:::cutree(model$call$t$tree, k=3, order_clusters_as_data = FALSE)
idx <- order(as.numeric(names(clust.cutree)))
clust.cutree <- clust.cutree[idx]
( tbl <- table(clust.hcpc, clust.cutree) )
###########
clust.cutree
clust.hcpc 1 2 3
1 50 0 0
2 0 0 62
3 0 36 2
This table shows that cluster labels 2 and 3 are matched with group labels 3 and 2, respectively. (Surprisingly, for two sample units this rule is not true.)
The groups levels that need to be passed to dendextend::color_branches can be found as follows:
( lbls <- apply(tbl,2,which.max) )
##############
1 2 3
1 3 2
Here is the dendrogram:
model$call$t$tree %>%
color_branches(k=3, groupLabels =lbls) %>%
set("labels_cex", .5) %>%
plot(horiz=T)

Survival Curve in R with survfit

I wanted to plot a survival curve using the following data. I called the data file as A.txt and the object A
A <- read.table(text = "
Time Status Group
8 1 A
8 1 A
8 1 A
9 1 A
9 1 A
9 1 A
15 0 A
15 0 A
7 1 B
7 1 B
8 1 B
9 1 B
10 1 B
10 1 B
15 0 B
15 0 B", header = TRUE)
I tried to plot a survival curve using this code:
title(main="Trial for Survival Curve")
fit <- survfit(Surv(Time, Status) ~ Group, data = A)
par(col.lab="red")
legend(10, .9, c("A", "B"), pch=c(2,3),col=2:3)
plot(fit, lty=2:3, col=2:3,lwd=5:5, xlab='Time(Days)',
ylab='% Survival',mark.time=TRUE,mark=2:3)
I would like to put marks (triangle for A and "+" for B) every time when survival % decreases for instance at Day 7 and Day 8. I want this labeling throughout the graph, but it adds the labels only at the end of the experiment.
First, I'd recommend rearranging the plotting calls:
par(col.lab="red")
plot(fit, lty=2:3, col=2:3,lwd=5:5, xlab='Time(Days)',
ylab='% Survival',mark.time=TRUE,mark=2:3)
title(main="Trial for Survival Curve")
legend(10, .9, c("A", "B"), pch=c(2,3),col=2:3)
You can add points to the survival plot with the points function. However, it looks like there's a small bug, which you can get around fairly easily:
firsty <- 1 ## Gets around bug
points(fit[1], col = 2, pch = 2) # Plots first group in fit
points(fit[2], col = 3, pch = 3) # Plots second group in fit
The points are plotted at the bottom of the "cliff" in the survival plot.

Resources