R: converting grob objects to ggplot/plotly [duplicate] - r

This question already exists:
R: Convert "grob" (graphical object) to "ggplot" [duplicate]
Closed 2 years ago.
I working with the R programming language. I am trying to convert a "grob" object into a "ggplot" object (the goal is eventually to convert the ggplot object into a "plotly" object).
I am looking for "the most simple" way to convert "grob" to "ggplot" - the computer I am using does not have a USB port or an internet connection, it only has R with some preloaded libraries (e.g. ggplot2, ggpubr)
In my example: I generated some data, ran a statistical model ("random forest") and plotted the results using "compressed" axis ("Tsne"). The code below can be copy/pasted into R, and the resulting "plot" ("final_plot") is the object that I want to convert to "ggplot":
library(cluster)
library(Rtsne)
library(dplyr)
library(randomForest)
library(caret)
library(ggplot2)
library(plotly)
#PART 1 : Create Data
#generate 4 random variables : response_variable ~ var_1 , var_2, var_3
var_1 <- rnorm(10000,1,4)
var_2<-rnorm(10000,10,5)
var_3 <- sample( LETTERS[1:4], 10000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
response_variable <- sample( LETTERS[1:2], 10000, replace=TRUE, prob=c(0.4, 0.6) )
#put them into a data frame called "f"
f <- data.frame(var_1, var_2, var_3, response_variable)
#declare var_3 and response_variable as factors
f$response_variable = as.factor(f$response_variable)
f$var_3 = as.factor(f$var_3)
#create id
f$ID <- seq_along(f[,1])
#PART 2: random forest
#split data into train set and test set
index = createDataPartition(f$response_variable, p=0.7, list = FALSE)
train = f[index,]
test = f[-index,]
#create random forest statistical model
rf = randomForest(response_variable ~ var_1 + var_2 + var_3, data=train, ntree=20, mtry=2)
#have the model predict the test set
pred = predict(rf, test, type = "prob")
labels = as.factor(ifelse(pred[,2]>0.5, "A", "B"))
confusionMatrix(labels, test$response_variable)
#PART 3: Visualize in 2D (source: https://dpmartin42.github.io/posts/r/cluster-mixed-types)
gower_dist <- daisy(test[, -c(4,5)],
metric = "gower")
gower_mat <- as.matrix(gower_dist)
labels = data.frame(labels)
labels$ID = test$ID
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(labels$labels),
name = labels$ID)
plot = ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = labels$labels))
plotly_plot = ggplotly(plot)
a = tsne_obj$Y
a = data.frame(a)
data = a
data$class = labels$labels
decisionplot <- function(model, data, class = NULL, predict_type = "class",
resolution = 100, showgrid = TRUE, ...) {
if(!is.null(class)) cl <- data[,class] else cl <- 1
data <- data[,1:2]
k <- length(unique(cl))
plot(data, col = as.integer(cl)+1L, pch = as.integer(cl)+1L, ...)
# make grid
r <- sapply(data, range, na.rm = TRUE)
xs <- seq(r[1,1], r[2,1], length.out = resolution)
ys <- seq(r[1,2], r[2,2], length.out = resolution)
g <- cbind(rep(xs, each=resolution), rep(ys, time = resolution))
colnames(g) <- colnames(r)
g <- as.data.frame(g)
### guess how to get class labels from predict
### (unfortunately not very consistent between models)
p <- predict(model, g, type = predict_type)
if(is.list(p)) p <- p$class
p <- as.factor(p)
if(showgrid) points(g, col = as.integer(p)+1L, pch = ".")
z <- matrix(as.integer(p), nrow = resolution, byrow = TRUE)
contour(xs, ys, z, add = TRUE, drawlabels = FALSE,
lwd = 2, levels = (1:(k-1))+.5)
invisible(z)
}
model <- randomForest(class ~ ., data=data, mtry=2, ntrees=500)
#this is the final plot
final_plot = decisionplot(model, data, class = "class", main = "rf (1)")
From here, I am trying to convert this object ("final_plot") into a ggplot object:
library(ggpubr)
final = ggpubr::as_ggplot(final_plot)
But this gives me the following error:
Error in gList(...) : only 'grobs' allowed in "gList"
From here, I eventually would have wanted to use this command to convert the ggplot into a plotly object:
plotly_plot = ggplotly(final)
Does anyone know if there is a straightforward way to convert "final_plot" into a ggplot object? (and then plotly)? I don't have the ggplotify library.
Thanks

Related

Unable to plot my QDA classification method

I have created a few types of classification, but when performing QDA I'm am unable to plot my predictions, does anyone know how to predict this?
set.seed(20220719)
#splitting training and testing data
ii = createDataPartition(classification[,3], p = .75, list = F)
training = classification[ii, ] #predictors for training
testing = classification[-ii, ] #predictors for testing
#fitting the model
qda_mod = qda(Group ~., training)
#predicting testing data
p2 = predict(qda_mod, testing)$class
tab1 = table(Predicted = p2, Actual = testing$Group)
tab1
data snipet
classification
X1 X2 Group
1 -1.007927175 0.399027128 0
2 -0.472479667 0.839121791 1
3 0.745229326 -1.279741933 1
4 -0.597907906 -1.942435976 1
5 0.186984091 -1.541910328 1
6 -0.395736986 -0.120650487 1
7 -0.155861012 1.193432933 0
8 0.382043985 -1.700433181 1
9 0.684346226 -0.890674936 1
10 0.453268993 0.674205724 1
Looking for an output similar to;
The Predicted groups have been colored with actual classification with red and green color. These are extracted from your predict call. The mix of red and green colors in groups shows the incorrect classification prediction. You can use the following code:
library(caret)
library(MASS)
set.seed(20220719)
#splitting training and testing data
ii = createDataPartition(classification[,3], p = .75, list = F)
training = classification[ii, ] #predictors for training
testing = classification[-ii, ] #predictors for testing
#fitting the model
qda_mod = qda(Group ~., training)
#predicting testing data
p2 = predict(qda_mod, testing)
#tab1 = table(Predicted = p2, Actual = testing$Group)
#tab1
# plot
par(mfrow=c(1,1))
plot(p2$posterior[,2], p2$class, col=testing$Group+10)
Output:
For extra info check this link.
#Oliver, made some nice functions in this post. Maybe this helps:
decisionplot <- function(model, data, class = NULL, predict_type = "class",
resolution = 100, showgrid = TRUE, ...) {
if(!is.null(class)) cl <- data[,class] else cl <- 1
data <- data[,1:2]
k <- length(unique(cl))
plot(data, col = as.integer(cl)+1L, pch = as.integer(cl)+1L, ...)
# make grid
r <- sapply(data, range, na.rm = TRUE)
xs <- seq(r[1,1], r[2,1], length.out = resolution)
ys <- seq(r[1,2], r[2,2], length.out = resolution)
g <- cbind(rep(xs, each=resolution), rep(ys, time = resolution))
colnames(g) <- colnames(r)
g <- as.data.frame(g)
### guess how to get class labels from predict
### (unfortunately not very consistent between models)
p <- predict(model, g, type = predict_type)
if(is.list(p)) p <- p$class
p <- as.factor(p)
if(showgrid) points(g, col = as.integer(p)+1L, pch = ".")
z <- matrix(as.integer(p), nrow = resolution, byrow = TRUE)
contour(xs, ys, z, add = TRUE, drawlabels = FALSE,
lwd = 2, levels = (1:(k-1))+.5)
invisible(z)
}
decisionplot(qda_mod, training, class = "Group")
Output:

Shapley values for the three clusters by cluster number KMeans algorithm

I am trying to replicate this https://cast42.github.io/blog/datascience/python/clustering/altair/shap/2020/04/23/explain-clusters-to-business.html#Kmeans-clustering
But using R and not Python as in the article.
What I haven't managed to get is the "Shapley values for the three clusters" part:
for cnr in df_km['cluster'].unique():
shap.summary_plot(shap_values[cnr], X, max_display=30, show=False)
plt.title(f'Cluster {cnr}') plt.show()
These are the results I've gotten so far. Note that I want to output the graph according to the label variable of the classification model.
Thanks!
# Package names
packages <- c("splitstackshape", "shapr", "Matrix", "xgboost", "SHAPforxgboost")
# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])}
winequality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep = ";")
#KMeans clasifier attribute evaluation
winequality_escale <- scale(winequality)
set.seed(123)
km.res_3 <- kmeans(winequality_escale, 3, nstart = 25)
km.res_3$size
km.res_3$centers
aggregate(winequality, by=list(cluster=km.res_3$cluster), mean)
k3 <- fviz_cluster(km.res_3, data=winequality_escale, palette= c("#2E9FDF", "#00AFBB", "#E7B800"), ellipse.type = "euclid", star.plot= T, repel = T, ggtheme = theme_minimal()) + ggtitle("k = 3")
winequality <- as.matrix(winequality)
model <- xgboost(
data = winequality,
label = km.res_3$cluster,
nround = 20,
verbose = FALSE)
shap_values <- shap.values(xgb_model = model, X_train = winequality)
shap_values$mean_shap_score
shap_values <- shap_values$shap_score
# shap.prep() returns the long-format SHAP data from either model or
shap_long <- shap.prep(xgb_model = model, X_train = winequality)
# is the same as: using given shap_contrib
shap_long <- shap.prep(shap_contrib = shap_values$shap_score, X_train = winequality)
# **SHAP summary plot**
shap.plot.summary(shap_long)

R: looping and visualizing "run times" in R

I am using the R programming language. I want to learn how to measure and plot the run time of difference procedures as the size of the data increases.
I found a previous stackoverflow post that answers a similar question: Plot the run time of three functions
It seems that the "microbenchmark" library in R should be able to accomplish this task.
Suppose I simulate the following data:
#load libraries
library(microbenchmark)
library(dplyr)
library(ggplot2)
library(Rtsne)
library(cluster)
library(dbscan)
library(plotly)
#simulate data
var_1 <- rnorm(1000,1,4)
var_2<-rnorm(1000,10,5)
var_3 <- sample( LETTERS[1:4], 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
var_4 <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.4, 0.6) )
#put them into a data frame called "f"
f <- data.frame(var_1, var_2, var_3, var_4)
#declare var_3 and response_variable as factors
f$var_3 = as.factor(f$var_3)
f$var_4 = as.factor(f$var_4)
#add id
f$ID <- seq_along(f[,1])
Now, I want to measure the run time of 7 different procedures:
#Procedure 1: :
gower_dist <- daisy(f[,-5],
metric = "gower")
gower_mat <- as.matrix(gower_dist)
#Procedure 2
lof <- lof(gower_dist, k=3)
#Procedure 3
lof <- lof(gower_dist, k=5)
#Procedure 4
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(
name = f$ID)
#Procedure 5
tsne_obj <- Rtsne(gower_dist, perplexity =10, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(
name = f$ID)
#Procedure 6
plot = ggplot(aes(x = X, y = Y), data = tsne_data) + geom_point(aes())
#Procedure 7
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(
name = f$ID,
lof=lof,
var1=f$var_1,
var2=f$var_2,
var3=f$var_3
)
p1 <- ggplot(aes(x = X, y = Y, size=lof, key=name, var1=var1,
var2=var2, var3=var3), data = tsne_data) +
geom_point(shape=1, col="red")+
theme_minimal()
ggplotly(p1, tooltip = c("lof", "name", "var1", "var2", "var3"))
Using the "microbenchmark" library, I can find out the time of individual functions:
procedure_1_part_1 <- microbenchmark(daisy(f[,-5],
metric = "gower"))
procedure_1_part_2 <- microbenchmark(as.matrix(gower_dist))
Here is where I get stuck:
I want to make a graph of the run times like this:
https://umap-learn.readthedocs.io/en/latest/benchmarking.html
Can someone please show me how to make this graph and use the microbenchmark statement for multiple functions at once (for different sizes of the dataframe "f" (for f = 5, 10, 50, 100, 200, 500, 100)?
microbench(cbind(gower_dist <- daisy(f[1:5,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))
microbench(cbind(gower_dist <- daisy(f[1:10,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))
microbench(cbind(gower_dist <- daisy(f[1:50,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))
etc
I could manually run each one of these, copy the results into excel and plot them, but this would also take a long time. Is there a quicker way to make a graph?
Thanks
Create a function that does all the steps of the analysis and pass that into microbenchmark. In pseudocode, something along the lines of
runAnalysis <- function(x, size) {
x <- x[1:size, ]
# forther steps of the analysis
}
xy <- microbenchmark(
subset_5 = runAnalysis(x = f, size = 5),
subset_50 = runAnalysis(x = f, size = 50),
times = 1
)
Mean times in miliseconds are in xy$time and names of the runs in xy$expr, which can be used to create a graph that you want.

How to locate individual samples that have been misclassified using kNN, in R?

Using the Iris dataset in R, I am looking at classification using kNN. I am interested in finding the observations that have been misclassified using the test set. I was able to produce scatter plots which gives a visual of the observations that have been misclassified. However, how can I locate and list all the observations that have been misclassified. I have included the code I used to get the scatter plots below which was from https://rpubs.com/Tonnia/irisknn
set.seed(12345)
allrows <- 1:nrow(iris)
trainrows <- sample(allrows, replace = F, size = 0.8*length(allrows))
train_iris <- iris[trainrows, 1:4]
train_label <- iris[trainrows, 5]
table(train_label)
test_iris <- iris[-trainrows, 1:4]
test_label <- iris[-trainrows, 5]
table(test_label)
library(class)
error.train <- replicate(0,30)
for(k in 1:30) {
pred_iris <- knn(train = train_iris, test = train_iris, cl = train_label, k)
error.train[k]<-1-mean(pred_iris==train_label)
}
error.train <- unlist(error.train, use.names=FALSE)
error.test <- replicate(0,30)
for(k in 1:30) {
pred_iris <- knn(train = train_iris, test = test_iris, cl = train_label, k)
error.test[k]<-1-mean(pred_iris==test_label)
}
error.test <- unlist(error.test, use.names = FALSE)
plot(error.train, type="o", ylim=c(0,0.15), col="blue", xlab = "K values", ylab = "Misclassification errors")
lines(error.test, type = "o", col="red")
legend("topright", legend=c("Training error","Test error"), col = c("blue","red"), lty=1:1)
pred_iris<-knn(train = train_iris, test = test_iris, cl = train_label, 6)
result <- cbind(test_iris, pred_iris)
combinetest <- cbind(test_iris, test_label)
result%>%
ggplot(aes(x=Petal.Width, y=Petal.Length, color=pred_iris))+
geom_point(size=3)
combinetest%>%
ggplot(aes(x=Petal.Width, y=Petal.Length, color=test_label))+
geom_point(size=3)
In your code, pred_iris holds the value for the current trained model response.
Once you have the combinetest data, around the end of your code, you could do something like:
combinetest[test_label != pred_iris,]
To get the ones with a different prediction than label.
Alternatively, with a more tidyverse readable syntax:
library(tidyverse)
combinetest %>%
filter(test_label != pred_iris)

PCA vs t-SNE for data-visualisation: is there a way of doing loading plots with t-SNE?

In the past i've used to using PCA and loading plots to visualise data using stats::prcomp and ggbiplot. Like this:
I've recently been introduced to t-SNE analysis (late to the game here) that has been revolutionary in reduction analysis and exploring patterns in data.
While t-SNE is picking up clusters in my data that PCA can't clearly distinguish, the downside is that there's no loading plot to understand which characteristic influences each cluster in space.
Is there any way with t-SNE I can form some sort of loading plot? Or are there any complementary methods associated with t-SNE that can be used? I'd really like to keep the t-SNE plot over the PCA plot for visualizing a particular dataset, but without the loading plot it's not as informative.
Any ideas? Dummy data below:
library(FD)
library(ggbiplot)
# generate random data
a <- sample(x = 1:5, size = 100, replace = TRUE)
b <- sample(x = 1:5, size = 100, replace = TRUE)
c <- sample(x = 1:5, size = 100, replace = TRUE)
d <- sample(x = 1:5, size = 100, replace = TRUE)
e <- sample(x = 1:5, size = 100, replace = TRUE)
f <- sample(x = 1:5, size = 100, replace = TRUE)
g <- sample(x = 1:5, size = 100, replace = TRUE)
# make ordinal
a <- ordered(a, levels = 1:5)
b <- ordered(b, levels = 1:5)
c <- ordered(c, levels = 1:5)
d <- ordered(d, levels = 1:5)
e <- ordered(e, levels = 1:5)
f <- ordered(f, levels = 1:5)
g <- ordered(g, levels = 1:5)
#as dataframe
table <- as.data.frame(cbind(a,b,c,d,e,f,g))
#PCA
pca <- prcomp(table,center = TRUE,scale. = TRUE)
ggbiplot(pca)
#t-SNE
gower_dist <- daisy(table, metric = "gower")
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y"))
ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point()

Resources