How can categorical data be clustered in 3 dimensions? - r

I have the data obtained from a survey and I would like to analyze it, make clusters and display it 3D as it allows to visualize the information in a more tangible way.
The case is that I have many columns with questions, which the respondents answer: Agree (1), Somewhat agree (0.8), Neutral (0.6), Somewhat disagree (0.4), Disagree (0.2) and finally a numerical rating question, what means are rather categorical data.
A sample of the dataset is shown below:
q1,q2,q3,q4,q5,q6,q7
1,0.8,0.6,0.2,0.2,0.4,10
0.2,1,0.4,0.4,0.4,0.4,9
0.6,1,0.2,0.4,0.2,0.2,6
I am trying to write some code in R based on the following reference: https://plotly.com/r/t-sne-and-umap-projections/
And the code I've tried to run is the following:
library(cluster)
gower_df <- daisy(data,
metric = "gower" ,
type = list(logratio = 2))
silhouette <- c()
silhouette = c(silhouette, NA)
for(i in 2:10){
pam_clusters = pam(as.matrix(gower_df),
diss = TRUE,
k = i)
silhouette = c(silhouette ,pam_clusters$silinfo$avg.width)
}
plot(1:10, silhouette,
xlab = "Clusters",
ylab = "Silhouette Width")
lines(1:10, silhouette)
pam_ = pam(gower_df, diss = TRUE, k = 2)
data[pam_$medoids, ]
pam_summary <- data %>%
mutate(cluster = pam_$clustering) %>%
group_by(cluster) %>%
do(cluster_summary = summary(.))
pam_summary$cluster_summary[[1]]
library(Rtsne)
library(ggplot2)
tsne_object <- Rtsne(gower_df, is_distance = TRUE)
tsne_df <- tsne_object$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_$clustering))
ggplot(aes(x = X, y = Y), data = tsne_df) +
geom_point(aes(color = cluster))
library(plotly)
library(umap)
fig2 <- plot_ly(tsne_df)
fig2
But I get a 2D representation. Any idea how I can do it?

Related

How can I select one plot from the partial dependence plot in R?

Here is the code for the partial dependence plot. I use the example data for this. First of all, I made a random forest model. Then I made a partial dependence plot.
rm(list = ls())
library(tidyverse)
library(mlbench)
library(randomForest)
library(caret)
library(edarf)
data("Sonar")
df<-Sonar
rm(Sonar)
# Clean up variable names (becuz I'm a bit OCD)
df <- df %>% rename(V01 = V1, V02 = V2, V03 = V3, V04 = V4,
V05 = V5, V06 = V6, V07 = V7, V08 = V8,
V09 = V9)
# Get minimum class frequency
min <- min(table(df$Class))
set.seed(223)
df_rf <- df %>% na.omit()
fit_rf <- randomForest(data = df_rf,
Class ~ .,
ntree = 500,
importance = TRUE,
sampsize = c(min, min))
# Add predicted values to data frame
df_rf <- df_rf %>%
mutate(predicted = predict(fit_rf))
# Get performance measures
confusionMatrix(df_rf$predicted, df_rf$Class, positive = "R")
# Get variable importance measures
imp_df <- data.frame(importance(fit_rf, scale = FALSE, type = 1))
# Tidy up and sort the data frame
imp_df <- imp_df %>%
mutate(names = rownames(imp_df)) %>%
arrange(desc(MeanDecreaseAccuracy))
# Save top predictor names as character vector
nm <- as.character(imp_df$names)[1:10]
# Get partial depedence values for top predictors
pd_df <- partial_dependence(fit = fit_rf,
vars = nm,
data = df_rf,
n = c(100, 200))
# Plot partial dependence using edarf
plot_pd(pd_df)
Then I got the result as follows.
I successfully got the multiple images that combined as one big plot. However, I need to select any one of these plots. Is there any way I can try?
You could use the dataframe your pd_df where you first have to make it a longer format by the columns M and R to visualize it in ggplot by a variable you want like this with example of V11:
library(ggplot2)
library(tidyr)
library(dplyr)
pd_df %>%
pivot_longer(cols = c(M, R)) %>%
ggplot(aes(x = V11, y = value, color = name)) +
geom_line() +
geom_point() +
labs(x = "value", y = "prediction")
Created on 2023-01-09 with reprex v2.0.2
You can replace V11 with other variables like you want.

How to specify groups with colors in qqplot()?

I have created a qqplot (with quantiles of beta distribution) from a dataset including two groups. To visualize, which points belong to which group, I would like to color them. I have tried the following:
res <- beta.mle(data$values) #estimate parameters of beta distribution
qqplot(qbeta(ppoints(500),res$param[1], res$param[2]),data$values,
col = data$group,
ylab = "Quantiles of data",
xlab = "Quantiles of Beta Distribution")
the result is shown here:
I have seen solutions specifying a "col" vector for qqnorm, hover this seems to not work with qqplot, as simply half the points is colored in either color, regardless of group. Is there a way to fix this?
A simulated some data just to shown how to add color in ggplot
Libraries
library(tidyverse)
# install.packages("Rfast")
Data
#Simulating data from beta distribution
x <- rbeta(n = 1000,shape1 = .5,shape2 = .5)
#Estimating parameters
res <- Rfast::beta.mle(x)
data <-
tibble(
simulated_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2])
) %>%
#Creating a group variable using quartiles
mutate(group = cut(x = simulated_data,
quantile(simulated_data,seq(0,1,.25)),
include.lowest = T))
Code
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = simulated_data, col = group))+
geom_point()
Output
For those who are wondering, how to work with pre-defined groups, this is the code that worked for me:
library(tidyverse)
library(Rfast)
res <- beta.mle(x)
# make sure groups are not numerrical
# (else color skale might turn out continuous)
g <- plyr::mapvalues(g, c("1", "2"), c("Group1", "Group2"))
data <-
tibble(
my_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2]),
group = g[order(x)]
)
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = my_data, col = group))+
geom_point()
result

How do you compare similarities between variables in an R data frame, based on two categorical variables and one numeric variables

I have a dataframe with three variables of interest: LGA(Location), Offence Category and Total (numeric)
What I am hoping to do, is compare the distance/similarity between each LGA, based on the Total value, in order to create a heat map or similar structure. Is this possible? And if so, what would the process be?
Here is a snippet of the data frame:
I don't really understand your question, but here is an example of a heatmap and a clustered heatmap for 'similar' data:
# Load libraries
library(tidyverse)
library(readxl)
library(httr)
# Find some data
url1 <- "https://www.bocsar.nsw.gov.au/Documents/lga/NewSouthWales.xlsx"
# Get the data and remove missing data points (NA's)
GET(url1, write_disk(tf <- tempfile(fileext = ".xlsx")))
df <- read_excel(path = tf, 2L, skip = 5) %>%
na.omit()
df2 <- df %>%
# format the data to "long format" for plotting
pivot_longer(cols = -c(`Premises type`)) %>%
# Change "Premises type" and "name" to factors
mutate(`Premises type` = factor(
`Premises type`, levels = unique(`Premises type`))
) %>%
mutate(name = factor(
name, levels = unique(name))
) %>%
# Remove the "Total" counts
filter(`Premises type` != "Total")
# Define colours for text (white for dark fill, black for light fill)
hcl <- farver::decode_colour(viridisLite::inferno(length(df2$value)), "rgb", "hcl")
label_col <- ifelse(hcl[, "l"] > 50, "black", "white")
# Plot the data (log scale for fill)
ggplot(df2, aes(y = fct_rev(`Premises type`),
x = name, fill = log(value))) +
geom_tile() +
geom_text(aes(label = value, color = factor(value)),
show.legend = FALSE, size = 2.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1.05),
axis.title = element_blank()) +
scale_color_manual(values = label_col) +
scale_fill_viridis_c(option = "inferno", na.value = "black")
And a clustered heatmap (similar Premises Type / Crime types cluster together):
# Load the raw data and format for pheatmap (expects a matrix)
dm <- read_excel(path = tf, 2L, skip = 5) %>%
na.omit() %>%
column_to_rownames(var = "Premises type")
# Plot the data
pheatmap::pheatmap(as.matrix(dm), scale = "row")
Edit
I haven't used it before, so I don't know if the output is correct, but based on this SO post you can use cluster::daisy() to get the gower dissimilarity for "Premises Type" then plot using pheatmap, e.g.
library(cluster)
pheatmap::pheatmap(as.matrix(daisy(dm)))
Edit 2
You only need two variables for this heatmap (i.e. "Local government Area" (Character) and "Total" (Numeric) should be fine):
# Load libraries
library(tidyverse)
library(readxl)
library(httr)
library(cluster)
library(pheatmap)
# Find some data
url1 <- "https://www.bocsar.nsw.gov.au/Documents/lga/NewSouthWales.xlsx"
# Get the data and remove missing data points (NA's)
GET(url1, write_disk(tf <- tempfile(fileext = ".xlsx")))
df <- read_excel(path = tf, 2L, skip = 5) %>%
na.omit()
# Select two variables, then set the Premises type as the rownames
df3 <- df %>%
select(`Premises type`, Robbery) %>%
column_to_rownames(var = "Premises type")
# (in your case, use "column_to_rownames(`Local government Area`)"
# Then plot the heatmap
pheatmap(daisy(as.matrix(df3)),
labels_row = rownames(df3),
labels_col = rownames(df3))

Visualize test and training set distribution with ggplot2

I am trying to visualize the distribution of a dataset and it's splits into test and training data to check if the split is stratified.
The minimal example uses the iris dataset. It has a species column which is a factor with 3 levels. The following code snippet will show a nice plot with the count for each label, however I would like to see the percentage/probability for the labels in the respective set to see the distribution of the training and test sets.
library("tidyverse")
data(iris)
n = nrow(iris)
idxTrain <- sample(1:n, size = round(0.7*n), replace = F)
train <- iris[idxTrain,]
test <- iris[-idxTrain,]
iris$Set <- rep("Train", time = nrow(iris))
iris$Set[-idxTrain] <- "Test"
ggplot(iris, aes(x = Species, fill = Set)) + geom_bar(position = "dodge")
I tried calculating the percentage as shown below however this does not work, because it shows the percentage of the whole dataframe which shows a distribution similar to the counts.
geom_bar(aes(y = (..count..)/sum(..count..)))
How can I plot the percentage of each label within each set efficiently?
Bonus: Including the whole dataset, train and test.
library("tidyverse")
data(iris)
n = nrow(iris)
idxTrain <- sample(1:n, size = round(0.7*n), replace = F)
train <- iris[idxTrain,]
test <- iris[-idxTrain,]
iris$Set <- rep("Train", time = nrow(iris))
iris$Set[-idxTrain] <- "Test"
you need a separate dataframe for the labels
df_labs <-
iris %>%
group_by(Species) %>%
count(Set) %>%
mutate(pct = n / sum(n)) %>%
filter(Set == "Test")
that you use as the data for the label geom (or text)
ggplot(iris, aes(x = Species, fill = Set)) +
geom_bar(position = "dodge") +
geom_label(data = df_labs, aes(label = scales::percent(pct), y = n / 2))

Predicted vs. Actual plot

I'm new to R and statistics and haven't been able to figure out how one would go about plotting predicted values vs. Actual values after running a multiple linear regression. I have come across similar questions (just haven't been able to understand the code). I would greatly appreciate it if you explain the code.
This is what I have done so far:
# Attach file containing variables and responses
q <- read.csv("C:/Users/A/Documents/Design.csv")
attach(q)
# Run a linear regression
model <- lm(qo~P+P1+P4+I)
# Summary of linear regression results
summary(model)
The plot of predicted vs. actual is so I can graphically see how well my regression fits on my actual data.
It would be better if you provided a reproducible example, but here's an example I made up:
set.seed(101)
dd <- data.frame(x=rnorm(100),y=rnorm(100),
z=rnorm(100))
dd$w <- with(dd,
rnorm(100,mean=x+2*y+z,sd=0.5))
It's (much) better to use the data argument -- you should almost never use attach() ..
m <- lm(w~x+y+z,dd)
plot(predict(m),dd$w,
xlab="predicted",ylab="actual")
abline(a=0,b=1)
Besides predicted vs actual plot, you can get an additional set of plots which help you to visually assess the goodness of fit.
--- execute previous code by Ben Bolker ---
par(mfrow = c(2, 2))
plot(m)
A tidy way of doing this would be to use modelsummary::augment():
library(tidyverse)
library(cowplot)
library(modelsummary)
set.seed(101)
# Using Ben's data above:
dd <- data.frame(x=rnorm(100),y=rnorm(100),
z=rnorm(100))
dd$w <- with(dd,rnorm(100,mean=x+2*y+z,sd=0.5))
m <- lm(w~x+y+z,dd)
m %>% augment() %>%
ggplot() +
geom_point(aes(.fitted, w)) +
geom_smooth(aes(.fitted, w), method = "lm", se = FALSE, color = "lightgrey") +
labs(x = "Actual", y = "Fitted") +
theme_bw()
This will work nicely for deep nested regression lists especially.
To illustrate this, consider some nested list of regressions:
Reglist <- list()
Reglist$Reg1 <- dd %>% do(reg = lm(as.formula("w~x*y*z"), data = .)) %>% mutate( Name = "Type 1")
Reglist$Reg2 <- dd %>% do(reg = lm(as.formula("w~x+y*z"), data = .)) %>% mutate( Name = "Type 2")
Reglist$Reg3 <- dd %>% do(reg = lm(as.formula("w~x"), data = .)) %>% mutate( Name = "Type 3")
Reglist$Reg4 <- dd %>% do(reg = lm(as.formula("w~x+z"), data = .)) %>% mutate( Name = "Type 4")
Now is where the power of the above tidy plotting framework comes to life...:
Graph_Creator <- function(Reglist){
Reglist %>% pull(reg) %>% .[[1]] %>% augment() %>%
ggplot() +
geom_point(aes(.fitted, w)) +
geom_smooth(aes(.fitted, w), method = "lm", se = FALSE, color = "lightgrey") +
labs(x = "Actual", y = "Fitted",
title = paste0("Regression Type: ", Reglist$Name) ) +
theme_bw()
}
Reglist %>% map(~Graph_Creator(.)) %>%
cowplot::plot_grid(plotlist = ., ncol = 1)
Same as #Ben Bolker's solution but getting a ggplot object instead of using base R
#first generate the dd data set using the code in Ben's solution, then...
require(ggpubr)
m <- lm(w~x+y+z,dd)
ggscatter(x = "prediction",
y = "actual",
data = data.frame(prediction = predict(m),
actual = dd$w)) +
geom_abline(intercept = 0,
slope = 1)

Resources