Cluster Analysis Visualisation: Colouring the Clusters after categorial variable - r

Salut folks! I'm still quiet new to ggplot and trying to understand, but I really need some help here.
Edit: Reproducible Data of my Dataset "Daten_ohne_Cluster_NA", first 25 rows
structure(list(ntaxa = c(2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5,
6, 6, 6, 6, 6, 5, 8, 8, 7, 7, 6, 5, 5), mpd.obs.z = c(-1.779004391,
-1.721014957, -1.77727283, -1.774642404, -1.789386039, -1.983401439,
-0.875426386, -2.276052068, -2.340365105, -2.203126078, -2.394158227,
-2.278173635, -1.269075471, -1.176760985, -1.313045215, -1.164289676,
-1.247549961, -0.868174033, -2.057106804, -2.03154772, -1.691850922,
-1.224391713, -0.93993654, -0.39315089, -0.418380361), mntd.obs.z = c(-1.759874454,
-1.855202792, -1.866281778, -1.798439855, -1.739998395, -1.890847575,
-0.920672112, -1.381541177, -1.382847758, -1.394870597, -1.339878669,
-1.349541665, -0.516793786, -0.525476292, -0.557425575, -0.539534996,
-0.521299478, -0.638951825, -1.06467985, -1.033009266, -0.758380203,
-0.572401837, -0.166616844, 0.399510209, 0.314591018), pe = c(0.046370234,
0.046370234, 0.046370234, 0.046370234, 0.046370234, 0.046370234,
0.071665745, 0.118619482, 0.118619482, 0.118619482, 0.118619482,
0.118619482, 0.205838414, 0.205838414, 0.205838414, 0.205838414,
0.205838414, 0.179091659, 0.215719118, 0.215719118, 0.212092271,
0.315391478, 0.312205596, 0.305510773, 0.305510773), ECO_NUM = c(1,
6, 6, 1, 7, 6, 6, 6, 6, 6, 6, 7, 7, 6, 1, 6, 6, 6, 6, 6, 6, 7,
7, 7, 6)), row.names = c(NA, -25L), class = c("tbl_df", "tbl",
"data.frame"))
(1) I prepared my Dataframe like this:
'Daten_Cluster <- Daten[, c("ntaxa", "mpd.obs.z", "mntd.obs.z", "pe", "ECO_NUM")]
(2) I threw out all the NA's with na.omit. It is 6 variables with 3811 objects each. The column ECO_NUM represents the different ecoregions as a kategorial, numerical factor.
(3) Then I did a Cluster Analysis with k.means. I used 31 groups as there are 31 ecoregions in my dataset and the aim is to colour the plot after ecoregions lateron.
'Biomes_Clus <- kmeans(Daten_Cluster_ohne_NA, 31, iter.max = 10, nstart = 25)
(4) Then I followed the online-instructions from datanovia.com on how to visualise a k.means cluster analysis (I always just follow these How-To
s as I have no idea how to do it all by myself). I tried to change the arguments accordingly to colour after ecoregions.
fviz_cluster(Biomes_Clus, data = Daten_Cluster_ohne_NA,
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw(),
) +
stat_mean(aes(color = Daten_Cluster_ohne_NA$ECO_NUM), size = 4)
I get more than 50 warnings here, I guess for each object. Saying: In grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size) : unimplemented pch value '30'
I know that there are not enough pch-symbols for 31 groups, but I also don't need them - I just would like to have it coloured.
I also tried out the other function ggscatter and created my own color-palette (called P36) with more than 31 colours to have enough colours for the ecoregions.
ggscatter(
ind.coord, x = "Dim.1", y = "Dim.2",
color = "Species", palette = "P36", ellipse = TRUE, ellipse.type = "convex",
legend = "right", ggtheme = theme_bw(),
xlab = paste0("Dim 1 (", variance.percent[1], "% )" ),
ylab = paste0("Dim 2 (", variance.percent[2], "% )" )
) +
stat_mean(aes(color = cluster), size = 4)
The Error here is that a Discrete value was supplied to continuous scale. THe Question is: How can I easily colour the outcome of my k.means (which worked) and colour it not by the newly clustered groups but by the ecoregions (to visualise if there is a difference between the clusters and the ecoregion-groups)?
I appreciate your help and me and my group partner would be very thankful!! :)
Greetings
Evelyn

Related

R: Creating new column to represent hi/mid/low bins by mean and standard deviation

I've got a batch of survey data that I'd like to be able to subset on a few specific columns which have 0-10 scale data (e.g. Rank your attitude towards x as 0 to 10) so that I can plot using using ggplot() + facet_grid. Faceting will be using 3 hi/med/low bins calculated as +1 / -1 standard deviation above the mean. I have working code, which splits the overall dataframe into 3 parts like so:
# Generate sample data:
structure(list(Q4 = c(2, 3, 3, 5, 4, 3), Q5 = c(1, 3, 3, 3, 2,
2), Q6 = c(4, 3, 3, 3, 4, 4), Q7 = c(4, 2, 3, 5, 5, 5), Q53_1 = c(5,
8, 4, 5, 4, 5)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
# Aquire Q53_1 data as factors
political_scale <- factor(climate_experience_data$Q53_1, levels = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
# Generate thresholds based on mean and standard deviation thresholds
low_threshold <- round(mean(as.numeric(political_scale, na.rm = T)) - sd(as.numeric(political_scale)), digits = 0)
high_threshold <- round(mean(as.numeric(political_scale, na.rm = T)) + sd(as.numeric(political_scale)), digits = 0)
# Generate low/med/high bins based on Mean and SD
political_lr_low <- filter(climate_experience_data, Q53_1 <= low_threshold)
political_lr_mid <- filter(climate_experience_data, Q53_1 < high_threshold & Q53_1 > low_threshold)
political_lr_high <- filter(climate_experience_data, Q53_1 >= high_threshold)
What I've realised is that this approach really doesn't lend itself to faceting. What I suspect is that I need to use a combination of mutate() across() where() and group_by() to add data to a new column Q53_scale with "hi" "med" "low" based on where Q53_1 values fall in relation to those low/high thresholds (e.g. SD +1 over mean and -1 under mean). My first few dozen attempts have fallen short - has anyone managed to use sd() to bin data for faceting in this way?
library(tidyverse)
climate_experience_data <- structure(list(Q4 = c(2, 3, 3, 5, 4, 3), Q5 = c(
1, 3, 3, 3, 2,
2
), Q6 = c(4, 3, 3, 3, 4, 4), Q7 = c(4, 2, 3, 5, 5, 5), Q53_1 = c(
5,
8, 4, 5, 4, 5
)), row.names = c(NA, -6L), class = c(
"tbl_df",
"tbl", "data.frame"
))
climate_experience_data %>%
mutate(
bin = case_when(
Q53_1 > mean(Q53_1) + sd(Q53_1) ~ "high",
Q53_1 < mean(Q53_1) - sd(Q53_1) ~ "low",
TRUE ~ "medium"
) %>% factor(levels = c("low", "medium", "high"))
) %>%
ggplot(aes(Q4, Q5)) +
geom_point() +
facet_grid(~bin)
Created on 2022-03-10 by the reprex package (v2.0.0)

Ggplot error : haven_labelled/vctrs_vctr/double

I am new here and still studying R so I am dealing with an error.
Here is what I get from console
Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/double. Defaulting to continuous.
I don't know what can I do to make it work. I want to get a scatterplot.
ggplot(data = diagnoza, aes(x = Plecc, y = P32.01))
Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/double. Defaulting to continuous.
Adding geom_point as suggested by #zx8754 gives me a scatter plot. There is still the warning you reported which is related to some of your variables being of type haven_labelled, so I guess you imported your data from SPSS.
To get rid of this warning you could convert your variables to R factors using haven::as_factor. Probably it would be best to do that for the whole dataset after importing your data.
diagnoza <- structure(list(Plecc = c(2, 2, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2,
1, 1, 1, 1, 2, 1, 1, 2), P32.01 = structure(c(3, 4, 5, 5, 5,
5, 5, 4, 3, 5, 3, 4, 3, 4, 5, 5, 5, 3, 4, 5), label = "P32.01. odpoczynek w domu (oglądanie TV)", format.spss = "F1.0", display_width = 12L, labels = c(Nigdy = 1,
Rzadko = 2, `Od czasu do czasu` = 3, Często = 4, `Bardzo często` = 5
), class = c("haven_labelled", "vctrs_vctr", "double"))), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
library(haven)
library(ggplot2)
# Convert labelled vector to a factor
diagnoza$P32.01 <- haven::as_factor(diagnoza$P32.01)
ggplot(data = diagnoza, aes(x = Plecc, y = P32.01)) +
geom_point()

Double index/category bar plot in R? [duplicate]

For a sample dataframe:
df <- structure(list(year = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4), imd.quintile = c(1, 2, 3, 4, 5, 1, 2, 3,
4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5), average_antibiotic = c(1.17153515458827,
1.11592565388857, 1.09288449967773, 1.07442652168281, 1.06102887394413,
1.0560582933182, 1.00678980505929, 0.992997489072538, 0.978343676071694,
0.967900478870214, 1.02854157116164, 0.98339099101476, 0.981198852494798,
0.971392872980818, 0.962289579742817, 1.00601488964457, 0.951187417739673,
0.950706064156994, 0.939174499710836, 0.934948233015044)), .Names = c("year",
"imd.quintile", "average_antibiotic"), row.names = c(NA, -20L
), vars = "year", drop = TRUE, class = c("grouped_df", "tbl_df",
"tbl", "data.frame"))
I want to produce a grouped bar chart, very similar to this post.
I want year on the x axes, and average_antibiotic on the y axes. I want the five bar charts (for each imd.quintile - which is the legend).
I have tried a couple of options (based on the post and elsewhere), but can't make it work.
ggplot(df, aes(x = imd.quintile, y = average_antibiotic)) +
geom_col() +
facet_wrap(~ year)
ggplot(df, aes(x = imd.quintile, y = average_antibiotic)) +
geom_bar(aes(fill = imd.quintile), position = "dodge", stat="identity")
Any ideas?
I believe you are looking for something like this:
library(ggplot2)
ggplot(df ) +
geom_col(aes(x = year, y = average_antibiotic, group=imd.quintile, fill=imd.quintile), position = "dodge" )

Set bubble size according to categorical data

Keep in mind, I am very new to R.
I have a dataset from a public opinion survey, and would like to represent the answers through a bubble chart, though the data is categorical, not numeric.
From dataset "Arab4" I have question/variable "Q713" with all of the observations coded as 1, 2, 3, 4, or 5 as the response options. I would like to plot the bubbles (stacked on top of one another by "country") with the size of the bubble corresponding to the percent of the vote share that answer got. For example, if 49% of respondents in Israel voted for option 1 under question "Q", then the bubble size would represent 49% and be situated above the Israel category label with the color of the bubble corresponding to the response type (1, 2, 3, 4, or 5).
I have the following code, giving me a blank chart, and I know to eventually use the "points" command with more specifications.
What I need help with is defining the radius of the circles from the data I have.
plot(Arab4$Country, Arab4$Q713, type= "n", xlab = FALSE, ylab=FALSE)
points(Arab4$country, Arab4$q713)
Here is some dput from the data set
dput(Arab4$q713[1:50])
structure(c(3, 5, 3, 3, 1, 3, 5, 5, 5, 5, 3, 2, 2, 3, 1, 1, 4,
2, 3, 5, 5, 5, 2, 5, 4, 2, 5, 2, 5, 3, 5, 5, 2, 2, 5, 2, 1, 2,
1, 2, 5, 3, 4, 5, 1, 1, 1, 4, 5, 3), labels = structure(c(1,
2, 3, 4, 5, 98, 99), .Names = c("Promoting democracy", "Promoting economic
development",
"Resolving the Arab-Israeli conflict", "Promoting women’s rights",
"The US should not get involved", "Don't know (Do not read)",
"Decline to answer (Do not read)")), class = "labelled")
Any ideas would help! Thanks!
As others have commented, this really is not a bubble chart as you only have 2 dimensions and the size of the circle does not add anything (other than perhaps visual appeal). But with that disclaimer, here is one approach to what I think you are trying to achieve. This requires the ggplot2 and reshape2 libraries.
library(ggplot2)
library(reshape2)
# create simulated data
dat <- data.frame(Egypt=sample(c(1:5), 20), Libya=sample(c(1:5),20))
# tabulate
dat.tab <- apply(dat, 2, table)
dat.long <- melt(dat.tab)
colnames(dat.long) <- c("Response", "Count", "Country")
ggplot(dat.long, aes(x=Country, y=Count, color=Country)) +
geom_point(aes(size=Count))
EDIT Here is another approach, using the data manipulation tools of the dplyr package to get you all the way to proportions:
# using dat from above again
dat.long <- melt(dat)
colnames(dat.long) <- c("Country", "Response")
dat.tab <- dat.long %>%
group_by(Country) %>%
count(Response) %>%
mutate(prop = prop.table(n))
ggplot(dat.tab, aes(x=Country, y=prop, color=Country)) +
geom_point(aes(size=prop))
You will need to do a little additional work to remove unwanted values (98, 99) if they are truly unwanted.
hth.

Grouping iGraph Vertices in a weighted network by color/subgroup in R

I am struggling to group my network by the subgroups. I currently have the following network:
Current Network
Which I have assigned the subgroups. I would like to plot all of the subgroups clustered together. To get a graph that looks like this:
Goal
Most algorithms seems to cluster based on weights in the graph. But I want to tell it to cluster based on the node colors/labelled subgroups. This is what I have now to code this network:
#Graph with Weighted matrix
g_weighted<-graph.adjacency(WeightedMatrix, mode="undirected", weighted = TRUE)
#Make nodes different colors based on different classes
numberofclasses<-length(table(ConnectedVertexColor))
V(g_weighted)$color=ConnectedVertexColor
Node_Colors <- rainbow(numberofclasses, alpha=0.5)
for(i in 1:numberofclasses){
V(g_weighted)$color=gsub(unique(ConnectedVertexColor[i],Node_Colors[i],V(g_weighted)$color)
}
#Plot with iGraph
plot.igraph(g_weighted,
edge.width=500*E(g_weighted)$weight,
vertex.size=15,
layout=layout.fruchterman.reingold, ##LAYOUT BY CLASS
title="Weighted Network",
edge.color=ifelse(WeightedMatrix > 0, "palegreen4","red4")
)
legend(x=-1.5, y=-1.1, c(unique(ConnectedVertexColor)), pch = 19, col=Node_Colors, bty="n")
The ConnectedVertexColor is a vector the contains information about if the node is a lipid, Nucleotide, Carb or AA. I have tried the command V(g_weighted)$community<-ConnectedVertexColor
but I cannot get this to transfer into useful information for iGraph.
Thanks for advice in advance.
Since you do not provide data, I am making a guess based on your "Current Network" picture. Of course, what you need is a layout for the graph. Below I provide two functions to create layouts that might meet your needs.
First, some data that looks a bit like yours.
EL = structure(c(1, 5, 4, 2, 7, 4, 7, 6, 6, 2, 9, 6, 3, 10,
7, 8, 3, 9, 8, 5, 3, 4, 10, 13, 12, 12, 13, 12, 13, 15, 15,
11, 11, 14, 14, 11, 11, 11, 15, 15, 11, 11, 13, 13, 11, 13),
.Dim = c(23L, 2L))
g2 = graph_from_edgelist(EL, directed = FALSE)
Groups = c(rep(1, 10), 2,2,3,3,3)
plot(g2, vertex.color=rainbow(3)[Groups])
First Layout
GroupByVertex01 = function(Groups, spacing = 5) {
Position = (order(Groups) + spacing*Groups)
Angle = Position * 2 * pi / max(Position)
matrix(c(cos(Angle), sin(Angle)), ncol=2)
}
GBV1 = GroupByVertex01(Groups)
plot(g2, vertex.color=rainbow(3)[Groups], layout=GBV1)
Second Layout
GroupByVertex02 = function(Groups) {
numGroups = length(unique(Groups))
GAngle = (1:numGroups) * 2 * pi / numGroups
Centers = matrix(c(cos(GAngle), sin(GAngle)), ncol=2)
x = y = c()
for(i in 1:numGroups) {
curGroup = which(Groups == unique(Groups)[i])
VAngle = (1:length(curGroup)) * 2 * pi / length(curGroup)
x = c(x, Centers[i,1] + cos(VAngle) / numGroups )
y = c(y, Centers[i,2] + sin(VAngle) / numGroups)
}
matrix(c(x, y), ncol=2)
}
GBV2 = GroupByVertex02(Groups)
plot(g2, vertex.color=rainbow(3)[Groups], layout=GBV2)

Resources