Related
I have created the following alluvial diagram in R as follows:
df <- data.frame(Variable = c("X1", "X2", "X3", "X4", "X5", "X6"),
Pearson1 = c(6, 3, 2, 5, 4, 1),
Spearman1 = c(6, 5, 1, 2, 3, 4),
Kendall1 = c(6, 5, 1, 2, 3, 4),
Pearson2 = c(6, 5, 1, 2, 3, 4),
Spearman2 = c(6, 5, 1, 2, 4, 3),
Kendall2 = c(6, 5, 1, 2, 3, 4))
df$freq<-1
alluvial(df[1:7], freq=df$freq, cex = 0.7,col= "red")
which results in
How can I set some specific lines to have different col than red? e.g. X1 from Variables to Pearson1, and then again from Kendall1 to Spearman2 and X3 in all states? I see I can't do that based on alluvial(). How can I recreate the above alluvial based on another function??
ggalluvial allows for varying aesthetics over one "flow" (or alluvium). The documentation provides a trick to use geom_flow with stat = "alluvium" and to specify "lode.guidance = "frontback".
The actual aesthetic (color) will need to be added to the data. geom_flow and geom_stratum will require different columns for the aesthetic, (try what happens when you use the same for both). I am passing the color directly and using scale_identity, but you can of course also use random values and then define your colors with scale_manual.
library(ggalluvial)
#> Loading required package: ggplot2
library(tidyverse)
df <- data.frame(Variable = c("X1", "X2", "X3", "X4", "X5", "X6"),
Pearson1 = c(6, 3, 2, 5, 4, 1),
Spearman1 = c(6, 5, 1, 2, 3, 4),
Kendall1 = c(6, 5, 1, 2, 3, 4),
Pearson2 = c(6, 5, 1, 2, 3, 4),
Spearman2 = c(6, 5, 1, 2, 4, 3),
Kendall2 = c(6, 5, 1, 2, 3, 4))
df_long <-
df %>%
## reshape your data in order to bring it to the right shape
mutate(across(everything(), as.character)) %>%
rownames_to_column("ID") %>%
pivot_longer(-ID) %>%
## correct order of your x
mutate(
name = factor(name, levels = names(df)),
## now hard code where you want to change the color.
## lodes need a different highlighting then your strata
## there are of course many ways to add this information, I am using case_when here
## you could also create separate vectors and add them to your data frame
highlight_lode = case_when(
ID == 3 ~ "blue",
ID == 1 & name %in% c("Variable", "Kendall1", "Pearson2") ~ "orange",
TRUE ~ "red"
),
highlight_stratum = case_when(
ID == 3 ~ "blue",
ID == 1 & name %in% c(
"Variable", "Pearson1", "Kendall1", "Pearson2",
"Spearman2"
) ~ "orange",
TRUE ~ "red"
)
)
ggplot(df_long,
## now use different color aesthetics in geom_flow and geom_stratum
aes(x = name, stratum = value, alluvium = ID, label = value)) +
## I took this trick with lode guidance from the documentation - this allows varying aesthetics over one flow.
geom_flow(aes(fill = highlight_lode), stat = "alluvium", lode.guidance = "frontback", color = "darkgray") +
geom_stratum(aes(fill = highlight_stratum)) +
geom_text(stat = "stratum") +
## as I have named the colors directly, it is appropriate to use scale_identity
scale_fill_identity()
#> Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
Created on 2023-01-29 with reprex v2.0.2
I've got a batch of survey data that I'd like to be able to subset on a few specific columns which have 0-10 scale data (e.g. Rank your attitude towards x as 0 to 10) so that I can plot using using ggplot() + facet_grid. Faceting will be using 3 hi/med/low bins calculated as +1 / -1 standard deviation above the mean. I have working code, which splits the overall dataframe into 3 parts like so:
# Generate sample data:
structure(list(Q4 = c(2, 3, 3, 5, 4, 3), Q5 = c(1, 3, 3, 3, 2,
2), Q6 = c(4, 3, 3, 3, 4, 4), Q7 = c(4, 2, 3, 5, 5, 5), Q53_1 = c(5,
8, 4, 5, 4, 5)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
# Aquire Q53_1 data as factors
political_scale <- factor(climate_experience_data$Q53_1, levels = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
# Generate thresholds based on mean and standard deviation thresholds
low_threshold <- round(mean(as.numeric(political_scale, na.rm = T)) - sd(as.numeric(political_scale)), digits = 0)
high_threshold <- round(mean(as.numeric(political_scale, na.rm = T)) + sd(as.numeric(political_scale)), digits = 0)
# Generate low/med/high bins based on Mean and SD
political_lr_low <- filter(climate_experience_data, Q53_1 <= low_threshold)
political_lr_mid <- filter(climate_experience_data, Q53_1 < high_threshold & Q53_1 > low_threshold)
political_lr_high <- filter(climate_experience_data, Q53_1 >= high_threshold)
What I've realised is that this approach really doesn't lend itself to faceting. What I suspect is that I need to use a combination of mutate() across() where() and group_by() to add data to a new column Q53_scale with "hi" "med" "low" based on where Q53_1 values fall in relation to those low/high thresholds (e.g. SD +1 over mean and -1 under mean). My first few dozen attempts have fallen short - has anyone managed to use sd() to bin data for faceting in this way?
library(tidyverse)
climate_experience_data <- structure(list(Q4 = c(2, 3, 3, 5, 4, 3), Q5 = c(
1, 3, 3, 3, 2,
2
), Q6 = c(4, 3, 3, 3, 4, 4), Q7 = c(4, 2, 3, 5, 5, 5), Q53_1 = c(
5,
8, 4, 5, 4, 5
)), row.names = c(NA, -6L), class = c(
"tbl_df",
"tbl", "data.frame"
))
climate_experience_data %>%
mutate(
bin = case_when(
Q53_1 > mean(Q53_1) + sd(Q53_1) ~ "high",
Q53_1 < mean(Q53_1) - sd(Q53_1) ~ "low",
TRUE ~ "medium"
) %>% factor(levels = c("low", "medium", "high"))
) %>%
ggplot(aes(Q4, Q5)) +
geom_point() +
facet_grid(~bin)
Created on 2022-03-10 by the reprex package (v2.0.0)
Salut folks! I'm still quiet new to ggplot and trying to understand, but I really need some help here.
Edit: Reproducible Data of my Dataset "Daten_ohne_Cluster_NA", first 25 rows
structure(list(ntaxa = c(2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5,
6, 6, 6, 6, 6, 5, 8, 8, 7, 7, 6, 5, 5), mpd.obs.z = c(-1.779004391,
-1.721014957, -1.77727283, -1.774642404, -1.789386039, -1.983401439,
-0.875426386, -2.276052068, -2.340365105, -2.203126078, -2.394158227,
-2.278173635, -1.269075471, -1.176760985, -1.313045215, -1.164289676,
-1.247549961, -0.868174033, -2.057106804, -2.03154772, -1.691850922,
-1.224391713, -0.93993654, -0.39315089, -0.418380361), mntd.obs.z = c(-1.759874454,
-1.855202792, -1.866281778, -1.798439855, -1.739998395, -1.890847575,
-0.920672112, -1.381541177, -1.382847758, -1.394870597, -1.339878669,
-1.349541665, -0.516793786, -0.525476292, -0.557425575, -0.539534996,
-0.521299478, -0.638951825, -1.06467985, -1.033009266, -0.758380203,
-0.572401837, -0.166616844, 0.399510209, 0.314591018), pe = c(0.046370234,
0.046370234, 0.046370234, 0.046370234, 0.046370234, 0.046370234,
0.071665745, 0.118619482, 0.118619482, 0.118619482, 0.118619482,
0.118619482, 0.205838414, 0.205838414, 0.205838414, 0.205838414,
0.205838414, 0.179091659, 0.215719118, 0.215719118, 0.212092271,
0.315391478, 0.312205596, 0.305510773, 0.305510773), ECO_NUM = c(1,
6, 6, 1, 7, 6, 6, 6, 6, 6, 6, 7, 7, 6, 1, 6, 6, 6, 6, 6, 6, 7,
7, 7, 6)), row.names = c(NA, -25L), class = c("tbl_df", "tbl",
"data.frame"))
(1) I prepared my Dataframe like this:
'Daten_Cluster <- Daten[, c("ntaxa", "mpd.obs.z", "mntd.obs.z", "pe", "ECO_NUM")]
(2) I threw out all the NA's with na.omit. It is 6 variables with 3811 objects each. The column ECO_NUM represents the different ecoregions as a kategorial, numerical factor.
(3) Then I did a Cluster Analysis with k.means. I used 31 groups as there are 31 ecoregions in my dataset and the aim is to colour the plot after ecoregions lateron.
'Biomes_Clus <- kmeans(Daten_Cluster_ohne_NA, 31, iter.max = 10, nstart = 25)
(4) Then I followed the online-instructions from datanovia.com on how to visualise a k.means cluster analysis (I always just follow these How-To
s as I have no idea how to do it all by myself). I tried to change the arguments accordingly to colour after ecoregions.
fviz_cluster(Biomes_Clus, data = Daten_Cluster_ohne_NA,
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw(),
) +
stat_mean(aes(color = Daten_Cluster_ohne_NA$ECO_NUM), size = 4)
I get more than 50 warnings here, I guess for each object. Saying: In grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size) : unimplemented pch value '30'
I know that there are not enough pch-symbols for 31 groups, but I also don't need them - I just would like to have it coloured.
I also tried out the other function ggscatter and created my own color-palette (called P36) with more than 31 colours to have enough colours for the ecoregions.
ggscatter(
ind.coord, x = "Dim.1", y = "Dim.2",
color = "Species", palette = "P36", ellipse = TRUE, ellipse.type = "convex",
legend = "right", ggtheme = theme_bw(),
xlab = paste0("Dim 1 (", variance.percent[1], "% )" ),
ylab = paste0("Dim 2 (", variance.percent[2], "% )" )
) +
stat_mean(aes(color = cluster), size = 4)
The Error here is that a Discrete value was supplied to continuous scale. THe Question is: How can I easily colour the outcome of my k.means (which worked) and colour it not by the newly clustered groups but by the ecoregions (to visualise if there is a difference between the clusters and the ecoregion-groups)?
I appreciate your help and me and my group partner would be very thankful!! :)
Greetings
Evelyn
I am working with a long-format longitudinal dataset where each person has 1, 2 or 3 time points. In order to perform certain analyses I need to make sure that each person has the same number of rows even if it consists of NAs because they did not complete the certain time point.
Here is a sample of the data before adding the rows:
structure(list(Values = c(23, 24, 45, 12, 34, 23), P_ID = c(1,
1, 2, 2, 2, 3), Event_code = c(1, 2, 1, 2, 3, 1), Site_code = c(1,
1, 3, 3, 3, 1)), class = "data.frame", row.names = c(NA, -6L))
This is the data I aim to get after adding the relevant rows:
structure(list(Values = c(23, 24, NA, 45, 12, 34, 23, NA, NA),
P_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Event_code = c(1, 2,
3, 1, 2, 3, 1, 2, 3), Site_code = c(1, 1, 1, 3, 3, 3, 1,
1, 1)), class = "data.frame", row.names = c(NA, -9L))
I want to come up with code that would automatically add rows to the dataset conditionally on whether the participant has had 1, 2 or 3 visits. Ideally it would make rest of data all NAs while copying Participant_ID and site_code but if not possible I would be satisfied just with creating the right number of rows.
We could use fill after doing a complete
library(dplyr)
library(tidyr)
ExpandedDataset %>%
complete(P_ID, Event_code) %>%
fill(Site_code)
I came with quite a long code, but you could group it in a function and make it easier:
Here's your dataframe:
df <- data.frame(ID = c(rep("P1", 2), rep("P2", 3), "P3"),
Event = c("baseline", "visit 2", "baseline", "visit 2", "visit 3", "baseline"),
Event_code = c(1, 2, 1, 2, 3, 1),
Site_code = c(1, 1, 2, 2, 2, 1))
How many records you have per ID?
values <- summary(df$ID)
What is the maximum number of records for a single patient?
target <- max(values)
Which specific patients have less records than the maximum?
uncompliant <- names(which(values<target))
And how many records do you have for those patients who have missing information?
rowcount <- values[which(values<target)]
So now, let's create the vectors of the data frame we will add to your original one. First, IDs:
IDs <- vector()
for(i in 1:length(rowcount)){
y <- rep(uncompliant[i], target - rowcount[i])
IDs <- c(IDs, y)
}
And now, the sitecodes:
SC <- vector()
for(i in 1:length(rowcount)){
y <- rep(unique(df$Site_code[which(df$ID == uncompliant[i])]), target - rowcount[i])
SC <- c(SC, y)
}
Finally, a data frame with the values we will introduce:
introduce <- data.frame(ID = IDs, Event = rep(NA, length(IDs)),
Event_code = rep(NA, length(IDs)),
Site_code = SC)
Combine the original dataframe with the new values to be added and sort it so it looks nice:
final <- as.data.frame(rbind(df, introduce))
final <- final[order(v$ID), ]
For a sample dataframe:
df <- structure(list(year = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4), imd.quintile = c(1, 2, 3, 4, 5, 1, 2, 3,
4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5), average_antibiotic = c(1.17153515458827,
1.11592565388857, 1.09288449967773, 1.07442652168281, 1.06102887394413,
1.0560582933182, 1.00678980505929, 0.992997489072538, 0.978343676071694,
0.967900478870214, 1.02854157116164, 0.98339099101476, 0.981198852494798,
0.971392872980818, 0.962289579742817, 1.00601488964457, 0.951187417739673,
0.950706064156994, 0.939174499710836, 0.934948233015044)), .Names = c("year",
"imd.quintile", "average_antibiotic"), row.names = c(NA, -20L
), vars = "year", drop = TRUE, class = c("grouped_df", "tbl_df",
"tbl", "data.frame"))
I want to produce a grouped bar chart, very similar to this post.
I want year on the x axes, and average_antibiotic on the y axes. I want the five bar charts (for each imd.quintile - which is the legend).
I have tried a couple of options (based on the post and elsewhere), but can't make it work.
ggplot(df, aes(x = imd.quintile, y = average_antibiotic)) +
geom_col() +
facet_wrap(~ year)
ggplot(df, aes(x = imd.quintile, y = average_antibiotic)) +
geom_bar(aes(fill = imd.quintile), position = "dodge", stat="identity")
Any ideas?
I believe you are looking for something like this:
library(ggplot2)
ggplot(df ) +
geom_col(aes(x = year, y = average_antibiotic, group=imd.quintile, fill=imd.quintile), position = "dodge" )