display mean value (rearrange data frame?) - r

I want to boxplot two groups (A and B) and display the mean value on each box plot.
I have 30 lines and 2 columns : each line contains the value of group A (col 1) and group B (col 2).
I did a boxplot with graphic boxplot
boxplot(Data_Q4$Group.A,Data_Q4$Group.B,names=c("group A","group B"))
but it seems like adding a mean point on the boxplot necessiting ggplot 2.
I tried many things but it already send me an error message
! Aesthetics must be either length 1 or the same as the data (30): x...
It seems my problem come from y axis. I need him to take the data from columns A and B but I don't know how to do this.
if my data was with value column and group columns (A or B for each line) it would work but I don't know how to rearrange it so that I get 2 columns (value and groups) and 60 lines with the values of the groups.
and then I do dataQ4 %>% ggplot(aes(x=group,y=value))+geom_boxplot+stat_summary(fun.y=mean)
I think it will be ok.
so my problem is to rearrange my data frame so that I can use ggplot and boxplot it
thanks for your help !
I share here my data :
dput(Data_Q4) structure(list(Group.A = c(1.25310535, 0.5546414, 0.301283, 1.29312466, 0.99455579, 0.5141743, 2.0078324, 0.42224244, 2.17877257, 3.21778902, 0.55782935, 0.59461765, 0.97739581, 0.20986658, 0.30944786, 1.10593627, 0.77418776, 0.08967408, 1.10817666, 0.24726425, 1.57198685, 4.83281274, 0.43113213, 2.73038931, 1.13683142, 0.81336825, 0.83700649, 1.7847654, 2.31247163, 2.90988727), Group.B = c(2.94928948, 0.70302878, 0.69016263, 1.25069011, 0.43649776, 0.22462232, 0.39231981, 1.5763435, 0.42792839, 0.19608026, 0.37724368, 0.07071508, 0.03962611, 0.38580831, 2.63928857, 0.78220807, 0.66454197, 0.9568569, 0.02484568, 0.21600677, 0.88031195, 0.13567357, 0.68181725, 0.20116062, 0.4834762, 0.50102846, 0.15668497, 0.71992076, 0.68549794, 0.86150777)), class = "data.frame", row.names = c(NA, -30L))

First I create some random data:
df <- data.frame(group = rep(c("A", "B"), 15),
value = runif(30, 0, 10))
You can use the following code:
library(tidyverse)
ggplot(data = df,
aes(x = group, y = value)) +
geom_boxplot() +
stat_summary(fun.y = mean, color = "darkred", position = position_dodge(0.75),
geom = "point", shape = 18, size = 3,
show.legend = FALSE)
Output:
The red dots represent the mean.
Using your data:
You can use the following code:
library(tidyverse)
library(reshape)
dataQ4 %>%
melt() %>%
ggplot(aes(x = variable, y = value)) +
geom_boxplot() +
stat_summary(fun.y = mean, color = "darkred", position = position_dodge(0.75),
geom = "point", shape = 18, size = 3,
show.legend = FALSE)
Output:

Related

Adding geom_line between data points with different geom_boxplot fill variable

Hi I have a much larger data frame but a sample dummy df is as follows:
set.seed(23)
df = data.frame(name = c(rep("Bob",8),rep("Tom",8)),
topic = c(rep(c("Reading","Writing"),8)),
subject = c(rep(c("English","English","Spanish","Spanish"),4)),
exam = c(rep("First",4),rep("Second",4),rep("First",4),rep("Second",4)),
score = sample(1:100,16))
I have to plot it in the way shown in the picture below (for my original data frame) but with lines connecting the scores corresponding to each name between the first and second class in the exam variable, I tried geom_line(aes(group=name)) but the lines are not connected in the right way. Is there any way to connect the points that also respects the grouping by the fill variable similar to how the position_dodge() helps separate the points by their fill grouping? Thanks a lot!
library(ggplot2)
df %>% ggplot(aes(x=topic,y=score,fill=exam)) +
geom_boxplot(outlier.shape = NA) +
geom_point(size=1.75,position = position_dodge(width = 0.75)) +
facet_grid(~subject,switch = "y")
One option to achieve your desired result would be to group the lines by name and topic and do the dodging of lines manually instead of relying on position_dogde. To this end convert topic to a numeric for the geom_line and shift the position by the necessary amount to align the lines with the dodged points:
set.seed(23)
df <- data.frame(
name = c(rep("Bob", 8), rep("Tom", 8)),
topic = c(rep(c("Reading", "Writing"), 8)),
subject = c(rep(c("English", "English", "Spanish", "Spanish"), 4)),
exam = c(rep("First", 4), rep("Second", 4), rep("First", 4), rep("Second", 4)),
score = sample(1:100, 16)
)
library(ggplot2)
ggplot(df, aes(x = topic, y = score, fill = exam)) +
geom_boxplot(outlier.shape = NA) +
geom_point(size = 1.75, position = position_dodge(width = 0.75)) +
geom_line(aes(
x = as.numeric(factor(topic)) + .75 / 4 * ifelse(exam == "First", -1, 1),
group = interaction(name, topic)
)) +
facet_grid(~subject, switch = "y")

How to specify unique geom assignments to facets?

Below I have simulated a dataset where an assignment was given to 5 groups of individuals on 5 different days (a new group with 200 new individuals each day). TrialStartDate denotes the date on which the assignment was given to each individual (ID), and TrialEndDate denotes when each individual finished the assignment.
set.seed(123)
data <-
data.frame(
TrialStartDate = rep(c(sample(seq(as.Date('2019/02/01'), as.Date('2019/02/15'), by="day"), 5)), each = 200),
TrialFinishDate = sample(seq(as.Date('2019/02/01'), as.Date('2019/02/15'), by = "day"), 1000,replace = T),
ID = seq(1,1000, 1)
)
I am interested in comparing how long individuals took to complete the trial depending on when they started the trial (i.e., assuming TrialStartDate has an effect on the length of time it takes to complete the trial).
To visualize this, I want to make a barplot showing counts of IDs on each TrialFinishDate where bars are colored by TrialStartDate (since each TrialStartDate acts as a grouping variable). The best I have come up with so far is by faceting like this:
data%>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
facet_wrap(~TrialStartDate, ncol = 1)
However, I also want to add a vertical line to each facet showing when the TrialStartDate was for each group (preferably colored the same as the bars). When attempting to add vertical lines with geom_vline, it adds all the lines to each facet:
data%>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
geom_vline(xintercept = unique(data$TrialStartDate))+
facet_wrap(~TrialStartDate, ncol = 1)
How can we make the vertical lines unique to the respective group in each facet?
You're specifying xintercept outside of aes, so the faceting is not respected.
This should do the trick:
data %>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
geom_vline(aes(xintercept = TrialStartDate))+
facet_wrap(~TrialStartDate, ncol = 1)
Note geom_vline(aes(xintercept = TrialStartDate))

R: ggplot2: how to separate labels in stat_summary

I try to plot labels above bars with the stat_summary function and a custom function that I wrote. There are three bars and each should be labeled with the letters a:c, respectively. However, instead of putting one label per bar, all three labels are placed on top of each other:
codes <- c ("a", "b", "c")
simple_y <- function(x) {
return (data.frame (y = mean (x) + 1, label = codes))
}
ggplot (iris, mapping = aes (x = Species, y = Sepal.Length)) +
geom_bar (stat = "summary", fun.y = "mean", fill = "blue", width = 0.7, colour = "black", size = 0.7) +
stat_summary (fun.data = simple_y, geom = "text", size = 10)
I do understand why this is not working: each time the simply_y-function is recycled, it sees the whole codes - vector. However, I have no clue how to tell R to separate the three labels. Is it possible to tell R to subsequently use the n_th element of an input-vector when recycling a function?
Does anybody have a good hint?
I would consider doing something like this:
labels <-
tibble(
Species = factor(c("setosa", "versicolor", "virginica")),
codes = c("a", "b", "c")
)
iris %>%
group_by(Species) %>%
summarize(Mean = mean(Sepal.Length)) %>%
ungroup() %>%
left_join(labels, by = "Species") %>%
ggplot(aes(x = Species, y = Mean)) +
geom_col(fill = "blue", width = 0.7, color = "black", size = 0.7) +
geom_text(aes(y = Mean + 0.3, label = codes), size = 6, show.legend = FALSE)
First, you can generate the data frame with means separately, avoiding the need for geom_bar and stat_summary. Then after joining the manual labels/codes to that summarized data frame, it's pretty straightforward to add them with geom_text.

R: PCA ggplot Error "arguments imply differing number of rows"

I have a dataset:
https://docs.google.com/spreadsheets/d/1ZgyRQ2uTw-MjjkJgWCIiZ1vpnxKmF3o15a5awndttgo/edit?usp=sharing
that I'm trying to apply PCA analysis and to achieve a graph based on graph provided in this post:
https://stats.stackexchange.com/questions/61215/how-to-interpret-this-pca-biplot-coming-from-a-survey-of-what-areas-people-are-i
However, an error doesn't seem to go away:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names =
TRUE, :
arguments imply differing number of rows: 0, 1006
Following is my code that I have trouble finding the source of error. Would like to have some help for error detection. Any hints?
The goal is to produced a PCA graph grouped by levels of Happiness.in.life. I modified the original code to fit with my dataset. Originally, group is determined by Genders, which has 2 levels. What I'm attempting to do is to build a graph based on 5 levels of Happiness.in.life. However, it doesn't seem I can use the old code...
Thanks!
library(magrittr)
library(dplyr)
library(tidyr)
df <- happiness_reduced %>% dplyr::select(Happiness.in.life:Internet.usage, Happiness.in.life)
head(df)
vars_on_hap <- df %>% dplyr::select(-Happiness.in.life)
head(vars_on_hap)
group<-df$Happiness.in.life
fit <- prcomp(vars_on_hap)
pcData <- data.frame(fit$x)
vPCs <- fit$rotation[, c("PC1", "PC2")] %>% as.data.frame()
multiple <- min(
(max(pcData[,"PC1"]) - min(pcData[,"PC1"]))/(max(vPCs[,"PC1"])-
min(vPCs[,"PC1"])),
(max(pcData[,"PC2"]) - min(pcData[,"PC2"]))/(max(vPCs[,"PC2"])-
min(vPCs[,"PC2"]))
)
ggplot(pcData, aes(x=PC1, y=PC2)) +
geom_point(aes(colour=groups)) +
coord_equal() +
geom_text(data=vPCs,
aes(x = fit$rotation[, "PC1"]*multiple*0.82,
y = fit$rotation[,"PC2"]*multiple*0.82,
label=rownames(fit$rotation)),
size = 2, vjust=1, color="black") +
geom_segment(data=vPCs,
aes(x = 0,
y = 0,
xend = fit$rotation[,"PC1"]*multiple*0.8,
yend = fit$rotation[,"PC2"]*multiple*0.8),
arrow = arrow(length = unit(.2, 'cm')),
color = "grey30")
Here is an approach on how to plot the result of PCA in ggplot2:
library(tidyverse)
library(ggrepel)
A good idea (not in all cases for instance if they are all in the same units) is to scale the variables prior to PCA
hapiness %>% #this is the data from google drive. In the future try not top post such links on SO because they tend to be unusable after some time has passed
select(-Happiness.in.life) %>%
prcomp(center = TRUE, scale. = TRUE) -> fit
Now we can proceed to plotting the fit:
fit$x %>% #coordinates of the points are in x element
as.data.frame()%>% #convert matrix to data frame
select(PC1, PC2) %>% #select the first two PC
bind_cols(hapiness = as.factor(hapiness$Happiness.in.life)) %>% #add the coloring variable
ggplot() +
geom_point(aes(x = PC1, y = PC2, colour = hapiness)) + #plot points and color
geom_segment(data = fit$rotation %>% #data we want plotted by geom_segment is in rotation element
as.data.frame()%>%
select(PC1, PC2) %>%
rownames_to_column(), #get to row names so you can label after
aes(x = 0, y = 0, xend = PC1 * 7, yend = PC2* 7, group = rowname), #I scaled the rotation by 7 so it fits in the plot nicely
arrow = arrow(angle = 20, type = "closed", ends = "last",length = unit(0.2,"cm")),
color = "grey30") +
geom_text_repel(data = fit$rotation %>%
as.data.frame()%>%
select(PC1, PC2) %>%
rownames_to_column(),
aes(x = PC1*7,
y = PC2*7,
label = rowname)) +
coord_equal(ratio = fit$sdev[2]^2 / fit$sdev[1]^2) + #I like setting the ratio to the ratio of eigen values
xlab(paste("PC1", round(fit$sdev[1]^2/ sum(fit$sdev^2) *100, 2), "%")) +
ylab(paste("PC2", round(fit$sdev[2]^2/ sum(fit$sdev^2) *100, 2), "%")) +
theme_bw()
Look at all them happy people on the left (well it is hard to notice because of the colors used, I suggest using the palette jco from ggpubr library) get_palette('jco', 5) ie scale_color_manual(values = get_palette('jco', 5))
quite a similar plot can be achieved with library ggord:
library(ggord)
ggord(fit, grp_in = as.factor(hapiness$Happiness.in.life),
size = 1, ellipse = F, ext = 1.2, vec_ext = 5)
the major difference is ggord uses equal scaling for axes. Also I scaled the rotation by 5 instead of 7 as in the first plot.
As you can see I do not like many intermediate data frames.

Boxplot width in ggplot with cross classified groups

I am making boxplots with ggplot with data that is classified by 2 factor variables. I'd like to have the box sizes reflect sample size via varwidth = TRUE but when I do this the boxes overlap.
1) Some sample data with a 3 x 2 structure
data <- data.frame(group1= sample(c("A","B","C"),100, replace = TRUE),group2= sample(c("D","E"),100, replace = TRUE) ,response = rnorm(100, mean = 0, sd = 1))
2) Default boxplots: ggplot without variable width
ggplot(data = data, aes(y = response, x = group1, color = group2)) + geom_boxplot()
I like how the first level of grouping is shown.
Now I try to add variable widths...
3) ...and What I get when varwidth = TRUE
ggplot(data = data, aes(y = response, x = group1, color = group2)) + geom_boxplot(varwidth = T)
This overlap seems to occur whether I use color = group2 or group = group2 in both the main call to ggplot and in the geom_boxplot statement. Fussing with position_dodge doesn't seem to help either.
4) A solution I don't like visually is to make unique factors by combining my group1 and group2
data$grp.comb <- paste(data$group1, data$group2)
ggplot(data = data, aes(y = response, x = grp.comb, color = group2)) + geom_boxplot()
I prefer having things grouped to reflect the cross classification
5) The way forward:
I'd like to either a)figure out how to either make varwidth = TRUE not cause the boxes to overlap or b)manually adjusted the space between the combined groups so that boxes within the 1st level of grouping are closer together.
I think your problem can be solved best by using facet_wrap.
library(ggplot2)
data <- data.frame(group1= sample(c("A","B","C"),100, replace = TRUE), group2=
sample(c("D","E"),100, replace = TRUE) ,response = rnorm(100, mean = 0, sd = 1))
ggplot(data = data, aes(y = response, x = group2, color = group2)) +
geom_boxplot(varwidth = TRUE) +
facet_wrap(~group1)
Which gives:
A recent update to ggplot2 makes it so that the code provided by #N Brouwer in (3) works as expected:
# library(devtools)
# install_github("tidyverse/ggplot2")
packageVersion("ggplot2") # works with v2.2.1.9000
library(ggplot2)
set.seed(1234)
data <- data.frame(group1= sample(c("A","B","C"), 100, replace = TRUE),
group2= sample(c("D","E"), 100, replace = TRUE),
response = rnorm(100, mean = 0, sd = 1))
ggplot(data = data, aes(y = response, x = group1, color = group2)) +
geom_boxplot(varwidth = T)
(I'm a new user and can't post images inline)
fig 1
This question has been answered here ggplot increase distance between boxplots
The answer involves using the position = position_dodge() argument of geom_boxplot().
For your example:
data <- data.frame(group1= sample(c("A","B","C"),100, replace = TRUE), group2=
sample(c("D","E"),100, replace = TRUE) ,response = rnorm(100, mean = 0, sd = 1))
ggplot(data = data, aes(y = response, x = group1, color = group2)) +
geom_boxplot(position = position_dodge(1))

Resources