R programming - ggplot2 boxplot labeling by group issue - r

Currently I have a data frame where I want to plot three variables into one boxplot:
livingsetting factor outcome
1 1 CKD 2
2 1 CKD 13
3 1 CKD 23
4 13 CKD 12
5 7 CKD -14
The livingsetting variable contains factors "1", "7", and "13".
The factor variable contains factors "CKD", "HD", and "Transplant".
The outcome variable is a continuous outcome variable.
This is my code for the boxplot:
ggplot(df, aes(x = interaction(livingsetting, factor),
y= outcome)) + geom_boxplot(aes(fill = livingsetting)) + xlab("Factors")+ ylab("Y")
And my plot looks like this:
The x-axis labels show 1.CKD, 13.CKD, 7.CKD, 1.HD, 13.HD, etc., but is it possible to tweak the xlab part so that the boxplot shows "CKD", "HD", and "Transplant" as the labels?
(so that each of the individual plots are grouped by threes).
For example, the first red, green, and blue plots will be labeled as "CKD" (as the group), the second red, green, and blue plots will be labeled as "HD", etc.

Here is an example illustrating my comment from above. You don't need interaction, since each aesthetic will create another boxplot:
df <- read.table(text = " livingsetting factor outcome
1 7 BLA 2
2 1 BLA 13
3 1 CKD 23
4 13 CKD 12
5 7 CKD -14", header = T, row.names = 1)
df$livingsetting <- as.factor(df$livingsetting)
library(ggplot2)
ggplot(data = df, aes(x = factor, y = outcome, fill = livingsetting)) +
geom_boxplot()

Is there a reason not to use facet_wrap or facet_grid? Unless I'm misunderstanding what you're looking for, this is a perfect use-case for faceting, and then you don't need interaction.
You should be able to change to this:
ggplot(df, aes(x = livingsetting, y = outcome)) +
geom_boxplot(aes(fill = livingsetting)) +
facet_wrap(~ factor)
This uses the dataframe as is, rather than getting the interaction, and adds labels for the factor variable to the tops of the facets, rather than on the tick labels (though you could do that if that's something you want).
Hope that helps!

Related

R: how to filter within aes()

As an R-beginner, there's one hurdle that I just can't find the answer to. I have a table where I can see the amount of responses to a question according to gender.
Response
Gender
n
1
1
84
1
2
79
2
1
42
2
2
74
3
1
84
3
2
79
etc.
I want to plot these in a column chart: on the y I want the n (or its proportions), and on the x I want to have two seperate bars: one for gender 1, and one for gender 2. It should look like the following example that I was given:
The example that I want to emulate
However, when I try to filter the columns according to gender inside aes(), it returns an error! Could anyone tell me why my approach is not working? And is there another practical way to filter the columns of the table that I have?
ggplot(table) +
geom_col(aes(x = select(filter(table, gender == 1), Q),
y = select(filter(table, gender == 1), n),
fill = select(filter(table, gender == 2), n), position = "dodge")
Maybe something like this:
library(RColorBrewer)
library(ggplot2)
df %>%
ggplot(aes(x=factor(Response), y=n, fill=factor(Gender)))+
geom_col(position=position_dodge())+
scale_fill_brewer(palette = "Set1")
theme_light()
Your answer does not work, because you are assigning the x and y variables as if it was two different datasets (one for x and one for y). In line with the solution from TarJae, you need to think of it as the axis in a diagram - so you need for your x axis to assign the categorical variables you are comparing, and you want for the y axis to assign the numerical variables which determines the height of the bars. Finally, you want to compare them by colors, so each group will have a different color - that is where you include your grouping variable (here, I use fill).
library(dplyr) ## For piping
library(ggplot2) ## For plotting
df %>%
ggplot(aes(x = Response, y = n, fill = as.character(Gender))) +
geom_bar(stat = "Identity", position = "Dodge")
I am adding "Identity" because the default in geom_bar is to count the occurences in you data (i.e., if you data was not aggregated). I am adding "Dodge" to avoid the bars to be stacked. I will recommend you, to look at this resource for more information: https://r4ds.had.co.nz/index.html

how to have a graphics matrix in R

Imagine you got 3 variables :
gestation of the mom
height of the mom
weight of the baby at birth
my 2 variables x are :
gestation of the mom
height of the mom
and my variable y is :
weight of the baby at birth
I would like to got a graphics matrix who explains weight of the baby at birth in function of gestation of the mom and weight of the baby at birth in function of height of the mom
I did it :
pairs((baby$bwt~baby$gestation+baby$age))
I obtains a graphic matrix like on picture :matrix_picture
But i would like to know how i can got only y in function of x1 and y in function of x2 because on my picture I got all, in others terms, i would like to obtain only the first line of my picture.
thanks for reading me
EDIT :
[matrix2_picture][2]
As you can see, on my absciss i got always same value ( 0 - 300) but i would like to got better value to got a better visualisation on each graphics, for example for age, i can't got 200 or 300, so i would like to got in absciss 10 m and 50 max for example
thanks
EDIT2:
[matrix3][3]
Just a last question, if I want get same thing than on the picture, how I can do it with ggplot
First is gestation of the mom in function of the weight of baby at birth, second is age of the mom in function of the weight of baby at birth and last is height of the mom in function of weigh of baby at birth
I did it :
df3 <- reshape2::melt(baby, "bwt")
ggplot(df3, aes(x=bwt, y=value)) +
geom_point() + facet_grid(.~variable,scales="free")
But I obtain it :
[matrix3][4]
Or you can see my ordinate is always same, not like when I used pairs.
thanks a lot !!!
[2]: https://i.stack.imgur.com/jppCJ.png
[3]: https://i.stack.imgur.com/TnEBe.png
[4]: https://i.stack.imgur.com/BPOUP.png
Last edit :
Do u know how we can do the same things but only for redidus of each variable
A little bit like the function pairs() but pairs with residus
reg=lm(formula=baby$bwt~baby$weight+baby$gestation+baby$age)
summary(reg)
plot(reg)
I would like to have residus of baby$bwt in function of theses 3 variables( weight , gestation, age)
For what i know, there isn't a solution using pairs. There are several other options, the one i know uses ggplot2.
First generating some dummy data:
df <- data.frame(
`gestation of the mom` = rnorm(20,300,30),
`height of the mom` = rnorm(20,170,10),
`weight of the baby at birth` = rnorm(20,50,5))
>df
gestation.of.the.mom height.of.the.mom weight.of.the.baby.at.birth
1 304.9339 165.7853 52.92590
2 219.7718 185.3528 43.06043
3 310.6279 166.5677 56.19357
4 278.8190 179.8276 54.33385
5 247.4760 186.6949 51.95354
Then reshaping the data frame for ggplot:
df2 <- reshape2::melt(df, "weight.of.the.baby.at.birth")
>df2
weight.of.the.baby.at.birth variable value
1 52.92590 gestation.of.the.mom 304.9339
2 43.06043 gestation.of.the.mom 219.7718
3 56.19357 gestation.of.the.mom 310.6279
4 54.33385 gestation.of.the.mom 278.8190
5 51.95354 gestation.of.the.mom 247.4760
...
21 52.92590 height.of.the.mom 165.7853
22 43.06043 height.of.the.mom 185.3528
23 56.19357 height.of.the.mom 166.5677
24 54.33385 height.of.the.mom 179.8276
25 51.95354 height.of.the.mom 186.6949
Then plotting:
library(ggplot2)
ggplot(df2, aes(x=value, y=weight.of.the.baby.at.birth)) +
geom_point() + facet_grid(.~variable)
Output:
You can find other answers in: Pairs scatter plot; one vs many, and Plot one numeric variable against n numeric variables in n plots.
EDIT1:
To make the scales be different, add the scales="free" argument to facet_grid:
ggplot(df2, aes(x=value, y=weight.of.the.baby.at.birth)) +
geom_point() + facet_grid(.~variable, scales="free")
Output:
EDIT2:
As you want the fixed variable to be your x axis, you need to change the place of variable in facet_grid:
ggplot(df2, aes(x=value, y=weight.of.the.baby.at.birth)) +
geom_point() + facet_grid(variable~., scales="free")
Output:
EDIT3:
Creating the model:
reg = lm(df$weight.of.the.baby.at.birth ~ df$gestation.of.the.mom + df$height.of.the.mom)
Adding a column with the residuals (before reshaping), and then reshaping:
df$resid = resid(reg)
df2 <- reshape2::melt(df, c("weight.of.the.baby.at.birth","resid"))
Plotting:
ggplot(df2, aes(x=value, y=resid)) +
geom_point() + facet_grid(.~variable, scales="free")
Output:

ggplot2 - How to plot length of time using geom_bar?

I am trying to show different growing season lengths by displaying crop planting and harvest dates at multiple regions.
My final goal is a graph that looks like this:
which was taken from an answer to this question. Note that the dates are in julian days (day of year).
My first attempt to reproduce a similar plot is:
library(data.table)
library(ggplot2)
mydat <- "Region\tCrop\tPlanting.Begin\tPlanting.End\tHarvest.Begin\tHarvest.End\nCenter-West\tSoybean\t245\t275\t1\t92\nCenter-West\tCorn\t245\t336\t32\t153\nSouth\tSoybean\t245\t1\t1\t122\nSouth\tCorn\t183\t336\t1\t153\nSoutheast\tSoybean\t275\t336\t1\t122\nSoutheast\tCorn\t214\t336\t32\t122"
# read data as data table
mydat <- setDT(read.table(textConnection(mydat), sep = "\t", header=T))
# melt data table
m <- melt(mydat, id.vars=c("Region","Crop"), variable.name="Period", value.name="value")
# plot stacked bars
ggplot(m, aes(x=Crop, y=value, fill=Period, colour=Period)) +
geom_bar(stat="identity") +
facet_wrap(~Region, nrow=3) +
coord_flip() +
theme_bw(base_size=18) +
scale_colour_manual(values = c("Planting.Begin" = "black", "Planting.End" = "black",
"Harvest.Begin" = "black", "Harvest.End" = "black"), guide = "none")
However, there's a few issues with this plot:
Because the bars are stacked, the values on the x-axis are aggregated and end up too high - out of the 1-365 scale that represents day of year.
I need to combine Planting.Begin and Planting.End in the same color, and do the same to Harvest.Begin and Harvest.End.
Also, a "void" (or a completely uncolored bar) needs to be created between Planting.Begin and Harvest.End.
Perhaps the graph could be achieved with geom_rect or geom_segment, but I really want to stick to geom_bar since it's more customizable (for example, it accepts scale_colour_manual in order to add black borders to the bars).
Any hints on how to create such graph?
I don't think this is something you can do with a geom_bar or geom_col. A more general approach would be to use geom_rect to draw rectangles. To do this, we need to reshape the data a bit
plotdata <- mydat %>%
dplyr::mutate(Crop = factor(Crop)) %>%
tidyr::pivot_longer(Planting.Begin:Harvest.End, names_to="period") %>%
tidyr::separate(period, c("Type","Event")) %>%
tidyr::pivot_wider(names_from=Event, values_from=value)
# Region Crop Type Begin End
# <chr> <fct> <chr> <int> <int>
# 1 Center-West Soybean Planting 245 275
# 2 Center-West Soybean Harvest 1 92
# 3 Center-West Corn Planting 245 336
# 4 Center-West Corn Harvest 32 153
# 5 South Soybean Planting 245 1
# ...
We've used tidyr to reshape the data so we have one row per rectangle that we want to draw and we've also make Crop a factor. We can then plot it like this
ggplot(plotdata) +
aes(ymin=as.numeric(Crop)-.45, ymax=as.numeric(Crop)+.45, xmin=Begin, xmax=End, fill=Type) +
geom_rect(color="black") +
facet_wrap(~Region, nrow=3) +
theme_bw(base_size=18) +
scale_y_continuous(breaks=seq_along(levels(plotdata$Crop)), labels=levels(plotdata$Crop))
The part that's a bit messy here that we are using a discrete scale for y but geom_rect prefers numeric values, so since the values are factors now, we use the numeric values for the factors to create ymin and ymax positions. Then we need to replace the y axis with the names of the levels of the factor.
If you also wanted to get the month names on the x axis you could do something like
dateticks <- seq.Date(as.Date("2020-01-01"), as.Date("2020-12-01"),by="month")
# then add this to you plot
... +
scale_x_continuous(breaks=lubridate::yday(dateticks),
labels=lubridate::month(dateticks, label=TRUE, abbr=TRUE))

using geom_bar to plot the sum of values by criteria in R

I'm new in R and I am trying to use ggplot to create subsets of bar graph per id all together. Each bar must represent the sum of the values in d column by month-year (which is c column). d has NA values and numeric values as well.
My dataframe, df, is something like this, but it has actually around 10000 rows:
#Example of my data
a=c(1,1,1,1,1,1,1,1,3)
b=c("2007-12-03", "2007-12-10", "2007-12-17", "2007-12-24", "2008-01-07", "2008-01-14", "2008-01-21", "2008-01-28","2008-02-04")
c=c(format(b,"%m-%Y")[1:9])
d=c(NA,NA,NA,NA,NA,4.80, 0.00, 5.04, 3.84)
df=data.frame(a,b,c,d)
df
a b c d
1 1 2007-12-03 12-2007 NA
2 1 2007-12-10 12-2007 NA
3 1 2007-12-17 12-2007 NA
4 1 2007-12-24 12-2007 NA
5 1 2008-01-07 01-2008 NA
6 1 2008-01-14 01-2008 4.80
7 1 2008-01-21 01-2008 0.00
8 1 2008-01-28 01-2008 5.04
9 3 2008-02-04 02-2008 3.84
I tried to do my graph using this:
mplot<-ggplot(df,aes(y=d,x=c))+
geom_bar()+
theme(axis.text.x = element_text(angle=90, vjust=0.5))+
facet_wrap(~ a)
I read from the help of geom_bar():
"geom_bar uses stat_count by default: it counts the number of cases at each x position"
So, I thought it would work like that by I'm having this error:
Error: stat_count() must not be used with a y aesthetic.
For the sample I'm providing, I would like to have the graph for id 1 that shows the months with NA empty and the 01-2008 with 9.84. Then for the second id, I would like to have again the months with NA empty and 02-2008 with 3.84.
I'm also tried to sum the data per month by using aggregate and sum before to plot and then use identity in the stat parameter of geom_bar, but, I'm getting NA in some months and I don't know the reason.
I really aprreciate your help.
You should use geom_col not geom_bar. See the help text:
There are two types of bar charts: geom_bar makes the height of the bar proportional to the number of cases in each group (or if the weight aethetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col instead. geom_bar uses stat_count by default: it counts the number of cases at each x position. geom_col uses stat_identity: it leaves the data as is.
So your final line of code should be:
ggplot(df, aes(y=d, x=c)) + geom_col() + theme(axis.text.x = element_text(angle=90, vjust=0.5))+facet_wrap(~ a)
Do you want something like this:
mplot = ggplot(df, aes(x = b, y = d))+
geom_bar(stat = "identity", position = "dodge")+
facet_wrap(~ a)
mplot
I am using x = b instead of x = c for now.
No need to use geom_col as suggested by #Jan. Simply use the weight aesthetic instead:
ggplot(iris, aes(Species, weight=Sepal.Width)) + geom_bar() + ggtitle("summed sepal width")

Creating a line chart in r for the average value of groups

I'm trying to create simple line charts with r that connect data points the average of groups of respondents (would also nive to lable them or distinguish them in diferent colors etc.)
My data is in long format and sorted like this shown (I also have it in wide format if thats of any value):
ID gender week class motivation
1 male 0 1 100
1 male 6 1 120
1 male 10 1 130
...
2 female 0 1 90
2 female 6 1 NA
2 female 10 1 117
...
3 male 0 2 89
3 male 6 2 112
3 male 10 2 NA
...
Basically, every respondent was measured a total of n times and the occasions (week) were the same for everyone. Some respondents were missing during one or more occasions. Let's say for motivation. Variables like gender, class and ID don't change, motivation does.
I tried to get a line chart using ggplot2
## define base for the graphs and store in object 'p'
plot <- ggplot(data = DataRlong, aes(x = week, y = motivation, group = gender))
plot + geom_line()
As grouping variable, I want to use class or gender for example.
However, my approach does not lead to lines that connect the averages per group.
I also get vertical lines for each measurement occasion. What does this mean? The only way I cold imagine fixing this is to create a new variable average.motivation and to compute the average for every group per occasion and then assign this average to all members of the group. However, this would mean that I had do this for every single group variable when I want to display group lines based on another factor.
Also, how does the plot handle missing data? (If one member of a group has a missing value, I still want the group average of this occasion to calculate the point rather than omitting the whole occasion for that group ).
Edit:
Thank you, the solution with dplyr works great for all my categorical variables.
Now, I'm trying to figure out how I can distinguish between subgroups by colouring their lines based on a second/third factor.
For example, I plot 20 lines for the groups of "class2", but rather than having all of them in 20 different colors, I would like them to use the same colour, if they belong to the same type of class ("class_type", e.g. A, B or C =20 lines, three groups of colours).
I've added the second factor to "mean_data2". That works well. Next, I've tried to change the colour argument in ggplot, (also tried as in geom_line), but that way, I don't have 20 lines anymore.
mean_data2 <- group_by(DataRlong, class2, class_type, occ)%>%
summarise(procras = mean(procras, na.rm = TRUE))
library(ggplot2) ggplot(na.omit(mean_data2), aes(x = occ, y = procras,
colour=class2)) + geom_point() + geom_line(aes(colour=class_type))
You can also use the dplyr package to aggregate the data:
library(dplyr)
mean_data <- group_by(data, gender, week) %>%
summarise(motivation = mean(motivation, na.rm = TRUE))
You can use na.omit() to get rid of the NA values as follows:
library(ggplot2)
ggplot(na.omit(mean_data), aes(x = week, y = motivation, colour = gender)) +
geom_point() + geom_line()
There is no need here to explicitly use the group aesthetic because ggplot will automatically group the lines by the categorical variables in your plot. And the only categorical variable you have is gender. (See this answer for more information).
Another possibility is using stat_summary, so you can do it only with ggplot.
ggplot(data = DataRlong, aes(x = week, y = motivation, group = gender)) +
stat_summary(geom = "line", fun.y = mean)
You almost certainly have to make sure those grouping variables are factors.
I'm not quite sure what you want, but here's a shot...
library("ggplot2")
df <- read.table(textConnection("ID gender week class motivation
1 male 0 1 100
1 male 6 1 120
1 male 10 1 130
2 female 0 1 90
2 female 6 1 NA
2 female 10 1 117
3 male 0 2 89
3 male 6 2 112
3 male 10 2 NA"), header=TRUE, stringsAsFactors=FALSE)
df2 <- aggregate(df$motivation, by=list(df$gender, df$week),
function(x)mean(x, na.rm=TRUE))
names(df2) <- c("gender", "week", "avg")
df2$gender <- factor(df2$gender)
ggplot(data = df2[!is.na(df2$avg), ],
aes(x = week, y = avg, group=gender, color=gender)) +
geom_point()+geom_line()

Resources