I have a peculiar problem with arranging boxplots given a certain order of the x-axis, as I am adding two boxplots from different dataframe in the same plot and each time I add the second geom_boxplot, R reorders my x axis alphabetically instead of following ordered levels of factor(x).
So, I have two dataframe of different lengths lookings something like this:
df1:
id value
1 A 1
2 A 2
3 A 3
4 A 5
5 B 10
6 B 8
7 B 1
8 C 3
9 C 7
df2:
id value
1 A 4
2 A 5
3 B 6
4 B 8
There is always more observations per id in df1 than in df2 and there is some ids in df1 that are not available in df2.
I'd like df1 to be sorted by the median(value) (ascending) and to first plot boxplots for each id in that order.
Then I add a second layer with boxplots for all other measurements per id from df2, which should maintain the same order on the x-axis.
Here's how I approached that:
vec <- df %>%
group_by(id) %>%
summarize(m = median(value)) %>%
arrange(m) %>%
pull(id)
p1 <- df1 %>%
ggplot(aes(x = factor(id, levels = vec), y = value)) +
geom_boxplot()
p1
p2 <- p1 +
geom_boxplot(data = df2, aes(x = factor(id, levels = vec), y = value))
p2
p1 shows the right order (ids are ordered based on ascending medians), p2 always throws my order off and goes back to plotting ids alphabetically (my id is a character column with names actually). I tried with sample dataframes and the above code achieves what is required. Hence, I am not sure what could be specifically wrong about my data so that the code fails when applied to the specific data and not the above mock data.
Any ideas?
Thanks a lot in advance!
If I understood correctly, this shoud work.
library(tidyverse)
# Sample data
df1 <-
tibble(
id = c("A","A","A","A","B","B","B","C","C"),
value = c(1,2,3,5,10,8,1,3,7),
type = "df1"
)
df2 <-
tibble(
id = c("A","A","B","B"),
value = c(4,5,6,8),
type = "df2"
)
df <-
# Create single data.frame
df1 %>%
bind_rows(df2) %>%
# Reorder id by median(value)
mutate(id = fct_reorder(id,value,median))
df %>%
ggplot(aes(id, y = value, fill = type)) +
geom_boxplot()
Related
I actually need help building on this question:
ggplot2 graphic order by grouped variable instead of in alphabetical order.
I need to produce a similar graph and I actually have a problem with the black points. I have data where column names are dates and rows are filled with 0 or 1 and I need to plot the point if the value is 1. To reproduce, here is a small sample (in my dataset, there is over 300 columns):
df <- data.frame(id=c(1,2,3),
"26April1970"=c(0,0,1),
"14August1970"=c(0,1,0))
I need to plot the dates on the x axis, match the id to the canton and show the points where the value is 1.
Could anyone help?
Try this:
plot_data = df %>%
## put data in long format
pivot_longer(-id, names_to = "colname") %>%
## keep only 1s
filter(value == 1) %>%
## convert dates to Date class
mutate(date = as.Date(colname, format = "%d%B%Y"))
plot_data
# # A tibble: 2 x 4
# id colname value date
# <dbl> <chr> <dbl> <date>
# 1 2 14August1970 1 1970-08-14
# 2 3 26April1970 1 1970-04-26
## plot
ggplot(plot_data, aes(x = date, y = factor(id))) +
geom_point()
Using this data:
df <- data.frame(id=c(1,2,3),
"26April1970"=c(0,0,1),
"14August1970"=c(0,1,0), check.names = FALSE)
Maybe you are looking for this:
library(ggplot2)
library(dplyr)
library(tidyr)
#Data
df <- data.frame(id=c(1,2,3),
"26April1970"=c(0,0,1),
"14August1970"=c(0,1,0))
#Code
df %>% pivot_longer(-id) %>%
ggplot(aes(x=name,y=factor(value)))+
geom_point(aes(color=factor(value)))+
scale_color_manual(values=c('transparent','black'))+
theme(legend.position = 'none')+xlab('Date')+ylab('value')
Output:
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a list of decimal numbers, ranging from 1 to 40K and I am trying to plot a frequency histogram together with the total sum of a given bin. I'm attempting to do it using ggplot2 but getting lost on how to use the same x axis bins from the histogram:
sales <- data.frame(amount = runif(100, min=0, max=40000))
h <- hist(sales$amount, breaks=b)
sales$groups <- cut(sales$amount, breaks=h$breaks)
ggplot(sales,aes(x=groups)) +
geom_bar(stat="count")+
geom_bar(aes(x=groups, y=amount), stat="identity") +
scale_y_continuous(sec.axis = sec_axis(~.*5, name = "sum"))
I managed to create both graphs independently, but they seem to overwrite each other.
or
If I understand right, you tried to plot two different variables (Count and Sum) in the bar graph. As they have really different ranges, you need to define a secondary y axis.
First, the grammar of ggplot2 asks for one for column for x values, one column for y values and one or several columns for groups (I'm doing a very brief and dirty summary of my understanding of how ggplot2 works).
Here, the idea is to have your "breaks" as x variable, a second column with all y values to be plot and a group column stipulating if a y value belongs to the group "Count" or "amount". You can achieve this using dplyr and tidyr packages:
set.seed(123)
sales <- data.frame(amount = runif(100, min=0, max=40000))
b = 4
h <- hist(sales$amount, breaks=b)
sales$groups <- cut(sales$amount, breaks=h$breaks)
library(tidyr)
library(dplyr)
sales %>% group_by(groups) %>% mutate(Count = n()) %>%
pivot_longer(.,cols = c(Count, amount), names_to = "Variable", values_to = "Value")
# A tibble: 200 x 3
# Groups: groups [4]
groups Variable Value
<fct> <chr> <dbl>
1 (1e+04,2e+04] Count 27
2 (1e+04,2e+04] amount 11503.
3 (3e+04,4e+04] Count 27
4 (3e+04,4e+04] amount 31532.
5 (1e+04,2e+04] Count 27
6 (1e+04,2e+04] amount 16359.
7 (3e+04,4e+04] Count 27
8 (3e+04,4e+04] amount 35321.
9 (3e+04,4e+04] Count 27
10 (3e+04,4e+04] amount 37619.
# … with 190 more rows
However, if you are trying to plot this straight you will get a bad plot with bars for "Count" really small compared to "amount":
library(ggplot2)
library(tidyr)
library(dplyr)
sales %>% group_by(groups) %>% mutate(Count = n()) %>%
pivot_longer(.,cols = c(Count, amount), names_to = "Variable", values_to = "Value")%>%
ggplot(aes(x=groups, y = Value, fill = Variable)) +
geom_bar(stat="identity", position = position_dodge())
So, you can try to pass a secondary y axis using sec.axis argument in scale_y_continuous. However, this won't change your plot, it will simply create a "fake" right axis with the scale modify by the value you pass on the argument sec.axis:
So, if you want to have both group of values visible on your graph you need to either scale down "amount" or scale up "Count" in order that both group have a similar range of values.
Here, as you want to have the sum on the right axis, we will scale down the "Sum" in order it get values in the same range than "Count" values.
On the graph, you can see that "amount" values is reaching around 40000 whereas the maximal value of "Count" is 30. So, you can choose the following scale factor: 40000 / 30 = 1333.333.
So, now, if you create a second column called "Amount" that is the result of "amount" divided by 1300, you will have "Amount" and "Count" on the same range. So, your data will looks like that now:
library(dplyr)
library(tidyr)
sales %>% group_by(groups) %>% mutate(Count = n()) %>%
mutate(Amount = amount /1300) %>%
pivot_longer(.,cols = c(Count, Amount), names_to = "Variable", values_to = "Value")
# A tibble: 200 x 4
# Groups: groups [4]
amount groups Variable Value
<dbl> <fct> <chr> <dbl>
1 24000. (2e+04,3e+04] Count 30
2 24000. (2e+04,3e+04] Amount 18.5
3 13313. (1e+04,2e+04] Count 30
4 13313. (1e+04,2e+04] Amount 10.2
5 19545. (1e+04,2e+04] Count 30
6 19545. (1e+04,2e+04] Amount 15.0
7 38179. (3e+04,4e+04] Count 20
8 38179. (3e+04,4e+04] Amount 29.4
9 19316. (1e+04,2e+04] Count 30
10 19316. (1e+04,2e+04] Amount 14.9
# … with 190 more rows
In order the secondary y axis reflect the reality of "amount" values, you can pass the opposite scale factor and multiply it by 1300.
Altogether, you get the following code:
library(ggplot2)
library(dplyr)
library(tidyr)
sales %>% group_by(groups) %>% mutate(Count = n()) %>%
mutate(Amount = amount /1300) %>%
pivot_longer(.,cols = c(Count, Amount), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x=groups, y = Value, fill = Variable)) +
geom_bar(stat="identity", position = position_dodge()) +
scale_y_continuous(name = "Count",sec.axis = sec_axis(~.*1300, name = "Sum"))
Thus, you have the illusion to have plot two different group of values on two different scales.
Hope that this long explanation was helpful for you.
I have a data.table / data frame with lists as values. I would like to make a box or violin plot of the values, one violin/box representing one row of my data set, but I can't figure out how.
Example:
test.dt <- data.table(id = c('a','b','c'), v1 = list(c(1,0,10),1:5,3))
ggplot(data = test.dt, aes(x = as.factor(id), y = v1)) + geom_boxplot()
I get the following message:
Warning message:
Computation failed in stat_boxplot():
'x' must be atomic
So my guess is that maybe I should split the lists of the values to rows somehow. I.e.: the row with a as id would be transformed to 3 rows (corresponding to the length of the vector in v1) with the same id, but the values would be split among them.
Firstly I don't know how to transform the data.table as mentioned, secondly I don't know either if this would be the solution at all.
Indeed, you need to unnest your dataset before plotting:
library(tidyverse)
unnest(test.dt) %>%
ggplot(data = ., aes(x = as.factor(id), y = v1)) + geom_boxplot()
I believe what you are looking for is the very handy unnest() function. The following code works:
library(data.table)
library(tidyverse)
test.dt <- data.table(id = c('a','b','c'), v1 = list(c(1,0,10),1:5,3))
test.dt = test.dt %>% unnest()
ggplot(test.dt, aes(x = as.factor(id), y = v1)) +
geom_boxplot()
If you don't want to import the whole tidyverse, the unnest() function is from the tidyr package.
This is what unnest() does with example data:
> data.table(id = c('a','b','c'), v1 = list(c(1,0,10),1:5,3))
id v1
1: a 1, 0,10
2: b 1,2,3,4,5
3: c 3
> data.table(id = c('a','b','c'), v1 = list(c(1,0,10),1:5,3)) %>% unnest()
id v1
1: a 1
2: a 0
3: a 10
4: b 1
5: b 2
6: b 3
7: b 4
8: b 5
9: c 3
I have two columns in a data.frame, that should have levels sorted in the same order, but I don't know how to do it in a straightforward manner.
Here's the situation:
library(ggplot2)
library(dplyr)
library(magrittr)
set.seed(1)
df1 <- data.frame(rating = sample(c("GOOD","BAD","AVERAGE"),10,T),
div = sample(c("A","B","C"),10,T),
n = sample(100,10,T))
# I'm adding a label column that I use for plotting purposes
df1 <- df1 %>% group_by(rating) %>% mutate(label = paste0(rating," (",sum(n),")")) %>% ungroup
# # A tibble: 10 x 4
# rating div n label
# <fctr> <fctr> <int> <chr>
# 1 BAD C 48 BAD (220)
# 2 BAD B 87 BAD (220)
# 3 BAD C 44 BAD (220)
# 4 GOOD B 25 GOOD (77)
# 5 AVERAGE B 8 AVERAGE (117)
# 6 AVERAGE C 10 AVERAGE (117)
# 7 AVERAGE A 32 AVERAGE (117)
# 8 GOOD B 52 GOOD (77)
# 9 AVERAGE C 67 AVERAGE (117)
# 10 BAD C 41 BAD (220)
# rating levels are sorted
df1$rating <- factor(df1$rating,c("BAD","AVERAGE","GOOD"))
ggplot(df1,aes(x=rating,y=n,fill=div)) + geom_col() # plots in the order I want
ggplot(df1,aes(x=label,y=n,fill=div)) + geom_col() # doesn't because levels aren't sorted
How do I manage to copy the factor order from one column to another ?
I can make it work this way but I think it's really awkward:
lvls <- df1 %>% select(rating,label) %>% unique %>% arrange(rating) %>% extract2("label")
df1$label <- factor(df1$label,lvls)
ggplot(df1,aes(x=label,y=n,fill=div)) + geom_col()
Instead of adding a label column and use aes(x = label, you may stick to aes(x = rating, and create the labels in scale_x_discrete:
ggplot(df1, aes(x = rating, y = n, fill = div)) +
geom_col() +
scale_x_discrete(labels = df1 %>%
group_by(rating) %>%
summarize(n = sum(n)) %>%
mutate(lab = paste0(rating, " (", n, ")")) %>%
pull(lab))
Once you have set the levels of rating, you can use forcats to set the levels of label by the order of rating like this...
library(forcats)
df1 <- df1 %>% group_by(rating) %>%
mutate(label=paste0(rating," (",sum(n),")")) %>%
ungroup %>%
arrange(rating) %>% #sort by rating
mutate(label=fct_inorder(label)) #set levels by order in which they appear
Or you can use forcats::fct_reorder to do the same thing...
df1$label <- fct_reorder(df1$label, as.numeric(df1$rating))
The plot then has the bars in the right order.
I want to make a stacked barchart that describes abundances of taxa at two locations in three different seasons. I'm using ggplot2. Making the plot is ok, but I have 48 taxa so I end up with a lot of different colours in the bar. There are only eight taxa that occur frequently and abundantly, so I'd like to group the others into "Other" for the plot.
My data looks like this:
SampleID TransectID SampleYear Season Location Taxa1 Taxa2 Taxa3 .... Taxa48
BW15001 1 2015 fall SiteA 25 0 0 0
BW15001 2 2015 fall SiteA 32 0 0 2
BW15001 2 2015 fall SiteA 6 0 45 0
BW15001 3 2015 fall SiteA 78 1 2 0
This is what I have tried (modified from here):
y <- rowSums(invert[6:54])
x<-invert[6:54]/y
x<-invert[,order(-colSums(x))]
#Extract list of top N Taxa
N<-8
taxa_list<-colnames(x)[1:N]
#remove "__Unknown__" and add it to others
taxa_list<-taxa_list[!grepl("Unknown",taxa_list)]
N<-length(taxa_list)
#Generate a new table with everything added to Others
new_x<-data.frame(x[,colnames(x) %in% taxa_list],
Others=rowSums(x[,!colnames(x) %in% taxa_list]))
df<-NULL
for (i in 1:dim(new_x)[2]){
tmp<-data.frame(row.names=NULL,Sample=rownames(new_x),
Taxa=rep(colnames(new_x)[i],dim(new_x) [1]),Value=new_x[,i],Type=grouping_info[,1])
if(i==1){df<-tmp} else {df<-rbind(df,tmp)}
}
To plot the graph:
colours <- c("#F0A3FF", "#0075DC", "#993F00","#4C005C","#2BCE48","#FFCC99","#808080","#94FFB5","#8F7C00","#9DCC00","#C20088","#003380","#FFA405","#FFA8BB","#426600","#FF0010","#5EF1F2","#00998F","#740AFF","#990000","#FFFF00");
library(ggplot2)
p<-ggplot(df,aes(Sample,Value,fill=Taxa))+
geom_bar(stat="identity")+
facet_grid(. ~ Type, drop=TRUE,scale="free",space="free_x")
p<-p+scale_fill_manual(values=colours[1:(N+1)])
p<-p+theme_bw()+ylab("Proportions")
p<-p+ scale_y_continuous(expand = c(0,0))+
theme(strip.background = element_rect(fill="gray85"))+
theme(panel.spacing = unit(0.3, "lines"))
p<-p+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p
The main problem that I would like help with today is pulling out the main taxa and lumping the rest as "Other". I think I can figure out how to group the graph by Season and Location using facet_grid() later...
Thanks!
Expanding on my comment. Take a look at the forcats package. Without a full example, it's hard to say, but the following should work:
library(tidyverse)
library(forcats)
temp <- df %>%
gather(taxa, amount, -c(1:5))
# Reshape the data so that that there is one record per each amount
tidy_df <- temp[rep(rownames(temp), times = temp$amount), ]
tidy_df %>%
select(-amount) %>%
mutate(taxa = fct_lump(taxa, n = 2)) %>% # Check out this line
ggplot(., aes(x = SampleID, fill = taxa)) +
geom_bar()
You can change fct_lump(taxa, n = 2) to fct_lump(taxa, n = 8) to group the top 8 categories. Alternatively, you can use fct_lump(taxa, prop = 0.9) to lump things up by proportions.
If you are simply going after the "presence" of the taxa in a sample (and not the value or amount), things are a bit simpler and can likely be handled in one pipe:
df %>%
gather(taxa, amount, -c(1:5)) %>%
mutate(amount = na_if(amount, 0)) %>%
na.omit() %>%
mutate(taxa = fct_lump(taxa, n = 2)) %>%
ggplot(., aes(x = SampleID, fill = taxa)) +
geom_bar()
One way of doing it:
library(plyr)
d=data.frame(SampleID=rep('BW15001',4),
TransectID=c(1,2,2,3),
SampleYear=rep(2015,4),
Taxa1=c(25,32,6,78),
Taxa2=c(0,0,0,1),
Taxa3=c(0,0,45,3))
#Reshape the df so that all taxa columns are melted into two
d=melt(d,id=colnames(d[,1:3]))
d$variable=as.character(d$variable)
# rename all uninteresting taxa as 'other'
`%ni%` <- Negate(`%in%`) # Here I decided to select the ones to keep, but the other way around is fine as well of course
d[d$variable %ni% c('Taxa1','Taxa2'),'variable']='Other' #here you could add a function to automatically determine which taxta you want to keep, as you already did
# aggregate all data for 'other'
d=ddply(d,colnames(d[,1:4]),summarise,value=sum(value))
#make your plot, this one is just a bad example
ggplot(d,aes(SampleID,value,fill=variable))+
geom_bar(stat="identity")+
facet_grid(. ~ Type, drop=TRUE,scale="free",space="free_x")