Subgroup axes ggplot2 similar to Excel PivotChart - r

I am trying plot Months with Year as subgroups to a chart with ggplot2. That is, something that looks like this:
A similar question was answered here, but I am hoping there is a better way that avoids hardcoding the axis labels.
The R code for the data frame is as follows:
set.seed(100)
df = data.frame( Year = rep(c(rep(2013, 12), rep(2014, 9)), 2)
, Month = rep(rep(month.abb, 2)[1:21], 2)
, Condition = rep(c("A", "B"), each=21)
, Value = runif(42))
As a bonus, I would appreciate learning how to plot smoothed totals by year without introducing a new variable (if this is possible?). If I use dplyr to summarise and group_by Year and Month, the order of the months is not preserved.
Notice, Month now starts at Apr:
group_by(df, Year, Month) %>% summarise(total = sum(Value)) %>% head
Source: local data frame [6 x 3]
Groups: Year
Year Month total
1 2013 Apr 0.4764846
2 2013 Aug 0.9194172
3 2013 Dec 1.2308575
4 2013 Feb 0.7960212
5 2013 Jan 1.0185700
6 2013 Jul 1.6943562

try this,
df$Month <- factor(df$Month, levels=month.abb)
p <- ggplot(df, aes(Month, Value, colour=Condition, group=Condition))+
facet_grid(.~Year) + geom_line() + theme_minimal()
library(gtable)
g <- ggplotGrob(p)
g2 <- g[-3,] %>%
gtable_add_rows(heights = g$heights[3], nrow(g)-3) %>%
gtable_add_grob(g[3,], t = nrow(g)-2, l=1, r=ncol(g))
grid.newpage()
grid.draw(g2)

Related

Cumulative stacked bar plot with the same variable with ggplot2

I got a data frame producers with two colums: person_id and year.
# A tibble: 3,207 x 2
person_id year
<chr> <chr>
1 GASH1991-04-30 2020
2 LOSP1969-06-29 2020
3 CRGM1989-08-26 2020
4 CEVE1954-07-15 2020
5 HERR1998-01-06 2020
6 TOLR1951-04-09 2020
7 BEAM1953-09-07 2020
8 ANRJ1977-07-06 2020
9 PAMH1982-02-06 2020
10 AKTE1967-11-15 2020
# ... with 3,197 more rows
I can summarise this dataframe to obtain cumulative sum:
producers %>%
select(person_id, year) %>%
group_by(year) %>%
distinct(person_id) %>%
summarise(total = n()) %>%
ungroup() %>%
mutate(cum = cumsum(total))
# A tibble: 3 x 3
year total cum
<chr> <int> <int>
1 2019 456 456
2 2020 1832 2288
3 2021 160 2448
An I can make a cummulative bar plot like this:
ggplot(producers, aes(x = as.factor(year), y = as.integer(cum))) +
geom_bar(position = "stack", stat = "identity") +
ylim(0,3000) +
xlab("Year") +
ylab("Producers") +
theme_classic()
But what I really want is something like this:
I've been trying with aes(fill = year) and other arguments but I can't get it. Thanks for your responses.
Here's an approach. Ultimately, we'll need two "year" variables, one to mark the category within each stack, and one to mark which stack we want it to appear in. Here, I set up year2 for the 2nd one, and filter out the values that shouldn't appear yet in each stack.
df2 <- data.frame(
year = 2019:2021,
total = c(456, 1832, 160)
)
library(tidyverse)
df2 %>%
crossing(year2 = df2$year) %>% # make copy for each year
filter(year <= year2) %>% # keep just the years up to current year
ggplot(aes(year2, total, fill = fct_rev(as.factor(year)))) +
geom_col() +
scale_fill_discrete(name = "Year")
ggplot2 works best with data in a long format where you have one variable to plot and then various identifying variables to control the fill, color, and facetting. Here I explicitly build a repeated data frame using map_dfr which essentially is running a for loop for each year in the input dataset. In dat_long the new column yearid becomes the x-axis identifier so within 2021 we can access the data for year 2019 through 2021 to control the color fill.
library(ggplot2)
library(dplyr)
library(purrr)
library(forcats)
year = c(2019, 2020, 2021)
sum = c(456, 1832, 160)
cumsum = c(456, 2288, 2448)
dat <- data.frame(year, sum)
# note: don't need the cumsum column
# instead, create long, replicated data where we repeat
# each years entry for every year that comes after it
dat_long <-
map_dfr(unique(dat$year),
~filter(dat, year <= .x) %>%
mutate(yearid = .x))
ggplot(data = dat_long,
aes(x = yearid,
y = sum,
# note: use factor to get discrete color palette, fct_rev to stack 2021 on top
fill = fct_rev(factor(year)))) +
geom_col()

Further scaling plotted data by year into intervals

I am practicing visualizing data with R with a dataset on certain incidents worldwide. I created a data frame only containing the number of incidents per year with the plyr count function.
library(plyr)
df_incidents <- count(df$iyear)
names(df_incidents)[names(df_incidents) == "x"] <- "year"
names(df_incidents)[names(df_incidents) == "freq"] <- "incidents"
df_incidents
Output:
year incidents
1970 651
1971 471
1972 568
1973 473
1974 581
... all the way to 2018
I visualised the above data with ggplot(df_incidents,aes(x=year,y=incidents)) + geom_bar(stat="identity") which returned a histogram of incidents per year, but I am unable to further group year into intervals of 5 years.
Should I alter my ggplot statement to scale the data or further process my df_incidents data frame into distinctive groups of approx 5 years from 1970?
You can try an approach using bars with scale_x_continuous() or using a new variable defined by cut() function. Here the approaches:
library(ggplot2)
library(dplyr)
set.seed(123)
#Data
df_incidents <- data.frame(year=1978:2018,
incidents=round(runif(41,500,1000),0))
#Plot option 1
ggplot(df_incidents,aes(x=year,y=incidents))+
geom_bar(stat = 'identity',color='black',fill='cyan')+
scale_x_continuous(breaks = seq(1978,2018,by=5))
Output:
And the second approach:
#Plot option 2
df_incidents %>%
mutate(Cutyear=cut(year,breaks = seq(1978,2018,by=5),include.lowest = T,right = F)) %>%
group_by(Cutyear) %>%
summarise(incidents=sum(incidents,na.rm=T)) %>%
ggplot(aes(x=Cutyear,y=incidents))+
geom_bar(stat = 'identity',color='black',fill='cyan')
Output:
You can bin your years into categories like below:
df_incidents %>% mutate(Binned= cut(year, breaks = c(1970, 1975, ....,2020))) %>%
group_by(Binned) %>% summarize(Incidents= count(incidents)) %>%
ggplot(.,aes(x=Binned,y=Incidents)) + geom_bar(stat="identity")

How to plot dates as dates (not numbers or character) on x axis of ggplot?

I have a huge data set containing bacteria samples (4 types of bacteria) from 10 water resources from 2010 until 2019. some values are missing so we need to not include them in the plot or analysis.
I want to plot a time series for each type of bacteria for each resource for all years.
What is the best way to do that?
library("ggplot2")
BactData= read.csv('RÃ¥vannsdata_Bergen_2010_2018a.csv', sep='\t',header=TRUE)
summary(BactData,na.rm = TRUE)
df$Date = as.Date( df$Date, '%d/%m/%Y')
#require(ggplot2)
ggplot( data = df, aes( Date,BactData$Svartediket_CB )) + geom_line()
#plot(BactData$Svartediket_CB,col='brown')
plot(BactData$Svartediket_CP,col='cyan')
plot(BactData$Svartediket_EC,col='magenta')
plot(BactData$Svartediket_IE,col='darkviolet')
using plot is not satisfactory because the x axis is just numbers not dates . Tried to use ggplot but got an error message. I am beginner in R.
Error message
Error in df$Date : object of type 'closure' is not subsettable
Data as CVS file with tab delimiter
This will do the trick
BactData = read.csv('RÃ¥vannsdata_Bergen_2010_2018a.csv', sep='\t',header=TRUE, stringsAsFactors = F)
colnames(BactData)[1] <- "Date"
library(lubridate)
BactData$Date = dmy(BactData$Date) # converts strings to date class
ggplot(data = BactData, aes(Date, Svartediket_CB )) + geom_line()
You can filter for any year using dplyr with lubridate. For example, 2017:
library(dplyr)
BactData %>% filter(year(Date) == 2017) %>%
ggplot(aes(Date, Svartediket_CB )) + geom_line()
Or for two years
library(dplyr)
BactData %>% filter(year(Date) == 2017 | year(Date) == 2018) %>%
ggplot(aes(Date, Svartediket_CB )) + geom_line()

R stacked bar charts including "other" (using ggplot2)

I want to make a stacked barchart that describes abundances of taxa at two locations in three different seasons. I'm using ggplot2. Making the plot is ok, but I have 48 taxa so I end up with a lot of different colours in the bar. There are only eight taxa that occur frequently and abundantly, so I'd like to group the others into "Other" for the plot.
My data looks like this:
SampleID TransectID SampleYear Season Location Taxa1 Taxa2 Taxa3 .... Taxa48
BW15001 1 2015 fall SiteA 25 0 0 0
BW15001 2 2015 fall SiteA 32 0 0 2
BW15001 2 2015 fall SiteA 6 0 45 0
BW15001 3 2015 fall SiteA 78 1 2 0
This is what I have tried (modified from here):
y <- rowSums(invert[6:54])
x<-invert[6:54]/y
x<-invert[,order(-colSums(x))]
#Extract list of top N Taxa
N<-8
taxa_list<-colnames(x)[1:N]
#remove "__Unknown__" and add it to others
taxa_list<-taxa_list[!grepl("Unknown",taxa_list)]
N<-length(taxa_list)
#Generate a new table with everything added to Others
new_x<-data.frame(x[,colnames(x) %in% taxa_list],
Others=rowSums(x[,!colnames(x) %in% taxa_list]))
df<-NULL
for (i in 1:dim(new_x)[2]){
tmp<-data.frame(row.names=NULL,Sample=rownames(new_x),
Taxa=rep(colnames(new_x)[i],dim(new_x) [1]),Value=new_x[,i],Type=grouping_info[,1])
if(i==1){df<-tmp} else {df<-rbind(df,tmp)}
}
To plot the graph:
colours <- c("#F0A3FF", "#0075DC", "#993F00","#4C005C","#2BCE48","#FFCC99","#808080","#94FFB5","#8F7C00","#9DCC00","#C20088","#003380","#FFA405","#FFA8BB","#426600","#FF0010","#5EF1F2","#00998F","#740AFF","#990000","#FFFF00");
library(ggplot2)
p<-ggplot(df,aes(Sample,Value,fill=Taxa))+
geom_bar(stat="identity")+
facet_grid(. ~ Type, drop=TRUE,scale="free",space="free_x")
p<-p+scale_fill_manual(values=colours[1:(N+1)])
p<-p+theme_bw()+ylab("Proportions")
p<-p+ scale_y_continuous(expand = c(0,0))+
theme(strip.background = element_rect(fill="gray85"))+
theme(panel.spacing = unit(0.3, "lines"))
p<-p+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p
The main problem that I would like help with today is pulling out the main taxa and lumping the rest as "Other". I think I can figure out how to group the graph by Season and Location using facet_grid() later...
Thanks!
Expanding on my comment. Take a look at the forcats package. Without a full example, it's hard to say, but the following should work:
library(tidyverse)
library(forcats)
temp <- df %>%
gather(taxa, amount, -c(1:5))
# Reshape the data so that that there is one record per each amount
tidy_df <- temp[rep(rownames(temp), times = temp$amount), ]
tidy_df %>%
select(-amount) %>%
mutate(taxa = fct_lump(taxa, n = 2)) %>% # Check out this line
ggplot(., aes(x = SampleID, fill = taxa)) +
geom_bar()
You can change fct_lump(taxa, n = 2) to fct_lump(taxa, n = 8) to group the top 8 categories. Alternatively, you can use fct_lump(taxa, prop = 0.9) to lump things up by proportions.
If you are simply going after the "presence" of the taxa in a sample (and not the value or amount), things are a bit simpler and can likely be handled in one pipe:
df %>%
gather(taxa, amount, -c(1:5)) %>%
mutate(amount = na_if(amount, 0)) %>%
na.omit() %>%
mutate(taxa = fct_lump(taxa, n = 2)) %>%
ggplot(., aes(x = SampleID, fill = taxa)) +
geom_bar()
One way of doing it:
library(plyr)
d=data.frame(SampleID=rep('BW15001',4),
TransectID=c(1,2,2,3),
SampleYear=rep(2015,4),
Taxa1=c(25,32,6,78),
Taxa2=c(0,0,0,1),
Taxa3=c(0,0,45,3))
#Reshape the df so that all taxa columns are melted into two
d=melt(d,id=colnames(d[,1:3]))
d$variable=as.character(d$variable)
# rename all uninteresting taxa as 'other'
`%ni%` <- Negate(`%in%`) # Here I decided to select the ones to keep, but the other way around is fine as well of course
d[d$variable %ni% c('Taxa1','Taxa2'),'variable']='Other' #here you could add a function to automatically determine which taxta you want to keep, as you already did
# aggregate all data for 'other'
d=ddply(d,colnames(d[,1:4]),summarise,value=sum(value))
#make your plot, this one is just a bad example
ggplot(d,aes(SampleID,value,fill=variable))+
geom_bar(stat="identity")+
facet_grid(. ~ Type, drop=TRUE,scale="free",space="free_x")

box plot for multiple observations

I have multiple observation of rainfall for the same station for around 14 years the data frame is in something like this :
df (from date -01/01/2000)
v1 v2 v3 v4 v5 v6 ........ v20
1 1 2 4 8 9..............
1.4 4 3.8..................
1.5 3 1.6....................
1.6 8 .....................
.
.
.
.
till date 31/01/2013 i.e total 5114 observations
where v1 v2 ...v20 are the rainfall simulation for the same point; I want to plot the box plot which represents the collective range of quantiles and median monthly when all the observations are taken together.
I can plot box plot for single monthly values using :
df$month<-factor(month.name,levels=month.name)
library(reshape2)
df.long<-melt(df,id.vars="month")
ggplot(df.long,aes(month,value))+geom_boxplot()
but in this problem as the data is daily and there are multiple observations i don't get idea where to start.
sample data
df = data.frame(matrix(rnorm(20), nrow=5114,ncol=100))
In case if u want to work with a zoo object :
date<-seq(as.POSIXct("2000-01-01 00:00:00","GMT"),as.POSIXct("2013-12-31 00:00:00","GMT"), by="1440 min")
If you want yo can also convert it to zoo object
x <- zoo(df, order.by=seq(as.POSIXct("2000-01-01 00:00:00","GMT"), as.POSIXct("2013-12-31 00:00:00","GMT"), by="1440 min"))
I am not familiar with zoo. So, I converted your sample to data frame. Your idea of using melt() is a right way. Then, you need to aggregate rain amount by month. I think it is good to look up aggregate() and other options. Here, I used dplyr and tidyr to arrange the sample data. I hope this will let you move forward.
### zoo to data frame by # Joshua Ulrich
### http://stackoverflow.com/questions/14064097/r-convert-between-zoo-object-and-data-frame-results-inconsistent-for-different
zoo.to.data.frame <- function(x, index.name="Date") {
stopifnot(is.zoo(x))
xn <- if(is.null(dim(x))) deparse(substitute(x)) else colnames(x)
setNames(data.frame(index(x), x, row.names=NULL), c(index.name,xn))
}
### to data frame
foo <- zoo.to.data.frame(df)
str(foo)
library(dplyr)
library(tidyr)
### wide to long data frame, aggregate rain amount by Date
ana <- foo %>%
melt(., id.vars = "Date") %>%
group_by(Date) %>%
summarize(rain = sum(value))
### Aggregate rain amount by year and month
bob <- ana %>%
separate(Date, c("year", "month", "date")) %>%
group_by(year, month) %>%
summarize(rain = sum(rain))
### Drawing a ggplot figure
ggplot(data = bob, aes(x = month, y = rain)) +
geom_boxplot()
just found out an easier way to do it, hwoever your answered really helped jazzuro
install.packages("reshape2")
library(dplyr)
library(reshape2)
require(ggplot2)
df = data.frame(matrix(rnorm(20), nrow=5114,ncol=100))
x <- zoo(df, order.by=seq(as.POSIXct("2000-01-01 00:00:00","GMT"),
as.POSIXct("2013-12-31 00:00:00","GMT"), by="1440 min"))
v<-aggregate(x, as.yearmon, mean)
months<- rep(1:12,14)
lol<-data.frame(v,months)
df.m <- melt(lol, id.var = "months")
View(df.m)
p <- ggplot(df.m, aes(factor(months), value))
p + geom_boxplot(aes(fill = months))

Resources