I am practicing visualizing data with R with a dataset on certain incidents worldwide. I created a data frame only containing the number of incidents per year with the plyr count function.
library(plyr)
df_incidents <- count(df$iyear)
names(df_incidents)[names(df_incidents) == "x"] <- "year"
names(df_incidents)[names(df_incidents) == "freq"] <- "incidents"
df_incidents
Output:
year incidents
1970 651
1971 471
1972 568
1973 473
1974 581
... all the way to 2018
I visualised the above data with ggplot(df_incidents,aes(x=year,y=incidents)) + geom_bar(stat="identity") which returned a histogram of incidents per year, but I am unable to further group year into intervals of 5 years.
Should I alter my ggplot statement to scale the data or further process my df_incidents data frame into distinctive groups of approx 5 years from 1970?
You can try an approach using bars with scale_x_continuous() or using a new variable defined by cut() function. Here the approaches:
library(ggplot2)
library(dplyr)
set.seed(123)
#Data
df_incidents <- data.frame(year=1978:2018,
incidents=round(runif(41,500,1000),0))
#Plot option 1
ggplot(df_incidents,aes(x=year,y=incidents))+
geom_bar(stat = 'identity',color='black',fill='cyan')+
scale_x_continuous(breaks = seq(1978,2018,by=5))
Output:
And the second approach:
#Plot option 2
df_incidents %>%
mutate(Cutyear=cut(year,breaks = seq(1978,2018,by=5),include.lowest = T,right = F)) %>%
group_by(Cutyear) %>%
summarise(incidents=sum(incidents,na.rm=T)) %>%
ggplot(aes(x=Cutyear,y=incidents))+
geom_bar(stat = 'identity',color='black',fill='cyan')
Output:
You can bin your years into categories like below:
df_incidents %>% mutate(Binned= cut(year, breaks = c(1970, 1975, ....,2020))) %>%
group_by(Binned) %>% summarize(Incidents= count(incidents)) %>%
ggplot(.,aes(x=Binned,y=Incidents)) + geom_bar(stat="identity")
Related
I got a data frame producers with two colums: person_id and year.
# A tibble: 3,207 x 2
person_id year
<chr> <chr>
1 GASH1991-04-30 2020
2 LOSP1969-06-29 2020
3 CRGM1989-08-26 2020
4 CEVE1954-07-15 2020
5 HERR1998-01-06 2020
6 TOLR1951-04-09 2020
7 BEAM1953-09-07 2020
8 ANRJ1977-07-06 2020
9 PAMH1982-02-06 2020
10 AKTE1967-11-15 2020
# ... with 3,197 more rows
I can summarise this dataframe to obtain cumulative sum:
producers %>%
select(person_id, year) %>%
group_by(year) %>%
distinct(person_id) %>%
summarise(total = n()) %>%
ungroup() %>%
mutate(cum = cumsum(total))
# A tibble: 3 x 3
year total cum
<chr> <int> <int>
1 2019 456 456
2 2020 1832 2288
3 2021 160 2448
An I can make a cummulative bar plot like this:
ggplot(producers, aes(x = as.factor(year), y = as.integer(cum))) +
geom_bar(position = "stack", stat = "identity") +
ylim(0,3000) +
xlab("Year") +
ylab("Producers") +
theme_classic()
But what I really want is something like this:
I've been trying with aes(fill = year) and other arguments but I can't get it. Thanks for your responses.
Here's an approach. Ultimately, we'll need two "year" variables, one to mark the category within each stack, and one to mark which stack we want it to appear in. Here, I set up year2 for the 2nd one, and filter out the values that shouldn't appear yet in each stack.
df2 <- data.frame(
year = 2019:2021,
total = c(456, 1832, 160)
)
library(tidyverse)
df2 %>%
crossing(year2 = df2$year) %>% # make copy for each year
filter(year <= year2) %>% # keep just the years up to current year
ggplot(aes(year2, total, fill = fct_rev(as.factor(year)))) +
geom_col() +
scale_fill_discrete(name = "Year")
ggplot2 works best with data in a long format where you have one variable to plot and then various identifying variables to control the fill, color, and facetting. Here I explicitly build a repeated data frame using map_dfr which essentially is running a for loop for each year in the input dataset. In dat_long the new column yearid becomes the x-axis identifier so within 2021 we can access the data for year 2019 through 2021 to control the color fill.
library(ggplot2)
library(dplyr)
library(purrr)
library(forcats)
year = c(2019, 2020, 2021)
sum = c(456, 1832, 160)
cumsum = c(456, 2288, 2448)
dat <- data.frame(year, sum)
# note: don't need the cumsum column
# instead, create long, replicated data where we repeat
# each years entry for every year that comes after it
dat_long <-
map_dfr(unique(dat$year),
~filter(dat, year <= .x) %>%
mutate(yearid = .x))
ggplot(data = dat_long,
aes(x = yearid,
y = sum,
# note: use factor to get discrete color palette, fct_rev to stack 2021 on top
fill = fct_rev(factor(year)))) +
geom_col()
I am fairly new to coding with R. I am working with relatively easy data just a large amount of it. I am trying to create a timeline based on the years these species of whale were sighted. So I just have two variables (year and species) and 151 observations. There are a total of 25 species to plot on this time line and I have provided a small example of my data below.
year species
1792 Megaptera novaeangliae
1792 Physeter macrocephalus
1793 Physeter macrocephalus
1832 Physeter macrocephalus
1833 Physeter macrocephalus
I have tried creating the timeline using timelineS and timelineG as well as vistime. TimelineG gets close to creating what I want but it does not seem to plot anything. The code is as follows:
timelineG(t8, start="year", end="year", names="species")
timelineG results
I am just kind of stuck. I do have the month and day that the species were sighted so I can add that back if needed. Thank you in advance for any guidance.
An example with ggplot:
library(tidyverse)
df <- tibble::tribble(
~year, ~species,
1792L, "Megaptera novaeangliae",
1792L, "Physeter macrocephalus",
1793L, "Physeter macrocephalus",
1832L, "Physeter macrocephalus",
1833L, "Physeter macrocephalu"
)
df %>%
ggplot(aes(x = year, y = species)) +
geom_point()
Or with the timelineG function:
library(timelineS)
df %>%
group_by(species) %>%
summarise(start = min(year),
end = max(year)) %>%
timelineG(start = "start", end = "end", names = "species")
or like this using timelineg,
dta <- data.frame(species = c('Megaptera novaeangliae','Physeter macrocephalus',
'Physeter macrocephalus','Physeter macrocephalus',
'Physeter macrocephalus'),
year = c(1792,1792,1793,1832,1833))
#str(dta)
dta$year <- as.Date(ISOdate(dta$year, 1, 1)) # assuming beginning of year for year
# str(dta)
timelineS(dta, main = "BGoodwin's species example")
I have a huge data set containing bacteria samples (4 types of bacteria) from 10 water resources from 2010 until 2019. some values are missing so we need to not include them in the plot or analysis.
I want to plot a time series for each type of bacteria for each resource for all years.
What is the best way to do that?
library("ggplot2")
BactData= read.csv('RÃ¥vannsdata_Bergen_2010_2018a.csv', sep='\t',header=TRUE)
summary(BactData,na.rm = TRUE)
df$Date = as.Date( df$Date, '%d/%m/%Y')
#require(ggplot2)
ggplot( data = df, aes( Date,BactData$Svartediket_CB )) + geom_line()
#plot(BactData$Svartediket_CB,col='brown')
plot(BactData$Svartediket_CP,col='cyan')
plot(BactData$Svartediket_EC,col='magenta')
plot(BactData$Svartediket_IE,col='darkviolet')
using plot is not satisfactory because the x axis is just numbers not dates . Tried to use ggplot but got an error message. I am beginner in R.
Error message
Error in df$Date : object of type 'closure' is not subsettable
Data as CVS file with tab delimiter
This will do the trick
BactData = read.csv('RÃ¥vannsdata_Bergen_2010_2018a.csv', sep='\t',header=TRUE, stringsAsFactors = F)
colnames(BactData)[1] <- "Date"
library(lubridate)
BactData$Date = dmy(BactData$Date) # converts strings to date class
ggplot(data = BactData, aes(Date, Svartediket_CB )) + geom_line()
You can filter for any year using dplyr with lubridate. For example, 2017:
library(dplyr)
BactData %>% filter(year(Date) == 2017) %>%
ggplot(aes(Date, Svartediket_CB )) + geom_line()
Or for two years
library(dplyr)
BactData %>% filter(year(Date) == 2017 | year(Date) == 2018) %>%
ggplot(aes(Date, Svartediket_CB )) + geom_line()
I want to make a stacked barchart that describes abundances of taxa at two locations in three different seasons. I'm using ggplot2. Making the plot is ok, but I have 48 taxa so I end up with a lot of different colours in the bar. There are only eight taxa that occur frequently and abundantly, so I'd like to group the others into "Other" for the plot.
My data looks like this:
SampleID TransectID SampleYear Season Location Taxa1 Taxa2 Taxa3 .... Taxa48
BW15001 1 2015 fall SiteA 25 0 0 0
BW15001 2 2015 fall SiteA 32 0 0 2
BW15001 2 2015 fall SiteA 6 0 45 0
BW15001 3 2015 fall SiteA 78 1 2 0
This is what I have tried (modified from here):
y <- rowSums(invert[6:54])
x<-invert[6:54]/y
x<-invert[,order(-colSums(x))]
#Extract list of top N Taxa
N<-8
taxa_list<-colnames(x)[1:N]
#remove "__Unknown__" and add it to others
taxa_list<-taxa_list[!grepl("Unknown",taxa_list)]
N<-length(taxa_list)
#Generate a new table with everything added to Others
new_x<-data.frame(x[,colnames(x) %in% taxa_list],
Others=rowSums(x[,!colnames(x) %in% taxa_list]))
df<-NULL
for (i in 1:dim(new_x)[2]){
tmp<-data.frame(row.names=NULL,Sample=rownames(new_x),
Taxa=rep(colnames(new_x)[i],dim(new_x) [1]),Value=new_x[,i],Type=grouping_info[,1])
if(i==1){df<-tmp} else {df<-rbind(df,tmp)}
}
To plot the graph:
colours <- c("#F0A3FF", "#0075DC", "#993F00","#4C005C","#2BCE48","#FFCC99","#808080","#94FFB5","#8F7C00","#9DCC00","#C20088","#003380","#FFA405","#FFA8BB","#426600","#FF0010","#5EF1F2","#00998F","#740AFF","#990000","#FFFF00");
library(ggplot2)
p<-ggplot(df,aes(Sample,Value,fill=Taxa))+
geom_bar(stat="identity")+
facet_grid(. ~ Type, drop=TRUE,scale="free",space="free_x")
p<-p+scale_fill_manual(values=colours[1:(N+1)])
p<-p+theme_bw()+ylab("Proportions")
p<-p+ scale_y_continuous(expand = c(0,0))+
theme(strip.background = element_rect(fill="gray85"))+
theme(panel.spacing = unit(0.3, "lines"))
p<-p+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p
The main problem that I would like help with today is pulling out the main taxa and lumping the rest as "Other". I think I can figure out how to group the graph by Season and Location using facet_grid() later...
Thanks!
Expanding on my comment. Take a look at the forcats package. Without a full example, it's hard to say, but the following should work:
library(tidyverse)
library(forcats)
temp <- df %>%
gather(taxa, amount, -c(1:5))
# Reshape the data so that that there is one record per each amount
tidy_df <- temp[rep(rownames(temp), times = temp$amount), ]
tidy_df %>%
select(-amount) %>%
mutate(taxa = fct_lump(taxa, n = 2)) %>% # Check out this line
ggplot(., aes(x = SampleID, fill = taxa)) +
geom_bar()
You can change fct_lump(taxa, n = 2) to fct_lump(taxa, n = 8) to group the top 8 categories. Alternatively, you can use fct_lump(taxa, prop = 0.9) to lump things up by proportions.
If you are simply going after the "presence" of the taxa in a sample (and not the value or amount), things are a bit simpler and can likely be handled in one pipe:
df %>%
gather(taxa, amount, -c(1:5)) %>%
mutate(amount = na_if(amount, 0)) %>%
na.omit() %>%
mutate(taxa = fct_lump(taxa, n = 2)) %>%
ggplot(., aes(x = SampleID, fill = taxa)) +
geom_bar()
One way of doing it:
library(plyr)
d=data.frame(SampleID=rep('BW15001',4),
TransectID=c(1,2,2,3),
SampleYear=rep(2015,4),
Taxa1=c(25,32,6,78),
Taxa2=c(0,0,0,1),
Taxa3=c(0,0,45,3))
#Reshape the df so that all taxa columns are melted into two
d=melt(d,id=colnames(d[,1:3]))
d$variable=as.character(d$variable)
# rename all uninteresting taxa as 'other'
`%ni%` <- Negate(`%in%`) # Here I decided to select the ones to keep, but the other way around is fine as well of course
d[d$variable %ni% c('Taxa1','Taxa2'),'variable']='Other' #here you could add a function to automatically determine which taxta you want to keep, as you already did
# aggregate all data for 'other'
d=ddply(d,colnames(d[,1:4]),summarise,value=sum(value))
#make your plot, this one is just a bad example
ggplot(d,aes(SampleID,value,fill=variable))+
geom_bar(stat="identity")+
facet_grid(. ~ Type, drop=TRUE,scale="free",space="free_x")
I am trying plot Months with Year as subgroups to a chart with ggplot2. That is, something that looks like this:
A similar question was answered here, but I am hoping there is a better way that avoids hardcoding the axis labels.
The R code for the data frame is as follows:
set.seed(100)
df = data.frame( Year = rep(c(rep(2013, 12), rep(2014, 9)), 2)
, Month = rep(rep(month.abb, 2)[1:21], 2)
, Condition = rep(c("A", "B"), each=21)
, Value = runif(42))
As a bonus, I would appreciate learning how to plot smoothed totals by year without introducing a new variable (if this is possible?). If I use dplyr to summarise and group_by Year and Month, the order of the months is not preserved.
Notice, Month now starts at Apr:
group_by(df, Year, Month) %>% summarise(total = sum(Value)) %>% head
Source: local data frame [6 x 3]
Groups: Year
Year Month total
1 2013 Apr 0.4764846
2 2013 Aug 0.9194172
3 2013 Dec 1.2308575
4 2013 Feb 0.7960212
5 2013 Jan 1.0185700
6 2013 Jul 1.6943562
try this,
df$Month <- factor(df$Month, levels=month.abb)
p <- ggplot(df, aes(Month, Value, colour=Condition, group=Condition))+
facet_grid(.~Year) + geom_line() + theme_minimal()
library(gtable)
g <- ggplotGrob(p)
g2 <- g[-3,] %>%
gtable_add_rows(heights = g$heights[3], nrow(g)-3) %>%
gtable_add_grob(g[3,], t = nrow(g)-2, l=1, r=ncol(g))
grid.newpage()
grid.draw(g2)