How to plot the availability of a variable by year? - r

year <- c(2000:2014)
group <- c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"C","C","C","C","C","C","C","C","C","C","C","C","C","C","C")
value <- sample(1:5, 45, replace=TRUE)
df <- data.frame(year,group,value)
df$value[df$value==1] <- NA
year group value
1 2000 A NA
2 2001 A 2
3 2002 A 2
...
11 2010 A 2
12 2011 A 3
13 2012 A 5
14 2013 A NA
15 2014 A 3
16 2000 B 2
17 2001 B 3
...
26 2010 B NA
27 2011 B 5
28 2012 B 4
29 2013 B 3
30 2014 B 5
31 2000 C 5
32 2001 C 4
33 2002 C 3
34 2003 C 4
...
44 2013 C 5
45 2014 C 3
Above is the sample dataframe for my question.
Each group (A,B or C), has value from 2000 to 2014, but in some years, the value might be missing for some of the groups.
The graph I would like to plot is as below:
x-axis is year
y-axis is group (i.e. A, B & C should be showed on y-lab)
the bar or line represent the value availability of each group
If the value is NA, then the bar would not show at that time point.
ggplot2 is preferred if possible.
Can anyone help?
Thank you.
I think my description is confusing. I am expecting a graph like below, BUT the x-axis would be year. And the bar or line represents the availability of the value for a given group across the year.
In the sample dataframe of group A, we have
2012 A 5
2013 A NA
2014 A 3
Then there should be nothing at the point of group A in 2013, and then a dot would be presented at the point of group A in 2014.

You can use the geom_errorbar, with no range (geom_errorbarh for horizontal). Then just subset for complete.cases (or !is.na(df$value))
library(ggplot2)
set.seed(10)
year <- c(2000:2014)
group <- c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"C","C","C","C","C","C","C","C","C","C","C","C","C","C","C")
value <- sample(1:5, 45, replace=TRUE)
df <- data.frame(year,group,value)
df$value[df$value==1] <- NA
no_na_df <- df[complete.cases(df), ]
ggplot(no_na_df, aes(x=year, y = group)) +
geom_errorbarh(aes(xmax = year, xmin = year), size = 2)
Edit:
To get a countious bar, you can use this slightly unappealing method. It is nesessary to make a numeric representation of the group data, to give the bars a width. Thereafter, we can make the scale represent the variables as discrete again.
df$group_n <- as.numeric(df$group)
no_na_df <- df[complete.cases(df), ]
ggplot(no_na_df, aes(xmin=year-0.5, xmax=year+0.5, y = group_n)) +
geom_rect(aes(ymin = group_n-0.1, ymax = group_n+0.1)) +
scale_y_discrete(limits = levels(df$group))

Related

Stacked Bar Graph of Count of Variables within date bins

Using R, I am trying to make a simple stacked bar graph of the counts of different settlement types by date. I have 3 ways of accounting for date. Below is an example of my database
ID Settlement Start End Mid
01 Urban 200 400 300
02 Rural 450 850 650
03 Military 1300 1400 1350
04 Castle 2 1000 501
so far I have
count(ratData, vars = "Settlement")
which returns
Settlement freq
1 78
2 Castle 25
3 Cave 3
4 Fortification 5
5 Hill Fort 2
6 Industrial (quarry) 1
7 Manor 2
8 Military 4
9 Military camp 1
10 Military Camp 3
11 Military site 1
12 Mining 1
13 Monastic 15
14 Monastic/Rural? 1
15 Port 5
16 River-site 2
17 Roman fort 1
18 Roman Fort 1
19 Roman settlement 3
20 Rural 22
21 Settlement 2
22 urban 1
23 Urban 123
24 Villa 4
25 Wic 13
Then to plot
ggplot(v, aes(x=Settlement, y=freq)) + geom_bar(stat='identity', fill='lightblue', color='black')
This however shows settlement type on the x axis instead of stacking the settlement types. This is missing date data. I would like to bin them into 100 year bins from 1-1500 and make a stacked bar graph of settlement types per bin to illustrate presence over time.
This should do the trick. The cut function is very useful in situations like this where you need to create a categorical variable based on some range of a continuous variable. I've gone the Tidyverse route but there are base R options as well.
library(dplyr)
library(ggplot2)
# Some dummy data that resembles your problem
s <- data.frame(ID = 1:100,
Settlement = c(rep('Urban', 50), rep('Rural', 20), rep('Military', 10), rep('Castle', 20)),
Start = signif(rnorm(100, 500, 100), 2),
End = signif(rnorm(100, 1000, 100), 2))
s$Mid <- s$Start + ((s$End - s$Start) / 2)
# Find the range of the mid variable to decide on cut locations
r <- range(s$Mid)
# Make a new factor variable based year bins - you will need to change to match your actual data
s$group <- cut(s$Mid, 5, labels = c('575-640', '641-705', '706-770', '771-835', '836-900'))
# Frequency count per factor level
grouped <- s %>%
group_by(group) %>%
count(Settlement)
# You'll need to clean up axis labels, etc.
ggplot(grouped, aes(x = group, y = n, fill = Settlement)) +
geom_bar(stat = 'identity')

Boxplot not plotting all data

I'm trying to plot a boxplot for a time series (e.g. http://www.r-graph-gallery.com/146-boxplot-for-time-series/) and can get every other example to work, bar my last one. I have averages per month for six years (2011 to 2016) and have data for 2014 and 2015 (albeit in small quantities), but for some reason, boxes aren't being shown for the 2014 and 2015 data.
My input data has three columns: year, month and residency index (a value between 0 and 1). There are multiple individuals (in this example, 37) each with an average residency index per month per year (including 2014 and 2015).
For example:
year month RI
2015 1 NA
2015 2 NA
2015 3 NA
2015 4 NA
2015 5 NA
2015 6 NA
2015 7 0.387096774
2015 8 0.580645161
2015 9 0.3
2015 10 0.225806452
2015 11 0.3
2015 12 0.161290323
2016 1 0.096774194
2016 2 0.103448276
2016 3 0.161290323
2016 4 0.366666667
2016 5 0.258064516
2016 6 0.266666667
2016 7 0.387096774
2016 8 0.129032258
2016 9 0.133333333
2016 10 0.032258065
2016 11 0.133333333
2016 12 0.129032258
which is repeated for each individual fish.
My code:
#make boxplot
boxplot(RI$RI~RI$month+RI$year,
xaxt="n",xlab="",col=my_colours,pch=20,cex=0.3,ylab="Residency Index (RI)", ylim=c(0,1))
abline(v=seq(0,12*6,12)+0.5,col="grey")
axis(1,labels=unique(RI$year),at=seq(6,12*6,12))
The average trend line works as per the other examples.
a=aggregate(RI$RI,by=list(RI$month,RI$year),mean, na.rm=TRUE)
lines(a[,3],type="l",col="red",lwd=2)
Any help on this matter would be greatly appreciated.
Your problem seems to be the presence of missing values, NA, in your data, the other values are plotted correctly. I've simplified your code a bit.
boxplot(RI$RI ~ RI$month + RI$year,
ylab="Residency Index (RI)")
a <- aggregate(RI ~ month + year, data = RI, FUN = mean, na.rm = TRUE)
lines(c(rep(NA, 6), a[,3]), type="l", col="red", lwd=2)
Also, I believe that maybe a boxplot is not the best way to depict your data. You only have one value per year/month, when a boxplot would require more. Maybe a simple scatter plot will do better.

Aggregation on 2 columns while keeping two unique R

So I have this:
Staff Result Date Days
1 50 2007 4
1 75 2006 5
1 60 2007 3
2 20 2009 3
2 11 2009 2
And I want to get to this:
Staff Result Date Days
1 55 2007 7
1 75 2006 5
2 15 2009 5
I want to have the Staff ID and Date be unique in each row, but I want to sum 'Days' and mean 'Result'
I can't work out how to do this in R, I'm sure I need to do lots of aggregations but I keep getting different results to what I am aiming for.
Many thanks
the simplest way to do this is to group_by Staff and Date and summarise the results with dplyr package:
require(dplyr)
df <- data.frame(Staff = c(1,1,1,2,2),
Result = c(50, 75, 60, 20, 11),
Date = c(2007, 2006, 2007, 2009, 2009),
Days = c(4, 5, 3, 3, 2))
df %>%
group_by(Staff, Date) %>%
summarise(Result = floor(mean(Result)),
Days = sum(Days)) %>%
data.frame
Staff Date Result Days
1 1 2006 75 5
2 1 2007 55 7
3 2 2009 15 5
You can aggregate on two variables by using a formula and then merge the two aggregates
merge(aggregate(Result ~ Staff + Date, data=df, mean),
aggregate(Days ~ Staff + Date, data=df, sum))
Staff Date Result Days
1 1 2006 75.0 5
2 1 2007 55.0 7
3 2 2009 15.5 5
Here is another option with data.table
library(data.table)
setDT(df1)[, .(Result = floor(mean(Result)), Days = sum(Days)), .(Staff, Date)]
# Staff Date Result Days
#1: 1 2007 55 7
#2: 1 2006 75 5
#3: 2 2009 15 5

Changing X-axis values in Time Series plot with R

I'm a newer R user and I need help with a time series plot. I created a time series plot, and cannot figure out how to change my x-axis values to correspond to my sample dates. My data is as follows:
Year Month Level
2009 8 350
2009 9 210
2009 10 173
2009 11 166
2009 12 153
2010 1 141
2010 2 129
2010 3 124
2010 4 103
2010 5 69
2010 6 51
2010 7 49
2010 8 51
2010 9 51
Let's say this data is saved as the name "data.csv"
data = read.table("data.csv", sep = ",", header = T)
data.ts = ts(data, frequency = 1)
plot(dat.mission.ts[, 3], ylab = "level", main = "main", axes = T)
I've also tried inputing the start = c(2009, 8) into the ts function but I still get wrong values
When I plot this my x axis does not correlate to August 2009 through Sept. 2010. It will either increase by year or just by decimal. I've looked up many examples online and also through the ? help on R, but cannot find a way to relabel my axis values. Any help would be appreciated.
Using base coding, you can accomplish this in a few steps. As described in this SO answer, you can identify your "Month" and "Year" data as a date if you use as.Date and paste functions together and incorporate a day (i.e., first day of the month; "1"). For the purposes of this answer, I will simply refer to the data you provided as df:
df$date<-with(df,as.Date(paste(Year,Month,'1',sep='-'),format='%Y-%m-%d'))
df
Year Month Level date
1 2009 8 350 2009-08-01
2 2009 9 210 2009-09-01
3 2009 10 173 2009-10-01
4 2009 11 166 2009-11-01
5 2009 12 153 2009-12-01
6 2010 1 141 2010-01-01
7 2010 2 129 2010-02-01
8 2010 3 124 2010-03-01
9 2010 4 103 2010-04-01
10 2010 5 69 2010-05-01
11 2010 6 51 2010-06-01
12 2010 7 49 2010-07-01
13 2010 8 51 2010-08-01
14 2010 9 51 2010-09-01
Then you can use your basic plot, axis, and mtext functions to control how you want to visualize the data and your axes. For instance:
xmin<-min(df$date,na.rm=T);xmax<-max(df$date,na.rm=T) #ESTABLISH X-VALUES (MIN & MAX)
ymin<-min(df$Level,na.rm=T);ymax<-max(df$Level,na.rm=T) #ESTABLISH Y-VALUES (MIN & MAX)
xseq<-seq.Date(xmin,xmax,by='1 month') #CREATE DATE SEQUENCE THAT INCREASES BY MONTH FROM DATE MINIMUM TO MAXIMUM
yseq<-round(seq(0,ymax,by=50),0) # CREATE SEQUENCE FROM 0-350 BY 50
par(mar=c(1,1,0,0),oma=c(6,5,3,2)) #CONTROLS YOUR IMAGE MARGINS
plot(Level~date,data=df,type='b',ylim=c(0,ymax),axes=F,xlab='',ylab='');box() #PLOT LEVEL AS A FUNCTION OF DATE, REMOVE AXES FOR FUTURE CUSTOMIZATION
axis.Date(side=1,at=xseq,format='%Y-%m',labels=T,las=3) #ADD X-AXIS LABELS WITH "YEAR-MONTH" FORMAT
axis(side=2,at=yseq,las=2) #ADD Y-AXIS LABELS
mtext('Date (Year-Month)',side=1,line=5) #X-AXIS LABEL
mtext('Level',side=2,line=4) #Y-AXIS LABEL
library(data.table)
library(ggplot2)
library(scales)
data<-data.table(datetime=seq(as.POSIXct("2009/08/01",format="%Y/%m/%d"),
as.POSIXct("2010/09/01",format="%Y/%m/%d"),by="1 month"),
Level=c(350,210,173,166,153,141,129,124,103,69,51,49,51,51))
ggplot(data)+
geom_point(aes(x=datetime,y=Level),col="brown1",size=1)+
scale_x_datetime(labels = date_format("%Y/%m"),breaks = "1 month")+
theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust=0.3))
Example using xts package:
library(xts)
ts1 <- xts(data$Level, as.POSIXct(sprintf("%d-%d-01", data$Year, data$Month)))
# or ts1 <- xts(data$Level, as.yearmon(data$Year + (data$Month-1)/12))
plot(ts1)
If you are using ggplot2:
library(ggplot2)
autoplot(ts1)

How to calculate time-weighted average and create lags

I have searched the forum, but found nothing that could answer or provide hint on how to do what I wish to on the forum.
I have yearly measurement of exposure data from which I wish to calculate individual level annual average based on entry of each individual into the study. For each row the one year exposure assignment should include data from the preceding 12 months starting from the last month before joining the study.
As an example the first person in the sample data joined the study on Feb 7, 2002. His exposure will include a contribution of January 2002 (annual average is 18) and February to December 2001 (annual average is 19). The time weighted average for this person would be (1/12*18) + (11/12*19). The two year average exposure for the same person would extend back from January 2002 to February 2000.
Similarly, for last person who joined the study in December 2004 will include contribution on 11 months in 2004 and one month in 2003 and his annual average exposure will be (11/12*5 ) derived form 2004 and (1/12*6) which comes from the annual average of 2003.
How can I calculate the 1, 2 and 5 year average exposure going back from the date of entry into study? How can I use lags in the manner taht I hve described?
Sample data is accessed from this link
https://drive.google.com/file/d/0B_4NdfcEvU7La1ZCd2EtbEdaeGs/view?usp=sharing
This is not an elegant answer. But, I would like to leave what I tried. I first arranged the data frame. I wanted to identify which year will be the key year for each subject. So, I created id. variable comes from the column names (e.g., pol_2000) in your original data set. entryYear comes from entry in your data. entryMonth comes from entry as well. check was created in order to identify which year is the base year for each participant. In my next step, I extracted six rows for each participant using getMyRows in the SOfun package. In the next step, I used lapply and did math as you described in your question. For the calculation for two/five year average, I divided the total values by year (2 or 5). I was not sure how the final output would look like. So I decided to use the base year for each subject and added three columns to it.
library(stringi)
library(SOfun)
devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
### Big thanks to BondedDust for this function
### http://stackoverflow.com/questions/6987478/convert-a-month-abbreviation-to-a-numeric-month-in-r
mo2Num <- function(x) match(tolower(x), tolower(month.abb))
### Arrange the data frame.
ana <- foo %>%
mutate(id = 1:n()) %>%
melt(id.vars = c("id","entry")) %>%
arrange(id) %>%
mutate(variable = as.numeric(gsub("^.*_", "", variable)),
entryYear = as.numeric(stri_extract_last(entry, regex = "\\d+")),
entryMonth = mo2Num(substr(entry, 3,5)) - 1,
check = ifelse(variable == entryYear, "Y", "N"))
### Find a base year for each subject and get some parts of data for each participant.
indx <- which(ana$check == "Y")
bob <- getMyRows(ana, pattern = indx, -5:0)
### Get one-year average
cathy <- lapply(bob, function(x){
x$one <- ((x[6,6] / 12) * x[6,4]) + (((12-x[5,6])/12) * x[5,4])
x
})
one <- unnest(lapply(cathy, `[`, i = 6, j = 8))
### Get two-year average
cathy <- lapply(bob, function(x){
x$two <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + (((12-x[4,6])/12) * x[4,4])) / 2
x
})
two <- unnest(lapply(cathy, `[`, i = 6, j =8))
### Get five-year average
cathy <- lapply(bob, function(x){
x$five <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + x[4,4] + x[3,4] + x[2,4] + (((12-x[2,6])/12) * x[1,4])) / 5
x
})
five <- unnest(lapply(cathy, `[`, i =6 , j =8))
### Combine the results with the key observations
final <- cbind(ana[which(ana$check == "Y"),], one, two, five)
colnames(final) <- c(names(ana), "one", "two", "five")
# id entry variable value entryYear entryMonth check one two five
#6 1 07feb2002 2002 18 2002 1 Y 18.916667 18.500000 18.766667
#14 2 06jun2002 2002 16 2002 5 Y 16.583333 16.791667 17.150000
#23 3 16apr2003 2003 14 2003 3 Y 15.500000 15.750000 16.050000
#31 4 26may2003 2003 16 2003 4 Y 16.666667 17.166667 17.400000
#39 5 11jun2003 2003 13 2003 5 Y 13.583333 14.083333 14.233333
#48 6 20feb2004 2004 3 2004 1 Y 3.000000 3.458333 3.783333
#56 7 25jul2004 2004 2 2004 6 Y 2.000000 2.250000 2.700000
#64 8 19aug2004 2004 4 2004 7 Y 4.000000 4.208333 4.683333
#72 9 19dec2004 2004 5 2004 11 Y 5.083333 5.458333 4.800000

Resources