Averaging a monthly time series with incomplete observations - r

I have the following dataset:
id observation_date Observation_value
1 2015-02-23 5
1 2015-02-24 6
1 2015-03-01 24
1 2015-07-16 2
1 2015-09-28 9
1 2015-12-05 12
I would like to create monthly averages of observation_value. In those cases that there are no values for a certain month, I would like to fill in the data with the average between the months where I have data.

Using the data in the Note at the end -- we have added a second id -- convert to zoo using column 1 to split by and column 2 as the index with yearmon class. Also in the same statement aggregate using mean over year/month giving the zoo object z. Then convert to ts which will fill in the missing months with NA and then convert back to zoo and use na.approx to fill in the NAs (or use na.spline or na.locf depending on what you want). fortify.zoo(zz) and fortify.zoo(zz, melt = TRUE) can be used to convert zoo objects to data frames.
library(zoo)
z <- read.zoo(dat, FUN = as.yearmon, index = 2, split = 1, aggregate = mean)
zz <- na.approx(as.zoo(as.ts(z)))
giving
> zz
1 2
Feb 2015 5.5 5.5
Mar 2015 24.0 24.0
Apr 2015 18.5 18.5
May 2015 13.0 13.0
Jun 2015 7.5 7.5
Jul 2015 2.0 2.0
Aug 2015 5.5 5.5
Sep 2015 9.0 9.0
Oct 2015 10.0 10.0
Nov 2015 11.0 11.0
Dec 2015 12.0 12.0
Note
Lines <- "id observation_date Observation_value
1 2015-02-23 5
1 2015-02-24 6
1 2015-03-01 24
1 2015-07-16 2
1 2015-09-28 9
1 2015-12-05 12
2 2015-02-23 5
2 2015-02-24 6
2 2015-03-01 24
2 2015-07-16 2
2 2015-09-28 9
2 2015-12-05 12"
dat <- read.table(text = Lines, header = TRUE)

Related

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

How to add means to an existing column in R

I am manipulating a dataset but I can't make things right.
Here's an example for this, where df is the name of data frame.
year ID value
2013 1 10
2013 2 20
2013 3 10
2014 1 20
2014 2 20
2014 3 30
2015 1 20
2015 2 10
2015 3 30
So I tried to make another data frame df1 <- aggregate(value ~ year, df, mean, rm.na=T)
And made this data frame df1:
year ID value
2013 avg 13.3
2014 avg 23.3
2015 avg 20
But I want to add each mean by year into each row of df.
The expected form is:
year ID value
2013 1 10
2013 2 20
2013 3 10
2013 avg 13.3
2014 1 20
2014 2 20
2014 3 30
2014 avg 23.3
2015 1 20
2015 2 10
2015 3 30
2015 avg 20
Here is an option with data.table where we convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year', get the 'mean of 'value' and 'ID' as 'avg', then use rbindlist to rbind both the datasets and order by 'year'
library(data.table)
rbindlist(list(setDT(df), df[, .(ID = 'avg', value = mean(value)), year]))[order(year)]
# year ID value
# 1: 2013 1 10.00000
# 2: 2013 2 20.00000
# 3: 2013 3 10.00000
# 4: 2013 avg 13.33333
# 5: 2014 1 20.00000
# 6: 2014 2 20.00000
# 7: 2014 3 30.00000
# 8: 2014 avg 23.33333
# 9: 2015 1 20.00000
#10: 2015 2 10.00000
#11: 2015 3 30.00000
#12: 2015 avg 20.00000
Or using the OP's method, rbind both the datasets and then order
df2 <- rbind(df, transform(df1, ID = 'avg'))
df2 <- df2[order(df2$year),]

categorize based on date ranges in R

How do I categorize each row in a large R dataframe (>2 million rows) based on date range definitions in a separate, much smaller R dataframe (12 rows)?
My large dataframe, captures, looks similar to this when called via head(captures) :
id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1
My small dataframe, seasons, looks similar to this in its entirety:
Season Opening.Date Closing.Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15
I need to add a 'season' column to my captures dataframe where the value would be determined based on if and where captures$date falls in the ranges defined in seasons.
Here is a long-hand solution I came up with that isn't working for me because my dataframe is so large.
#add packages
library(dplyr)
library(lubridate)
#make blank column
captures$season=NA
for (i in 1:length(seasons$Season)){
for (j in 1:length(captures$id{
captures$season[j]=ifelse(between(captures$date[j],ymd(seasons$Opening.Date[i]),ymd(seasons$Closing.Date[i])),seasons$Season[i],captures$season[j])
}
}
Again, this doesn't work for me as R crashes every time. I also realize this doesn't take advantage of vectorization in R. Any help here is appreciated!
Here's using non equi joins from data.table:
require(data.table) # v1.10.4+
setDT(captures) # convert data.frames to data.tables
setDT(seasons)
ans <- seasons[captures, Season,
on=.(Opening.Date<=date, Closing.Date>=date),
mult="first"]
# [1] 2016 2016 2016 2015 2015 2015
seasons[, season := ans]
For each row in captures, the index corresponding to the first matching row (mult="first") in seasons is figured out based on the condition provided to on argument. The value of Season for corresponding indices is then returned and saved under ans. It is then added as a new column to seasons by reference.
I've shown it in two steps for sake of understanding.
You can see the first matching indices by using which=TRUE instead:
seasons[captures,
on=.(Opening.Date<=date, Closing.Date>=date),
mult="first",
which=TRUE]
# [1] 1 1 1 2 2 2
It would be great indeed if you could do a join operation efficiently based on a range of values instead of equality. Unfortunately, I don't know if a general solution exists. In the time being, I suggest using a single for loop.
The efficiency of vectorization is best done along the tallest data. That is, if we loop on one data.frame and vectorize the other, it makes more sense to vectorize the longer vector and loop on the shorter ones. With this in mind, we'll loop on the frame of seasons and vectorize the 2M rows of data.
Your data:
txt <- "Season Opening.Date Closing.Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)
txt <- " id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1"
dat <- read.table(text = txt, header = TRUE)
dat$date <- as.Date(dat$date)
And the start the process, we assume that all data's season is as yet not defined:
dat$season <- NA
Loop around each of the seasons' rows:
for (i in seq_len(nrow(seasons))) {
dat$season <- ifelse(is.na(dat$season) &
dat$date >= seasons$Opening.Date[i] &
dat$date < seasons$Closing.Date[i],
seasons$Season[i], dat$season)
}
dat
# id date sex season
# 1 160520 2016-11-22 1 2016
# 2 1029735 2016-11-12 1 2016
# 3 1885200 2016-11-05 1 2016
# 4 2058366 2015-09-26 2 2015
# 5 2058367 2015-09-26 1 2015
# 6 2058368 2015-09-26 1 2015
You could try with sqldf. Note, I had to change the point in Opening_Date and Closing_Date to an "_".
library(sqldf)
captures$season <- sqldf("select Season from seasons s, captures c
where c.date >= s.Opening_Date and c.date <= s.Closing_Date")
captures
id date sex Season
1 160520 2016-11-22 1 2016
2 1029735 2016-11-12 1 2016
3 1885200 2016-11-05 1 2016
4 2058366 2015-09-26 2 2015
5 2058367 2015-09-26 1 2015
6 2058368 2015-09-26 1 2015
data
txt <- "Season Opening_Date Closing_Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)
txt <- " id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1"
captures <- read.table(text = txt, header = TRUE)
captures$date <- as.Date(captures$date)

Replace duplicate values using multiple conditions in r

I am new to R and I have the following data (an example) as a csv file, and I want to replace any duplicate values if they occurred on the consecutive days during similar year and month by zero or a letter. I only need to keep one average.
Year Month Day Average
2013 8 28 2.3
2013 8 29 2.3
2013 8 30 1.7
2013 8 31 1.7
2014 8 7 3
2014 8 6 3
2014 8 8 3
2014 8 9 3
2014 9 11 5.8
2014 9 12 5.8
2014 9 13 5.8
The result I expect is something like this
Year Month Day Average
2013 8 28 2.3
2013 8 29 0
2013 8 30 1.7
2013 8 31 0
2014 8 7 3
2014 8 6 0
2014 8 8 0
2014 8 9 0
2014 9 11 5.8
2014 9 12 0
2014 9 13 0
Also I would like to be able delete the rows that have the duplicate values that were replaced like this:
Year Month Day Average
2013 8 28 2.3
2013 8 30 1.7
2014 8 7 3
2014 9 11 5.8
I have to have two files one with the duplicate values replaced by Zero or a letter and another one has only the averages without the duplicate values.
Thank you in advance!!
Using dplyr for the data.frame manipulation, lubridate for date
manipulation and diff to find consecutive repeated values.
Note that I've also sorted the dates to keep the earliest one which makes it not exactly match with the example solution.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
df1 <- read.table(
text = "
Year Month Day Average
2013 8 28 2.3
2013 8 29 2.3
2013 8 30 1.7
2013 8 31 1.7
2014 8 7 3
2014 8 6 3
2014 8 8 3
2014 8 9 3
2014 9 11 5.8
2014 9 12 5.8
2014 9 13 5.8",
header = T)
df2 <- read.table(
text = "
Year Month Day Average
2013 8 28 2.3
2013 8 29 0
2013 8 30 1.7
2013 8 31 0
2014 8 7 3
2014 8 6 0
2014 8 8 0
2014 8 9 0
2014 9 11 5.8
2014 9 12 0
2014 9 13 0",
header = T)
df3 <- read.table(
text = "
Year Month Day Average
2013 8 28 2.3
2013 8 30 1.7
2014 8 7 3
2014 9 11 5.8",
header = T)
df2 <- df1 %>%
mutate(date = ymd(paste(Year, Month, Day, sep = "-"))) %>%
arrange(date) %>%
mutate(is_consecutive_average = c(FALSE, diff(Average) == 0)) %>%
mutate(is_consecutive_day = c(FALSE, diff(date) == 1)) %>%
mutate(Average = Average * !(is_consecutive_average & is_consecutive_day)) %>%
select(-is_consecutive_average, -is_consecutive_day, -date)
df2
## Year Month Day Average
## 1 2013 8 28 2.3
## 2 2013 8 29 0.0
## 3 2013 8 30 1.7
## 4 2013 8 31 0.0
## 5 2014 8 6 3.0
## 6 2014 8 7 0.0
## 7 2014 8 8 0.0
## 8 2014 8 9 0.0
## 9 2014 9 11 5.8
## 10 2014 9 12 0.0
## 11 2014 9 13 0.0
df3 <- df2 %>%
filter(Average != 0)
df3
## Year Month Day Average
## 1 2013 8 28 2.3
## 2 2013 8 30 1.7
## 3 2014 8 6 3.0
## 4 2014 9 11 5.8
Here's a data.table solution:
Read in the data
data <- readr::read_csv(
text,
col_names = TRUE,
trim_ws = TRUE
)
library( data.table )
setDT( data )
Convert the date values to a nicer format, and sort
data[ , date := as.Date( paste0( Year, "-", Month, "-", Day ) ) ]
setorder( data, date )
Create new columns for previous date and average values
data[ , prev.date := shift( date, 1L, type = "lag" ) ]
data[ , prev.average := shift( Average, 1L, type = "lag" ) ]
Mark the points where a new "group" should be created, based on your criteria. Also mark the very first record as the start of a new group, since we can assume that it is.
data[ , group := 0L
][ as.integer( date - prev.date ) > 1L |
Average != prev.average, group := 1L
][ 1L, group := 1L ]
Get your first desired output by replacing particular values with zeros
data[ group != 1L, Average := 0 ]
first.output <- data[ , .( date, Average ) ]
head( first.output, 3 )
date Average
1: 2013-08-28 2.3
2: 2013-08-29 0.0
3: 2013-08-30 1.7
Now mark the groups as unique numbers
data[ , group := cumsum( group ) ]
And get your second output by aggregating to maximum "Average" value (which will be the only one not equal to zero), and the minimum "date" value (the first in that group):
second.output <- data[ , .( date = min( date ),
Average = max( Average ) ),
by = group ][ , .( date, Average ) ]
head( second.output, 3 )
date Average
1: 2013-08-28 2.3
2: 2013-08-30 1.7
3: 2014-08-06 3.0
NOTE: you could likely get second.output by simply removing rows with a zero "Average" value from the first.output, but it would remove any groups where the "Average" really is zero, so I think this method is safer.

Taking Average and Median by Month and then Ordering by Date and Factor in R

Lets suppose I have the following data:
set.seed(123)
Dates <- c("2013-10-07","2013-10-14","2013-11-21","2013-11-28" , "2013-12-04" , "2013-12-11","2013-01-18","2013-01-18")
Dates.New <- c(Dates,Dates)
Values <- sample(seq(1:10),16,replace = TRUE)
Factor <- c(rep("Group 1",8),rep("Group 2",8))
df <- data.frame(Dates.New,Values,Factor)
df[sample(1:nrow(df)),]
This returns
Dates.New Values Factor
4 2013-11-28 9 Group 1
1 2013-10-07 3 Group 1
5 2013-12-04 10 Group 1
13 2013-12-04 7 Group 2
11 2013-11-21 10 Group 2
8 2013-01-18 9 Group 1
7 2013-01-18 6 Group 1
9 2013-10-07 6 Group 2
6 2013-12-11 1 Group 1
14 2013-12-11 6 Group 2
16 2013-01-18 9 Group 2
3 2013-11-21 5 Group 1
2 2013-10-14 8 Group 1
15 2013-01-18 2 Group 2
12 2013-11-28 5 Group 2
10 2013-10-14 5 Group 2
What I am trying to do here is find the monthly average and median for both of my factors then order each group by month in a new data frame. So the new data frame would have a median and average for months 10,11,12,1 for Group 1 bundled together and the next 4 rows would have the median and average for months 10,11,12,1 for Group 2bundled together as well. I am open to packages. Thanks!
Here is a data.table solution. The question seems to be looking for both mean and median. See if this suits your need.
library(zoo); library(data.table)
setDT(df)[, list(Mean = mean(Values),
Median = median(Values)),
by = list(Factor, as.yearmon(Dates.New))][order(Factor, as.yearmon)]
# Factor as.yearmon Mean Median
# 1: Group 1 Jan 2013 7.5 7.5
# 2: Group 1 Oct 2013 5.5 5.5
# 3: Group 1 Nov 2013 7.0 7.0
# 4: Group 1 Dec 2013 5.5 5.5
# 5: Group 2 Jan 2013 5.5 5.5
# 6: Group 2 Oct 2013 5.5 5.5
# 7: Group 2 Nov 2013 7.5 7.5
# 8: Group 2 Dec 2013 6.5 6.5
Like this?
df$Dates.New <- as.Date(df$Dates.New)
library(zoo) # for as.yearmon(...)
result <- aggregate(Values~as.yearmon(Dates.New)+Factor,df,mean)
names(result)[1] <- "Year.Mon"
result
# Year.Mon Factor Values
# 1 Jan 2013 Group 1 7.5
# 2 Oct 2013 Group 1 5.5
# 3 Nov 2013 Group 1 7.0
# 4 Dec 2013 Group 1 5.5
# 5 Jan 2013 Group 2 5.5
# 6 Oct 2013 Group 2 5.5
# 7 Nov 2013 Group 2 7.5
# 8 Dec 2013 Group 2 6.5

Resources