Manipulating Dates with dplyr - r

I have longitudinal, geocoded address data and the length of time at each geocode. I then have a series of variables (I'm just calling them x here) that give characteristics of each geoid location. Below here is just two cases but I have thousands.
id<-c(1,1,1,7,7,7,7)
geoid<-c(53,45,45,16,18,42)
start<-c("1/1/2004","10/31/2004","1/1/2005","1/1/2005","6/1/2007","7/2/2007")
end<-c("10/30/2004","12/31/2004","12/31/2007","5/31/2007","7/1/2007","12/31/2007")
x<-c(.5,.7,.7,.3,.4,.6)
dat<-data.frame(id,geoid,x,start,end)
dat$start<-as.Date(dat$start,format='%m/%d/%Y')
dat$end<-as.Date(dat$end,format='%m/%d/%Y')
dat
id geoid x start end
1 53 0.5 2004-01-01 2004-10-30
1 45 0.7 2004-10-31 2004-12-31
1 45 0.7 2005-01-01 2007-12-31
7 16 0.3 2005-01-01 2007-05-31
7 18 0.4 2007-06-01 2007-08-01
7 42 0.6 2007-08-02 2007-12-31
I need to end up with a single value for each year (2004, 2005, 2006, 2007) and for each case (1, 7) that is weighted by the length of time at each address. So case 1 moves from geoid 53 to 45 in 2004 and case 7 moves from geoid 16 to 18 to 42 in 2007. So I calculate the percent of the year at each geoid (and eventually I will multiply that by x and take the mean for each year to get a weighted average). Cases staying put for a whole year will get a weight of 1.
#calculate the percentage of year at each address for id 1
(as.Date("10/31/2004",format='%m/%d/%Y')-as.Date("1/1/2004",format='%m/%d/%Y'))/365.25
Time difference of 0.8323066
(as.Date("12/31/2004",format='%m/%d/%Y')-as.Date("10/31/2004",format='%m/%d/%Y'))/365.25
Time difference of 0.1670089
#calculate the percentage of year at each address for id 7
(as.Date("05/31/2007",format='%m/%d/%Y')-as.Date("1/1/2007",format='%m/%d/%Y'))/365.25
Time difference of 0.4106776
(as.Date("07/01/2007",format='%m/%d/%Y')-as.Date("06/01/2007",format='%m/%d/%Y'))/365.25
Time difference of 0.08213552
(as.Date("12/31/2007",format='%m/%d/%Y')-as.Date("07/02/2007",format='%m/%d/%Y'))/365.25
Time difference of 0.4982888
I can do this by brute force by looking at each year individually, calculating the percent of the year spent at that address. Then I would multiply each weight by the x values and take the mean for that year - that will not be reasonably possible to do with thousands of cases. Any ideas of how to address this more efficiently would be much appreciated. Seems like it might be doable with dplyr slice but I'm stalled out at the moment. The key is separating out each year.

As eipi10 mentioned, some of your data spans more than a year. It also looks inconsistent with the data you used in your time difference calculations, which are all within the same year.
Assuming that your start and end dates would actually be in the same year, you can do something like the following:
foo <- dat %>%
mutate(start_year=year(dat$start),
end_year=year(dat$end),
same_year=(start_year==end_year),
year_frac=as.numeric(dat$end - dat$start)/365.25,
wtd_x = year_frac * x)
This gives you:
id geoid x start end start_year end_year same_year year_frac wtd_x
1 1 53 0.5 2004-01-01 2004-10-31 2004 2004 TRUE 0.83230664 0.41615332
2 1 45 0.7 2004-10-31 2004-12-31 2004 2004 TRUE 0.16700890 0.11690623
3 1 45 0.7 2005-01-01 2007-12-31 2005 2007 FALSE 2.99520876 2.09664613
4 7 16 0.3 2007-01-01 2007-05-31 2007 2007 TRUE 0.41067762 0.12320329
5 7 18 0.4 2007-06-01 2007-07-01 2007 2007 TRUE 0.08213552 0.03285421
6 7 42 0.6 2007-07-02 2007-12-31 2007 2007 TRUE 0.49828884 0.29897331
You can then group and summarise the data using:
bar <- foo %>%
group_by(start_year, id) %>%
summarise(sum(wtd_x))
to give you the answer:
start_year id sum(wtd_x)
(dbl) (dbl) (dfft)
1 2004 1 0.5330595 days
2 2005 1 2.0966461 days
3 2007 7 0.4550308 days

Hopefully this will get you started. I wasn't sure how you wanted to deal with cases where the period from start to end spans more than one year or crosses calendar years.
library(dplyr)
dat %>%
mutate(fractionOfYear = as.numeric(end - start)/365.25)
id geoid x start end fractionOfYear
1 1 53 0.5 2004-01-01 2004-10-30 0.82956879
2 1 45 0.7 2004-10-31 2004-12-31 0.16700890
3 1 45 0.7 2005-01-01 2007-12-31 2.99520876
4 7 16 0.3 2005-01-01 2007-05-31 2.40930869
5 7 18 0.4 2007-06-01 2007-07-01 0.08213552
6 7 42 0.6 2007-07-02 2007-12-31 0.49828884

I was able to find some local help that led us to a simple function. We're still stuck on how to use apply with dates but this overall handles it.
#made up sample address data
id<-c(1,1,1,7,7,7)
geoid<-c(53,45,45,16,18,42)
start<-c("1/31/2004","10/31/2004","1/1/2005","1/1/2005","6/1/2007","7/2/2007")
end<-c("10/30/2004","12/31/2004","12/31/2007","5/31/2007","7/1/2007","12/31/2007")
dat <- data.frame(id,geoid,start,end)
#format addresses
dat$start<-as.Date(dat$start,format='%m/%d/%Y')
dat$end<-as.Date(dat$end,format='%m/%d/%Y')
#function to create proportion of time at each address
prop_time <- function(drange, year){
start <- drange[[1]]; end <- drange[[2]]
#start year and end year
syear <- as.numeric(format(start,'%Y'))
eyear <- as.numeric(format(end,'%Y'))
#select only those dates that are within the same year
if(syear<=year & year<=eyear){
byear <- as.Date(paste("1/1", sep="/", year), format='%m/%d/%Y')
eyear <- as.Date(paste("12/31", sep="/", year), format='%m/%d/%Y')
astart <- max(byear, start)
aend <- min(eyear, end)
prop <- as.numeric((aend - astart))/as.numeric((eyear - byear))
} else prop <- 0 #if no proportion within same year calculated then gets 0
prop
}
#a second function to apply prop_time to multiple cases
prop_apply <- function(dat_times, year){
out <- NULL
for(i in 1:dim(dat_times)[1]){
out <- rbind(out,prop_time(dat_times[i,], year))
}
out
}
#create new data frame to populate years
dat <- data.frame(dat, y2004=0, y2005=0, y2006=0, y2007=0)
dat_times <- dat[,c("start", "end")]
#run prop_apply in a loop across cases and selected years
for(j in 2004:2007){
newdate <- paste("y", j, sep="")
dat[,newdate] <- prop_apply(dat_times, j)
}

Related

Six-month peak-season running average

I'm trying to implement this:
The recommendation is a peak season ozone AQG level of 60 μg/m3
(the average of daily maximum 8-hour mean ozone concentrations).
The peak season is defined as the six consecutive months of the year
with the highest six-month running-average ozone concentration.
In regions away from the equator, this period will typically be in the
warm season within a single calendar year (northern hemisphere)
or spanning two calendar years (southern hemisphere). Close to
the equator, such clear seasonal patterns may not be obvious, but a
running-average six-month peak season will usually be identifiable
from existing monitoring or modelling data.
I have:
# A tibble: 300 × 2
date value
<dttm> <dbl>
1 1997-01-01 00:00:00 NA
2 1997-02-01 00:00:00 NA
3 1997-03-01 00:00:00 NA
4 1997-04-01 00:00:00 30.2
5 1997-05-01 00:00:00 20.9
6 1997-06-01 00:00:00 10.1
7 1997-07-01 00:00:00 9.40
8 1997-08-01 00:00:00 22.4
9 1997-09-01 00:00:00 26.2
10 1997-10-01 00:00:00 32.9
# … with 290 more rows
Every year is complete (with or without NA). I found the peaks by "findpeaks" from pracma package, and get:
peaks = findpeaks(mda8_omit$value, minpeakdistance = 6,
minpeakheight = mean(mda8_omit$value))
How do i optimize to get the best six month by peak? For northern hemisphere is easier because the peaks is within a yer (summer) but in the southern hemisphere is split in two years and peaks may change depending on latitude. Any ideas on how to continue?
Assuming that
we only use windows with 6 consecutive months of data
the year that a window falls is determined by the last month of the window
we compare all such windows, at most 12, within each calendar year
Calculate the rolling mean and then grouping by year take the row with the largest rolling mean within year. This row is the last month of the 6 month window. The input is shown reproducibly in the Note at the end.
library(dplyr)
library(zoo)
DF %>%
mutate(date = as.yearmon(date),
peakmean = rollapplyr(value, 6, mean, fill = NA)) %>%
group_by(year = as.integer(date)) %>%
slice_max(peakmean) %>%
ungroup %>%
select(-year)
## # A tibble: 1 × 3
## date value peakmean
## <yearmon> <dbl> <dbl>
## 1 Oct 1997 32.9 20.3
Note
Lines <- "date value
1 1997-01-01T00:00:00 NA
2 1997-02-01T00:00:00 NA
3 1997-03-01T00:00:00 NA
4 1997-04-01T00:00:00 30.2
5 1997-05-01T00:00:00 20.9
6 1997-06-01T00:00:00 10.1
7 1997-07-01T00:00:00 9.40
8 1997-08-01T00:00:00 22.4
9 1997-09-01T00:00:00 26.2
10 1997-10-01T00:00:00 32.9"
DF <- read.table(text = Lines)

R Expand time series data based on start and end point

I think I have a pretty simple request. I have the following dataframe, where "place" is a unique identifier, while start_date and end_date may overlap. The values are unique for each ID "place".
place start_date end_date value
1 2007-09-01 2010-10-12 0.5
2 2013-09-27 2015-10-11 0.7
...
What I need is to create a year-based variable, where I expand the time series by each year (starting from first of January (i.e. 2011-01-01) starts a new row for that particular "place" and "value". I mean something like this:
place year value
1 2007 0.5
1 2008 0.5
1 2009 0.5
1 2010 0.5
2 2013 0.7
2 2014 0.7
2 2015 0.7
...
There are some cases with overlap (ie. "place"=1 & "year"=2007) for two separate cases, where one observations starts with one year and the other observation continues from that year. In that case I would prefer the "value" that ends on that specific year. So if one observation for place=1 ends with 2007 in March and another place=1 starts with 2007 in April, year=2007 value for place=1 would be marked with the previous "ending" value if that makes sense.
I've only gotten this far:
library(data.table)
data <- data.table(dat)
data[,:=(start_date = as.Date(start_date), end_date = as.Date(end_date))]
data[,num_mons:= length(seq(from=start_date, to=end_date, by='year')),by=1:nrow(data)]
I guess writing a loop makes the most sense?
Thank you for your help and advice.
Using a tidyverse solution could look like:
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
data <- tibble(place = c(1, 2),
start_date = c('2007-09-01',
'2013-09-27'),
end_date = c('2010-10-12',
'2015-10-11'),
value = c(0.5, 0.7))
data %>%
mutate(year = map2(start_date,
end_date,
~ as.character(str_extract(.x, '\\d{4}'):
str_extract(.y, '\\d{4}')))) %>%
separate_rows(year) %>%
filter(!year %in% c('c', '')) %>%
select(place, year, value)
# place year value
# <dbl> <chr> <dbl>
# 1 1 2007 0.5
# 2 1 2008 0.5
# 3 1 2009 0.5
# 4 1 2010 0.5
# 5 2 2013 0.7
# 6 2 2014 0.7
# 7 2 2015 0.7
I'm having problems understanding the third paragraph of your question ("There are ..."). It seems to me to be a separate question. If that is the case, please consider moving the question to a separate post here on SO. If it is not a separate question, please reformulate the paragraph.
You could do the following:
library(lubridate)
library(tidyverse)
df %>%
group_by(place) %>%
mutate(year = list(seq(year(ymd(start_date)), year(ymd(end_date)))))%>%
unnest(year)%>%
select(place,year,value)
# A tibble: 7 x 3
# Groups: place [2]
place year value
<int> <int> <dbl>
1 1 2007 0.5
2 1 2008 0.5
3 1 2009 0.5
4 1 2010 0.5
5 2 2013 0.7
6 2 2014 0.7
7 2 2015 0.7

Aggregate a column by defined weeks in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have the following table and I need to aggregate the columns from 4 to 5 based on the defined weeks below for a given month.
for example for any given month my weekly definition for purchase date column as follows:
wk1: 1-6 days
wk2: 7-12 days
wk3: 13-18 days
wk4: 19-24 days
wk5: 25-31 days
Year County purchase_Date acres_purchase Date_Diff
2010 Cache 9/28/2009 30.5 1
2010 Cache 10/1/2009 5.0 4
2010 Cache 10/3/2009 10.2 3
2010 Cache 10/5/2009 20 3
2010 Cache 10/7/2009 15 5
2010 Cache 10/13/2009 5 1
2010 Cache 10/14/2009 6 2
2010 Cache 10/19/2009 25 7
2010 Cache 10/25/2009 12 3
2010 Cache 10/30/2009 2 1
Output:
Year County purchase_Date Week purchase_by_date Date_Diff
2010 Cache 9/28/2009 Sep-wk5 30.5 1
2010 Cache 10/1/2009 Oct-wk1 35.2 10
2010 Cache 10/7/2009 Oct-wk2 15 5
2010 Cache 10/13/2009 Oct-wk3 11 3
2010 Cache 10/19/2009 Oct-wk4 25 7
2010 Cache 10/25/2009 Oct-wk5 14 4
Is there a way that I can achieve "output" table in R?
Any help is appreciated.
First convert purchase_Date to a date class, then extract purchase_Day:
df1$purchase_Date <- as.Date(df1$purchase_Date, format= "%m/%d/%Y")
df1$purchase_Day <- as.numeric(format(df1$purchase_Date, "%d"))
Define a helper function to assign each range of days to the correct week.
weekGroup <- function(x){
if (x <= 6) {
week <- "wk1"
} else if (x <= 12) {
week <- "wk2"
} else if (x <= 18) {
week <- "wk3"
} else if (x <= 24) {
week <- "wk4"
} else {
week <-"wk5"
}
return(week)
}
Pass each day to our helper function:
df1$week <- sapply(df1$purchase_Day, weekGroup)
Pull the month into a separate column, and convert to numeric
df1$month <- as.numeric(format(df1$purchase_Date, "%m"))
month.abb is a list of the month abbreviations. Use the numeric month to call the respective list element
df1$monthAbb <- sapply(df1$month, function(x) month.abb[x])
Combine week and monthAbb
df1$monthWeek <- paste(df1$monthAbb,df1$week, sep="-")
And #cmaher basically provided this already, but for completeness, the final summary:
require(dplyr)
df1 %>% group_by(Year, County,monthWeek) %>%
summarise(purchaseDate=min(purchase_Date),acres=sum(acres_purchase),
date_diff=sum(Date_Diff))
Year County monthWeek purchaseDate acres date_diff
<int> <fctr> <chr> <date> <dbl> <int>
1 2010 Cache Oct-wk1 2009-10-01 35.2 10
2 2010 Cache Oct-wk2 2009-10-07 15.0 5
3 2010 Cache Oct-wk3 2009-10-13 11.0 3
4 2010 Cache Oct-wk4 2009-10-19 25.0 7
5 2010 Cache Oct-wk5 2009-10-25 14.0 4
6 2010 Cache Sep-wk5 2009-09-28 30.5 1
Assuming your purchase_Date variable is of class Date, you can use lubridate::day() and base::findInterval to segment your dates:
df$Week <- findInterval(lubridate::day(df$purchase_Date), c(7, 13, 19, 25, 32)) + 1
df$Week <- as.factor(paste(lubridate::month(df$purchase_Date), df$Week, sep = "-"))
# purchase_Date Week
# 2017-10-01 10-1
# 2017-10-02 10-1
# 2017-10-03 10-1
# ...
# 2017-10-29 10-5
# 2017-10-30 10-5
# 2017-10-31 10-5
Then, one way to achieve your target output is with dplyr like so:
df %>% group_by(Year, Country, Week) %>%
summarize(
purchase_Date = min(purchase_Date),
purchase_by_date = sum(acres_purchase),
Date_Diff = sum(Date_Diff))

How to re-order data in R, and creating a new variable for the data?

I have been working with the CDC FluView dataset, retrieved by this code:
library(cdcfluview)
library(ggplot2)
usflu <- get_flu_data("national", "ilinet", years=1998:2015)
What I am trying to do is create a new week variable, call it "week_new", so that the WEEK variable from this dataset is reordered. I want to reorder it by having the first week be equal to week number 30 in each year. For example, in 1998, instead of week 1 corresponding to the first week of that year, I would like week 30 to correspond to the first week of that year, and every subsequent year after that have the same scale. I am also trying to create another new variable called "season", which simply puts each week into it's corresponding flu season, say "1998-1999" for week 30 of 1998 through 1999, and so on.
I believe this involves a for loop and conditional statements, but I am not familiar with how to use these in R. I am new to programming and am learning Java and R at the same time, and have only worked with loops in Java so far.
Here is what I have tried so far, I think it's supposed to be something like this:
wk_num <- 1
for(i in nrow(usflu)){
if(week == 31){
wk_num <- 1
wk_new[i] <- wk_num
wk_num <- wk_num+1
}
if(week < 53){
season[i] <- paste(Yr[i], '-', Yr[i] +1)
}
else{
}
Any help is greatly appreciated and hopefully what I am asking makes sense. I am hoping to understand re-ordering for the future as I believe it will be an important tool for me to have at my disposal for coding in R.
Here's one way to accomplish this with the packages dplyr and tidyr:
library(dplyr)
library(tidyr)
usflu_df <- tbl_df(usflu)
usflu_df %>%
complete(YEAR, WEEK) %>%
filter(!(YEAR == 1998 & WEEK < 30)) %>%
mutate(season = cumsum(WEEK == 30),
season_nm = paste(1997 + season, 1998 + season, sep = "-")) %>%
group_by(season) %>%
mutate(new_wk = seq_along(season)) %>%
select(YEAR, WEEK, new_wk, season, season_nm)
# YEAR WEEK new_wk season season_nm
# (int) (int) (int) (int) (chr)
# 1 1998 30 1 1 1998-1999
# 2 1998 31 2 1 1998-1999
# 3 1998 32 3 1 1998-1999
# 4 1998 33 4 1 1998-1999
# 5 1998 34 5 1 1998-1999
# 6 1998 35 6 1 1998-1999
# 7 1998 36 7 1 1998-1999
# 8 1998 37 8 1 1998-1999
# 9 1998 38 9 1 1998-1999
# 10 1998 39 10 1 1998-1999
Talking through this...
First, use tidyr::complete to turn implicit missing values into explicit missing values -- the original data pulled back did not have all of the weeks for 1998. Next, filter out the irrelevant records from 1998, that is, anything with a week before 1998 and week 30 to make our lives easier. We then create two new variables, season and season_nm via cumsum and a simple paste function. The season simply increments anytime it sees WEEK == 30 -- this is useful because of leap years. We then group_by season so that we can seq_along season to create the new_wk variable.

(In)correct use of a linear time trend variable, and most efficient fix?

I have 3133 rows representing payments made on some of the 5296 days between 7/1/2000 and 12/31/2014; that is, the "Date" feature is non-continuous:
> head(d_exp_0014)
Year Month Day Amount Count myDate
1 2000 7 6 792078.6 9 2000-07-06
2 2000 7 7 140065.5 9 2000-07-07
3 2000 7 11 190553.2 9 2000-07-11
4 2000 7 12 119208.6 9 2000-07-12
5 2000 7 16 1068156.3 9 2000-07-16
6 2000 7 17 0.0 9 2000-07-17
I would like to fit a linear time trend variable,
t <- 1:3133
to a linear model explaining the variation in the Amount of the expenditure.
fit_t <- lm(Amount ~ t + Count, d_exp_0014)
However, this is obviously wrong, as t increments in different amounts between the dates:
> head(exp)
Year Month Day Amount Count Date t
1 2000 7 6 792078.6 9 2000-07-06 1
2 2000 7 7 140065.5 9 2000-07-07 2
3 2000 7 11 190553.2 9 2000-07-11 3
4 2000 7 12 119208.6 9 2000-07-12 4
5 2000 7 16 1068156.3 9 2000-07-16 5
6 2000 7 17 0.0 9 2000-07-17 6
Which to me is the exact opposite of a linear trend.
What is the most efficient way to get this data.frame merged to a continuous date-index? Will a date vector like
CTS_date_V <- as.data.frame(seq(as.Date("2000/07/01"), as.Date("2014/12/31"), "days"), colnames = "Date")
yield different results?
I'm open to any packages (using fpp, forecast, timeSeries, xts, ts, as of right now); just looking for a good answer to deploy in functional form, since these payments are going to be updated every week and I'd like to automate the append to this data.frame.
I think some kind of transformation to regular (continuous) time series is a good idea.
You can use xts to transform time series data (it is handy, because it can be used in other packages as regular ts)
Filling the gaps
# convert myDate to POSIXct if necessary
# create xts from data frame x
ts1 <- xts(data.frame(a = x$Amount, c = x$Count), x$myDate )
ts1
# create empty time series
ts_empty <- seq( from = start(ts1), to = end(ts1), by = "DSTday")
# merge the empty ts to the data and fill the gap with 0
ts2 <- merge( ts1, ts_empty, fill = 0)
# or interpolate, for example:
ts2 <- merge( ts1, ts_empty, fill = NA)
ts2 <- na.locf(ts2)
# zoo-xts ready functions are:
# na.locf - constant previous value
# na.approx - linear approximation
# na.spline - cubic spline interpolation
Deduplicate dates
In your sample there is now sign of duplicated values. But based on a new question it is very likely. I think you want to aggregate values with sum function:
ts1 <- period.apply( ts1, endpoints(ts1,'days'), sum)

Resources