R data.table merge drops rows (December only) - r

Solved with the help of #Uwe Block.
R data.table merge drops December observations by shifting the month-index back in one data set while trying to merge a monthly data set onto a set of daily observations. What's a good way to do this merge that works as expected?
Using merge per #Harry Daniels merge(monthly, daily, by=c("year","month"), all=TRUE) instead of daily[monthly, on=c("year","month"), all=TRUE] retains all daily observations correctly, but the monthly data are still shifted so that January->0.
Problem: generating the month and year columns on the monthly dataset made months not quite exactly integer values. I.e. 1 was actually 0.999999999999091 so the merge took the floor internally and offset it.
Example: `monthly[,month:=100*(Date%%1)]' where the date was stored as numeric 2016.01, 2016.02,...,2016.12.
See the following:
> monthly
year month CPI
1: 2016 1 236.916
2: 2016 2 237.111
3: 2016 3 238.132
4: 2016 4 239.261
5: 2016 5 240.229
6: 2016 6 241.018
7: 2016 7 240.628
8: 2016 8 240.849
9: 2016 9 241.428
10: 2016 10 241.729
11: 2016 11 241.353
12: 2016 12 241.432
> daily
date year month close
1: 2016-01-04 2016 1 2012.66
2: 2016-01-05 2016 1 2016.71
3: 2016-01-06 2016 1 1990.26
4: 2016-01-07 2016 1 1943.09
5: 2016-01-08 2016 1 1922.03
---
248: 2016-12-23 2016 12 2263.79
249: 2016-12-27 2016 12 2268.88
250: 2016-12-28 2016 12 2249.92
251: 2016-12-29 2016 12 2249.26
252: 2016-12-30 2016 12 2238.83
> daily[monthly, on=c("year","month")]
date year month close CPI
1: <NA> 2016 0 NA 236.916
2: 2016-01-04 2016 1 2012.66 237.111
3: 2016-01-05 2016 1 2016.71 237.111
4: 2016-01-06 2016 1 1990.26 237.111
5: 2016-01-07 2016 1 1943.09 237.111
---
228: 2016-11-23 2016 11 2204.72 241.432
229: 2016-11-25 2016 11 2213.35 241.432
230: 2016-11-28 2016 11 2201.72 241.432
231: 2016-11-29 2016 11 2204.66 241.432
232: 2016-11-30 2016 11 2198.81 241.432
> merge(monthly, daily, by=c("year","month"), all=TRUE)
year month CPI close
1: 2016 0 236.916 NA
2: 2016 1 237.111 2012.66
3: 2016 1 237.111 2016.71
4: 2016 1 237.111 1990.26
5: 2016 1 237.111 1943.09
---
249: 2016 12 NA 2263.79
250: 2016 12 NA 2268.88
251: 2016 12 NA 2249.92
252: 2016 12 NA 2249.26
253: 2016 12 NA 2238.83

This should suffice:
merge(monthly, daily , by = 'month', all = TRUE )

Related

How to create a new column using looping and rbind in r?

I have a data similar like this. I would like to make 3 columns (date1, date2, date3) by using looping and rbind. It is because I am requied to do it by only that method.
(all I was told is making a loop, subset the data, sort it make a new data frame then rbind it to make a new column.)
year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103
The result I expect is:
date1: number of days from 2011, January 1st, start again from 1 in a new year.
date2: number of days of an id working in a year, start again from 1 in a new year.
date3: number of days open within a year, start again from 1 in a new year.
(all of the dates are in ascending order)
year month day id date1 date2 date3
2011 1 5 3101 5 1 1
2011 1 14 3101 14 2 2
2011 2 3 3101 34 3 3
2011 2 4 3101 35 4 4
2012 1 27 3153 27 1 1
2012 2 20 3153 51 2 2
2012 2 22 3153 53 3 3
2012 3 1 3153 60 4 4
2013 1 31 3103 31 1 1
2013 2 1 3103 32 2 2
2013 2 4 3103 35 3 3
2013 3 4 3103 94 4 4
2013 3 6 3103 96 5 5
Please help! Thank you.
You can do it without using unnecessary for loop and subset, here is the answer below
df <- read.table(text =" year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103",header = T)
library(lubridate)
df$date1 <- yday(mdy(paste0(df$month,"-",df$day,"-",df$year)))
df$date2 <- ave(df$year, df$id, FUN = seq_along)
df$date3 <- ave(df$year, df$year, FUN = seq_along)

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

categorize based on date ranges in R

How do I categorize each row in a large R dataframe (>2 million rows) based on date range definitions in a separate, much smaller R dataframe (12 rows)?
My large dataframe, captures, looks similar to this when called via head(captures) :
id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1
My small dataframe, seasons, looks similar to this in its entirety:
Season Opening.Date Closing.Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15
I need to add a 'season' column to my captures dataframe where the value would be determined based on if and where captures$date falls in the ranges defined in seasons.
Here is a long-hand solution I came up with that isn't working for me because my dataframe is so large.
#add packages
library(dplyr)
library(lubridate)
#make blank column
captures$season=NA
for (i in 1:length(seasons$Season)){
for (j in 1:length(captures$id{
captures$season[j]=ifelse(between(captures$date[j],ymd(seasons$Opening.Date[i]),ymd(seasons$Closing.Date[i])),seasons$Season[i],captures$season[j])
}
}
Again, this doesn't work for me as R crashes every time. I also realize this doesn't take advantage of vectorization in R. Any help here is appreciated!
Here's using non equi joins from data.table:
require(data.table) # v1.10.4+
setDT(captures) # convert data.frames to data.tables
setDT(seasons)
ans <- seasons[captures, Season,
on=.(Opening.Date<=date, Closing.Date>=date),
mult="first"]
# [1] 2016 2016 2016 2015 2015 2015
seasons[, season := ans]
For each row in captures, the index corresponding to the first matching row (mult="first") in seasons is figured out based on the condition provided to on argument. The value of Season for corresponding indices is then returned and saved under ans. It is then added as a new column to seasons by reference.
I've shown it in two steps for sake of understanding.
You can see the first matching indices by using which=TRUE instead:
seasons[captures,
on=.(Opening.Date<=date, Closing.Date>=date),
mult="first",
which=TRUE]
# [1] 1 1 1 2 2 2
It would be great indeed if you could do a join operation efficiently based on a range of values instead of equality. Unfortunately, I don't know if a general solution exists. In the time being, I suggest using a single for loop.
The efficiency of vectorization is best done along the tallest data. That is, if we loop on one data.frame and vectorize the other, it makes more sense to vectorize the longer vector and loop on the shorter ones. With this in mind, we'll loop on the frame of seasons and vectorize the 2M rows of data.
Your data:
txt <- "Season Opening.Date Closing.Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)
txt <- " id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1"
dat <- read.table(text = txt, header = TRUE)
dat$date <- as.Date(dat$date)
And the start the process, we assume that all data's season is as yet not defined:
dat$season <- NA
Loop around each of the seasons' rows:
for (i in seq_len(nrow(seasons))) {
dat$season <- ifelse(is.na(dat$season) &
dat$date >= seasons$Opening.Date[i] &
dat$date < seasons$Closing.Date[i],
seasons$Season[i], dat$season)
}
dat
# id date sex season
# 1 160520 2016-11-22 1 2016
# 2 1029735 2016-11-12 1 2016
# 3 1885200 2016-11-05 1 2016
# 4 2058366 2015-09-26 2 2015
# 5 2058367 2015-09-26 1 2015
# 6 2058368 2015-09-26 1 2015
You could try with sqldf. Note, I had to change the point in Opening_Date and Closing_Date to an "_".
library(sqldf)
captures$season <- sqldf("select Season from seasons s, captures c
where c.date >= s.Opening_Date and c.date <= s.Closing_Date")
captures
id date sex Season
1 160520 2016-11-22 1 2016
2 1029735 2016-11-12 1 2016
3 1885200 2016-11-05 1 2016
4 2058366 2015-09-26 2 2015
5 2058367 2015-09-26 1 2015
6 2058368 2015-09-26 1 2015
data
txt <- "Season Opening_Date Closing_Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)
txt <- " id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1"
captures <- read.table(text = txt, header = TRUE)
captures$date <- as.Date(captures$date)

Finding the third Friday of a month and data table

I want to find the third Friday of a month for delivery date of the futures, I used the solution here, getNthDayOfWeek from RcppBDT package:
library(data.table)
library(RcppBDT)
data <- setDT(data.frame(mon=c(5:12, 1:12, 1:12, 1:4),
year=c(rep(2011,8), rep(2012,12), rep(2013,12), rep(2014,4))))
data[, third.friday:= getNthDayOfWeek(third, Fri, mon, year)]
However I get this message: Error: expecting a single value. What am I missing?
Since you did not specify a by clause in your transformation, := is (presumably) trying to apply getNthDayOfWeek as a vectorized function.
This should work:
Data[
,third.friday := getNthDayOfWeek(third, Fri, mon, year)
,by = "mon,year"]
Data
# mon year third.friday
#1: 5 2011 2011-05-20
#2: 6 2011 2011-06-17
#3: 7 2011 2011-07-15
#4: 8 2011 2011-08-19
#5: 9 2011 2011-09-16
#6: 10 2011 2011-10-21
#7: 11 2011 2011-11-18
#8: 12 2011 2011-12-16
#9: 1 2012 2012-01-20
Or, more generally, in case you have duplicate mon,year tuples in your object:
Data[,Idx := 1:.N][
,third.friday := getNthDayOfWeek(third, Fri, mon, year)
,by = "mon,year,Idx"
][,Idx := NULL][]
# mon year third.friday
#1: 5 2011 2011-05-20
#2: 6 2011 2011-06-17
#3: 7 2011 2011-07-15
#4: 8 2011 2011-08-19
#5: 9 2011 2011-09-16
#6: 10 2011 2011-10-21
#7: 11 2011 2011-11-18
#8: 12 2011 2011-12-16
#9: 1 2012 2012-01-20

Correct previous year by id within R

I have data something like this:
df <- data.frame(Id=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,9,9,9,9),Date=c("2013-04","2013-12","2013-01","2013-12","2013-11",
"2013-12","2012-04","2013-12","2012-08","2014-12","2013-08","2014-12","2013-08","2014-12","2011-01","2013-11","2013-12","2014-01","2014-04"))
To get the correct format:
df$Date <- paste0(df$Date,"-01")
I would need to obtain only years, so that each id contains 2 dates following on each other.
I if do on the existing data something like this:
require(lubridate)
df$Date <- year(as.Date(df$Date)-days(1))
I get sometimes same date for given id.
The desired output for the column Date is this:
2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011 2013 2014
Please note that the last date for given id is always correct, so just the preceding year have to be corrected based on the last date. The date have to be in format that can be converted to years only as shown.
EDIT Here is the case:
Id Date
1 2013-11-01
1 2013-12-01
1 2014-01-01
1 2014-04-01
Now I'm getting this: 2012,2013,2013,2013
I would need: 2012,2013,2013,2014
This is how I would solve this using data.table package (though it looks over complicated to me)
library(data.table)
setDT(df)[, year := year(Date)][,
year := if(.N == 2) (year[2] - 1):year[2] else year,
Id][]
# Id Date year indx
# 1: 1 2013-04-01 2012 2
# 2: 1 2013-12-01 2013 2
# 3: 2 2013-01-01 2012 2
# 4: 2 2013-12-01 2013 2
# 5: 3 2013-11-01 2012 2
# 6: 3 2013-12-01 2013 2
# 7: 4 2012-04-01 2012 2
# 8: 4 2013-12-01 2013 2
# 9: 5 2012-08-01 2013 2
# 10: 5 2014-12-01 2014 2
# 11: 6 2013-08-01 2013 2
# 12: 6 2014-12-01 2014 2
# 13: 7 2013-08-01 2013 2
# 14: 7 2014-12-01 2014 2
# 15: 8 2011-01-01 2011 1
Or all in one step (thanks to #Arun for providing this):
setDT(df)[, year := {tmp = year(Date);
if (.N == 2L) (tmp[2]-1L):tmp[2] else tmp},
Id]
Edit:
Per OPs new data, we can modify the code by adding additional index
setDT(df)[, indx := if(.N > 2) rep(seq_len(.N/2), each = 2) + 1L else .N, Id]
df[, year := {tmp = year(Date); if (.N > 1L) (tmp[2] - 1L):tmp[2] else tmp},
list(Id, indx)][]
# Id Date indx year
# 1: 1 2013-04-01 2 2012
# 2: 1 2013-12-01 2 2013
# 3: 2 2013-01-01 2 2012
# 4: 2 2013-12-01 2 2013
# 5: 3 2013-11-01 2 2012
# 6: 3 2013-12-01 2 2013
# 7: 4 2012-04-01 2 2012
# 8: 4 2013-12-01 2 2013
# 9: 5 2012-08-01 2 2013
# 10: 5 2014-12-01 2 2014
# 11: 6 2013-08-01 2 2013
# 12: 6 2014-12-01 2 2014
# 13: 7 2013-08-01 2 2013
# 14: 7 2014-12-01 2 2014
# 15: 8 2011-01-01 1 2011
# 16: 9 2013-11-01 2 2012
# 17: 9 2013-12-01 2 2013
# 18: 9 2014-01-01 3 2013
# 19: 9 2014-04-01 3 2014
Or another possible solution provided by #akrun
setDT(df)[, `:=`(year = year(Date), indx = .N, indx2 = as.numeric(gl(.N,2, .N))), Id]
df[indx > 1, year:=(year[2]-1):year[2], list(Id, indx2)][]
Using dplyr using similar approach as #David Arenburg's
library(dplyr)
df %>%
group_by(Id) %>%
mutate(year=as.numeric(sub('-.*', '', Date)),
year=replace(year, n()>1, c(year[2]-1, year[2])))
# Id Date year
#1 1 2013-04 2012
#2 1 2013-12 2013
#3 2 2013-01 2012
#4 2 2013-12 2013
#5 3 2013-11 2012
#6 3 2013-12 2013
#7 4 2012-04 2012
#8 4 2013-12 2013
#9 5 2012-08 2013
#10 5 2014-12 2014
#11 6 2013-08 2013
#12 6 2014-12 2014
#13 7 2013-08 2013
#14 7 2014-12 2014
#15 8 2011-01 2011
Or using base R
with(df, ave(as.numeric(sub('-.*', '', Date)), Id,
FUN=function(x) if(length(x)>1)(x[2]-1):x[2] else x))
#[1] 2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011
Update
You can try
df$indx <- with(df, ave(Id, Id, FUN=function(x) (seq_along(x)-1)%/%2+1))
with(df, ave(as.numeric(sub('-.*', '', Date)), Id, indx,
FUN=function(x) if(length(x)>1)(x[2]-1):x[2] else x))
#[1] 2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011
#[16] 2012 2013 2013 2014
Or
df %>%
group_by(Id) %>%
mutate(year=as.numeric(sub('-.*', '', Date))) %>%
group_by(indx=cumsum(rep(c(TRUE,FALSE), length.out=n())), add=TRUE) %>%
mutate(year=replace(year, n()>1, c(year[2]-1, year[2])))
Here's a dplyr solution. You can remove the intermediate fields last_year and year2, but I left them here for clarity:
library(stringr)
library(dplyr)
df %>%
group_by(Id) %>%
mutate(
last_year = last(as.integer(str_sub(Date, 1, 4))),
year2 = row_number() - n(),
year = last_year + year2
)

Resources