I am trying to figure out how to find the closest date in 1 zoo object to a given date in another zoo object (could also use data.frame). Suppose I have:
dates.zoo <- zoo(data.frame(val=seq(1:121)), order.by = seq.Date(as.Date('2018-12-01'), as.Date('2019-03-31'), "days"))
monthly.zoo <- zoo(data.frame(val=c(1,2,4)), order.by = c(as.Date('2018-12-14'), as.Date('2019-1-2'), as.Date('2019-2-3')))
For each date in dates.zoo I would like to align it with the closest previous date in monthly.zoo. (NA if no monthly date is found). So the data.frame/zoo object I am expecting is:
...
2018-12-02 2 NA
...
2018-12-14 14 2018-12-14
2018-12-15 15 2018-12-14
2018-12-16 16 2018-12-14
...
2019-01-01 32 2018-12-14
2019-01-02 33 2019-01-02
2019-01-03 34 2019-01-02
...
NOTE: I would prefer a Base-R solution but others would be interesting to see also
Following through on Henrik's suggestion to use findInterval. We can do:
library(zoo)
interval.idx <- findInterval(index(dates.zoo), index(monthly.zoo))
interval.idx <- ifelse(interval.idx == 0, NA, interval.idx)
dates.zoo$month <- index(monthly.zoo)[interval.idx]
A rolling join using data.table can be used.
See also: https://www.r-bloggers.com/understanding-data-table-rolling-joins/
Also a solution using base-R
data.table solution
library(data.table)
dates.df <- data.table(val=seq(1:121), dates = seq.Date(as.Date('2018-12-01'), as.Date('2019-03-31'), "days"))
monthly.df <- data.table(val=c(1,2,4,5), dates = c(as.Date('2018-12-14'), as.Date('2019-1-2'), as.Date('2019-2-3')))
setkeyv(dates.df,"dates")
setkeyv(monthly.df,"dates")
#monthly.df[,nearest:=(dates)][dates.df,roll = 'nearest'] #closest date
monthly.df[,nearest:=(dates)][dates.df,roll = Inf] #Closest _previous_ date
base R solution
dates.df <- zoo(data.frame(val=seq(1:121)), order.by = seq.Date(as.Date('2018-12-01'), as.Date('2019-03-31'), "days"))
monthly.df <- zoo(data.frame(val=c(1,2,4)), order.by = c(as.Date('2018-12-14'), as.Date('2019-1-2'), as.Date('2019-2-3')))
dates.df <- data.frame(val=dates.df$val,dates=attributes(dates.df)$index)
monthly.df <- data.frame(val=monthly.df$val,dates=attributes(monthly.df)$index)
min_distances <- as.numeric(dates.df$dates)- matrix(rep(as.numeric(monthly.df$dates),nrow(dates.df)),ncol=length(monthly.df$dates),byrow=T)
min_distances <- as.data.frame(t(min_distances))
closest <- sapply(min_distances,function(x)
{
w <- which(x==min(x[x>0]));
ifelse(length(w)==0,NA,w)
})
dates.df$closest_month <- monthly.df$dates[closest]
Results: data.table
> monthly.df[,nearest:=(dates)][dates.df,roll = Inf]
val dates nearest i.val
1: NA 2018-12-01 <NA> 1
2: NA 2018-12-02 <NA> 2
3: NA 2018-12-03 <NA> 3
4: NA 2018-12-04 <NA> 4
5: NA 2018-12-05 <NA> 5
---
118: 4 2019-03-27 2019-02-03 117
119: 4 2019-03-28 2019-02-03 118
120: 4 2019-03-29 2019-02-03 119
121: 4 2019-03-30 2019-02-03 120
122: 4 2019-03-31 2019-02-03 121
Results base R
> dates.df[64:69,]
val dates closest_month
2019-02-02 64 2019-02-02 2019-01-02
2019-02-03 65 2019-02-03 2019-01-02
2019-02-04 66 2019-02-04 2019-02-03
2019-02-05 67 2019-02-05 2019-02-03
2019-02-06 68 2019-02-06 2019-02-03
2019-02-07 69 2019-02-07 2019-02-03
If, for each date in dates.df, you want to get the closest date in monthly.df which is less than the given date, and monthly.df is sorted by date ascending, you can use the method below. It counts the number of rows in monthly.df with index less than the given date, which is equivalent to the index if mothly.df is sorted by date ascending. If there are 0 such rows, the index is changed to NA.
inds <- rowSums(outer(index(dates.df), index(monthly.df), `>`))
inds[inds == 0] <- NA
dates.df_monthmatch <- index(monthly.df)[inds]
dates.df_monthmatch
# [1] NA NA NA NA NA NA
# [7] NA NA NA NA NA NA
# [13] NA NA "2018-12-14" "2018-12-14" "2018-12-14" "2018-12-14"
# [19] "2018-12-14" "2018-12-14" "2018-12-14" "2018-12-14" "2018-12-14" "2018-12-14"
# [25] "2018-12-14" "2018-12-14" "2018-12-14" "2018-12-14" "2018-12-14" "2018-12-14"
# [31] "2018-12-14" "2018-12-14" "2018-12-14" "2019-01-02" "2019-01-02" "2019-01-02"
# [37] "2019-01-02" "2019-01-02" "2019-01-02" "2019-01-02" "2019-01-02" "2019-01-02"
# [43] "2019-01-02" "2019-01-02" "2019-01-02" "2019-01-02" "2019-01-02" "2019-01-02"
# [49] "2019-01-02" "2019-01-02" "2019-01-02" "2019-01-02" "2019-01-02" "2019-01-02"
# [55] "2019-01-02" "2019-01-02" "2019-01-02" "2019-01-02" "2019-01-02" "2019-01-02"
# [61] "2019-01-02" "2019-01-02" "2019-01-02" "2019-01-02" "2019-01-02" "2019-02-03"
# [67] "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03"
# [73] "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03"
# [79] "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03"
# [85] "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03"
# [91] "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03"
# [97] "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03"
# [103] "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03"
# [109] "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03"
# [115] "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03" "2019-02-03"
# [121] "2019-02-03"
Here is a possibility, although I did have to change the object to a data frame in order to assign the zoo index dates. This code compares the month, then year, and then finally day with criteria that it is less than or equal to the date to be matched against. If there is no date that matches this criteria then an NA is assigned. These comparisons were done with he package 'lubridate' checking for the individual date elements, and then which to logically index the best match.
library(zoo)
library(lubridate)
dates.df <- zoo(data.frame(val=seq(1:121)), order.by = seq.Date(as.Date('2018-12-01'), as.Date('2019-03-31'), "days"))
monthly.df <- zoo(data.frame(val=c(1,2,4)), order.by = c(as.Date('2018-12-14'), as.Date('2019-1-2'), as.Date('2019-2-3')))
month_m<-month(monthly.df)
month_d<-month(dates.df)
year_m<-year(monthly.df)
year_d<-year(dates.df)
day_m<-day(monthly.df)
day_d<-day(dates.df)
index<-list()
Index<-list()
for( i in 1:length(monthly.df)){
index[[i]]<-which(month_m[i] == month_d & year_m[i] == year_d
& day_d <= day_m[i])
test<-unlist(index[[i]])
#Assigns NA if no suitable match is found
if(length(test)==0){
print("NA")
Index[[i]]=NA
}else {
Index[[i]]<-tail(test, n=1)
}
}
Test<-unlist(Index)
monthly.df_Fin<-as.data.frame(monthly.df)
dates.df_Fin<-as.data.frame(dates.df)
monthly.df_Fin$match<-as.character(row.names(dates.df_Fin)[Test])
monthly.df_Fin$value<-dates.df_Fin[Test,]
> monthly.df_Fin
val match value
2018-12-14 1 2018-12-14 14
2019-01-02 2 2019-01-02 33
2019-02-03 4 2019-02-03 65
Say we changed a value outside of the critera range:
monthly.df <- zoo(data.frame(val=c(1,2,4)), order.by = c(as.Date('2018-12-
14'), as.Date('2019-1-2'), as.Date('2017-2-3')))
....
#Result
> monthly.df_Fin
val match value
2017-02-03 4 <NA> NA
2018-12-14 1 2018-12-14 14
2019-01-02 2 2019-01-02 33
Related
I have a data frame which looks like this:
Subscription MonthlyPayment FirstPaymentDate NumberofPayments
<chr> <dbl> <date> <int>
1 Netflix 12.99 2021-05-24 21
2 Spotify 9.99 2021-08-17 7
3 PureGym 19.99 2022-07-04 9
4 DisneyPlus 7.99 2020-10-26 11
5 AmazonPrime 34.99 2020-08-11 73
6 Youtube 12.99 2020-09-27 35
I want to find out future payment dates for each subscription service. For example Netflix has 21 monthly payments, so I want to list out all the monthly payment days from the first payment date. How would I do this for each subscription service, using dplyr?
You can use dplyr and tidyr; I create a list of sequential payments (rowwise) and then unnest that list
library(dplyr)
library(tidyr)
df %>%
rowwise() %>%
mutate(Payments = list(seq(FirstPaymentDate, by="month", length.out=NumberofPayments))) %>%
unnest(Payments)
Output:
# A tibble: 156 × 5
Subscription MonthlyPayment FirstPaymentDate NumberofPayments Payments
<chr> <dbl> <date> <int> <date>
1 Netflix 13.0 2021-05-24 21 2021-05-24
2 Netflix 13.0 2021-05-24 21 2021-06-24
3 Netflix 13.0 2021-05-24 21 2021-07-24
4 Netflix 13.0 2021-05-24 21 2021-08-24
5 Netflix 13.0 2021-05-24 21 2021-09-24
6 Netflix 13.0 2021-05-24 21 2021-10-24
7 Netflix 13.0 2021-05-24 21 2021-11-24
8 Netflix 13.0 2021-05-24 21 2021-12-24
9 Netflix 13.0 2021-05-24 21 2022-01-24
10 Netflix 13.0 2021-05-24 21 2022-02-24
# … with 146 more rows
You can add the months based on NumberofPayments directly to FirstPaymentDate. This approach does not require dplyr.
library(lubridate)
library(purrr)
df <- data.frame(sub = c("Netflix", "Spotify", "PureGym", "DisneyPlus", "AmazonPrime", "Youtube"),
mo_pay = c(12.99, 9.99, 19.99, 7.99, 34.99, 12.99),
dt_fpay = as.Date(c("2021-05-24", "2021-08-17", "2022-07-04", "2020-10-26", "2020-08-11", "2020-09-27")),
n_pay = c(21, 7, 9, 11, 73, 35))
pay_dt <- map(seq(nrow(df)),
function(x) df$dt_fpay[x] %m+% months(seq(df$n_pay[x])))
names(pay_dt) <- df$sub
pay_dt
output:
> pay_dt
$Netflix
[1] "2021-06-24" "2021-07-24" "2021-08-24" "2021-09-24" "2021-10-24" "2021-11-24"
[7] "2021-12-24" "2022-01-24" "2022-02-24" "2022-03-24" "2022-04-24" "2022-05-24"
[13] "2022-06-24" "2022-07-24" "2022-08-24" "2022-09-24" "2022-10-24" "2022-11-24"
[19] "2022-12-24" "2023-01-24" "2023-02-24"
$Spotify
[1] "2021-09-17" "2021-10-17" "2021-11-17" "2021-12-17" "2022-01-17" "2022-02-17"
[7] "2022-03-17"
$PureGym
[1] "2022-08-04" "2022-09-04" "2022-10-04" "2022-11-04" "2022-12-04" "2023-01-04"
[7] "2023-02-04" "2023-03-04" "2023-04-04"
$DisneyPlus
[1] "2020-11-26" "2020-12-26" "2021-01-26" "2021-02-26" "2021-03-26" "2021-04-26"
[7] "2021-05-26" "2021-06-26" "2021-07-26" "2021-08-26" "2021-09-26"
$AmazonPrime
[1] "2020-09-11" "2020-10-11" "2020-11-11" "2020-12-11" "2021-01-11" "2021-02-11"
[7] "2021-03-11" "2021-04-11" "2021-05-11" "2021-06-11" "2021-07-11" "2021-08-11"
[13] "2021-09-11" "2021-10-11" "2021-11-11" "2021-12-11" "2022-01-11" "2022-02-11"
[19] "2022-03-11" "2022-04-11" "2022-05-11" "2022-06-11" "2022-07-11" "2022-08-11"
[25] "2022-09-11" "2022-10-11" "2022-11-11" "2022-12-11" "2023-01-11" "2023-02-11"
[31] "2023-03-11" "2023-04-11" "2023-05-11" "2023-06-11" "2023-07-11" "2023-08-11"
[37] "2023-09-11" "2023-10-11" "2023-11-11" "2023-12-11" "2024-01-11" "2024-02-11"
[43] "2024-03-11" "2024-04-11" "2024-05-11" "2024-06-11" "2024-07-11" "2024-08-11"
[49] "2024-09-11" "2024-10-11" "2024-11-11" "2024-12-11" "2025-01-11" "2025-02-11"
[55] "2025-03-11" "2025-04-11" "2025-05-11" "2025-06-11" "2025-07-11" "2025-08-11"
[61] "2025-09-11" "2025-10-11" "2025-11-11" "2025-12-11" "2026-01-11" "2026-02-11"
[67] "2026-03-11" "2026-04-11" "2026-05-11" "2026-06-11" "2026-07-11" "2026-08-11"
[73] "2026-09-11"
$Youtube
[1] "2020-10-27" "2020-11-27" "2020-12-27" "2021-01-27" "2021-02-27" "2021-03-27"
[7] "2021-04-27" "2021-05-27" "2021-06-27" "2021-07-27" "2021-08-27" "2021-09-27"
[13] "2021-10-27" "2021-11-27" "2021-12-27" "2022-01-27" "2022-02-27" "2022-03-27"
[19] "2022-04-27" "2022-05-27" "2022-06-27" "2022-07-27" "2022-08-27" "2022-09-27"
[25] "2022-10-27" "2022-11-27" "2022-12-27" "2023-01-27" "2023-02-27" "2023-03-27"
[31] "2023-04-27" "2023-05-27" "2023-06-27" "2023-07-27" "2023-08-27"
I would like an object that gives me a date range for every month (or quarter) from 1990-01-01 to 2021-12-31, separated by a colon. So for example in the monthly case, the first object would be 1990-01-01:1990-01-31, the second object would be 1990-02-01:1990-02-31, and so on.
The issue I am having trouble with is making sure that the date range is exclusive, i.e., that no date gets repeated.
start_date1 <- as.Date("1990-01-01", "%Y-%m-%d")
end_date1 <- as.Date("2021-12-01", "%Y-%m-%d")
first_date <- format(seq(start_date1,end_date1,by="month"),"%Y-%m-%d")
start_date2 <- as.Date("1990-02-01", "%Y-%m-%d")
end_date2 <- as.Date("2022-01-01", "%Y-%m-%d")
second_date <- format(seq(start_date2,end_date2,by="month"),"%Y-%m-%d")
date<-paste0(first_date, ":")
finaldate<-paste0(date, second_date)
This code works, except that the first date in each month gets repeated "1990-01-01:1990-02-01" "1990-02-01:1990-03-01", and that the last date is "2021-12-01:2022-01-01" (including Jan 1, 2022 rather than stopping at Dec 31, 2021.
If I go by 30 days instead, it doesn't work as well because not every month has 30 days.
What's the best way to get an exclusive date range?
You could do:
dates <- seq(as.Date("1990-01-01"), as.Date("2022-01-01"), by = "month")
dates <- paste(head(dates, -1), tail(dates-1, - 1), sep = ":")
resulting in:
dates
#> [1] "1990-01-01:1990-01-31" "1990-02-01:1990-02-28" "1990-03-01:1990-03-31"
#> [4] "1990-04-01:1990-04-30" "1990-05-01:1990-05-31" "1990-06-01:1990-06-30"
#> [7] "1990-07-01:1990-07-31" "1990-08-01:1990-08-31" "1990-09-01:1990-09-30"
#> [10] "1990-10-01:1990-10-31" "1990-11-01:1990-11-30" "1990-12-01:1990-12-31"
#> [13] "1991-01-01:1991-01-31" "1991-02-01:1991-02-28" "1991-03-01:1991-03-31"
#> [16] "1991-04-01:1991-04-30" "1991-05-01:1991-05-31" "1991-06-01:1991-06-30"
#> [19] "1991-07-01:1991-07-31" "1991-08-01:1991-08-31" "1991-09-01:1991-09-30"
#> [22] "1991-10-01:1991-10-31" "1991-11-01:1991-11-30" "1991-12-01:1991-12-31"
#> [25] "1992-01-01:1992-01-31" "1992-02-01:1992-02-29" "1992-03-01:1992-03-31"
#> [28] "1992-04-01:1992-04-30" "1992-05-01:1992-05-31" "1992-06-01:1992-06-30"
#> [31] "1992-07-01:1992-07-31" "1992-08-01:1992-08-31" "1992-09-01:1992-09-30"
#> [34] "1992-10-01:1992-10-31" "1992-11-01:1992-11-30" "1992-12-01:1992-12-31"
#> [37] "1993-01-01:1993-01-31" "1993-02-01:1993-02-28" "1993-03-01:1993-03-31"
#> [40] "1993-04-01:1993-04-30" "1993-05-01:1993-05-31" "1993-06-01:1993-06-30"
#> [43] "1993-07-01:1993-07-31" "1993-08-01:1993-08-31" "1993-09-01:1993-09-30"
#> [46] "1993-10-01:1993-10-31" "1993-11-01:1993-11-30" "1993-12-01:1993-12-31"
#> [49] "1994-01-01:1994-01-31" "1994-02-01:1994-02-28" "1994-03-01:1994-03-31"
#> [52] "1994-04-01:1994-04-30" "1994-05-01:1994-05-31" "1994-06-01:1994-06-30"
#> [55] "1994-07-01:1994-07-31" "1994-08-01:1994-08-31" "1994-09-01:1994-09-30"
#> [58] "1994-10-01:1994-10-31" "1994-11-01:1994-11-30" "1994-12-01:1994-12-31"
#> [61] "1995-01-01:1995-01-31" "1995-02-01:1995-02-28" "1995-03-01:1995-03-31"
#> [64] "1995-04-01:1995-04-30" "1995-05-01:1995-05-31" "1995-06-01:1995-06-30"
#> [67] "1995-07-01:1995-07-31" "1995-08-01:1995-08-31" "1995-09-01:1995-09-30"
#> [70] "1995-10-01:1995-10-31" "1995-11-01:1995-11-30" "1995-12-01:1995-12-31"
#> [73] "1996-01-01:1996-01-31" "1996-02-01:1996-02-29" "1996-03-01:1996-03-31"
#> [76] "1996-04-01:1996-04-30" "1996-05-01:1996-05-31" "1996-06-01:1996-06-30"
#> [79] "1996-07-01:1996-07-31" "1996-08-01:1996-08-31" "1996-09-01:1996-09-30"
#> [82] "1996-10-01:1996-10-31" "1996-11-01:1996-11-30" "1996-12-01:1996-12-31"
#> [85] "1997-01-01:1997-01-31" "1997-02-01:1997-02-28" "1997-03-01:1997-03-31"
#> [88] "1997-04-01:1997-04-30" "1997-05-01:1997-05-31" "1997-06-01:1997-06-30"
#> [91] "1997-07-01:1997-07-31" "1997-08-01:1997-08-31" "1997-09-01:1997-09-30"
#> [94] "1997-10-01:1997-10-31" "1997-11-01:1997-11-30" "1997-12-01:1997-12-31"
#> [97] "1998-01-01:1998-01-31" "1998-02-01:1998-02-28" "1998-03-01:1998-03-31"
#> [100] "1998-04-01:1998-04-30" "1998-05-01:1998-05-31" "1998-06-01:1998-06-30"
#> [103] "1998-07-01:1998-07-31" "1998-08-01:1998-08-31" "1998-09-01:1998-09-30"
#> [106] "1998-10-01:1998-10-31" "1998-11-01:1998-11-30" "1998-12-01:1998-12-31"
#> [109] "1999-01-01:1999-01-31" "1999-02-01:1999-02-28" "1999-03-01:1999-03-31"
#> [112] "1999-04-01:1999-04-30" "1999-05-01:1999-05-31" "1999-06-01:1999-06-30"
#> [115] "1999-07-01:1999-07-31" "1999-08-01:1999-08-31" "1999-09-01:1999-09-30"
#> [118] "1999-10-01:1999-10-31" "1999-11-01:1999-11-30" "1999-12-01:1999-12-31"
#> [121] "2000-01-01:2000-01-31" "2000-02-01:2000-02-29" "2000-03-01:2000-03-31"
#> [124] "2000-04-01:2000-04-30" "2000-05-01:2000-05-31" "2000-06-01:2000-06-30"
#> [127] "2000-07-01:2000-07-31" "2000-08-01:2000-08-31" "2000-09-01:2000-09-30"
#> [130] "2000-10-01:2000-10-31" "2000-11-01:2000-11-30" "2000-12-01:2000-12-31"
#> [133] "2001-01-01:2001-01-31" "2001-02-01:2001-02-28" "2001-03-01:2001-03-31"
#> [136] "2001-04-01:2001-04-30" "2001-05-01:2001-05-31" "2001-06-01:2001-06-30"
#> [139] "2001-07-01:2001-07-31" "2001-08-01:2001-08-31" "2001-09-01:2001-09-30"
#> [142] "2001-10-01:2001-10-31" "2001-11-01:2001-11-30" "2001-12-01:2001-12-31"
#> [145] "2002-01-01:2002-01-31" "2002-02-01:2002-02-28" "2002-03-01:2002-03-31"
#> [148] "2002-04-01:2002-04-30" "2002-05-01:2002-05-31" "2002-06-01:2002-06-30"
#> [151] "2002-07-01:2002-07-31" "2002-08-01:2002-08-31" "2002-09-01:2002-09-30"
#> [154] "2002-10-01:2002-10-31" "2002-11-01:2002-11-30" "2002-12-01:2002-12-31"
#> [157] "2003-01-01:2003-01-31" "2003-02-01:2003-02-28" "2003-03-01:2003-03-31"
#> [160] "2003-04-01:2003-04-30" "2003-05-01:2003-05-31" "2003-06-01:2003-06-30"
#> [163] "2003-07-01:2003-07-31" "2003-08-01:2003-08-31" "2003-09-01:2003-09-30"
#> [166] "2003-10-01:2003-10-31" "2003-11-01:2003-11-30" "2003-12-01:2003-12-31"
#> [169] "2004-01-01:2004-01-31" "2004-02-01:2004-02-29" "2004-03-01:2004-03-31"
#> [172] "2004-04-01:2004-04-30" "2004-05-01:2004-05-31" "2004-06-01:2004-06-30"
#> [175] "2004-07-01:2004-07-31" "2004-08-01:2004-08-31" "2004-09-01:2004-09-30"
#> [178] "2004-10-01:2004-10-31" "2004-11-01:2004-11-30" "2004-12-01:2004-12-31"
#> [181] "2005-01-01:2005-01-31" "2005-02-01:2005-02-28" "2005-03-01:2005-03-31"
#> [184] "2005-04-01:2005-04-30" "2005-05-01:2005-05-31" "2005-06-01:2005-06-30"
#> [187] "2005-07-01:2005-07-31" "2005-08-01:2005-08-31" "2005-09-01:2005-09-30"
#> [190] "2005-10-01:2005-10-31" "2005-11-01:2005-11-30" "2005-12-01:2005-12-31"
#> [193] "2006-01-01:2006-01-31" "2006-02-01:2006-02-28" "2006-03-01:2006-03-31"
#> [196] "2006-04-01:2006-04-30" "2006-05-01:2006-05-31" "2006-06-01:2006-06-30"
#> [199] "2006-07-01:2006-07-31" "2006-08-01:2006-08-31" "2006-09-01:2006-09-30"
#> [202] "2006-10-01:2006-10-31" "2006-11-01:2006-11-30" "2006-12-01:2006-12-31"
#> [205] "2007-01-01:2007-01-31" "2007-02-01:2007-02-28" "2007-03-01:2007-03-31"
#> [208] "2007-04-01:2007-04-30" "2007-05-01:2007-05-31" "2007-06-01:2007-06-30"
#> [211] "2007-07-01:2007-07-31" "2007-08-01:2007-08-31" "2007-09-01:2007-09-30"
#> [214] "2007-10-01:2007-10-31" "2007-11-01:2007-11-30" "2007-12-01:2007-12-31"
#> [217] "2008-01-01:2008-01-31" "2008-02-01:2008-02-29" "2008-03-01:2008-03-31"
#> [220] "2008-04-01:2008-04-30" "2008-05-01:2008-05-31" "2008-06-01:2008-06-30"
#> [223] "2008-07-01:2008-07-31" "2008-08-01:2008-08-31" "2008-09-01:2008-09-30"
#> [226] "2008-10-01:2008-10-31" "2008-11-01:2008-11-30" "2008-12-01:2008-12-31"
#> [229] "2009-01-01:2009-01-31" "2009-02-01:2009-02-28" "2009-03-01:2009-03-31"
#> [232] "2009-04-01:2009-04-30" "2009-05-01:2009-05-31" "2009-06-01:2009-06-30"
#> [235] "2009-07-01:2009-07-31" "2009-08-01:2009-08-31" "2009-09-01:2009-09-30"
#> [238] "2009-10-01:2009-10-31" "2009-11-01:2009-11-30" "2009-12-01:2009-12-31"
#> [241] "2010-01-01:2010-01-31" "2010-02-01:2010-02-28" "2010-03-01:2010-03-31"
#> [244] "2010-04-01:2010-04-30" "2010-05-01:2010-05-31" "2010-06-01:2010-06-30"
#> [247] "2010-07-01:2010-07-31" "2010-08-01:2010-08-31" "2010-09-01:2010-09-30"
#> [250] "2010-10-01:2010-10-31" "2010-11-01:2010-11-30" "2010-12-01:2010-12-31"
#> [253] "2011-01-01:2011-01-31" "2011-02-01:2011-02-28" "2011-03-01:2011-03-31"
#> [256] "2011-04-01:2011-04-30" "2011-05-01:2011-05-31" "2011-06-01:2011-06-30"
#> [259] "2011-07-01:2011-07-31" "2011-08-01:2011-08-31" "2011-09-01:2011-09-30"
#> [262] "2011-10-01:2011-10-31" "2011-11-01:2011-11-30" "2011-12-01:2011-12-31"
#> [265] "2012-01-01:2012-01-31" "2012-02-01:2012-02-29" "2012-03-01:2012-03-31"
#> [268] "2012-04-01:2012-04-30" "2012-05-01:2012-05-31" "2012-06-01:2012-06-30"
#> [271] "2012-07-01:2012-07-31" "2012-08-01:2012-08-31" "2012-09-01:2012-09-30"
#> [274] "2012-10-01:2012-10-31" "2012-11-01:2012-11-30" "2012-12-01:2012-12-31"
#> [277] "2013-01-01:2013-01-31" "2013-02-01:2013-02-28" "2013-03-01:2013-03-31"
#> [280] "2013-04-01:2013-04-30" "2013-05-01:2013-05-31" "2013-06-01:2013-06-30"
#> [283] "2013-07-01:2013-07-31" "2013-08-01:2013-08-31" "2013-09-01:2013-09-30"
#> [286] "2013-10-01:2013-10-31" "2013-11-01:2013-11-30" "2013-12-01:2013-12-31"
#> [289] "2014-01-01:2014-01-31" "2014-02-01:2014-02-28" "2014-03-01:2014-03-31"
#> [292] "2014-04-01:2014-04-30" "2014-05-01:2014-05-31" "2014-06-01:2014-06-30"
#> [295] "2014-07-01:2014-07-31" "2014-08-01:2014-08-31" "2014-09-01:2014-09-30"
#> [298] "2014-10-01:2014-10-31" "2014-11-01:2014-11-30" "2014-12-01:2014-12-31"
#> [301] "2015-01-01:2015-01-31" "2015-02-01:2015-02-28" "2015-03-01:2015-03-31"
#> [304] "2015-04-01:2015-04-30" "2015-05-01:2015-05-31" "2015-06-01:2015-06-30"
#> [307] "2015-07-01:2015-07-31" "2015-08-01:2015-08-31" "2015-09-01:2015-09-30"
#> [310] "2015-10-01:2015-10-31" "2015-11-01:2015-11-30" "2015-12-01:2015-12-31"
#> [313] "2016-01-01:2016-01-31" "2016-02-01:2016-02-29" "2016-03-01:2016-03-31"
#> [316] "2016-04-01:2016-04-30" "2016-05-01:2016-05-31" "2016-06-01:2016-06-30"
#> [319] "2016-07-01:2016-07-31" "2016-08-01:2016-08-31" "2016-09-01:2016-09-30"
#> [322] "2016-10-01:2016-10-31" "2016-11-01:2016-11-30" "2016-12-01:2016-12-31"
#> [325] "2017-01-01:2017-01-31" "2017-02-01:2017-02-28" "2017-03-01:2017-03-31"
#> [328] "2017-04-01:2017-04-30" "2017-05-01:2017-05-31" "2017-06-01:2017-06-30"
#> [331] "2017-07-01:2017-07-31" "2017-08-01:2017-08-31" "2017-09-01:2017-09-30"
#> [334] "2017-10-01:2017-10-31" "2017-11-01:2017-11-30" "2017-12-01:2017-12-31"
#> [337] "2018-01-01:2018-01-31" "2018-02-01:2018-02-28" "2018-03-01:2018-03-31"
#> [340] "2018-04-01:2018-04-30" "2018-05-01:2018-05-31" "2018-06-01:2018-06-30"
#> [343] "2018-07-01:2018-07-31" "2018-08-01:2018-08-31" "2018-09-01:2018-09-30"
#> [346] "2018-10-01:2018-10-31" "2018-11-01:2018-11-30" "2018-12-01:2018-12-31"
#> [349] "2019-01-01:2019-01-31" "2019-02-01:2019-02-28" "2019-03-01:2019-03-31"
#> [352] "2019-04-01:2019-04-30" "2019-05-01:2019-05-31" "2019-06-01:2019-06-30"
#> [355] "2019-07-01:2019-07-31" "2019-08-01:2019-08-31" "2019-09-01:2019-09-30"
#> [358] "2019-10-01:2019-10-31" "2019-11-01:2019-11-30" "2019-12-01:2019-12-31"
#> [361] "2020-01-01:2020-01-31" "2020-02-01:2020-02-29" "2020-03-01:2020-03-31"
#> [364] "2020-04-01:2020-04-30" "2020-05-01:2020-05-31" "2020-06-01:2020-06-30"
#> [367] "2020-07-01:2020-07-31" "2020-08-01:2020-08-31" "2020-09-01:2020-09-30"
#> [370] "2020-10-01:2020-10-31" "2020-11-01:2020-11-30" "2020-12-01:2020-12-31"
#> [373] "2021-01-01:2021-01-31" "2021-02-01:2021-02-28" "2021-03-01:2021-03-31"
#> [376] "2021-04-01:2021-04-30" "2021-05-01:2021-05-31" "2021-06-01:2021-06-30"
#> [379] "2021-07-01:2021-07-31" "2021-08-01:2021-08-31" "2021-09-01:2021-09-30"
#> [382] "2021-10-01:2021-10-31" "2021-11-01:2021-11-30" "2021-12-01:2021-12-31"
Created on 2022-03-19 by the reprex package (v2.0.1)
I used lubridate for the simplicity of its ymd() function.
require(lubridate)
You start with creating a vector of first days of the month:
start <- seq(ymd("1990-01-01"), ymd("2021-12-01"), by = "month")
Then you create another vector subtracting 1 day to obtain the last day of each month:
b <- start - 1
You remove the first element of that vector
end <- b[-1]
You join them all
paste0(start, ":", end)
There's an easily (manually) fixable issue: the very last interval is incorrect.
1) yearmon/yearqtr Create a monthly sequence using yearmon class and then convert that to the start and end dates. Similarly for quarters and yearqtr class. Internally both represent dates by year and fraction of year so use 1/12 and 1/4 in by=. Also note that using as.Date gives the date at the start of the month or quarter and the same but with the frac=1 argument gives the end.
library(zoo)
# input
st <- as.Date("1990-01-01")
en <- as.Date("2021-12-01")
# by month
mon <- seq(as.yearmon(st), as.yearmon(en), 1/12)
paste(as.Date(mon), as.Date(mon, frac = 1), sep = ":")
# by quarter
qtr <- seq(as.yearqtr(st), as.yearqtr(en), 1/4)
paste(as.Date(qtr), as.Date(qtr, frac = 1), sep = ":")
There is some question of what the end date should be. The above give an end date on the last interval of 2021-12-31 but if the end date should be 2021-12-01 so that no interval extends past en then replace the two paste lines with these respectively.
paste(as.Date(mon), pmin(as.Date(mon, frac = 1), en), sep = ":")
paste(as.Date(qtr), pmin(as.Date(qtr, frac = 1), en), sep = ":")
2) Base R A base R alternative is to use the expressions involving cut shown below to get the end of period. (1) seems less tricky but this might be useful if using only base R were desired. A similar approach with pmin as in (1) could be used if we want to ensure that no range extends beyond en.
This and the remaining solutions, but not (1), assume that st is the first of the month; however, that could readily be handled if needed.
mon <- seq(st, en, by = "month")
paste(mon, as.Date(cut(mon + 31, "month")) - 1, sep = ":")
qtr <- seq(st, en, by = "quarter")
paste(as.Date(qtr), as.Date(cut(qtr + 93, "month")) - 1, sep = ":")
3) lubridate Using various functions from this package we can write the following. A similar approach using pmin as in (1) could be used if the ranges may not extend beyond en.
library(lubridate)
mon <- seq(st, en, by = "month")
paste(mon, mon + month(1) - 1, sep = ":")
qtr <- seq(st, en, by = "quarter")
paste(qtr, qtr + quarter(1) - 1, sep = ":")
4) IDate We can use IDate class from data.table in which case we can make use of cut.IDate which returns another IDate object rather than a character string (as in base R).
st <- as.IDate("1990-01-01")
en <- as.IDate("2021-12-01")
mon <- seq(st, en, by = "month")
paste(mon, cut(mon + 31, "month") - 1, sep = ":")
qtr <- seq(st, en, by = "quarter")
paste(qtr, cut(qtr + 93, "month") - 1, sep = ":")
Just now i start learning the sparklyr package using the reference sparklyr
i did what was written in the document.
when using the following code
delay <- flights_tbl %>%
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(count > 20, dist < 2000, !is.na(delay)) %>%
collect
Warning messages:
1: Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning
2: Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning
> delay
# A tibble: 2,961 x 4
tailnum count dist delay
<chr> <dbl> <dbl> <dbl>
1 N14228 111 1547 3.71
2 N24211 130 1330 7.70
3 N668DN 49.0 1028 2.62
4 N39463 107 1588 2.16
5 N516JB 288 1249 12.0
6 N829AS 230 228 17.2
7 N3ALAA 63.0 1078 3.59
8 N793JB 283 1529 4.72
9 N657JB 285 1286 5.03
10 N53441 102 1661 0.941
# ... with 2,951 more rows
In the similar way i want apply the same operations on nycflights13::flights dataset using dplyr package
nycflights13::flights %>%
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(count > 20, dist < 2000, !is.na(delay))
# A tibble: 1,319 x 4
tailnum count dist delay
<chr> <int> <dbl> <dbl>
1 N102UW 48 536 2.94
2 N103US 46 535 - 6.93
3 N105UW 45 525 - 0.267
4 N107US 41 529 - 5.73
5 N108UW 60 534 - 1.25
6 N109UW 48 536 - 2.52
7 N110UW 40 535 2.80
8 N111US 30 536 - 0.467
9 N11206 111 1414 12.7
10 N112US 38 535 - 0.947
# ... with 1,309 more rows
My problem is why i am getting the different results ?
As mention in the documentation dplyr is the complete backend operations
for sparklyr.
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2 dplyr_0.7.4 sparklyr_0.7.0
loaded via a namespace (and not attached):
[1] DBI_0.7 readr_1.1.1 withr_2.1.1
[4] nycflights13_0.2.2 rprojroot_1.3-2 lattice_0.20-35
[7] foreign_0.8-69 pkgconfig_2.0.1 config_0.2
[10] utf8_1.1.3 compiler_3.4.0 stringr_1.3.0
[13] parallel_3.4.0 xtable_1.8-2 Rcpp_0.12.15
[16] cli_1.0.0 shiny_1.0.5 plyr_1.8.4
[19] httr_1.3.1 tools_3.4.0 openssl_1.0
[22] nlme_3.1-131.1 broom_0.4.3 R6_2.2.2
[25] dbplyr_1.2.1 bindr_0.1 purrr_0.2.4
[28] assertthat_0.2.0 curl_3.1 digest_0.6.15
[31] mime_0.5 stringi_1.1.6 rstudioapi_0.7
[34] reshape2_1.4.3 hms_0.4.1 backports_1.1.2
[37] htmltools_0.3.6 grid_3.4.0 glue_1.2.0
[40] httpuv_1.3.5 rlang_0.2.0 psych_1.7.8
[43] magrittr_1.5 rappdirs_0.3.1 lazyeval_0.2.1
[46] yaml_2.1.16 crayon_1.3.4 tidyr_0.8.0
[49] pillar_1.1.0 base64enc_0.1-3 mnormt_1.5-5
[52] jsonlite_1.5 tibble_1.4.2 Lahman_6.0-0
The key difference is that in the non-sparklyr, we are not using na.rm = TRUE in mean, therefore, those elements having NA in 'distance' or 'arr_delay' will become NA when we take the mean but in sparklyr the NA values are already removed so the argument is not needed
We can check the NA elements in 'distance' and 'arr_delay'
nycflights13::flights %>%
summarise_at(vars(distance, arr_delay), funs(sum(is.na(.))))
# A tibble: 1 x 2
# distance arr_delay
# <int> <int>
#1 0 9430 #### number of NAs
So, if we correct for that, then the output will be the same
res <- nycflights13::flights %>%
group_by(tailnum) %>%
summarise(count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)) %>%
filter(count > 20, dist < 2000, !is.na(delay)) %>%
arrange(tailnum)
res
# A tibble: 2,961 x 4
# tailnum count dist delay
# <chr> <int> <dbl> <dbl>
# 1 N0EGMQ 371 676 9.98
# 2 N10156 153 758 12.7
# 3 N102UW 48 536 2.94
# 4 N103US 46 535 - 6.93
# 5 N104UW 47 535 1.80
# 6 N10575 289 520 20.7
# 7 N105UW 45 525 - 0.267
# 8 N107US 41 529 - 5.73
# 9 N108UW 60 534 - 1.25
#10 N109UW 48 536 - 2.52
# ... with 2,951 more rows
Using sparklyr
library(sparklyr)
library(dplyr)
library(nycflights13)
sc <- spark_connect(master = "local")
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
delay <- flights_tbl %>%
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(count > 20, dist < 2000, !is.na(delay)) %>%
arrange(tailnum) %>%
collect
delay
# A tibble: 2,961 x 4
# tailnum count dist delay
# <chr> <dbl> <dbl> <dbl>
# 1 N0EGMQ 371 676 9.98
# 2 N10156 153 758 12.7
# 3 N102UW 48.0 536 2.94
# 4 N103US 46.0 535 - 6.93
# 5 N104UW 47.0 535 1.80
# 6 N10575 289 520 20.7
# 7 N105UW 45.0 525 - 0.267
# 8 N107US 41.0 529 - 5.73
# 9 N108UW 60.0 534 - 1.25
#10 N109UW 48.0 536 - 2.52
# ... with 2,951 more rows
I have a set of observation pairs that I want to label with the intervals between their times. (In the real dataset, these observation pairs represent entry and exit microphone calibrations.)
# R version 3.2.3
library(lubridate) ## Version 1.5.6
library(dplyr) ## Version 0.5.0
data <- data.frame(
group = c(1,1,2,2,3,3),
type = rep(c("start", "end"), 3),
time = ymd_hms("2016-06-01 01:00:00") + c(0,1,3,6,12,18),
someAttribute = runif(6)
)
data
## group type time someAttribute
## 1 1 start 2016-06-01 01:00:00 0.2540128
## 2 1 end 2016-06-01 01:00:01 0.6845078
## 3 2 start 2016-06-01 01:00:03 0.3576477
## 4 2 end 2016-06-01 01:00:06 0.1223582
## 5 3 start 2016-06-01 01:00:12 0.2715063
## 6 3 end 2016-06-01 01:00:18 0.6392607
I include a dummy someAttribute in this example to emphasize that a simple solution like tidyr::spread() would make a mess of the attributes that belong to each row in data.
I have a function that makes the intervals, and I apply it by group with dplyr:
makeTwoIntervals <- function(twoDatetimes) {
return(rep(interval(twoDatetimes[1], twoDatetimes[2]), 2))
}
data2 <- data %>% group_by(group) %>% mutate(intervals = makeTwoIntervals(time))
data2$intervals
## [1] 2016-06-01 01:00:00 UTC--2016-06-01 01:00:01 UTC
## [2] 2016-06-01 01:00:00 UTC--2016-06-01 01:00:01 UTC
## [3] 2016-06-01 01:00:00 UTC--2016-06-01 01:00:03 UTC
## [4] 2016-06-01 01:00:00 UTC--2016-06-01 01:00:03 UTC
## [5] 2016-06-01 01:00:00 UTC--2016-06-01 01:00:06 UTC
## [6] 2016-06-01 01:00:00 UTC--2016-06-01 01:00:06 UTC
These values are not what I expected to get. The correct times are passed to my function, and it creates the correct two-element vector of intervals for return, but when this vector is passed back to mutate, something bad happens. Taking a closer look:
str(data2$intervals)
## Formal class 'Interval' [package "lubridate"] with 3 slots
## ..# .Data: num [1:6] 1 1 3 3 6 6
## ..# start: POSIXct[1:2], format: "2016-06-01 01:00:00" "2016-06-01 01:00:00"
## ..# tzone: chr "UTC"
It's not clear to me what went wrong here. These are the results I wanted to see:
## Desired result of data2$intervals:
## [1] 2016-06-01 01:00:00 UTC--2016-06-01 01:00:01 UTC
## [2] 2016-06-01 01:00:00 UTC--2016-06-01 01:00:01 UTC
## [3] 2016-06-01 01:00:03 UTC--2016-06-01 01:00:06 UTC
## [4] 2016-06-01 01:00:03 UTC--2016-06-01 01:00:06 UTC
## [5] 2016-06-01 01:00:12 UTC--2016-06-01 01:00:18 UTC
## [6] 2016-06-01 01:00:12 UTC--2016-06-01 01:00:18 UTC
Could anyone offer some insight into what went wrong, or how I could achieve the desired result? Am I misusing mutate, or is it just not designed to handle objects like lubridate::Interval?
This is a workaround based on #Arun's data.table workaround (#1777), but in dplyr language:
data2 <- data %>% group_by(group) %>% mutate(ranges = list(range(time)))
data3 <- data2 %>% mutate(intervals = list(interval(ranges[[1]][1], ranges[[1]][2])))
data3$intervals2 <- do.call("c", data3$intervals)
data3$intervals2
## [1] 2016-06-01 01:00:00 UTC--2016-06-01 01:00:01 UTC
## [2] 2016-06-01 01:00:00 UTC--2016-06-01 01:00:01 UTC
## [3] 2016-06-01 01:00:03 UTC--2016-06-01 01:00:06 UTC
## [4] 2016-06-01 01:00:03 UTC--2016-06-01 01:00:06 UTC
## [5] 2016-06-01 01:00:12 UTC--2016-06-01 01:00:18 UTC
## [6] 2016-06-01 01:00:12 UTC--2016-06-01 01:00:18 UTC
Not totally satisfying, but it works. Thanks for the tip, #Arun.
I am trying to run the code in a tutorial by rbresearch titled 'Low Volatility with R', however when trying to run the cbind function the time series seems to get totally misaligned.
Here's the data-preparation section that works perfectly:
require(quantmod)
symbols = c("XLY", "XLP", "XLE", "XLF", "XLV", "XLI", "XLK", "XLB", "XLU")
getSymbols(symbols, index.class=c("POSIXt","POSIXct"), from='2000-01-01')
for(symbol in symbols) {
x<-get(symbol)
x<-to.monthly(x,indexAt='lastof',drop.time=TRUE)
indexFormat(x)<-'%Y-%m-%d'
colnames(x)<-gsub("x",symbol,colnames(x))
assign(symbol,x)
}
for(symbol in symbols) {
x <- get(symbol)
x1 <- ROC(Ad(x), n=1, type="continuous", na.pad=TRUE)
colnames(x1) <- "ROC"
colnames(x1) <- paste("x",colnames(x1), sep =".")
#x2 is the 12 period standard deviation of the 1 month return
x2 <- runSD(x1, n=12)
colnames(x2) <- "RANK"
colnames(x2) <- paste("x",colnames(x2), sep =".")
x <- cbind(x,x2)
colnames(x)<-gsub("x",symbol,colnames(x))
assign(symbol,x)
}
rank.factors <- cbind(XLB$XLB.RANK, XLE$XLE.RANK, XLF$XLF.RANK, XLI$XLI.RANK,
XLK$XLK.RANK, XLP$XLP.RANK, XLU$XLU.RANK, XLV$XLV.RANK, XLY$XLY.RANK)
r <- as.xts(t(apply(rank.factors, 1, rank)))
for (symbol in symbols){
x <- get(symbol)
x <- x[,1:6]
assign(symbol,x)
}
To illustrate that the XLE ETF data data is aligned with the XLE Ranked data:
> head(XLE)
XLE.Open XLE.High XLE.Low XLE.Close XLE.Vodalume XLE.Adjusted
2000-01-31 27.31 29.47 25.87 27.31 5903600 22.46
2000-02-29 27.31 27.61 24.62 26.16 4213000 21.51
2000-03-31 26.02 30.22 25.94 29.31 8607600 24.10
2000-04-30 29.50 30.16 27.52 28.87 5818900 23.74
2000-05-31 29.19 32.31 29.00 32.27 5148800 26.54
2000-06-30 32.16 32.50 30.09 30.34 4563100 25.07
> nrow(XLE)
[1] 163
> head(r$XLE.RANK)
XLE.RANK
2000-01-31 2
2000-02-29 2
2000-03-31 2
2000-04-30 2
2000-05-31 2
2000-06-30 2
nrow(r$XLE.RANK)
[1] 163
However after running the following cbind function, the xts object becomes totally misaligned:
> XLE <- cbind(XLE, r$XLE.RANK)
> head(XLE)
XLE.Open XLE.High XLE.Low XLE.Close XLE.Volume XLE.Adjusted XLE.RANK
2000-01-31 27.31 29.47 25.87 27.31 5903600 22.46 NA
2000-01-31 NA NA NA NA NA NA 2
2000-02-29 27.31 27.61 24.62 26.16 4213000 21.51 NA
2000-02-29 NA NA NA NA NA NA 2
2000-03-31 26.02 30.22 25.94 29.31 8607600 24.10 NA
2000-03-31 NA NA NA NA NA NA 2
> nrow(XLE)
[1] 326
Since running pre-existing code seldom works for me, I suspect there's something wrong with my R console, so here's my session information:
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
[5] LC_TIME=English_Canada.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] timeSeries_3010.97 timeDate_3010.98 quantstrat_0.7.8 foreach_1.4.1
[5] blotter_0.8.14 PerformanceAnalytics_1.1.0 FinancialInstrument_1.1.9 quantmod_0.4-0
[9] Defaults_1.1-1 TTR_0.22-0 xts_0.9-5 zoo_1.7-10
loaded via a namespace (and not attached):
[1] codetools_0.2-8 grid_3.0.1 iterators_1.0.6 lattice_0.20-15 tools_3.0.1
I'm completely unsure how to properly align the xts object without the NA and would greatly appreciate any help.
I don't see how anyone was not able to replicate your issue. The problem is this line:
r <- as.xts(t(apply(rank.factors, 1, rank)))
The first for loop converts the data to monthly and drops the time component of the index, which converts the index to a Date. This means that rank.factors has a Date index. But as.xts creates a POSIXct index by default, so r will have a POSIXct index.
cbind(XLE, r$XLE.RANK) is merging an xts object with a Date index with an xts object with a POSIXct index. The conversion from POSIXct to Date can be problematic if you're not very careful with timezone settings.
If you don't need the time component, it's best to avoid POSIXct and just use Date. Therefore, everything should work if you set dateFormat="Date" in your as.xts call.
R> r <- as.xts(t(apply(rank.factors, 1, rank)), dateFormat="Date")
R> XLE <- cbind(XLE, r$XLE.RANK)
R> head(XLE)
XLE.Open XLE.High XLE.Low XLE.Close XLE.Volume XLE.Adjusted XLE.RANK
2000-01-31 27.31 29.47 25.87 27.31 5903600 22.46 2
2000-02-29 27.31 27.61 24.62 26.16 4213000 21.51 2
2000-03-31 26.02 30.22 25.94 29.31 8607600 24.10 2
2000-04-30 29.50 30.16 27.52 28.87 5818900 23.74 2
2000-05-31 29.19 32.31 29.00 32.27 5148800 26.54 2
2000-06-30 32.16 32.50 30.09 30.34 4563100 25.07 2