roll data.table both forwards and backwards simultaneously - r

This is related to the previous question on SO: roll data.table with rollends
Given the data...
library(data.table)
dt1 = data.table(Date=seq(from=as.Date("2013-01-03"),
to=as.Date("2013-06-27"), by="1 week"),
key="Date")[, ind:=.I]
dt2 = data.table(Date=seq(from=as.Date("2013-01-01"),
to=as.Date("2013-06-30"), by="1 day"),
key="Date")
I try to carry the weekly datapoints one day forward and backward ...
dt1[dt2, roll=1][dt2, roll=-1]
...but only the first roll join (forward) seems to work and roll=-1 is ignored:
Date ind
1: 2013-01-01 NA
2: 2013-01-02 NA
3: 2013-01-03 1
4: 2013-01-04 1
5: 2013-01-05 NA
---
177: 2013-06-26 NA
178: 2013-06-27 26
179: 2013-06-28 26
180: 2013-06-29 NA
181: 2013-06-30 NA
The same effect when I reverse the order:
dt1[dt2, roll=-1][dt2, roll=1]
Date ind
1: 2013-01-01 NA
2: 2013-01-02 1
3: 2013-01-03 1
4: 2013-01-04 NA
5: 2013-01-05 NA
---
177: 2013-06-26 26
178: 2013-06-27 26
179: 2013-06-28 NA
180: 2013-06-29 NA
181: 2013-06-30 NA
I would like to achieve:
Date ind
1: 2013-01-01 NA
2: 2013-01-02 1
3: 2013-01-03 1
4: 2013-01-04 1
5: 2013-01-05 NA
---
177: 2013-06-26 26
178: 2013-06-27 26
179: 2013-06-28 26
180: 2013-06-29 NA
181: 2013-06-30 NA
EDIT:
I am using the fresh data.table version 1.8.11, note session details:
sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.8.11
loaded via a namespace (and not attached):
[1] plyr_1.8 reshape2_1.2.2 stringr_0.6.2 tools_3.0.0
Thanks

There is nothing to roll after the first join, each row in dt2 has a corresponding row in dt1[dt2, roll = .]. So just do the two rolls separately and combine them together, e.g.:
dt1[dt2, roll = 1][, ind := ifelse(is.na(ind), dt1[dt2, roll = -1]$ind, ind)]

Related

Is there a more concise method to filter columns containing a number greater than or less than, in dplyr?

Here is a sample of my tibble
protein patient value
<chr> <chr> <dbl>
1 BOD1L2 RF0064_Case-9-d- 10.4
2 PPFIA2 RF0064_Case-20-d- 7.83
3 STAT4 RF0064_Case-11-d- 11.0
4 TOM1L2 RF0064_Case-29-d- 13.0
5 SH2D2A RF0064_Case-2-d- 8.28
6 TIGD4 RF0064_Case-49-d- 9.71
In the "patient" column the "d" as in "Case-x-d" represents the a number of days. What I would like to do is create a new column stating whether the strings in the "patient" column contain values less than 14d.
I have managed to do this using the following command:
under14 <- "-1d|-2d|-3d|-4d|-4d|-5d|-6d|-7d|-8d|-9d|-11d|-12d|-13d|-14d"
data <- data %>%
mutate(case=ifelse(grepl(under14,data$patient),'under14days','over14days'))
However this seems extremely clunky and actually took way to long to type. I will have to be changing my search term many times so would like a quicker way to do this? Perhaps using some kind of regex is the best option, but I don't really know where to start with this.
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readxl_1.1.0 Rmisc_1.5 plyr_1.8.4 lattice_0.20-35 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.5 purrr_0.2.5
[9] readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_2.2.1 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 cellranger_1.1.0 pillar_1.2.3 compiler_3.5.0 bindr_0.1.1 tools_3.5.0 lubridate_1.7.4
[8] jsonlite_1.5 nlme_3.1-137 gtable_0.2.0 pkgconfig_2.0.1 rlang_0.2.1 psych_1.8.4 cli_1.0.0
[15] rstudioapi_0.7 yaml_2.1.19 parallel_3.5.0 haven_1.1.1 bindrcpp_0.2.2 xml2_1.2.0 httr_1.3.1
[22] hms_0.4.2 grid_3.5.0 tidyselect_0.2.4 glue_1.2.0 R6_2.2.2 foreign_0.8-70 modelr_0.1.2
[29] reshape2_1.4.3 magrittr_1.5 scales_0.5.0 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5 colorspace_1.3-2
[36] utf8_1.1.4 stringi_1.2.3 lazyeval_0.2.1 munsell_0.5.0 broom_0.4.4 crayon_1.3.4
>
One possibility is to use tidyr::separate
library(tidyverse)
df %>%
separate(patient, into = c("ID1", "Days", "ID2"), sep = "-", extra = "merge", remove = F) %>%
mutate(case = ifelse(as.numeric(Days) <= 14, "under14days", "over14days")) %>%
select(-ID1, -ID2)
# protein patient Days value case
#1 BOD1L2 RF0064_Case-9-d- 9 10.40 under14days
#2 PPFIA2 RF0064_Case-20-d- 20 7.83 over14days
#3 STAT4 RF0064_Case-11-d- 11 11.00 under14days
#4 TOM1L2 RF0064_Case-29-d- 29 13.00 over14days
#5 SH2D2A RF0064_Case-2-d- 2 8.28 under14days
#6 TIGD4 RF0064_Case-49-d- 49 9.71 over14days
Sample data
df <-read.table(text =
" protein patient value
1 BOD1L2 RF0064_Case-9-d- 10.4
2 PPFIA2 RF0064_Case-20-d- 7.83
3 STAT4 RF0064_Case-11-d- 11.0
4 TOM1L2 RF0064_Case-29-d- 13.0
5 SH2D2A RF0064_Case-2-d- 8.28
6 TIGD4 RF0064_Case-49-d- 9.71 ", header = T, row.names = 1)
Since, format of the patient is clearly defined, a possible solution in base-R using gsub can be to extract the days and check with range as:
df$case <- ifelse(as.integer(gsub("RF0064_Case-(\\d+)-d-","\\1", df$patient)) <= 14,
"under14days", "over14days")
Exactly, same way, OP can modify code used in mutate as:
library(dplyr)
df <- df %>%
mutate(case = ifelse(as.integer(gsub("RF0064_Case-(\\d+)-d-","\\1", patient)) <= 14,
"under14days", "over14days"))
df
# protein patient value case
# 1 BOD1L2 RF0064_Case-9-d- 10.40 under14days
# 2 PPFIA2 RF0064_Case-20-d- 7.83 over14days
# 3 STAT4 RF0064_Case-11-d- 11.00 under14days
# 4 TOM1L2 RF0064_Case-29-d- 13.00 over14days
# 5 SH2D2A RF0064_Case-2-d- 8.28 under14days
# 6 TIGD4 RF0064_Case-49-d- 9.71 over14days
Data:
df <- read.table(text =
"protein patient value
1 BOD1L2 RF0064_Case-9-d- 10.4
2 PPFIA2 RF0064_Case-20-d- 7.83
3 STAT4 RF0064_Case-11-d- 11.0
4 TOM1L2 RF0064_Case-29-d- 13.0
5 SH2D2A RF0064_Case-2-d- 8.28
6 TIGD4 RF0064_Case-49-d- 9.71",
header = TRUE, stringsAsFactors = FALSE)
We can also extract the number directly with regex. ?<=- is look behind, which identifies the position with "-"
library(tidyverse)
dat2 <- dat %>%
mutate(Day = as.numeric(str_extract(patient, pattern = "(?<=-)[0-9]*"))) %>%
mutate(case = ifelse(Day <= 14,'under14days','over14days'))
dat2
# protein patient value Day case
# 1 BOD1L2 RF0064_Case-9-d- 10.40 9 under14days
# 2 PPFIA2 RF0064_Case-20-d- 7.83 20 over14days
# 3 STAT4 RF0064_Case-11-d- 11.00 11 under14days
# 4 TOM1L2 RF0064_Case-29-d- 13.00 29 over14days
# 5 SH2D2A RF0064_Case-2-d- 8.28 2 under14days
# 6 TIGD4 RF0064_Case-49-d- 9.71 49 over14days
DATA
dat <- read.table(text = " protein patient value
1 BOD1L2 'RF0064_Case-9-d-' 10.4
2 PPFIA2 'RF0064_Case-20-d-' 7.83
3 STAT4 'RF0064_Case-11-d-' 11.0
4 TOM1L2 'RF0064_Case-29-d-' 13.0
5 SH2D2A 'RF0064_Case-2-d-' 8.28
6 TIGD4 'RF0064_Case-49-d-' 9.71",
header = TRUE, stringsAsFactors = FALSE)

R Creating a time-varying survival dataset from event data

I want to create a survival dataset featuring multiple-record ids. The existing event data consists of one row observations with the date formatted as dd/mm/yy. The idea is to count the number of consecutive months where there is at least one event/month (there are multiple years, so this has to be accounted for somehow). In other words, I want to create episodes that capture such monthly streaks, including periods of inactivity. To give an example, the code should transform something like this:
df1
id event.date
group1 01/01/16
group1 05/02/16
group1 07/03/16
group1 10/06/16
group1 12/09/16
to this:
df2
id t0 t1 ep.no ep.t ep.type
group1 1 3 1 3 1
group1 4 5 2 2 0
group1 6 6 3 1 1
group1 7 8 4 2 0
group1 9 9 5 1 1
group1 10 ... ... ... ...
where t0 and t1 are the start and end months, ep.no is the episode counter for the particular id, ep.t is the length of that particular episode, and ep.type indicates the type of episode (active/inactive). In the example above, there is an initial three-months of activity, then a two-month break, followed by a single-month episode of relapse etc.
I am mostly concerned about the transformation that brings about the t0 and t1 from df1 to df2, as the other variables in df2 can be constructed afterwards based on them (e.g. no is a counter, time is arithmetic, and type always starts out as 1 and alternates). Given the complexity of the problem (at least for me), I get the need to provide the actual data, but I am not sure if that is allowed? I will see what I can do if a mod chimes in.
I think this does what you want. The trick is identifying the sequence of observations that need to be treated together, and using dplyr::lag with cumsum is the way to go.
# Convert to date objects, summarize by month, insert missing months
library(tidyverse)
library(lubridate)
# added rows of data to demonstrate that it works with
# > id and > 1 event per month and rolls across year end
df1 <- read_table("id event.date
group1 01/01/16
group1 02/01/16
group1 05/02/16
group1 07/03/16
group1 10/06/16
group1 12/09/16
group1 01/02/17
group2 01/01/16
group2 05/02/16
group2 07/03/16",col_types="cc")
# need to get rid of extra whitespace, but automatically converts to date
# summarize by month to count events per month
df1.1 <- mutate(df1, event.date=dmy(event.date),
yr=year(event.date),
mon=month(event.date))
# get down to one row per event and complete data
df2 <- group_by(df1.1,id,yr,mon) %>%
summarize(events=n()) %>%
complete(id, yr, mon=1:12, fill=list(events=0)) %>%
group_by(id) %>%
mutate(event = as.numeric(events >0),
is_start=lag(event,default=-1)!=event,
episode=cumsum(is_start),
episode.date=ymd(paste(yr,mon,1,sep="-"))) %>%
group_by(id, episode) %>%
summarize(t0 = first(episode.date),
t1 = last(episode.date) %m+% months(1),
ep.length = as.numeric((last(episode.date) %m+% months(1)) - first(episode.date)),
ep.type = first(event))
Gives
Source: local data frame [10 x 6]
Groups: id [?]
id episode t0 t1 ep.length ep.type
<chr> <int> <dttm> <dttm> <dbl> <dbl>
1 group1 1 2016-01-01 2016-04-01 91 1
2 group1 2 2016-04-01 2016-06-01 61 0
3 group1 3 2016-06-01 2016-07-01 30 1
4 group1 4 2016-07-01 2016-09-01 62 0
5 group1 5 2016-09-01 2016-10-01 30 1
6 group1 6 2016-10-01 2017-02-01 123 0
7 group1 7 2017-02-01 2017-03-01 28 1
8 group1 8 2017-03-01 2018-01-01 306 0
9 group2 1 2016-01-01 2016-04-01 91 1
10 group2 2 2016-04-01 2017-01-01 275 0
Using complete() with mon=1:12 will always make the last episode stretch to the end of that year. The solution would be to insert a filter() on yr and mon after complete()
The advantage of keeping t0 and t1 as Date-time objects is that they work correctly across year boundaries, which using month numbers won't.
Session information:
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] lubridate_1.3.3 dplyr_0.5.0 purrr_0.2.2
[4] readr_0.2.2 tidyr_0.6.0 tibble_1.2
[7] ggplot2_2.2.0 tidyverse_1.0.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.8 knitr_1.15.1 magrittr_1.5
[4] munsell_0.4.2 colorspace_1.2-6 R6_2.1.3
[7] stringr_1.1.0 highr_0.6 plyr_1.8.4
[10] tools_3.3.2 grid_3.3.2 gtable_0.2.0
[13] DBI_0.5 lazyeval_0.2.0 assertthat_0.1
[16] digest_0.6.10 memoise_1.0.0 evaluate_0.10
[19] stringi_1.1.2 scales_0.4.1

dplyr v0.3 left_join error

I'm not being able to left_join with dplyr 0.3 when trying to use by argument.
First, I installed v0.3 following Hadley's suggestion on github
if (packageVersion("devtools") < 1.6) {
install.packages("devtools")
}
devtools::install_github("hadley/lazyeval")
devtools::install_github("hadley/dplyr")
sessioninfo()
# R version 3.1.1 (2014-07-10)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#
#locale:
#[1] LC_COLLATE=Portuguese_Portugal.1252 LC_CTYPE=Portuguese_Portugal.1252
#[3] LC_MONETARY=Portuguese_Portugal.1252 LC_NUMERIC=C
#[5] LC_TIME=Portuguese_Portugal.1252
#
#attached base packages:
#[1] stats graphics grDevices utils datasets methods base
#
#other attached packages:
#[1] dplyr_0.3
#
#loaded via a namespace (and not attached):
#[1] assertthat_0.1 DBI_0.3.1 magrittr_1.0.1 parallel_3.1.1 Rcpp_0.11.2 tools_3.1.1
Then taking some data
df1<-as.tbl(data.frame('var1'=LETTERS[1:10],'value1'=sample(1:100,10), stringsAsFactors = F))
df2<-as.tbl(data.frame('var2'=LETTERS[1:10],'value2'=sample(1:100,10), stringsAsFactors = F))
library(dplyr)
And finally trying to left_join
left_join(df1, df2, by = c('var1' = 'var2'))
# Error: cannot join on column 'var1'
But it works with
df2$var1 <- df2$var2
left_join(df1, df2, by = c('var1' = 'var2'))
Source: local data frame [10 x 4]
var1 value1 var2 value2
1 A 37 A 48
2 B 90 B 18
3 C 13 C 36
4 D 94 D 75
5 E 14 E 12
6 F 95 F 52
7 G 60 G 55
8 H 69 H 72
9 I 25 I 49
10 J 47 J 10
My questions
Is dplyr ignoring the by argument in the second example?
Can this be a bug?

cbind function creates misaligned xts object

I am trying to run the code in a tutorial by rbresearch titled 'Low Volatility with R', however when trying to run the cbind function the time series seems to get totally misaligned.
Here's the data-preparation section that works perfectly:
require(quantmod)
symbols = c("XLY", "XLP", "XLE", "XLF", "XLV", "XLI", "XLK", "XLB", "XLU")
getSymbols(symbols, index.class=c("POSIXt","POSIXct"), from='2000-01-01')
for(symbol in symbols) {
x<-get(symbol)
x<-to.monthly(x,indexAt='lastof',drop.time=TRUE)
indexFormat(x)<-'%Y-%m-%d'
colnames(x)<-gsub("x",symbol,colnames(x))
assign(symbol,x)
}
for(symbol in symbols) {
x <- get(symbol)
x1 <- ROC(Ad(x), n=1, type="continuous", na.pad=TRUE)
colnames(x1) <- "ROC"
colnames(x1) <- paste("x",colnames(x1), sep =".")
#x2 is the 12 period standard deviation of the 1 month return
x2 <- runSD(x1, n=12)
colnames(x2) <- "RANK"
colnames(x2) <- paste("x",colnames(x2), sep =".")
x <- cbind(x,x2)
colnames(x)<-gsub("x",symbol,colnames(x))
assign(symbol,x)
}
rank.factors <- cbind(XLB$XLB.RANK, XLE$XLE.RANK, XLF$XLF.RANK, XLI$XLI.RANK,
XLK$XLK.RANK, XLP$XLP.RANK, XLU$XLU.RANK, XLV$XLV.RANK, XLY$XLY.RANK)
r <- as.xts(t(apply(rank.factors, 1, rank)))
for (symbol in symbols){
x <- get(symbol)
x <- x[,1:6]
assign(symbol,x)
}
To illustrate that the XLE ETF data data is aligned with the XLE Ranked data:
> head(XLE)
XLE.Open XLE.High XLE.Low XLE.Close XLE.Vodalume XLE.Adjusted
2000-01-31 27.31 29.47 25.87 27.31 5903600 22.46
2000-02-29 27.31 27.61 24.62 26.16 4213000 21.51
2000-03-31 26.02 30.22 25.94 29.31 8607600 24.10
2000-04-30 29.50 30.16 27.52 28.87 5818900 23.74
2000-05-31 29.19 32.31 29.00 32.27 5148800 26.54
2000-06-30 32.16 32.50 30.09 30.34 4563100 25.07
> nrow(XLE)
[1] 163
> head(r$XLE.RANK)
XLE.RANK
2000-01-31 2
2000-02-29 2
2000-03-31 2
2000-04-30 2
2000-05-31 2
2000-06-30 2
nrow(r$XLE.RANK)
[1] 163
However after running the following cbind function, the xts object becomes totally misaligned:
> XLE <- cbind(XLE, r$XLE.RANK)
> head(XLE)
XLE.Open XLE.High XLE.Low XLE.Close XLE.Volume XLE.Adjusted XLE.RANK
2000-01-31 27.31 29.47 25.87 27.31 5903600 22.46 NA
2000-01-31 NA NA NA NA NA NA 2
2000-02-29 27.31 27.61 24.62 26.16 4213000 21.51 NA
2000-02-29 NA NA NA NA NA NA 2
2000-03-31 26.02 30.22 25.94 29.31 8607600 24.10 NA
2000-03-31 NA NA NA NA NA NA 2
> nrow(XLE)
[1] 326
Since running pre-existing code seldom works for me, I suspect there's something wrong with my R console, so here's my session information:
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
[5] LC_TIME=English_Canada.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] timeSeries_3010.97 timeDate_3010.98 quantstrat_0.7.8 foreach_1.4.1
[5] blotter_0.8.14 PerformanceAnalytics_1.1.0 FinancialInstrument_1.1.9 quantmod_0.4-0
[9] Defaults_1.1-1 TTR_0.22-0 xts_0.9-5 zoo_1.7-10
loaded via a namespace (and not attached):
[1] codetools_0.2-8 grid_3.0.1 iterators_1.0.6 lattice_0.20-15 tools_3.0.1
I'm completely unsure how to properly align the xts object without the NA and would greatly appreciate any help.
I don't see how anyone was not able to replicate your issue. The problem is this line:
r <- as.xts(t(apply(rank.factors, 1, rank)))
The first for loop converts the data to monthly and drops the time component of the index, which converts the index to a Date. This means that rank.factors has a Date index. But as.xts creates a POSIXct index by default, so r will have a POSIXct index.
cbind(XLE, r$XLE.RANK) is merging an xts object with a Date index with an xts object with a POSIXct index. The conversion from POSIXct to Date can be problematic if you're not very careful with timezone settings.
If you don't need the time component, it's best to avoid POSIXct and just use Date. Therefore, everything should work if you set dateFormat="Date" in your as.xts call.
R> r <- as.xts(t(apply(rank.factors, 1, rank)), dateFormat="Date")
R> XLE <- cbind(XLE, r$XLE.RANK)
R> head(XLE)
XLE.Open XLE.High XLE.Low XLE.Close XLE.Volume XLE.Adjusted XLE.RANK
2000-01-31 27.31 29.47 25.87 27.31 5903600 22.46 2
2000-02-29 27.31 27.61 24.62 26.16 4213000 21.51 2
2000-03-31 26.02 30.22 25.94 29.31 8607600 24.10 2
2000-04-30 29.50 30.16 27.52 28.87 5818900 23.74 2
2000-05-31 29.19 32.31 29.00 32.27 5148800 26.54 2
2000-06-30 32.16 32.50 30.09 30.34 4563100 25.07 2

Creating an xts object results in altered timestamps

Suppose I have:
R> str(data)
'data.frame': 4 obs. of 2 variables:
$ datetime: Factor w/ 4 levels "2011-01-05 09:30:00.001",..: 1 2 3 4
$ price : num 18.3 18.3 18.3 18.3
R> data
datetime price
1 2011-01-05 09:30:00.001 18.31
2 2011-01-05 09:30:00.321 18.33
3 2011-01-05 09:30:01.511 18.33
4 2011-01-05 09:30:02.192 18.34
When I try to load this into an xts object the timestamps are subtly altered:
R> x <- xts(data[-1], as.POSIXct(strptime(data$datetime, '%Y-%m-%d %H:%M:%OS')))
R> str(x)
An ‘xts’ object from 2011-01-05 09:30:00.000 to 2011-01-05 09:30:02.191 containing:
Data: num [1:4, 1] 18.3 18.3 18.3 18.3
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr "price"
Indexed by objects of class: [POSIXct,POSIXt] TZ:
xts Attributes:
NULL
R> x
price
2011-01-05 09:30:00.000 18.31
2011-01-05 09:30:00.321 18.33
2011-01-05 09:30:01.510 18.33
2011-01-05 09:30:02.191 18.34
You'll notice that the timestamps have been altered. The first entry now occurs at 09:30:00.000 instead of what the original data said, 09:30:00.001. The third and fourth rows are also incorrect.
What's causing this? Am I doing something fundamentally wrong? I've tried various incantations to get the data into an xts object and they all seem to exhibit this behavior.
EDIT: Add sessionInfo()
R> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] xts_0.8-2 zoo_1.7-4
loaded via a namespace (and not attached):
[1] grid_2.13.1 lattice_0.19-30 tools_2.13.1
EDIT 2: If I modify my source data to be microsecond precision as follows:
datetime,price
2011-01-05 09:30:00.001000,18.31
2011-01-05 09:30:00.321000,18.33
2011-01-05 09:30:01.511000,18.33
2011-01-05 09:30:02.192000,18.34
And then load it so I have:
R> test
datetime price
1 2011-01-05 09:30:00.001000 18.31
2 2011-01-05 09:30:00.321000 18.33
3 2011-01-05 09:30:01.511000 18.33
4 2011-01-05 09:30:02.192000 18.34
And then, finally, convert it into an xts object and set the index format:
R> x <- xts(test[,-1], as.POSIXct(strptime(test$datetime, '%Y-%m-%d %H:%M:%OS')))
R> indexFormat(x) <- '%Y-%m-%d %H:%M:%OS6'
R> x
[,1]
2011-01-05 09:30:00.000999 18.31
2011-01-05 09:30:00.321000 18.33
2011-01-05 09:30:01.510999 18.33
2011-01-05 09:30:02.191999 18.34
You can see the effect as well. I was hoping that adding the extra precision would help, but unfortunately it does not.
EDIT 3: Please see #DWin's answer for an end-to-end test case that reproduces this behavior.
EDIT 4: The behavior does not appear to be millisecond oriented. The following shows the same altered result of a microsecond resolution timestamp. If I change my input data to:
R> data
datetime price
1 2011-01-05 09:30:00.001001 18.31
2 2011-01-05 09:30:00.321001 18.33
3 2011-01-05 09:30:01.511001 18.33
4 2011-01-05 09:30:02.192005 18.34
And then create an xts object:
R> x <- xts(data[-1],
as.POSIXct(strptime(as.character(data$datetime), '%Y-%m-%d %H:%M:%OS')))
R> indexFormat(x) <- '%Y-%m-%d %H:%M:%OS6'
R> x
price
2011-01-05 09:30:00.001000 18.31
2011-01-05 09:30:00.321001 18.33
2011-01-05 09:30:01.511001 18.33
2011-01-05 09:30:02.192004 18.34
EDIT 5: It would appear to be a floating point precision issue. Observe:
R> t <- as.POSIXct("2011-01-05 09:30:00.001001")
R> t
[1] "2011-01-05 09:30:00.001 CST"
R> as.numeric(t)
[1] 1294241400.0010008812
This exhibits the error behavior, and is consistent with the example in EDIT 4. However, using an example that didn't show the error:
R> t <- as.POSIXct("2011-01-05 09:30:01.511001")
R> t
[1] "2011-01-05 09:30:01.511001 CST"
R> as.numeric(t)
[1] 1294241401.5110011101
It seems as if xts or some underlying component is rounding down rather than to the nearest?
You have your times in a factor:
R> str(data)
'data.frame': 4 obs. of 2 variables:
$ datetime: Factor w/ 4 levels "2011-01-05 09:30:00.001",..: 1 2 3 4
[...]
That is not the best place to start. You need to convert to character. Hence instead of
x <- xts(data[-1], as.POSIXct(strptime(data$datetime, '%Y-%m-%d %H:%M:%OS')))
I would suggest
x <- xts(data[-1],
order.by=as.POSIXct(strptime(as.character(data$datetime),
'%Y-%m-%d %H:%M:%OS')))
In my experience, the as.character() around a factor is critical. Factors are powerful for modeling, they are however a bit of a nuisance when you get them accidentally from reading data. Use stringsAsFactor=FALSE to your advantage and avoid them on data import.
Edit: So this appears to point to the strptime/strftime implementations. To make matters more interesting, R takes some of these from the operating system and reimplements some in src/main/datetime.c.
Also, pay attention to the smallest epsilon you can add to a time variable and still have R see them as equal. On my 64-bit Linux system, this happens 10^-7 :
R> sapply(seq(1, 8), FUN=function(x) identical(now, now+1/10^x))
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
R>
It seems the problem is only in printing. Using the OP's original data:
ind <- as.POSIXct(strptime(data$datetime, '%Y-%m-%d %H:%M:%OS'))
as.numeric(ind)*1e6 # as expected
# [1] 1294241400001000 1294241400321000 1294241401511000 1294241402192000
ind # wrong
# [1] "2011-01-05 09:30:00.000 CST" "2011-01-05 09:30:00.321 CST"
# [3] "2011-01-05 09:30:01.510 CST" "2011-01-05 09:30:02.191 CST"
x <- xts(data[-1], ind)
x # wrong
# price
# 2011-01-05 09:30:00.000 18.31
# 2011-01-05 09:30:00.321 18.33
# 2011-01-05 09:30:01.510 18.33
# 2011-01-05 09:30:02.191 18.34
as.numeric(index(x))*1e6 # but the underlying index values are as expected
# [1] 1294241400001000 1294241400321000 1294241401511000 1294241402192000
I post this just so people who want to explore it can have a reproducible example which shows that it happens on more than just the OP's system. as.character to the factor does not keep it from occurring.
dat <- read.table(textConnection(" datetime\tprice
1\t2011-01-05 09:30:00.001\t18.31
2\t2011-01-05 09:30:00.321\t18.33
3\t2011-01-05 09:30:01.511\t18.33
4\t2011-01-05 09:30:02.192\t18.34"), header =TRUE, sep="\t")
as.character(dat$datetime)
#[1] "2011-01-05 09:30:00.001" "2011-01-05 09:30:00.321" "2011-01-05 09:30:01.511"
#[4] "2011-01-05 09:30:02.192"
strptime(as.character(dat$datetime), '%Y-%m-%d %H:%M:%OS')
#[1] "2011-01-05 09:30:00" "2011-01-05 09:30:00" "2011-01-05 09:30:01"
#[4] "2011-01-05 09:30:02"
as.POSIXct(strptime(as.character(dat$datetime),
'%Y-%m-%d %H:%M:%OS'))
#[1] "2011-01-05 09:30:00 EST" "2011-01-05 09:30:00 EST" "2011-01-05 09:30:01 EST"
#[4] "2011-01-05 09:30:02 EST"
x <- xts(dat[-1],
order.by=as.POSIXct(strptime(as.character(dat$datetime),
'%Y-%m-%d %H:%M:%OS')))
x
#### price
2011-01-05 09:30:00 18.31
2011-01-05 09:30:00 18.33
2011-01-05 09:30:01 18.33
2011-01-05 09:30:02 18.34
indexFormat(x) <- '%Y-%m-%d %H:%M:%OS6'
x
price
2011-01-05 09:30:00.000999 18.31
2011-01-05 09:30:00.321000 18.33
2011-01-05 09:30:01.510999 18.33
2011-01-05 09:30:02.191999 18.34
sessionInfo()
R version 2.13.1 RC (2011-07-03 r56263)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid splines stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] xts_0.8-2 zoo_1.7-4 sculpt3d_0.2-2 RGtk2_2.20.12
[5] rgl_0.92.798 survey_3.24 hexbin_1.26.0 spam_0.23-0
[9] xtable_1.5-6 polspline_1.1.5 Ryacas_0.2-10 XML_3.4-0
[13] rms_3.3-1 Hmisc_3.8-3 survival_2.36-9 sos_1.3-0
[17] brew_1.0-6 lattice_0.19-30
loaded via a namespace (and not attached):
[1] cluster_1.14.0 tools_2.13.1

Resources