Finding start and end time index conditionally for xts dataset - r

I am trying to extract the start and end time index separately for all the labels and store them separately.
EDIT
As suggested in the comment I prepared an example dataset
data <- rnorm(11)
dates1 <- as.POSIXct("2019-03-18 10:30:00", tz = "CET") + 0:6*60
dates2 <- as.POSIXct("2019-03-19 08:30:00", tz = "CET") + 0:3*60
dates <- append(dates1, dates2)
R <- xts(x = data, order.by = dates)
colnames(R) <- "R"
R$Label[1:7] <- 1
R$Label[8:11] <- 2
Output:
R Label
2019-03-18 10:30:00 1.193363635 1
2019-03-18 10:31:00 -0.558021057 1
2019-03-18 10:32:00 0.670440862 1
2019-03-18 10:33:00 0.073794492 1
2019-03-18 10:34:00 -0.416108940 1
2019-03-18 10:35:00 -0.596981420 1
2019-03-18 10:36:00 0.002006772 1
2019-03-19 08:30:00 -1.245200719 2
2019-03-19 08:31:00 0.417944923 2
2019-03-19 08:32:00 1.699169683 2
2019-03-19 08:33:00 0.861448103 2
Class of R is xts, zoo.
Now I would like to store the start and end time index for label 1 and two separately. I have many more data with more labels, so it needs to be automated. I would really appreciate if you can help. Thank you

Using the data you have posted:
library(xts)
library(dplyr)
library(tibble)
set.seed(42)
data <- rnorm(11)
dates1 <- as.POSIXct("2019-03-18 10:30:00", tz = "CET") + 0:6*60
dates2 <- as.POSIXct("2019-03-19 08:30:00", tz = "CET") + 0:3*60
dates <- append(dates1, dates2)
R <- xts(x = data, order.by = dates)
colnames(R) <- "R"
R$Label <- 1 # note I have removed the indexing here
R$Label[8:11] <- 2
R %>%
as.data.frame() %>%
rownames_to_column() %>%
group_by(Label) %>%
summarise(min = min(rowname), max = max(rowname) )
# A tibble: 2 x 3
Label min max
<dbl> <chr> <chr>
1 1 2019-03-18 09:30:00 2019-03-18 09:36:00
2 2 2019-03-19 07:30:00 2019-03-19 07:33:00

If we split it into components and then use start and end on each component we can get the start and end times of each group.
s <- split(R, R$Label)
do.call("c", lapply(s, start)) # start of each group
do.call("c", lapply(s, end)) # end of each group
If you want the row numbers instead do the same thing but change the index to 1, 2, 3, ...
R2 <- zoo(coredata(R))
s <- split(R2, R2$Label)
do.call("c", lapply(s, start)) # start of each group
do.call("c", lapply(s, end)) # end of each group

Related

How to conditionally check and replace data in xts object?

Here is a reproducible data set. The problem is to find 1 or 2 consecutive non-NA values in between a series of NA and assign them as NA. If there are more than 2, its fine nothing needs to be done.
set.seed(55)
data <- rnorm(10)
dates <- as.POSIXct("2019-03-18 10:30:00", tz = "CET") + 0:9*60
R <- xts(x = data, order.by = dates)
colnames(R) <- "R-factor"
R[c(1, 3, 6, 10)] <- NA
R
Output:
R-factor
2019-03-18 10:30:00 NA
2019-03-18 10:31:00 -1.812376850
2019-03-18 10:32:00 NA
2019-03-18 10:33:00 -1.119221005
2019-03-18 10:34:00 0.001908206
2019-03-18 10:35:00 NA
2019-03-18 10:36:00 -0.505343855
2019-03-18 10:37:00 -0.099234393
2019-03-18 10:38:00 0.305353199
2019-03-18 10:39:00 NA
Expected result:
R-factor
2019-03-18 10:30:00 NA
2019-03-18 10:31:00 NA
2019-03-18 10:32:00 NA
2019-03-18 10:33:00 NA
2019-03-18 10:34:00 NA
2019-03-18 10:35:00 NA
2019-03-18 10:36:00 -0.505343855
2019-03-18 10:37:00 -0.099234393
2019-03-18 10:38:00 0.305353199
2019-03-18 10:39:00 NA
I have written a function with for-loop which works fine for a small dataset but it's extremely slow. Original data consists of 100,000+ data points and this function couldn't execute it after more than 10 minutes
Can anyone kindly help me to avoid the loop to make it faster?
Create a function Fillin which returns NA if the length is less than or equal to 3 (or 2 if the first element is not NA so that we can handle the first group even if it does not start with an NA) and returns its argument otherwise. Use cumsum to group the runs and apply Fillin to each group.
Fillin <- function(x) if (length(x) <= 3 - !is.na(x[1])) NA else x
Rc <- coredata(R)
R[] <- ave(Rc, cumsum(is.na(Rc)), FUN = Fillin)
giving:
> R
R-factor
2019-03-18 10:30:00 NA
2019-03-18 10:31:00 NA
2019-03-18 10:32:00 NA
2019-03-18 10:33:00 NA
2019-03-18 10:34:00 NA
2019-03-18 10:35:00 NA
2019-03-18 10:36:00 -0.50534386
2019-03-18 10:37:00 -0.09923439
2019-03-18 10:38:00 0.30535320
2019-03-18 10:39:00 NA
Performance
This solution runs at about the same speed as the one using rle.
library(microbenchmark)
microbenchmark(
Fill = { Fillin <- function(x) if (length(x) <= 3 - !is.na(x[1])) NA else x
Rc <- coredata(R)
R[] <- ave(Rc, cumsum(is.na(Rc)), FUN = Fillin)
},
RLrep = { rleR <- rle(c(is.na(R[,1])))
is.na(R) <- with(rleR, rep(lengths < 3 , lengths ) )
}
)
giving:
Unit: microseconds
expr min lq mean median uq max neval cld
Fill 490.9 509.5 626.550 527.7 596.45 3411.1 100 a
RLrep 523.5 540.8 604.061 550.8 592.00 1244.4 100 a
Maybe try this based on Distance from the closest non NA value in a dataframe
library(tidyverse)
set.seed(55)
x <- 100000
data <- rnorm(x)
dates <- as.POSIXct("2019-03-18 10:30:00", tz = "CET") + (seq_len(x))*60
time_table1 <- tibble(time = dates,data = data)
time_table <- time_table1 %>%
mutate(random = rnorm(x),
new = if_else(random > data,NA_real_,data)) %>%
select(-data,-random) %>%
rename(data= new)
lengths_na <- time_table$data %>% is.na %>% rle %>% pluck('lengths')
the_operation <- . %>%
mutate(lengths_na =lengths_na %>% seq_along %>% rep(lengths_na)) %>%
group_by(lengths_na) %>%
add_tally() %>%
ungroup() %>%
mutate(replace_sequence = if_else(condition = n < 3,true = NA_real_,false = data))
microbenchmark::microbenchmark(time_table %>% the_operation)
The results are quite good
Unit: milliseconds
expr min lq mean median uq max neval
time_table %>% the_operation 141.9009 176.2988 203.3744 190.183 214.1691 412.3161 100
Maybe this is simpler to read
library(tidyverse)
set.seed(55)
# Create the data
x <- 100
data <- rnorm(x)
dates <- as.POSIXct("2019-03-18 10:30:00", tz = "CET") + (seq_len(x))*60
time_table1 <- tibble(time = dates,data = data)
# Fake some na's
time_table <- time_table1 %>%
mutate(random = rnorm(x),
new = if_else(random > data,NA_real_,data)) %>%
select(-data,-random) %>%
rename(data= new)
# The rle function counts the occurrences of the same value in a vector,
# We create a T/F vector using is.na function
# meaning that we can count the lenght of sequences with or without na's
lengths_na <- time_table$data %>% is.na %>% rle %>% pluck('lengths')
# This operation here can be done outside of the df
new_col <- lengths_na %>%
seq_along %>% # Counts to the size of this vector
rep(lengths_na) # Reps the lengths of the sequences populating the vector
result <- time_table %>%
mutate(new_col =new_col) %>%
group_by(new_col) %>% # Operates the logic on this group look into the tidyverse
add_tally() %>% # Counts how many instance there are on each group
ungroup() %>% # Not actually needed but good manners
mutate(replace_sequence = if_else(condition = n < 3,true = NA_real_,false = data))
I guess, there are more elegant solutions around, but this cuts the time in half
R_df=as.data.frame(R)
R_df$shift_1=c(R_df$`R-factor`[-1],NA) #shift value one up
R_df$shift_2=c(NA,R_df$`R-factor`[-nrow(R_df)]) #shift value one down
# create new filtered variable
R_df$`R-factor_new`=ifelse(is.na(R_df$`R-factor`),NA,
ifelse((!is.na(R_df$shift_1))|(!is.na(R_df$shift_2)),
R_df$`R-factor`,NA)
> test replications elapsed relative user.self sys.self user.child sys.child
> 2 ifelseapproach 1000 0.83 1.000 0.65 0.19 NA NA
> 1 original 1000 1.81 2.181 1.76 0.01 NA NA
This is probably faster than most of the other solutions offered. The rep function is essentially the inverse of the rle function. It takes two vector arguments and expands the count of the values of the first by the lengths in the second and this allows a test based on the length of runs and then replacement with is.na <-. There are actually two different functions: rle(x) which returns a logical vector of length(x) and then there is is.na(x)<- which makes an assignment of NA to items in x depending on the logical values in the vector to the right of that function.:
rleR <- rle(c(is.na(R[,1]))) #get the position and lengths of nonNA's and NA's
is.na(R) <- with(rleR, rep(lengths < 3 , lengths ) ) #set NAs
#--------------
> R
R-factor
2019-03-18 10:30:00 NA
2019-03-18 10:31:00 NA
2019-03-18 10:32:00 NA
2019-03-18 10:33:00 NA
2019-03-18 10:34:00 NA
2019-03-18 10:35:00 NA
2019-03-18 10:36:00 -0.50534386
2019-03-18 10:37:00 -0.09923439
2019-03-18 10:38:00 0.30535320
2019-03-18 10:39:00 NA
Warning message:
timezone of object (CET) is different than current timezone ().
microbenchmark(
Fill = {Fillin <- function(x) if (length(x) <= 3 - !is.na(x[1])) NA else x
ave(R, cumsum(is.na(R)), FUN = Fillin)},
RLrep = {rleR <- rle(c(is.na(R[,1])))
is.na(R) <- with(rleR, rep(lengths < 3 , lengths ) )})
#----------------------
Unit: microseconds
expr min lq mean median uq max neval cld
Fill 1668.788 1784.6275 1942.5261 1844.5825 2005.0960 4911.762 100 b
RLrep 102.174 113.9565 144.3477 131.4735 156.6715 368.665 100 a

Conditionally returning a vector of some row values based on another column's row values

Here is a reproducible data set:
set.seed(55);
data <- rnorm(12);
dates <- as.POSIXct("2019-03-18 10:30:00", tz = "CET") + 0:(length(data)-1)*60;
R <- xts(x = data, order.by = dates) %>%
sample(size = 10) %>%
fortify.zoo()
colnames(R) <- c("Time", "Rf");
R$lab <- "A"
R$lab[c(5, length(R$lab))] <- "BB"
R$diff <- c(NA, diff(R$Rf))
Output looks like:
> R
Time Rf lab diff
1 2019-03-18 10:30:00 0.120139084 A NA
2 2019-03-18 10:32:00 0.151582984 A 0.0314439
3 2019-03-18 10:33:00 -1.119221005 A -1.2708040
4 2019-03-18 10:34:00 0.001908206 A 1.1211292
5 2019-03-18 10:36:00 -0.505343855 BB -0.5072521
6 2019-03-18 10:37:00 -0.099234393 A 0.4061095
7 2019-03-18 10:38:00 0.305353199 A 0.4045876
8 2019-03-18 10:39:00 0.198409703 A -0.1069435
9 2019-03-18 10:40:00 -0.048910950 A -0.2473207
10 2019-03-18 10:41:00 -0.843233767 BB -0.7943228
I am trying to return the rows of column "diff" as a vector when the corresponding value of "lab" column is "A". But the condition is, for "BB" not only the corresponding diff value is dropped but also the two immediate values from upper and lower rows are also skipped.
Of the above example, following output is expected:
> res
[1] 0.0314439 -1.2708040 0.4045876 -0.1069435
Can you kindly help? Thanks
You can try
inds <- which(R$lab == "BB")
R$diff[-unique(c(inds - 1, inds, inds + 1))]
Or as #Rui Barradas mentioned
R$diff[-sapply(inds, `+`, -1:1)]

Build datetime column in R

I have 2 columns
one is date :
2011-04-13
2013-07-29
2010-11-23
the other is time :
3
22
15
I want to make a new column contains date time
it will be like this
2011-04-13 3:00:00
2013-07-29 22:00:00
2010-11-23 15:00:00
I managed to combine them as string
but when i convert them to datetime i get only date the time disappears
any idea how to get date and time in one column?
my script
data <- read.csv("d:\\__r\\hour.csv")
data$date <- as.POSIXct(paste(data$dteday , paste(data$hr, ":00:00", sep=""), sep=" "))
as example you can use ymd_hm function from lubridate:
a <- c("2014-09-08", "2014-09-08", "2014-09-08")
b <- c(3, 4, 5)
library(lubridate)
library(tidyverse)
tibble(a, b) %>%
mutate(time = paste0(a, " ", b, "-0"),
time = ymd_hm(time))
output would be:
# A tibble: 3 x 3
a b time
<chr> <dbl> <dttm>
1 2014-09-08 3 2014-09-08 03:00:00
2 2014-09-08 4 2014-09-08 04:00:00
3 2014-09-08 5 2014-09-08 05:00:00
found this fixed the problem
data$date <- as.POSIXct(strptime(paste(data$dteday , paste(data$hr, ":00:00", sep=""), sep=" "), "%Y-%m-%d %H:%M:%S"))

R Forecast Package TS Object with hourly data and setting start

I was trying to see if it is possible to set the start and end parameters of the ts() function in the forecast R package. The reason for this is to then use window() to subset a train and test set by date.
The time frame is from 2015-01-01 00:00:00 to 12/31/2017 23:00
index esti
2015-01-01 00:00:00 1
2015-01-01 01:00:00 2
2015-01-01 02:00:00 3
2015-01-01 03:00:00 2
2015-01-01 04:00:00 5
2015-01-01 05:00:00 2
...
2017-12-31 18:00:00 0
2017-12-31 19:00:00 1
2017-12-31 20:00:00 0
2017-12-31 21:00:00 2
2017-12-31 22:00:00 0
2017-12-31 23:00:00 4
I used the following syntax to create the time series object:
tmp <- ts(dat, start = c(2015,1), frequency=24)
The returned object is this:
Time Series:
Start = c(2015, 1)
End = c(2015, 6)
Frequency = 24
It looks as if the ts object isn't correct here...
As far as I understand, the ts object does not work well with hourly input. It is recommended that you work with xts or zoo package instead. See this SO post.
Try the following:
## Creating an entire hourly dataframe similar to the example dat
x <-
lubridate::parse_date_time(
c("2015-01-01 00:00:00", "2017-12-31 23:00:00"),
orders = "ymdHMS"
)
y <- seq(x[1], x[2], by = "hour")
dat <- data.frame(
index = y, esti = sample(seq(0, 10), size = length(y),
replace = TRUE)
)
## xts package
library(xts)
tmp <- xts(dat, order.by = dat$index)
## Example window-ing
window(tmp, end = y[100])
Let me know if this does not work out.

R data.table add column as function of another data.table

I have one data table which contains just a sequence of times. I have another data table containing two columns: start_time and end_time. I want to take the first data table and add a column where the value is the count of all of the rows in the second data table where the time from the first data table fits within the start and end time. Here is my code
start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")
all_dates <- as.data.table(seq(start_date, end_date, "min"))
colnames(all_dates) <- c("Bin")
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)
Here is what i want to do, but this is wrong and gives an error. What's the right way to write this?
all_dates[, BinCount := input_data[start_times < Bin & end_times > Bin, .N] ]
In the end i should get something like
Bin BinCount
2017-01-31 17:00:00 1
2017-01-31 17:01:00 5
...
The problem can be solved very easily using sqldf as it provides easy way to join tables with range checking. Hence one solution could be:
The data from OP:
library(data.table)
start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")
all_dates <- as.data.table(seq(start_date, end_date, "min"))
colnames(all_dates) <- c("Bin")
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)
library(sqldf)
result <- sqldf("SELECT all_dates.bin, count() as BinCount
FROM all_dates, input_data
WHERE all_dates.bin > input_data.start_times AND
all_dates.bin < input_data.end_times
GROUP BY bin" )
result
Bin BinCount
1 2017-01-31 17:01:00 1
2 2017-01-31 17:02:00 1
3 2017-01-31 17:03:00 1
4 2017-01-31 17:04:00 1
5 2017-01-31 17:05:00 1
6 2017-01-31 17:06:00 1
...........
...........
497 2017-02-01 01:17:00 6
498 2017-02-01 01:18:00 5
499 2017-02-01 01:19:00 5
500 2017-02-01 01:20:00 4
[ reached getOption("max.print") -- omitted 460 rows ]
In data.table you're after a range join.
library(data.table)
start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")
all_dates <- as.data.table(seq(start_date, end_date, "min"))
colnames(all_dates) <- c("Bin")
set.seed(123)
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)
## doing the range-join and calculating the number of items per bin in one chained step
input_data[
all_dates
, on = .(start_times < Bin, end_times > Bin)
, nomatch = 0
, allow.cartesian = T
][, .N, by = start_times]
# start_times N
# 1: 2017-01-31 17:01:00 1
# 2: 2017-01-31 17:02:00 1
# 3: 2017-01-31 17:03:00 1
# 4: 2017-01-31 17:04:00 1
# 5: 2017-01-31 17:05:00 1
# ---
# 956: 2017-02-01 08:56:00 6
# 957: 2017-02-01 08:57:00 4
# 958: 2017-02-01 08:58:00 4
# 959: 2017-02-01 08:59:00 5
# 960: 2017-02-01 09:00:00 5
Note:
I've put the all_dates object on the right-hand-side of the join, so the result contains the names of the input_data columns, even though they are your Bins (see this issue for the discussion on this topic)
I've used set.seed(), as you're taking samples
Wasn't requested, but here is a compact alternative solution using the tidyverse. Uses lubridate parsers, interval, and %within%, as well as purrr::map_int to generate the desired bin counts.
library(tidyverse)
library(lubridate)
start_date <- ymd_hms(x = "2017-01-31 17:00:00") # lubridate parsers
end_date <- ymd_hms(x = "2017-02-01 09:00:00")
all_dates <- tibble(seq(start_date, end_date, "min")) # tibble swap for data.table
colnames(all_dates) <- c("Bin")
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- tibble(
start_times,
end_times,
intvl = interval(start_times, end_times) # Add interval column
)
all_dates %>% # Checks date in Bin and counts intervals it lies within
mutate(BinCount = map_int(.$Bin, ~ sum(. %within% input_data$intvl)))
# A tibble: 961 x 2
Bin BinCount
<dttm> <int>
1 2017-01-31 17:00:00 0
2 2017-01-31 17:01:00 0
3 2017-01-31 17:02:00 0
4 2017-01-31 17:03:00 0
5 2017-01-31 17:04:00 0
6 2017-01-31 17:05:00 0
7 2017-01-31 17:06:00 0
8 2017-01-31 17:07:00 1
9 2017-01-31 17:08:00 1
10 2017-01-31 17:09:00 1
# ... with 951 more rows

Resources