data.table based function is altering the previous table - r

I have a time series data with two columns: 1) a POSIX date time column of 30 minute intervals and 2) a value for each interval, as shown below:
read_date_time int_val
2013-01-15 15:00:00 2.3
2013-01-15 15:30:00 2.4
I've written a function that pivots the data.table so that there are 48 columns for each time interval for each row representing a day.
read_date 00:00 00:30 01:00 01:30 ...
2013-01-15 1.3 1.4 1.2 1.5 ...
The function involved creating two new columns (pure_date and interval) which are used as IDs as part of the reshape function. However I'm finding that the new columns are also added to the original table and the original read_date_time column is removed.
int_val pure_date interval
6.829986e-05 2013-08-31 00:00:00
6.887250e-05 2013-08-31 00:30:00
This causes numerous problems downstream as the original data set is reused in other functions. I'm aware that I could probably bypass some of these problems using data.frame operations instead, however as I'm handling very large quantities of data and efficiency is key, really I need a data.table solution.
What am I doing wrong?
Code for replication....
require(data.table)
require(reshape)
require(stringr)
# Create time_array for example
set.seed(1L) ## for reproducibility
dt_format = "%Y-%m-%d %H:%M"
time_seq <- seq.POSIXt(
as.POSIXct("2012-01-01 00:00:00", format=dt_format),
as.POSIXct("2013-12-31 00:00:00", format=dt_format),
by = "30 mins")
values <- runif(NROW(time_seq),0,1)
combined_data_set <- data.table(read_date_time = time_seq, int_val = values)
> head(combined_data_set) # Format wanted
# Define Pivoting Function
pivot_data <- function(A) {
con_data <- A
con_data[,pure_date := as.Date(read_date_time)]
con_data[,interval := str_sub(as.character(read_date_time),-8,-1)]
con_data[,read_date_time := NULL]
con_data <- data.table(read_date = as.character(con_data$pure_date),
interval = con_data$interval,
int_val = con_data$int_val)
pivoted <- recast(con_data, read_date ~ interval,
id.var = c("read_date","interval"))
return(pivoted)
}
# Apply to data set
pivoted_output <- pivot_data(combined_data_set)
# Original data has been altered, what's happened!!!!!
> head(combined_data_set)

pivot_data <- function(A) {
con_data <- copy(A)
con_data[,pure_date := as.Date(read_date_time)]
con_data[,interval := str_sub(as.character(read_date_time),-8,-1)]
con_data[,read_date_time := NULL]
con_data <- data.table(read_date = as.character(con_data$pure_date),
interval = con_data$interval,
int_val = con_data$int_val)
pivoted <- recast(con_data, read_date ~ interval,
id.var = c("read_date","interval"))
return(pivoted)
}
As mentioned, just a school-boy error, copy(A) does the job....

Related

Calculate how long each row's time period overlaps with other rows

Each row records the start time and end time of a time period. To simply, we assume all are in the same day, so we don't need to bother other more difficult issues.
I need to calculate how long each row's time period overlaps with other rows. And the number of rows that have such overlaps. Suppose I have a data set. I want column D and column E.
There is no sample data. I just use this fake data set to make an example.
dat <- data.frame(id=1:4,
Start = c("02:50","02:55","03:15","03:25"),
End = c("03:10", "03:05", "03:20", "03:30"),
num_overlap = c(1,1,0,0),
time_overlap = c(10,10,0,0))
dat
Here's a solution using the foverlaps method within the data.table package:
Packages:
library(data.table)
library(chron) # To convert the times provided into a `times` format
Your data:
dat <- data.frame(id=1:4,
Start = c("02:50","02:55","03:15","03:25"),
End = c("03:10", "03:05", "03:20", "03:30"))
I assume your times represent hours and minutes, so need to add the seconds ":00" and format as a times object:
dat$Start = times(paste0(dat$Start, ":00"))
dat$End = times(paste0(dat$End,":00"))
We will use the data.table package so we coerce the data frame to a data.table object
setDT(dat)
Set the Start and End time columns as keys for merging and then call the foverlaps function
setkey(dat, "Start", "End")
# Merge onto itself to find overlaps
overlaps = foverlaps(dat, dat, type = "any")
(See more on foverlaps here at https://www.rdocumentation.org/packages/data.table/versions/1.14.2/topics/foverlaps)
foverlaps keeps cases where the match was with itself (id = i.id) so we remove them
overlaps = overlaps[id != i.id]
Next, compute the amount of overlap, for each overlap that occurred
overlaps[, time_overlap := pmin(i.End, End) - pmax(i.Start, Start)]
Convert to minutes per https://stackoverflow.com/a/29067748/3674399
overlaps[, time_overlap := 60 * 24 * as.numeric(time_overlap)]
Keep only id and time_overlap, and summarize by id
overlaps = overlaps[, list(id, time_overlap)]
overlaps = overlaps[, list(num_overlap = .N, time_overlap = sum(time_overlap)), by = "id"]
Finally, merge with original dataset now, and fill any missing values
result = merge(dat, overlaps, by = "id", all.x = T)
result[is.na(num_overlap), num_overlap := 0]
result[is.na(time_overlap), time_overlap := 0]
The result is as follows:
id Start End num_overlap time_overlap
1: 1 02:50:00 03:10:00 1 10
2: 2 02:55:00 03:05:00 1 10
3: 3 03:15:00 03:20:00 0 0
4: 4 03:25:00 03:30:00 0 0
If you are new to the data.table syntax, please see for example, this intro here: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

Compare date intervals within the same data frame

I have search around and find similar questions but can make it work for my data.
I have a data frame with start and end dates, as well as several other factors. Ideally, the start date of a row should be posterior to the end date of any previous row, but the data has duplicated starts or ends, and sometimes the interval of the dates overlap.
I tried to make a reproducible example:
df = data.frame(start=c("2018/04/15 9:00:00","2018/04/15 9:00:00","2018/04/16 10:20:00","2018/04/16 15:30:00",
"2018/04/17 12:40:00","2018/04/17 18:50:00"),
end=c("2018/04/16 8:00:00","2018/04/16 7:10:00","2018/04/17 18:20:00","2018/04/16 16:30:00",
"2018/04/17 16:40:00","2018/04/17 19:50:00"),
value=c(10,15,11,13,14,12))
I was able to remove the duplicated (end or start dates), but I can't remove the overlapping intervals. I want to create a loop that "cleans" the intervals contained within any larger interval. So the results looks like this:
result = df[c(1,3,6),]
I thought I could make a loop that would "clean" both duplicates and overlapping intervals, but I can't make it work.
Any suggestions?
The data.table package is suited for this kind of problem using the overlapping join function foverlaps (inspired by findOverlaps function from the Bioconductor package IRanges) and then an anti-join (data.table syntax is B[!A, on]) to remove those inner intervals.
library(data.table)
cols <- c("start", "end")
setDT(df)
df[, (cols) := lapply(.SD, function(x) as.POSIXct(x, format="%Y/%m/%d %H:%M:%S")), .SDcols=cols]
setkeyv(df, cols)
anti <- foverlaps(df, df, type="within")[start!=i.start | end!=i.end | value!=i.value]
df[!anti, on=.(start=i.start, end=i.end, value=i.value)]
# start end value
# 1: 2018-04-15 09:00:00 2018-04-16 08:00:00 10
# 2: 2018-04-16 10:20:00 2018-04-17 18:20:00 11
# 3: 2018-04-17 18:50:00 2018-04-17 19:50:00 12
Alternative approach is to use %within% of the lubridate() package:
library(lubridate)
# transform characters to dates
start_time <- as_datetime(df[ , "start"], tz = "UTC")
end_time <- as_datetime(df[ , "end"], tz = "UTC")
# construct intervals
start_end_intrvls <- interval(start_time, end_time)
# find indices of the non-within intervals
not_within <- !(sapply(FUN = function(i) any(start_end_intrvls[i] %within% start_end_intrvls[-i]),
X = seq(along.with = df[ , "start"])))
df[not_within, ]
# start end value
# 1 2018/04/15 9:00:00 2018/04/16 8:00:00 10
# 3 2018/04/16 10:20:00 2018/04/17 18:20:00 11
# 6 2018/04/17 18:50:00 2018/04/17 19:50:00 12
Update
The as_datetime() function causes an error when being applied to a tibble:
as_datetime(tibble("2018/04/15 9:00:00"), tz = "UTC")
Error in as.POSIXct.default(x) :
do not know how to convert 'x' to class “POSIXct”
The solution above may be modified to resolve this issue with substitution of the as_datetime() with the as.POSIXlt():
df_tibble <- tibble(start=c("2018/04/15 9:00:00","2018/04/15 9:00:00","2018/04/16 10:20:00",
"2018/04/16 15:30:00", "2018/04/17 12:40:00","2018/04/17 18:50:00"),
end=c("2018/04/16 8:00:00","2018/04/16 7:10:00","2018/04/17 18:20:00","2018/04/16 16:30:00",
"2018/04/17 16:40:00","2018/04/17 19:50:00"), value=c(10,15,11,13,14,12))
start_time_lst <- lapply(FUN = function(i) as.POSIXlt(as.character(df_tibble[i , "start"]),
tz = "UTC"),
X = seq(along.with = unlist(df_tibble[ , "start"])))
end_time_lst <- lapply(FUN = function(i) as.POSIXlt(as.character(df_tibble[ i, "end"]),
tz = "UTC"),
X = seq(along.with = unlist(df_tibble[ , "end"])))
start_end_intrvls <- lapply(function(i) interval(start_time_lst[[i]] , end_time_lst[[i]]),
X = seq(along.with = unlist(df_tibble[ , "start"])))
not_within <- sapply(function(i) !(any(unlist(Map(`%within%`,
start_end_intrvls[[i]], start_end_intrvls[-i])))),
X = seq(along.with = unlist(df_tibble[ , "start"])))

Take a daily rolling mean of a seven day window for 30 minute sampled data

I would like to take a mean of a 7 day rolling window with 1 day increments of data that is collected at 30 minute intervals.
I have tried using data.table with by conditional statement with no success. Any guidane would be greatly appreciated.
# packages
library(data.table)
library(lubridate)
# Set set.seed to have reproducible sampling
set.seed(42)
# Create some Data
start = ymd_hms("2014-01-01 00:00:00")
end = ymd_hms("2014-12-31 23:59:59")
# Create data with 30 minute intervals.
dat <- data.table(timestamp = seq(start, end, by = "30 min"),
sample1 = sample(1:20, 17520, replace = TRUE))
# Create date variable for merging datasets.
dat[, date := as.Date(timestamp)]
# Create data for 7 day window moving window with one day increments.
dat2 <- data.table(start = seq(start, end, by = "1 day"),
end = seq(start + days(7), end + days(7), by = "1 day"))
# Create date variable for merging datasets.
dat2[, date := as.Date(start)]
# mergre datasets.
dat <- merge(dat, dat2, by="date")
# Tried
dat[, .(sample.mean = mean(sample1)), by = .(timestamp >= start & timestamp < end)]
# timestamp sample.mean
# 1: TRUE 10.46638
dat[, .(sample.mean = mean(sample1)), by = .(timestamp %in% c(start:end))]
# timestamp sample.mean
# 1: TRUE 10.40059
# 2: FALSE 10.46767
# Warning messages:
# 1: In start:end :
# numerical expression has 17520 elements: only the first used
# 2: In start:end :
# numerical expression has 17520 elements: only the first used
dat[, .(sample.mean = mean(sample1)), by = .(timestamp %between% c(start, end))]
# timestamp sample.mean
# 1: TRUE 19.00000
# 2: FALSE 10.46589
I'm not 100% sure I understand your exact parameters, but here's the basic approach:
setkey(dat, date)
#pull the 7 previous days
dat[ , dat[.(seq(.BY$date - 7L,
.BY$date, by = "day")),
#nomatch = 0L will exclude any requested dates outside the interval
mean(sample1), nomatch = 0L], by = date]
# date V1
# 1: 2014-01-01 12.31250
# 2: 2014-01-02 10.94792
# 3: 2014-01-03 11.27083
# 4: 2014-01-04 11.10417
# 5: 2014-01-05 10.79167
# ---
# 361: 2014-12-27 10.50260
# 362: 2014-12-28 10.52344
# 363: 2014-12-29 10.05990
# 364: 2014-12-30 10.03906
# 365: 2014-12-31 10.38542
Some possible tinkers:
Change 7L to whatever window you'd like; use positive if you want forward-looking averages
If you want to go by timestamp, you'll have to adjust the 7L to match whatever units (seconds/minutes/hours/etc)
The extreme points of the interval are not technically correct since the window is shorter than requested; exclude nomatch and these points will return as NA
Use .(var = mean(sample1)) to name the output column var.
Here's one approach:
library(zoo)
daymeans = dat[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat[daymeans, rmean := i.rmean, on="date"]
This assumes that your data is already sorted by date; if not, use keyby=date instead of by=date. If you don't want to juggle intermediate objects, there is a one-liner:
# Michael Chirico's suggestion from the comments
dat[dat[, mean(sample1), by=date][, rollmean(V1, 7, fill=NA)], rmean := i.V1, on = "date"]
You may need to tweak the arguments to rollmean to fit your particular definition of the window. #eddi suggested that runmean from the caTools library is typically faster than zoo's rollmean and so is probably also worth a look.
Crude benchmark with the OP's example data:
dat2 = copy(dat)
# Michael's answer
system.time({
setkey(dat, date)
dat[ , dat[.(seq(.BY$date - 7L,
.BY$date, by = "day")),
mean(sample1), nomatch = 0L], by = date]
})
user system elapsed
0.33 0.00 0.35
# this answer
system.time({
daymeans = dat2[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat2[daymeans, rmean := i.rmean, on="date"]
})
user system elapsed
0 0 0
Why it's faster: Here, we're computing 365 means of 48 numbers and then a rolling mean of length 365; which is less computationally costly than making 365 merges to find 48*7 numbers and then taking the mean of the latter.

Convert continuous time-series data into daily-hourly representation using R

I have time-series data in xts representation as
library(xts)
xtime <-timeBasedSeq('2015-01-01/2015-01-30 23')
df <- xts(rnorm(length(xtime),30,4),xtime)
Now I want to calculate co-orelation between different days, and hence I want to represent df in matrix form as:
To achieve this I used
p_mat= split(df,f="days",drop=FALSE,k=1)
Using this I get a list of days, but I am not able to arrange this list in matrix form. Also I used
p_mat<- df[.indexday(df) %in% c(1:30) & .indexhour(df) %in% c(1:24)]
With this I do not get any output.
Also I tried to use rollapply(), but was not able to arrange it properly.
May I get help to form the matrix using xts/zoo objects.
Maybe you could use something like this:
#convert to a data.frame with an hour column and a day column
df2 <- data.frame(value = df,
hour = format(index(df), '%H:%M:%S'),
day = format(index(df), '%Y:%m:%d'),
stringsAsFactors=FALSE)
#then use xtabs which ouputs a matrix in the format you need
tab <- xtabs(value ~ day + hour, df2)
Output:
hour
day 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00 10:00:00 11:00:00 12:00:00
2015:01:01 28.15342 35.72913 27.39721 29.17048 28.42877 28.72003 28.88355 31.97675 29.29068 27.97617 35.37216 29.14168 29.28177
2015:01:02 23.85420 28.79610 27.88688 27.39162 29.77241 22.34256 34.70633 23.34011 28.14588 25.53632 26.99672 38.34867 30.06958
2015:01:03 37.47716 31.70040 29.04541 34.23393 33.54569 27.52303 38.82441 28.97989 24.30202 29.42240 30.83015 39.23191 30.42321
2015:01:04 24.13100 32.08409 29.36498 35.85835 26.93567 28.27915 26.29556 29.29158 31.60805 27.07301 33.32149 25.16767 25.80806
2015:01:05 32.16531 29.94640 32.04043 29.34250 31.68278 28.39901 24.51917 33.95135 36.07898 28.76504 24.98684 32.56897 29.82116
2015:01:06 18.44432 27.43807 32.28203 29.76111 29.60729 32.24328 25.25417 34.38711 29.97862 32.82924 34.13643 30.89392 26.48517
2015:01:07 34.58491 20.38762 32.29096 31.49890 28.29893 33.80405 28.44305 28.86268 33.42964 36.87851 31.08022 28.31126 25.24355
2015:01:08 33.67921 31.59252 28.36989 35.29703 27.19507 27.67754 25.99571 27.32729 33.78074 31.73481 34.02064 28.43953 31.50548
2015:01:09 28.46547 36.61658 36.04885 30.33186 32.26888 25.90181 31.29203 34.17445 30.39631 28.18345 27.37687 29.85631 34.27665
2015:01:10 30.68196 26.54386 32.71692 28.69160 23.72367 28.53020 35.45774 28.66287 32.93100 33.78634 30.01759 28.59071 27.88122
2015:01:11 32.70907 31.51985 29.22881 36.31157 32.38494 25.30569 29.37743 22.32436 29.21896 19.63069 35.25601 27.45783 28.28008
2015:01:12 29.96676 30.51542 29.41650 29.34436 37.05421 33.05035 34.44572 26.30717 30.65737 34.61930 29.77391 21.48256 31.37938
2015:01:13 33.46089 34.29776 37.58262 27.58801 28.43653 28.33511 28.49737 28.53348 28.81729 35.76728 27.20985 28.44733 32.61015
2015:01:14 22.96213 32.27889 36.44939 23.45088 26.88173 27.43529 27.27547 21.86686 32.00385 23.87281 29.90001 32.37194 29.20722
2015:01:15 28.30359 30.94721 20.62911 33.84679 27.58230 26.98849 23.77755 24.18443 30.22533 32.03748 21.60847 25.98255 32.14309
2015:01:16 23.52449 29.56138 31.76356 35.40398 24.72556 31.45754 30.93400 34.77582 29.88836 28.57080 25.41274 27.93032 28.55150
2015:01:17 25.56436 31.23027 25.57242 31.39061 26.50694 30.30921 28.81253 25.26703 30.04517 33.96640 36.37587 24.50915 29.00156
...and so on
Here's one way to do it using a helper function that will account for days that do not have 24 observations.
library(xts)
xtime <- timeBasedSeq('2015-01-01/2015-01-30 23')
set.seed(21)
df <- xts(rnorm(length(xtime),30,4), xtime)
tHourly <- function(x) {
# initialize result matrix for all 24 hours
dnames <- list(format(index(x[1]), "%Y-%m-%d"),
paste0("H", 0:23))
res <- matrix(NA, 1, 24, dimnames = dnames)
# transpose day's rows and set colnames
tx <- t(x)
colnames(tx) <- paste0("H", .indexhour(x))
# update result object and return
res[,colnames(tx)] <- tx
res
}
# split on days, apply tHourly to each day, rbind results
p_mat <- split(df, f="days", drop=FALSE, k=1)
p_list <- lapply(p_mat, tHourly)
p_hmat <- do.call(rbind, p_list)

Find range of values in each unique day

I have the following example:
Date1 <- seq(from = as.POSIXct("2010-05-01 02:00"),
to = as.POSIXct("2010-10-10 22:00"), by = 3600)
Dat <- data.frame(DateTime = Date1,
t = rnorm(length(Date1)))
I would like to find the range of values in a given day (i.e. maximum - minimum).
First, I've defined additional columns which define the unique days in terms of the date and in terms of the day of year (doy).
Dat$date <- format(Dat$DateTime, format = "%Y-%m-%d") # find the unique days
Dat$doy <- as.numeric(format(Dat$DateTime, format="%j")) # find the unique days
To then find the range I tried
by(Dat$t, Dat$doy, function(x) range(x))
but this returns the range as two values not a single value, So, my question is, how do I find the calculated range for each day and return them in a data.frame which has
new_data <- data.frame(date = unique(Dat$date),
range = ...)
Can anyone suggest a method for doing this?
I tend to use tapply for this kind of thing. ave is also useful sometimes. Here:
> dr = tapply(Dat$t,Dat$doy,function(x){diff(range(x))})
Always check tricksy stuff:
> dr[1]
121
3.084317
> diff(range(Dat$t[Dat$doy==121]))
[1] 3.084317
Use the names attribute to get the day-of-year and the values to make a data frame:
> new_data = data.frame(date=names(dr),range=dr)
> head(new_data)
date range
121 121 3.084317
122 122 4.204053
Did you want to convert the number day-of-year back to a date object?
# Use the data.table package
require(data.table)
# Set seed so data is reproducible
set.seed(42)
# Create data.table
Date1 <- seq(from = as.POSIXct("2010-05-01 02:00"), to = as.POSIXct("2010-10-10 22:00"), by = 3600)
DT <- data.table(date = as.IDate(Date1), t = rnorm(length(Date1)))
# Set key on data.table so that it is sorted by date
setkey(DT, "date")
# Make a new data.table with the required information (can be used as a data.frame)
new_data <- DT[, diff(range(t)), by = date]
# date V1
# 1: 2010-05-01 4.943101
# 2: 2010-05-02 4.309401
# 3: 2010-05-03 4.568818
# 4: 2010-05-04 2.707036
# 5: 2010-05-05 4.362990
# ---
# 159: 2010-10-06 2.659115
# 160: 2010-10-07 5.820803
# 161: 2010-10-08 4.516654
# 162: 2010-10-09 4.010017
# 163: 2010-10-10 3.311408

Resources