Optimizing search in time series data frame - r

I have a data frame of 50 columns by 2.5 million rows in R, representing a time series. The time column is of class POSIXct. For analysis, I repeatedly need to find the state of the system for a given class at a particular time.
My current approach is the following (simplified and reproducible):
set.seed(1)
N <- 10000
.time <- sort(sample(1:(100*N),N))
class(.time) <- c("POSIXct", "POSIXt")
df <- data.frame(
time=.time,
distance1=sort(sample(1:(100*N),N)),
distance2=sort(sample(1:(100*N),N)),
letter=sample(letters,N,replace=TRUE)
)
# state search function
time.state <- function(df,searchtime,searchclass){
# find all rows in between the searchtime and a while (here 10k seconds)
# before that
rows <- which(findInterval(df$time,c(searchtime-10000,searchtime))==1)
# find the latest state of the given class within the search interval
return(rev(rows)[match(T,rev(df[rows,"letter"]==searchclass))])
}
# evaluate the function to retrieve the latest known state of the system
# at time 500,000.
df[time.state(df,500000,"a"),]
However, the call to which is very costly. Alternatively, I could first filter by class and then find the time, but that doesn't change the evaluation time much. According to Rprof, it's which and == that cost the majority of the time.
Is there a more efficient solution? The time points are sorted weakly increasing.

Because which, == and [ are all linear with the size of the data frame, the solution is to generate subset data frames for bulk operations, as follows:
# function that applies time.state to a series of time/class cominations
time.states <- function(df,times,classes,day.length=24){
result <- vector("list",length(times))
day.end <- 0
for(i in 1:length(times)){
if(times[i] > day.end){
# create subset interval from 1h before to 24h after
day.begin <- times[i]-60*60
day.end <- times[i]+day.length*60*60
df.subset <- df[findInterval(df$time,c(day.begin,day.end))==1,]
}
# save the resulting row from data frame
result[[i]] <- df.subset[time.state(df.subset,times[i],classes[i]),]
}
return(do.call("rbind",result))
}
With dT=diff(range(df$times)) and dT/day.length large, this reduces the evaluation time with a factor of dT/(day.length+1).

Related

Vectorizing a column-by-column comparison to separate values

I'm working with data gathered from multi-channel electrode systems, and am trying to make this run faster than it currently is, but I can't find any good way of doing it without loops.
The gist of it is; I have modified averages for each column (which is a channel), and need to compare each value in a column to the average for that column. If the value is above the adjusted mean, then I need to put that value in another data frame so it can be easily read.
Here is some sample code for the problematic bit:
readout <- data.frame(dimnmames <- c("Values"))
#need to clear the dataframe in order to run it multiple times without errors
#timeFrame is just a subsection of the original data, 60 channels with upwards of a few million rows
readout <- readout[0,]
for (i in 1:ncol(timeFrame)){
for (g in 1:nrow(timeFrame)){
if (timeFrame[g,i] >= posCompValues[i,1])
append(spikes, timeFrame[g,i])
}
}
The data ranges from 500 thousand to upwards of 130 million readings, so if anyone could point me in the right direction I'd appreciate it.
Something like this should work:
Return values of x greater than y:
cmpfun <- function(x,y) return(x[x>y])
For each element (column) of timeFrame, compare with the corresponding value of the first column of posCompValues
vals1 <- Map(cmpfun,timeFrame,posCompValues[,1])
Collapse the list into a single vector:
spikes <- unlist(vals1)
If you want to save both the value and the corresponding column it may be worth unpacking this a bit into a for loop:
resList <- list()
for (i in seq(ncol(timeFrame))) {
tt <- timeFrame[,i]
spikes <- tt[tt>posCompVals[i,1]]
if (length(spikes)>0) {
resList[[i]] <- data.frame(value=spikes,orig_col=i)
}
}
res <- do.call(rbind, resList)

storing lubridate intervals in a dataframe (R)

I want to create a dataframe of 15 minute intervals over 24 hours starting with a certain inverval on several dates. I use a loop for that but instant of the actual intervals it stores the number of seconds which is not useful in my case. Is there any way to avoid this? I need the intervals to look how often timed events happen in these intervals. I found one similar question, but the answer concentrated on using lapply instead of apply, which isn't applicable here.
So here is a basic example:
begin<-as.POSIXct(rbind("2016-03-31 09:00:00","2016-04-12 09:00:00"))
end<-as.POSIXct(rbind("2016-03-31 09:15:00","2016-04-12 09:15:00"))
int<-as.interval(begin,end)
aufl<-duration(15, "mins")
Intervall=data.frame()
for (j in 1:length(int)){for (i in 1:96){Intervall[j,i]<-int_shift(int[j],aufl*(i-1))}}
Intervall
I created an answer, I hope this is what you are looking for. If not, please comment:
library(lubridate)
begin <- as.POSIXct(rbind("2016-03-31 09:00:00","2016-04-12 09:00:00"))
# copy begin time for loop
begin_new <- begin
# create durateion object
aufl <- duration(15, "mins")
# count times for loop
times <- 24*60/15
# create dataframe with begin time
Intervall <- data.frame(begin,stringsAsFactors = FALSE)
for (i in 1:times){
cat("test",i,"\n")
# save old time for interval calculation
begin_start <- begin_new
# add 15 Minutes to original time
begin_new <- begin_new + aufl
cat(begin_new,"\n")
# create an interval object between
new_dur <- interval(begin_start,begin_new)
# bind to original dataframe
Intervall <- cbind(Intervall,new_dur)
}
# Add column names
vec_names <- paste0("v",c(1:(times+1)))
colnames(Intervall) <- vec_names

eliminate loop from rowwise subseting of data

I have two data sets - TEST end TRAIN. TEST is a subset of TRAIN. By using the columns "prod" and "clnt" I need to find all rows in TRAIN which corresponds to TEST (it is one to multiple correspondence). Then I make a temporal analysis of the respective values of the column "order" of TEST (first column "week" is the time).
So I take the first row of TRAIN, I compare all rows of TEST whether some of them contain the same combination of numbers of "prod" and "clnt" and record the respective values of "order" in TS. Usually I have zero to about ten values in TS per row of TRAIN. Then I do some calculations on TS (in this artificial case just mean(TS)) and record the result as well as the "Id" of the row of TEST in a data set Subm.
The algorithm works, but because I have millions of rows in TRAIN and TEST, I need it as fast as possible and especially to get rid of the loop, which is the slowest part. Probably I messed up with the data.frame declaration/usage also, but I am not sure.
set.seed(42)
NumObsTrain=100000 # this can be as much as 70 000 000
NumObsTest=10000 # this can be as much as 6 000 000
#create the TRAIN data set
train1=floor(runif(NumObsTrain, min=0, max=NumObsTrain+1))
train1=matrix(train1,ncol = 2)
train=cbind(8,train1) #week
train=rbind(train,cbind(9,train1)) #week
train=cbind(train,runif(NumObsTrain,min=1,max=10)) #order
train=cbind(c(1:nrow(train)),train)# id number of each row
colnames(train)=c("id","week","prod","clnt","order")
train=as.data.frame(train)
train=train[sample(nrow(train)),] # reflush the rows of train
# Create the TEST dataset
test=train[1:NumObsTest,]
test[,"week"][1:{NumObsTest/2}]=10
test[,"week"][{(NumObsTest/2)+1}:NumObsTest]=11
TS=numeric(length = 10)
id=c(1:NumObsTest*2)
order=c(1:NumObsTest*2)
Subm=data.frame(id,order)
ptm <- proc.time()
# This is the loop
for (i in 1:NumObsTest){
Subm$id[i]=test$id[i]
TS=train$order[train$clnt==test$clnt[i]&train$prod==test$prod[i]]
Subm$order[i]=mean(TS)
}
proc.time() - ptm
The following will create a data.frame with all (prod, clnt) and order combinations, then group them by prod and clnt, then take the mean of the order of each group. The final result is missing the id, and for some reason you have more data in your final data.frame, which I cannot figure out why. But the order results are correct.
newtrain <- train[, 3:5]
newtest <- test[, c(1, 3:4)]
x <- dplyr::inner_join(newtest, newtrain)
y <- dplyr::group_by(x, prod, clnt)
z <- dplyr::summarise(y, mean(order))

Efficient way of spliting, applying function and returning data.frame with variable vector length

I am currently trying to use plyr + reshape2 to proccess my data, but it is taking a lot of time.
I have a dataframe (df) with 3 columns: network, user_id and date.
My goal is:
To split df in 2 levels (network and user_id);
apply a function (get_interval) in each split;
bind the results in another dataframe (df2).
get_interval returns a vector of the same length as the number of rows of the input.
Thus, df2 has the same size of df, but with the results computed by get_interval.
The problem is that I cannot use ddply directly, since it only handles vectors of equal length and the results of the function have varied length.
I came up with this solution:
aux <- melt(dlply(df,.(network,user_id), get_interval))
df2 <- cbind(interval=aux$value,colsplit(aux$L1,"\\.",names=c("network","user_id")))
But it is very inefficient, and since df is quite big I waste hours every time I have to run it.
Is there a way of doing this more efficiently?
EDIT
The basic operation of get_interval is as follows:
get_interval <- function(df){
if(nrow(df) < 2)
return (NA)
x <- c(NA,df$date[-1] - df$date[-nrow(df)])
return(x) ## ceiling wont work because some intervals are 0.
}
It is possible to generate this data artificially with:
n <- 1000000
ref_time <- as.POSIXct("2013-12-17 00:00:00")
interval_range <- 86400*10 # 10 days
df <- data.frame(user_id=floor(runif(n,1,n/10)),
network=gl(2,n,labels=c("anet","unet")),
value=as.POSIXct(ref_time - runif(n,0,interval_range)))

R: Split-Apply-Combine... Apply Functions via Aggregate to Row-Bound Data Frames Subset by Class

Update: My NOAA GHCN-Daily weather station data functions have since been cleaned and merged into the rnoaa package, available on CRAN or here: https://github.com/ropensci/rnoaa
I'm designing a R function to calculate statistics across a data set comprised of multiple data frames. In short, I want to pull data frames by class based on a reference data frame containing the names. I then want to apply statistical functions to values for the metrics listed for each given day. In effect, I want to call and then overlay a list of data frames to calculate functions on a vector of values for every unique date and metric where values are not NA.
The data frames are iteratively read into the workspace from file based on a class variable, using the 'by' function. After importing the files for a given class, I want to rbind() the data frames for that class and each user-defined metric within a range of years. I then want to apply a concatenation of user-provided statistical functions to each metric within a class that corresponds to a given value for the year, month, and day (i.e., the mean [function] low temperature [class] on July 1st, 1990 [date] reported across all locations [data frames] within a given region [class]. I want the end result to be new data frames containing values for every date within a region and a year range for each metric and statistical function applied. I am very close to having this result using the aggregate() function, but I am having trouble getting reasonable results out of the aggregate function, which is currently outputting NA and NaN for most functions other than the mean temperature. Any advice would be much appreciated! Here is my code thus far:
# Example parameters
w <- c("mean","sd","scale") # Statistical functions to apply
x <- "C:/Data/" # Folder location of CSV files
y <- c("MaxTemp","AvgTemp","MinTemp") # Metrics to subset the data
z <- c(1970:2000) # Year range to subset the data
CSVstnClass <- data.frame(CSVstations,CSVclasses)
by(CSVstnClass, CSVstnClass[,2], function(a){ # Station list by class
suppressWarnings(assign(paste(a[,2]),paste(a[,1]),envir=.GlobalEnv))
apply(a, 1, function(b){ # Data frame list, row-wise
classData <- data.frame()
sapply(y, function(d){ # Element list
CSV_DF <- read.csv(paste(x,b[2],"/",b[1],".csv",sep="")) # Read in CSV files as data frames
CSV_DF1 <- CSV_DF[!is.na("Value")]
CSV_DF2 <- CSV_DF1[which(CSV_DF1$Year %in% z & CSV_DF1$Element == d),]
assign(paste(b[2],"_",d,sep=""),CSV_DF2,envir=.GlobalEnv)
if(nrow(CSV_DF2) > 0){ # Remove empty data frames
classData <<- rbind(classData,CSV_DF2) # Bind all data frames by row for a class and element
assign(paste(b[2],"_",d,"_bound",sep=""),classData,envir=.GlobalEnv)
sapply(w, function(g){ # Function list
# Aggregate results of bound data frame for each unique date
dataFunc <- aggregate(Value~Year+Month+Day+Element,data=classData,FUN=g,na.action=na.pass)
assign(paste(b[2],"_",d,"_",g,sep=""),dataFunc,envir=.GlobalEnv)
})
}
})
})
})
I think I am pretty close, but I am not sure if rbind() is performing properly, nor why the aggregate() function is outputting NA and NaN for so many metrics. I was concerned that the data frames were not being bound together or that missing values were not being handled well by some of the statistical functions. Thank you in advance for any advice you can offer.
Cheers,
Adam
You've tackled this problem in a way that makes it very hard to debug. I'd recommend switching things around so you can more easily check each step. (Using informative variable names also helps!) The code is unlikely to work as is, but it should be much easier to work iteratively, checking that each step has succeeded before continuing to the next.
paths <- dir("C:/Data/", pattern = "\\.csv$")
# Read in CSV files as data frames
raw <- lapply(paths, read.csv, str)
# Extract needed rows
filter_metrics <- c("MaxTemp", "AvgTemp", "MinTemp")
filter_years <- 1970:2000
filtered <- lapply(raw, subset,
!is.na(Value) & Year %in% filter_years & Element %in% filter_metrics)
# Drop any empty data frames
rows <- vapply(filtered, nrow, integer(1))
filtered <- filtered[rows > 0]
# Compute aggregates
my_aggregate <- function(df, fun) {
aggregate(Value ~ Year + Month + Day + Element, data = df, FUN = fun,
na.action = na.pass)
}
means <- lapply(filtered, my_aggregate, mean)
sds <- lapply(filtered, my_aggregate, sd)
scales <- lapply(filtered, my_aggregate, scale)

Resources