Interval search on a data frame - r

I've a data frame of time series and values. The time series are seconds from the epoch. Here is how the top few elements look like in that data frame
val = seq(1,19)
ts = seq(1342980888,1342982000,by=60)
x = data.frame(ts = ts,val = val)
head(x)
ts val
1 1342980888 1
2 1342980948 2
3 1342981008 3
4 1342981068 4
5 1342981128 5
6 1342981188 6
I would like to some kind of an interval search function which takes in as input a time stamp say 1342980889 (+1 the the ts in the first row) and it should return 1,2 (the row number) as output. Basically, I want to find the two rows which have time stamps that bracket the input time stamp 1342980889. While this is relatively easy to do using "which", I suspect "which" does a vector scan and as the real data frame is quite large I want to do it using a binary search. Thanks much in advance

You should use the findInterval function. It will give you the index of the row where x$ts is immediately smaller than the value you are looking for (and you just have to add one to get the other index)
findInterval(1342980889, x$ts)
# [1] 1
Also note that the function is vectorized, i.e., the first argument can be a vector of values to look for:
findInterval(c(1342980889, 1342981483), x$ts)
# [1] 1 10

Related

R: locate element previous in vector within for loop and report in new column

I've looked through many older posts but nothing is really hitting the answer I need. In short: I have a data frame that contains observation data and the time of observation in days.
My goal is to add a column for weeks. I have already subsetted the data so that I only have the time vector at intervals of 7 (t == 7, 14, 21, etc). I just need to make a for loop that creates a new vector of "weeks" that I can then cbind to my data. I'd prefer it to be a character string so I can use it more easily in ggplot geom_historgram, but isn't as necessary as just creating the new vector successfully.
The tricky part of the data is that there is not an equal number of observations per time- t # 28 has maybe 5x as many observations as t #7, etc.
I want to create code that evaluates what t is, then checks to see if it is greater than the last element in the t vector. If it isn't, then populate the week vector with the last value it did, and if so, then increase it by 1.
I know this is bad from a like, computer science/R perspective in a lot of ways, but any help would be useful:
#fake data (in reality this is a huge data set with many observations at intervals of 1 for t
L = rnorm(50, mean=10, sd=2)
t = c((rep.int(7,3)), (rep.int(14,6)), rep.int(21,8), rep.int(28,12), (rep.int(31, 5)), (rep.int(36,16)))
fake = cbind(L,t)
#create df that has only the observations that are at weekly time points
dayofweek = seq(7,120,7)
df = subset(fake, t %in% dayofweek)
#create empty week vector
week = c()
#for loop with if-else statement nested to populate the week vector
for (i in 1:length(dayofweek)){
if (t = t[t-1]){
week = i
} else if (t > t[t-1]{
week = i+1
}
}
Thanks!!
I'm not sure I can follow what you want to do. If you want to determine which week the data fall within, why not:
set.seed(1)
L = rnorm(50, mean=10, sd=2)
...
fake <- data.frame(L=L, t=t)
fake$week <- floor(fake$t/7) # comment this out so t==7 becomes week==1 + 1
head(fake)
# L t week
# 1 8.747092 7 2
# 2 10.367287 7 2
# 3 8.328743 7 2
# 4 13.190562 14 3
# 5 10.659016 14 3
# 6 8.359063 14 3

group data by tolerance via index list

I dont know how to explain it shortly. I try my best:
I have the following example data:
Data<-data.frame(A=c(1,2,3,5,8,9,10),B=c(5.3,9.2,5,8,10,9.5,4),C=c(1:7))
and a index
Ind<-data.frame(I=c(5,6,2,4,1,3,7))
The value in Ind corresponds to the C column in the Data. Now I want to start with the first Ind value, and find the corresponding row in the Data data.frame (column C). From that row I want to go up and down and find values in column A that are in a tolerance range of 1. I want to write these values into a result dataframe add a group id column and delete it in the dataframe Data (where I found them). Then I start with the next entry in the Index dataframe Ind and so an until the data.frame Data is empty. I know how to match my Ind with column C of my Data and how to write and delete and the other stuff in a for loop, but I dont know the main point, which is my question here:
when I have found my row in the Data, how can I look up fitting values of column A in the tolerance range up and below that entry to get my Group id?
what I want to get is this result:
A B C Group
1 5.3 1 2
2 9.2 2 2
3 5 3 2
5 8 4 3
8 10 5 1
9 9.5 6 1
10 4 7 4
Maybe somebody could help me with the critical point in my question or even how to solve this issue in a fast way.
Many thanks!
Generally: avoid deleting or growing a data frame row by row inside a loop. R's memory management means that every time you add or delete a row, another copy of the data frame is made. Garbage collection will eventually discard the "old" copies of the data frame, but garbage can quickly accumulate and reduce performance. Instead, add a logical column to the Data data frame, and set "extracted" rows to TRUE. So like this:
Data$extracted <- rep(FALSE,nrow(Data))
As for your problem: I get a different set of grouping numbers, but the groups are identical.
There might be a more elegant way to do this, but this will get it done.
# store results in a separate list
res <- list()
group.counter <- 1
# loop until they're all done.
for(idx in Ind$I) {
# skip this iteration if idx is NA.
if(is.na(idx)) {
next
}
# dat.rows is a logical vector which shows the rows where
# "A" meets the tolerance requirement.
# specify the tolerance here.
mytol <- 1
# the next only works for integer compare.
# also not covered: what if multiple values of C
# match idx? do we loop over each corresponding value of A,
# i.e. loop over each value of 'target'?
target <- Data$A[Data$C == idx]
# use the magic of vectorized logical compare.
dat.rows <-
( (Data$A - target) >= -mytol) &
( (Data$A - target) <= mytol) &
( ! Data$extracted)
# if dat.rows is all false, then nothing met the criteria.
# skip the rest of the loop
if( ! any(dat.rows)) {
next
}
# copy the rows to the result list.
res[[length(res) + 1]] <- data.frame(
A=Data[dat.rows,"A"],
B=Data[dat.rows,"B"],
C=Data[dat.rows,"C"],
Group=group.counter # this value will be recycled to match length of A, B, C.
)
# flag the extraction.
Data$extracted[dat.rows] <- TRUE
# increment the group counter
group.counter <- group.counter + 1
}
# now make a data.frame from the results.
# this is the last step in how we avoid
#"growing" a data.frame inside a loop.
resData <- do.call(rbind, res)

Match each row in a table to a row in another table based on the difference between row timestamps

I have two unevenly-spaced time series that each measure separate attributes of the same system. The two series's data points are not sampled at the same times, and the series are not the same length. I would like to match each row from series A to the row of B that is closest to it in time. What I have in mind is to add a column to A that contains indexes to the closest row in B. Both series have a time column measured in Unix time (eg. 1459719755).
for example, given two datasets
a time
2 1459719755
4 1459719772
3 1459719773
b time
45 1459719756
2 1459719763
13 1459719766
22 1459719774
The first dataset should be updated to
a time index
2 1459719755 1
4 1459719772 4
3 1459719773 4
since B[1,]$time has the closest value to A[1,]$time, B[4,]$time has the closest value to A[2,]$time and A[3,]$time.
Is there any convenient way to do this?
Try something like this:
(1+ecdf(bdat$time)(adat$time)*nrow(bdat))
[1] 1 4 4
Why should this work? The ecdf function returns another function that has a value from 0 to 1. It returns the "position" in the "probability range" [0,1] of a new value in a distribution of values defined by the first argument to ecdf. The expression is really just rescaling that function's result to the range [1, nrow(bdat)]. (I think it's flipping elegant.)
Another approach would be to use approxfun on the sorted values of bdat$time which would then let get you interpolated values. These might need to be rounded. Using them as indices would instead truncate to integer.
apf <- approxfun( x=sort(bdat$time), y=seq(length( bdat$time)) ,rule=2)
apf( adat$time)
#[1] 1.000 3.750 3.875
round( apf( adat$time))
#[1] 1 4 4
In both case you are predicting a sorted value from its "order statistic". In the second case you should check that ties are handled in the manner you desire.

How to select specific elements and find their index in a data.frame?

I would like to select specific elements of a data.list after processing it.
To get process parameters I describe the my problem in the reproducible example.
In the example code below, I have three sets of data.list each have 5 column.
Each data.list repeat theirselves three times each and each data.list assignet to unique number called set_nbr which defines these datasets.
#to create reproducible data (this part creates three sets of data each one repeats 3 times of those of Mx, My and Mz values along with set_nbr)
set.seed(1)
data.list <- lapply(1:3, function(x) {
nrep <- 3
time <- rep(seq(90,54000,length.out=600),times=nrep)
Mx <- c(replicate(nrep,sort(runif(600,-0.014,0.012),decreasing=TRUE)))
My <- c(replicate(nrep,sort(runif(600,-0.02,0.02),decreasing=TRUE)))
Mz <- c(replicate(nrep,sort(runif(600,-1,1),decreasing=TRUE)))
df <- data.frame(time,Mx,My,Mz,set_nbr=x)
})
after applying some function I have output like this.
result
time Mz set_nbr
1 27810 -1.917835e-03 1
2 28980 -1.344288e-03 1
3 28350 -3.426615e-05 1
4 27900 -9.934413e-04 1
5 25560 -1.016492e-02 2
6 27360 -4.790767e-03 2
7 28080 -7.062256e-04 2
8 26550 -1.171716e-04 2
9 26820 -2.495893e-03 3
10 26550 -7.397865e-03 3
11 26550 -2.574022e-03 3
12 27990 -1.575412e-02 3
My questions starts from here.
1) How to get min,middle and max values of time column, for each set_nbr ?
2) How to use evaluated set_nbr and Mz values inside of data.list?
In short;
After deciding the min,middle and max values from time column and corresponding Mz values for each set_nbr in result, I want to return back to original data.list and extract those columns of Mx, My, Mz according those of set_nbr and Mz values. Since each set_nbr actually corresponding to 600 rows, I would like to extract those defined set_nbrs family from data.list
we use time as a factor to select set_nbr. Here factor means as extraction parameter not the real factor in R command.
In addition, as you will see four set_nbr exist for each dataset but they are indeed addressing different dataset in the data.list
I'm a big advocate of using lists of data frames when appropriate, but in this case it doesn't look like there's any reason to keep them separated as different list items. Let's combine them into a single data frame.
library(dplyr)
dat = bind_rows(data.list)
Then getting your summary stats is easy:
dat %>% group_by(set_nbr) %>%
summarize(min_time = min(time),
max_time = max(time),
middle_time = median(time))
# Source: local data frame [3 x 4]
#
# set_nbr min_time max_time middle_time
# 1 1 90 54000 27045
# 2 2 90 54000 27045
# 3 3 90 54000 27045
In your sample data, time is defined the same way each time, so of course the min, median, and max are all the same.
I'd suggest, in the new question you ask about plotting, starting with the combined data frame dat.
As to your second question:
2) How to select evaluated set_nbr values inside of data.list?
Selecting a single item from a list, use double brackets
data.list[[2]]
However, with the combined data, it's just a normal column of a normal data frame so any of these will work:
dat[dat$set_nbr == 2, ]
subset(dat, set_nbr == 2)
filter(dat, set_nbr == 2)
To your clarification in comments, if you want the Mx and My values for the time and set_nbr in the results object, using my combined dat above, simply do a join: left_join(results, dat).
This should work, but I'm a little confused because in your simulated data time is numeric, but in your new text you say "we use time as a factor". If you've converted time to a factor object, this will only work if it has the same levels in each of the data frames in your data list. If not, I would recommend keeping time as numeric.

Calculate sum of counts per min from data frame in R

I've been trying to figure this out for a while, but haven't been able to do so. I found a lot of similar questions which didn't help at all.
I have around 43000 records in data frame in R. The date column is in the format "2011-11-15 02:00:01", and the other column is the count. The structure of the data frame:
str(results)
'data.frame': 43070 obs. of 2 variables:
$ dates: Factor w/ 43070 levels "2011-11-15 02:00:01",..: 1 2 3 4 5 6 7 8 9 10 ...
$ count: num 1 2 1 1 1 1 2 3 1 2 ...
How can I get the total count per min?
And I also want to convert the results data frame into json. I used rjson package which converted the entire data frame as a single json element. When I inserted into mongodb, there was only on _id for all 43000 records. What did I do wrong?
You can use the xts package to get the counts/minute quite easily.
install.packages("xts")
require("xts")
results_xts <- xts(results$count, order.by = as.POSIXlt(results$dates))
This converts your dataframe to an xts object. There are a bunch of functions (apply.daily, apply.yearly, etc) in xts that apply functions to different time frames, but there isn't one for by minute. Fortunately the code for those functions is super simple, so just run
ep <- endpoints(results_xts, "minutes")
period.apply(results_xts, ep, FUN = sum)
Sorry, I don't know the answer to your other question.
Asterisk here, untested, but here is my solution for getting the counts per minute, maybe someone will chime in on the json part, I'm not familiar with that
here's my example time series and count
tseq<-seq(now,length.out=130, by="sec")
count<-rep(1, 130)
we find the index of where our minutes switch via the following
mins<-c(0,diff(floor(cumsum(c(0,diff(tseq)))/60)))
indxs<-which(mins%in%1)
Let me break that down (as there are many things nested in there).
First we diff over the time sequence, then add a 0 on the front because we lose an observation with diff
Second, sum the diff-ed vector, giving us the seconds value in each spot (this could probably also be done by a simple format call over the vector of times too)
Third, divide that vector, now the seconds in each spot, by 60 so we get a value in each spot corresponding to the minutes.
Fourth, floor it so we get integers
diff that vector so we get 0's except for 1's where the minute switches
add a 0 to that vector since we lose an observation with the diff
then get the indeces of the 1's with the which call
then we find the start and ends to our minutes
startpoints<-indxs
endpoints<-c(indxs[2:length(indxs)], length(mins))
then we simply sum over the corresponding subset
mapply(function(start, end) sum(count[start:end]), start=startpoints, end=endpoints)
#[1] 61 10
We get 61 for the first point because we include the 0th and 60th second for the first subset

Resources