How to sample time series dataset in minutes - r

I have time series sensor data which recorded at the interval of 30 seconds as follows:
Head:
temperature humidity light voltage time
1: 19.8071 37.61155 137.5400 2.69124 0
2: 19.7336 37.09330 71.7600 2.69964 30
3: 19.6160 37.57370 97.5200 2.69964 30
4: 19.7728 37.16200 143.5200 2.71196 60
5: 20.2040 36.88710 50.6000 2.69964 60
6: 19.0476 40.09450 110.4724 2.80151 90
It is a very large dataset with more than 2 billion records. I need to sample the data at 5 minutes interval to reduce the size of the dataset.

We can try with %%. As the initial dataset is data.table, we can use data.table methods for efficiency
DT[!time %% 300]

Related

Need formula on how to calculate the work completion days on average

Client give 100 task to employee.
Employee complete 50 task in 1 day
20 task in 2 days
15 task in 3 days
4 task in 4 days
5 taak in 6 days
6 task in 10 days.
Now I want to know on a average how many days employee will take to complete for 1 task
Need formula for this query..
Assuming tasks are not completed in parallel (i.e. days are mutually exclusive with respect to completing/working on tasks), average days per task = 0.26:
=SUM(B2:B7)/SUM(A2:A7)
This is where the solution should terminate - however, I provide a number of checks/alternative approaches which serve to demonstrate (unequivocally) the veracity of the above function...
checks
check 1
The same can be derived using the 'weighted average calculation:
=SUM((B2:B7/A2:A7)*A2:A7)/SUM(A2:A7)
check 2
Intuitively, if each task takes ~0.26 days to complete, and there are 100 tasks, then the total duration (days) ~= 26: summing column B gives just that:
check 3
If still unconvinced, you can calculate the average days per task for each category/type (i.e. for those that take 1,2,3,.., 10 days to complete):
=B2:B7/A2:A7
Then expand these out using sequence / other method:
=SEQUENCE(1,A2,G2,0)
Again, this yields 0.26 and which should confirm (unequivocally) the veracity of the simple/direct ratio...
Ta

R: calculate closing time for chamber N2O flux measurements

I performed static N2O chamber measurements that I would like to analyse now using the "gasfluxes package" https://cran.r-project.org/web/packages/gasfluxes/gasfluxes.pdf.
I measured different samples (POTS) during 10 min intervals. Each sample was measured two times a day (SESSION: AM, PM) for 9 days. The N2O analyzer saved data (conc.) every second!
My data now looks like this:
DATE POT SESSION TIME Concentration
1: 2017-10-18T00:00:00Z O11 AM 10:16:00.746 0.3512232
2: 2017-10-18T00:00:00Z O11 AM 10:16:01.382 0.3498687
3: 2017-10-18T00:00:00Z O11 AM 10:16:02.124 0.3482681
4: 2017-10-18T00:00:00Z O11 AM 10:16:03.216 0.3459306
5: 2017-10-18T00:00:00Z O11 AM 10:16:04.009 0.3459124
6: 2017-10-18T00:00:00Z O11 AM 10:16:04.326 0.3456660
To use the package, I need to calculate closing times out of the exact time (TIME) data points. The time should look like this (table taken from the package pdf. see above)
serie V A time C
1: ID1 0.522625 1 0.0000000 0.3317823
2: ID1 0.522625 1 0.3333333 0.3304053
3: ID1 0.522625 1 0.6666667 0.3394311
4: ID1 0.522625 1 1.0000000 0.4469102
5: ID2 0.523625 1 0.0000000 0.4572708
How can I calculate this for each individual 10-minute measurement period for each pot? Basically it should list the increasing nr. of seconds as my machine measured conc. every second.
My idea is to group by "POT", "DATE" and "Session" which creates a unique identifier for one complete chamber measurement and do a loop.
I also learned that I should use "lubridate" as I'm working with times (https://data.library.virginia.edu/working-with-dates-and-time-in-r-using-the-lubridate-package/). I still don't know how to calculate Time durations now for my case? I think I need to write a loop?
Something like this but I always get error messages (my former question R: Calculate measurement time-points for separate samples)
df.HMR %>% group_by(DATE, Series, Session) %>%
mutate(dt=as.POSIXct(df.HMR$TIME,format="%H:%M:%S"), time_diff = dt-lag(dt))
Error message: Column dt must be length 838 (the group size) or one, not 379698
Anyone can help me or knows about an another approach?
Any help is very welcome.
Many thanks!

computing and formatting averages and squares of time intervals

I have a model which predicts the duration of certain events, and measures of durations for those events. I then want to compute the difference between Predicted and Measured, the mean difference and the RMSE. I'm able to do it, but the formatting is really awkward and not what I expected:
database <- data.frame(Predicted = c(strptime(c("4:00", "3:35", "3:38"), format = "%H:%M")),
Measured = c(strptime(c("3:39", "3:40", "3:53"), format = "%H:%M")))
database
> Predicted Measured
1 2016-11-28 04:00:00 2016-11-28 03:39:00
2 2016-11-28 03:35:00 2016-11-28 03:40:00
3 2016-11-28 03:38:00 2016-11-28 03:53:00
This is the first weirdness: why does R shows me a time and a date, even if I clearly specified a time-only format (%H:%M), and there was no date in my data to start with? It gets weirder:
database$Error <- with(database, Predicted-Measured)
database$Mean_Error <- with(database, mean(Predicted-Measured))
database$RMSE <- with(database, sqrt(mean(as.numeric(Predicted-Measured)^2)))
> database
Predicted Measured Error Mean_Error RMSE
1 2016-11-28 04:00:00 2016-11-28 03:39:00 21 mins 0.3333333 15.17674
2 2016-11-28 03:35:00 2016-11-28 03:40:00 -5 mins 0.3333333 15.17674
3 2016-11-28 03:38:00 2016-11-28 03:53:00 -15 mins 0.3333333 15.17674
Why is the variable Error expressed in minutes? For Error it's not a bad choice, but it becomes quite hard to read for Mean_Error. For RMSE it's even worse, but this could be due to the as.numeric function: if I remove it, R complains that '^' not defined for "difftime" objects. My questions are:
Is it possible to show the first 2 columns (Predicted and Measured) shown in the %H:%M format?
for the other 3 columns ( Error, Mean_Error and RMSE) I would like to compare a %M:%S format and a format in only seconds, and choose among the two. Is it possible?
EDIT: just to be more clear, my goal is to insert observations of time intervals into a dataframe and compute a vector of time interval differences. Then, compute some statistics for that vector: mean, RMSE, etc.. I know I could just enter the time observations in seconds, but that doesn't look very good: it's difficult to tell that 13200 seconds are 3 hours and 40 minutes. Thus I would like to be able to store the time intervals in the %H:%M, but then be able to manipulate them algebraically and show the results in a format of my choosing. Is that possible?
We can use difftime to specify the units for the difference in time. The output of difftime is an object of class difftime. When this difftime object is coerced to numeric using as.numeric, we can change these units (see the examples in ?difftime):
## Note we don't convert to date-time because we just want %H:%M
database <- data.frame(Predicted = c("4:00", "3:35", "3:38"),
Measured = c("3:39", "3:40", "3:53"))
## We now convert to date-time and use difftime to compute difference in minutes
database$Error <- with(database, difftime(strptime(Predicted,format="%H:%M"),strptime(Measured,format="%H:%M"), units="mins"))
## Use as.numeric to change units to seconds
database$Mean_Error <- with(database, mean(as.numeric(Error,units="secs")))
database$RMSE <- with(database, sqrt(mean(as.numeric(Error,units="secs")^2)))
## Predicted Measured Error Mean_Error RMSE
##1 4:00 3:39 21 mins 20 910.6042
##2 3:35 3:40 -5 mins 20 910.6042
##3 3:38 3:53 -15 mins 20 910.6042

create an index for aggregating daily data to match periodic data

I have daily measurements prec.d and periodic measurements prec.p. The periodic measurements (3-12 days apart) are roughly the sum of the daily measurements between the start and end dates, and I need to compare prec in the two data frames. I have so far manually created an index week that represents the time span of each periodic measurement, but it would be great to make week in a reproducible fashion.
data.frame prec.d
day week prec
6/20/2013 1 0
6/21/2013 1 0
6/22/2013 1 0
6/23/2013 1 0
6/24/2013 1 41.402
6/25/2013 1 2.794
6/26/2013 1 6.096
6/27/2013 2 0.508
6/28/2013 2 0
6/29/2013 2 0
6/30/2013 2 2.54
7/1/2013 2 18.034
7/2/2013 2 4.064
And data.frame prec.p
start end week prec1 prec2 prec3
6/20/2013 6/26/2013 1 50.28 31.78042615 42.76461716
6/27/2013 7/2/2013 2 25.1 15.70964247 20.49507586
I would like to create the week field automatically, which spans from start to end in prec.p. The I can aggregate by week to make prec in both data frames match.
Introduce YYYYWW field in both weekly and minute data, WW stands for week number, that will give you a common index. For example
x <- as.Date(runif(100)*100)
yyyyww <- strftime(x, format="%Y%U")
yyyyww
Or take a look at package quantmod, if I remember correctly, it has functions for time frame conversion.

How to optimize for loops in extremely large dataframe

I have a dataframe "x" with 5.9 million rows and 4 columns: idnumber/integer, compdate/integer and judge/character,, representing individual cases completed in an administrative court. The data was imported from a stata dataset and the date field came in as integer, which is fine for my purposes. I want to create the caseload variable by calculating the number of cases completed by the judge within the 30 day window of the completion date of the case at issue.
here are the first 34 rows of data:
idnumber compdate judge
1 9615 JVC
2 15316 BAN
3 15887 WLA
4 11968 WFN
5 15001 CLR
6 13914 IEB
7 14760 HSD
8 11063 RJD
9 10948 PPL
10 16502 BAN
11 15391 WCP
12 14587 LRD
13 10672 RTG
14 11864 JCW
15 15071 GMR
16 15082 PAM
17 11697 DLK
18 10660 ADP
19 13284 ECC
20 13052 JWR
21 15987 MAK
22 10105 HEA
23 14298 CLR
24 18154 MMT
25 10392 HEA
26 10157 ERH
27 9188 RBR
28 12173 JCW
29 10234 PAR
30 10437 ADP
31 11347 RDW
32 14032 JTZ
33 11876 AMC
34 11470 AMC
Here's what I came up with. So for each record I'm taking a subset of the data for that particular judge and then subsetting the cases decided in the 30 day window, and then assigning the length of a vector in the subsetted dataframe to the caseload variable for the subject case, as follows:
for(i in 1:length(x$idnumber)){
e<-x$compdate[i]
f<-e-29
a<-x[x$judge==x$judge[i] & !is.na(x$compdate),]
b<-a[a$compdate<=e & a$compdate>=f,]
x$caseload[i]<-length(b$idnumber)
}
It is working but it is taking extremely long to complete. How can I optimize this or do this easier. Sorry I'm very new to r and to programming -- I'm a law professor trying to analyze court data.... Your help is appreciated. Thanks.
Ken
You don't have to loop through every row. You can do operations on the entire column at once. First, create some data:
# Create some data.
n<-6e6 # cases
judges<-apply(combn(LETTERS,3),2,paste0,collapse='') # About 2600 judges
set.seed(1)
x<-data.frame(idnumber=1:n,judge=sample(judges,n,replace=TRUE),compdate=Sys.Date()+round(runif(n,1,120)))
Now, you can make a rolling window function, and run it on each judge.
# Sort
x<-x[order(x$judge,x$compdate),]
# Create a little rolling window function.
rolling.window<-function(y,window=30) seq_along(y) - findInterval(y-window,y)
# Run the little function on each judge.
x$workload<-unlist(by(x$compdate,x$judge,rolling.window)))
I don't have much experience with rolling calculations, but...
Calculate this per-day, not per-case (since it will be the same for cases on the same day).
Calculate a cumulative sum of the number of cases, and then take the difference of the current value of this sum and the value of the sum 31 days ago (or min{daysAgo:daysAgo>30} since cases are not resolved every day).
It's probably fastest to use a data.table. This is my attempt, using #nograpes simulated data. Comments start with #.
require(data.table)
DT <- data.table(x)
DT[,compdate:=as.integer(compdate)]
setkey(DT,judge,compdate)
# count cases for each day
ldt <- DT[,.N,by='judge,compdate']
# cumulative sum of counts
ldt[,nrun:=cumsum(N),by=judge]
# see how far to look back
ldt[,lookbk:=sapply(1:.N,function(i){
z <- compdate[i]-compdate[i:1]
older <- which(z>30)
if (length(older)) min(older)-1L else as(NA,'integer')
}),by=judge]
# compute cumsum(today) - cumsum(more than 30 days ago)
ldt[,wload:=list(sapply(1:.N,function(i)
nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]])
))]
On my laptop, this takes under a minute. Run this command to see the output for one judge:
print(ldt['XYZ'],nrow=120)

Resources