Set values to NA in data frame if Date is outside of a given interval - r

I have two dataframes, df1 and df2.
df1 contains values for different products X1,X2,and so on at different times. df2 contains the true start and end date for some of the products. I want to replace the values outside of the given date intervals in df2 by NA, as shown in the final table df3.
Create df1 and df2:
df1=data.frame(matrix(NA,10,6))
df1[,1]=(c(seq(as.Date("2012-01-01"),as.Date("2012-10-01"),by="1 month")))
df1[,2]=c(1:10); df1[,3]=c(12:21); df1[,4]=c(0.5:10); df1[,5]=c(5:14); df1[,6]=c(10:19)
colnames(df1)=c("Date","X1","X2","X3","X4","X5")
df2=data.frame(matrix(data=c("X1","X2","X4","2012-02-01","2012-04-01","2012-06-01","2012-09-01","2012-06-01","2012-10-01"),3,3))
colnames(df2)=c("Name","Start","End")
Output:
> df1
Date X1 X2 X3 X4 X5
1 2012-01-01 1 12 0.5 5 10
2 2012-02-01 2 13 1.5 6 11
3 2012-03-01 3 14 2.5 7 12
4 2012-04-01 4 15 3.5 8 13
5 2012-05-01 5 16 4.5 9 14
6 2012-06-01 6 17 5.5 10 15
7 2012-07-01 7 18 6.5 11 16
8 2012-08-01 8 19 7.5 12 17
9 2012-09-01 9 20 8.5 13 18
10 2012-10-01 10 21 9.5 14 19
> df2
Name Start End
1 X1 2012-02-01 2012-09-01
2 X2 2012-04-01 2012-06-01
3 X4 2012-06-01 2012-10-01
Final output should look like this:
df3
Date X1 X2 X3 X4 X5
1 2012-01-01 NA NA 0.5 NA 10
2 2012-02-01 2 NA 1.5 NA 11
3 2012-03-01 3 NA 2.5 NA 12
4 2012-04-01 4 15 3.5 NA 13
5 2012-05-01 5 16 4.5 NA 14
6 2012-06-01 6 17 5.5 10 15
7 2012-07-01 7 NA 6.5 11 16
8 2012-08-01 8 NA 7.5 12 17
9 2012-09-01 9 NA 8.5 13 18
10 2012-10-01 NA NA 9.5 14 19

Using dplyr and tidyr...
library(tidyr)
library(dplyr)
df3 <- df1 %>% gather(key=Name,value=value,-Date) %>% #convert to long form
left_join(df2) %>% #merge in date limits
mutate(ind=(as.Date(Date)>=as.Date(Start) & as.Date(Date)<=as.Date(End))) %>% #check valid
mutate(value=replace(value,!ind,NA)) %>% #replace invalid with NA
select(Date,Name,value) %>% #remove unnecessary variables
spread(key=Name,value=value) #convert back to rectangular form
df3
Date X1 X2 X3 X4 X5
1 2012-01-01 NA NA 0.5 NA 10
2 2012-02-01 2 NA 1.5 NA 11
3 2012-03-01 3 NA 2.5 NA 12
4 2012-04-01 4 15 3.5 NA 13
5 2012-05-01 5 16 4.5 NA 14
6 2012-06-01 6 17 5.5 10 15
7 2012-07-01 7 NA 6.5 11 16
8 2012-08-01 8 NA 7.5 12 17
9 2012-09-01 9 NA 8.5 13 18
10 2012-10-01 NA NA 9.5 14 19

I am sure there is a more elegant way, but you could create a matrix of the indices that meet your criterion, where you set the elements to 1 if it is within your interval for that product and NA if it isn't. Assuming you are dealing with numerical values you can then multiply your data frame with that index matrix:
Example:
library(dplyr)
## Convert your dates to Date-objects:
df2 <- df2 %>% dplyr::mutate(Start = as.Date(Start), End = as.Date(End))
## Create a matrix of indices (TRUE/FALSE):
indMx <- lapply(names(df1)[-1], function(product){
(df1$Date >= df2$Start[df2$Name == product]) &
(df1$Date <= df2$End[df2$Name == product])
}) %>% do.call('cbind',.)
## Multiply with NA^indMx, which gives you NA in place of FALSE and
## 1 in place of TRUE:
df1[,-1] <- df1[,-1]*NA^indMx
df1
# Date X1 X2 X3
# 1 2012-01-01 1 12 0.5
# 2 2012-02-01 NA 13 1.5
# 3 2012-03-01 NA 14 2.5
# 4 2012-04-01 NA NA 3.5
# 5 2012-05-01 NA NA 4.5
# 6 2012-06-01 NA NA NA
# 7 2012-07-01 NA 18 NA
# 8 2012-08-01 NA 19 NA
# 9 2012-09-01 NA 20 NA
# 10 2012-10-01 10 21 NA

Here is one solution with data.table. There might be a more elegant method using non-equi joins.
for(i in seq_len(nrow(df2))) df1[!(Date %between% df2[i,.(Start, End)]), df2[i, Name] := NA]
Here, you run through each row of df2, subset df1 based on dates outside of the start and end dates in the current row of df2, and then assign NA to the variable given in df2.
This returns
df1
Date X1 X2 X3
1: 2012-01-01 NA NA NA
2: 2012-02-01 2 NA NA
3: 2012-03-01 3 NA NA
4: 2012-04-01 4 15 NA
5: 2012-05-01 5 16 NA
6: 2012-06-01 6 17 5.5
7: 2012-07-01 7 NA 6.5
8: 2012-08-01 8 NA 7.5
9: 2012-09-01 9 NA 8.5
10: 2012-10-01 NA NA 9.5
update
If the data is constructed as was updated in the original post, then run this line first to convert the Names variable in df2 to a character vector (starts out as a factor). Then the above code will work for the new dataset.
# convert data.frames to data.tables
setDT(df1)
setDT(df2)
# convert factor to character
df2[, Name := as.character(Name)]
data
library(data.table)
# read in data
df1 <- fread("Date X1 X2 X3
2012-01-01 1 12 0.5
2012-02-01 2 13 1.5
2012-03-01 3 14 2.5
2012-04-01 4 15 3.5
2012-05-01 5 16 4.5
2012-06-01 6 17 5.5
2012-07-01 7 18 6.5
2012-08-01 8 19 7.5
2012-09-01 9 20 8.5
2012-10-01 10 21 9.5")
df2 <- fread(" Name Start End
X1 2012-02-01 2012-09-01
X2 2012-04-01 2012-06-01
X3 2012-06-01 2012-10-01")
# convert to date type
df1[, Date := as.Date(Date)]
df2[, c("Start", "End") := .(as.Date(Start), as.Date(End))]

Related

Find row of the next instance of the value in R

I have two columns Time and Event. There are two events A and B. Once an event A takes place, I want to find when the next event B occurs. Column Time_EventB is the desired output.
This is the data frame:
df <- data.frame(Event = sample(c("A", "B", ""), 20, replace = TRUE), Time = paste("t", seq(1,20)))
What is the code in R for finding the next instance of a value (B in this case)?
What is the code for once the instance of B is found, return the value of the corresponding Time Column?
The code should be something like this:
data$Time_EventB <- ifelse(data$Event == "A", <Code for returning time of next instance of B>, "")
In Excel this can be done using VLOOKUP.
Here's a simple solution:
set.seed(1)
df <- data.frame(Event = sample(c("A", "B", ""),size=20, replace=T), time = 1:20)
as <- which(df$Event == "A")
bs <- which(df$Event == "B")
next_b <- sapply(as, function(a) {
diff <- bs-a
if(all(diff < 0)) return(NA)
bs[min(diff[diff > 0]) == diff]
})
df$next_b <- NA
df$next_b[as] <- df$time[next_b]
> df
Event time next_b
1 A 1 2
2 B 2 NA
3 B 3 NA
4 4 NA
5 A 5 8
6 6 NA
7 7 NA
8 B 8 NA
9 B 9 NA
10 A 10 14
11 A 11 14
12 A 12 14
13 13 NA
14 B 14 NA
15 15 NA
16 B 16 NA
17 17 NA
18 18 NA
19 B 19 NA
20 20 NA
Here's an attempt using a "rolling join" from the data.table package:
library(data.table)
setDT(df)
df[Event=="B", .(time, nextb=time)][df, on="time", roll=-Inf][Event != "A", nextb := NA][]
# time nextb Event
# 1: 1 2 A
# 2: 2 NA B
# 3: 3 NA B
# 4: 4 NA
# 5: 5 8 A
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA B
# 9: 9 NA B
#10: 10 14 A
#11: 11 14 A
#12: 12 14 A
#13: 13 NA
#14: 14 NA B
#15: 15 NA
#16: 16 NA B
#17: 17 NA
#18: 18 NA
#19: 19 NA B
#20: 20 NA
Using data as borrowed from #thc

Dividing the data in multiple columns to 8 values logically in R

I have the data as following. Each column starting from 1.07m to 11.82m represents the depth and the values corresponds to the temperature. I am interested in reducing the datasets into 8 sets (8 distinct water depths). While doing so I would like to use the averaging method. For example in row1 of my data starts from column x1.07m to x2.82m (x2.82m because all the values beyond that point are NA). I would like to create a separate data frame with data and 8 columns (layer1, layer2, layer3, layer4, layer5, layer6, layer7, layer8). Layer1 value should start from 1.07m and the Layer8 should correspond to the maximum non-zero value.
Data: The dput of data can be found on https://dl.dropboxusercontent.com/u/9267938/rcode.R
> head(data.frame(mytest))
datetime Year Month Day Hour Minute Second X1.07m X1.32m X1.57m X1.82m X2.07m X2.32m X2.57m X2.82m X3.07m
1 2014-08-03 12:40:00 2014 8 3 12 40 0 -0.079553637 -0.018856349 -0.022559778 -0.0278269427 -0.019816260 -0.01304108 -0.003394041 -0.010720688 NA
2 2014-08-03 12:50:00 2014 8 3 12 50 0 -0.001409806 0.006434559 0.013885671 0.0033940409 0.009665614 0.01176982 0.011130125 0.019991707 0.02997477
3 2014-08-03 13:00:00 2014 8 3 13 0 0 -0.006942835 -0.011130125 0.010715907 -0.0058745801 -0.005716650 0.01534520 0.030355206 0.024851408 0.04862646
4 2014-08-03 13:10:00 2014 8 3 13 10 0 -0.020586547 0.002935416 -0.016304143 -0.0001326389 -0.003896694 0.00361282 0.004723244 0.013947785 0.03787721
5 2014-08-03 13:20:00 2014 8 3 13 20 0 -0.028394300 -0.023132719 -0.001721911 -0.0139650391 -0.038460075 0.01749898 0.008466864 0.003630492 0.01442467
6 2014-08-03 13:30:00 2014 8 3 13 30 0 -0.034646511 -0.006791177 0.004064423 -0.0038792422 -0.015942808 -0.02029747 -0.014287663 0.007956902 0.01786172
X3.32m X3.57m X3.82m X4.07m X4.32m X4.57m X4.82m X5.07m X5.32m X5.57m X5.82m X6.07m X6.32m X6.57m X6.82m
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 0.05094966 0.04699597 0.032100892 0.02650842 0.045689389 0.0169759192 -0.006879327 -0.0187681077 -0.030404344 -0.04405705 -0.04501967 NA NA NA NA
3 0.04500833 0.01713256 0.006450535 0.02870071 0.019079580 0.0009741734 -0.024666588 -0.0409943643 -0.030201313 -0.03873463 -0.02893064 NA NA NA NA
4 0.03971244 0.05723497 0.039496306 0.03799276 0.012742073 0.0024111385 -0.023706420 -0.0188563490 -0.033791404 -0.04162619 -0.02979164 -0.045051204 NA NA NA
5 0.03269076 0.05125416 0.054766084 0.03625076 0.005988487 0.0020217180 -0.007510352 -0.0069913419 -0.006656083 -0.01630414 -0.01403812 -0.001580609 NA NA NA
6 0.01913708 0.03932811 0.048955209 0.04764632 0.037480601 0.0205218532 0.004171715 0.0009371753 -0.002468609 -0.04511612 -0.01263816 0.035861544 NA NA NA
X7.07m X7.32m X7.57m X7.82m X8.07m X8.32m X8.57m X8.82m X9.07m X9.32m X9.57m X9.82m X10.07m X10.32m X10.57m X10.82m X11.07m X11.32m X11.57m X11.82m
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Sometimes the data points will be 20, 22, 25 points so the function should be written such that it would try to account those information and divide into 8 data values for each rows.
Rcode.R linked to dropbox has the code that has dput of mytest. It was pretty big to be posted here. So I posted a external link.
Info added
Each row would have different number of data. The motive is to convert them into 8 columns of data using averaging or linear interpolation.
Taking the question as a desire to collapse the values to means of eight equally spaced depths, dplyr and tidyr take us where we need to go:
library(dplyr)
library(tidyr)
mytest %>%
# melt to long form
gather(depth, value, -datetime:-Second, na.rm = TRUE) %>%
# clean depth to number
mutate(depth = extract_numeric(depth)) %>%
# group so cut levels are for each datetime
group_by(datetime) %>%
# group to keep columns; cut depth into 8 levels per group
group_by(datetime, levels = cut(depth, 8, paste0('level', 1:8))) %>%
# collapse groups by taking the mean
summarise(value = mean(value)) %>%
# re-spread new levels to wide form
spread(levels, value) %>%
# re-add other time columns dropped by summarise
inner_join(mytest %>% select(datetime:Second), .)
# Source: local data frame [20 x 15]
#
# datetime Year Month Day Hour Minute Second level1 level2
# (time) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 2014-08-03 12:40:00 2014 8 3 12 40 0 -0.079553637 -0.0188563490
# 2 2014-08-03 12:50:00 2014 8 3 12 50 0 0.006303474 0.0065298277
# 3 2014-08-03 13:00:00 2014 8 3 13 0 0 -0.002452351 -0.0057956151
# 4 2014-08-03 13:10:00 2014 8 3 13 10 0 -0.011318424 -0.0001388374
# 5 2014-08-03 13:20:00 2014 8 3 13 20 0 -0.017749644 -0.0116420430
# 6 2014-08-03 13:30:00 2014 8 3 13 30 0 -0.012457755 -0.0133731725
# 7 2014-08-03 13:40:00 2014 8 3 13 40 0 -0.020440875 -0.0253538846
# 8 2014-08-03 13:50:00 2014 8 3 13 50 0 -0.058681338 -0.0177194127
# 9 2014-08-03 14:00:00 2014 8 3 14 0 0 -0.037929680 -0.0211918383
# 10 2014-08-03 14:10:00 2014 8 3 14 10 0 -0.027045726 -0.0147261076
# 11 2014-08-03 14:20:00 2014 8 3 14 20 0 -0.048997399 -0.0290804019
# 12 2014-08-03 14:30:00 2014 8 3 14 30 0 -0.059110466 -0.0370898043
# 13 2014-08-03 14:40:00 2014 8 3 14 40 0 -0.067156867 -0.0138750287
# 14 2014-08-03 14:50:00 2014 8 3 14 50 0 -0.049762164 -0.0280648246
# 15 2014-08-03 15:00:00 2014 8 3 15 0 0 -0.028033559 -0.0245379952
# 16 2014-08-03 15:10:00 2014 8 3 15 10 0 -0.044087211 -0.0107995239
# 17 2014-08-03 15:20:00 2014 8 3 15 20 0 -0.028761973 -0.0113161242
# 18 2014-08-03 15:30:00 2014 8 3 15 30 0 -0.013476051 -0.0142316424
# 19 2014-08-03 15:40:00 2014 8 3 15 40 0 -0.012799297 -0.0135366710
# 20 2014-08-03 15:50:00 2014 8 3 15 50 0 -0.012238548 -0.0180806876
# Variables not shown: level3 (dbl), level4 (dbl), level5 (dbl), level6 (dbl), level7 (dbl),
# level8 (dbl)
Note that you should check that these data make sense in context; you've lost your depth data by scaling them.

Grouped moving average in r

I'm trying to calculate a moving average in r over a particular field BUT I need this moving average to be grouped by two or more other fields. The purpose of this new average is for predictive analysis so I need it to be trailing as well.
Any variables that do not have enough values to be averaged (such as student J) would ideally give either NA or its original Score value.
I've been trying rollapply and data.table and am having no luck!
I've provided the table of data and two moving averages (AVG2 with k=2 and AVG3 with k=3) to show exactly what I'm after. The moving average is on Score and the variables to group over are school, Student and area. Please help!
no school Student area Score **AVG2** **AVG3**
1 I S A 5 NA NA
2 B S A 2 NA NA
3 B S A 7 NA NA
4 B O A 3 NA NA
5 B O B 9 NA NA
6 I O A 6 NA NA
7 I O B 3 NA NA
8 I S A 7 NA NA
9 I O A 1 NA NA
10 B S A 7 4.5 NA
11 I S A 3 NA NA
12 I O A 8 3.5 NA
13 B S A 3 7 5.33
14 I O A 4 4.5 5
15 B O A 1 NA NA
16 I S A 9 5 5
17 B S A 4 5 5.67
18 B O A 6 2 NA
19 I S A 3 6 6.33
20 I O B 8 NA NA
21 B S A 3 3.5 4.67
22 I O A 4 6 4.33
23 B O A 1 3.5 3.33
24 I S A 9 6 5
25 B S A 4 3.5 3.33
26 B O A 6 3.5 2.67
27 I J A 6 NA NA
here is the code to recreate the initial table in r:
school <- c('I','B','B','B','B','I','I','I','I','B','I','I','B','I','B','I','B','B','I','I','B','I','B','I','B','B','I')
Student <- c('S','S','S','O','O','O','O','S','O','S','S','O','S','O','O','S','S','O','S','O','S','O','O','S','S','O','J')
area <- c('A','A','A','A','B','A','B','A','A','A','A','A','A','A','A','A','A','A','A','B','A','A','A','A','A','A','A')
Score <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1,9,4,6,3,8,3,4,1,9,4,6,6)
data.frame(school, Student, area, Score)
You can try solving the problem using dplyr and TTR but for student J from school I it is not possible to calculate a moving average as there's only one measurement.
AVG2 caluculated with stats:filter gives the result you wanted to have, but I also added AVG2b calculated with TTR::SMA to show a simple moving average calculation, where the current measurement is also taken into account.
library(dplyr)
library(TTR)
df <- data.frame(school, Student, Score)
df$AVG2 <- NA
df$AVG2b <- NA
df[!(df$school=="I" & df$Student=="J"),] <- df[!(df$school=="I" & df$Student=="J"),] %>%
group_by(school, Student) %>%
mutate(AVG2 = stats::filter(Score, c(0, 0.5, 0.5), sides = 1 ), AVG2b = SMA(Score, n= 2))
> df
school Student Score AVG2 AVG2b
1 I S 5 NA NA
2 B S 2 NA NA
3 B S 7 NA 4.5
4 B O 3 NA NA
5 B O 9 NA 6.0
6 I O 6 NA NA
7 I O 3 NA 4.5
8 I S 7 NA 6.0
9 I O 1 4.5 2.0
10 B S 7 4.5 7.0
...
Here is a rollapply solution. Note that it appears that you want the average of the prior two or three rows in the same group, i.e. excluding the data on the current row.
library(zoo)
roll <- function(x, n) {
if (length(x) <= n) NA
else rollapply(x, list(-seq(n)), mean, fill = NA)
}
transform(DF, AVG2 = ave(Score, school, Student, FUN = function(x) roll(x, 2)),
AVG3 = ave(Score, school, Student, FUN = function(x) roll(x, 3)))
giving:
school Student Score AVG2 AVG3
1 I S 5 NA NA
2 B S 2 NA NA
3 B S 7 NA NA
4 B O 3 NA NA
5 B O 9 NA NA
6 I O 6 NA NA
7 I O 3 NA NA
8 I S 7 NA NA
9 I O 1 4.5 NA
10 B S 7 4.5 NA
11 I S 3 6.0 NA
12 I O 8 2.0 3.333333
13 B S 3 7.0 5.333333
14 I O 4 4.5 4.000000
15 B O 1 6.0 NA
16 I S 9 5.0 5.000000
17 B S 4 5.0 5.666667
18 B O 6 5.0 4.333333
19 I S 3 6.0 6.333333
20 I O 8 6.0 4.333333
21 B S 3 3.5 4.666667
22 I O 4 6.0 6.666667
23 B O 1 3.5 5.333333
24 I S 9 6.0 5.000000
25 B S 4 3.5 3.333333
26 B O 6 3.5 2.666667
27 I J 6 NA NA
Update: Fixed roll.
Here is AVG2 calculation with data.table, which is faster compared to other approaches:
library(data.table)
dt <- data.table(df)
setkey(dt, school, Student, area)
dt[, c("start", "len") := .(ifelse(.I + 1 > .I[.N], 0, .I +1), pmax(pmin(1, .I[.N] - .I -1), 0)), by = .(school, Student, area)][
, AVG2 := mean(dt$Score[start:(start+len)]), by = 1:nrow(dt)]
res$AVG2[res$len == 0] <- NA

Daily averages of all data frame variables including NA values with aggregate function

I want to calculate daily means of all variables in my dataframe which includes NA values. All my databases have a value every 30min, so I´m very interested in using the timestamp with aggregate function to obtain daily, weekly, monthly... aggregated data.
My dataframe is 37795 rows x 54 variables. I´ve tried two ways to do that, first option does not give me daily means cause I obtained too high values (not logical). Second option gives me almost all NA values. I do not what to do.
I write my dataframe head and code below.
head(data)
timestamp day month year.x hour minute doy.x rn_1_1_1 ppfd_1_1_1
1 2013-07-06 00:00:00 6 7 2013 0 0 187.000 -84.37381 0.754
2 2013-07-06 00:30:00 6 7 2013 0 30 187.020 -84.07990 0.808
3 2013-07-06 01:00:00 6 7 2013 1 0 187.041 -82.19991 0.808
4 2013-07-06 01:30:00 6 7 2013 1 30 187.062 -81.12341 0.831
5 2013-07-06 02:00:00 6 7 2013 2 0 187.083 -79.57474 0.708
6 2013-07-06 02:30:00 6 7 2013 2 30 187.104 -77.72460 0.639
ppfdr_1_1_1 p_rain_1_1_1 swc_1_1_1 swc_2_1_1 swc_3_1_1 air_pressure air_pressure.1
1 0.624 0 0.07230304 0.09577876 0.134602791 101212.4165 1012.124165
2 0.587 0 0.07233134 0.09569421 0.134479816 101181.8094 1011.818094
3 0.713 0 0.07242914 0.09566160 0.134203719 101166.0948 1011.660948
4 0.72 0 0.07252077 0.09563419 0.134149141 101144.6151 1011.446151
5 0.564 0 0.07261925 0.09560297 0.134095791 101144.8662 1011.448662
6 0.706 0 0.07271843 0.09557789 0.134037119 101144.5084 1011.445084
u_rot v_rot w_rot wind_speed u. h_scr_qc01_man
1 5.546047919 1.42E-14 4.76E-16 5.546047919 0.426515403 -28.07603618
2 5.122724997 6.94E-15 -8.00E-16 5.122724997 0.408213459 -34.39110979
3 5.248639421 4.56E-15 7.28E-17 5.248639421 0.393959075 -33.29033501
4 4.845257286 2.81E-14 -1.33E-17 4.845257286 0.365475898 -32.62427147
5 4.486426895 1.39E-14 -4.43E-16 4.486426895 0.335905384 -33.80219189
6 4.109603841 7.08E-15 -9.76E-16 4.109603841 0.312610588 -35.77289349
fco2_scr_qc01_man le_scr_qc01_man fco2_scr_qc0 fco2_scr_qc0_man date year.y time
1 -0.306504951 NA NA NA 06-jul-13 2013 0:00
2 -0.206266524 NA -0.206266524 -0.206266524 06-jul-13 2013 0:30
3 -0.268508139 NA -0.268508139 -0.268508139 06-jul-13 2013 1:00
4 -0.203804516 0.426531598 -0.203804516 -0.203804516 06-jul-13 2013 1:30
5 -0.217438742 -0.358248118 -0.217438742 -0.217438742 06-jul-13 2013 2:00
6 -0.193778528 2.571063044 -0.193778528 -0.193778528 06-jul-13 2013 2:30
doy_ent doy.y doy_cum doy_cum_ent mes nrecord bat panel_temp vwc_0.1
1 187 187.0000 187.0000 187 7 24 12.57 22.93 0.06284828
2 187 187.0208 187.0208 187 7 25 12.56 22.85 0.06267169
3 187 187.0417 187.0417 187 7 26 12.55 22.58 0.06261738
4 187 187.0625 187.0625 187 7 27 12.54 22.3 0.06247716
5 187 187.0833 187.0833 187 7 28 12.53 22.01 0.06249525
6 187 187.1042 187.1042 187 7 29 12.52 21.82 0.06236862
vwc_0.5 vwc_1.5 temp_0.1 temp_0.5 temp_1.5 tempsd_0.1 tempsd_0.5 tempsd_1.5
1 0.07569027 0.1007845 30.9 28.96 25.14 0.372 0.961 0.767
2 0.07569027 0.1007743 30.8 28.85 24.99 0.181 1.361 1.087
3 0.07568554 0.1008558 30.53 28.8 25.03 0.98 1.476 0.351
4 0.07559577 0.1008507 30.52 29.09 25.11 0.186 0.229 0.556
5 0.07559577 0.1007743 30.11 29.09 24.87 1.331 0.191 0.954
6 0.07556271 0.1007285 30.15 29.33 25.04 1.447 1.078 0.2
pair pair_avg CO2_0.1 CO2_0.5 CO2_1.5 DCO2_0.1 DCO2_0.5
1 101.2124 101.2118 1161.592832 3275.1134 4888.231603 -24.67422109 34.88538221
2 101.1818 101.2131 1168.144925 3338.24016 4941.418642 6.55209301 63.12675931
3 101.1661 101.2090 1201.049131 3435.235974 5012.525851 32.90420541 96.9958144
4 101.1446 101.2007 1268.613941 3556.723878 5092.96558 67.56481067 121.4879035
5 101.1449 101.1906 1364.315214 3680.188043 5164.795759 95.7012722 123.464165
6 101.1445 101.1805 1472.975286 3808.988677 5236.40855 108.6600723 128.8006346
DCO2_1.5
1 31.30293041
2 53.18703947
3 71.10720845
4 80.43972916
5 71.83017884
6 71.61279156
## Daily avg - OPTION 1
data$timestamp <- as.POSIXct(data$timestamp, format = "%d/%m/%Y %H:%M",tz ="GMT")
> dates <- format(data$timestamp,"%Y/%m/%d",tz = "GMT")
> datadates <- cbind(data,dates)
> dailydata_avg <- aggregate(. ~ dates, datadates, FUN=mean, na.rm=TRUE, na.action = "na.pass")
head(dailydata_avg)
dates timestamp day month year.x hour minute doy.x rn_1_1_1 ppfd_1_1_1
1 2013/07/06 1373111100 6 7 2013 11.5 15 187.489 159.7788 3580.562
2 2013/07/07 1373197500 7 7 2013 11.5 15 188.489 154.0925 3506.688
3 2013/07/08 1373283900 8 7 2013 11.5 15 189.489 152.5259 3460.667
4 2013/07/09 1373370300 9 7 2013 11.5 15 190.489 131.1619 2965.250
5 2013/07/10 1373456700 10 7 2013 11.5 15 191.489 136.7853 3171.958
6 2013/07/11 1373543100 11 7 2013 11.5 15 192.489 145.2757 3282.167
ppfdr_1_1_1 p_rain_1_1_1 swc_1_1_1 swc_2_1_1 swc_3_1_1 air_pressure air_pressure.1
1 2552.396 1.0000 0.07095847 0.09606378 18341.81 25940.167 25940.167
2 2532.542 1.0000 0.06994341 0.09502167 18065.98 24891.000 24891.000
3 2523.562 1.0000 0.06860553 0.09379282 17777.02 23107.271 23107.271
4 2336.000 1.0000 0.06717054 0.09268716 17526.50 19309.500 19309.500
5 2607.229 1.0625 0.06620048 0.09166904 17275.56 8385.646 8385.646
6 2484.521 1.0000 0.06562964 0.09083684 17028.94 3535.438 3535.438
u_rot v_rot w_rot wind_speed u. h_scr_qc01_man fco2_scr_qc01_man
1 32167.83 2215.875 2041.354 32167.83 28531.44 18197.75 15365.65
2 30878.27 1911.312 1939.917 30878.27 26929.62 17605.52 14955.56
3 26052.96 2261.417 2116.458 26052.96 23305.83 19167.98 18399.33
4 17284.04 1987.438 2139.083 17284.04 17704.35 20349.92 18137.65
5 12028.06 2053.812 1960.417 12028.06 15670.00 21997.83 21120.19
6 15607.50 1997.417 1907.646 15607.50 15384.56 18000.94 18810.62
le_scr_qc01_man fco2_scr_qc0 fco2_scr_qc0_man date year.y time doy_ent doy.y
1 17409.67 13032.10 13027.90 137 2013 44.5 187 187.4896
2 15524.38 12077.17 12072.92 163 2013 44.5 188 188.4896
3 16407.71 14775.94 14770.56 189 2013 44.5 189 189.4896
4 16788.04 15024.79 15019.02 215 2013 44.5 190 190.4896
5 17955.58 17737.25 17730.75 241 2013 44.5 191 191.4896
6 14610.02 16605.48 16599.33 267 2013 44.5 192 192.4896
doy_cum doy_cum_ent mes nrecord bat panel_temp vwc_0.1 vwc_0.5 vwc_1.5
1 187.4896 187.5 7 28966.375 111.5208 1836.250 4638.833 4594.396 37.35417
2 188.4896 188.5 7 20801.417 111.7292 1900.812 4656.875 4392.979 26.68750
3 189.4896 189.5 7 4394.500 110.6042 1934.792 4675.604 4238.229 65.20833
4 190.4896 190.5 7 9467.708 104.0000 2090.896 4776.521 4178.729 54.12500
5 191.4896 191.5 7 14796.375 109.7500 2145.875 4907.292 4161.312 108.39583
6 192.4896 192.5 7 20127.958 109.3125 1934.375 4876.021 4123.458 143.10417
temp_0.1 temp_0.5 temp_1.5 tempsd_0.1 tempsd_0.5 tempsd_1.5 pair pair_avg CO2_0.1
1 2018.438 1565.812 797.8750 470.8125 474.3958 508.8333 101.1268 101.1323 10400.27
2 1998.438 1574.000 783.1875 478.3333 460.4583 566.0208 101.0764 101.0789 11292.75
3 1994.833 1568.104 780.2083 463.8125 453.1667 488.5625 100.9967 101.0036 13288.25
4 2042.625 1564.875 780.1667 465.0000 599.2708 437.6042 100.8520 100.8665 16156.60
5 2114.708 1576.729 780.5000 471.5833 406.5417 484.6875 100.4828 100.5169 18656.50
6 2124.604 1591.125 781.8125 516.7500 530.3333 510.7500 100.3025 100.2947 14586.60
CO2_0.5 CO2_1.5 DCO2_0.1 DCO2_0.5 DCO2_1.5
1 26360.38 34371.31 19795.81 20637.94 27123.92
2 26939.60 34558.17 18838.38 20464.56 20452.58
3 27603.06 34608.31 17413.15 19998.02 22754.85
4 28572.69 34678.38 19294.62 21894.92 18379.62
5 28983.29 34644.15 20251.17 20409.58 22077.40
6 28236.12 34736.67 17031.02 18852.04 19684.69`
## Daily avg - OPTION 2
data$timestamp <- as.POSIXct(data$timestamp, format = "%d/%m/%Y %H:%M",tz ="GMT")
datatime <- data$timestamp
dailydata_avg <- aggregate( data,
by = list('DATES'= format(datatime,'%Y%m%d' )),
FUN = mean, na.rm=T)
I obtain this console message:
1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
head(dailydata_avg)
DATES timestamp day month year.x hour minute doy.x rn_1_1_1 ppfd_1_1_1
1 20130706 2013-07-06 13:45:00 6 7 2013 11.5 15 187.489 159.7788 NA
2 20130707 2013-07-07 13:45:00 7 7 2013 11.5 15 188.489 154.0925 NA
3 20130708 2013-07-08 13:45:00 8 7 2013 11.5 15 189.489 152.5259 NA
4 20130709 2013-07-09 13:45:00 9 7 2013 11.5 15 190.489 131.1619 NA
5 20130710 2013-07-10 13:45:00 10 7 2013 11.5 15 191.489 136.7853 NA
6 20130711 2013-07-11 13:45:00 11 7 2013 11.5 15 192.489 145.2757 NA
ppfdr_1_1_1 p_rain_1_1_1 swc_1_1_1 swc_2_1_1 swc_3_1_1 air_pressure air_pressure.1
1 NA NA 0.07095847 0.09606378 NA NA NA
2 NA NA 0.06994341 0.09502167 NA NA NA
3 NA NA 0.06860553 0.09379282 NA NA NA
4 NA NA 0.06717054 0.09268716 NA NA NA
5 NA NA 0.06620048 0.09166904 NA NA NA
6 NA NA 0.06562964 0.09083684 NA NA NA
u_rot v_rot w_rot wind_speed u. h_scr_qc01_man fco2_scr_qc01_man le_scr_qc01_man
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA
fco2_scr_qc0 fco2_scr_qc0_man date year.y time doy_ent doy.y doy_cum doy_cum_ent
1 NA NA NA 2013 NA 187 187.4896 187.4896 187.5
2 NA NA NA 2013 NA 188 188.4896 188.4896 188.5
3 NA NA NA 2013 NA 189 189.4896 189.4896 189.5
4 NA NA NA 2013 NA 190 190.4896 190.4896 190.5
5 NA NA NA 2013 NA 191 191.4896 191.4896 191.5
6 NA NA NA 2013 NA 192 192.4896 192.4896 192.5
mes nrecord bat panel_temp vwc_0.1 vwc_0.5 vwc_1.5 temp_0.1 temp_0.5 temp_1.5
1 7 NA NA NA NA NA NA NA NA NA
2 7 NA NA NA NA NA NA NA NA NA
3 7 NA NA NA NA NA NA NA NA NA
4 7 NA NA NA NA NA NA NA NA NA
5 7 NA NA NA NA NA NA NA NA NA
6 7 NA NA NA NA NA NA NA NA NA
tempsd_0.1 tempsd_0.5 tempsd_1.5 pair pair_avg CO2_0.1 CO2_0.5 CO2_1.5 DCO2_0.1
1 NA NA NA 101.1268 101.1323 NA NA NA NA
2 NA NA NA 101.0764 101.0789 NA NA NA NA
3 NA NA NA 100.9967 101.0036 NA NA NA NA
4 NA NA NA 100.8520 100.8665 NA NA NA NA
5 NA NA NA 100.4828 100.5169 NA NA NA NA
6 NA NA NA 100.3025 100.2947 NA NA NA NA
DCO2_0.5 DCO2_1.5
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
How could I do it?
Thanks!!
I didn't use the aggregate function, I used the tapply one.
This is the code, that deals with NA's, I came up with:
# create a sequence of DateTime with half-hourly data
DateTime <- seq.POSIXt(from = as.POSIXct("2015-05-01 00:00:00", tz = "Etc/GMT+12"),
to = as.POSIXct("2015-05-30 23:59:00", tz = "Etc/GMT+12"), by = 1800)
# create some dummy data of the same length as DateTime vector
aa <- runif(1440, 5.0, 7.5)
bb <- NA
df <- data.frame(DateTime, aa, bb)
# replace a cell with NA in the "a" column
df[19,2] <- NA # dataframe = df, row = 19, column = 2
# create DateHour column to use later
df$DateHour <- paste(format(df$DateTime, "%Y/%m/%d"), format(df$DateTime, "%H"), sep = " ")
View(df)
# Hourly means
# Calculate hourly mean values
aa.HourlyMean <- tapply(df$aa, df$DateHour, mean, na.rm = TRUE)
# convert the vector to dataframe
aa.HourlyMean <- data.frame(aa.HourlyMean)
# Extract the DateHour column from the "aa" dataframe
aa.HourlyMean$DateHour <- row.names(aa.HourlyMean);
# Delete rownames of "aa" dataframe
row.names(aa.HourlyMean) <- NULL
# Create a tidy DateTime column
aa.HourlyMean$DateTime <- as.POSIXct(aa.HourlyMean$DateHour, "%Y/%m/%d %H", tz = "Etc/GMT+12")
# change to a tidy dataframe
aa.HourlyMean <- aa.HourlyMean[,c(3,2,1)]
# You can delete any column (for example, DateHour) by
# aa.HourlyMean$Date <- NULL
# You can rename a column with "plyr" package by
# rename(aa.HourlyMean)[3] <- "NewColumnName"
# View the hourly mean of the "aa" dataframe
View(aa.HourlyMean)
# You can do the same with the "bb" vector
bb.HourlyMean <- tapply(df$bb, df$DateHour, mean, na.rm = TRUE)
bb.HourlyMean <- data.frame(bb.HourlyMean)
# View the hourly mean of the "bb" vector
View(bb.HourlyMean)
# /Hourly means
You then can combine in one dataframe the aa.HourlyMean and bb.HourlyMean vectors.
# Daily means
df$Date <- format(df$DateTime, "%Y/%m/%d")
aa.DailyMean <- tapply(df$aa, df$Days, mean, na.rm = TRUE)
aa.DailyMean <- data.frame(aa.DailyMean)
aa.DailyMean$Date <- row.names(aa.DailyMean); row.names(aa.DailyMean) <- NULL
aa.DailyMean <- aa.DailyMean[,c(2,1)]
View(aa.DailyMean)
# /Daily means
# Weekly means
df$YearWeek <- paste(format(df$DateTime, "%Y"), strftime(DateTime, format = "%W"), sep = " ")
aa.WeeklyMean <- tapply(df$aa, df$YearWeek, mean, na.rm = TRUE)
aa.WeeklyMean <- data.frame(aa.WeeklyMean)
aa.WeeklyMean$YearWeek <- row.names(aa.WeeklyMean); row.names(aa.WeeklyMean) <- NULL
aa.WeeklyMean <- aa.WeeklyMean[,c(2,1)]
View(aa.WeeklyMean)
# /Weekly means
I created the mean values for hourly, daily and weekly observations but you get the idea how to create the monthly, yearly, ... ones.

R - Calculate Time Elapsed Since Last Event with Multiple Event Types

I have a dataframe that contains the dates of multiple types of events.
df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000"
,"03/01/2001","17/03/2001","23/04/2001",
"26/05/2001","01/06/2001",
"30/06/2001","02/07/2001","15/07/2001"
,"21/12/2001"), "%d/%m/%Y"),
event_type=c(0,4,1,2,4,1,0,2,3,3,4,3))
date event_type
---------------- ----------
1 2000-07-06 0
2 2000-09-15 4
3 2000-10-15 1
4 2001-01-03 2
5 2001-03-17 4
6 2001-04-23 1
7 2001-05-26 0
8 2001-06-01 2
9 2001-06-30 3
10 2001-07-02 3
11 2001-07-15 4
12 2001-12-21 3
I am trying to calculate the days between each event type so the output looks like the below:
date event_type days_since_last_event
---------------- ---------- ---------------------
1 2000-07-06 0 NA
2 2000-09-15 4 NA
3 2000-10-15 1 NA
4 2001-01-03 2 NA
5 2001-03-17 4 183
6 2001-04-23 1 190
7 2001-05-26 0 324
8 2001-06-01 2 149
9 2001-06-30 3 NA
10 2001-07-02 3 2
11 2001-07-15 4 120
12 2001-12-21 3 172
I have benefited from the answers from these two previous posts but have not been able to address my specific problem in R; multiple event types.
Calculate elapsed time since last event
Calculate days since last event in R
Below is as far as I have gotten. I have not been able to leverage the last event index to calculate the last event date.
df <- cbind(df, as.vector(data.frame(count=ave(df$event_type==df$event_type,
df$event_type, FUN=cumsum))))
df <- rename(df, c("count" = "last_event_index"))
date event_type last_event_index
--------------- ------------- ----------------
1 2000-07-06 0 1
2 2000-09-15 4 1
3 2000-10-15 1 1
4 2001-01-03 2 1
5 2001-03-17 4 2
6 2001-04-23 1 2
7 2001-05-26 0 2
8 2001-06-01 2 2
9 2001-06-30 3 1
10 2001-07-02 3 2
11 2001-07-15 4 3
12 2001-12-21 3 3
We can use diff to get the difference between adjacent 'date' after grouping by 'event_type'. Here, I am using data.table approach by converting the 'data.frame' to 'data.table' (setDT(df)), grouped by 'event_type', we get the diff of 'date'.
library(data.table)
setDT(df)[,days_since_last_event :=c(NA,diff(date)) , by = event_type]
df
# date event_type days_since_last_event
# 1: 2000-07-06 0 NA
# 2: 2000-09-15 4 NA
# 3: 2000-10-15 1 NA
# 4: 2001-01-03 2 NA
# 5: 2001-03-17 4 183
# 6: 2001-04-23 1 190
# 7: 2001-05-26 0 324
# 8: 2001-06-01 2 149
# 9: 2001-06-30 3 NA
#10: 2001-07-02 3 2
#11: 2001-07-15 4 120
#12: 2001-12-21 3 172
Or as #Frank mentioned in the comments, we can also use shift (from version v1.9.5+ onwards) to get the lag (by default, the type='lag') of 'date' and subtract from the 'date'.
setDT(df)[, days_since_last_event := as.numeric(date-shift(date,type="lag")),
by = event_type]
The base R version of this is to use split/lapply/rbind to generate the new column.
> do.call(rbind,
lapply(
split(df, df$event_type),
function(d) {
d$dsle <- c(NA, diff(d$date)); d
}
)
)
date event_type dsle
0.1 2000-07-06 0 NA
0.7 2001-05-26 0 324
1.3 2000-10-15 1 NA
1.6 2001-04-23 1 190
2.4 2001-01-03 2 NA
2.8 2001-06-01 2 149
3.9 2001-06-30 3 NA
3.10 2001-07-02 3 2
3.12 2001-12-21 3 172
4.2 2000-09-15 4 NA
4.5 2001-03-17 4 183
4.11 2001-07-15 4 120
Note that this returns the data in a different order than provided; you can re-sort by date or save the original indices if you want to preserve that order.
Above, #akrun has posted the data.tables approach, the parallel dplyr approach would be straightforward as well:
library(dplyr)
df %>% group_by(event_type) %>% mutate(days_since_last_event=date - lag(date, 1))
Source: local data frame [12 x 3]
Groups: event_type [5]
date event_type days_since_last_event
(date) (dbl) (dfft)
1 2000-07-06 0 NA days
2 2000-09-15 4 NA days
3 2000-10-15 1 NA days
4 2001-01-03 2 NA days
5 2001-03-17 4 183 days
6 2001-04-23 1 190 days
7 2001-05-26 0 324 days
8 2001-06-01 2 149 days
9 2001-06-30 3 NA days
10 2001-07-02 3 2 days
11 2001-07-15 4 120 days
12 2001-12-21 3 172 days

Resources