Currently, I have an array of list(groupA) like the following example:
$AAAAAA
time timegap
1 06:00:00 0
2 07:00:00 60
3 08:00:00 40
4 09:00:00 0
5 10:00:00 30
$BBBBBBB
time timegap
1 06:00:00 0
2 07:00:00 60
3 08:00:00 40
4 09:00:00 0
5 10:00:00 30
I am trying to create a function that generate a dummy variable if timegap is greater than a certain number. But the challenge is the number for generating dummy variable would be different from others if the time in in the range from 07:00:00 to 09:00:00.
What I did was as following:
dummytime<-function(x){
if(x$time>times("07:00:00") & x$time<times("09:00:00")){
d<-c(1200)
}
else{
d<-c(600)
}
dummytime<- as.numeric(x$timegap>=d)
as.data.frame(dummytime)
}
dumtime<-lapply(groupm2,dummytime)
However, I got an error like this:
Error in if (as.logical(x$time > times("07:00:00") & x$time < times("09:00:00")))
{ : missing value where TRUE/FALSE needed
Any suggestion? Thanks for help in advance.
Here is one way. Since you used the chron package to convert character to time. I have done that. Then, I created a list. Then, lpply.
library(chron)
# time to Class 'times'
df1$time <- chron(times = df1$time)
df2$time <- chron(times = df2$time)
# Create a list
ana <- list(df1 = df1, df2 = df2)
#$df1
# time timegap
#1 06:00:00 0
#2 07:00:00 60
#3 08:00:00 40
#4 09:00:00 0
#5 10:00:00 30
lapply(ana, function(x){
x$test <- ifelse(x$time >= "07:00:00" & x$time <= "09:00:00",
1200, 600)
x
})
#$df1
# time timegap test
#1 06:00:00 0 600
#2 07:00:00 60 1200
#3 08:00:00 40 1200
#4 09:00:00 0 1200
#5 10:00:00 30 600
#$df2
# time timegap test
#1 06:00:00 0 600
#2 07:00:00 60 1200
#3 08:00:00 40 1200
#4 09:00:00 0 1200
#5 10:00:00 30 600
Or
lapply(ana, transform,
test = ifelse(time >= "07:00:00" & time <= "09:00:00", 1200, 600))
Related
I have a data frame and a vector that I want to compare with a column of my data frame to assign groups based on the values that meet the condition, the problem is that these values are dynamic so I need a code that takes into account the different lengths that this vector can take
This is a minimal reproducible example of my data frame
value <- c(rnorm(39, 5, 2))
Date <- seq(as.POSIXct('2021-01-18'), as.POSIXct('2021-10-15'), by = "7 days")
df <- data.frame(Date, value)
This is the vector I have to compare with the Date of the data frame
dates_tour <- as.POSIXct(c('2021-01-18', '2021-05-18', '2021-08-18', '2021-10-15'))
This creates the desired output
df <- df %>% mutate(tour = case_when(Date >= dates_tour[1] & Date <= dates_tour[2] ~ 1,
Date > dates_tour[2] & Date <= dates_tour[3]~2,
Date > dates_tour[3] & Date <= dates_tour[4]~3))
However, I don't want to do it like that since this project needs to be updated frequently and the variable dates_tour change in length
So I would like to take that into account to create the tour variable
I tried to do it like this: but it doesn't work
for (i in 1:length(dates_tour)) {
df <- df %>% mutate(tour = case_when(Date >= dates_tour[i] & Date <= dates_tour[i+1] ~ i))
}
You can use cut to bin a vector based on break points:
df %>%
mutate(
tour = cut(Date, breaks = dates_tour, labels = seq_along(dates_tour[-1]))
)
We may remove the first and last elements to create a tibble and then loop over the rows of the tibble
library(dplyr)
library(purrr)
keydat <- tibble(start = dates_tour[-length(dates_tour)],
end = dates_tour[-1])
df$tour <- imap(seq_len(nrow(keydat)),
~ case_when(df$Date >= keydat$start[.x] &
df$Date <= keydat$end[.x]~ .y )) %>%
invoke(coalesce, .)
-output
> df
Date value tour
1 2021-01-18 00:00:00 7.874620 1
2 2021-01-25 00:00:00 9.704973 1
3 2021-02-01 00:00:00 5.898070 1
4 2021-02-08 00:00:00 3.287319 1
5 2021-02-15 00:00:00 5.488132 1
6 2021-02-22 00:00:00 4.425636 1
7 2021-03-01 00:00:00 6.244084 1
8 2021-03-08 00:00:00 5.528364 1
9 2021-03-15 01:00:00 7.954929 1
10 2021-03-22 01:00:00 4.691995 1
11 2021-03-29 01:00:00 5.943415 1
12 2021-04-05 01:00:00 5.316373 1
13 2021-04-12 01:00:00 5.182952 1
14 2021-04-19 01:00:00 3.330700 1
15 2021-04-26 01:00:00 7.461089 1
16 2021-05-03 01:00:00 4.338873 1
17 2021-05-10 01:00:00 5.768665 1
18 2021-05-17 01:00:00 3.574488 1
19 2021-05-24 01:00:00 5.106042 2
20 2021-05-31 01:00:00 2.828844 2
21 2021-06-07 01:00:00 4.616084 2
22 2021-06-14 01:00:00 7.234506 2
23 2021-06-21 01:00:00 4.760413 2
24 2021-06-28 01:00:00 7.020543 2
25 2021-07-05 01:00:00 7.403235 2
26 2021-07-12 01:00:00 6.368435 2
27 2021-07-19 01:00:00 3.527764 2
28 2021-07-26 01:00:00 5.254025 2
29 2021-08-02 01:00:00 5.676425 2
30 2021-08-09 01:00:00 3.783304 2
31 2021-08-16 01:00:00 6.310292 2
32 2021-08-23 01:00:00 2.938218 3
33 2021-08-30 01:00:00 5.101852 3
34 2021-09-06 01:00:00 3.765659 3
35 2021-09-13 01:00:00 5.489846 3
36 2021-09-20 01:00:00 4.174276 3
37 2021-09-27 01:00:00 7.348895 3
38 2021-10-04 01:00:00 5.103772 3
39 2021-10-11 01:00:00 4.941248 3
The dataframe df1 summarizes detections of different individuals (ID) through time (Datetime). As a short example:
library(lubridate)
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Datetime= ymd_hms(c("2016-08-21 00:00:00","2016-08-24 08:00:00","2016-08-23 12:00:00","2016-08-29 03:00:00","2016-08-27 23:00:00","2016-09-02 02:00:00","2016-09-01 12:00:00","2016-09-09 04:00:00","2016-09-01 12:00:00","2016-09-10 12:00:00")))
> df1
ID Datetime
1 1 2016-08-21 00:00:00
2 2 2016-08-24 08:00:00
3 1 2016-08-23 12:00:00
4 2 2016-08-29 03:00:00
5 1 2016-08-27 23:00:00
6 2 2016-09-02 02:00:00
7 1 2016-09-01 12:00:00
8 2 2016-09-09 04:00:00
9 1 2016-09-01 12:00:00
10 2 2016-09-10 12:00:00
I want to calculate for each row, the number of hours (Hours_since_begining) since the first time that the individual was detected.
I would expect something like that (It can contain some mistakes since I did the calculations by hand):
> df1
ID Datetime Hours_since_begining
1 1 2016-08-21 00:00:00 0
2 2 2016-08-24 08:00:00 0
3 1 2016-08-23 12:00:00 60 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-23 12:00:00"
4 2 2016-08-29 03:00:00 115
5 1 2016-08-27 23:00:00 167 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-27 23:00:00"
6 2 2016-09-02 02:00:00 210
7 1 2016-09-01 12:00:00 276
8 2 2016-09-09 04:00:00 380
9 1 2016-09-01 12:00:00 276
10 2 2016-09-10 12:00:00 412
Does anyone know how to do it?
Thanks in advance!
You can do this :
library(tidyverse)
# first get min datetime by ID
min_datetime_id <- df1 %>% group_by(ID) %>% summarise(min_datetime=min(Datetime))
# join with df1 and compute time difference
df1 <- df1 %>% left_join(min_datetime_id) %>% mutate(Hours_since_beginning= as.numeric(difftime(Datetime, min_datetime,units="hours")))
I have data
dt <- data.table(time=as.POSIXct(c("2018-01-01 01:01:00","2018-01-01 01:05:00","2018-01-01 01:01:00")), y=c(1,10,9))
> dt
time y
1: 2018-01-01 01:01:00 1
2: 2018-01-01 01:05:00 10
3: 2018-01-01 01:01:00 9
and I would like to aggregate by time. Usually, I would do
dt[,list(sum=sum(y),count=.N), by="time"]
time sum count
1: 2018-01-01 01:01:00 10 2
2: 2018-01-01 01:05:00 10 1
but this time, I would also like to get zero values for the minutes in between, i.e.,
time sum count
1: 2018-01-01 01:01:00 10 2
2: 2018-01-01 01:02:00 0 0
3: 2018-01-01 01:03:00 0 0
4: 2018-01-01 01:04:00 0 0
5: 2018-01-01 01:05:00 10 1
Could this be done, for example, using an external vector
times <- seq(from=min(dt$time),to=max(dt$time),by="mins")
that can be fed to the data.table function as a grouping variable?
You would typically do with with a join (either before or after the aggregation). For example:
dt <- dt[J(times), on = "time"]
dt[,list(sum=sum(y, na.rm = TRUE), count= sum(!is.na(y))), by=time]
# time sum count
#1: 2018-01-01 01:01:00 10 2
#2: 2018-01-01 01:02:00 0 0
#3: 2018-01-01 01:03:00 0 0
#4: 2018-01-01 01:04:00 0 0
#5: 2018-01-01 01:05:00 10 1
Or in a "piped" version:
dt[J(times), on = "time"][
, .(sum = sum(y, na.rm = TRUE), count= sum(!is.na(y))),
by = time]
I want to speed up this calculation I need to do on specific part of the dataframe, this is an example data
days <- c("01.01.2018","01.01.2018","01.01.2018",
"02.01.2018","02.01.2018","02.01.2018",
"03.01.2018","03.01.2018","03.01.2018")
time <- c("00:00:00","01:00:00","02:00:00",
"00:00:00","01:00:00","02:00:00",
"00:00:00","01:00:00","02:00:00")
a <- c(1,2,3,
1,2,3,
1,2,3)
b <- c(1,2,3,
5,6,7,
10,11,12)
results <- NA
df1 <- data.frame(days,time,a,results)
df2 <- data.frame(days,time,b)
I need to add the value from df2$b at 00:00:00 of each day to the same entire day values in df1$a and store it in results.
Right now I'm doing it like this:
ndays <- unique(df1$days)
for(i in 1:length(ndays)) {
factor <- df2[(df2$days == ndays[i] & df2$time == "00:00:00"),]$b
df1[df1$days == ndays[i],]$results <- df1[df1$days == ndays[i],]$a + factor
}
The problem is, I've got huge dataframes with lot of days and cycling them one by one is slow. Is there a fastest way to do so?
edit: This is the filled results column after the cycle
df1
days time a results
1 01.01.2018 00:00:00 1 2 # results = a + df$b # 01.01.2018 00:00:00
2 01.01.2018 01:00:00 2 3 # results = a + df$b # 01.01.2018 00:00:00
3 01.01.2018 02:00:00 3 4 # results = a + df$b # 01.01.2018 00:00:00
4 02.01.2018 00:00:00 1 6 # results = a + df$b # 02.01.2018 00:00:00
5 02.01.2018 01:00:00 2 7 # results = a + df$b # 02.01.2018 00:00:00
6 02.01.2018 02:00:00 3 8 # results = a + df$b # 02.01.2018 00:00:00
7 03.01.2018 00:00:00 1 11 # results = a + df$b # 03.01.2018 00:00:00
8 03.01.2018 01:00:00 2 12 # results = a + df$b # 03.01.2018 00:00:00
9 03.01.2018 02:00:00 3 13 # results = a + df$b # 03.01.2018 00:00:00
You can do this with a merge instead of a for loop which will be much faster. In the below answer I'm also using data.table, a fast version of data.frames that are very useful when working with large tables.
# install.packages("data.table") # Uncomment if necessary
library(data.table)
df1 <- data.frame(days,time,a) # You don't need to create the result column yet
df2 <- data.frame(days,time,b)
df1 <- data.table(df1)
df2 <- data.table(df2)
# Merge the two tables on the days column
df3 <- merge(df1, df2[time=="00:00:00"], by="days")
# This is your result
answer <- df3[, .(days, time=time.x, a, results=a+b)]
Output:
> answer
days time a results
1: 01.01.2018 00:00:00 1 2
2: 01.01.2018 01:00:00 2 3
3: 01.01.2018 02:00:00 3 4
4: 02.01.2018 00:00:00 1 6
5: 02.01.2018 01:00:00 2 7
6: 02.01.2018 02:00:00 3 8
7: 03.01.2018 00:00:00 1 11
8: 03.01.2018 01:00:00 2 12
9: 03.01.2018 02:00:00 3 13
One solution using dplyr could be as below. The approach of the solution is to:
1) First filter all time other than 00:00:00 from the df2
2) Then inner_join both df1 and df2 on days. This will enable to select value of b from df2 to every matching day in merged dataframe. Finally add a and b to find result.
df1 <- data.frame(days,time,a,results, stringsAsFactors = FALSE)
df2 <- data.frame(days,time,b, stringsAsFactors = FALSE)
library(dplyr)
df2 %>%
filter(time == "00:00:00") %>%
inner_join(df1, by="days") %>%
mutate(time = time.y, results = a+b) %>%
select( days, time, a, b, results)
#Result:
days time a b results
1 01.01.2018 00:00:00 1 1 2
2 01.01.2018 01:00:00 2 1 3
3 01.01.2018 02:00:00 3 1 4
4 02.01.2018 00:00:00 1 5 6
5 02.01.2018 01:00:00 2 5 7
6 02.01.2018 02:00:00 3 5 8
7 03.01.2018 00:00:00 1 10 11
8 03.01.2018 01:00:00 2 10 12
9 03.01.2018 02:00:00 3 10 13
transform(merge(df1,aggregate(b~days,df2,function(x)x[1])),results=a+b)
days time a results b
1 01.01.2018 00:00:00 1 2 1
2 01.01.2018 01:00:00 2 3 1
3 01.01.2018 02:00:00 3 4 1
4 02.01.2018 00:00:00 1 6 5
5 02.01.2018 01:00:00 2 7 5
6 02.01.2018 02:00:00 3 8 5
7 03.01.2018 00:00:00 1 11 10
8 03.01.2018 01:00:00 2 12 10
9 03.01.2018 02:00:00 3 13 10
One thing to note. This assumes time in df2 is arranged chronologically and that the first value for any given day is of time 00:00:00.
I have some weather forecast data, which records the forecast amount of rainfall for every hour. I would like to compare this to observation data, which has the observed amount of rainfall for every 6 hours. So, I need to aggregate the forecast data to 6-hourly data.
Here is an overview of my data:
DateUtc StationID FcstDay PrecipQuantity_hSum
1 2014-01-01 12:00:00 54745 0 0
2 2014-01-01 13:00:00 54745 0 0
3 2014-01-01 14:00:00 54745 0 0
4 2014-01-01 15:00:00 54745 0 0
5 2014-01-01 16:00:00 54745 0 0
6 2014-01-01 17:00:00 54745 0 0
7 2014-01-01 18:00:00 54745 0 0
8 2014-01-01 19:00:00 54745 0 0
9 2014-01-01 20:00:00 54745 0 0
10 2014-01-01 21:00:00 54745 0 0
11 2014-01-01 22:00:00 54745 0 0
12 2014-01-01 23:00:00 54745 0 0
13 2014-01-02 00:00:00 54745 1 0
14 2014-01-02 01:00:00 54745 1 0
15 2014-01-02 02:00:00 54745 1 0
16 2014-01-02 03:00:00 54745 1 0
17 2014-01-02 04:00:00 54745 1 0
18 2014-01-02 05:00:00 54745 1 0
19 2014-01-02 06:00:00 54745 1 0
20 2014-01-02 07:00:00 54745 1 0
... <NA> <NA> ... ...
13802582 2014-11-20 08:00:00 55005 7 0
13802583 2014-11-20 09:00:00 55005 7 0
13802584 2014-11-20 10:00:00 55005 7 0
13802585 2014-11-20 11:00:00 55005 7 0
13802586 2014-11-20 12:00:00 55005 7 0
To aggregate correctly, it is important to split by StationID (the weather station) and FcstDay (number of days between date of calculating prediction and the date being forecast) before aggregating.
I have used the xts package to do the aggregating and it works as expected if I manually subset the data first e.g.
z <- fcst[which(fcst$StationID=="54745" & fcst$FcstDay==1),]
z.xts <- xts(z$PrecipQuantity_hSum, z$DateUtc)
ends <- endpoints(z.xts, "hours", 6)
precip6 <- as.data.frame(period.appl(z.xts, ends, sum))
I need to automate the subsetting, but I have tried to wrap the xts functions in various split-apply functions and always get the same error:
Error in xts(z$PrecipQuantity_hSum, z$DateUtc) :
NROW(x) must match length(order.by)
This is my latest version of my code:
df <- data.frame()
d_ply(
.data = fcst,
.variables = c("FcstDay", "StationID"),
.fun = function(z){
z.xts <- xts(z$PrecipQuantity_hSum, z$DateUtc)
ends <- endpoints(z.xts, "hours", 6)
precip6 <- as.data.frame(period.apply(z.xts, ends, sum))
precip6$DateUtc <- rownames(precip6)
rownames(precip6) <- NULL
df <- rbind.fill(df, precip6)
})
I've also tried nested for loops. Can anybody give any guidance on what's wrong? I've included the code for a reproducible example set below. Thanks in advance.
DateUtc <- rep(seq(from=ISOdatetime(2014,1,1,0,0,0), to=ISOdatetime(2014,12,30,0,0,0), by=(60*60)), times=9)
StationID <- rep(c("50060","50061","50062"), each=3*8713)
FcstDay <- rep(c(1,2,3), each=8713, times=3)
PrecipQuantity_hSum <- rgamma(78417, shape=1, rate=20)
fcst <- data.frame(DateUtc, StationID, FcstDay, PrecipQuantity_hSum)
I think the error David Robinson is getting is because your example code uses PrecipQuantity_6hSum and not PrecipQuantity_hSum. Once this is changed your ddply code is working for me.
Does this work for you?
df<-ddply(
.data = fcst,
.variables = c("FcstDay", "StationID"),
.fun = function(z){
z.xts <- xts(z$PrecipQuantity_6hSum, z$DateUtc)
ends <- endpoints(z.xts, "hours", 6)
precip6 <- as.data.frame(period.apply(z.xts, ends, sum))
precip6$DateUtc <- rownames(precip6)
rownames(precip6) <- NULL
return(precip6)
})