I am working with a dataset containing observations of animals during a given time period as well as data on how much they ate during that time period and some metadata about the animal their ID number.
Here is a subset of the dataset:
start end id uptake
1: 2017-01-29 10:16:00 2017-01-29 12:25:00 14 2.04
2: 2017-01-29 10:16:00 2017-01-29 12:25:00 21 1.53
3: 2017-01-29 10:16:00 2017-01-29 12:25:00 12 0.17
4: 2017-01-29 10:16:00 2017-01-29 12:25:00 20 1.19
5: 2017-01-29 10:16:00 2017-01-29 12:25:00 19 0.85
6: 2017-01-31 09:48:00 2017-01-31 11:59:00 21 5.27
7: 2017-01-31 09:48:00 2017-01-31 11:59:00 34 1.87
8: 2017-01-31 11:50:00 2017-01-31 14:59:00 21 1.00
9: 2017-01-31 11:50:00 2017-01-31 14:59:00 34 0.14
10: 2017-01-31 11:50:00 2017-01-31 14:59:00 20 1.00
11: 2017-01-31 11:50:00 2017-01-31 14:59:00 19 0.43
12: 2017-01-31 11:50:00 2017-01-31 14:59:00 14 3.43
13: 2017-01-31 15:15:00 2017-01-31 16:21:00 12 1.00
14: 2017-01-31 15:15:00 2017-01-31 16:21:00 20 0.72
15: 2017-01-31 15:15:00 2017-01-31 16:21:00 14 0.86
16: 2017-01-31 15:15:00 2017-01-31 16:21:00 21 0.43
17: 2017-01-31 15:15:00 2017-01-31 16:21:00 19 0.57
18: 2017-02-01 09:55:00 2017-02-01 11:47:00 34 1.62
19: 2017-02-01 09:55:00 2017-02-01 11:47:00 21 3.06
20: 2017-02-01 12:03:00 2017-02-01 15:02:00 19 1.29
21: 2017-02-01 12:03:00 2017-02-01 15:02:00 14 3.86
Normally there is a maximum of one row per individual per day, as there was only one observation period. However, on some days there were multiple observation periods so that some ids have more than one row on these days. For these days I would like to collapse the multiple rows per individual while keeping the earliest start and latest end timestamp of the observation periods of that day, while summing up the uptake value but keeping the id value the same.
I am looking for something like this:
X start end id uptake
1 1 2017-01-29 10:16 2017-01-29 12:25 14 2.04
2 2 2017-01-29 10:16 2017-01-29 12:25 21 1.53
3 3 2017-01-29 10:16 2017-01-29 12:25 12 0.17
4 4 2017-01-29 10:16 2017-01-29 12:25 20 1.19
5 5 2017-01-29 10:16 2017-01-29 12:25 19 0.85
6 6 2017-01-31 09:48 2017-01-31 16:21 21 6.70
7 7 2017-01-31 09:48 2017-01-31 16:21 34 2.01
8 10 2017-01-31 11:50 2017-01-31 16:21 20 1.72
9 11 2017-01-31 11:50 2017-01-31 16:21 19 1.00
10 12 2017-01-31 11:50 2017-01-31 16:21 14 4.29
11 13 2017-01-31 15:15 2017-01-31 16:21 12 1.00
12 18 2017-02-01 09:55 2017-02-01 15:02 34 1.62
13 19 2017-02-01 09:55 2017-02-01 15:02 21 3.06
14 20 2017-02-01 12:03 2017-02-01 15:02 19 1.29
15 21 2017-02-01 12:03 2017-02-01 15:02 14 3.86
Within dplyr this is a task for group_by and summarize:
library(dplyr)
library(lubridate)
df |>
group_by(id, lubridate::date(start)) |>
summarise(start = min(start),
end = max(end),
uptake = sum(uptake)) |>
ungroup() |>
arrange(start)
Output:
# A tibble: 15 × 5
id date start end uptake
<dbl> <date> <dttm> <dttm> <dbl>
1 12 2017-01-29 2017-01-29 10:16:00 2017-01-29 12:25:00 0.17
2 14 2017-01-29 2017-01-29 10:16:00 2017-01-29 12:25:00 2.04
3 19 2017-01-29 2017-01-29 10:16:00 2017-01-29 12:25:00 0.85
4 20 2017-01-29 2017-01-29 10:16:00 2017-01-29 12:25:00 1.19
5 21 2017-01-29 2017-01-29 10:16:00 2017-01-29 12:25:00 1.53
6 21 2017-01-31 2017-01-31 09:48:00 2017-01-31 16:21:00 6.7
7 34 2017-01-31 2017-01-31 09:48:00 2017-01-31 14:59:00 2.01
8 14 2017-01-31 2017-01-31 11:50:00 2017-01-31 16:21:00 4.29
9 19 2017-01-31 2017-01-31 11:50:00 2017-01-31 16:21:00 1
10 20 2017-01-31 2017-01-31 11:50:00 2017-01-31 16:21:00 1.72
11 12 2017-01-31 2017-01-31 15:15:00 2017-01-31 16:21:00 1
12 21 2017-02-01 2017-02-01 09:55:00 2017-02-01 11:47:00 3.06
13 34 2017-02-01 2017-02-01 09:55:00 2017-02-01 11:47:00 1.62
14 14 2017-02-01 2017-02-01 12:03:00 2017-02-01 15:02:00 3.86
15 19 2017-02-01 2017-02-01 12:03:00 2017-02-01 15:02:00 1.29
Data (please include data using dput next time):
library(readr)
library(dplyr)
df <- read_delim("idx,start,end,id,uptake
1:,2017-01-29 10:16:00,2017-01-29 12:25:00,14,2.04
2:,2017-01-29 10:16:00,2017-01-29 12:25:00,21,1.53
3:,2017-01-29 10:16:00,2017-01-29 12:25:00,12,0.17
4:,2017-01-29 10:16:00,2017-01-29 12:25:00,20,1.19
5:,2017-01-29 10:16:00,2017-01-29 12:25:00,19,0.85
6:,2017-01-31 09:48:00,2017-01-31 11:59:00,21,5.27
7:,2017-01-31 09:48:00,2017-01-31 11:59:00,34,1.87
8:,2017-01-31 11:50:00,2017-01-31 14:59:00,21,1.00
9:,2017-01-31 11:50:00,2017-01-31 14:59:00,34,0.14
10:,2017-01-31 11:50:00,2017-01-31 14:59:00,20,1.00
11:,2017-01-31 11:50:00,2017-01-31 14:59:00,19,0.43
12:,2017-01-31 11:50:00,2017-01-31 14:59:00,14,3.43
13:,2017-01-31 15:15:00,2017-01-31 16:21:00,12,1.00
14:,2017-01-31 15:15:00,2017-01-31 16:21:00,20,0.72
15:,2017-01-31 15:15:00,2017-01-31 16:21:00,14,0.86
16:,2017-01-31 15:15:00,2017-01-31 16:21:00,21,0.43
17:,2017-01-31 15:15:00,2017-01-31 16:21:00,19,0.57
18:,2017-02-01 09:55:00,2017-02-01 11:47:00,34,1.62
19:,2017-02-01 09:55:00,2017-02-01 11:47:00,21,3.06
20:,2017-02-01 12:03:00,2017-02-01 15:02:00,19,1.29
21:,2017-02-01 12:03:00,2017-02-01 15:02:00,14,3.86") |> select(-idx)
I have hourly data of CO2 values and I would like to know what is the CO2 concentration during the night (e.g. 9pm-7am). A reproducible example:
library(tidyverse); library(lubridate)
times <- seq(ymd_hms("2020-01-01 08:00:00"),
ymd_hms("2020-01-04 08:00:00"), by = "1 hours")
values <- runif(length(times), 1, 15)
df <- tibble(times, values)
How to get mean nightime values (e.g. between 9pm and 7am)? Of course I can filter like this:
df <- df %>%
filter(!hour(times) %in% c(8:20))
And then give id to each observation during the night
df$ID <- rep(LETTERS[1:round(nrow(df)/11)],
times = 1, each = 11)
And finally group and summarise
df_grouped <- df %>%
group_by(., ID) %>%
summarise(value_mean =mean(values))
But this is not a good way I am sure. How to do this better? Especially the part where we give ID to the nighttime values
You can use data.table::frollmean to get the means for a certain window time. In your case you want the means for the last 10 hours, so we set the n argument of the function to 10:
> df$means <- data.table::frollmean(df$values, 10)
> df
> head(df, 20)
# A tibble: 20 x 3
times values means
<dttm> <dbl> <dbl>
1 2020-01-01 08:00:00 4.15 NA
2 2020-01-01 09:00:00 6.24 NA
3 2020-01-01 10:00:00 5.17 NA
4 2020-01-01 11:00:00 9.20 NA
5 2020-01-01 12:00:00 12.3 NA
6 2020-01-01 13:00:00 2.93 NA
7 2020-01-01 14:00:00 9.12 NA
8 2020-01-01 15:00:00 9.72 NA
9 2020-01-01 16:00:00 12.0 NA
10 2020-01-01 17:00:00 13.4 8.41
11 2020-01-01 18:00:00 10.2 9.01
12 2020-01-01 19:00:00 1.97 8.59
13 2020-01-01 20:00:00 11.9 9.26
14 2020-01-01 21:00:00 8.84 9.23
15 2020-01-01 22:00:00 10.1 9.01
16 2020-01-01 23:00:00 3.76 9.09
17 2020-01-02 00:00:00 9.98 9.18
18 2020-01-02 01:00:00 5.56 8.76
19 2020-01-02 02:00:00 5.22 8.09
20 2020-01-02 03:00:00 6.36 7.39
Each row in the mean column will be the mean of that same row value column with the 9 last rows of the value column. Of course there will be some NAs.
Maybe you should give some look to the tsibble package, built to manipulate time series.
You can parametrize the difference between the times you want, but they need to be evenly spaced in your data to use this solution:
n <- diff(which(grepl('20:00:00|08:00:00', df$times))) + 1
n <- unique(n)
df$means <- data.table::frollmean(df$values, n)
> head(df, 20)
# A tibble: 20 x 3
times values means
<dttm> <dbl> <dbl>
1 2020-01-01 08:00:00 11.4 NA
2 2020-01-01 09:00:00 7.03 NA
3 2020-01-01 10:00:00 7.15 NA
4 2020-01-01 11:00:00 6.91 NA
5 2020-01-01 12:00:00 8.18 NA
6 2020-01-01 13:00:00 4.70 NA
7 2020-01-01 14:00:00 13.8 NA
8 2020-01-01 15:00:00 5.16 NA
9 2020-01-01 16:00:00 12.3 NA
10 2020-01-01 17:00:00 3.81 NA
11 2020-01-01 18:00:00 3.09 NA
12 2020-01-01 19:00:00 9.89 NA
13 2020-01-01 20:00:00 1.24 7.28
14 2020-01-01 21:00:00 8.07 7.02
15 2020-01-01 22:00:00 5.59 6.91
16 2020-01-01 23:00:00 5.77 6.81
17 2020-01-02 00:00:00 10.7 7.10
18 2020-01-02 01:00:00 3.44 6.73
19 2020-01-02 02:00:00 10.3 7.16
20 2020-01-02 03:00:00 4.61 6.45
I have a data frame df1 that summarises the depth of different fishes over time at different places.
On the other hand, I have df2 that summarises the intensity of the currents over time (EVERY THREE HOURS) from the surface to 39 meters depth at intervals of 8 meters (m0-7, m8-15, m16-23, m24-31 and m32-39) in a specific place. As an example:
df1<-data.frame(Datetime=c("2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 17:29:16","2016-08-01 18:33:16","2016-08-01 20:54:16","2016-08-01 22:48:16"),Site=c("BD","HG","BD","BD","BD","BD"),Ind=c(16,17,19,16,17,16), Depth=c(5.3,24,36.4,42,NA,22.1))
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
> df1
Datetime Site Ind Depth
1 2016-08-01 15:34:07 BD 16 5.3
2 2016-08-01 16:25:16 HG 17 24.0
3 2016-08-01 17:29:16 BD 19 36.4
4 2016-08-01 18:33:16 BD 16 42.0
5 2016-08-01 20:54:16 BD 17 NA
6 2016-08-01 22:48:16 BD 16 22.1
df2<-data.frame(Datetime=c("2016-08-01 12:00:00","2016-08-01 15:00:00","2016-08-01 18:00:00","2016-08-01 21:00:00","2016-08-02 00:00:00"), Site=c("BD","BD","BD","BD","BD"),var1=c(2.75,4,6.75,2.25,4.3),var2=c(3,4,4.75,3,2.1),var3=c(2.75,4,5.75,2.25,1.4),var4=c(3.25,3,6.5,2.75,3.4),var5=c(3,4,4.75,3,1.7))
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
colnames(df2)<-c("Datetime","Site","m0-7","m8-15","m16-23","m24-31","m32-39")
> df2
Datetime Site m0-7 m8-15 m16-23 m24-31 m32-39
1 2016-08-01 12:00:00 BD 2.75 3.00 2.75 3.25 3.00
2 2016-08-01 15:00:00 BD 4.00 4.00 4.00 3.00 4.00
3 2016-08-01 18:00:00 BD 6.75 4.75 5.75 6.50 4.75
4 2016-08-01 21:00:00 BD 2.25 3.00 2.25 2.75 3.00
5 2016-08-02 00:00:00 BD 4.30 2.10 1.40 3.40 1.70
I want to create a variable in df1 that reflects the mean current for the depth layers in which the fish WASN'T. For instance, if fish is at 20 meters depth, which corresponds to the layer m16-23, I want to know the mean current for the layers m0-7, m8-15, m24-31 and m32-39.
Note1: if my fish was to a depth higher than 39 meters, I consider it as if it was at the deepest layer (m32-39). An example of this in row 4 of df1.
Note2: since the current records are every three hours, every hour indicated in df2$Datetime represents one hour and a half more, and one hour and a half less. That is, the current intensity pointed out in df2 at 21:00:00 reflects the currents between 19:30:00 and 22:30:00. The same with the rest of the hours.
I would expect this:
> df1
Datetime Site Ind Depth current.Mean
1 2016-08-01 15:34:07 BD 16 5.3 3.75
2 2016-08-01 16:25:16 HG 17 24.0 NA
3 2016-08-01 17:29:16 BD 19 36.4 5.94
4 2016-08-01 18:33:16 BD 16 42.0 5.94
5 2016-08-01 20:54:16 BD 17 NA NA
6 2016-08-01 22:48:16 BD 16 22.1 2.87
Does anyone know how to do it?
I'd approach this in two steps:
make a lookup table with avg_speed_elsewhere for each Datetime, Site, and Depth in df2.
Join to df1.
Here's a lookup table:
library(tidyverse)
df2_long <- df2 %>%
gather(depth_rng, speed, `m0-7`:`m32-39`) %>%
separate(depth_rng, c("min_depth", "max_depth")) %>%
mutate_at(vars(matches("depth")), parse_number) %>%
# EDIT -- added to make deep category cover >39 too
mutate(max_depth = if_else(max_depth == 39, 10000, max_depth)) %>%
group_by(Datetime, Site) %>%
# Avg Speed elsewhere is the sum of all speeds, minus this speed, all divided by 4.
mutate(avg_speed_elsewhere = (sum(speed) - speed) / 4)
> df2_long
# A tibble: 25 x 6
# Groups: Datetime, Site [5]
Datetime Site min_depth max_depth speed avg_speed_elsewhere
<dttm> <fct> <dbl> <dbl> <dbl> <dbl>
1 2016-08-18 12:00:00 BD 0 7 2.75 3
2 2016-08-18 15:00:00 BD 0 7 4 3.75
3 2016-08-18 18:00:00 BD 0 7 6.75 5.44
4 2016-08-18 21:00:00 BD 0 7 2.25 2.75
5 2016-08-19 00:00:00 BD 0 7 4.3 2.15
6 2016-08-18 12:00:00 BD 8 15 3 2.94
7 2016-08-18 15:00:00 BD 8 15 4 3.75
8 2016-08-18 18:00:00 BD 8 15 4.75 5.94
9 2016-08-18 21:00:00 BD 8 15 3 2.56
10 2016-08-19 00:00:00 BD 8 15 2.1 2.7
# ... with 15 more rows
I expect this will work, but your provided data doesn't overlap so I'm not sure:
df1 %>%
# EDIT - replaced floor_date with round_date
mutate(Datetime_3hr = lubridate::round_date(Datetime, "3 hour")) %>%
left_join(df2_long, by = c("Site", "Datetime_3hr" = "Datetime")) %>%
filter(Depth >= min_depth & Depth < max_depth + 1 | is.na(Depth))
This question comprises interesting challenges:
The OP is asking for a "partial anti-join", i.e., the OP wants to aggregate current data in df2 where Datetime and Site are matching but the depth layer does not.
The current data df2 are given in a look-up table where each value is associated with a depth range (depth layer) and a time range of 3 hours. So, the measured Depth and Datetime in df1 need to be mapped onto the respective ranges.
I have tried different approaches but I ended up with the one below which does not make assumptions about the aggregate function. So, mean() can be called directly.
library(data.table)
library(magrittr)
# reshape df2 from wide to long format
currents <- melt(setDT(df2), id.vars = c("Datetime", "Site"),
variable.name = "layer", value.name = "current")
# create columns to join on
labels <- names(df2) %>% stringr::str_subset("^m")
breaks <- c(seq(0, 32, 8), Inf)
setDT(df1)[, layer := cut(Depth, breaks = breaks, labels = labels)]
df1[, current.dt := df2[df1, on = .(Site, Datetime),
roll = "nearest", x.Datetime]]
# "partial anti-join" to compute mean of other layers
currents_other_layers <-
currents[df1, on = .(Site, Datetime = current.dt)][
layer != i.layer, mean(current), by = .(i.Datetime, Site)]
# append result column
df1[currents_other_layers, on = .(Site, Datetime = i.Datetime), current.mean := i.V1]
df1
Datetime Site Ind Depth layer current.dt current.mean
1: 2016-08-01 15:34:07 BD 16 5.3 m0-7 2016-08-01 15:00:00 3.7500
2: 2016-08-01 16:25:16 HG 17 24.0 m16-23 <NA> NA
3: 2016-08-01 17:29:16 BD 19 36.4 m32-39 2016-08-01 18:00:00 5.9375
4: 2016-08-01 18:33:16 BD 16 42.0 m32-39 2016-08-01 18:00:00 5.9375
5: 2016-08-01 20:54:16 BD 17 NA <NA> 2016-08-01 21:00:00 NA
6: 2016-08-01 22:48:16 BD 16 22.1 m16-23 2016-08-02 00:00:00 2.8750
This reproduces OP's expected result.
Explanation
df2 is reshaped from wide to long format. This allows for joining / anti-joining on the layer column.
currents
Datetime Site layer current
1: 2016-08-01 12:00:00 BD m0-7 2.75
2: 2016-08-01 15:00:00 BD m0-7 4.00
3: 2016-08-01 18:00:00 BD m0-7 6.75
4: 2016-08-01 21:00:00 BD m0-7 2.25
5: 2016-08-02 00:00:00 BD m0-7 4.30
6: 2016-08-01 12:00:00 BD m8-15 3.00
7: 2016-08-01 15:00:00 BD m8-15 4.00
8: 2016-08-01 18:00:00 BD m8-15 4.75
9: 2016-08-01 21:00:00 BD m8-15 3.00
10: 2016-08-02 00:00:00 BD m8-15 2.10
11: 2016-08-01 12:00:00 BD m16-23 2.75
12: 2016-08-01 15:00:00 BD m16-23 4.00
13: 2016-08-01 18:00:00 BD m16-23 5.75
14: 2016-08-01 21:00:00 BD m16-23 2.25
15: 2016-08-02 00:00:00 BD m16-23 1.40
16: 2016-08-01 12:00:00 BD m24-31 3.25
17: 2016-08-01 15:00:00 BD m24-31 3.00
18: 2016-08-01 18:00:00 BD m24-31 6.50
19: 2016-08-01 21:00:00 BD m24-31 2.75
20: 2016-08-02 00:00:00 BD m24-31 3.40
21: 2016-08-01 12:00:00 BD m32-39 3.00
22: 2016-08-01 15:00:00 BD m32-39 4.00
23: 2016-08-01 18:00:00 BD m32-39 4.75
24: 2016-08-01 21:00:00 BD m32-39 3.00
25: 2016-08-02 00:00:00 BD m32-39 1.70
Datetime Site layer current
Now, df1 has to be amended to include columns which correspond to layer and Datetime in currents.
For Depth, the cut() function is used. The last layer level m32-39 is extended to Inf so all depths greater 32 m are included in this level as requested by the OP.
For Datetime, a rolling join to the nearest Datetime in df2 is used. This is possible because df2$Datetime denotes the mid-point of the 3 hour time range.
After df1 has been prepared, we can do the "partial anti-join". Unfortunately, data.table's non-equi joins does not accept the != operator. So, we cannot write
currents[df1, on = .(Datetime = current.dt, Site, layer != layer)]
directly but have to use a work-around where we first pick the rows where we expect matches and then do an anti-join:
currents[df1, on = .(Datetime = current.dt, Site)][
!df1, on = .(Datetime = current.dt, Site, layer)]
Datetime Site layer current i.Datetime Ind Depth i.layer
1: 2016-08-01 15:00:00 BD m8-15 4.00 2016-08-01 15:34:07 16 5.3 m0-7
2: 2016-08-01 15:00:00 BD m16-23 4.00 2016-08-01 15:34:07 16 5.3 m0-7
3: 2016-08-01 15:00:00 BD m24-31 3.00 2016-08-01 15:34:07 16 5.3 m0-7
4: 2016-08-01 15:00:00 BD m32-39 4.00 2016-08-01 15:34:07 16 5.3 m0-7
5: 2016-08-01 18:00:00 BD m0-7 6.75 2016-08-01 17:29:16 19 36.4 m32-39
6: 2016-08-01 18:00:00 BD m8-15 4.75 2016-08-01 17:29:16 19 36.4 m32-39
7: 2016-08-01 18:00:00 BD m16-23 5.75 2016-08-01 17:29:16 19 36.4 m32-39
8: 2016-08-01 18:00:00 BD m24-31 6.50 2016-08-01 17:29:16 19 36.4 m32-39
9: 2016-08-01 18:00:00 BD m0-7 6.75 2016-08-01 18:33:16 16 42.0 m32-39
10: 2016-08-01 18:00:00 BD m8-15 4.75 2016-08-01 18:33:16 16 42.0 m32-39
11: 2016-08-01 18:00:00 BD m16-23 5.75 2016-08-01 18:33:16 16 42.0 m32-39
12: 2016-08-01 18:00:00 BD m24-31 6.50 2016-08-01 18:33:16 16 42.0 m32-39
13: 2016-08-01 21:00:00 BD m0-7 2.25 2016-08-01 20:54:16 17 NA <NA>
14: 2016-08-01 21:00:00 BD m8-15 3.00 2016-08-01 20:54:16 17 NA <NA>
15: 2016-08-01 21:00:00 BD m16-23 2.25 2016-08-01 20:54:16 17 NA <NA>
16: 2016-08-01 21:00:00 BD m24-31 2.75 2016-08-01 20:54:16 17 NA <NA>
17: 2016-08-01 21:00:00 BD m32-39 3.00 2016-08-01 20:54:16 17 NA <NA>
18: 2016-08-02 00:00:00 BD m0-7 4.30 2016-08-01 22:48:16 16 22.1 m16-23
19: 2016-08-02 00:00:00 BD m8-15 2.10 2016-08-01 22:48:16 16 22.1 m16-23
20: 2016-08-02 00:00:00 BD m24-31 3.40 2016-08-01 22:48:16 16 22.1 m16-23
21: 2016-08-02 00:00:00 BD m32-39 1.70 2016-08-01 22:48:16 16 22.1 m16-23
22: <NA> HG <NA> NA 2016-08-01 16:25:16 17 24.0 m16-23
Datetime Site layer current i.Datetime Ind Depth i.layer
This can be aggregated as desired by an arbitrary aggregation function (no need to manually add single columns selectively):
currents_other_layers <-
currents[df1, on = .(Datetime = current.dt, Site)][
!df1, on = .(Datetime = current.dt, Site, layer)][
!is.na(Depth), mean(current), by = .(i.Datetime, Site)]
currents_other_layers
i.Datetime Site V1
1: 2016-08-01 15:34:07 BD 3.7500
2: 2016-08-01 17:29:16 BD 5.9375
3: 2016-08-01 18:33:16 BD 5.9375
4: 2016-08-01 22:48:16 BD 2.8750
5: 2016-08-01 16:25:16 HG NA
This result contains the average currents of all other layers except the observed layer. Note that grouping is by i.Datetime which refers to df1$Datetime and Site. Rows where Depth is missing in df1 are omitted to meet OP's expected result.
A final update join appends the result column to df1.
Using data.table you can do a rolling join between your two databases in order to associate your depth variables with your current variables even though the times don't match. What the rolling join does is associate one table with the other by which time is the closest (according to your chosen options). I changed a bit of your data so that the the days match
library(data.table)
df1<-data.frame(Datetime=c("2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 17:29:16","2016-08-01 18:33:16","2016-08-01 20:54:16","2016-08-01 22:48:16"),Site=c("BD","HG","BD","BD","BD","BD"),Ind=c(16,17,19,16,17,16), Depth=c(5.3,24,36.4,42,NA,22.1))
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
df2<-data.frame(Datetime=c("2016-08-01 12:00:00","2016-08-01 15:00:00","2016-08-01 18:00:00","2016-08-01 21:00:00","2016-08-02 00:00:00"), Site=c("BD","BD","BD","BD","BD"),var1=c(2.75,4,6.75,2.25,4.3),var2=c(3,4,4.75,3,2.1),var3=c(2.75,4,5.75,2.25,1.4),var4=c(3.25,3,6.5,2.75,3.4),var5=c(3,4,4.75,3,1.7))
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
colnames(df2)<-c("Datetime","Site","m0-7","m8-15","m16-23","m24-31","m32-39")
setDT(df1)
setDT(df2)
setkey(df1, Site, Datetime)
setkey(df2, Site, Datetime)
df_merge = df2[df1, roll = Inf]
Then I use dplyr's case_when to calculate the currents for other depths
library(dplyr)
df_merge[, current_elsewhere := case_when(
is.na(Depth) ~ NA_real_,
Depth < 7 ~ (`m8-15` + `m16-23` + `m24-31` + `m32-39`)/4,
Depth < 15 ~ (`m0-7` + `m16-23` + `m24-31` + `m32-39`)/4,
Depth < 23 ~ (`m0-7` + `m8-15` + `m24-31` + `m32-39`)/4,
Depth < 31 ~ (`m0-7` + `m8-15` + `m16-23` + `m32-39`)/4,
T ~ (`m0-7` + `m8-15` + `m16-23` + `m24-31`)/4)]
df_merge
Datetime Site m0-7 m8-15 m16-23 m24-31 m32-39 Ind Depth current_elsewhere
1: 2016-08-01 15:34:07 BD 4.00 4.00 4.00 3.00 4.00 16 5.3 3.7500
2: 2016-08-01 17:29:16 BD 4.00 4.00 4.00 3.00 4.00 19 36.4 3.7500
3: 2016-08-01 18:33:16 BD 6.75 4.75 5.75 6.50 4.75 16 42.0 5.9375
4: 2016-08-01 20:54:16 BD 6.75 4.75 5.75 6.50 4.75 17 NA NA
5: 2016-08-01 22:48:16 BD 2.25 3.00 2.25 2.75 3.00 16 22.1 2.7500
6: 2016-08-01 16:25:16 HG NA NA NA NA NA 17 24.0 NA
This question already has answers here:
Insert rows for missing dates/times
(9 answers)
Closed 5 years ago.
I have a dataframe that contains hourly weather information. I would like to increase the granularity of the time measurements (5 minute intervals instead of 60 minute intervals) while copying the other columns data into the new rows created:
Current Dataframe Structure:
Date Temperature Humidity
2015-01-01 00:00:00 25 0.67
2015-01-01 01:00:00 26 0.69
Target Dataframe Structure:
Date Temperature Humidity
2015-01-01 00:00:00 25 0.67
2015-01-01 00:05:00 25 0.67
2015-01-01 00:10:00 25 0.67
.
.
.
2015-01-01 00:55:00 25 0.67
2015-01-01 01:00:00 26 0.69
2015-01-01 01:05:00 26 0.69
2015-01-01 01:10:00 26 0.69
.
.
.
What I've Tried:
for(i in 1:nrow(df)) {
five.minutes <- seq(df$date[i], length = 12, by = "5 mins")
for(j in 1:length(five.minutes)) {
df$date[i]<-rbind(five.minutes[j])
}
}
Error I'm getting:
Error in as.POSIXct.numeric(value) : 'origin' must be supplied
The one possible solution can be using fill from tidyr and right_join from dplyr.
The approach is to create date/time series between min and max+55mins times from dataframe. Left join dataframe with timeseries which will provide you all desired rows but NA for Temperature and Humidity. Now use fill to populated NA values with previous valid values.
# Data
df <- read.table(text = "Date Temperature Humidity
'2015-01-01 00:00:00' 25 0.67
'2015-01-01 01:00:00' 26 0.69
'2015-01-01 02:00:00' 28 0.69
'2015-01-01 03:00:00' 25 0.69", header = T, stringsAsFactors = F)
df$Date <- as.POSIXct(df$Date, format = "%Y-%m-%d %H:%M:%S")
# Create a dataframe with all possible date/time at intervale of 5 mins
Dates <- data.frame(Date = seq(min(df$Date), max(df$Date)+3540, by = 5*60))
result <- df %>%
right_join(Dates, by="Date") %>%
fill(Temperature, Humidity)
result
# Date Temperature Humidity
#1 2015-01-01 00:00:00 25 0.67
#2 2015-01-01 00:05:00 25 0.67
#3 2015-01-01 00:10:00 25 0.67
#4 2015-01-01 00:15:00 25 0.67
#5 2015-01-01 00:20:00 25 0.67
#6 2015-01-01 00:25:00 25 0.67
#7 2015-01-01 00:30:00 25 0.67
#8 2015-01-01 00:35:00 25 0.67
#9 2015-01-01 00:40:00 25 0.67
#10 2015-01-01 00:45:00 25 0.67
#11 2015-01-01 00:50:00 25 0.67
#12 2015-01-01 00:55:00 25 0.67
#13 2015-01-01 01:00:00 26 0.69
#14 2015-01-01 01:05:00 26 0.69
#.....
#.....
#44 2015-01-01 03:35:00 25 0.69
#45 2015-01-01 03:40:00 25 0.69
#46 2015-01-01 03:45:00 25 0.69
#47 2015-01-01 03:50:00 25 0.69
#48 2015-01-01 03:55:00 25 0.69
I think this might do:
df=tibble(DateTime=c("2015-01-01 00:00:00","2015-01-01 01:00:00"),Temperature=c(25,26),Humidity=c(.67,.69))
df$DateTime<-ymd_hms(df$DateTime)
DateTime=as.POSIXct((sapply(1:(nrow(df)-1),function(x) seq(from=df$DateTime[x],to=df$DateTime[x+1],by="5 min"))),
origin="1970-01-01", tz="UTC")
Temperature=c(sapply(1:(nrow(df)-1),function(x) rep(df$Temperature[x],12)),df$Temperature[nrow(df)])
Humidity=c(sapply(1:(nrow(df)-1),function(x) rep(df$Humidity[x],12)),df$Humidity[nrow(df)])
tibble(as.character(DateTime),Temperature,Humidity)
<chr> <dbl> <dbl>
1 2015-01-01 00:00:00 25.0 0.670
2 2015-01-01 00:05:00 25.0 0.670
3 2015-01-01 00:10:00 25.0 0.670
4 2015-01-01 00:15:00 25.0 0.670
5 2015-01-01 00:20:00 25.0 0.670
6 2015-01-01 00:25:00 25.0 0.670
7 2015-01-01 00:30:00 25.0 0.670
8 2015-01-01 00:35:00 25.0 0.670
9 2015-01-01 00:40:00 25.0 0.670
10 2015-01-01 00:45:00 25.0 0.670
11 2015-01-01 00:50:00 25.0 0.670
12 2015-01-01 00:55:00 25.0 0.670
13 2015-01-01 01:00:00 26.0 0.690
Link to data:
http://dl.dropbox.com/u/56075871/data.txt
I want to divide each observation by mean for that hour. Example:
2012-01-02 10:00:00 5.23
2012-01-03 10:00:00 5.28
2012-01-04 10:00:00 5.29
2012-01-05 10:00:00 5.29
2012-01-09 10:00:00 5.28
2012-01-10 10:00:00 5.33
2012-01-11 10:00:00 5.42
2012-01-12 10:00:00 5.55
2012-01-13 10:00:00 5.68
2012-01-16 10:00:00 5.53
mean for that is 5.388. Next i want divide each observation by that mean, so... 5.23/5.388, 5.28/5.388, ... until end 5.53/5.388
I have hourly timeseries for 10 stocks:
S1.1h S2.1h S3.1h S4.1h S5.1h S6.1h S7.1h S8.1h S9.1h S10.1h
2012-01-02 10:00:00 64.00 110.7 5.23 142.0 20.75 34.12 32.53 311.9 7.82 5.31
2012-01-02 11:00:00 64.00 110.8 5.30 143.2 20.90 34.27 32.81 312.0 7.97 5.34
2012-01-02 12:00:00 64.00 111.1 5.30 142.8 20.90 34.28 32.70 312.4 7.98 5.33
2012-01-02 13:00:00 61.45 114.7 5.30 143.1 21.01 34.35 32.85 313.0 7.96 5.35
2012-01-02 14:00:00 61.45 116.2 5.26 143.7 21.10 34.60 32.99 312.9 7.95 5.36
2012-01-02 15:00:00 63.95 116.2 5.26 143.2 21.26 34.72 33.00 312.6 7.99 5.37
2012-01-02 16:00:00 63.95 117.3 5.25 143.3 21.27 35.08 33.04 312.7 7.99 5.36
2012-01-02 17:00:00 63.95 117.8 5.24 144.7 21.25 35.40 33.10 313.6 7.99 5.40
2012-01-02 18:00:00 63.95 117.9 5.23 145.0 21.20 35.50 33.17 312.5 7.98 5.35
2012-01-03 10:00:00 63.95 115.5 5.28 143.5 21.15 35.31 33.05 311.7 7.94 5.37
...
And i want to divie each observation by its mean for hour (periodical)
I have some code. Code to make means:
#10:00:00, 11:00:00, ... 18:00:00
times <- paste(seq(10, 18),":00:00", sep="")
#means - matrix of means for timeseries and hour
means <- matrix(ncol= ncol(time_series), nrow = length(times))
for (t in 1:length(times)) {
#t is time 10 to 18
for(i in 1:ncol(time_series)) {
#i is stock 1 to 10
# hour mean for each observation in data
means[t,i] <- mean(time_series[grep(times[t], index(time_series)), i])
}
}
And my function to get "things done":
for (t in 1:length(times)) {
# get all dates with times[t] hour
hours <- time_series[grep(times[t], index(time_series))]
ep <- endpoints(hours, "hours")
out <- rbind(out, period.apply(hours, INDEX=ep, FUN=function(x) {
x/means[t,]
}))
}
I know this is awful, but it works. How can i simplify code?
Here's one way to do it:
# Split the xts object into chunks by hour
# .indexhour() returns the hourly portion for each timestamp
s <- split(time_series, .indexhour(time_series))
# Use sweep to divide each value of x by colMeans(x) for each group of hours
l <- lapply(s, function(x) sweep(x, 2, colMeans(x), FUN="/"))
# rbind everything back together
r <- do.call(rbind, l)
The scale function can do that. Used with ave you could restrict to calcs within hours. Post the resutls of dput on that xts/zoo object and you will get rapid replies.