Changing unevenly spaced time data into evenly spaced hourly in R - r

I have some weather data that comes in unevenly spaced, and I would like to grab the simple hourly values. I need hourly so I can join this data up with a separate data.frame
Example of the weather data:
> weather_df
A tibble: 10 × 3
datetime temperature temperature_dewpoint
<dttm> <dbl> <dbl>
1 2011-01-01 00:00:00 4 -1
2 2011-01-01 00:20:00 3 -1
3 2011-01-01 00:40:00 3 -1
4 2011-01-01 01:00:00 2 -1
5 2011-01-01 01:20:00 2 0
6 2011-01-01 01:45:00 2 0
7 2011-01-01 02:05:00 1 -1
8 2011-01-01 02:25:00 2 0
9 2011-01-01 02:45:00 2 -1
10 2011-01-01 03:10:00 2 0
I would like to only have hourly data, but as you can see observations don't always fall on the hour mark. I've tried rounding but then I have multiple observations with the same time.
weather_df$datetime_rounded <- as.POSIXct(round(weather_df$datetime, units = c("hours")))
weather_df
# A tibble: 10 × 4
datetime temperature temperature_dewpoint datetime_rounded
<dttm> <dbl> <dbl> <dttm>
1 2011-01-01 00:00:00 4 -1 2011-01-01 00:00:00
2 2011-01-01 00:20:00 3 -1 2011-01-01 00:00:00
3 2011-01-01 00:40:00 3 -1 2011-01-01 01:00:00
4 2011-01-01 01:00:00 2 -1 2011-01-01 01:00:00
5 2011-01-01 01:20:00 2 0 2011-01-01 01:00:00
6 2011-01-01 01:45:00 2 0 2011-01-01 02:00:00
7 2011-01-01 02:05:00 1 -1 2011-01-01 02:00:00
8 2011-01-01 02:25:00 2 0 2011-01-01 02:00:00
9 2011-01-01 02:45:00 2 -1 2011-01-01 03:00:00
10 2011-01-01 03:10:00 2 0 2011-01-01 03:00:00
I can't determine easily which observation to keep without computing the difference of datetime from datetimerounded. There must be a more elegant way to do this. Any help would be appreciated!

Here is my non-elegant solution.
I calculated the absolute distance between datetime and datetime_rounded
weather_df$time_dist <- abs(weather_df$datetime - weather_df$datetimerounded)
Then I sorted by the distance
weather_df <- weather_df[order(weather_df$time_dist),]
The removed duplicates of the rounded column. Since its sorted it keeps the observation closest to the round hour.
weather_df <- weather_df [!duplicated(weather_df$datetimerounded),]
Then sorted back by the time
weather_df <- weather_df [order(weather_df$datetimerounded),]
Sure there has to be a better way to do this. I'm not very familiar yet with working with time series in R.

Related

How to estimate percent of time seen depending on the variables "Datetime" and "Number of times seen per hour"

In this post (How to add a variable that estimate the proportion of days someone has been seen since the first time) I asked something with a similar final goal, but here the dataframe is entirely different.
Here, df1 summarises per hour (Datetime) the number of times that a specific animal (ID) has been seen (Times_seen_per_hour) within a particular area of interest. Since we know if the animal was in this area at this hour, we also created the column Presence, that indicates if the animal was in the area where we can detect it.
I want to know the proportion of hours that the animal was detected regarding the total number of hours that we know the animal was in the area.
Here, an example of what I have now:
df1<- data.frame(Datetime= ymd_hms(c("2019-05-20 12:00:00","2019-05-20 12:00:00","2019-05-20 13:00:00","2019-05-20 13:00:00","2019-05-20 14:00:00","2019-05-20 14:00:00","2019-05-20 15:00:00","2019-05-20 15:00:00","2019-05-20 16:00:00","2019-05-20 16:00:00","2019-05-20 17:00:00","2019-05-20 17:00:00","2019-05-20 18:00:00","2019-05-20 18:00:00","2019-05-20 19:00:00","2019-05-20 19:00:00")),
ID= c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2),
Times_seen_per_hour=c(3,0,0,4,2,1,3,2,1,0,0,0,7,0,4,1),
Presence= c(TRUE,FALSE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,FALSE,TRUE,FALSE,TRUE,TRUE,TRUE,TRUE))
df1
Datetime ID Times_seen_per_hour Presence
1 2019-05-20 12:00:00 1 3 TRUE
2 2019-05-20 12:00:00 2 0 FALSE
3 2019-05-20 13:00:00 1 0 TRUE
4 2019-05-20 13:00:00 2 4 TRUE
5 2019-05-20 14:00:00 1 2 TRUE
6 2019-05-20 14:00:00 2 1 TRUE
7 2019-05-20 15:00:00 1 3 TRUE
8 2019-05-20 15:00:00 2 2 TRUE
9 2019-05-20 16:00:00 1 1 TRUE
10 2019-05-20 16:00:00 2 0 FALSE
11 2019-05-20 17:00:00 1 0 TRUE
12 2019-05-20 17:00:00 2 0 FALSE
13 2019-05-20 18:00:00 1 7 TRUE
14 2019-05-20 18:00:00 2 0 TRUE
15 2019-05-20 19:00:00 1 4 TRUE
16 2019-05-20 19:00:00 2 1 TRUE
As mentioned, I need to create a new variable called Prop_hours_seen that indicates the proportion of hours that the animal has been seen with regard the total number of hours we know the animal was there (Presence == TRUE).
I would expect this:
> df1
Datetime ID Times_seen_per_hour Presence Prop_hours_seen
1 2019-05-20 12:00:00 1 3 TRUE 1.00 # We divide number of hours seen between total number of hours it could have been seen, that is 1/1.
2 2019-05-20 12:00:00 2 0 FALSE NA # We don't consider this hour since the animal wasn't in our area of interest.
3 2019-05-20 13:00:00 1 0 TRUE 0.50 # We divide number of hours seen (it was seen 1 hour) between total number of hours it could have been seen (it could have been seen at 12:00:00 and at 13:00:00), that is 1/2=0.5.
4 2019-05-20 13:00:00 2 4 TRUE 1.00
5 2019-05-20 14:00:00 1 2 TRUE 0.66
6 2019-05-20 14:00:00 2 1 TRUE 1.00
7 2019-05-20 15:00:00 1 3 TRUE 0.75
8 2019-05-20 15:00:00 2 2 TRUE 1.00
9 2019-05-20 16:00:00 1 1 TRUE 0.80
10 2019-05-20 16:00:00 2 0 FALSE NA
11 2019-05-20 17:00:00 1 0 TRUE 0.66
12 2019-05-20 17:00:00 2 0 FALSE NA
13 2019-05-20 18:00:00 1 7 TRUE 0.71
14 2019-05-20 18:00:00 2 0 TRUE 0.75
15 2019-05-20 19:00:00 1 4 TRUE 0.75
16 2019-05-20 19:00:00 2 1 TRUE 0.80
I know this is complex to understand, does anyone know how to do it?
This seems to match your desired output.
Be warned this assumes every hour will have a row 1:length(Datetime).
df1 %>%
arrange(ID, Datetime) %>%
group_by(ID) %>%
mutate(hours_passed = 1:length(Datetime),
hours_seen = cumsum(Times_seen_per_hour > 0),
cumulative_presence = cumsum(Presence),
prop_hours_seen = hours_seen / cumulative_presence,
prop_hours_seen = ifelse(Presence, prop_hours_seen, NA)) %>%
arrange(Datetime, ID)
Datetime ID Times_seen_per_hour Presence prop_hours_seen
<dttm> <dbl> <dbl> <lgl> <dbl>
1 2019-05-20 12:00:00 1 3 TRUE 1
2 2019-05-20 12:00:00 2 0 FALSE NA
3 2019-05-20 13:00:00 1 0 TRUE 0.5
4 2019-05-20 13:00:00 2 4 TRUE 1
5 2019-05-20 14:00:00 1 2 TRUE 0.667
6 2019-05-20 14:00:00 2 1 TRUE 1
7 2019-05-20 15:00:00 1 3 TRUE 0.75
8 2019-05-20 15:00:00 2 2 TRUE 1
9 2019-05-20 16:00:00 1 1 TRUE 0.8
10 2019-05-20 16:00:00 2 0 FALSE NA
11 2019-05-20 17:00:00 1 0 TRUE 0.667
12 2019-05-20 17:00:00 2 0 FALSE NA
13 2019-05-20 18:00:00 1 7 TRUE 0.714
14 2019-05-20 18:00:00 2 0 TRUE 0.75
15 2019-05-20 19:00:00 1 4 TRUE 0.75
16 2019-05-20 19:00:00 2 1 TRUE 0.8

How to calculate number of hours from a fixed start point that varies among levels of a variable

The dataframe df1 summarizes detections of different individuals (ID) through time (Datetime). As a short example:
library(lubridate)
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Datetime= ymd_hms(c("2016-08-21 00:00:00","2016-08-24 08:00:00","2016-08-23 12:00:00","2016-08-29 03:00:00","2016-08-27 23:00:00","2016-09-02 02:00:00","2016-09-01 12:00:00","2016-09-09 04:00:00","2016-09-01 12:00:00","2016-09-10 12:00:00")))
> df1
ID Datetime
1 1 2016-08-21 00:00:00
2 2 2016-08-24 08:00:00
3 1 2016-08-23 12:00:00
4 2 2016-08-29 03:00:00
5 1 2016-08-27 23:00:00
6 2 2016-09-02 02:00:00
7 1 2016-09-01 12:00:00
8 2 2016-09-09 04:00:00
9 1 2016-09-01 12:00:00
10 2 2016-09-10 12:00:00
I want to calculate for each row, the number of hours (Hours_since_begining) since the first time that the individual was detected.
I would expect something like that (It can contain some mistakes since I did the calculations by hand):
> df1
ID Datetime Hours_since_begining
1 1 2016-08-21 00:00:00 0
2 2 2016-08-24 08:00:00 0
3 1 2016-08-23 12:00:00 60 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-23 12:00:00"
4 2 2016-08-29 03:00:00 115
5 1 2016-08-27 23:00:00 167 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-27 23:00:00"
6 2 2016-09-02 02:00:00 210
7 1 2016-09-01 12:00:00 276
8 2 2016-09-09 04:00:00 380
9 1 2016-09-01 12:00:00 276
10 2 2016-09-10 12:00:00 412
Does anyone know how to do it?
Thanks in advance!
You can do this :
library(tidyverse)
# first get min datetime by ID
min_datetime_id <- df1 %>% group_by(ID) %>% summarise(min_datetime=min(Datetime))
# join with df1 and compute time difference
df1 <- df1 %>% left_join(min_datetime_id) %>% mutate(Hours_since_beginning= as.numeric(difftime(Datetime, min_datetime,units="hours")))

How to generate a sequence of dates and times with a specific start date/time in R

I am looking to generate or complete a column of dates and times. I have a dataframe of four numeric columns and one POSIXct time column that looks like this:
CH_1 CH_2 CH_3 CH_4 date_time
1 -10096 -11940 -9340 -9972 2018-07-24 10:45:01
2 -10088 -11964 -9348 -9960 <NA>
3 -10084 -11940 -9332 -9956 <NA>
4 -10088 -11956 -9340 -9960 <NA>
5 -10084 -11944 -9332 -9976 <NA>
6 -10076 -11940 -9340 -9948 <NA>
7 -10088 -11956 -9352 -9960 <NA>
8 -10084 -11944 -9348 -9980 <NA>
9 -10076 -11964 -9348 -9976 <NA>
0 -10076 -11956 -9348 -9964 <NA>
I would like to sequentially generate dates and times for the date_time column, increasing by 1 second until the dataframe is filled. (i.e. the next date/time should be 2018-07-24 10:45:02). This is meant to be reproducible for multiple datasets and the number of rows that need filled is not always known, but the start date/time will always be present in that first cell.
I know that the solution is likely within seq.Date (or similar), but the problem I have is that I won't always know the end date/time, which is what most examples I have found require. Any help would be appreciated!
Here's a tidyverse solution, using Zygmunt Zawadzki's example data:
library(lubridate)
library(tidyverse)
df %>% mutate(date_time = date_time[1] + seconds(row_number()-1))
Output:
date_time
1 2018-01-01 00:00:00
2 2018-01-01 00:00:01
3 2018-01-01 00:00:02
4 2018-01-01 00:00:03
5 2018-01-01 00:00:04
6 2018-01-01 00:00:05
7 2018-01-01 00:00:06
8 2018-01-01 00:00:07
9 2018-01-01 00:00:08
10 2018-01-01 00:00:09
11 2018-01-01 00:00:10
Data:
df <- data.frame(date_time = c(as.POSIXct("2018-01-01 00:00:00"), rep(NA,10)))
No need for lubridate, just,R code:
x <- data.frame(date = c(as.POSIXct("2018-01-01 00:00:00"), rep(NA,10)))
startDate <- x[["date"]][1]
x[["date2"]] <- startDate + (seq_len(nrow(x)) - 1)
x
# date date2
# 1 2018-01-01 2018-01-01 00:00:00
# 2 <NA> 2018-01-01 00:00:01
# 3 <NA> 2018-01-01 00:00:02
# 4 <NA> 2018-01-01 00:00:03
# 5 <NA> 2018-01-01 00:00:04
# 6 <NA> 2018-01-01 00:00:05
# 7 <NA> 2018-01-01 00:00:06
# 8 <NA> 2018-01-01 00:00:07
# 9 <NA> 2018-01-01 00:00:08
# 10 <NA> 2018-01-01 00:00:09
# 11 <NA> 2018-01-01 00:00:10

How can I group values by hour and count the cumulative totals in other columns

I have a data frame that is aggregated per minute (where one row represents one minute in YYYY-MM-DD HH:MM:SS format).
I want to group each minute value into their respective hour values/bins.
I have also extracted the hour value from the date field into another column in order to group the data more easily (YYYY-MM-DD HH).
I have looked at several approaches/answers where people recommend using lubridate/dplyr/anytime but no approach seems to have worked completely for me.
My data frame:
> df
date hour available busy
1 2018-03-01 01:00:00 2018-03-01 01:00:00 1 1
2 2018-03-01 01:01:00 2018-03-01 01:00:00 1 1
3 2018-03-01 01:02:00 2018-03-01 01:00:00 1 1
4 2018-03-01 01:03:00 2018-03-01 01:00:00 1 1
5 2018-03-01 01:04:00 2018-03-01 01:00:00 1 1
6 2018-03-01 01:05:00 2018-03-01 01:00:00 1 1
...
7907 2018-03-14 00:54:00 2018-03-14 1 0
7908 2018-03-14 00:55:00 2018-03-14 1 0
7909 2018-03-14 00:56:00 2018-03-14 2 0
7910 2018-03-14 00:57:00 2018-03-14 1 0
7911 2018-03-14 00:58:00 2018-03-14 1 0
7912 2018-03-14 00:59:00 2018-03-14 1 0
I want to group everything by hour for each date (I don't mind if I use the hour column or whether the values are grouped by the HH value in the date column) and list the CUMULATIVE number of available and busy for each hour group.
My desired output df will look like this (note that these are dummy values and not the actual values):
date available busy
1 2018-03-01 01:00:00 1 6
2 2018-03-01 02:00:00 2 11
3 2018-03-01 03:00:00 10 8
...
450 2018-03-14 08:00:00 11 1
451 2018-03-14 09:00:00 24 19
452 2018-03-14 10:00:00 12 4
This is the sample data:
Here's the dplyr code to do that:
library(lubridate)
df2 <- df %>%
group_by(hour) %>%
summarize(
available = sum(available),
busy = sum(available)
) %>%
ungroup()

Summations by conditions on another row dealing with time

I am looking to run a cumulative sum at every row for values that occur in two columns before and after that point. So in this case I have volume of 2 incident types at every given minute over two days. I want to create a column which adds all the incidents that occured before and after for each row by the type. Sumif from excel comes to mind but I'm not sure how to port that over to R:
EDIT: ADDED set.seed and easier numbers
I have the following data set:
set.seed(42)
master_min =
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-2 23:00", tz="America/New_York"),
by="min"
))
)
incident1= round(runif(2821, min=0, max=10))
incident2= round(runif(2821, min=0, max=10))
master_min = head(cbind(master_min, incident1, incident2), 5)
How do I essentially compute the following logic:
for each row, sum all the incident1s that occured before that row's timestamp and all the incident2s that occured after that row's timestamp? It would be great to get a data table solution, if not a dplyr as I am working with a large dataset. Below is a before and after for the data`:
BEFORE:
master_min incident1 incident2
1: 2016-01-01 00:00:00 9 6
2: 2016-01-01 00:01:00 9 5
3: 2016-01-01 00:02:00 3 5
4: 2016-01-01 00:03:00 8 6
5: 2016-01-01 00:04:00 6 9
AFTER THE CALCULATION:
master_min incident1 incident2 new_column
1: 2016-01-01 00:00:00 9 6 25
2: 2016-01-01 00:01:00 9 5 29
3: 2016-01-01 00:02:00 3 5 33
4: 2016-01-01 00:03:00 8 6 30
5: 2016-01-01 00:04:00 6 9 29
If I understand correctly:
# Cumsum of incident1, without current row:
master_min$sum1 <- cumsum(master_min$incident1) - master_min$incident1
# Reverse cumsum of incident2, without current row:
master_min$sum2 <- rev(cumsum(rev(master_min$incident2))) - master_min$incident2
# Your new column:
master_min$new_column <- master_min$sum1 + master_min$sum2
*update
The following two lines can do the job
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
I rewrote the question a bit to show a bit more comprehensive structure
library(data.table)
master_min <-
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-1 0:09", tz="America/New_York"),
by="min"
))
)
set.seed(2)
incident1= as.integer(runif(10, min=0, max=10))
incident2= as.integer(runif(10, min=0, max=10))
master_min = cbind(master_min, incident1, incident2)
Now master_min looks like this
> master_min
master_min incident1 incident2
1: 2016-01-01 00:00:00 1 5
2: 2016-01-01 00:01:00 7 2
3: 2016-01-01 00:02:00 5 7
4: 2016-01-01 00:03:00 1 1
5: 2016-01-01 00:04:00 9 4
6: 2016-01-01 00:05:00 9 8
7: 2016-01-01 00:06:00 1 9
8: 2016-01-01 00:07:00 8 2
9: 2016-01-01 00:08:00 4 4
10: 2016-01-01 00:09:00 5 0
Apply transformations
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
Results
> master_min
master_min incident1 incident2 sum1 sum2
1: 2016-01-01 00:00:00 1 5 1 37
2: 2016-01-01 00:01:00 7 2 8 35
3: 2016-01-01 00:02:00 5 7 13 28
4: 2016-01-01 00:03:00 1 1 14 27
5: 2016-01-01 00:04:00 9 4 23 23
6: 2016-01-01 00:05:00 9 8 32 15
7: 2016-01-01 00:06:00 1 9 33 6
8: 2016-01-01 00:07:00 8 2 41 4
9: 2016-01-01 00:08:00 4 4 45 0
10: 2016-01-01 00:09:00 5 0 50 0

Resources