Let's say I have a dataframe with contains time series as below:
Date value
2000-01-01 00:00:00 4.6
2000-01-01 01:00:00 N/A
2000-01-01 02:00:00 5.3
2000-01-01 03:00:00 6.0
2000-01-01 04:00:00 N/A
2000-01-01 05:00:00 N/A
2000-01-01 06:00:00 N/A
2000-01-01 07:00:00 6.0
I want to find an efficient way to calculate the size of the gap (number of consecutive N/As) and add it to a new column of my dataframe to get the following:
Date value gap_size
2000-01-01 00:00:00 4.6 0
2000-01-01 01:00:00 N/A 1
2000-01-01 02:00:00 5.3 0
2000-01-01 03:00:00 6.0 0
2000-01-01 04:00:00 N/A 3
2000-01-01 05:00:00 N/A 3
2000-01-01 06:00:00 N/A 3
2000-01-01 07:00:00 6.0 0
My dataframe in reality has more than 6 millions row so I am looking for the cheapest way in terms of computation. Note that my time series is equi-spaced over the whole dataset (1 hour).
You could try using rle in this case to generate run lengths. First, convert your value column to logical using is.na and apply rle which provides the run lengths of the different values of the input vector. In this case, the two categories are TRUE and FALSE, and you're counting how long they run for. You can then replicate this by the run length to get the output you're looking for.
x = c(1,2,4,NA,NA,6,NA,19,NA,NA)
res = rle(is.na(x))
rep(res$values*res$lengths,res$lengths)
#> [1] 0 0 0 2 2 0 1 0 2 2
Set to data.table with setDT() and:
dt[, gap := rep(rle(value)$lengths, rle(value)$lengths) * (value == "N/A")]
Date value gap
1: 2000-01-01 00:00:00 4.6 0
2: 2000-01-01 01:00:00 N/A 1
3: 2000-01-01 02:00:00 5.3 0
4: 2000-01-01 03:00:00 6.0 0
5: 2000-01-01 04:00:00 N/A 3
6: 2000-01-01 05:00:00 N/A 3
7: 2000-01-01 06:00:00 N/A 3
8: 2000-01-01 07:00:00 6.0 0
Data:
dt <- structure(list(Date = c("2000-01-01 00:00:00", "2000-01-01 01:00:00",
"2000-01-01 02:00:00", "2000-01-01 03:00:00", "2000-01-01 04:00:00",
"2000-01-01 05:00:00", "2000-01-01 06:00:00", "2000-01-01 07:00:00"
), value = c("4.6", "N/A", "5.3", "6.0", "N/A", "N/A", "N/A",
"6.0")), row.names = c(NA, -8L), class = c("data.table", "data.frame"
))
Related
The dataframe df1 summarizes detections of different individuals (ID) through time (Datetime). As a short example:
library(lubridate)
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Datetime= ymd_hms(c("2016-08-21 00:00:00","2016-08-24 08:00:00","2016-08-23 12:00:00","2016-08-29 03:00:00","2016-08-27 23:00:00","2016-09-02 02:00:00","2016-09-01 12:00:00","2016-09-09 04:00:00","2016-09-01 12:00:00","2016-09-10 12:00:00")))
> df1
ID Datetime
1 1 2016-08-21 00:00:00
2 2 2016-08-24 08:00:00
3 1 2016-08-23 12:00:00
4 2 2016-08-29 03:00:00
5 1 2016-08-27 23:00:00
6 2 2016-09-02 02:00:00
7 1 2016-09-01 12:00:00
8 2 2016-09-09 04:00:00
9 1 2016-09-01 12:00:00
10 2 2016-09-10 12:00:00
I want to calculate for each row, the number of hours (Hours_since_begining) since the first time that the individual was detected.
I would expect something like that (It can contain some mistakes since I did the calculations by hand):
> df1
ID Datetime Hours_since_begining
1 1 2016-08-21 00:00:00 0
2 2 2016-08-24 08:00:00 0
3 1 2016-08-23 12:00:00 60 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-23 12:00:00"
4 2 2016-08-29 03:00:00 115
5 1 2016-08-27 23:00:00 167 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-27 23:00:00"
6 2 2016-09-02 02:00:00 210
7 1 2016-09-01 12:00:00 276
8 2 2016-09-09 04:00:00 380
9 1 2016-09-01 12:00:00 276
10 2 2016-09-10 12:00:00 412
Does anyone know how to do it?
Thanks in advance!
You can do this :
library(tidyverse)
# first get min datetime by ID
min_datetime_id <- df1 %>% group_by(ID) %>% summarise(min_datetime=min(Datetime))
# join with df1 and compute time difference
df1 <- df1 %>% left_join(min_datetime_id) %>% mutate(Hours_since_beginning= as.numeric(difftime(Datetime, min_datetime,units="hours")))
I have a date-time column with non-consecutive date-times (all on the hour), like this:
dat <- data.frame(dt = as.POSIXct(c("2018-01-01 12:00:00",
"2018-01-13 01:00:00",
"2018-02-01 11:00:00")))
# Output:
# dt
#1 2018-01-01 12:00:00
#2 2018-01-13 01:00:00
#3 2018-02-01 11:00:00
I'd like to expand the rows of column dt so that every hour in between the very minimum and maximum date-times is present, looking like:
# Desired output:
# dt
#1 2018-01-01 12:00:00
#2 2018-01-01 13:00:00
#3 2018-01-01 14:00:00
#4 .
#5 .
And so on. tidyverse-based solutions are most preferred.
#DavidArenburg's comment is the way to go for a vector. However, if you want to expand dt inside a data frame with other columns that you would like to keep, you might be interested in tidyr::complete combined with tidyr::full_seq:
dat <- data.frame(dt = as.POSIXct(c("2018-01-01 12:00:00",
"2018-01-13 01:00:00",
"2018-02-01 11:00:00")))
dat$a <- letters[1:3]
dat
#> dt a
#> 1 2018-01-01 12:00:00 a
#> 2 2018-01-13 01:00:00 b
#> 3 2018-02-01 11:00:00 c
library(tidyr)
res <- complete(dat, dt = full_seq(dt, 60 ** 2))
print(res, n = 5)
#> # A tibble: 744 x 2
#> dt a
#> <dttm> <chr>
#> 1 2018-01-01 12:00:00 a
#> 2 2018-01-01 13:00:00 <NA>
#> 3 2018-01-01 14:00:00 <NA>
#> 4 2018-01-01 15:00:00 <NA>
#> 5 2018-01-01 16:00:00 <NA>
#> # ... with 739 more rows
Created on 2018-03-12 by the reprex package (v0.2.0).
I have a data.table containing time series of hourly observations from different locations (sites). There are gaps -- missing hours -- in each sequence. I want to fill out the sequence of hours for each site, so each sequence has a row for each hour (although data will be missing, NA).
Example data:
library(data.table)
library(lubridate)
DT <- data.table(site = rep(LETTERS[1:2], each = 3),
date = ymd_h(c("2017080101", "2017080103", "2017080105",
"2017080103", "2017080105", "2017080107")),
# x = c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 3.2, 3.3),
x = c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3),
key = c("site", "date"))
DT
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 03:00:00 1.2
# 3: A 2017-08-01 05:00:00 1.3
# 4: B 2017-08-01 03:00:00 2.1
# 5: B 2017-08-01 05:00:00 2.2
# 6: B 2017-08-01 07:00:00 2.3
The desired result DT2 would contain all the hours between the first (minimum) date and the last (maximum) date for each site, with x missing where the new rows are inserted:
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: A 2017-08-01 04:00:00 NA
# 5: A 2017-08-01 05:00:00 1.3
# 6: B 2017-08-01 03:00:00 2.1
# 7: B 2017-08-01 04:00:00 NA
# 8: B 2017-08-01 05:00:00 2.2
# 9: B 2017-08-01 06:00:00 NA
#10: B 2017-08-01 07:00:00 2.3
I have tried to join DT with a date sequence constructed from min(date) and max(date). This is in the right direction, but the date range is over all sites rather than for each individual site, the filled in rows have missing site, and the sort order (key) is wrong:
DT[.(seq(from = min(date), to = max(date), by = "hour")),
.SD, on="date"]
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: NA 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: B 2017-08-01 03:00:00 2.1
# 5: NA 2017-08-01 04:00:00 NA
# 6: A 2017-08-01 05:00:00 1.3
# 7: B 2017-08-01 05:00:00 2.2
# 8: NA 2017-08-01 06:00:00 NA
# 9: B 2017-08-01 07:00:00 2.3
So I naturally tried adding by = site:
DT[.(seq(from = min(date), to = max(date), by = "hour")),
.SD, on="date", by=.(site)]
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 03:00:00 1.2
# 3: A 2017-08-01 05:00:00 1.3
# 4: NA <NA> NA
# 5: B 2017-08-01 03:00:00 2.1
# 6: B 2017-08-01 05:00:00 2.2
# 7: B 2017-08-01 07:00:00 2.3
But this doesn't work either. Can anyone suggest the right data.table formulation to give the desired filled-out DT2 shown above?
library(data.table)
library(lubridate)
setDT(DT)
test <- DT[, .(date = seq(min(date), max(date), by = 'hour')), by =
'site']
DT <- merge(test, DT, by = c('site', 'date'), all.x = TRUE)
DT
site date x
1: A 2017-08-01 01:00:00 1.1
2: A 2017-08-01 02:00:00 NA
3: A 2017-08-01 03:00:00 1.2
4: A 2017-08-01 04:00:00 NA
5: A 2017-08-01 05:00:00 1.3
6: B 2017-08-01 03:00:00 2.1
7: B 2017-08-01 04:00:00 NA
8: B 2017-08-01 05:00:00 2.2
9: B 2017-08-01 06:00:00 NA
10: B 2017-08-01 07:00:00 2.3
Thanks to both Frank and Wen for putting me on the right track. I found a compact data.table solution. The result DT2 is also keyed on site and date, as in the input table (which is desirable although I didn't request this in the OP). This is a reformulation of Wen's solution, in data.table syntax, which I assume will be slightly more efficient on large datasets.
DT2 <- DT[setkey(DT[, .(date = seq(from = min(date), to = max(date),
by = "hour")), by = site], site, date), ]
DT2
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: A 2017-08-01 04:00:00 NA
# 5: A 2017-08-01 05:00:00 1.3
# 6: B 2017-08-01 03:00:00 2.1
# 7: B 2017-08-01 04:00:00 NA
# 8: B 2017-08-01 05:00:00 2.2
# 9: B 2017-08-01 06:00:00 NA
#10: B 2017-08-01 07:00:00 2.3
key(DT2)
# [1] "site" "date"
EDIT1: As mentioned by Frank, the on= syntax can also be used. The following DT3 formulation gives the correct answer, but DT3 is not keyed, while the DT2 result is keyed. That means an 'extra' setkey() would be needed if a keyed result was desired.
DT3 <- DT[DT[, .(date = seq(from = min(date), to = max(date),
by = "hour")), by = site], on = c("site", "date"), ]
DT3
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: A 2017-08-01 04:00:00 NA
# 5: A 2017-08-01 05:00:00 1.3
# 6: B 2017-08-01 03:00:00 2.1
# 7: B 2017-08-01 04:00:00 NA
# 8: B 2017-08-01 05:00:00 2.2
# 9: B 2017-08-01 06:00:00 NA
#10: B 2017-08-01 07:00:00 2.3
key(DT3)
# NULL
all.equal(DT2, DT3)
# [1] "Datasets has different keys. 'target': site, date. 'current' has no key."
all.equal(DT2, DT3, check.attributes = FALSE)
# [1] TRUE
Is there a way to write the DT3 expression to give a keyed result, other than expressly using setkey()?
EDIT2: Frank's comment suggests an additional formulation DT4 using keyby = .EACHI. In this case .SD is inserted as j, which is required when by or keyby is used. This gives the correct answer and the result is keyed like the DT2 formulation.
DT4 <- DT[DT[, .(date = seq(from = min(date), to = max(date), by = "hour")),
by = site], on = c("site", "date"), .SD, keyby = .EACHI]
DT4
# site date x
# 1: A 2017-08-01 01:00:00 1.1
# 2: A 2017-08-01 02:00:00 NA
# 3: A 2017-08-01 03:00:00 1.2
# 4: A 2017-08-01 04:00:00 NA
# 5: A 2017-08-01 05:00:00 1.3
# 6: B 2017-08-01 03:00:00 2.1
# 7: B 2017-08-01 04:00:00 NA
# 8: B 2017-08-01 05:00:00 2.2
# 9: B 2017-08-01 06:00:00 NA
#10: B 2017-08-01 07:00:00 2.3
key(DT4)
# [1] "site" "date"
identical(DT2, DT4)
# [1] TRUE
I have some weather data that comes in unevenly spaced, and I would like to grab the simple hourly values. I need hourly so I can join this data up with a separate data.frame
Example of the weather data:
> weather_df
A tibble: 10 × 3
datetime temperature temperature_dewpoint
<dttm> <dbl> <dbl>
1 2011-01-01 00:00:00 4 -1
2 2011-01-01 00:20:00 3 -1
3 2011-01-01 00:40:00 3 -1
4 2011-01-01 01:00:00 2 -1
5 2011-01-01 01:20:00 2 0
6 2011-01-01 01:45:00 2 0
7 2011-01-01 02:05:00 1 -1
8 2011-01-01 02:25:00 2 0
9 2011-01-01 02:45:00 2 -1
10 2011-01-01 03:10:00 2 0
I would like to only have hourly data, but as you can see observations don't always fall on the hour mark. I've tried rounding but then I have multiple observations with the same time.
weather_df$datetime_rounded <- as.POSIXct(round(weather_df$datetime, units = c("hours")))
weather_df
# A tibble: 10 × 4
datetime temperature temperature_dewpoint datetime_rounded
<dttm> <dbl> <dbl> <dttm>
1 2011-01-01 00:00:00 4 -1 2011-01-01 00:00:00
2 2011-01-01 00:20:00 3 -1 2011-01-01 00:00:00
3 2011-01-01 00:40:00 3 -1 2011-01-01 01:00:00
4 2011-01-01 01:00:00 2 -1 2011-01-01 01:00:00
5 2011-01-01 01:20:00 2 0 2011-01-01 01:00:00
6 2011-01-01 01:45:00 2 0 2011-01-01 02:00:00
7 2011-01-01 02:05:00 1 -1 2011-01-01 02:00:00
8 2011-01-01 02:25:00 2 0 2011-01-01 02:00:00
9 2011-01-01 02:45:00 2 -1 2011-01-01 03:00:00
10 2011-01-01 03:10:00 2 0 2011-01-01 03:00:00
I can't determine easily which observation to keep without computing the difference of datetime from datetimerounded. There must be a more elegant way to do this. Any help would be appreciated!
Here is my non-elegant solution.
I calculated the absolute distance between datetime and datetime_rounded
weather_df$time_dist <- abs(weather_df$datetime - weather_df$datetimerounded)
Then I sorted by the distance
weather_df <- weather_df[order(weather_df$time_dist),]
The removed duplicates of the rounded column. Since its sorted it keeps the observation closest to the round hour.
weather_df <- weather_df [!duplicated(weather_df$datetimerounded),]
Then sorted back by the time
weather_df <- weather_df [order(weather_df$datetimerounded),]
Sure there has to be a better way to do this. I'm not very familiar yet with working with time series in R.
I have one zoo object with hourly observations, and one with daily observations.
My goal is to merge the two series by the index into one object, where I match daily values with all hourly values of the same date.
To be specific, the first object zX contains hourly observations with no missing values. The second object zY contains a list of certain special dates. These should be added to zX as a dummy on every observation on that day.
library(zoo)
# 3 days of data with hourly resoulution
x <- runif(24*3)
indexHour <- as.POSIXct(as.Date("2015-01-01") + seq(0, (24*3-1)/24, 1/24))
zX <- zoo(x, indexHour)
# Only 2 days of data with daily resolution - one date is missing
y <- c(0, 2)
indexDay <- as.POSIXct(c(as.Date("2015-01-01"), as.Date("2015-01-3")))
zY <- zoo(y, indexDay)
Expected output
2015-01-01 00:00:00 0.78671677 0
2015-01-01 01:00:00 0.40625297 0
...
2015-01-01 23:00:00 0.75371677 0
2015-01-02 00:00:00 0.34571677 NA
2015-01-02 01:00:00 0.40625297 NA
...
2015-01-02 23:00:00 0.12671677 NA
2015-01-03 00:00:00 0.54671677 2
2015-01-03 01:00:00 0.40625297 2
...
2015-01-03 23:00:00 0.23671677 2
Try this:
z <- cbind(zX, zY = coredata(zY)[match(as.Date(time(zX)), as.Date(time(zY)))])
giving:
> head(z, 30)
zX zY
2014-12-31 19:00:00 0.20050507 0
2014-12-31 20:00:00 0.98745944 0
2014-12-31 21:00:00 0.02685118 0
2014-12-31 22:00:00 0.82922065 0
2014-12-31 23:00:00 0.77466073 0
2015-01-01 00:00:00 0.87494486 0
2015-01-01 01:00:00 0.39466493 0
2015-01-01 02:00:00 0.49233047 0
2015-01-01 03:00:00 0.19231866 0
2015-01-01 04:00:00 0.91684281 0
2015-01-01 05:00:00 0.48264758 0
2015-01-01 06:00:00 0.08900482 0
2015-01-01 07:00:00 0.48236308 0
2015-01-01 08:00:00 0.30624266 0
2015-01-01 09:00:00 0.48860905 0
2015-01-01 10:00:00 0.18761759 0
2015-01-01 11:00:00 0.37730202 0
2015-01-01 12:00:00 0.51766405 0
2015-01-01 13:00:00 0.30146257 0
2015-01-01 14:00:00 0.66511275 0
2015-01-01 15:00:00 0.66457355 0
2015-01-01 16:00:00 0.92248105 0
2015-01-01 17:00:00 0.17868851 0
2015-01-01 18:00:00 0.71363131 0
2015-01-01 19:00:00 0.82189523 NA
2015-01-01 20:00:00 0.73392131 NA
2015-01-01 21:00:00 0.95409518 NA
2015-01-01 22:00:00 0.49774272 NA
2015-01-01 23:00:00 0.27700155 NA
2015-01-02 00:00:00 0.85833340 NA
Inspired by the join statements in How to join (merge) data frames (inner, outer, left, right)? the following code produce desired output:
x <- cbind(x = coredata(zX), date = format(as.Date(index(zX))))
y <- cbind(y = coredata(zY), date = format(as.Date(index(zY))))
z <- zoo(merge(x, y, by = 'date', all.x=TRUE), index(zX))
z <- z[,!colnames(z) %in% c('date')]
View(z)