I have a very large dataset with date and time in a single column on 15-minute intervals corresponding to the data. Unfortunately the software recording the data has some issues and so randomly there are 15-minute intervals (usually 1 or 2 but sometime 3 and 4). The dataset is reported as follows:
Date_and_time Pressure
2016-07-08 18:00:00 3.542
2016-07-08 18:15:00 5:444
2016-07-08 18:45:00 2:556
2016-07-08 19:00:00 4:567
I am looking for a way to enter a row inbetween the missing time frames. My goal is to stack this data for multiple sites on top of each other I and I need to make sure for graphing purposes that the line up.
If you can perfectly guarantee that all times are aligned on the quarter hour, then you could try this:
tibble(Date_and_time = do.call(seq, c(as.list(range(dat$Date_and_time)), by="15 mins"))) %>%
full_join(dat, by = "Date_and_time")
# # A tibble: 5 x 2
# Date_and_time Pressure
# <dttm> <chr>
# 1 2016-07-08 18:00:00 3.542
# 2 2016-07-08 18:15:00 5:444
# 3 2016-07-08 18:30:00 <NA>
# 4 2016-07-08 18:45:00 2:556
# 5 2016-07-08 19:00:00 4:567
If you think there is a chance that your times are not perfectly aligned (even a fraction of a second will introduce unnecessary rows), then we can turn this into a problem of "enforce a gap of no more than 15 minutes":
dat %>%
group_by(grp = cumsum(c(FALSE, as.numeric(diff(Date_and_time), units = "mins") > 15))) %>%
summarize(Date_and_time = max(Date_and_time) + 15*60) %>%
bind_rows(dat) %>%
arrange(Date_and_time) %>%
select(-grp)
# # A tibble: 6 x 2
# Date_and_time Pressure
# <dttm> <chr>
# 1 2016-07-08 18:00:00 3.542
# 2 2016-07-08 18:15:00 5:444
# 3 2016-07-08 18:30:00 <NA>
# 4 2016-07-08 18:45:00 2:556
# 5 2016-07-08 19:00:00 4:567
# 6 2016-07-08 19:15:00 <NA>
Notice that the last added row is unnecessary, that can be removed in a simple clean-up step. The premise of this second method is that it creates a group where everything within the group is gapped 15 minutes (or less), and then adds 15 minutes to the last one row. This ensures that there is no gap greater than 15 minutes, but:
It will always produce a single row at the bottom that may not be needed; and
It does not make any assurance of the gap between the added rows and the rows beneath them. For example, if your third row was instead at "2016-07-08 18:31:00", then the time would sequence through "18:15:00", "18:30:00", then "18:31:00" (with a 1-minute gap).
Data
dat <- structure(list(Date_and_time = structure(c(1468015200, 1468016100, 1468017900, 1468018800), class = c("POSIXct", "POSIXt"), tzone = ""), Pressure = c("3.542", "5:444", "2:556", "4:567")), row.names = c(NA, -4L), class = "data.frame")
You could make a sequence that has all potential sampling times and then join your data to that.
library(tidyverse)
ALL_PERIODS <-data.frame(SAMPLE_TIME= seq.POSIXt(from = as.POSIXlt("2016-07-08 18:00:00"), to =as.POSIXlt("2016-07-08 20:00:00"), by = "15 min"))
SAMPLE_DATA <- data.frame(Date_and_time= as.POSIXlt( c("2016-07-08 18:00:00","2016-07-08 18:15:00","2016-07-08 18:45:00","2016-07-08 19:00:00") ), pressure=c(3.542, 5.444,2.556, 4.567))
ALL_PERIODS_DATA <- left_join(ALL_PERIODS,SAMPLE_DATA, by=c("SAMPLE_TIME"="Date_and_time"))
Related
I know a lot of questions have been asked on the same subject but I have not found an answer to this particular question, despite trying to adapt other codes to my problem.
My data frame "v1" has more than 300 thousand lines with the variable "Date" in the following format:
Date
2015-07-27 17:35:00
2015-07-27 17:40:00
2015-07-27 17:45:00
1st I want to know if all the "Date" intervals are in the 5 to 5 minutes interval. If not I would like to track where different intervals are.
2nd I pretend to create a new column where it can be seen the time stamp of the different intervals. For example, "time_int" where it would be seen "00:05:00", "00:05:00"...
Any help will be appreciated. Thank you in advance.
Here is an option to calculate the difference using lag. If you'd like, you could create another column showing hours with units = "hours".
library(tidyverse)
library(lubridate)
df <- data.frame(date = ymd_hms(c("2015-07-27 17:35:00",
"2015-07-27 17:40:00", "2015-07-27 17:49:00", "2015-07-27 19:49:00")))
df %>%
mutate(diff = date - lag(date),
diff_minutes = as.numeric(diff, units = "mins"),
time_int = format(.POSIXct(diff_minutes*60, "UTC"), "%H:%M:%S")) %>%
select(date, diff_minutes, time_int) %>%
# Filter the data for a range of minutes
filter(diff_minutes >= 5 & diff_minutes < 10)
# OUTPUT:
#> date diff_minutes time_int
#> 1 2015-07-27 17:40:00 5 00:05:00
#> 2 2015-07-27 17:49:00 9 00:09:00
Created on 2021-03-09 by the reprex package (v0.3.0)
Original Data
date
<S3: POSIXct>
2015-07-27 17:35:00
2015-07-27 17:40:00
2015-07-27 17:49:00
2015-07-27 19:49:00
You can use rollapplyr to find the time difference between two consecutive rows. And then you can use which to find the rows that the time difference is not 5 minutes.
dt=read.table(text=text, header=TRUE)
library(lubridate)
library(dplyr)
library(zoo)
dt=mutate(dt, Date=ymd_hms(Date)) %>%
mutate(dt, Dif=rollapplyr(Date, 2, function(x) {
return(difftime(x[2], x[1]))
}, fill=NA))
dt
Date Dif
1 2015-07-27 17:35:00 NA
2 2015-07-27 17:40:00 5
3 2015-07-27 17:45:00 5
4 2015-07-27 17:49:00 4
dt[which(dt$Dif != as.difftime(5, units="mins")),]
Date Dif
4 2015-07-27 17:49:00 4
Lastly, to format the times in your desired format:
dt %>% mutate(DifString=format(.POSIXct(Dif*60, tz="GMT"), "%H:%M:%S"))
Date Dif DifString
1 2015-07-27 17:35:00 NA <NA>
2 2015-07-27 17:40:00 5 00:05:00
3 2015-07-27 17:45:00 5 00:05:00
4 2015-07-27 17:49:00 4 00:04:00
Data
text="Date
'2015-07-27 17:35:00'
'2015-07-27 17:40:00'
'2015-07-27 17:45:00'
'2015-07-27 17:49:00'"
dt=read.table(text=text, header=TRUE)
I have a dataset with a lot of replicated rows, and I want to make a dataset with no replications. Date and time are the main ways of distinguishing between distinct and similar rows, but sometimes the times are a bit off. I want to reduce my dataset so that if 2 rows are within 1 hour of each other on the same day the second instance does not show up.
input_date<-c("4/20/2014", "5/15/2002", "3/12/2019", "3/12/2019", "3/12/2019", "3/12/2019")
input_time<-c("4:30", "4:30", "9:00", "9:55", "12:00", "12:00")
input<-cbind(input_date, input_time)
colnames(input)<-c("date", "time")
#use distinct to remove duplicate values--this removes final row, but I want it to also remove row 4.
output<-distinct(input, date, time)
Is there any easy way to tell R to get rid of rows with values that are close to each other but not exactly the same?
Here is an approach that rounds times to make groups based on the hour.
Then, use {dplyr} group_by / slice to get the first row of each group.
input_date <- c("4/20/2014", "5/15/2002", "3/12/2019", "3/12/2019", "3/12/2019", "3/12/2019")
input_time <- c("4:30", "4:30", "9:00", "9:55", "12:00", "12:00")
# make a data.frame
input <- data.frame(date =input_date, time = input_time)
# use dplyr for data manipulation of groups
library(dplyr, warn.conflicts = FALSE)
# take the 1st slice index from each group
input %>%
mutate(datetime = as.POSIXct(sprintf("%s %s", date, time),
format = "%m/%d/%Y %H:%M"),
hour = round(datetime, "hours")) %>%
group_by(hour) %>%
slice(1)
#> # A tibble: 5 x 4
#> # Groups: hour [5]
#> date time datetime hour
#> <chr> <chr> <dttm> <dttm>
#> 1 5/15/2002 4:30 2002-05-15 04:30:00 2002-05-15 05:00:00
#> 2 4/20/2014 4:30 2014-04-20 04:30:00 2014-04-20 05:00:00
#> 3 3/12/2019 9:00 2019-03-12 09:00:00 2019-03-12 09:00:00
#> 4 3/12/2019 9:55 2019-03-12 09:55:00 2019-03-12 10:00:00
#> 5 3/12/2019 12:00 2019-03-12 12:00:00 2019-03-12 12:00:00
I've triangulated information from other SO answers for the below code, but getting stuck with an error message. Searched SO for similar errors and resolutions but haven't been able to figure it out, so help is appreciated.
For every group ("id"), I want to get the difference between the start times for consecutive rows.
Reproducible data:
require(dplyr)
df <-data.frame(id=as.numeric(c("1","1","1","2","2","2")),
start= c("1/31/17 10:00","1/31/17 10:02","1/31/17 10:45",
"2/10/17 12:00", "2/10/17 12:20","2/11/17 09:40"))
time <- strptime(df$start, format = "%m/%d/%y %H:%M")
df %>%
group_by(id)%>%
mutate(diff = time - lag(time),
diff_mins = as.numeric(diff, units = 'mins'))
Gets me error:
Error in mutate_impl(.data, dots) :
Column diff must be length 3 (the group size) or one, not 6
In addition: Warning message:
In unclass(time1) - unclass(time2) :
longer object length is not a multiple of shorter object length
Do you mean something like this?
There is no need for lag here, a simple diff on the grouped times is sufficient.
df %>%
mutate(start = as.POSIXct(start, format = "%m/%d/%y %H:%M")) %>%
group_by(id) %>%
mutate(diff = c(0, diff(start)))
## A tibble: 6 x 3
## Groups: id [2]
# id start diff
# <dbl> <dttm> <dbl>
#1 1. 2017-01-31 10:00:00 0.
#2 1. 2017-01-31 10:02:00 2.
#3 1. 2017-01-31 10:45:00 43.
#4 2. 2017-02-10 12:00:00 0.
#5 2. 2017-02-10 12:20:00 20.
#6 2. 2017-02-11 09:40:00 1280.
You can use lag and difftime (per Hadley):
df %>%
mutate(time = as.POSIXct(start, format = "%m/%d/%y %H:%M")) %>%
group_by(id) %>%
mutate(diff = difftime(time, lag(time)))
# A tibble: 6 x 4
# Groups: id [2]
id start time diff
<dbl> <fct> <dttm> <time>
1 1. 1/31/17 10:00 2017-01-31 10:00:00 <NA>
2 1. 1/31/17 10:02 2017-01-31 10:02:00 2
3 1. 1/31/17 10:45 2017-01-31 10:45:00 43
4 2. 2/10/17 12:00 2017-02-10 12:00:00 <NA>
5 2. 2/10/17 12:20 2017-02-10 12:20:00 20
6 2. 2/11/17 09:40 2017-02-11 09:40:00 1280
I have a large dataset over many years which has several variables, but the one I am interested in is wind speed and dateTime. I want to find the time of the max wind speed for every day in the data set. I have hourly data in Posixct format, with WS as a numeric with occasional NAs. Below is a short data set that should hopefully illustrate my point, however my dateTime wasn't working out to be hourly data, but it provides enough for a sample.
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1798,rep=TRUE)
WD <- sample(0:390,1798,rep=TRUE)
Temp <- sample(0:40,1798,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I have previously tried creating a new column with just a posix date (minus time) to allow for day isolation, however all the things I have tried have only returned a shortened data frame with date and WS (aggregate, splitting, xts). Aggregate was only one that didn't do this, however, it gave me 23:00:00 as a constant time which isn't correct.
I have looked at How to calculate daily means, medians, from weather variables data collected hourly in R?, https://stats.stackexchange.com/questions/7268/how-to-aggregate-by-minute-data-for-a-week-into-hourly-means and others but none have answered this question, or the solutions have not returned an ideal result.
I need to compare the results of this analysis with another data frame, so hence the reason I need the actual time when the max wind speed occurred for each day in the dataset. I have a feeling there is a simple solution, however, this has me frustrated.
A dplyr solution may be:
library(dplyr)
df %>%
mutate(date = as.Date(dateTime)) %>%
left_join(
df %>%
mutate(date = as.Date(dateTime)) %>%
group_by(date) %>%
summarise(max_ws = max(WS, na.rm = TRUE)) %>%
ungroup(),
by = "date"
) %>%
select(-date)
# dateTime WS WD Temp max_ws
# 1 2011-01-01 00:00:00 NA 313 2 15
# 2 2011-01-01 00:24:00 7 376 1 15
# 3 2011-01-01 00:48:00 3 28 28 15
# 4 2011-01-01 01:12:00 15 262 24 15
# 5 2011-01-01 01:36:00 1 149 34 15
# 6 2011-01-01 02:00:00 4 319 33 15
# 7 2011-01-01 02:24:00 15 280 22 15
# 8 2011-01-01 02:48:00 NA 110 23 15
# 9 2011-01-01 03:12:00 12 93 15 15
# 10 2011-01-01 03:36:00 3 5 0 15
Dee asked for: "I want to find the time of the max wind speed for every day in the data set." Other answers have calculated the max(WS) for every day, but not at which hour that occured.
So I propose the following solution with dyplr:
library(dplyr)
set.seed(12345)
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1738,rep=TRUE)
WD <- sample(0:390,1738,rep=TRUE)
Temp <- sample(0:40,1738,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
df %>%
group_by(Date = as.Date(dateTime)) %>%
mutate(Hour = hour(dateTime),
Hour_with_max_ws = Hour[which.max(WS)])
I want to highlight out, that if there are several hours with the same maximal windspeed (in the example below: 15), only the first hour with max(WS) will be shown as result, though the windspeed 15 was reached on that date at the hours 0, 3, 4, 21 and 22! So you might need a more specific logic.
For the sake of completeness (and because I like the concise code) here is a "one-liner" using data.table:
library(data.table)
setDT(df)[, max.ws := max(WS, na.rm = TRUE), by = as.IDate(dateTime)][]
dateTime WS WD Temp max.ws
1: 2011-01-01 00:00:00 NA 293 22 15
2: 2011-01-01 00:24:00 15 55 14 15
3: 2011-01-01 00:48:00 NA 186 24 15
4: 2011-01-01 01:12:00 4 300 22 15
5: 2011-01-01 01:36:00 0 120 36 15
---
1734: 2011-01-29 21:12:00 12 249 5 15
1735: 2011-01-29 21:36:00 9 282 21 15
1736: 2011-01-29 22:00:00 12 238 6 15
1737: 2011-01-29 22:24:00 10 127 21 15
1738: 2011-01-29 22:48:00 13 297 0 15
I have the following dataframes:
AllDays
2012-01-01
2012-01-02
2012-01-03
...
2015-08-18
Leases
StartDate EndDate
2012-01-01 2013-01-01
2012-05-07 2013-05-06
2013-09-05 2013-12-01
What I want to do is, for each date in the allDays dataframe, calculate the number of leases that are in effect. e.g. if there are 4 leases with start date <= 2015-01-01 and end date >= 2015-01-01, then I would like to place a 4 in that dataframe.
I have the following code
for (i in 1:nrow(leases))
{
occupied = seq(leases$StartDate[i],leases$EndDate[i],by="days")
occupied = occupied[occupied < dateOfInt]
matching = match(occupied,allDays$Date)
allDays$Occupancy[matching] = allDays$Occupancy[matching] + 1
}
which works, but as I have about 5000 leases, it takes about 1.1 seconds. Does anyone have a more efficient method that would require less computation time?
Date of interest is just the current date and is used simply to ensure that it doesn't count lease dates in the future.
Using seq is almost surely inefficient--imagine you had a lease in your data that's 10000 years long. seq will take forever and return 10000*365-1 days that don't matter to us. We then have to use %in% which also makes the same number of unnecessary comparisons.
I'm not sure the following is the best approach (I'm convinced there's a fully vectorized solution) but it gets closer to the heart of the problem.
Data
set.seed(102349)
days<-data.frame(AllDays=seq(as.Date("2012-01-01"),
as.Date("2015-08-18"),"day"))
leases<-data.frame(StartDate=sample(days$AllDays,5000L,T))
leases$EndDate<-leases$StartDate+round(rnorm(5000,mean=365,sd=100))
Approach
Use data.table and sapply:
library(data.table)
setDT(leases); setDT(days)
days[,lease_count:=
sapply(AllDays,function(x)
leases[StartDate<=x&EndDate>=x,.N])][]
AllDays lease_count
1: 2012-01-01 5
2: 2012-01-02 8
3: 2012-01-03 11
4: 2012-01-04 16
5: 2012-01-05 18
---
1322: 2015-08-14 1358
1323: 2015-08-15 1358
1324: 2015-08-16 1360
1325: 2015-08-17 1363
1326: 2015-08-18 1359
This is exactly the problem where foverlaps shines: subsetting a data.frame based upon another data.frame (foverlaps seems to be tailored for that purpose).
Based on #MichaelChirico's data.
setkey(days[, AllDays1:=AllDays,], AllDays, AllDays1)
setkey(leases, StartDate, EndDate)
foverlaps(leases, days)[, .(lease_count=.N), AllDays]
# user system elapsed
# 0.114 0.018 0.136
# #MichaelChirico's approach
# user system elapsed
# 0.909 0.000 0.907
Here is a brief explanation on how it works by #Arun, which got me started with the data.table.
Without your data, I can't test whether or not this is faster, but it gets the job done with less code:
for (i in 1:nrow(AllDays)) AllDays$tally[i] = sum(AllDays$AllDays[i] >= Leases$Start.Date & AllDays$AllDays[i] <= Leases$End.Date)
I used the following to test it; note that the relevant columns in both data frames are formatted as dates:
AllDays = data.frame(AllDays = seq(from=as.Date("2012-01-01"), to=as.Date("2015-08-18"), by=1))
Leases = data.frame(Start.Date = as.Date(c("2013-01-01", "2012-08-20", "2014-06-01")), End.Date = as.Date(c("2013-12-31", "2014-12-31", "2015-05-31")))
An alternative approach, but I'm not sure it's faster.
library(lubridate)
library(dplyr)
AllDays = data.frame(dates = c("2012-02-01","2012-03-02","2012-04-03"))
Lease = data.frame(start = c("2012-01-03","2012-03-01","2012-04-02"),
end = c("2012-02-05","2012-04-15","2012-07-11"))
# transform to dates
AllDays$dates = ymd(AllDays$dates)
Lease$start = ymd(Lease$start)
Lease$end = ymd(Lease$end)
# create the range id
Lease$id = 1:nrow(Lease)
AllDays
# dates
# 1 2012-02-01
# 2 2012-03-02
# 3 2012-04-03
Lease
# start end id
# 1 2012-01-03 2012-02-05 1
# 2 2012-03-01 2012-04-15 2
# 3 2012-04-02 2012-07-11 3
data.frame(expand.grid(AllDays$dates,Lease$id)) %>% # create combinations of dates and ranges
select(dates=Var1, id=Var2) %>%
inner_join(Lease, by="id") %>% # join information
rowwise %>%
do(data.frame(dates=.$dates,
flag = ifelse(.$dates %in% seq(.$start,.$end,by="1 day"),1,0))) %>% # create ranges and check if the date is in there
ungroup %>%
group_by(dates) %>%
summarise(N=sum(flag))
# dates N
# 1 2012-02-01 1
# 2 2012-03-02 1
# 3 2012-04-03 2
Try the lubridate package. Create an interval for each lease. Then count the lease intervals which each date falls in.
# make some data
AllDays <- data.frame("Days" = seq.Date(as.Date("2012-01-01"), as.Date("2012-02-01"), by = 1))
Leases <- data.frame("StartDate" = as.Date(c("2012-01-01", "2012-01-08")),
"EndDate" = as.Date(c("2012-01-10", "2012-01-21")))
library(lubridate)
x <- new_interval(Leases$StartDate, Leases$EndDate, tzone = "UTC")
AllDays$NumberInEffect <- sapply(AllDays$Days, function(a){sum(a %within% x)})
The Output
head(AllDays)
Days NumberInEffect
1 2012-01-01 1
2 2012-01-02 1
3 2012-01-03 1
4 2012-01-04 1
5 2012-01-05 1
6 2012-01-06 1