Remove duplicate events within delta time - r

Given the datraframe below
class timestamp
1 A 2019-02-14 15:00:29
2 A 2019-01-27 17:59:53
3 A 2019-01-27 18:00:00
4 B 2019-02-02 18:00:00
5 C 2019-03-08 16:00:37
observation 2 and 3 point to the same event. How do I remove rows belonging to the same class if another timestamp within 2 minutes already exists?
Desired output:
class timestamp
1 A 2019-02-14 15:00:00
2 B 2019-01-27 18:00:00
3 A 2019-02-02 18:00:00
4 C 2019-03-08 16:00:00
round( ,c("mins")) can be used to get rid of the second component but if the timestamps are to far off some test samples will be rounded to the wrong minute leaving still different timestamps

EDIT
I think I over-complicated the problem in first attempt, I think what would work for your case is to round time for 2 minute intervals which we can do using round_date from lubridate .
library(lubridate)
library(dplyr)
df %>%
mutate(timestamp = round_date(as.POSIXct(timestamp), unit = "2 minutes")) %>%
group_by(class) %>%
filter(!duplicated(timestamp))
# class timestamp
# <chr> <dttm>
#1 A 2019-02-14 15:00:00
#2 A 2019-01-27 18:00:00
#3 B 2019-02-02 18:00:00
#4 C 2019-03-08 16:00:00
Original Attempt
We can first convert the timestamp to POSIXct object, then arrange rows by class and timestamp, use cut to divide them into "2 min" interval and then remove duplicates.
library(dplyr)
df %>%
mutate(timestamp = as.POSIXct(timestamp)) %>%
arrange(class, timestamp) %>%
group_by(class) %>%
filter(!duplicated(as.numeric(cut(timestamp, breaks = "2 mins")), fromLast = TRUE))
# class timestamp
# <chr> <dttm>
#1 A 2019-01-27 18:00:00
#2 A 2019-02-14 15:00:29
#3 B 2019-02-02 18:00:00
#4 C 2019-03-08 16:00:37
Here, I haven't changed or rounded the timestamp column and kept it as it is but it would be simple to round it if you use cut in mutate. Also if you want to keep the first entry like 2019-01-27 17:59:53 then remove fromLast = TRUE argument.

Related

Time interval between several dates (days and hours)

I know a lot of questions have been asked on the same subject but I have not found an answer to this particular question, despite trying to adapt other codes to my problem.
My data frame "v1" has more than 300 thousand lines with the variable "Date" in the following format:
Date
2015-07-27 17:35:00
2015-07-27 17:40:00
2015-07-27 17:45:00
1st I want to know if all the "Date" intervals are in the 5 to 5 minutes interval. If not I would like to track where different intervals are.
2nd I pretend to create a new column where it can be seen the time stamp of the different intervals. For example, "time_int" where it would be seen "00:05:00", "00:05:00"...
Any help will be appreciated. Thank you in advance.
Here is an option to calculate the difference using lag. If you'd like, you could create another column showing hours with units = "hours".
library(tidyverse)
library(lubridate)
df <- data.frame(date = ymd_hms(c("2015-07-27 17:35:00",
"2015-07-27 17:40:00", "2015-07-27 17:49:00", "2015-07-27 19:49:00")))
df %>%
mutate(diff = date - lag(date),
diff_minutes = as.numeric(diff, units = "mins"),
time_int = format(.POSIXct(diff_minutes*60, "UTC"), "%H:%M:%S")) %>%
select(date, diff_minutes, time_int) %>%
# Filter the data for a range of minutes
filter(diff_minutes >= 5 & diff_minutes < 10)
# OUTPUT:
#> date diff_minutes time_int
#> 1 2015-07-27 17:40:00 5 00:05:00
#> 2 2015-07-27 17:49:00 9 00:09:00
Created on 2021-03-09 by the reprex package (v0.3.0)
Original Data
date
<S3: POSIXct>
2015-07-27 17:35:00
2015-07-27 17:40:00
2015-07-27 17:49:00
2015-07-27 19:49:00
You can use rollapplyr to find the time difference between two consecutive rows. And then you can use which to find the rows that the time difference is not 5 minutes.
dt=read.table(text=text, header=TRUE)
library(lubridate)
library(dplyr)
library(zoo)
dt=mutate(dt, Date=ymd_hms(Date)) %>%
mutate(dt, Dif=rollapplyr(Date, 2, function(x) {
return(difftime(x[2], x[1]))
}, fill=NA))
dt
Date Dif
1 2015-07-27 17:35:00 NA
2 2015-07-27 17:40:00 5
3 2015-07-27 17:45:00 5
4 2015-07-27 17:49:00 4
dt[which(dt$Dif != as.difftime(5, units="mins")),]
Date Dif
4 2015-07-27 17:49:00 4
Lastly, to format the times in your desired format:
dt %>% mutate(DifString=format(.POSIXct(Dif*60, tz="GMT"), "%H:%M:%S"))
Date Dif DifString
1 2015-07-27 17:35:00 NA <NA>
2 2015-07-27 17:40:00 5 00:05:00
3 2015-07-27 17:45:00 5 00:05:00
4 2015-07-27 17:49:00 4 00:04:00
Data
text="Date
'2015-07-27 17:35:00'
'2015-07-27 17:40:00'
'2015-07-27 17:45:00'
'2015-07-27 17:49:00'"
dt=read.table(text=text, header=TRUE)

Is there a way to select rows based on a loose distinct?

I have a dataset with a lot of replicated rows, and I want to make a dataset with no replications. Date and time are the main ways of distinguishing between distinct and similar rows, but sometimes the times are a bit off. I want to reduce my dataset so that if 2 rows are within 1 hour of each other on the same day the second instance does not show up.
input_date<-c("4/20/2014", "5/15/2002", "3/12/2019", "3/12/2019", "3/12/2019", "3/12/2019")
input_time<-c("4:30", "4:30", "9:00", "9:55", "12:00", "12:00")
input<-cbind(input_date, input_time)
colnames(input)<-c("date", "time")
#use distinct to remove duplicate values--this removes final row, but I want it to also remove row 4.
output<-distinct(input, date, time)
Is there any easy way to tell R to get rid of rows with values that are close to each other but not exactly the same?
Here is an approach that rounds times to make groups based on the hour.
Then, use {dplyr} group_by / slice to get the first row of each group.
input_date <- c("4/20/2014", "5/15/2002", "3/12/2019", "3/12/2019", "3/12/2019", "3/12/2019")
input_time <- c("4:30", "4:30", "9:00", "9:55", "12:00", "12:00")
# make a data.frame
input <- data.frame(date =input_date, time = input_time)
# use dplyr for data manipulation of groups
library(dplyr, warn.conflicts = FALSE)
# take the 1st slice index from each group
input %>%
mutate(datetime = as.POSIXct(sprintf("%s %s", date, time),
format = "%m/%d/%Y %H:%M"),
hour = round(datetime, "hours")) %>%
group_by(hour) %>%
slice(1)
#> # A tibble: 5 x 4
#> # Groups: hour [5]
#> date time datetime hour
#> <chr> <chr> <dttm> <dttm>
#> 1 5/15/2002 4:30 2002-05-15 04:30:00 2002-05-15 05:00:00
#> 2 4/20/2014 4:30 2014-04-20 04:30:00 2014-04-20 05:00:00
#> 3 3/12/2019 9:00 2019-03-12 09:00:00 2019-03-12 09:00:00
#> 4 3/12/2019 9:55 2019-03-12 09:55:00 2019-03-12 10:00:00
#> 5 3/12/2019 12:00 2019-03-12 12:00:00 2019-03-12 12:00:00

Start and end dates of time periods defined by a column in a data frame

I have a database of hourly data organized in rows and would like to reshape it in such as way as to obtain the start and end times when the data are within a certain criteria
Consider the following case example, one column is the sequential hourly times, and in the second column is the dummy variable data.
Yrs= data.frame(Date=seq(as.POSIXct("2019-02-04 01:00:00",tz="UTC"), as.POSIXct("2019-02-04 23:00:00",tz="UTC"), by="hour"))
Yrs$Var=c(1:12,1:11)
I would like to obtain the start and end dates of the period in which the Variable was between say 3 and 7.
Expected result:
StartDate EndDate
2019-02-04 03:00:00 2019-02-04 07:00:00
2019-02-04 15:00:00 2019-02-04 19:00:00
I figure I can create a new column indicating the rows where the criteria is met, but do not know how to get the start and end of those consecutive periods
Yrs$Period= ifelse(Yrs$Var >= 3 & Yrs$Var <=7, 1, 0)
I found a reverse example to this problem here Given start date and end date, reshape/expand data for each day between (each day on a row)
but I am struggling to figure this out. Any help will be greatly appreciated.
Maybe something like:
library(data.table)
setDT(Yrs)[, .(StartDate=Date[Var==3L], EndDate=Date[Var==7L]),
by=.(c(0L, cumsum(diff(Var) < 1L)))][, -1L]
output:
StartDate EndDate
1: 2019-02-04 03:00:00 2019-02-04 07:00:00
2: 2019-02-04 15:00:00 2019-02-04 19:00:00
Why not filter and spread ?
library(dplyr)
Yrs %>%
filter(Var == 3 | Var == 7) %>%
group_by(Var) %>%
mutate(ind = row_number()) %>%
spread(Var, Date) %>%
select(-ind) %>%
rename_all(funs(c("Start_Date", "End_Date")))
# Start_Date End_Date
# <dttm> <dttm>
#1 2019-02-04 03:00:00 2019-02-04 07:00:00
#2 2019-02-04 15:00:00 2019-02-04 19:00:00

R Difference in time between rows

I've triangulated information from other SO answers for the below code, but getting stuck with an error message. Searched SO for similar errors and resolutions but haven't been able to figure it out, so help is appreciated.
For every group ("id"), I want to get the difference between the start times for consecutive rows.
Reproducible data:
require(dplyr)
df <-data.frame(id=as.numeric(c("1","1","1","2","2","2")),
start= c("1/31/17 10:00","1/31/17 10:02","1/31/17 10:45",
"2/10/17 12:00", "2/10/17 12:20","2/11/17 09:40"))
time <- strptime(df$start, format = "%m/%d/%y %H:%M")
df %>%
group_by(id)%>%
mutate(diff = time - lag(time),
diff_mins = as.numeric(diff, units = 'mins'))
Gets me error:
Error in mutate_impl(.data, dots) :
Column diff must be length 3 (the group size) or one, not 6
In addition: Warning message:
In unclass(time1) - unclass(time2) :
longer object length is not a multiple of shorter object length
Do you mean something like this?
There is no need for lag here, a simple diff on the grouped times is sufficient.
df %>%
mutate(start = as.POSIXct(start, format = "%m/%d/%y %H:%M")) %>%
group_by(id) %>%
mutate(diff = c(0, diff(start)))
## A tibble: 6 x 3
## Groups: id [2]
# id start diff
# <dbl> <dttm> <dbl>
#1 1. 2017-01-31 10:00:00 0.
#2 1. 2017-01-31 10:02:00 2.
#3 1. 2017-01-31 10:45:00 43.
#4 2. 2017-02-10 12:00:00 0.
#5 2. 2017-02-10 12:20:00 20.
#6 2. 2017-02-11 09:40:00 1280.
You can use lag and difftime (per Hadley):
df %>%
mutate(time = as.POSIXct(start, format = "%m/%d/%y %H:%M")) %>%
group_by(id) %>%
mutate(diff = difftime(time, lag(time)))
# A tibble: 6 x 4
# Groups: id [2]
id start time diff
<dbl> <fct> <dttm> <time>
1 1. 1/31/17 10:00 2017-01-31 10:00:00 <NA>
2 1. 1/31/17 10:02 2017-01-31 10:02:00 2
3 1. 1/31/17 10:45 2017-01-31 10:45:00 43
4 2. 2/10/17 12:00 2017-02-10 12:00:00 <NA>
5 2. 2/10/17 12:20 2017-02-10 12:20:00 20
6 2. 2/11/17 09:40 2017-02-11 09:40:00 1280

R: extract hour from variable format timestamp

My dataframe has timestamp with and without seconds, and a random use of 0 in front of months and hours, i.e. 01 or 1
library(tidyverse)
df <- data_frame(cust=c('A','A','B','B'), timestamp=c('5/31/2016 1:03:12', '05/25/2016 01:06',
'6/16/2016 01:03', '12/30/2015 23:04:25'))
cust timestamp
A 5/31/2016 1:03:12
A 05/25/2016 01:06
B 6/16/2016 01:03
B 12/30/2015 23:04:25
How to extract hours into a separate column? The desired output:
cust timestamp hours
A 5/31/2016 1:03:12 1
A 05/25/2016 01:06 1
B 6/16/2016 9:03 9
B 12/30/2015 23:04:25 23
I prefer the answer with tidyverse and mutate, but my attempt fails to extract hours correctly:
df %>% mutate(hours=strptime(timestamp, '%H') %>% as.character() )
# A tibble: 4 × 3
cust timestamp hours
<chr> <chr> <chr>
1 A 5/31/2016 1:03:12 2016-10-31 05:00:00
2 A 05/25/2016 01:06 2016-10-31 05:00:00
3 B 6/16/2016 01:03 2016-10-31 06:00:00
4 B 12/30/2015 23:04:25 2016-10-31 12:00:00
Try this:
library(lubridate)
df <- data.frame(cust=c('A','A','B','B'), timestamp=c('5/31/2016 1:03:12', '05/25/2016 01:06',
'6/16/2016 09:03', '12/30/2015 23:04:25'))
df %>% mutate(hours=hour(strptime(timestamp, '%m/%d/%Y %H:%M')) %>% as.character() )
cust timestamp hours
1 A 5/31/2016 1:03:12 1
2 A 05/25/2016 01:06 1
3 B 6/16/2016 09:03 9
4 B 12/30/2015 23:04:25 23
Here is a solution that appends 00 for the seconds when they are missing, then converts to a date using lubridate and extracts the hours using format. Note, if you don't want the 00:00 at the end of the hours, you can just eliminate them from the output format in format:
df %>%
mutate(
cleanTime = ifelse(grepl(":[0-9][0-9]:", timestamp)
, timestamp
, paste0(timestamp, ":00")) %>% mdy_hms
, hour = format(cleanTime, "%H:00:00")
)
returns:
cust timestamp cleanTime hour
<chr> <chr> <dttm> <chr>
1 A 5/31/2016 1:03:12 2016-05-31 01:03:12 01:00:00
2 A 05/25/2016 01:06 2016-05-25 01:06:00 01:00:00
3 B 6/16/2016 01:03 2016-06-16 01:03:00 01:00:00
4 B 12/30/2015 23:04:25 2015-12-30 23:04:25 23:00:00
Your timestamp is a character string (), you need to format is as a date (with as.Date for example) before you can start using functions like strptime.
You are going to have to go through some string manipulations to have properly formatted data before you can convert it to dates. Prepend a zero to months with a single digit and append :00 to hours with missing seconds. Use strsplit() and other regex functions. Afterwards do as.Date(df$timestamp,format = '%m/%d/%Y %H:%M:%S'), then you will be able to use strptime to extract the hours.

Resources