I have a dataset with a lot of replicated rows, and I want to make a dataset with no replications. Date and time are the main ways of distinguishing between distinct and similar rows, but sometimes the times are a bit off. I want to reduce my dataset so that if 2 rows are within 1 hour of each other on the same day the second instance does not show up.
input_date<-c("4/20/2014", "5/15/2002", "3/12/2019", "3/12/2019", "3/12/2019", "3/12/2019")
input_time<-c("4:30", "4:30", "9:00", "9:55", "12:00", "12:00")
input<-cbind(input_date, input_time)
colnames(input)<-c("date", "time")
#use distinct to remove duplicate values--this removes final row, but I want it to also remove row 4.
output<-distinct(input, date, time)
Is there any easy way to tell R to get rid of rows with values that are close to each other but not exactly the same?
Here is an approach that rounds times to make groups based on the hour.
Then, use {dplyr} group_by / slice to get the first row of each group.
input_date <- c("4/20/2014", "5/15/2002", "3/12/2019", "3/12/2019", "3/12/2019", "3/12/2019")
input_time <- c("4:30", "4:30", "9:00", "9:55", "12:00", "12:00")
# make a data.frame
input <- data.frame(date =input_date, time = input_time)
# use dplyr for data manipulation of groups
library(dplyr, warn.conflicts = FALSE)
# take the 1st slice index from each group
input %>%
mutate(datetime = as.POSIXct(sprintf("%s %s", date, time),
format = "%m/%d/%Y %H:%M"),
hour = round(datetime, "hours")) %>%
group_by(hour) %>%
slice(1)
#> # A tibble: 5 x 4
#> # Groups: hour [5]
#> date time datetime hour
#> <chr> <chr> <dttm> <dttm>
#> 1 5/15/2002 4:30 2002-05-15 04:30:00 2002-05-15 05:00:00
#> 2 4/20/2014 4:30 2014-04-20 04:30:00 2014-04-20 05:00:00
#> 3 3/12/2019 9:00 2019-03-12 09:00:00 2019-03-12 09:00:00
#> 4 3/12/2019 9:55 2019-03-12 09:55:00 2019-03-12 10:00:00
#> 5 3/12/2019 12:00 2019-03-12 12:00:00 2019-03-12 12:00:00
Related
I have a very large dataset with date and time in a single column on 15-minute intervals corresponding to the data. Unfortunately the software recording the data has some issues and so randomly there are 15-minute intervals (usually 1 or 2 but sometime 3 and 4). The dataset is reported as follows:
Date_and_time Pressure
2016-07-08 18:00:00 3.542
2016-07-08 18:15:00 5:444
2016-07-08 18:45:00 2:556
2016-07-08 19:00:00 4:567
I am looking for a way to enter a row inbetween the missing time frames. My goal is to stack this data for multiple sites on top of each other I and I need to make sure for graphing purposes that the line up.
If you can perfectly guarantee that all times are aligned on the quarter hour, then you could try this:
tibble(Date_and_time = do.call(seq, c(as.list(range(dat$Date_and_time)), by="15 mins"))) %>%
full_join(dat, by = "Date_and_time")
# # A tibble: 5 x 2
# Date_and_time Pressure
# <dttm> <chr>
# 1 2016-07-08 18:00:00 3.542
# 2 2016-07-08 18:15:00 5:444
# 3 2016-07-08 18:30:00 <NA>
# 4 2016-07-08 18:45:00 2:556
# 5 2016-07-08 19:00:00 4:567
If you think there is a chance that your times are not perfectly aligned (even a fraction of a second will introduce unnecessary rows), then we can turn this into a problem of "enforce a gap of no more than 15 minutes":
dat %>%
group_by(grp = cumsum(c(FALSE, as.numeric(diff(Date_and_time), units = "mins") > 15))) %>%
summarize(Date_and_time = max(Date_and_time) + 15*60) %>%
bind_rows(dat) %>%
arrange(Date_and_time) %>%
select(-grp)
# # A tibble: 6 x 2
# Date_and_time Pressure
# <dttm> <chr>
# 1 2016-07-08 18:00:00 3.542
# 2 2016-07-08 18:15:00 5:444
# 3 2016-07-08 18:30:00 <NA>
# 4 2016-07-08 18:45:00 2:556
# 5 2016-07-08 19:00:00 4:567
# 6 2016-07-08 19:15:00 <NA>
Notice that the last added row is unnecessary, that can be removed in a simple clean-up step. The premise of this second method is that it creates a group where everything within the group is gapped 15 minutes (or less), and then adds 15 minutes to the last one row. This ensures that there is no gap greater than 15 minutes, but:
It will always produce a single row at the bottom that may not be needed; and
It does not make any assurance of the gap between the added rows and the rows beneath them. For example, if your third row was instead at "2016-07-08 18:31:00", then the time would sequence through "18:15:00", "18:30:00", then "18:31:00" (with a 1-minute gap).
Data
dat <- structure(list(Date_and_time = structure(c(1468015200, 1468016100, 1468017900, 1468018800), class = c("POSIXct", "POSIXt"), tzone = ""), Pressure = c("3.542", "5:444", "2:556", "4:567")), row.names = c(NA, -4L), class = "data.frame")
You could make a sequence that has all potential sampling times and then join your data to that.
library(tidyverse)
ALL_PERIODS <-data.frame(SAMPLE_TIME= seq.POSIXt(from = as.POSIXlt("2016-07-08 18:00:00"), to =as.POSIXlt("2016-07-08 20:00:00"), by = "15 min"))
SAMPLE_DATA <- data.frame(Date_and_time= as.POSIXlt( c("2016-07-08 18:00:00","2016-07-08 18:15:00","2016-07-08 18:45:00","2016-07-08 19:00:00") ), pressure=c(3.542, 5.444,2.556, 4.567))
ALL_PERIODS_DATA <- left_join(ALL_PERIODS,SAMPLE_DATA, by=c("SAMPLE_TIME"="Date_and_time"))
Given the datraframe below
class timestamp
1 A 2019-02-14 15:00:29
2 A 2019-01-27 17:59:53
3 A 2019-01-27 18:00:00
4 B 2019-02-02 18:00:00
5 C 2019-03-08 16:00:37
observation 2 and 3 point to the same event. How do I remove rows belonging to the same class if another timestamp within 2 minutes already exists?
Desired output:
class timestamp
1 A 2019-02-14 15:00:00
2 B 2019-01-27 18:00:00
3 A 2019-02-02 18:00:00
4 C 2019-03-08 16:00:00
round( ,c("mins")) can be used to get rid of the second component but if the timestamps are to far off some test samples will be rounded to the wrong minute leaving still different timestamps
EDIT
I think I over-complicated the problem in first attempt, I think what would work for your case is to round time for 2 minute intervals which we can do using round_date from lubridate .
library(lubridate)
library(dplyr)
df %>%
mutate(timestamp = round_date(as.POSIXct(timestamp), unit = "2 minutes")) %>%
group_by(class) %>%
filter(!duplicated(timestamp))
# class timestamp
# <chr> <dttm>
#1 A 2019-02-14 15:00:00
#2 A 2019-01-27 18:00:00
#3 B 2019-02-02 18:00:00
#4 C 2019-03-08 16:00:00
Original Attempt
We can first convert the timestamp to POSIXct object, then arrange rows by class and timestamp, use cut to divide them into "2 min" interval and then remove duplicates.
library(dplyr)
df %>%
mutate(timestamp = as.POSIXct(timestamp)) %>%
arrange(class, timestamp) %>%
group_by(class) %>%
filter(!duplicated(as.numeric(cut(timestamp, breaks = "2 mins")), fromLast = TRUE))
# class timestamp
# <chr> <dttm>
#1 A 2019-01-27 18:00:00
#2 A 2019-02-14 15:00:29
#3 B 2019-02-02 18:00:00
#4 C 2019-03-08 16:00:37
Here, I haven't changed or rounded the timestamp column and kept it as it is but it would be simple to round it if you use cut in mutate. Also if you want to keep the first entry like 2019-01-27 17:59:53 then remove fromLast = TRUE argument.
I used the following R code to create a POSIXct date time field from a separate date and time field both in character format using lubridate and dplyr.
library(dplyr)
library(lubridate)
c_cycle_work <- tibble(
StartDate = c("1/28/2011", "2/26/2011", "4/2/2011", "4/11/2011"),
StartTime = c("10:58", "6:02", "6:00", "9:47")
)
c_cycle_work %>%
mutate(start_dt = paste0(StartDate, StartTime, sep = " ", collapse = NULL)) %>%
mutate(start_dt = mdy_hms(start_dt))
# 1 1/28/2011 10:58 2020-01-28 11:10:58
# 2 2/26/2011 6:02 2020-02-26 11:06:02
# 3 4/2/2011 6:00 2020-04-02 11:06:00
# 4 4/11/2011 9:47 2020-04-11 11:09:47
The start_dt field I created is in Y m d format even though I used mdy_hms based on the data. Also, all years have been changed to 2020.
Went over this several times, used paste vs. paste0, etc. but still stumped.
Your problem is the paste0() which doesn't have a sep= argument. So when you paste the date and time you get 1/28/201110:58 and it spilts that into 1/28/20/11/10/58 though it seemed to work differently with my version lubridate_1.6.0. Also you where use "hms" but your times didn't have seconds. This should work with your data
c_cycle_work %>%
mutate(start_dt = paste(StartDate, StartTime, sep=" ")) %>%
mutate(start_dt = mdy_hm(start_dt))
# StartDate StartTime start_dt
# <chr> <chr> <dttm>
# 1 1/28/2011 10:58 2011-01-28 10:58:00
# 2 2/26/2011 6:02 2011-02-26 06:02:00
# 3 4/2/2011 6:00 2011-04-02 06:00:00
# 4 4/11/2011 9:47 2011-04-11 09:47:00
I've triangulated information from other SO answers for the below code, but getting stuck with an error message. Searched SO for similar errors and resolutions but haven't been able to figure it out, so help is appreciated.
For every group ("id"), I want to get the difference between the start times for consecutive rows.
Reproducible data:
require(dplyr)
df <-data.frame(id=as.numeric(c("1","1","1","2","2","2")),
start= c("1/31/17 10:00","1/31/17 10:02","1/31/17 10:45",
"2/10/17 12:00", "2/10/17 12:20","2/11/17 09:40"))
time <- strptime(df$start, format = "%m/%d/%y %H:%M")
df %>%
group_by(id)%>%
mutate(diff = time - lag(time),
diff_mins = as.numeric(diff, units = 'mins'))
Gets me error:
Error in mutate_impl(.data, dots) :
Column diff must be length 3 (the group size) or one, not 6
In addition: Warning message:
In unclass(time1) - unclass(time2) :
longer object length is not a multiple of shorter object length
Do you mean something like this?
There is no need for lag here, a simple diff on the grouped times is sufficient.
df %>%
mutate(start = as.POSIXct(start, format = "%m/%d/%y %H:%M")) %>%
group_by(id) %>%
mutate(diff = c(0, diff(start)))
## A tibble: 6 x 3
## Groups: id [2]
# id start diff
# <dbl> <dttm> <dbl>
#1 1. 2017-01-31 10:00:00 0.
#2 1. 2017-01-31 10:02:00 2.
#3 1. 2017-01-31 10:45:00 43.
#4 2. 2017-02-10 12:00:00 0.
#5 2. 2017-02-10 12:20:00 20.
#6 2. 2017-02-11 09:40:00 1280.
You can use lag and difftime (per Hadley):
df %>%
mutate(time = as.POSIXct(start, format = "%m/%d/%y %H:%M")) %>%
group_by(id) %>%
mutate(diff = difftime(time, lag(time)))
# A tibble: 6 x 4
# Groups: id [2]
id start time diff
<dbl> <fct> <dttm> <time>
1 1. 1/31/17 10:00 2017-01-31 10:00:00 <NA>
2 1. 1/31/17 10:02 2017-01-31 10:02:00 2
3 1. 1/31/17 10:45 2017-01-31 10:45:00 43
4 2. 2/10/17 12:00 2017-02-10 12:00:00 <NA>
5 2. 2/10/17 12:20 2017-02-10 12:20:00 20
6 2. 2/11/17 09:40 2017-02-11 09:40:00 1280
Suppose I have a daily rain data.frame like this:
df.meteoro = data.frame(Dates = seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"),
rain = rnorm(length(seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"))))
I'm trying to sum the accumulated rain between a 14 days interval with this code:
library(tidyverse)
library(lubridate)
df.rain <- df.meteoro %>%
mutate(TwoWeeks = round_date(df.meteoro$data, "14 days")) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
The problem is that it isn't starting on 2017-01-19 but on 2017-01-15 and I was expecting my output dates to be:
"2017-02-02" "2017-02-16" "2017-03-02" "2017-03-16" "2017-03-30" "2017-04-13"
"2017-04-27" "2017-05-11" "2017-05-25" "2017-06-08" "2017-06-22" "2017-07-06" "2017-07-20"
"2017-08-03" "2017-08-17" "2017-08-31" "2017-09-14" "2017-09-28" "2017-10-12" "2017-10-26"
"2017-11-09" "2017-11-23" "2017-12-07" "2017-12-21" "2018-01-04" "2018-01-18"
TL;DR I have a year long daily rain data.frame and want to sum the accumulate rain for the dates above.
Please help.
Use of round_date in the way you have shown it will not give you 14-day periods as you might expect. I have taken a different approach in this solution and generated a sequence of dates between your first and last dates and grouped these into 14-day periods then joined the dates to your observations.
startdate = min(df.meteoro$Dates)
enddate = max(df.meteoro$Dates)
dateseq =
data.frame(Dates = seq.Date(startdate, enddate, by = 1)) %>%
mutate(group = as.numeric(Dates - startdate) %/% 14) %>%
group_by(group) %>%
mutate(starts = min(Dates))
df.rain <- df.meteoro %>%
right_join(dateseq) %>%
group_by(starts) %>%
summarise(sum_rain = sum(rain))
head(df.rain)
> head(df.rain)
# A tibble: 6 x 2
starts sum_rain
<date> <dbl>
1 2017-01-19 6.09
2 2017-02-02 5.55
3 2017-02-16 -3.40
4 2017-03-02 2.55
5 2017-03-16 -0.12
6 2017-03-30 8.95
Using a right-join to the date sequence is to ensure that if there are missing observation days that spanned a complete time period you'd still get that period listed in the result (though in your case you have a complete year of dates anyway).
round_date rounds to the nearest multiple of unit (here, 14 days) since some epoch (probably the Unix epoch of 1970-01-01 00:00:00), which doesn't line up with your purpose.
To get what you want, you can do the following:
df.rain = df.meteoro %>%
mutate(days_since_start = as.numeric(Dates - as.Date("2017/1/18")),
TwoWeeks = as.Date("2017/1/18") + 14*ceiling(days_since_start/14)) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
This computes days_since_start as the days since 2017/1/18 and then manually rounds to the next multiple of two weeks.
Assuming you want to round to the closest date from the ones you have specified I guess the following will work
targetDates<-seq(ymd("2017-02-02"),ymd("2018-01-18"),by='14 days')
df.meteoro$Dates=targetDates[sapply(df.meteoro$Dates,function(x) which.min(abs(interval(targetDates,x))))]
sum_rain=ddply(df.meteoro,.(Dates),summarize,sum_rain=sum(rain,na.rm=T))
as you can see not all dates have the same number of observations. Date "2017-02-02" for instance has all the records between "2017-01-19" until "2017-02-09", which are 22 records. From "2017-02-10" on dates are rounded to "2017-02-16" etc.
This may be a cheat, but assuming each row/observation is a separate day, then why not just group by every 14 rows and sum.
# Assign interval groups, each 14 rows
df.meteoro$my_group <-rep(1:100, each=14, length.out=nrow(df.meteoro))
# Grab Interval Names
my_interval_names <- df.meteoro %>%
select(-rain) %>%
group_by(my_group) %>%
slice(1)
# Summarise
df.meteoro %>%
group_by(my_group) %>%
summarise(rain = sum(rain)) %>%
left_join(., my_interval_names)
#> Joining, by = "my_group"
#> # A tibble: 27 x 3
#> my_group rain Dates
#> <int> <dbl> <date>
#> 1 1 3.86 2017-01-19
#> 2 2 -0.581 2017-02-02
#> 3 3 -0.876 2017-02-16
#> 4 4 1.80 2017-03-02
#> 5 5 3.79 2017-03-16
#> 6 6 -3.50 2017-03-30
#> 7 7 5.31 2017-04-13
#> 8 8 2.57 2017-04-27
#> 9 9 -1.33 2017-05-11
#> 10 10 5.41 2017-05-25
#> # ... with 17 more rows
Created on 2018-03-01 by the reprex package (v0.2.0).