Leave only unique rows in 3 columns out of 4 [duplicate] - r

This question already has answers here:
Collapsing data frame by selecting one row per group
(4 answers)
Remove duplicated rows using dplyr
(6 answers)
Filtering out duplicated/non-unique rows in data.table
(5 answers)
Closed 2 years ago.
I have a dataframe:
Date ID Type Value
2020-08-04 03:00:00 1 active 14
2020-08-04 03:00:00 1 active 15
2020-08-04 03:00:00 2 active 16
2020-08-04 03:00:00 2 passive 17
I want to remove rows which has same values in columns Date ID Type. So desired result is:
Date ID Type Value
2020-08-04 03:00:00 1 active 14
2020-08-04 03:00:00 2 active 16
2020-08-04 03:00:00 2 passive 17
As you see, second row disappeared. How could i do that?

I would suggest creating a global id like this with paste() and then use duplicated():
#Code
mdf[duplicated(mdf$Date,mdf$ID,mdf$Type,fromLast = F),]
Output:
Date ID Type Value
2 04/08/2020 3:00:00 1 active 15
3 04/08/2020 3:00:00 2 active 16
4 04/08/2020 3:00:00 2 passive 17
Some data used:
#Data
mdf <- structure(list(Date = c("04/08/2020 3:00:00", "04/08/2020 3:00:00",
"04/08/2020 3:00:00", "04/08/2020 3:00:00"), ID = c(1L, 1L, 2L,
2L), Type = c("active", "active", "active", "passive"), Value = 14:17), row.names = c(NA,
-4L), class = "data.frame")

If your goal is to keep the minimum value for a given ID, you can use this dplyr solution:
mdf %>%
group_by(Date, ID, Type) %>%
mutate(Value = min(Value)) %>%
unique()
Which gives us:
Date ID Type Value
<chr> <int> <chr> <int>
1 04/08/2020 3:00:00 1 active 14
2 04/08/2020 3:00:00 2 active 16
3 04/08/2020 3:00:00 2 passive 17

Using dplyr
tble = read.table(text='
S.no Date ID Type Value
1 2020-08-04 03:00:00 1 active 14
2 2020-08-04 03:00:00 1 active 15
3 2020-08-04 03:00:00 2 active 16
4 2020-08-04 03:00:00 2 passive 17')
library(dplyr)
tble %>% distinct(Date, ID, Type, .keep_all=TRUE)
#> S.no Date ID Type Value
#> 1 2020-08-04 03:00:00 1 active 14
#> 3 2020-08-04 03:00:00 2 active 16
#> 4 2020-08-04 03:00:00 2 passive 17
Created on 2020-09-04 by the reprex package (v0.3.0)

Related

Sum time across different continuous time events across date and time combinations in R

I am having trouble figuring out how to account for and sum continuous time observations across multiple dates and time events in my dataset. A similar question is found here, but it only accounts for one instance of a continuous time event. I have a dataset with multiple date and time combinations. Here is an example from that dataset, which I am manipulating in R:
date.1 <- c("2021-07-21", "2021-07-21", "2021-07-21", "2021-07-29", "2021-07-29", "2021-07-30", "2021-08-01","2021-08-01","2021-08-01")
time.1 <- c("15:57:59", "15:58:00", "15:58:01", "15:46:10", "15:46:13", "18:12:10", "18:12:10","18:12:11","18:12:13")
df <- data.frame(date.1, time.1)
df
date.1 time.1
1 2021-07-21 15:57:59
2 2021-07-21 15:58:00
3 2021-07-21 15:58:01
4 2021-07-29 15:46:10
5 2021-07-29 15:46:13
6 2021-07-30 18:12:10
7 2021-08-01 18:12:10
8 2021-08-01 18:12:11
9 2021-08-01 18:12:13
I tried following the following script from the link I present:
df$missingflag <- c(1, diff(as.POSIXct(df$time.1, format="%H:%M:%S", tz="UTC"))) > 1
df
date.1 time.1 missingflag
1 2021-07-21 15:57:59 FALSE
2 2021-07-21 15:58:00 TRUE
3 2021-07-21 15:58:01 FALSE
4 2021-07-29 15:46:10 FALSE
5 2021-07-29 15:46:13 TRUE
6 2021-07-30 18:12:10 TRUE
7 2021-08-01 18:12:10 FALSE
8 2021-08-01 18:12:11 FALSE
9 2021-08-01 18:12:13 TRUE
But it did not working as anticipated and did not get closer to my answer. It would have been an intermediate goal and probably wouldn't answer my questions.
The GOAL of my dilemma would be account to for all the continuous time observations and put into a new table like this:
date.1 time.1 secs
1 2021-07-21 15:57:59 3
4 2021-07-29 15:46:10 1
5 2021-07-29 15:46:13 1
6 2021-07-30 18:12:10 1
7 2021-08-01 18:12:10 2
9 2021-08-01 18:12:13 1
You will see that the start time of each of the continuous time observations are recorded and the total number of seconds (secs) observed since the start of the continuous observation are being recorded. The script would account for date.1 as there are multiple dates in the dataset.
Thank you in advance.
You can create a datetime object combining date and time columns, get the difference of consecutive values and create groups where all the time 1s apart are part of the same group. For each group count the number of rows and their first datetime value.
library(dplyr)
library(tidyr)
df %>%
unite(datetime, date.1, time.1, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime)) %>%
group_by(grp = cumsum(difftime(datetime,
lag(datetime, default = first(datetime)), units = 'secs') > 1)) %>%
summarise(datetime = first(datetime),
secs = n(), .groups = 'drop') %>%
select(-grp)
# datetime secs
# <dttm> <int>
#1 2021-07-21 15:57:59 3
#2 2021-07-29 15:46:10 1
#3 2021-07-29 15:46:13 1
#4 2021-07-30 18:12:10 1
#5 2021-08-01 18:12:10 2
#6 2021-08-01 18:12:13 1
I have kept datetime as single combined column here but if needed you can separate them again as two different columns using
%>% separate(datetime, c('date', 'time'), sep = ' ')

Split a rows into two when a date range spans a change in calendar year

I am trying to figure out how to add a row when a date range spans a calendar year. Below is a minimal reprex:
I have a date frame like this:
have <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-12-20'), as.Date('2019-05-13')),
to = c(as.Date('2019-06-20'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
have
#> from to
#> 1 2018-12-15 2019-06-20
#> 2 2019-12-20 2020-01-25
#> 3 2019-05-13 2019-09-10
I want a data.frame that splits into two rows when to and from span a calendar year.
want <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-01-01'), as.Date('2019-12-20'), as.Date('2020-01-01'), as.Date('2019-05-13')),
to = c(as.Date('2018-12-31'), as.Date('2019-06-20'), as.Date('2019-12-31'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
want
#> from to
#> 1 2018-12-15 2018-12-31
#> 2 2019-01-01 2019-06-20
#> 3 2019-12-20 2019-12-31
#> 4 2020-01-01 2020-01-25
#> 5 2019-05-13 2019-09-10
I am wanting to do this because for a particular row, I want to know how many days are in each year.
want$time_diff_by_year <- difftime(want$to, want$from)
Created on 2020-05-15 by the reprex package (v0.3.0)
Any base R, tidyverse solutions would be much appreciated.
You can determine the additional years needed for your date intervals with map2, then unnest to create additional rows for each year.
Then, you can identify date intervals of intersections between partial years and a full calendar year. This will keep the partial years starting Jan 1 or ending Dec 31 for a given year.
library(tidyverse)
library(lubridate)
have %>%
mutate(date_int = interval(from, to),
year = map2(year(from), year(to), seq)) %>%
unnest(year) %>%
mutate(year_int = interval(as.Date(paste0(year, '-01-01')), as.Date(paste0(year, '-12-31'))),
year_sect = intersect(date_int, year_int),
from_new = as.Date(int_start(year_sect)),
to_new = as.Date(int_end(year_sect))) %>%
select(from_new, to_new)
Output
# A tibble: 5 x 2
from_new to_new
<date> <date>
1 2018-12-15 2018-12-31
2 2019-01-01 2019-06-20
3 2019-12-20 2019-12-31
4 2020-01-01 2020-01-25
5 2019-05-13 2019-09-10

Dividing data based on custom date range

I have a time series which spans multiple years and want to divide it into four categories based on date (ie, 15 April - 10 May, 11 May - 10 July, and so on). My first thought was to use lubridate to define each time period with interval() and then use %within% to determine whether an event occurs within it or not.
df
id datetime
1 HAR10 2019-06-26 04:35:06
2 HAR05 2019-08-05 19:15:00
3 HAR07 2018-07-26 01:01:00
4 HAR07 2018-07-24 23:36:00
5 HAR05 2019-08-27 18:59:43
6 HAR05 2019-07-12 03:33:00
7 HAR07 2018-08-09 16:21:00
8 HAR07 2019-05-01 00:04:28
9 HAR04 2019-07-01 05:25:00
10 HAR07 2018-07-18 15:17:00
perA <- interval(ymd(20190511), ymd(20190710))
df %within% perA
I immediately ran into a problem with year, since I want to get all events from, say, April - May, regardless of what year they occurred, but interval is year-specific so the interval defined above works for my 2019 data but not my 2018 data. I could define a new set of intervals for each year, but that seems very messy.
Another problem is that a vector of TRUE and FALSE, which %within% returns, is not what I need. I need to assign each event to a category based on which time range it falls within.
My second thought was to use filter(), but I don't think that solves either of my problems. Any help is appreciated!
You can easily extract the month, day or even hour and set to the same year across dates. I made up some groups. This is a dplyr solution, but you should be able to easily convert to base if you prefer.
library(dplyr)
library(lubridate)
df %>%
mutate(noyeardate = as.Date(paste(2000, month(datetime), day(datetime), sep = "-")),
interval = case_when(noyeardate %within% interval(ymd(20000101), ymd(20000331)) ~ "Group 1",
noyeardate %within% interval(ymd(20000401), ymd(20000630)) ~ "Group 2",
noyeardate %within% interval(ymd(20000701), ymd(20000930)) ~ "Group 3",
noyeardate %within% interval(ymd(20001001), ymd(20001231)) ~ "Group 4"))
id datetime noyeardate interval
1 HAR10 2018-07-18 15:17:00 2000-07-18 Group 3
2 HAR05 2018-07-24 23:36:00 2000-07-24 Group 3
3 HAR07 2018-07-26 01:01:00 2000-07-26 Group 3
4 HAR07 2018-08-09 16:21:00 2000-08-09 Group 3
5 HAR05 2019-05-01 00:04:28 2000-05-01 Group 2
6 HAR05 2019-06-26 04:35:06 2000-06-26 Group 2
7 HAR07 2019-07-01 05:25:00 2000-07-01 Group 3
8 HAR07 2019-07-12 03:33:00 2000-07-12 Group 3
9 HAR04 2019-08-05 19:15:00 2000-08-05 Group 3
10 HAR07 2019-08-27 18:59:43 2000-08-27 Group 3
Data:
df <- data.frame(id = c("HAR10", "HAR05", "HAR07", "HAR07", "HAR05", "HAR05", "HAR07", "HAR07", "HAR04", "HAR07"),
datetime = as.POSIXct(c("2018-07-18 15:17:00", "2018-07-24 23:36:00",
"2018-07-26 01:01:00", "2018-08-09 16:21:00", "2019-05-01 00:04:28",
"2019-06-26 04:35:06", "2019-07-01 05:25:00", "2019-07-12 03:33:00",
"2019-08-05 19:15:00", "2019-08-27 18:59:43")))

Creating a Survival Analysis dataset

I have a table composed by three columns: ID, Opening Date and Cancelation Date.
What I want to do is to create 36 observations per client (one per month for 3 years) as a dummy variable. Basically, i want all the months observations before the cancelation date to have a 1 and the others a 0. In case that the cancelation date is null, then all of the values would be 1.
This process should be repeated for every ID.
The desired output would be a table with five columns: ID, Opening Date, Cancelation Date, Month (from 1 to 36, starting on opening date) and Status (1 or 0).
I've tried everything but havent managed to solve this problem, using seq() to create the dates and order them seq(table$Opening, by = "month", length.out = 36) and many other ways.
We can use complete from tidyr to create a dates of 1-month sequence for each ID, create a row_number for each group as count of Month and create Status based on Cancellation_Date.
library(dplyr)
library(tidyr)
df %>%
mutate_at(vars(ends_with("Date")), as.Date, "%d/%m/%y") %>%
mutate(Date = Opening_Date) %>%
group_by(ID) %>%
complete(Date = seq(Date,by = "1 month", length.out = 36)) %>%
mutate(Month = row_number()) %>%
fill(Opening_Date, Cancellation_Date) %>%
mutate(Status = +(Date <= Cancellation_Date))
# ID Date Opening_Date Cancellation_Date Month Status
# <dbl> <date> <date> <date> <int> <int>
# 1 336 2017-01-01 2017-01-01 2018-06-01 1 1
# 2 336 2017-02-01 2017-01-01 2018-06-01 2 1
# 3 336 2017-03-01 2017-01-01 2018-06-01 3 1
# 4 336 2017-04-01 2017-01-01 2018-06-01 4 1
# 5 336 2017-05-01 2017-01-01 2018-06-01 5 1
# 6 336 2017-06-01 2017-01-01 2018-06-01 6 1
# 7 336 2017-07-01 2017-01-01 2018-06-01 7 1
# 8 336 2017-08-01 2017-01-01 2018-06-01 8 1
# 9 336 2017-09-01 2017-01-01 2018-06-01 9 1
#10 336 2017-10-01 2017-01-01 2018-06-01 10 1
# … with 26 more rows
In the output Date column is sequence of monthly dates for each ID, which can be removed from the final output if not needed.
data
df <- data.frame(ID = 336, Opening_Date = '1/1/17',Cancellation_Date = '1/6/18')

Add a new date field based of off repeat encounters [duplicate]

This question already has answers here:
How to create a lag variable within each group?
(5 answers)
Closed 5 years ago.
I have repeat encounters with animals that have a unique IndIDII and a unique GPSSrl number. Each encounter has a FstCptr date.
dat <- structure(list(IndIDII = c("BHS_115", "BHS_115", "BHS_372", "BHS_372",
"BHS_372", "BHS_372"), GPSSrl = c("035665", "036052", "034818",
"035339", "036030", "036059"), FstCptr = structure(c(1481439600,
1450162800, 1426831200, 1481439600, 1457766000, 1489215600), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("IndIDII", "GPSSrl", "FstCptr"
), class = "data.frame", row.names = c(1L, 2L, 29L, 30L, 31L,
32L))
> dat
IndIDII GPSSrl FstCptr
1 BHS_115 035665 2016-12-11
2 BHS_115 036052 2015-12-15
29 BHS_372 034818 2015-03-20
30 BHS_372 035339 2016-12-11
31 BHS_372 036030 2016-03-12
32 BHS_372 036059 2017-03-11
For each IndID-GPSSrl grouping, I want to create a new field (NextCptr) that documents the date of the next encounter. For the last encounter, the new field would be NA, for example:
dat$NextCptr <- as.Date(c("2015-12-15", NA, "2016-12-11", "2016-03-12", "2017-03-11", NA))
> dat
IndIDII GPSSrl FstCptr NextCptr
1 BHS_115 035665 2016-12-11 2015-12-15
2 BHS_115 036052 2015-12-15 <NA>
29 BHS_372 034818 2015-03-20 2016-12-11
30 BHS_372 035339 2016-12-11 2016-03-12
31 BHS_372 036030 2016-03-12 2017-03-11
32 BHS_372 036059 2017-03-11 <NA>
I would like to work within dplyr and group_by(IndIDII, GPSSrl).
As always, many thanks!
Group by column IndIDII then use lead to shift FstCptr forward by one:
dat %>% group_by(IndIDII) %>% mutate(NextCptr = lead(FstCptr))
# A tibble: 6 x 4
# Groups: IndIDII [2]
# IndIDII GPSSrl FstCptr NextCptr
# <chr> <chr> <dttm> <dttm>
#1 BHS_115 035665 2016-12-11 02:00:00 2015-12-15 02:00:00
#2 BHS_115 036052 2015-12-15 02:00:00 NA
#3 BHS_372 034818 2015-03-20 02:00:00 2016-12-11 02:00:00
#4 BHS_372 035339 2016-12-11 02:00:00 2016-03-12 02:00:00
#5 BHS_372 036030 2016-03-12 02:00:00 2017-03-11 02:00:00
#6 BHS_372 036059 2017-03-11 02:00:00 NA
If you need to shift the column in the opposite direction, lag could also be useful, dat %>% group_by(IndIDII) %>% mutate(NextCptr = lag(FstCptr)).

Resources