Using dplyr to remove duplicates conditionally - r

I have a dataset in longformat that contains both visit and measure dates for each ID. What I want is to remove the duplicate visit dates for each ID conditionally, namely:
IF visit date - measure date does not equal 0, then I want the to include the first visit date.
IF visit date - measure date is a draw, however, then I want to include the lastest visit date.
I already wrote part of the code using dplyr. However, I cannot seem to figure out how to code the second part of the condition.
Any help would be very much appreciated.
library(dplyr)
df <- data.frame(ID = c(1, 1),
VISIT = c(as.Date("2020-01-01"), as.Date("2020-01-01")),
MEASURE = c(as.Date("2020-01-01"), as.Date("2020-01-01")),
VALUE = c(5, 10))
df2 <- df %>%
mutate(DIFF = abs(VISIT - MEASURE)) %>%
arrange(DIFF) %>%
group_by(ID) %>%
group_by(VISIT) %>%
# If DIFF dates is != 0, I want the first value
# If DIFF dates is a draw, I want the latest value
slice(1) %>%
ungroup()

I am not sure what exactly you try to achieve, but maybe this could help you. I adjusted the example dataframe a bit, maybe you will need to edit yours in your question such that it makes sense. In your example data DIFF dates is never unequal to 0.
library(dplyr)
df <- data.frame(
ID = c(1, 1, 2, 2),
VISIT = c(
as.Date("2020-01-01"),
as.Date("2020-01-01"),
as.Date("2020-01-01"),
as.Date("2020-01-02")
),
MEASURE = c(
as.Date("2020-01-01"),
as.Date("2020-01-01"),
as.Date("2020-01-01"),
as.Date("2020-01-03")
),
VALUE = c(5, 10, 15, 20)
)
df2 <- df %>%
group_by(ID) %>%
mutate(
DIFF = abs(VISIT - MEASURE),
# get days as a digit
DIFF = stringr::str_extract(DIFF, "\\d+") %>% as.numeric(),
# your if conditions
DIFF_filter = case_when(
DIFF != 0 ~ min(VISIT),
DIFF == 0 ~ max(VISIT)
)
)

Related

fill() in missing lubridate value from a different column

Below is a fictional reproducible example of pick-up and drop-of times of four taxis.
Taxi 1, 2, and 3 unfortunately have a missing in the drop-of time. fortunately, two of these times (for taxi 1 and 3) can be inferred to be at least 1 sec before they pick-up new costumers (these are non-ride sharing taxi, very corona-proof):
(the below df is - in the real use case - the result of a group_by and summarise of another df)
library(dplyr)
x <- seq(as.POSIXct('2020/01/01'), # Create sequence of dates
as.POSIXct('2030/01/01'),
by = "10 mins") %>%
head(20) %>%
sort()
taxi_nr <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4)
drop_of <- x[c(TRUE, FALSE)]
pick_up <- x[c(FALSE, TRUE)]
drop_of[2] <- NA
drop_of[5] <- NA
drop_of[7] <- NA
df <- data.frame(taxi_nr,pick_up,drop_of) %>%
arrange(pick_up)
I wish to fill in the NA of taxi 1 and 3, I have tried the following:
df <- df %>%
fill(drop_of, .direction = "up")
However, this take the below drop-of value instead of the below pick-up value and does not take into account the taxi nr.
I have also thought about:
df <- df %>%
filter(is.na(drop_of)) %>%
mutate(drop_of, ov[,+1])
This seems to run into problems with the taxi_nr 2 case, as there is no [,+1] in within the group - or so I believe is the issue. I have tried to add safely(), possibly() and quietly(), but that did not help:
df <- df %>%
filter(is.na(drop_of)) %>%
mutate(drop_of, purr::safely(ov[,+1]))
Does anyone have a solution?
ps: once I get the right column for filling in it also needs to be subtracted 1 second and be in the right lubridate formate (d/m/y-h/m/s)
THANKS!
You can try to use a temporary variable for it, although it does not look pretty
df <- df %>%
mutate(temp = ifelse(is.na(drop_of), NA, pick_up)) %>%
group_by(taxi_nr) %>%
fill(temp, .direction = "up") %>%
ungroup() %>%
mutate(drop_of = ifelse(is.na(drop_of), temp - 1, drop_of),
drop_of = as.POSIXct(drop_of, origin = "1970-01-01")) %>%
select(-temp)
And if you need your data in a format d/m/y-h/m/s, you could do that with a format() function (I am not sure if what you described is exactly what you need, but at least you should get the idea)
df <- df %>% mutate(drop_of = format(drop_of, "%d/%m/%Y-%H/%M/%S"))

create dummy for group-specific maximum value occurence, conditional on overall maximum

Let's assume I have a dataset similar to this:
library(tidyverse)
library(lubridate)
state <- c(rep("Alabama", 10), rep("Arizona", 10), rep("Arkansas", 10))
county <- c(rep("Baldwin", 5), rep("Barbour", 5), rep("Apache", 5), rep("Cochise", 5), rep("Arkansas", 5), rep("Ashley", 5))
date <- rep(seq(ymd('2012-04-06'),ymd('2012-04-10'),by='days'), 6)
stray_dogs <- c(lag(1:3, n = 2, default = 0), floor(runif(7, min=1, max=4)),
lag(1:6, n = 5, default = 0), floor(runif(4, min=1, max=18)),
lag(1:2, n = 1, default = 0), floor(runif(8, min=1, max=4)))
df <- data.frame(state, county, date, stray_dogs) %>%
mutate(stray_dogs_max = max(stray_dogs)) %>%
mutate(most_stray_dogs = case_when(stray_dogs_max == stray_dogs ~ 1,
stray_dogs_max != stray_dogs ~ 0))
I would like to find the date when the highest number of stray dogs were found in each county via group_by(state, county) or anything similar and create a dichotomous variable (column), which takes the value of 1 for that particular day (and takes 0 for the rest of days). However, when there were no stray dogs at all in a particular county during the time period, it should mark the day as 1 when most_stray_dogs equals to 1; and when there are multiple days with the same number of stray dogs within a county, it should pick the day which is closer to the day of most_stray_dogs == 1.
For the latter bit, my intuition would be to use a helper vector, created with difftime; nevertheless I just can't put together all these at once. How should I create this column?
I think this works. No "right answer" is provided and the data is big enough to make it hard to eyeball, so I'm not positive, but it's methodical so it should at least get you on the right track.
In calculating the data difference I arbitrarily subtracted 0.1 as a tie-breaker between an equal number of days before and after the national max. I then arrange each group to assign the top choice (which is a bit inefficient, but should be quick enough).
df %>% arrange(state, county, date) %>%
group_by(date) %>%
mutate(national_count = sum(stray_dogs)) %>%
ungroup() %>%
mutate(
is_national_max = national_count == max(national_count)
) %>%
group_by(state, county) %>%
mutate(
is_county_max = stray_dogs == max(stray_dogs),
days_from_national_max = abs(date - date[is_national_max] - 0.1)
) %>%
arrange(state, county, desc(is_county_max), desc(days_from_national_max)) %>%
mutate(your_result = as.integer(row_number() == 1)) %>%
ungroup() %>%
arrange(state, county, date)

Finding the first row after which x rows meet some criterium in R

A data wrangling question:
I have a dataframe of hourly animal tracking points with columns for id, time, and whether the animal is on land or in water (0 = water; 1 = land). It looks something like this:
set.seed(13)
n <- 100
dat <- data.frame(id = rep(1:5, each = 10),
datetime=seq(as.POSIXct("2020-12-26 00:00:00"), as.POSIXct("2020-12-30 3:00:00"), by = "hour"),
land = sample(0:1, n, replace = TRUE))
What I need to do is flag the first row after which the animal uses land at least once for 3 straight days. I tried doing something like this:
dat$ymd <- ymd(dat$datetime[1]) # make column for year-month-day
# add land points within each id group
land.pts <- dat %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
drop_na(land) %>%
mutate(all.land = cumsum(land))
#flag days that have any land points
flag <- land.pts %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
slice(n()) %>%
mutate(flag = if_else(all.land == 0,0,1))
# Combine flagged dataframe with full dataframe
comb <- left_join(land.pts, flag)
comb[is.na(comb)] <- 1
and then I tried this:
x = comb %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0 | flag==0,
0,
difftime(datetime, lag(datetime), units="days")))
But I still can't quite wrap my head around what to do to make it so that I can figure out when the animal has been on land at least once for three days straight, and then flag that first point on land. Thanks so much for any help you can provide!
Create a date column from the timestamp. Summarise the data and keep only 1 row for each id and date which shows whether the animal was on land even once in the entire day.
Use zoo's rollapply function to mark the first day as TRUE if the next 3 days the animal was on land.
library(dplyr)
library(zoo)
dat <- dat %>% mutate(date = as.Date(datetime))
dat %>%
group_by(id, date) %>%
summarise(on_land = any(land == 1)) %>%
mutate(consec_three = rollapply(on_land, 3,all, align = 'left', fill = NA)) %>%
ungroup %>%
#If you want all the rows of the data
left_join(dat, by = c('id', 'date'))

Last 3 month lags in r

The data is :
Category <- c(rep("A",4))
Month <- c(1,2,3,4)
Sales <- c(10,15,20,25)
df <- data.frame(Category,Month,Sales)
df <- df %>% filter(Category=='A') %>%
group_by(Month) %>%
summarise(Sales=sum(Sales,na.rm=TRUE)) %>%
mutate(lag_1 = dplyr::lag(Sales, 1),
lag_2 = dplyr::lag(Sales, 2),
lag_3 = dplyr::lag(Sales, 3),
lag_3_mean = rollapply(Sales,3,mean,align='right',fill=NA))
Present Output
I want the lag_3_mean to be the mean of last 3 months, not including the present month. For example, in Month 4 lag_3_mean = Average(Sales value in month 3,2,1).
The expected output should be:
Use a width of list(-(1:3)) to get offsets of -1, -2, -3.
rollapplyr(Sales, list(-(1:3)), mean, fill = NA)
Note that this recent question is very similar Variable frameshift rolling average for multiple variables

In R, is it possible to include the same row in multiple groups, or is there other workaround?

I've measured N20 flux from soil at multiple timepoints in the day (not equally spaced). I'm trying to calculate the total N20 flux from soil for a subset of days by finding the area under the curve for the given day. I know how to do this when using only measures from the given day, however, I'd like to include the last measure of the previous day and the first measure of the following day to improve the estimation of the curve.
Here's an example to give a more concrete idea:
library(MESS)
library(lubridate)
library(dplyr)
Generate Reproducible Example
datetime <- seq(ymd_hm('2015-04-07 11:20'),ymd('2015-04-13'), by = 'hours')
dat <- data.frame(datetime, day = day(datetime), Flux = rnorm(n = length(datetime), mean = 400, sd = 20))
useDate <- data.frame(day = c(7:12), DateGood = c("No", "Yes", "Yes", "No", "Yes", "No"))
dat <- left_join(dat, useDate)
Some days are "bad" (too many missing measures) and some are "Good" (usable). The goal is to filter all measurements (rows) that occurred on a "Good" day as well as the last measurement from the day before and the first measurement on the next day.
out <- dat %>%
mutate(lagDateGood = lag(DateGood),
leadDateGood = lead(DateGood)) %>%
filter(lagDateGood != "No" | leadDateGood != "No")
Now I need to calculate the area under the curve - this is not correct
out2 <- out %>%
group_by(day) %>%
mutate(hourOfday = hour(datetime) + minute(datetime)/60) %>%
summarize(auc = auc(x = hourOfday, y = Flux, from = 0, to = 24, type = "spline"))
The trouble is that I don't include the measurements on end of previous day and start of following day when calculating AUC. Also, I get an estimate of flux for day 10, which is a "bad" day.
I think the crux of my question has to do with groups. Some measurements need to be in multiple groups (for example the last measurement on day 8 would be used in estimating AUC for day 8 and day 9). Do you have suggestions for how I could form new groups? Or might there be a completely different way to achieve the goal?
For what it's worth, this is what I did. The answer really lies in the question I linked to in the comments. Starting with the dataframe "out" from the question:
#Now I need to calculate the area under the curve for each day
n <- nrow(out)
extract <- function(ix) out[seq(max(1, min(ix)-1), min(n, max(ix) + 1)), ]
res <- lapply(split(1:n, out$day), extract)
calcTotalFlux <- function(df) {
if (nrow(df) < 10) { # make sure the day has at least 10 measures
NA
} else {
day_midnight <- floor_date(df$datetime[2], "day")
df %>%
mutate(time = datetime - day_midnight) %>%
summarize(TotalFlux = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))}
}
do.call("rbind",lapply(res, calcTotalFlux))
TotalFlux
7 NA
8 585230.2
9 579017.3
10 NA
11 563689.7
12 NA
Here's another way. More in line with the suggestions of #Alex Brown.
# Another way
last <- out %>%
group_by(day) %>%
filter(datetime == max(datetime)) %>%
ungroup() %>%
mutate(day = day + 1)
first <- out %>%
group_by(day) %>%
filter(datetime == min(datetime)) %>%
ungroup() %>%
mutate(day = day - 1)
d <- rbind(out, last, first) %>%
group_by(day) %>%
arrange(datetime)
n_measures_per_day <- d %>%
summarize(n = n())
d <- left_join(d, n_measures_per_day) %>%
filter(n > 4)
TotalFluxDF <- d %>%
mutate(timeAtMidnight = floor_date(datetime[3], "day"),
time = datetime - timeAtMidnight) %>%
summarize(auc = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))
TotalFluxDF
Source: local data frame [3 x 2]
day auc
(dbl) (dbl)
1 8 585230.2
2 9 579017.3
3 11 563689.7

Resources