fill() in missing lubridate value from a different column - r

Below is a fictional reproducible example of pick-up and drop-of times of four taxis.
Taxi 1, 2, and 3 unfortunately have a missing in the drop-of time. fortunately, two of these times (for taxi 1 and 3) can be inferred to be at least 1 sec before they pick-up new costumers (these are non-ride sharing taxi, very corona-proof):
(the below df is - in the real use case - the result of a group_by and summarise of another df)
library(dplyr)
x <- seq(as.POSIXct('2020/01/01'), # Create sequence of dates
as.POSIXct('2030/01/01'),
by = "10 mins") %>%
head(20) %>%
sort()
taxi_nr <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4)
drop_of <- x[c(TRUE, FALSE)]
pick_up <- x[c(FALSE, TRUE)]
drop_of[2] <- NA
drop_of[5] <- NA
drop_of[7] <- NA
df <- data.frame(taxi_nr,pick_up,drop_of) %>%
arrange(pick_up)
I wish to fill in the NA of taxi 1 and 3, I have tried the following:
df <- df %>%
fill(drop_of, .direction = "up")
However, this take the below drop-of value instead of the below pick-up value and does not take into account the taxi nr.
I have also thought about:
df <- df %>%
filter(is.na(drop_of)) %>%
mutate(drop_of, ov[,+1])
This seems to run into problems with the taxi_nr 2 case, as there is no [,+1] in within the group - or so I believe is the issue. I have tried to add safely(), possibly() and quietly(), but that did not help:
df <- df %>%
filter(is.na(drop_of)) %>%
mutate(drop_of, purr::safely(ov[,+1]))
Does anyone have a solution?
ps: once I get the right column for filling in it also needs to be subtracted 1 second and be in the right lubridate formate (d/m/y-h/m/s)
THANKS!

You can try to use a temporary variable for it, although it does not look pretty
df <- df %>%
mutate(temp = ifelse(is.na(drop_of), NA, pick_up)) %>%
group_by(taxi_nr) %>%
fill(temp, .direction = "up") %>%
ungroup() %>%
mutate(drop_of = ifelse(is.na(drop_of), temp - 1, drop_of),
drop_of = as.POSIXct(drop_of, origin = "1970-01-01")) %>%
select(-temp)
And if you need your data in a format d/m/y-h/m/s, you could do that with a format() function (I am not sure if what you described is exactly what you need, but at least you should get the idea)
df <- df %>% mutate(drop_of = format(drop_of, "%d/%m/%Y-%H/%M/%S"))

Related

R: Finding indirect links between colleagues. Code works with string ids, but not numeric ids

I am trying the extract the set of indirect colleagues of doctors. I call colleagues doctors who work together in the same hospital. An indirect colleague is a doctor who works with the colleague of a doctor in another hospital. In the example below, doctor "a" works with doctor "b" in hospital 1, who in turn work with doctor "c" in hospital 2. Therefore "c" is an indirect colleague of "a".
The code below works well when physician id constitutes of string values (df0) or low numeric values (df1), but not when physicians id constitutes of high numeric value (df2). I would like to fix the code to work with high numeric values (while keeping the original ids of physicians).
df0 <- tribble(
~hospital, ~doctors,
1, c("a", "b"),
2, c("b", "c"),
3, c("a", "d"),
) %>%
unnest(doctors)
# Below, I replaced doctor id with numeric values
df1 <- tribble(
~hospital, ~doctors,
1, c(1, 2),
2, c(2, 3),
3, c(1, 4),
) %>%
unnest(doctors)
# Now I added +5 to each physician id
df2 <- tribble(
~hospital, ~doctors,
1, c(6, 7),
2, c(7, 8),
3, c(6, 9)
) %>%
unnest(doctors)
df <- df2 # The code only works with df0 and df1, not with df2
colleagues <- full_join(df, df, by = c("hospital")) %>%
rename(doctor = doctors.x, colleagues = doctors.y) %>%
filter(doctor != colleagues) %>%
distinct(doctor, colleagues) %>%
chop(colleagues) %>%
deframe()
colleagues %>%
enframe(name = "ego",
value = "alter") %>%
unnest(alter) %>%
mutate(ego_colleagues = map(ego, ~ colleagues[[.x]]),
alter_colleagues = map(alter, ~ colleagues[[.x]]),
alter_colleague_only = map2(alter_colleagues, ego_colleagues, ~ .x[!(.x %in% .y)])) %>%
unnest(alter_colleague_only) %>%
filter(ego != alter_colleague_only) %>%
select(ego, alter, alter_colleague_only)
The issue is in your map calls. Using df2, when you map(ego, ~ colleagues[[.x]]), colleagues[.x] is indexing by position, not name. When you use character names, it defaults to using character names. When you use numeric names and they're 1, 2, 3, 4 it happens to work by luck. But when you have a list of 4 and you're calling colleagues[[6]], then you get the index out of bounds error. If that's not totally clear, print these:
colleagues[[1]] vs. colleagues[[6]] vs. colleagues$`6` .
A quick fix would be to wrap the first part of those map statements in as.character like this:
colleagues %>%
enframe(name = "ego",
value = "alter") %>%
unnest(alter) %>%
mutate(ego_colleagues = map(as.character(ego), ~ colleagues[[.x]]),
alter_colleagues = map(as.character(alter), ~ colleagues[[.x]]),
alter_colleague_only = map2(as.character(alter_colleagues), as.character(ego_colleagues), ~ .x[!(.x %in% .y)])) %>%
unnest(alter_colleague_only) %>%
filter(ego != alter_colleague_only) %>%
select(ego, alter, alter_colleague_only)
UPDATE:
Depending on your setup, you could try using the furrr package with future_map and future_map2, but at least in this minimal example that was a much slower approach. I don't know if that holds true on your real data.
Here's another option. While ugly because it has a lot of intermediate objects, it may be helpful. It uses matrices and leverages the fact that you have these reciprocal relationships (if I'm interpreting correctly). I benchmarked it and it takes half as long.
t1 <- colleagues %>%
enframe(name = "ego",
value = "alter") %>%
unnest(alter) %>%
filter(!duplicated(paste0(pmax(ego, alter), pmin(ego, alter)))) %>%
as.matrix()
t2 <- t1 %>%
rbind(t1[1:nrow(t1),c(2,1)])
alter_colleague_only <- t2[match(t2[,2], t2[,1]), "alter"]
t3 <- cbind(t2, alter_colleague_only)
t4 <- t3[which(t2[,1] != t3[,3]),]
t5 <- t4[,c(3,2,1)]
t6 <- rbind(t4, t5) %>%
as_tibble() %>%
arrange(ego)

Finding the first row after which x rows meet some criterium in R

A data wrangling question:
I have a dataframe of hourly animal tracking points with columns for id, time, and whether the animal is on land or in water (0 = water; 1 = land). It looks something like this:
set.seed(13)
n <- 100
dat <- data.frame(id = rep(1:5, each = 10),
datetime=seq(as.POSIXct("2020-12-26 00:00:00"), as.POSIXct("2020-12-30 3:00:00"), by = "hour"),
land = sample(0:1, n, replace = TRUE))
What I need to do is flag the first row after which the animal uses land at least once for 3 straight days. I tried doing something like this:
dat$ymd <- ymd(dat$datetime[1]) # make column for year-month-day
# add land points within each id group
land.pts <- dat %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
drop_na(land) %>%
mutate(all.land = cumsum(land))
#flag days that have any land points
flag <- land.pts %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
slice(n()) %>%
mutate(flag = if_else(all.land == 0,0,1))
# Combine flagged dataframe with full dataframe
comb <- left_join(land.pts, flag)
comb[is.na(comb)] <- 1
and then I tried this:
x = comb %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0 | flag==0,
0,
difftime(datetime, lag(datetime), units="days")))
But I still can't quite wrap my head around what to do to make it so that I can figure out when the animal has been on land at least once for three days straight, and then flag that first point on land. Thanks so much for any help you can provide!
Create a date column from the timestamp. Summarise the data and keep only 1 row for each id and date which shows whether the animal was on land even once in the entire day.
Use zoo's rollapply function to mark the first day as TRUE if the next 3 days the animal was on land.
library(dplyr)
library(zoo)
dat <- dat %>% mutate(date = as.Date(datetime))
dat %>%
group_by(id, date) %>%
summarise(on_land = any(land == 1)) %>%
mutate(consec_three = rollapply(on_land, 3,all, align = 'left', fill = NA)) %>%
ungroup %>%
#If you want all the rows of the data
left_join(dat, by = c('id', 'date'))

Select top n columns (based on an aggregation)

I have a data set with 100's of columns, I want to keep top 20 columns with highest average (can be other aggregation like sum or SD).
How to efficiently do it?
One way I think is to create a vector of averages of all columns, sort it descending and keep top n values in it then use it subset my data set.
I am looking for a more elegant way and some thing that can also be part of dplyr pipe %>% flow.
code below for creating a dummy dataset, also I would appreciate suggestion for elegant ways to create dummy dataset.
#initialize data set
set.seed(101)
df <- as.data.frame(matrix(round(runif(25,2,5),0), nrow = 5, ncol = 5))
# add more columns
for (i in 1:5){
set.seed (101)
df_stage <-
as.data.frame(matrix(
round(runif(25,5*i , 10*i), 0), nrow = 5, ncol = 5
))
colnames(df_stage) <- paste("v",(10*i):(10*i+4))
df <- cbind(df, df_stage)
}
Another tidyverse approach with a bit of reshaping:
library(tidyverse)
n = 3
df %>%
summarise_all(mean) %>%
gather() %>%
top_n(n, value) %>%
pull(key) %>%
df[.]
We can do this with
library(dplyr)
n <- 3
df %>%
summarise_all(mean) %>%
unlist %>%
order(., decreasing = TRUE) %>%
head(n) %>%
df[.]

Trying to add a calculation involving lags into a new dataframe column

I'm trying to do something extremely simple, but I can't work out how to do it.
Basically this:
Where A and B are columns in a data frame.
If I use:
df$B <- lag(df$B,1) + df$A
It obviously results in NA because there is no lag of B before row 1.
We could use accumulate
library(tidyverse)
df %>%
mutate(B = accumulate(A, `+`))
Or it could be just cumsum
df %>%
mutate(B = cumsum(A))
data
df <- data.frame(A= c(10, 9, 3, 1, 7))

dplyr having trouble redefining type with group_by()

I have the following problem:
When using dplyr to mutate a numeric column after group_by(), it fails if a row contains only one value which is an NaN when using the mutate command.
Thus, if the grouped column contains a numeric, it correctly classifies as dbl, but as soon as there is an instance of only a NaN for a group, it fails as dplyr defines that group as lgl, while all the other groups are dbl.
My first (and more general question) is:
Is there a way to tell dplyr, when using group_by(), to always define a column in a certain way?
Secondly, can someone help me with a hack for the problem explained in the MWE below:
# ERROR: This will provide the column defining error mentioned:
df <- data_frame(a = c(rep(LETTERS[1:2],4),"C"),g = c(rep(LETTERS[5:7],3)), x = c(7, 8,3, 5, 9, 2, 4, 7,8)) %>% tbl_df()
df <- df %>% group_by(a) %>% mutate_each(funs(sd(., na.rm=TRUE)),x)
df <- df %>% mutate(Winsorise = ifelse(x>2,2,x))
# NO ERROR (as no groups have single entry with NaN):
df2 <- data_frame(a = c(rep(LETTERS[1:2],4),"C"),g = c(rep(LETTERS[5:7],3)), x = c(7, 8,3, 5, 9, 2, 4, 7,8)) %>% tbl_df()
df2 <- df2 %>% group_by(a) %>% mutate_each(funs(sd(., na.rm=TRUE)),x)
# Update the Group for the row with an NA - Works
df2[9,1] <- "A"
df2 <- df2 %>% mutate(Winsorise = ifelse(x>3,3,x))
# REASON FOR ERROR: What happens for groups with one member = NaN, although we want the winsorise column to be dbl not lgl:
df3 <- data_frame(g = "A",x = NaN)
df3 <- df3 %>% mutate(Winsorise = ifelse(x>3,3,x))
The reason is, as you rightly pointed out in df3, that the mutate result is cast as a logical when the source column is NaN/NA.
To circumvent this, cast your answer as numeric:
df <- df %>% mutate(Winsorise = as.numeric(ifelse(x>2,2,x)))
Perhaps #hadley could shed some light on why the mutate result is cast as lgl?

Resources