More efficient match with data.table - r

I have a massive dataset with information on hospitalizations: it includes id of hospitalization, id of physicians, id of hospital, as well as admission/discharge dates. Given that one hospitalization may involve multiple physicians, each row in the data is identified at the hospitalization id - physician id level. A physician may work in multiple hospitals. There are 92M hospitalizations, 150k physicians and 6k hospitals in my data.
I have another dataset with information on physicians' specialties. A physician may have multiple specialties.
I want to find, for each hospitalization-physician ids, the ids of all other hospitalizations in the same hospital that were concluded in the 30 days prior to the start of that given hospitalization and performed exclusively by other physicians in the same specialty.
Consider the simple example below. The sample variable added to df flags the hospitalization ids which will have at least 1 other hospitalization linked to it according to the criteria explained above.
df <- data.frame(hospitalization_id = c(1, 2, 3,
1, 2, 3,
4, 5,
6, 7, 8),
hospital_id = c("A", "A", "A",
"A", "A", "A",
"A", "A",
"B", "B", "B"),
physician_id = c(1, 1, 1,
2, 2, 2,
3, 3,
2, 2, 2),
date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-12", "2000-01-20",
"2000-02-10", "2000-02-11", "2000-02-12")),
date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-18", "2000-01-22",
"2000-02-11", "2000-02-14", "2000-02-17")))
df <- df %>%
mutate(sample = c(0,0,0,0,0,1,1,1,0,0,0))
physician_spec <- data.frame(physician_id = c(1, 2, 2, 3),
specialty_code = c(100, 100, 200, 200))
With the help of StackOverFlow fellows (link to original post: Find set of rows in row-specific range with restriction at different levels), I now have the following code that works perfectly fine. The issue is that the code takes forever to run. In the past 3 days it went over only 300 hospitals out of the 6k hospitals in the data.
setDT(df)
setDT(physician_spec)
peers_in_spec <- function(p) {
physician_spec[
physician_id != p &
specialty_code %in% physician_spec[physician_id==p, specialty_code],
physician_id]
}
f <- function(p, st) {
peers_in_spec = peers_in_spec(p)
exclude_hosps = df_hospital[physician_id == p, unique(hospitalization_id)]
unique(df_hospital[
physician_id %in% peers_in_spec(p) &
(st - date_end)>=1 & (st - date_end)<=30 &
!hospitalization_id %in% exclude_hosps
]$hospitalization_id)
}
for(h in unique(df$hospital_id)) {
print(paste0("Hospital id: ", h))
df_hospital <- df[hospital_id==h]
tryCatch({
output <- df_hospital[sample==1,
.(peer_hospid = f(physician_id, date_start)),
.(physician_id, hospitalization_id)]
print(output)
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
I was wondering if there is a way to make the code more efficient: either by further subsetting the original data before applying the f() function (for instance, by looping over each day-hospital and subsetting the data for the period prior to that day before applying f), or adjusting the code in some other way.

Here is one alternative approach:
# load libraries
library(data.table)
library(magrittr)
# set as data.table
setDT(physician_spec)
setDT(df)
# Create a physician match table.. for each physician, which other physicians are matched by speciality?
phys_match = physician_spec[physician_spec, on="specialty_code"] %>%
.[physician_id!=i.physician_id, .(mds = list(i.physician_id)),physician_id]
# Merge df on itself, using a join on hospital_id, and non-equi join re the start_date
k <- df[df[,.(hospital_id,h_id = hospitalization_id, date_end,e=date_end+30,other_md = physician_id)],
on=.(hospital_id, date_start>date_end, date_start<e), nomatch=0] %>%
.[, .(hospitalization_id, h_id, hospital_id, physician_id, other_md)] %>%
.[phys_match,on="physician_id"]
# add speciality match boolean, and keep if this match is true
k[, spec_match:=other_md %in% mds[[1]], 1:nrow(k)]
# helper function checks: if the physician_id value (which is constant, so use p[1])
# is in o, then we return False, otherwise we check if among the rows where speciality matches
# there is a set difference of length>0
f <- function(p,o,m) {
fifelse((p[1] %in% o),F,length(setdiff(o[m],p[m]))>0)
}
k[, f(physician_id, other_md,spec_match), .(hospitalization_id, h_id)][V1==TRUE][, V1:=NULL][]
Output:
hospitalization_id h_id
1: 3 4
2: 4 1
3: 5 1
4: 5 2

Related

Path along linked values taking the lowest value each time

I have a data.table with two columns "From" and "To" as follows:
data.table(From = c(1,1,1,1,2,2,2,2,3,3,3,4,4,5),
To = c(3,4,5,6,3,4,5,6,4,5,6,5,6,6))
The data.table will always be sorted as shown in the example above, with "From" and "To" values increasing from smallest to largest.
I need to find a 'path' starting from the first 'From' (which will always be '1'), through to the last 'To' value, subject to always choosing the lowest 'To' value.
In the above example, I would have 1 --> 3, then 3 --> 4, then 4 --> 5, then finally 5 --> 6.
I then want to return in a vector 1, 3, 4, 5, and 6, representing the linked values.
The only way that I can think of doing it is using a while or for loop and looping through each group of 'From' values and iteratively choosing the smallest. That seems inefficient though and will probably be very slow on my actual data set which is over 100,000 rows long.
Are there any data.table-like solutions?
I also thought that maybe igraph would have a method for this, but I must admit that I currently have pretty much zero knowledge of this function.
Any help would be greatly appreciated.
Thanks,
Phil
EDIT:
Thanks for all the responses so far.
My example/explanation wasn't a great one sorry, as I didn't explain that the 'From' / 'To' pairs don't need to go all the way through to the end value of the 'To' column.
Using the example from the comments below:
dt <- data.table(From = c(1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 5),
To = c(3, 4, 5, 6, 3, 4, 5, 6, 5, 6, 6))
The output would simply be a vector of c(1, 3), as it will start at 1, choose the lowest value which is 3, and then because there are no 'From' values of '3', it wouldn't continue any further.
Another example:
dt <- data.table(From = c(1,1,1,2,2,3,3,4,4),
To = c(2,3,4,5,6,4,7,8,9))
The intended output here is a vector c(1,2,5); following the path 1 --> 2, then 2 --> 5, at which point it stops as there is no '5' value in the "From" column.
Hopefully, that makes sense, and apologies for the lack of clarity in the original question.
Thanks,
Phil
You can try the code below
dt %>%
group_by(From) %>%
slice_min(To) %>%
graph_from_data_frame() %>%
ego(
order = sum((m <- membership(components(.))) == m[names(m) == "1"]),
nodes = "1",
mode = "out"
) %>%
pluck(1) %>%
names() %>%
as.numeric()
or simpler with subcomponent (as #clp did)
dt %>%
group_by(From) %>%
slice_min(To) %>%
graph_from_data_frame() %>%
subcomponent(v = "1", mode = "out") %>%
names() %>%
as.integer()
which gives
For the first new updated data
[1] 1 3
For the second updaed data
[1] 1 2 5
Assuming an ordered From and To list this may work.
It first groups by From, compresses by To, then excludes non-matching From-To values using shift.
If jumps are missing (e.g. To 3 but From 3 missing) it prints NULL
dt[, .(frst = first(To)), From][
, if(all((frst %in% From)[1:(.N - 1)])){
c(1, unique(frst[From == shift(frst, type = "lag", fill = T)]))}]
[1] 1 3 4 5 6
Using from Igraph and subcomponents().
After ThomasisCoding's comment, I realized that graph_from_data_frame creates a graph by name.
This is a waste of memory (and time) if the graph is large (1E6).
Note also that graph_from_edgelist(as.matrix(...)) is much faster.
dt2 <- setNames(aggregate(dt$To, list(dt$From), "min"), c("From", "To") )
g <- graph_from_edgelist(as.matrix(dt2), directed=TRUE)
as.numeric(as_ids(subcomponent(g, 1, mode="out")))
First attempt.
dt2 <- setNames(aggregate(dt$To, list(dt$From), "min"), c("From", "To") )
g <- graph_from_data_frame(dt2, directed=TRUE)
as.numeric(as_ids(subcomponent(g, 1, mode="out")))
I can't seem to get the other answers to work with certain tables. E.g.,
library(data.table)
library(igraph)
library(purrr)
dt <- data.table(
From = c(1, 1, 1, 1, 2, 2, 4, 5),
To = c(3, 4, 5, 6, 4, 6, 6, 6)
)
fPath1 <- function(dt) {
setorder(dt, From, To)[, wt := fifelse(rleid(To)==1,1,Inf), From] %>%
graph_from_data_frame() %>%
set_edge_attr(name = "weight", value = dt[, wt]) %>%
shortest_paths(min(dt[, From]), max(dt[, To])) %>%
pluck(1) %>%
unlist(use.names = FALSE)
}
fPath2 <- function(dt) {
dt[, .SD[which.min(To)], From] %>%
graph_from_data_frame() %>%
shortest_paths(min(dt[, From]), max(dt[, To])) %>%
pluck(1) %>%
unlist(use.names = FALSE)
}
fPath3 <- function(dt) {
dt[, .(frst = first(To)), From][
, if(all((frst %in% From)[1:(.N - 1)])){
c(1, unique(frst[From == shift(frst, type = "lag", fill = T)]))}]
}
fPath1(dt)
#> [1] 1 6
fPath2(dt)
#> Warning in shortest_paths(., min(dt[, From]), max(dt[, To])): At core/paths/
#> unweighted.c:368 : Couldn't reach some vertices.
#> integer(0)
fPath3(dt)
#> NULL
This igraph solution seems to work based on a little more extensive testing:
fPath4 <- function(dt) {
g <- graph_from_data_frame(dt)
E(g)$weight <- (dt$To - dt$From)^2
as.integer(V(g)[shortest_paths(g, V(g)[1], V(g)[name == dt$To[nrow(dt)]])$vpath[[1]]]$name)
}
fPath4(dt)
#> [1] 1 4 6
A sequential solution is feasable.
Copying one million dataframe lines took 8 seconds on my system.
n <- 1E6
df1 <- data.frame(from=sample(n), to=sample(n))
path <- c()
system.time(
for (i in seq(nrow(df1)) ){
path[length(path) + 1] <- df1[i, "to"] # avoid copying.
}
)
mean(path)
length(path)
Output.
[1] 500000.5
[1] 1000000
Updated after last edit of Phil.
The first step is to simplify the input (df).
## Select min(To) by From.
if (nrow(df) > 0) { df2 <- setNames(aggregate(df$To, list(df$From), "min"), c("From", "To") )
} else df2 <- df
Set path to first start node and
subsequently append end nodes.
## Let tt is maximal outgoing node upto now.
path <- df2[1,1]
tt <- df2[1,1]
for (i in seq_len(nrow(df2))){
if (df2[i, 1] < tt) next
else if (df2[i,1] == tt) { tt <- df2[i, 2]
path[length(path) + 1] <- df2[i, 2]
}
else break
}
head(path)
Output:
[1] 1 3 4 5 6 , df as in first example.
[1] 1 2 5 , df as in another example.

Automatically create data frames based on factor levels of a column

I have some fake case data with a manager id, type, and location. I'd like to automatically create data frames with the average number of cases a manager has at a given location.
# create fake data
manager_id <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)
type <- c("A", "A", "B", "B", "B", "A", "A", "A", "C", "A", "B", "B", "C", "C", "C")
location <- c("Beach", "Beach", "Beach", "Beach", "Beach", "City", "City", "City", "Farm", "Farm", "Farm", "Farm", "Farm", "Farm", "City")
manager_id <- data.frame(manager_id)
type <- data.frame(type)
location <- data.frame(location)
df <- cbind(manager_id, type, location)
After creating fake data, I created a function that finds this average. The function works.
avgs_function <- function(dat){
dat1 <- dat %>% group_by(manager_id) %>% summarise(total = n())
total <- mean(dat1$total)
total <- round(total, 0)
total
}
I then loop through each location, create data frames using the avgs_function, and store them in a list. Then I call the data frames into my global environment. Something is going wrong here that I can't figure out. The weird thing is that is was working fine yesterday.
df_list <- unique(df$location) %>%
set_names() %>%
map(~avgs_function(df))
names(df_list) <- paste0(names(df_list), "_avg")
list2env(df_list, envir = .GlobalEnv)
Right now, the code is giving these values:
Beach_avg = 5
City_avg = 5
Farm_avg = 5
I would like:
Beach_avg = 5
City_avg = 2
Farm_avg = 3
I believe the issue is happening with the purrr package. Any help would be greatly appreciated!
I don't think you need purrr at all (just dplyr): this gets your desired output
result <-(df
%>% count(manager_id, location)
%>% group_by(location)
%>% summarise(across(n, mean))
)
(although without the _avg added to the location names: you could add mutate(across(location, paste0, "_avg")) (or something with glue) if you wanted)
This also doesn't create the separate variables you wanted (although obviously you can add more stuff e.g. with(result, setNames(list(n), location)) %>% list2env(), but in general workflows that populate your global workspace with a bunch of different named variables are a bad idea - collections like this can usually be handled better by keeping them inside a list/data frame/tibble ...

How to determine number of new unique values being added to existing group

I'm not even sure how best to phrase this question, which may be why I'm having so much difficulty finding an answer. Here's an example:
df <- data.frame(
user = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
period = c(1, 2, 3, 3, 1, 2, 3, 3, 1, 2, 3, 3),
product = c("a", "b", "b", "c", "a", "a", "c", "d", "a", "b", "b", "a")
)
user period product
1 1 1 a
2 1 2 b
3 1 3 b
4 1 3 c
5 2 1 a
6 2 2 a
7 2 3 c
8 2 3 d
9 3 1 a
10 3 2 b
11 3 3 b
12 3 3 a
First question: I have a group of users three users, who are receiving products over three different time periods. I want to know, for the latest time period (the third one), how many users received new products that they haven't received before.
Notably in some of the time periods, users may receive more than one product. In this example, the only time they've received more than one product is in the third period.
The answer to this first question should be that two users have received something new (user 1 and 2)
Second question: How would you then get a count of the number of new products being received by users?
The answer should be, user 1 received 1 new product, user two received 2 new products, user three received 0 new products.
Create dummy variables for each product so that you can calculate the statistics faster, at least in the following steps.
You can order the data table based on the periods and divide it depending on the period you specify (in your case-- period 3).
Then check for the users to see whether they receive a new product that they have not received before, with colSums function.
Finally, make a quick for loop for each user.
The result table will show who the users are (user), whether they received a new product (is_new_product), if so, how many new products they received (how_many), and what kind of new product they received (what).
If you sum the column of is_new_product, you will receive the answer to Questions 1, which is 2.
If you look at the column how_many, you will receive an answer to Question 2, which is User 1 received 1 new product, and User 2 received 2 new products.
Additionally, you can see the product information in the what column.
library(fastDummies)
df <- data.frame(
user = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
period = c(1, 2, 3, 3, 1, 2, 3, 3, 1, 2, 3, 3),
product = c("a", "b", "b", "c", "a", "a", "c", "d", "a", "b", "b", "a")
)
### Make a dummy variable to make calculations easier
df$product <- as.factor(df$product)
df <- fastDummies::dummy_cols(df, select_columns = c("product"),remove_selected_columns = T)
## Order the data based on periods
df <- df[order(df$period,decreasing = F),]
## For loop create some empty vectors
user <- c()
is_new_product <- c()
how_many <- c()
what <- c()
for (n_user in 1:3) {
user[n_user] <- n_user
res <-
colSums(df[df$period != 3 &
df$user == n_user, 3:ncol(df)]) - colSums(df[df$period == 3 &
df$user == n_user, 3:ncol(df)])
is_new_product[n_user] <- as.numeric(any(res<0))
how_many[n_user] <- sum(res==-1)
what[n_user] <- paste0(names(which(res==-1)),collapse = " | ")
}
result <- data.frame(user,is_new_product,how_many,what)
result
Results
> result
user is_new_product how_many what
1 1 1 1 product_c
2 2 1 2 product_c | product_d
3 3 0 0
This does what you want
first <- df %>% group_by(user, product) %>% slice_min(period) %>% rename(first=period)
last <- df %>% group_by(user, product) %>% slice_max(period) %>% rename(last=period)
df <- full_join(first, last) %>% mutate(new_product = if_else(first ==3 & last==3,1,0 ) ) %>%
group_by(user) %>% summarise(new_products = sum(new_product))
Edit: Adjusting the last bit to give you an extra column with products received for first time in period 3.
df <- full_join(first, last) %>%
mutate(new_product = if_else(first ==3 & last==3,1,0 ) ) %>%
mutate(product = if_else(new_product==1,product,"" )) %>%
group_by(user) %>%
summarise(new_products = sum(new_product), products = paste(product, collapse = ""))
Here is a base R approach using an array:
tbl <- xtabs(~product+period+user, df)
results <- apply(tbl, 3, function(x) rowSums(x) == x[, 3] & x[, 3] > 0)
cat("Number of Products First Received in Period 3 by User (Column)\n") ; colSums(results)
# Number of Products First Received in Period 3 by User (Column)
# 1 2 3
# 1 2 0
cat("Number of Products First Received in Period 3 by Product (Column)\n") ; rowSums(results)
# Number of Products First Received in Period 3 by Product (Column)
# a b c d
# 0 0 2 1

Apply own function across vector

I am using a dataframe as a lookup table. Given the name of an financial institution (FI) return its ID.
I have a function that can do that for a single (FI) at a time. I wish to use this function with a vector of a few dozen FIs.
So feeding companies into the function should return 12 and 14.
Is this a job for the Tidyverse map function?
library("tidyverse")
groups_df <- tibble(
fi = c("a", "b", "c"),
id = c(10, 12, 14)
)
companies <- c("b", "c")
group_id_lookup <- function(new_comp, groups_df){
group_id_idx <- which(groups_df$fi == new_comp)
group_id <- pull(groups_df[group_id_idx, 2])
print(group_id)
# Do more things
}
group_id_lookup("b", groups_df) # OK
group_id_lookup("c", groups_df) # OK
# Wrong
companies %>%
map(group_id_lookup)
A join or match option should be a better approach here.
groups_df$id[match(companies, groups_df$fi)]
#[1] 12 14

Lagged difference between rows in R_ a different take

My question is similar to a few that have been asked before, but I hope different enough to warrant a separate question.
See here, and here. I'll pull some of the same example data as these questions. For context to my question- I am looking to see how my observed catch-rate (sea creatures) changed over multiple days of sampling the same area.
I want to calculate the difference between the first sample day at a given site (first Letter in data below), and the subsequent sample days (next rows of same letter).
#Example data
df <- data.frame(
id = c("A", "A", "A", "A", "B", "B", "B"),
num = c(1, 8, 6, 3, 7, 7 , 9),
What_I_Want = c(NA, 7, 5, 2, NA, 0, 2))
The first solution that I found calculates a lagged difference between each row. I also wanted this calculation- so it was helpful to find:
#Calculate lagged differences
df_new <- df %>%
# group by condition
group_by(id) %>%
# find difference
mutate(diff = num - lag(num))
Here the difference is between A.1 and A.2; then A.2 and A.3 etc...
What I would like to do now is calculate the difference with respect to the first value of each group. So for letter A, I would like to calculate 1 - 8, then 1 - 6, and finally 1 - 3. Any suggestions?
One clunky solution (linked above) is to create two (or more) columns for each distance lagged and some how merge the results that I want
df_clunky = df %>%
group_by(id) %>%
mutate(
deltaLag1 = num - lag(num, 1),
deltaLag2 = num - lag(num, 2))
Here is a base R method with replace and ave
ave(df$num , df$id, FUN=function(x) replace(x - x[1], 1, NA))
[1] NA 7 5 2 NA 0 2
ave applies the replace function to each id. replace takes the difference of the vector and the first element in the vector as its input and replaces NA into the first element.

Resources