Automatically create data frames based on factor levels of a column - r

I have some fake case data with a manager id, type, and location. I'd like to automatically create data frames with the average number of cases a manager has at a given location.
# create fake data
manager_id <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)
type <- c("A", "A", "B", "B", "B", "A", "A", "A", "C", "A", "B", "B", "C", "C", "C")
location <- c("Beach", "Beach", "Beach", "Beach", "Beach", "City", "City", "City", "Farm", "Farm", "Farm", "Farm", "Farm", "Farm", "City")
manager_id <- data.frame(manager_id)
type <- data.frame(type)
location <- data.frame(location)
df <- cbind(manager_id, type, location)
After creating fake data, I created a function that finds this average. The function works.
avgs_function <- function(dat){
dat1 <- dat %>% group_by(manager_id) %>% summarise(total = n())
total <- mean(dat1$total)
total <- round(total, 0)
total
}
I then loop through each location, create data frames using the avgs_function, and store them in a list. Then I call the data frames into my global environment. Something is going wrong here that I can't figure out. The weird thing is that is was working fine yesterday.
df_list <- unique(df$location) %>%
set_names() %>%
map(~avgs_function(df))
names(df_list) <- paste0(names(df_list), "_avg")
list2env(df_list, envir = .GlobalEnv)
Right now, the code is giving these values:
Beach_avg = 5
City_avg = 5
Farm_avg = 5
I would like:
Beach_avg = 5
City_avg = 2
Farm_avg = 3
I believe the issue is happening with the purrr package. Any help would be greatly appreciated!

I don't think you need purrr at all (just dplyr): this gets your desired output
result <-(df
%>% count(manager_id, location)
%>% group_by(location)
%>% summarise(across(n, mean))
)
(although without the _avg added to the location names: you could add mutate(across(location, paste0, "_avg")) (or something with glue) if you wanted)
This also doesn't create the separate variables you wanted (although obviously you can add more stuff e.g. with(result, setNames(list(n), location)) %>% list2env(), but in general workflows that populate your global workspace with a bunch of different named variables are a bad idea - collections like this can usually be handled better by keeping them inside a list/data frame/tibble ...

Related

How do you find the date when one column stopped equaling another column in R?

I have the following variables in a dataset:
UserID | Date | Workplace | First_Workplace
Users are assigned to a workplace (Remote or Office). I'm trying to figure out when each user returned to the office and Workplace no longer equaled First_Workplace. I need to add this data to a new column named Date_Changed.
Sudo code would be something like the below. I just can't figure it out in R:
data %>%
mutate(data, Date_Changed = Date WHEN Workplace != First_Workplace FOR EACH UserID)
We may use which on the logical values to get the index and select the first index with [1]
library(dplyr)
data %>%
group_by(UserID) %>%
mutate(Date_changed = Date[which(Workplace !=
first(Workplace))[1]]) %>%
ungroup
NOTE: If there is no change, it returns NA
data
data <- data.frame(UserID = rep(1:3, each = 3),
Workplace = c("A", "A", "B", "A", "A", "A", "C", "C", "C"),
Date = Sys.Date() + 0:8)

More efficient match with data.table

I have a massive dataset with information on hospitalizations: it includes id of hospitalization, id of physicians, id of hospital, as well as admission/discharge dates. Given that one hospitalization may involve multiple physicians, each row in the data is identified at the hospitalization id - physician id level. A physician may work in multiple hospitals. There are 92M hospitalizations, 150k physicians and 6k hospitals in my data.
I have another dataset with information on physicians' specialties. A physician may have multiple specialties.
I want to find, for each hospitalization-physician ids, the ids of all other hospitalizations in the same hospital that were concluded in the 30 days prior to the start of that given hospitalization and performed exclusively by other physicians in the same specialty.
Consider the simple example below. The sample variable added to df flags the hospitalization ids which will have at least 1 other hospitalization linked to it according to the criteria explained above.
df <- data.frame(hospitalization_id = c(1, 2, 3,
1, 2, 3,
4, 5,
6, 7, 8),
hospital_id = c("A", "A", "A",
"A", "A", "A",
"A", "A",
"B", "B", "B"),
physician_id = c(1, 1, 1,
2, 2, 2,
3, 3,
2, 2, 2),
date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-12", "2000-01-20",
"2000-02-10", "2000-02-11", "2000-02-12")),
date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-18", "2000-01-22",
"2000-02-11", "2000-02-14", "2000-02-17")))
df <- df %>%
mutate(sample = c(0,0,0,0,0,1,1,1,0,0,0))
physician_spec <- data.frame(physician_id = c(1, 2, 2, 3),
specialty_code = c(100, 100, 200, 200))
With the help of StackOverFlow fellows (link to original post: Find set of rows in row-specific range with restriction at different levels), I now have the following code that works perfectly fine. The issue is that the code takes forever to run. In the past 3 days it went over only 300 hospitals out of the 6k hospitals in the data.
setDT(df)
setDT(physician_spec)
peers_in_spec <- function(p) {
physician_spec[
physician_id != p &
specialty_code %in% physician_spec[physician_id==p, specialty_code],
physician_id]
}
f <- function(p, st) {
peers_in_spec = peers_in_spec(p)
exclude_hosps = df_hospital[physician_id == p, unique(hospitalization_id)]
unique(df_hospital[
physician_id %in% peers_in_spec(p) &
(st - date_end)>=1 & (st - date_end)<=30 &
!hospitalization_id %in% exclude_hosps
]$hospitalization_id)
}
for(h in unique(df$hospital_id)) {
print(paste0("Hospital id: ", h))
df_hospital <- df[hospital_id==h]
tryCatch({
output <- df_hospital[sample==1,
.(peer_hospid = f(physician_id, date_start)),
.(physician_id, hospitalization_id)]
print(output)
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
I was wondering if there is a way to make the code more efficient: either by further subsetting the original data before applying the f() function (for instance, by looping over each day-hospital and subsetting the data for the period prior to that day before applying f), or adjusting the code in some other way.
Here is one alternative approach:
# load libraries
library(data.table)
library(magrittr)
# set as data.table
setDT(physician_spec)
setDT(df)
# Create a physician match table.. for each physician, which other physicians are matched by speciality?
phys_match = physician_spec[physician_spec, on="specialty_code"] %>%
.[physician_id!=i.physician_id, .(mds = list(i.physician_id)),physician_id]
# Merge df on itself, using a join on hospital_id, and non-equi join re the start_date
k <- df[df[,.(hospital_id,h_id = hospitalization_id, date_end,e=date_end+30,other_md = physician_id)],
on=.(hospital_id, date_start>date_end, date_start<e), nomatch=0] %>%
.[, .(hospitalization_id, h_id, hospital_id, physician_id, other_md)] %>%
.[phys_match,on="physician_id"]
# add speciality match boolean, and keep if this match is true
k[, spec_match:=other_md %in% mds[[1]], 1:nrow(k)]
# helper function checks: if the physician_id value (which is constant, so use p[1])
# is in o, then we return False, otherwise we check if among the rows where speciality matches
# there is a set difference of length>0
f <- function(p,o,m) {
fifelse((p[1] %in% o),F,length(setdiff(o[m],p[m]))>0)
}
k[, f(physician_id, other_md,spec_match), .(hospitalization_id, h_id)][V1==TRUE][, V1:=NULL][]
Output:
hospitalization_id h_id
1: 3 4
2: 4 1
3: 5 1
4: 5 2

Convert character column to factor preserving column label

I have a dataframe that I read from the XLSX file. Every column name looks like this: CODE___DESCRIPTION so for example A1___Some funky column here. It is easier to use the codes as colnames but I want to use description when needed so it must be stored in the dataframe. This is why I am using sjlabelled package later on.
Make yourself some random data and save it as some_data.xlsx.
library(dplyr) #to play with tibbles
library(stringi) #to play with strings
library(writexl) #name speaks for itself
tibble(col1 = sample(c("a", "b", "c", NA, "N/A"), 50, replace = T),
col2 = sample(c("d", "e", "f", NA, "N/A"), 50, replace = T),
col3 = sample(c("g", "h", "i", NA, "N/A"), 50, replace = T),
col4 = sample(c("j", "k", "l", NA, "N/A"), 50, replace = T)) %>%
setNames(stri_c("A", 1:4, "___", stri_rand_strings(4, 10))) %>%
write_xlsx(path = "some_data.xlsx", col_names = T, format_headers = F)
I've created simple function to prepare my data the way I want it.
library(sjlabelled) #to play with labelled data
label_it <- function(data = NULL, split = "___"){
#This basically makes an array of two columns (of codes and descriptions respectively)
k.n <- data %>%
names() %>%
stri_split_fixed(pattern = split, simplify = T)
data%>%
set_label(k.n[,2]) %>% #set description as each column's label
setNames(k.n[,1]) #set code as each column's name
}
First I read the data from XLSX file. Then I label it.
library(readxl) #name speaks for itself again
data <- read_xlsx("some_data.xlsx", na = c("", "N/A")) %>%
label_it()
Now each of my dataframe's column is character vector (in fact it's a structure) with two attributes:
label being description part
names being the original dataframe column name (CODE___DESCRIPTION style) and is not to be mistaken for output of names(data) which would be the codes part
Let's say I would like to change first and third column to factor.
To do this I have tried two things:
data[,1] <- factor(data[,1], levels = c("c", "a", "b"))
data[,3] <- factor(data[,3], levels = c("h", "g", "i"))
this changes all of those two columns values to NA_integer_.
data <- data %>%
mutate(A1 = factor(A1, levels = c("c", "a", "b")),
A3 = factor(A3, levels = c("h", "g", "i")))
this changes character vectors to factors as intended, but it drops both column attributes (label and names) which I need to be preserved.
I also tried quite a lot of functions from sjlabelled, labelled and haven packages. Nothing worked as I intended. Finally, I have found a solution, but it isn't perfect and I would love to find an easier way of doing this.
The solution is to lose those attributes but then regain ('copy' in fact) them.
data <- data %>%
mutate(A1 = factor(A1, levels = c("c", "a", "b")),
A3 = factor(A3, levels = c("h", "g", "i"))) %>%
copy_labels(data)
copy_labels is function from sjlabelled package which is used when labels are lost due to e.g. data subsetting as in this example.
P.S.
I would love to add r-sjlabelled and r-labelled tags because those packages are considered in this problem but am under 1500 reputation required to do this.

group by sequence of events and get summary statistics for each sequence

I have a data.frame with log of sequences of events. Here, sequence 1 is composed of event A, then B, then C, each starting at a specific timestamp (in seconds).
df=data.frame(id=runif(10, 1e6, 1e7), sequence = c(1,1,1,2,2,3,3,3,4,4), event=c("A", "B", "C", "B", "C", "A", "B", "C", "B", "C"), starts_at=c(20,22,24,20,30,20,21,23,20,40))
What I want is to group my data.frame by type of sequence (there are dozens of types, length 2 to 6): A->B->C or B->C, and then to get some results on those types. Desired output would be:
#### sequence_type number.appearances mean.delay.between.events
#### 1 ABC 2 1.5 / 2
#### 2 BC 2 15
The last column "mean delay" would be a string composed of the mean diff time between successive events in a sequence: in ABC sequence, there is 1.5 seconds in average between A and B, and 2 between B and C.
I also thought of "spreading" each mean difference in a new column diff.1, diff.2..., but seems complicated since sequence have different lengths. Though i'm open to different ways of presenting this information..
So far I've come up with:
library(dplyr)
df %>% group_by(sequence) %>% arrange(starts_at) %>% summarise(sequence_type = paste0(event, collapse="")) %>% group_by(sequence_type) %>% tally
I didn't find how to achieve the second part. Thanks for the help...
This might not bee the elegant solution you would get with dplyr but I think is general enough that it would work with your real data.
First you just need to get the corresponding sequence of each row of your data, that is ayuda_seq
library(zoo)
df=data.frame(id=runif(14, 1e6, 1e7), sequence = c(1,1,1,2,2,3,3,3,4,4,5,5,5,5),
event=c("A", "B", "C", "B", "C", "A", "B", "C", "B", "C","A","B","C","D"),
starts_at=c(20,22,24,20,30,20,21,23,20,40,20,22,21,15))
ayuda_seq = sapply(df$sequence, function(x) paste0(df[df$sequence == x,3],collapse = ""))
and then you just loop through the unique sequences and generate the sub sequence by each 2 elements.
vec_means = NULL
for(x in unique(ayuda_seq)){
data_temp = df[ayuda_seq == x,]
diff_temp = diff(data_temp$starts_at)
temp_sub = apply(rollapply(data_temp[,3],FUN = paste0,width = 2),1,paste0,collapse = "")
mean_temp = aggregate(diff_temp,by = list(temp_sub),mean)
if(all(!duplicated(temp_sub))){
averages = paste0(mean_temp[,2],collapse = " / ")
} else{
averages = paste0(mean_temp[match(temp_sub[duplicated(temp_sub)],mean_temp[,1]),2],collapse = " / ")
}
vec_means = c(vec_means,averages)
}
df_res = data.frame(sequence_type = unique(ayuda_seq),
number.appearances = as.numeric(table(ayuda_seq)/nchar(unique(ayuda_seq))),
mean.delay.between.events = vec_means)
the variable temp_sub will have the different combinations within the original string you are looping. In the case of "ABC" there is a possible combination of "CA" which is not taking in consideration because it is unique.
Not pretty, but it works
tmp<-df %>% group_by(sequence) %>% dplyr::arrange(sequence, starts_at) %>% dplyr::mutate(seq_row_num=dplyr::row_number(), lead_starts_at=dplyr::lead(starts_at, n = 1)) %>% base::as.data.frame()
tmp<- tmp %>% dplyr::group_by(sequence) %>% mutate(max_seq_len=max(seq_row_num)) %>% base::as.data.frame()
tmp$seq_len_id<- paste0(tmp$sequence, tmp$max_seq_len)
tmp$next_seq_val<- tmp$seq_row_num + 1
tmp$next_seq_val<- base::ifelse(tmp$next_seq_val >= tmp$max_seq_len, tmp$max_seq_len, tmp$next_seq_val)
tmp_seq_labels<- stats::aggregate(tmp$event, list(tmp$seq_len_id), paste, collapse='')
tmp<- base::merge(tmp, tmp_seq_labels, by.x="seq_len_id", by.y="Group.1")
colnames(tmp)[which(colnames(tmp)=="x")]<- "seq_group"
tmp$within_group_step<-"ZZ"
tmp$within_group_step<- base::ifelse(tmp$seq_row_num != tmp$max_seq_len, substr(tmp$seq_group, start = tmp$seq_row_num, stop =tmp$next_seq_val), tmp$within_group_step)
tmp$within_step_by_group_id<- paste0(tmp$seq_group, tmp$within_group_step)
tmp$time_diff<- 0
tmp$time_diff<- base::ifelse(!is.na(tmp$lead_starts_at), tmp$lead_starts_at - tmp$starts_at, tmp$time_diff)
res<- stats::aggregate(time_diff ~ within_step_by_group_id + seq_group + within_group_step, data=tmp, FUN=mean)
drops<- grep(pattern = "ZZ", x = res$within_step_by_group_id)
if(length(drops)>=1){
res<- res[-drops,]
}
colnames(res)<- c("Full_Group_Pattern", "Group_Pattern", "Sub_Group_Pattern", "Mean_Time_Difference")
res<- res %>% dplyr::group_by(Group_Pattern) %>%
dplyr::mutate(Number_of_Appearances=n()) %>% base::as.data.frame()
Here is the result:

Speed up date lookup in R

I have a function that takes a vector of dates and matches it with a subset list of dates (based on certain attributes). For example, say my raw data looks like this:
key_1 <- c("A", "A", "B", "B")
date_1 <- as.Date(c("2012-03-31", "2011-01-31", "2011-08-07", "2014-04-09"))
And my lookup data looks like this:
lookup <- date.frame(stringsAsFactors = FALSE,
key_2 = c("A", "A", "A", "A", "B", "B", "B", "B"),
date_2 = as.Date(c(
"2010-05-12", "2011-05-12", "2012-05-12", "2013-05-12",
"2010-12-01", "2011-12-01", "2012-12-01", "2013-12-01"
))
)
I'm essentially looking for the largest date_2 that date_1 is larger than. So that date_1 maps to date_3. Basically, it's this in Excel:
date_3 = VLOOKUP(date_1[1], date_2[1:4], 1, TRUE)
Which would produce this:
date_3 <- c("2011-05-12", "2010-05-12", "2010-12-01", "2013-12-01")
My current function (below) works great but my raw data is 220k rows so it takes roughly 12 minutes to run. While that isn't the worst thing in the world, I was hoping there might be a faster way to run it.
my_fun <- function(key_1, date_1) {
indices <- sapply(unique(lookup$key_2), function(x) {which(lookup$key_2 == x)})
periods <- lookup$date_2[indices[,key_1]]
index <- findInterval(x = date_1, vec = periods) %>% as.numeric()
periods %>% magrittr::extract(index)
}
date_3 = mapply(my_fun, key_1, date_1, USE.NAMES = FALSE) %>%
as.Date(origin = "1970-01-01")
Thanks.
Update: I've tried utilizing both of the answers here but couldn't get them to work.

Resources