How to optimise an r function with 2 inputs within a loop - r

I am new to r and I am surprised at how long it takes to run what I believe to be rather simple lines of code, this leads me to believe I am missing something rather obvious. I have searched the internet and tried a few different iterations of the function but nothing has improved the efficiency (measured in time).
The Extract data is a data frame with 18.5m rows and 11 variables. I am trying to establish two things, first what percentage of patients stay in a hospital for longer than 7 as a percentage of all patients and second 21 days stays as a proportion of 7 days.
LOS_prob_providerage <- function(x,y){
Var1 = which(Extract$LOS>=0 & Extract$ProviderCode == x & Extract$age_group == y)
Var2 = which(Extract$LOS>=7 & Extract$ProviderCode == x & Extract$age_group == y)
return(list(Strand=(sum(Extract$LOS[Var1] >= 7)/length(Var1))*100, ELOS=(sum(Extract$LOS[Var2] >= 21)/length(Var2))*100))
}
When I call this function I give it a list of hospitals as the x variable and 1 age group from a list for the y variable (I can't seem to get it to take both as lists and output all hospitals for all age groups) using the following set of code
Providerage_prob_strand = mapply(LOS_prob_providerage,Provider_unique, agelabels[1], SIMPLIFY = FALSE)
I then create a data frame using the 2 lists that the function outputs using the code below
National = data.frame(matrix(unlist(Providerage_prob_strand), ncol=2,
byrow=T),row.names = Provider_unique)
colnames(National) <- c("Stranded_010","ELOS_010")
I subsequently re-run the last portions of code for all 11 elements in my age group list and append to the National data frame.
Question 1: Is there a less computationally intensive way to code my loop using r, or is the loop just taking that length of time due to the way r stores everything in memory?
Question 2: Is there anywhere to give r two lists for both the x and y varibale using mapply/sapply and for it to output the results to both Strand and ELOS across all hospitals /age groups?

I would use the data.table package for this.
Some dummy data to demonstrate (usually it is good practice for the question asker to provide this):
set.seed(123)
df1 = data.frame(
provider = sample(LETTERS[1:4], 1000, T),
los = round(runif(1000,0,40)),
age_group = sample(1:4,1000, T))
Now we turn this into a data table
library(data.table)
setDT(df1)
and we can extact the values you want like this:
providerlist = c('A','B')
age_list = c(1,2)
df1[provider %in% providerlist & age_group %in% age_list,
.(los_greater_than7 = 100*sum(los>7)/.N),
keyby = .(provider, age_group)]
# provider age_group los_greater_than7
# 1: A 1 92.40506
# 2: A 2 81.81818
# 3: B 1 77.27273
# 4: B 2 87.50000
df1[provider %in% providerlist & age_group %in% age_list & los>7,
.(los_greater_than20 = 100*sum(los>20)/.N),
by = .(provider, age_group)]
# provider age_group los_greater_than20
# 1: A 1 56.16438
# 2: A 2 66.66667
# 3: B 1 56.86275
# 4: B 2 58.92857

Related

How to return the range of values shared between two data frames in R?

I have several data frames that have the same columns names, and ID
, the following to are the start from and end to of a range and group label from each of them.
What I want is to find which values offrom and to from one of the data frames are included in the range of the other one. I leave an example picture to ilustrate what I want to achieve (no graph is need for the moment)
I thought I could accomplish this using between() of the dplyr package but no. This could be accomplish using if between() returns true then return the maximum value of from and the minimum value of to between the data frames.
I leave example data frames and the results I'm willing to obtain.
a <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(1,500,1000,1,500,1000,1,500,1000),
to=c(400,900,1400,400,900,1400,400,900,1400),group=rep("a",9))
b <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(300,1200,1900,1400,2800,3700,1300,2500,3500),
to=c(500,1500,2000,2500,3000,3900,1400,2800,3900),group=rep("b",9))
results <- data.frame(ID = c(1,1,1,2,3),from=c(300,500,1200,1400,1300),
to=c(400,500,1400,1400,1400),group=rep("a, b",5))
I tried using this function which will return me the values when there is a match but it doesn't return me the range shared between them
f <- function(vec, id) {
if(length(.x <- which(vec >= a$from & vec <= a$to & id == a$ID))) .x else NA
}
b$fromA <- a$from[mapply(f, b$from, b$ID)]
b$toA <- a$to[mapply(f, b$to, b$ID)]
We can play with the idea that the starting and ending points are in different columns and the ranges for the same group (a and b) do not overlap. This is my solution. I have called 'point_1' and 'point_2' your mutated 'from' and 'to' for clarity.
You can bind the two dataframes and compare the from col with the previous value lag(from) to see if the actual value is smaller. Also you compare the previous lag(to) to the actual to col to see if the max value of the range overlap the previous range or not.
Important, these operations do not distinguish if the two rows they are comparing are from the same group (a or b). Therefore, filtering the NAs in point_1 (the new mutated 'from' column) you will remove wrong mutated values.
Also, note that I assume that, for example, a range in 'a' cannot overlap two rows in 'b'. In your 'results' table that doesn't happen but you should check that in your dataframes.
res = rbind(a,b) %>% # Bind by rows
arrange(ID,from) %>% # arrange by ID and starting point (from)
group_by(ID) %>% # perform the following operations grouped by IDs
# Here is the trick. If the ranges for the same ID and group (i.e. 1,a) do
# not overlap, when you mutate the following cols the result will be NA for
# point_1.
mutate(point_1 = ifelse(from <= lag(to), from, NA),
point_2 = ifelse(lag(to)>=to, to, lag(to)),
groups = paste(lag(group), group, sep = ',')) %>%
filter(! is.na(point_1)) %>% # remove NAs in from
select(ID,point_1, point_2, groups) # get the result dataframe
If you play a bit with the code, not using the filter() and select() you will see how that's work.
> res
# A tibble: 5 x 4
# Groups: ID [3]
ID point_1 point_2 groups
<dbl> <dbl> <dbl> <chr>
1 1 300 400 a,b
2 1 500 500 b,a
3 1 1200 1400 a,b
4 2 1400 1400 a,b
5 3 1300 1400 a,b

How do I filter data in data frame and change column's cell values based on it using a loop?

Currently working with a larger data frame with various participant IDs that looks like this:
#ASC_new Data Frame
Pcp Choice Target ASC Product choice_consis
2393 zwyn27soc B A 1 USB drive 0
2394 zwyn27soc B A 1 job 0
2395 zwyn27soc B B 1 USB drive 0
2397 zwyn27soc B A 1 printer 0
2399 zwyn27soc B B 1 walking shoes 0
2400 zwyn27soc B A 1 printer 0
I would like to try to loop through each participant (Pcp), and look at their choices in column "Choice." For example, under both of the products "USB drive," the participant chose "B" (Choice). Therefore, under "choice_consis," I want there to be a 1 to replace the 0 because the choices are consistent or equal. Although, my for loop for going through the participants and product names isn't working:
#Examples/snippets of my values
pcp_list <- list("ybg606k3l", "yk83d2asc", "yl55v0zhm", "zwyn27soc")
product_list <- list("USB drive", "printer", "walking shoes", "job")
#for loop that isn't working
for (i in pcp_list){ #iterating through participant codes
for (j in product_list){ #iterating through product names
comparison <- filter(ASC_new, Pcp == i & Product == j) #filtering participant data and products into new dataframe
choice_1 <- ASC_new$Choice[1] #creating labels for choice 1 and 2
choice_2 <- ASC_new$Choice[2]
if (isTRUE(choice_1 == choice_2)){ #comparing choice 1 and choice 2 and adding value of 1 to Choice_consis column if they are equal
ASC_new$choice_consis[1] <- 1
ASC_new$choice_consis[2] <- 1
}
}
}
In the end I would like a data frame where each participant's choice_consis is labeled with a 1 or 0 expressing if they chose the same item (A,B,D) both times that each product appeared.
This is something that's pretty natural to do using dplyr, if you don't care about collapsing across different choices. I'll illustrate on a toy data frame:
IDs <- 1:2
choices <- c('A', 'B')
products <- c('USB', 'Printer')
df <- data.frame(Pcp = rep(IDs, each = 4),
Choice = c(rep(choices, each = 2),
rep(choices, each = 2)),
Product = c(rep(products, times = 2),
rep(products, each = 2)))
df %>%
dplyr::group_by(Pcp, Product) %>%
dplyr::summarize(choice_consis = as.numeric(length(unique(Choice)) == 1))
This does (in essence) the same thing you're trying to do with your for loop: look at each combination of participants and products (that's what the group_by does) and then analyze that combination (that's what the summarize does). It's a little more succinct and readable than a double for loop. I'd check out Chapter 5 of Hadley's book on R for Data Science to learn more about these sorts of things.
As far as what's wrong with your for loop, the issue is that even though you create your comparison data frame, all the subsequent operations are on ASC_new. So if you wanted to use a for loop and maintain the structure of your original data, you could do something like:
for (i in pcp_list) {
for (j in product_list) {
compare <- (ASC_new$Pcp == i) & (ASC_new$Product == j)
choices <- ASC_new$Choice[compare]
if (length(unique(choices)) == 1) {
ASC_new$choice_consis[compare] <- 1
}
}
}
Creating a new data frame as you did makes it a little harder to substitute values in the original (because we don't know "where" the filtered data frame came from), so I just get the indices of the original data frame corresponding to the participant-product combination. Note also that I eliminated the hard-coding of the fact that there are only two choices, as well as the isTRUE within the if statement (== will evaluate to TRUE or FALSE, as desired).
Hope this helps!
You can count the unique value of Choice for each Pcp and Product and assign 1 if it is 1 or 0 otherwise.
This can be done in base R :
df$choice_consis <- +(with(df, ave(Choice, Pcp, Product, FUN = function(x)
length(unique(x)))) == 1)
dplyr :
library(dplyr)
df %>%
group_by(Pcp, Product) %>%
mutate(choice_consis = +(n_distinct(Choice) == 1))
and data.table
library(data.table)
setDT(df)[, choice_consis := as.integer(uniqueN(Choice) == 1), .(Pcp, Product)]

Not all values storing in a loop

I want to store values in "yy" but my code below stores only one row (last value). Please see the output below. Can somebody help to store all the values in "yy"
Thanks in advance. I am a beginner to R.
arrPol <- as.matrix(unique(TN_97_Lau_Cot[,6]))
arrYear <- as.matrix(unique(TN_97_Lau_Cot[,1]))
for (ij in length(arrPol)){
for (ik in length(arrYear)) {
newPolicy <- subset(TN_97_Lau_Cot, POLICY == as.character(arrPol[ij]) & as.numeric(arrYear[ik]))
yy <- newPolicy[which.min(newPolicy$min_dist),]
}
}
Output:
YEAR DIVISION STATE COUNTY CROP POLICY STATE_ABB LRPP min_dist
1: 2016 8 41 97 21 699609 TN 0 2.6
Here is a image of "TN_97_Lau_Cot" matrix.
No loops required. There could be an easier way to do it, but two set-based steps are better than two loops. These are the two ways I would try and do it:
base
# Perform an aggregate and merge it to your data.frame.
TN_97_Lau_Cot_Agg <- merge(
x = TN_97_Lau_Cot,
y = aggregate(min_dist ~ YEAR + POLICY, data = TN_97_Lau_Cot, min),
by = c("YEAR","POLICY"),
all.x = TRUE
)
# Subset the values that you want.
TN_97_Lau_Cot_Final <- unique(subset(TN_97_Lau_Cot_Agg, min_dist.x == min_dist.y))
data.table
library(data.table)
# Convert your data.frame to a data.table.
TN_97_Lau_Cot <- data.table(TN_97_Lau_Cot)
# Perform a "window" function that calculates the min value for each year without reducing the rows.
TN_97_Lau_Cot[, minDistAggregate:=min(min_dist), by = c("YEAR","POLICY")]
# Find the policy numbers that match the minimum distance for that year.
TN_97_Lau_Cot_Final <- unique(TN_97_Lau_Cot[min_dist==minDistAggregate, -10, with=FALSE])

Converting relative observations into numerical values

this is my first project in R, after just having learned java.
I have a (large) data set that I have imported from a csv file into data frame.
I have identified the two relevent columns for this question, the first that has the name of the patient, and second that asks the patient the level of swelling.
The level of swelling is relative i.e. better, worse or about the same.
Not all patients have the same number of observations.
I am having difficulty converting these relative values into numerical values that can be used as part of a greater analysis.
Below is psuedocode to what i think could be an appropriate solution:
for row in 'patientname'
patientcounter = dtfr1[row, 'patientname'];
if dtfr1[row, 'patientname'] == patientcounter
if dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] == 'better'
conditioncounter--;
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
elseif [row, 'Does.you.swelling.seem.better.or.worse'] == 'better'
conditoncounter++;
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
else
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
if dtfr1[row, 'patientname'] =! patientcounter
patientcounter = dtfr1[row, 'patientname'];
What would your advice be for a good solution to this problem? Thanks!
If I'm understanding correctly, you want the difference in the counts of worse and better, by patient? If so, something like this would work.
# Simulated data
dtfr1 <- data.frame(patient = sample(letters[1:3], 100, replace=TRUE),
condition = sample(c("better", "worse"), 100, replace=TRUE))
head(dtfr1)
# patient condition
# 1 a worse
# 2 b better
# 3 b worse
# 4 a better
# 5 c worse
# 6 a better
better_count <- tapply(dtfr1$condition, dtfr1$patient, function(x) sum(x == "better"))
worse_count <- tapply(dtfr1$condition, dtfr1$patient, function(x) sum(x == "worse"))
worse_count - better_count
# a b c
# 5 0 -1

Aggregate contiguous rows

I have a data.frame in R with a series of variables
userID (numeric) var1 (factor) var2 (factor) time (character) action (character)
The first 3 columns are the same for each user session, which can have many rows. However, time and action change for each row.
I have tried aggregate to combine the entire session (contiguous rows) into a single row.
dat <- aggregate(cbind(time, action) ~ userID + var1 + var2,
data = log, FUN = paste, collapse = "|")
That would solve my problem if users only appeared in the file once. However, that is not the case and the above line aggregates all sessions into a single row.
How can I avoid that? How do I aggregate contiguous rows (sessions) only?
Like what MrFlick said, except create the user.session variable with:
rle <- with(log, rle(as.character(interaction(user, var1, var2))))
log$user.session <- rep(seq_along(rle$lengths), rle$lengths)
It just seems easier to understand for us mere mortals...
I would create a user.session ID such that each continuous sequence of rows for a given user/var1/var2 are assigned a unique ID. First, a sample data set
log<-data.frame(
user = rep(c(1,2,3,1,3,4), times),
var1 = factor(rep(letters[c(1,2,3,1,3,4)+7], times)),
var2 = factor(rep(letters[c(1,2,3,1,3,4)+13], times)),
time = "a",
action="b",
stringsAsFactors=F
)
Now we add the user.session id
log$user.session<-with(log,
ave(seq_len(nrow(log)),user,var1,var2,FUN=function(x) {
cumsum(c(0,diff(x))!=1)
})
)
And now you can do
dat <- aggregate(cbind(time, action) ~ user + var1 + var2 + user.session,
data = log, FUN = paste, collapse = "|")
which gives
user var1 var2 user.session time action
1 1 h n 1 a|a|a|a|a b|b|b|b|b
2 2 i o 1 a|a|a b|b|b
3 3 j p 1 a b
4 4 k q 1 a|a|a b|b|b
5 1 h n 2 a|a b|b
6 3 j p 2 a|a|a|a|a|a|a b|b|b|b|b|b|b
So in bed I had the same realization which is to add a user.session column. I like both of your solutions but the second one is easier to understand and that is why I chose it. In any case, this is a third possibility
log$user.session <- cumsum(c(TRUE, diff(log$userID)!=0))
Then, including this column in the aggregate function does the trick as well.
Cheers.

Resources