I am working with a for loop. The goal of the for loop is to simply test a condition and code the data appropriately. The loop has to iterate over 503,288 unique values and includes three if statements. Is there a way to speed up the for loop?
The code is as follows:
count<- 0
for(i in unique(Data$ID)){ #503288
#Subset Relevant Data
Loop_Before<- subset(Primary_Before, ID == i); Loop_After <- subset(Primary_After, ID == i)
if(nrow(Loop_Before) >= 1 & nrow(Loop_After) >= 1){
Data$Status[Data$ID == i] <- "Both Elections"
}
if(nrow(Loop_Before) >= 1 & nrow(Loop_After) == 0){
Data$Status[Data$ID == i] <- "Only Primary Election"
}
if(nrow(Loop_Before) >= 0 & nrow(Loop_After) == 1){
Data$Status[Data$ID == i] <- "Only General Election"
}
message(count<- count +1)
}
table(Data$Status)
Thank you for your help!
Avoid the for-loop entirely. I don't know your dataset but the following should be 10 or even 100 times faster:
library(tidyverse) # load some packages that will help
# let's create some sample data
Data <- data.frame(ID = c(1,1,1,1,2,2,2,3,3))
Primary_before <- data.frame(ID = c(0,1,2,2,3,3,3))
Primary_after <- data.frame(ID = c(1,3))
# now for every ID we count the nr of rows in both dataframes
summarised_before <- Primary_before %>%
group_by(ID) %>%
summarise(nrRows = n())
ID nrRows
<dbl> <int>
1 0 1
2 1 1
3 2 2
4 3 3
summarised_after <- Primary_after %>%
group_by(ID) %>%
summarise(nrRows = n())
ID nrRows
<dbl> <int>
1 1 1
2 3 1
# now we join them together
summarised_both <- summarised_after %>%
full_join(summarised_before, by = "ID", suffix = c("_after", "_before"))
# and now we do the final calculation
summarised_both %>%
mutate(nrRows_after = replace_na(nrRows_after, 0)) %>%
mutate(Status = case_when(nrRows_before >= 1 & nrRows_after >= 1 ~ "Both elections"
, nrRows_before >= 1 & nrRows_after == 0 ~ "Only primary election"
, nrRows_before >= 0 & nrRows_after == 1 ~ "Only general election")) %>%
filter(ID %in% Data$ID)
I saved the intermediate results, but you could also do it in one go, like this:
Primary_before %>%
group_by(ID) %>%
summarise(nrRows = n()) %>%
full_join(Primary_after %>%
group_by(ID) %>%
summarise(nrRows = n())
, by = "ID"
, suffix = c("_after", "_before")) %>%
mutate(nrRows_after = replace_na(nrRows_after, 0)) %>%
mutate(Status = case_when(nrRows_before >= 1 & nrRows_after >= 1 ~ "Both elections"
, nrRows_before >= 1 & nrRows_after == 0 ~ "Only primary election"
, nrRows_before >= 0 & nrRows_after == 1 ~ "Only general election")) %>%
filter(ID %in% Data$ID)
Related
I have the following decision rules:
RELIABILITY LEVEL DESCRIPTION
LEVEL I Multiple regression
LEVEL II Multiple regression + mechanisms specified (all interest variables)
LEVEL III Multiple regression + mechanisms specified (all interest + control vars)
The first three columns are the data upon which the 4th column should be reproduced using dplyr.
The reliability level should be the same for the whole table (model)... I want to code it using dplyr.
Here is my try so far... As you can see, I can't get it to be the same for the whole model
library(tidyverse)
library(readxl)
library(effectsize)
df <- read_excel("https://github.com/timverlaan/relia/blob/59d2cbc5d7830c41542c5f65449d5f324d6013ad/relia.xlsx")
df1 <- df %>%
group_by(study, table, function_var) %>%
mutate(count_vars = n()) %>%
ungroup %>%
group_by(study, table, function_var, mechanism_described) %>%
mutate(count_int = case_when(
function_var == 'interest' & mechanism_described == 'yes' ~ n()
)) %>%
mutate(count_con = case_when(
function_var == 'control' & mechanism_described == 'yes' ~ n()
)) %>%
mutate(reliable_int = case_when(
function_var == 'interest' & count_vars/count_int == 1 ~ 1)) %>%
mutate(reliable_con = case_when(
function_var == 'control' & count_vars/count_con == 1 ~ 1)) %>%
# group_by(study, source) %>%
mutate(reliable = case_when(
reliable_int != 1 ~ 1,
reliable_int == 1 ~ 2,
reliable_int + reliable_con == 2 ~ 3)) %>%
# ungroup() %>%
The code settled on is:
library(tidyverse)
library(readxl)
df <- read_excel("C:/Users/relia.xlxs")
df <- df %>% select(-reliability_score)
test<-df %>% group_by(study,model,function_var) %>%
summarise(count_yes=sum(mechanism_described=="yes"),n=n(),frac=count_yes/n) %>%
mutate(frac_control=frac[function_var=="control"],
frac_interest=frac[function_var=="interest"]) %>%
mutate(reliability = case_when(
frac_control == 1 & frac_interest != 1 ~ -99,
frac_control != 1 & frac_interest != 1 ~ 2,
frac_interest == 1 & frac_control != 1 ~ 3,
frac_interest ==1 & frac_control == 1 ~ 4)) %>% group_by(study,model) %>% summarise(reliability=mean(reliability))
df_reliability<-left_join(df,test)
View(df_reliability)
However, I would prefer to do this all within one dplyr pipe. If anyone has a solution I would love to hear it...
The function works fine on a df containing 1000 to 20,000 cases but anything more than that and it takes hours (5 hours plus) and right now i have a df that's 57635985 observation long
Suppose a df like this:
d<-structure(list(ReviewType= c("Review","Review","Review","Correction","Correction","Review","Review","Review","Review","Review","Correction","Correction","Deficiency","Correction","Correction",
"Correction", "Deficiency", "Deficiency", "Correction","Correction","Deficiency","Correction"),
Submissiondate= c("2020-08-29 04:32:00","2020-08-28 04:31:00","2020-08-26 04:31:00","2020-08-25 04:31:00","2020-08-24 04:31:00","2020-08-23 04:31:00","2020-08-22 04:31:00","2020-08-21 04:31:00","2020-08-20 04:31:00","2020-08-19 04:31:00",
"2020-09-27 04:31:00","2020-09-27 03:52:59","2020-09-28 17:30:00","2020-09-29 14:01:00",
"2020-09-05 03:00:00","2020-09-05 03:51:00", "2020-09-03 23:59:49",
"2020-09-02 00:03:54","2020-09-01 00:04:48","2020-10-01 04:31:00","2020-10-11 04:31:00","2020-10-21 04:31:00"),
CaseNo= c("124","123","125","121","121","125","123","123","123","123","123","123","123","125","123","123","123","124","123","127","127","127")), class = "data.frame", row.names = c(NA, -22L))
d<-d%>%arrange(CaseNo,Submissiondate)
The code below is trying to see for each case number, as long as the status stayed at correction at each week's end, it will be counted in the stats for all the subsequent weeks till the status changes to anything:
d <- d %>% mutate(Submissiondate = as.Date(Submissiondate),
weekday = wday(Submissiondate),
week.end = Submissiondate + 7 - weekday)
#1 End state for each case and week
EndStates <- d %>%
group_by(CaseNo, week.end) %>%
summarize(WeekEndState = last(ReviewType)) %>% ungroup()
#2 Get unique values of Case/Week.end/ReviewType
chk <- d %>%
select(CaseNo, week.end, ReviewType) %>%
distinct()
#3 Add the EndStates and count if the week had a correction AND
# the week ended as a Correction
chk <- left_join(chk, EndStates, by = c("CaseNo", "week.end"))
cor_df <- as.data.frame(matrix(ncol=length(unique(chk$CaseNo))+1,
nrow=length(unique(chk$week.end))))
names(cor_df) <- c("week.end", unique(chk$CaseNo))
cor_df$week.end <- unique(chk$week.end)
for(i in 1:nrow(cor_df)){
for(j in 2:ncol(cor_df)){
this_CaseNo_idx = chk$CaseNo == strtoi(names(cor_df)[j])
idx = this_CaseNo_idx &
chk$week.end == max(chk$week.end[chk$week.end <= cor_df$week.end[i] &
this_CaseNo_idx])
if (sum(idx) < 1){
cor_df[i, j] = 0
}
else{
cor_df[i, j] = max(ifelse(chk$WeekEndState[idx] == "Correction" &
chk$ReviewType[idx] == "Correction",
1, 0))
}
}
}
cor_df$asw <- rowSums(cor_df[,2:ncol(cor_df)])
cor_df <- cor_df[order(cor_df$week.end),]
data.frame(week.end=cor_df$week.end, cor_df$asw)
Expected output:
week.end cor_df.asw
1 2020-08-22 0
2 2020-08-29 1
3 2020-09-05 2
4 2020-10-03 3
5 2020-10-17 2
6 2020-10-24 3
Any guidance appreciated.
I can get you there starting with the EndStates data frame. I'm not sure whether it will be much faster. Because dplyr does thing to the columns all at once (rather than sequentially down the column), I still needed a while() loop to do some of the filling in missing weeks. Perhaps a better dplyr person will provide an alternative.
library(dplyr)
library(tidyr)
cor_df2 <- EndStates %>%
mutate(count = as.numeric(WeekEndState == "Correction")) %>%
select(-WeekEndState) %>%
pivot_wider(id_cols="week.end", names_from="CaseNo", values_from="count") %>%
arrange(week.end) %>%
mutate(across(-week.end, function(x)case_when(is.na(x) & week.end == min(week.end) ~ 0, TRUE ~ x)))
while(any(is.na(cor_df2))){
cor_df2 <- cor_df2 %>% mutate(across(-week.end, function(x)case_when(is.na(x)~lag(x), TRUE ~ x)))
}
cor_df2 <- cor_df2 %>%
mutate(asw = rowSums(.[-1])) %>%
select(week.end, asw)
I have a data frame:
temp = as.data.frame(matrix(c(1,2,2,-3,1,1,2,3,-2,0,2,1,-5,1,1,2,1,3,0,0,3,2,4,-1,1,3,2,2,0,1,3,2,4,3,0), ncol=5,byrow = TRUE))
colnames(temp) = c("ID","srch","utility","reutility","code")
I need to group by "ID" column. For any "ID" value, "srch" column stays the same. For each group, if srch > 1 I need min(utility where code==1)-max(reutility where code ==0) else (i.e. if srch<=1) I need it to be 0.
This is the output I need:
temp = as.data.frame(matrix(c(1,4,2,0,3,-1), ncol=2,byrow = TRUE))
colnames(temp)=c("ID","Val")
Any code using dplyr is great but others are welcome too.
Assuming output for ID = 2 is 0, we can do :
library(dplyr)
temp %>%
group_by(ID) %>%
summarise(Val = if(first(srch) > 1) min(utility[code == 1]) -
max(reutility[code == 0]) else 0)
# A tibble: 3 x 2
# ID Val
# <dbl> <dbl>
#1 1 4
#2 2 0
#3 3 -1
I have a df:
a<-c(5,1,5,3,5,3,5,1)
b<-c(1,5,1,5,1,5,3,5)
df<-as.data.frame(rbind(a,b))
names(df)<-c('pre1','post1','pre2','post2','pre3','post3','pre4','post4')
And I have two groups of samples within the columns eg 'pre' and post':
pre<-seq(1,8,by=2)
post<-seq(2,8,by=2)
I would like to apply a conditional that 100% of the pre and 50% of the post pass OR 50% of the pre and 100% of the post
eg
if 100% of 'pre' are 3 or over AND 50% post are 3 or over keep row
OR
if 50% of 'pre' are 3 or over AND 100% post are 3 or over keep row
so in the example df only row 'a' would stay
I have:
test<- ((df[apply(df[pre],1,function(x) sum(x>=3)/length(x)),] &
df[apply(df[post],1,function(x) sum(x>3)/length(x))>=0.5,]) |
(df[apply(df[pre],1,function(x) sum(x>3)/length(x))>=0.5,] &
df[apply(df[post],1,function(x) sum(x>3)/length(x)),]))
But I get a vector of 'TRUEs' which isn't what I want.
We can create a logical vector to compare using rowSums
df[(rowSums(df[pre] >= 3)/length(pre) == 1) &
(rowSums(df[post] >= 3)/length(post) >= 0.5) |
(rowSums(df[post] >= 3)/length(post) == 1) &
(rowSums(df[pre] >= 3)/length(pre) >= 0.5), ]
# pre1 post1 pre2 post2 pre3 post3 pre4 post4
#a 5 1 5 3 5 3 5 1
Using apply we can do
df[apply(df[pre] >= 3, 1, all) & apply(df[post] >= 3, 1, sum)/length(post) >= 0.5 |
apply(df[post] >= 3, 1, all) & apply(df[pre] >= 3, 1, sum)/length(pre) >= 0.5, ]
Here's a much less concise tidyverse solution that could probably be shortened substantially.
library(tidyverse)
pass_val = 3
df %>%
rownames_to_column() %>%
gather(col, val, -rowname) %>%
separate("col", c("type", "num"), sep = -1) %>%
count(rowname, type, pass = val >= pass_val) %>%
spread(pass, n, fill = 0) %>%
transmute(rowname, type, pass_pct = `TRUE`/(`TRUE` + `FALSE`)) %>%
spread(type, pass_pct) %>%
filter(post == 1 & pre >= 0.5 | post >= 0.5 & pre == 1)
Here is one option with tidyverse
library(tidyverse)
library(rap)
crossing(val = c(0.5, 1), cols = c("pre", "post")) %>%
rap(x = ~ df %>%
select(matches(cols)) %>%
{rowMeans(. >=3) >= val}) %>%
group_by(val) %>%
transmute(ind = reduce(x, `&`)) %>%
filter(any(ind)) %>%
pull(ind) %>%
filter(df, .)
# pre1 post1 pre2 post2 pre3 post3 pre4 post4
#1 5 1 5 3 5 3 5 1
Here's a base R solution that splits by row name, checks the conditions using sapply, and uses the output as a logical index on df:
df[sapply(split(df, rownames(df)), function(x) {
(sum(x[pre] > 2)/ncol(x[pre]) >= .5) & (sum(x[post] > 2)/ncol(x[post]) == 1) ||
(sum(x[pre] > 2)/ncol(x[pre]) == 1) & (sum(x[post] > 2)/ncol(x[post]) >= .5)
}),]
#### OUTPUT ####
pre1 post1 pre2 post2 pre3 post3 pre4 post4
a 5 1 5 3 5 3 5 1
I've got several sequential comparative evaluations to conduct with two variables in R in order to check for concordance.
In this example, say I have a boolean ANES_6 and a numeric ANES. The boolean is 1 if the patient had anesthesia for more than 6 hours, 0 else. The numeric value is the time the patient was under anesthesia.
I'm looking to write a function which can replace multiple copy-pastes of the following:
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 1 & ANES < 6)) %>%
tally()
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 0 & ANES >= 6)) %>%
tally()
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 1 & ANES >= 6)) %>%
tally()
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 0 & ANES >= 6)) %>%
tally()
I could create the following function (non-exhaustive of all cases shown above):
my_func <- function(x, y) {
if (x == "gt" & y == 1) {
data %>% select(ANES_6, AnaestheticTime_hours_) %>%
filter(ANES >= 6 & ANES_6 == 1) %>%
tally()
} else if (x == "lt" & y == 0 ) {
data %>% select(ANES_6, AnaestheticTime_hours_) %>%
filter(ANES < 6 & ANES_6 != 1) %>%
tally()
}}
which takes x and y as input, with values for x being c('lt', 'gt'), and y being c(0, 1), in order to evaluate all possible condition. However, this would entail writing more code, and not less.
Is there a way to input logical comparisons in the function such that the following works:
my_func <- function(x, y) {
data %>% select(ANES_6, ANES) %>%
filter(ANES x 6 & ANES_6 == y)
}
with x replaced by >=, <, etc, in the input of the function. Currently, this does not work, are there any workarounds?
Try grouping. The question should normally include reproducible test data but I have provided it this time.
library(dplyr)
data <- data.frame(ANES_6 = c(0, 0, 1, 1), ANES = 5:6) # test data
data %>%
group_by(ANES_6, ANES >= 6) %>%
tally %>%
ungroup
giving:
# A tibble: 4 x 3
ANES_6 `ANES >= 6` n
<dbl> <lgl> <int>
1 0. FALSE 1
2 0. TRUE 1
3 1. FALSE 1
4 1. TRUE 1