I've got several sequential comparative evaluations to conduct with two variables in R in order to check for concordance.
In this example, say I have a boolean ANES_6 and a numeric ANES. The boolean is 1 if the patient had anesthesia for more than 6 hours, 0 else. The numeric value is the time the patient was under anesthesia.
I'm looking to write a function which can replace multiple copy-pastes of the following:
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 1 & ANES < 6)) %>%
tally()
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 0 & ANES >= 6)) %>%
tally()
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 1 & ANES >= 6)) %>%
tally()
data %>% select(ANES_6, ANES) %>%
filter(ANES_6 == 0 & ANES >= 6)) %>%
tally()
I could create the following function (non-exhaustive of all cases shown above):
my_func <- function(x, y) {
if (x == "gt" & y == 1) {
data %>% select(ANES_6, AnaestheticTime_hours_) %>%
filter(ANES >= 6 & ANES_6 == 1) %>%
tally()
} else if (x == "lt" & y == 0 ) {
data %>% select(ANES_6, AnaestheticTime_hours_) %>%
filter(ANES < 6 & ANES_6 != 1) %>%
tally()
}}
which takes x and y as input, with values for x being c('lt', 'gt'), and y being c(0, 1), in order to evaluate all possible condition. However, this would entail writing more code, and not less.
Is there a way to input logical comparisons in the function such that the following works:
my_func <- function(x, y) {
data %>% select(ANES_6, ANES) %>%
filter(ANES x 6 & ANES_6 == y)
}
with x replaced by >=, <, etc, in the input of the function. Currently, this does not work, are there any workarounds?
Try grouping. The question should normally include reproducible test data but I have provided it this time.
library(dplyr)
data <- data.frame(ANES_6 = c(0, 0, 1, 1), ANES = 5:6) # test data
data %>%
group_by(ANES_6, ANES >= 6) %>%
tally %>%
ungroup
giving:
# A tibble: 4 x 3
ANES_6 `ANES >= 6` n
<dbl> <lgl> <int>
1 0. FALSE 1
2 0. TRUE 1
3 1. FALSE 1
4 1. TRUE 1
Related
I have the following decision rules:
RELIABILITY LEVEL DESCRIPTION
LEVEL I Multiple regression
LEVEL II Multiple regression + mechanisms specified (all interest variables)
LEVEL III Multiple regression + mechanisms specified (all interest + control vars)
The first three columns are the data upon which the 4th column should be reproduced using dplyr.
The reliability level should be the same for the whole table (model)... I want to code it using dplyr.
Here is my try so far... As you can see, I can't get it to be the same for the whole model
library(tidyverse)
library(readxl)
library(effectsize)
df <- read_excel("https://github.com/timverlaan/relia/blob/59d2cbc5d7830c41542c5f65449d5f324d6013ad/relia.xlsx")
df1 <- df %>%
group_by(study, table, function_var) %>%
mutate(count_vars = n()) %>%
ungroup %>%
group_by(study, table, function_var, mechanism_described) %>%
mutate(count_int = case_when(
function_var == 'interest' & mechanism_described == 'yes' ~ n()
)) %>%
mutate(count_con = case_when(
function_var == 'control' & mechanism_described == 'yes' ~ n()
)) %>%
mutate(reliable_int = case_when(
function_var == 'interest' & count_vars/count_int == 1 ~ 1)) %>%
mutate(reliable_con = case_when(
function_var == 'control' & count_vars/count_con == 1 ~ 1)) %>%
# group_by(study, source) %>%
mutate(reliable = case_when(
reliable_int != 1 ~ 1,
reliable_int == 1 ~ 2,
reliable_int + reliable_con == 2 ~ 3)) %>%
# ungroup() %>%
The code settled on is:
library(tidyverse)
library(readxl)
df <- read_excel("C:/Users/relia.xlxs")
df <- df %>% select(-reliability_score)
test<-df %>% group_by(study,model,function_var) %>%
summarise(count_yes=sum(mechanism_described=="yes"),n=n(),frac=count_yes/n) %>%
mutate(frac_control=frac[function_var=="control"],
frac_interest=frac[function_var=="interest"]) %>%
mutate(reliability = case_when(
frac_control == 1 & frac_interest != 1 ~ -99,
frac_control != 1 & frac_interest != 1 ~ 2,
frac_interest == 1 & frac_control != 1 ~ 3,
frac_interest ==1 & frac_control == 1 ~ 4)) %>% group_by(study,model) %>% summarise(reliability=mean(reliability))
df_reliability<-left_join(df,test)
View(df_reliability)
However, I would prefer to do this all within one dplyr pipe. If anyone has a solution I would love to hear it...
I plan to filter data for multiple columns with multiple columns in one line to reduce the time used for running the code. This is sample data that I used to test my code. Basically, I want to remove any rows containing 0, 1, 2, and NA.
test <- data.frame(A = c(1,0,2,3,4,0,5,6,0,7,0,8,0,9,NA),
B = c(0,1,0,2,3,4,0,5,0,7,8,0,NA,9,0),
C = c(1,2,3,0,0,4,5,6,0,7,0,8,NA,0,9))
I used the following code to clean my data. Although it does the job, the code is very tedious and takes me quite a while when I run it with a large database.
test %>% filter(!is.na(A)) %>%
filter(!is.na(B)) %>%
filter(!is.na(C)) %>%
filter(A != 0) %>%
filter(A != 1) %>%
filter(A != 2) %>%
filter(B != 0) %>%
filter(B != 1) %>%
filter(B != 2) %>%
filter(C != 0) %>%
filter(C != 1) %>%
filter(C != 2)
A B C
1 6 5 6
2 7 7 7
I tried to shorten the code using filter, filter_at, and any_vars, but it did not work. Below are my attempts to deal with this problem (all of these codes did not work because they could not delete the row containing 0 (or 1,2, and NA).
df_total <- test %>%
filter_at(vars(A, B, C), any_vars(!is.na(.))) %>%
filter_at(vars(A, B, C), any_vars(. != 2)) %>%
filter_at(vars(A, B, C), any_vars(. != 1)) %>%
filter_at(vars(A, B, C), any_vars(. != 0))
df_total <- test %>%
filter_at(vars(A, B, C), any_vars(!is.na(.) | . != 2 | . != 1 | . != 0))
df_total <- test %>%
filter(!is.na(A) | A!= 2 | A!= 1 | A!= 0) %>%
filter(!is.na(B) | B!= 2 | B!= 1 | B!= 0) %>%
filter(!is.na(C) | C!= 2 | C!= 1 | C!= 0) %>%
I cannot figure out what I did incorrectly here. I went back and forth between the documentation and R to solve this problem, but my efforts were useless. Could you please suggest to me what I did wrong in my code? How can I write a code for multiple columns with multiple conditions in just one line? The point for one line is to speed up the running time for R. Any advice/ suggestions/ resources to find the answer would be appreciated! Thank you.
Another possible solution:
library(dplyr)
test %>%
filter(complete.cases(.) & if_all(everything(), ~ !(.x %in% 0:2)))
#> A B C
#> 1 6 5 6
#> 2 7 7 7
test %>%
filter(across(c(A, B, C), function(x) !is.na(x) & !x %in% c(0, 1, 2)))
# A B C
# 6 5 6
# 7 7 7
I have data in the following format:
DATE GROUP EVENT ELIGIBLE
2021-3-9 A 1 1
2021-3-1 A 0 0
2021-3-1 B 0 1
2021-2-20 B 1 1
I would like to group the data by the GROUP column and then add three new columns that calculate by group the sum of (EVENT / ELIGIBLE) for the following time frames. Last 3 months, 3 months back to six months back, and the last year.
I have calculated the overall percentage without separate timeframes by doing the following:
grouped <- data %>%
filter(ELIGIBLE == 1 ) %>%
group_by(GROUP) %>%
mutate(count_Eligible = sum(ELIGIBLE == 1 )) %>%
mutate(count_events = sum(EVENT == 1 )) %>%
mutate(Percentage = round(100*count_events/count_Eligible,2))
I am wondering what the cleanest way would be to add the three different percentages within the timeframes. So far I have pulled the dates to do the filtering with the following code:
today <- Sys.Date()
three_month_lookback <- as.Date(today) - months(3)
six_month_lookback <- as.Date(today) - months(6)
one_year_lookback <- as.Date(today) - months(12)
We can create a function to do the calculation
library(dplyr)
library(purrr)
f1 <- function(data) {
data %>%
filter(ELIGIBLE == 1 ) %>%
group_by(GROUP) %>%
transmute(count_Eligible = sum(ELIGIBLE == 1 ),
count_events = sum(EVENT == 1 ),
Percentage = round(100*count_events/count_Eligible,2))
}
Then, loop over the 'lookback' periods, subset the data based on the 'DATE' column and apply the function
map2_dfr(list(three_month_lookback, six_month_lookback,
one_year_lookback) list(today(), three_month_lookback, today()),
~ data %>%
mutate(DATE = as.Date(DATE)) %>%
filter(DATE >= .x, DATE <= .y) %>%
f1(.), .id = 'grp'
)
If we need to combine by columns
map2(list(three_month_lookback, six_month_lookback,
one_year_lookback) list(today(), three_month_lookback, today()),
~ data %>%
mutate(DATE = as.Date(DATE)) %>%
filter(DATE >= .x, DATE <= .y) %>%
f1(.)
) %>%
reduce(full_join, by = "GROUP")
The function works fine on a df containing 1000 to 20,000 cases but anything more than that and it takes hours (5 hours plus) and right now i have a df that's 57635985 observation long
Suppose a df like this:
d<-structure(list(ReviewType= c("Review","Review","Review","Correction","Correction","Review","Review","Review","Review","Review","Correction","Correction","Deficiency","Correction","Correction",
"Correction", "Deficiency", "Deficiency", "Correction","Correction","Deficiency","Correction"),
Submissiondate= c("2020-08-29 04:32:00","2020-08-28 04:31:00","2020-08-26 04:31:00","2020-08-25 04:31:00","2020-08-24 04:31:00","2020-08-23 04:31:00","2020-08-22 04:31:00","2020-08-21 04:31:00","2020-08-20 04:31:00","2020-08-19 04:31:00",
"2020-09-27 04:31:00","2020-09-27 03:52:59","2020-09-28 17:30:00","2020-09-29 14:01:00",
"2020-09-05 03:00:00","2020-09-05 03:51:00", "2020-09-03 23:59:49",
"2020-09-02 00:03:54","2020-09-01 00:04:48","2020-10-01 04:31:00","2020-10-11 04:31:00","2020-10-21 04:31:00"),
CaseNo= c("124","123","125","121","121","125","123","123","123","123","123","123","123","125","123","123","123","124","123","127","127","127")), class = "data.frame", row.names = c(NA, -22L))
d<-d%>%arrange(CaseNo,Submissiondate)
The code below is trying to see for each case number, as long as the status stayed at correction at each week's end, it will be counted in the stats for all the subsequent weeks till the status changes to anything:
d <- d %>% mutate(Submissiondate = as.Date(Submissiondate),
weekday = wday(Submissiondate),
week.end = Submissiondate + 7 - weekday)
#1 End state for each case and week
EndStates <- d %>%
group_by(CaseNo, week.end) %>%
summarize(WeekEndState = last(ReviewType)) %>% ungroup()
#2 Get unique values of Case/Week.end/ReviewType
chk <- d %>%
select(CaseNo, week.end, ReviewType) %>%
distinct()
#3 Add the EndStates and count if the week had a correction AND
# the week ended as a Correction
chk <- left_join(chk, EndStates, by = c("CaseNo", "week.end"))
cor_df <- as.data.frame(matrix(ncol=length(unique(chk$CaseNo))+1,
nrow=length(unique(chk$week.end))))
names(cor_df) <- c("week.end", unique(chk$CaseNo))
cor_df$week.end <- unique(chk$week.end)
for(i in 1:nrow(cor_df)){
for(j in 2:ncol(cor_df)){
this_CaseNo_idx = chk$CaseNo == strtoi(names(cor_df)[j])
idx = this_CaseNo_idx &
chk$week.end == max(chk$week.end[chk$week.end <= cor_df$week.end[i] &
this_CaseNo_idx])
if (sum(idx) < 1){
cor_df[i, j] = 0
}
else{
cor_df[i, j] = max(ifelse(chk$WeekEndState[idx] == "Correction" &
chk$ReviewType[idx] == "Correction",
1, 0))
}
}
}
cor_df$asw <- rowSums(cor_df[,2:ncol(cor_df)])
cor_df <- cor_df[order(cor_df$week.end),]
data.frame(week.end=cor_df$week.end, cor_df$asw)
Expected output:
week.end cor_df.asw
1 2020-08-22 0
2 2020-08-29 1
3 2020-09-05 2
4 2020-10-03 3
5 2020-10-17 2
6 2020-10-24 3
Any guidance appreciated.
I can get you there starting with the EndStates data frame. I'm not sure whether it will be much faster. Because dplyr does thing to the columns all at once (rather than sequentially down the column), I still needed a while() loop to do some of the filling in missing weeks. Perhaps a better dplyr person will provide an alternative.
library(dplyr)
library(tidyr)
cor_df2 <- EndStates %>%
mutate(count = as.numeric(WeekEndState == "Correction")) %>%
select(-WeekEndState) %>%
pivot_wider(id_cols="week.end", names_from="CaseNo", values_from="count") %>%
arrange(week.end) %>%
mutate(across(-week.end, function(x)case_when(is.na(x) & week.end == min(week.end) ~ 0, TRUE ~ x)))
while(any(is.na(cor_df2))){
cor_df2 <- cor_df2 %>% mutate(across(-week.end, function(x)case_when(is.na(x)~lag(x), TRUE ~ x)))
}
cor_df2 <- cor_df2 %>%
mutate(asw = rowSums(.[-1])) %>%
select(week.end, asw)
I am working with a for loop. The goal of the for loop is to simply test a condition and code the data appropriately. The loop has to iterate over 503,288 unique values and includes three if statements. Is there a way to speed up the for loop?
The code is as follows:
count<- 0
for(i in unique(Data$ID)){ #503288
#Subset Relevant Data
Loop_Before<- subset(Primary_Before, ID == i); Loop_After <- subset(Primary_After, ID == i)
if(nrow(Loop_Before) >= 1 & nrow(Loop_After) >= 1){
Data$Status[Data$ID == i] <- "Both Elections"
}
if(nrow(Loop_Before) >= 1 & nrow(Loop_After) == 0){
Data$Status[Data$ID == i] <- "Only Primary Election"
}
if(nrow(Loop_Before) >= 0 & nrow(Loop_After) == 1){
Data$Status[Data$ID == i] <- "Only General Election"
}
message(count<- count +1)
}
table(Data$Status)
Thank you for your help!
Avoid the for-loop entirely. I don't know your dataset but the following should be 10 or even 100 times faster:
library(tidyverse) # load some packages that will help
# let's create some sample data
Data <- data.frame(ID = c(1,1,1,1,2,2,2,3,3))
Primary_before <- data.frame(ID = c(0,1,2,2,3,3,3))
Primary_after <- data.frame(ID = c(1,3))
# now for every ID we count the nr of rows in both dataframes
summarised_before <- Primary_before %>%
group_by(ID) %>%
summarise(nrRows = n())
ID nrRows
<dbl> <int>
1 0 1
2 1 1
3 2 2
4 3 3
summarised_after <- Primary_after %>%
group_by(ID) %>%
summarise(nrRows = n())
ID nrRows
<dbl> <int>
1 1 1
2 3 1
# now we join them together
summarised_both <- summarised_after %>%
full_join(summarised_before, by = "ID", suffix = c("_after", "_before"))
# and now we do the final calculation
summarised_both %>%
mutate(nrRows_after = replace_na(nrRows_after, 0)) %>%
mutate(Status = case_when(nrRows_before >= 1 & nrRows_after >= 1 ~ "Both elections"
, nrRows_before >= 1 & nrRows_after == 0 ~ "Only primary election"
, nrRows_before >= 0 & nrRows_after == 1 ~ "Only general election")) %>%
filter(ID %in% Data$ID)
I saved the intermediate results, but you could also do it in one go, like this:
Primary_before %>%
group_by(ID) %>%
summarise(nrRows = n()) %>%
full_join(Primary_after %>%
group_by(ID) %>%
summarise(nrRows = n())
, by = "ID"
, suffix = c("_after", "_before")) %>%
mutate(nrRows_after = replace_na(nrRows_after, 0)) %>%
mutate(Status = case_when(nrRows_before >= 1 & nrRows_after >= 1 ~ "Both elections"
, nrRows_before >= 1 & nrRows_after == 0 ~ "Only primary election"
, nrRows_before >= 0 & nrRows_after == 1 ~ "Only general election")) %>%
filter(ID %in% Data$ID)