I am working with a large health insurance dataset and I am interested in participants with certain claims codes. One of my inclusion criteria is that the participant has to have be insured for one year before and one year after the claim date. E.g., if they were injured 9/27/2017, they need insurance from 9/27/2016-9/27/2018.
I have tried doing a simple rowsum, and using apply, but both have the same issue: in from:to : numerical expression has # elements: only the first used. Right now, I have the range saved as variables in the dataframe. It think I understand why I am having the issue--it is expecting a number and receiving a vector. How can I get it to conditionally select columns to sum. I will include my code below.
In my example, I am just trying to count the number of months a participant is insured for 6 month before and after their accident. The ins_#_# variables are a simple YES/NO for whether or not participants were insured that month. Any guidance is appreciated!
library(tidyverse)
set.seed(1)
df <- data.frame(id= seq(1,100),
injury_date = sample(seq(as.Date('2017/01/01'), as.Date('2017/12/31'), by="day"), 100),
ins_07_16 = sample(c(0,1), replace = TRUE),
ins_08_16 = sample(c(0,1), replace = TRUE),
ins_09_16 = sample(c(0,1), replace = TRUE),
ins_10_16 = sample(c(0,1), replace = TRUE),
ins_11_16 = sample(c(0,1), replace = TRUE),
ins_12_16 = sample(c(0,1), replace = TRUE),
ins_01_17 = sample(c(0,1), replace = TRUE),
ins_02_17 = sample(c(0,1), replace = TRUE),
ins_03_17 = sample(c(0,1), replace = TRUE),
ins_04_17 = sample(c(0,1), replace = TRUE),
ins_05_17 = sample(c(0,1), replace = TRUE),
ins_06_17 = sample(c(0,1), replace = TRUE),
ins_07_17 = sample(c(0,1), replace = TRUE),
ins_08_17 = sample(c(0,1), replace = TRUE),
ins_09_17 = sample(c(0,1), replace = TRUE),
ins_10_17 = sample(c(0,1), replace = TRUE),
ins_11_17 = sample(c(0,1), replace = TRUE),
ins_12_17 = sample(c(0,1), replace = TRUE),
ins_01_18 = sample(c(0,1), replace = TRUE),
ins_02_18 = sample(c(0,1), replace = TRUE),
ins_03_18 = sample(c(0,1), replace = TRUE),
ins_04_18 = sample(c(0,1), replace = TRUE),
ins_05_18 = sample(c(0,1), replace = TRUE),
ins_06_18 = sample(c(0,1), replace = TRUE))
df <- df %>%
mutate(month = as.numeric(format(as.Date(injury_date), "%m")), #pulling month of injury
low_mo = month + 2,
high_mo = month + 14)
df$insured <- rowSums(df[df$low_mo:df$high_mo]) #only uses first element
df$insured <- apply(df[df$low_mo:df$high_mo], 1, sum) #only uses first element
Edit:
Although I did not specify that I wanted a fast solution, I am working with a lot of data so I tested which of #akrun's solutions was the fastest. I changed the dataframe so it was 1e5 (100,000) rows. The results are below in case anyone is curious.
microbenchmark(o1 <- sapply(seq_len(nrow(df)), function(i) sum(df[i, df$low_mo[i]:df$high_mo[i]])),
o2 <- {colInd <- Map(`:`, df$low_mo, df$high_mo);
rowInd <- rep(seq_len(nrow(df)), lengths(colInd));
as.vector(tapply(df[-(1:2)][cbind(rowInd, unlist(colInd)-2)],
rowInd, FUN = sum))},
o3 <- {colInd1 <- Map(function(x, y) which(!seq_along(df) %in% x:y), df$low_mo, df$high_mo);
rowInd1 <- rep(seq_len(nrow(df)), lengths(colInd1));
rowSums(replace(df, cbind(rowInd1, unlist(colInd1)), NA)[-(1:2)], na.rm = TRUE)},
times = 5)
Unit: milliseconds
expr min lq mean median uq max neval
o1 20408.5072 20757.0285 20903.9386 20986.2275 21069.3163 21298.6137 5
o2 433.5463 436.3066 448.6448 455.6551 456.8836 460.8325 5
o3 470.6834 482.4449 492.9594 485.6210 504.1353 521.9122 5
> identical(o1, o2)
[1] TRUE
> identical(o2, o3)
[1] TRUE
There are couple of way to do this. Loop through the sequence of rows, subset the dataset by the row index, and the columns generated by taking the sequence of 'low_mo' and 'high_mo' for each row, get the sum
o1 <- sapply(seq_len(nrow(df)), function(i) sum(df[i, df$low_mo[i]:df$high_mo[i]]))
Or another option is to extract the elements based on the row/column index and then do a group by sum
colInd <- Map(`:`, df$low_mo, df$high_mo)
rowInd <- rep(seq_len(nrow(df)), lengths(colInd))
o2 <- as.vector(tapply(df[-(1:2)][cbind(rowInd, unlist(colInd)-2)],
rowInd, FUN = sum))
identical(o1, o2)
#[1] TRUE
Or another approach is to change the column values that are not in the sequence to NA and use the rowSums
colInd1 <- Map(function(x, y) which(!seq_along(df) %in% x:y), df$low_mo, df$high_mo)
rowInd1 <- rep(seq_len(nrow(df)), lengths(colInd1))
o3 <- rowSums(replace(df, cbind(rowInd1, unlist(colInd1)),
NA)[-(1:2)], na.rm = TRUE)
identical(o1, o3)
#[1] TRUE
Related
I have tried the following formula but it gives all nos even when I change the quantile value.
NOTE: I have 3 independent datasets that I want to apply the function.
outlier<-function(x1,x2){
q1<-quantile(x1 , .75, na.rm = TRUE)
if(x1>q1){x2<-"Yes"
}else{
x2<-"No"
}
}
I have tried x2<-ifelse(x1>q1,"Yes","No")
inside the function but it still doesn't work.
You can use an ifelse statement and create a new column using mutate.
library(dplyr)
set.seed(1)
df <- tibble(x1 = sample(c(1:10), size = 10, replace = T))
df %>%
mutate(x2 = ifelse(quantile(x1, 0.75, na.rm = T) < x1, "Yes", "No"))
If you want a function
library(dplyr)
set.seed(1)
df <- tibble(x1 = sample(c(1:10), size = 10, replace = T),
x2 = sample(c(1:10), size = 10, replace = T),
x3 = sample(c(1:10), size = 10, replace = T),
x4 = sample(c(1:10), size = 10, replace = T))
outlier<-function(dataframe, quant = 0.75, col = c("x1", "x2")){
dataframe %>%
mutate(across(all_of(col), ~ifelse(.x>quantile(.x,0.75), 'Yes', 'No'),
.names = '{col}_yes'))
}
outlier(dataframe = df,quant = 0.25)
I have a dataframe with 40 rows and ~40000 columns. The 40 rows are split into group "A" and group "B" (20 each). For each column, I would like to apply a statistical test (wilcox.test()) comparing the two groups. I started using a for loop to run through the 40000 columns but it was very slow.
Minimal Reproducible Example (MRE):
library(tidyverse)
set.seed(123)
metrics <- paste("metric_", 1:40000, sep = "")
patient_IDs <- paste("patientID_", 1:40, sep = "")
m <- matrix(sample(1:20, 1600000, replace = TRUE), ncol = 40000, nrow = 40,
dimnames=list(patient_IDs, metrics))
test_data <- as.data.frame(m)
test_data$group <- c(rep("A", 20), rep("B", 20))
# Collate list of metrics to analyse ("check") for significance
list_to_check <- colnames(test_data)[1:40000]
Original 'loop' method (this is what I want to vectorise):
# Create a variable to store the results
results_A_vs_B <- c()
# Loop through the "list_to_check" and,
# for each 'value', compare group A with group B
# and load the results into the "results_A_vs_B" variable
for (i in list_to_check) {
outcome <- wilcox.test(test_data[test_data$group == "A", ][[i]],
test_data[test_data$group == "B", ][[i]],
exact = FALSE)
if (!is.nan(outcome$p.value) && outcome$p.value <= 0.05) {
results_A_vs_B[i] <- paste(outcome$p.value, sep = "\t")
}
}
# Format the results into a dataframe
summarised_results_A_vs_B <- as.data.frame(results_A_vs_B) %>%
rownames_to_column(var = "A vs B") %>%
rename("Wilcox Test P-value" = "results_A_vs_B")
Benchmarking the answers so far:
# Ronak Shah's "Map" approach
Map_func <- function(dataset, list_to_check) {
tmp <- split(dataset[list_to_check], dataset$group)
stack(Map(function(x, y) wilcox.test(x, y, exact = FALSE)$p.value, tmp[[1]], tmp[[2]]))
}
# #Onyambu's data.table method
dt_func <- function(dataset, list_to_check) {
melt(setDT(dataset), measure.vars = list_to_check)[, dcast(.SD, rowid(group) + variable ~ group)][, wilcox.test(A, B, exact = FALSE)$p.value, variable]
}
# #Park's dplyr method (with some minor tweaks)
dplyr_func <- function(dataset, list_to_check){
dataset %>%
summarise(across(all_of(list_to_check),
~ wilcox.test(.x ~ group, exact = FALSE)$p.value)) %>%
pivot_longer(cols = everything(),
names_to = "Metrics",
values_to = "Wilcox Test P-value")
}
library(microbenchmark)
res_map <- microbenchmark(Map_func(test_data, list_to_check), times = 10)
res_dplyr <- microbenchmark(dplyr_func(test_data, list_to_check), times = 2)
library(data.table)
res_dt <- microbenchmark(dt_func(test_data, list_to_check), times = 10)
autoplot(rbind(res_map, res_dt, res_dplyr))
# Excluding dplyr
autoplot(rbind(res_map, res_dt))
--
Running the code on a server took a couple of seconds longer but the difference between Map and data.table was more pronounced (laptop = 4 cores, server = 8 cores):
autoplot(rbind(res_map, res_dt))
Here is another option -
Map_approach <- function(dataset, list_to_check) {
tmp <- split(dataset[list_to_check], dataset$group)
stack(Map(function(x, y) wilcox.test(x, y)$p.value, tmp[[1]], tmp[[2]]))
}
Map_approach(data_subset, list_to_check)
# values ind
#1 5.359791e-05 value_1
#2 5.499685e-08 value_2
#3 1.503951e-06 value_3
#4 6.179352e-08 value_4
#5 5.885650e-08 value_5
Testing it on larger sample Map is slightly faster than the for loop.
n <- 1e6
data_subset <- data.frame(patient_ID = 1:n,
group = c(rep("A", n/2),
rep("B", n/2)),
value_1 = c(sample(1:10, n/2, replace = TRUE),
sample(5:15, n/2, replace = TRUE)),
value_2 = c(sample(1:5, n/2, replace = TRUE),
sample(15:n/2, n/2, replace = TRUE)),
value_3 = c(sample(1:12, n/2, replace = TRUE),
sample(8:17, n/2, replace = TRUE)),
value_4 = c(sample(5:10, n/2, replace = TRUE),
sample(15:25, n/2, replace = TRUE)),
value_5 = c(sample(20:40, n/2, replace = TRUE),
sample(10:15, n/2, replace = TRUE)))
microbenchmark::microbenchmark(loop = wilcox_loop(data_subset, list_to_check),
Map = Map_approach(data_subset, list_to_check))
#Unit: seconds
# expr min lq mean median uq max neval cld
# loop 5.254573 5.631162 5.788624 5.734480 5.920424 6.756319 100 b
# Map 4.710790 5.084783 5.201711 5.160722 5.309048 5.721540 100 a
May you try this code? It's slightly faster in my computer.
wilcox_loop2 <- function(data_subset, list_to_check){
A = data_subset[data_subset$group == "A",]
B = data_subset[data_subset$group == "B",]
outcome <- sapply(list_to_check, function(x) wilcox.test(A[[x]],
B[[x]],
exact = FALSE)$p.value)
as.data.frame(outcome) %>%
rownames_to_column(var = "A vs B") %>%
rename("Wilcox Test P-value" = "outcome")
}
I'm not sure it's OK to split data into A and B...
My system time costs is like
microbenchmark::microbenchmark(origin = wilcox_loop(data_subset, list_to_check),
test = wilcox_loop2(data_subset, list_to_check))
Unit: milliseconds
expr min lq mean median uq max neval cld
origin 4.815601 5.006951 6.490757 5.385502 6.790752 21.5876 100 b
test 3.817801 4.116151 5.146963 4.330500 4.870651 15.8271 100 a
I need to label a values in a lot of variables with sjlabelled::set_labels. Here is a reproducable example and what already works:
library(data.table)
library(sjlabelled)
lookup <- data.table(id = paste0("q", 1:5),
answers = paste(paste0("atext", 1:5), paste0("btext", 1:5)
, paste0("ctext", 1:5), sep = ";"))
data <- data.table(q1 = sample(1:3, 10, replace = TRUE),
q2 = sample(1:3, 10, replace = TRUE),
q3 = sample(1:3, 10, replace = TRUE),
q4 = sample(1:3, 10, replace = TRUE),
q5 = sample(1:3, 10, replace = TRUE))
data$q1 <- set_labels(data$q1, labels = unlist(strsplit(lookup[id == "q1", answers], split = ";")))
get_labels(data$q1)
So the labels for the different answers (=values) are seperated by a semicolon. I am able to make it work if I call the variables by id but as you can see in the example code but I am struggling with the task if I want to "loop" through all variables.
The goal is to be able to export the datatable (or dataframe) as an SPSS file. If it works with other packages I would also be happy.
Match the column names of data with id, split the answers on ; and pass the labels as a list.
library(sjlabelled)
data <- set_labels(data, labels = strsplit(lookup$answers[match(names(data), lookup$id)], ';'))
get_labels(data)
#$q1
#[1] "atext1" "btext1" "ctext1"
#$q2
#[1] "atext2" "btext2" "ctext2"
#$q3
#[1] "atext3" "btext3" "ctext3"
#$q4
#[1] "atext4" "btext4" "ctext4"
#$q5
#[1] "atext5" "btext5" "ctext5"
I have a huge dataset with several groups (factors with between 2 to 6 levels), and dichotomous variables (0, 1).
example data
DF <- data.frame(
group1 = sample(x = c("A","B","C","D"), size = 100, replace = T),
group2 = sample(x = c("red","blue","green"), size = 100, replace = T),
group3 = sample(x = c("tiny","small","big","huge"), size = 100, replace = T),
var1 = sample(x = 0:1, size = 100, replace = T),
var2 = sample(x = 0:1, size = 100, replace = T),
var3 = sample(x = 0:1, size = 100, replace = T),
var4 = sample(x = 0:1, size = 100, replace = T),
var5 = sample(x = 0:1, size = 100, replace = T))
I want to do a chi square for every group, across all the variables.
library(tidyverse)
library(rstatix)
chisq_test(DF$group1, DF$var1)
chisq_test(DF$group1, DF$var2)
chisq_test(DF$group1, DF$var3)
...
etc
I managed to make it work by using two nested for loops, but I'm sure there is a better solution
groups <- c("group1","group2","group3")
vars <- c("var1","var2","var3","var4","var5")
results <- data.frame()
for(i in groups){
for(j in vars){
test <- chisq_test(DF[,i], DF[,j])
test <- mutate(test, group=i, var=j)
results <- rbind(results, test)
}
}
results
I think I need some kind of apply function, but I can't figure it out
Here is one way to do it with apply. I am sure there is an even more elegant way to do it with dplyr. (Note that here I extract the p.value of the test, but you can extract something else or the whole test result if you prefer).
res <- apply(DF[,1:3], 2, function(x) {
apply(DF[,4:7], 2,
function(y) {chisq.test(x,y)$p.value})
})
Here's a quick and easy dplyr solution, that involves transforming the data into long format keyed by group and var, then running the chi-sq test on each combination of group and var.
DF %>%
pivot_longer(starts_with("group"), names_to = "group", values_to = "group_val") %>%
pivot_longer(starts_with("var"), names_to = "var", values_to = "var_val") %>%
group_by(group, var) %>%
summarise(chisq_test(group_val, var_val)) %>%
ungroup()
Sample data:
var1 <- matrix(sample(c(NA, 1:3), 10, replace = TRUE), 10,1)
var2 <- matrix(sample(c(NA, 1:3), 10, replace = TRUE), 10,1)
var3 <- matrix(sample(c(NA, 1:3), 10, replace = TRUE), 10,1)
var4 <- matrix(sample(c(NA, 1:3), 10, replace = TRUE), 10,1)
var5 <- matrix(sample(c(NA, 1:3), 10, replace = TRUE), 10,1)
NewDataframe <- as.data.frame(cbind(var1,var2,var3,var4,var5))
names(NewDataframe) <- c("var1","var2","var3","var4","var5")
NewDataframe[is.na(NewDataframe)] <- ""
vector of data:
par1 <- data.frame(var1=2,var3=5,var4=3)
par2 <- data.frame(var2=4,var5=7)
Pre-multiply each row of newdataframe with the correct par variables var1 atc.
Rows where par1 or par2 would not apply is left blank. How to approach this? Thanks.
I have simplified some of your code to use predict() as a much handier alternative to doing the matrix multiplication yourself.
dataframe <- data.frame(y=rbinom(100,2,0.4),var1=rnorm(100,2,2),var2=rnorm(100,3,4),var3=rnorm(100,4,5),var4=rnorm(100,5,6),var5=rnorm(100,30,3))
model1 <- lm(y~var1+var3, data=dataframe)
model2 <- lm(y~var2+var4+var5, data=dataframe)
var1 <- matrix(sample(c(NA, 1:3), 100, replace = TRUE), 100,1)
var2 <- matrix(sample(c(NA, 1:3), 100, replace = TRUE), 100,1)
var3 <- matrix(sample(c(NA, 1:3), 100, replace = TRUE), 100,1)
var4 <- matrix(sample(c(NA, 1:3), 100, replace = TRUE), 100,1)
var5 <- matrix(sample(c(NA, 1:3), 100, replace = TRUE), 100,1)
NewDataframe <- as.data.frame(cbind(var1,var2,var3,var4,var5))
names(NewDataframe) <- c("var1","var2","var3","var4","var5")
Use complete.cases() to identify rows that have no NAs and would produce a viable estimate
m1.ids <- with(NewDataframe, complete.cases(var1, var3))
Make two vectors, one using model1 for the rows that have no NAs in the relevant columns, and another using model2 for all the rest.
y.hat1 <- predict(model1, newdata=NewDataframe[m1.ids, ])
y.hat2 <- predict(model2, newdata=NewDataframe[!m1.ids, ])
Use the index to match the estimates to their respective rows.
NewDataframe <- rbind(data.frame(NewDataframe[m1.ids,], y.hat=y.hat1),
data.frame(NewDataframe[!m1.ids,], y.hat=y.hat2))
Alternatively, you can generate a full vector of estimates with each model, and use ifelse() to choose values from the second if the first is NA. That could look cleaner if your data is not big, but would produce redundant estimates.