A lm() dynamic function - R - r

Let's assume I have this dataframe:
N <- 50
df <- data.frame(
LA1 = sample(1:10, size = N, replace = TRUE),
LA2 = sample(1:10, size = N, replace = TRUE),
LA3 = sample(1:10, size = N, replace = TRUE),
LA4 = sample(1:10, size = N, replace = TRUE),
LA5 = sample(1:10, size = N, replace = TRUE),
LA6 = sample(1:10, size = N, replace = TRUE),
LA7 = sample(1:10, size = N, replace = TRUE),
LA8 = sample(1:10, size = N, replace = TRUE),
LAY = sample(1:10, size = N, replace = TRUE),
UF1 = sample(1:10, size = N, replace = TRUE),
UF2 = sample(1:10, size = N, replace = TRUE),
UF3 = sample(1:10, size = N, replace = TRUE),
UF4 = sample(1:10, size = N, replace = TRUE),
UF5 = sample(1:10, size = N, replace = TRUE),
UF6 = sample(1:10, size = N, replace = TRUE),
UFY = sample(1:10, size = N, replace = TRUE),
EK1 = sample(1:10, size = N, replace = TRUE),
EK2 = sample(1:10, size = N, replace = TRUE),
EK3 = sample(1:10, size = N, replace = TRUE),
EK4 = sample(1:10, size = N, replace = TRUE),
EK5 = sample(1:10, size = N, replace = TRUE),
EK6 = sample(1:10, size = N, replace = TRUE),
EK7 = sample(1:10, size = N, replace = TRUE),
EK8 = sample(1:10, size = N, replace = TRUE),
EK9 = sample(1:10, size = N, replace = TRUE),
EK10 = sample(1:10, size = N, replace = TRUE),
EK11 = sample(1:10, size = N, replace = TRUE),
EK12 = sample(1:10, size = N, replace = TRUE),
EKY = sample(1:10, size = N, replace = TRUE),
Z1 = sample(1:10, size = N, replace = TRUE),
Z2 = sample(1:10, size = N, replace = TRUE),
Z3 = sample(1:10, size = N, replace = TRUE)
)
Where I want to compute this models:
m1=lm(formula = LAY ~ LA1+LA2+LA3+LA4+LA5+LA6+LA7+LA8, data = df)
m11=step(m1,direction="both")
m2=lm(formula = UFY ~ UF1+UF2+UF3+UF4+UF5+UF6,data = df)
m22=step(m2,direction="both")
m3=lm(formula = EKY ~ EK1+EK2+EK3+EK4+EK5+EK6+EK7+EK8+EK9+EK10+EK11+EK12, data = df)
m33=step(m3,direction="both")
m8=lm(formula = Z1 ~ LAY+UFY+EKY, data = df)
m88=step(m8,direction="both")
m9=lm(formula = Z2 ~ LAY+UFY+EKY, data = df)
m99=step(m9,direction="both")
m10=lm(formula = Z3 ~ LAY+UFY+EKY, data = df)
m100=step(m10,direction="both")
As you can see, if the dimensionality of the database increases (increasing the number of LA, UF, or EK independent variables) I will have to modify manually the input for the models). So, I'm looking for a way to:
Given a certain quantity of independent variables (could be 5, 10, 30 or more) for a given category (LA, UF, and EK), the input for the model changes automatically.
Even I have found different syntax to compute the models (like X*Z = [(X+Z)^3]), I can't find a way to make this computation more dynamic.
Considerations:
The number of independent variables (LA, UF, EK) can change.
The number of dependent variables (LAY, UFY, EKY) never changes.
From the output of this models is extracted the coefficient vector (just in case this one).

Related

Loop inside a function, how to store function output to an existing dataframe

My goal is to run linear regressions with my defined equation, and then store the model residuals to my original dataset.
library(tidyverse)
library(stringr)
set.seed(5)
df <- data.frame(
id = c(1:100),
age = sample(20:80, 100, replace = TRUE),
sex = sample(c("M", "F"), 100, replace = TRUE, prob = c(0.7, 0.3)),
type = sample(letters[1:4], 100, replace = TRUE),
bmi = sample(15:35, 100, replace = TRUE),
sbp = sample(75:160, 100, replace = TRUE),
cat_outcome1 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.68, 0.32)),
cat_outcome2 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.65, 0.35)),
cat_outcome3 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.60, 0.40)),
cat_outcome4 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.45, 0.55)),
dog_outcome1 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.68, 0.32)),
dog_outcome2 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.65, 0.35)),
dog_outcome3 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.60, 0.40)),
dog_outcome4 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.45, 0.55))
)
outcome = colnames(df)[str_detect(colnames(df), "outcome")]
test_function = function(vars_dep, vars_indep, input_data){
for (z in vars_dep) {
formula = as.formula(paste0(z, " ~ ", vars_indep))
model = lm(formula, data = input_data, na.action = na.exclude)
# Take the residual from each model, create a new col with the suffix '.res'
input_data[, paste0(z, ".res")] = residuals(model)
}
}
Like shown above, I would like to save the residuals and give them a suffix depending on which y I use in the model, and finally save these residuals as columns in my original dataframe df. So I expected to see cat_outcome1.res, cat_outcome2.res as new columns but they were not saved in df. Any suggestions are greatly appreciated!
This function gives you what you want:
test_function <- function(vars_dep, vars_indep, input_data){
for (z in vars_dep) {
formula = as.formula(paste0(z, " ~ ", vars_indep))
model = lm(formula, data = input_data, na.action = na.exclude)
# Take the residual from each model, create a new col with the suffix '.res'
input_data[[paste0(z, ".res")]] <- residuals(model)
}
return(input_data)
}

Tetrachoric correlations using hetcor different when variables are stored as factor vs numeric

I'm trying to run a factor analysis with a series of binary variables. I'm using the hetcor function to obtain tetrachoric correlations. The tetrachoric correlations are much smaller when the variables are stored as numeric as opposed to factor. Does anyone know why this would be, or if I should be concerned? I ran a tetrachoric correlation matrix in Stata, and the matrix matches the one produced in R with factors.
The data below do not produce the same size of differences in the correlations that I'm seeing in my data, and I haven't been able to figure out how to share my data on stackoverflow (in spite of a lot of searching), but this is the code I'm using to produce the correlations and what the data look like (15 binary variables, n=6157).
set.seed(5)
df <- tibble(x1 = sample(c(0,1), replace = TRUE, size = 6157),
x2 = sample(c(0,1), replace = TRUE, size = 6157),
x3 = sample(c(0,1), replace = TRUE, size = 6157),
x4 = sample(c(0,1), replace = TRUE, size = 6157),
x5 = sample(c(0,1), replace = TRUE, size = 6157),
x6 = sample(c(0,1), replace = TRUE, size = 6157),
x7 = sample(c(0,1), replace = TRUE, size = 6157),
x8 = sample(c(0,1), replace = TRUE, size = 6157),
x9 = sample(c(0,1), replace = TRUE, size = 6157),
x10 = sample(c(0,1), replace = TRUE, size = 6157),
x11 = sample(c(0,1), replace = TRUE, size = 6157),
x12 = sample(c(0,1), replace = TRUE, size = 6157),
x13 = sample(c(0,1), replace = TRUE, size = 6157),
x14 = sample(c(0,1), replace = TRUE, size = 6157),
x15 = sample(c(0,1), replace = TRUE, size = 6157))
# Tetrachoric correlations with numeric variables
tetrachoric_num <- hetcor(df)$cor
cor.plot(tetrachoric_num, numbers=T, upper=FALSE, main = "Tetrachoric Correlations with numeric variables", show.legend = TRUE)
# convert all columns to factors
df[sapply(df, is.numeric)] <- lapply(df[sapply(df, is.numeric)], as.factor)
# Tetrachoric correlations with factor variables
tetrachoric_factor <- hetcor(df)$cor
cor.plot(tetrachoric_factor, numbers=T, upper=FALSE, main = "Tetrachoric Correlations with factor variables", show.legend = TRUE)

R Corrplot Text Spacing

Is there any way to remove the spaces between dashes? For example, removing the white space between "IL - 1alpha"? I'm not sure if there's any argument I could add in order to do this. Thanks in advance!
Here is a random dataset I created along with the code I use:
dat <- data.frame("Eotaxin" = sample(100, size = 66, replace = TRUE), "GRO-alpha" = sample(100, size = 66, replace = TRUE),
"IL-1alpha" = sample(100, size = 66, replace = TRUE), "IL-ra" = sample(100, size = 66, replace = TRUE),
"IL-8" = sample(100, size = 66, replace = TRUE), "IP-10" = sample(100, size = 66, replace = TRUE),
"MIP-1beta" = sample(100, size = 66, replace = TRUE),
"SDF-1alpha" = sample(100, size = 66, replace = TRUE))
library('corrplot')
matrix_cor <- cor(dat)
colnames(matrix_cor) <- c("Eotaxin",
":GRO*alpha",
":IL-1*alpha",
"IL-ra",
"IL-8",
"IP-10",
":MIP-1*beta",
":SDF-1*alpha")
rownames(matrix_cor) <- c("Eotaxin",
":GRO*alpha",
":IL-1*alpha",
"IL-ra",
"IL-8",
"IP-10",
":MIP-1*beta",
":SDF-1*alpha")
corrplot(matrix_cor, type = "upper",tl.col="black", tl.cex = 1)

Arguments must have same length when using tapply

data.frame(q1 = sample(c(1, 5), 200, replace = T, prob = c(1/2, 1/2)),
gender = sample(c("M", "F"), 200, replace = T, prob = c(2/3, 1/3))
) %>% tapply(.$q1,list(.$gender),FUN=sum)
I just want to use tapply to sum by gender, but got error as below:
Error in tapply(., .$q1, list(.$gender), FUN = sum) :
arguments must have same length
Where's the problem?
For the sum example, you can use data.table syntax:
library(data.table)
df <- data.frame(q1 = sample(c(1, 5), 200, replace = T, prob = c(1/2, 1/2)),
gender = sample(c("M", "F"), 200, replace = T, prob = c(2/3, 1/3)))
as.data.table(df)[, sum(q1), by = gender]
This will also work with a function that has multiple return values, unlike my previous example with summarize:
as.data.table(df)[, shapiro.test(q1), by = gender]

Data.table - subsetting within groups during group by is slow

I'm trying to produce several aggregate statistics, and some of them need to be produced on a subset of each group. The data.table is quite large, 10 million rows, but using by without column subsetting is blazing fast (less than a second). Adding just one additional column which needs to be calculated on a subset of each group increases the running time by factor of 12.
Is the a faster way to do this? Below is my full code.
library(data.table)
library(microbenchmark)
N = 10^7
DT = data.table(id1 = sample(1:400, size = N, replace = TRUE),
id2 = sample(1:100, size = N, replace = TRUE),
id3 = sample(1:50, size = N, replace = TRUE),
filter_var = sample(1:10, size = N, replace = TRUE),
x1 = sample(1:1000, size = N, replace = TRUE),
x2 = sample(1:1000, size = N, replace = TRUE),
x3 = sample(1:1000, size = N, replace = TRUE),
x4 = sample(1:1000, size = N, replace = TRUE),
x5 = sample(1:1000, size = N, replace = TRUE) )
setkey(DT, id1,id2,id3)
microbenchmark(
DT[, .(
sum_x1 = sum(x1),
sum_x2 = sum(x2),
sum_x3 = sum(x3),
sum_x4 = sum(x4),
sum_x5 = sum(x5),
avg_x1 = mean(x1),
avg_x2 = mean(x2),
avg_x3 = mean(x3),
avg_x4 = mean(x4),
avg_x5 = mean(x5)
) , by = c('id1','id2','id3')] , unit = 's', times = 10L)
min lq mean median uq max neval
0.942013 0.9566891 1.004134 0.9884895 1.031334 1.165144 10
microbenchmark( DT[, .(
sum_x1 = sum(x1),
sum_x2 = sum(x2),
sum_x3 = sum(x3),
sum_x4 = sum(x4),
sum_x5 = sum(x5),
avg_x1 = mean(x1),
avg_x2 = mean(x2),
avg_x3 = mean(x3),
avg_x4 = mean(x4),
avg_x5 = mean(x5),
sum_x1_F1 = sum(x1[filter_var < 5]) #this line slows everything down
) , by = c('id1','id2','id3')] , unit = 's', times = 10L)
min lq mean median uq max neval
12.24046 12.4123 12.83447 12.72026 13.49059 13.61248 10
GForce makes grouped operations run faster and will work on expressions like list(x = funx(X), y = funy(Y)), ...) where X and Y are column names and funx and funy belong to the set of optimized functions.
For a full description of what works, see ?GForce.
To test if an expression works, read the messages from DT[, expr, by=, verbose=TRUE].
In the OP's case, we have sum_x1_F1 = sum(x1[filter_var < 5]) which is not covered by GForce even though sum(v) is. In this special case, we can make a var v = x1*condition and sum that:
DT[, v := x1*(filter_var < 5)]
system.time( DT[, .(
sum_x1 = sum(x1),
sum_x2 = sum(x2),
sum_x3 = sum(x3),
sum_x4 = sum(x4),
sum_x5 = sum(x5),
avg_x1 = mean(x1),
avg_x2 = mean(x2),
avg_x3 = mean(x3),
avg_x4 = mean(x4),
avg_x5 = mean(x5),
sum_x1_F1 = sum(v)
) , by = c('id1','id2','id3')])
# user system elapsed
# 0.63 0.19 0.81
For comparison, timing the OP's code on my computer:
system.time( DT[, .(
sum_x1 = sum(x1),
sum_x2 = sum(x2),
sum_x3 = sum(x3),
sum_x4 = sum(x4),
sum_x5 = sum(x5),
avg_x1 = mean(x1),
avg_x2 = mean(x2),
avg_x3 = mean(x3),
avg_x4 = mean(x4),
avg_x5 = mean(x5),
sum_x1_F1 = sum(x1[filter_var < 5]) #this line slows everything down
) , by = c('id1','id2','id3')])
# user system elapsed
# 9.00 0.02 9.06

Resources