Substitute results to an error with cor.test - r

I am trying to write a loop, which returns linear regression and correlation parameters. When trying to pass substitute through cor.test function, I encounter an unexpected error
data(iris)
i <- 1
vars <- list(par = as.name(colnames(iris)[1]), expl = as.name(colnames(iris)[2:4][i]))
lm(substitute(par ~ expl, vars), data = iris) # works
lm(Sepal.Length ~ Sepal.Width, data = iris) # works. Result equal to the statement above
cor.test(~Sepal.Length + Sepal.Width, data = iris) # works
cor.test(substitute(~par + expl, vars), data = iris) # does not work
## Error in cor.test.default(substitute(~par + expl, vars), data = iris) :
## argument "y" is missing, with no default
To my understanding the cor.test statement should be the same than the manually inputted one.
What is the reason for the error? How can I write a substitute statement for cor.test that works?

The error stems from the fact that the first version is of formula type and the second one is language:
str(substitute(~par + expl, vars))
# language ~Sepal.Length + Sepal.Width
str(~Sepal.Length + Sepal.Width)
# Class 'formula' language ~Sepal.Length + Sepal.Width
# ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
If you use as.formula on the second version it works:
cor.test(as.formula(substitute(~par + expl, vars)), data = iris)
# Pearson's product-moment correlation
#
# data: Sepal.Length and Sepal.Width
# t = -1.4403, df = 148, p-value = 0.1519
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.27269325 0.04351158
# sample estimates:
# cor
# -0.1175698

Related

Is there a way to change default formula in mlr3?

I'm studying {mlr3} package and i would like to compare 2 models:
Y ~ .
Y ~ .^2
I did not find in the documentation a way to specify the formula in the task creation. The default is ~ ..
I can find a hacky way with model.matrix and set another taks with these new columns but i might be missing something.
You are probably looking for PipeOpModelMatrix function from mlr3pipelines.
For instance
task <- as_task_regr(iris[1:4], target = "Sepal.Length")
lrn <- lrn("regr.lm")
pop <- po("modelmatrix", formula = ~ . ^ 2)
pop %>>%
lrn -> gr
GraphLearner$new(gr) -> gr
gr$train(task)
gr$model$modelmatrix$outtasklayout
#output
id type
1: (Intercept) numeric
2: Petal.Length numeric
3: Petal.Length:Petal.Width numeric
4: Petal.Length:Sepal.Width numeric
5: Petal.Width numeric
6: Petal.Width:Sepal.Width numeric
7: Sepal.Width numeric
gr$model$regr.lm$model
#output
Call:
stats::lm(formula = task$formula(), data = task$data())
Coefficients:
(Intercept) `(Intercept)` Petal.Length Petal.Width Sepal.Width
1.39837 NA 1.15756 -1.66219 0.84812
`Petal.Length:Petal.Width` `Petal.Length:Sepal.Width` `Petal.Width:Sepal.Width`
0.06695 -0.16772 0.27626

Residuals from R regression of a data subset

I would like to get regression residuals but only a residuals subset from data:
My R code:
reg = lm(Y ~ X1+X2+.....+Xn,data=fic)
step_reg = step(reg, direction= "both")
summary(step_reg)
fic is a dataframe with n columns called X1, X2, ...Xn.
To get all residuals: step_reg2$residuals
But I would like to get residuals only for rows which respect criteria like for example X1 = 'xxxx'
What could be the solution, please?
You can use the data you used for the regression to subset the residuals like:
reg <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=iris)
step_reg <- step(reg, direction= "both")
step_reg$residuals[iris$Species=="setosa"]
In case there are missing values:
x <- iris
x[1,2] <- NA
reg <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=x)
reg$residuals[iris[names(reg$residuals), "Species"] == "setosa"]

Loop for multiple linear regression

Hi I’m starting to use r and am stuck on analyzing my data. I have a dataframe that has 80 columns. Column 1 is the dependent variable and from column 2 to 80 they are the independent variables. I want to perform 78 multiple linear regressions leaving the first independent variable of the model fixed (column 2) and create a list where I can to save all regressions to later be able to compare the models using AIC scores. how can i do it?
Here is my loop
data.frame
for(i in 2:80)
{
Regressions <- lm(data.frame$column1 ~ data.frame$column2 + data.frame [,i])
}
Using the iris dataset as an example you can do:
lapply(seq_along(iris)[-c(1:2)], function(x) lm(data = iris[,c(1:2, x)]))
[[1]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Petal.Length
2.2491 0.5955 0.4719
[[2]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Petal.Width
3.4573 0.3991 0.9721
[[3]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Speciesversicolor Speciesvirginica
2.2514 0.8036 1.4587 1.9468
This works because when you pass a dataframe to lm() without a formula it applies the function DF2formula() under the hood which treats the first column as the response and all other columns as predictors.
With the for loop we can initialize a list to store the output
nm1 <- names(df1)[2:80]
Regressions <- vector('list', length(nm1))
for(i in seq_along(Regressions)) {
Regressions[[i]] <- lm(reformulate(c("column2", nm1[i]), "column1"), data = df1)
}
Or use paste instead of reformulate
for(i in seq_along(Regressions)) {
Regressions[[i]] <- lm(as.formula(paste0("column1 ~ column2 + ",
nm1[i])), data = df1)
}
Using a reproducible example
nm2 <- names(iris)[3:5]
Regressions2 <- vector('list', length(nm2))
for(i in seq_along(Regressions2)) {
Regressions2[[i]] <- lm(reformulate(c("Sepal.Width", nm2[i]), "Sepal.Length"), data = iris)
}
Regressions2[[1]]
#Call:
#lm(formula = reformulate(c("Sepal.Width", nm2[i]), "Sepal.Length"),
# data = iris)
#Coefficients:
# (Intercept) Sepal.Width Petal.Length
# 2.2491 0.5955 0.4719

Comparing percent change of model coefficients

I am working through step 3 of purposeful model-building from Hosmer-Lemeshow and it suggests to compare the percent change in coefficients between a full model [Iris.mod1] and a reduced model [Iris.mod2]. I would like to automate this step if possible.
Right now I have the following code:
#Make species a binomial DV
iris = subset(iris, iris$Species != 'virginica')
iris$Species = as.numeric(ifelse(iris$Species == 'setosa', 1, 0))
#Build models
Iris.mod1 = glm(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,
data = iris, family = binomial())
Iris.mod2 = glm(Species~Sepal.Length+Petal.Length, data = iris, family =
binomial())
The dataset I am actually using has about 93 variables and 1.7 million rows. But I am using the iris data just for this example.
#Try to see if any coefficients changed by > 20%
paste(names(which((summary(Iris.mod1)$coefficients[2:
(nrow(summary(Iris.mod1)$coefficients)),1] -
(summary(Iris.mod2)$coefficients[2:
(nrow(summary(Iris.mod2)$coefficients)),1]/
(summary(Iris.mod1)$coefficients[2:nrow(summary(Iris.mod1)$coefficients)),1]
> 0.2 == TRUE)))))
However, this code is full of errors and I am lost in a sea of parenthesis.
Is there an efficient way to determine which variables coefficient changed by more than 20%?
Thank you in advance.
The broom package is really nice for making data frames of model coefficients and terms. We can use that to get things in a workable format:
library(broom)
m_list = list(m1 = Iris.mod1, m2 = Iris.mod2)
t_list = lapply(m_list, tidy)
library(dplyr)
library(tidyr)
bind_rows(t_list, .id = "mod") %>%
select(term, estimate, mod) %>%
spread(key = mod, value = estimate) %>%
mutate(p_change = (m2 - m1) / m1 * 100,
p_change_gt_20 = p_change > 20)
# term m1 m2 p_change p_change_gt_20
# 1 (Intercept) -6.556265 -65.84266 904.2709 TRUE
# 2 Petal.Length -19.053588 -49.04616 157.4117 TRUE
# 3 Petal.Width -25.032928 NA NA NA
# 4 Sepal.Length 9.878866 37.56141 280.2199 TRUE
# 5 Sepal.Width 7.417640 NA NA NA

Using neural networks neuralnet in R to predict factor values

I am using neuralnet package, use several inputs to predict an output.
Originally, my output is a factor variable, and I saw the error:
Error in neurons[[i]] %*% weights[[i]] :
requires numeric/complex matrix/vector arguments
When I converted the output to numeric variable, the error disappeared. Is there a way to neural network with factor output?
I adapted code that I found at this site, which uses the iris dataset with the neuralnet package to predict iris species from the morphological data.
Without a reproducible example, I'm not sure if this applies to your case. The key here was to convert the factorial response level to its own binary variable. The prediction is a bit different than other models in R - you choose the factor level with the highest score.
Example code:
library(neuralnet)
# Make training and validation data
set.seed(1)
train <- sample(nrow(iris), nrow(iris)*0.5)
valid <- seq(nrow(iris))[-train]
iristrain <- iris[train,]
irisvalid <- iris[valid,]
# Binarize the categorical output
iristrain <- cbind(iristrain, iristrain$Species == 'setosa')
iristrain <- cbind(iristrain, iristrain$Species == 'versicolor')
iristrain <- cbind(iristrain, iristrain$Species == 'virginica')
names(iristrain)[6:8] <- c('setosa', 'versicolor', 'virginica')
# Fit model
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain,
hidden=c(3)
)
plot(nn)
# Predict
comp <- compute(nn, irisvalid[-5])
pred.weights <- comp$net.result
idx <- apply(pred.weights, 1, which.max)
pred <- c('setosa', 'versicolor', 'virginica')[idx]
table(pred, irisvalid$Species)
#pred setosa versicolor virginica
# setosa 23 0 0
# versicolor 1 21 7
# virginica 0 1 22
This might raise warnings:
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain,
hidden=c(3)
)
So replace it with:
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain, hidden = 3,lifesign = "full")
If this does not work:
comp <- compute(nn, irisvalid[-5])
then use
comp <- neuralnet::compute(nn, irisvalid[,1:4])

Resources