constructing a model on a subset of a dataframe

constructing a model on a subset of a dataframe - r

consider for example the "iris" dataframe which is installed with main setup of R :
names(iris)
# [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
levels(iris$Species)
# [1] "setosa" "versicolor" "virginica"
now I construct three models without attaching the "iris":
t1=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris)
t2=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris[iris$Species=="setosa",])
t3=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris , subset = (iris$Species=="setosa"))
now i think t2=t3<>t1 but R says t1=t2<>t3. why I'm wrong?!!
now I construct again my models but this time with attaching the "iris":
attach(iris)
t1=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris)
t2=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris[iris$Species=="setosa",])
t3=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris , subset = (iris$Species=="setosa"))
now me and R both think: t2=t3<>t1. but again I'm confused because of the effect of attaching on model! I think first set of models is equivalent to second set of models, but R says no! thanks.

Its a scoping issue. If you do:
t1=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris)
t2=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris[iris$Species=="setosa",])
t3=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris , subset = (iris$Species=="setosa"))
You get the desired result.
coef(t1) == coef(t2)
(Intercept) iris$Sepal.Width iris$Petal.Length
FALSE FALSE FALSE
coef(t2) == coef(t3)
(Intercept) Sepal.Width Petal.Length
TRUE TRUE TRUE
When you say iris$Sepal.Length, R already knows where to look for that value. The subset argument is thus redundant and R ignores it. As mentioned in the comments, there is no need to use foo$bar when data = foo is supplied, and this situation looks to be a good example of why not to do so.

Two methods for conducting a linear model on a subset:
Creating the subset manually
setosa <- subset(iris, subset = Species == "setosa")
t1 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=setosa)
Using the subset argument in lm()
t2 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=iris, subset = Species == "setosa")
t1 and t2 are equivalent. However, if you use iris$ in the lm() call, R ignores what is passed to data (and possibly subset), since you are explicitly giving the vectors to the function rather than the dataframe. This is an incorrect way to use lm().

Related

Auto VIF (Variable importance factor) for variable analysis

I want to calculate an "Auto" VIF for variables, one vs the others. For example in the iris dataset, I want Sepal.Width to act as target value and the others as explanatory variables.
First I remove the Species column, so only the variables stay. Then I want to loop over each variable and test it again the others. Finally I want the VIF resulto to be stored in a list.
This is what I have tried:
library(car)
library(dplyr)
library(tidyr)
iris_clean <- iris %>%
select(-Species)
col_names <- colnames(iris)
i <- 1
for(col in col_names) {
regr <- lm(col ~ ., data=iris_clean)
list[i] <- vif(regr)
i <- i+1
}
For some reason I get an error:
Error in model.frame.default(formula = col ~ ., data = iris_clean, drop.unused.levels = TRUE) :
variable lengths differ (found for 'Sepal.Length')
Which I don't understand because the variables have the same length. Please, any help will be greatly appreciated.

You can try this -
library(car)
library(dplyr)
library(tidyr)
iris_clean <- iris %>% select(-Species)
col_names <- colnames(iris_clean)
result <- vector('list', length(col_names))
for(i in seq_along(col_names)) {
regr <- lm(paste0(col_names[i], '~ .'), data=iris_clean)
result[[i]] <- vif(regr)
}
result
#[[1]]
# Sepal.Width Petal.Length Petal.Width
# 1.270815 15.097572 14.234335
#[[2]]
#Sepal.Length Petal.Length Petal.Width
# 4.278282 19.426391 14.089441
#[[3]]
#Sepal.Length Sepal.Width Petal.Width
# 3.415733 1.305515 3.889961
#[[4]]
#Sepal.Length Sepal.Width Petal.Length
# 6.256954 1.839639 7.557780

How to create a function in R so that the name of the arguments become objects within my function?

I am relatively new to R. I want to create a function in which arguments will be variables (objects) in the function glm called within my function.
I have a dataframe containing multiple variables (columns). I wish to run multiple logistic regressions using glm() using the same predictors (terms) with only one changing, hence the need for the function. I would like to be able to specify the name of this predictor and of the object created by glm().
For example:
myfunc <-function (myvar, mymodel) {
mymodel <- glm (var1 ~ var2 + var3 + var4 + myvar, data = myframe, family = "binomial")
}
I would like the arguments of my function to allow me to run the same the analysis multiple times by replacing one variable and to obtain results in different objects. For example,
myfunc(var_A, model_A)
should be equivalent to
model_A<- glm (var1 ~ var2 + var3 + var4 + var_A, data = myframe, family = "binomial")
and
myfunc(var_B, model_B)
should be equivalent to
model_B<- glm (var1 ~ var2 + var3 + var4 + var_B, data = myframe, family = "binomial")
And so on.
I cannot find how to write my function so that the name of the arguments become objects within my function.

Another possible solution is :
data(iris)
iris <- iris[1:100,]
myfunc <-function (myvar, mymodel) {
formula <- reformulate(termlabels = c('Sepal.Length',myvar), response = 'Species')
model <- glm (formula, data = iris, family = "binomial")
assign(mymodel,model,pos = globalenv())
}
myfunc("Sepal.Width","model_a")
here the inputs of the function must be characters. With your example the function should look like this :
myfunc <-function (myvar, mymodel) {
formula <- reformulate(termlabels = c("var2" , "var3" , "var4" , myvar), response = 'var1')
model <- glm (formula, data = myframe, family = "binomial")
assign(mymodel,model,pos = globalenv())
}

You can try using the lapply function
For example, see the below code
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
I will try to generate list of linear models with Sepal.Length as the Independent Variable, Sepal.Width as the common dependent variable across models and one more variable each time from the set ("Petal.Length", "Petal.Width", "Species")
changing.variables <- c("Petal.Length", "Petal.Width", "Species")
df <- iris[changing.variables]
list.lm <- lapply(df, FUN = function(x) lm(Sepal.Length~Sepal.Width+x, data = iris))
list.lm
Output
> list.lm
$Petal.Length
Call:
lm(formula = Sepal.Length ~ Sepal.Width + x, data = iris)
Coefficients:
(Intercept) Sepal.Width x
2.2491 0.5955 0.4719
$Petal.Width
Call:
lm(formula = Sepal.Length ~ Sepal.Width + x, data = iris)
Coefficients:
(Intercept) Sepal.Width x
3.4573 0.3991 0.9721
$Species
Call:
lm(formula = Sepal.Length ~ Sepal.Width + x, data = iris)
Coefficients:
(Intercept) Sepal.Width xversicolor xvirginica
2.2514 0.8036 1.4587 1.9468

Residuals from R regression of a data subset

I would like to get regression residuals but only a residuals subset from data:
My R code:
reg = lm(Y ~ X1+X2+.....+Xn,data=fic)
step_reg = step(reg, direction= "both")
summary(step_reg)
fic is a dataframe with n columns called X1, X2, ...Xn.
To get all residuals: step_reg2$residuals
But I would like to get residuals only for rows which respect criteria like for example X1 = 'xxxx'
What could be the solution, please?

You can use the data you used for the regression to subset the residuals like:
reg <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=iris)
step_reg <- step(reg, direction= "both")
step_reg$residuals[iris$Species=="setosa"]
In case there are missing values:
x <- iris
x[1,2] <- NA
reg <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=x)
reg$residuals[iris[names(reg$residuals), "Species"] == "setosa"]

Loop for multiple linear regression

Hi I’m starting to use r and am stuck on analyzing my data. I have a dataframe that has 80 columns. Column 1 is the dependent variable and from column 2 to 80 they are the independent variables. I want to perform 78 multiple linear regressions leaving the first independent variable of the model fixed (column 2) and create a list where I can to save all regressions to later be able to compare the models using AIC scores. how can i do it?
Here is my loop
data.frame
for(i in 2:80)
{
Regressions <- lm(data.frame$column1 ~ data.frame$column2 + data.frame [,i])
}

Using the iris dataset as an example you can do:
lapply(seq_along(iris)[-c(1:2)], function(x) lm(data = iris[,c(1:2, x)]))
[[1]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Petal.Length
2.2491 0.5955 0.4719
[[2]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Petal.Width
3.4573 0.3991 0.9721
[[3]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Speciesversicolor Speciesvirginica
2.2514 0.8036 1.4587 1.9468
This works because when you pass a dataframe to lm() without a formula it applies the function DF2formula() under the hood which treats the first column as the response and all other columns as predictors.

With the for loop we can initialize a list to store the output
nm1 <- names(df1)[2:80]
Regressions <- vector('list', length(nm1))
for(i in seq_along(Regressions)) {
Regressions[[i]] <- lm(reformulate(c("column2", nm1[i]), "column1"), data = df1)
}
Or use paste instead of reformulate
for(i in seq_along(Regressions)) {
Regressions[[i]] <- lm(as.formula(paste0("column1 ~ column2 + ",
nm1[i])), data = df1)
}
Using a reproducible example
nm2 <- names(iris)[3:5]
Regressions2 <- vector('list', length(nm2))
for(i in seq_along(Regressions2)) {
Regressions2[[i]] <- lm(reformulate(c("Sepal.Width", nm2[i]), "Sepal.Length"), data = iris)
}
Regressions2[[1]]
#Call:
#lm(formula = reformulate(c("Sepal.Width", nm2[i]), "Sepal.Length"),
# data = iris)
#Coefficients:
# (Intercept) Sepal.Width Petal.Length
# 2.2491 0.5955 0.4719

Using neural networks neuralnet in R to predict factor values

I am using neuralnet package, use several inputs to predict an output.
Originally, my output is a factor variable, and I saw the error:
Error in neurons[[i]] %*% weights[[i]] :
requires numeric/complex matrix/vector arguments
When I converted the output to numeric variable, the error disappeared. Is there a way to neural network with factor output?

I adapted code that I found at this site, which uses the iris dataset with the neuralnet package to predict iris species from the morphological data.
Without a reproducible example, I'm not sure if this applies to your case. The key here was to convert the factorial response level to its own binary variable. The prediction is a bit different than other models in R - you choose the factor level with the highest score.
Example code:
library(neuralnet)
# Make training and validation data
set.seed(1)
train <- sample(nrow(iris), nrow(iris)*0.5)
valid <- seq(nrow(iris))[-train]
iristrain <- iris[train,]
irisvalid <- iris[valid,]
# Binarize the categorical output
iristrain <- cbind(iristrain, iristrain$Species == 'setosa')
iristrain <- cbind(iristrain, iristrain$Species == 'versicolor')
iristrain <- cbind(iristrain, iristrain$Species == 'virginica')
names(iristrain)[6:8] <- c('setosa', 'versicolor', 'virginica')
# Fit model
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain,
hidden=c(3)
)
plot(nn)
# Predict
comp <- compute(nn, irisvalid[-5])
pred.weights <- comp$net.result
idx <- apply(pred.weights, 1, which.max)
pred <- c('setosa', 'versicolor', 'virginica')[idx]
table(pred, irisvalid$Species)
#pred setosa versicolor virginica
# setosa 23 0 0
# versicolor 1 21 7
# virginica 0 1 22

This might raise warnings:
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain,
hidden=c(3)
)
So replace it with:
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain, hidden = 3,lifesign = "full")
If this does not work:
comp <- compute(nn, irisvalid[-5])
then use
comp <- neuralnet::compute(nn, irisvalid[,1:4])

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

constructing a model on a subset of a dataframe - r

Related

Auto VIF (Variable importance factor) for variable analysis

How to create a function in R so that the name of the arguments become objects within my function?

Residuals from R regression of a data subset

Loop for multiple linear regression

Using neural networks neuralnet in R to predict factor values

Categories

Resources