Is there a way to change default formula in mlr3? - r

I'm studying {mlr3} package and i would like to compare 2 models:
Y ~ .
Y ~ .^2
I did not find in the documentation a way to specify the formula in the task creation. The default is ~ ..
I can find a hacky way with model.matrix and set another taks with these new columns but i might be missing something.

You are probably looking for PipeOpModelMatrix function from mlr3pipelines.
For instance
task <- as_task_regr(iris[1:4], target = "Sepal.Length")
lrn <- lrn("regr.lm")
pop <- po("modelmatrix", formula = ~ . ^ 2)
pop %>>%
lrn -> gr
GraphLearner$new(gr) -> gr
gr$train(task)
gr$model$modelmatrix$outtasklayout
#output
id type
1: (Intercept) numeric
2: Petal.Length numeric
3: Petal.Length:Petal.Width numeric
4: Petal.Length:Sepal.Width numeric
5: Petal.Width numeric
6: Petal.Width:Sepal.Width numeric
7: Sepal.Width numeric
gr$model$regr.lm$model
#output
Call:
stats::lm(formula = task$formula(), data = task$data())
Coefficients:
(Intercept) `(Intercept)` Petal.Length Petal.Width Sepal.Width
1.39837 NA 1.15756 -1.66219 0.84812
`Petal.Length:Petal.Width` `Petal.Length:Sepal.Width` `Petal.Width:Sepal.Width`
0.06695 -0.16772 0.27626

Related

How to create a function in R so that the name of the arguments become objects within my function?

I am relatively new to R. I want to create a function in which arguments will be variables (objects) in the function glm called within my function.
I have a dataframe containing multiple variables (columns). I wish to run multiple logistic regressions using glm() using the same predictors (terms) with only one changing, hence the need for the function. I would like to be able to specify the name of this predictor and of the object created by glm().
For example:
myfunc <-function (myvar, mymodel) {
mymodel <- glm (var1 ~ var2 + var3 + var4 + myvar, data = myframe, family = "binomial")
}
I would like the arguments of my function to allow me to run the same the analysis multiple times by replacing one variable and to obtain results in different objects. For example,
myfunc(var_A, model_A)
should be equivalent to
model_A<- glm (var1 ~ var2 + var3 + var4 + var_A, data = myframe, family = "binomial")
and
myfunc(var_B, model_B)
should be equivalent to
model_B<- glm (var1 ~ var2 + var3 + var4 + var_B, data = myframe, family = "binomial")
And so on.
I cannot find how to write my function so that the name of the arguments become objects within my function.
Another possible solution is :
data(iris)
iris <- iris[1:100,]
myfunc <-function (myvar, mymodel) {
formula <- reformulate(termlabels = c('Sepal.Length',myvar), response = 'Species')
model <- glm (formula, data = iris, family = "binomial")
assign(mymodel,model,pos = globalenv())
}
myfunc("Sepal.Width","model_a")
here the inputs of the function must be characters. With your example the function should look like this :
myfunc <-function (myvar, mymodel) {
formula <- reformulate(termlabels = c("var2" , "var3" , "var4" , myvar), response = 'var1')
model <- glm (formula, data = myframe, family = "binomial")
assign(mymodel,model,pos = globalenv())
}
You can try using the lapply function
For example, see the below code
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
I will try to generate list of linear models with Sepal.Length as the Independent Variable, Sepal.Width as the common dependent variable across models and one more variable each time from the set ("Petal.Length", "Petal.Width", "Species")
changing.variables <- c("Petal.Length", "Petal.Width", "Species")
df <- iris[changing.variables]
list.lm <- lapply(df, FUN = function(x) lm(Sepal.Length~Sepal.Width+x, data = iris))
list.lm
Output
> list.lm
$Petal.Length
Call:
lm(formula = Sepal.Length ~ Sepal.Width + x, data = iris)
Coefficients:
(Intercept) Sepal.Width x
2.2491 0.5955 0.4719
$Petal.Width
Call:
lm(formula = Sepal.Length ~ Sepal.Width + x, data = iris)
Coefficients:
(Intercept) Sepal.Width x
3.4573 0.3991 0.9721
$Species
Call:
lm(formula = Sepal.Length ~ Sepal.Width + x, data = iris)
Coefficients:
(Intercept) Sepal.Width xversicolor xvirginica
2.2514 0.8036 1.4587 1.9468

Substitute results to an error with cor.test

I am trying to write a loop, which returns linear regression and correlation parameters. When trying to pass substitute through cor.test function, I encounter an unexpected error
data(iris)
i <- 1
vars <- list(par = as.name(colnames(iris)[1]), expl = as.name(colnames(iris)[2:4][i]))
lm(substitute(par ~ expl, vars), data = iris) # works
lm(Sepal.Length ~ Sepal.Width, data = iris) # works. Result equal to the statement above
cor.test(~Sepal.Length + Sepal.Width, data = iris) # works
cor.test(substitute(~par + expl, vars), data = iris) # does not work
## Error in cor.test.default(substitute(~par + expl, vars), data = iris) :
## argument "y" is missing, with no default
To my understanding the cor.test statement should be the same than the manually inputted one.
What is the reason for the error? How can I write a substitute statement for cor.test that works?
The error stems from the fact that the first version is of formula type and the second one is language:
str(substitute(~par + expl, vars))
# language ~Sepal.Length + Sepal.Width
str(~Sepal.Length + Sepal.Width)
# Class 'formula' language ~Sepal.Length + Sepal.Width
# ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
If you use as.formula on the second version it works:
cor.test(as.formula(substitute(~par + expl, vars)), data = iris)
# Pearson's product-moment correlation
#
# data: Sepal.Length and Sepal.Width
# t = -1.4403, df = 148, p-value = 0.1519
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.27269325 0.04351158
# sample estimates:
# cor
# -0.1175698

Retrieving p-value from anderson-darling test (ad.test) in R from an aggregate call

So I have the following
ad_test_by_group = aggregate(data, by = group, FUN = "ad.test")
and this outputs the test statistic by group. However I would like the p-values associated to each of the group instead.
Is there a way to do this?
As suggested by #ed_sans, you may try:
library(nortest)
data(iris)
aggregate(. ~ Species, data = iris[,c(1,5)], function(x) ad.test(x)$p.value)
Species Sepal.Length
1 setosa 0.3352439
2 versicolor 0.4332962
3 virginica 0.1475325
The 2nd column are pvalues.
x<-iris[iris$Species=="setosa", 1]
ad.test(x)$p.value
[1] 0.3352439

Fitting a linear regression model by group gives NaN p-values

I am using plyr::ddply to run a regression model
model <- rating ~ A + B + C + D + E + F
by the factor resp.id. I can create a data frame of the betas by each factor with:
indiv.betas <- ddply(data.coded, "resp.id",
function(df) coef(lm(model, data=df)))
I am now trying to extract the p-values for the variables by the factor using:
indiv.pvalues <- ddply(data.coded, "resp.id",
function(df) coef(summary(lm(model, data=df)))[, "Pr(>|t|)"])
Unfortunately, it just gives me a data frame with NaN.
Although, if I run a model across the entire data set, I can extract p-values from this one model as a data frame successfully with:
pvalue <- as.data.frame(coef(summary(lm(model, data=df)))[, "Pr(>|t|)"])
How can I create a data frame of the p-values by the factor?
Thanks.
When you fit a single model
rating ~ A + B + C + D + E + F
you get meaningful, non-NA result. While when you fit the same model for each subset / factor level by resp.id, you get NaN result. I am 100% sure that for some factor level, you don't have enough data to fit the above model. It would be a good idea, to first check how many data there are for each group. You can use:
N <- with(data.coded, tapply(rating, resp.id, FUN = length))
Your model has 7 coefficients (1 for intercept and 1 each for A, B, ..., F). So which(N < 7) will tell you which factor levels are producing NaN.
In this part, I will show that I am not able to reproduce your problem with iris dataset.
library(plyr)
model <- Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width
ddply(iris, "Species", function(df) coefficients(lm(model, data=df)))
# Species (Intercept) Sepal.Width Petal.Length Petal.Width
#1 setosa 2.351890 0.6548350 0.2375602 0.2521257
#2 versicolor 1.895540 0.3868576 0.9083370 -0.6792238
#3 virginica 0.699883 0.3303370 0.9455356 -0.1697527
ddply(iris, "Species", function(df) coef(summary(lm(model, data=df)))[, 4])
# Species (Intercept) Sepal.Width Petal.Length Petal.Width
#1 setosa 3.034183e-07 6.834434e-09 2.593594e-01 0.470987
#2 versicolor 5.112246e-04 6.488965e-02 1.666695e-06 0.125599
#3 virginica 1.961563e-01 6.439972e-02 1.074269e-13 0.395875
In this part, I will show why NaN could appear when there are more coefficients than data.
set.seed(0);
x1 <- rnorm(3); x2 <- rnorm(3); x3 <- rnorm(3)
y <- rnorm(3)
fit <- lm(y ~ x1 + x2 + x3) ## 3 data, 4 coefficients
coef(summary(fit))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.4217653 NaN NaN NaN
#x1 0.4124869 NaN NaN NaN
#x2 1.1489330 NaN NaN NaN

Using neural networks neuralnet in R to predict factor values

I am using neuralnet package, use several inputs to predict an output.
Originally, my output is a factor variable, and I saw the error:
Error in neurons[[i]] %*% weights[[i]] :
requires numeric/complex matrix/vector arguments
When I converted the output to numeric variable, the error disappeared. Is there a way to neural network with factor output?
I adapted code that I found at this site, which uses the iris dataset with the neuralnet package to predict iris species from the morphological data.
Without a reproducible example, I'm not sure if this applies to your case. The key here was to convert the factorial response level to its own binary variable. The prediction is a bit different than other models in R - you choose the factor level with the highest score.
Example code:
library(neuralnet)
# Make training and validation data
set.seed(1)
train <- sample(nrow(iris), nrow(iris)*0.5)
valid <- seq(nrow(iris))[-train]
iristrain <- iris[train,]
irisvalid <- iris[valid,]
# Binarize the categorical output
iristrain <- cbind(iristrain, iristrain$Species == 'setosa')
iristrain <- cbind(iristrain, iristrain$Species == 'versicolor')
iristrain <- cbind(iristrain, iristrain$Species == 'virginica')
names(iristrain)[6:8] <- c('setosa', 'versicolor', 'virginica')
# Fit model
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain,
hidden=c(3)
)
plot(nn)
# Predict
comp <- compute(nn, irisvalid[-5])
pred.weights <- comp$net.result
idx <- apply(pred.weights, 1, which.max)
pred <- c('setosa', 'versicolor', 'virginica')[idx]
table(pred, irisvalid$Species)
#pred setosa versicolor virginica
# setosa 23 0 0
# versicolor 1 21 7
# virginica 0 1 22
This might raise warnings:
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain,
hidden=c(3)
)
So replace it with:
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain, hidden = 3,lifesign = "full")
If this does not work:
comp <- compute(nn, irisvalid[-5])
then use
comp <- neuralnet::compute(nn, irisvalid[,1:4])

Resources