When running a decision tree I use:
mod1 <- C5.0(Species ~ ., data = iris)
If I want to pass in a data frame and set the target feature name in the formula (something different than "Species") how would I do this?
For example,
mod1 <- C5.0(other_data[,target_column] ~ ., data = other_data)
which obviously doesn't work.
1) Paste together the formula:
fun <- function(resp, data) C5.0(as.formula(paste(resp, "~ .")), data = data)
# test
library(C50)
fun("Species", iris)
giving:
Call:
C5.0.formula(formula = as.formula(paste(resp, "~ .")), data = data)
Classification Tree
Number of samples: 150
Number of predictors: 4
Tree size: 4
Non-standard options: attempt to group attributes
2) Or this variation which gives nicer rendition of the call on the line after Call: in the output:
fun <- function(resp, data)
do.call(C5.0, list(as.formula(paste(resp, "~ .")), data = substitute(data)))
fun("Species", iris)
giving:
Call:
C5.0.formula(formula = Species ~ ., data = iris)
Classification Tree
Number of samples: 150
Number of predictors: 4
Tree size: 4
Here is a second test of this version of fun using the builtin data frame CO2:
fun("Plant", CO2)
giving:
Call:
C5.0.formula(formula = Plant ~ ., data = CO2)
Classification Tree
Number of samples: 84
Number of predictors: 4
Tree size: 7
Non-standard options: attempt to group attributes
The following allows for passing in arbitrary data and a target feature to the C50 method:
boosted_trees <- function(data_train, target_feature, iter_choice) {
target_index <- grep(target_feature, colnames(data_train))
model_boosted <- C5.0(x = data_train[, -target_index], y = data_train[[target_feature]], trial=iter_choice)
model_boosted$call$x <- data_train[, -target_index]
model_boosted$call$y <- data_train[[target_feature]]
return(model_boosted)
}
The trick is to rename the terms in the method call after building the model so that it can be plotted.
An alternative that may be preferable is to overwrite the symbol within the parse tree after creating the formula:
x <- Species~.;
x;
## Species ~ .
x[[2L]] <- as.symbol('Blah');
x;
## Blah ~ .
The above works because formulas are encoded as normal parse trees, with a top-level node that consists of a call (typeof 'language', mode 'call') of the `~`() function, and classed as 'formula':
(function(x) c(typeof(x),mode(x),class(x)))(.~.);
## [1] "language" "call" "formula"
All parse trees can be read and written as a recursive list structure. Here I'll demonstrate that using a nice little recursive function I originally wrote for this answer:
unwrap <- function(x) if (typeof(x) == 'language') lapply(as.list(x),unwrap) else x;
unwrap(Species~.);
## [[1]]
## `~`
##
## [[2]]
## Species
##
## [[3]]
## .
##
In other words, parse trees represent function calls with the function symbol as the first list component, and then all function arguments as the subsequent list components. The special case of a normal formula captures the LHS as the first function argument and the RHS as the second. Hence x[[2L]] represents the LHS symbol of your formula, which we can overwrite directly with a normal assignment to your preferred symbol.
Related
The logic is similar to the content-based recommender,
content
undesirable
desirable
user_1
...
user_10
1
3.00
2.77
0.11
NA
...
5000
2.50
2.11
NA
0.12
I need to run the model for undesirable and desirable as independent values and each user as the dependent value, thus I need run 10 times to fit the model and predict each user's NA value.
This is the code that I hard coding, but I wonder how to use for loop, I just searched for several methods but they do not work for me...
the data as 'test'
hard code
#fit model
fit_1 = lm(user_1 ~ undesirable + desirable, data = test)
...
fit_10 = lm(user_10 ~ undesirable + desirable, data = test)
#prediction
u_1_na = test[is.na(test$user_1), c('user_1', 'undesirable', 'desirable')]
result1 = predict(fit_1, newdata = u_1_na)
which(result1 == max(result1))
max(result1)
...
u_10_na = test[is.na(test$user_10), c('user_10', 'undesirable', 'desirable')]
result10 = predict(fit_10, newdata = u_10_na)
which(result10 == max(result10))
max(result10)
#make to csv file
apply each max predict value to csv.
this is what I try for now(for loop)
mod_summaries <- list()
for(i in 1:10) {
predictors_i <- colnames(data)[1:10]
mod_summaries[[i - 1]] <- summary(
lm(predictors_i ~ ., test[ , c("undesirable", 'desirable')]))
}
An apply method:
mod_summaries_lapply <-
lapply(
colnames(mtcars),
FUN = function(x)
summary(lm(reformulate(".", response = x), data = mtcars))
)
A for loop method to make linear models for each column. The key is the reformulate() function, which creates the formula from strings. In the question, the function is made of a string and results in error invalid term in model formula. The string needs to be evaluated with eval() . This example uses the mtcars dataset.
mod_summaries <- list()
for(i in 1:11) {
predictors_i <- colnames(mtcars)[i]
mod_summaries[[i]] <- summary(lm(reformulate(".", response = predictors_i), data=mtcars))
#summary(lm(reformulate(". -1", response = predictors_i), data=mtcars)) # -1 to exclude intercept
#summary(lm(as.formula(paste(predictors_i, "~ .")), data=mtcars)) # a "paste as formula" method
}
You could use the function as.formula together with the paste function to create your formula. Following is an example
formula_lm <- as.formula(
paste(response_var,
paste(expl_var, collapse = " + "),
sep = " ~ "))
This implies that you have more than one explanatory variable (separated in the paste with +). If you only have one, omit the second paste.
With the created formula, you can use the lm funciton like this:
lm(formula_lm, data)
Edit: the vector expl_var would in your case include the undesirable and desirable variable.
Avoid the loop. Make your data tidy. Something like:
library(tidyverse)
test %>%
select(-content) %>%
pivot_longer(
starts_with("user"),
names_to="user",
values_to="value"
) %>%
group_by(user) %>%
group_map(
function(.x, .y) {
summary(lm(user ~ ., data=.x))
}
)
Untested code since your example is not reproducible.
Let's say I fit a model using partykit:mob(). Afterward, I would like to generate a side-by-side table with all the nodes (including the model fitted using the whole sample). Here I attempted to do it using stargazer(), but other ways are more than welcome.
Below an example and attempts to get the table.
library("partykit")
require("mlbench")
## Pima Indians diabetes data
data("PimaIndiansDiabetes", package = "mlbench")
## a simple basic fitting function (of type 1) for a logistic regression
logit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {
glm(y ~ 0 + x, family = binomial, start = start, ...)
}
## set up a logistic regression tree
pid_tree <- mob(diabetes ~ glucose | pregnant + pressure + triceps + insulin +
mass + pedigree + age, data = PimaIndiansDiabetes, fit = logit)
pid_tree
# Model-based recursive partitioning (logit)
#
# Model formula:
# diabetes ~ glucose | pregnant + pressure + triceps + insulin +
# mass + pedigree + age
#
# Fitted party:
# [1] root
# | [2] mass <= 26.3: n = 167
# | x(Intercept) xglucose
# | -9.95150963 0.05870786
# | [3] mass > 26.3
# | | [4] age <= 30: n = 304
# | | x(Intercept) xglucose
# | | -6.70558554 0.04683748
# | | [5] age > 30: n = 297
# | | x(Intercept) xglucose
# | | -2.77095386 0.02353582
#
# Number of inner nodes: 2
# Number of terminal nodes: 3
# Number of parameters per node: 2
# Objective function: 355.4578
1.- Extract summary(pid_tree, node = x) + stargazer().
## I want to replicate this table extracting the the nodes from partykit object.
library(stargazer)
m.glm<- glm(diabetes ~ glucose, family = binomial,data = PimaIndiansDiabetes)
typeof(m.glm)
## [1] "list"
class(m.glm)
## [1] "glm" "lm"
stargazer(m.glm)
## ommited output.
## Extracting summary from each node
summ_full_data <- summary(pid_tree, node = 1)
summ_node_2 <- summary(pid_tree, node = 2)
summ_node_4 <- summary(pid_tree, node = 4)
summ_node_5 <- summary(pid_tree, node = 5)
## trying to create stargazer table with coefficients
stargazer(m.glm,
summ_node_2,
summ_node_4,
summ_node_5,title="MOB Results")
##Error: $ operator is invalid for atomic vectors
2.- Extract pid_tree[x] + stargazer().
## Second Attempt (extracting modelparty objects instead)
node_2 <- pid_tree[2]
node_4 <- pid_tree[4]
node_5 <- pid_tree[5]
class(node_5)
##[1] "modelparty" "party"
stargazer(m.glm,
node_2,
node_4,
node_5,title="MOB Results")
# % Error: Unrecognized object type.
# % Error: Unrecognized object type.
# % Error: Unrecognized object type.
3.- Not really elegant, I know: Force class to emulate the glm object.
## Force class of object to emulate glm one
class(m.glm)
class(summ_node_2) <- c("glm", "lm")
stargazer(summ_node_2)
##Error in if (p > 0) { : argument is of length zero
A rather pragmatic solution would be just re-fit the model recovering the rules found by partykit:mob() and then use stargaze() on them, but for sure I am missing something here. Thanks in advance.
It's best to extract (or refit) the list of model objects per node and then apply the table package of choice. Personally, I don't like stargazer much and much rather use modelsummary instead or sometimes the good old memisc.
If the tree contains the model $objects in the $info (as for pid_tree) you can use nodeapply() for all nodeids() to extract these:
pid_models <- nodeapply(pid_tree, ids = nodeids(pid_tree), FUN = function(x) x$info$object)
If you just want to extract the fitted models for the terminal nodes (leaves) of the tree, then you can do so by setting ids = nodeids(pid_tree, terminal = TRUE).
Alternatively, especially when the model objects are not stored, you can easily refit them via:
pid_models <- refit.modelparty(pid_tree)
Here, you could also include node = nodeids(pid_tree, terminal = TRUE) to only refit the terminal node models.
In all cases you can subsequently use
msummary(pid_models)
to produce the model summary table. It supports a variety of output formats and of course you can tweak the list further to change the results, e.g., by changing their names etc. The default output looks like this:
My bad, it was a small difference that makes it work. Here a solution, not sure if the best way, but it does the work.-
library(stargazer)
obj_node_full_sample<- pid_tree[1]$node$info$object
obj_node_2<- pid_tree[2]$node$info$object
obj_node_4<- pid_tree[4]$node$info$object
obj_node_5<- pid_tree[5]$node$info$object
stargazer(obj_node_full_sample,
obj_node_2,
obj_node_4,
obj_node_5,title="Results", align=TRUE)
I am trying to create a function that allows me to pass outcome and predictor variable names as strings into the lm() regression function. I have actually asked this before here, but I learned a new technique here and would like to try and apply the same idea in this new format.
Here is the process
library(tidyverse)
# toy data
df <- tibble(f1 = factor(rep(letters[1:3],5)),
c1 = rnorm(15),
out1 = rnorm(15))
# pass the relevant inputs into new objects like in a function
d <- df
outcome <- "out1"
predictors <- c("f1", "c1")
# now create the model formula to be entered into the model
form <- as.formula(
paste(outcome,
paste(predictors, collapse = " + "),
sep = " ~ "))
# now pass the formula into the model
model <- eval(bquote( lm(.(form),
data = d) ))
model
# Call:
# lm(formula = out1 ~ f1 + c1, data = d)
#
# Coefficients:
# (Intercept) f1b f1c c1
# 0.16304 -0.01790 -0.32620 -0.07239
So this all works nicely, an adaptable way of passing variables into lm(). But what if we want to apply special contrast coding to the factorial variable? I tried
model <- eval(bquote( lm(.(form),
data = d,
contrasts = list(predictors[1] = contr.treatment(3)) %>% setNames(predictors[1])) ))
But got this error
Error: unexpected '=' in:
" data = d,
contrasts = list(predictors[1] ="
Any help much appreciated.
Reducing this to the command generating the error:
list(predictors[1] = contr.treatment(3))
Results in:
Error: unexpected '=' in "list(predictors[1] ="
list() seems to choke when the left-hand side naming is a variable that needs to be evaluated.
Your approach of using setNames() works, but needs to be wrapped around the list construction step itself.
setNames(list(contr.treatment(3)), predictors[1])
Output is a named list containing a contrast matrix:
$f1
2 3
1 0 0
2 1 0
3 0 1
I have a function in R which includes multiple other functions, including a custom one. I then use lapply to run the combined function across multiple variables. However, when the output is produced it is in the order of
function1: variable a, variable b, variable c
function2: variable a, variable b, variable c
When what I would like is for it to be the other way around:
variable a: function 1, function 2...
variable b: function 1, function 2...
I have recreated an example below using the mtcars dataset, with number of cylinders as a predictor variable, and vs and am as outcome variables.
library(datasets)
library(tidyverse)
library(skimr)
library(car)
data(mtcars)
mtcars_binary <- mtcars %>%
dplyr::select(cyl, vs, am)
# logistic regression function
logistic.regression <- function(logmodel) {
dev <- logmodel$deviance
null.dev <- logmodel$null.deviance
modelN <- length(logmodel$fitted.values)
R.lemeshow <- 1 - dev / null.dev
R.coxsnell <- 1 - exp ( -(null.dev - dev) / modelN)
R.nagelkerke <- R.coxsnell / ( 1 - ( exp (-(null.dev / modelN))))
cat("Logistic Regression\n")
cat("Hosmer and Lemeshow R^2 ", round(R.lemeshow, 3), "\n")
cat("Cox and Snell R^2 ", round(R.coxsnell, 3), "\n")
cat("Nagelkerke R^2" , round(R.nagelkerke, 3), "\n")
}
# all logistic regression results
log_regression_tests1 <- function(df_vars, df_data) {
glm_summary <- glm(df_data[,df_vars] ~ df_data[,1], data = df_data, family = binomial, na.action = "na.omit")
glm_print <- print(glm_summary)
log_results <- logistic.regression(glm_summary)
blr_coefficients <- exp(glm_summary$coefficients)
blr_confint <- exp(confint(glm_summary))
list(glm_summary = glm_summary, glm_print = glm_print, log_results = log_results, blr_coefficients = blr_coefficients, blr_confint = blr_confint)
}
log_regression_results1 <- sapply(colnames(mtcars_binary[,2:3]), log_regression_tests1, mtcars_binary, simplify = FALSE)
log_regression_results1
When I do this, the output is being produced as:
glm_summary: vs, am
log_results: vs, am
etc. etc.
When what I would like for the output to be ordered is:
vs: all function outputs
am: all function outputs
In addition, when I run this line of code, log_regression_results1 <- sapply(colnames(mtcars_binary[,2:3]), log_regression_tests1, mtcars_binary, simplify = FALSE) I get only the results of the logistic regression function, but when I print the overall results log_regression_results1 I get the remaining output, could anyone explain why?
Finally, the glm_summary function is not producing all of the output which it should. When I run the functions independently on a single variable, like so
glm_vs <- glm(vs ~ cyl, data = mtcars_binary, family = binomial, na.action = "na.omit")
summary(glm_vs)
logistic.regression(glm_vs)
exp(glm_vs$vs)
exp(confint(glm_vs))
it also produces the standard error, z value, and p value for summary(glm_vs) which it does not do embedded in the function, even though I have ```glm_print <- print(glm_summary)' included. Is there a way to get the output for the full summary function within the log_regression_tests1 function?
when I run your code up to log_regression_results1 I got exactly what you ask for:
summary(log_regression_results1)
Length Class Mode
vs 5 -none- list
am 5 -none- list
maybe you meant to ask the other way round?
I have a linear model with lots of explaining variables (independent variables)
model <- lm(y ~ x1 + x2 + x3 + ... + x100)
some of which are linear depended on each other (multicollinearity).
I want the machine to search for the name of the explaining variable which has the highest VIF coefficient (x2 for example), delete it from the formula and then run the old lm function with the new formula
model <- lm(y ~ x1 + x3 + ... + x100)
I already learned how to retrieve the name of the explaining variable which has the highest VIF coefficient:
max_vif <- function(x) {
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
But I still don't understand how to search the needed explaining variable, delete it from the formula and run the function again.
We can use the update function and paste in the column that needs to be removed. We first can fit a model, and then use update to change that model's formula. The model formula can be expressed as a character string, which allows you to concatenate the general formula .~. and whatever variable(s) you'd like removed (using the minus sign -).
Here is an example:
fit1 <- lm(wt ~ mpg + cyl + am, data = mtcars)
coef(fit1)
# (Intercept) mpg cyl am
# 4.83597190 -0.09470611 0.08015745 -0.52182463
rm_var <- "am"
fit2 <- update(fit1, paste0(".~. - ", rm_var))
coef(fit2)
# (Intercept) mpg cyl
# 5.07595833 -0.11908115 0.08625557
Using max_vif we can wrap this into a function:
rm_max_vif <- function(x){
# find variable(s) needing to be removed
rm_var <- max_vif(x)
# concatenate with "-" to remove variable(s) from formula
rm_var <- paste(paste0("-", rm_var), collapse = " ")
# update model
update(x, paste0(".~.", rm_var))
}
Problem solved!
I created a list containing all variables for lm model:
Price <- list(y,x1,...,x100)
Then I used different way for setting lm model:
model <- lm(y ~ ., data = Price)
So we can just delete variable with the highest VIF from Price list.
With the function i already came up the code will be:
Price <- list(y,x1,x2,...,x100)
model <- lm(y ~ ., data = Price)
max_vif <- function(x) { # Function for finding name of variable with the highest VIF
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
n <- max(data.frame(vif(model)))
while(n >= 5) { # Loop for deleting variable with the highest VIF from `Price` list one after another, untill there is no VIF equal or higher then 5
Price[[m]] <- NULL
model_auto <- lm(y ~ ., data = Price)
m <- max_vif(model)
n <- max(data.frame(vif(model)))
}