Pass a vector of variables into lm() formula - r

I was trying to automate a piece of my code so that programming become less tedious.
Basically I was trying to do a stepwise selection of variables using fastbw() in the rms package. I would like to pass the list of variables selected by fastbw() into a formula as y ~ x1+x2+x3, "x1" "x2" "x3" being the list of variables selected by fastbw()
Here is the code I tried and did not work
olsOAW0.r060 <- ols(roll_pct~byoy+trans_YoY+change18m,
subset= helper=="POPNOAW0_r060",
na.action = na.exclude,
data = modelready)
OAW0 <- fastbw(olsOAW0.r060, rule="p", type="residual", sls= 0.05)
vec <- as.vector(OAW0$names.kept, mode="any")
b <- paste(vec, sep ="+") ##I even tried b <- paste(OAW0$names.kept, sep="+")
bestp.OAW0.r060 <- lm(roll_pct ~ b ,
data = modelready,
subset = helper =="POPNOAW0_r060",
na.action = na.exclude)
I am new to R and still haven't trailed the steep learning curve, so apologize for obvious programming blunders.

You're almost there. You just have to paste the entire formula together, something like this:
paste("roll_pct ~ ",b,sep = "")
coerce it to an actual formula using as.formula and then pass that to lm. Technically, I think lm may coerce a character string itself, but coercing it yourself is generally safer. (Some functions that expect formulas won't do the coercion for you, others will.)

You would actually need to use collapse instead of seb when defining b.
b <- paste(OAW0$names.kept, collapse="+")
Then you can put it in joran answer
paste("roll_pct ~ ",b,sep = "")
or just use:
paste("roll_pct ~ ",paste(OAW0$names.kept, collapse="+"),sep = "")

I ran into similar issue today, if you want to make it even more generic where you don't even have to have fixed class name, you can use
frmla <- as.formula(paste(colnames(modelready)[1], paste(colnames(modelready)[2:ncol(modelready)], sep = "",
collapse = " + "), sep = " ~ "))
This assumes that you have class variable or the dependent variable in the first column but indexing can be easily switched to last column as:
frmla <- as.formula(paste(colnames(modelready)[ncol(modelready)], paste(colnames(modelready)[1:(ncol(modelready)-1)], sep = "",
collapse = " + "), sep = " ~ "))
Then continue with lm using:
bestp.OAW0.r060 <- lm(frmla , data = modelready, ... )

If you're looking for something less verbose:
fm <- as.formula( paste( colnames(df)[i], ".", sep=" ~ "))
# i is the index of the outcome column
Here it is in a function:
getFormula<-function(target, df) {
i <- grep(target,colnames(df))
as.formula(paste(colnames(df)[i],
".",
sep = " ~ "))
}
fm <- getFormula("myOutcomeColumnName", myDataFrame)
rp <- rpart(fm, data = myDataFrame) # Use the formula to build a model

One trick that I use in similar situations is to subset your data and simply use e.g. lm(dep_var ~ ., data = your_data).
Example
data(mtcars)
ind_vars <- c("mpg", "cyl")
dep_var <- "hp"
temp_subset <- dplyr::select(mtcars, dep_var, ind_vars)
lm(hp ~., data = temp_subset)

just to simplify and collect above answers, based on a function
my_formula<- function(colPosition, trainSet){
dep_part<- paste(colnames(trainSet)[colPosition],"~",sep=" ")
ind_part<- paste(colnames(trainSet)[-colPosition],collapse=" + ")
dt_formula<- as.formula(paste(dep_part,ind_part,sep=" "))
return(dt_formula)
}
To use it:
my_formula( dependent_var_position, myTrainSet)

Related

Combine names of variables (in a formula) retrieved by grep in R

I'd like to use the output of grep directly in a formula.
In other words, I use grep to retrieve the variables I want to select and store them in a vector.
The cool thing would be to be able to use this vector in a formula.
As to say
var.to.retrieve <- grep(pattern="V", x=data)
lm(var.dep~var.to.retrieve)
but this doesn't work...
I've tried the solution paste(var.to.retrieve, collapse="+") but this doesn't work either.
EDIT
The solution could be
formula <- as.formula(paste(var.dep, paste(var.to.retrieve, collapse="+"), sep="~"))
but I cannot imagine there is no more elegant way to do it
reformulate(var.to.retrieve, response = var.dep) is basically this.
var.dep <- "y"
var.to.retrieve <- LETTERS[1:10]
r1 <- reformulate(var.to.retrieve, response = var.dep)
r2 <- as.formula(
paste(var.dep,
paste(var.to.retrieve, collapse = "+"),
sep = "~")
)
identical(r1,r2) ## TRUE
var_to_retrieve <- colnames(data)[grep(pattern = "V", x = colnames(data))]
lm(formula(paste(var.dep, paste(var_to_retrieve, collapse = "+"), sep = "~")),
data = data)

Function for looping over common variable names with different suffixes in R

I have some code which I'm looking to replicate many times, each for a different country as the suffix.
Assuming 3 countries as a simple example:
country_list <- c('ALB', 'ARE', 'ARG')
I'm trying to create a series of variables called a_m5_ALB, a_m5_ARE, a_m5_ARG etc which have various functions e.g. addcol or round_df applied to reg_math_ALB, reg_math_ARE, reg_math_ARG etc
for (i in country_list) {
paste("a_m5", i , sep = "_") <- addcol(paste("reg_math", i , sep = "_"))
}
for (i in country_list) {
paste("a_m5", i , sep = "_") <- round_df(paste("reg_math", i , sep = "_"))
}
where addcol and round_df are defined as:
addcol = function(y){
dat1 = mutate(y, p.value = ((1 - pt(q = abs(reg.t.value), df = dof))*2))
return(dat1)
}
round_df <- function(x, digits) {
numeric_columns <- sapply(x, mode) == 'numeric'
x[numeric_columns] <- round(x[numeric_columns], digits)
x
}
The loop errors when any of the functions are added in brackets before the paste variable part but it works if doing it manually e.g.
a_m5_ALB <- addcol(reg_math_ALB)
Please could you help? I think it's the application of the function in a loop which i'm getting wrong.
Errors:
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "character"
Error in round(x[numeric_columns], digits) :
non-numeric argument to mathematical function
Thank you
From your examples, you're really in a case where everything should be in a single dataframe. Here, keeping separate variables for each country is not the right tool for the job. Say you have your per-country dataframes saved as csv, you can rewrite everything as:
library(tidyverse)
country_list <- c('ALB', 'ARE', 'ARG')
read_data <- function(ctry){
read_csv(paste0("/path/to/file/", "reg_math_", ctry)) %>%
add_column(country = ctry)
}
total_df <- map_dfr(country_list, read_data)
total_df %>%
mutate(p.value = (1 - pt(q = abs(reg.t.value), df = dof))*2) %>%
mutate(across(where(is.numeric), round, digits = digits))
And it gives you immediate access to all other dplyr functions that are great for this kind of manipulation.

extract weights from a RWeka SMOreg model

I am using the awesome RWeka package in order to fit a SMOreg model as implemented in Weka. While everything is working fine, I have some problem extracting the weights from the fitted model.
As all Weka classifier object, my model has a nice print method that shows me all the features and their relative weights. However, I am not able to extract this weights in any way.
You can see for yourself by running the following code:
library(RWeka)
data("mtcars")
SMOreg_classifier <- make_Weka_classifier("weka/classifiers/functions/SMOreg")
model_SMOreg <- SMOreg_classifier(mpg ~ ., data = mtcars)
Now, if you simply call the model
model_SMOreg
you'll see that it prints all the features used in the model with their relative weight. I would like to access those weights as a vector or, even better, as a 2-columns table with one column containing the names of the features and the other containing the weights.
I am working on a Windows 7 x64 system, using RStudio Version 1.0.153, R 3.4.2 Short Summer and RWeka 0.4-35.
Does someone know how to do this ?
I think you cannot get this in numeric format.
attr(model_SMOreg, "meta")$class # "Weka_classifier"
getAnywhere("print.Weka_classifier")
Result:
A single object matching ‘print.Weka_classifier’ was found
It was found in the following places
registered S3 method for print from namespace RWeka
namespace:RWeka
with value
function (x, ...)
{
writeLines(.jcall(x$classifier, "S", "toString"))
invisible(x)
}
<bytecode: 0x8328630>
<environment: namespace:RWeka>
So we see: print.Weka_classifier() makes a .writeLines() call which in turn makes a rJava::.jcall call, which returns a string.
Thus, I think you need to parse the weights yourself, perhaps by calling the capture.output() method.
Based on the suggestion of #knb I have wrote a function to extract the weights from a SMOreg model and return a tibble with one column for the features name and one for the features weight, with the row arranged following the absolute value of the weight.
Note that this function only works for the SMOreg classifier, as the output of other classifiers is slightly different in terms of layout. However, I think the function can be easily adapted for other classifiers.
library(stringr)
library(tidyverse)
extract_weights_from_SMOreg <- function(model) {
oldw <- getOption("warn")
options(warn = -1)
raw_output <- capture.output(model)
trimmed_output <- raw_output[-c(1:3,(length(raw_output) - 4): length(raw_output))]
df <- data_frame(features_name = vector(length = length(trimmed_output) + 1, "character"),
features_weight = vector(length = length(trimmed_output) + 1, "numeric"))
for (line in 1:length(trimmed_output)) {
string_as_vector <- trimmed_output[line] %>%
str_split(string = ., pattern = " ") %>%
unlist(.)
numeric_element <- trimmed_output[line] %>%
str_split(string = ., pattern = " ") %>%
unlist(.) %>%
as.numeric(.)
position_mul <- string_as_vector[is.na(numeric_element)] %>%
str_detect(string = ., pattern = "[*]") %>%
which(.)
numeric_element <- numeric_element %>%
`[`(., c(1:position_mul))
text_element <- string_as_vector[is.na(numeric_element)]
there_is_plus <- string_as_vector[is.na(numeric_element)] %>%
str_detect(string = ., pattern = "[+]") %>%
sum(.)
if (there_is_plus) { sign_is <- "+"} else { sign_is <- "-"}
feature_weight <- numeric_element[!is.na(numeric_element)]
if (sign_is == "-") {df[line, "features_weight"] <- feature_weight * -1} else {df[line, "features_weight"] <- numeric_element[!(is.na(numeric_element))]}
df[line, "features_name"] <- paste(text_element[(position_mul + 1): length(text_element)], collapse = " ")
}
intercept_line <- raw_output[length(raw_output) - 4]
there_is_plus_intercept <- intercept_line %>%
str_detect(string = ., pattern = "[+]") %>%
sum(.)
if (there_is_plus_intercept) { intercept_sign_is <- "+"} else { intercept_sign_is <- "-"}
numeric_intercept <- intercept_line %>%
str_split(string = ., pattern = " ") %>%
unlist(.) %>%
as.numeric(.) %>%
`[`(., length(.))
df[nrow(df), "features_name"] <- "intercept"
if (intercept_sign_is == "-") {df[nrow(df), "features_weight"] <- numeric_intercept * -1} else {df[nrow(df), "features_weight"] <- numeric_intercept}
options(warn = oldw)
df <- df %>%
arrange(desc(abs(features_weight)))
return(df)
}
Here an example for one model
library(RWeka)
data("mtcars")
SMOreg_classifier <- make_Weka_classifier("weka/classifiers/functions/SMOreg")
mpg_model_weights <- extract_weights_from_SMOreg(SMOreg_classifier(data = mtcars, mpg ~ .))
mpg_model_weights

Function to create new binary variables within existing dataframe?

This question is related to a previous topic:
How to use custom function to create new binary variables within existing dataframe?
I would like to use a similar function but be able to use a vector to specify ICD9 diagnosis variables within the dataframe to search for (e.g., "diag_1", "diag_2","diag_1", etc )
I tried
y<-c("diag_1","diag_2","diag_1")
diagnosis_func(patient_db, y, "2851", "Anemia")
but I get the following error:
Error in `[[<-`(`*tmp*`, i, value = value) :
recursive indexing failed at level 2
Below is the working function by Benjamin from the referenced post. However, it works only from 1 diagnosis variable at a time. Ultimately I need to create a new binary variable that indicates if a patient has a specific diagnosis by querying the 25 diagnosis variables of the dataframe.
*targetcolumn is the icd9 diagnosis variables "diag_1"..."diag_20" is the one I would like to input as vector
diagnosis_func <- function(data, target_col, icd, new_col){
pattern <- sprintf("^(%s)",
paste0(icd, collapse = "|"))
data[[new_col]] <- grepl(pattern = pattern,
x = data[[target_col]]) + 0L
data
}
diagnosis_func(patient_db, "diag_1", "2851", "Anemia")
This non-function version works for multiple diagnosis. However I have not figured out how to use it in a function version as above.
pattern = paste("^(", paste0("2851", collapse = "|"), ")", sep = "")
df$anemia<-ifelse(rowSums(sapply(df[c("diag_1","diag_2","diag_3")], grepl, pattern = pattern)) != 0,"1","0")
Any help or guidance on how to get this function to work would be greatly appreciated.
Best,
Albit
Try this modified version of Benjamin's function:
diagnosis_func <- function(data, target_col, icd, new_col){
pattern <- sprintf("^(%s)",
paste0(icd, collapse = "|"))
new <- apply(data[target_col], 2, function(x) grepl(pattern=pattern, x)) + 0L
data[[new_col]] <- ifelse(rowSums(new)>0, 1,0)
data
}

Creating formula using very long strings in R

I'm in a situation where I have a vector full of column names for a really large data frame.
Let's assume: x = c("Name", "address", "Gender", ......, "class" ) [approximatively 100 variables]
Now, I would like to create a formula which I'll eventually use to create a HoeffdingTree.
I'm creating formula using:
myformula <- as.formula(paste("class ~ ", paste(x, collapse= "+")))
This throws up the following error:
Error in parse(text = x) : :1:360: unexpected 'else'
1:e+spread+prayforsonni+just+want+amp+argue+blxcknicotine+mood+now+right+actually+herapatra+must+simply+suck+there+always+cookies+ever+everything+getting+nice+nigga+they+times+abu+all+alliepickl
The paste part in the above statement works fine but passing it as an argument to as.formula is throwing all kinds of weird problems.
The problem is that you have R keywords as column names. else is a keyword so you can't use it as a regular name.
A simplified example:
s <- c("x", "else", "z")
f <- paste("y~", paste(s, collapse="+"))
formula(f)
# Error in parse(text = x) : <text>:1:10: unexpected '+'
# 1: y~ x+else+
# ^
The solution is to wrap your words in backticks "`" so that R will treat them as non-syntactic variable names.
f <- paste("y~", paste(sprintf("`%s`", s), collapse="+"))
formula(f)
# y ~ x + `else` + z
You can reduce your data-set first
dat_small <- dat[,c("class",x)]
and then use
myformula <- as.formula("class ~ .")
The . means using all other (all but class) column.
You may try reformulate
reformulate(setdiff(x, 'class'), response='class')
#class ~ Name + address + Gender
where 'x' is
x <- c("Name", "address", "Gender", 'class')
If R keywords are in the 'x', you can do
reformulate('.', response='class')
#class ~ .

Resources