Using column numbers not names in lm() - r

Instead of something like lm(bp~height+age, data=mydata) I would like to specify the columns by number, not name.
I tried lm(mydata[[1]]~mydata[[2]]+mydata[[3]]) but the problem with this is that, in the fitted model, the coefficients are named mydata[[2]], mydata[[3]] etc, whereas I would like them to have the real column names.
Perhaps this is a case of not having your cake and eating it, but if the experts could advise whether this is possible I would be grateful

lm(
as.formula(paste(colnames(mydata)[1], "~",
paste(colnames(mydata)[c(2, 3)], collapse = "+"),
sep = ""
)),
data=mydata
)
Instead of c(2, 3) you can use how many indices you want (no need for for loop).

lm(mydata[,1] ~ ., mydata[-1])
The trick that I found in a course on R is to remove the response column, otherwise you get warning "essentially perfect fit: summary may be unreliable". I do not know why it works, it does not follow from documentation. Normally, we keep the response column in.
And a simplified version of the earlier answer by Tomas:
lm(
as.formula(paste(colnames(mydata)[1], "~ .")),
data=mydata
)

Related

Fit model on a subset of columns in dataframe in R

I'm trying to use lm() and matchit() on a subset of covariates. I have generated an arbitrary number of columns with prefix "covar", i.e. "covar.1", "covar.2", etc. I'd like to do something like
lm(group ~ covars, data=df)
where covars is a vector of strings c("covar.1", "covar.2", ...).
I tried several things like
cols <- colnames(df)
covars <- cols[grep("covar", colnames(df))]
m.out <- matchit(group ~ covars, data=df, method="nearest", distance="logit", caliper=.20)
but got variable lengths differ (found for 'covars').
Defining a new dataframe only with covars and group can work but that defeats my purpose using matchit because I want the matched data to have other columns, too, not just covars I picked to be the matched on.
This seems to be an easy task but somehow I can't figure out after some googling. Not sure what R formula expects there as subset of columns. Any help is appreciated.
You might want to use as.formula.
Try doing this:
Replace group ~ covars
with as.formula(paste('group','~', paste(covars, collapse="+"))))
I mentioned this in your other question, but the cobalt package has a function specifically for this, which is f.build(). The first argument to f.build() is a string containing the name of the treatment variable (or left hand side of the formula), and the second argument is a string vector containing the names of the variables to be on the right hand side of the formula (i.e., the covariates). The second argument can also be a data.frame containing the covariates; f.build() simply extracts the names. It then performs the operation described in the chosen answer, bit adds in a few other aspects that make it a little more general and robust to errors.
The cobalt documentation has a section on f.build() and uses its use with glm() and matchit() as examples.
After running matchit(), you can assess balance on the covariates using the bal.tab() function in cobalt, which is compatible with MatchIt:
bal.tab(m.out, un = TRUE)
The documentation for cobalt explains its use with MatchIt in detail.

Machine-learning issue in r: Predict function returns error (object 'active' not found) [duplicate]

I cannot understand what is going wrong here.
data.train <- read.table("Assign2.WineComplete.csv",sep=",",header=T)
# Building decision tree
Train <- data.frame(residual.sugar=data.train$residual.sugar,
total.sulfur.dioxide=data.train$total.sulfur.dioxide,
alcohol=data.train$alcohol,
quality=data.train$quality)
Pre <- as.formula("pre ~ quality")
fit <- rpart(Pre, method="class",data=Train)
I am getting the following error :
Error in eval(expr, envir, enclos) : object 'pre' not found
Don't know why #Janos deleted his answer, but it's correct: your data frame Train doesn't have a column named pre. When you pass a formula and a data frame to a model-fitting function, the names in the formula have to refer to columns in the data frame. Your Train has columns called residual.sugar, total.sulfur, alcohol and quality. You need to change either your formula or your data frame so they're consistent with each other.
And just to clarify: Pre is an object containing a formula. That formula contains a reference to the variable pre. It's the latter that has to be consistent with the data frame.
This can happen if you don't attach your dataset.
I think I got what I was looking for..
data.train <- read.table("Assign2.WineComplete.csv",sep=",",header=T)
fit <- rpart(quality ~ ., method="class",data=data.train)
plot(fit)
text(fit, use.n=TRUE)
summary(fit)
i use
colname(train) = paste("A", colname(train))
and it turns out to the same problem as yours.
I finally figure out that randomForest is more stingy than rpart, it can't recognize the colname with space, comma or other specific punctuation.
paste function will prepend "A" and " " as seperator with each colname.
so we need to avert the space and use this sentence instead:
colname(train) = paste("A", colname(train), sep = "")
this will prepend string without space.

Loop through column names in Fixed Effects Regression

I am trying to code a fixed effects regression, but I have MANY dummy variables. Basically, I have 184 variables on the RHS of my equation. Instead of writing this out, I am trying to create a loop that will pass through each column (I have named each column with a number).
This is the code i have so far, but the paste is not working. I may be totally off base using paste, but I wasn't sure how else to approach this. However, I am getting an error (see below).
FE.model <- plm(avg.kw ~ 0 + (for (i in 41:87) {
paste("hour.dummy",i,sep="") + paste("dummy.CDH",i,sep="")
+ paste("dummy.MA",i,sep="") + paste("DR.variable",i,sep="")
}),
data = data.reg,
index=c('Site.ID','date.hour'),
model='within',
effect='individual')
summary(FE.model)
As an example for the column names, when i=41 the names should be "hour.dummy41" "dummy.CDH41", etc.
I'm getting the following error:
Error in paste("hour.dummy", i, sep = "") + paste("dummy.CDH", i, sep = "") : non-numeric argument to binary operator
So I'm not sure if it's the paste function that is not appropriate here, or if it's the loop. I can't seem to find a way to loop through column names easily in R.
Any help is much appreciated!
Ignoring worries about fitting a model with so many terms for the moment, you probably want to generate a string, and then cast it as a formula:
#create a data.frame where rows are the parts of the variable names, then collapse it
rhs <- do.call(paste, c(as.list(expand.grid(c("hour.dummy","dummy.CDH"), 41:87)), sep=".", collapse=" + "))
fml <- as.formula(sprintf ("avg.kw ~ %s"), rhs))
FE.model <-pml(flm, ...
I've only put in two of the 'dummy's in the second line- but you should get the idea

R - remove as.numeric from string

I sometimes vectorise the variables I used in a model and do other stuff with it (e.g. descriptives etc...). The problem is that sometimes I use "as.numeric(var)" or "as.factor(var)", or center "I(var-15)". I then need the name of the original variables.
The problem is that I can't simply gsub(lmfit$model,"as.factor(","") because I get an error, and I want to avoid delete variables that contain I etc... so I need to delete I(* -any number) and as.factor(*), where * is the variable name that I want to remain untouched.
Let's say I have a vector of coefficients from a model:
outcome <- c(1:9)
INDEX <- c(18,17,15,20,10,20,25,13,12)
BODYFAT <- c(18,18,15,20,20,20,15,20,15)
lmfit <- glm(outcome ~ as.factor(BODYFAT) + I(INDEX-15), family = gaussian())
names(lmfit$model)
How would you work on names(lmfit$model) to get the original variable names back (i.e. BODYFAT and INDEX?
I've started creating some clunky code to remove all the centering numbers (assuming 1 to 500 should be enough in most cases)
b<-paste(paste0("- ",1:500,"|",collapse=""),"-501",collapse="")
library(stringr)
str_replace_all(names(lmfit$model),b, " ")
But I'm having real problems with the removing I() and as.factor(). Any suggestions?
Many thanks in advance

What does the period mean when used with ~ (in a formula)?

From the FSelector manual:
data(iris)
subset <- cfs(Species~., iris)
f <- as.simple.formula(subset, "Species")
print(f)
Specifically, I mean the one in "Species~.".
Now, it's awfully tough to Google how a bit of punctuation is used (for me anyway) and I couldn't anything. This code is unclear.
I think you're referring to the period contained in Species~., in which case this is just the standard R formulation of referring to 'all other variables' in the data frame, rather than typing them out one by one, as in Species ~ Variable1 + Variable2 etc.
From the help files of ?formula:
There are two special interpretations of . in a formula. The usual one
is in the context of a data argument of model fitting functions and
means ‘all columns not otherwise in the formula’: see terms.formula.
In the context of update.formula, only, it means ‘what was previously
in this part of the formula’.

Resources