R: Use string containing variable names in regression - r

I first use grep to obtain all variable names that begin with the preface: "h_." I then collapse that array into a single string, separated with plus signs. Is there a way to subsequently use this string in a linear regression?
For example:
holiday_array <- grep("h_", names(df), value=TRUE)
holiday_string = paste(holiday_array, collapse=' + ' )
r_3 <- lm(log(assaults) ~ year + month + holiday_string, data = df)
I get the straightforward error variable lengths differ (found for 'holiday_string')
I can do it like this, for example:
holiday_formula <- as.formula(paste('log(assaults) ~ attend_v + year+ month + ', paste("", holiday_vars, collapse='+')))
r_3 <- lm(holiday_formula, data = df)
But I don't want to have to type a separate formula construction for each new set of controls. I want to be able to add the "string" inside the lm function. Is this possible?
The above is problematic, because let's say I want to then add another set of control variables to the formula contained in holiday_formula, so something like this:
weather_vars <- grep("w_", names(df), value=TRUE) weather_formula
<- as.formula(paste(holiday_formula, paste("+", weather_vars,
collapse='+')))
Not sure how you would do the above.

I don't know a simple method for construction of a formula argument different than the one you are rejecting (although I considered and rejected using update.formula since it would also have required using as.formula), but this is an alternate method for achieving the same goal. It uses the "."-expansion feature of R-formulas and relies on the ability of the [-function to accept character argument for column selection:
r_3 <- lm(log(assaults) ~ attend_v + year+ month + . ,
data = df[ , c('assaults', 'attend_v', 'year', 'month', holiday_vars] )

Related

The way R handles subseting

I'm having some trouble understanding how R handles subsetting internally and this is causing me some issues while trying to build some functions. Take the following code:
f <- function(directory, variable, number_seq) {
##Create a empty data frame
new_frame <- data.frame()
## Add every data frame in the directory whose name is in the number_seq to new_frame
## the file variable specify the path to the file
for (i in number_seq){
file <- paste("~/", directory, "/",sprintf("%03d", i), ".csv", sep = "")
x <- read.csv(file)
new_frame <- rbind.data.frame(new_frame, x)
}
## calculate and return the mean
mean(new_frame[, variable], na.rm = TRUE)*
}
*While calculating the mean I tried to subset first using the $ sign new_frame$variable and the subset function subset( new_frame, select = variable but it would only return a None value. It only worked when I used new_frame[, variable].
Can anyone explain why the other subseting didn't work? It took me a really long time to figure it out and even though I managed to make it work I still don't know why it didn't work in the other ways and I really wanna look inside the black box so I won't have the same issues in the future.
Thanks for the help.
This behavior has to do with the fact that you are subsetting inside a function.
Both new_frame$variable and subset(new_frame, select = variable) look for a column in the dataframe withe name variable.
On the other hand, using new_frame[, variable] uses the variablename in f(directory, variable, number_seq) to select the column.
The dollar sign ($) can only be used with literal column names. That avoids confusion with
dd<-data.frame(
id=1:4,
var=rnorm(4),
value=runif(4)
)
var <- "value"
dd$var
In this case if $ took variables or column names, which do you expect? The dd$var column or the dd$value column (because var == "value"). That's why the dd[, var] way is different because it only takes character vectors, not expressions referring to column names. You will get dd$value with dd[, var]
I'm not quite sure why you got None with subset() I was unable to replicate that problem.

Loop through column names in Fixed Effects Regression

I am trying to code a fixed effects regression, but I have MANY dummy variables. Basically, I have 184 variables on the RHS of my equation. Instead of writing this out, I am trying to create a loop that will pass through each column (I have named each column with a number).
This is the code i have so far, but the paste is not working. I may be totally off base using paste, but I wasn't sure how else to approach this. However, I am getting an error (see below).
FE.model <- plm(avg.kw ~ 0 + (for (i in 41:87) {
paste("hour.dummy",i,sep="") + paste("dummy.CDH",i,sep="")
+ paste("dummy.MA",i,sep="") + paste("DR.variable",i,sep="")
}),
data = data.reg,
index=c('Site.ID','date.hour'),
model='within',
effect='individual')
summary(FE.model)
As an example for the column names, when i=41 the names should be "hour.dummy41" "dummy.CDH41", etc.
I'm getting the following error:
Error in paste("hour.dummy", i, sep = "") + paste("dummy.CDH", i, sep = "") : non-numeric argument to binary operator
So I'm not sure if it's the paste function that is not appropriate here, or if it's the loop. I can't seem to find a way to loop through column names easily in R.
Any help is much appreciated!
Ignoring worries about fitting a model with so many terms for the moment, you probably want to generate a string, and then cast it as a formula:
#create a data.frame where rows are the parts of the variable names, then collapse it
rhs <- do.call(paste, c(as.list(expand.grid(c("hour.dummy","dummy.CDH"), 41:87)), sep=".", collapse=" + "))
fml <- as.formula(sprintf ("avg.kw ~ %s"), rhs))
FE.model <-pml(flm, ...
I've only put in two of the 'dummy's in the second line- but you should get the idea

How to run bigglm function for large number of variables

In ffbase (http://cran.r-project.org/web/packages/ffbase/ffbase.pdf) there is the bigglm function:
bigglm.ffdf(formula, data, family = gaussian(), ...,
where formula is something like Y~X, assuming Y and X correspond to the colnames of ffdf object called data.
What if I have 200 columns in data that I want to put on the RHS of the equation? Clearly I can't type Y~X1+X2+....+X200.
How do I run Y~X1+X2+....+X200 without typing out all 200 variables on the RHS?
the . symbol is the normal character for this, not sure if it works with ffbase though. I.e.
m <- lm(y ~ ., df)
will describe y by all other columns in df.
As described by Chris, this appears to be a bug in biglm, and can be worked around by using:
m <- bigglm(terms(y ~ ., data=df), data=df)
But this should be reported as a bug to the author of biglm.
If Sam's answer doesn't work, you can build up a character string representing the formula and then cast is as a formula:
formula <- as.formula(paste('Y', paste(paste('',
paste('X', 1:200, sep = ''), sep = '', collapse = ' + ')), sep = ' ~ '))
The inner paste creates X1 to X200. The next paste collapses the resulting vector into a single string with the elements of the first paste put together with +'s. The last paste adds on the Y ~. Finally, I change it from a string to a formula.

data.frame columns as tabular row labels in R

I am trying to make a table in R/Sweave using the tabular command. I want the row labels to be the headings of my data frame consca. (Each column is a question, and each row is a student's responses to each question.) The command I am using is this:
latex(tabular(Heading('Questions')*(paste(labels(consca)[[2]],collapse='+')) ~ (n=1) + (mn +
sdev),data=consca))
Which throws this error:
Error in term2table(rows[[i]], cols[[j]], data, n) :
Argument paste(labels(consca)[[2]], collapse = "+") is not length 298
The paste argument works...
paste(labels(consca)[[2]],collapse='+')
[1] "Q02+Q03+Q06+Q17+Q19+Q25+Q31+Q33+Q36+Q39+Q45+Q50"
And produces the output I desire:
latex(tabular(Heading('Questions')*(Q02+Q03+Q06+Q17+Q19+Q25+Q31+Q33+Q36+Q39+Q45+Q50) ~ (n=1) +
(mn + sdev),data=consca))
However, I want to do this with multiple scales (i.e. I want to change consca to other objects and I want to eliminate the copy/paste step.)
I have fiddled with eval and as.symbol, but to no avail. Perhaps I am not using them in the right way.
OK, and for those of you who will want a minimal reproducible example, here goes:
require(tables)
a <- rnorm(10)
b <- rnorm(10,2)
c <- rnorm(10,100)
x <- data.frame(a,b,c)
# This works:
tabular(a+b+c ~ (mean + sd), x)
# This fails:
tabular(paste(labels(x)[[2]],collapse='+') ~ (mean+sd),x)
# Even though:
paste(labels(x)[[2]],collapse='+')
[1] "a+b+c"
I found a workaround using the describe function in the psych package. (Ultimately, I wanted more than the mean and sd, and the describe function automatically calculates them.) This creates a data.frame, which is trivial to turn into a \LaTeX table. 😃

Anova in R: Dataframe selection

I just run into a problem when using a variable in the anova term. Normally I would use "AGE" directly in the term, but run it all in a loop so myvar will change.
myvar=as.name("AGE")
x=summary( aov (dat ~ contrasts*myvar)+ Error(ID/(contrasts)), data =set))
names(set) = "contrasts" "AGE" "ID" "dat"
It's like when I want to select:
set$myvar
not function! but set$AGE yes
Is there any code for this?
You need to create a string representation of the model formula, then convert it using as.formula.
myvar <- "AGE"
f <- as.formula(paste("dat ~", myvar))
aov(f)
As Richie wrote, pasting seems like the simplest solution. Here's a more complete example:
myvar <- "AGE"
f <- as.formula(paste("dat ~ contrasts *", myvar, "+ Error(ID/contrasts)"))
x <- summary( aov(f, data=set) )
...and instead of set$myvar you would write
set[[myvar]]
A more advanced answer is that a formula is actually a call to the "~" operator. You can modify the call directly, which would be slightly more efficient inside the loop:
> f <- dat ~ contrasts * PLACEHOLDER + Error(ID/contrasts) # outside loop
> f[[3]][[2]][[3]] <- as.name(myvar) # inside loop
> f # see what it looks like...
dat ~ contrasts * AGE + Error(ID/contrasts)
The magic [[3]][[2]][[3]] specifies the part of the formula you want to replace. The formula actually looks something like this (a parse tree):
`~`(dat, `+`(`*`(contrasts, PLACEHOLDER), Error(`/`(ID, contrasts))
Play around with indexing the formula and you'll understand:
> f[[3]]
contrasts * AGE + Error(ID/contrasts)
> f[[3]][[2]]
contrasts * AGE
UPDATE: What are the benefits of this? Well, it is more robust - especially if you don't control the data's column names. If myvar <- "AGE GROUP" the current paste solution doesn't work. And if myvar <- "file.create('~/OWNED')", you have a serious security risk...

Resources