How to run bigglm function for large number of variables - r

In ffbase (http://cran.r-project.org/web/packages/ffbase/ffbase.pdf) there is the bigglm function:
bigglm.ffdf(formula, data, family = gaussian(), ...,
where formula is something like Y~X, assuming Y and X correspond to the colnames of ffdf object called data.
What if I have 200 columns in data that I want to put on the RHS of the equation? Clearly I can't type Y~X1+X2+....+X200.
How do I run Y~X1+X2+....+X200 without typing out all 200 variables on the RHS?

the . symbol is the normal character for this, not sure if it works with ffbase though. I.e.
m <- lm(y ~ ., df)
will describe y by all other columns in df.
As described by Chris, this appears to be a bug in biglm, and can be worked around by using:
m <- bigglm(terms(y ~ ., data=df), data=df)
But this should be reported as a bug to the author of biglm.

If Sam's answer doesn't work, you can build up a character string representing the formula and then cast is as a formula:
formula <- as.formula(paste('Y', paste(paste('',
paste('X', 1:200, sep = ''), sep = '', collapse = ' + ')), sep = ' ~ '))
The inner paste creates X1 to X200. The next paste collapses the resulting vector into a single string with the elements of the first paste put together with +'s. The last paste adds on the Y ~. Finally, I change it from a string to a formula.

Related

GLM insert vector using the string of the vector, rather than the vector itself

I am creating a function that will run all my analysis, as a method of streamlining 600 lines of code to avoid errors. I am having trouble with the line of the GLM. I have some analysis where the dataframe is filtered, however when I vectors DV and ME they are from a dataframe that already exist (my_data$Hire, my_data$Con). I need to pass these values in such a way that R will recognise that they are the correct DV and ME from the newly created dataframe - GM.
calculatemodels <- function(type, DV, ME, ME2, MD, df, test, MA_data, model, number, samplename){
if (type == "A1"){
GM <- my_data %>% filter(Gender == 1)
m <- glm(DV ~ ME, data=GM) # problem here, need to pass DV and ME vectors correctly.
MA_data <- CollectDEffect(test, m, MA_data, namestring(ME), model, number, samplename)
}
return(MA_data)
}
MA_data <- calculatemodels("A1", my_data$Hire, my_data$Con, , , my_data,
"", MA_data, "", "1", "full")
I tried using get and paste, however it does not work. In a nutshell, I need to pass the name of the DV and ME and have the function recognize that these are the vectors for the model, not pass vectors that are already attached to a dataframe, i.e., my_data$Hire
As far as I understand, you are trying to pass ingredients of the model as vectors and run the glm model on them.
There are a couple of issues here. First of all, my_data is not the function variable. I guess it exists in your global environment. Second, you need to create a formula (as.formula) from column names and use it as input to the model:
calculatemodels <- function(type, DV, ME, ME2, MD,
df, test, MA_data,
model, number,samplename){
if (type == "A1"){
GM <- my_data %>% filter(Gender == 1)
glm_model <- as.formula(paste0(DV,"~",ME))
m <- glm(glm_model, data=GM) # problem here, need to pass DV and ME vectors correctly.
MA_data <- CollectDEffect(test, m, MA_data, namestring(ME), model, number, samplename)
}
return(MA_data)
}
You need to use column names (Hire and Con) as input values to the function:
MA_data <- calculatemodels("A1", "Hire", "Con", , , my_data,
"", MA_data, "", "1", "full")
Hope this resolves the issue.

R: Use string containing variable names in regression

I first use grep to obtain all variable names that begin with the preface: "h_." I then collapse that array into a single string, separated with plus signs. Is there a way to subsequently use this string in a linear regression?
For example:
holiday_array <- grep("h_", names(df), value=TRUE)
holiday_string = paste(holiday_array, collapse=' + ' )
r_3 <- lm(log(assaults) ~ year + month + holiday_string, data = df)
I get the straightforward error variable lengths differ (found for 'holiday_string')
I can do it like this, for example:
holiday_formula <- as.formula(paste('log(assaults) ~ attend_v + year+ month + ', paste("", holiday_vars, collapse='+')))
r_3 <- lm(holiday_formula, data = df)
But I don't want to have to type a separate formula construction for each new set of controls. I want to be able to add the "string" inside the lm function. Is this possible?
The above is problematic, because let's say I want to then add another set of control variables to the formula contained in holiday_formula, so something like this:
weather_vars <- grep("w_", names(df), value=TRUE) weather_formula
<- as.formula(paste(holiday_formula, paste("+", weather_vars,
collapse='+')))
Not sure how you would do the above.
I don't know a simple method for construction of a formula argument different than the one you are rejecting (although I considered and rejected using update.formula since it would also have required using as.formula), but this is an alternate method for achieving the same goal. It uses the "."-expansion feature of R-formulas and relies on the ability of the [-function to accept character argument for column selection:
r_3 <- lm(log(assaults) ~ attend_v + year+ month + . ,
data = df[ , c('assaults', 'attend_v', 'year', 'month', holiday_vars] )

Loop through column names in Fixed Effects Regression

I am trying to code a fixed effects regression, but I have MANY dummy variables. Basically, I have 184 variables on the RHS of my equation. Instead of writing this out, I am trying to create a loop that will pass through each column (I have named each column with a number).
This is the code i have so far, but the paste is not working. I may be totally off base using paste, but I wasn't sure how else to approach this. However, I am getting an error (see below).
FE.model <- plm(avg.kw ~ 0 + (for (i in 41:87) {
paste("hour.dummy",i,sep="") + paste("dummy.CDH",i,sep="")
+ paste("dummy.MA",i,sep="") + paste("DR.variable",i,sep="")
}),
data = data.reg,
index=c('Site.ID','date.hour'),
model='within',
effect='individual')
summary(FE.model)
As an example for the column names, when i=41 the names should be "hour.dummy41" "dummy.CDH41", etc.
I'm getting the following error:
Error in paste("hour.dummy", i, sep = "") + paste("dummy.CDH", i, sep = "") : non-numeric argument to binary operator
So I'm not sure if it's the paste function that is not appropriate here, or if it's the loop. I can't seem to find a way to loop through column names easily in R.
Any help is much appreciated!
Ignoring worries about fitting a model with so many terms for the moment, you probably want to generate a string, and then cast it as a formula:
#create a data.frame where rows are the parts of the variable names, then collapse it
rhs <- do.call(paste, c(as.list(expand.grid(c("hour.dummy","dummy.CDH"), 41:87)), sep=".", collapse=" + "))
fml <- as.formula(sprintf ("avg.kw ~ %s"), rhs))
FE.model <-pml(flm, ...
I've only put in two of the 'dummy's in the second line- but you should get the idea

data.frame columns as tabular row labels in R

I am trying to make a table in R/Sweave using the tabular command. I want the row labels to be the headings of my data frame consca. (Each column is a question, and each row is a student's responses to each question.) The command I am using is this:
latex(tabular(Heading('Questions')*(paste(labels(consca)[[2]],collapse='+')) ~ (n=1) + (mn +
sdev),data=consca))
Which throws this error:
Error in term2table(rows[[i]], cols[[j]], data, n) :
Argument paste(labels(consca)[[2]], collapse = "+") is not length 298
The paste argument works...
paste(labels(consca)[[2]],collapse='+')
[1] "Q02+Q03+Q06+Q17+Q19+Q25+Q31+Q33+Q36+Q39+Q45+Q50"
And produces the output I desire:
latex(tabular(Heading('Questions')*(Q02+Q03+Q06+Q17+Q19+Q25+Q31+Q33+Q36+Q39+Q45+Q50) ~ (n=1) +
(mn + sdev),data=consca))
However, I want to do this with multiple scales (i.e. I want to change consca to other objects and I want to eliminate the copy/paste step.)
I have fiddled with eval and as.symbol, but to no avail. Perhaps I am not using them in the right way.
OK, and for those of you who will want a minimal reproducible example, here goes:
require(tables)
a <- rnorm(10)
b <- rnorm(10,2)
c <- rnorm(10,100)
x <- data.frame(a,b,c)
# This works:
tabular(a+b+c ~ (mean + sd), x)
# This fails:
tabular(paste(labels(x)[[2]],collapse='+') ~ (mean+sd),x)
# Even though:
paste(labels(x)[[2]],collapse='+')
[1] "a+b+c"
I found a workaround using the describe function in the psych package. (Ultimately, I wanted more than the mean and sd, and the describe function automatically calculates them.) This creates a data.frame, which is trivial to turn into a \LaTeX table. 😃

Using column numbers not names in lm()

Instead of something like lm(bp~height+age, data=mydata) I would like to specify the columns by number, not name.
I tried lm(mydata[[1]]~mydata[[2]]+mydata[[3]]) but the problem with this is that, in the fitted model, the coefficients are named mydata[[2]], mydata[[3]] etc, whereas I would like them to have the real column names.
Perhaps this is a case of not having your cake and eating it, but if the experts could advise whether this is possible I would be grateful
lm(
as.formula(paste(colnames(mydata)[1], "~",
paste(colnames(mydata)[c(2, 3)], collapse = "+"),
sep = ""
)),
data=mydata
)
Instead of c(2, 3) you can use how many indices you want (no need for for loop).
lm(mydata[,1] ~ ., mydata[-1])
The trick that I found in a course on R is to remove the response column, otherwise you get warning "essentially perfect fit: summary may be unreliable". I do not know why it works, it does not follow from documentation. Normally, we keep the response column in.
And a simplified version of the earlier answer by Tomas:
lm(
as.formula(paste(colnames(mydata)[1], "~ .")),
data=mydata
)

Resources