I try to run multiple linear regressions on large data sets. Basically biglm works fine. Now I try to find a convenient way to create my formula automatically, using a vector, containing my dependent variables and a string, containing the rest of my formula. Both strings together are my formula.
This works fine for lm() but leads to an error using biglm()
reproduceable example:
library(biglm)
data<-data.frame(av=c(1,2,3,4,5,6,5,4,5,5),
uv1=c(1,2,5,5,4,56,3,4,5,6),
uv2=c(4,5,8,3,2,7,6,2,4,6),
weight=c(1.2,1,1,1,1,1,1,1,0,0))
dependent<-c('av')
independent<-'~ uv1 + uv2 -1'
formula<-paste(dependent[1],independent)
#this works fine
lm_standard<-lm(formula,data=data,weights=weight)
#and this works fine
lm_big1<-biglm(av~uv1+uv2-1,data=data,weights=~weight)
#and here comes the error
lm_big<-biglm(formula,data=data,weights=~weight)
Error: $ operator is invalid for atomic vectors
I don't use as.formula(), because I don't know how to add the -1 to the as.formula() object. My workaround for the as.formula() problem leads to the error message. Is it possible to a) use as.formula() with a missing intercept or b) paste the formula in a way, biglm() can understand?
lm automatically coerces suitable objects to a formula object, whilst biglm does not. Just do it yourself....
lm_big<-biglm( as.formula( formula ) ,data=data,weights=~weight)
Related
I'm running some Surv() functions, and one thing I do not like, or understand, is why this function does not take a "data=" argument. This is annoying because I want to perform the same Surv() function on the same data frame but filtered by different criteria each time.
So for example, my data frame is called "ikt" and I want to filter by "donor_type2=='LD'" and also use a strata variable "plan 2". I tried the following but it didn't work:
library(survival)
library(dplyr)
ikt<-data.frame(organ_yrs=(seq(1,20)),
organ_status=rep(c(0,0,1,1),each=5),
plan2=rep(c('A','B','A','B'),each=5),
donor_type2=rep(c('LD','DD'),each=10) )
organ_surv_func<-function(data,criteria,strata) {
data2<-filter(data,criteria)
Surv(data2$organ_yrs,data2$organ_status)~data2$strata
}
organ_surv_func(ikt,donor_type2=='LD',plan2)
Error in filter_impl(.data, quo) : object 'donor_type2' not found
I'm coming from a SAS background so that's probably why I'm thinking this should work and it doesn't...
I looked up something about sapply(), but I don't think that works when the function doesn't have the data= option.
Also the reason I need the Surv() object and not just survfit(Surv()) (which would let me use data=) is because I'm also using survdiff() for log-rank tests, which takes in the Surv() object as it's main argument:
lr<-function (surv) {
round(1-pchisq(survdiff(surv)$chisq,length(survfit(surv)$strata)-1),3)
}
Thanks for any help you can provide.
I'm writing this "answer" to caution you against proceeding down the path you seem to be following. The Surv function is really intended to be used as the LHS of a formula defined within one of the survival package functions. You should avoid using constructions like:
Surv(data2$organ_yrs,data2$organ_status)~data2$strata
For one thing it's needlessly verbose, but more importantly, it will prevent the use of predict when it comes time to match up names to formals. The survdiff and the other survival functions all have both a "data" argument as well as a "subset" argument. The subset function should allow you to avoid using filter.
organ_surv_func<-function(data, covar) {
form = as.formula(substitute( Surv(organ_yrs, organ_status) ~ covar, list(covar=covar) ) )
survdiff(form, data=data)
}
# although I think running surdiff in a for-loop might be easier,
# as it would involve fewer tricky language constructs
organ_surv_func( subset(ikt, (donor_type2=='LD')), covar=quote(plan2))
If you assign the output of survfit to a named variable, you will be able to more economically access chisq and strata:
myfit <- organ_surv_func( subset(ikt, (donor_type2=='LD')), covar=quote(plan2))
my.lr.test<-function (myfit) {
round(1-pchisq(myfit$chisq, length(myfit$strata)-1), 3)
}
my.lr.test(myfit) # not going to be useful with that dataset.
I am trying to make boxcox transformation of a variable (i.e. sqrt.CR) with lambda value from -2 to 2. On running the below R code it gives a error of invalid atomic vectors. Later on checking earlier posts i saw few suggestions to transform the matrix into a data frame. Though the error continued to show up. Do anyone know to figure out this error ?
R code.
Matrix to data frame conversion
drivers.data<-as.data.frame(drivers)
Boxcox transfrom.
drivers$box_CR<-boxcox(drivers.data$sqrt.CR,lambda=seq(-2,2))
The input to boxcox must be the output of a lm or aov call, not a vector of numbers as yours appears to be. See ?boxcox.
boxcox(object, ...)
Arguments:
object: a formula or fitted model object. Currently only ‘lm’ and
‘aov’ objects are handled.
It could be because of package conflict, in MASS,boxcox requires a model object lm, whereas in bestNormalize it requires a vector.
Try
bestNormalize::boxcox(drivers.data)
I'm trying to call a column name for the e1071 svm function.
The working code looks like:
model = svm(Air_Flow~., data = trainset)
But in an effort to make it more automated I changed it to:
coi=44
model = svm(colnames(data)[coi]~., data = trainset)
where
This didn't work due (I think) to the quote marks, so I tried:
get(colnames(data)[coi])
cat(...)
print(...,quote = F)
as.name(...)
parse(...)
Only get() sort of worked, but then when I tried to predict other values using model it didn't. Any suggestions on what may get this working?
Thanks
Formulas are not strings that you can just "paste" variables into. Nor are variable names the same as strings. You need to be careful about how you build expressions to make sure you are using the correct type. Formulas are really un-evaluated calls that hold names/symbols as parameters.
You might consider using bquote() to build your formula expression, and be sure to convert the character version of the variable name to a proper variable name with as.name()
coi=44
model = svm(bquote(.(as.name(colnames(data)[coi])~.), data = trainset)
Yes, this is a bit ugly. That's why often functions that allow formulas also have an alternative interface that's easier to program against. svm() also allows you to pass in an x and y parameter for the response and predictors. You might do
model = svm(trainset[,col], trainset[,-col])
which is nicer because you can subset columns from your dataset with both string and numeric indexes
After running something like:
mod.1<-lm(z~x+y)
I know I can do summary(mod.1) and see the $R^2$ value. I'm wondering how I might grab it from mod.1, sort of like grabbing the coefficients with mod.1$coefficients.
mod.1 = lm(c(1,2,3)~ c(1,2.3,3.4))
summary(mod.1)$r.squared
R-squared is actually not an element of the lm object itself, but of summary(mod.1). That is, if you type str(summary(mod.1)) you will see that the summary is itself a list (with a special print method) and that one of those list items is R-squared.
However, for programmatic use it's inefficient to calculate the entire summary just to extract one element. Rolling your own extractor function would lead to faster code in general, especially if you call lm with the argument y = TRUE. Then R-squared would just be 1 - sum(mod.1$residuals^2)/sum((mod.1$y - mean(mod.1$y))^2).
I have this function
ANN<-function (x,y){
DV<-rep(c(0:1),5)
X1<-c(1:10)
X2<-c(2:11)
ANN<-neuralnet(x~y,hidden=10,algorithm='rprop+')
return(ANN)
}
I need the function run like
formula=X1+X2
ANN(DV,formula)
and get result of the function. So the problem is to say the function USE the object which was created during the run of function. I need to run trough lapply more combinations of x,y, so I need it this way. Any advices how to achieve it? Thanks
I've edited my answer, this still works for me. Does it work for you? Can you be specific about what sort of errors you are getting?
New response:
ANN<-function (y){
X1<-c(1:10)
DV<-rep(c(0:1),5)
X2<-c(2:11)
dat <- data.frame(X1,X2)
ANN<-neuralnet(DV ~y,hidden=10,algorithm='rprop+',data=dat)
return(ANN)
}
formula<-X1+X2
ANN(formula)
If you want so specify the two parts of the formula separately, you should still pass them as formulas.
library(neuralnet)
ANN<-function (x,y){
DV<-rep(c(0:1),5)
X1<-c(1:10)
X2<-c(2:11)
formula<-update(x,y)
ANN<-neuralnet(formula,data=data.frame(DV,X1,X2),
hidden=10,algorithm='rprop+')
return(ANN)
}
ANN(DV~., ~X1+X2)
And assuming you're using neuralnet() from the neuralnet library, it seems the data= is required so you'll need to pass in a data.frame with those columns.
Formulas as special because they are not evaluated unless explicitly requested to do so. This is different than just using a symbol, where as soon as you use it is evaluated to something in the proper frame. This means there's a big difference between DV (a "name") and DV~. (a formula). The latter is safer for passing around to functions and evaluating in a different context. Things get much trickier with symbols/names.