R calling string for lookup in a function - r

I'm trying to call a column name for the e1071 svm function.
The working code looks like:
model = svm(Air_Flow~., data = trainset)
But in an effort to make it more automated I changed it to:
coi=44
model = svm(colnames(data)[coi]~., data = trainset)
where
This didn't work due (I think) to the quote marks, so I tried:
get(colnames(data)[coi])
cat(...)
print(...,quote = F)
as.name(...)
parse(...)
Only get() sort of worked, but then when I tried to predict other values using model it didn't. Any suggestions on what may get this working?
Thanks

Formulas are not strings that you can just "paste" variables into. Nor are variable names the same as strings. You need to be careful about how you build expressions to make sure you are using the correct type. Formulas are really un-evaluated calls that hold names/symbols as parameters.
You might consider using bquote() to build your formula expression, and be sure to convert the character version of the variable name to a proper variable name with as.name()
coi=44
model = svm(bquote(.(as.name(colnames(data)[coi])~.), data = trainset)
Yes, this is a bit ugly. That's why often functions that allow formulas also have an alternative interface that's easier to program against. svm() also allows you to pass in an x and y parameter for the response and predictors. You might do
model = svm(trainset[,col], trainset[,-col])
which is nicer because you can subset columns from your dataset with both string and numeric indexes

Related

Using Surv function repeatedly with different data frames

I'm running some Surv() functions, and one thing I do not like, or understand, is why this function does not take a "data=" argument. This is annoying because I want to perform the same Surv() function on the same data frame but filtered by different criteria each time.
So for example, my data frame is called "ikt" and I want to filter by "donor_type2=='LD'" and also use a strata variable "plan 2". I tried the following but it didn't work:
library(survival)
library(dplyr)
ikt<-data.frame(organ_yrs=(seq(1,20)),
organ_status=rep(c(0,0,1,1),each=5),
plan2=rep(c('A','B','A','B'),each=5),
donor_type2=rep(c('LD','DD'),each=10) )
organ_surv_func<-function(data,criteria,strata) {
data2<-filter(data,criteria)
Surv(data2$organ_yrs,data2$organ_status)~data2$strata
}
organ_surv_func(ikt,donor_type2=='LD',plan2)
Error in filter_impl(.data, quo) : object 'donor_type2' not found
I'm coming from a SAS background so that's probably why I'm thinking this should work and it doesn't...
I looked up something about sapply(), but I don't think that works when the function doesn't have the data= option.
Also the reason I need the Surv() object and not just survfit(Surv()) (which would let me use data=) is because I'm also using survdiff() for log-rank tests, which takes in the Surv() object as it's main argument:
lr<-function (surv) {
round(1-pchisq(survdiff(surv)$chisq,length(survfit(surv)$strata)-1),3)
}
Thanks for any help you can provide.
I'm writing this "answer" to caution you against proceeding down the path you seem to be following. The Surv function is really intended to be used as the LHS of a formula defined within one of the survival package functions. You should avoid using constructions like:
Surv(data2$organ_yrs,data2$organ_status)~data2$strata
For one thing it's needlessly verbose, but more importantly, it will prevent the use of predict when it comes time to match up names to formals. The survdiff and the other survival functions all have both a "data" argument as well as a "subset" argument. The subset function should allow you to avoid using filter.
organ_surv_func<-function(data, covar) {
form = as.formula(substitute( Surv(organ_yrs, organ_status) ~ covar, list(covar=covar) ) )
survdiff(form, data=data)
}
# although I think running surdiff in a for-loop might be easier,
# as it would involve fewer tricky language constructs
organ_surv_func( subset(ikt, (donor_type2=='LD')), covar=quote(plan2))
If you assign the output of survfit to a named variable, you will be able to more economically access chisq and strata:
myfit <- organ_surv_func( subset(ikt, (donor_type2=='LD')), covar=quote(plan2))
my.lr.test<-function (myfit) {
round(1-pchisq(myfit$chisq, length(myfit$strata)-1), 3)
}
my.lr.test(myfit) # not going to be useful with that dataset.

not error, but not results either in R

I am trying to make a function in R that calculates the mean of nitrate, sulfate and ID. My original dataframe have 4 columns (date,nitrate, sulfulfate,ID). So I designed the next code
prueba<-read.csv("C:/Users/User/Desktop/coursera/001.csv",header=T)
columnmean<-function(y, removeNA=TRUE){ #y will be a matrix
whichnumeric<-sapply(y, is.numeric)#which columns are numeric
onlynumeric<-y[ , whichnumeric] #selecting just the numeric columns
nc<-ncol(onlynumeric) #lenght of onlynumeric
means<-numeric(nc)#empty vector for the means
for(i in 1:nc){
means[i]<-mean(onlynumeric[,i], na.rm = TRUE)
}
}
columnmean(prueba)
When I run my data without using the function(), but I use row by row with my data it will give me the mean values. Nevertheless if I try to use the function so it will make all the steps by itself, it wont mark me error but it also won't compute any value, as in my environment the dataframe 'prueba' and the columnmean function
what am I doing wrong?
A reproducible example would be nice (although not absolutely necessary in this case).
You need a final line return(means) at the end of your function. (Some old-school R users maintain that means alone is OK - R automatically returns the value of the last expression evaluated within the function whether return() is specified or not - but I feel that using return() explicitly is better practice.)
colMeans(y[sapply(y, is.numeric)], na.rm=TRUE)
is a slightly more compact way to achieve your goal (although there's nothing wrong with being a little more verbose if it makes your code easier for you to read and understand).
The result of an R function is the value of the last expression. Your last expression is:
for(i in 1:nc){
means[i]<-mean(onlynumeric[,i], na.rm = TRUE)
}
It may seem strange that the value of that expression is NULL, but that's the way it is with for-loops in R. The means vector does get changed sequentially, which means that BenBolker's advice to use return(.) is correct (as his advice almost always is.) . For-loops in R are a notable exception to the functional programming paradigm. They provide a mechanism for looping (as do the various *apply functions) but the commands inside the loop exert their effects in the calling environment via side effects (unlike the apply functions).

R: parameter in update function

Here is a snippet of R script doing beta regression on data "GasolineYield":
library("betareg")
data("GasolineYield", package = "betareg")
gy_logit <- betareg(yield ~ batch + temp, data = GasolineYield)
gy_logit4 <- update(gy_logit, subset = -4)
The 4th line magically deletes the 4th observation and update the fit automatically, but I don't quite understand the why this parameter works in the update function here, because I tried to look up the documentation by ?update, but couldn't find there's such parameter.
I'm curious about how to find right documentation in this case, because maybe I want to add some new observation instead of removing it. Any help?
subset in betareg works the same as subset in lm, therefore you can read lm documentation.
From the help file you can find:
subset an optional vector specifying a subset of observations to be used in the fitting process.
Hence by setting select=-4 you are lefting out the fourth row in the estimation.
update() contains the ... parameter, which means any parameters that are not matched in your call to update() are passed on to the function that does the estimation. In this case, that is betareg(), which does have the subset argument.
This type of thing is very common in R. Many higher-level function that call other user-visible functions will have the three dot parameter and pass any unmatched parameters on, so you have to search all the user-visible functions that get called in order to know all possible options.
You can check out the help file for the top level function (update() in this case) to get an idea of which functions get the leftover parameters.

Use the multiple variables in function in r

I have this function
ANN<-function (x,y){
DV<-rep(c(0:1),5)
X1<-c(1:10)
X2<-c(2:11)
ANN<-neuralnet(x~y,hidden=10,algorithm='rprop+')
return(ANN)
}
I need the function run like
formula=X1+X2
ANN(DV,formula)
and get result of the function. So the problem is to say the function USE the object which was created during the run of function. I need to run trough lapply more combinations of x,y, so I need it this way. Any advices how to achieve it? Thanks
I've edited my answer, this still works for me. Does it work for you? Can you be specific about what sort of errors you are getting?
New response:
ANN<-function (y){
X1<-c(1:10)
DV<-rep(c(0:1),5)
X2<-c(2:11)
dat <- data.frame(X1,X2)
ANN<-neuralnet(DV ~y,hidden=10,algorithm='rprop+',data=dat)
return(ANN)
}
formula<-X1+X2
ANN(formula)
If you want so specify the two parts of the formula separately, you should still pass them as formulas.
library(neuralnet)
ANN<-function (x,y){
DV<-rep(c(0:1),5)
X1<-c(1:10)
X2<-c(2:11)
formula<-update(x,y)
ANN<-neuralnet(formula,data=data.frame(DV,X1,X2),
hidden=10,algorithm='rprop+')
return(ANN)
}
ANN(DV~., ~X1+X2)
And assuming you're using neuralnet() from the neuralnet library, it seems the data= is required so you'll need to pass in a data.frame with those columns.
Formulas as special because they are not evaluated unless explicitly requested to do so. This is different than just using a symbol, where as soon as you use it is evaluated to something in the proper frame. This means there's a big difference between DV (a "name") and DV~. (a formula). The latter is safer for passing around to functions and evaluating in a different context. Things get much trickier with symbols/names.

Taylor diagram from existing Correlation and Standard Dev values

Is it possible to create a Taylor diagram from already calculated correlation and standard deviation values?
I am doing model evaluation, and I have already the correlation and standard deviations values.I understand that there is already a package plotrix where by giving the observation and the modeled values, the diagram is created. However for the type of work that I am doing, it is easier to start by giving already the correlation and standard deviation values.
Is there any way I can do this in R?
There's no reason it shouldn't be possible, but the authors didn't seem to allow for that when they wrote the function. The function is a bit long and complex, but the part that does the calculation is at the top. It is possible to swap out that code and replace it to allow for the passing of summary statistics. Now, keep in mind what i'm about to do is a hack and i've only tested it with versions 3.5-5 of plotrix. Other version may not work.
Here will will create a new function taylor.diagram2 that takes all the code from taylor.diagram but adds in an extra if statement to check for a list of summarized data as the first argument
taylor.diagram2<-taylor.diagram
bl<-as.list(body(taylor.diagram))
cond<-list(
as.name("if"),
quote(is.list(ref) & missing(model)), #condition
quote({R<-ref$R; sd.r<-ref$sd.r; sd.f<-ref$sd.f}), #if true
as.call(c(as.symbol("{"), bl[3:8]))) #else
bl<-c(bl[1:2], as.call(cond), bl[9:length(bl)]) #splice in new code
body(taylor.diagram2)<-as.call(bl) #update function
Now we can test the function. First, we'll do things the standard way
#test data
aref<-rnorm(30,sd=2)
amodel1<-aref+rnorm(30)/2
#standard behavior function
taylor.diagram2(aref,amodel1, main="Standard Behavior"))
#summarized data
xx<-list(
R=cor(aref, amodel1, use = "pairwise"),
sd.r=sd(aref),
sd.f=sd(amodel1)
)
#modified behavior
taylor.diagram2(xx, main="Modified Behavior")
So the new taylor.diagram2 function can do both. If you pass it two vectors, it will do the standard behavior. If you pass it a list with the names R, sd.r, and sd.f, then it will do the same plot but with the values you passed in. Also, the model parameter must be empty for the modified version to work. That means if you want to set any additional parameter, you must use named parameters rather than positional arguments.

Resources