R if statement in a function with a fomula argument - r

testing<-function(formula=NULL,data=NULL){
if(with(data,formula)==T){
print('YESSSS')
}
}
A<-matrix(1:16,4,4)
colnames(A)<-c('x','y','z','gg')
A<-as.data.frame(A)
testing(data=A,formula=(2*x+y==Z))
Error in eval(expr, envir, enclos) : object 'x' not found
##or I can put formula=(x=1)
##reason that I use formula is because my dataset had different location and I would want
##to 'subset' my data into different set
This is the main flow of my code. I had done some search and seems to be no one ask this kind of stupid question or it is not possible to pass a formula in a if statement. Thank you in advance

if you just want subset of your data.frame create a character object representing the formula like this:
formula="2*x+y==z"
testing<-function(data,formula){with(data = data,expr = eval(parse(text = formula)))}
subset(A,testing(A,formula=formula))
#x y z gg
#2 2 6 10 14
You can change the formula as per your need.

If we need to evaluate it, one option is eval(parse
testing<-function(formula=NULL,data=NULL){
data <- deparse(substitute(data))
if(any(eval(parse(text=paste("with(", data, ",",
deparse(substitute(formula)), ")")))))
print("YESSS")
}
testing(data=A,formula=(2*x+y==z))
#[1] "YESSS"

When you call a function in R it evaluates its arguments first before executing the function.
For example, prod(2+2, 3) is first turned into prod(4, 3) before the function prod() is even called.
Thus, in your code, R starts by trying to solve (2*x+y==Z). It fails because there is no x object outside of the function code. So, it not even begin running testing().
To use your function correctly you should make it clear to R that it is not supposed to calculate (2*x+y==Z). Instead it should pass this information as is. You could do that using the functions expression() and eval().
testing<-function(formula=NULL,data=NULL){
if(with(data,eval(formula==T)){
print('YESSSS')
}
}
A<-matrix(1:16,4,4)
colnames(A)<-c('x','y','z','gg')
A<-as.data.frame(A)
testing(data=A,formula=expression(2*x+y==Z))
However, you will notice that there other problems with your code.
For Z is different than z. Notice that the in colnames you use z and in the formula Z.
The if() only works for when there is a single value of true or false. In your case, you will have one value for each row in A. When this happens, if() will only check if the first row fits the criteria.
If your purpose is subsetting, it is much more easier to do:
A.subset <- subset(A, 2*A$x+A$y == A$z)

After a discussion with my colleague,
here is a kind of solution
testing<-function(cx,cy,px,py,z,data=NULL){
list<-NULL
for(m in 1:nrow(data)){
if(cx*data$x[m]^px+cy*data$y[m]^py+data$z==0){
print(m)}
}
}
but this can deal with polynomial only and with a lot of arguments in the function. I am think of a way to reduce it as a general equation.or maybe this is the most easiest equation.

Related

Using variable from another script (inter-file function closure)

I have script main.R, where I create inv_cov_mat variable. I later load metrics.R and use it to calculate function value (I use it as kind of inter-script function closure). I get error "object 'inv_cov_mat' not found". My code:
main.R:
knn <- function(...)
{
# some code
source("./source/metrics.R")
if (metric == "mahalanobis")
inv_cov_mat <- solve(cov(training_set))
# other code
# calculate distance in given metric between current vector and every row vector from training set matrix
distances <- apply(training_set, 1, metric, vec2=curr_vec) # error
metrics.R:
mahalanobis <- function(vec1, vec2)
{
diff <- vec1 - vec2
sqrt(t(diff) %*% inv_cov_mat %*% diff)
}
I've found simple, even if not elegant answer: use inv_cov_mat as global variable, not inside knn function. Then other scripts can see it.
It's not entirely clear what you want, but if I understand you correctly---you have a character string identifying the metric you want to use, and a function with the same name. So you should be able to use get to retrieve the function based on the name.
metric == "mahalanobis"
metric.fun = get(metric)
distances <- apply(training_set, 1, metric.fun, vec2=curr_vec)
That said, there are probably better ways to organize your code that would avoid this problem entirely, e.g. create a named list of functions for accessing metrics.
EDIT regarding the issue of inv_cov_mat, either pass it as an argument to your metric function or use get inside that function to access variables from the parent environment using the envir argument. Passing the variable as an argument to your metric function is definitely the better and cleaner approach.

Writing/applying "subtract the mean"-function to standardize regression parameters

I was trying to write and apply a seemingly easy function that would standardize my continuous regression parameters/ predictors. The reason is that I want to deal with multicollinearity.
So instead of writing x-mean(x,na.rm=T) each time, I'm looking for something more handy which does the job for me - not least because I wanted to exercize writing functions in R. ;)
So here is what I tried:
fun <- function(data.frame, x){
data.frame$x - mean(data.frame$x, na.rm=T)
}
Apparently this is not too wrong. At least it doesn't return an error message.
However, applying fun to, say, the built-in mtcars dataset and, say, the variable disp yields this error message:
#Loading the data:
data("mtcars")
fun(mtcars,x=disp) #I tried several ways, e.g. w and w/o "mtcars" in front
Warning message:
In mean.default(mtcars$x, na.rm = T) :
argument is not numeric or logical: returning NA
My guess is that it is about how I applied the function, because when I do manually what the function is supposed to do, it works perfectly.
Also, I was looking for similar questions on writing and applying such a function (also beyond the Stack Exchange universe), but I didn't find anything helpful.
Hope I didn't make a blunder due to my novice R-skills.
There is already a function in R which does what you want to do: scale().
You can just write scale(mtcars$hp, center = TRUE, scale = FALSE) which then subtracts the mean of the vector from the vector itself.
In combination with apply this is powerful; You can, for example center every column of your dataframe by writing:
apply(dataframe, MARGIN = 2, FUN = scale, center = TRUE, scale = FALSE)
Before you do that you have to make sure that this is a valid function for your column. You cannot scale factors or characters, for example.
In regards to your question: Your function should have to look like this:
fun <- function(data.frame, x){
data.frame[[x]] - mean(data.frame[[x]], na.rm=T)
}
and then when specifying the function you would have to write fun(mtcars, "hp") and specify the variable name in quotation marks. This is because of the special way the $ operator works, you cannot use a character string after it.

How can I keep the entire lm objects returned from j argument evaluation in data.tables?

Suppose a data.table:
z = data.table(k=1:10, h=1:100, i=1:100)
setkey(z, k)
I want to estimate for each key k, lm(h~i).
My first thought was just to try:
result = z[,lm(h~i),by=key(z)]
But this returns an error, reminding me, 'All items in j=list(...) should be atomic vectors or lists.'
Next, following the error's suggestion:
result = z[,list(lmcol=lm(h~i)),by=key(z)]
But,
result[1,class(lmcol[[1]])]
returns 'numeric' instead of 'lm'!
What is the correct procedure for recovering the entire lm object from the second code block?
Thanks!
Wrap everything in a list
result <- z[, list(lmcol = list(lm(h~i))), by = key(z)]
However, be warned that update etc don't work well with this approach see Why is using update on a lm inside a grouped data.table losing its model data? for a description of this problem

lapply fail, but function works fine for each individual input arguments

Many thanks in advance for any advices or hints.
I'm working with data frames. The simplified coding is as follows:
`
f<-funtion(name){
x<-tapply(name$a,list(name$b,name$c),sum)
1) y<-dataset[[deparse(substitute(name))]]
#where dataset is an already existed list object with names the same as the
#function argument. I would like to avoid inputting two arguments.
z<-vector("list",n) #where n is also defined already
2) for (i in 1:n){z[[i]]<-x[y[[i]],i]}
...
}
lapply(list_names,f)
`
The warning message is:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
and the output is incorrect. I tried debugging and found the conflict may lie in line 1) and 2). However, when I try f(name) it is perfectly fine and the output is correct. I guess the problem is in lapply and I searched for a while but could not get to the point. Any ideas? Many thanks!
The structure of the data
Thanks Joran. Checking again I found the problem might not lie in what I had described. I produce the full code as follows and you can copy-paste to see the error.
n<-4
name1<-data.frame(a=rep(0.1,20),b=rep(1:10,each=2),c=rep(1:n,each=5),
d=rep(c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a91"),each=2))
name2<-data.frame(a=rep(0.2,20),b=rep(1:10,each=2),c=rep(1:n,each=5),
d=rep(c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a91"),each=2))
name3<-data.frame(a=rep(0.3,20),b=rep(1:10,each=2),c=rep(1:n,each=5),
d=rep(c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a91"),each=2))
#d is the name for the observations. d corresponds to b.
dataset<-vector("list",3)
names(dataset)<-c("name1","name2","name3")
dataset[[1]]<-list(c(1,2),c(1,2,3,4),c(1,2,3,4,5,10),c(4,5,8))
dataset[[2]]<-list(c(1,2,3,5),c(1,2),c(1,2,10),c(2,3,4,5,8,10))
dataset[[3]]<-list(c(3,5,8,10),c(1,2,5,7),c(1,2,3,4,5),c(2,3,4,6,9))
f<-function(name){
x<-tapply(name$a,list(name$b,name$c),sum)
rownames(x)<-sort(unique(name$d)) #the row names for
y<-dataset[[deparse(substitute(name))]]
z<-vector("list",n)
for (i in 1:n){
z[[i]]<-x[y[[i]],i]}
nn<-length(unique(unlist(sapply(z,names)))) # the number of names appeared
names_<-sort(unique(unlist(sapply(z,names)))) # the names appeared add to the matrix
# below
m<-matrix(,nrow=nn,ncol=n);rownames(m)<-names_
index<-vector("list",n)
for (i in 1:n){
index[[i]]<-match(names(z[[i]]),names_)
m[index[[i]],i]<-z[[i]]
}
return(m)
}
list_names<-vector("list",3)
list_names[[1]]<-name1;list_names[[2]]<-name2;list_names[[3]]<-name3
names(list_names)<-c("name1","name2","name3")
lapply(list_names,f)
f(name1)
the lapply(list_names,f) would fail, but f(name1) will produce exactly the matrix I want. Thanks again.
Why it doesn't work
The issue is the calling stack doesn't look the same in both cases. In lapply, it looks like
[[1]]
lapply(list_names, f) # lapply(X = list_names, FUN = f)
[[2]]
FUN(X[[1L]], ...)
In the expression being evaluated, f is called FUN and its argument name is called X[[1L]].
When you call f directly, the stack is simply
[[1]]
f(name1) # f(name = name1)
Usually this doesn't matter, but with substitute it does because substitute cares about the name of the function argument, not its value. When you get to
y<-dataset[[deparse(substitute(name))]]
inside lapply it's looking for the element in dataset named X[[1L]], and there isn't one, so y is bound to NULL.
A way to get it to work
The simplest way to deal with this is probably to just have f operate on character strings and pass names(list_names) to lapply. This can be accomplished fairly easily by changing the beginning of f to
f<-function(name){
passed.name <- name
name <- list_names[[name]]
x<-tapply(name$a,list(name$b,name$c),sum)
rownames(x)<-sort(unique(name$d)) #the row names for
y<-dataset[[passed.name]]
# the rest of f...
and changing lapply(list_names, f) to lapply(names(list_names),f). This should give you what you want with nearly minimal modification, but you also might consider also renaming some of your variables so the word name isn't used for so many different things--the function names, the argument of f, and all the various variables containing name.

R curve() on expression involving vector

I'd like to plot a function of x, where x is applied to a vector. Anyway, easiest to give a trivial example:
var <- c(1,2,3)
curve(mean(var)+x)
curve(mean(var+x))
While the first one works, the second one gives errors:
'expr' did not evaluate to an object of length 'n' and
In var + x : longer object length is not a multiple of shorter object length
Basically I want to find the minimum of such a function: e.g.
optimize(function(x) mean(var+x), interval=c(0,1))
And then be able to visualise the result. While the optimize function works, I can't figure out how to get the curve() to work as well.. Thanks!
The function needs to be vectorized. That means, if it evaluates a vector it has to return a vector of the same length. If you pass any vector to mean the result is always a vector of length 1. Thus, mean is not vectorized. You can use Vectorize:
f <- Vectorize(function(x) mean(var+x))
curve(f,from=0, to=10)
This can be done in the general case using sapply:
curve(sapply(x, function(e) mean(var + e)))
In the specific example you give, mean(var) + x, is of course arithmetically equivalent to what you're looking for. Similar shortcuts might exist for whatever more complicated function you're working with.

Resources