r Creating models on subsets with data.table inside a function - r

Using data.table, I am trying to write a function that takes a data table, a formula object, and a string as arguments, and creates and stores multiple model objects.
myData <- data.table(c("A","A","A","B","B","B"),c(1,2,1,4,5,5),c(1,1,2,5,6,4))
## This works.
ModelsbyV1 <- myData[,list(model=list(lm(V2~V3)),by=V1)]
##This does not.
SectRegress <- function (df,eq,sectors) {
Output <- df[,list(model=list(lm(eq))),
by=sectors]
return(Output)
}
Test <- SectRegress(myData,formula(V2~V3),sectors="V1")
##Error in eval(expr, envir, enclos) : object 'X' not found
I have tried ataching the df in the function. But, that nullifies the ability to group by type. The colnames(df) inside the function includes "X". I'm stumped.

You've to evaluate it within the environment .SD (as lm can not "see" V2 and V3 otherwise):
SectRegress <- function (df,eq,sectors) {
Output <- df[, list(model=list(lm(eq, .SD))), by=sectors]
return(Output)
}
Test <- SectRegress(myData,formula(V2~V3),sectors="V1")

Related

How can create a function using variables in a dataframe

I'm sure the question is a bit dummy (sorry)... I'm trying to create a function using differents variables I have stored in a Dataframe. The function is like that:
mlr_turb <- function(Cond_in, Flow_in, pH_in, pH_out, Turb_in, nm250_i, nm400_i, nm250_o, nm400_o){
Coag = (+0.032690 + 0.090289*Cond_in + 0.003229*Flow_in - 0.021980*pH_in - 0.037486*pH_out
+0.016031*Turb_in -0.026006*nm250_i +0.093138*nm400_o - 0.397858*nm250_o - 0.109392*nm400_o)/0.167304
return(Coag)
}
m4_turb <- mlr_turb(dataset)
The problem is when I try to run my function in a dataframe (with the same name of variables). It doesn't detect my variables and shows this message:
Error in mlr_turb(dataset) :
argument "Flow_in" is missing, with no default
But, actually, there is, also all the variables.
I think I missplace or missing some order in the function that gives it the possibility to take the variables from the dataset. I have searched a lot about that but I have not found any answer...
No dumb questions!
I think you're looking for do.call. This function allows you to unpack values into a function as arguments. Here's a really simple example.
# a simple function that takes x, y and z as arguments
myFun <- function(x, y, z){
result <- (x + y)/z
return(result)
}
# a simple data frame with columns x, y and z
myData <- data.frame(x=1:5,
y=(1:5)*pi,
z=(11:15))
# unpack the values into the function using do.call
do.call('myFun', myData)
Output:
[1] 0.3765084 0.6902654 0.9557522 1.1833122 1.3805309
You meet a standard problem when writing R that is related to the question of standard evaluation (SE) vs non standard evaluation (NSE). If you need more elements, you can have a look at this blog post I wrote
I think the most convenient way to write function using variables is to use variable names as arguments of the function.
Let's take again #Muon example.
# a simple function that takes x, y and z as arguments
myFun <- function(x, y, z){
result <- (x + y)/z
return(result)
}
The question is where R should find the values behind names x, y and z. In a function, R will first look within the function environment (here x,y and z are defined as parameters) then it will look at global environment and then it will look at the different packages attached.
In myFun, R expects vectors. If you give a column name, you will experience an error. What happens if you want to give a column name ? You must say to R that the name you gave should be associated to a value in the scope of a dataframe. You can for instance do something like that:
myFun <- function(df, col1 = "x", col2 = "y", col3 = "z"){
result <- (df[,col1] + df[,col2])/df[,col3]
return(result)
}
You can go far further in that aspect with data.table package. If you start writing functions that need to use variables from a dataframe, I recommend you to start having a look at this package
I like Muon's answer, but I couldn't get it to work if there are columns in the data.frame not in the function. Using the with() function is a simple way to make this work as well...
#Code from Muon:
# a simple function that takes x, y and z as arguments
myFun <- function(x, y, z){
result <- (x + y)/z
return(result)
}
# a simple data frame with columns x, y and z
myData <- data.frame(x=1:5,
y=(1:5)*pi,
z=(11:15),
a=6:10) #adding a var not used in myFun
# unpack the values into the function using do.call
do.call('myFun', myData)
#generates an error for the unused "a" column
#using with() function:
with(myData, myFun(x, y, z))

R: rBind from Matrix package does not work for sparse matrices

I have the following code:
concept_vectors <- foreach(j = 1:2, .combine=rBind, .packages="Matrix") %do% {
Matrix::colMeans(sparseX[1:10,],sparseResult=TRUE)
}
which results in the following error message:
Error in { : no method for coercing this S4 class to a vector
However, if I either remove 'sparseResult=TRUE' option, or do not use colMeans at all, the code works, even if without colMeans, sparseX is still an S4 object.
If I replace rBind with rbind2 directly, then I still see the following error:
error calling combine function:
<simpleError in .__H__.rbind(deparse.level = 0, x, y): no method for coercing this S4 class to a vector>
Do you know any workaround for this?
The problem was that colMeans returs sparseVector and not sparseMatrix. Therefore, rBind is not able to combine several sparseVector objects into sparseMatrix.
As mentioned at https://stackoverflow.com/a/8979207/1075993, the solution is to write a function, that will combine multiple sparseVector objects into sparseMatrix:
sameSizeVectorList2Matrix <- function(vectorList){
sm_i<-NULL
sm_j<-NULL
sm_x<-NULL
for (k in 1:length(vectorList)) {
sm_i <- c(sm_i,rep(k,length(vectorList[[k]]#i)))
sm_j <- c(sm_j,vectorList[[k]]#i)
sm_x <- c(sm_x,vectorList[[k]]#x)
}
return (sparseMatrix(i=sm_i,j=sm_j,x=sm_x,dims=c(length(vectorList),vectorList[[1]]#length)))
}

Scoping assignment and local, bound and global variable in R

I am new to R and trying to figure out behavior of local,bound and global variables. I am confused with the following problem. If I write the function in the following way, then what are the local, bound and global variables of function f?
f <- function(a ="") {
return_a <- function() a
set_a <- function(x)
a <<- x
list(return_a,set_a)
}
return_a is a function. set_a is a function. They are both functional objects (with associated environments, but using the word "variable" to describe them seems prone to confusion. If you call f, you get a list of twofunctions. When you create a list, there are not necessarily names to the list so p$set_a("Carl") throws an error because there is no p[['set_a']].
> p <- f("Justin"); p$set_a("Carl")
Error: attempt to apply non-function
But p[[2]] now returns a function and you need to call it:
> p[[2]]
function(x)
a <<- x
<environment: 0x3664f6a28>
> p[[2]]("Carl")
That did change the value of the symbol-a in the environment of p[[1]]:
> p[[1]]()
[1] "Carl"

Difficulties using `with` inside a function

I am trying to understand how to pass a data frame to an R function. I found an answer to this question on StackOverflow that provides the following demonstration / solution:
Pass a data.frame column name to a function
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))
This makes sense to me, but I don't quit understand the rules for calling data frames within a function. Take the following example:
data(iris)
x.test <- function(df, x){
out <- with(df, mean(x))
return(out)
}
x.test(iris, "Sepal.Length")
The output of this is NA, with a warning message. But, if I do the same procedure without the function it seems to work just fine.
with(iris, mean(Sepal.Length))
I'm obviously missing something here -- any help would be greatly appreciated.
Thanks!
You have been given the correct advice already (which was to use "[" or "[[" rather than with inside functions) but it might also be helpful to ponder why the problem occurred. Inside the with you asked the mean function to return the mean of a character vector, so NA was the result. When you used with at the interactive level, you had no quotes around the character name of the column and if you had you would have gotten the same result:
> with(iris, mean('Sepal.Length'))
[1] NA
Warning message:
In mean.default("Sepal.Length") :
argument is not numeric or logical: returning NA
If you had used the R get mechanism for "promoting" a character object to return the result of a named object you would actually have succeeded, although with is still generally not recommended for programming use:
x.test <- function(df, x){
out <- with(df, mean( get(x)) ) # get() retrieves the named object from the workspace
return(out)
}
x.test(iris, "Sepal.Length")
#[1] 5.843333
See the Details section of the ?with page for warnings about its use in functions.
This will work
data(iris)
x.test <- function(df, x){
out <- mean(df[, x])
return(out)
}
x.test(iris, "Sepal.Length")
Your code is trying to take mean("Sepal.Length") which is clearly not what you want.

Anonymous function in R

Using a dataset w, which includes a numeric column PY, I can do:
nrow(subset(w, PY==50))
and get the correct answer. If, however, I try to create a function:
fxn <- function(dataset, fac, lev){nrow(subset(dataset, fac==lev))}
and run
fxn(w, PY, 50)
I get the following error:
Error in eval(expr, envir, enclos) : object 'PY' not found
What am I doing wrong? Thanks.
From the documentation of subset:
Warning
This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
This rather obscure warning was very well explained here: Why is `[` better than `subset`?
The final word is you can't use subset other than interactively, in particular, not via a wrapper like you are trying. You should use [ instead:
fxn <- function(dataset, fac, lev) nrow(dataset[dataset[fac] == lev, , drop = FALSE])
or rather simply:
fxn <- function(dataset, fac, lev) sum(dataset[fac] == lev)

Resources