How does mutate_all(.funs=~./sum(x)) work? - r

I used this code to calculate relative abundance (cell/total of column) of a table I had. I don't understand how the . and ~ functions work.

The construct ~./sum(x) is technically a special type of R object called a formula
class(~./sum(x))
#> [1] "formula"
However, in tidyverse functions such as mutate_all, this formula is taken and converted into a lambda function, which is an anonymous function (i.e. a function that isn't named and is written in place as a parameter passed in a call to another function).
Internally, the formula is converted into a function with rlang::as_function. Suppose we wanted to write a function that just adds two to a variable. In base R we might write
add_two <- function(var){
return(var + 2)
}
add_two(5)
#> [1] 7
In the tidyverse, we can use a formula as shorthand for this function, where the . becomes a shorthand for "the variable that was passed as a first argument to the function":
add_two <- rlang::as_function(~ . + 2)
add_two(5)
#> [1] 7
In functions such as mutate_all, the formula will automatically be passed through rlang::as_function, so if we wanted to add two to each column in our data frame, instead of writing:
mutate_all(.funs= function(var) {return(var + 2);})
we could write
mutate_all(.funs=~.+2)
In your case, the formula ~./sum(x) is effectively transformed into
function(var) {
return(var / sum(x))
}
where x has to exist either as a column in your data frame or a variable in the calling environment.
The reasons for having it this way are that it saves typing and shortens lines of code. Inserting a function within a call to another function often leads to messy and poorly formatted code. This shorthand method helps to prevent that.
You can read more about anonymous functions and how they are used in the tidyverse here

Suppose we have this dataset:
dataset <- data.frame(a = c(1,2,3,4),
b = c(2,3,4,5),
c = c(3,4,5,6))
And you want to divide all vectors by the total (ie. for vector a = 1/10, 2/10, 3/10, 4/10). To avoid writing for all variables, you can use mutate_all, and then a lambda using .funswhich says make a function that divides all values in each vector represented by the dot by the sum of all values in that vector.
dataset %>% mutate_all(.funs = ~./sum(.))
Hope this helps.

mutate_all applies the function in .funs to all values. Each value (.) is divided by the sum(x), to get you the "relative abundance" which is essentially the fraction of the total value, which is the sum(x). You can think of ~ as a "function of". So you are saying each cell in the dataframe is a function of itself divided by the overall sum.

Related

R Tidymodels: error columns don't exist when using function argument to specify column

I'm trying to write a function to use the R tidymodels function initial_split with an argument that would let me change the strata to a different variable each time I call the function.
Using initial_split regularly like this works perfectly:
split_glab=initial_split(data,prop=0.7,strata=sp_glabrata)
Then I converted it to a function and plugged in my species parameter:
split_data=function(df,species){
initial_split(df,prop=0.7,strata=species)
}
split_data(data,species=sp_glabrata)
And get the following error:
Error: Can't subset columns that don't exist.
x Column `species` doesn't exist.
Of course, this column doesn't exist in my data since it's just an argument in my function --the column I'm trying to reference is called sp_glabrata. I can't figure out how to get my function to reference the column instead of the parameter. I don't want to just type the column name since I have to apply many similar functions to several columns and it would take forever.
Any guidance would be appreciated!
As it is a tidy package, can make use of curly-curly operator ({{}}) to evaluate the unquoted argument as a column name
library(tidymodels)
split_data <- function(df, species){
initial_split(df, prop=0.7, strata={{species}})
}
-testing
split_data(iris, species = Species)
#<Analysis/Assess/Total>
#<105/45/150>

Calculating multiple ROC curves in R using a for loop and pROC package. What variable to use in the predictor field?

I am using the pROC package and I want to calculate multiple ROC curve plots using a for loop.
My variables are specific column names that are included as string in a vector and I want pROC to read sequentially that vector and use the strings in the field "predictor" that seems to accept text/characters.
However, I cannot parse correctly the variable, as I am getting the error:
'predictor' argument should be the name of the column, optionally quoted.
here is an example code with aSAH dataset:
ROCvector<- c("s100b","ndka")
for (i in seq_along(ROCvector)){
a<-ROCvector[i]
pROC_obj <- roc(data=aSAH, outcome, as.character(a))
#code for output/print#
}
I have tried to call just "a" and using the functions print() or get() without any results.
Writing manually the variable (with or without quoting) works, of course.
Is there something I am missing about the type of variable I should use in the predictor field?
By passing data=aSAH as first argument, you are triggering the non-standard evaluation (NSE) of arguments, dplyr-style. Therefore you cannot simply pass the column name in a variable. Note the inconsistency with outcome that you pass unquoted and looks like a variable (but isn't)? Fortunately, functions with NSE in dplyr come with an equivalent function with standard evaluation, whose name ends with _. The pROC package follows this convention. You should usually use those if you are programming with column names.
Long story short, you should use the roc_ function instead, which accepts characters as column names (don't forget to quote "outcome"):
pROC_obj <- roc_(data=aSAH, "outcome", as.character(a))
A slightly more idiomatic version of your code would be:
for (predictor in ROCvector) {
pROC_obj <- roc_(data=aSAH, "outcome", predictor)
}
roc can accept formula, so we can use paste0 and as.formula to create one. i.e.
library(pROC)
ROCvector<- c("s100b","ndka")
for (i in seq_along(ROCvector)){
a<-ROCvector[i]
pROC_obj <- roc(as.formula(paste0("outcome~",a)), data=aSAH)
print(pROC_obj)
#code for output/print#
}
To can get the original call i.e. without paste0 wich you can use for later for downstream calculations, use eval and bquote
pROC_obj <- eval(bquote(roc(.(as.formula(paste0("outcome~",a))), data=aSAH)))

peculiar syntax for function within()

I came across this fantastic function called
within {base}
I use it more often now than the much hyped
mutate {dplyr}
My question is, why is within() having such a peculiar format with assignment operators used <- instead of the usual = for args; How is it different from mutate other than what is given in this fantastic article I found. I am interested to know the underlying mechanism.
Article of Bob Munchen - 2013
The function within takes an expression as second argument. That expression is essentially a codeblock, best contained within curly brackets {}.
In this codeblock, you can assign new variables, change values and the likes. The variables can be used in the codeblock as objects.
mutate on the other hand takes a set of arguments for the mutation. These arguments have to be named after the variable that should be created, and get the value for that variable as the value.
So :
mutate(iris, ratio = Sepal.Length/Petal.Length)
# and
within(iris, {ratio = Sepal.Length/Petal.Length})
give the same result. The problem starts when you remove the curly brackets:
> within(iris, ratio = Sepal.Length/Petal.Length)
Error in eval(substitute(expr), e) : argument is missing, with no default
The curly brackets enclosed an expression (piece of code), and hence within() worked correctly. If you don't use the {}, then R semantics reads that last command as "call the function within with iris as first argument and a second argument called ratio set to Sepal.Length/Petal.Length". And as the function within() doesn't have an argument ratio, that one is ignored. Instead, within looks for the expression that should be the second argument. But it can't find that one, so that explains the error.
So there's little peculiar about it. Both functions just have different arguments. All the rest is pretty much how R deals with arguments.
Args of within are not assigned with <- but with the usual =.
Let's see the first example in your link:
mydata.new <- within(mydata, {
+ x2 <- x ^ 2
+ x3 <- x2 + 100
+ } )
Here,
{
x2 <- x ^ 2
x3 <- x2 + 100
}
is just an argument of the function (an R expression). Nor x2 nor x3 are arguments to within. The function could have been called in that way instead to make it clearer:
mydata.new <- within(data = mydata, expr = {
x2 <- x ^ 2
x3 <- x2 + 100
})

R: passing by parameter to function and using apply instead of nested loop and recursive indexing failed

I have two lists of lists. humanSplit and ratSplit. humanSplit has element of the form::
> humanSplit[1]
$Fetal_Brain_408_AGTCAA_L001_R1_report.txt
humanGene humanReplicate alignment RNAtype
66 DGKI Fetal_Brain_408_AGTCAA_L001_R1_report.txt 6 reg
68 ARFGEF2 Fetal_Brain_408_AGTCAA_L001_R1_report.txt 5 reg
If you type humanSplit[[1]], it gives the data without name $Fetal_Brain_408_AGTCAA_L001_R1_report.txt
RatSplit is also essentially similar to humanSplit with difference in column order. I want to apply fisher's test to every possible pairing of replicates from humanSplit and ratSplit. Now I defined the following empty vector which I will use to store the informations of my fisher's test
humanReplicate <- vector(mode = 'character', length = 0)
ratReplicate <- vector(mode = 'character', length = 0)
pvalue <- vector(mode = 'numeric', length = 0)
For fisher's test between two replicates of humanSplit and ratSplit, I define the following function. In the function I use `geneList' which is a data.frame made by reading a file and has form:
> head(geneList)
human rat
1 5S_rRNA 5S_rRNA
2 5S_rRNA 5S_rRNA
Now here is the main function, where I use a function getGenetype which I already defined in other part of the code. Also x and y are integers :
fishertest <-function(x,y) {
ratReplicateName <- names(ratSplit[x])
humanReplicateName <- names(humanSplit[y])
## merging above two based on the one-to-one gene mapping as in geneList
## defined above.
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[x]], by.x = "rat", by.y = "ratGene")
## [here i do other manipulation with using already defined function
## getGenetype that is defined outside of this function and make things
## necessary to define following contingency table]
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
fisherTest <- fisher.test(contingencyTable)
humanReplicate <- c(humanReplicate,humanReplicateName )
ratReplicate <- c(ratReplicate,ratReplicateName )
pvalue <- c(pvalue , fisherTest$p)
}
After doing all this I do the make matrix eg to use in apply. Here I am basically trying to do something similar to double for loop and then using fisher
eg <- expand.grid(i = 1:length(ratSplit),j = 1:length(humanSplit))
junk = apply(eg, 1, fishertest(eg$i,eg$j))
Now the problem is, when I try to run, it gives the following error when it tries to use function fishertest in apply
Error in humanSplit[[y]] : recursive indexing failed at level 3
Rstudio points out problem in following line:
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
Ultimately, I want to do the following:
result <- data.frame(humanReplicate,ratReplicate, pvalue ,alternative, Conf.int1, Conf.int2, oddratio)
I am struggling with these questions:
In defining fishertest function, how should I pass ratSplit and humanSplit and already defined function getGenetype?
And how I should use apply here?
Any help would be much appreciated.
Up front: read ?apply. Additionally, the first three hits on google when searching for "R apply tutorial" are helpful snippets: one, two, and three.
Errors in fishertest()
The error message itself has nothing to do with apply. The reason it got as far as it did is because the arguments you provided actually resolved. Try to do eg$i by itself, and you'll see that it is returning a vector: the corresponding column in the eg data.frame. You are passing this vector as an index in the i argument. The primary reason your function erred out is because double-bracket indexing ([[) only works with singles, not vectors of length greater than 1. This is a great example of where production/deployed functions would need type-checking to ensure that each argument is a numeric of length 1; often not required for quick code but would have caught this mistake. Had it not been for the [[ limit, your function may have returned incorrect results. (I've been bitten by that many times!)
BTW: your code is also incorrect in its scoped access to pvalue, et al. If you make your function return just the numbers you need and the aggregate it outside of the function, your life will simplify. (pvalue <- c(pvalue, ...) will find pvalue assigned outside the function but will not update it as you want. You are defeating one purpose of writing this into a function. When thinking about writing this function, try to answer only this question: "how do I compare a single rat record with a single human record?" Only after that works correctly and simply without having to overwrite variables in the parent environment should you try to answer the question "how do I apply this function to all pairs and aggregate it?" Try very hard to have your function not change anything outside of its own environment.
Errors in apply()
Had your function worked properly despite these errors, you would have received the following error from apply:
apply(eg, 1, fishertest(eg$i, eg$j))
## Error in match.fun(FUN) :
## 'fishertest(eg$i, eg$j)' is not a function, character or symbol
When you call apply in this sense, it it parsing the third argument and, in this example, evaluates it. Since it is simply a call to fishertest(eg$i, eg$j) which is intended to return a data.frame row (inferred from your previous question), it resolves to such, and apply then sees something akin to:
apply(eg, 1, data.frame(...))
Now that you see that apply is being handed a data.frame and not a function.
The third argument (FUN) needs to be a function itself that takes as its first argument a vector containing the elements of the row (1) or column (2) of the matrix/data.frame. As an example, consider the following contrived example:
eg <- data.frame(aa = 1:5, bb = 11:15)
apply(eg, 1, mean)
## [1] 6 7 8 9 10
# similar to your use, will not work; this error comes from mean not getting
# any arguments, your error above is because
apply(eg, 1, mean())
## Error in mean.default() : argument "x" is missing, with no default
Realize that mean is a function itself, not the return value from a function (there is more to it, but this definition works). Because we're iterating over the rows of eg (because of the 1), the first iteration takes the first row and calls mean(c(1, 11)), which returns 6. The equivalent of your code here is mean()(c(1, 11)) will fail for a couple of reasons: (1) because mean requires an argument and is not getting, and (2) regardless, it does not return a function itself (in a "functional programming" paradigm, easy in R but uncommon for most programmers).
In the example here, mean will accept a single argument which is typically a vector of numerics. In your case, your function fishertest requires two arguments (templated by my previous answer to your question), which does not work. You have two options here:
Change your fishertest function to accept a single vector as an argument and parse the index numbers from it. Bothing of the following options do this:
fishertest <- function(v) {
x <- v[1]
y <- v[2]
ratReplicateName <- names(ratSplit[x])
## ...
}
or
fishertest <- function(x, y) {
if (missing(y)) {
y <- x[2]
x <- x[1]
}
ratReplicateName <- names(ratSplit[x])
## ...
}
The second version allows you to continue using the manual form of fishertest(1, 57) while also allowing you to do apply(eg, 1, fishertest) verbatim. Very readable, IMHO. (Better error checking and reporting can be used here, I'm just providing a MWE.)
Write an anonymous function to take the vector and split it up appropriately. This anonymous function could look something like function(ii) fishertest(ii[1], ii[2]). This is typically how it is done for functions that either do not transform as easily as in #1 above, or for functions you cannot or do not want to modify. You can either assign this intermediary function to a variable (which makes it no longer anonymous, figure that) and pass that intermediary to apply, or just pass it directly to apply, ala:
.func <- function(ii) fishertest(ii[1], ii[2])
apply(eg, 1, .func)
## equivalently
apply(eg, 1, function(ii) fishertest(ii[1], ii[2]))
There are two reasons why many people opt to name the function: (1) if the function is used multiple times, better to define once and reuse; (2) it makes the apply line easier to read than if it contained a complex multi-line function definition.
As a side note, there are some gotchas with using apply and family that, if you don't understand, will be confusing. Not the least of which is that when your function returns vectors, the matrix returned from apply will need to be transposed (with t()), after which you'll still need to rbind or otherwise aggregrate.
This is one area where using ddply may provide a more readable solution. There are several tutorials showing it off. For a quick intro, read this; for a more in depth discussion on the bigger picture in which ddply plays a part, read Hadley's Split, Apply, Combine Strategy for Data Analysis paper from JSS.

Accessing Arbitrary Columns from an R Data Frame using with()

Suppose that I have a data frame with a column whose name is stored in a variable. Accessing this column using the variable is easy using bracket notation:
df <- data.frame(A = rep(1, 10), B = rep(2, 10))
column.name <- 'B'
df[,column.name]
But it is not obvious how to access an arbitrary column using a call to with(). The naive approach
with(df, column.name)
effectively evaluates column.name in the caller's environment. How can I delay evaluation sufficiently that with() will provide the same results that brackets give?
You can use get:
with(df, get(column.name))
You use 'with' to create a localized and temporary namespace inside which you evaluate some expression. In your code above, you haven't passed in an expression.
For instance:
data(iris) # this data is in your R installation, just call 'data' and pass it in
Ordinarily you have to refer to variable names within a data frame like this:
tx = tapply(iris$sepal.len, list(iris$species), mean)
Unless you do this:
attach(iris)
The problem with using 'attach' is the likelihood of namespace clashes, so you've got to remember to call 'detach'
It's much cleaner to use 'with':
tx = with( iris, tapply(sepal.len, list(species), mean) )
So, the call signature (informally) is: with( data, function() )

Resources