I'm interested in counting a number of arguments passed to a function. length can't be used for that purpose:
>> length(2,2,2,2,2)
Error in length(2, 2, 2, 2, 2) :
5 arguments passed to 'length' which requires 1
This is obvious as length takes 1 argument so:
length(c(2,2,2,2,2))
would produce the desired result - 5.
Solution
I want to call my function like that myFunction(arg1, arg2, arg3). This can be done with use of an ellipsis:
myCount <- function(...) {length(list(...))}
myCount would produce the desired result:
>> myCount(2,2,2,2,2)
[1] 5
Problem
This is awfully inefficient. I'm calling this function on substantial number of arguments and creating lists just to count number of objects is wasteful. What's the better way of returning the number of arguments passed to a function?
How about
myCount <- function(...) {length(match.call())-1}
This just inspects the passed call (and removes 1 for the function name itself)
nargs returns the number of arguments supplied to that function
myCount <- function(...) {
nargs()
}
> myCount(2,2,2,2,2)
[1] 5
Reference https://stat.ethz.ch/R-manual/R-devel/library/base/html/nargs.html
Here is a somewhat elegant way using length() with purrr::lift_*() familiy functions.
Generally you are passing multiple arguments to length(), which is not working because length() takes a vector or a list as input.
So what we need is to convert the input from a vector/list to ... (dots). purrr::lift_*() family provides a series of functions that do so.
One option can be converting from vector to dots:
> lift_vd(length)(2, 2, 2, 2, 2)
[1] 5
Another option can be converting from list to dots:
> lift_ld(length)(2, 2, 2, 2, 2)
[1] 5
Both options are working perfectly well, and what you need is using one of the purrr::lift_*() functions on length() before passing spliced arguments to it.
Related
I currently am doing an operation similar to below:
v<-c("my","pig","is","big","with","a","name")
s<-c("m","g")
for(i in c(1:length(s))){
print(grep(v,pattern=s[i]))
}
Which prints
[1] 1 7
[1] 2 4
I would like to instead vectorize this operation where the return values are stored in a vector. I tried
mynewvector<-lapply(v,grep,pattern=s,x=v)
but the problem is that I don't know how to get lapply iterate over the elements passed as arguments (e.g. iterating over s). I saw this answer, but I don't think mapply works here because I am trying to hold one argument constant (x=v) and iterate over the other argument (pattern=s)
How would I do this?
Following up on d.b.'s response, the most clear solution is
lapply(s, function(a) grep(pattern = a, x = v))
There are two examples of function Reduce() in Hadley Wickham's book Advanced R. Both work well.
Reduce(`+`, 1:3) # -> ((1 + 2) + 3)
Reduce(sum, 1:3) # -> sum(sum(1, 2), 3)
However, when using mean in Reduce(), it does not follow the same pattern. The outcome is always the first element of the list.
> Reduce(mean, 1:3)
[1] 1
> Reduce(mean, 4:2)
[1] 4
The two functions sum() and mean() are very similar. Why one works fine with Reduce(), but the other does not? How do I know a if a function behaves normally in Reduce() before it gives incorrect result?
This has to do with the fact that, unlike sum or +, mean expects a single argument (re: a vector of values), and as such cannot be applied in the manner that Reduce operates, namely:
Reduce uses a binary function to successively combine the elements of
a given vector and a possibly given initial value.
Take note of the signature of mean:
mean(x, ...)
When you pass multiple values to it, the function will match x to the first value and ignore the rest. For example, when you call Reduce(mean, 1:3), this is more or less what is going on:
mean(1, 2)
#[1] 1
mean(mean(1, 2), 3)
#[1] 1
Compare this with the behavior of sum, which accept a variable number of values:
sum(1, 2)
#[1] 3
sum(sum(1, 2), 3)
#[1] 6
I'm new to programming and I wrote a code that finds spam words for the first email but I would like to write a for loop that would do this for all of the emails. Any help would be appreciated. Thank you.
words = grepl("viagra", spamdata[[ 1 ]]$header[ "Subject"])
I presume that you want to loop over the elements of spamdata and build up an indicator whether the string "viagra" is found in the subject lines of your emails.
Lets set up some dummy data for illustration purposes:
subjects <- c("Buy my viagra", "Buy my Sildenafil citrate",
"UK Lottery Win!!!!!")
names(subjects) <- rep("Subject", 3)
spamdata <- list(list(Header = subjects[1]), list(Header = subjects[2]),
list(Header = subjects[3]))
Next we create a vector words to hold the result of each iteration of the loop. You do not want to be growing words or any other object at each iteration - that will force copying and will slow your loop down. Instead allocate storage before you begin - here using the length of the list over which we want to loop:
words <- logical(length = length(spamdata))
You can set up a loop as so
## seq_along() creates a sequence of 1:length(spamdata)
for(i in seq_along(spamdata)) {
words[ i ] <- grepl("viagra", spamdata[[ i ]]$Header["Subject"])
}
We can then look at words:
> words
[1] TRUE FALSE FALSE
Which matches what we know from the made up subjects.
Notice how we used i as a place holder for 1, 2, and 3 - at each iteration of the loop, i takes on the next value in the sequence 1,2,3 so we can i) access the ith component of spamdata to get the next subject line, and ii) access the ith element of words to store the result of the grepl() call.
Note that instead of an implicit loop we could also use the sapply() or lapply() functions, which create the loop for you but might need a bit of work to write a custom function. Instead of using grepl() directly, we can write a wrapper:
foo <- function(x) {
grepl("viagra", x$Header["Subject"])
}
In the above function we use x instead of the list name spamdata because when lapply() and sapply() loop over the spamdata list, the individual components (referenced by spamdata[[i]] in the for() loop) get passed to our function as argument x so we only need to refer to x in the grepl() call.
This is how we could use our wrapper function foo() in lapply() or sapply(), first lapply():
> lapply(spamdata, foo)
[[1]]
[1] TRUE
[[2]]
[1] FALSE
[[3]]
[1] FALSE
sapply() will simplify the returned object where possible, as follows:
> sapply(spamdata, foo)
[1] TRUE FALSE FALSE
Other than that, they work similarly.
Note we can make our wrapper function foo() more useful by allowing it to take an argument defining the spam word you wish to search for:
foo <- function(x, string) {
grepl(string, x$Header["Subject"])
}
We can pass extra arguments to our functions with lapply() and sapply() like this:
> sapply(spamdata, foo, string = "viagra")
[1] TRUE FALSE FALSE
> sapply(spamdata, foo, string = "Lottery")
[1] FALSE FALSE TRUE
Which you will find most useful (for() loop or the lapply(), sapply() versions) will depend on your programming background and which you find most familiar. Sometimes for() is easier and simpler to use, but perhaps more verbose (which isn't always a bad thing!), whilst lapply() and sapply() are quite succinct and useful where you don't need to jump through hoops to create a workable wrapper function.
In R a loopstakes this form, where variable is the name of your iteration variable, and sequence is a vector or list of values:
for (variable in sequence) expression
The expression can be a single R command - or several lines of commands wrapped in curly brackets:
for (variable in sequence) {
expression
expression
expression
}
In this case it would be for(words){ do whatever you want to do}
Also
Basic loop theory
The basic structure for loop commands is: for(i in 1:n){stuff to do}, where n is the number of times the loop will execute.
listname[[1]] refers to the first element in the list “listname.”
In a for loop, listname[[i]] refers to the variable corresponding to the ith iteration of the for loop.
The code for(i in 1:length(yesnovars)) tells the loop to execute only once for each variable in the list.
Answer taken from the following
sources:
Loops in R
Programming in R
I have two lists of lists. humanSplit and ratSplit. humanSplit has element of the form::
> humanSplit[1]
$Fetal_Brain_408_AGTCAA_L001_R1_report.txt
humanGene humanReplicate alignment RNAtype
66 DGKI Fetal_Brain_408_AGTCAA_L001_R1_report.txt 6 reg
68 ARFGEF2 Fetal_Brain_408_AGTCAA_L001_R1_report.txt 5 reg
If you type humanSplit[[1]], it gives the data without name $Fetal_Brain_408_AGTCAA_L001_R1_report.txt
RatSplit is also essentially similar to humanSplit with difference in column order. I want to apply fisher's test to every possible pairing of replicates from humanSplit and ratSplit. Now I defined the following empty vector which I will use to store the informations of my fisher's test
humanReplicate <- vector(mode = 'character', length = 0)
ratReplicate <- vector(mode = 'character', length = 0)
pvalue <- vector(mode = 'numeric', length = 0)
For fisher's test between two replicates of humanSplit and ratSplit, I define the following function. In the function I use `geneList' which is a data.frame made by reading a file and has form:
> head(geneList)
human rat
1 5S_rRNA 5S_rRNA
2 5S_rRNA 5S_rRNA
Now here is the main function, where I use a function getGenetype which I already defined in other part of the code. Also x and y are integers :
fishertest <-function(x,y) {
ratReplicateName <- names(ratSplit[x])
humanReplicateName <- names(humanSplit[y])
## merging above two based on the one-to-one gene mapping as in geneList
## defined above.
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[x]], by.x = "rat", by.y = "ratGene")
## [here i do other manipulation with using already defined function
## getGenetype that is defined outside of this function and make things
## necessary to define following contingency table]
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
fisherTest <- fisher.test(contingencyTable)
humanReplicate <- c(humanReplicate,humanReplicateName )
ratReplicate <- c(ratReplicate,ratReplicateName )
pvalue <- c(pvalue , fisherTest$p)
}
After doing all this I do the make matrix eg to use in apply. Here I am basically trying to do something similar to double for loop and then using fisher
eg <- expand.grid(i = 1:length(ratSplit),j = 1:length(humanSplit))
junk = apply(eg, 1, fishertest(eg$i,eg$j))
Now the problem is, when I try to run, it gives the following error when it tries to use function fishertest in apply
Error in humanSplit[[y]] : recursive indexing failed at level 3
Rstudio points out problem in following line:
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
Ultimately, I want to do the following:
result <- data.frame(humanReplicate,ratReplicate, pvalue ,alternative, Conf.int1, Conf.int2, oddratio)
I am struggling with these questions:
In defining fishertest function, how should I pass ratSplit and humanSplit and already defined function getGenetype?
And how I should use apply here?
Any help would be much appreciated.
Up front: read ?apply. Additionally, the first three hits on google when searching for "R apply tutorial" are helpful snippets: one, two, and three.
Errors in fishertest()
The error message itself has nothing to do with apply. The reason it got as far as it did is because the arguments you provided actually resolved. Try to do eg$i by itself, and you'll see that it is returning a vector: the corresponding column in the eg data.frame. You are passing this vector as an index in the i argument. The primary reason your function erred out is because double-bracket indexing ([[) only works with singles, not vectors of length greater than 1. This is a great example of where production/deployed functions would need type-checking to ensure that each argument is a numeric of length 1; often not required for quick code but would have caught this mistake. Had it not been for the [[ limit, your function may have returned incorrect results. (I've been bitten by that many times!)
BTW: your code is also incorrect in its scoped access to pvalue, et al. If you make your function return just the numbers you need and the aggregate it outside of the function, your life will simplify. (pvalue <- c(pvalue, ...) will find pvalue assigned outside the function but will not update it as you want. You are defeating one purpose of writing this into a function. When thinking about writing this function, try to answer only this question: "how do I compare a single rat record with a single human record?" Only after that works correctly and simply without having to overwrite variables in the parent environment should you try to answer the question "how do I apply this function to all pairs and aggregate it?" Try very hard to have your function not change anything outside of its own environment.
Errors in apply()
Had your function worked properly despite these errors, you would have received the following error from apply:
apply(eg, 1, fishertest(eg$i, eg$j))
## Error in match.fun(FUN) :
## 'fishertest(eg$i, eg$j)' is not a function, character or symbol
When you call apply in this sense, it it parsing the third argument and, in this example, evaluates it. Since it is simply a call to fishertest(eg$i, eg$j) which is intended to return a data.frame row (inferred from your previous question), it resolves to such, and apply then sees something akin to:
apply(eg, 1, data.frame(...))
Now that you see that apply is being handed a data.frame and not a function.
The third argument (FUN) needs to be a function itself that takes as its first argument a vector containing the elements of the row (1) or column (2) of the matrix/data.frame. As an example, consider the following contrived example:
eg <- data.frame(aa = 1:5, bb = 11:15)
apply(eg, 1, mean)
## [1] 6 7 8 9 10
# similar to your use, will not work; this error comes from mean not getting
# any arguments, your error above is because
apply(eg, 1, mean())
## Error in mean.default() : argument "x" is missing, with no default
Realize that mean is a function itself, not the return value from a function (there is more to it, but this definition works). Because we're iterating over the rows of eg (because of the 1), the first iteration takes the first row and calls mean(c(1, 11)), which returns 6. The equivalent of your code here is mean()(c(1, 11)) will fail for a couple of reasons: (1) because mean requires an argument and is not getting, and (2) regardless, it does not return a function itself (in a "functional programming" paradigm, easy in R but uncommon for most programmers).
In the example here, mean will accept a single argument which is typically a vector of numerics. In your case, your function fishertest requires two arguments (templated by my previous answer to your question), which does not work. You have two options here:
Change your fishertest function to accept a single vector as an argument and parse the index numbers from it. Bothing of the following options do this:
fishertest <- function(v) {
x <- v[1]
y <- v[2]
ratReplicateName <- names(ratSplit[x])
## ...
}
or
fishertest <- function(x, y) {
if (missing(y)) {
y <- x[2]
x <- x[1]
}
ratReplicateName <- names(ratSplit[x])
## ...
}
The second version allows you to continue using the manual form of fishertest(1, 57) while also allowing you to do apply(eg, 1, fishertest) verbatim. Very readable, IMHO. (Better error checking and reporting can be used here, I'm just providing a MWE.)
Write an anonymous function to take the vector and split it up appropriately. This anonymous function could look something like function(ii) fishertest(ii[1], ii[2]). This is typically how it is done for functions that either do not transform as easily as in #1 above, or for functions you cannot or do not want to modify. You can either assign this intermediary function to a variable (which makes it no longer anonymous, figure that) and pass that intermediary to apply, or just pass it directly to apply, ala:
.func <- function(ii) fishertest(ii[1], ii[2])
apply(eg, 1, .func)
## equivalently
apply(eg, 1, function(ii) fishertest(ii[1], ii[2]))
There are two reasons why many people opt to name the function: (1) if the function is used multiple times, better to define once and reuse; (2) it makes the apply line easier to read than if it contained a complex multi-line function definition.
As a side note, there are some gotchas with using apply and family that, if you don't understand, will be confusing. Not the least of which is that when your function returns vectors, the matrix returned from apply will need to be transposed (with t()), after which you'll still need to rbind or otherwise aggregrate.
This is one area where using ddply may provide a more readable solution. There are several tutorials showing it off. For a quick intro, read this; for a more in depth discussion on the bigger picture in which ddply plays a part, read Hadley's Split, Apply, Combine Strategy for Data Analysis paper from JSS.
The following function is used to multiply a sequence 1:x by y
f1<-function(x,y){return (lapply(1:x, function(a,b) b*a, b=y))}
Looks like a is used to represent the element in the sequence 1:x, but I do not know how to understand this parameter passing mechanism. In other OO languages, like Java or C++, there have call by reference or call by value.
Short answer: R is call by value. Long answer: it can do both.
Call By Value, Lazy Evaluation, and Scoping
You'll want to read through: the R language definition for more details.
R mostly uses call by value but this is complicated by its lazy evaluation:
So you can have a function:
f <- function(x, y) {
x * 3
}
If you pass in two big matrixes to x and y, only x will be copied into the callee environment of f, because y is never used.
But you can also access variables in parent environments of f:
y <- 5
f <- function(x) {
x * y
}
f(3) # 15
Or even:
y <- 5
f <- function() {
x <- 3
g <- function() {
x * y
}
}
f() # returns function g()
f()() # returns 15
Call By Reference
There are two ways for doing call by reference in R that I know of.
One is by using Reference Classes, one of the three object oriented paradigms of R (see also: Advanced R programming: Object Oriented Field Guide)
The other is to use the bigmemory and bigmatrix packages (see The bigmemory project). This allows you to create matrices in memory (underlying data is stored in C), returning a pointer to the R session. This allows you to do fun things like accessing the same matrix from multiple R sessions.
To multiply a vector x by a constant y just do
x * y
The (some prefix)apply functions works very similar to each other, you want to map a function to every element of your vector, list, matrix and so on:
x = 1:10
x.squared = sapply(x, function(elem)elem * elem)
print(x.squared)
[1] 1 4 9 16 25 36 49 64 81 100
It gets better with matrices and data frames because you can now apply a function over all rows or columns, and collect the output. Like this:
m = matrix(1:9, ncol = 3)
# The 1 below means apply over rows, 2 would mean apply over cols
row.sums = apply(m, 1, function(some.row) sum(some.row))
print(row.sums)
[1] 12 15 18
If you're looking for a simple way to multiply a sequence by a constant, definitely use #Fernando's answer or something similar. I'm assuming you're just trying to determine how parameters are being passed in this code.
lapply calls its second argument (in your case function(a, b) b*a) with each of the values of its first argument 1, 2, ..., x. Those values will be passed as the first parameter to the second argument (so, in your case, they will be argument a).
Any additional parameters to lapply after the first two, in your case b=y, are passed to the function by name. So if you called your inner function fxn, then your invocation of lapply is making calls like fxn(1, b=4), fxn(2, b=4), .... The parameters are passed by value.
You should read the help of lapply to understand how it works. Read this excellent answer to get and a good explanation of different xxpply family functions.
From the help of laapply:
lapply(X, FUN, ...)
Here FUN is applied to each elementof X and ... refer to:
... optional arguments to FUN.
Since FUN has an optional argument b, We replace the ... by , b=y.
You can see it as a syntax sugar and to emphasize the fact that argument b is optional comparing to argument a. If the 2 arguments are symmetric maybe it is better to use mapply.