There are two examples of function Reduce() in Hadley Wickham's book Advanced R. Both work well.
Reduce(`+`, 1:3) # -> ((1 + 2) + 3)
Reduce(sum, 1:3) # -> sum(sum(1, 2), 3)
However, when using mean in Reduce(), it does not follow the same pattern. The outcome is always the first element of the list.
> Reduce(mean, 1:3)
[1] 1
> Reduce(mean, 4:2)
[1] 4
The two functions sum() and mean() are very similar. Why one works fine with Reduce(), but the other does not? How do I know a if a function behaves normally in Reduce() before it gives incorrect result?
This has to do with the fact that, unlike sum or +, mean expects a single argument (re: a vector of values), and as such cannot be applied in the manner that Reduce operates, namely:
Reduce uses a binary function to successively combine the elements of
a given vector and a possibly given initial value.
Take note of the signature of mean:
mean(x, ...)
When you pass multiple values to it, the function will match x to the first value and ignore the rest. For example, when you call Reduce(mean, 1:3), this is more or less what is going on:
mean(1, 2)
#[1] 1
mean(mean(1, 2), 3)
#[1] 1
Compare this with the behavior of sum, which accept a variable number of values:
sum(1, 2)
#[1] 3
sum(sum(1, 2), 3)
#[1] 6
Related
I'm trying to get a hold on how the apply function works. Here is what I tried:
df = data.frame(x=c(1,2,3,4,5), x2=c(1,2,3,4,5))
apply(df$x2, 2, function(x) x*2) #doesn't work
apply(df["x2"], 2, function(x) x*2) #works
apply(df[,2], 2, function(x) (x*2)) #doesn't work
apply(df[2], 2, function(x) x*2) #works (suprisingly)
apply(df[2,], 1, function(x) x*2) #works, but gives me vertical vector
apply(df[2,], 2, function(x) x*2) #works; this gives me the output I expected in line above
Questions (as idicated by comments):
Why doesn't line 2 work although line 3 does?
Why can I use [2,] to refer to row 2 (line 6), but cannot use [,2] to refer to column 2 (line 4), but have to use [2] (line 5) instead?
In line 6 I expected to get what I got from line 7: row 2 (with double values) in a row. Why didn't I get this from line 6, I indicated row with
MARGIN=2?
apply needs to be used on something with a dimension of positive length. For simplicity some Object that has rows and columns.
That's why you have margin 1, 2. Standing for the row-wise and col-wise operation.
Check your Input values like this:
dim(df["x2"])
dim(df[,2]) #this is null, so it does not work
df[,2] gives you a vector same as df$x2. A vector does not have rows and cols. Therefore not working with apply.
In order to understand what you are doing wrong:
Type ?"[" into your console and read everything. Also play around... what you are already doing!
Have a closer look at the drop argument.
Lastly with df[2,] your subsetting a single row. It's still a dataframe.
Check dim(df[2,])
apply(df[2,], 1, function(x) x*2) #works, but gives me vertical vector
apply(df[2,], 2, function(x) x*2) #works; this gives me the output I expected in line above
The reason you don't get the same output. Is the WHOLE reason why apply exists. Please read ?apply to understand.
When you have questions after reading the two mentioned resources, feel free to ask more.
Here is a little example:
m <- matrix(1:9,nrow=3)
m
apply(m,1,max) #row-wise max value
apply(m,2,max) #col-wise max value
The problem is subsetting:
First:
df$x2 and df[, 2] are different from df["x2"] and df[2], as the former return a numeric vector, the latter return a data.frame.
Second:
df[2, ] returns the second row of your data.frame. If you use MARGIN = 1 you go through the rows, each row is represented as a (named) vector of length equal to the number of columns in your data.frame.
If you use MARGIN = 2 you go through the columns, again, each column is represented as a (named) vector of length equal to the number of rows in your data.frame.
Why doesn't line 2 work although line 3 does?
df$x2 is a vector i.e. c(1,2,3,4,5) whereas df["x2"] is a data frame with just one column. The vector has no second dimension to apply over. See ?'['] in R for details of how subsetting works, this isn't really related to the apply function
Why can I use [2,] to refer to row 2 (line 6), but cannot use [,2] to refer to column 2 (line 4), but have to use [2] (line 5) instead?
Again, see the subsetting help page, but df[,2,drop=FALSE] is probably what you need.
In line 6 I expected to get what I got from line 7: row 2 (with double values) in a row. Why didn't I get this from line 6, I indicated row with MARGIN=2?
The value section of ?apply explains the dimensions that you can expect as output from a call to apply:
If each call to FUN returns a vector of length n, then apply returns an array of dimension c(n, dim(X)[MARGIN]) if n > 1. If n equals 1, apply returns a vector if MARGIN has length 1 and an array of dimension dim(X)[MARGIN] otherwise.
In this case we see that:
> dim(df[2,])
# [1] 1 2
and so:
apply(df[2,], 1, function(x) x*2)
has n=2 and dim(df[2,])[1]=1, so you should expect an output with dimensions c(2,1).
You should look at each type and dimension of the expression
> typeof(df$x2)
[1] "double"
> dim(df$x2)
NULL
> typeof(df["x2"])
[1] "list"
> dim(df["x2"])
[1] 5 1
> typeof(df[, 2])
[1] "double"
> dim(df[, 2])
NULL
> typeof(df[2])
[1] "list
> dim(df[2])
[1] 5 1
> typeof(df[2, ])
[1] "list"
> dim(df[2,])
[1] 1 2
The line 2 does not work because you try to apply function to variable which has NULL dimension. (dim(X) must have positive length). The rest is similar. You must keep attention on the type of the expression in apply. I recommend you to simply print values to check if there are properly for the apply function.
In R's documentation for apply, it says:
FUN: the function to be applied: see ‘Details’. In the case of functions like +, %*%, etc., the function name must be backquoted or quoted.
I don't understand the latter half the sentence.
When I do
matrix1 = matrix(rnorm(3*4), 3, 4)
apply(matrix1, 1, "+")
I get the transpose of the matrix
And when I do
apply(matrix, 1, "%*%")
I get an error.
I'm trying to get the row-wise sum and product of this matrix.
Also, if that's not what the documentation is talking about, what do does + and %*% supposed to do when supplied as the FUN argument of apply?
matrix1 = matrix(rnorm(3*4), 3, 4)
apply(matrix1, 1, "+")
Does something like the transpose because it supplies rows of the matrix1 object one by one and returns the values of each operation as columns. If you had on the other hand specified:
apply(matrix1, 2, "+")
There would not have been the appearance of transposition because apply always returns its values as a column-major result.
In the second instance, you didn't give a second argument to the "%*%" operator. The "+" operator can be either unary or binary but the "%*%" operator is always binary. (It doesn't really make a lot of sense to use "%*%" with apply and a single dimension since "%*%" is really designed as a standalone operator. If you want the row-wise sum then just use:
rowSums(matrix1)
But you could have used the slower:
apply(matrix1, 1, sum)
For product use:
apply(matrix1, 1, prod)
Neither + nor %*% are designed to collapse their arguments into a single value in contrast to sum and prod which are so designed.
Reply to comment. The %*% operator performs the "matrix multiply" operation. The i-columns of the first argument are multiplied by the J-rows and summed to deliver the i-j element of a new matrix. Many mathematical operations with statistical or physical meaning which would otherwise require a double for-loop can be accomplished by matrix-multiply. Let's imagine your matrix was a bunch of data values and you wanted to come up with a prediction for each row based on a model with three coefficients equal to say, c(5,6,7):
c(5,6,7) %*% matrix1
# [,1] [,2] [,3] [,4]
#[1,] 2.047344 10.02339 1.73618 0.7223964
OK, I have a little problem which I believe I can solve with which and grepl (alternatives are welcome), but I am getting lost:
my_query<- c('g1', 'g2', 'g3')
my_data<- c('string2','string4','string5','string6')
I would like to return the index in my_query matching in my_data. In the example above, only 'g2' is in mydata, so the result in the example would be 2.
It seems to me that there is no easy way to do this without a loop. For each element in my_query, we can use either of the below functions to get TRUE or FALSE:
f1 <- function (pattern, x) length(grep(pattern, x)) > 0L
f2 <- function (pattern, x) any(grepl(pattern, x))
For example,
f1(my_query[1], my_data)
# [1] FALSE
f2(my_query[1], my_data)
# [1] FALSE
Then, we use *apply loop to apply, say f2 to all elements of my_query:
which(unlist(lapply(my_query, f2, x = my_data)))
# [1] 2
Thanks, that seems to work. To be honest, I preferred to your one-line original version. I am not sure why you edited with creating another function to call afterwards with *apply. Is there any advantage as compared to which(lengths(lapply(my_query, grep, my_data)) > 0L)?
Well, I am not entirely sure. When I read ?lengths:
One advantage of ‘lengths(x)’ is its use as a more efficient
version of ‘sapply(x, length)’ and similar ‘*apply’ calls to
‘length’.
I don't know how much more efficient that lengths is compared with sapply. Anyway, if it is still a loop, then my original suggestion which(lengths(lapply(my_query, grep, my_data)) > 0L) is performing 2 loops. My edit is essentially combining two loops together, hopefully to get some boost (if not too tiny).
You can still arrange my new edit into a single line:
which(unlist(lapply(my_query, function (pattern, x) any(grepl(pattern, x)), x = my_data)))
or
which(unlist(lapply(my_query, function (pattern) any(grepl(pattern, my_data)))))
Expanding on a comment posted initially by #Gregor you could try:
which(colSums(sapply(my_query, grepl, my_data)) > 0)
#g2
# 2
The function colSums is vectorized and represents no problem in terms of performance. The sapply() loop seems inevitable here, since we need to check each element within the query vector. The result of the loop is a logical matrix, with each column representing an element of my_query and each row an element of my_data. By wrapping this matrix into which(colSums(..) > 0) we obtain the index numbers of all columns that contain at least one TRUE, i.e., a match with an entry of my_data.
The following function is used to multiply a sequence 1:x by y
f1<-function(x,y){return (lapply(1:x, function(a,b) b*a, b=y))}
Looks like a is used to represent the element in the sequence 1:x, but I do not know how to understand this parameter passing mechanism. In other OO languages, like Java or C++, there have call by reference or call by value.
Short answer: R is call by value. Long answer: it can do both.
Call By Value, Lazy Evaluation, and Scoping
You'll want to read through: the R language definition for more details.
R mostly uses call by value but this is complicated by its lazy evaluation:
So you can have a function:
f <- function(x, y) {
x * 3
}
If you pass in two big matrixes to x and y, only x will be copied into the callee environment of f, because y is never used.
But you can also access variables in parent environments of f:
y <- 5
f <- function(x) {
x * y
}
f(3) # 15
Or even:
y <- 5
f <- function() {
x <- 3
g <- function() {
x * y
}
}
f() # returns function g()
f()() # returns 15
Call By Reference
There are two ways for doing call by reference in R that I know of.
One is by using Reference Classes, one of the three object oriented paradigms of R (see also: Advanced R programming: Object Oriented Field Guide)
The other is to use the bigmemory and bigmatrix packages (see The bigmemory project). This allows you to create matrices in memory (underlying data is stored in C), returning a pointer to the R session. This allows you to do fun things like accessing the same matrix from multiple R sessions.
To multiply a vector x by a constant y just do
x * y
The (some prefix)apply functions works very similar to each other, you want to map a function to every element of your vector, list, matrix and so on:
x = 1:10
x.squared = sapply(x, function(elem)elem * elem)
print(x.squared)
[1] 1 4 9 16 25 36 49 64 81 100
It gets better with matrices and data frames because you can now apply a function over all rows or columns, and collect the output. Like this:
m = matrix(1:9, ncol = 3)
# The 1 below means apply over rows, 2 would mean apply over cols
row.sums = apply(m, 1, function(some.row) sum(some.row))
print(row.sums)
[1] 12 15 18
If you're looking for a simple way to multiply a sequence by a constant, definitely use #Fernando's answer or something similar. I'm assuming you're just trying to determine how parameters are being passed in this code.
lapply calls its second argument (in your case function(a, b) b*a) with each of the values of its first argument 1, 2, ..., x. Those values will be passed as the first parameter to the second argument (so, in your case, they will be argument a).
Any additional parameters to lapply after the first two, in your case b=y, are passed to the function by name. So if you called your inner function fxn, then your invocation of lapply is making calls like fxn(1, b=4), fxn(2, b=4), .... The parameters are passed by value.
You should read the help of lapply to understand how it works. Read this excellent answer to get and a good explanation of different xxpply family functions.
From the help of laapply:
lapply(X, FUN, ...)
Here FUN is applied to each elementof X and ... refer to:
... optional arguments to FUN.
Since FUN has an optional argument b, We replace the ... by , b=y.
You can see it as a syntax sugar and to emphasize the fact that argument b is optional comparing to argument a. If the 2 arguments are symmetric maybe it is better to use mapply.
Take the following example:
boltzmann <- function(x, t=0.1) { exp(x/t) / sum(exp(x/t)) }
z=rnorm(10,mean=1,sd=0.5)
exp(z[1]/t)/sum(exp(z/t))
[1] 0.0006599707
boltzmann(z)[1]
[1] 0.0006599707
It appears that exp in the boltzmann function operates over elements and vectors and knows when to do the right thing. Is the sum "unrolling" the input vector and applying the expression on the values? Can someone explain how this works in R?
Edit: Thank you for all of the comments, clarification, and patience with an R n00b. In summary, the reason this works was not immediately obvious to me coming from other languages. Take python for example. You would first compute the sum and then compute the value for each element in the vector.
denom = sum([exp(v / t) for v in x])
vals = [exp(v / t) / denom for v in x]
Whereas is R the sum(exp(x/t)) can be computed inline.
This is explained in An Introduction to R, Section 2.2: Vector arithmetic.
Vectors can be used in arithmetic expressions, in which case the
operations are performed element by element. Vectors occurring in the
same expression need not all be of the same length. If they are not,
the value of the expression is a vector with the same length as the
longest vector which occurs in the expression. Shorter vectors in the
expression are recycled as often as need be (perhaps fractionally)
until they match the length of the longest vector. In particular a
constant is simply repeated. So with the above assignments the command
x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
y <- c(x, 0, x)
v <- 2*x + y + 1
generates a new vector v of length 11 constructed by adding together,
element by element, 2*x repeated 2.2 times, y repeated just once, and
1 repeated 11 times.
This might be clearer if you evaluated the numerator and the denominator separately:
x = rnorm(10,mean=1,sd=0.5)
t = .1
exp(x/t)
# [1] 1.845179e+05 6.679273e+03 4.379369e+06 1.852623e+06 9.960374e+02
# [6] 1.359676e+09 6.154045e+03 1.777027e+01 1.070003e+04 6.217397e+04
sum(exp(x/t))
# [1] 2984044296
Since the numerator is a vector of length 10, and the denominator is a vector of length 1, the division returns a vector of length 10.
Since you're interested in comparing this to Python, imagine the two following rules were added to Python (incidentally, these are similar to the usage of arrays in numpy):
If you divide a list by a number, it will divide all items in the list by the number:
[2, 4, 6, 8] / 2
# [1, 2, 3, 4]
The function exp in Python is "vectorized", which means that when it is applied to a list it will apply to each item in the list. However, sum still works the way you expect it to.
exp([1, 2, 3]) => [exp(1), exp(2), exp(3)]
In that case, imagine how this code would be evaluated in Python:
t = .1
x = [1, 2, 3, 4]
exp(x/t) / sum(exp(x/t))
It would follow the following simplifications, using those two simple rules:
exp([v / t for v in x]) / sum(exp([v / t for v in x]))
[exp(v / t) for v in x] / sum([exp(v / t) for v in x])
Now do you see how it knows the difference?
Vectorisation has several slightly different meanings in R.
It can mean accepting a vector input, transforming each element, and returning a vector (like exp does).
It can also mean accepting a vector input and calculating some summary statistic, then returning a scalar value (like mean does).
sum conforms to the second behaviour, but also has a third vectorisation behaviour, where it will create a summary statistic across inputs. Try sum(1, 2:3, 4:6), for example.