Confused by a vapply function using grepl internally (Part of datacamp course) - r

hits <- vapply(titles,
FUN = grepl,
FUN.VALUE = logical(length(pass_names)),
pass_names)
titles is a vector with titles such as "mr", pass_names is a list of names.
2 questions.
I don't understand the resulting matrix hits
I don't understand why the last line is pass_names nor what how I am supposed to know about these 4 arguments. Under ?vapply it specificies the x, FUN, FUN.VALUE but I cannot figure out how I am supposed to figure out that pass_names needs to be listed there.
I have looked online and could not find an answer, so I hope this will help others too. Thank you in advance for your answers, yes I am a beginner.
Extra info: This question uses the titanic package in R, pass_names is just titanic$Name, titles is just paste(",", c("Mr\\.", "Master", "Don", "Rev", "Dr\\.", "Major", "Sir", "Col", "Capt", "Jonkheer"))

You're right to be a bit confused.
The vapply code chunk in your question is equivalent to:
hits <- vapply(titles,
FUN = function(x) grepl(x, pass_names),
FUN.VALUE = logical(length(pass_names)))
vapply takes a ... argument which takes as many arguments as are provided. If the arguments are not named (see #Roland's comment), the n-th argument in the ... position is passed to the n+1-th argument of FUN (the first argument to FUN is X, i.e. titles in this case).
The resulting matrix has the same number of rows as the number of rows in titanic and has 10 columns, the length of titles. The [i, j]-th entry is TRUE if the i-th pass_names matches the j-th regular expression in titles, FALSE if it doesn't.

Essentially you are passing two vectors in your vapply which is equivalent to two nested for loops. Each pairing is then passed into the required arguments of grepl: grepl(pattern, x).
Specifically, on first loop of vapply the first item in titles is compared with every item of pass_names. Then on second loop, the second item in titles is compared again to all items of pass_names and so on until first vector, titles, is exhausted.
To illustrate, you can equivalently build a hits2 matrix using nested for loops, rendering exactly as your vapply output, hits:
hits2 <- matrix(NA, nrow=length(df$name), ncol=length(titles))
colnames(hits2) <- titles
for (i in seq_along(df$name)) {
for (j in seq_along(titles)) {
hits2[i, j] <- grepl(pattern=titles[j], x=df$name[i])
}
}
all.equal(hits, hits2)
# [1] TRUE
Alternatively, you can run same exact in sapply without the required FUN.VALUE argument as both sapply and vapply are wrappers to lapply. However, vapply is more preferred as you proactively assert your output while sapply renders one way depending on function. For instance, in vapply you could render an integer matrix with: FUN.VALUE = integer(length(pass_names)).
hits3 <- sapply(titles, FUN = grepl, pass_names)
all.equal(hits, hits3)
# [1] TRUE
All in all, the apply family are more concise, compact ways to run iterations and renders a data structure instead of initializing and assigning a vector/matrix with for or while loops.
For further reading, consider this interesting SO post: Is the “*apply” family really not vectorized?

Related

what does the small x means in lapply

I have the variables:
trims<- c(0,0.1,0.2,0.5)
x<-rcauchy(100)
and the following operation:
lapply(trims, mean, x=x)
what does the small x refer to in this case? The documentation for lapply does not explain it well either. I do know that for lapply function, it takes a function and apply it to each element of the list, which I believe is trim in this case. How does x come in then?
If we use anonymous function, it will be clear.
res <- lapply(trims, function(y) mean(x, trim=y))
res1 <- lapply(trims, mean, x=x)
identical(res, res1)
#[1] TRUE
The lapply loops through each of the 'trims' and as mean has first argument of x and second argument of trim and the first argument is already mentioned with x=x i.e. the object created with rauncy, naturally the the second argument i.e. trim selects the values in 'trimws'

Iterating over the argument of a function (grep) passed to lapply

I currently am doing an operation similar to below:
v<-c("my","pig","is","big","with","a","name")
s<-c("m","g")
for(i in c(1:length(s))){
print(grep(v,pattern=s[i]))
}
Which prints
[1] 1 7
[1] 2 4
I would like to instead vectorize this operation where the return values are stored in a vector. I tried
mynewvector<-lapply(v,grep,pattern=s,x=v)
but the problem is that I don't know how to get lapply iterate over the elements passed as arguments (e.g. iterating over s). I saw this answer, but I don't think mapply works here because I am trying to hold one argument constant (x=v) and iterate over the other argument (pattern=s)
How would I do this?
Following up on d.b.'s response, the most clear solution is
lapply(s, function(a) grep(pattern = a, x = v))

getting lost in Using which() and regex in R

OK, I have a little problem which I believe I can solve with which and grepl (alternatives are welcome), but I am getting lost:
my_query<- c('g1', 'g2', 'g3')
my_data<- c('string2','string4','string5','string6')
I would like to return the index in my_query matching in my_data. In the example above, only 'g2' is in mydata, so the result in the example would be 2.
It seems to me that there is no easy way to do this without a loop. For each element in my_query, we can use either of the below functions to get TRUE or FALSE:
f1 <- function (pattern, x) length(grep(pattern, x)) > 0L
f2 <- function (pattern, x) any(grepl(pattern, x))
For example,
f1(my_query[1], my_data)
# [1] FALSE
f2(my_query[1], my_data)
# [1] FALSE
Then, we use *apply loop to apply, say f2 to all elements of my_query:
which(unlist(lapply(my_query, f2, x = my_data)))
# [1] 2
Thanks, that seems to work. To be honest, I preferred to your one-line original version. I am not sure why you edited with creating another function to call afterwards with *apply. Is there any advantage as compared to which(lengths(lapply(my_query, grep, my_data)) > 0L)?
Well, I am not entirely sure. When I read ?lengths:
One advantage of ‘lengths(x)’ is its use as a more efficient
version of ‘sapply(x, length)’ and similar ‘*apply’ calls to
‘length’.
I don't know how much more efficient that lengths is compared with sapply. Anyway, if it is still a loop, then my original suggestion which(lengths(lapply(my_query, grep, my_data)) > 0L) is performing 2 loops. My edit is essentially combining two loops together, hopefully to get some boost (if not too tiny).
You can still arrange my new edit into a single line:
which(unlist(lapply(my_query, function (pattern, x) any(grepl(pattern, x)), x = my_data)))
or
which(unlist(lapply(my_query, function (pattern) any(grepl(pattern, my_data)))))
Expanding on a comment posted initially by #Gregor you could try:
which(colSums(sapply(my_query, grepl, my_data)) > 0)
#g2
# 2
The function colSums is vectorized and represents no problem in terms of performance. The sapply() loop seems inevitable here, since we need to check each element within the query vector. The result of the loop is a logical matrix, with each column representing an element of my_query and each row an element of my_data. By wrapping this matrix into which(colSums(..) > 0) we obtain the index numbers of all columns that contain at least one TRUE, i.e., a match with an entry of my_data.

In R, evaluate expressions within vector of strings

I wish to evaluate a vector of strings containing arithmetic expressions -- "1+2", "5*6", etc.
I know that I can parse a single string into an expression and then evaluate it as in eval(parse(text="1+2")).
However, I would prefer to evaluate the vector without using a for loop.
foo <- c("1+2","3+4","5*6","7/8") # I want to evaluate this and return c(3,7,30,0.875)
eval(parse(text=foo[1])) # correctly returns 3, so how do I vectorize the evaluation?
eval(sapply(foo, function(x) parse(text=x))) # wrong! evaluates only last element
Just apply the whole function.
sapply(foo, function(x) eval(parse(text=x)))
Just to show that you can also do this with a for loop:
result <- numeric(length(foo))
foo <- parse(text=foo)
for(i in seq_along(foo))
result[i] <- eval(foo[[i]])
I'm not a fan of using the *apply functions for their own sake, but in this case, sapply really does lead to simpler, clearer code.

vectorize a bidimensional function in R

I have a some true and predicted labels
truth <- factor(c("+","+","-","+","+","-","-","-","-","-"))
pred <- factor(c("+","+","-","-","+","+","-","-","+","-"))
and I would like to build the confusion matrix.
I have a function that works on unary elements
f <- function(x,y){ sum(y==pred[truth == x])}
however, when I apply it to the outer product, to build the matrix, R seems unhappy.
outer(levels(truth), levels(truth), f)
Error in outer(levels(x), levels(x), f) :
dims [product 4] do not match the length of object [1]
What is the recommended strategy for this in R ?
I can always go through higher order stuff, but that seems clumsy.
I sometimes fail to understand where outer goes wrong, too. For this task I would have used the table function:
> table(truth,pred) # arguably a lot less clumsy than your effort.
pred
truth - +
- 4 2
+ 1 3
In this case, you are test whether a multivalued vector is "==" to a scalar.
outer assumes that the function passed to FUN can take vector arguments and work properly with them. If m and n are the lengths of the two vectors passed to outer, it will first create two vectors of length m*n such that every combination of inputs occurs, and pass these as the two new vectors to FUN. To this, outer expects, that FUN will return another vector of length m*n
The function described in your example doesn't really do this. In fact, it doesn't handle vectors correctly at all.
One way is to define another function that can handle vector inputs properly, or alternatively, if your program actually requires a simple matching, you could use table() as in #DWin 's answer
If you're redefining your function, outer is expecting a function that will be run for inputs:
f(c("+","+","-","-"), c("+","-","+","-"))
and per your example, ought to return,
c(3,1,2,4)
There is also the small matter of decoding the actual meaning of the error:
Again, if m and n are the lengths of the two vectors passed to outer, it will first create a vector of length m*n, and then reshapes it using (basically)
dim(output) = c(m,n)
This is the line that gives an error, because outer is trying to shape the output into a 2x2 matrix (total 2*2 = 4 items) while the function f, assuming no vectorization, has given only 1 output. Hence,
Error in outer(levels(x), levels(x), f) :
dims [product 4] do not match the length of object [1]

Resources