I wish to evaluate a vector of strings containing arithmetic expressions -- "1+2", "5*6", etc.
I know that I can parse a single string into an expression and then evaluate it as in eval(parse(text="1+2")).
However, I would prefer to evaluate the vector without using a for loop.
foo <- c("1+2","3+4","5*6","7/8") # I want to evaluate this and return c(3,7,30,0.875)
eval(parse(text=foo[1])) # correctly returns 3, so how do I vectorize the evaluation?
eval(sapply(foo, function(x) parse(text=x))) # wrong! evaluates only last element
Just apply the whole function.
sapply(foo, function(x) eval(parse(text=x)))
Just to show that you can also do this with a for loop:
result <- numeric(length(foo))
foo <- parse(text=foo)
for(i in seq_along(foo))
result[i] <- eval(foo[[i]])
I'm not a fan of using the *apply functions for their own sake, but in this case, sapply really does lead to simpler, clearer code.
Related
hits <- vapply(titles,
FUN = grepl,
FUN.VALUE = logical(length(pass_names)),
pass_names)
titles is a vector with titles such as "mr", pass_names is a list of names.
2 questions.
I don't understand the resulting matrix hits
I don't understand why the last line is pass_names nor what how I am supposed to know about these 4 arguments. Under ?vapply it specificies the x, FUN, FUN.VALUE but I cannot figure out how I am supposed to figure out that pass_names needs to be listed there.
I have looked online and could not find an answer, so I hope this will help others too. Thank you in advance for your answers, yes I am a beginner.
Extra info: This question uses the titanic package in R, pass_names is just titanic$Name, titles is just paste(",", c("Mr\\.", "Master", "Don", "Rev", "Dr\\.", "Major", "Sir", "Col", "Capt", "Jonkheer"))
You're right to be a bit confused.
The vapply code chunk in your question is equivalent to:
hits <- vapply(titles,
FUN = function(x) grepl(x, pass_names),
FUN.VALUE = logical(length(pass_names)))
vapply takes a ... argument which takes as many arguments as are provided. If the arguments are not named (see #Roland's comment), the n-th argument in the ... position is passed to the n+1-th argument of FUN (the first argument to FUN is X, i.e. titles in this case).
The resulting matrix has the same number of rows as the number of rows in titanic and has 10 columns, the length of titles. The [i, j]-th entry is TRUE if the i-th pass_names matches the j-th regular expression in titles, FALSE if it doesn't.
Essentially you are passing two vectors in your vapply which is equivalent to two nested for loops. Each pairing is then passed into the required arguments of grepl: grepl(pattern, x).
Specifically, on first loop of vapply the first item in titles is compared with every item of pass_names. Then on second loop, the second item in titles is compared again to all items of pass_names and so on until first vector, titles, is exhausted.
To illustrate, you can equivalently build a hits2 matrix using nested for loops, rendering exactly as your vapply output, hits:
hits2 <- matrix(NA, nrow=length(df$name), ncol=length(titles))
colnames(hits2) <- titles
for (i in seq_along(df$name)) {
for (j in seq_along(titles)) {
hits2[i, j] <- grepl(pattern=titles[j], x=df$name[i])
}
}
all.equal(hits, hits2)
# [1] TRUE
Alternatively, you can run same exact in sapply without the required FUN.VALUE argument as both sapply and vapply are wrappers to lapply. However, vapply is more preferred as you proactively assert your output while sapply renders one way depending on function. For instance, in vapply you could render an integer matrix with: FUN.VALUE = integer(length(pass_names)).
hits3 <- sapply(titles, FUN = grepl, pass_names)
all.equal(hits, hits3)
# [1] TRUE
All in all, the apply family are more concise, compact ways to run iterations and renders a data structure instead of initializing and assigning a vector/matrix with for or while loops.
For further reading, consider this interesting SO post: Is the “*apply” family really not vectorized?
Consider a function f(x,y), where x is a vector (1xn) and data a matrix (nxm), returning a numeric scalar.
Now, I have a matrix A and a three-dimensional array B and would like to apply f across the first dimension of A and B.
Specifically, I would like f to be evaluated at x=A[1,] y=B[1,,], followed by x=A[2,] y=B[2,,] and so on, returning a vector of numeric scalars.
Is there a way to use any function of the "apply" family to solve this problem, thus avoiding a loop?
You can do:
sapply(1:nrow(A), function(i) f(A[i,], B[i,,]))
This is loop hiding because the looping is done inside of sapply(). I suppose in this case it is better to use a explicit loop:
result <- numeric(nrow(A))
for (i in 1:nrow(A)) result[i] <- f(A[i,], B[i,,]
OK, I have a little problem which I believe I can solve with which and grepl (alternatives are welcome), but I am getting lost:
my_query<- c('g1', 'g2', 'g3')
my_data<- c('string2','string4','string5','string6')
I would like to return the index in my_query matching in my_data. In the example above, only 'g2' is in mydata, so the result in the example would be 2.
It seems to me that there is no easy way to do this without a loop. For each element in my_query, we can use either of the below functions to get TRUE or FALSE:
f1 <- function (pattern, x) length(grep(pattern, x)) > 0L
f2 <- function (pattern, x) any(grepl(pattern, x))
For example,
f1(my_query[1], my_data)
# [1] FALSE
f2(my_query[1], my_data)
# [1] FALSE
Then, we use *apply loop to apply, say f2 to all elements of my_query:
which(unlist(lapply(my_query, f2, x = my_data)))
# [1] 2
Thanks, that seems to work. To be honest, I preferred to your one-line original version. I am not sure why you edited with creating another function to call afterwards with *apply. Is there any advantage as compared to which(lengths(lapply(my_query, grep, my_data)) > 0L)?
Well, I am not entirely sure. When I read ?lengths:
One advantage of ‘lengths(x)’ is its use as a more efficient
version of ‘sapply(x, length)’ and similar ‘*apply’ calls to
‘length’.
I don't know how much more efficient that lengths is compared with sapply. Anyway, if it is still a loop, then my original suggestion which(lengths(lapply(my_query, grep, my_data)) > 0L) is performing 2 loops. My edit is essentially combining two loops together, hopefully to get some boost (if not too tiny).
You can still arrange my new edit into a single line:
which(unlist(lapply(my_query, function (pattern, x) any(grepl(pattern, x)), x = my_data)))
or
which(unlist(lapply(my_query, function (pattern) any(grepl(pattern, my_data)))))
Expanding on a comment posted initially by #Gregor you could try:
which(colSums(sapply(my_query, grepl, my_data)) > 0)
#g2
# 2
The function colSums is vectorized and represents no problem in terms of performance. The sapply() loop seems inevitable here, since we need to check each element within the query vector. The result of the loop is a logical matrix, with each column representing an element of my_query and each row an element of my_data. By wrapping this matrix into which(colSums(..) > 0) we obtain the index numbers of all columns that contain at least one TRUE, i.e., a match with an entry of my_data.
I'm new to programming and I wrote a code that finds spam words for the first email but I would like to write a for loop that would do this for all of the emails. Any help would be appreciated. Thank you.
words = grepl("viagra", spamdata[[ 1 ]]$header[ "Subject"])
I presume that you want to loop over the elements of spamdata and build up an indicator whether the string "viagra" is found in the subject lines of your emails.
Lets set up some dummy data for illustration purposes:
subjects <- c("Buy my viagra", "Buy my Sildenafil citrate",
"UK Lottery Win!!!!!")
names(subjects) <- rep("Subject", 3)
spamdata <- list(list(Header = subjects[1]), list(Header = subjects[2]),
list(Header = subjects[3]))
Next we create a vector words to hold the result of each iteration of the loop. You do not want to be growing words or any other object at each iteration - that will force copying and will slow your loop down. Instead allocate storage before you begin - here using the length of the list over which we want to loop:
words <- logical(length = length(spamdata))
You can set up a loop as so
## seq_along() creates a sequence of 1:length(spamdata)
for(i in seq_along(spamdata)) {
words[ i ] <- grepl("viagra", spamdata[[ i ]]$Header["Subject"])
}
We can then look at words:
> words
[1] TRUE FALSE FALSE
Which matches what we know from the made up subjects.
Notice how we used i as a place holder for 1, 2, and 3 - at each iteration of the loop, i takes on the next value in the sequence 1,2,3 so we can i) access the ith component of spamdata to get the next subject line, and ii) access the ith element of words to store the result of the grepl() call.
Note that instead of an implicit loop we could also use the sapply() or lapply() functions, which create the loop for you but might need a bit of work to write a custom function. Instead of using grepl() directly, we can write a wrapper:
foo <- function(x) {
grepl("viagra", x$Header["Subject"])
}
In the above function we use x instead of the list name spamdata because when lapply() and sapply() loop over the spamdata list, the individual components (referenced by spamdata[[i]] in the for() loop) get passed to our function as argument x so we only need to refer to x in the grepl() call.
This is how we could use our wrapper function foo() in lapply() or sapply(), first lapply():
> lapply(spamdata, foo)
[[1]]
[1] TRUE
[[2]]
[1] FALSE
[[3]]
[1] FALSE
sapply() will simplify the returned object where possible, as follows:
> sapply(spamdata, foo)
[1] TRUE FALSE FALSE
Other than that, they work similarly.
Note we can make our wrapper function foo() more useful by allowing it to take an argument defining the spam word you wish to search for:
foo <- function(x, string) {
grepl(string, x$Header["Subject"])
}
We can pass extra arguments to our functions with lapply() and sapply() like this:
> sapply(spamdata, foo, string = "viagra")
[1] TRUE FALSE FALSE
> sapply(spamdata, foo, string = "Lottery")
[1] FALSE FALSE TRUE
Which you will find most useful (for() loop or the lapply(), sapply() versions) will depend on your programming background and which you find most familiar. Sometimes for() is easier and simpler to use, but perhaps more verbose (which isn't always a bad thing!), whilst lapply() and sapply() are quite succinct and useful where you don't need to jump through hoops to create a workable wrapper function.
In R a loopstakes this form, where variable is the name of your iteration variable, and sequence is a vector or list of values:
for (variable in sequence) expression
The expression can be a single R command - or several lines of commands wrapped in curly brackets:
for (variable in sequence) {
expression
expression
expression
}
In this case it would be for(words){ do whatever you want to do}
Also
Basic loop theory
The basic structure for loop commands is: for(i in 1:n){stuff to do}, where n is the number of times the loop will execute.
listname[[1]] refers to the first element in the list “listname.”
In a for loop, listname[[i]] refers to the variable corresponding to the ith iteration of the for loop.
The code for(i in 1:length(yesnovars)) tells the loop to execute only once for each variable in the list.
Answer taken from the following
sources:
Loops in R
Programming in R
I'm new to programming and I wrote a code that finds spam words for the first email but I would like to write a for loop that would do this for all of the emails. Any help would be appreciated. Thank you.
words = grepl("viagra", spamdata[[ 1 ]]$header[ "Subject"])
I presume that you want to loop over the elements of spamdata and build up an indicator whether the string "viagra" is found in the subject lines of your emails.
Lets set up some dummy data for illustration purposes:
subjects <- c("Buy my viagra", "Buy my Sildenafil citrate",
"UK Lottery Win!!!!!")
names(subjects) <- rep("Subject", 3)
spamdata <- list(list(Header = subjects[1]), list(Header = subjects[2]),
list(Header = subjects[3]))
Next we create a vector words to hold the result of each iteration of the loop. You do not want to be growing words or any other object at each iteration - that will force copying and will slow your loop down. Instead allocate storage before you begin - here using the length of the list over which we want to loop:
words <- logical(length = length(spamdata))
You can set up a loop as so
## seq_along() creates a sequence of 1:length(spamdata)
for(i in seq_along(spamdata)) {
words[ i ] <- grepl("viagra", spamdata[[ i ]]$Header["Subject"])
}
We can then look at words:
> words
[1] TRUE FALSE FALSE
Which matches what we know from the made up subjects.
Notice how we used i as a place holder for 1, 2, and 3 - at each iteration of the loop, i takes on the next value in the sequence 1,2,3 so we can i) access the ith component of spamdata to get the next subject line, and ii) access the ith element of words to store the result of the grepl() call.
Note that instead of an implicit loop we could also use the sapply() or lapply() functions, which create the loop for you but might need a bit of work to write a custom function. Instead of using grepl() directly, we can write a wrapper:
foo <- function(x) {
grepl("viagra", x$Header["Subject"])
}
In the above function we use x instead of the list name spamdata because when lapply() and sapply() loop over the spamdata list, the individual components (referenced by spamdata[[i]] in the for() loop) get passed to our function as argument x so we only need to refer to x in the grepl() call.
This is how we could use our wrapper function foo() in lapply() or sapply(), first lapply():
> lapply(spamdata, foo)
[[1]]
[1] TRUE
[[2]]
[1] FALSE
[[3]]
[1] FALSE
sapply() will simplify the returned object where possible, as follows:
> sapply(spamdata, foo)
[1] TRUE FALSE FALSE
Other than that, they work similarly.
Note we can make our wrapper function foo() more useful by allowing it to take an argument defining the spam word you wish to search for:
foo <- function(x, string) {
grepl(string, x$Header["Subject"])
}
We can pass extra arguments to our functions with lapply() and sapply() like this:
> sapply(spamdata, foo, string = "viagra")
[1] TRUE FALSE FALSE
> sapply(spamdata, foo, string = "Lottery")
[1] FALSE FALSE TRUE
Which you will find most useful (for() loop or the lapply(), sapply() versions) will depend on your programming background and which you find most familiar. Sometimes for() is easier and simpler to use, but perhaps more verbose (which isn't always a bad thing!), whilst lapply() and sapply() are quite succinct and useful where you don't need to jump through hoops to create a workable wrapper function.
In R a loopstakes this form, where variable is the name of your iteration variable, and sequence is a vector or list of values:
for (variable in sequence) expression
The expression can be a single R command - or several lines of commands wrapped in curly brackets:
for (variable in sequence) {
expression
expression
expression
}
In this case it would be for(words){ do whatever you want to do}
Also
Basic loop theory
The basic structure for loop commands is: for(i in 1:n){stuff to do}, where n is the number of times the loop will execute.
listname[[1]] refers to the first element in the list “listname.”
In a for loop, listname[[i]] refers to the variable corresponding to the ith iteration of the for loop.
The code for(i in 1:length(yesnovars)) tells the loop to execute only once for each variable in the list.
Answer taken from the following
sources:
Loops in R
Programming in R