how to start a for loop in R programming - r

I'm new to programming and I wrote a code that finds spam words for the first email but I would like to write a for loop that would do this for all of the emails. Any help would be appreciated. Thank you.
words = grepl("viagra", spamdata[[ 1 ]]$header[ "Subject"])

I presume that you want to loop over the elements of spamdata and build up an indicator whether the string "viagra" is found in the subject lines of your emails.
Lets set up some dummy data for illustration purposes:
subjects <- c("Buy my viagra", "Buy my Sildenafil citrate",
"UK Lottery Win!!!!!")
names(subjects) <- rep("Subject", 3)
spamdata <- list(list(Header = subjects[1]), list(Header = subjects[2]),
list(Header = subjects[3]))
Next we create a vector words to hold the result of each iteration of the loop. You do not want to be growing words or any other object at each iteration - that will force copying and will slow your loop down. Instead allocate storage before you begin - here using the length of the list over which we want to loop:
words <- logical(length = length(spamdata))
You can set up a loop as so
## seq_along() creates a sequence of 1:length(spamdata)
for(i in seq_along(spamdata)) {
words[ i ] <- grepl("viagra", spamdata[[ i ]]$Header["Subject"])
}
We can then look at words:
> words
[1] TRUE FALSE FALSE
Which matches what we know from the made up subjects.
Notice how we used i as a place holder for 1, 2, and 3 - at each iteration of the loop, i takes on the next value in the sequence 1,2,3 so we can i) access the ith component of spamdata to get the next subject line, and ii) access the ith element of words to store the result of the grepl() call.
Note that instead of an implicit loop we could also use the sapply() or lapply() functions, which create the loop for you but might need a bit of work to write a custom function. Instead of using grepl() directly, we can write a wrapper:
foo <- function(x) {
grepl("viagra", x$Header["Subject"])
}
In the above function we use x instead of the list name spamdata because when lapply() and sapply() loop over the spamdata list, the individual components (referenced by spamdata[[i]] in the for() loop) get passed to our function as argument x so we only need to refer to x in the grepl() call.
This is how we could use our wrapper function foo() in lapply() or sapply(), first lapply():
> lapply(spamdata, foo)
[[1]]
[1] TRUE
[[2]]
[1] FALSE
[[3]]
[1] FALSE
sapply() will simplify the returned object where possible, as follows:
> sapply(spamdata, foo)
[1] TRUE FALSE FALSE
Other than that, they work similarly.
Note we can make our wrapper function foo() more useful by allowing it to take an argument defining the spam word you wish to search for:
foo <- function(x, string) {
grepl(string, x$Header["Subject"])
}
We can pass extra arguments to our functions with lapply() and sapply() like this:
> sapply(spamdata, foo, string = "viagra")
[1] TRUE FALSE FALSE
> sapply(spamdata, foo, string = "Lottery")
[1] FALSE FALSE TRUE
Which you will find most useful (for() loop or the lapply(), sapply() versions) will depend on your programming background and which you find most familiar. Sometimes for() is easier and simpler to use, but perhaps more verbose (which isn't always a bad thing!), whilst lapply() and sapply() are quite succinct and useful where you don't need to jump through hoops to create a workable wrapper function.

In R a loopstakes this form, where variable is the name of your iteration variable, and sequence is a vector or list of values:
for (variable in sequence) expression
The expression can be a single R command - or several lines of commands wrapped in curly brackets:
for (variable in sequence) {
expression
expression
expression
}
In this case it would be for(words){ do whatever you want to do}
Also
Basic loop theory
The basic structure for loop commands is: for(i in 1:n){stuff to do}, where n is the number of times the loop will execute.
listname[[1]] refers to the first element in the list “listname.”
In a for loop, listname[[i]] refers to the variable corresponding to the ith iteration of the for loop.
The code for(i in 1:length(yesnovars)) tells the loop to execute only once for each variable in the list.
Answer taken from the following
sources:
Loops in R
Programming in R

Related

Why are functions only sometimes first class variables in R? Using them from built-in data structures causes "Error: attempt to apply non-function"

I am trying to understand how first class functions work in R. I had understood that functions were first class in R, but was sorely disappointed when applying that understanding. When a function is saved to a list, whether that be as an ordinary list, a vector or a dictionary style list or vector, it is no longer callable, leading to the following error:
Error: attempt to apply non-function
e.g.
print_func <- function() {
print('hi')
}
print_func()
[1] "hi"
my_list = list(print_func)
my_list[0]()
Error: attempt to apply non-function
my_vector = c(print_func)
my_vector[0]()
Error: attempt to apply non-function
my_map <- c("a" = print_func)
my_map["a"]()
Error: attempt to apply non-function
So why is this? Does R not actually treat functions as first class members in all cases, or is there another reason why this occurs?
I see that R vectors also do unexpected things (for me - perhaps not for experienced R users) to nested arrays:
nested_vector <- c("a" = c("b" = 1))
nested_vector["a"]
<NA>
NA
nested_vector["a.b"]
a.b
1
Here it makes sense to me that "a.b" might reference the sub-member of the key "b" under the key "a". But apparently that logic goes out the window when trying to call the upper level key "a".
R is 1-based; so, you refer to the first element of a vector using index 1 (and not 0 like in python).
There are two approaches to accessing list elements:
accessing list elements while keeping a list (return a list containing the desired elements)
pulling an element out of a list
In the first case, the subsetting is done using a single pair of brackets ([]) and you will always get a list back. Note that this is different from python where you get a list only if you select more than one element (lst = [fun1, fun2]; lst[0] return fun1 and not a one-element list like R while lst[0:2] returns a list).
In the second approach, the subsetting is done using a double pair of brackets ([[]]). you basically pull an element completely out of a list; more like subsetting one element out of a list in python.
print_func <- function() {
print('hi')
}
print_func()
my_list = list(print_func)
mode(my_list[1]) # return a list (not a function); so, it's not callable
[1] "list"
mode(my_list[[1]]) # return a function; so, it's callable
[1] "function"
my_list[1]() # error
my_list[[1]]() # works
[1] "hi"
#
my_vector = c(print_func)
mode(my_vector) # a list, so, not callable
[1] "list"
my_vector[1]() # error because it returns a list and not a function
my_vector[[1]]() # works
[1] "hi"
When subsetting with names, the same logic of single and double pair of brackets applies
my_map <- c("a" = print_func)
mode(my_map) # list, so, not callable
[1] "list"
my_map["a"]() # error
my_map[["a"]]() # works
[1] "hi"
Limey pointed out my 2 issues in the comments. I was using a 0-index and I was using single brackets. If I use a 1-index and double brackets it works, and functions are treated as first class variables.
My issue is resolved, and hopefully I won't make that same mistake again.

Iterating over the argument of a function (grep) passed to lapply

I currently am doing an operation similar to below:
v<-c("my","pig","is","big","with","a","name")
s<-c("m","g")
for(i in c(1:length(s))){
print(grep(v,pattern=s[i]))
}
Which prints
[1] 1 7
[1] 2 4
I would like to instead vectorize this operation where the return values are stored in a vector. I tried
mynewvector<-lapply(v,grep,pattern=s,x=v)
but the problem is that I don't know how to get lapply iterate over the elements passed as arguments (e.g. iterating over s). I saw this answer, but I don't think mapply works here because I am trying to hold one argument constant (x=v) and iterate over the other argument (pattern=s)
How would I do this?
Following up on d.b.'s response, the most clear solution is
lapply(s, function(a) grep(pattern = a, x = v))

getting lost in Using which() and regex in R

OK, I have a little problem which I believe I can solve with which and grepl (alternatives are welcome), but I am getting lost:
my_query<- c('g1', 'g2', 'g3')
my_data<- c('string2','string4','string5','string6')
I would like to return the index in my_query matching in my_data. In the example above, only 'g2' is in mydata, so the result in the example would be 2.
It seems to me that there is no easy way to do this without a loop. For each element in my_query, we can use either of the below functions to get TRUE or FALSE:
f1 <- function (pattern, x) length(grep(pattern, x)) > 0L
f2 <- function (pattern, x) any(grepl(pattern, x))
For example,
f1(my_query[1], my_data)
# [1] FALSE
f2(my_query[1], my_data)
# [1] FALSE
Then, we use *apply loop to apply, say f2 to all elements of my_query:
which(unlist(lapply(my_query, f2, x = my_data)))
# [1] 2
Thanks, that seems to work. To be honest, I preferred to your one-line original version. I am not sure why you edited with creating another function to call afterwards with *apply. Is there any advantage as compared to which(lengths(lapply(my_query, grep, my_data)) > 0L)?
Well, I am not entirely sure. When I read ?lengths:
One advantage of ‘lengths(x)’ is its use as a more efficient
version of ‘sapply(x, length)’ and similar ‘*apply’ calls to
‘length’.
I don't know how much more efficient that lengths is compared with sapply. Anyway, if it is still a loop, then my original suggestion which(lengths(lapply(my_query, grep, my_data)) > 0L) is performing 2 loops. My edit is essentially combining two loops together, hopefully to get some boost (if not too tiny).
You can still arrange my new edit into a single line:
which(unlist(lapply(my_query, function (pattern, x) any(grepl(pattern, x)), x = my_data)))
or
which(unlist(lapply(my_query, function (pattern) any(grepl(pattern, my_data)))))
Expanding on a comment posted initially by #Gregor you could try:
which(colSums(sapply(my_query, grepl, my_data)) > 0)
#g2
# 2
The function colSums is vectorized and represents no problem in terms of performance. The sapply() loop seems inevitable here, since we need to check each element within the query vector. The result of the loop is a logical matrix, with each column representing an element of my_query and each row an element of my_data. By wrapping this matrix into which(colSums(..) > 0) we obtain the index numbers of all columns that contain at least one TRUE, i.e., a match with an entry of my_data.

How do I loop through a string in R? [duplicate]

I'm new to programming and I wrote a code that finds spam words for the first email but I would like to write a for loop that would do this for all of the emails. Any help would be appreciated. Thank you.
words = grepl("viagra", spamdata[[ 1 ]]$header[ "Subject"])
I presume that you want to loop over the elements of spamdata and build up an indicator whether the string "viagra" is found in the subject lines of your emails.
Lets set up some dummy data for illustration purposes:
subjects <- c("Buy my viagra", "Buy my Sildenafil citrate",
"UK Lottery Win!!!!!")
names(subjects) <- rep("Subject", 3)
spamdata <- list(list(Header = subjects[1]), list(Header = subjects[2]),
list(Header = subjects[3]))
Next we create a vector words to hold the result of each iteration of the loop. You do not want to be growing words or any other object at each iteration - that will force copying and will slow your loop down. Instead allocate storage before you begin - here using the length of the list over which we want to loop:
words <- logical(length = length(spamdata))
You can set up a loop as so
## seq_along() creates a sequence of 1:length(spamdata)
for(i in seq_along(spamdata)) {
words[ i ] <- grepl("viagra", spamdata[[ i ]]$Header["Subject"])
}
We can then look at words:
> words
[1] TRUE FALSE FALSE
Which matches what we know from the made up subjects.
Notice how we used i as a place holder for 1, 2, and 3 - at each iteration of the loop, i takes on the next value in the sequence 1,2,3 so we can i) access the ith component of spamdata to get the next subject line, and ii) access the ith element of words to store the result of the grepl() call.
Note that instead of an implicit loop we could also use the sapply() or lapply() functions, which create the loop for you but might need a bit of work to write a custom function. Instead of using grepl() directly, we can write a wrapper:
foo <- function(x) {
grepl("viagra", x$Header["Subject"])
}
In the above function we use x instead of the list name spamdata because when lapply() and sapply() loop over the spamdata list, the individual components (referenced by spamdata[[i]] in the for() loop) get passed to our function as argument x so we only need to refer to x in the grepl() call.
This is how we could use our wrapper function foo() in lapply() or sapply(), first lapply():
> lapply(spamdata, foo)
[[1]]
[1] TRUE
[[2]]
[1] FALSE
[[3]]
[1] FALSE
sapply() will simplify the returned object where possible, as follows:
> sapply(spamdata, foo)
[1] TRUE FALSE FALSE
Other than that, they work similarly.
Note we can make our wrapper function foo() more useful by allowing it to take an argument defining the spam word you wish to search for:
foo <- function(x, string) {
grepl(string, x$Header["Subject"])
}
We can pass extra arguments to our functions with lapply() and sapply() like this:
> sapply(spamdata, foo, string = "viagra")
[1] TRUE FALSE FALSE
> sapply(spamdata, foo, string = "Lottery")
[1] FALSE FALSE TRUE
Which you will find most useful (for() loop or the lapply(), sapply() versions) will depend on your programming background and which you find most familiar. Sometimes for() is easier and simpler to use, but perhaps more verbose (which isn't always a bad thing!), whilst lapply() and sapply() are quite succinct and useful where you don't need to jump through hoops to create a workable wrapper function.
In R a loopstakes this form, where variable is the name of your iteration variable, and sequence is a vector or list of values:
for (variable in sequence) expression
The expression can be a single R command - or several lines of commands wrapped in curly brackets:
for (variable in sequence) {
expression
expression
expression
}
In this case it would be for(words){ do whatever you want to do}
Also
Basic loop theory
The basic structure for loop commands is: for(i in 1:n){stuff to do}, where n is the number of times the loop will execute.
listname[[1]] refers to the first element in the list “listname.”
In a for loop, listname[[i]] refers to the variable corresponding to the ith iteration of the for loop.
The code for(i in 1:length(yesnovars)) tells the loop to execute only once for each variable in the list.
Answer taken from the following
sources:
Loops in R
Programming in R

if-else vs ifelse with lists

Why do the if-else construct and the function ifelse() behave differently?
mylist <- list(list(a=1, b=2), list(x=10, y=20))
l1 <- ifelse(sum(sapply(mylist, class) != "list")==0, mylist, list(mylist))
l2 <-
if(sum(sapply(mylist, class) != "list") == 0){ # T: all list elements are lists
mylist
} else {
list(mylist)
}
all.equal(l1,l2)
# [1] "Length mismatch: comparison on first 1 components"
From the ifelse documentation:
‘ifelse’ returns a value with the same shape as ‘test’ which is
filled with elements selected from either ‘yes’ or ‘no’ depending
on whether the element of ‘test’ is ‘TRUE’ or ‘FALSE’.
So your input has length one so the output is truncated to length 1.
You can also see this illustrated with a more simple example:
ifelse(TRUE, c(1, 3), 7)
# [1] 1
if ( cond) { yes } else { no } is a control structure. It was designed to effect programming forks rather than to process a sequence. I think many people come from SPSS or SAS whose authors chose "IF" to implement conditional assignment within their DATA or TRANSFORM functions and so they expect R to behave the same. SA and SPSS both have implicit FOR-loops in there Data steps. Whereas R came from a programming tradition. R's implicit for-loops are built in to the many vectorized functions (including ifelse). The lapply/sapply fucntions are the more Rsavvy way to implement most sequential processing, although they don't succeed at doing lagged variable access, especially if there are any randomizing features whose "effects" get cumulatively handled.
ifelse takes an expression that builds a vector of logical values as its first argument. The second and third arguments need to be vectors of equal length and either the first of them or the second gets chosen. This is similar to the SPSS/SAS IF commands which have an implicit by-row mode of operation.
For some reason this is marked as a duplicate of
Why does ifelse() return single-value output?
So a work around for that question is:
a=3
yo <- ifelse(a==1, 1, list(c(1,2)))
yo[[1]]

Resources