R - Apply Family (sapply vs. vapply) [duplicate]

R - Apply Family (sapply vs. vapply) [duplicate] - r

The documentation says
vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer [...] to use.
Could you please elaborate as to why it is generally safer, maybe providing examples?
P.S.: I know the answer and I already tend to avoid sapply. I just wish there was a nice answer here on SO so I can point my coworkers to it. Please, no "read the manual" answer.

As has already been noted, vapply does two things:
Slight speed improvement
Improves consistency by providing limited return type checks.
The second point is the greater advantage, as it helps catch errors before they happen and leads to more robust code. This return value checking could be done separately by using sapply followed by stopifnot to make sure that the return values are consistent with what you expected, but vapply is a little easier (if more limited, since custom error checking code could check for values within bounds, etc.).
Here's an example of vapply ensuring your result is as expected. This parallels something I was just working on while PDF scraping, where findD would use a regex to match a pattern in raw text data (e.g. I'd have a list that was split by entity, and a regex to match addresses within each entity. Occasionally the PDF had been converted out-of-order and there would be two addresses for an entity, which caused badness).
> input1 <- list( letters[1:5], letters[3:12], letters[c(5,2,4,7,1)] )
> input2 <- list( letters[1:5], letters[3:12], letters[c(2,5,4,7,15,4)] )
> findD <- function(x) x[x=="d"]
> sapply(input1, findD )
[1] "d" "d" "d"
> sapply(input2, findD )
[[1]]
[1] "d"
[[2]]
[1] "d"
[[3]]
[1] "d" "d"
> vapply(input1, findD, "" )
[1] "d" "d" "d"
> vapply(input2, findD, "" )
Error in vapply(input2, findD, "") : values must be length 1,
but FUN(X[[3]]) result is length 2
Because two there are two d's in the third element of input2, vapply produces an error. But sapply changes the class of the output from a character vector to a list, which could break code downstream.
As I tell my students, part of becoming a programmer is changing your mindset from "errors are annoying" to "errors are my friend."
Zero length inputs
One related point is that if the input length is zero, sapply will always return an empty list, regardless of the input type. Compare:
sapply(1:5, identity)
## [1] 1 2 3 4 5
sapply(integer(), identity)
## list()
vapply(1:5, identity, integer(1))
## [1] 1 2 3 4 5
vapply(integer(), identity, integer(1))
## integer(0)
With vapply, you are guaranteed to have a particular type of output, so you don't need to write extra checks for zero length inputs.
Benchmarks
vapply can be a bit faster because it already knows what format it should be expecting the results in.
input1.long <- rep(input1,10000)
library(microbenchmark)
m <- microbenchmark(
sapply(input1.long, findD ),
vapply(input1.long, findD, "" )
)
library(ggplot2)
library(taRifx) # autoplot.microbenchmark is moving to the microbenchmark package in the next release so this should be unnecessary soon
autoplot(m)

The extra key strokes involved with vapply could save you time debugging confusing results later. If the function you're calling can return different datatypes, vapply should certainly be used.
One example that comes to mind would be sqlQuery in the RODBC package. If there's an error executing a query, this function returns a character vector with the message. So, for example, say you're trying to iterate over a vector of table names tnames and select the max value from the numeric column 'NumCol' in each table with:
sapply(tnames,
function(tname) sqlQuery(cnxn, paste("SELECT MAX(NumCol) FROM", tname))[[1]])
If all the table names are valid, this would result in a numeric vector. But if one of the table names happens to change in the database and the query fails, the results are going to be coerced into mode character. Using vapply with FUN.VALUE=numeric(1), however, will stop the error here and prevent it from popping up somewhere down the line---or worse, not at all.

If you always want your result to be something in particular...e.g. a logical vector. vapply makes sure this happens but sapply does not necessarily do so.
a<-vapply(NULL, is.factor, FUN.VALUE=logical(1))
b<-sapply(NULL, is.factor)
is.logical(a)
is.logical(b)

Related

How to get elements of a vector of strings without the symbol [ ]

How to get elements of a vector of strings without the symbol [1]?
v <- c("a","b","c")
for (i in seq_along(v)) {
print(v[i])
}
Instead of getting "a", "b", "c" I obtained:
[1] "a"
[1] "b"
[1] "c"
But when I use as.symbol
for (i in seq_along(v)) {
print(as.symbol(v[i]))
}
I obtain:
a
b
c
without any [ ]

print is a function with a misleading name. A more accurate name would be show_value_in_interactive_console (but that’s a handful). Its purpose is really only for displaying values in the interactive R console. It is not suitable for other use.
When you actually want to display values to a user, or save them to a file, you do not want to use print. Instead, you want to use
For displaying values to a user: message or warning (or stop)
For persisting the text representation of values to a file or otherwise exposing them to the system: writeLines, cat
All of the above are usually used in combination with format, sprintf, as.character and toString, which perform the actual conversion of the value to text.
Oh, and as.symbol is completely unrelated to the above and shouldn’t be used here. It happens to work for your purpose purely by accident.

You can print using as.symbol and if you need the index of the symbol just print i before
for (i in seq_along(v)) {
print(i)
print(as.symbol(v[i]))
}
output will look like
0
a
1
b
2
c

Why combine produces a different behavior from readLines() function

I am learning R and so far I am not having any trouble in catching up besides the following problem that I am hopeful someone out there will help me to understand.
If I create a character vector in the following way test1 <- c("a", "b", "c")
I get one vector of type character and I can access to each member of the vector through an indexer test1[n].
That makes sense and does what I understand it should do.
However if I do test2 <- readLines("file1.txt") where file1.txt contains one line (several random words space separated) I get one vector of class character (same as the first case) and I can't use an indexer (unless there's a way and I don't know about it yet).
Questions:
Why both are char type based but they are stored differently
How one could tell them apart without knowing how they have been created
Besides using a strsplit() is there a way to break it down like c() does at loading time from a file?
Any help to understand the insides of this language is wildly appreciated!

Why both are char type based but they are stored differently
Both are stored in exactly the same way. R has no specific type to represent a single character and as a consequence characters are not a collections.
In the first case you have simply a character vector of length 3 where each element has size 1
test1 <- c("a", "b", "c")
typeof(test1)
# [1] "character"
length(test1)
# [1] 3
nchar(test1)
# [1] 1 1 1
and in the second case a character vector of length equal to number of lines in an input file and each element has size equal to length of string:
writeLines("foobar", con="file1.txt")
test2 <- readLines("file1.txt")
typeof(test2)
# [1] "character"
length(test2)
# [1] 1
nchar(test2)
# [1] 6
Besides using a strsplit() is there a way to break it down like c() does at loading time from a file?
If you have fixed size elements you can try readBin but generally speaking strisplit is the way to go:
f <- "file1.txt"
readBin(f, what = 'raw', size = 1, n = file.info(f)$size) %>% sapply(rawToChar)
# [1] "f" "o" "o" "b" "a" "r" "\n"

Why does my output have elements before the NULL, and how can I refactor a for loop with "list" in R to use something more elegant

I have heard that for loops are bad R style. I have a for loop that uses "list" to assemble the output, and even if the control flow of "for" is tolerable, the output is not very readable.
Here is my original code:
longlist<-c("apple","orange","red snapper")
sublist<-c("ap","er")
altgrepper1<-function(substringList, longstringList){
#go through the list of short strings
#check to see whether any of the short strings is a substring
#of any of the long strings
output2<-NULL
for (i in 1:length(substringList)){
output1<-grep(substringList[i],longstringList)
output2<-list(output2,output1)
}
output2
}
Here is the output:
[[1]]
[[1]][[1]]
NULL
[[1]][[2]]
[1] 1 3
[[2]]
[1] 3
I am very surprised that the output has anything preceding the NULL, so clearly I don't understand how list works. This output makes sense to me on a toy example, but I'm sure I would rapidly get confused if I tried to understand output from a real data set.
So perhaps the highest priority is to return a better-structured sort of output - perhaps a list, perhaps an array. But to do that, it might be better to eliminate the for loop and use lapply or some other R syntax.
I'm sorry for the newbie question, but I really don't see why my output looks this way.

As mentioned in another answer, the problem is in output2<-list(output2,output1) within the loop. Your output actually has a list of a list.
for loops are not necessarily "bad R style". I think it depends on personal preference and R coding standards (or some people).
That being said, there is a strong case to be made about the apply functions, namely apply, sapply, and lapply. If I were you, I would read ?apply, run all the examples, and then every time you write a for loop, ask yourself if it can be done with one of these apply functions.
Here's a nice way to look at a comparison of your objects sublist and longlist
> sapply(sublist, function(x) grep(x, longlist, value = TRUE))
$ap
[1] "apple" "red snapper"
$er
[1] "red snapper"
lapply always returns output as a list, sapply does not. But in this case, sapply makes the result easier to understand. Compare to
> lapply(sublist, function(x) grep(x, longlist, value = TRUE))
[[1]]
[1] "apple" "red snapper"
[[2]]
[1] "red snapper"
Notice that lapply does not automatically attach names to the result, making the result from sapply easier to comprehend here.
Once I learned about the apply family of functions, two things happened. One, I never used another for loop in R. And two, R programming became really exciting for me. To each their own though.

The first item of output 2 is 'NULL'. This what you are assigning before the for loop.
Within the for loop, you are appending the output of grep (stored in output1) to output2 with your line output2<-list(output2,output1). NULL is special type of object in R. Read ?is.null for more details. When you assign it to another object, it is similar though not the same
as assigning a missing value
try this
output2 <-NULL
output2 <-list(output2,2)
output2
you will get an idea of why your output looks like the way it does.
also please have a look at %in%.

Unexpected behavior using -which() in R when the search term is not found

I have been using the R which function to remove rows from a data frame. I recently discovered that if the search term is NOT in the data.frame, the result is an empty character.
# 1: returns A-Q, S-Z (as expected)
LETTERS[-which(LETTERS == "R")]
# 2: returns "character(0)" (not what I would expect)
LETTERS[-which(LETTERS == "1")]
# 3: returns A-Z (expected)
LETTERS[which(LETTERS != "1")]
# 4: returns A-Q, S-Z (expected)
LETTERS[which(LETTERS != "R")]
Is the second example the expected behavior for -which() when the search term is not found? I have already switched my code to use the syntax in example 4, which seems safer, but I am just curious.

That is a well-known pitfall. When nothing matches the logical test the which-function returns numeric(0) and then "[" returns nothing instead of returning everything which would be expected. You can use:
LETTERS[ ! LETTERS == "1" ]
LETTERS[ ! LETTERS %in% "1" ]
There is another gotcha to be aware of and is the one that makes me choose to use which(). When using logical indexing an NA value used inside "[" will return a row. I generally do not want that so I use DFRM[ which(logical) ] although this seems to bother some people who say is is not needed. I just think they are working with small datasets and infrequently encounter the annoyance of seeing tens of thousands of NA-induced useless lines of output on their console. I never use the negated which version though.

Because of this:
which(LETTERS == '-1')
## integer(0)
and this:
(1:2)[integer(0)]
integer(0)
Instead of #4, use this:
LETTERS[LETTERS != "R"]

In example 2, which returns integer(0) (a zero-length integer vector) because no values are TRUE. A negative zero-length vector (-integer(0)) is still a zero-length vector. So you're essentially asking for the NULL element of LETTERS, which doesn't exist.

Why is `vapply` safer than `sapply`?

The documentation says
vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer [...] to use.
Could you please elaborate as to why it is generally safer, maybe providing examples?
P.S.: I know the answer and I already tend to avoid sapply. I just wish there was a nice answer here on SO so I can point my coworkers to it. Please, no "read the manual" answer.

As has already been noted, vapply does two things:
Slight speed improvement
Improves consistency by providing limited return type checks.
The second point is the greater advantage, as it helps catch errors before they happen and leads to more robust code. This return value checking could be done separately by using sapply followed by stopifnot to make sure that the return values are consistent with what you expected, but vapply is a little easier (if more limited, since custom error checking code could check for values within bounds, etc.).
Here's an example of vapply ensuring your result is as expected. This parallels something I was just working on while PDF scraping, where findD would use a regex to match a pattern in raw text data (e.g. I'd have a list that was split by entity, and a regex to match addresses within each entity. Occasionally the PDF had been converted out-of-order and there would be two addresses for an entity, which caused badness).
> input1 <- list( letters[1:5], letters[3:12], letters[c(5,2,4,7,1)] )
> input2 <- list( letters[1:5], letters[3:12], letters[c(2,5,4,7,15,4)] )
> findD <- function(x) x[x=="d"]
> sapply(input1, findD )
[1] "d" "d" "d"
> sapply(input2, findD )
[[1]]
[1] "d"
[[2]]
[1] "d"
[[3]]
[1] "d" "d"
> vapply(input1, findD, "" )
[1] "d" "d" "d"
> vapply(input2, findD, "" )
Error in vapply(input2, findD, "") : values must be length 1,
but FUN(X[[3]]) result is length 2
Because two there are two d's in the third element of input2, vapply produces an error. But sapply changes the class of the output from a character vector to a list, which could break code downstream.
As I tell my students, part of becoming a programmer is changing your mindset from "errors are annoying" to "errors are my friend."
Zero length inputs
One related point is that if the input length is zero, sapply will always return an empty list, regardless of the input type. Compare:
sapply(1:5, identity)
## [1] 1 2 3 4 5
sapply(integer(), identity)
## list()
vapply(1:5, identity, integer(1))
## [1] 1 2 3 4 5
vapply(integer(), identity, integer(1))
## integer(0)
With vapply, you are guaranteed to have a particular type of output, so you don't need to write extra checks for zero length inputs.
Benchmarks
vapply can be a bit faster because it already knows what format it should be expecting the results in.
input1.long <- rep(input1,10000)
library(microbenchmark)
m <- microbenchmark(
sapply(input1.long, findD ),
vapply(input1.long, findD, "" )
)
library(ggplot2)
library(taRifx) # autoplot.microbenchmark is moving to the microbenchmark package in the next release so this should be unnecessary soon
autoplot(m)

The extra key strokes involved with vapply could save you time debugging confusing results later. If the function you're calling can return different datatypes, vapply should certainly be used.
One example that comes to mind would be sqlQuery in the RODBC package. If there's an error executing a query, this function returns a character vector with the message. So, for example, say you're trying to iterate over a vector of table names tnames and select the max value from the numeric column 'NumCol' in each table with:
sapply(tnames,
function(tname) sqlQuery(cnxn, paste("SELECT MAX(NumCol) FROM", tname))[[1]])
If all the table names are valid, this would result in a numeric vector. But if one of the table names happens to change in the database and the query fails, the results are going to be coerced into mode character. Using vapply with FUN.VALUE=numeric(1), however, will stop the error here and prevent it from popping up somewhere down the line---or worse, not at all.

If you always want your result to be something in particular...e.g. a logical vector. vapply makes sure this happens but sapply does not necessarily do so.
a<-vapply(NULL, is.factor, FUN.VALUE=logical(1))
b<-sapply(NULL, is.factor)
is.logical(a)
is.logical(b)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Apply Family (sapply vs. vapply) [duplicate] - r

If you always want your result to be something in particular...e.g. a logical vector. vapply makes sure this happens but sapply does not necessarily do so. a<-vapply(NULL, is.factor, FUN.VALUE=logical(1)) b<-sapply(NULL, is.factor) is.logical(a) is.logical(b)

Related

How to get elements of a vector of strings without the symbol [ ]

Why combine produces a different behavior from readLines() function

Why does my output have elements before the NULL, and how can I refactor a for loop with "list" in R to use something more elegant

Unexpected behavior using -which() in R when the search term is not found

Why is `vapply` safer than `sapply`?

Categories

Resources