read specific elements on a list wih R - r

I have the following list:
v1<-c('hello', 'bye')
v2<-c(1,2,3)
v3<-c(5,6,5,5,5,5)
l<-list(v1, v2, v3)
I want to read the second element of each member of the list. Thus the result may be:
'bye' 2 6
I did it using 'sapply' with the instruction:
sapply(1:3, function(i){l[[i]][2])
and it works. But I would like to do it with an easier instruction, I tried
l[[1:3]][2]
But it doesn't work. Which is the easier way to obtain the second element of each member in my list.
Thank you!

I would suggest
sapply(l,`[[`,2)
#[1] "bye" "2" "6"
EDIT:
Note that result of this operation is a vector, so all elements had to be coerced to the same type (in this case, it is character). If you'd like to keep the types of the result components, you should use
lapply(l,`[[`,2)
which returns a list, and elements of a list in R could be of different types. (thanks to Richard for bringing attention to this aspect!)

If you want to vectorize this (avoid *apply loops), you could use stringi::stri_list2matrix (though, you will lose your classes)
library(stringi)
stri_list2matrix(l, byrow = TRUE)[, 2]
## [1] "bye" "2" "6"

Related

R - Apply Family (sapply vs. vapply) [duplicate]

The documentation says
vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer [...] to use.
Could you please elaborate as to why it is generally safer, maybe providing examples?
P.S.: I know the answer and I already tend to avoid sapply. I just wish there was a nice answer here on SO so I can point my coworkers to it. Please, no "read the manual" answer.
As has already been noted, vapply does two things:
Slight speed improvement
Improves consistency by providing limited return type checks.
The second point is the greater advantage, as it helps catch errors before they happen and leads to more robust code. This return value checking could be done separately by using sapply followed by stopifnot to make sure that the return values are consistent with what you expected, but vapply is a little easier (if more limited, since custom error checking code could check for values within bounds, etc.).
Here's an example of vapply ensuring your result is as expected. This parallels something I was just working on while PDF scraping, where findD would use a regex to match a pattern in raw text data (e.g. I'd have a list that was split by entity, and a regex to match addresses within each entity. Occasionally the PDF had been converted out-of-order and there would be two addresses for an entity, which caused badness).
> input1 <- list( letters[1:5], letters[3:12], letters[c(5,2,4,7,1)] )
> input2 <- list( letters[1:5], letters[3:12], letters[c(2,5,4,7,15,4)] )
> findD <- function(x) x[x=="d"]
> sapply(input1, findD )
[1] "d" "d" "d"
> sapply(input2, findD )
[[1]]
[1] "d"
[[2]]
[1] "d"
[[3]]
[1] "d" "d"
> vapply(input1, findD, "" )
[1] "d" "d" "d"
> vapply(input2, findD, "" )
Error in vapply(input2, findD, "") : values must be length 1,
but FUN(X[[3]]) result is length 2
Because two there are two d's in the third element of input2, vapply produces an error. But sapply changes the class of the output from a character vector to a list, which could break code downstream.
As I tell my students, part of becoming a programmer is changing your mindset from "errors are annoying" to "errors are my friend."
Zero length inputs
One related point is that if the input length is zero, sapply will always return an empty list, regardless of the input type. Compare:
sapply(1:5, identity)
## [1] 1 2 3 4 5
sapply(integer(), identity)
## list()
vapply(1:5, identity, integer(1))
## [1] 1 2 3 4 5
vapply(integer(), identity, integer(1))
## integer(0)
With vapply, you are guaranteed to have a particular type of output, so you don't need to write extra checks for zero length inputs.
Benchmarks
vapply can be a bit faster because it already knows what format it should be expecting the results in.
input1.long <- rep(input1,10000)
library(microbenchmark)
m <- microbenchmark(
sapply(input1.long, findD ),
vapply(input1.long, findD, "" )
)
library(ggplot2)
library(taRifx) # autoplot.microbenchmark is moving to the microbenchmark package in the next release so this should be unnecessary soon
autoplot(m)
The extra key strokes involved with vapply could save you time debugging confusing results later. If the function you're calling can return different datatypes, vapply should certainly be used.
One example that comes to mind would be sqlQuery in the RODBC package. If there's an error executing a query, this function returns a character vector with the message. So, for example, say you're trying to iterate over a vector of table names tnames and select the max value from the numeric column 'NumCol' in each table with:
sapply(tnames,
function(tname) sqlQuery(cnxn, paste("SELECT MAX(NumCol) FROM", tname))[[1]])
If all the table names are valid, this would result in a numeric vector. But if one of the table names happens to change in the database and the query fails, the results are going to be coerced into mode character. Using vapply with FUN.VALUE=numeric(1), however, will stop the error here and prevent it from popping up somewhere down the line---or worse, not at all.
If you always want your result to be something in particular...e.g. a logical vector. vapply makes sure this happens but sapply does not necessarily do so.
a<-vapply(NULL, is.factor, FUN.VALUE=logical(1))
b<-sapply(NULL, is.factor)
is.logical(a)
is.logical(b)

"Named tuples" in r

If you load the pracma package into the r console and type
gammainc(2,2)
you get
lowinc uppinc reginc
0.5939942 0.4060058 0.5939942
This looks like some kind of a named tuple or something.
But, I can't work out how to extract the number below the lowinc, namely 0.5939942. The code (gammainc(2,2))[1] doesn't work, we just get
lowinc
0.5939942
which isn't a number.
How is this done?
As can be checked with str(gammainc(2,2)[1]) and class(gammainc(2,2)[1]), the output mentioned in the OP is in fact a number. It is just a named number. The names used as attributes of the vector are supposed to make the output easier to understand.
The function unname() can be used to obtain the numerical vector without names:
unname(gammainc(2,2))
#[1] 0.5939942 0.4060058 0.5939942
To select the first entry, one can use:
unname(gammainc(2,2))[1]
#[1] 0.5939942
In this specific case, a clearer version of the same might be:
unname(gammainc(2,2)["lowinc"])
Double brackets will strip the dimension names
gammainc(2,2)[[1]]
gammainc(2,2)[["lowinc"]]
I don't claim it to be intuitive, or obvious, but it is mentioned in the manual:
For vectors and matrices the [[ forms are rarely used, although they
have some slight semantic differences from the [ form (e.g. it drops
any names or dimnames attribute, and that partial matching is used for
character indices).
The partial matching can be employed like this
gammainc(2, 2)[["low", exact=FALSE]]
In R vectors may have names() attribute. This is an example:
vector <- c(1, 2, 3)
names(vector) <- c("first", "second", "third")
If you display vector, you should probably get desired output:
vector
> vector
first second third
1 2 3
To ensure what type of output you get after the function you can use:
class(your_function())
I hope this helps.

Why does my output have elements before the NULL, and how can I refactor a for loop with "list" in R to use something more elegant

I have heard that for loops are bad R style. I have a for loop that uses "list" to assemble the output, and even if the control flow of "for" is tolerable, the output is not very readable.
Here is my original code:
longlist<-c("apple","orange","red snapper")
sublist<-c("ap","er")
altgrepper1<-function(substringList, longstringList){
#go through the list of short strings
#check to see whether any of the short strings is a substring
#of any of the long strings
output2<-NULL
for (i in 1:length(substringList)){
output1<-grep(substringList[i],longstringList)
output2<-list(output2,output1)
}
output2
}
Here is the output:
[[1]]
[[1]][[1]]
NULL
[[1]][[2]]
[1] 1 3
[[2]]
[1] 3
I am very surprised that the output has anything preceding the NULL, so clearly I don't understand how list works. This output makes sense to me on a toy example, but I'm sure I would rapidly get confused if I tried to understand output from a real data set.
So perhaps the highest priority is to return a better-structured sort of output - perhaps a list, perhaps an array. But to do that, it might be better to eliminate the for loop and use lapply or some other R syntax.
I'm sorry for the newbie question, but I really don't see why my output looks this way.
As mentioned in another answer, the problem is in output2<-list(output2,output1) within the loop. Your output actually has a list of a list.
for loops are not necessarily "bad R style". I think it depends on personal preference and R coding standards (or some people).
That being said, there is a strong case to be made about the apply functions, namely apply, sapply, and lapply. If I were you, I would read ?apply, run all the examples, and then every time you write a for loop, ask yourself if it can be done with one of these apply functions.
Here's a nice way to look at a comparison of your objects sublist and longlist
> sapply(sublist, function(x) grep(x, longlist, value = TRUE))
$ap
[1] "apple" "red snapper"
$er
[1] "red snapper"
lapply always returns output as a list, sapply does not. But in this case, sapply makes the result easier to understand. Compare to
> lapply(sublist, function(x) grep(x, longlist, value = TRUE))
[[1]]
[1] "apple" "red snapper"
[[2]]
[1] "red snapper"
Notice that lapply does not automatically attach names to the result, making the result from sapply easier to comprehend here.
Once I learned about the apply family of functions, two things happened. One, I never used another for loop in R. And two, R programming became really exciting for me. To each their own though.
The first item of output 2 is 'NULL'. This what you are assigning before the for loop.
Within the for loop, you are appending the output of grep (stored in output1) to output2 with your line output2<-list(output2,output1). NULL is special type of object in R. Read ?is.null for more details. When you assign it to another object, it is similar though not the same
as assigning a missing value
try this
output2 <-NULL
output2 <-list(output2,2)
output2
you will get an idea of why your output looks like the way it does.
also please have a look at %in%.

Splitting strings in R and extracting information from lists

I have the following row names in my data:
column_01.1
column_01.2
column_01.3
column_02.1
column_02.2
I can split these rownames with the following command:
strsplit(rownames(my_data),split= "\\.")
and get the list:
[[1]]
[1] "column_01" "1"
[[2]]
[1] "column_01" "2"
[[3]]
[1] "column_01" "3"
...
But since I want characters out of the first part and completely discard the second part, like this:
column_01
column_01
column_01
column_02
column_02
I have run out of tricks to extract only this part of the information. I've tried some options with unlist() and as.data.frame(), but no luck. Or is there an easier way to split the strings? I do not want to use as.character(substring(rownames(my_data),1,9)) as the location of the "." can change (while it would work for this example).
You can map [ to get the first elements:
sapply(strsplit(rownames(my_data),split= "\\."),'[',1)
...or (better) use regular expressions:
gsub('\\..*$','',rownames(my_data))
(translation: find all matches of (dot-character, something, end-of-string) and replace with empty string)
Since I like the stringr package, I thought I'd throw this out there:
str_replace(rownames(my_data), "(^column_.+)\\.\\d+", "\\1")
(I'm not great with regex so the ^ might be better outside the parenthesis)

Why is `vapply` safer than `sapply`?

The documentation says
vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer [...] to use.
Could you please elaborate as to why it is generally safer, maybe providing examples?
P.S.: I know the answer and I already tend to avoid sapply. I just wish there was a nice answer here on SO so I can point my coworkers to it. Please, no "read the manual" answer.
As has already been noted, vapply does two things:
Slight speed improvement
Improves consistency by providing limited return type checks.
The second point is the greater advantage, as it helps catch errors before they happen and leads to more robust code. This return value checking could be done separately by using sapply followed by stopifnot to make sure that the return values are consistent with what you expected, but vapply is a little easier (if more limited, since custom error checking code could check for values within bounds, etc.).
Here's an example of vapply ensuring your result is as expected. This parallels something I was just working on while PDF scraping, where findD would use a regex to match a pattern in raw text data (e.g. I'd have a list that was split by entity, and a regex to match addresses within each entity. Occasionally the PDF had been converted out-of-order and there would be two addresses for an entity, which caused badness).
> input1 <- list( letters[1:5], letters[3:12], letters[c(5,2,4,7,1)] )
> input2 <- list( letters[1:5], letters[3:12], letters[c(2,5,4,7,15,4)] )
> findD <- function(x) x[x=="d"]
> sapply(input1, findD )
[1] "d" "d" "d"
> sapply(input2, findD )
[[1]]
[1] "d"
[[2]]
[1] "d"
[[3]]
[1] "d" "d"
> vapply(input1, findD, "" )
[1] "d" "d" "d"
> vapply(input2, findD, "" )
Error in vapply(input2, findD, "") : values must be length 1,
but FUN(X[[3]]) result is length 2
Because two there are two d's in the third element of input2, vapply produces an error. But sapply changes the class of the output from a character vector to a list, which could break code downstream.
As I tell my students, part of becoming a programmer is changing your mindset from "errors are annoying" to "errors are my friend."
Zero length inputs
One related point is that if the input length is zero, sapply will always return an empty list, regardless of the input type. Compare:
sapply(1:5, identity)
## [1] 1 2 3 4 5
sapply(integer(), identity)
## list()
vapply(1:5, identity, integer(1))
## [1] 1 2 3 4 5
vapply(integer(), identity, integer(1))
## integer(0)
With vapply, you are guaranteed to have a particular type of output, so you don't need to write extra checks for zero length inputs.
Benchmarks
vapply can be a bit faster because it already knows what format it should be expecting the results in.
input1.long <- rep(input1,10000)
library(microbenchmark)
m <- microbenchmark(
sapply(input1.long, findD ),
vapply(input1.long, findD, "" )
)
library(ggplot2)
library(taRifx) # autoplot.microbenchmark is moving to the microbenchmark package in the next release so this should be unnecessary soon
autoplot(m)
The extra key strokes involved with vapply could save you time debugging confusing results later. If the function you're calling can return different datatypes, vapply should certainly be used.
One example that comes to mind would be sqlQuery in the RODBC package. If there's an error executing a query, this function returns a character vector with the message. So, for example, say you're trying to iterate over a vector of table names tnames and select the max value from the numeric column 'NumCol' in each table with:
sapply(tnames,
function(tname) sqlQuery(cnxn, paste("SELECT MAX(NumCol) FROM", tname))[[1]])
If all the table names are valid, this would result in a numeric vector. But if one of the table names happens to change in the database and the query fails, the results are going to be coerced into mode character. Using vapply with FUN.VALUE=numeric(1), however, will stop the error here and prevent it from popping up somewhere down the line---or worse, not at all.
If you always want your result to be something in particular...e.g. a logical vector. vapply makes sure this happens but sapply does not necessarily do so.
a<-vapply(NULL, is.factor, FUN.VALUE=logical(1))
b<-sapply(NULL, is.factor)
is.logical(a)
is.logical(b)

Resources