I am learning R and so far I am not having any trouble in catching up besides the following problem that I am hopeful someone out there will help me to understand.
If I create a character vector in the following way test1 <- c("a", "b", "c")
I get one vector of type character and I can access to each member of the vector through an indexer test1[n].
That makes sense and does what I understand it should do.
However if I do test2 <- readLines("file1.txt") where file1.txt contains one line (several random words space separated) I get one vector of class character (same as the first case) and I can't use an indexer (unless there's a way and I don't know about it yet).
Questions:
Why both are char type based but they are stored differently
How one could tell them apart without knowing how they have been created
Besides using a strsplit() is there a way to break it down like c() does at loading time from a file?
Any help to understand the insides of this language is wildly appreciated!
Why both are char type based but they are stored differently
Both are stored in exactly the same way. R has no specific type to represent a single character and as a consequence characters are not a collections.
In the first case you have simply a character vector of length 3 where each element has size 1
test1 <- c("a", "b", "c")
typeof(test1)
# [1] "character"
length(test1)
# [1] 3
nchar(test1)
# [1] 1 1 1
and in the second case a character vector of length equal to number of lines in an input file and each element has size equal to length of string:
writeLines("foobar", con="file1.txt")
test2 <- readLines("file1.txt")
typeof(test2)
# [1] "character"
length(test2)
# [1] 1
nchar(test2)
# [1] 6
Besides using a strsplit() is there a way to break it down like c() does at loading time from a file?
If you have fixed size elements you can try readBin but generally speaking strisplit is the way to go:
f <- "file1.txt"
readBin(f, what = 'raw', size = 1, n = file.info(f)$size) %>% sapply(rawToChar)
# [1] "f" "o" "o" "b" "a" "r" "\n"
Related
I have a table inside a data.frame, and I need to get only the last two characters from that table, how do I do this?
Note: I was trying to do it using str_sub, but in it I can only define which character starts and which ends, and my data varies the size of characters. Follow my example below that does not solve:
base$estado <- str_sub(psd_base$itbc_name, start = 2)
You can use the functions substr() and nchar() to select the last letter of a character. Both are directly applicable to vectors, so you can write:
names = c("Alpha","Bip","Charlemagne","Haggs","O")
substr(names,nchar(names),nchar(names))
Which will give the output:
[1] "a" "p" "e" "s" "O"
Since I do not have a reproducible example of your data, this example has to suffice. I think you get the idea.
How to get elements of a vector of strings without the symbol [1]?
v <- c("a","b","c")
for (i in seq_along(v)) {
print(v[i])
}
Instead of getting "a", "b", "c" I obtained:
[1] "a"
[1] "b"
[1] "c"
But when I use as.symbol
for (i in seq_along(v)) {
print(as.symbol(v[i]))
}
I obtain:
a
b
c
without any [ ]
print is a function with a misleading name. A more accurate name would be show_value_in_interactive_console (but that’s a handful). Its purpose is really only for displaying values in the interactive R console. It is not suitable for other use.
When you actually want to display values to a user, or save them to a file, you do not want to use print. Instead, you want to use
For displaying values to a user: message or warning (or stop)
For persisting the text representation of values to a file or otherwise exposing them to the system: writeLines, cat
All of the above are usually used in combination with format, sprintf, as.character and toString, which perform the actual conversion of the value to text.
Oh, and as.symbol is completely unrelated to the above and shouldn’t be used here. It happens to work for your purpose purely by accident.
You can print using as.symbol and if you need the index of the symbol just print i before
for (i in seq_along(v)) {
print(i)
print(as.symbol(v[i]))
}
output will look like
0
a
1
b
2
c
The documentation says
vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer [...] to use.
Could you please elaborate as to why it is generally safer, maybe providing examples?
P.S.: I know the answer and I already tend to avoid sapply. I just wish there was a nice answer here on SO so I can point my coworkers to it. Please, no "read the manual" answer.
As has already been noted, vapply does two things:
Slight speed improvement
Improves consistency by providing limited return type checks.
The second point is the greater advantage, as it helps catch errors before they happen and leads to more robust code. This return value checking could be done separately by using sapply followed by stopifnot to make sure that the return values are consistent with what you expected, but vapply is a little easier (if more limited, since custom error checking code could check for values within bounds, etc.).
Here's an example of vapply ensuring your result is as expected. This parallels something I was just working on while PDF scraping, where findD would use a regex to match a pattern in raw text data (e.g. I'd have a list that was split by entity, and a regex to match addresses within each entity. Occasionally the PDF had been converted out-of-order and there would be two addresses for an entity, which caused badness).
> input1 <- list( letters[1:5], letters[3:12], letters[c(5,2,4,7,1)] )
> input2 <- list( letters[1:5], letters[3:12], letters[c(2,5,4,7,15,4)] )
> findD <- function(x) x[x=="d"]
> sapply(input1, findD )
[1] "d" "d" "d"
> sapply(input2, findD )
[[1]]
[1] "d"
[[2]]
[1] "d"
[[3]]
[1] "d" "d"
> vapply(input1, findD, "" )
[1] "d" "d" "d"
> vapply(input2, findD, "" )
Error in vapply(input2, findD, "") : values must be length 1,
but FUN(X[[3]]) result is length 2
Because two there are two d's in the third element of input2, vapply produces an error. But sapply changes the class of the output from a character vector to a list, which could break code downstream.
As I tell my students, part of becoming a programmer is changing your mindset from "errors are annoying" to "errors are my friend."
Zero length inputs
One related point is that if the input length is zero, sapply will always return an empty list, regardless of the input type. Compare:
sapply(1:5, identity)
## [1] 1 2 3 4 5
sapply(integer(), identity)
## list()
vapply(1:5, identity, integer(1))
## [1] 1 2 3 4 5
vapply(integer(), identity, integer(1))
## integer(0)
With vapply, you are guaranteed to have a particular type of output, so you don't need to write extra checks for zero length inputs.
Benchmarks
vapply can be a bit faster because it already knows what format it should be expecting the results in.
input1.long <- rep(input1,10000)
library(microbenchmark)
m <- microbenchmark(
sapply(input1.long, findD ),
vapply(input1.long, findD, "" )
)
library(ggplot2)
library(taRifx) # autoplot.microbenchmark is moving to the microbenchmark package in the next release so this should be unnecessary soon
autoplot(m)
The extra key strokes involved with vapply could save you time debugging confusing results later. If the function you're calling can return different datatypes, vapply should certainly be used.
One example that comes to mind would be sqlQuery in the RODBC package. If there's an error executing a query, this function returns a character vector with the message. So, for example, say you're trying to iterate over a vector of table names tnames and select the max value from the numeric column 'NumCol' in each table with:
sapply(tnames,
function(tname) sqlQuery(cnxn, paste("SELECT MAX(NumCol) FROM", tname))[[1]])
If all the table names are valid, this would result in a numeric vector. But if one of the table names happens to change in the database and the query fails, the results are going to be coerced into mode character. Using vapply with FUN.VALUE=numeric(1), however, will stop the error here and prevent it from popping up somewhere down the line---or worse, not at all.
If you always want your result to be something in particular...e.g. a logical vector. vapply makes sure this happens but sapply does not necessarily do so.
a<-vapply(NULL, is.factor, FUN.VALUE=logical(1))
b<-sapply(NULL, is.factor)
is.logical(a)
is.logical(b)
What is the efficient and simple way in R to do the following:
read in two-column data from a file
use this information to build some kind of translation dictionary, like a python dict
apply the translation to the content of a vector in order to obtain the translated vector, possibly for several vectors but using the same correspondence information
?
I thought that the hash package would help me to do that, but I'm unsure I perform step 3 correctly.
Say my initial vector is my_vect and my hash is my_dict
I tried the following:
values(my_dict, keys=my_vect)
The following observation make me doubt that I'm doing it in the proper way:
The operation seems slow (more than one second on a powerful desktop computer with a vector of 582 entries and a hash of 46665 entries)
It results in something that doesn't look homogeneous with my_vec: while my_vec appeared as "indexed by numbers" (I mean that integer numbers between square brackets appear on the side of the values when displaying the data in the interactive console), the result of calling values as above appears to still somehow looks like a dictionary: each displayed translated value has the original value (i.e. hash key) displayed above it. I just want the values.
Edit:
If I understand correctly, R has some way of using "names" instead of numerical indices for vectors, and what I obtain using the values function is such a vector with names. It seems to work for what I wanted to do, although I imagine it takes more memory than necessary.
I tried libraries hash and hashmap, and the second seemed more efficient.
A small usage example:
> library(hashmap)
> keys = c("a", "b", "c", "d")
> values = c("A", "B", "C", "D")
> my_dict <- hashmap(keys, values)
> my_vect <- c("b", "c", "c")
> translated <- my_dict$find(my_vect)
> translated
[1] "B" "C" "C"
To build the dictionary from a table obtained using read.table, the option stringsAsFactors = FALSE of read.table has to be used, otherwise weird things happen (see discussion in the comments of https://stackoverflow.com/a/38838271/1878788).
Did you try the str_replace_all function from the stringr package?
Let's say you have a dictionary data frame dict with columns original and replacement. The following code replaces all instances of original with replacement in the vector.
library(stringr)
translations <- setNames(dict$replacement, dict$original)
new_vect <- str_replace_all(vect, fixed(translations))
I'm not sure if it implements hashing, but the underlying expression is in C code from the stringi package, so it should be fast.
The only case where that won't work as is, is if some of the words in original contain other words in original. In this case you'll need to add regular expression start-string (^) or end-string ($) markers to the original strings you want to replace.
translations <- setNames(dict$replacement, paste0("^", dict$original, "$"))
The documentation says
vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer [...] to use.
Could you please elaborate as to why it is generally safer, maybe providing examples?
P.S.: I know the answer and I already tend to avoid sapply. I just wish there was a nice answer here on SO so I can point my coworkers to it. Please, no "read the manual" answer.
As has already been noted, vapply does two things:
Slight speed improvement
Improves consistency by providing limited return type checks.
The second point is the greater advantage, as it helps catch errors before they happen and leads to more robust code. This return value checking could be done separately by using sapply followed by stopifnot to make sure that the return values are consistent with what you expected, but vapply is a little easier (if more limited, since custom error checking code could check for values within bounds, etc.).
Here's an example of vapply ensuring your result is as expected. This parallels something I was just working on while PDF scraping, where findD would use a regex to match a pattern in raw text data (e.g. I'd have a list that was split by entity, and a regex to match addresses within each entity. Occasionally the PDF had been converted out-of-order and there would be two addresses for an entity, which caused badness).
> input1 <- list( letters[1:5], letters[3:12], letters[c(5,2,4,7,1)] )
> input2 <- list( letters[1:5], letters[3:12], letters[c(2,5,4,7,15,4)] )
> findD <- function(x) x[x=="d"]
> sapply(input1, findD )
[1] "d" "d" "d"
> sapply(input2, findD )
[[1]]
[1] "d"
[[2]]
[1] "d"
[[3]]
[1] "d" "d"
> vapply(input1, findD, "" )
[1] "d" "d" "d"
> vapply(input2, findD, "" )
Error in vapply(input2, findD, "") : values must be length 1,
but FUN(X[[3]]) result is length 2
Because two there are two d's in the third element of input2, vapply produces an error. But sapply changes the class of the output from a character vector to a list, which could break code downstream.
As I tell my students, part of becoming a programmer is changing your mindset from "errors are annoying" to "errors are my friend."
Zero length inputs
One related point is that if the input length is zero, sapply will always return an empty list, regardless of the input type. Compare:
sapply(1:5, identity)
## [1] 1 2 3 4 5
sapply(integer(), identity)
## list()
vapply(1:5, identity, integer(1))
## [1] 1 2 3 4 5
vapply(integer(), identity, integer(1))
## integer(0)
With vapply, you are guaranteed to have a particular type of output, so you don't need to write extra checks for zero length inputs.
Benchmarks
vapply can be a bit faster because it already knows what format it should be expecting the results in.
input1.long <- rep(input1,10000)
library(microbenchmark)
m <- microbenchmark(
sapply(input1.long, findD ),
vapply(input1.long, findD, "" )
)
library(ggplot2)
library(taRifx) # autoplot.microbenchmark is moving to the microbenchmark package in the next release so this should be unnecessary soon
autoplot(m)
The extra key strokes involved with vapply could save you time debugging confusing results later. If the function you're calling can return different datatypes, vapply should certainly be used.
One example that comes to mind would be sqlQuery in the RODBC package. If there's an error executing a query, this function returns a character vector with the message. So, for example, say you're trying to iterate over a vector of table names tnames and select the max value from the numeric column 'NumCol' in each table with:
sapply(tnames,
function(tname) sqlQuery(cnxn, paste("SELECT MAX(NumCol) FROM", tname))[[1]])
If all the table names are valid, this would result in a numeric vector. But if one of the table names happens to change in the database and the query fails, the results are going to be coerced into mode character. Using vapply with FUN.VALUE=numeric(1), however, will stop the error here and prevent it from popping up somewhere down the line---or worse, not at all.
If you always want your result to be something in particular...e.g. a logical vector. vapply makes sure this happens but sapply does not necessarily do so.
a<-vapply(NULL, is.factor, FUN.VALUE=logical(1))
b<-sapply(NULL, is.factor)
is.logical(a)
is.logical(b)