Why does sapply() return a list with attributes when used on characters? - r

There is a strange behaviour of sapply() when used on a vector of characters:
y <- c("Hello", "bob", "daN")
z <- sapply(y, function(x) {toupper(x)})
z
# Hello bob daN
# "HELLO" "BOB" "DAN"
str(z)
# Named chr [1:3] "HELLO" "BOB" "DAN"
# - attr(*, "names")= chr [1:3] "Hello" "bob" "daN"
Why does sapply() return a vector with the old values as attributes? I don't want them, I don't need them and I am not aware of this behaviour when applied on e.g. numerical vectors.

By default, sapply() adds names for each iteration on character vectors.
The result can be delivered without the names by using USE.NAMES = FALSE in the call.
sapply(y, toupper, USE.NAMES = FALSE)
# [1] "HELLO" "BOB" "DAN"
This is explained in help(sapply)
USE.NAMES - logical; if TRUE and if X is character, use X as names for the result unless it had names already. Since this argument follows ... its name cannot be abbreviated.
Note that when you are applying a single function only, there is no need to use an anonymous function (anonymous function usage is also slightly less efficient). This is also shown above.
Also note that sapply() is not necessary here, as toupper() is vectorized.
toupper(y)
# [1] "HELLO" "BOB" "DAN"

Related

Why do identical row names yield different results on barplot axis labels? [duplicate]

I've come across a strange behavior when playing with some dataframes: when I create two identical dataframes a,b, then swap their rownames around, they don't come out as identical:
rm(list=ls())
a <- data.frame(a=c(1,2,3),b=c(2,3,4))
b <- a
identical(a,b)
#TRUE
identical(rownames(a),rownames(b))
#TRUE
rownames(b) <- rownames(a)
identical(a,b)
#FALSE
Can anyone reproduce/explain why?
This is admittedly a bit confusing. Starting with ?data.frame we see that:
If row.names was supplied as NULL or no suitable component was found
the row names are the integer sequence starting at one (and such row
names are considered to be ‘automatic’, and not preserved by
as.matrix).
So initially a and b each have an attribute called row.names that are integers:
> str(attributes(a))
List of 3
$ names : chr [1:2] "a" "b"
$ row.names: int [1:3] 1 2 3
$ class : chr "data.frame"
But rownames() returns a character vector (as does dimnames(), actually a list of character vectors, called under the hood). So after reassigning the row names you end up with:
> str(attributes(b))
List of 3
$ names : chr [1:2] "a" "b"
$ row.names: chr [1:3] "1" "2" "3"
$ class : chr "data.frame"

Why do identical dataframes become different when changing rownames to the same

I've come across a strange behavior when playing with some dataframes: when I create two identical dataframes a,b, then swap their rownames around, they don't come out as identical:
rm(list=ls())
a <- data.frame(a=c(1,2,3),b=c(2,3,4))
b <- a
identical(a,b)
#TRUE
identical(rownames(a),rownames(b))
#TRUE
rownames(b) <- rownames(a)
identical(a,b)
#FALSE
Can anyone reproduce/explain why?
This is admittedly a bit confusing. Starting with ?data.frame we see that:
If row.names was supplied as NULL or no suitable component was found
the row names are the integer sequence starting at one (and such row
names are considered to be ‘automatic’, and not preserved by
as.matrix).
So initially a and b each have an attribute called row.names that are integers:
> str(attributes(a))
List of 3
$ names : chr [1:2] "a" "b"
$ row.names: int [1:3] 1 2 3
$ class : chr "data.frame"
But rownames() returns a character vector (as does dimnames(), actually a list of character vectors, called under the hood). So after reassigning the row names you end up with:
> str(attributes(b))
List of 3
$ names : chr [1:2] "a" "b"
$ row.names: chr [1:3] "1" "2" "3"
$ class : chr "data.frame"

using paste with a list

I'm trying to understand the behavior of strsplit and paste, which are inverse functions. However, when I strsplit a vector, a list is returned, like so:
> strsplit(c("on,e","tw,o","thre,e","fou,r"),",")
[[1]]
[1] "on" "e"
[[2]]
[1] "tw" "o"
[[3]]
[1] "thre" "e"
[[4]]
[1] "fou" "r"
I tried using lapply to cat the elements of the list back together, but it doesn't work:
> lapply(strsplit(c("on,e","tw,o","thre,e","fou,r"),","),cat)
on etw othre efou r[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
The same formula with paste instead of cat actually does nothing at all! Why am I getting these results? and how can I get the result I want, which is the original vector back again?
(Obviously, in my actual code I'm trying to do more with the strsplit and cat than just return the original vector, but I think a solution to this problem will work for mine. Thanks!)
While yes, cat will concatenate and print to the console, it does not actually function in the same way paste does. It's result best explained in help("cat")
The collapse argument in paste is effectively the opposite of the split argument in strsplit. And you can use sapply to return the simplified pasted vector.
x <- c("on,e","tw,o","thre,e","fou,r")
( y <- sapply(strsplit(x, ","), paste, collapse = ",") )
# [1] "on,e" "tw,o" "thre,e" "fou,r"
( z <- vapply(strsplit(x, ","), paste, character(1L), collapse = ",") )
# [1] "on,e" "tw,o" "thre,e" "fou,r"
identical(x, y)
# [1] TRUE
identical(x, z)
# [1] TRUE
Note that for cases like this, vapply will be more efficient than sapply. And adding fixed = TRUE in strsplit should increase efficiency as well.

splitting a character vector at specified intervals in R

I have some sentences in specific format and I need to split them at regular intervals.
The sentences look like this
"abxyzpqrst34245"
"mndeflmnop6346781"
I want to split each of these sentences after the following characters: c(2,5,10), so that the output will be:
[1] c("ab", "xyz", "pqrst", "34245")
[2] c("mn", "def", "lmnop", "6346781")
NOTE: The numeric character after the 3rd split is of variable length, where as the previous ones are of fixed length.
I tried to use cut, but it only works with integer vectors.
I looked at split, but I'm not sure if it works without factors.
So, I finally went with substr to divide each of the sentences separately like this:
substr("abxyzpqrst34245", 1,2)
[1] "ab"
substr("abxyzpqrst34245", 3,5)
[1] "xyz"
substr("abxyzpqrst34245", 6,10)
[1] "pqrst"
substr("abxyzpqrst34245", 11,10000)
[1] "34245"
I'm using this long process to split these strings. Is there any easier way to achieve this splitting?
You're looking for (the often overlooked) substring:
x <- "abxyzpqrst34245"
substring(x,c(1,3,6,11),c(2,5,10,nchar(x)))
[1] "ab" "xyz" "pqrst" "34245"
which is handy because it is fully vectorized. If you want to do this over multiple strings in turn, you might do something like this:
x <- c("abxyzpqrst34245","mndeflmnop6346781")
> lapply(x,function(y) substring(y,first = c(1,3,6,11),last = c(2,5,10,nchar(y))))
[[1]]
[1] "ab" "xyz" "pqrst" "34245"
[[2]]
[1] "mn" "def" "lmnop" "6346781"
If you have a vector of strings to be split, you might also find read.fwf() handy. Use it like so:
x <- c("abxyzpqrst34245", "mndeflmnop6346781")
df <- read.fwf(file = textConnection(x),
widths = c(2,3,5,10000),
colClasses = "character")
df
# V1 V2 V3 V4
# 1 ab xyz pqrst 34245
# 2 mn def lmnop 6346781
str(df)
# 'data.frame': 2 obs. of 4 variables:
# $ V1: chr "ab" "mn"
# $ V2: chr "xyz" "def"
# $ V3: chr "pqrst" "lmnop"
# $ V4: chr "34245" "6346781"

Sapply different than individual application of function

When applied individually to each element of the vector, my function gives a different result than using sapply. It's driving me nuts!
Item I'm using: this (simplified) list of arguments another function was called with:
f <- as.list(match.call()[-1])
> f
$ampm
c(1, 4)
To replicate this you can run the following:
foo <- function(ampm) {as.list(match.call()[-1])}
f <- foo(ampm = c(1,4))
Here is my function. It just strips the 'c(...)' from a string.
stripConcat <- function(string) {
sub(')','',sub('c(','',string,fixed=TRUE),fixed=TRUE)
}
When applied alone it works as so, which is what I want:
> stripConcat(f)
[1] "1, 4"
But when used with sapply, it gives something totally different, which I do NOT want:
> sapply(f, stripConcat)
ampm
[1,] "c"
[2,] "1"
[3,] "4"
Lapply doesn't work either:
> lapply(f, stripConcat)
$ampm
[1] "c" "1" "4"
And neither do any of the other apply functions. This is driving me nuts--I thought lapply and sapply were supposed to be identical to repeated applications to the elements of the list or vector!
The discrepency you are seeing, I believe, is simply due to how as.character coerces elements of a list.
x2 <- list(1:3, quote(c(1, 5)))
as.character(x2)
[1] "1:3" "c(1, 5)"
lapply(x2, as.character)
[[1]]
[1] "1" "2" "3"
[[2]]
[1] "c" "1" "5"
f is not a call, but a list whose first element is a call.
is(f)
[1] "list" "vector"
as.character(f)
[1] "c(1, 4)"
> is(f[[1]])
[1] "call" "language"
> as.character(f[[1]])
[1] "c" "1" "4"
sub attempts to coerce anything that is not a character into a chracter.
When you pass sub a list, it calls as.character on the list.
When you pass it a call, it calls as.character on that call.
It looks like for your stripConcat function, you would prefer a list as input.
In that case, I would recommend the following for that function:
stripConcat <- function(string) {
if (!is.list(string))
string <- list(string)
sub(')','',sub('c(','',string,fixed=TRUE),fixed=TRUE)
}
Note, however, that string is a misnomer, since it doesn't appear that you are ever planning to pass stripConcat a string. (not that this is an issue, of course)

Resources