I am using apply to generate strings from a data frame.
For example:
df2 <- data.frame(a=c(1:3), b=c(9:11))
apply(df2, 1, function(row) paste0("hello", row['a']))
apply(df2, 1, function(row) paste0("hello", row['b']))
works as I would expect and generates
[1] "hello1" "hello2" "hello3"
[1] "hello9" "hello10" "hello11"
However, if I have
df <- data.frame(a=c(1:3), b=c(9:11), c=c("1", "2", "3"))
apply(df, 1, function(row) paste0("hello", row['a']))
apply(df, 1, function(row) paste0("hello", row['b']))
the output is
[1] "hello1" "hello2" "hello3"
[1] "hello 9" "hello10" "hello11"
Can any one please explain why I get a padded space to make all the strings the same length in the second case? I can work around the problem using gsub, but I would like to have a better understanding of why this happens
You don't need apply function:
paste0("hello", df[["a"]])
[1] "hello1" "hello2" "hello3"
paste0("hello", df[["b"]])
[1] "hello9" "hello10" "hello11"
This is happening because apply transforms your data.frame in a matrix. See what happens when you coerce df to matrix:
as.matrix(df)
a b c
[1,] "1" " 9" "1"
[2,] "2" "10" "2"
[3,] "3" "11" "3"
Notice that it coerced to a character matrix and it included the extra space on the " 9".
Related
I have the following column from a dataframe
df <- data.frame(
crime = as.character(c(115400, 171200, 91124, 263899, 67601, 51322)),
stringsAsFactors=FALSE
)
I am using a function to extract the first two digits based on some condition as seen on the function below
for (i in df$crime){
if (nchar(i)==6){
print(substring(i,1,2))}
else {print(substring(i,1,1))
}
}
when I run this function I get the following output which is what I want
[1] "11"
[1] "17"
[1] "9"
[1] "26"
[1] "6"
[1] "5"
However, I want this to be saved as stand along vector. how do I do that?
Here is a base R solution with ifelse+ substring
res <- with(df, substring(crime,1,ifelse(nchar(crime) == 6, 2, 1)))
such that
> res
[1] "11" "17" "9" "26" "6" "5"
substr/substring are vectorized, so we can use ifelse
v1 <- with(df1, ifelse(nchar(crime) == 6, substr(crime, 1, 2), substr(crime, 1, 1)))
v1
#[1] "11" "17" "9" "26" "6" "5"
In the OP's for loop, a vector can be initialized to store the output in each of the iterations
v1 <- character(nrow(df1))
for (i in seq_along(df1$crime)){
if (nchar(df1$crime[i])==6){
v1[i] <- substring(df1$crime[i],1,2)
} else {
v1[i] <- substring(df1$crime[i],1,1)
}
}
Using regex :
output <- with(df, ifelse(nchar(crime) == 6, sub("(..).*", "\\1", crime),
sub("(.).*", "\\1", crime)))
output
#[1] "11" "17" "9" "26" "6" "5"
It becomes a little simpler with str_extract from stringr
with(df, ifelse(nchar(crime) == 6, stringr::str_extract(crime, ".."),
stringr::str_extract(crime, ".")))
I can imagine some situations where keeping the extracted codes within the original data frame is useful.
I'll use the data.table package as it's fast, which may be handy if your data is big.
library(data.table)
# convert your data.frame to data.table
setDT(df)
# filter the rows where crime length is 6,
# and assign the first two characters of
# it into a new variable "extracted".
# some rows now have NAs in the new
# field. The last [] prints it to screen.
df[nchar(crime) == 6, extracted := substring(crime, 1, 2)][]
I have an integer
a <- (0:3)
And I would like to convert it to a character string that looks like this
b <- "(0:3)"
I have tried
as.character(a)
[1] "0" "1" "2" "3"
and
toString(a)
[1] "0, 1, 2, 3"
But neither do exactly what I need to do.
Can anyone help me get from a to b?
Many thanks in advance!
paste0("(", min(a), ":", max(a), ")")
"(0:3)"
Or more concisely with sprintf():
sprintf("(%d:%d)", min(a), max(a))
One option is deparse and paste the brackets
as.character(glue::glue('({deparse(a)})'))
#[1] "(0:3)"
Another option would be to store as a quosure and then convert it to character
library(rlang)
a <- quo((0:3))
quo_name(a)
#[1] "(0:3)"
it can be evaluated with eval_tidy
eval_tidy(a)
#[1] 0 1 2 3
I'm wondering if there is any way to remove blanks from the list.
As far as I've searched, I found out that there are many Q&As for removing
the whole element from the list, but couldn't find the one regarding
a specific component of the element.
To be specific, the list now I'm working with looks like this:
[[1]]
[1] "1" "" "" "2" "" "" "3"
[[2]]
[1] "weak"
[[3]]
[1] "22" "33"
[[4]]
[1] "44" "34p" "45"
From above, you can find " ", which should be removed.
I've tried different commands like
text.words.bl <- text.words.ll[-which(text.words.ll==" ")]
text.words.bl <- text.words.ll[!sapply(text.words.ll, is.null)]
etc, but seems like " "s in [[1]] of the list still remains.
Is it impossible to apply commands to small pieces in each element of the list?
(e.g. 1, 2, weak, 22, 33... respectively)
I've used "lapply" function to run specific commands to each elements,
and it seemed like those lapply commands all worked....
JY
Use %in%, but negate it with !:
## Sample data:
L <- list(c(1, 2, "", "", 4), c(1, "", "", 2), c("", "", 3))
L
# [[1]]
# [1] "1" "2" "" "" "4"
#
# [[2]]
# [1] "1" "" "" "2"
#
# [[3]]
# [1] "" "" "3"
The replacement:
lapply(L, function(x) x[!x %in% ""])
# [[1]]
# [1] "1" "2" "4"
#
# [[2]]
# [1] "1" "2"
#
# [[3]]
# [1] "3"
Obviously, assign the output to "L" if you want to overwrite the original dataset:
L[] <- lapply(L, function(x) x[!x %in% ""])
Another way would be to use nchar(). I borrowed L from #Ananda Mahto.
lapply(L, function(x) x[nchar(x) >= 1])
#[[1]]
#[1] "1" "2" "4"
#
#[[2]]
#[1] "1" "2"
#
#[[3]]
#[1] "3"
When applied individually to each element of the vector, my function gives a different result than using sapply. It's driving me nuts!
Item I'm using: this (simplified) list of arguments another function was called with:
f <- as.list(match.call()[-1])
> f
$ampm
c(1, 4)
To replicate this you can run the following:
foo <- function(ampm) {as.list(match.call()[-1])}
f <- foo(ampm = c(1,4))
Here is my function. It just strips the 'c(...)' from a string.
stripConcat <- function(string) {
sub(')','',sub('c(','',string,fixed=TRUE),fixed=TRUE)
}
When applied alone it works as so, which is what I want:
> stripConcat(f)
[1] "1, 4"
But when used with sapply, it gives something totally different, which I do NOT want:
> sapply(f, stripConcat)
ampm
[1,] "c"
[2,] "1"
[3,] "4"
Lapply doesn't work either:
> lapply(f, stripConcat)
$ampm
[1] "c" "1" "4"
And neither do any of the other apply functions. This is driving me nuts--I thought lapply and sapply were supposed to be identical to repeated applications to the elements of the list or vector!
The discrepency you are seeing, I believe, is simply due to how as.character coerces elements of a list.
x2 <- list(1:3, quote(c(1, 5)))
as.character(x2)
[1] "1:3" "c(1, 5)"
lapply(x2, as.character)
[[1]]
[1] "1" "2" "3"
[[2]]
[1] "c" "1" "5"
f is not a call, but a list whose first element is a call.
is(f)
[1] "list" "vector"
as.character(f)
[1] "c(1, 4)"
> is(f[[1]])
[1] "call" "language"
> as.character(f[[1]])
[1] "c" "1" "4"
sub attempts to coerce anything that is not a character into a chracter.
When you pass sub a list, it calls as.character on the list.
When you pass it a call, it calls as.character on that call.
It looks like for your stripConcat function, you would prefer a list as input.
In that case, I would recommend the following for that function:
stripConcat <- function(string) {
if (!is.list(string))
string <- list(string)
sub(')','',sub('c(','',string,fixed=TRUE),fixed=TRUE)
}
Note, however, that string is a misnomer, since it doesn't appear that you are ever planning to pass stripConcat a string. (not that this is an issue, of course)
If you use apply over rows on a data.frame with character and numeric columns, apply uses as.matrix internally to convert the data.frame to only characters. But if the numeric column consists of numbers of different lengths, as.matrix adds spaces to match the highest/"longest" number.
An example:
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
df
## id1 id2
## 1 a 100
## 2 a 90
## 3 a 8
as.matrix(df)
## id1 id2
## [1,] "a" "100"
## [2,] "a" " 90"
## [3,] "a" " 8"
I would have expected the result to be:
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
Why the extra spaces?
They can create unexpected results when using apply on a data.frame:
myfunc <- function(row){
paste(row[1], row[2], sep = "")
}
> apply(df, 1, myfunc)
[1] "a100" "a 90" "a 8"
>
While looping gives the expected result.
> for (i in 1:nrow(df)){
print(myfunc(df[i,]))
}
[1] "a100"
[1] "a90"
[1] "a8"
and
> paste(df[,1], df[,2], sep = "")
[1] "a100" "a90" "a8"
Are there any situations where the extra spaces that are added with as.matrix is useful?
This is because of the way non-numeric data are converted in the as.matrix.data.frame method. There is a simple work-around, shown below.
Details
?as.matrix notes that conversion is done via format(), and it is here that the additional spaces are added. Specifically, ?as.matrix has this in the Details section:
‘as.matrix’ is a generic function. The method for data frames
will return a character matrix if there is only atomic columns and
any non-(numeric/logical/complex) column, applying ‘as.vector’ to
factors and ‘format’ to other non-character columns. Otherwise,
the usual coercion hierarchy (logical < integer < double <
complex) will be used, e.g., all-logical data frames will be
coerced to a logical matrix, mixed logical-integer will give a
integer matrix, etc.
?format also notes that
Character strings are padded with blanks to the display width of the widest.
Consider this example which illustrates the behaviour
> format(df[,2])
[1] "100" " 90" " 8"
> nchar(format(df[,2]))
[1] 3 3 3
format doesn't have to work this way as it has trim:
trim: logical; if ‘FALSE’, logical, numeric and complex values are
right-justified to a common width: if ‘TRUE’ the leading
blanks for justification are suppressed.
e.g.
> format(df[,2], trim = TRUE)
[1] "100" "90" "8"
but there is no way to pass this argument along to the as.matrix.data.frame method.
Workaround
A way to work around this is to apply format() yourself, manually, via sapply. There you can pass in trim = TRUE
> sapply(df, format, trim = TRUE)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
or, using vapply we can state what we expect to be returned (here character vectors of length 3 [nrow(df)]):
> vapply(df, format, FUN.VALUE = character(nrow(df)), trim = TRUE)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
It does seem a little strange. In the manual (?as.matrix) it explains that format is called for the conversion to a character matrix:
The method for data frames will return a character matrix if there is
only atomic columns and any non-(numeric/logical/complex) column,
applying as.vector to factors and format to other non-character
columns.
And you can see that if you call format directly, it does what as.matrix does:
format(df$id2)
[1] "100" " 90" " 8"
What you need to do is pass the trim arugment:
format(df$id2,trim=TRUE)
[1] "100" "90" "8"
But, unfortunately, the as.matrix.data.frame function doesn't allow you to do that.
else if (non.numeric) {
for (j in pseq) {
if (is.character(X[[j]]))
next
xj <- X[[j]]
miss <- is.na(xj)
xj <- if (length(levels(xj)))
as.vector(xj)
else format(xj) # This could have ... as an argument
# else format(xj,...)
is.na(xj) <- miss
X[[j]] <- xj
}
}
So, you could modify as.data.frame.matrix. I think it would be a nice feature addition, however, to include this in base.
But, a quick solution would be to simply:
as.matrix(data.frame(lapply(df,as.character)))
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
# As mentioned in the comments, this also works:
sapply(df,as.character)
as.matrix calls format internally:
> format(df$id2)
[1] "100" " 90" " 8"
That's where the extra spaces come from. format has an extra argument trim to remove those:
> format(df$id2, trim = TRUE)
[1] "100" "90" "8"
However you cannot supply this argument to as.matrix.
The reason for this behaviour is already explained in previous answers, but I'd like to offer another way of circumventing this:
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
do.call(cbind,df)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
Note that if using stringsAsFactors = TRUE, this doesn't work as factor levels are converted to numbers.
Just another solution: trimWhiteSpace(x) (from limma R pckg) also does the job if you don't mind downloading the package.
source("https://bioconductor.org/biocLite.R")
biocLite("limma")
library(limma)
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
as.matrix(df)
id1 id2
[1,] "a" "100"
[2,] "a" " 90"
[3,] "a" " 8"
trimWhiteSpace(as.matrix(df))
id1 id2 enter code here
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"