using paste with a list - r

I'm trying to understand the behavior of strsplit and paste, which are inverse functions. However, when I strsplit a vector, a list is returned, like so:
> strsplit(c("on,e","tw,o","thre,e","fou,r"),",")
[[1]]
[1] "on" "e"
[[2]]
[1] "tw" "o"
[[3]]
[1] "thre" "e"
[[4]]
[1] "fou" "r"
I tried using lapply to cat the elements of the list back together, but it doesn't work:
> lapply(strsplit(c("on,e","tw,o","thre,e","fou,r"),","),cat)
on etw othre efou r[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
The same formula with paste instead of cat actually does nothing at all! Why am I getting these results? and how can I get the result I want, which is the original vector back again?
(Obviously, in my actual code I'm trying to do more with the strsplit and cat than just return the original vector, but I think a solution to this problem will work for mine. Thanks!)

While yes, cat will concatenate and print to the console, it does not actually function in the same way paste does. It's result best explained in help("cat")
The collapse argument in paste is effectively the opposite of the split argument in strsplit. And you can use sapply to return the simplified pasted vector.
x <- c("on,e","tw,o","thre,e","fou,r")
( y <- sapply(strsplit(x, ","), paste, collapse = ",") )
# [1] "on,e" "tw,o" "thre,e" "fou,r"
( z <- vapply(strsplit(x, ","), paste, character(1L), collapse = ",") )
# [1] "on,e" "tw,o" "thre,e" "fou,r"
identical(x, y)
# [1] TRUE
identical(x, z)
# [1] TRUE
Note that for cases like this, vapply will be more efficient than sapply. And adding fixed = TRUE in strsplit should increase efficiency as well.

Related

Split a string on alternating index

I have a string similar to "HLeelmloon" which is two words interweaved together. How can I separate this into two separate words, splitting on alternating letters?
I can use strsplit() and a for loop to allocate alternating letters to two new vectors and then join the list but this seems very long winded:
string <- "HLeelmloon"
split<-el(strsplit(string,''))
> split
[1] "H" "L" "e" "e" "l" "m" "l" "o" "o" "n"
word1<-c()
word2<-c()
for(i in 1:length(split)){
if(i %% 2 == 1){
word1<-append(word1, split[i])
} else {
word2<-append(word2, split[i])
}
}
word1 = paste0(word1, collapse = '')
word2 = paste0(word2, collapse = '')
> word1
[1] "Hello"
> word2
[1] "Lemon"
My issue is it's not very elegant, and it doesn't upscale well if I want to split the string into N different words. Is there a better way to do this?
You could use gsub to capture alternating characters into the same group:
gsub("(.)(.)?", "\\1", string)
#[1] "Hello"
gsub("(.)(.)?", "\\2", string)
#[1] "Lemon"
You can do it by using TRUE and FALSE for indexing, i.e.
v1 = strsplit(string, '')[[1]]
paste(v1[c(TRUE, FALSE)], collapse = '')
#[1] "Hello"
paste(v1[c(FALSE, TRUE)], collapse = '')
#[1] "Lemon"
Considering your question is how to split into more than two words, you should use the split function. Using your example data can be a bit confusing because you chose to name one variable 'split'. In the following block, the first 'split' is the function, the second one your split variable.
number_of_words <- 2
lapply(split(split,1:number_of_words),paste0,collapse='')
$`1`
[1] "Hello"
$`2`
[1] "Lemon"
number_of_words <- 3
lapply(split(split,1:number_of_words),paste0,collapse='')
$`1`
[1] "Heln"
$`2`
[1] "Llo"
$`3`
[1] "emo"
To avoid confusion, here's the same code without the variable named split:
number_of_words <- 2
lapply(split(el(strsplit(string,'')),1:number_of_words),paste0,collapse='')
$`1`
[1] "Hello"
$`2`
[1] "Lemon"
Try this code:
paste0(split[seq(1,nchar(string),by = 2)],collapse="")
[1] "Hello"
> paste0(split[seq(2,nchar(string),by = 2)],collapse="")
[1] "Lemon"
It appends even and odd positions in the string string
Another way using your split variable, will work with any number of words:
N <- 2
apply(matrix(split,N),1,paste,collapse="")
# [1] "Hello" "Lemon"

Accessing element of a split string in R

If I have a string,
x <- "Hello World"
How can I access the second word, "World", using string split, after
x <- strsplit(x, " ")
x[[2]] does not do anything.
As mentioned in the comments, it's important to realise that strsplit returns a list object. Since your example is only splitting a single item (a vector of length 1) your list is length 1. I'll explain with a slightly different example, inputting a vector of length 3 (3 text items to split):
input <- c( "Hello world", "Hi there", "Back at ya" )
x <- strsplit( input, " " )
> x
[[1]]
[1] "Hello" "world"
[[2]]
[1] "Hi" "there"
[[3]]
[1] "Back" "at" "ya"
Notice that the returned list has 3 elements, one for each element of the input vector. Each of those list elements is split as per the strsplit call. So we can recall any of these list elements using [[ (this is what your x[[2]] call was doing, but you only had one list element, which is why you couldn't get anything in return):
> x[[1]]
[1] "Hello" "world"
> x[[3]]
[1] "Back" "at" "ya"
Now we can get the second part of any of those list elements by appending a [ call:
> x[[1]][2]
[1] "world"
> x[[3]][2]
[1] "at"
This will return the second item from each list element (note that the "Back at ya" input has returned "at" in this case). You can do this for all items at once using something from the apply family. sapply will return a vector, which will probably be good in this case:
> sapply( x, "[", 2 )
[1] "world" "there" "at"
The last value in the input here (2) is passed to the [ operator, meaning the operation x[2] is applied to every list element.
If instead of the second item, you'd like the last item of each list element, we can use tail within the sapply call instead of [:
> sapply( x, tail, 1 )
[1] "world" "there" "ya"
This time, we've applied tail( x, 1 ) to every list element, giving us the last item.
As a preference, my favourite way to apply actions like these is with the magrittr pipe, for the second word like so:
x <- input %>%
strsplit( " " ) %>%
sapply( "[", 2 )
> x
[1] "world" "there" "at"
Or for the last word:
x <- input %>%
strsplit( " " ) %>%
sapply( tail, 1 )
> x
[1] "world" "there" "ya"
Another approach that might be a little easier to read and apply to a data frame within a pipeline (though it takes more lines) would be to wrap it in your own function and apply that.
library(tidyverse)
df <- data.frame(
greetings = c( "Hello world", "Hi there", "Back at ya" )
)
split_params = function (x, sep, n) {
# Splits string into list of substrings separated by 'sep'.
# Returns nth substring.
x = strsplit(x, sep)[[1]][n]
return(x)
}
df = df %>%
mutate(
'greetings' = sapply(
X = greetings,
FUN = split_params,
# Arguments for split_params.
sep = ' ',
n = 2
)
)
df
### (Output in RStudio Notebook)
greetings second_word
<chr> <chr>
Hello world world
Hi there there
Back at ya at
3 rows
###
With stringr 1.5.0, you can use str_split_i to access the ith element of a split string:
library(stringr)
x <- "Hello World"
str_split_i(x, " ", i = 2)
#[1] "World"
It is vectorized:
x <- c("Hello world", "Hi there", "Back at ya")
str_split_i(x, " ", 2)
#[1] "world" "there" "at"
x=strsplit("a;b;c;d",";")
x
[[1]]
[1] "a" "b" "c" "d"
x=as.character(x[[1]])
x
[1] "a" "b" "c" "d"
x=strsplit(x," ")
x
[[1]]
[1] "a"
[[2]]
[1] "b"
[[3]]
[1] "c"
[[4]]
[1] "d"

R: trim consecutive trailing and leading special characters from set of strings

I have a list of character vectors, all equal lengths. Example data:
> a = list('**aaa', 'bb*bb', 'cccc*')
> a = sapply(a, strsplit, '')
> a
[[1]]
[1] "*" "*" "a" "a" "a"
[[2]]
[1] "b" "b" "*" "b" "b"
[[3]]
[1] "c" "c" "c" "c" "*"
I would like to identify the indices of all leading and trailing consecutive occurrences of the character *. Then I would like to remove these indices from all three vectors in the list. By trailing and leading consecutive characters I mean e.g. either only a single occurrence as in the third one (cccc*) or multiple consecutive ones as in the first one (**aaa).
After the removal, all three character vectors should still have the same length.
So the first two and the last character should be removed from all three vectors.
[[1]]
[1] "a" "a"
[[2]]
[1] "*" "b"
[[3]]
[1] "c" "c"
Note that the second vector of the desired result will still have a leading *, which, however became the first character after the operation, so it should be in.
I tried using which to identify the indices (sapply(a, function(x)which(x=='*'))) but this would still require some code to detect the trailing ones.
Any ideas for a simple solution?
I would replace the lead and lag stars with NA:
aa <- lapply(setNames(a,seq_along(a)), function(x) {
star = x=="*"
toNA = cumsum(!star) == 0 | rev(cumsum(rev(!star))) == 0
replace(x, toNA, NA)
})
Store in a data.frame:
DF <- do.call(data.frame, c(aa, list(stringsAsFactors=FALSE)) )
Omit all rows with NA:
res <- na.omit(DF)
# X1 X2 X3
# 3 a * c
# 4 a b c
If you hate data.frames and want your list back: lapply(res,I) or c(unclass(res)), which gives
$X1
[1] "a" "a"
$X2
[1] "*" "b"
$X3
[1] "c" "c"
First of, like Richard Scriven asked in his comment to your question, your output is not the same as the thing you asked for. You ask for removal of leading and trailing characters, but your given ideal output is just the 3rd and 4th element of the character lists.
This would be easily achievable by something like
a <- list('**aaa', 'bb*bb', 'cccc*')
alist = sapply(a, strsplit, '')
lapply(alist, function(x) x[3:4])
Now for an answer as you asked it:
IMHO, sapply() isn't necessary here.
You need a function of the grep family to operate directly on your characters, which all share a help page in R opened by ?grep.
I would propose gsub() and a bit of Regular Expressions for your problem:
a <- list('**aaa', 'bb*bb', 'cccc*')
b <- gsub(pattern = "^(\\*)*", x = a, replacement = "")
c <- gsub(pattern = "(\\*)*$", x = b, replacement = "")
> c
[1] "aaa" "bb*bb" "cccc"
This is doable in one regex, but then you need a backreference for the stuff in between i think, and i didn't get this to work.
If you are familiar with the magrittr package and its excellent pipe operator, you can do this more elegantly:
library(magrittr)
gsub(pattern = "^(\\*)*", x = a, replacement = "") %>%
gsub(pattern = "(\\*)*$", x = ., replacement = "")

Sapply different than individual application of function

When applied individually to each element of the vector, my function gives a different result than using sapply. It's driving me nuts!
Item I'm using: this (simplified) list of arguments another function was called with:
f <- as.list(match.call()[-1])
> f
$ampm
c(1, 4)
To replicate this you can run the following:
foo <- function(ampm) {as.list(match.call()[-1])}
f <- foo(ampm = c(1,4))
Here is my function. It just strips the 'c(...)' from a string.
stripConcat <- function(string) {
sub(')','',sub('c(','',string,fixed=TRUE),fixed=TRUE)
}
When applied alone it works as so, which is what I want:
> stripConcat(f)
[1] "1, 4"
But when used with sapply, it gives something totally different, which I do NOT want:
> sapply(f, stripConcat)
ampm
[1,] "c"
[2,] "1"
[3,] "4"
Lapply doesn't work either:
> lapply(f, stripConcat)
$ampm
[1] "c" "1" "4"
And neither do any of the other apply functions. This is driving me nuts--I thought lapply and sapply were supposed to be identical to repeated applications to the elements of the list or vector!
The discrepency you are seeing, I believe, is simply due to how as.character coerces elements of a list.
x2 <- list(1:3, quote(c(1, 5)))
as.character(x2)
[1] "1:3" "c(1, 5)"
lapply(x2, as.character)
[[1]]
[1] "1" "2" "3"
[[2]]
[1] "c" "1" "5"
f is not a call, but a list whose first element is a call.
is(f)
[1] "list" "vector"
as.character(f)
[1] "c(1, 4)"
> is(f[[1]])
[1] "call" "language"
> as.character(f[[1]])
[1] "c" "1" "4"
sub attempts to coerce anything that is not a character into a chracter.
When you pass sub a list, it calls as.character on the list.
When you pass it a call, it calls as.character on that call.
It looks like for your stripConcat function, you would prefer a list as input.
In that case, I would recommend the following for that function:
stripConcat <- function(string) {
if (!is.list(string))
string <- list(string)
sub(')','',sub('c(','',string,fixed=TRUE),fixed=TRUE)
}
Note, however, that string is a misnomer, since it doesn't appear that you are ever planning to pass stripConcat a string. (not that this is an issue, of course)

In R, how can a string be split without using a seperator

i am try split method and i want to have the second element of a string containing only 2 elemnts. The size of the string is 2.
examples :
string= "AC"
result shouldbe a split after the first letter ("A"), that I get :
res= [,1] [,2]
[1,] "A" "C"
I tryed it with split, but I have no idea how to split after the first element??
strsplit() will do what you want (if I understand your Question). You need to split on "" to split the string on it's elements. Here is an example showing how to do what you want on a vector of strings:
strs <- rep("AC", 3) ## your string repeated 3 times
next, split each of the three strings
sstrs <- strsplit(strs, "")
which produces
> sstrs
[[1]]
[1] "A" "C"
[[2]]
[1] "A" "C"
[[3]]
[1] "A" "C"
This is a list so we can process it with lapply() or sapply(). We need to subset each element of sstrs to select out the second element. Fo this we apply the [ function:
sapply(sstrs, `[`, 2)
which produces:
> sapply(sstrs, `[`, 2)
[1] "C" "C" "C"
If all you have is one string, then
strsplit("AC", "")[[1]][2]
which gives:
> strsplit("AC", "")[[1]][2]
[1] "C"
split isn't used for this kind of string manipulation. What you're looking for is strsplit, which in your case would be used something like this:
strsplit(string,"",fixed = TRUE)
You may not need fixed = TRUE, but it's a habit of mine as I tend to avoid regular expressions. You seem to indicate that you want the result to be something like a matrix. strsplit will return a list, so you'll want something like this:
strsplit(string,"",fixed = TRUE)[[1]]
and then pass the result to matrix.
If you sure that it's always two char string (check it by all(nchar(x)==2)) and you want only second then you could use sub or substr:
x <- c("ab", "12")
sub(".", "", x)
# [1] "b" "2"
substr(x, 2, 2)
# [1] "b" "2"

Resources