R: trim consecutive trailing and leading special characters from set of strings

R: trim consecutive trailing and leading special characters from set of strings - r

I have a list of character vectors, all equal lengths. Example data:
> a = list('**aaa', 'bb*bb', 'cccc*')
> a = sapply(a, strsplit, '')
> a
[[1]]
[1] "*" "*" "a" "a" "a"
[[2]]
[1] "b" "b" "*" "b" "b"
[[3]]
[1] "c" "c" "c" "c" "*"
I would like to identify the indices of all leading and trailing consecutive occurrences of the character *. Then I would like to remove these indices from all three vectors in the list. By trailing and leading consecutive characters I mean e.g. either only a single occurrence as in the third one (cccc*) or multiple consecutive ones as in the first one (**aaa).
After the removal, all three character vectors should still have the same length.
So the first two and the last character should be removed from all three vectors.
[[1]]
[1] "a" "a"
[[2]]
[1] "*" "b"
[[3]]
[1] "c" "c"
Note that the second vector of the desired result will still have a leading *, which, however became the first character after the operation, so it should be in.
I tried using which to identify the indices (sapply(a, function(x)which(x=='*'))) but this would still require some code to detect the trailing ones.
Any ideas for a simple solution?

I would replace the lead and lag stars with NA:
aa <- lapply(setNames(a,seq_along(a)), function(x) {
star = x=="*"
toNA = cumsum(!star) == 0 | rev(cumsum(rev(!star))) == 0
replace(x, toNA, NA)
})
Store in a data.frame:
DF <- do.call(data.frame, c(aa, list(stringsAsFactors=FALSE)) )
Omit all rows with NA:
res <- na.omit(DF)
# X1 X2 X3
# 3 a * c
# 4 a b c
If you hate data.frames and want your list back: lapply(res,I) or c(unclass(res)), which gives
$X1
[1] "a" "a"
$X2
[1] "*" "b"
$X3
[1] "c" "c"

First of, like Richard Scriven asked in his comment to your question, your output is not the same as the thing you asked for. You ask for removal of leading and trailing characters, but your given ideal output is just the 3rd and 4th element of the character lists.
This would be easily achievable by something like
a <- list('**aaa', 'bb*bb', 'cccc*')
alist = sapply(a, strsplit, '')
lapply(alist, function(x) x[3:4])
Now for an answer as you asked it:
IMHO, sapply() isn't necessary here.
You need a function of the grep family to operate directly on your characters, which all share a help page in R opened by ?grep.
I would propose gsub() and a bit of Regular Expressions for your problem:
a <- list('**aaa', 'bb*bb', 'cccc*')
b <- gsub(pattern = "^(\\*)*", x = a, replacement = "")
c <- gsub(pattern = "(\\*)*$", x = b, replacement = "")
> c
[1] "aaa" "bb*bb" "cccc"
This is doable in one regex, but then you need a backreference for the stuff in between i think, and i didn't get this to work.
If you are familiar with the magrittr package and its excellent pipe operator, you can do this more elegantly:
library(magrittr)
gsub(pattern = "^(\\*)*", x = a, replacement = "") %>%
gsub(pattern = "(\\*)*$", x = ., replacement = "")

Related

Searching for a way to replace elements of character vectors in a list

I have a list of character vectors of different length, containing identifiers (e.g. "011" or "12"), numbers indicating the amount of money ("112.3" or "490.5") and years ("2011" or "2020"), empty elements ("") and elements only containing a dot("."). I want to get rid of the elements of character vectors that only contain a dot or are empty. The leading zeros of the identifiers are important, so I cannot change the type to numeric.
This original data
l <- list(c("2015","2016"),c(""),c("."), c("0","2418.9","292.4"),c("2",".",".","2394.6"),c("011","","934.0","1200.7"))
should look like this:
l_final <- list(c("2015","2016"),c("0","2418.9","292.4"),c("2","2394.6"),c("011","934.0","1200.7"))
My idea is to create a list with TRUE/FALSE indicating for each vector which elements to keep, but right now I'm really stuck as the following approach does not work (it returns integers that are zero) :
test <- lapply(list, function(i) {unlist(lapply(list[i], function(b) which(b==".")))})
Regarding the expression for ".", I already tried other regular expressions like "\." and "[.]".

We could loop over the list, subset the elements that are not . or "" and Filter out the list elements that are empty
Filter(length, lapply(list, function(x) x[! x %in% c(".", "")]))
-output
[[1]]
[1] "2015" "2016"
[[2]]
[1] "0" "2418.9" "292.4"
[[3]]
[1] "2" "2394.6"

A two-step solution doubling down on lapply, if needed/as an alternative:
# data
l <- list(c("2015","2016"),c(""),c("."), c("0","2418.9","292.4"),c("2",".",".","2394.6"))
# remove those with "." or ""
l2 <- lapply(l, function(x) {x[!(x %in% c(".", ""))]})
# remove empty list positions
l2[lapply(l2, length) > 0]
#[[1]]
#[1] "2015" "2016"
#[[2]]
#[1] "0" "2418.9" "292.4"
#[[3]]
#[1] "2" "2394.6"

Why is this grep exclusion failed to work in R?

I am trying to do exclude certain characters when using grep in R. But I cannot get the result that I expect.
Here is the code:
x <- c("a", "ab", "b", "abc")
grep("[^b]", x, value=T)
> [1] "a" "ab" "abc"
I want to grab anything in vector x that does not contain b. It should not return "ab" or "abc".
Ultimately I want to pick up any element that contains "a" but not "b".
This is the result that I would expect:
grep("a[^b]", x, value=T)
> [1] "a"
How can I do that?

Try this:
grep("^[^b]*a[^b]*$", x, value=TRUE)
# [1] "a"
It looks for the start of the string, then allows any number of characters that are not "b", then an "a", then any number of characters that are not "b" again and then the end of the string is reached.

We can use the invert property of grep which returns values which do not match. So here it returns those values which do not have "b" in them.
grep("b", x, value = TRUE, invert = TRUE)
#[1] "a"

I've got the result, what are you looking for, using this regular expression in grep:
grep("^[^b]*$", x, value=TRUE)
[1] "a"

R Removing an element of a list element

This is more a general question on the behavior of lists in R, but the specific problem is:
I have a list of groups of words which I'm trying to manually remove specific words for - where no word is mentioned twice.
Currently, I'm using this method
l = strsplit(c("a b", "c d"), " ")
> l
[[1]]
[1] "a" "b"
[[2]]
[1] "c" "d"
# remove the value "d"
l = lapply(l, function(x) { x[x != "d"] })
> l
[[1]]
[1] "a" "b"
[[2]]
[1] "c"
Is there any sort of built in list indexing method that would be preferable to use? I feel like I should just be able to parse the list without using lapply. If not, is it possible that someone could explain why this is the case?
Thanks

You need to go through each element of the list and check if the vector contains d to filter/remove it.
One of the reason is that a list can contains various type of data (functions, data.frame, numeric, character, boolean, other lists, class) so there can't be vectorized operations (which are - as suggests the name - for vectors).
What you do is to filter you filter on the front end - eg when you have the list. It could be preferable to filter in the back end your vector, eg before obtaining the list:
l = strsplit(gsub('d','',c("a b", "c d")), " ")
#[[1]]
#[1] "a" "b"
#[[2]]
#[1] "c"
Some alternative solution for a front end filtering:
lapply(l, grep, pattern='[^d]', value=T)

using paste with a list

I'm trying to understand the behavior of strsplit and paste, which are inverse functions. However, when I strsplit a vector, a list is returned, like so:
> strsplit(c("on,e","tw,o","thre,e","fou,r"),",")
[[1]]
[1] "on" "e"
[[2]]
[1] "tw" "o"
[[3]]
[1] "thre" "e"
[[4]]
[1] "fou" "r"
I tried using lapply to cat the elements of the list back together, but it doesn't work:
> lapply(strsplit(c("on,e","tw,o","thre,e","fou,r"),","),cat)
on etw othre efou r[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
The same formula with paste instead of cat actually does nothing at all! Why am I getting these results? and how can I get the result I want, which is the original vector back again?
(Obviously, in my actual code I'm trying to do more with the strsplit and cat than just return the original vector, but I think a solution to this problem will work for mine. Thanks!)

While yes, cat will concatenate and print to the console, it does not actually function in the same way paste does. It's result best explained in help("cat")
The collapse argument in paste is effectively the opposite of the split argument in strsplit. And you can use sapply to return the simplified pasted vector.
x <- c("on,e","tw,o","thre,e","fou,r")
( y <- sapply(strsplit(x, ","), paste, collapse = ",") )
# [1] "on,e" "tw,o" "thre,e" "fou,r"
( z <- vapply(strsplit(x, ","), paste, character(1L), collapse = ",") )
# [1] "on,e" "tw,o" "thre,e" "fou,r"
identical(x, y)
# [1] TRUE
identical(x, z)
# [1] TRUE
Note that for cases like this, vapply will be more efficient than sapply. And adding fixed = TRUE in strsplit should increase efficiency as well.

In R, how can a string be split without using a seperator

i am try split method and i want to have the second element of a string containing only 2 elemnts. The size of the string is 2.
examples :
string= "AC"
result shouldbe a split after the first letter ("A"), that I get :
res= [,1] [,2]
[1,] "A" "C"
I tryed it with split, but I have no idea how to split after the first element??

strsplit() will do what you want (if I understand your Question). You need to split on "" to split the string on it's elements. Here is an example showing how to do what you want on a vector of strings:
strs <- rep("AC", 3) ## your string repeated 3 times
next, split each of the three strings
sstrs <- strsplit(strs, "")
which produces
> sstrs
[[1]]
[1] "A" "C"
[[2]]
[1] "A" "C"
[[3]]
[1] "A" "C"
This is a list so we can process it with lapply() or sapply(). We need to subset each element of sstrs to select out the second element. Fo this we apply the [ function:
sapply(sstrs, `[`, 2)
which produces:
> sapply(sstrs, `[`, 2)
[1] "C" "C" "C"
If all you have is one string, then
strsplit("AC", "")[[1]][2]
which gives:
> strsplit("AC", "")[[1]][2]
[1] "C"

split isn't used for this kind of string manipulation. What you're looking for is strsplit, which in your case would be used something like this:
strsplit(string,"",fixed = TRUE)
You may not need fixed = TRUE, but it's a habit of mine as I tend to avoid regular expressions. You seem to indicate that you want the result to be something like a matrix. strsplit will return a list, so you'll want something like this:
strsplit(string,"",fixed = TRUE)[[1]]
and then pass the result to matrix.

If you sure that it's always two char string (check it by all(nchar(x)==2)) and you want only second then you could use sub or substr:
x <- c("ab", "12")
sub(".", "", x)
# [1] "b" "2"
substr(x, 2, 2)
# [1] "b" "2"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: trim consecutive trailing and leading special characters from set of strings - r

Related

Searching for a way to replace elements of character vectors in a list

Why is this grep exclusion failed to work in R?

R Removing an element of a list element

using paste with a list

In R, how can a string be split without using a seperator

Categories

Resources