Why is this grep exclusion failed to work in R? - r

I am trying to do exclude certain characters when using grep in R. But I cannot get the result that I expect.
Here is the code:
x <- c("a", "ab", "b", "abc")
grep("[^b]", x, value=T)
> [1] "a" "ab" "abc"
I want to grab anything in vector x that does not contain b. It should not return "ab" or "abc".
Ultimately I want to pick up any element that contains "a" but not "b".
This is the result that I would expect:
grep("a[^b]", x, value=T)
> [1] "a"
How can I do that?

Try this:
grep("^[^b]*a[^b]*$", x, value=TRUE)
# [1] "a"
It looks for the start of the string, then allows any number of characters that are not "b", then an "a", then any number of characters that are not "b" again and then the end of the string is reached.

We can use the invert property of grep which returns values which do not match. So here it returns those values which do not have "b" in them.
grep("b", x, value = TRUE, invert = TRUE)
#[1] "a"

I've got the result, what are you looking for, using this regular expression in grep:
grep("^[^b]*$", x, value=TRUE)
[1] "a"

Related

How to extract unique letters among word of consecutive letters?

For example, there is character x = "AAATTTGGAA".
What I want to achieve is, from x, split x by consecutive letters, "AAA", "TTT", "GG", "AA".
Then, unique letters of each chunk is "A", "T", "G", "A" , so expected output is ATGA.
How should I get this?
Here is a useful regex trick approach:
x <- "AAATTTGGAA"
out <- strsplit(x, "(?<=(.))(?!\\1)", perl=TRUE)[[1]]
out
[1] "AAA" "TTT" "GG" "AA"
The regex pattern used here says to split at any boundary where the preceding and following characters are different.
(?<=(.)) lookbehind and also capture preceding character in \1
(?!\\1) then lookahead and assert that following character is different
You can split each character in the string. Use rle to find consecutive runs and select only the unique ones.
x <- "AAATTTGGAA"
vec <- unlist(strsplit(x, ''))
rle(vec)$values
#[1] "A" "T" "G" "A"
paste0(rle(vec)$values, collapse = '')
#[1] "ATGA"
We can use regmatch with pattern (.)\\1+ like below
> regmatches(x,gregexpr("(.)\\1+",x))[[1]]
[1] "AAA" "TTT" "GG" "AA"
or if you need the unique letters only
> gsub("(.)\\1+", "\\1", x)
[1] "ATGA"

Is it possible to remove variables with a certain pattern from a datatable or list?

For example if I have a list which contains: "a", "ab", "b", "c", "ad" as variables.
Is it possible to remove all variables which contain an "a", without writing every single variable down?
I think grep or grepl could help
> grep("a",v,value = TRUE, invert = TRUE)
[1] "b" "c"
or
> v[!grepl("a",v)]
[1] "b" "c"
Data
v <- c("a","ab","b","c","ad")
“variables” are conventionally called “names” in R.
So if you want to remove them from a list-like structure, you can manipulate its names, and then subset the list with the resulting vector of names.
x = x[grep('a', names(x), value = TRUE, invert = TRUE)]
Or, using grepl instead:
x = x[! grepl('a', names(x))]
An option with str_subset
library(stringr)
str_subset(v, "a", negate = TRUE)
#[1] "b" "c"
data
v <- c("a","ab","b","c","ad")

Regex Matching Negative values

I'm trying to create some simple and easy to write content-clusters with multiple regexes.
Imagine a list of strings: c("a","b","ac")
The groups I need to define are "All: a's" and "All: b's". So the values "a" and "ac" are "A" and "b" is "B".
myDF$contentGroup <- sub(".*a.*", "A", myDF$stringList)
However this will result in a column within my dataframe "contentGroup" which contains the value of "stringList" if no match occured. So if I do the same line of code with "B" it will overwrite the "A"s.
myDF$contentGroup <- sub(".*b.*", "B", myDF$stringList)
I just cant figure out how to do simple clustering in a single line of code. Making it as simple as possible.
You can use grep to match 'a' and 'b', and replace as follows,
x[grep('a', x, fixed = TRUE)] <- 'A'
x[grep('b', x, fixed = TRUE)] <- 'B'
x
#[1] "A" "B" "A"

Subsetting on all but empty grep returns empty vector

Suppose I have some character vector, which I'd like to subset to elements that don't match some regular expression. I might use the - operator to remove the subset that grep matches:
> vec <- letters[1:5]
> vec
[1] "a" "b" "c" "d" "e"
> vec[-grep("d", vec)]
[1] "a" "b" "c" "e"
I'm given back everything except the entries that matched "d". But if I search for a regular expression that isn't found, instead of getting everything back as I would expect, I get nothing back:
> vec[-grep("z", vec)]
character(0)
Why does this happen?
It's because grep returns an integer vector, and when there's no match, it returns integer(0).
> grep("d", vec)
[1] 4
> grep("z", vec)
integer(0)
and the since the - operator works elementwise, and integer(0) has no elements, the negation doesn't change the integer vector:
> -integer(0)
integer(0)
so vec[-grep("z", vec)] evaluates to vec[-integer(0)] which in turn evaluates to vec[integer(0)], which is character(0).
You will get the behavior you expect with invert = TRUE:
> vec[grep("d", vec, invert = TRUE)]
[1] "a" "b" "c" "e"
> vec[grep("z", vec, invert = TRUE)]
[1] "a" "b" "c" "d" "e"

In R, how can a string be split without using a seperator

i am try split method and i want to have the second element of a string containing only 2 elemnts. The size of the string is 2.
examples :
string= "AC"
result shouldbe a split after the first letter ("A"), that I get :
res= [,1] [,2]
[1,] "A" "C"
I tryed it with split, but I have no idea how to split after the first element??
strsplit() will do what you want (if I understand your Question). You need to split on "" to split the string on it's elements. Here is an example showing how to do what you want on a vector of strings:
strs <- rep("AC", 3) ## your string repeated 3 times
next, split each of the three strings
sstrs <- strsplit(strs, "")
which produces
> sstrs
[[1]]
[1] "A" "C"
[[2]]
[1] "A" "C"
[[3]]
[1] "A" "C"
This is a list so we can process it with lapply() or sapply(). We need to subset each element of sstrs to select out the second element. Fo this we apply the [ function:
sapply(sstrs, `[`, 2)
which produces:
> sapply(sstrs, `[`, 2)
[1] "C" "C" "C"
If all you have is one string, then
strsplit("AC", "")[[1]][2]
which gives:
> strsplit("AC", "")[[1]][2]
[1] "C"
split isn't used for this kind of string manipulation. What you're looking for is strsplit, which in your case would be used something like this:
strsplit(string,"",fixed = TRUE)
You may not need fixed = TRUE, but it's a habit of mine as I tend to avoid regular expressions. You seem to indicate that you want the result to be something like a matrix. strsplit will return a list, so you'll want something like this:
strsplit(string,"",fixed = TRUE)[[1]]
and then pass the result to matrix.
If you sure that it's always two char string (check it by all(nchar(x)==2)) and you want only second then you could use sub or substr:
x <- c("ab", "12")
sub(".", "", x)
# [1] "b" "2"
substr(x, 2, 2)
# [1] "b" "2"

Resources