R - test if first occurrence of string1 is followed by string2 - r

I have an R string, with the format
s = `"[some letters and numbers]_[a number]_[more numbers, letters, punctuation, etc, anything]"`
I simply want a way of checking if s contains "_2" in the first position. In other words, after the first _ symbol, is the single number a "2"? How do I do this in R?
I'm assuming I need some complicated regex expresion?
Examples:
39820432_2_349802j_32hfh = TRUE
43lda821_9_428fj_2f = FALSE (notice there is a _2 there, but not in the right spot)

> grepl("^[^_]+_1",s)
[1] FALSE
> grepl("^[^_]+_2",s)
[1] TRUE
basically, look for everything at the beginning except _, and then the _2.
+1 to #Ananda_Mahto for suggesting grepl instead of grep.

I think it's worth answering the generic question "R - test if string contains string" here.
For that, use the
grep function.
# example:
> if(length(grep("ab","aacd"))>0) print("found") else print("Not found")
[1] "Not found"
> if(length(grep("ab","abcd"))>0) print("found") else print("Not found")
[1] "found"

Related

Check an expresion including a dot [duplicate]

Lets say I have a string "Hello." I want to see if this string contains a period:
text <- "Hello."
results <- grepl(".", text)
This returns results as TRUE, but it would return that as well if text is "Hello" without the period.
I'm confused, I can't find anything about this in the documentation and it only does this for the period.
Any ideas?
See the differences with these examples
> grepl("\\.", "Hello.")
[1] TRUE
> grepl("\\.", "Hello")
[1] FALSE
the . means anything as pointed out by SimonO101, if you want to look for an explicit . then you have to skip it by using \\. which means look for a .
R documentation is extensive on regular expressions, you can also take a look at this link to understand the use of the dot.
I use Jilber's approach usually but here are two other ways:
> grepl("[.]", "Hello.")
[1] TRUE
> grepl("[.]", "Hello")
[1] FALSE
> grepl(".", "Hello.", fixed = TRUE)
[1] TRUE
> grepl(".", "Hello", fixed = TRUE)
[1] FALSE

replace characters after occurrence of a specific character in R

I have a list of characters like this:-
a <- c("NM020506_1","NM_020519_1","NM00_1030297.2")
I am trying to get an output like this using base R.
NM020506, NM, NM00
i.e ignore everything after "_".
I tried something like this. But clearly it is not correct.
a
[1] "NM020506_1" "NM_020519_1" "NM00_1030297.2"
> substr(a,1,unlist(gregexpr(pattern ='_',a))-1)
[1] "NM020506" "NM" "NM00_1030"
>
You can use sub function, whereby you substitute everything after _ with empty.
a <- c("NM020506_1","NM_020519_1","NM00_1030297.2")
sub("_.*","",a)
[1] "NM020506" "NM" "NM00"
No need to use gregexpr since it is greedy and yet you only need the first - . You can rather use regexpr which is not greedy
substr(a,1,regexpr(pattern ='_',a)-1)
[1] "NM020506" "NM" "NM00"
You can use strsplitas:
#data
a <- c("NM020506_1","NM_020519_1","NM00_1030297.2")
sapply(strsplit(a,"_"),function(x)x[1])
#[1] "NM020506" "NM" "NM00"

r grepl to distinguish between no and not

I am dealing with two strings like this below
x1 <- "Unknown, because not discussed"
x2 <- "Not at goal, no."
How do i use grepl function to distinguish between these two strings ?
When I use grepl("no", x1), it shows TRUE, which is not correct. This is picking up the no in not or Unknown. How do i use string parsing function to detect strings with the word no explicitly ? Any advise is much appreciated.
You can use word boundary \\b to distinguish them. \\bno\\b will match no only without preceding and following word characters:
grepl("\\bno\\b", x1)
# [1] FALSE
grepl("\\bno\\b", x2)
# [1] TRUE
I can think of a couple of options for matching "no" but not "not":
Using the \b "word boundary" pattern:
> x = c("Unknown, because not discussed", "Not at goal, no.")
> grepl("\\bno\\b", x)
[1] FALSE TRUE
Using [^t] to exclude "not":
> grepl("\\bno[^t]", x)
[1] FALSE TRUE
For matching the word "no" by itself the word boundary option "\\bno\\b" is probably best.

Remove fields with special characters

I'm trying to remove all fields that have special characters (#?.* etc) in their text.
I think I should be using
Filter(function(x) {grepl('|[^[:punct:]]).*?', x)} == FALSE, data$V1)
where data$V1 contains my data. However, it seems like
grepl('|[^[:punct:]]).*?', x)
fails with trivial examples like
grepl('|[^[:punct:]]).*?', 'M')
which outputs TRUE even though M has no special characters. How should I be using grepl to remove fields with special characters from a column of data?
To search for "special characters", you can search for the negation of alphanumeric characters as such:
grepl('[^[:alnum:]_]+', c('m','m#','M9*'))
# [1] FALSE TRUE TRUE
or use the symbol \W
grepl('\\W+', c('m','m#','M9*'))
# [1] FALSE TRUE TRUE
\W is explained in the regular expression help:
"The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]̀)."
Starting a regular expression with a | make it literally useless since it will match anything.
See this JS example:
console.log('With the starting pipe => ' + /|([\W]).*?/.test('M'));
console.log('Without the starting pipe => ' + /([\W]).*?/.test('M'));
Simply put those inside [...] and provide this to the pattern argument to grepl, then negate.
data$V1[!grepl("[#?.*]", data$V1)]
For example,
> x <- c("M", "3#3", "8.*x")
> x[!grepl("[#?.*]", x)]
[1] "M"

Grep in R using OR and NOT

I have the following vector in R and I would like to find all the strings containing A's and B's but not the number 2.
vec1<-c("A_cont_1", "A_cont_12", "B_treat_8", "AB_cont_22", "cont_21_Aa")
The following does not work:
grep("A|B|!2", vec1)
It gives me back all the strings:
[1] 1 2 3 4 5
The same is true for this example:
grep("A|B|-2", vec1)
What would be the correct syntax?
You can do this with a fairly simple regular expression:
grep("^[^2]*[AB][^2]*$", vec1)
In words, it means:
^ match the start of the string
[^2]* match anything except "2", zero or more times
[AB] match "A" or "B"
[^2]* match anything except "2", zero or more times
$ match the end of the string
I would use two grep calls:
intersect(grep("A|B",vec1),grep("2",vec1,invert=TRUE))
#[1] 1 3
OP, your attempt is pretty close, try this:
grep('^(A|B|[^2])*$', vec1)
grep generally does not work very well for doing a positive and a negative search in one invocation. You might be able to make it work with a complex regular expression, but you might be better off just doing:
grep '[AB]' somefile.txt | grep -v '2'
The R equivalent of that would be:
grep("2", grep("A|B", vec1, value = T), invert = T)
I extended the answer provided by #eddi. I have tested it in R and it works for me. I changed the last variable in your example since they all contained A|B.
# Create the vector from the OP with one change
vec1<-c("A_cont_1", "A_cont_12", "B_treat_8", "AB_cont_22", "cont_21_dd")
I then ran the following code. It will tell you which results you should expect from each section of grep.
First, tell me which columns contain A or B
> grepl("A|B", vec1)
[1] TRUE TRUE TRUE TRUE FALSE
Now tell me which columns contain a "2"
> grepl("2", vec1)
[1] FALSE TRUE FALSE TRUE TRUE
The index we want is 2,4
> grep("2", grep("A|B", vec1, value = T))
[1] 2 4
Done!

Resources