I have an R string, with the format
s = `"[some letters and numbers]_[a number]_[more numbers, letters, punctuation, etc, anything]"`
I simply want a way of checking if s contains "_2" in the first position. In other words, after the first _ symbol, is the single number a "2"? How do I do this in R?
I'm assuming I need some complicated regex expresion?
Examples:
39820432_2_349802j_32hfh = TRUE
43lda821_9_428fj_2f = FALSE (notice there is a _2 there, but not in the right spot)
> grepl("^[^_]+_1",s)
[1] FALSE
> grepl("^[^_]+_2",s)
[1] TRUE
basically, look for everything at the beginning except _, and then the _2.
+1 to #Ananda_Mahto for suggesting grepl instead of grep.
I think it's worth answering the generic question "R - test if string contains string" here.
For that, use the
grep function.
# example:
> if(length(grep("ab","aacd"))>0) print("found") else print("Not found")
[1] "Not found"
> if(length(grep("ab","abcd"))>0) print("found") else print("Not found")
[1] "found"
Related
Lets say I have a string "Hello." I want to see if this string contains a period:
text <- "Hello."
results <- grepl(".", text)
This returns results as TRUE, but it would return that as well if text is "Hello" without the period.
I'm confused, I can't find anything about this in the documentation and it only does this for the period.
Any ideas?
See the differences with these examples
> grepl("\\.", "Hello.")
[1] TRUE
> grepl("\\.", "Hello")
[1] FALSE
the . means anything as pointed out by SimonO101, if you want to look for an explicit . then you have to skip it by using \\. which means look for a .
R documentation is extensive on regular expressions, you can also take a look at this link to understand the use of the dot.
I use Jilber's approach usually but here are two other ways:
> grepl("[.]", "Hello.")
[1] TRUE
> grepl("[.]", "Hello")
[1] FALSE
> grepl(".", "Hello.", fixed = TRUE)
[1] TRUE
> grepl(".", "Hello", fixed = TRUE)
[1] FALSE
I have a list of characters like this:-
a <- c("NM020506_1","NM_020519_1","NM00_1030297.2")
I am trying to get an output like this using base R.
NM020506, NM, NM00
i.e ignore everything after "_".
I tried something like this. But clearly it is not correct.
a
[1] "NM020506_1" "NM_020519_1" "NM00_1030297.2"
> substr(a,1,unlist(gregexpr(pattern ='_',a))-1)
[1] "NM020506" "NM" "NM00_1030"
>
You can use sub function, whereby you substitute everything after _ with empty.
a <- c("NM020506_1","NM_020519_1","NM00_1030297.2")
sub("_.*","",a)
[1] "NM020506" "NM" "NM00"
No need to use gregexpr since it is greedy and yet you only need the first - . You can rather use regexpr which is not greedy
substr(a,1,regexpr(pattern ='_',a)-1)
[1] "NM020506" "NM" "NM00"
You can use strsplitas:
#data
a <- c("NM020506_1","NM_020519_1","NM00_1030297.2")
sapply(strsplit(a,"_"),function(x)x[1])
#[1] "NM020506" "NM" "NM00"
I am dealing with two strings like this below
x1 <- "Unknown, because not discussed"
x2 <- "Not at goal, no."
How do i use grepl function to distinguish between these two strings ?
When I use grepl("no", x1), it shows TRUE, which is not correct. This is picking up the no in not or Unknown. How do i use string parsing function to detect strings with the word no explicitly ? Any advise is much appreciated.
You can use word boundary \\b to distinguish them. \\bno\\b will match no only without preceding and following word characters:
grepl("\\bno\\b", x1)
# [1] FALSE
grepl("\\bno\\b", x2)
# [1] TRUE
I can think of a couple of options for matching "no" but not "not":
Using the \b "word boundary" pattern:
> x = c("Unknown, because not discussed", "Not at goal, no.")
> grepl("\\bno\\b", x)
[1] FALSE TRUE
Using [^t] to exclude "not":
> grepl("\\bno[^t]", x)
[1] FALSE TRUE
For matching the word "no" by itself the word boundary option "\\bno\\b" is probably best.
I'm trying to remove all fields that have special characters (#?.* etc) in their text.
I think I should be using
Filter(function(x) {grepl('|[^[:punct:]]).*?', x)} == FALSE, data$V1)
where data$V1 contains my data. However, it seems like
grepl('|[^[:punct:]]).*?', x)
fails with trivial examples like
grepl('|[^[:punct:]]).*?', 'M')
which outputs TRUE even though M has no special characters. How should I be using grepl to remove fields with special characters from a column of data?
To search for "special characters", you can search for the negation of alphanumeric characters as such:
grepl('[^[:alnum:]_]+', c('m','m#','M9*'))
# [1] FALSE TRUE TRUE
or use the symbol \W
grepl('\\W+', c('m','m#','M9*'))
# [1] FALSE TRUE TRUE
\W is explained in the regular expression help:
"The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]̀)."
Starting a regular expression with a | make it literally useless since it will match anything.
See this JS example:
console.log('With the starting pipe => ' + /|([\W]).*?/.test('M'));
console.log('Without the starting pipe => ' + /([\W]).*?/.test('M'));
Simply put those inside [...] and provide this to the pattern argument to grepl, then negate.
data$V1[!grepl("[#?.*]", data$V1)]
For example,
> x <- c("M", "3#3", "8.*x")
> x[!grepl("[#?.*]", x)]
[1] "M"
I have the following vector in R and I would like to find all the strings containing A's and B's but not the number 2.
vec1<-c("A_cont_1", "A_cont_12", "B_treat_8", "AB_cont_22", "cont_21_Aa")
The following does not work:
grep("A|B|!2", vec1)
It gives me back all the strings:
[1] 1 2 3 4 5
The same is true for this example:
grep("A|B|-2", vec1)
What would be the correct syntax?
You can do this with a fairly simple regular expression:
grep("^[^2]*[AB][^2]*$", vec1)
In words, it means:
^ match the start of the string
[^2]* match anything except "2", zero or more times
[AB] match "A" or "B"
[^2]* match anything except "2", zero or more times
$ match the end of the string
I would use two grep calls:
intersect(grep("A|B",vec1),grep("2",vec1,invert=TRUE))
#[1] 1 3
OP, your attempt is pretty close, try this:
grep('^(A|B|[^2])*$', vec1)
grep generally does not work very well for doing a positive and a negative search in one invocation. You might be able to make it work with a complex regular expression, but you might be better off just doing:
grep '[AB]' somefile.txt | grep -v '2'
The R equivalent of that would be:
grep("2", grep("A|B", vec1, value = T), invert = T)
I extended the answer provided by #eddi. I have tested it in R and it works for me. I changed the last variable in your example since they all contained A|B.
# Create the vector from the OP with one change
vec1<-c("A_cont_1", "A_cont_12", "B_treat_8", "AB_cont_22", "cont_21_dd")
I then ran the following code. It will tell you which results you should expect from each section of grep.
First, tell me which columns contain A or B
> grepl("A|B", vec1)
[1] TRUE TRUE TRUE TRUE FALSE
Now tell me which columns contain a "2"
> grepl("2", vec1)
[1] FALSE TRUE FALSE TRUE TRUE
The index we want is 2,4
> grep("2", grep("A|B", vec1, value = T))
[1] 2 4
Done!