Searching for an exact String in another String - r

I'm dealing with a very simple question and that is searching for a string inside of another string. Consider the example below:
bigStringList <- c("SO1.A", "SO12.A", "SO15.A")
strToSearch <- "SO1."
bigStringList[grepl(strToSearch, bigStringList)]
I'm looking for something that when I search for "SO1.", it only returns "SO1.A".
I saw many related questions on SO but most of the answers include grepl() which does not work in my case.
Thanks very much for your help in advance.

When searching for a simple string that doesn't include any metacharacters, you can set fixed=TRUE:
grep("SO1.", bigStringList, fixed=TRUE, value=TRUE)
# [1] "SO1.A"
Otherwise, as Frank notes, you'll need to escape the period (so that it'll be interpreted as an actual . rather than as a symbol meaning "any single character"):
grep("SO1\\.", bigStringList, value=TRUE)
# [1] "SO1.A"

Related

String Matching in R - Problem with pattern

I have a small Problem. I want to extract a special pattern like this:
v-97bcer
or b-chyfvg or ghd6db
I tried this:
identifier_1 <- "([:alnum:]{6})" # for things like this ghd6db
identifier_2 <- "([:lower:]{1})[- ][:alnum:]{6})" # for things like this v-97bcer or b-chyfvg
The problem is that the first "identifier" works well ok, but extracts for example names as well. In GHD6D8 this example the numbers have no fixed place and can occur everywhere. I do just now that the length is 6.
And the second problem is that for example V-97bcer can occur like v97bcer but I need this format v-97bcer. Here too the numbers are randomly.
If somebody could help or give me a good source for better understanding how to do this. I have not much exp in string matching. Thank you
this should work:
x <- c("v-97bcer", "b-chyfvg", "ghd6db", "v97bcer")
grep("^([a-z].)?[a-z0-9]{6}$", x)
Note that in order to fix the length of the string I provide ^ and $ to the string.
This pattern matches v-97bcer and b-chyfvg and ghd6db but not v97bcer.

How do you remove an isolated number from a string in R?

This is a silly question, but I can't seem to find a solution in R online. I am trying to remove an isolated number from a long string. For example, I would like to remove the number 27198 from the sentence below.
x <- "hello3 my name 27198 is 5joey"
I tried the following:
gsub("[0-9]","",x)
Which results in:
"hello my name is joey"
But I want:
"hello3 my name is 5joey"
This seems really simple, but I am not well versed with regular expressions. Thanks for your help!
We can specify word boundary (\\b) at the end of one or more digits ([0-9]+)
gsub("\\b[0-9]+\\b", "", x)
#[1] "hello3 my name is 5joey"

Extract numerical value before a string in R

I have been mucking around with regex strings and strsplit but can't figure out how to solve my problem.
I have a collection of html documents that will always contain the phrase "people own these". I want to extract the number immediately preceding this phrase. i.e. '732,234 people own these' - I'm hoping to capture the number 732,234 (including the comma, though I don't care if it's removed).
The number and phrase are always surrounded by a . I tried using Xpath but that seemed even harder than a regex expression. Any help or advice is greatly appreciated!
example string: >742,811 people own these<
-> 742,811
Could you please try following.
val <- "742,811 people own these"
gsub(' [a-zA-Z]+',"",val)
Output will be as follows.
[1] "742,811"
Explanation: using gsub(global substitution) function of R here. Putting condition here where it should replace all occurrences of space with small or capital alphabets with NULL for variable val.
Try using str_extract_all from the stringr library:
str_extract_all(data, "\\d{1,3}(?:,\\d{3})*(?:\\.\\d+)?(?= people own these)")

R gsub regular expression syntax error

If I have some string : 2017-01-12T19:00:00.000+000, and I want to have 2017-01-12, so delete all after and including "T" How do I proceed,
gsub("$.*T"," ","2017-01-12T19:00:00.000+000")
, would this not work? I am referring my self to:http://www.endmemo.com/program/R/gsub.php
Thank you!
One approach is to match and capture the date portion of your string using gsub() and then replace the entire string with what was captured.
gsub("(\\d{4}-\\d{2}-\\d{2}).*","\\1","2017-01-12T19:00:00.000+000")
[1] "2017-01-12"
Your original approach:
gsub("T.*","","2017-01-12T19:00:00.000+000")
[1] "2017-01-12"
As others have said, if the need for this format exceeds the scope of this particular timestamp string, then you should consider using a date API instead.
Demo here:
Rextester

Finding number of occurrences of a word in a file using R functions

I am using the following code for finding number of occurrences of a word memory in a file and I am getting the wrong result. Can you please help me to know what I am missing?
NOTE1: The question is looking for exact occurrence of word "memory"!
NOTE2: What I have realized they are exactly looking for "memory" and even something like "memory," is not accepted! That was the part which has brought up the confusion I guess. I tried it for word "action" and the correct answer is 7! You can try as well.
#names=scan("hamlet.txt", what=character())
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character())
Read 28230 items
> length(grep("memory",names))
[1] 9
Here's the file
The problem is really Shakespeare's use of punctuation. There are a lot of apostrophes (') in the text. When the R function scan encounters an apostrophe it assumes it is the start of a quoted string and reads all characters up until the next apostrophe into a single entry of your names array. One of these long entries happens to include two instances of the word "memory" and so reduces the total number of matches by one.
You can fix the problem by telling scan to regard all quotation marks as normal characters and not treat them specially:
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
Be careful when using the R implementation of grep. It does not behave in exactly the same way as the usual GNU/Linux program. In particular, the way you have used it here WILL find the number of matching words and not just the total number of matching lines as some people have suggested.
As pointed by #andrew, my previous answer would give wrong results if a word repeats on the same line. Based on other answers/comments, this one seems ok:
names = scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
idxs = grep("memory", names, ignore.case = TRUE)
length(idxs)
# [1] 10

Resources