How to grep a string ending in a specific punctuation mark - r

I'm trying to grep strings that end in a dash in R, but having trouble. I've worked out how to grep strings ending in any punctuation mark, maybe not the best way but this worked:
grep("\\#[[:print:]]+[[:punct:]]$",c)
Can't for the life of me work out how to grep strings that end specifically in a dash
for example these strings:
- # (piano) - not this.
- # hello hello - not this either.
I'd like to sub all the stuff between the dashes (and including the dashes) with nothing "" and leave the text to the right of the second dash, which end in full stops. So, I would like the output to be (for example, based on the example above):
not this.
and
not this either.
Any help would be appreciated.
Thank you!
Maro
UPDATE:
Hi again everyone,
I'm just updating my original question again:
So what I had in my original data was these three examples (I tried to simplify in my original post above, but I think it might be helpful for you all to see what I was actually dealing with):
- # (Piano) - no, and neither can you.
- # (Piano) - uh-huh.
- # Many dreams ago - Try it again.
(numbers 1-3 are for the purposes of making things clearer, they are not part of the strings)
I was trying to find a way to delete all the stuff between and including the two dashes, and leave all the stuff after the second dash, so I wanted my output to be:
no, and neither can you.
uh-huh.
Try it again.
I ended up using this:
gsub(("-[[:blank:]]#[[:blank:]]\\(?[A-Z][a-z]*\\)?[[:blank:]]-", "", c)
which helped me get 1. and 2. in one go. But this didn't help with 3 - I thought by including the question mark after the open and close bracket (which I thought meant 'optional') this would help me get all three targets, but for some reason it didn't. To then get 3, I just ended up targeting that specific string i.e. - # Many dreams ago -, by using:
gsub(("- # Many dreams ago -"), "", c)
I'm new to this, so not the best solution I'm sure.
In my original post (this has been edited a couple of times) I included square brackets around the three strings, which explains some of the answers I originally received from members of the community. Apologies for the confusion!
Thanks everyone - if there's anything that doesn't make sense, please let me know, and I'll try to clarify.
Maro

If you want to stay in between the square brackets you can start the match at #, then use a negated character class [^][]* matching optional chars other than an opening or closing square bracket, and match the last -
Replace the match with an empty string.
c <- "[- # (piano) - not this.]"
sub("#[^][]*-", "", c)
Output
[1] "[- not this.]"
For a more specific match of that string format, you can match the whole line including the square brackets, the # and the string ending on a full stop, and capture what you want to keep.
In the replacement use the capture group value.
c <- c("[- # (piano) - not this.]", "[- # hello hello - not this either.]")
sub("\\[[^][#]*#[^][]*-\\s*([^][]*\\.)]", "\\1", c)
Output
[1] "not this." "not this either."

Related

Split string in R,better to use regular expressions

I'd like to split a text string in R
for example:"cell_70001.ERP123.138_D11_62.5Y_45880"
But,I want ERP123.138_D11_62.5Y_45880 finally.
That is to say, cut the place where the first punctuation starts, get the part after it,
I really don’t understand regular expressions, but I’m very anxious. I hope someone can help me. Thank you.
Since your aim is to get where the first punctuation starts, considering _ is a word, we could do:
sub(".*?\\W","", "cell_70001.ERP123.138_D11_62.5Y_45880")
[1] "ERP123.138_D11_62.5Y_45880"
This is to say, delete everything until the first non-word character.
You could also do it as:
sub("\\w+\\W","", "cell_70001.ERP123.138_D11_62.5Y_45880")
[1] "ERP123.138_D11_62.5Y_45880"
Which means delete every word until the first non-word. Which is then also deleted

String Matching in R - Problem with pattern

I have a small Problem. I want to extract a special pattern like this:
v-97bcer
or b-chyfvg or ghd6db
I tried this:
identifier_1 <- "([:alnum:]{6})" # for things like this ghd6db
identifier_2 <- "([:lower:]{1})[- ][:alnum:]{6})" # for things like this v-97bcer or b-chyfvg
The problem is that the first "identifier" works well ok, but extracts for example names as well. In GHD6D8 this example the numbers have no fixed place and can occur everywhere. I do just now that the length is 6.
And the second problem is that for example V-97bcer can occur like v97bcer but I need this format v-97bcer. Here too the numbers are randomly.
If somebody could help or give me a good source for better understanding how to do this. I have not much exp in string matching. Thank you
this should work:
x <- c("v-97bcer", "b-chyfvg", "ghd6db", "v97bcer")
grep("^([a-z].)?[a-z0-9]{6}$", x)
Note that in order to fix the length of the string I provide ^ and $ to the string.
This pattern matches v-97bcer and b-chyfvg and ghd6db but not v97bcer.

Removing part of strings within a column

I have a column within a data frame with a series of identifiers in, a letter and 8 numbers, i.e. B15006788.
Is there a way to remove all instances of B15.... to make them empty cells (there’s thousands of variations of numbers within each category) but keep B16.... etc?
I know if there was just one thing I wanted to remove, like the B15, I could do;
sub(“B15”, ””, df$col)
But I’m not sure on the how to remove a set number of characters/numbers (or even all subsequent characters after B15).
Thanks in advance :)
Welcome to SO! This is a case of regex. You can use base R as I show here or look into the stringR package for handy tools that are easier to understand. You can also look for regex rules to help define what you want to look for. For what you ask you can use the following code example to help:
testStrings <- c("KEEPB15", "KEEPB15A", "KEEPB15ABCDE")
gsub("B15.{2}", "", testStrings)
gsub is the base R function to replace a pattern with something else in one or a series of inputs. To test our regex I created the testStrings vector for different examples.
Breaking down the regex code, "B15" is the pattern you're specifically looking for. The "." means any character and the "{2}" is saying what range of any character we want to grab after "B15". You can change it as you need. If you want to remove everything after "B15". replace the pattern with "B15.". the "" means everything till the end.
edit: If you want to specify that "B15" must be at the start of the string, you can add "^" to the start of the pattern as so: "^B15.{2}"
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf has a info on different regex's you can make to be more particular.

Matching emails format using R

I was having an intro class at datacamp.com and ran into a problem.
Goal: find right emails using grep. "Right emails" defined by having an "#", end with ".edu").
Emails vector:
emails <- c("john.doe#ivyleague.edu", "education#world.gov", "dalai.lama#peace.org",
"invalid.edu", "quant#bigdatacollege.edu", "cookie.monster#sesame.tv")
I was thinking of
grep("#*\\.edu$",emails)
and it gave me
[1] 1 4 5
because I thought "*" matches "multiple characters". Later I found that it doesn't work like that.
Turned out the right code is
grep("#.*\\.edu$",emails)
I googled some documentation and only have a vague sense of how to get the correct answer. Can someone explain how exactly R match the right emails? Thanks a bunch!!
You've already been advised the using the asterisk quantifier wasn't giving you the specificity you needed, so use the "+" quantifier, which forces at least one such match. I decided to make the problem more complex by adding some where there were duplicated at-signs:
emails <- c("john.doe##ivyleague.edu", "education##world.gov", "dalai.lama#peace.org",
"invalid.edu", "quant#bigdatacollege.edu", "cookie.monster#sesame.tv")
grep( "^[^#]+#[^#]+\\.edu$", emails)
#[1] 5
That uses the regex character-class structure where items inside flankking square-brackets are taken as literals except when there is an initial up-caret ("^"), in which case it is the negation of the character class, i.e. in this case any character except "#". This will also exclude situations where the at-sign is the first character. Thanks to KonradRudolph who pointed out that adding "^" as the first character in the pattern (which signifies the point just before the first character of a potential match) would prevent allowing Items with an initial "##" from being matched.

Finding number of occurrences of a word in a file using R functions

I am using the following code for finding number of occurrences of a word memory in a file and I am getting the wrong result. Can you please help me to know what I am missing?
NOTE1: The question is looking for exact occurrence of word "memory"!
NOTE2: What I have realized they are exactly looking for "memory" and even something like "memory," is not accepted! That was the part which has brought up the confusion I guess. I tried it for word "action" and the correct answer is 7! You can try as well.
#names=scan("hamlet.txt", what=character())
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character())
Read 28230 items
> length(grep("memory",names))
[1] 9
Here's the file
The problem is really Shakespeare's use of punctuation. There are a lot of apostrophes (') in the text. When the R function scan encounters an apostrophe it assumes it is the start of a quoted string and reads all characters up until the next apostrophe into a single entry of your names array. One of these long entries happens to include two instances of the word "memory" and so reduces the total number of matches by one.
You can fix the problem by telling scan to regard all quotation marks as normal characters and not treat them specially:
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
Be careful when using the R implementation of grep. It does not behave in exactly the same way as the usual GNU/Linux program. In particular, the way you have used it here WILL find the number of matching words and not just the total number of matching lines as some people have suggested.
As pointed by #andrew, my previous answer would give wrong results if a word repeats on the same line. Based on other answers/comments, this one seems ok:
names = scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
idxs = grep("memory", names, ignore.case = TRUE)
length(idxs)
# [1] 10

Resources