Considering myself a novice at Regular-Expressions, I came across an R-script which would eventually wipe away white-spaces from a string or (say) a line using gsub().
Following is the gsub() function with a (in my opinion) a complex criterion to match :
gsub("(^ +)|( +$)", "", line)
Can anyone explain me what does this expression mean ? Thoroughly !
An example would make this so easy.
Please also provide some links where i can learn some real stuff about regex, because i found no good sources when i looked for the same.
Thanks for considerations.
The regex just trims the space in the string, Using the base R function trimws will be more clear I think.
(^ +)|( +$)
^ string start position.
+(space plus) more than one space.
$ string ending position.
| alternative.
Related
I'm trying to grep strings that end in a dash in R, but having trouble. I've worked out how to grep strings ending in any punctuation mark, maybe not the best way but this worked:
grep("\\#[[:print:]]+[[:punct:]]$",c)
Can't for the life of me work out how to grep strings that end specifically in a dash
for example these strings:
- # (piano) - not this.
- # hello hello - not this either.
I'd like to sub all the stuff between the dashes (and including the dashes) with nothing "" and leave the text to the right of the second dash, which end in full stops. So, I would like the output to be (for example, based on the example above):
not this.
and
not this either.
Any help would be appreciated.
Thank you!
Maro
UPDATE:
Hi again everyone,
I'm just updating my original question again:
So what I had in my original data was these three examples (I tried to simplify in my original post above, but I think it might be helpful for you all to see what I was actually dealing with):
- # (Piano) - no, and neither can you.
- # (Piano) - uh-huh.
- # Many dreams ago - Try it again.
(numbers 1-3 are for the purposes of making things clearer, they are not part of the strings)
I was trying to find a way to delete all the stuff between and including the two dashes, and leave all the stuff after the second dash, so I wanted my output to be:
no, and neither can you.
uh-huh.
Try it again.
I ended up using this:
gsub(("-[[:blank:]]#[[:blank:]]\\(?[A-Z][a-z]*\\)?[[:blank:]]-", "", c)
which helped me get 1. and 2. in one go. But this didn't help with 3 - I thought by including the question mark after the open and close bracket (which I thought meant 'optional') this would help me get all three targets, but for some reason it didn't. To then get 3, I just ended up targeting that specific string i.e. - # Many dreams ago -, by using:
gsub(("- # Many dreams ago -"), "", c)
I'm new to this, so not the best solution I'm sure.
In my original post (this has been edited a couple of times) I included square brackets around the three strings, which explains some of the answers I originally received from members of the community. Apologies for the confusion!
Thanks everyone - if there's anything that doesn't make sense, please let me know, and I'll try to clarify.
Maro
If you want to stay in between the square brackets you can start the match at #, then use a negated character class [^][]* matching optional chars other than an opening or closing square bracket, and match the last -
Replace the match with an empty string.
c <- "[- # (piano) - not this.]"
sub("#[^][]*-", "", c)
Output
[1] "[- not this.]"
For a more specific match of that string format, you can match the whole line including the square brackets, the # and the string ending on a full stop, and capture what you want to keep.
In the replacement use the capture group value.
c <- c("[- # (piano) - not this.]", "[- # hello hello - not this either.]")
sub("\\[[^][#]*#[^][]*-\\s*([^][]*\\.)]", "\\1", c)
Output
[1] "not this." "not this either."
I have been mucking around with regex strings and strsplit but can't figure out how to solve my problem.
I have a collection of html documents that will always contain the phrase "people own these". I want to extract the number immediately preceding this phrase. i.e. '732,234 people own these' - I'm hoping to capture the number 732,234 (including the comma, though I don't care if it's removed).
The number and phrase are always surrounded by a . I tried using Xpath but that seemed even harder than a regex expression. Any help or advice is greatly appreciated!
example string: >742,811 people own these<
-> 742,811
Could you please try following.
val <- "742,811 people own these"
gsub(' [a-zA-Z]+',"",val)
Output will be as follows.
[1] "742,811"
Explanation: using gsub(global substitution) function of R here. Putting condition here where it should replace all occurrences of space with small or capital alphabets with NULL for variable val.
Try using str_extract_all from the stringr library:
str_extract_all(data, "\\d{1,3}(?:,\\d{3})*(?:\\.\\d+)?(?= people own these)")
I'd like to split the string into the following
S <- "No. Ok (whatever). If you must. Please try to be careful (shakes head)."
[1] No.
[2] Ok (whatever). If you must.
[3] Please try to be careful (shakes head).
The pattern is the first . before each (...).
I'm familiar with (?<=...) (i.e. positive lookbehind) but this doesn't seem to work with non-fixed length patterns. I'd like to know if I'm wrong about positive lookbehind or if there's some regex magic to do this. Thanks!
Note that I don't know much about ruby, but there should be something like a split method that uses a regex pattern as a delimiter and split the string accordingly.
Use this regex:
(?<=\.) (?=[^.]+?\(.+?\))
This looks for a space character. Behind the space, there must be a dot (?<=\.). After it (?=, there must be a bunch of characters that are not dots [^.]+?, and then a pair of brackets with something inside \(.+?\).
Try it online: https://regex101.com/r/8PcbFJ/1
I have been using strapplyc in R to select different portions of a string that match one particular set of criteria. These have worked successfully until I found a portion of the string where the required portion could be defined one of two ways.
Here is an example of the string which is liberally sprinkled with \t:
\t\t\tsome words here\t\t\tDefect: some more words here Action: more words
I can write the strapply statement to capture the text between Defect: and the start of Action:
strapplyc(record[i], "Defect:(.*?)Action")
This works and selects the chosen text between Defect: and Action. In some cases there is no action section to the string and I've used the following code to capture these cases.
strapplyc(record[i], "Defect:(.*?)$")
What I have been trying to do is capture the text that either ends with Action, or with the end of the string (using $).
This is the bit that keeps failing. It returns nothing for either option. Here is my failing code:
strapplyc(record[i], "Defect:(.*?)Action|$")
Any idea where I'm going wrong, or a better solution would be much appreciated.
If you are up for a more efficient solution, you could drop the .*? matching and unroll your pattern like:
Defect:((?:[^A]+|A(?!ction))*)
This matches Defect: followed by any amount of characters that are not an A or are an A and not followed by ction. This avoids the expanding that is needed for the lazy dot matching. It will work for both ways, as it does stop matching when it hits Action or the end of your string.
As suggested by Wiktor, you can also use
Defect:([^A]*(?:A(?!ction)[^A]*)*)
Which is a little bit faster when there are many As in the string.
You might want to consider to use A(?!ction:) or A(?!ction\s*:), to avoid false early matches.
The alternation operator | is the regex operator with the lowest precedence. That means the regex Defect:(.*?)Action|$ is actually a combination of Defect:(.*?)Action and $ - since an empty string is a valid match for $, your regex returns the empty string.
To solve that, you should combine the regexes Defect:(.*?)Action and Defect:(.*?)$ with an OR:
Defect:(.*?)Action|Defect:(.*?)$
Or you can enclose Action|$ in a group as Sebastian Proske said in the comments:
Defect:(.*?)(?:Action|$)
I am trying to combine 2 regular expressions into 1 with the OR operator: |
I have one that checks for match of a letter followed by 8 digits:
Regex.IsMatch(s, "^[A-Z]\d{8}$")
I have another that checks for simply 9 digits:
Regex.IsMatch(s, "^\d{9}$")
Now, Instead of doing:
If Not Regex.IsMatch(s, "^[A-Z]\d{8}$") AndAlso
Not Regex.IsMatch(s, "^\d{9}$") Then
...
End If
I thought I could simply do:
If Not Regex.IsMatch(s, "^[A-Z]\d{8}|\d{9}$") Then
...
End If
Apparently I am not combining the two correctly and apparently I am horrible at regular expressions. Any help would be much appreciated.
And for those wondering, I did take a glance at How to combine 2 conditions and more in regex and I am still scratching my head.
The | operator has a high precedence and in your original regex will get applied first. You should be combining the two regex's w/ grouping parentheses to make the precedence clear. As in:
"^(([A-Z]\d{8})|(\d{9}))$"
How about using ^[A-Z0-9]\d{8}$ ?
I think you want to group the conditions:
Regex.IsMatch(s, "^(([A-Z]\d{8})|(\d{9}))$")
The ^ and $ represent the beginning and end of the line, so you don't want them considered in the or condition. The parens allow you to be explicit about "everything in this paren" or "anything in this other paren"
#MikeC's offering seems the best:
^[A-Z0-9]\d{8}$
...but as to why your expression is not working the way you might expect, you have to understand that the | "or" or "alternation" operator has a very high precedence - the only higher one is the grouping construct, I believe. If you use your example:
^[A-Z]\d{8}|\d{9}$
...you're basically saying "match beginning of string, capital letter, then 8 digits OR match 9 digits then end of string" -- if, instead you mean "match beginning of string, then a capital letter followed by 8 digits then the end of string OR the beginning of the string followed by 9 digits, then the end of string", then you want one of these:
^([A-Z]\d{8}|\d{9})$
^[A-Z]\d{8}$|^\d{9}$
Hope this is helpful for your understanding
I find the OR operator a bit weird sometimes as well, what I do I use groups to denote which sections I want to match, so your regex would become something like so: ^(([A-Z]\d{8})|(\d{9}))$