Split a string in a flexible manner with a regular expression

Split a string in a flexible manner with a regular expression - r

Context: I need to split strings that are too long and that are used as column headers in an html table. Those strings are variable names, so they don't have any spaces in them.
If I let the css max-width property do the job, the string is split at a fixed place, not making use of the dots or _'s in the string.
For example, suppose I have this string:
this.is.a.long.string.indeed.yeah.well.you.know
Using the dots as separators, I can split it in many, many different ways. But I pose these guiding principles:
All substrings must be 12 characters or less
Separators [._] should be at the end, not at the beginning of a substring
The number of substrings must be minimal
If several solutions exist, the one having the most similar substring lengths is to be preferred.
I could do this programmatically with R, but I'm turning to regex wizards to see whether this is possible using solely regular expressions.
What I have so far:
Regex: .{1,12}(_|\b|\Z)
Results: this.is.a. | long.string. | indeed.yeah. | well.you. | know
It works well, except when there is a long sequence of letters without any separators. Please see this example on regex101.com.
Ideally, separators would be used whenever possible, and a fallback split would occur when there is a sequence longer than 12 characters without a separator.

You were so close, you just need to present it with another alternative for cases where no separator is found:
.{1,12}(_|\b|\Z)|.{1,12}
Check it out: https://regex101.com/r/XrJuYj/2/
Edit: to ensure the split portion contains a non-separating character, you can use the following:
(?=.{1,12}(.*))(?=.*?[^\W_].*?[\W_].*?\1).{1,12}(?<=_|\b|\Z)|.{1,12}
See it at: https://regex101.com/r/XrJuYj/3

Related

How do I terminate my pattern at a line break?

I have a long character that comes from a pdf that I want to process.
I have recurring instances of Table X. Name of the table, that in my character are always followed by a \r\n
However, when I try to extract all the tables in a list, using List_Tables <-str_extract_all(Plain_Text, "Table\\s+\\d+\\.\\s+(([A-z]|\\s))+\\r\\n"), I do have often another line that is still in my extraction, e.g.
> List_Tables
[[1]]
[1] "Table 1. Real GDP\r\n Percentage changes\r\n"
[2] "Table 2. Nominal GDP\r\n Percentage changes\r\n"
What have I missed in my code ?

\s matches all whitespace, including line breaks! When combined with the greedy quantifier +, this means that (([A-z]|\\s))+ matches, in your first example,
Real GDP\r\n […] Percentage changes\r\n
The easiest way to fix this is to use a non-greedy quantifier: i.e. +? instead of +.
Just for completeness’ sake I’ll mention that there are alternatives, but they get more complicated. For instance, you could use negative assertions to include an “if” test to match whitespace which isn’t a line break character; or you could use the character class [ \t] instead of \s, which is more restrictive but also more explicit and probably closer to what you want.

R: Regex for identifying numbers within HTML chunk

this is my first entry on stack overflow, so please be indulgent if my post might have some lack in terms of quality.
I want to learn some webscraping with R and started with a simple example --> Extracting a table from a Wikipedia site.
I managed to download the specific page and identified the HTML sections I am interested in:
<td style="text-align:right">511.000.000\n</td>
Now I want to extract the number in the data from the table by using regex. So i created a regex, which should match the structure of the number from my point of view:
pattern<-"\\d*\\.\\d*\\.\\d*\\.\\d*\\."
I also tried other variations but none of them found the number within the HTML code. I wanted to keep the pattern open as the numbers might be hundreds, thousand, millions, billions.
My questions: The number is within the HTML code, might it be
necessary to include some code for the non-number code (which should
not be extracted...)
What would be the correct version for the
pattern to identify the number correctly?
Thank you very much for your support!!

So many stars implies a lot of backtracking.
One point further, using \\d* would match more than 3 digits in any group and would also match a group with no digit.
Assuming your numbers are always integers, formatted using a . as thousand separator, you could use the following: \\d{1,3}(?:\\.\\d{3})* (note the usage of non-capturing group construct (?:...) - implying the use of perl = TRUE in arguments, as mentioned in Regular Expressions as used in R).

Look closely at your regex. You are assuming that the number will have 4 periods (\\.) in it, but in your own example there are only two periods. It's not going to match because while the asterisk marks \\d as optional (zero or more), the periods are not marked as optional. If you add a ? modifier after the 3rd and 4th period, you may find that your pattern starts matching.

Regex in R match specified words when they all (two or more) occur in whatever order within certain distance in particular line

I have a double challenge.
First, I want to match lines that contain two (or eventually more) specified words within certain distance in whatever order.
Using lookaround I manage to select lines matching two or more words, regardless of the order within they occur. I can also easily add more words to be found in the same line, so it this can also be applied without much effort when more word must occur in order to be selected. The disadvantage is that can't detail the maximal distance between them.
^(?=.*\john)(?=.*\jack).*$
By using the pipe operator I can detail both orders in which the terms may occur as well as the accepted distance between them, but when more words should be matched the code becomes lengthy and errorsensitive.
jack.{0,100}john|john.{0,100}jack
Is there a way to combine the respective advantages of both approaches in one regular expression?
Second, ideally I would like that only 'jack' and 'john' (and are selected in the line but not the whole line.
Is there a possibility to do this all at once?

For this case, you have to use the second approach. But it can't be possible with regex alone.. You have to ask for language tools help like paste in-order to build a regex (given in the second format).
In python, I would do like below to create a long regex.
>>> def create_reg(lis):
out = []
for i in lis:
out.append(''.join(i) + '|' + ''.join([i[2],i[1], i[0]]))
return '(?:' + '|'.join(out) + ')'
>>> lst = [('john', '{0,100}', 'jack'), ('foo', '{0,100}', 'bar')]
>>> create_reg(lst)
'(?:john{0,100}jack|jack{0,100}john|foo{0,100}bar|bar{0,100}foo)'
>>>

Convert string to variable name in R

I have spend hours to look for a proper solutions but I found nothing on Internet. There is my question. In R, I have a specific list of characters containings my desired variable names ("2011_Q4", "2012_Q1", ...). When I try to assign a dataset to each of this name with a loop, it does work but the output it's strange. Indeed, I have
> View(`2011_Q4`)
instead of
> View(2011_Q4)
And I don't know how to remove this apostrophe. It's very annoying since I have to type this ` in order to call the variable.
Somebody can help me? I would appreciate his help.
Thanks a lot and best regards

Firstly, it's a backtick (`), not an apostrophe ('). In R, backticks occasionally denote variable names; apostrophes work as single quotes for denoting strings.
The issue you're having is that your variables start with a number, which is not allowed in R. Since you somehow made it happen anyway, you need to use backticks to tell R not to interpret 2011_Q4 as a number, but as a variable.
From ?Quotes:
Names and Identifiers
Identifiers consist of a sequence of letters, digits, the period (.)
and the underscore. They must not start with a digit nor underscore,
nor with a period followed by a digit. Reserved words are not valid
identifiers.
The definition of a letter depends on the current locale, but only
ASCII digits are considered to be digits.
Such identifiers are also known as syntactic names and may be used
directly in R code. Almost always, other names can be used provided
they are quoted. The preferred quote is the backtick (`), and deparse
will normally use it, but under many circumstances single or double
quotes can be used (as a character constant will often be converted to
a name). One place where backticks may be essential is to delimit
variable names in formulae: see formula.
The best solution to your issue is simply to change your variable names to something that starts with a character, e.g. Y2011_Q4.

Regular expression for x number of digits and only one hyphen?

I made the following regex:
(\d{5}|\d-\d{4}|\d{2}-\d{3}|\d{3}-\d{2}|\d{4}-\d)
And it seems to work. That is, it will match a 5 digit number or a 5 digit number with only 1 hyphen in it, but the hyphen can not be the lead or the end.
I would like a similar regex, but for a 25 digit number. If I use the same tactic as above, the regex will be very long.
Can anyone suggest a simpler regex?
Additional Notes:
I'm putting this regex into an XML file which is to be consumed by an ASP.NET application. I don't have access to the .net backend code. But I suspect they would do something liek this:
Match match = Regex.Match("Something goes here", "my regex", RegexOptions.None);

You need to use a lookahead:
^(?:\d{25}|(?=\d+-\d+$)[\d\-]{26})$
Explanation:
Either it's \d{25} from start to end, 25 digits.
Or: it is 26 characters of [\d\-] (digits or hyphen) AND it matched \d+-\d+ - meaning it has exactly one hyphen in the middle.
Working example with test cases

You could use this regex:
^[0-9](?:(?=[0-9]*-[0-9]*$)[0-9-]{24}|[0-9]{23})[0-9]$
The lookahead makes sure there's only 1 dash and the character class makes sure there are 23 numbers between the first and the last. Might be made shorter though I think.
EDIT: The a 'bit' shorter xP
^(?:[0-9]{25}|(?=[^-]+-[^-]+$)[0-9-]{26})$
A bit similar to Kobi's though, I admit.

If you aren't fussy about the length at all (i.e. you only want a string of digits with an optional hyphen) you could use:
([\d]+-[\d]+){1}|\d
(You may want to add line/word boundaries to this, depending on your circumstances)
If you need to have a specific length of match, this pattern doesn't really work. Kobi's answer is probably a better fit for you.

I think the fastest way is to do a simple match then add up the length of the capture buffers, why attempt math in a regex, makes no sence.
^(\d+)-?(\d+)$

This will match 25 digits and exactly one hyphen in the middle:
^(?=(-*\d){25})\d.{24}\d$

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split a string in a flexible manner with a regular expression - r

Related

How do I terminate my pattern at a line break?

R: Regex for identifying numbers within HTML chunk

Regex in R match specified words when they all (two or more) occur in whatever order within certain distance in particular line

Convert string to variable name in R

Regular expression for x number of digits and only one hyphen?

Categories

Resources