Extract strings with exactly only two dots - unix

I am looking to extract strings that have exactly 2 dots like below.
a.b.c
$$abc.$$def.123
The relevance is only to the dots.
So far i have tried
grep "\\.{2}" file_name.txt.
But this is not giving me the result. Could you please help me

I think this is just a regular expression issue. Your \.{2} will match two consecutive dots. What you'll probably want is something like:
^[^\.]*\.[^\.]*\.[^\.]*$
Which is "start of string, zero or more not-dots, a dot, zero or more not-dots, a dot, zero or more not-dots, end of string".

Related

Do someone know where can I find all the symbols that denote letter, number, the end of the string, the beginning of the string in R?

I need to delimit the string and this time the delimiter is $($, but I need to note that the next character is number ( because I am specifically trying to separate the title from the year from one column. ) Even better would be that I could indicate, that after $($ there are 4 digits. But in general my question is where can I find all the symbols that denote different form of characters or group of character in order to make it easier to separate text into two columns. Thanks in advance.

How do I terminate my pattern at a line break?

I have a long character that comes from a pdf that I want to process.
I have recurring instances of Table X. Name of the table, that in my character are always followed by a \r\n
However, when I try to extract all the tables in a list, using List_Tables <-str_extract_all(Plain_Text, "Table\\s+\\d+\\.\\s+(([A-z]|\\s))+\\r\\n"), I do have often another line that is still in my extraction, e.g.
> List_Tables
[[1]]
[1] "Table 1. Real GDP\r\n Percentage changes\r\n"
[2] "Table 2. Nominal GDP\r\n Percentage changes\r\n"
What have I missed in my code ?
\s matches all whitespace, including line breaks! When combined with the greedy quantifier +, this means that (([A-z]|\\s))+ matches, in your first example,
Real GDP\r\n […] Percentage changes\r\n
The easiest way to fix this is to use a non-greedy quantifier: i.e. +? instead of +.
Just for completeness’ sake I’ll mention that there are alternatives, but they get more complicated. For instance, you could use negative assertions to include an “if” test to match whitespace which isn’t a line break character; or you could use the character class [ \t] instead of \s, which is more restrictive but also more explicit and probably closer to what you want.

Split a string in a flexible manner with a regular expression

Context: I need to split strings that are too long and that are used as column headers in an html table. Those strings are variable names, so they don't have any spaces in them.
If I let the css max-width property do the job, the string is split at a fixed place, not making use of the dots or _'s in the string.
For example, suppose I have this string:
this.is.a.long.string.indeed.yeah.well.you.know
Using the dots as separators, I can split it in many, many different ways. But I pose these guiding principles:
All substrings must be 12 characters or less
Separators [._] should be at the end, not at the beginning of a substring
The number of substrings must be minimal
If several solutions exist, the one having the most similar substring lengths is to be preferred.
I could do this programmatically with R, but I'm turning to regex wizards to see whether this is possible using solely regular expressions.
What I have so far:
Regex: .{1,12}(_|\b|\Z)
Results: this.is.a. | long.string. | indeed.yeah. | well.you. | know
It works well, except when there is a long sequence of letters without any separators. Please see this example on regex101.com.
Ideally, separators would be used whenever possible, and a fallback split would occur when there is a sequence longer than 12 characters without a separator.
You were so close, you just need to present it with another alternative for cases where no separator is found:
.{1,12}(_|\b|\Z)|.{1,12}
Check it out: https://regex101.com/r/XrJuYj/2/
Edit: to ensure the split portion contains a non-separating character, you can use the following:
(?=.{1,12}(.*))(?=.*?[^\W_].*?[\W_].*?\1).{1,12}(?<=_|\b|\Z)|.{1,12}
See it at: https://regex101.com/r/XrJuYj/3

Regular expression condition

My textbox should allow "The first character should be alphabetic or numbers and the remaining characters only numbers"
ex: #999999999 here # represent alphabet(a-z,A-Z) or numbers(0-9),
Please, help me.
I thought you need something like this:
in case you need 0 or more numbers after first symbol
[a-zA-Z0-9][0-9]*
in case you need 1 or more numbers after first symbol
[a-zA-Z0-9][0-9]+

Regular expression for x number of digits and only one hyphen?

I made the following regex:
(\d{5}|\d-\d{4}|\d{2}-\d{3}|\d{3}-\d{2}|\d{4}-\d)
And it seems to work. That is, it will match a 5 digit number or a 5 digit number with only 1 hyphen in it, but the hyphen can not be the lead or the end.
I would like a similar regex, but for a 25 digit number. If I use the same tactic as above, the regex will be very long.
Can anyone suggest a simpler regex?
Additional Notes:
I'm putting this regex into an XML file which is to be consumed by an ASP.NET application. I don't have access to the .net backend code. But I suspect they would do something liek this:
Match match = Regex.Match("Something goes here", "my regex", RegexOptions.None);
You need to use a lookahead:
^(?:\d{25}|(?=\d+-\d+$)[\d\-]{26})$
Explanation:
Either it's \d{25} from start to end, 25 digits.
Or: it is 26 characters of [\d\-] (digits or hyphen) AND it matched \d+-\d+ - meaning it has exactly one hyphen in the middle.
Working example with test cases
You could use this regex:
^[0-9](?:(?=[0-9]*-[0-9]*$)[0-9-]{24}|[0-9]{23})[0-9]$
The lookahead makes sure there's only 1 dash and the character class makes sure there are 23 numbers between the first and the last. Might be made shorter though I think.
EDIT: The a 'bit' shorter xP
^(?:[0-9]{25}|(?=[^-]+-[^-]+$)[0-9-]{26})$
A bit similar to Kobi's though, I admit.
If you aren't fussy about the length at all (i.e. you only want a string of digits with an optional hyphen) you could use:
([\d]+-[\d]+){1}|\d
(You may want to add line/word boundaries to this, depending on your circumstances)
If you need to have a specific length of match, this pattern doesn't really work. Kobi's answer is probably a better fit for you.
I think the fastest way is to do a simple match then add up the length of the capture buffers, why attempt math in a regex, makes no sence.
^(\d+)-?(\d+)$
This will match 25 digits and exactly one hyphen in the middle:
^(?=(-*\d){25})\d.{24}\d$

Resources