Regular Expression in R - Spaces before and after the text - r

I have a stats file that has lines that are like this:
"system.l2.compressor.compression_size::1 0 # Number of blocks that compressed to fit in 1 bits"
0 is the value that I care about in this case. The spaces between the actual statistic and whatever is before and after it are not the same each time.
My code is something like that to try and get the stats.
if (grepl("system.l2.compressor.compression_size::1", line))
{
matches <- regmatches(line, gregexpr("[[:digit:]]+\\.*[[:digit:]]", line))
compression_size_1 = as.numeric(unlist(matches))[1]
}
The reason I have this regular expression
[[:digit:]]+\\.*[[:digit:]]
is because in other cases the statistic is a decimal number. I don't anticipate in the cases that are like the example I posted for the numbers to be decimals, but it would be nice to have a "fail safe" regex that can capture even such a case.
In this case I get "2." "1" "0" "1" as answers. How can I restrict it so that I can get only the true stat as the answer?
I tried using something like this
"[:space:][[:digit:]]+\\.*[[:digit:]][:space:]"
or other variations, but either I get back NA, or the same numbers but with spaces surrounding them.

Here are a couple base R possibilities depending on how your data is set up. In the future, it is helpful to provide a reproducible example. Definitely provide one if these don't work. If the pattern works, it will probably be faster to adapt it to a stringr or stringi function. Good luck!!
# The digits after the space after the anything not a space following "::"
gsub(".*::\\S+\\s+(\\d+).*", "\\1", strings)
[1] "58740" "58731" "70576"
# Getting the digit(s) following a space and preceding a space and pound sign
gsub(".*\\s+(\\d+)\\s+#.*", "\\1", strings)
[1] "58740" "58731" "70576"
# Combining the two (this is the most restrictive)
gsub(".*::\\S+\\s+(\\d+)\\s+#.*", "\\1", strings)
[1] "58740" "58731" "70576"
# Extracting the first digits surounded by spaces (least restrictive)
gsub(".*?\\s+(\\d+)\\s+.*", "\\1", strings)
[1] "58740" "58731" "70576"
# Or, using stringr for the last pattern:
as.numeric(stringr::str_extract(strings, "\\s+\\d+\\s+"))
[1] 58740 58731 70576
EDIT: Explanation for the second one:
gsub(".*\\s+(\\d+)\\s+#.*", "\\1", strings)
.* - .=any character except \n; *=any number of times
\\s+ - \\s =whitespace; +=at least one instance (of the whitespace)
(\\d+) - ()=capture group, you can reference it later by the number of occurrences (i.e., the ”\\1” returns the first instance of this pattern); \\d=digit; +=at least one instance (of a digit)
\\s+# - \\s =whitespace; +=at least one instance (of the whitespace); # a literal pound sign
.* - .=any character except \n; *=any number of times
Data:
strings <- c("system.l2.compressor.compression_size::256 58740 # Number of blocks that compressed to fit in 256 bits",
"system.l2.compressor.encoding::Base*.8_1 58731 # Number of data entries that match encoding Base8_1",
"system.l2.overall_hits::.cpu.data 70576 # number of overall hits")

Related

Can quantifiers be used in regex replacement in R?

My objective would be replacing a string by a symbol repeated as many characters as have the string, in a way as one can replace letters to capital letters with \\U\\1, if my pattern was "...(*)..." my replacement for what is captured by (*) would be something like x\\q1 or {\\q1}x so I would get so many x as characters captured by *.
Is this possible?
I am thinking mainly in sub,gsub but you can answer with other libraris like stringi,stringr, etc.
You can use perl = TRUE or perl = FALSE and any other options with convenience.
I assume the answer can be negative, since seems to be quite limited options (?gsub):
a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed = FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern. For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. If a character vector of length 2 or more is supplied, the first element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA.
Main quantifiers are (?base::regex):
?
The preceding item is optional and will be matched at most once.
*
The preceding item will be matched zero or more times.
+
The preceding item will be matched one or more times.
{n}
The preceding item is matched exactly n times.
{n,}
The preceding item is matched n or more times.
{n,m}
The preceding item is matched at least n times, but not more than m times.
Ok, but it seems to be an option (which is not in PCRE, not sure if in PERL or where...) (*) which captures the number of characters the star quantifier is able to match (I found it at https://www.rexegg.com/regex-quantifier-capture.html) so then it could be used \q1 (same reference) to refer to the first captured quantifier (and \q2, etc.). I also read that (*) is equivalent to {0,} but I'm not sure if this is really the fact for what I'm interested in.
EDIT UPDATE:
Since asked by commenters I update my question with an specific example provide by this interesting question. I modify a bit the example. Let's say we have a <- "I hate extra spaces elephant" so we are interested in keeping the a unique space between words, the 5 first characters of each word (till here as the original question) but then a dot for each other character (not sure if this is what is expected in the original question but doesn't matter) so the resulting string would be "I hate extra space. eleph..." (one . for the last s in spaces and 3 dots for the 3 letters ant in the end of elephant). So I started by keeping the 5 first characters with
gsub("(?<!\\S)(\\S{5})\\S*", "\\1", a, perl = TRUE)
[1] "I hate extra space eleph"
How should I replace the exact number of characters in \\S* by dots or any other symbol?
Quantifiers cannot be used in the replacement pattern, nor the information how many chars they match.
What you need is a \G base PCRE pattern to find consecutive matches after a specific place in the string:
a <- "I hate extra spaces elephant"
gsub("(?:\\G(?!^)|(?<!\\S)\\S{5})\\K\\S", ".", a, perl = TRUE)
See the R demo and the regex demo.
Details
(?:\G(?!^)|(?<!\S)\S{5}) - the end of the previous successful match or five non-whitespace chars not preceded with a non-whitespace char
\K - a match reset operator discarding text matched so far
\S - any non-whitespace char.
gsubfn is like gsub except the replacement string can be a function which inputs the match and outputs the replacement. The function can optionally be expressed a formula as we do here replacing each string of word characters with the output of the function replacing that string. No complex regular expressions are needed.
library(gsubfn)
gsubfn("\\w+", ~ paste0(substr(x, 1, 5), strrep(".", max(0, nchar(x) - 5))), a)
## [1] "I hate extra space. eleph..."
or almost the same except function is slightly different:
gsubfn("\\w+", ~ paste0(substr(x, 1, 5), substring(gsub(".", ".", x), 6)), a)
## [1] "I hate extra space. eleph..."

Extracting word with co-occurring alphabets in R

I wanted to extract certain words from a bigger word-list. One example of a desired extracted word-list is: extract all the words that contain /s/ followed by /r/. So this should give me words such as sər'ka:rəh, e:k'sa:r, səmʋitərəɳ, and so:'ha:rd. from the bigger word-list.
Consider the data (IPA transcription) to be the one given below:
sər'ka:rəh
sə'lᴔ:nija:
hã:ki:
pu:'dʒa:ẽ:
e:k'sa:r
mritko:
dʒʱã:sa:
pə'hũtʃ'ne:'ʋa:le:
kərəpʈ
tʃinhirit
tʃʰəʈʈʰi:
dʱũdʱ'la:pən
səmʋitərəɳ
so:'ha:rd
məl'ʈi:spe:'ʃijliʈi:
la:'pər'ʋa:i:
upləbɡʱ
Thanks much!
Here's an answer to the issue described in the first paragraph of your post. (To my mind, the examples in the second paragraph are inconsistent with the issue described in the first para, so I'll take the liberty of ignoring them here).
You say you want to "extract all the words that contain p followed by t". The word 'extract' implies that there are other characters in the same string than those you want to match and extract. The verb 'contain' implies that the words you want to extract need not necessarily have p in word-initial position. Based on these premises, here's some mock data and a solution to the task:
Data:
x <- c("pastry is to the pastor's appetite what pot is to the pupil's")
Solution:
libary(stringr)
unlist(str_extract_all(x, "\\b\\w*(?<=p)\\w*t\\w*\\b"))
This uses word boundaries \\b to extract the target words from the surrounding context; it further uses positive lookbehind (?<=...) to assert the condition that for there to be a matching t there needs to be a p character occurring prior to the match.
The regex in more detail:
\\b: the opening word boundary
\\w*: zero or more alphanumeric chars (or an underscore)
(?<=p): positive lookbehind: "if and only if you see a p char on
the left..."
\\w*: zero or more alphanumeric chars (or an underscore)
t: the literal character t
\\w*: zero or more alphanumeric chars (or an underscore)
\\b: the closing word boundary
Result:
[1] "pastry" "pastor" "appetite" "pot"
EDIT 1:
Now that the question has been updated, a more definitive answer is possible.
Data:
x <- c("sər'ka:rəh","sə'lᴔ:nija:","hã:ki:","pu:'dʒa:ẽ:","e:k'sa:r",
"mritko:","dʒʱã:sa:","pə'hũtʃ'ne:'ʋa:le:","kərəpʈ","tʃinhirit",
"tʃʰəʈʈʰi:","dʱũdʱ'la:pən","səmʋitərəɳ","so:'ha:rd",
"məl'ʈi:spe:'ʃijliʈi:", "la:'pər'ʋa:i:","upləbɡʱ")
If you want to match (rather than extract) words that "contain /s/ followed by /r/", you can use grepin various ways. Here are two ways:
grep("s.*r", x, value = T)
or:
grep("(?<=s).*r", x, value = T, perl = T) # with lookbehind
The result is the same in either case:
[1] "sər'ka:rəh" "e:k'sa:r" "səmʋitərəɳ" "so:'ha:rd"
EDIT 2:
If the aim is to match words that "contain /s/ or /p/ followed by /r/ or /t/", you can use the metacharacter | to allow for alternatives:
grep("s.*r|s.*t|p.*r|p.*t", x, value = T)
# or, more succinctly:
grep("(s|p).*(r|t)", x, value = T)
[1] "sər'ka:rəh" "e:k'sa:r" "pə'hũtʃ'ne:'ʋa:le:" "səmʋitərəɳ" "so:'ha:rd"
[6] "la:'pər'ʋa:i:"
You can use grep function. Assuming your list is called list:
grep("p[a-z]+t", list, value=TRUE)

R regular expression match/omit several repeats

I'm using back references to get rid of accidental repeats in vectors of variable names. The names in the first case I encountered have repeat patterns like this
x <- c("gender_gender-1", "county_county-2", "country_country-1997",
"country_country-1993")
The repeats were always separated by underscore and there was only one repeat to eliminate. And they always start at the beginning of the text. After checking the Regular Expression Cookbook, 2ed, I arrived at an answer that works:
> gsub("^(.*?)_\\1", "\\1", x)
[1] "gender-1" "county-2" "country-1997" "country-1993"
I was worried that the future cases might have dash or space as separator, so I wanted to generalize the matching a bit. I got that worked out as well.
> x <- c("gender_gender-1", "county-county-2", "country country-1997",
+ "country,country-1993")
> gsub("^(.*?)[,_\ -]\\1", "\\1", x)
[1] "gender-1" "county-2" "country-1997" "country-1993"
So far, total victory.
Now, what is the correct fix if there are three repeats in some cases? In this one, I want "country-country-country" to become just one "country".
> x <- c("gender_gender-1", "county-county-county-2")
> gsub("^(.*?)[,_\ -]\\1", "\\1", x)
[1] "gender-1" "county-county-2"
I am willing to replace all of the separators by "_" if that makes it easier to get rid of the repeat words.
You may quantify the [,_ -]\1 part:
gsub("^(.*?)(?:[,_\\s-]\\1)+", "\\1", x, perl=TRUE)
See the R demo
Note I also replace the space with \s to match any whitespace (and this requires perl=TRUE). You may also match any whitespace with [:space:], then you do not need perl=TRUE, i.e. gsub("^(.*?)(?:[,_[:space:]-]\\1)+", "\\1", x).
Details:
^ - matches the start of a string
(.*?) - any 0+ chars as few as possible up to the first...
(?:
[,_\\s-] - ,, _, whitespace or -
\\1 - same value as captured in Group 1
)+ - 1 or more times.
If you only want to match the repeat part 1 or 2 times, replace + with {1,2} limiting quantifier:
gsub("^(.*?)(?:[,_\\s-]\\1){1,2}", "\\1", x, perl=TRUE)

regular expression: remove consecutive repeated characters at least 2 times as well as those after it in a string in R

I have a vector with different strings like this:
s <- c("mir123mm8", "qwe98wwww98", "123m3tqppppw23!")
and
> s
[1] "mir123mm8" "qwe98wwww98" "123m3tqppppw23!"
I would like to have the answer like this:
> c("mir123", "qwe98", "123m3tq")
[1] "mir123" "qwe98" "123m3tq"
That means that if a string has at least 2 consecutive repeated characters, then them and after them should be removed.
What is the better way to do it using regular expression in R?
You can use back reference in the pattern to match repeated characters:
sub("(.*?)(.)\\2.*", "\\1", s)
# [1] "mir123" "qwe98" "123m3tq"
The pattern matches when the second captured group which is a single character repeats directly after it. Make the first capture group ungreedy by ? so that whenever the pattern matches, the first captured group is returned.

get character before second underscore [duplicate]

What regular expression can retrieve (e.g. with sup()) the characters before the second period. Given a character vector like:
v <- c("m_s.E1.m_x.R1PE1", "m_xs.P1.m_s.R2E12")
I would like to have returned this:
[1] "m_s.E1" "m_xs.P1"
> sub( "(^[^.]+[.][^.]+)(.+$)", "\\1", v)
[1] "m_s.E1" "m_xs.P1"
Now to explain it: The symbols inside the first and third paired "[ ]" match any character except a period ("character classes"), and the "+"'s that follow them let that be an arbitrary number of such characters. The [.] therefore is only matching the first period, and the second period will terminate the match. Parentheses-pairs allow you to specific partial sections of matched characters and there are two sections. The second section is any character (the period symbol) repeated an arbitrary number of times until the end of the string, $. The "\\1" specifies only the first partial match as the returned value.
The ^ operator means different things inside and outside the square-brackets. Outside it refers to the length-zero beginning of the string. Inside at the beginning of a character class specification, it is the negation operation.
This is a good use case for "character classes" which are described in the help page found by typing:
?regex
Not regex but the qdap package has the beg2char (beginning of string 2 n character) to handle this:
library(qdap)
beg2char(v, ".", 2)
## [1] "m_s.E1" "m_xs.P1"

Resources