Extracting word with co-occurring alphabets in R

Extracting word with co-occurring alphabets in R - r

I wanted to extract certain words from a bigger word-list. One example of a desired extracted word-list is: extract all the words that contain /s/ followed by /r/. So this should give me words such as sər'ka:rəh, e:k'sa:r, səmʋitərəɳ, and so:'ha:rd. from the bigger word-list.
Consider the data (IPA transcription) to be the one given below:
sər'ka:rəh
sə'lᴔ:nija:
hã:ki:
pu:'dʒa:ẽ:
e:k'sa:r
mritko:
dʒʱã:sa:
pə'hũtʃ'ne:'ʋa:le:
kərəpʈ
tʃinhirit
tʃʰəʈʈʰi:
dʱũdʱ'la:pən
səmʋitərəɳ
so:'ha:rd
məl'ʈi:spe:'ʃijliʈi:
la:'pər'ʋa:i:
upləbɡʱ
Thanks much!

Here's an answer to the issue described in the first paragraph of your post. (To my mind, the examples in the second paragraph are inconsistent with the issue described in the first para, so I'll take the liberty of ignoring them here).
You say you want to "extract all the words that contain p followed by t". The word 'extract' implies that there are other characters in the same string than those you want to match and extract. The verb 'contain' implies that the words you want to extract need not necessarily have p in word-initial position. Based on these premises, here's some mock data and a solution to the task:
Data:
x <- c("pastry is to the pastor's appetite what pot is to the pupil's")
Solution:
libary(stringr)
unlist(str_extract_all(x, "\\b\\w*(?<=p)\\w*t\\w*\\b"))
This uses word boundaries \\b to extract the target words from the surrounding context; it further uses positive lookbehind (?<=...) to assert the condition that for there to be a matching t there needs to be a p character occurring prior to the match.
The regex in more detail:
\\b: the opening word boundary
\\w*: zero or more alphanumeric chars (or an underscore)
(?<=p): positive lookbehind: "if and only if you see a p char on
the left..."
\\w*: zero or more alphanumeric chars (or an underscore)
t: the literal character t
\\w*: zero or more alphanumeric chars (or an underscore)
\\b: the closing word boundary
Result:
[1] "pastry" "pastor" "appetite" "pot"
EDIT 1:
Now that the question has been updated, a more definitive answer is possible.
Data:
x <- c("sər'ka:rəh","sə'lᴔ:nija:","hã:ki:","pu:'dʒa:ẽ:","e:k'sa:r",
"mritko:","dʒʱã:sa:","pə'hũtʃ'ne:'ʋa:le:","kərəpʈ","tʃinhirit",
"tʃʰəʈʈʰi:","dʱũdʱ'la:pən","səmʋitərəɳ","so:'ha:rd",
"məl'ʈi:spe:'ʃijliʈi:", "la:'pər'ʋa:i:","upləbɡʱ")
If you want to match (rather than extract) words that "contain /s/ followed by /r/", you can use grepin various ways. Here are two ways:
grep("s.*r", x, value = T)
or:
grep("(?<=s).*r", x, value = T, perl = T) # with lookbehind
The result is the same in either case:
[1] "sər'ka:rəh" "e:k'sa:r" "səmʋitərəɳ" "so:'ha:rd"
EDIT 2:
If the aim is to match words that "contain /s/ or /p/ followed by /r/ or /t/", you can use the metacharacter | to allow for alternatives:
grep("s.*r|s.*t|p.*r|p.*t", x, value = T)
# or, more succinctly:
grep("(s|p).*(r|t)", x, value = T)
[1] "sər'ka:rəh" "e:k'sa:r" "pə'hũtʃ'ne:'ʋa:le:" "səmʋitərəɳ" "so:'ha:rd"
[6] "la:'pər'ʋa:i:"

You can use grep function. Assuming your list is called list:
grep("p[a-z]+t", list, value=TRUE)

Related

Can quantifiers be used in regex replacement in R?

My objective would be replacing a string by a symbol repeated as many characters as have the string, in a way as one can replace letters to capital letters with \\U\\1, if my pattern was "...(*)..." my replacement for what is captured by (*) would be something like x\\q1 or {\\q1}x so I would get so many x as characters captured by *.
Is this possible?
I am thinking mainly in sub,gsub but you can answer with other libraris like stringi,stringr, etc.
You can use perl = TRUE or perl = FALSE and any other options with convenience.
I assume the answer can be negative, since seems to be quite limited options (?gsub):
a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed = FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern. For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. If a character vector of length 2 or more is supplied, the first element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA.
Main quantifiers are (?base::regex):
?
The preceding item is optional and will be matched at most once.
*
The preceding item will be matched zero or more times.
+
The preceding item will be matched one or more times.
{n}
The preceding item is matched exactly n times.
{n,}
The preceding item is matched n or more times.
{n,m}
The preceding item is matched at least n times, but not more than m times.
Ok, but it seems to be an option (which is not in PCRE, not sure if in PERL or where...) (*) which captures the number of characters the star quantifier is able to match (I found it at https://www.rexegg.com/regex-quantifier-capture.html) so then it could be used \q1 (same reference) to refer to the first captured quantifier (and \q2, etc.). I also read that (*) is equivalent to {0,} but I'm not sure if this is really the fact for what I'm interested in.
EDIT UPDATE:
Since asked by commenters I update my question with an specific example provide by this interesting question. I modify a bit the example. Let's say we have a <- "I hate extra spaces elephant" so we are interested in keeping the a unique space between words, the 5 first characters of each word (till here as the original question) but then a dot for each other character (not sure if this is what is expected in the original question but doesn't matter) so the resulting string would be "I hate extra space. eleph..." (one . for the last s in spaces and 3 dots for the 3 letters ant in the end of elephant). So I started by keeping the 5 first characters with
gsub("(?<!\\S)(\\S{5})\\S*", "\\1", a, perl = TRUE)
[1] "I hate extra space eleph"
How should I replace the exact number of characters in \\S* by dots or any other symbol?

Quantifiers cannot be used in the replacement pattern, nor the information how many chars they match.
What you need is a \G base PCRE pattern to find consecutive matches after a specific place in the string:
a <- "I hate extra spaces elephant"
gsub("(?:\\G(?!^)|(?<!\\S)\\S{5})\\K\\S", ".", a, perl = TRUE)
See the R demo and the regex demo.
Details
(?:\G(?!^)|(?<!\S)\S{5}) - the end of the previous successful match or five non-whitespace chars not preceded with a non-whitespace char
\K - a match reset operator discarding text matched so far
\S - any non-whitespace char.

gsubfn is like gsub except the replacement string can be a function which inputs the match and outputs the replacement. The function can optionally be expressed a formula as we do here replacing each string of word characters with the output of the function replacing that string. No complex regular expressions are needed.
library(gsubfn)
gsubfn("\\w+", ~ paste0(substr(x, 1, 5), strrep(".", max(0, nchar(x) - 5))), a)
## [1] "I hate extra space. eleph..."
or almost the same except function is slightly different:
gsubfn("\\w+", ~ paste0(substr(x, 1, 5), substring(gsub(".", ".", x), 6)), a)
## [1] "I hate extra space. eleph..."

Regex - Best way to match all values between two two digit numbers?

Let's say I want a Regex expression that will only match numbers between 18 and 31. What is the right way to do this?
I have a set of strings that look like this:
"quiz.18.player.total_score"
"quiz.19.player.total_score"
"quiz.20.player.total_score"
"quiz.21.player.total_score"
I am trying to match only the strings that contain the numbers 18-31, and am currently trying something like this
(quiz.)[1-3]{1}[1-9]{1}.player.total_score
This obviously won't work because it will actually match all numbers between 11-39. What is the right way to do this?

Regex: 1[89]|2\d|3[01]
For matching add additional text and escape the dots:
quiz\.(?:1[89]|2\d|3[01])\.player\.total_score
Details:
(?:) non-capturing group
[] match a single character present in the list
| or
\d matches a digit (equal to [0-9])
\. dot
. matches any character

!) If s is the character vector read the fields into a data frame picking off the second field and check whether it is in the desired range. Put the result in logical vector ok and get those elements from s. This uses no regular expressions and only base R.
digits <- read.table(text = s, sep = ".")$V2
s[digits %in% 18:31]
2) Another approach based on the pattern "\\D" matching any non-digit is to remove all such characters and then check if what is left is in the desired range:
digits <- gsub("\\D", "", s)
s[digits %in% 18:31]
2a) In the development version of R (to be 3.6.0) we could alternately use the new whitespace argument of trimws like this:
digits <- trimws(s, whitespace = "\\D")
s[digits %in% 18:31]
3) Another alternative is to simply construct the boundary strings and compare s to them. This will work only if all the number parts in s are exactly the same number of digits (which for the sample shown in the question is the case).
ok <- s >= "quiz.18.player.total_score" & s <= "quiz.31.player.total_score"
s[ok]

This is done using character ranges and alternations. For your range
3[10]|[2][0-9]|1[8-9]
Demo

Extract first letter in each word in R

I had a data.frame with some categorical variables. Let's suppose sentences is one of these variables:
sentences <- c("Direito à participação e ao controle social",
"Direito a ser ouvido pelo governo e representantes",
"Direito aos serviços públicos",
"Direito de acesso à informação")
For each value, I would like to extract just the first letter of each word, ignoring if the word has 4 letters or less (e, de, à, a, aos, ser, pelo), My goal is create acronym variables. I expect the following result:
[1] "DPCS", "DOGR", "DSP", "DAI
I tried to make a pattern subset using stringr with a regex pattern founded here:
library(stringr)
pattern <- "^(\b[A-Z]\w*\s*)+$"
str_subset(str_to_upper(sentences), pattern)
But I got an error when creating the pattern object:
Error: '\w' is an escape sequence not recognized in the string beginning with ""^(\b[A-Z]\w"
What am I doing wrong?
Thanks in advance for any help.

You can use gsub to delete all the unwanted characters and remain with the ones you want. From the expected output, it seems you are still using characters from words tht are 3 characters long:
gsub('\\b(\\pL)\\pL{2,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS" "DSOPGR" "DASP" "DAI"
But if we were to ignore the words you indicated then it would be:
gsub('\\b(\\pL)\\pL{4,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS" "DOGR" "DSP" "DAI"

#Onyambu's answer is great, though as a regular expression beginner, it does take me a long time to try to understand it so that I can make modifications to suit my own needs.
Here is my understanding to gsub('\\b(\\pL)\\pL{4,}|.','\\U\\1',sentences,perl = TRUE).
Post in the hope of being helpful to others.
Background information:
\\b: boundary of word
\\pL matches any kind of letter from any language
{4,} is an occurrence indicator
{m}: The preceding item is matched exactly m times.
{m,}: The preceding item is matched m or more times, i.e., m+
{m,n}: The preceding item is matched at least m times, but not more than n times.
| is OR logic operator
. represents any one character except newline.
\\U\\1 in the replacement text is to reinsert text captured by the pattern as well as capitalize the texts. Note that parentheses () create a numbered capturing group in the pattern.
With all the background knowledge, the interpretation of the command is
replace words matching \\b(\\pL)\\pL{4,} with the first letter
replace any character not matching the above pattern with "" as nothing is captured for this group
Here are two great places I learned all these backgrounds.
https://www.regular-expressions.info/rlanguage.html
https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html

You can use this pattern: (?<=^| )\S(?=\pL{4,})
I used a positive lookbehind to make sure the matches are preceded by either a space or the beginning of the line. Then I match one character, only if it is followed by 4 or more letters, hence the positive lookahead.
I suggest you don't use \w for non-English languages, because it won't match any characters with accents. Instead, \pL matches any letter from any language.
Once you have your matches, you can just concatenate them to create your strings (dpcs, dogr, etc...)
Here's a demo

R Regex to identify and replace characters between multiple dots

I have the following codes
"ABC.A.SVN.10.10.390.10.UDGGL"
"XYZ.Z.SVN.11.12.111.99.ASDDL"
and I need to replace the characters that exist between the 2nd and the 3rd dot. In this case it is SVN but it may well be any combination of between A and ZZZ, so really the only way to make this work is by using the dots.
The required outcome would be:
"ABC.A..10.10.390.10.UDGGL"
"XYZ.Z..11.12.111.99.ASDDL"
I tried variants of grep("^.+(\\.\\).$", "ABC.A.SVN.10.10.390.10.UDGGL") but I get an error.
Some examples of what I have tried with no success :
Link 1
Link 2
EDIT
I tried #Onyambu 's first method and I ran into a variant which I had not accounted for: "ABC.A.AB11.1.12.112.1123.UDGGL". In the replacement part, I also have numeric values. The desired outcome is "ABC.A..1.12.112.1123.UDGGL" and I get it using sub("\\.\\w+.\\B.",".",x) per the second part of his answer!

See code in use here
x <- c("ABC.A.SVN.10.10.390.10.UDGGL", "XYZ.Z.SVN.11.12.111.99.ASDDL")
sub("^(?:[^.]*\\.){2}\\K[^.]*", "", x, perl=T)
^ Assert position at the start of the line
(?:[^.]*\.){2} Match the following exactly twice
[^.]*\. Match any character except . any number of times, followed by .
\K Resets the starting point of the pattern. Any previously consumed characters are no longer included in the final match
[^.]* Match any character except . any number of times
Results in [1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"

x= "ABC.A.SVN.10.10.390.10.UDGGL" "XYZ.Z.SVN.11.12.111.99.ASDDL"
sub("([A-Z]+)(\\.\\d+)","\\2",x)
[1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
([A-Z]+) Capture any word that has the characters A-Z
(\\.\\d+) The captured word above, must be followed with a dot ie\\..This dot is then followed by numbers ie \\d+. This completes the capture.
so far the captured part of the string "ABC.A.SVN.10.10.390.10.UDGGL" is SVN.10 since this is the part that matches the regular expression. But this part was captured as SVN and .10. we do a backreference ie replace the whole SVN.10 with the 2nd part .10
Another logic that will work:
sub("\\.\\w+.\\B.",".",x)
[1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"

Not exactly regex but here is one more approach
#DATA
S = c("ABC.A.SVN.10.10.390.10.UDGGL", "XYZ.Z.SVN.11.12.111.99.ASDDL")
sapply(X = S,
FUN = function(str){
ind = unlist(gregexpr("\\.", str))[2:3]
paste(c(substring(str, 1, ind[1]),
"SUBSTITUTION",
substring(str, ind[2], )), collapse = "")
},
USE.NAMES = FALSE)
#[1] "ABC.A.SUBSTITUTION.10.10.390.10.UDGGL" "XYZ.Z.SUBSTITUTION.11.12.111.99.ASDDL"

get character before second underscore [duplicate]

What regular expression can retrieve (e.g. with sup()) the characters before the second period. Given a character vector like:
v <- c("m_s.E1.m_x.R1PE1", "m_xs.P1.m_s.R2E12")
I would like to have returned this:
[1] "m_s.E1" "m_xs.P1"

> sub( "(^[^.]+[.][^.]+)(.+$)", "\\1", v)
[1] "m_s.E1" "m_xs.P1"
Now to explain it: The symbols inside the first and third paired "[ ]" match any character except a period ("character classes"), and the "+"'s that follow them let that be an arbitrary number of such characters. The [.] therefore is only matching the first period, and the second period will terminate the match. Parentheses-pairs allow you to specific partial sections of matched characters and there are two sections. The second section is any character (the period symbol) repeated an arbitrary number of times until the end of the string, $. The "\\1" specifies only the first partial match as the returned value.
The ^ operator means different things inside and outside the square-brackets. Outside it refers to the length-zero beginning of the string. Inside at the beginning of a character class specification, it is the negation operation.
This is a good use case for "character classes" which are described in the help page found by typing:
?regex

Not regex but the qdap package has the beg2char (beginning of string 2 n character) to handle this:
library(qdap)
beg2char(v, ".", 2)
## [1] "m_s.E1" "m_xs.P1"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting word with co-occurring alphabets in R - r

You can use grep function. Assuming your list is called list: grep("p[a-z]+t", list, value=TRUE)

Related

Can quantifiers be used in regex replacement in R?

Regex - Best way to match all values between two two digit numbers?

Extract first letter in each word in R

R Regex to identify and replace characters between multiple dots

get character before second underscore [duplicate]

Categories

Resources