Extract disallowed characters - r

I have transcriptions with erroneous encodings, that is, characters that occur but should not occur.
In this toy data, the only allowed characters are this class:
"[)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"
df <- data.frame(
Utterance = c("~°maybe you (.) >should ¥just¥<",
"SOME text |<-- pipe¿ and€", # <--: | and €
"blah%", # <--: %
"text ^more text", # <--: ^
"£norm(hh)a::l£mal, (1.22)"))
What I need to do is:
detect Utterances that contain any wrong encodings
extract the wrong characters
I'm doing OK as far as detection is concerned but the extraction fails miserably:
library(stringr)
library(dplyr)
df %>%
filter(!str_detect(Utterance, "[)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")) %>%
mutate(WrongChar = str_extract_all(Utterance, "[^)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"))
Utterance WrongChar
1 SOME text |<-- pipe¿ and€ SO, ME, t, ex, |<, --, p, ip, e¿, a, nd
2 blah% bl, ah
3 text ^more text te, xt, ^m, or, t, ex
How can the extraction be improved to obtain this expected result:
Utterance WrongChar
1 SOME text |<-- pipe¿ and€ |, €
2 blah% %
3 text ^more text ^

You need to
Ensure the [ and ] are escaped inside a character class
Add whitespace pattern to both regexp checks as its absence is messing your results.
So you need to use
df %>%
filter(str_detect(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")) %>%
mutate(WrongChar = str_extract_all(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"))
Output:
Utterance WrongChar
1 SOME text |<-- pipe¿ and€ |, €
2 blah% %
3 text ^more text ^
Note that I used positive logic in filter(str_detect(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")), so we get all items that contain at least one char other than an allowed one.

Related

How to remove Unicode representations of Emojis in strings using regexp in R?

I am working with data from the Twitter API and wherever users had included Emojis in their name field, they have been translated to Unicode string representations in my dataframe. The structure of my data is somewhat like this:
user_profiles <- as.data.frame(c("Susanne Bold", "Julian K. Peard <U+0001F41C>",
"<U+0001F30A> Alexander K Miller <U+0001F30A>", "John Mason"))
colnames(user_profiles) <- "name"
which looks like this:
name
1 Susanne Bold
2 Julian K. Peard <U+0001F41C>
3 <U+0001F30A> Alexander K Miller <U+0001F30A>
4 John Mason
I am now trying to isolate the actual name into a new column using regexp:
user_profiles <- user_profiles %>%
mutate(clean_name = str_remove_all(name, "\\<U\\+[[:alnum:]]\\>[ ]?"))
But this expression 1. seems rather complicated and 2. doesn't work for identifying the pattern. I have tried multiple variations of the regexp already, weirdly enough, grepl is able to detect the pattern with this version (which string_remove_all doesn't accept since it is missing a closing bracket):
grepl("\\<U\\+[[:alnum:]\\>[ ]?", user_profiles$name)
[1] FALSE TRUE TRUE FALSE
# note that the second bracket around alnum is left opened
Can somebody explain this or offer an easier solution?
Thanks a lot!
The first str_remove_all does not work because you missed the + quantifier after the alphanumeric pattern. Also, note that after <U+, only hex chars are used, so instead of [:alnum:], you can use a more precise [:xdigit:] POSIX character class.
You can use
user_profiles <- user_profiles %>%
mutate(clean_name = str_remove_all(name, "<U\\+[[:xdigit:]]+>\\s*"))
Do not escape < and >, they are never special in any regex flavor, and in TRE regex, used with base regex functions without perl=TRUE, the \< and \> are word boundaries.
Pattern details
<U - <U string
\+ - a literal +
[[:xdigit:]]+ - one or more hex chars
> - a > char
\s* - zero or more whitespaces.
Why does the grepl regex work? This is interesting, because you omitted the ] closing bracket expression boundary char, and "spoilt" the regex to match like this:
\<U\+ - a word boundary (in TRE, \< matches a left-hand word boundary) and then U+ string
[[:alnum:]\>[ ]? - this is an optional bracket expression that matches one or zero chars from the set:
[:alnum:] - any alphanumeric char
\ - a backslash (yes, because in TRE regex flavor, regex escape sequences are treated literally)
> - a > char
[ - a [ char
- a space.
So, it matches <U+0 in <U+0001F41C>, for example.
Here is an alternative way how we could do it:
library(dplyr)
library(tidyr)
user_profiles %>%
separate_rows(name, sep = '\\<|\\>') %>%
filter(!str_detect(name, 'U+')) %>%
mutate(name = na_if(name, "")) %>%
na.omit()
name
<chr>
1 "Susanne Bold"
2 "Julian K. Peard "
3 " Alexander K Miller "
4 "John Mason"
We can add one or more (+) for the [[:alnum:]]
library(dplyr)
library(stringr)
user_profiles <- user_profiles %>%
mutate(clean_name = str_remove_all(name, "\\s*\\<U\\+[[:alnum:]]+\\>\\s*"))
-output
user_profiles
name clean_name
1 Susanne Bold Susanne Bold
2 Julian K. Peard <U+0001F41C> Julian K. Peard
3 <U+0001F30A> Alexander K Miller <U+0001F30A> Alexander K Miller
4 John Mason John Mason

extract part of word into a field from a long string using R

I have a single long string variable with 3 obs. I was trying to create a field prob to extract the specific string from the long string. the code and message is below.
data aa: "The probability of being a carrier is 0.0002422359 " " an BRCA1 carrier 0.0001061067 "
" an BRCA2 carrier 0.00013612 "
enter code here
aa$prob <- ifelse(grepl("The probability of being a carrier is", xx)==TRUE, word(aa, 8, 8), ifelse(grepl("BRCA", xx)==TRUE, word(aa, 5, 5), NA))
Warning message: In aa$prob <- ifelse(grepl("The probability of being a carrier is", : Coercing LHS to a list
Here is my previous answer, updated to reflect a data.frame.
library(dplyr)
aa <- data.frame(aa = c("...", "...", "The probability of being a carrier is 0.0002422359 ", " an BRCA1 carrier 0.0001061067 ", " an BRCA2 carrier 0.00013612 ", "..."))
aa %>%
mutate(prob = as.numeric(if_else(grepl("(probability|BRCA[12] carrier)", aa),
gsub("^.*?\\b([0-9]+\\.?[0-9]*)\\s*$", "\\1", aa), NA_character_)))
# aa prob
# 1 ... NA
# 2 ... NA
# 3 The probability of being a carrier is 0.0002422359 0.0002422359
# 4 an BRCA1 carrier 0.0001061067 0.0001061067
# 5 an BRCA2 carrier 0.00013612 0.0001361200
# 6 ... NA
Regex walk-through:
^ and $ are beginning and end of string, respective; \\b is a word-boundary; none of these "consume" any characters, they just mark beginnings and endings
. means one character
? means "zero or one", aka optional; * means "zero or more"; + means "one or more"; all refer to the previous character/class/group
\\s is blank space, including spaces and tabs
[0-9] is a class, meaning any character between 0 and 9; similarly, [a-z] is all lowercase letters, [a-zA-Z] are all letters, [0-9A-F] are hexadecimal digits, etc
(...) is a saved group; it's not uncommon in a group to use | as an "or"; this group is used later in the replacement= part of gsub as numbered groups, so \\1 recalls the first group from the pattern
So grouped and summarized:
"^.*?\\b([0-9]+\\.?[0-9]*)\\s*$"
1 ^^^^^^^^^^^^^^^^^^
2 ^^^
3 ^^^
4 ^^^^
This is the "number" part, that allows for one or more digits, an optional decimal point, and zero or more digits. This is saved in group "1".
The word boundary guarantees that we include leading numbers (it's possible, depending on a few things, for "12.345" to be parsed as "2.345" without this.
Anything before the number-like string.
Some or no blank space after the number.
Grouped logically, in an organized way
Regex isn't unique to R, it's a parsing language that R (and most other programming languages) supports in one way or another.

How to get word which contains number in R with regular expression?

I have a data frame in which some of the lines contain the word 'Comments(0)'. I want to get all the occurrences where there is 'Comments' and inside bracket any number.
I am using below code which is giving me integer(0) as output.
text <- c("Because I could not Comments(2) stop for Death - Comments(1)",
"He kindly stopped for me -",
"The Carriage held but just Ourselves - Comments(5)",
"and Immortality")
grep(pattern = "Comments([:digit:])", text)
you can use:
grep(pattern = "Comments\\(\\d\\)", text)
#output
[1] 1 3
or
grep(pattern = "Comments\\([[:digit:]]\\)", text)
the ( and ) need to be escaped since they are regex special characters
\\d - digit

Regular expression not working in R but works on website. Text mining

I have a regex which works on the regular expression website but doesn't work when I copy it in R. Below is the code to recreate my data frame:
text <- data.frame(page = c(1,1,2,3), sen = c(1,2,1,1),
text = c("Dear Mr case 1",
"the value of my property is £500,000.00 and it was built in 1980",
"The protected percentage is 0% for 2 years",
"The interest rate is fixed for 2 years at 4.8%"))
regex working on website: https://regex101.com/r/OcVN5r/2
Below is the R codes I have tried so far and neither works.
library(stringr)
patt = "dear\\s+(mr|mrs|miss|ms)\\b[^£]+(£[\\d,.]+)(?:\\D|\\d(?![\\d.]*%))+([\\d.]+%)(?:\\D|\\d(?![\\d.]*%))+([\\d.]+%)"
str_extract(text, patt)
grepl(pattern = patt, x = text)
I'm getting an error saying the regex is wrong but it works on the website. Not sure how to get it to work in r.
Basically I am trying to extract pieces of information from the text. Below are the details:
From the above dataframe, I need to extract the following:
1: Gender of the person. In this case it would be Male (looking at Mr)
2: The number that represents the property value. in this case would be £500,000.00.
3: The protected percentage value, which in our case would be 0%.
4: The interest rate value and in our case it is 4.8%.
I think you can do this with regexpr function.
For an example:
text = "Dear Mr case 1, the value of my property is £500,000.00 and it was built in 1980, The protected percentage is 13% for 2 years, The interest rate is fixed for 2 years at 4.8%";
grps <- regexpr (pattern=patt, text = text, perl=TRUE, ignore.case=TRUE);
start_idx <- attr (grps, "capture.start");
end_idx <- start_idx + attr (grps, "capture.length");
substring (text = text, first = start_idx, last = end_idx);
This matches: [1] "Mr " "£500,000.00 " "13% " "4.8%"
From the manual:
regexpr returns an integer vector of the same length as text giving the starting position of the first match or -1 if there is
none, with attribute "match.length", an integer vector giving the
length of the matched text (or -1 for no match). The match positions
and lengths are in characters unless useBytes = TRUE is used, when
they are in bytes (as they are for an ASCII-only matching: in either
case an attribute useBytes with value TRUE is set on the result). If
named capture is used there are further attributes "capture.start",
"capture.length" and "capture.names".
gregexpr returns a list of the same length as text each element of
which is of the same form as the return value for regexpr, except that
the starting positions of every (disjoint) match are given.
In your case I think you need to paste the lines together by using
full_line <- paste (text[,"text"], collapse=" ");
Then apply regexpr on full_line
I think the issue is your regex isn't giving alternate or "OR" matches. See below based on your bullet list
library(stringi)
rgx <- "(?<=dear\\s?)(m(r(s)?|s|iss))|\\p{S}([0-9]\\S+)|([0-9]+)((\\.[0-9]{1,})?)\\%"
stri_extract_all_regex(
text$text, rgx, opts_regex = stri_opts_regex(case_insensitive = T)
) %>% unlist()
Which gives
[1] "Mr" "£500,000.00" "0%" "4.8%"
The pattern says:
"(?<=dear\\s?)(m(r(s)?|s|iss))" = find a match where the word dear appears before a mr, ms, mrs or miss... but don't capture the dear or the leading space
| = OR
"\\p{S}([0-9]\\S+)" = find a match where a sequence of numbers occurs, after a symbol (see ?stringi-search-charclass), until there is a white space. But It must have a symbol at the beginning
| = OR
"([0-9]+)((\\.[0-9]{1,})?)\\%" = find a match where a number occurs one or more times, that may have a decimal with numbers after it, but will end in a percent sign

Substitute word with same word without initial # in R

I am trying to do a dataframe string substitution in R. I need to find all the words preceded by '#' (without space, e.g. #word) and change the '#' for '!' (e.g. from #word to !word). At the same time, it leaves intact the other instances of '#' (e.g. # or ## or #[#]). For example, this is my original dataframe (to change: #def, #jkl, #stu):
> df = data.frame(number = 1:4, text = c('abc #def ghi', '#jkl # mno', '#[#] pqr #stu', 'vwx ### yz'))
> df
number text
1 1 abc #def ghi
2 2 #jkl # mno
3 3 #[#] pqr #stu
4 4 vwx ### yz
And this is what I need it to look like:
> df_result = data.frame(number = 1:4, text = c('abc !def ghi', '!jkl # mno', '#[#] pqr !stu', 'vwx ### yz'))
> df_result
number text
1 1 abc !def ghi
2 2 !jkl # mno
3 3 #[#] pqr !stu
4 4 vwx ### yz
I have tried with
> gsub('#.+[a-z] ', '!', df$text)
[1] "abc !ghi" "!# mno" "!#stu" "vwx ### yz"
But the result is not the desired one. Any help is much appreciated.
Thank you.
How about
gsub("(^| )#(\\w)", "\\1!\\2", df$text)
# [1] "abc !def ghi" "!jkl # mno" "#[#] pqr !stu" "vwx ### yz"
This matches an # symbol at beginning of a string, or after a space. Then, we capture the word character after the # symbol, and replace # with !.
Explanation courtesy of regex101.com:
(^| ) is the 1st Capturing Group; ^ asserts position at start of the string; | denotes "or"; blank space matches the space character literally
# matches the character # literally (case sensitive)
(\\w) is the 2nd Capturing Group, it denotes a word character
The replacement string \\1!\\2 replaces the regular expression match with the first capturing group (\\1), followed by !, followed by the second capturing group (\\2).
You can use a positive lookahead (?=...)
gsub("#(?=[A-Za-z])", "!", df$text, perl = TRUE)
[1] "abc !def ghi" "!jkl # mno" "#[#] pqr !stu" "vwx ### yz"
From the "Regular Expressions as used in R" documentation page:
Patterns (?=...) and (?!...) are zero-width positive and negative lookahead assertions: they match if an attempt to match the ... forward from the current position would succeed (or not), but use up no characters in the string being processed.

Resources