Convert special letters to english letters in R - r

Is there a way to convert special letters, in a text, to english letters in R? For example:
Æ -> AE
Ø -> O
Å -> A
Edit: The reason I need this convert is R cant see that these two words are the same:
stringdist('oversættelse','oversaettelse')
[1] 2
grepl('oversættelse','oversaettelse')
FALSE
Some people tent to write using only english characters and some others not. In order to compare some texts I need to have them in the 'same format'.

I recently had a very similar problem and was pointed to the question Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?
basically the gist is for many of this special characters there exist more than one unicode representation - which will mess with text comparisons. The suggested solution is to use the stringi package function stri_trans_nfc - it has also a function stri_trans_general that supports transliteration, which might be exactly what you need.

You can use chartr
x <- "ØxxÅxx"
chartr("ØÅ", "OA", x)
[1] "OxxAxx"
And/or gsub
y <- "Æabc"
gsub("Æ", "AE", y)
[1] "AEabc"

Related

Replace Strings in R with regular expression in it dynamically [duplicate]

I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.
Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?
For example (made up function):
x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
I've written an R version of Perl's quotemeta function:
library(stringr)
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.
Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:
This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):
Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.
which reinforces my point that this solution is only guaranteed for PCRE.
Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
My previous answer:
I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.
re.escape <- function(strings){
vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)",
"\\{", "\\}", "\\^", "\\$","\\*",
"\\+", "\\?", "\\.", "\\|")
replace.vals <- paste0("\\\\", vals)
for(i in seq_along(vals)){
strings <- gsub(vals[i], replace.vals[i], strings)
}
strings
}
Some output
> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"
An easier way than #ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.
Use the rex package
These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:
library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")
But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:
x = "foo[bar]"
y = rex(start, x, end)
Now y is ^foo\[bar\]$ and will only match the exact string contained in x.
According to ?regex:
The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).
Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:
> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"
Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".

How to extract only valid equations from a string

Extract all the valid equations in the following text.
I have tried a few regex expressions but none seem to work.
Hoping to use sub or gsub functions in R.
myText <- 'equation1: 2+3=5, equation2 is: 2*3=6, do not extract 2w3=6'
expected result : 2+3=5 2*3=6
Here is a base R approach. We can use grepexpr() to find multiple matches of equations in the input string:
x <- c("equation1: 2+3=5, equation2 is: 2*3=6, do not extract 2w3=6")
m <- gregexpr("\\b\\w+(?:[+-\\*]\\w+)+=\\w+\\b", x)
regmatches(x, m)
[[1]]
[1] "2+3=5" "2*3=6"
Here is an explanation of the regex:
\\b\\w+ match an initial symbol
(?:[+-\\*]\\w+) then match at least one arithmetic symbol (+-\*) followed
by another variable
+=\\w+ match an equals sign, followed by a variable
For the examples you've posted, the regex (\d+[+\-*\/]\d+=\d+) should extract the equations and not the rest of the text. Note that this regex does not handle variables/variable names, only numbers and the basic arithmetic operators. This may need to be adapted for r.
Demo

Filtering out entries in a column that contain UTF-8 arabic characters in R

I have a data set called event_table that has a column titled "c.Comments" which contains strings mostly in english, but has some arabic in some of the comment entries. I want to filter out rows in which the comments entry contains arabic characters.
I read the data into R from an xlsx file and the arabic characters show as UTF-8 "< U+4903 >< U+483d >" (with no spaces) etc.
I've tried using regular expressions to achieve what I want, but the strings I'm trying to match refuse to be filtered out. I've tried all kinds of different regular expressions, but none of them seem to do the trick. I'll try filtering out literally "
event_table <- event_table %>%
filter(!grepl("<U+", c.Comments, fixed = TRUE))
event_table <- event_table %>%
filter(!grepl("<U\\+", c.Comments)
"\x", "\d\d\d\d", and all sorts of other combinations have done nothing for me
I'm starting to suspect that my method of filtering may be the issue rather than the regular expression, so any suggestions would be greatly appreciated.
Arabic chars can be detected with grep/grepl using a PCRE regex like \p{Arabic}:
> df <- data.frame(x=c("123", "abc", "ﺏ"))
> df
x
1 123
2 abc
3 <U+FE8F>
> grepl("\\p{Arabic}", df$x, perl=TRUE)
[1] FALSE FALSE TRUE
In your case, the code will look like
event_table <- event_table %>%
filter(!grepl("\\p{Arabic}", c.Comments, perl=TRUE))
Look at the ?Syntax help page. The Unicode character associated with may vary with the assumed codepage. On my machine the R character would be created with the string: "\u4903" but it prints as a Chinese glyph. The regex engine in R (as documented in the ?regex help page which you should refer to now) is PCRE.
The pattern in this grepl expression will filter out the printing non-ASCII characters:
grepl("[[:alnum:]]|[[:punct:]]", "\u4903")
[1] FALSE
And I don't think you should be negating that grepl result:
dplyr::filter(data.frame("\u4903"), grepl("[[:alnum:]]|[[:punct:]]", "\u4903"))
[1] X.䤃.
<0 rows> (or 0-length row.names)
dplyr::filter(data.frame("\u4903"), !grepl("[[:alnum:]]|[[:punct:]]", "\u4903"))
X.䤃.
1 䤃

How to subset words with certain number of vowels in rstudio?

I try to subset a list of words having 5 or more vowel letters using str_subset function in rstudio. However, can't figure it.
Is there any suggestion for this issue?
Since you are evidently using stringr, the function str_count will give you what you are after. Assuming your "list of words" means a character vector of single words, the following should do the trick.
testStrings <- c("Brillig", "slithey", "TOVES",
"Abominable", "EQUATION", "Multiplication", "aaagh")
VowelCount <- str_count(testString, pattern = "[AEIOUaeiou]")
OutputStrings <- testStrings[VowelCount >= 5]
The part in square brackets is a regular expression which matches any capital or lower case vowel in English. Of course other languages have different sets of vowels which you may need to take into account.
If you want to do the same in base R, the following single-liner should do it:
OutputStrings <- grep("([AEIOUaeiou].*){5,}", testStrings, value = TRUE)

How can I match emoji with an R regex?

I want to determine which elements of my vector contain emoji:
x = c('😂', 'no', '🍹', '😀', 'no', '😛', '䨺', '감사')
x
# [1] "\U0001f602" "no" "\U0001f379" "\U0001f600" "no" "\U0001f61b" "䨺" "감사"
Related posts only cover other languages, and because mostly they refer to specialized libraries, I couldn't figure out a way to translate to R:
What is the regex to extract all the emojis from a string?
How do I remove emoji from string
replace emoji unicode symbol using regexp in javascript
Regular expression matching emoji in Mac OS X / iOS
remove unicode emoji using re in python
The second looked very promising, but alas (not fixed by supplying perl = TRUE):
x[grepl('[\u{1F600}-\u{1F6FF}]', x)]
Error: invalid \u{xxxx} sequence (line 1)
Similar issues come about from other questions. How can we match emoji in R?
I am converting the encoding to UTF-8 to compare the UTF-8 value of emoji's value with all the emoji's value in remoji library which is in UTF-8. I am using the stringr library to find the position of emoji's in the vector. One is free to use grep or any other function.
1st Method:
library(stringr)
xvect = c('😂', 'no', '🍹', '😀', 'no', '😛')
Encoding(xvect) <- "UTF-8"
which(str_detect(xvect,"[^[:ascii:]]")==T)
# [1] 1 3 4 6
Here 1,3,4 and 6 are emoji's character in this case.
Edited :
2nd Method:
Install a package called remoji using devtools using below command, Since we have already converted the emoji items into UTF-8. we can now compare the UTF-8 values of all the emoji's present in the emoji library. Use trimws to remove the whitespaces
install.packages("devtools")
devtools::install_github("richfitz/remoji")
library(remoji)
emj <- emoji(list_emoji(), TRUE)
xvect %in% trimws(emj)
Output:
which(xvect %in% trimws(emo))
# [1] 1 3 4 6
Both of the above methods are not full proof and first method assumes that there are no any ascii characters other than emojis in the vector and second method relies on the library information of remoji. In case where the a certain emoji information is not present in the library, the last command may yield a FALSE instead of TRUE.
Final Edit:
As per the discussion amongst OP(#MichaelChirico) and #SymbolixAU. Thanks to both of them it seems the problem with small typo of capital U. The new regex is xvect[grepl('[\U{1F300}-\U{1F6FF}]', xvect)] . The range in the character class is taken from F300 to F6FF. One can off course change this range to a new range in cases where an emoji lies outside this range. This may not be the complete list and over the period of time these ranges may keep increasing/changing.

Resources