stringi::stri_unescape_unicode() is not able to render Unicode characters in some ranges - r

Table of contents
The context
The problem
The question
The context
In the context of R, I'm aware that stringi::stri_unescape_unicode() could be used for converting a Unicode code to its corresponding character.
For example, the Unicode code for á (LATIN SMALL LETTER A WITH ACUTE) and 好 is U+00E1 and U+597D, respectively. This means that I can insert those character by executing the following.
library(stringi)
stringi::stri_unescape_6unicode("\\u00E1")
stringi::stri_unescape_unicode("\\u597D")
[1] "á"
[1] "好"
I'm also aware that characters in the following ranges are for private use. The following quote was retrieved fromd this glossary (archive) in https://unicode.org.
Private-Use Code Point. Code points in the ranges U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..U+10FFFD. (See definition D49 in Section 3.5, Properties.) These code points are designated in the Unicode Standard for private use.
As you can read in the quote, there are three ranges. The following lists those characters that are the limits of those ranges.
First range:  (U+E000)
First range:  (U+F8FF)
Second range: 󰀀 (U+F0000)
Second range: 󿿽 (U+FFFFD)
Third range: 􀀀 (U+100000)
Third range: 􏿽 (U+10FFFD)
The problem
When I try to print the characters in the in the list above that belong to the first range (i.e.  (U+E000) and  (U+F8FF)), there's no problem.
stringi::stri_unescape_unicode("\\ue000")
stringi::stri_unescape_unicode("\\uf8ff")
[1] ""
[1] ""
However, when I try to print the characters in shown in the list above that belong to the second range (i.e. 󰀀 (U+F0000) and 󿿽 (U+FFFFD)), R doesn't return those characters.
stringi::stri_unescape_unicode("\\uf0000")
stringi::stri_unescape_unicode("\\uffffd")
[1] "0"
[1] "\uffffd"
Similarly, the following doesn't print the characters shown in the list above that belong in the third range (i.e. 􏿽 (U+10FFFD) and 􀀀 (U+100000))
stringi::stri_unescape_unicode("\\u100000")
stringi::stri_unescape_unicode("\\u10fffd")
[1] "က00"
[1] "ჿfd"
The question
Why isn't stringi::stri_unescape_unicode() able to display characters that belong to the ranges U+F0000..U+FFFFD or U+100000..U+10FFFD?
Is there any function in R that is able to return those characters?

Related

How to generate all possible unicode characters?

If we type in letters we get all lowercase letters from english alphabet. However, there are many more possible characters like ä, é and so on. And there are symbols like $ or (, too. I found this table of unicode characters which is exactly what I need. Of course I do not want to copy and paste hundreds of possible unicode characters in one vector.
What I've tried so far: The table gives the decimals for (some of) the unicode characters. For example, see the following small table:
Glyph Decimal Unicode Usage in R
! 33 U+0021 "\U0021"
So if type "\U0021" we get a !. Further, paste0("U", format(as.hexmode(33), width= 4, flag="0")) returns "U0021" which is quite close to what I need but adding \ results in an error:
paste0("\U", format(as.hexmode(33), width= 4, flag="0"))
Error: '\U' used without hex digits in character string starting ""\U"
I am stuck. And I am afraid even if I figure out how to transform numbers to characters usings as.hexmode() there is still the problem that there are not Decimals for all unicode characters (see table, Decimals end with 591).
Any idea how to generate a vector with all the unicode characters listed in the table linked?
(The question started with a real world problem but now I am mostly simply eager to know how to do this.)
There may be easier ways to do this, but here goes. The Unicode package contains everything you need.
First we can get a list of unicode scripts and the block ranges:
library(Unicode)
uranges <- u_scripts()
Check what we've got:
head(uranges, 3)
$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B U+1E950..U+1E959 U+1E95E..U+1E95F
$Ahom
[1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726 U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F
[11] U+11740..U+11746
$Anatolian_Hieroglyphs
[1] U+14400..U+14646
Next we can convert the ranges into their sequences.
expand_uranges <- lapply(uranges, as.u_char_seq)
To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:
all_unicode_chars <- unlist(expand_uranges)
# The Wikipedia page linked states there are 144,697 characters
length(all_unicode_chars)
[1] 144762
So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:
intToUtf8(expand_uranges$Katakana[[1]])
[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"

r Regular expression for extracting UK postcode from an address is not ordered

I'm trying to extract UK postcodes from address strings in R, using the regular expression provided by the UK government here.
Here is my function:
address_to_postcode <- function(addresses) {
# 1. Convert addresses to upper case
addresses = toupper(addresses)
# 2. Regular expression for UK postcodes:
pcd_regex = "[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})"
# 3. Check if a postcode is present in each address or not (return TRUE if present, else FALSE)
present <- grepl(pcd_regex, addresses)
# 4. Extract postcodes matching the regular expression for a valid UK postcode
postcodes <- regmatches(addresses, regexpr(pcd_regex, addresses))
# 5. Return NA where an address does not contain a (valid format) UK postcode
postcodes_out <- list()
postcodes_out[present] <- postcodes
postcodes_out[!present] <- NA
# 6. Return the results in a vector (should be same length as input vector)
return(do.call(c, postcodes_out))
}
According to the guidance document, the logic this regular expression looks for is as follows:
"GIR 0AA" OR One letter followed by either one or two numbers OR One letter followed by a second letter that must be one of
ABCDEFGHJ KLMNOPQRSTUVWXY (i.e..not I) and then followed by either one
or two numbers OR One letter followed by one number and then another
letter OR A two part post code where the first part must be One letter
followed by a second letter that must be one of ABCDEFGH
JKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and
optionally a further letter after that AND The second part (separated
by a space from the first part) must be One number followed by two
letters. A combination of upper and lower case characters is allowed.
Note: the length is determined by the regular expression and is
between 2 and 8 characters.
My problem is that this logic is not completely preserved when using the regular expression without the ^ and $ anchors (as I have to do in this scenario because the postcode could be anywhere within the address strings); what I'm struggling with is how to preserve the order and number of characters for each segment in a partial (as opposed to complete) string match.
Consider the following example:
> address_to_postcode("1A noplace road, random city, NR1 2PK, UK")
[1] "NR1 2PK"
According to the logic in the guideline, the second letter in the postcode cannot be 'z' (and there are some other exclusions too); however look what happens when I add a 'z':
> address_to_postcode("1A noplace road, random city, NZ1 2PK, UK")
[1] "Z1 2PK"
... whereas in this case I would expect the output to be NA.
Adding the anchors (for a different usage case) doesn't seem to help as the 'z' is still accepted even though it is in the wrong place:
> grepl("^[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})$", "NZ1 2PK")
[1] TRUE
Two questions:
Have I misunderstood the logic of the regular expression and
If not, how can I correct it (i.e. why aren't the specified letter
and character ranges exclusive to their position within the regular expression)?
Edit
Since posting this answer, I dug deeper into the UK government's regex and found even more problems. I posted another answer here that describes all the issues and provides alternatives to their poorly formatted regex.
Note
Please note that I'm posting the raw regex here. You'll need to escape certain characters (like backslashes \) when porting to r.
Issues
You have many issues here, all of which are caused by whoever created the document you're retrieving your regex from or the coder that created it.
1. The space character
My guess is that when you copied the regular expression from the link you provided it converted the space character into a newline character and you removed it (that's exactly what I did at first). You need to, instead, change it to a space character.
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
here ^
2. Boundaries
You need to remove the anchors ^ and $ as these indicate start and end of line. Instead, wrap your regex in (?:) and place a \b (word boundary) on either end as the following shows. In fact, the regex in the documentation is incorrect (see Side note for more information) as it will fail to anchor the pattern properly.
See regex in use here
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^^^^^ ^^^
3. Character class oversight
There's a missing - in the character class as pointed out by #deadcrab in his answer here.
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^
4. They made the wrong character class optional!
In the documentation it clearly states:
A two part post code where the first part must be:
One letter followed by a second letter that must be one of ABCDEFGHJKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and optionally a further letter after that
They made the wrong character class optional!
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^^^^^^
it should be this one ^^^^^^^^
5. The whole thing is just awful...
There are so many things wrong with this regex that I just decided to rewrite it. It can very easily be simplified to perform a fraction of the steps it currently takes to match text.
\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? [0-9][A-Za-z]{2}|[Gg][Ii][Rr] 0[Aa]{2})\b
Answer
As mentioned in the comments below my answer, some postcodes are missing the space character. For missing spaces in the postcodes (e.g. NR12PK), simply add a ? after the spaces as shown in the regex below:
\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? ?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})\b
^^ ^^
You may also shorten the regex above with the following and use the case-insensitive flag (ignore.case(pattern) or ignore_case = TRUE in r, depending on the method used.):
\b(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]? ?[0-9][A-Z]{2}|GIR ?0A{2})\b
Note
Please note that regular expressions only validate the possible format(s) of a string and cannot actually identify whether or not a postcode legitimately exists. For this, you should use an API. There are also some edge-cases where this regex will not properly match valid postcodes. For a list of these postcodes, please see this Wikipedia article.
The regex below additionally matches the following (make it case-insensitive to match lowercase variants as well):
British Overseas Territories
British Forces Post Office
Although they've recently changed it to align with the British postcode system to BF, followed by a number (starting with BF1), they're considered optional alternative postcodes
Special cases outlined in that article (as well as SAN TA1 - a valid postcode for Santa!)
See this regex in use here.
\b(?:(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]?|ASCN|STHL|TDCU|BBND|[BFS]IQ{2}|GX11|PCRN|TKCA) ?[0-9][A-Z]{2}|GIR ?0A{2}|SAN ?TA1|AI-?[0-9]{4}|BFPO[ -]?[0-9]{2,3}|MSR[ -]?1(?:1[12]|[23][135])0|VG[ -]?11[1-6]0|[A-Z]{2} ? [0-9]{2}|KY[1-3][ -]?[0-2][0-9]{3})\b
I would also recommend anyone implementing this answer to read this StackOverflow question titled UK Postcode Regex (Comprehensive).
Side note
The documentation you linked to (Bulk Data Transfer: Additional Validation for CAS Upload - Section 3. UK Postcode Regular Expression) actually has an improperly written regular expression.
As mentioned in the Issues section, they should have:
Wrapped the entire expression in (?:) and placed the anchors around the non-capturing group. Their regular expression, as it stands, will fail in for some cases as seen here.
The regular expression is also missing - in one of the character classes
It also made the wrong character class optional.
here is my regular expression
txt="0288, Bishopsgate, London Borough of Tower Hamlets, London, Greater London, England, EC2M 4QP, United Kingdom"
matches=re.findall(r'[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}', txt)

Using grep() with Unicode characters in R

(strap in!)
Hi, I'm running into issues involving Unicode encoding in R.
Basically, I'm importing data sets that contain Unicode (UTF-8) characters, and then running grep() searches to match values. For example, say I have:
bigData <- c("foo","αβγ","bar","αβγγ (abgg)", ...)
smallData <- c("αβγ","foo", ...)
What I'm trying to do is take the entries in smallData and match them to entries in bigData. (The actual sets are matrixes with columns of values, so what I'm trying to do is find the indexes of the matches, so I can tell what row to add the values to.) I've been using
matches <- grepl(smallData[i], bigData, fixed=T)
which usually results in a vector of matches. For i=2, it would return 1, since "foo" is element 1 of bigData. This is peachy and all is well. But RStudio seems to not be dealing with unicode characters properly. When I import the sets and view them, they use the character IDs.
dataset <- read_csv("[file].csv", col_names = FALSE, locale = locale())
Using View(dataset) shows "aß<U+03B3>" instead of "αβγ." The same goes for
dataset[1]
A tibble: 1x1 <chr>
[1] aß<U+03B3>
print(dataset[1])
A tibble: 1x1 <chr>
[1] aß<U+03B3>
However, and this is why I'm stuck rather than just adjusting the encoding:
paste(dataset[1])
[1] "αβγ"
Encoding(toString(dataset[1]))
[1] "UTF-8"
So it appears that R is recognizing in certain contexts that it should display Unicode characters, while in others it just sticks to--ASCII? I'm not entirely sure, but certainly a more limited set.
In any case, regardless of how it displays, what I want to do is be able to get
grep("αβγ", bigData)
[1] 2 4
However, none of the following work:
grep("αβ", bigData) #(Searching the two letters that do appear to convert)
grep("<U+03B3>",bigData,fixed=T) #(Searching the code ID itself)
grep("αβ", toString(bigData)) #(converts the whole thing to one string)
grep("\\β", bigData) #(only mentioning because it matches, bizarrely, to ß)
The only solution I've found is:
grep("\u03B3", bigData)
[1] 2 4
Which is not ideal for a couple reasons, most jarringly that it doesn't look like it's possible to just take every <U+####> and replace it with \u####, since not every Unicode character is converted to the <U+####> format, but none of them can be searched. (i.e., α and ß didn't turn into their unicode keys, but they're also not searchable by themselves. So I'd have to turn them into their keys, then alter their keys to a form that grep() can use, then search.)
That means I can't just regex the keys into a searchable format--and even if I could, I have a lot of entries including characters that'd need to be escaped (e.g., () or ), so having to remove the fixed=T term would be its own headache involving nested escapes.
Anyway...I realize that a significant part of the problem is that my set apparently involves every sort of character under the sun, and it seems I have thoroughly entrapped myself in a net of regular expressions.
Is there any way of forcing a search with (arbitrary) unicode characters? Or do I have to find a way of using regular expressions to escape every ( and α in my data set? (coordinate to that second question: is there a method to convert a unicode character to its key? I can't seem to find anything that does that specific function.)

Subsetting different length strings by spaces in R

In R, I currently have a long vector of dates and times saved as a string. So depending on the given date, the string can be 16 or 17 or 18 characters long and so I cannot just subset the first the 8 or 10 characters in the string, since that would not work for every date. But since there is a space between the date and time values, I am wondering how can I subset this string so that I only get the characters before the space?
Just to show how the string looks like now, here are a couple of examples:
"4/18/1950 0:00:00"
"6/8/1951 0:00:00"
"11/15/1951 0:00:00"
I'm not sure if you are familiar with regular expressions, if not you should learn as they are extremely useful:
tutorial
As akrun pointed out you can use the "sub" command to remove the space and everything after it like this:
sub(" .*","",stringVar)
First argument is the regular expression code which matches the space and everything that follows.
Second argument is what you want to replace the match with, in this case nothing
Third argument is the input string
Alternatively, you can just split the string at the space and select the first half using "strsplit"
strsplit(stringVar," ")[1]

Replacing all occurrences of a pattern in a string

Used to run R with numbers and matrix, when it comes to play with strings and characters I am lost. I want to analyze some data where the time is read into R as follow:
>my.time.char[1]
[1] "\"2011-10-05 15:55:00\""
I want to end up with a string containing only:
"2011-10-05 15:55:00"
Using the function sub() (that i barely understand...), I got the following result:
> sub("(\")","",my.time.char[1])
[1] "2011-10-05 15:55:00\""
This is closer to the format i am looking for, but I still need to get rid of the two last characters (\").
The second line from ?sub explains:
sub and gsub perform replacement of the first and all matches respectively.
which should tell you to use gsub instead.

Resources