Related
I have a web response being returned in raw format which I'm unable to properly encode. It contains the following values:
ef bc 86
The character is meant to be a Fullwidth Ampersand (to illustrate below):
> as.character("\uFF06")
[1] "&"
> charToRaw("\uFF02")
[1] ef bc 82
However, no matter what I've tried it gets converted to ". To illustrate:
> rawToChar(charToRaw("\uFF02"))
[1] """
Because of the equivalence of the raw values, I don't think there's anything I can do in my web call to influence the problem I'm having (happy to be corrected). I believe I need to work out how to properly do the character encoding.
I also took an extreme approach of trying all other encodings as follows but none converted to the fullwidth ampersand:
> x_raw <- charToRaw("\uFF02")
> x_raw
[1] ef bc 82
> sapply(
+ stringi::stri_enc_list()
+ ,function(encoding) stringi::stri_encode(str = x_raw, encoding)
+ ) |> # R's new native pipe
+ tibble::enframe(name = "encoding")
# A tibble: 1,203 x 2
encoding value
<chr> <chr>
1 037 "Õ¯b"
2 273 "Õ¯b"
3 277 "Õ¯b"
4 278 "Õ¯b"
5 280 "Õ¯b"
6 284 "Õ¯b"
7 285 "Õ~b"
8 297 "Õ¯b"
9 420 "\u001a\u001ab"
10 424 "\u001a\u001ab"
# ... with 1,193 more rows
My work around at the moment is to replace the strings after the encoding, but this character is just one example of many, and hard-coding every instance doesn't seem practical.
> rawToChar(x_raw)
[1] """
> stringr::str_replace_all(rawToChar(x_raw), c(""" = "\uFF06"))
[1] "&"
The substitution workaround is also complicated that I've also got characters like the HYPHEN (not HYPEN-MINUS) somehow getting converted where the last to raw values are getting converted to a string with what appears to be octal values:
> as.character("\u2010") # HYPHEN
[1] "‐"
> as.character("\u2010") |> charToRaw() # As raw
[1] e2 80 90
> as.character("\u2010") |> charToRaw() |> rawToChar() # Converted back to string
[1] "â€\u0090"
> charToRaw("â\200\220") # string with equivalent raw
[1] e2 80 90
Any help appreciated.
I'm not totally clear on exactly what you are trying to do, but the problem with getting back your original character is that R cannot determine the encoding automatically from the raw bytes. I assume you are on Windows. If you do
val <- rawToChar(charToRaw("\uFF06"))
val
# [1] "&"
Encoding(val)
# [1] "unknown"
Encoding(val) <- "UTF-8"
val
# [1] "&"
Just make sure to set the encoding properly.
I was doing some file merging, and two files wouldn't - despite having a key column that matches (I actually generated one key column by copy-pasting from the other). It's the damndest thing, and I worry that I'm either going crazy or missing something fundamental. As an example (and I cannot figure out how to make it reproducible, as when I copy and paste these strings into new objects, they compare just fine), here's my current console:
> q
[1] "1931 80th Anniversary"
> z
[1] "1931 80th Anniversary"
> q == z
[1] FALSE
I str-ed both, just in case I missed something, and...
> str(q)
chr "1931 80th Anniversary"
> str(z)
chr "1931 80th Anniversary"
What could be going on here?
This was a great puzzler. To answer - to diagnose the problem, charToRaw() was the answer.
> charToRaw(q)
[1] 31 39 33 31 c2 a0 38 30 74 68 c2 a0 41 6e 6e 69 76 65
[19] 72 73 61 72 79
> charToRaw(z)
[1] 31 39 33 31 20 38 30 74 68 20 41 6e 6e 69 76 65 72 73
[19] 61 72 79
Oh! Different! It seems to lie in the encoding, which, given that these were both plain ole' CSVs I loaded from, I never would have guessed, but
> Encoding(q)
[1] "UTF-8"
> Encoding(z)
[1] "unknown"
In the end, I used iconv() on q to make it work
> iconv(q, from = 'UTF-8', to = 'ASCII//TRANSLIT') == z
[1] TRUE
This has been a weird journey, and I hope this helps someone else who is as baffled as I was - and they learn a few new functions along the way.
It looks like you have non-breaking spaces in your string, which isn't really an encoding issue. This happens to me all the time because alt + space inserts a non-breaking space on a Mac, and I use alt on my German keyboard for all sorts of special characters, too. My pinkies are my slowest fingers and they don't always release alt fast enough when I transition from some special character to a space. I discovered this problem writing bash scripts, where <command> | <command> is common and | is alt + 7.
I think stringr::str_replace_all(q, "\\s", " ") should fix your current issue. Alternatively, you can try targeting specific non-printables, e.g. in your situation stringr::str_replace_all(q, "\uA0", " "). To expose the offending characters you can use stringi::stri_escape_unicode(q), which would return "1931\\u00a080th\\u00a0Anniversary". You can then just copy and paste to get the same results as above: stringr::str_replace_all(q, "\u00a0", " ")
I am reading an R object with the readRDS. It should have two columns, a year and a character string. For most rows, the character string is OK, but some have a strange white blob and others seem to have a character vector with escaped special characters and some have special characters like â.
I think its an encoding issue with the original data (which is not mine), but am unsure what the blobs are or what causes the character vectors / escaping. I realise its probably the original data, but trying to understand a little more of what I am seeing so I can investigate.
I'm using macOS 10.14.6.
Any ideas welcome.
The original data is here and I used the following to pull out some of the rows with strange characters.
data <- readRDS("all_speech.rds") %>%
select(year, speech) %>%
filter(str_detect(speech, "â"))
str(hansardOrig)
'data.frame': 2286324 obs. of 2 variables:
$ year : num 1979 1979 1979 1979 1979 ...
$ speech: chr "Mr. Speaker ...
Added
sample <- data %>% mutate(speech = substr(speech, 1, 200))
dput(head(sample))
structure(list(year = c(1982, 1982, 1982, 1984, 1986, 1986),
speech = c("With this it will be convenient to take amendment No. 112, in title, line 10, leave out 'section 163 1) of’.\n",
"I am not so much surprised as astonished by the amendment. It would create tremendous problems. Police officers have a vital role in visiting places of entertainment—without a warrant—particularly in ",
"I note the hon. Gentleman's desire to retire there.\nMy right hon. Friend mentioned that we are setting up a pilot scheme with three experimental homes. They will be in adapted, domestic-style, buildin",
"The British forces in the Lebanon had their headquarters at Haddâsse. From that position they would have been totally unable to help British nationals in west Beirut. They are better able to help, thr",
"We know that soon more cars will be manufactured in the United Kingdom, as the hon. Member for Edinburgh, Central Mr. Fletcher) wishes.\nhirdly, the decision will have a domino effect—that American phr",
"I beg to move,\nThat leave be given to bring in a Bill to make illegal the display of pictures of naked or partially naked women in sexually provocative poses in newspapers.\nThis is a simple but import"
)), row.names = c(NA, 6L), class = "data.frame")
You've got a difficult problem ahead of you. The sample you show has inconsistent encodings, so fixups will be hard to do.
The first entry in sample$speech displays like this on my Mac:
> sample$speech[1]
[1] "With this it will be convenient to take amendment No. 112, in title,
line 10, leave out 'section 163 1) of’.\n"
This looks okay up to the end, where the ’ characters look like a UTF-8 encoding for a directional quote "’", interpreted in the WINDOWS-1252 encoding. I can fix that with this code:
> iconv(sample$speech[1], from="utf-8", to="WINDOWS-1252")
[1] "With this it will be convenient to take amendment No. 112, in title,
line 10, leave out 'section 163 1) of’.\n"
However, this messes up the second entry, because it has em-dashes correctly encoded, so the translation converts them to hex 97 characters, not legal in the native UTF-8 encoding on the Mac:
> sample$speech[2]
[1] "I am not so much surprised as astonished by the amendment. It would
create tremendous problems. Police officers have a vital role in visiting
places of entertainment—without a warrant—particularly in "
> iconv(sample$speech[2], from="utf-8", to="WINDOWS-1252")
[1] "I am not so much surprised as astonished by the amendment. It would
create tremendous problems. Police officers have a vital role in visiting
places of entertainment\x97without a warrant\x97particularly in "
There are functions in various packages to guess encodings and to fix them, e.g. rvest::repair_encoding, stringi::stri_enc_detect, but I couldn't get them to work on your data. I wrote one myself, based on these ideas: use utf8ToInt to convert each string to its Unicode code point, then look for which ones contain multiple high values in a sequence. sample$speech[1] looks like this:
> utf8ToInt(sample$speech[1])
[1] 87 105 116 104 32 116 104 105 115 32 105 116 32 119 105 108 108
[18] 32 98 101 32 99 111 110 118 101 110 105 101 110 116 32 116 111
[35] 32 116 97 107 101 32 97 109 101 110 100 109 101 110 116 32 78
[52] 111 46 32 49 49 50 44 32 105 110 32 116 105 116 108 101 44
[69] 32 108 105 110 101 32 49 48 44 32 108 101 97 118 101 32 111
[86] 117 116 32 39 115 101 99 116 105 111 110 32 49 54 51 32 49
[103] 41 32 111 102 226 8364 8482 46 10
and that sequence near the end 226 8364 8482 is typical for a misinterpreted UTF-8 character. (The Wikipedia page describes the encoding in detail. Two byte chars start with 192 to 223, three byte chars start with 224 to 239, and four byte chars start with 240 to 247. Chars after the first are all in the range 128 to 191. The tricky part is figuring out how these high order chars will be displayed, because that depends on the wrongly assumed encoding.) Here's a quick and dirty function that tries every encoding known to iconv() and reports on what it does:
fixEncoding <- function(s, guess = iconvlist()) {
firstbytes <- list(as.raw(192:223),
as.raw(224:239), as.raw(240:247))
nextbytes <- as.raw(128:191)
for (i in seq_along(s)) {
str <- utf8ToInt(s[i])
if (any(str > 127)) {
fixes <- c()
encs <- c()
for (g in guess) {
high <- which(str > 127)
firsts <- lapply(firstbytes, function(s) utf8ToInt(iconv(rawToChar(s), from = g, to = "UTF-8", sub="")))
nexts <- utf8ToInt(iconv(rawToChar(nextbytes), from = g, to = "UTF-8", sub = ""))
for (try in 1:3) {
starts <- high[str[high] %in% firsts[[try]]]
starts <- starts[starts <= length(str) - try]
for (hit in starts) {
if (str[hit+1] %in% nexts &&
(try < 2 || str[hit+2] %in% nexts) &&
(try < 3 || str[hit+3] %in% nexts))
high <- setdiff(high, c(hit, hit + 1,
if (try > 1) hit + 2,
if (try > 2) hit + 3))
}
}
if (!length(high)) {
fixes <- c(fixes, iconv(s[i], from = "UTF-8", to = g, mark = FALSE))
encs <- c(encs, g)
}
}
if (length(fixes)) {
if (length(unique(fixes)) == 1) {
s[i] <- fixes[1]
message("Fixed s[", i, "] using one of ", paste(encs, collapse=","), "\n", sep = "")
} else {
warning("s[", i, "] has multiple possible fixes.")
message("It could be")
uniq <- unique(fixes)
for (u in seq_along(uniq))
message(paste(encs[fixes == uniq[u]], collapse = ","), "\n")
message("Not fixed!\n")
}
}
}
}
s
}
When I try it on your sample, I see this:
> fixed <- fixEncoding(sample$speech)
Fixed s[1] using one of CP1250,CP1252,CP1254,CP1256,CP1258,MS-ANSI,MS-ARAB,MS-EE,MS-TURK,WINDOWS-1250,WINDOWS-1252,WINDOWS-1254,WINDOWS-1256,WINDOWS-1258
You can make it less verbose by calling it as
fixed <- suppressMessages(fixEncoding(sample$speech))
The other issue you had in your original post was that some strings were being displayed as single characters. I think that's an RStudio bug. If I put too many characters in a single element in a dataframe, the RStudio viewer can't display it. For me the limit is around 10240 chars. This dataframe won't display properly:
d <- data.frame(x = paste(rep("a", 10241), collapse=""))
but any smaller number works. This isn't an R issue; it can display that dataframe in the console with no problem. It's only View(d) that is bad, and only in RStudio.
I am trying to extract from a random text phone numbers in 28 different formats in R. I have read previous posts here on R regex, such as \ being replaced with \\, and running the regex operator with perl=TRUE, so I have solved most of my issues. I need help with some debugging.
I use the following regular expression in R:
medium_regex2 = "(?:\\+?(\\d{1})?-?\\(?(\\d{3})\\)?[\\s-\\.]?)?(\\d{3})[\\s-\\.]?(\\d{4})[\\s-\\.]?"
and run the following code:
medium_phone_extract2 <- function(string){
unlist(regmatches(string,gregexpr(medium_regex2,string, perl=TRUE)))
}
medium_phone_extract2(phonenumbers)
The expression spots 26 out of 28 numbers correctly. The 2 missing number formats are:
"+90-555-4443322"
"+1.517.3002010"
How would you improve the regex so that these 2 formats are also correctly extracted?
edit: the full 28 formats I am trying to extract are:
phonenumbers <- c("05554443322",
"0555 444 3322",
"0555 444 33 22",
"5554443322",
"555 444 3322",
"555 444 33 22",
"905554443322",
"+905554443322",
"+90-555-4443322",
"+1-517-3002010",
"+1-(800)-3002010",
"+1-517-3002010",
"+1.517.3002010",
"000-000-0000",
"000 000 0000",
"000.000.0000",
"(000)000-0000",
"(000)000 0000",
"(000)000.0000",
"(000) 000-0000",
"(000) 000 0000",
"(000) 000.0000",
"000-0000",
"000 0000",
"000.0000",
"0000000",
"0000000000",
"(000)0000000")
howmany_numbers <- length(phonenumbers)
#28
And the 26 I am able to extract with the regex are:
[1] "05554443322" "0555 444 3322" "5554443322" "555 444 3322" "90555444332"
[6] "+90555444332" "0-555-4443322" "+1-517-3002010" "+1-(800)-3002010" "+1-517-3002010"
[11] "517.3002010" "000-000-0000" "000 000 0000" "000.000.0000" "(000)000-0000"
[16] "(000)000 0000" "(000)000.0000" "(000) 000-0000" "(000) 000 0000" "(000) 000.0000"
[21] "000-0000" "000 0000" "000.0000" "0000000" "0000000000"
[26] "(000)0000000"
You may use the following regex:
(?:\+?\d{0,3}-?\(?[\s.-]?\d{3}\)?[\s.-]?)?\d{3}[\s.-]?\d{2}\s?\d{2}
In case you want to only match it when not inside other digits, you may add (?<!\d) / (?!\d) lookarounds that prevent a match if there is a digit on the left or right:
(?<!\d)(?:\+?\d{0,3}-?\(?[\s.-]?\d{3}\)?[\s.-]?)?\d{3}[\s.-]?\d{2}\s?\d{2}(?!\d)
To ensure the usual word boundary on both sides use
(?<!\w)(?:\+?\d{0,3}-?\(?[\s.-]?\d{3}\)?[\s.-]?)?\d{3}[\s.-]?\d{2}\s?\d{2}\b
In R, do not forget to double all backslashes in the string literal:
regex <- "(?<!\\w)(?:\\+?\\d{0,3}-?\\(?[\\s.-]?\\d{3}\\)?[\\s.-]?)?\\d{3}[\\s.-]?\\d{2}\\s?\\d{2}\\b"
Main points:
((\\d{1})?|(\\d{2})?|(\\d{3}))? is better written as \d{0,3}, zero to three digits pattern (alternation makes matching process more resource consuming compared to a more linear, straight-forward pattern)
[\\s.-] is preferred to [\\s\\-\\.] since a hyphen is better placed at the end of the character class (no need to escape it there) and note that . always matches a literal . inside a character class
(\\d{4}|\\d{2}\\s\\d{2}) can and should be re-written as \\d{2}\\s?\\d{2} matching 2 digits followed with an optional whitespace and then 2 digits.
Not sure you really want to match a whitespace, hyphen or dot at the end of the pattern, so I suggest removing [\\s-\\.]? at the end.
How can I convert Ab9876543210 into Ab9876543210? Is there a solution by regular expression?
test <- dput("Ab9876543210")
Disclaimer: The following works on my machine, but since I can't replicate your full width string based purely on the example provided, this is a best guess based on my version of the problem (pasting the string into a text file, save it with UTF-8 encoding, & loading it in with coding specified as UTF-8.
Step 1. Reading in the text (I added a half width version for comparison):
> test <- readLines("fullwidth.txt", encoding = "UTF-8")
> test
[1] "Ab9876543210" "Ab9876543210"
Step 2. Verifying that the full & half width versions are not equal:
# using all.equal()
test1 <- test[1]
test2 <- test[2]
> all.equal(test1, test2)
[1] "1 string mismatch"
# compare raw bytes
> charToRaw(test1)
[1] ef bb bf ef bc a1 62 ef bc 99 ef bc 98 ef bc 97 ef bc 96 ef bc 95 ef
[24] bc 94 ef bc 93 ef bc 92 ef bc 91 ef bc 90
> charToRaw(test2)
[1] 41 62 39 38 37 36 35 34 33 32 31 30
For anyone interested, if you paste the raw byte version into a utf-8 decoder as hexadecimal input, you'll see that except for letter b (mapped from 62 in the 7th byte), the rest of the letters were formed by 3-byte sequences. In addition, the first 3-byte sequence maps to "ZERO WIDTH NO-BREAK SPACE character", so it's not visible when you print the string to console.
Step 3. Converting from full width to half width using the Nippon package:
library(Nippon)
test1.converted <- zen2han(test1)
> test1.converted
[1] "Ab9876543210"
# If you want to compare against the original test2 string, remove the zero
# width character in front
> all.equal(substring(test1.converted, 2), test2)
[1] TRUE
Here is a base R solution
Full width characters are in the range 0xFF01:0xFFEF, and can be offset like this.
x <- "Ab9876543210"
iconv(x, to = "utf8") |>
utf8ToInt() |>
(\(.) ifelse(. > 0xFF01 & . <= 0xFFEF, . - 65248, .))() |>
intToUtf8()
[1] "Ab9876543210"