Using urltools::url_parse with UTF-8 domains - r

The function url_parse is very fast and works fine most of the time. But recently, domain names may contain UTF-8 characters, for example
url <- "www.cordes-tiefkühlprodukte.de"
Now if I apply url_parse on this url, I get a special character "< fc >" in the domain column:
url_parse(url)
scheme domain port path parameter fragment
1 <NA> www.cordes-tiefk<fc>hlprodukte.de <NA> <NA> <NA> <NA>
My question is: How can I "fix" this entry to UTF-8? I tried iconv and some functions from the stringi package, but with no success.
(I am aware of httr::parse_url, which does not have this problem. So one approach would be to detect the urls that are not ascii, and use url_parse on those and parse_url on the few special cases. However, this leads to the problem to (efficiently) detect the non-ascii URLs.)
EDIT: Unfortunately, url1 <- URLencode(enc2utf8(url)) does not help. When I do
robotstxt::paths_allowed(
url1,
domain=urltools::suffix_extract(urltools::domain(url1))
)
I get an error could not resolve host. However, plugging in the original URL and the 2nd level domain by hand, paths_allowed works.
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] urltools_1.7.3 fortunes_1.5-4
loaded via a namespace (and not attached):
[1] compiler_3.6.1 Rcpp_1.0.1 triebeard_0.3.0

I could reproduce the issue. I could convert the column domain to UTF-8 by reading it with readr::parse_character and latin1 encoding:
library(urltools)
library(tidyverse)
url <- "www.cordes-tiefkühlprodukte.de"
parts <-
url_parse(url) %>%
mutate(domain = parse_character(domain, locale = locale(encoding = "latin1")))
parts
scheme domain port path parameter fragment
1 <NA> www.cordes-tiefkühlprodukte.de <NA> <NA> <NA> <NA>
I guess that the encoding you have to specify (here latin1) depends only on your locale and not on the url's special characters, but I'm not 100% sure about that.

Just for reference, another method that works fine for me is:
library(stringi)
url <- "www.cordes-tiefkühlprodukte.de"
url <- stri_escape_unicode(url)
dat <- urltools::parse_url(url)
for(cn in colnames(dat)) dat[,cn] <- stri_unescape_unicode(dat[,cn])

Related

Foreign(hebrew, Chinese) characters: Tidyverse incorrect display in console but correct in View() [duplicate]

For at least some cases, Asian characters are printable if they are contained in a matrix, or a vector, but not in a data.frame. Here is an example
q<-'天'
q # Works
# [1] "天"
matrix(q) # Works
# [,1]
# [1,] "天"
q2<-data.frame(q,stringsAsFactors=FALSE)
q2 # Does not work
# q
# 1 <U+5929>
q2[1,] # Works again.
# [1] "天"
Clearly, my device is capable of displaying the character, but when it is in a data.frame, it does not work.
Doing some digging, I found that the print.data.frame function runs format on each column. It turns out that if you run format.default directly, the same problem occurs:
format(q)
# "<U+5929>"
Digging into format.default, I find that it is calling the internal format, written in C.
Before I dig any further, I want to know if others can reproduce this behaviour. Is there some configuration of R that would allow me to display these characters within data.frames?
My sessionInfo(), if it helps:
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252
[3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
[5] LC_TIME=English_Canada.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.0.1
I hate to answer my own question, but although the comments and answers helped, they weren't quite right. In Windows, it doesn't seem like you can set a generic 'UTF-8' locale. You can, however, set country-specific locales, which will work in this case:
Sys.setlocale("LC_CTYPE", locale="Chinese")
q2 # Works fine
# q
#1 天
But, it does make me wonder why exactly format seems to use the locale; I wonder if there is a way to have it ignore the locale in Windows. I also wonder if there is some generic UTF-8 locale that I don't know about on Windows.
I just blogged about Unicode and R several days ago. I think your R editor is UTF-8 and this gives your illusion that R in your Windows handles UTF-8 characters.
The short answer is when you want to process Unicode (Here, it is Chinese), don't use English Windows, use a Chinese version Windows or Linux which by default is UTF-8.
Session info in my Ubuntu:
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: i686-pc-linux-gnu (32-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

Stringi/stringr pattern behave differently if function is sourced in R

I am using the stringi package for a while now and everything works fine.
I recently wanted to put some regex inside a function and store that function in a separate file. The code works just fine if the function is loaded from the script but when it is sourced I do not get the expected result.
Here is the code to reproduce the issue :
clean <- function(text){
stri_replace_all_regex(str = text,
pattern = "(?i)[^a-zàâçéèêëîïôûùüÿñæœ0-9,\\.\\?!']",
replacement = " ")
}
text <- "A sample text with some french accent é, è, â, û and some special characters |, [, ( that needs to be cleaned."
clean(text) # OK
[1] "A sample text with some french accent é, è, â, û and some special characters , , that needs to be cleaned."
source(clean.r)
clean(text) # KO
[1] "A sample text with some french accent , , , and some special characters , , that needs to be cleaned."
I want to remove everything that is not a letter, an accented letters and punctuation charcater ?, !, ,, and ..
The code works just fine if the function is loaded inside the script directly. If it is sourced then it gives a different result.
I also tried using stringr and I have the same problem. My files are saved in UTF-8 encoding.
I do not understand why this is happening, any help is greatly appreciated.
Thank you.
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringi_1.1.5 data.table_1.10.4
loaded via a namespace (and not attached):
[1] compiler_3.4.1 tools_3.4.1 yaml_2.1.14
Try converting the text to ASCII first. This will change the characters, and may allow the same behaviour when you source the function in R.
+1 to Felipe Alvarenga
https://stackoverflow.com/a/45941699/2069472
text <- "Ábcdêãçoàúü"
iconv(text, to = "ASCII//TRANSLIT")

R - c() unexpectedly converts names of named vectors into UTF-8. Is this a bug?

I've faced a strange behavior of c() with R 3.3.2 on Windows with non-US-English locale. It converts the names of named vectors into UTF-8.
x <- "φ"
names(x) <- "φ"
Encoding(names(x))
#> [1] "unknown"
Encoding(names(c(x)))
#> [1] "UTF-8"
Thought this issue is not problematic for most people, it is critical for those who uses named vectors as lookup tables (example is here: http://adv-r.had.co.nz/Subsetting.html#applications). I am also the one who stuck with the behavior of dplyr's select() function.
I'm not quite sure whether this behavior is a bug or by design. Should I submit a bug report to R core?
Here's info about my R environment:
sessionInfo()
#> R version 3.3.2 (2016-10-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows >= 8 x64 (build 9200)
#>
#> locale:
#> [1] LC_COLLATE=Japanese_Japan.932 LC_CTYPE=Japanese_Japan.932 LC_MONETARY=Japanese_Japan.932
#> [4] LC_NUMERIC=C LC_TIME=Japanese_Japan.932
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] tools_3.3.2
You should still see names(c(x)) == names(x) on your system. The encoding change by c() may be unintentional, but shouldn't affect your code in most scenarios.
On Windows, which doesn't have a UTF-8 locale, your safest bet is to convert all strings to UTF-8 first via enc2utf8(), and then stay in UTF-8. This will also enable safe lookups.
Language symbols (as used in dplyr's group_by()) are an entirely different issue. For some reason they are always interpreted in the native encoding. (Try as.name(names(c(x))).) However, it's still best to have them in UTF-8, and convert to native just before calling as.name(). This is what dplyr should be doing, we're just not quite there yet.
My recommendation is to use ASCII-only characters for column names when using dplyr on Windows. This requires some discipline if you're relying on tidyr::spread() for non-ASCII column contents. You could also consider switching to a system (OS X or Linux) that works with UTF-8 natively.

reading Tamil corpus in R

I am have built a basic word prediction product using R as part of an online course project work. I wanted to extend it for predicting next word from Tamil phases. I had used sample of Tamil language corpora from HC Corpora website. I have read it into R and created a tm() corpus.
testData <- "திருவண்ணாமலை, கொல்லிமலை, சதுரகிரி என அவன் சித்தர்களை பல
இடங்களில், மலைகளில், குகைகளில், இன்னும் பல ரகசிய இடங்களில்
அவன் சித்தர்களை சந்தித்து பல நம்பமுடியாத சக்திகளைப்
பெற்றுவிட்டான் என்று சொல்லிக் கொள்கிறார்கள்"
getUnigrams <- function(x) {NGramTokenizer(x,
Weka_control(min=1, max=1))}
unigrams <- DocumentTermMatrix(VCorpus(VectorSource(testData)),
control=list(tokenize=getUnigrams))
unigramsList <- data.frame(slam::col_sums(unigrams))
head(unigramsList, 3)
> slam..col_sums.unigrams.
அவன் 2
இடங்களில் 2
இன்னும் 1
The actual Tamil words are row names of this data-frame and displayed properly on the screen. However, when I try to add it as column against their respective count, the resulting data frame does not displays the Tamil words correctly in column unigramsList$word1. It displays it as unicode characters of underlying Tamil word.
unigramsList$word1 <- rownames(unigramsList) ## Encoding issues arise from here!!!
head(unigramsList, 3)
slam..col_sums.unigrams.
அவன் 2
இடங்களில் 2
இன்னும் 1
word1
அவன் <U+0B85><U+0BB5><U+0BA9><U+0BCD>
இடங்களில் <U+0B87><U+0B9F><U+0B99><U+0BCD><U+0B95><U+0BB3><U+0BBF><U+0BB2><U+0BCD>
இன்னும் <U+0B87><U+0BA9><U+0BCD><U+0BA9><U+0BC1><U+0BAE><U+0BCD>
>
I tried to continue with these unicode characters and built n-grams for 2, 3 and 4-grams and used it for my prediction. But all subsequent operations on this column are displayed as raw unicode only. I want to be able to view and predict them in their native Tamil characters.
My session information is as below:
> sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-29 tm_0.6-2 NLP_0.1-9 stringi_1.0-1 stringr_1.0.0
loaded via a namespace (and not attached):
[1] magrittr_1.5 parallel_3.2.5 tools_3.2.5 slam_0.1-37
[5] grid_3.2.5 rJava_0.9-8 RWekajars_3.9.0-1
I managed to hack a solution to above and hence thought of posting it for anyone interested in this topic.
a) Instead of saving the n-grams as csv files on Windows, I saved them in R binary format (using save() and load() functions). I had saved the generated n-grams using read.csv() with fileEncoding option set to UTF-8, but still it did not help even after deploying it on Shiny.
b) Deployed and tested on Shiny apps, which runs on a Linux platform and hence it was able to display Tamil characters in unicode correctly. Testing it locally on Windows was not effective as characters were displayed as raw unicodes e.g. , etc.
Thanks to Marek Gagolewski, author of stringi, for suggestions regarding shinyio, which helped me deploy and test on shiny's Linux platform.
You can check out the product using the below link if you are interested: https://periasamyr.shinyapps.io/predictwordml/
Regards
Peri

How to get proper encoding using browseURL()?

I'm basically trying to browse a URL with japanese letters in it. This question builds up on my first question from yesterday. My code now generates the right URL and if I just take the URL and put into my browser I get the right result, but if I try to automate the process by integrating browseURL() I get a wrong result.
E.g. I am trying to call following URL:
http://www.google.com/trends/trendsReport?hl=en-US&q=VWゴルフ %2B VWポロ %2B VWパサート %2B VWティグアン&date=1%2F2010 68m&cmpt=q&content=1&export=1
if I now use
browseURL(http://www.google.com/trends/trendsReport?hl=en-US&q=VWゴルフ %2B VWポロ %2B VWパサート %2B VWティグアン&date=1%2F2010 68m&cmpt=q&content=1&export=1)
I can see in the browser that it browsed
www.google.com/trends/trendsReport?hl=en-US&q=VW%E3%83%BB%EF%BD%BDS%E3%83%BB%EF%BD%BD%E3%83%BB%EF%BD%BD%E3%83%BB%EF%BD%BDt%20%2B%20VW%E3%83%BB%EF%BD%BD%7C%E3%83%BB%EF%BD%BD%E3%83%BB%EF%BD%BD%20%2B%20VW%E3%83%BB%EF%BD%BDp%E3%83%BB%EF%BD%BDT%E3%83%BB%EF%BD%BD[%E3%83%BB%EF%BD%BDg%20%2B%20VW%E3%83%BB%EF%BD%BDe%E3%83%BB%EF%BD%BDB%E3%83%BB%EF%BD%BDO%E3%83%BB%EF%BD%BDA%E3%83%BB%EF%BD%BD%E3%83%BB%EF%BD%BD&date=1%2F2010%2068m&cmpt=q&content=1&export=1
which seems to be an encoding mistake. I already tried
browseURL(URL, encodeIfNeeded=TRUE)
but that doesnt seem to change a thing and as far as I interpret the function it also shouldnt because this function is there to generate those "%B" letters, which makes it even more surprising that I get them even when encodeIfNeeded = FALSE.
Any help is highly appreciated!
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 8 (build 9200)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=Japanese_Japan.932 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.2.1
I think this will get around the issue:
library(httr)
library(curl)
gt_url <- "http://www.google.com/trends/trendsReport?hl=en-US&q=VWゴルフ %2B VWポロ %2B VWパサート %2B VWティグアン&date=1%2F2010 68m&cmpt=q&content=1&export=1"
# ensure the %2B's aren't getting in the way then
# ask httr to carve up the url and put it back together
parts <- parse_url(URLdecode(gt_url))
browseURL(build_url(parts))
That gives this (too long to paste but I want to make sure OP gets to see the whole content).
I also now see why you have to do it this way (both download.file and GET with write_disk don't work due to the javascript redirect).

Resources