Encoding lost when reading XML in R

Encoding lost when reading XML in R - r

I am retrieving online XML data using the XML R packages. My issue is that the UTF-8 encoding is lost during the call to xmlToList : for instance, 'é' are replaced by 'Ã©'. This happens during the XML parsing.
Here is a code snippet, with an example of encoding lost and another where encoding is kept (depending of the data source) :
library(XML)
library(RCurl)
url = "http://www.bdm.insee.fr/series/sdmx/data/DEFAILLANCES-ENT-FR-ACT/M.AZ+BE.BRUT+CVS-CJO?lastNObservations=2"
res <- getURL(url)
xmlToList(res)
# encoding lost
url2 = "http://www.bdm.insee.fr/series/sdmx/conceptscheme/"
res2 <- getURL(url2)
xmlToList(res2)
# encoding kept
Why the behaviour about encoding is different ? I tried to set .encoding = "UTF-8" in getURL, and to enc2utf8(res) but that makes no change.
Any help is welcome !
Thanks,
Jérémy
R version 3.2.1 (2015-06-18)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RCurl_1.95-4.7 bitops_1.0-6 XML_3.98-1.3
loaded via a namespace (and not attached):
[1] tools_3.2.1

You are trying to read SDMX documents in R. I would suggest to use the rsdmx package that makes easier the reading of SDMX documents. The package is available on CRAN, you can also access the latest version on Github.
rsdmx allows you to read SDMX documents by file or url, e.g.
require(rsdmx)
sdmx = readSDMX("http://www.bdm.insee.fr/series/sdmx/data/DEFAILLANCES-ENT-FR-ACT/M.AZ+BE.BRUT+CVS-CJO?lastNObservations=2")
as.data.frame(sdmx)
Another approach is to use the web-service interface to embedded data providers, and INSEE is one of them. Try:
sdmx <- readSDMX(providerId = "INSEE", resource = "data",
flowRef = "DEFAILLANCES-ENT-FR-ACT",
key = "M.AZ+BE.BRUT+CVS-CJO", key.mode = "SDMX",
start = 2010, end = 2015)
as.data.frame(sdmx)
AFAIK the package also contains issues to the character encoding, but i'm currently investigating a solution to make available soon in the package. Calling getURL(file, .encoding="UTF-8") properly retrieves data, but encoding is lost calling xml functions.
Note: I also see you use a parameter lastNObservations. For the moment the web-service interface does not support extra parameters, but it may be made available quite easily if you need it.

Related

Authentication error when using Figshare API via rfigshare

According to the Rfigshare readme:,
The first time you use an rfigshare function, it will ask you to authenticate online. Just log in and click okay to authenticate rfigshare. R will allow you to cache your login credentials so that you won't be asked to authenticate again (even between R sessions), as long as you are using the same working directory in future.
After installing rfigshare on a fresh machine (without an existing .httr-oauth)
library(devtools)
install_github('ropensci/rfigshare')
library(rfigshare)
id = 3761562
fs_browse(id)
Error in value[[3L]](cond) : Requires authentication.
Are your credentials stored in options?
See fs_auth function for details.
Thus, in spite of what the readme says, I am not asked to authenticate.
Directly calling fs_auth does not work either:
> fs_auth()
Error in init_oauth1.0(self$endpoint, self$app, permission = self$params$permission, :
Bad Request (HTTP 400).
My sessionInfo is as follows:
sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rfigshare_0.3.7.100
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 magrittr_2.0.1 tidyselect_1.1.0 munsell_0.5.0
[5] colorspace_2.0-1 R6_2.5.0 rlang_0.4.11 fansi_0.5.0
[9] httr_1.4.2 dplyr_1.0.5 grid_4.0.5 gtable_0.3.0
[13] utf8_1.2.1 DBI_1.1.1 ellipsis_0.3.2 assertthat_0.2.1
[17] yaml_2.2.1 tibble_3.1.2 lifecycle_1.0.0 crayon_1.4.1
[21] RJSONIO_1.3-1.4 purrr_0.3.4 ggplot2_3.3.3 later_1.2.0
[25] vctrs_0.3.8 promises_1.2.0.1 glue_1.4.2 compiler_4.0.5
[29] pillar_1.6.1 generics_0.1.0 scales_1.1.1 XML_3.99-0.6
[33] httpuv_1.6.1 pkgconfig_2.0.3
Does anyone have any tips or workarounds? This definitely did work maybe 6 months ago when I last tried. I also have an open thread about this issue with Figshare support, but their knowledge of the R library seems limited.
(cross-posted from Github)

The master branch of rfigshare seems to be out of sink with what figshare now offers in that the master branch seems to use v1 of the api along with oauth v1 authentication whereas figshare has moved on with v2 of the api and now promotes the use of oauth v2.
While I am unsure whether figshare has shutdown v1 of the api and/or has disallowed oauth v1, it seems like you might still be able to use the package if you install from the sckott branch and use a personal access token (PAT).
To generate a PAT, navigate to https://figshare.com/account/applications in a web browser. At the bottom of this page, you can generate a PAT. When the token is presented, copy it as you will not be able to view it again (although you can easily generate a new one at any time).
You will want to store this token in your .Renviron file. The usethis package has a nifty edit_r_environ() function to make this a little easier:
usethis::edit_r_environ()
Running the above in R should find your .Renviron file and open it for editing. Store your PAT on a new line.
RFIGSHARE_PAT="the-really-long-pat-you-should-have-on-your-clipbord"
Save and close the file. Make sure to restart your R session for this change to take effect.
You might then test to see if the above worked by running:
Sys.getenv("RFIGSHARE_PAT")
To see if your PAT is found.
Then install rfigshare from the sckott branch.
remotes::install_github("https://github.com/ropensci/rfigshare/tree/sckott")
Now you should be able to
library(rfigshare)
fs_browse()

You might also consider leveraging the fact that the current figshare api is Open API compatible and build your own client on the fly with the swagger specification.
Generate and store a personal access token as I described in my other answer. Then you could do
library(rapiclient)
library(httr)
fs_api <- get_api("https://docs.figshare.com/swagger.json")
header <- c(Authorization = sprintf("token %s", Sys.getenv("RFIGSHARE_PAT")))
fs_api <- list(operations = get_operations(fs_api, header),
schemas = get_schemas(fs_api))
my_articles <- fs_api$operations$private_articles_list()
content(my_articles)

I think one of the issues is that you're passing an article_id to fs_browse, which the first argument is not that. If you're looking to browse a public set, you can set mine = FALSE and session = NULL, like:
out = fs_details(article_id = 3761562, mine = FALSE, session = NULL)

Figshare support informed me that they have blocked requests done using http://. Switching the requests to https:// seemed to fix some of the issues on rfigshare. In particular, fs_details(), and fs_delete() works after switching to https://.
fs_upload() is broken even after switching to https.

Non-Latin characters show as question marks when using rodbc/odbc/dbplyr with SQL-Server

I'm using dbplyr to get data from SQL-Server into R, but Chinese, Japanese and other non-Latin characters are appearing as "?". I'm using a windows machine.
I've read through the following threads:
How does R handle Unicode / UTF-8?
How to use Regex to strip punctuation without tainting UTF-8 or UTF-16 encoded text like chinese?
Fetching UTF-8 text from MySQL in R returns “????”
These provide some useful ideas, but nothing has worked so far. I have tried:
Setting encoding = 'UTF-8' within the dbConnect function. Characters still show as question-marks.
Setting encoding = 'UTF-16' within the dbConnect function. R returns an error: # Error in iconv(x[current], from = enc, to = to, ...)
Changing the global character encoding to UTF-8 with: Sys.setenv(LANG = "UTF-8")
and options(encoding = "UTF-8")
Checking if the characters display when plotting (which would indicate that they are being stored correctly). This wasn't the case.
I was able to get the characters to display correctly by using RJDBC, however this is not compatible with dbplyr, according to this GitHub issue.
Here is my session info:
> sessionInfo()
# R version 3.5.0 (2018-04-23)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# Matrix products: default
# locale:
# [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
# [5] LC_TIME=English_United Kingdom.1252
My code looks like this:
> con <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "server name",
Database = "database name",
user = "my username",
password = "my password",
encoding = "UTF-8")
odbc/dbplyr sure handles these character types on Windows, so what am I missing here?
Any help would be much appreciated!

Check the list of encodings available with iconvlist().
I used encoding = "windows-1252" to be able to work correctly with Nordic characters using ODBC version 1.2.2.
Although I have not used Chinese or Japanese characters, the encoding values "GB18030", "gb2312" and "GBK" could be used for Chinese Guobiao for example.
wikipedia has a helpful page (scroll to the bottom for the list).

Limited results in searchTwitter() or userTimeline()

I am trying to fetch tweets using searchTwitter() and/or userTimeline()
I want to fetch maximum number tweets allowed to fetch by twitterR API (I believe that limit is around 3000.)
But in result I'm only getting very few posts (like 83 or 146). I'm sure there are more number of posts, when I check the Timeline of that user (via browser or app) I see there are more than 3000 posts.
Below is the message I get.
r_stats <- searchTwitter("#ChangeToMeIs", n=2000)
Warning message:
In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit, :
2000 tweets were requested but the API can only return 83
Is there anything I am missing on?
PS: I've checked all related question before posting. Before marking duplicate, please help me with the solution.

Actually, you are using is the Twitter Search API and it only returns a sample of results, not the comprehensive search.
What you need is Twitter Streaming API.
Please note that the Twitter search API does not return an exhaustive
list of tweets that match your search criteria, as Twitter only makes
available a sample of recent tweets. For a more comprehensive search,
you will need to use the Twitter streaming API, creating a database of
results and regularly updating them, or use an online service that can
do this for you.
Source: https://colinpriest.com/2015/07/04/tutorial-using-r-and-twitter-to-analyse-consumer-sentiment/

I install the library twitteR from git hub and this is quite important that the version is from git not CRAN
Than set up
setup_twitter_oauth("xxxxxxx", "xxxxx")
and than you can use the commends as
To get twitts from users timeline
ut <- userTimeline('xxxx', n=2000)
ut <- twListToDF(ut)
or to search for specific hastags
tweets<-twListToDF(searchTwitter("#f1", n=5000))
It works perfect for me
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=Swedish_Sweden.1252 LC_CTYPE=Swedish_Sweden.1252 LC_MONETARY=Swedish_Sweden.1252 LC_NUMERIC=C LC_TIME=Swedish_Sweden.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] twitteR_1.1.9
loaded via a namespace (and not attached):
[1] bit_1.1-12 httr_1.1.0 rjson_0.2.15 plyr_1.8.3 R6_2.1.2 rsconnect_0.4.1.11 DBI_0.3.1 tools_3.2.2
[9] whisker_0.3-2 yaml_2.1.13 Rcpp_0.12.4 bit64_0.9-5 rCharts_0.4.5 RJSONIO_1.3-0 grid_3.2.2 lattice_0.20-33

Since twitteR is going to be deprecated, what you need to do is install rtweet.
Here is the code:
# Install and load the 'rtweet' package
install.packages("rtweet")
library(rtweet)
# whatever name you assigned to your created app
appname <- "tweet-search-app"
# api key (example below is not a real key)
key <- "9GmBeouvgfdljlBLryeIeqCHEt"
# api secret (example below is not a real key)
secret <- "ugdfdgdgrxOzjhlkhlxgdxllhoiofdtrrdytszghcv"
# create token named "twitter_token"
twitter_token <- create_token(
app = appname,
consumer_key = key,
consumer_secret = secret)
# Retrieve tweets for a particular hashtag
r_stats <- search_tweets("#ChangeToMeIs", n = 2000, token = twitter_token)

rodbc character encoding error with PostgreSQL

I'm getting a new error which I've never gotten before when connecting from R to a GreenPlum PostgreSQL database using RODBC. I've gotten the error using both EMACS/ESS and RStudio, and the RODBC call has worked as is in the past.
library(RODBC)
gp <- odbcConnect("greenplum", believeNRows = FALSE)
data <- sqlQuery(gp, "select * from mytable")
> data
[1] "22P05 7 ERROR: character 0xc280 of encoding \"UTF8\" has no equivalent in "WIN1252\";\nError while executing the query"
[2] "[RODBC] ERROR: Could not SQLExecDirect 'select * from mytable'"
EDIT:
Just tried querying another table and did get results. So I guess it's not an RODBC problem but a PostgreSQL table encoding problem.
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RODBC_1.3-2
>

First, the issue arises because R is trying to convert to a Windows locale that supports UTF8. Unfortunately, Brian Ripley has reported numerous times that Windows has no UTF8 locales. From hours spent searching the web, StackOverflow, Microsoft, etc., I have come to the conclusion that Microsoft hates UTF-8 Windows won't support UTF8.
As a result, I'm not sure that there's an easy solution to this, if there is any solution at all. The best I can recommend is to wrap some kind of conversion on the server side, look at filtering the data if you can, or try a different language, if appropriate (e.g. Chinese, Japanese, Korean).
If you do decide to wrap a converter, unicode.org recommends this ICU toolkit.

0xc280 is a control element ( U+0080 in Unicode) that is causing trouble pretty often when using SQL and the likes. The problem often lies in the conversion chain that invariably happens when you use different applications that use different encoding schemes. Windows has UTF-8 included by now, so it's not strictly a Windows problem. I believe the problem arises before R reads the data in.
In fact, in the chain the character sequence 0x80 in UNICODE will be mapped to 0xc280 in UTF-8. This is supposed to be a control sequence, and cannot be printed. But chances are big that the 0x80 is in fact not UNICODE, but Windows Latin-1 or Latin-2. In that case, the 0x80 represents the euro sign. That might explain how it ends up in your data. Check if you can find something like that in the data, that would explain something already.
My guess is that the solution will not lie at the R-end of this workchain, but before that. It will try automatic conversion, but this one is reported to fail in some cases (also for SQL and Oracle btw). Check in which encoding you're working in Postgresql, and try to use any of the latin types. There might be other links involved (a Putty or similar terminal for example). I'm pretty sure all the encodings there are ISO8859-1, which is Latin-1. Somewhere UTF-8 gets thrown in between, and when the 0x80 character gets wrongly mapped to 0xc280, you get trouble.
So check the encodings in your complete workchain, and make sure that they all match. If they don't, the automatic conversion done between each step is bound to give trouble for some characters.
Hope this helps.

I might have posted this response elswhere but here goes.
I get similar error when connecting to Postgres DB from MS SQL Management client. Tyring to fix the source data is almost impossible in my case.
My Scenario:
Trying to connect to Postgress using MS SQL Linked Objects via an
ODBC System DSN, and see errors such as "ERROR: character 0xc280 of
encoding "UTF8" has no equivalent in"WIN1252";
Select statements on some tables work and others throw this error.
Fix: Use an ODBC driver that supports Unicode. I am using an ODBC driver from PostgreSQL Global Development Group. Go to Configure DSN/Manage DSN and select the Unicode driver.
Good luck.

By default Greenplum use UTF8 for character encoding. You could check this by logging in to Greenplum server and launching psql - console client for Greenplum.
In this console application you could issue command: \l to list all of the databases configured in the Greenplum - this should also describe character set for database.
I think your prblem is that R doesnt support UTF8 for chars (You use different locale)
But you could use On-the-fly transcoding in ODBC driver. Not sure about all ODBC drivers but DataDirect drivers support extra option in odbc.ini file (usually located in user home directory) - IANAAppCodePage.
You could find appropriate code for this parameter on this link:
http://www.iana.org/assignments/character-sets
Here is the example od ODBC.ini content:
[ODBC]
Driver=/opt/odbc/lib/S0gplm60.so
IANAAppCodePage=2252
AlternateServers=
ApplicationUsingThreads=1
ConnectionReset=0
ConnectionRetryCount=0
ConnectionRetryDelay=3
Database=mysdb
EnableDescribeParam=1
ExtendedColumnMetadata=0
FailoverGranularity=0
FailoverMode=0
FailoverPreconnect=0
FetchRefCursor=1
FetchTSWTZasTimestamp=0
FetchTWFSasTime=0
HostName=192.168.1.100
InitializationString=
LoadBalanceTimeout=0
LoadBalancing=0
LoginTimeout=15
LogonID=
MaxPoolSize=100
MinPoolSize=0
Password=
Pooling=0
PortNumber=5432
QueryTimeout=0
ReportCodepageConversionErrors=0
TransactionErrorBehavior=1
XMLDescribeType=-10

Compression issue when saving anR project

I solved the problem by moving R installation directory out of disk C. Thanks Joris for the great suggestions! I think the R core team should also take this as a bug and do something against the protecting mechanism of windows xp.
Dear Community:
While using the BIOMOD packages in R, I always get the following problem:
Error in xzfile(file, "wb", compression = 9) : cannot open the connection
In addition: Warning message:
In xzfile(file, "wb", compression = 9) :
cannot initialize lzma encoder, error 5
It was said by the author of the package and also in the help file of "save" that the problem should be caused by lack of permission to write. However, as I am logging in as administative account and have assess to all operations, I have no idea what the problem is. Can anybody help me out? I really need to run the package now. Thanks in advance~
Sincerely,
Marco
Below is the illustration in the help file of "save":
The most common reason for failure is lack of write permission in
the current directory. For 'save.image' and for saving at the end
of a session this will shown by messages like
Error in gzfile(file, "wb") : unable to open connection
In addition: Warning message:
In gzfile(file, "wb") :
cannot open compressed file '.RDataTmp',
probable reason 'Permission denied'
The defaults were changed to use compressed saves for 'save' in
2.3.0 and for 'save.image' in 2.4.0. Any recent version of R can
read compressed save files, and a compressed file can be
uncompressed (by 'gzip -d') for use with very old versions of R.*
Sorry for the ommision of the information:
Here is the sessionInfo():
> sessionInfo()
R version 2.12.2 (2011-02-25)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Chinese_People's Republic of China.936
[2] LC_CTYPE=Chinese_People's Republic of China.936
[3] LC_MONETARY=Chinese_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese_People's Republic of China.936
attached base packages:
[1] splines stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] BIOMOD_1.1-6.8 foreign_0.8-42 gam_1.04
[4] randomForest_4.6-2 mda_0.4-1 class_7.3-3
[7] gbm_1.6-3.1 lattice_0.19-17 MASS_7.3-11
[10] Design_2.3-0 Hmisc_3.8-3 survival_2.36-5
[13] rpart_3.1-48 nnet_7.3-1 ade4_1.4-16
[16] rgdal_0.6-33 dismo_0.5-19 rJava_0.9-0
[19] raster_1.7-47 sp_0.9-78
loaded via a namespace (and not attached):
[1] cluster_1.13.3 grid_2.12.2 tools_2.12.2
Now I found that the problem come form the lzma encoder in doing "save":
> x<-runif(100)
> save(x, file = "F:/test.gzip", compress='gzip')
> save(x, file = "F:/test.xz", compress='xz')
Error in xzfile(file, "wb", compression = 9) : cannot open the connection
>

I had a similar issue when trying to project to a new scenario (a tables containing columns corresponding to the predictor variables) after having run the modeling procedure using 8 models.
The first table (approx 250,000 rows) ran fine, and I was able to save the results as a .csv file. However the second one (approx 380,000 rows) resulted in the above error message, and some of the files were not written to the project folder.
I have since cut all the tables down to a maximum of 260,000 rows and I no longer recieve the error message. It was a bit of a pain doing it in multiple runs, but once I had written the script once, I just used find and replace in MS Word to change it for each run.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex