All,
I'm trying to figure out how to put a .Rda file into Base64 encoding for it to be shipped to and from an API. I am really struggling with how to do this. Here's what I've got, but I think it's way off target:
cuse <- read.table("http://data.princeton.edu/wws509/datasets/cuse.dat", header=TRUE)
lrfit <- glm( cbind(using, notUsing) ~ age + education + wantsMore , family = binomial, data=cuse)
filename <- "C:/test.Rda"
save(lrfit, file=filename)
library("base64enc")
tst <- base64encode(filename)
save(tst, file="C:/encode.Rda")
base64decode(file="C:/encode.Rda", output = "C:/decode.Rda")
When I try to open the decode.Rda file, it throws a magic number error. Like I said, I think I'm way off base here, and any help would be appreciated. Thank you so much.
Here a correct sequence of steps that should allow for the correct encoding/decoding
#sample data
dd<-iris
fn <- "test.rda"
fnb4 <- "test.rdab64"
#save rda
save(iris, file=fn)
#write base64 encoded version
library(base64enc)
txt <- base64encode(fn)
ff <- file(fnb4, "wb")
writeBin(txt, ff)
close(ff)
#decode base64 encoded version
base64decode(file=fnb4, output = "decode.rda")
(load("decode.rda"))
# [1] "iris"
The problem was your second save(). That was creating another RDA file with the base64 data encoded inside. It was not writing a base64 encoded version of the RDA file to disc.
Related
I have an amount of 1000 csv files which contains Hebrew.
I'm trying to import them into R but there is a problem reading Hebrew into the program.
When using this, I get arount 80% of the files with correct hebrew but other 20% not:
data_lst <- lapply(files_to_read,function(i){
read.csv(i, encoding = "UTF-8")
})
When using this, I get the other 20% right but the 80% that worked before does not work here:
data_lst <- lapply(files_to_read,function(i){
read.csv(i, encoding = 'utf-8-sig')
})
I'm unable to use read_csv from library(readr) and have to stay with the format of read.csv.
Thank you for you help!
It sounds like you have two different file encodings, utf-8 and utf-8-sig. The latter has a Byte Order Mark of 0xef, 0xbb, 0xbf at the start indicating the encoding.
I wrote the iris dataset to csv in both encodings - the only difference is the first line.
UTF-8:
sepal.length,sepal.width,petal.length,petal.width,species
UTF-8-SIG:
sepal.length,sepal.width,petal.length,petal.width,species
In your case, it sounds like R is not detecting the encodings correctly, but using encoding="utf-8" works for some files, and encoding="utf-8-sig" works for the other files. The natural course of action seems to me to be to read in the first line and see if it has that pattern at the start:
BOM_pattern <- "^"
encodings <- vapply(
files_to_read,
\(file) {
line <- readLines(file, n = 1L, encoding = "utf-8")
ifelse(grepl(BOM_pattern, line), "utf-8-sig", "utf-8")
},
character(1)
)
This will return a (named) character vector of c("utf-8", "utf-8-sig") as appropriate. You can then supply the encoding to read.csv:
data_lst <- Map(
\(file, encoding) read.csv(file, encoding = encoding),
files_to_read,
encodings
)
This should read in each data frame with the correct encoding and store them in the list data_lst.
I have a large stata file that I think has some French accented characters that have been saved poorly.
When I import the file with the encoding set to blank, it won't read in. When I set it to latin1 it will read in, but in one variable, and I'm certain in others, French accented characters are not rendered properly. I had a similar problem with another stata file and I tried to apply the fix (which actually did not work in that case, but seems on point) here.
To be honest this seems to be the real problem here somehow. A lot of the garbled characters are "actual" and they match up to what is "expected" But I have no idea to go back.
Reproducible code is here:
library(haven)
library(here)
library(tidyverse)
library(labelled)
#Download file
temp <- tempfile()
temp2 <- tempfile()
download.file("https://github.com/sjkiss/Occupation_Recode/raw/main/Data/CES-E-2019-online_F1.dta.zip", temp)
unzip(zipfile = temp, exdir = temp2)
ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="latin1")
#Try with encoding set to blank, it won't work.
#ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="")
unlink(c(temp, temp2))
#### Diagnostic section for accented characters ####
ces19web$cps19_prov_id
#Note value labels are cut-off at accented characters in Quebec.
#I know this occupation has messed up characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text)
#Check the encodings of the occupation titles and store in a variable encoding
ces19web$encoding<-Encoding(ces19web$pes19_occ_text)
#Check encoding of problematic characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding)
#Write out messy occupation titles
ces19web %>%
filter(str_detect(pes19_occ_text,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding) %>%
write_csv(file=here("Data/messy.csv"))
#Try to fix
source("https://github.com/sjkiss/Occupation_Recode/raw/main/fix_encodings.R")
#store the messy variables in messy
messy<-ces19web$pes19_occ_text
library(stringi)
#Try to clean with the function fix_encodings
ces19web$pes19_occ_text_cleaned<-stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
#Examine
ces19web %>%
filter(str_detect(pes19_occ_text_cleaned,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, pes19_occ_text_cleaned, encoding) %>%
head()
Your data file is a dta version 113 file (the first byte in the file is 113). That is, it's a Stata 8 file, and especially pre-Stata 14, hence using custom encoding (Stata >=14 uses UTF-8).
So using the encoding argument of read_dta seems right. But there are a few problems here, as can be seen with a hex editor.
First, the truncated labels at accented letters (like Québec → Qu) are actually not caused by haven: they are stored truncated in the dta file.
The pes19_occ_text is encoded in UTF-8, as you can check with:
ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="UTF-8")
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)
output: "Producteur télé"
This "é" is characteristic of UTF-8 data (here "é") read as latin1.
However, if you try to import with encoding="UTF-8", read_dta will fail: there might be other non-UTF-8 characters in the file, that read_dta can't read as UTF-8. We have to do somthing after the import.
Here, read_dta is doing something nasty: it imports "Producteur télé" as if it were latin1 data, and converts to UTF-8, so the encoding string really has UTF-8 characters "Ã" and "©".
To fix this, you have first to convert back to latin1. The string will still be "Producteur télé", but encoded in latin1.
Then, instead of converting, you have simply to force the encoding as UTF-8, without changing the data.
Here is the code:
ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="")
ces19web$pes19_occ_text <- iconv(ces19web$pes19_occ_text, from = "UTF-8", to = "latin1")
Encoding(ces19web$pes19_occ_text) <- "UTF-8"
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)
output: "Producteur télé"
You can do the same on other variables with diacritics.
The use of iconv here may be more understandable if we convert to raw with charToRaw, to see the actual bytes. After importing the data, "télé" is the representation of "74 c3 83 c2 a9 6c c3 83 c2 a9" in UTF-8. The first byte 0x74 (in hex) is the letter "t", and 0x6c is the letter "l". In between, we have four bytes, instead of only two for the letter "é" in UTF-8 ("c3 a9", i.e. "é" when read as latin1).
Actually, "c3 83" is "Ã" and "c2 a9" is "©".
Therefore, we have first to convert these characters back to latin1, so that they take one byte each. Then "74 c3 a9 6c c3 a9" is the encoding of "télé", but this time in latin1. That is, the string has the same bytes as "télé" encoded in UTF-8, and we just need to tell R that the encoding is not latin1 but UTF-8 (and this is not a conversion).
See also the help pages of Encoding and iconv.
Now a good question may be: how did you end up with such a bad dta file in the first place? It's quite surprising for a Stata 8 file to hold UTF-8 data.
The first idea that comes to mind is a bad use of the saveold command, that allows one to save data in a Stata file for an older version. But according to the reference manual, in Stata 14 saveold can only store files for Stata >=11.
Maybe a third party tool did this, as well as the bad truncation of labels? It might be SAS or SPSS for instance. I don't know were your data come from, but it's not uncommon for public providers to use SAS for internal work and to publish converted datasets. For instance datasets from the European Social Survey are provided in SAS, SPSS and Stata format, but if I remember correctly, initially it was only SAS and SPSS, and Stata came later: the Stata files are probably just converted using another tool.
Answer to the comment: how to loop over character variables to do the same? There is a smarter way with dplyr, but here is a simple loop with base R.
ces19web <- read_dta("CES-E-2019-online_F1.dta")
for (n in names(ces19web)) {
v <- ces19web[[n]]
if (is.character(v)) {
v <- iconv(v, from = "UTF-8", to = "latin1")
Encoding(v) <- "UTF-8"
}
ces19web[[n]] <- v
}
it is a rather untypical scenario, I am using R Custom visual in PowerBI to plot a raster and the only way to pass data is by using a dataframe.
this what I have done so far,
generate a raster in R
save it to file using SaveRDS
encoded the file as a base64 and save it as a csv.
now using this code I manage to read the csv, load it to a dataframe combine al the rows
my question is how to decode it back to a raster Object ?
here is a reproducible example
# Input load. Please do not change #
`dataset` = read.csv('https://raw.githubusercontent.com/djouallah/keplergl/master/raster.csv', check.names = FALSE, encoding = "UTF-8", blank.lines.skip = FALSE);
# Original Script. Please update your script content here and once completed copy below section back to the original editing window #
library(caTools)
library(readr)
dataset$Value <- as.character(dataset$Value)
dataset <- dataset[order(dataset$Index),]
z <- paste(dataset$Value)
Raster <- base64decode(z,"raw")
here is the result
it turned out the solution is very easy, saveRDS has an option to save with ascii = TRUE
saveRDS(background,'test.rds',ascii = TRUE,compress = FALSE)
now I just read it as humain readbale format (which is easy to load to PowerBI) and it works
fil <- 'https://raw.githubusercontent.com/djouallah/keplergl/master/test.rds'
cony <- gzcon(url(fil))
XXX <- readRDS(cony,refhook = NULL)
plotRGB(XXX)
The documentation for gzcon states:
Use a writable rawConnection to compress data into a variable.
Here's my code:
output <- raw(0)
z <- rawConnection(output, "wb")
zz <- gzcon(z, text = TRUE)
writeLines("TEST", zz)
close(zz)
So at this point, I'm not sure how I can retrieve the compressed data.
I would like to retrieve the value using rawConnectionValue but gzcon has messed up z.
Thanks in advance!
I am trying to download and extract a .csv file from a webpage using R.
This question is a duplicate of Using R to download zipped data file, extract, and import data.
I cannot get the solution to work, but it may be due to the web address i am using.
I am trying to download the .csv files from http://data.worldbank.org/country/united-kingdom (under the download data drop down)
Using #Dirk's solution from the link above, i tried
temp <- tempfile()
download.file("http://api.worldbank.org/v2/en/country/gbr?downloadformat=csv",temp)
con <- unz(temp, "gbr_Country_en_csv_v2.csv")
dat <- read.table(con, header=T, skip=2)
unlink(temp)
I got the extended link by looking at the page source code, which I expect is causing the problems, although it works if i paste it into the address bar.
The file downloads with the correct Gb
download.file("http://api.worldbank.org/v2/en/country/gbr?downloadformat=csv",temp)
# trying URL 'http://api.worldbank.org/v2/en/country/gbr?downloadformat=csv'
# Content type 'application/zip' length 332358 bytes (324 Kb)
# opened URL
# downloaded 324 Kb
# also tried unzip but get this warning
con <- unzip(temp, "gbr_Country_en_csv_v2.csv")
# Warning message:
# In unzip(temp, "gbr_Country_en_csv_v2.csv") :
# requested file not found in the zip file
But these are the file names when i manually download them.
I'd appreciate some help with where i am going wrong , thanks
I am using Windows 8, R version 3.1.0
In order to get your data to download and uncompress, you need to set mode="wb"
download.file("...",temp, mode="wb")
unzip(temp, "gbr_Country_en_csv_v2.csv")
dd <- read.table("gbr_Country_en_csv_v2.csv", sep=",",skip=2, header=T)
It looks like the default is "w" which assumes a text files. If it was a plain csv file this would be fine. But since it's compressed, it's a binary file, hence the "wb". Without the "wb" part, you can't open the zip at all.
It's almost everything ok. In this case you only need to specify that it's a comma separated file, eg using sep="," in read.table:
temp <- tempfile()
download.file("http://api.worldbank.org/v2/en/country/gbr?downloadformat=csv",
temp)
con <- unz(temp, "gbr_Country_en_csv_v2.csv")
dat <- read.table(con, header=T, skip=2, sep=",")
unlink(temp)
With this little change i can import your csv smoothly.
HTH, Luca
The Word Bank Developmet Indictors can be obtained using the WDI package. For example,
library(WDI)
inds <- WDIsearch(field = "indicator")[, 1]
GB <- WDI("GB", indicator = inds)
See WDIsearch and WDI functions and the rerference manual for more info.