I am trying to convert from hex to base64 but the conversion I get with functions like base64Encode or base64_enc do not match with the conversion I get from this site https://conv.darkbyte.ru/ or this site http://tomeko.net/online_tools/hex_to_base64.php?lang=en
library(RCurl)
library(jsonlite)
hex_number="9d0a5a7d6771dd7fa321a48a820f93627657df
3292548df1389533913a60328300a9cc80d982875a8d08bb7
602c59935cacae88ea635ed8d3cea9ef57b1884cc"
base64_enc(hex_number)
#"OWQwYTVhN2Q2NzcxZGQ3ZmEzMjFhNDhhODIwZjkzNjI3NjU3ZGYKMzI5M
#jU0OGRmMTM4OTUz\nMzkxM2E2MDMyODMwMGE5Y2M4MGQ5ODI4NzVhOGQwO
#GJiNwo2MDJjNTk5MzVjYWNhZTg4ZWE2\nMzVlZDhkM2NlYTllZjU3YjE4ODRjYw=="
base64Encode(hex_number)
#"OWQwYTVhN2Q2NzcxZGQ3ZmEzMjFhNDhhODIwZjkzNjI3NjU3ZGYKMzI5M
#jU0OGRmMTM4OTUzMzkxM2E2MDMyODMwMGE5Y2M4MGQ5ODI4NzVhOGQwOGJiNwo
#2MDJjNTk5MzVjYWNhZTg4ZWE2MzVlZDhkM2NlYTllZjU3YjE4ODRjYw=="
#desired result:
#nQpafWdx3X+jIaSKgg+TYnZX3zKSVI3xOJUzkTpgMoMAqcyA2YKHWo0Iu3YCxZk1ysrojqY17Y086p71exiEzA==
Also I have tried to chenge the HEX to text before changing it to HEX with the code in this page http://blog.entropic-data.com/2017/04/19/short-dealing-with-embedded-nul-in-string-manipulation-with-r/ I didn't get the result I want.
Borrow some code from the wkb package (or just install and use it directly) to convert the hex string into a raw vector before passing it to one of the base 64 conversion routines:
hex_number <- "9d0a5a7d6771dd7fa321a48a820f93627657df3292548df1389533913a60328300a9cc80d982875a8d08bb7602c59935cacae88ea635ed8d3cea9ef57b1884cc"
I'm "source-ing" this but you should copy the code locally if you plan on using it as GH could go down or the code could change.
source_url("https://raw.githubusercontent.com/ianmcook/wkb/master/R/hex2raw.R",
sha1 = "4443c72fb3831e002359ad564f1f2a8ec5e45e0c")
openssl::base64_encode(hex2raw(hex_number))
## [1] "nQpafWdx3X+jIaSKgg+TYnZX3zKSVI3xOJUzkTpgMoMAqcyA2YKHWo0Iu3YCxZk1ysrojqY17Y086p71exiEzA=="
OR (if you are willing to have the wkb package as a dependency:
openssl::base64_encode(wkb::hex2raw(hex_number))
Related
My object in R contains the following unicode which are extracted from twitter.
\xe0\xae\xa8\xe0\xae\x9f\xe0\xae\xbf\xe0\xae\x95\xe0\xae\xb0\xe0\xaf\x8d
\xe0\xae\x9a\xe0\xaf\x82\xe0\xae\xb0\xe0\xaf\x8d\xe0\xae\xaf\xe0\xae\xbe
\xe0\xae\x9a\xe0\xaf\x86\xe0\xae\xaf\xe0\xaf\x8d\xe0\xae\xa4
\xe0\xae\x89\xe0\xae\xa4\xe0\xae\xb5\xe0\xae\xbf
\xe0\xae\xae\xe0\xae\xbf\xe0\xae\x95
\xe0\xae\xae\xe0\xaf\x81\xe0\xae\x95\xe0\xaf\x8d\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xaf\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xae\xa4\xe0\xaf\x81!'
- \xe0\xae\x9f\xe0\xaf\x86\xe0\xae\xb2\xe0\xaf\x8d\xe0\xae\x9f\xe0\xae\xbe\xe0\xae\xb5\xe0\xae\xbf\xe0\xae\xb2\xe0\xaf\x8d
\xe0\xae\xa8\xe0\xaf\x86\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xb4\xe0\xaf\x8d\xe0\xae\xa8\xe0\xaf\x8d\xe0\xae\xa4
\xe0\xae\x9a\xe0\xaf\x80\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xaf\x8d
I need to convert them to human readable strings. If I just put this in a string, e.g.
x <- "\xe0\xae\xa8\xe0\xae\x9f\xe0\xae\xbf\xe0\xae\x95\xe0\xae\xb0\xe0\xaf\x8d \xe0\xae\x9a\xe0\xaf\x82\xe0\xae\xb0\xe0\xaf\x8d\xe0\xae\xaf\xe0\xae\xbe \xe0\xae\x9a\xe0\xaf\x86\xe0\xae\xaf\xe0\xaf\x8d\xe0\xae\xa4 \xe0\xae\x89\xe0\xae\xa4\xe0\xae\xb5\xe0\xae\xbf \xe0\xae\xae\xe0\xae\xbf\xe0\xae\x95 \xe0\xae\xae\xe0\xaf\x81\xe0\xae\x95\xe0\xaf\x8d\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xaf\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xae\xa4\xe0\xaf\x81!' - \xe0\xae\x9f\xe0\xaf\x86\xe0\xae\xb2\xe0\xaf\x8d\xe0\xae\x9f\xe0\xae\xbe\xe0\xae\xb5\xe0\xae\xbf\xe0\xae\xb2\xe0\xaf\x8d \xe0\xae\xa8\xe0\xaf\x86\xe0\xae\x95\xe0\xae\xbf\xe0\xae\xb4\xe0\xaf\x8d\xe0\xae\xa8\xe0\xaf\x8d\xe0\xae\xa4 \xe0\xae\x9a\xe0\xaf\x80\xe0\xae\xae\xe0\xae\xbe\xe0\xae\xa9\xe0\xaf\x8d"
it displays as an unreadable mess. How can I get it to display using the actual characters?
When you assign the hex codes like \xe0\xae\xa8\xe0... to a string, R doesn't know how they are intended to be interpreted, so it assumes the encoding for the current locale on your computer. On most modern Unix-based systems these days, that would be UTF-8, so for example on a Mac your string displays as
> x
[1] "நடிகர் சூர்யா செய்த உதவி மிக முக்கியமானது!' - டெல்டாவில் நெகிழ்ந்த சீமான்"
which I assume is the correct display. Google Translate recognizes it as being written in Tamil.
However, on Windows it displays unreadably. On my Windows 10 system, I see
> x
[1] "நடிகர௠சூரà¯à®¯à®¾ செயà¯à®¤ உதவி மிக à®®à¯à®•à¯à®•à®¿à®¯à®®à®¾à®©à®¤à¯!' - டெலà¯à®Ÿ
because it uses the code page corresponding to the Latin1 encoding, which is wrong for that string. To get it to display properly on Windows, you need to tell R that it is encoded in UTF-8 by declaring its encoding:
Encoding(x) <- "UTF-8"
Then it will display properly in Windows as well, which solves your problem.
For others trying to do this, it's important to know that there are only a few values that work this way. You can declare the encoding to be "UTF-8", "latin1", "bytes" or "unknown". "unknown" means the local encoding on the machine, "bytes" means it shouldn't be interpreted as characters at all. If your string has a different encoding, you need to use a different approach: convert to one of the encodings that R knows about.
For example, the string
x <- "\xb4\xde\xd1\xe0\xde\xd5 \xe3\xe2\xe0\xde"
is Russian encoded in ISO 8859-5. On a system where that was the local encoding it would display properly, but on mine it displays using the hex codes. To get it to display properly I need to convert it to UTF-8 using
y <- iconv(x, from="ISO8859-5", to="UTF-8")
Then it will display properly as [1] "Доброе утро". You can see the full list of encodings that iconv() knows about using iconvlist().
I am using OpenCPU and R to create a web API that takes in some inputs and returns a topoJSON file from a database, as well as some other information. OpenCPU automatically pushes the output through toJSON, which results in JSON output that has quoted JSON in it (i.e., the topoJSON). This is obviously not ideal--especially since it then gets incredibly cluttered with backticked quotes (\"). I tried using fromJSON to convert it to an R object, which could then be converted back (which is incredibly inefficient), but it returns a slightly different syntax and the result is that it doesn't work.
I feel like there should be some way to convert the string to some other type of object that results in toJSON calling a different handler that tells it to just leave it alone, but I can't figure out how to do that.
> s <- '{"type":"Topology","objects":{"map": "0"}}'
> fromJSON(s)
$type
[1] "Topology"
$objects
$objects$map
[1] "0"
> toJSON(fromJSON(s))
{"type":["Topology"],"objects":{"map":["0"]}}
That's just the beginning of the file (I replaced the actual map with "0"), and as you can see, brackets appeared around "Topology" and "0". Alternately, if I just keep it as a string, I end up with this mess:
> toJSON(s)
["{\"type\":\"Topology\",\"objects\":{\"0000595ab81ec4f34__csv\": \"0\"}}"]
Is there any way to fix this so that I just get the verbatim string but without quotes and backticks?
EDIT: Note that because I'm using OpenCPU, the output needs to come from toJSON (so no other function can be used, unfortunately), and I can't do any post-processing.
To it seems you just want the values rather than vectors. Set auto_unbox=TRUE to turn length-one vectors into scalar values
toJSON(fromJSON(s), auto_unbox = TRUE)
# {"type":"Topology","objects":{"map":"0"}}
That does print without escaping for me (using jsonlite_1.5). Maybe you are using an older version of jsonlite. You can also get around that by using cat() to print the result. You won't see the slashes when you do that.
cat(toJSON(fromJSON(s), auto_unbox = TRUE))
You can manually unbox the relevant entries:
library(jsonlite)
s <- '{"type":"Topology","objects":{"map": "0"}}'
j <- fromJSON(s)
j$type <- unbox(j$type)
j$objects$map <- unbox(j$objects$map)
toJSON(j)
# {"type":"Topology","objects":{"map":"0"}}
I am trying to analyze bill texts from LegisScan, but am running into problems decoding the text from the API pull response. Turns out LegisScan encodes the full text of all legislations in base 64 when pulled through their API and I am having some trouble decoding it.
This downloaded JSON request is an example of the full text portion of the JSON result that I downloaded through the API. However, the usual methods do not seem to be working on it.
What I have tried:
Legiscan does not seem to support R directly, so I used the package LegiscanR. I used LegiscanR's BillText function to get the correct JSON link, then used parseBillText to try to decode the text from the link into UTF-8. However, it throws up a fromJSON error even with the correct API key and document id stated in the link:
Error in fromJSON(content, handler, default.size, depth, allowComments, :
object 'Strict' not found
Using the base64decode (base64enc package) or base64Decode (RCurl package) function to convert the text from base 64 to raw, and then using the rawToChar function to convert it into characters.
My code:
text <- base64decode("https://www.dropbox.com/s/5ozd0a1zsb6y9pi/Legiscan_fulltext.txt?dl=0")
rawToChar(text)
Nul <- text == as.raw(00)
text[Nul] <- as.raw(20)
text2 <- rawToChar(text)
However, trying to use the rawToChar alone gives me an "embedded nul in string" error
Error in rawToChar(test2) :
embedded nul in string: '%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<>>>\r\nendobj\r\n2 0 obj\r\n<>\r\nendobj\r\n3 0 obj\r\n<>/ExtGState<>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 612 792] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>>\r\nendobj\r\n4 0 obj\r\n<>\r\nstream\r\nx\x9c\xb5ZYs\xdb8\022~w\x95\xff\003*O\u0516M\021ཛJ\x95\xe3ę̵\x99\xb1\xa7f\xb7\x92y\xa0$\xca\xe2\x86"\025\036\xf6\xe6\xdfow\003\x94\bR0sh\x93*\x99G\xa3\001|\xdd\xfdu7\xa4\xf9U\xd5d\xebdٰ\xe7\xcf\xe7WM\x93,7銽\x9f\u07d5\xbb\xbf\xe6w\x9fw\xe9\xfc]r\x9f\025I\x93\x95\xc5\xfc\xb6]4\xf8\xe8\x874Y\xa5Ջ\027\xec\xe5\xabk\xf6\xf2\xee\xfcl~\xc3Yl\xc7\
Substituting these nulls out to represent spaces allows rawToChar to run, but the output is gibberish, or in another form of encoding that is not the expected English text characters.
[1] "\x86\xdbi\xb3\xff\xf0\xc3\ak\xa2\x96\xe8\xc5\xca&\xfe\xcf\xf9\xa37tk\\xeco\xac\xbd\xa6/\xcbz\b\xacq\xa9\u07faYm{\033m\xc6\xd7e"
Any other ideas on what else to try? Thanks.
I have been dealing with the same problem in Python and in Python the following code worked:
import base64
raw = base64.b64decode(bill_text['doc'])
pdf_result = open(output_file, "wb").write(raw)
I think maybe in your case you are trying to convert the document immediately into text but that may not be so easy and I did in Python by parsing the saved PDF file with functions from the PyPDF2 library.
I want to read in a SAS data file (sas7bdat format) into R. I've tried using sas7bdat package, but ended up getting error.
CODE:
x <- read.sas7bdat("C:\Users\petas\Desktop\airline.sas7bdat")
ERROR:
'\U' used without hex digits in character string starting ""C:\U"
Can someone help me with this? Thanks in advance.
Posting example by using haven library
install.packages("haven")
library(haven)
url <- "C:\\Users\\petas\\Desktop\\airline.sas7bdat"
x <- read_sas(url)
If you use windows than you need to use instead "\" use "\\" or Unix/linux style "/". Easiest will be to use forward slashes so will be compatible in the future with the path of any OS, in your case Error: '\U' used without hex digits in character string starting ""C:\U" is due the use of single backslashes instead double backslashes.
Hope it helps.
Try using forward slashes:
x <- read.sas7bdat("C:/Users/petas/Desktop/airline.sas7bdat")
I've got an R-function along these lines:
swedish.weekday <- function(date = Sys.Date()) {
require(lubridate)
c("Sön", "Mån", "Tis", "Ons", "Tor", "Fre", "Lör")[wday(date)]
}
This returns the three letter equivalent of Sun, Mon, Tue etc.
Works absolutely fine until I include this function in a package where during the build the function transforms into:
swedish.weekday <- function(date = Sys.Date()) {
require(lubridate)
c("Sön", "Mån", "Tis", "Ons", "Tor", "Fre", "Lör")[wday(date)]
}
I've tried setting the encoding options in the project settings to either ISO8859-1 or WINDOWS-1252 but neither works. Using 64 bit R 3.1.2 under Windows 7.
Suspect I'd need to change something in the build config but I'm lost as to what - any help/direction much appreciated!
As per the link posted in the comments above I solved the issue by merely using Unicode escapes as such:
day <- c("S\u00F6n", "M\u00E5n", "Tis", "Ons", "Tor", "Fre", "L\u00F6r")[wday(date)]
Edit: While passing these results to an external system (OLAP) I discovered it is also necessary to force the encoding of these results to ISO ("latin-9") to ensure it does not only look correct on the screen but also as far as the system is concerned as such day <- inconv(day, "UTF-8", "latin-9")
For ref...
There is a portable way to have arbitrary text in character strings (only) in your R code, which is to supply them in Unicode as \uxxxx escapes. If there are any characters not in the current encoding the parser will encode the character string as UTF-8 and mark it as such. This applies also to character strings in datasets: they can be prepared using \uxxxx escapes or encoded in UTF-8 in a UTF-8 locale, or even converted to UTF-8 via ‘iconv()’. If you do this, make sure you have ‘R (>= 2.10)’ (or later) in the ‘Depends’ field of the DESCRIPTION file.