I have some text that is base 64 encoded and want to decode it in R. The package im using is base64decode of the base64enc package. The problem I have is that its not human readable. How do i make it work
e.g. This is what i get from a text string that was endcoded from "exampleEncodedText"
base64enc::base64decode("ZXhhbXBsZUVuY29kZWRUZXh0")
[1] 65 78 61 6d 70 6c 65 45 6e 63 6f 64 65 64 54 65 78 74
For reference i encoded it on https://www.base64decode.org/
?base64decode says that this function decodes a base64-encoded string into binary data. So, using rawToChar gives a human readable character:
rawToChar(base64decode("ZXhhbXBsZUVuY29kZWRUZXh0"))
[1] "exampleEncodedText"
Related
I'm working on writing a pure JS thrift decoder that doesn't depend on thrift definitions. I have been following this handy guide which has been my bible for the past few days: https://erikvanoosten.github.io/thrift-missing-specification/
I almost have my parser working, but there is a string type that throws a wrench into the program, and I don't quite understand what it's doing. Here is an excerpt of the hexdump, which I did my best to annotate:
Correctly parsing:
000001a0 0a 32 30 32 31 2d 31 31 2d 32 34 16 02 00 18 07 |.2021-11-24.....|
........................blah blah blah............| | |
Object End-| | |
0x18 & 0xF = 0x8 = Binary-| |
The binary sequence is 0x7 characters long-|
000001b0 53 65 61 74 74 6c 65 18 02 55 53 18 02 55 53 18 |Seattle..US..US.|
S E A T T L E |___| U S |___| U S
Another string, 2 bytes long |------------|
So far so good.
But then I get to this point:
There string I am trying to extract is "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4592.0 Safari/537.36 Edg/94.0.975.1" and is 134 bytes long.
000001c0 09 54 61 68 6f 65 2c 20 43 41 12 12 00 00 08 c8 |.Tahoe, CA......|
Object ends here-| | |
0x8 & 0xF = 0x8 = Binary -| |
0xc8 bytes long (200)-|
000001d0 01 86 01 4d 6f 7a 69 6c 6c 61 2f 35 2e 30 20 28 |...Mozilla/5.0 (|
| | | M o z i l l a
???? |--|-134, encoded as var-int
000001e0 4d 61 63 69 6e 74 6f 73 68 3b 20 49 6e 74 65 6c |Macintosh; Intel|
As you can see, I have a byte sequence 0x08 0xC8 0x01 0x86 0x01 which contains the length of the string I'm looking for, is followed by the string I'm looking for but has 3 extra bytes that are unclear in purpose.
The 0x01 is especially confusing as it neither a type identifier, nor seems to have a concrete value.
What am I missing?
Thrift supports pluggable serialization schemes. In tree you have binary, compact and json. Out of tree anything goes. From the looks of it you are trying to decode compact protocol, so I'll answer accordingly.
Everything sent and everything returned in a Thrift RPC call is packaged in a struct. Every field in a struct has a 1 byte type and a 2 byte field ID prefix. In compact protocol field ids, when possible, are delta encoded into the type and all ints are compressed down to just the bits needed to store them (and some flags). Because ints can now take up varying numbers of bytes we need to know when they end. Compact protocol encodes the int bits in 7 bits of a byte and sets the high order bit to 1 if the next byte continues the int. If the high order bit is 0 the int is complete. Thus the int 5 (101) would be encoded in one byte as 0000101. Compact knows this is the end of the int because the high order bit is 0.
In your case, the int 134 (binary 10000110) will need 2 bytes to encode because it is more than 7 bits. The fist 7 bits are stored in byte 1 with the 0x80 bit set to flag "the int continues". The second and final byte encodes the last bit (00000001). What you thought was 134 was just the encoding of the first seven bits. The stray 1 was the final bit of the 134.
I'd recommend you use the in tree source to do any needed protocol encoding/decoding. It's already written and tested: https://github.com/apache/thrift/blob/master/lib/nodejs/lib/thrift/compact_protocol.js
The byte sequence reads as follows
0x08: String type, the next 2 bytes define the elementId
0xC8 0x01: ElementId, encoded in 16 bits
0x86 0x01: String length, encoded as var int
It turns out that if the type identifier does not contain bits defining the elementId, the elementId will be stored in the next 2 bytes.
I am using Outlook under Windows 10 as my email client and am trying to use the RDCOMClient library to process some emails. Some of the emails are in Russian and I am having trouble getting the Russian part out in a usable format. Right now, I am just
focusing on the subject lines. When I extract the line and print
it out, I just get question marks except for a few Latin characters
in the subject. I have tried setting the encoding and using
iconv, but with no success. But iconv did provide a useful clue.
Based on my reproducible example below showing the raw characters
gives:
iconv(SUBJECT, toRaw=T)
[1] 53 74 61 63 6b 4f 76 65 72 66 6c 6f 77 54 65 73 74 4d 65 73 73 61 67 65 3a
[26] 20 3f 3f 3f 3f 3f 3f 3f 3f 20 3f 3f 3f 3f 3f 3f 3f 3f 3f
All of the 3f's at the end? That is the code for question mark. RDCOMClient is
actually returning the ??? from Outlook. It is not some encoding issue inside R.
I have looked at many RDCOMClient posts on SO, but do not see anything
that deals with this problem.
Is the RDCOMClient<->Outlook connection just broken? Or is there some way
around this?
Attempt at Reproducible Example
Since we are talking about accessing email, I don't see how to make a
really easy reproducible example, but here is a reproducible way to test this.
Of course, you have to have Outlook on Windows for this to make sense.
Send yourself an email with the subject line:
StackOverflowTestMessage: Тестовое сообщение
R code
We need to find the email first. Most of the code does that.
Then we inspect the subject.
## Connect to Outlook
OutApp <- COMCreate("Outlook.Application")
outlookNameSpace = OutApp$GetNameSpace("MAPI")
## Find the Inbox
INBOX = outlookNameSpace$GetDefaultFolder(6)
INBOX$Name() ## Confirm
emails <- INBOX$Items
## Find the relevant email
NumEmail = emails()$Count()
MessageNumber = 0
for(i in NumEmail:1) {
SUBJ = emails(i)$Subject()
if(grepl("StackOverflowTestMessage", SUBJ)) {
MessageNumber = i
break()
}
}
## Now try to get the subject line
SUBJECT = emails(MessageNumber)$Subject()
Encoding(SUBJECT) = 'UTF-8'
SUBJECT
[1] "StackOverflowTestMessage: ???????? ?????????"
iconv(SUBJECT, toRaw=T)
[[1]]
[1] 53 74 61 63 6b 4f 76 65 72 66 6c 6f 77 54 65 73 74 4d 65 73 73 61 67 65 3a
[26] 20 3f 3f 3f 3f 3f 3f 3f 3f 20 3f 3f 3f 3f 3f 3f 3f 3f 3f```
tl;dr "What would the bytes 0x33 0x39 0x0d 0x0a between the end of HTTP headers and the start of HTTP response body refer to?"
I'm using the thoroughly excellent libcurl to make HTTP requests to various 3rd party endpoints. These endpoints are not under my control and are required to implement a specification. To help debug and develop these endpoints I have implemented the text output functionality you might see if you make a curl request from the command line with the -v flag using curl.setopt(pycurl.VERBOSE, 1) and curl.setopt(pycurl.DEBUGFUNCTION, debug_function)
This has been working great but recently I've come across a request which my debug function does not handle in the same way as curl's debug output. I'm sure is due to me not understanding the HTTP spec.
If making a curl request from the command line with --verbose I get the following returned.
# redacted headers
< Via: 1.1 vegur
<
{"code":"InvalidCredentials","message":"Bad credentials"}*
Connection #0 to host redacted left intact
If making the same request with --trace the following is returned
0000: 56 69 61 3a 20 31 2e 31 20 76 65 67 75 72 0d 0a Via: 1.1 vegur..
<= Recv header, 2 bytes (0x2)
0000: 0d 0a ..
<= Recv data, 1 bytes (0x1)
0000: 33 3
<= Recv data, 62 bytes (0x3e)
0000: 39 0d 0a 7b 22 63 6f 64 65 22 3a 22 49 6e 76 61 9..{"code":"Inva
0010: 6c 69 64 43 72 65 64 65 6e 74 69 61 6c 73 22 2c lidCredentials",
0020: 22 6d 65 73 73 61 67 65 22 3a 22 42 61 64 20 63 "message":"Bad c
0030: 72 65 64 65 6e 74 69 61 6c 73 22 7d 0d 0a redentials"}..
<= Recv data, 1 bytes (0x1)
0000: 30 0
<= Recv data, 4 bytes (0x4)
0000: 0d 0a 0d 0a ....
== Info: Connection #0 to host redacted left intact
All HTTP client libs I've tested don't include these parts of the bytes in the response body so I'm guessing these are part of the HTTP spec I don't know about but I can't find a reference to them and I don't know how to handle them.
If it's helpful I think curl is using this https://github.com/curl/curl/blob/master/src/tool_cb_dbg.c for building the output in the first example bit I'm not really a c/c++ programmer and I haven't been able to reverse engineer the logic.
Does anyone know what these bytes are?
0d 0a are ASCII control characters representing carriage return and line feed, respectively. CRLF is used in HTTP to mark the end of a header field (there are some historic exceptions you should not worry about at this point). A double CRLF is supposed to mark the end of the fields section of a message.
The 33 39 you observe there is "39" in ascii. This is the chunk size indicator - treated as a hexdecimal number. The presence of Transfer-Encoding: chunked in the response headers may support this.
I installed nginx (nginx version: nginx/1.7.9) via macports on my mac running the latest OSX.
I configured a URI to use SCGI:
location /server {
include /Users/ruipacheco/Projects/Assorted/nginx/conf/scgi_params;
scgi_pass unix:/var/tmp/rpc.sock;
#scgi_pass 127.0.0.1:9000;
}
And when I do a GET request on 127.0.0.1/server, I see the following on my SCGI server:
633:CONTENT_LENGTH0REQUEST_METHODGETREQUEST_URI/serverQUERY_STRINGCONTENT_TYPEDOCUMENT_URI/serverDOCUMENT_ROOT/opt/local/htmlSCGI1SERVER_PROTOCOLHTTP/1.1REMOTE_ADDR127.0.0.1REMOTE_PORT62088SERVER_PORT80SERVER_NAMElocalhostHTTP_HOST127.0.0.1HTTP_CONNECTIONkeep-aliveHTTP_CACHE_CONTROLmax-age=0HTTP_ACCEPTtext/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8HTTP_USER_AGENTMozilla/5.0
(Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/40.0.2214.115
Safari/537.36HTTP_DNT1HTTP_ACCEPT_ENCODINGgzip, deflate,
sdchHTTP_ACCEPT_LANGUAGEen-US,en;q=0.8,End of file
The problem is that the length of the netstring, 633, does not match the interpretation. If I understand the netstrings spec correctly, 633 should be the length of characters between the first : and the last ,:
Any string of 8-bit bytes may be encoded as [len]":"[string]",". Here [string] is the string and [len] is a nonempty sequence of ASCII digits giving the length of [string] in decimal. The ASCII digits are <30> for 0, <31> for 1, and so on up through <39> for 9. Extra zeros at the front of [len] are prohibited: [len] begins with <30> exactly when [string] is empty.
For example, the string hello world! is encoded as 31 32 3a 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 2c, i.e., 12:hello world!,.
So, I'm getting the wrong length. How can this be explained?
As far as I can tell, your example response has correct length.
According to example here:
http://en.wikipedia.org/wiki/Simple_Common_Gateway_Interface
Field values are preceded and followed by <00> symbol (ASCII symbol with hex code 00), eg.:
REQUEST_METHOD <00>GET<00>
Once I added missing spaces to your response snippet – it quickly got back to 633 bytes, as advertised.
I suppose somewhere in the process of passing that response to us here, some piece of software stripped <00>'s, which is a totally normal behaviour?
Anyway, the answer seems to be – your nginx is either returning a correct length, or your response is stripping <00>'s somewhere.
Well,
The hexadecimal <31 32 3a 68 65 6c 6c 6f 20 77 6f 72 6c 64 21>
in ASCII is "12:hello world!" (no quotes) and the lenght is 12 (hello world!)
And this one <31 32 3a 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 2c> in the example is wrong (at least it didnt match the nginx norm.)(since the internal lenght is 13 and the lenght specified in hex is 12):
The ASCII "12:hello world!," should be "13:hello world!," and in hex <31 33 3a 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 2c>
This line is the mess:
For example, the string "hello world!" is encoded as <31 32 3a 68 65
6c 6c 6f 20 77 6f 72 6c 64 21 2c>, i.e., "12:hello world!,".
OK) 12:hello world! ---> <31 *32* 3a 68 65 6c 6c 6f 20 77 6f 72 6c 64 21>
KO) 12:hello world!, ---> <31 *32* 3a 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 2c>
OK) 13:hello world!, ---> <31 *33* 3a 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 2c>
The hex inside the ** is the second number of the lenght.
Then your concept about this Ok, the example is bad.
I'm trying to read in a complicated data file that has floating point values. Some C code has been supplied that handles this format (Met Office PP file) and it does a lot of bit twiddling and swapping. And it doesn't work. It gets a lot right, like the size of the data, but the numerical values in the returned matrix are nonsensical, have NaNs and values like 1e38 and -1e38 liberally sprinkled.
However, I have a binary exe ("convsh") that can convert these to netCDF, and the netCDFs look fine - nice swirly maps of wind speed.
What I'm thinking is that the bytes of the PP file are being read in in the wrong order. If I could compare the bytes of the floats returned correctly in the netCDF data with the bytes in the floats returned wrongly from the C code, then I might figure out the correct swappage.
So is there a plain R function to dump the four (or eight?) bytes of a floating point number? Something like:
> as.bytes(pi)
[1] 23 54 163 73 99 00 12 45 # made up values
searches for "bytes" and "float" and "binary" haven't helped.
Its trivial in C, I could probably have written it in the time it took me to write this...
rdyncall might give you what you're looking for:
library(rdyncall)
as.floatraw(pi)
# [1] db 0f 49 40
# attr(,"class")
# [1] "floatraw"
Or maybe writeBin(pi, raw(8))?
Yes, that must exist in the serialization code because R merrily sends stuff across the wire, taking care of endianness too. Did you look at eg Rserve using it, or how digest passes the char representation to chosen hash functions?
After a quick glance at digest.R:
R> serialize(pi, connection=NULL, ascii=TRUE)
[1] 41 0a 32 0a 31 33 34 39 31 34 0a 31 33 31 38 34 30 0a
[19] 31 34 0a 31 0a 33 2e 31 34 31 35 39 32 36 35 33 35 38
[37] 39 37 39 33 0a
and
R> serialize(pi, connection=NULL, ascii=FALSE)
[1] 58 0a 00 00 00 02 00 02 0f 02 00 02 03 00 00 00 00 0e
[19] 00 00 00 01 40 09 21 fb 54 44 2d 18
R>
That might get you going.
Come to think about it, this includes header meta-data.
The package mcga (machine-coded genetic algorithms) includes some functions for bytes-to-double and doubles-to-byte conversions. For handling the bytes of pi, you can use DoubleToBytes like:
> DoubleToBytes(pi)
1 24 45 68 84 251 33 9 64
For converting bytes to double again, BytesToDouble() can be used instead:
> BytesToDouble(c(24,45,68,84,251,33,9,64))
1 3.141593
Links:
CRAN page of mcga