Converting a big string in R to hex - r

I need to convert a big string of 1M booleans to hexadecimal. I can't figure the right library or syntax. I thought I could use a gsub after I converted the booleans to char, but I can't keep it in raw data type. Each FALSE needs to be a 0x0 and TRUE needs to be a 0x8. So, this :
FALSEFALSEFALSETRUEFALSEFALSE
becomes this raw data:
000800
Here's a snip of code I've been playing with:
sample<-paste0(sample(x=c(T,F),size=100,replace=T),collapse="")
best_raw<-as.vector(sample) %>%
as.character(.) %>%
paste(.,collapse="") %>%
gsub(x=.,pattern="TRUETRUE",replacement=as.raw(0x88)) %>%
gsub(x=.,pattern="FALSEFALSE",replacement=as.raw(0x00)) %>%
gsub(x=.,pattern="TRUEFALSE",replacement=as.raw(0x80)) %>%
gsub(x=.,pattern="FALSETRUE",replacement=as.raw(0x08))
but a few things become obvious... first, I thought I was cleverly using characters and as.raw by taking two booleans at a time, but gsub does not respect the pairs going in so I get a mash of stuff like:
"8088TRUE00TRUE000000880800888088TRUE"
I've seen a few references to bin2hex, but I can't find it in CRAN and I'm not sure it would do what I want anyway.

If I understand the question correctly, you have a character string with TRUE and FALSE concatenated. You can directly use gsub to replace them with numbers. Then you can separate the individual numbers to get a vector rather than a single long string and you can apply the conversion you need on that final vector.
smp <- paste0(sample(x=c("TRUE","FALSE"),size=100,replace=T),collapse="")
smp_as_numbers <- gsub("FALSE", "0",gsub("TRUE", "8", smp, fixed=TRUE), fixed = TRUE)
smp_as_vector <- strsplit(smp_as_numbers, "")[[1]]
as.raw(as.integer(smp_as_vector))
> as.raw(as.integer(smp_as_vector))
# [1] 08 00 00 00 00 08 00 08 08 08 08 08 08 00 08 08 00 08 00 00 00 00 00 08 00 08 08 08
# [29] 08 00 00 08 00 00 00 08 00 00 08 08 00 00 00 00 08 08 08 08 08 08 00 08 00 08 08 00
# [57] 00 08 08 00 08 00 08 08 08 00 08 08 08 08 00 08 08 00 00 00 08 00 08 08 08 00 08 08
# [85] 08 00 08 00 00 08 08 00 00 08 08 00 00 08 00 00
> smp
#[1] #"TRUEFALSEFALSEFALSEFALSETRUEFALSETRUETRUETRUETRUETRUETRUEFALSETRUETRUEFALSETRUEFALSEFALSEFALSEFALSEFALSETRUEFALSETRUETRUETRUETRUEFALSEFALSETRUEFALSEFALSEFALSETRUEFALSEFALSETRUETRUEFALSEFALSEFALSEFALSETRUETRUETRUETRUETRUETRUEFALSETRUEFALSETRUETRUEFALSEFALSETRUETRUEFALSETRUEFALSETRUETRUETRUEFALSETRUETRUETRUETRUEFALSETRUETRUEFALSEFALSEFALSETRUEFALSETRUETRUETRUEFALSETRUETRUETRUEFALSETRUEFALSEFALSETRUETRUEFALSEFALSETRUETRUEFALSEFALSETRUEFALSEFALSE"
If you want to use BMS::hex2bin(), you have to use directly the non-split characters:
smp <- paste0(sample(x=c("TRUE","FALSE"),size=5,replace=T),collapse="")
smp_as_numbers <- gsub("FALSE", "0",gsub("TRUE", "8", smp, fixed=TRUE), fixed = TRUE)
smp_as_numbers
# [1] "08888"
BMS::hex2bin(smp_as_numbers)
# [1] 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0
typeof(BMS::hex2bin(smp_as_numbers))
# [1] "double"

Related

Convert list of raw-vectors in dataframe

I have a list of raw-vectors named "output". Something like that:
[1] 58 0a 00 00 00 03 00 04 00 03 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 fe
[1] 58 0a 00 00 00 03 00 04 00 03 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 01 03 19 00 00 04 02 00 00 00 01 00 04 00 09 00 00 00 04 6d 65 74 61 00 00 02 13 00 00 00 03 00 00 00 10 00 00 00 01 00
[1] ...
They have different lenghts and are from the type "raw".
I need a dataframe with one vector in each cell:
ID
vectors
1
58 0a 00 00 00 03 00 04 00 03 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 fe
2
58 0a 00 00 00 03 00 04 00 03 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 01 03 19 00 00 04 02 00 00 00 01 00 04 00 09 00 00 00 04 6d 65 74 61 00 00 02 13 00 00 00 03 00 00 00 10 00 00 00 01 00
I have tried this:
as.data.frame(output)
#Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 27, 3132, 4141, 4267, 3701, 3943, 5200
df <- data.frame(matrix(unlist(output), nrow=length(output)))
#Warning message:
In matrix(unlist(output), nrow = length(output)) :
data length [32954] is not a sub-multiple or multiple of the number of rows [14]
Is there a way to solve my problem?
You have to use I when creating the data.frame.
output <- list(raw(2), raw(3))
DF <- data.frame(ID=1:2, vectors = I(output))
str(DF)
#'data.frame': 2 obs. of 2 variables:
# $ ID : int 1 2
# $ vectors:List of 2
# ..$ : raw 00 00
# ..$ : raw 00 00 00
# ..- attr(*, "class")= chr "AsIs"
DF
#DF
# ID vectors
#1 1 00, 00
#2 2 00, 00, 00
This can be also done with tibble
library(tibble)
output <- list(raw(2), raw(3))
tibble(ID = 1:2, vectors = output)
# A tibble: 2 x 2
ID vectors
<int> <list>
1 1 <raw [2]>
2 2 <raw [3]>

How to read the MPEG2VideoDescriptor in an MXF file?

Here follows the hex dump of the MPEG2VideoDescriptor:
06 0e 2b 34 02 53 01 01 0d 01 01 01 01 01 51 00
83 00 00 f3 3c 0a 00 10 a3 be 51 b2 00 05 e7 11
bf 82 21 97 f7 a0 14 ed 30 06 00 04 00 00 00 02
30 01 00 08 00 00 ea 60 00 00 03 e9 80 00 00 04
01 c9 c3 80 30 04 00 10 06 0e 2b 34 04 01 01 02
0d 01 03 01 02 04 61 01 32 15 00 01 05 32 0e 00
08 00 00 00 10 00 00 00 09 32 0d 00 10 00 00 00
02 00 00 00 04 00 00 00 1a 00 00 00 00 32 0c 00
01 00 32 08 00 04 00 00 02 d0 32 09 00 04 00 00
05 00 32 02 00 04 00 00 02 d0 32 03 00 04 00 00
05 00 32 01 00 10 06 0e 2b 34 04 01 01 03 04 01
02 02 01 04 03 00 33 02 00 04 00 00 00 02 33 08
00 04 00 00 00 01 33 03 00 01 04 33 01 00 04 00
00 00 08 33 0b 00 01 00 33 07 00 02 00 00 33 04
The first 16 bytes:
06 0e 2b 34 02 53 01 01 0d 01 01 01 01 01 51 00 (UID)
Next 4 bytes is the BER size:
83 00 00 f3 (0xf3 bytes long)
Next 4 bytes:
3c 0a 00 10 (0x3c0a means Instance UUID and 0x0010 is the size)
Then follows the UUID:
a3 be 51 b2 00 05 e7 11 bf 82 21 97 f7 a0 14 ed
Next 4 bytes:
30 06 00 04 (0x3006 means Linked Track ID and 0x0004 is the size)
Next 4 bytes is the Linked Track ID: 00 00 00 02
Next 4 bytes: 30 01 00 08 (0x3001 means Sample Rate and 0x0008 is the size)
The following 8 bytes are actually frame rate numerator and denominator:
0000ea60 == 60000 and 000003e9 == 1001.
Now we have the bold part: 80 00 00 04
.
Can somebody please explain what does it mean?
The next four bytes are 01 c9 c3 80 and it is definitely the bitrate (30000000), but how can I know that for sure?
Edit:
Does 80 00 00 04 mean the following:
0x8000 is a dynamic tag. According to SMPTE 337, tags 0x8000-0xFFFF are dynamically allocated. The 0x0004 is the size (4 bytes). If that's true, how can I tell that the following 4 bytes 01 c9 c3 80 are actually the bitrate? It could be anything, or?
First you have to understand how local tags work.
Local tags 0x8000 and above are user defined.
You have to look at the primer pack of the header partition.
The primer pack translates the local tag to a global UL which may or may not be vendor specific.
Consider the primer pack being a translation table between the 2 byte local tag and the 16 byte UL.

Data modified on AWS API Gateway Response body

I am trying to return hexadecimal string as response from my AWS Lambda function. When it reaches to the client the data seems to be modified.
Data :
47 49 46 38 39 61 01 00 01 00 80 00 00 00 00 00
ff ff ff 21 f9 04 01 00 00 01 00 2c 00 00 00 00
01 00 01 00 00 08 04 00 03 04 04 00 3b
Hexadecimal Excaped Data ( Sent Data ):
\x47\x49\x46\x38\x39\x61\x01\x00\x01\x00\x80\x00\x00\x00\x00\x00"
"\xff\xff\xff\x21\xf9\x04\x01\x00\x00\x01\x00\x2c\x00\x00\x00\x00"
"\x01\x00\x01\x00\x00\x08\x04\x00\x03\x04\x04\x00\x3b
Received Data
47 49 46 38 39 61 01 00 01 00 c2 80 00 00 00 00
00 c3 bf c3 bf c3 bf 21 c3 b9 04 01 00 00 01 00
2c 00 00 00 00 01 00 01 00 00 08 04 00 03 04 04
00 3b
How to fix this?
Last time I checked it was not very explicit in the doc, but API Gateway is really made for json (or similar) and support for binary is 'on the roadmap' but clearly doesn't seem to be a priority. It converts everything it sends to utf-8.
Comparing precisely your original data with the received one you can see it :
47 49 46 38 39 61 01 00 01 00 80 00 00 00 00 00 ff ff ff 21 f9 04 01 00 00 01 00 2c 00 00 00 00 01 00 01 00 00 08 04 00 03 04 04 00 3b
47 49 46 38 39 61 01 00 01 00 c2 80 00 00 00 00 00 c3 bf c3 bf c3 bf 21 c3 b9 04 01 00 00 01 00 2c 00 00 00 00 01 00 01 00 00 08 04 00 03 04 04 00 3b
Everything under 0x7f is OK because the unicode code point is the same as the encoded byte (U+0047 -> 47), but for 0x80 or more the problem arises : U+0080 -> c2 80, U+00FF -> c3 bf and so on.
We had a similar problem recently : binary data was corrupted and bigger when sent through Gateway than with direct access to our backend. It was because a lot of bytes get replaced by Unicode special 'replacement character' aka 'U+FFFD' aka '0xEF 0xBF 0xBD'.
How to fix ? We just stopped using Gateway but if you can afford your data to be bigger, you can base64 encode it.

quickly load a subset of rows from data.frame saved with `saveRDS()`

With a large file (1GB) created by saving a large data.frame (or data.table) is it possible to very quickly load a small subset of rows from that file?
(Extra for clarity: I mean something as fast as mmap, i.e. the runtime should be approximately proportional to the amount of memory extracted, but constant in the size of the total dataset. "Skipping data" should have essentially zero cost. This can be very easy, or impossible, or something in between, depending on the serialiization format. )
I hope that the R serialization format makes it easy to skip forward through the file to the relevant portions of the file.
Am I right in assuming that this would be impossible with a compressed file, simply because gzip requires to uncompress everything from the beginning?
saveRDS(object, file = "", ascii = FALSE, version = NULL,
compress = TRUE, refhook = NULL)
But I'm hoping binary (ascii=F) uncompressed (compress=F) might allow something like this. Use mmap on the file, then quickly skip to the rows and columns of interest?
I'm hoping it has already been done, or there is another format (reasonably space efficient) that allows this and is well-supported in R.
I've used things like gdbm (from Python) and even implemented a custom system in Rcpp for a specific data structure, but I'm not satisfied with any of this.
After posting this, I worked a bit with the package ff (CRAN) and am very impressed with it (not much support for character vectors though).
Am I right in assuming that this would be impossible with a compressed
file, simply because gzip requires to uncompress everything from the
beginning?
Indeed, for a short explanation let's take some dummy method as starting point:
AAAAVVBABBBC gzip would do something like: 4A2VBA3BC
Obviously you can't extract all A from the file without reading it all as you can't guess if there's an A at end or not.
For the other question "Loading part of a saved file" I can't see a solution on top of my head. You probably can with write.csv and read.csv (or fwrite and fread from the data.table package) with skipand nrows parameters could be an alternative.
By all means, using any function on a file already read would mean loading the whole file in memory before filtering, which is no more time than reading the file and then subsetting from memory.
You may craft something in Rcpp, taking advantage of streams for reading data without loading them in memory, but reading and parsing each entry before deciding if it should be kept or not won't give you a real better throughput.
saveDRS will save a serialized version of the datas, example:
> myvector <- c("1","2","3").
> serialize(myvector,NULL)
[1] 58 0a 00 00 00 02 00 03 02 03 00 02 03 00 00 00 00 10 00 00 00 03 00 04 00 09 00 00 00 01 31 00 04 00 09 00 00 00 01 32 00 04 00 09 00 00
[47] 00 01 33
It is of course parsable, but means reading byte per byte according to the format.
On the other hand, you could write as csv (or write.table for more complex data) and use an external tool before reading, something along the line:
z <- tempfile()
write.table(df, z, row.names = FALSE)
shortdf <- read.table(text= system( command = paste0( "awk 'NR > 5 && NR < 10 { print }'" ,z) ) )
You'll need a linux system with awk wich is able to parse millions of lines in a few milliseconds, or to use a windows compiled version of awk obviously.
Main advantage is that awk is able to filter on a regex or some other conditions each line of data.
Complement for case of data.frame, a data.frame is more or less a list of vectors (simple case), this list will be saved sequentially so if we have a dataframe like:
> str(ex)
'data.frame': 3 obs. of 2 variables:
$ a: chr "one" "five" "Whatever"
$ b: num 1 2 3
It's serialization is:
> serialize(ex,NULL)
[1] 58 0a 00 00 00 02 00 03 02 03 00 02 03 00 00 00 03 13 00 00 00 02 00 00 00 10 00 00 00 03 00 04 00 09 00 00 00 03 6f 6e 65 00 04 00 09 00
[47] 00 00 04 66 69 76 65 00 04 00 09 00 00 00 08 57 68 61 74 65 76 65 72 00 00 00 0e 00 00 00 03 3f f0 00 00 00 00 00 00 40 00 00 00 00 00 00
[93] 00 40 08 00 00 00 00 00 00 00 00 04 02 00 00 00 01 00 04 00 09 00 00 00 05 6e 61 6d 65 73 00 00 00 10 00 00 00 02 00 04 00 09 00 00 00 01
[139] 61 00 04 00 09 00 00 00 01 62 00 00 04 02 00 00 00 01 00 04 00 09 00 00 00 09 72 6f 77 2e 6e 61 6d 65 73 00 00 00 0d 00 00 00 02 80 00 00
[185] 00 ff ff ff fd 00 00 04 02 00 00 00 01 00 04 00 09 00 00 00 05 63 6c 61 73 73 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 0a 64 61 74 61
[231] 2e 66 72 61 6d 65 00 00 00 fe
Translated to ascii for an idea:
X
one five Whatever?ð## names a b row.names
ÿÿÿý class
data.frameþ
We have the header of the file, the the header of the list, then each vector composing the list, as we have no clue on how much size the character vector will take we can't skip to arbitrary datas, we have to parse each header (the bytes just before the text data give it's length). Even worse now to get the corresponding integers, we have to go to the integer vector header, which can't be determined without parsing each character header and summing them.
So in my opinion, crafting something is possible but will probably not be really much quicker than reading all the object and will be brittle to the save format (as R has already 3 formats to save objects).
Some reference here
Same view as the serialize output in ascii format (more readable to get how it is organized):
> write(rawToChar(serialize(ex,NULL,ascii=TRUE)),"")
A
2
197123
131840
787
2
16
3
262153
3
one
262153
4
five
262153
8
Whatever
14
3
1
2
3
1026
1
262153
5
names
16
2
262153
1
a
262153
1
b
1026
1
262153
9
row.names
13
2
NA
-3
1026
1
262153
5
class
16
1
262153
10
data.frame
254

the meaning of bits in rawToBits?

> as.raw(15)
[1] 0f
> rawToBits(as.raw(15))
[1] 01 01 01 01 00 00 00 00
> rawToBits(0f)
Error: unexpected symbol in "rawToBits(0f"
> rawToBits("0f")
Error in rawToBits("0f") : argument 'x' must be a raw vector
> rawToBits("0x0f")
Error in rawToBits("0x0f") : argument 'x' must be a raw vector
I have some problems to ask:
1) is that 0f a raw type data?
2) why rawToBits(as.raw(15)) can not get 11110000? the 15 is not 11110000?
15=0f=1*2^0+1*2^1+1*2^2+1*2^3
What is the meaning of 0 in [1] 01 00 00 00 00 00 00 00 when you input rawToBits(as.raw(1))?
In the manual ,i get a raw vector with entries 0 or 1,what is the meaning ofentries 0 or 1.
Why rawToBits(as.raw(2)) is not 10 00 00 00 00 00 00 00?
Just typing 0f doesn't give you something of type raw.
> str(as.raw(15))
raw 0f
> str(0f)
Error: unexpected symbol in "str(0f"
> str("0f")
chr "0f"
If you want to know what's going on with the bits you could try some other values to get a better idea what is going on
> rawToBits(as.raw(1))
[1] 01 00 00 00 00 00 00 00
> rawToBits(as.raw(2))
[1] 00 01 00 00 00 00 00 00
> rawToBits(as.raw(4))
[1] 00 00 01 00 00 00 00 00
> rawToBits(as.raw(8))
[1] 00 00 00 01 00 00 00 00
> rawToBits(as.raw(1 + 2 + 4 + 8))
[1] 01 01 01 01 00 00 00 00
> rawToBits(as.raw(15))
[1] 01 01 01 01 00 00 00 00

Resources