read_table with minus zero (-0) in R - r

I am reading a .txt file in R with read_table.
This file contains multiple columns, where three of them corresponds to 'hh mm ss'. Since 00 00 00 corresponds to midnight, any value before that is negative, as in, -00 30 00 corresponds to 30 minutes to midnight.
Here is some of the .txt data:
-00 53 30
-00 53 05
-00 52 43
-00 52 20
00 02 25
00 02 37
data <- read.table('data.txt',header = FALSE, stringsAsFactors = FALSE)
Output data:
00 53 30
00 53 05
00 52 43
00 52 20
00 02 25
00 02 37
The problem is I need it to be a "negative zero".
How can I overcome this?

Related

unix - how to remove a byte from a recursive position in a ebcdic file

I have an ebcdic file and need to remove a byte that recurs every n bytes:
For example:
00 00 00 25 00 0C 25 00 00 00 00 00 0C 25 00 78 25 69 67 4C 25 00 78 90 69 67 4C 25
I need to remove "25" from position 7, 14, 21, 28, the output would be:
00 00 00 25 00 0C 00 00 00 00 00 0C 00 78 25 69 67 4C 00 78 90 69 67 4C
I tried to use cut:
l=`stat -c%s myfile`
for (( i = 1 ; i <= $l ; i += 7 )) ;
do
m=`expr $i + 5`
cat myfile | cut -c$i-$m` >> fileout
done
But the output is not what I expect:
00 00 00 25 00 0C 0A 00 00 00 00 00 0C 0A 00 78 25 69 67 4C 0A 00 78 90 69 67 4C 0A
Can somebody help me?

Can you please explain Reed Solomon encoding part's Identity matrix?

I am working on a object storage project where I need to understand Reed Solomon error correction algorithm,
I have gone through this Doc as a starter and also some thesis paper.
1. content.sakai.rutgers.edu
2. theseus.fi
but I can't seem to understand the lower part of the identity matrix (red box), where it is coming from. How this calculation is done?
Can anyone please explain this.
The encoding matrix is a 6 x 4 Vandermonde matrix using the evaluation points {0 1 2 3 4 5} modified so that the upper 4 x 4 portion of the matrix is the identity matrix. To create the matrix, a 6 x 4 Vandermonde matrix is generated (where matrix[r][c] = pow(r,c) ), then multiplied with the inverse of the upper 4 x 4 portion of the matrix to produce the encoding matrix. This is the equivalent of "systematic encoding" with Reed Solomon's "original view" as mentioned in the Wikipedia article you linked to above, which is different than Reed Solomon's "BCH view", which links 1. and 2. refer to. The Wikipedia's example systematic encoding matrix is a transposed version of the encoding matrix used in the question.
https://en.wikipedia.org/wiki/Vandermonde_matrix
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction#Systematic_encoding_procedure:_The_message_as_an_initial_sequence_of_values
The code to generate the encoding matrix is near the bottom of this github source file:
https://github.com/Backblaze/JavaReedSolomon/blob/master/src/main/java/com/backblaze/erasure/ReedSolomon.java
Vandermonde inverse upper encoding
matrix part of matrix matrix
01 00 00 00 01 00 00 00
01 01 01 01 01 00 00 00 00 01 00 00
01 02 04 08 x 7b 01 8e f4 = 00 00 01 00
01 03 05 0f 00 7a f4 8e 00 00 00 01
01 04 10 40 7a 7a 7a 7a 1b 1c 12 14
01 05 11 55 1c 1b 14 12
Decoding is performed in two steps. First, any missing rows of data are recovered, then any missing rows of parity are regenerated using the now recovered data.
For 2 missing rows of data, the 2 corresponding rows of the encoding matrix are removed, and the 4 x 4 sub-matrix inverted and used the multiply the 4 non-missing rows of data and parity to recover the 2 missing rows of data. If there is only 1 missing row of data, then a second data row is chosen as if it was missing, in order to generate the inverted matrix. The actual regeneration of data only requires the corresponding rows of the inverted matrix to be used for a matrix multiply.
Once the data is recovered, then any missing parity rows are regenerated from the now recovered data, using the corresponding rows of the encoding matrix.
Based on the data shown, the math is based on finite field GF(2^8) modulo 0x11D. For example, encoding using the last row of the encoding matrix with the last column of the data matrix is (0x1c·0x44)+(0x1b·0x48)+(0x14·0x4c)+(0x12·0x50) = 0x25 (using finite field math).
The question example doesn't make it clear that the 6 x 4 encode matrix can encode a 4 x n matrix, where n is the number of bytes per row. Example where n == 8:
encode data encoded data
01 00 00 00 31 32 33 34 35 36 37 38
00 01 00 00 31 32 33 34 35 36 37 38 41 42 43 44 45 46 47 48
00 00 01 00 x 41 42 43 44 45 46 47 48 = 49 4a 4b 4c 4d 4e 4f 50
00 00 00 01 49 4a 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58
1b 1c 12 14 51 52 53 54 55 56 57 58 e8 eb ea ed ec ef ee dc
1c 1b 14 12 f5 f6 f7 f0 f1 f2 f3 a1
assume rows 0 and 4 are erasures and deleted from the matrices:
00 01 00 00 41 42 43 44 45 46 47 48
00 00 01 00 49 4a 4b 4c 4d 4e 4f 50
00 00 00 01 51 52 53 54 55 56 57 58
1c 1b 14 12 f5 f6 f7 f0 f1 f2 f3 a1
invert encode sub-matrix:
inverse encode identity
46 68 8f a0 00 01 00 00 01 00 00 00
01 00 00 00 x 00 00 01 00 = 00 01 00 00
00 01 00 00 00 00 00 01 00 00 01 00
00 00 01 00 1c 1b 14 12 00 00 00 01
reconstruct data using sub-matrices:
inverse encoded data restored data
46 68 8f a0 41 42 43 44 45 46 47 48 31 32 33 34 35 36 37 38
01 00 00 00 x 49 4a 4b 4c 4d 4e 4f 50 = 41 42 43 44 45 46 47 48
00 01 00 00 51 52 53 54 55 56 57 58 49 4a 4b 4c 4d 4e 4f 50
00 00 01 00 f5 f6 f7 f0 f1 f2 f3 a1 51 52 53 54 55 56 57 58
The actual process only uses the rows of the matrices that correspond
to the erased rows that need to be reconstructed.
First data is reconstructed:
sub-inverse encoded data reconstructed data
41 42 43 44 45 46 47 48
46 68 8f a0 x 49 4a 4b 4c 4d 4e 4f 50 = 31 32 33 34 35 36 37 38
51 52 53 54 55 56 57 58
f5 f6 f7 f0 f1 f2 f3 a1
Once data is reconstructed, reconstruct parity
sub-encode data reconstruted parity
31 32 33 34 35 36 37 38
1b 1c 12 14 x 41 42 43 44 45 46 47 48 = e8 eb ea ed ec ef ee dc
49 4a 4b 4c 4d 4e 4f 50
51 52 53 54 55 56 57 58
One alternate to this approach is to use BCH view Reed Solomon. For an odd number of parities, such as RS(20,17) (3 parities), 2 matrix multiplies and one XOR is needed for encoding, and for a single erasure only XOR is needed. For e>1 erasures, a (e-1) by n matrix multiply is done, followed by an XOR. For an even number of parities, if an XOR is used as part of the encode, then a e by n matrix multiply is needed to correct, or the e x n matrix used for encode, allowing one XOR for the correction.
Another alternative is Raid 6, where "syndromes" are appended to the matrix of data, but do not form a proper code word. One of the syndrome rows, called P, is just XOR, the other row called Q, is successive powers of 2 in GF(2^8). For a 3 parity Raid 6, the third row is called R, is successive powers of 4 in GF(2^8). Unlike standard BCH view, if a Q or R row is lost, it has to be recalculated (as opposed to using XOR to correct it). By using a diagonal pattern, if 1 of n disks fail, then only 1/n of the Q's and R's need to be regenerated when the disk is replaced.
http://alamos.math.arizona.edu/RTG16/ECC/raid6.pdf
Note that this pdf file's alternate method uses the same finite field as the method above, GF(2^8) mod 0x11D, which may make it easier to compare the methods.

quickly load a subset of rows from data.frame saved with `saveRDS()`

With a large file (1GB) created by saving a large data.frame (or data.table) is it possible to very quickly load a small subset of rows from that file?
(Extra for clarity: I mean something as fast as mmap, i.e. the runtime should be approximately proportional to the amount of memory extracted, but constant in the size of the total dataset. "Skipping data" should have essentially zero cost. This can be very easy, or impossible, or something in between, depending on the serialiization format. )
I hope that the R serialization format makes it easy to skip forward through the file to the relevant portions of the file.
Am I right in assuming that this would be impossible with a compressed file, simply because gzip requires to uncompress everything from the beginning?
saveRDS(object, file = "", ascii = FALSE, version = NULL,
compress = TRUE, refhook = NULL)
But I'm hoping binary (ascii=F) uncompressed (compress=F) might allow something like this. Use mmap on the file, then quickly skip to the rows and columns of interest?
I'm hoping it has already been done, or there is another format (reasonably space efficient) that allows this and is well-supported in R.
I've used things like gdbm (from Python) and even implemented a custom system in Rcpp for a specific data structure, but I'm not satisfied with any of this.
After posting this, I worked a bit with the package ff (CRAN) and am very impressed with it (not much support for character vectors though).
Am I right in assuming that this would be impossible with a compressed
file, simply because gzip requires to uncompress everything from the
beginning?
Indeed, for a short explanation let's take some dummy method as starting point:
AAAAVVBABBBC gzip would do something like: 4A2VBA3BC
Obviously you can't extract all A from the file without reading it all as you can't guess if there's an A at end or not.
For the other question "Loading part of a saved file" I can't see a solution on top of my head. You probably can with write.csv and read.csv (or fwrite and fread from the data.table package) with skipand nrows parameters could be an alternative.
By all means, using any function on a file already read would mean loading the whole file in memory before filtering, which is no more time than reading the file and then subsetting from memory.
You may craft something in Rcpp, taking advantage of streams for reading data without loading them in memory, but reading and parsing each entry before deciding if it should be kept or not won't give you a real better throughput.
saveDRS will save a serialized version of the datas, example:
> myvector <- c("1","2","3").
> serialize(myvector,NULL)
[1] 58 0a 00 00 00 02 00 03 02 03 00 02 03 00 00 00 00 10 00 00 00 03 00 04 00 09 00 00 00 01 31 00 04 00 09 00 00 00 01 32 00 04 00 09 00 00
[47] 00 01 33
It is of course parsable, but means reading byte per byte according to the format.
On the other hand, you could write as csv (or write.table for more complex data) and use an external tool before reading, something along the line:
z <- tempfile()
write.table(df, z, row.names = FALSE)
shortdf <- read.table(text= system( command = paste0( "awk 'NR > 5 && NR < 10 { print }'" ,z) ) )
You'll need a linux system with awk wich is able to parse millions of lines in a few milliseconds, or to use a windows compiled version of awk obviously.
Main advantage is that awk is able to filter on a regex or some other conditions each line of data.
Complement for case of data.frame, a data.frame is more or less a list of vectors (simple case), this list will be saved sequentially so if we have a dataframe like:
> str(ex)
'data.frame': 3 obs. of 2 variables:
$ a: chr "one" "five" "Whatever"
$ b: num 1 2 3
It's serialization is:
> serialize(ex,NULL)
[1] 58 0a 00 00 00 02 00 03 02 03 00 02 03 00 00 00 03 13 00 00 00 02 00 00 00 10 00 00 00 03 00 04 00 09 00 00 00 03 6f 6e 65 00 04 00 09 00
[47] 00 00 04 66 69 76 65 00 04 00 09 00 00 00 08 57 68 61 74 65 76 65 72 00 00 00 0e 00 00 00 03 3f f0 00 00 00 00 00 00 40 00 00 00 00 00 00
[93] 00 40 08 00 00 00 00 00 00 00 00 04 02 00 00 00 01 00 04 00 09 00 00 00 05 6e 61 6d 65 73 00 00 00 10 00 00 00 02 00 04 00 09 00 00 00 01
[139] 61 00 04 00 09 00 00 00 01 62 00 00 04 02 00 00 00 01 00 04 00 09 00 00 00 09 72 6f 77 2e 6e 61 6d 65 73 00 00 00 0d 00 00 00 02 80 00 00
[185] 00 ff ff ff fd 00 00 04 02 00 00 00 01 00 04 00 09 00 00 00 05 63 6c 61 73 73 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 0a 64 61 74 61
[231] 2e 66 72 61 6d 65 00 00 00 fe
Translated to ascii for an idea:
X
one five Whatever?ð## names a b row.names
ÿÿÿý class
data.frameþ
We have the header of the file, the the header of the list, then each vector composing the list, as we have no clue on how much size the character vector will take we can't skip to arbitrary datas, we have to parse each header (the bytes just before the text data give it's length). Even worse now to get the corresponding integers, we have to go to the integer vector header, which can't be determined without parsing each character header and summing them.
So in my opinion, crafting something is possible but will probably not be really much quicker than reading all the object and will be brittle to the save format (as R has already 3 formats to save objects).
Some reference here
Same view as the serialize output in ascii format (more readable to get how it is organized):
> write(rawToChar(serialize(ex,NULL,ascii=TRUE)),"")
A
2
197123
131840
787
2
16
3
262153
3
one
262153
4
five
262153
8
Whatever
14
3
1
2
3
1026
1
262153
5
names
16
2
262153
1
a
262153
1
b
1026
1
262153
9
row.names
13
2
NA
-3
1026
1
262153
5
class
16
1
262153
10
data.frame
254

How to convert variable to RDA format in memory

Currently, I convert data table into rda format by using save() like so
save(data_table,file="data.rda")
But I don't want to write to disk to get .rda format of the variable.
Instead, I would like to have the output(rda bytes) be written straight into a variable.
If it were in other programming languages, It would be something like
Switching from
FileStream fs = new FileStream("data.rda","w")
Save(data_table,fs)
to
byte[] buffer = new byte[data_table.getBytes()]
Save(data_table,buffer);
Is there a way to do something like this in R?
This is possible with raw connections
rc <- rawConnection(raw(0),"wb")
save(iris,file=rc)
v1 <- rawConnectionValue(rc)
close(rc)
head(v1,80)
#> [1] 52 44 58 32 0a 58 0a 00 00 00 02 00 03 01 01 00 02 03 00 00
#> [21] 00 04 02 00 00 00 01 00 04 00 09 00 00 00 04 69 72 69 73 00
#> [41] 00 03 13 00 00 00 05 00 00 00 0e 00 00 00 96 40 14 66 66 66
#> [61] 66 66 66 40 13 99 99 99 99 99 9a 40 12 cc cc cc cc cc cd 40
Compare to writing to a temporary file and reading back
save(iris,file="temp.rda",compress=FALSE)
v2 = readBin("temp.rda", raw(), file.info("temp.rda")[1,"size"])
identical(v1,v2)
#> [1] TRUE
Note that writing to file by default uses compression and writing to a raw connection by default does not, hence the compress=FALSE argument.
See also serialize, which may be more suitable, depending on your purpose
v3 <- serialize(iris,NULL)
Note that identical(v1,v3)==FALSE. Indeed, v3 is identical instead to using saveRDS in place of save above. The encoding/content is similar
head(v3,80)
#> [1] 58 0a 00 00 00 02 00 03 01 01 00 02 03 00 00 00 03 13 00 00
#> [21] 00 05 00 00 00 0e 00 00 00 96 40 14 66 66 66 66 66 66 40 13
#> [41] 99 99 99 99 99 9a 40 12 cc cc cc cc cc cd 40 12 66 66 66 66
#> [61] 66 66 40 14 00 00 00 00 00 00 40 15 99 99 99 99 99 9a 40 12
And it is easy to recover the data
identical(unserialize(v3),iris)
#> [1] TRUE

Float to hex conversion - reverse engineering

I'm trying to do some reverse engineering on my heating system. Monitoring the CAN BUS results in receiving hexademical strings, for example:
00 D0 68 D6 86 83 61 8F
61 C0 02 5C 12 B5 02 5C
12 78 04 39 04 03 05 02
05 C4 04 5C 12 5C 12 5C
12 5C 12 D0 68 00 00 00
00 18 08 37 D2 00 00 00
00 00 00 00 00 15 75 F2
F0 01 00 01 00 00 00 1F
I know that for example the temperature value of 22.5°C should be somewhere in there.
So far I have tried to look for the following conversions:
Possibility 1: ascii to hex
22.5 = 32 32 2E 35
Possibility 2: float to hex conversion
22.5 = 0x 41 b4 00 00
However none of these resulted in a match.
What would be other possibilities to converted a float to a hex string?
Thx
note: the given string is just a small part of my can sniffer so don't look for 22.5 in my given string here. I'm just looking for other possible conversions.

Resources