R truncates text files with certain encodings

R truncates text files with certain encodings - r

I'm trying to read into R a test file encoded in Code page 437. Here is the file, and here is its hex-dump:
00000000: 0b0c 0e0f 1011 1213 1415 1617 1819 1a1b ................
00000010: 1c1d 1e1f 2021 2223 2425 2627 2829 2a2b .... !"#$%&'()*+
00000020: 2c2d 2e2f 3031 3233 3435 3637 3839 3a3b ,-./0123456789:;
00000030: 3c3d 3e3f 4041 4243 4445 4647 4849 4a4b <=>?#ABCDEFGHIJK
00000040: 4c4d 4e4f 5051 5253 5455 5657 5859 5a5b LMNOPQRSTUVWXYZ[
00000050: 5c5d 5e5f 6061 6263 6465 6667 6869 6a6b \]^_`abcdefghijk
00000060: 6c6d 6e6f 7071 7273 7475 7677 7879 7a7b lmnopqrstuvwxyz{
00000070: 7c7d 7e7f ffad 9b9c 9da6 aeaa f8f1 fde6 |}~.............
00000080: faa7 afac aba8 8e8f 9280 90a5 999a e185 ................
00000090: a083 8486 9187 8a82 8889 8da1 8c8b a495 ................
000000a0: a293 94f6 97a3 9681 989f e2e9 e4e8 eae0 ................
000000b0: ebee e3e5 e7ed fc9e f9fb ecef f7f0 f3f2 ................
000000c0: a9f4 f5c4 b3da bfc0 d9c3 b4c2 c1c5 cdba ................
000000d0: d5d6 c9b8 b7bb d4d3 c8be bdbc c6c7 ccb5 ................
000000e0: b6b9 d1d2 cbcf d0ca d8d7 cedf dcdb ddde ................
000000f0: b0b1 b2fe 0a .....
The file contains 245 characters (including the final newline), but R only reads 242 of them:
> test_text <- readLines(file('437__characters.txt', encoding='437'))
Warning message:
In readLines(file("437__characters.txt", :
incomplete final line found on '437__characters.txt'
> test_text
[1] "\v\f\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\177 ¡¢£¥ª«¬°±²µ·º»¼½¿ÄÅÆÇÉÑÖÜßàáâäåæçèéêëìíîïñòóôö÷ùúûüÿƒΓΘΣΦΩαδεπστφⁿ₧∙√∞∩≈≡≤≥⌐⌠⌡─│┌┐└┘├┤┬┴┼═║╒╓╔╕╖╗╘╙╚╛╜╝╞╟╠╡╢╣╤╥╦╧╨╩╪╫╬▀▄█▌▐░▒"
> nchar(test_text)
[1] 242
You'll note that R doesn't read the final characters "▓■\n".
My best guess is that this is something to do with how R determines the length of text files, because of the following:
Even though the file is terminated with a newline (0x0a), R gives an 'incomplete final line found' warning
Adding seven or more characters to the end of the file makes it read correctly
Similarly, the file is read correctly if you remove three characters from anywhere in the file
The same issue seems to occur with reading files encoded in other DOS code pages
This question might be related: R: read.table stops when meeting specific utf-16 characters.

It appears to be something wrong with readLines(), but could very well be an issue with the file connection for text, with something amiss happening in the encoding = part. Anyway, here's a workaround: Load the file as binary, and then convert. And stay away from bad voodoo 1980s code pages.
Using readLines()
This does not capture the last \n since that delimits the unit of text input by `readLines().
test_text2 <- readLines(file("~/Downloads/437__characters.txt", raw = TRUE))
test_text3 <- stringi::stri_conv(test_text2, "IBM437", "UTF-8")
stringi::stri_length(test_text3)
## [1] 244
test_text3
## [1] "\v\f\016\017\020\021\022\023\024\025\026\027\030\031\034\033\177\035\036\037 !\"#$%&'()*+,-./
## 0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\032 ¡¢£¥ª«¬°±²μ·º
## »¼½¿ÄÅÆÇÉÑÖÜßàáâäåæçèéêëìíîïñòóôö÷ùúûüÿƒΓΘΣΦΩαδεπστφⁿ₧∙√∞∩≈≡≤≥⌐⌠⌡─│┌┐└┘├┤┬┴┼═║╒╓╔╕╖╗╘╙╚╛╜╝╞╟╠╡╢╣╤╥
## ╦╧╨╩╪╫╬▀▄█▌▐░▒▓■"
Using readBin()
Captures everything including the \n.
test_text_bin <- readBin(file("~/Downloads/437__characters.txt", "rb"),
n = 245, what = "raw")
test_text_bin_UTF8 <- stringi::stri_conv(test_text_bin, "IBM437", "UTF-8")
stringi::stri_length(test_text_bin_UTF8)
## [1] 245
test_text_bin_UTF8
## [1] "\v\f\016\017\020\021\022\023\024\025\026\027\030\031\034\033\177\035\036\037 !\"#$%&'()*+,-./
## 0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\032 ¡¢£¥ª«¬°±²μ·º
## »¼½¿ÄÅÆÇÉÑÖÜßàáâäåæçèéêëìíîïñòóôö÷ùúûüÿƒΓΘΣΦΩαδεπστφⁿ₧∙√∞∩≈≡≤≥⌐⌠⌡─│┌┐└┘├┤┬┴┼═║╒╓╔╕╖╗╘╙╚╛╜╝╞╟╠╡╢╣╤╥
## ╦╧╨╩╪╫╬▀▄█▌▐░▒▓■\n"

Related

Decode Epson print (ESC-i) command decoding/encoding

I'm trying to understand the algorithm used for compression value = 1 with the Epson ESCP2 print command, "ESC-i". I have a hex dump of a raw print file which looks, in part, like the hexdump below (note little-endian format issues).
000006a 1b ( U 05 00 08 08 08 40 0b
units; (page1=08), (vt1=08), (hz1=08), (base2=40 0b=0xb40=2880)
...
00000c0 691b 0112 6802 0101 de00
esc i 12 01 02 68 01 01 00
print color1, compress1, bits1, bytes2, lines2, data...
color1 = 0x12 = 18 = light cyan
compress1 = 1
bits1 (bits/pixel) = 0x2 = 2
bytes2 is ??? = 0x0168 = 360
lines2 is # lines to print = 0x0001 = 1
00000c9 de 1200 9a05 6959
00000d0 5999 a565 5999 6566 5996 9695 655a fd56
00000e0 1f66 9a59 6656 6566 5996 9665 9659 6666
00000f0 6559 9999 9565 6695 9965 a665 6666 6969
0000100 5566 95fe 9919 6596 5996 5696 9666 665a
0000110 5956 6669 0456 1044 0041 4110 0040 8140
0000120 9000 0d00
1b0c 1b40 5228 0008 5200 4d45
FF esc # esc ( R 00 REMOTE1
The difficulty I'm having is how to decode the data, starting at 00000c9, given 2 bits/pixel and the count of 360. It's my understanding this is some form of tiff or rle encoding, but I can't decode it in a way that makes sense. The output was produced by gutenprint plugin for GIMP.
Any help would be much appreciated.

The byte count is not a count of the bytes in the input stream; it is a count of the bytes in the input stream as expanded to an uncompressed form. So when expanded, there should be a total of 360 bytes. The input bytes are interpreted as either a count of bytes to follow, if positive, in which case the count is the byte value +1; and if negative the count is a count of the number of times the immediately following byte should be expanded, again, +1. The 0D at the end is a terminating carriage return for the line as a whole.
The input stream is only considered as a string of whole bytes, despite the fact that the individual pixel/nozzle controls are only 2 bits each. So it is not really possible to use a repeat count for something like a 3-nozzle sequence; a repeat count must always specify a full byte 4-nozzle combination.
The above example then specifies:
0xde00 => repeat 0x00 35 times
0x12 => use the next 19 bytes as is
0xfd66 => repeat 0x66 4 times
0x1f => use the next 32 bytes as is
etc.

s3.save a json file to aws s3

I'm trying to save a correctly formatted json file to aws s3.
I can save a regular data frame to s3 with e.g.
library(tidyverse)
library(aws.s3)
s3save(mtcars, bucket = "s3://ourco-emr/", object = "tables/adhoc.db/mtcars/mtcars")
But I need to get mtcars into json format. Specifically ndjson.
I am able to create a correctly formatted json file with e.g:
predictions_file <- file("mtcars.json")
jsonlite::stream_out(mtcars), predictions_file)
This saves a file to my directory called mtcars.json.
However, with the aws.s3 function s3save(), I need to send an object that's in memory, not a file.
Tried:
predictions_file <- file("mtcars.json")
s3write_using(mtcars,
FUN = jsonlite::stream_out,
con = predictions_file,
"s3://ourco-emr/",
object = "tables/adhoc.db/mtcars/mtcars")
Gives:
Error in if (verbose) message("opening ", is(con), " output connection.") :
argument is not interpretable as logical
I tried the same code block but leaving out the line for con=predictions_file, that just gave:
Argument con must be a connection.
If the function jsonlite::stream_out() creates a correctly formatted json file, how can I then write that file to s3?
Edit:
The desired json output would look like this:
{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3,"wt":2,"qsec":16,"vs":0,"am":1,"gear":4,"carb":4,"year":"2020","month":"03","day":"05"}
{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3,"wt":2,"qsec":17,"vs":0,"am":1,"gear":4,"carb":4,"year":"2020","month":"03","day":"05"}
{"mpg":22,"cyl":4,"disp":108,"hp":93,"drat":35,"wt":2,"qsec":18,"vs":1,"am":1,"gear":4,"carb":1,"year":"2020","month":"03","day":"05"}
{"mpg":21,"cyl":6,"disp":258,"hp":110,"drat":8,"wt":3,"qsec":19,"vs":1,"am":0,"gear":3,"carb":1,"year":"2020","month":"03","day":"05"}
{"mpg":18,"cyl":8,"disp":360,"hp":175,"drat":3,"wt":3,"qsec":17,"vs":0,"am":0,"gear":3,"carb":2,"year":"2020","month":"03","day":"05"}
When attempting with readchar:
mtcars_string <- readChar("mtcars.json", 1e6)
s3save(mtcars_string, bucket = "s3://ourco-emr/", object = "tables/adhoc.db/mtcars/2020/03/06/mtcars")
If I then download and open the resulting json file, it looks like this:
5244 5833 0a58 0a00 0000 0300 0306 0000
0305 0000 0000 0555 5446 2d38 0000 0402
0000 0001 0004 0009 0000 000d 6d74 6361
7273 5f73 7472 696e 6700 0000 1000 0000
0100 0400 0900 0012 347b 226d 7067 223a
3231 2c22 6379 6c22 3a36 2c22 6469 7370
So it looks like a tsb has been sent to aws s3 as opposed to json

I had the same problem. I need to write and upload JSON lines (ndjson) to S3 and, as far as I know, only stream_out() from the jsonlite-package writes JSON lines.
stream_out() takes only connection-objects as a destination, s3write_using(), however, writes to a temporary file tmp and passes the path to that file as a string to FUN. stream_out() then throws the error:
Argument con must be a connection.
A tentative fix is to modify s3write_using() to pass a connection to FUN instead of a filepath-string.
trace(s3write_using, edit=TRUE) - opens an editor
Change line 5:
value <- FUN(x, tmp, ...)
To this:
value <- FUN(x, file(tmp), ...)
You can then upload the data using stream_out():
s3write_using(x = data,
FUN = stream_out,
bucket = 'mybucket',
object = 'my/object.json',
opts = list(acl = "private", multipart = FALSE, verbose = T, show_progress = T))
The edit remains for the whole session or until you do untrace(s3write_using).
One should probably file a request in their cloudyr/aws.s3 GitHub as this as to be a common use case.

Incorrect wav header generated by sox

I was using sox to convert a 2 channels, 48000Hz, 24bits wav file (new.wav) to a mono wav file (post.wav).
Here are the related commands and outputs:
[Farmer#Ubuntu recording]$ soxi new.wav
Input File : 'new.wav'
Channels : 2
Sample Rate : 48000
Precision : 24-bit
Duration : 00:00:01.52 = 72901 samples ~ 113.908 CDDA sectors
File Size : 447k
Bit Rate : 2.35M
Sample Encoding: 24-bit Signed Integer PCM
[Farmer#Ubuntu recording]$ sox new.wav -c 1 post.wav
[Farmer#Ubuntu recording]$ soxi post.wav
Input File : 'post.wav'
Channels : 1
Sample Rate : 48000
Precision : 24-bit
Duration : 00:00:01.52 = 72901 samples ~ 113.908 CDDA sectors
File Size : 219k
Bit Rate : 1.15M
Sample Encoding: 24-bit Signed Integer PCM
It looks fine. But let us check the header of post.wav:
[Farmer#Ubuntu recording]$ xxd post.wav | head -10
00000000: 5249 4646 9856 0300 5741 5645 666d 7420 RIFF.V..WAVEfmt
00000010: 2800 0000 feff 0100 80bb 0000 8032 0200 (............2..
00000020: 0300 1800 1600 1800 0400 0000 0100 0000 ................
00000030: 0000 1000 8000 00aa 0038 9b71 6661 6374 .........8.qfact
00000040: 0400 0000 c51c 0100 6461 7461 4f56 0300 ........dataOV..
This is the standard wav file header structure.
The first line is no problem.
The second line "2800 0000" shows the size of sub chunk "fmt ", it should be 0x00000028 (as this is little endian) = 40 bytes. But there are 54 bytes (before sub chunk "fmt " and sub chunk "data").
The third line shows "ExtraParamSize" is 0x0018 = 22 bytes. But actually it is 36 bytes (from third line's "1600" to 5th line's "0100"). The previous 16 bytes are standard.
So what's the extra 36 bytes?

Ok,I found out the answer.
Look at the second line, we can found that audio format is "feff", actual value is 0xFFFE, so this is not a PCM standard wave format, but a extensible format.
Wav head detailed introduction can refer to this link. The article is well written and thanks to the author.
So as this is a Non-PCM format wav, "fmt " chunk space occupied 40 bytes is no problem， and followed by a "fact" chunk, and then is "data" chunk, So everything makes sense.

decrypt xored content in binary

I want to decrypt a xored content. if you want you can download the file in here
the file extension is .bin but content looks like hex to me and not binary, i'm not sure what kind of content it's.
the content look likes bellow:
2007 0b54 180a 541d 1318 1a00 541c 0654
0a0c 0606 065a 9854 0caa 2624 3000 0c04
260c 102c b435 fcaa b2ab acbf 32b2 aeb9
34b9 a0a8 a425 b6a9 809c bcb7 a8bb 2e34
eaa7 a835 80aa 8625 b8a7 aebc 2cbb 9e9d
329c bcaf 3493 a080 a625 aab9 329c bcaf
34b1 aab6 aab3 3431 b0a8 bebf b6ad 3634
b0af 849d 329c b225 faab acba b4af 3a93
32aa a0a9 a6b3 b80a 0a
and if it's hex why each 4 character is space delimiter-ed?
i think it can't be base64, because when i try to run following code i will get error
a#ubuntu:~/Downloads$ base64 -d enigma.bin>enigma.txt
base64: invalid input
second my goal is to find the key. so I tried the xortool
a#ubuntu:~/Downloads$ xortool enigma.bin
The most probable key lengths:
3: 15.1%
6: 19.3%
9: 13.6%
12: 15.3%
15: 9.4%
18: 10.9%
20: 4.4%
24: 5.3%
30: 3.4%
36: 3.4%
Key-length can be 3*n
Most possible char is needed to guess the key!
so i tried most used character like space(20) or E T A O I N S H R D L U but i had not luck. still my guess is i got the encoding incorrect

Unix Pipeling "AWK" - The summation whilst matching

Below I have some raw data. My goal is to match 'column one' values and have the total number of bytes in a single line of output for each ip address.
For example output:
81.220.49.127 6654
81.226.10.238 328
81.227.128.93 84700
Raw Data:
81.220.49.127 328
81.220.49.127 328
81.220.49.127 329
81.220.49.127 367
81.220.49.127 5302
81.226.10.238 328
81.227.128.93 84700
Can anyone advise me on how to do this.

Using an associative array:
awk '{a[$1]+=$2}END{for (i in a){print i,a[i]}}' infile
Alternative to preserve order:
awk '!($1 in a){b[++cont]=$1}{a[$1]+=$2}END{for (c=1;c<=cont;c++){print b[c],a[b[c]]}}' infile
Another way where arrays are not needed:
awk 'lip != $1 && lip != ""{print lip,sum;sum=0}
{sum+=$NF;lip=$1}
END{print lip,sum}' infile
Result
81.220.49.127 6654
81.226.10.238 328
81.227.128.93 84700

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R truncates text files with certain encodings - r

Related

Decode Epson print (ESC-i) command decoding/encoding

s3.save a json file to aws s3

Incorrect wav header generated by sox

decrypt xored content in binary

Unix Pipeling "AWK" - The summation whilst matching

Categories

Resources