Why is raster filesize is so much different than objectsize? - r

I have a 1.2 GB .csv file on my disk. I use R's filename = read.csv(path)-function and then I check the object size via object.size(filename) and it turns out, that it's 3721MB large. Why is this difference?

A CSV file is a plain text file and might look like this:
1,2,3,4
3,2,3,2
3,4,2,1
each character (ie digit and comma) is a byte. This file is 24 bytes big (there's an invisible "new line" character at the end of each row).
When read into R each number is stored as a floating point decimal number, which is 8 bytes. The file above would then be 8*24 (values) = 96 bytes big.
It can go the other way. If the above file was instead written:
1.0000000000, 2.0000000000, 3.00000000000, 4.000000000
[etc]
then in the CSV each number is taking about 12 bytes - each digit, decimal point, command and zero takes a byte - and when read in to R would still only take 8 bytes as floating point decimal values.

Related

Simple string encryptation - safety of higher ascii characters

I am trying to create a simple encryptation scheme for strings. Each character of the string is given another ascii value.
It entails writing ascii characters upto 246 to a simple file on disk.
I want to find out if it is safe to write these special characters to the disk or can it cause untoward results. Thanks for your help.
Edit: I am considering algorithm similar to following:
* Convert each character of string to its integer number (hence 110 for 'n' and 122 for 'z')
* Double that number (get 220 and 244)
* Convert this to character (will get extended ascii codes)
* Save these characters to file.
Is it safe to save these extended ascii characters to disk files using usual text file writing functions?
There is only a limited set of ASCII characters. There are 95 printable characters such as 'A' but also the space character. There are 33 printable characters such as Line Feed, Carriage Return, NUL but also DELETE. So you cannot use 246 characters of ASCII as there are only 128 total available. ASCII is strictly 7 bits giving you 2^7 = 128 possible values.
Even if you would use the ISO 8859 Latin character set or the Windows-1252 character set you would still have the unprintable control characters to deal with, leaving you with 256 - 33 - 5 characters or 218 characters. Windows-1252 still has 5 undefined characters.
What you can do is of course save your data as bytes. Each byte has 256 possible values (usually 0 to 255 or -128 to 127). As long as you open files as binary this pose no problem.
You can of course store as many characters in a file as you want, up to the file system or operating system limit. So I presume you didn't ask that.

How to view all special characters

I am facing hard time in removing the special characters from the csv file.
I have done a head -1 so i am trying to compare only 1 row.
wc filename shows it has 1396 byte count
If i go to the end of the file the curson ends at 1394.
In vi I do set list (to check for control characters), i see a $ (nothing after that), so i now know its the 1395 byte count.
Can someone please tell me where is the 1396th byte?
I am trying to compare 2 files using diff and its giving me a lot of trouble.
Please help.
The last 2 bytes of your line are \r\n - this is a Windows line ending. dos2unix converts this into a Unix line ending, which is \n - hence the line is shortened by 1 byte following conversion.

Why would a read-in variable consumes way more memory than file's storage size in R

When I tried to read in a big file of actual size 672MB into R, it turns out that the system memory usage exploded from 0.98 G to 3.6 G (I'm using a 4 GB memory desktop). Which means it takes several times of space to store the file into memory and I can do nothing calculation after I read in as lack of memory. Is that normal?
The code I've used: a=read.table(file.choose(),header=T,colClasses="integer",nrows=16777777,comment.char="",sep="\t")
The file contains 167772XX lines.
gc() before and after I run
not sure what does this mean.
Your text file is 672MB. Assuming all your integers are 1 digit, it's perfectly reasonable that your R object is about 2*672MB.
Each character in a text file is 1 byte. R stores integers in 4 bytes (see ?integer). That means your file contains ~336MB of "\t" and ~336MB of integers stored as 1-byte characters.
R reads those 1-byte characters, stores them as 4-byte integers and... 336*4 = 1344MB. The second row and second column of your gc output reads 1345.6, which equals 1344MB + the original 1.6MB.

How to read a non-standard DBF memo (BLOB) file from ACT?

I am trying to convert data from Act 2000 to a MySQL database. I have successfully imported the DBF files into individual MySQL tables. However I am having issues with the *.BLB file, which seems to be a non-standard memo file.
The DBF files, identifies themselves as dbase III Plus, No memo format. There is a single *.BLB which is a memo file for multiple DBFs to share BLOB data.
If you read this document: http://cicorp.com/act/sdk/ACT6-SDK-ChapterA.htm#_Toc483994053)
You can see that the REGARDING column is a 6 character one. The description is: This 6-byte field is supplied by the system and contains a reference to a field in the Binary Large Object (BLOB) Database.
Now upon opening the *.BLB I can see that the block size is 64 bytes. All the blocks of text are NULL padded out to that size.
Where I am stumbling is trying to convert the values stored in the REGARDING column to blocks location in the BLB file. My assumption is that 6 character field is an offset.
For example, one value for REGARDING is, (ignoring the square brackets): [ ",J$]
In my Googling, I found this: http://ulisse.elettra.trieste.it/services/doc/dbase/DBFstruct.htm#C1.5
It explains that in memo fields (in normal DBF files at least) the space value is ignore (i.e. it's padding out the column).
Therefore if I'm correct (again, square brackets) [",J$] should be the offset in my BLB file. Luckily I've still got access to the original ACT2000 software, so I can compare the full text in the program / MySQL and BLB file.
Using my example value, I know that the DB row with REGARDING value of [ ",J$] corresponds to a 1024 byte offset (or 16 blocks, assuming my guess of a 64 byte sized block).
I've tried reading some Python code for open source projects that read DBF files - but I'm in over my head.
I think what I need to do is unpack the characters to binary, but am not sure.
How can I find the 64-block based spot to read from based on what's found in the DBF files?
EDIT by Jerry Dodge
I've attempted to reverse-engineer the strings in this field to hexadecimal values, and then to an integer value using StrToInt64, but the result still does not match up with the blob file. I've also tried multiplying this integer value by 64 and not multiplying, but the result keeps winding up outside of the size of the blob file, not actually finding any data.
For example, a value of ___/BD (_ = space) translates to $2f4244 hexidecimal, which in turn translates to the integer value of 3097156, but does not correspond with any relevant portion of data in the blob file, even when multiplied or divided by 64.
According to the SDK you linked, the following happens as I understand:
There is a TYPE field (right behing REGARDING) that encodes what REGARDING is used for (see the second table of the linked chapter). So I'd assume that if type=6 (meeting not held) the REGARDING is either irrelevant or only contains a meeting ID reference from some other table. On that line of thought I would only expect REGARDING to be a BLB offset if type=101 (or possibly 100). I'd also not abandon the thought that in these relevant cases TYPE might be a concatenation of BLB file index and offset (because there is a mention that each file must not be longer than 30K chars and I really expect to be able to store much more data even in one table).

Delimiting binary sequences

I need to be able to delimit a stream of binary data. I was thinking of using something like the ASCII EOT (End of Transmission) character to do this.
However I'm a bit concerned -- how can I know for sure that the particular binary sequence used for this (0b00000100) won't appear in my own binary sequences, thus giving a false positive on delimitation?
In other words, how is binary delimiting best handled?
EDIT: ...Without using a length header. Sorry guys, should have mentioned this before.
You've got five options:
Use a delimiter character that is unlikely to occur. This runs the risk of you guessing incorrectly. I don't recommend this approach.
Use a delimiter character and an escape sequence to include the delimiter. You may need to double the escape character, depending upon what makes for easier parsing. (Think of the C \0 to include an ASCII NUL in some content.)
Use a delimiter phrase that you can determine does not occur. (Think of the mime message boundaries.)
Prepend a length field of some sort, so you know to read the following N bytes as data. This has the downside of requiring you to know this length before writing the data, which is sometimes difficult or impossible.
Use something far more complicated, like ASN.1, to completely describe all your content for you. (I don't know if I'd actually recommend this unless you can make good use of it -- ASN.1 is awkward to use in the best of circumstances, but it does allow completely unambiguous binary data interpretation.)
Usually, you wrap your binary data in a well known format, for example with a fixed header that describes the subsequent data. If you are trying to find delimeters in an unknown stream of data, usually you need an escape sequence. For example, something like HDLC, where 0x7E is the frame delimeter. Data must be encoded such that if there is 0x7E inside the data, it is replaced with 0x7D followed by an XOR of the original data. 0x7D in the data stream is similarly escaped.
If the binary records can really contain any data, try adding a length before the data instead of a marker after the data. This is sometimes called a prefix length because the length comes before the data.
Otherwise, you'd have to escape the delimiter in the byte stream (and escape the escape sequence).
You can prepend the size of the binary data before it. If you are dealing with streamed data and don't know its size beforehand, you can divide it into chunks and have each chunk begin with size field.
If you set a maximum size for a chunk, you will end up with all but the last chunk the same length which will simplify random access should you require it.
As a space-efficient and fixed-overhead alternative to prepending your data with size fields and escaping the delimiter character, the escapeless encoding can be used to trim off that delimiter character, probably together with other characters that should have special meaning, from your data.
#sarnold's answer is excellent, and here I want to share some code to illustrate it.
First here is a wrong way to do it: using a \n delimiter. Don't do it! the binary data could contain \n, and it would be mixed up with the delimiters:
import os, random
with open('test', 'wb') as f:
for i in range(100): # create 100 binary sequences of random
length = random.randint(2, 100) # length (between 2 and 100)
f.write(os.urandom(length) + b'\n') # separated with the character b"\n"
with open('test', 'rb') as f:
for i, l in enumerate(f):
print(i, l) # oops we get 123 sequences! wrong!
...
121 b"L\xb1\xa6\xf3\x05b\xc9\x1f\x17\x94'\n"
122 b'\xa4\xf6\x9f\xa5\xbc\x91\xbf\x15\xdc}\xca\x90\x8a\xb3\x8c\xe2\x07\x96<\xeft\n'
Now the right way to do it (option #4 in sarnold's answer):
import os, random
with open('test', 'wb') as f:
for i in range(100):
length = random.randint(2, 100)
f.write(length.to_bytes(2, byteorder='little')) # prepend the data with the length of the next data chunk, packed in 2 bytes
f.write(os.urandom(length))
with open('test', 'rb') as f:
i = 0
while True:
l = f.read(2) # read the length of the next chunk
if l == b'': # end of file
break
length = int.from_bytes(l, byteorder='little')
s = f.read(length)
print(i, s)
i += 1
...
98 b"\xfa6\x15CU\x99\xc4\x9f\xbe\x9b\xe6\x1e\x13\x88X\x9a\xb2\xe8\xb7(K'\xf9+X\xc4"
99 b'\xaf\xb4\x98\xe2*HInHp\xd3OxUv\xf7\xa7\x93Qf^\xe1C\x94J)'

Resources