Does R support 8bit variables? - r

I am trying to read a large (~700Mb) .csv file into R.
The file contains an array of integers less than 256, with a header row and 2 header columns.
I use:
trainSet <- read.csv(trainFileName)
This eventually barfs with:
Loading Data...
R(2760) malloc: *** mmap(size=151552) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
R(2760) malloc: *** mmap(size=151552) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Error: cannot allocate vector of size 145 Kb
Execution halted
Looking at the memory usage, it conks out at about 3Gb usage on a 6Gb machine with zero page file usage at the time of the crash, so there may be another way to fix it.
If I use:
trainSet <- read.csv(trainFileName, header=TRUE, nrows=100)
classes = sapply(train,class);
I can see that all the columns are being loaded as "integer" which I think is 32 bits.
Clearly using 3Gb to load a part of a 700Mb .csv file is far from efficient. I wonder if there's a way to tell R to use 8 bit numbers for the columns? This is what I've done in the past in Matlab and it's worked a treat, however, I can't seem to find anywhere a mention of an 8 bit type in R.
Does it exist? And how would I tell it read.csv to use it?
Thanks in advance for any help.

The narrow answer is that the add-on package ff allows you to use a more compact representation.
The downside is that the different representation prevents you from passing the data to standard functions.
So you may need to rethink your approach: maybe sub-sampling the data, or getting more RAM.

Q: Can you tell R to use 8 bit numbers
A: No. (Edit: See Dirk's comment's below. He's smarter than I am.)
Q: Will more RAM help?
A: Maybe. Assuming a 64 bit OS and a 64 bit instance of R are the starting point, then "Yes", otherwise "No".
Implicit question A: Will a .csv dataset that is 700 MB be 700 MB when read in by read.csv?
A: Maybe. If it really is all integers, it may be smaller or larger. It's going to take 4 bytes for each integer and if most of your integers were in the range of -9 to 10, they might actually "expand" in size when stored as 4 bytes each. At the moment you are only using 1-3 bytes per value so you would expect about a 50% increase in size You would want to use colClasses="integer"in the read-function. Otherwise they may get stored as factor or as 8 byte "numeric" if there are any data-input glitches.
Implicit question B: If You get the data into the workspace will you be able to work with it?
A: Only maybe. You need at a minimum three times as much memory as your largest objects because of the way R copies on assignment even if it is a copy to its own name.

Not trying to be snarky, but the way to fix this is documented in ?read.csv:
These functions can use a surprising amount of memory when reading
large files. There is extensive discussion in the ‘R Data
Import/Export’ manual, supplementing the notes here.
Less memory will be used if ‘colClasses’ is specified as one of
the six atomic vector classes. This can be particularly so when
reading a column that takes many distinct numeric values, as
storing each distinct value as a character string can take up to
14 times as much memory as storing it as an integer.
Using ‘nrows’, even as a mild over-estimate, will help memory
usage.
This example takes awhile to run because of I/O, even with my SSD, but there are no memory issues:
R> # In one R session
R> x <- matrix(sample(256,2e8,TRUE),ncol=2)
R> write.csv(x,"700mb.csv",row.names=FALSE)
R> # In a new R session
R> x <- read.csv("700mb.csv", colClasses=c("integer","integer"),
+ header=TRUE, nrows=1e8)
R> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 173632 9.3 350000 18.7 350000 18.7
Vcells 100276451 765.1 221142070 1687.2 200277306 1528.0
R> # Max memory used ~1.5Gb
R> print(object.size(x), units="Mb")
762.9 Mb

Related

R: How do I permanently set the amount of memory R will use to the maximum for my machine?

I know that some version of this question has been addressed multiple times in the past, but I think this iteration of this widely shared problem is sufficiently distinct to justify its own response. I would like to permanently set the maximum memory available to R to largest value that my machine can handle, i.e., not just for a single session. I am running 64-bit R on a windows 7 machine with 6 gig of RAM.
Currently I am trying to do a conversion of a 10 GB Stata file into a .rds object. On similar smaller objects the compression in the .dta to .rds conversion has been by a factor of four or better, and I (rather surprisingly) have not had any trouble doing dplyr manipulation on objects of 2 to 3 GB (after compression), even when two of them and work product are all in memory at once. This seems to conflict with my previous belief that the amount of physical RAM is the absolute upper limit of what R can handle, as I am fairly certain that between loaded .rds objects and various intermediate work products I have had more than 6 GB of undeleted objects laying about my workspace at one time.
I find conflicting statements about whether the maximum memory size is my actual RAM less OS demands, or my actual RAM, or my actual RAM plus an unknown (to me) amount of virtual RAM (subject to a potentially serious slowdown when you reach into virtual RAM). These file conversions are one-time (per file) jobs and I do not care if they are slow.
Looking at the base R help page on “Memory limits” and the help-pages for memory.size(), it seems that there are multiple distinct limits under Windows, relating to total memory used in a session, available to a single process, allocatable by malloc or contained in a single vector. The individual vectors in my file are only around eight million rows long.
memory.size and memory.limit both report current settings in the neighborhood of 6 GB. I got multiple warning messages saying that I was pressed up against that limit, but the actual error message was something like “cannot allocate vector of length 120 MB”.
So I think there are three distinct questions:
How do I determine the maximum possible memory for each 64-bit R
memory setting; and
How many distinct memory settings do I need to make; and
How do I make them permanently, as opposed to for a single session?
Following the advice of #Konrad below, I had this rather puzzling exchange with R/RStudio:
> memory.size()
[1] 424.85
> memory.size(max=TRUE)
[1] 454.94
> memory.size()
[1] 436.89
> memory.size(5000)
[1] 6046
Warning message:
In memory.size(5000) : cannot decrease memory limit: ignored
> memory.size()
[1] 446.27
The first three interactions seem to suggest that there is a hard memory limit on my machine of 455 MB. The second-to-last one, on the other hand, appears to be saying that the memory limit is set at my RAM level, without allowance for the OS, and without using virtual memory. Then the last one goes back claiming to a limit of around 450.
I just tried the recommendation here:
Increasing (or decreasing) the memory available to R processes
but with 6000 MB rather than 500; I'll provide a report.

ffdf object consumes extra RAM (in GB)

I have decided to test the key advantage of ff package - RAM minimal allocation (PC specs: i5, RAM 8Gb, Win7 64 bit, Rstudio).
According to the package discription we can manipulate physical objects (files) like virtual ones as if they are allocated into RAM. Thus, actual RAM usage is reduced greatly (from Gb to kb). The code I have used as follows:
library(ff)
library(ffbase)
setwd("D:/My_package/Personal/R/reading")
x<-cbind(rnorm(1:100000000),rnorm(1:100000000),1:100000000)
system.time(write.csv2(x,"test.csv",row.names=FALSE))
system.time(x <- read.csv2.ffdf(file="test.csv", header=TRUE, first.rows=100000, next.rows=100000000,levels=NULL))
print(object.size(x)/1024/1024)
print(class(x))
The actual file size is 4.5 Gb, the actual RAM used varies in such a way (by Task Manager): 2,92 -> upper limit(~8Gb) -> 5.25Gb.
The object size (by object.size()) is about 12 kb.
My concern is about RAM extra allocations (~2.3 GB). According to the package discription it should have increased only by 12 kb. I dont use any characters.
Maybe I have missed something of ff package.
Well, I have found a solution to eliminate the use of extra RAM.
First of all it is necessary to pay attention to such arguments as 'first.rows' and 'next.rows' of method 'read.table.ffdf' in ff package.
The first argument ('first.rows') stipulates the initial chunk in row quantity and it stipulates the initial memory allocation. I have used the default value (1000 rows).
The extra memory allocation is the subject of the second argument ('next.rows'). If you want to have ffdf object without extra RAM allocations (in my case - in Gb) so you need to select such a number of rows for the next chunk that the size of the chunk should not exceed the value of 'getOption("ffbatchbytes")'.
In my case I have used 'first.rows=1000' and 'next.rows=1000' and the total RAM allocation has varied up to 1Mb in Task Manager.
The increase of 'next.rows' up to 10000 has caused the RAM growth by 8-9 Mb.
So this arguments are subject to your experiments to pick up the best proportion.
Besides, you must keep in mind that the increase of 'next.rows' will impact the processing time to make ffdf object(by several runs):
'first.rows=1000' and 'next.rows=1000' is around 1500 sec. (RAM ~ 1Mb)
'first.rows=1000' and 'next.rows=10000' is around 230 sec. (RAM ~ 9Mb)

R - Memory allocation besides objects in ls()

I have loaded a fairly large set of data using data.table. I then want to add around 30 columns using instructions of the form:
DT[, x5:=cumsum(y1), by=list(x1, x2)]
DT[, x6:=cummean(y2), by=x1]
At some point I start to get "warnings" like this:
1: In structure(.Call(C_objectSize, x), class = "object_size") :
Reached total allocation of 8072Mb: see help(memory.size)
I check the tracemem(DT) every now and then to assure that no copies are made. The only output I ever get is:
"<0000000005E8E700>"
Also I check ls() to see which objects are in use and object.size() to see how much of my RAM is allocated by the object. The only output of ls() is my data.table and the object size after the first error is 5303.1 Mb.
I am on a Windows 64-bit machine running R in 64-bit and have 8 GB RAM. Of these 8 GB RAM only 80% are in use when I get the warning. Of these R is using 5214.0 Mb (strange since the table is bigger than this).
My question is, if the only RAM R is using is 5303.1 Mb and I still have around 2 Gb of free memory why do I get the error that R has reached the limit of 8 Gb and is there anything I can do against it? If not, what are other options? I know I could use Bigmemory but then I would have to rewrite my whole code and would loose the sweet by-reference modifications which data.table offers.
The problem is that the operations require RAM beyond what the object itself takes up. You could verify that windows is using a page file. If it is you could try increasing its size. http://windows.microsoft.com/en-us/windows/change-virtual-memory-size
If that fails you could try to run a live environment of Lubuntu linux to see if its memory overhead is small enough to allow the operation. http://lubuntu.net/
Ultimately, I suspect you're going to have to use bigmemory or similar.

reading csv in Julia is slow compared to Python

reading large text / csv files in Julia takes a long time compared to Python. Here are the times to read a file whose size is 486.6 MB and has 153895 rows and 644 columns.
python 3.3 example
import pandas as pd
import time
start=time.time()
myData=pd.read_csv("C:\\myFile.txt",sep="|",header=None,low_memory=False)
print(time.time()-start)
Output: 19.90
R 3.0.2 example
system.time(myData<-read.delim("C:/myFile.txt",sep="|",header=F,
stringsAsFactors=F,na.strings=""))
Output:
User System Elapsed
181.13 1.07 182.32
Julia 0.2.0 (Julia Studio 0.4.4) example # 1
using DataFrames
timing = #time myData = readtable("C:/myFile.txt",separator='|',header=false)
Output:
elapsed time: 80.35 seconds (10319624244 bytes allocated)
Julia 0.2.0 (Julia Studio 0.4.4) example # 2
timing = #time myData = readdlm("C:/myFile.txt",'|',header=false)
Output:
elapsed time: 65.96 seconds (9087413564 bytes allocated)
Julia is faster than R, but quite slow compared to Python. What can I do differently to speed up reading a large text file?
a separate issue is the size in memory is 18 x size of hard disk file size in Julia, but only 2.5 x size for python. in Matlab, which I have found to be most memory efficient for large files, it is 2 x size of hard disk file size. Any particular reason for the large file size in memory in Julia?
The best answer is probably that I'm not as a good a programmer as Wes.
In general, the code in DataFrames is much less well-optimized than the code in Pandas. I'm confident that we can catch up, but it will take some time as there's a lot of basic functionality that we need to implement first. Since there's so much that needs to be built in Julia, I tend to focus on doing things in three parts: (1) build any version, (2) build a correct version, (3) build a fast, correct version. For the work I do, Julia often doesn't offer any versions of essential functionality, so my work gets focused on (1) and (2). As more of the tools I need get built, it'll be easier to focus on performance.
As for memory usage, I think the answer is that we use a set of data structures when parsing tabular data that's much less efficient than those used by Pandas. If I knew the internals of Pandas better, I could list off places where we're less efficient, but for now I'll just speculate that one obvious failing is that we're reading the whole dataset into memory rather than grabbing chunks from disk. This certainly can be avoided and there are issues open for doing so. It's just a matter of time.
On that note, the readtable code is fairly easy to read. The most certain way to get readtable to be faster is to whip out the Julia profiler and start fixing the performance flaws it uncovers.
There is a relatively new julia package called CSV.jl by Jacob Quinn that provides a much faster CSV parser, in many cases on par with pandas: https://github.com/JuliaData/CSV.jl
Note that the "n bytes allocated" output from #time is the total size of all allocated objects, ignoring how many of them might have been freed. This number is often much higher than the final size of live objects in memory. I don't know if this is what your memory size estimate is based on, but I wanted to point this out.
I've found a few things that can partially help this situation.
using the readdlm() function in Julia seems to work considerably faster (e.g. 3x on a recent trial) than readtable(). Of course, if you want the DataFrame object type, you'll then need to convert to it, which may eat up most or all of the speed improvement.
Specifying dimensions of your file can make a BIG difference, both in speed and in memory allocations. I ran this trial reading in a file that is 258.7 MB on disk:
julia> #time Data = readdlm("MyFile.txt", '\t', Float32, skipstart = 1);
19.072266 seconds (221.60 M allocations: 6.573 GB, 3.34% gc time)
julia> #time Data = readdlm("MyFile.txt", '\t', Float32, skipstart = 1, dims = (File_Lengths[1], 62));
10.309866 seconds (87 allocations: 528.331 MB, 0.03% gc time)
The type specification for your object matters a lot. For instance, if your data has strings in it, then the data of the array that you read in will be of type Any, which is expensive memory wise. If memory is really an issue, you may want to consider preprocessing your data by first converting the strings to integers, doing your computations, and then converting back. Also, if you don't need a ton of precision, using Float32 type instead of Float64 can save a LOT of space. You can specify this when reading the file in, e.g.:
Data = readdlm("file.csv", ',', Float32)
Regarding memory usage, I've found in particular that the PooledDataArray type (from the DataArrays package) can be helpful in cutting down memory usage if your data has a lot of repeated values. The time to convert to this type is relatively large, so this isn't a time saver per se, but at least helps reduce the memory usage somewhat. E.g. when loading a data set with 19 million rows and 36 columns, 8 of which represented categorical variables for statistical analysis, this reduced the memory allocation of the object from 5x its size on disk to 4x its size. If there are even more repeated values, the memory reduction can be even more significant (I've had situations where the PooledDataArray cuts memory allocation in half).
It can also sometimes help to run the gc() (garbage collector) function after loading and formatting data to clear out any unneeded ram allocation, though generally Julia will do this automatically pretty well.
Still though, despite all of this, I'll be looking forward to further developments on Julia to enable faster loading and more efficient memory usage for large data sets.
Let us first create a file you are talking about to provide reproducibility:
open("myFile.txt", "w") do io
foreach(i -> println(io, join(i+1:i+644, '|')), 1:153895)
end
Now I read this file in in Julia 1.4.2 and CSV.jl 0.7.1.
Single threaded:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
4.747160 seconds (1.55 M allocations: 1.281 GiB, 4.29% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
2.780213 seconds (13.72 k allocations: 1.206 GiB, 5.80% gc time)
and using e.g. 4 threads:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
4.546945 seconds (6.02 M allocations: 1.499 GiB, 5.05% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
0.812742 seconds (47.28 k allocations: 1.208 GiB)
In R it is:
> system.time(myData<-read.delim("myFile.txt",sep="|",header=F,
+ stringsAsFactors=F,na.strings=""))
user system elapsed
28.615 0.436 29.048
In Python (Pandas) it is:
>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
25.95710587501526
Now if we test fread from R (which is fast) we get:
> system.time(fread("myFile.txt", sep="|", header=F,
stringsAsFactors=F, na.strings="", nThread=1))
user system elapsed
1.043 0.036 1.082
> system.time(fread("myFile.txt", sep="|", header=F,
stringsAsFactors=F, na.strings="", nThread=4))
user system elapsed
1.361 0.028 0.416
So in this case the summary is:
despite the cost of compilation of CSV.File in Julia when you run it for the first time it is significantly faster than base R or Python
it is comparable in speed to fread in R (in this case slightly slower, but other benchmark made here shows cases when it is faster)
EDIT: Following the request I have added a benchmark for a small file: 10 columns, 100,000 rows Julia vs Pandas.
Data preparation step:
open("myFile.txt", "w") do io
foreach(i -> println(io, join(i+1:i+10, '|')), 1:100_000)
end
CSV.jl, single threaded:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
1.898649 seconds (1.54 M allocations: 93.848 MiB, 1.48% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
0.029965 seconds (248 allocations: 17.037 MiB)
Pandas:
>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
0.07587623596191406
Conclusions:
the compilation cost is a one-time cost that has to be paid and it is constant (roughly it does not depend on how big is the file you want to read in)
for small files CSV.jl is faster than Pandas (if we exclude compilation cost)
Now, if you would like to avoid having to pay compilation cost on every fresh Julia session this is doable with https://github.com/JuliaLang/PackageCompiler.jl.
From my experience, if you are doing data science work, where e.g. you read-in thousands of CSV files, I do not have a problem with waiting 2 seconds for the compilation, if later I can save hours. It takes more than 2 seconds to write the code that reads in the files.
Of course - if you write a script that does little work and terminates after it is done then it is a different use case as compilation time would be a majority of computational cost actually. In this case using PackageCompiler.jl is a strategy I use.
In my experience, the best way to deal with larger text files is not load them up into Julia, but rather to stream them. This method has some additional fixed costs, but generally runs extremely quickly. Some pseudo code is this:
function streamdat()
mycsv=open("/path/to/text.csv", "r") # <-- opens a path to your text file
sumvec = [0.0] # <-- store a sum here
i = 1
while(!eof(mycsv)) # <-- loop through each line of the file
row = readline(mycsv)
vector=split(row, "|") # <-- split each line by |
sumvec+=parse(Float64, vector[i])
i+=1
end
end
streamdat()
The code above is just a simple sum, but this logic can be expanded to more complex problems.
using CSV
#time df=CSV.read("C:/Users/hafez/personal/r/tutorial for students/Book2.csv")
recently I tried in Julia 1.4.2. I found different response and at first, I didn't understand Julia. then I posted the same thing in the Julia discussion forums. then I understood that this code will provide only compile time. here you can find benchmark

R memory issue with memory.limit()

I am running some simulations on a machine with 16GB memory. First, I met some errors:
Error: cannot allocate vector of size 6000.1 Mb (the number might be not accurate)
Then I tried to allocate more memory to R by using:
memory.limit(1E10)
The reason of choosing such a big number is because memory.limit could not allow me of selecting a number less than my system total memory
In memory.size(size) : cannot decrease memory limit: ignored
After doing this, I can finish my simulations, but R took around 15GB memory, which stopped my from doing any post analysis.
I used object.size() to estimate the total memory used of all the generated variable, which only took around 10GB. I could not figure where R took the rest of the memory. So my question is how do I reasonably allocate memory to R without exploding my machine?
Thanks!
R is interpreted so WYSINAWYG (what you see is not always what you get). As is mentioned in the comments you need more memory that is required by the storage of your objects due to copying of said objects. Also, it is possible that besides being inefficient, nested for loops are a bad idea because gc won't run in the innermost loop. If you have any of these I suggest you try to remove them using vectorised methods, or you manually call gc in your loops to force garbage collections, but be warned this will slow things down somewhat
The issue of memory required for simple objects can be illustrated by the following example. This code grows a data.frame object. Watch the memory use before, after and the resulting object size. There is a lot of garbage that is allowed to accumulate before gc is invoked. I think garbage collection is problematic on Windows than *nix systems. I am not able to replicate the example at the bottom on Mac OS X, but I can repeatedly on Windows. The loop and more explanations can be found in The R Inferno page 13...
# Current memory usage in Mb
memory.size()
# [1] 130.61
n = 1000
# Run loop overwriting current objects
my.df <- data.frame(a=character(0), b=numeric(0))
for(i in 1:n) {
this.N <- rpois(1, 10)
my.df <- rbind(my.df, data.frame(a=sample(letters,
this.N, replace=TRUE), b=runif(this.N)))
}
# Current memory usage afterwards (in Mb)
memory.size()
# [1] 136.34
# BUT... Size of my.df
print( object.size( my.df ) , units = "Mb" )
0.1 Mb

Resources