uncompress a big .gz file - r

I need to uncompress a transactions.gz file downloaded from Kaggle; approximately (2.86 GB), 350 million rows, 11 columns.
I tried on RStudio, windows Vista, 32 bits, RAM: 3 GB:
transactions <- read.table(gzfile("E:/2014/Proyectos/Kaggle/transactions.gz"))
write.table(transactions, file="E:/2014/Proyectos/Kaggle/transactions.csv")
But i receive this error message on the console
> transactions <- read.table(gzfile("E:/2014/Proyectos/Kaggle/transactions.gz"))
Error: cannot allocate vector of size 64.0 Mb
> write.table(transactions, file="E:/2014/Proyectos/Kaggle/transactions.csv")
Error: cannot allocate vector of size 64.0 Mb
I checked this case, but it didn't work for me: Decompress gz file using R
I would appreciate any suggestions.

This file decompresses to a 22GB .csv file. You can't process it all at once in R on your 6GB machine because R needs to read everything into memory. It would be best to process it in an RDBMS like postgresql. If you are intent on using R you could process it in chunks, reading a manageable number of rows at a time: read a chunk, process it, and then overwrite with the next chunk. For this data.table::fread would be better than the standard read.table.
Oh, and don't decompress in R, just run gunzip from the command line and then process the csv. If you're on Windows you can use winzip or 7zip.

Related

How to free memory which is used by big.matrix objects of crashed R sessions

I use the bigmemory package to access big matrix objects in parallel, e.g. like this
a <- bigmemory::big.matrix(nrow = 200, ncol = 100, shared = TRUE) # shared = TRUE is the default
However, working with the resulting objects sometimes cause R to crash. This means that the memory used by the matrix objects is not released. The bigmemory manual warns of such a case but presents no solution:
Abruptly closed R (using e.g. task manager) will not have a chance to
finalize the big.matrix objects, which will result in a memory leak, as
the big.matrices will remain in the memory (perhaps under obfuscated names)
with no easy way to reconnect R to them
After a few crashes and restarts of my R process, I get the following error:
No space left on device
Error in CreateSharedMatrix(as.double(nrow),
as.double(ncol), as.character(colnames), :
The shared matrix could not be created
Obviously, my memory is blocked by orphaned big matrices. I tried the command ipcs, which is advertised to list shared memory blocks, but the sizes of the segments listed there are much too small compared to my matrix objects. This means also that ipcrm is of no use here to remove my orphaned objects.
Where does bigmemory store its objects on different operating systems and how do I delete orphaned ones?
Linux
A call to df -h solved the mystery for my operating system (Linux/CentOS).
$ df -h
Filesystem Size Used Avail Use% Mounted on
...
tmpfs 1008G 1008G 0 100% /dev/shm
...
There is a temporary file system in the folder /dev/shm. Files therein exist only in RAM. This file system is used to share data between processes. In this folder were several files with random strings as names, and multiple files with the same prefix, which seem to be related to the same big.matrix object:
$ ls -l /dev/shm
-rw-r--r-- 1 user grp 320000 Apr 26 13:42 gBDEDtvwNegvocUQpYNRMRWP
-rw-r--r-- 1 user grp 8 Apr 26 13:42 gBDEDtvwNegvocUQpYNRMRWP_counter
-rw-r--r-- 1 user grp 32 Apr 26 13:42 sem.gBDEDtvwNegvocUQpYNRMRWP_bigmemory_counter_mutex
Unfortunately, I don't know which matrix belongs to which file, but if you have no R processes running at the time, deleting files with this name pattern should remove the orphaned objects.
Windows
I don't know how other OS's do this, so feel free to add it into this community wiki if you know

Transfer data from 32 bit session to 64 bit session r

I am using R to connect to an enterprise database via ODBC to extract data and do some analysis. The ODBC connection requires some 32 bit .dll files so I used the 32 bit version of R. However, I need to use R 64 bit for the analysis. I saved down the data in .rds files and tried to pull them back into an R 64 bit session, but I hit an error:
df <- do.call('rbind', lapply(list.files(path = "path", pattern = ".rds"), readRDS))
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
cannot open compressed file 'filename.rds', probable reason 'No such file or directory'
I know I could save the data down to .csv and import it, but there will be a fair amount of formatting required as my data is over 200 columns wide with about every data type represented. I'm wondering if there's a simpler way to get data from a 32 bit session to a 64 bit session without the need for reformatting all the data.
Thanks for your help!

Cross read parquet files between R and Python

We have generated a parquet files, one in Dask (Python) and another with R Drill (using the Sergeant packet ). They use a different implementations of parquet see my other parquet question
We are not able to cross read the files (the python can't read the R file and vice versa).
When reading the Python parquet file in the R environment we receive the following error: system error: Illegalstatexception: UTF8 can only annotate binary filed .
When reading the R/Drill parquet file in Dask we get an FileNotFoundError: [Error 2] no such file or directory ...\_metadata (which is self explanatory).
What are the options to cross read parquet files between R and Python?
Any insights would be appreciated.
To read drill-like parquet data-sets with fastparquet/dask, you need to pas a list of the filenames, e.g.,
files = glob.glob('mydata/*/*.parquet')
df = dd.read_parquet(files)
The error from going in the other direction might be a bug, or (gathering from your other question), may indicate that you used fixed-length strings, but drill/R doesn't support them.

(R error) Error: cons memory exhausted (limit reached?)

I am working with big data and I have a 70GB JSON file.
I am using jsonlite library to load in the file into memory.
I have tried AWS EC2 x1.16large machine (976 GB RAM) to perform this load but R breaks with the error:
Error: cons memory exhausted (limit reached?)
after loading in 1,116,500 records.
Thinking that I do not have enough RAM, I tried to load in the same JSON on a bigger EC2 machine with 1.95TB of RAM.
The process still broke after loading 1,116,500 records. I am using R version 3.1.1 and I am executing it using --vanilla option. All other settings are default.
here is the code:
library(jsonlite)
data <- jsonlite::stream_in(file('one.json'))
Any ideas?
There is a handler argument to stream_in that allows to handle big data. So you could write the parsed data to a file or filter the unneeded data.

R: Cannot allocate memory greater than x MB

I have a main function in R which calls other files to run my program. I call the main file through a bat file(.exe). When I run it line-by-line it runs without a memory error, but when I call the bat file to run it, it halts and gives me the following error:
Cannot allocate memory greater than 51 MB.
How can I avoid this?
Memory limitations in R such as this are a recurring nightmare for a lot of us.
Very often the problem is a limit imposed by your OS limits (which can usually be changed on a Bash or PowerShell command line), architecture (32 v. 64 bit), or the availability of contiguous free RAM, irregardless of overall available memory.
It's hard to say why something would not cause a memory issue when run line by line, but would hit the memory limit when run as a .bat.
What version of R are you running? Do you have both installed? Is 32-bit being called by Rscript when you run your .bat file whereas you run a 64-bit version line by line? You can check the version of R that's being run with R.Version().
You can test this by running the command memory.limit() in both your R IDE/terminal and in your .bat file (be sure to print or save the result as an object in your .bat file). You might also do well to try setting memory.limit() in your .bat file, as it may just have a smaller default, perhaps due to differences in your R Profile that's invoked in your IDE or terminal versus the .bat file.
If architecture isn't the cause of your memory error, then you have several more troubleshooting steps to try:
Check memory usage in both environments (in R directly and via your .bat process) using this:
sort( sapply(ls(),function(x){object.size(get(x))}))
Run the garbage collector explicitly in your scripts, that's the gc() command
Check all object sizes to make sure there are no unexpected results in your .bat process: sort( sapply(ls(),function(x){format(object.size(get(x)), units = "Mb")}))
Try memory profiling:
Rprof(tf <- "rprof.log", memory.profiling=TRUE)
Rprof(NULL)
summaryRprof(tf)
While this is a RAM issue, for good measure you might want to check that the compute power available is both sufficient and not varying between these two ways of running your code: parallel::detectCores()
Examine your performance with Prof. Hadley Wikham's lineprof tool (warning: requires devtools and doesn't work on lines of code which call the C programming language)
References While I'm pulling these snippets out of my own code, most of them originally came from other, related StackOverflow posts, such as:
Reaching memory allocation in R
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
R memory limit warning vs "unable to allocate..."
How to compute the size of the allocated memory for a general type
R : Any other solution to "cannot allocate vector size n mb" in R?
Yes you should be using 64bit R, if you can.
See this question, and this from the R docs.

Resources