I was working on something relatively simple: I have three files that weigh ~150MB each, about 240k rows and 145 columns each, and wanted to join them. The thing is, when I open the first file with readxl::read_excel, it suddenly requires 10GB of memory just to open the file, making it impossible for me to open the three files (was barely able to open the first one after several tries and reinstalling readxl), even though when this file is read, the dataframe object weighs 287MB as per object_size().
I'm a bit baffled as to why R is needing so much RAM to open my file. Any ideas on what could be happening? Something I might be missing? Any less memory intensive alternatives?
As extra information, when I opened the file I saw it has filters enabled and some table formatting from Excel.
Thank you very much
Related
I am having an issue where I am trying to get a list of all of the file names in a directory in R. Usually I would use list.files(). However, this particular folder has ~300,000 very small files in it. So when I call list.files(pattern="*.csv", recursive = FALSE), my R Studio session just totally hangs. I waited a 24 hours last time & it never sorted itself out... Though oddly enough, my computer doesn't seem to be running out of memory at all.
My question is, is there a way to import a list of file names in a more efficient way? Or is there a way to import a smaller chunk of the file names at a time, eg Import the first thousand, then the second thousand, then the third thousand etc.
I wasn't sure what code samples to include--if there is something that'd be helpful, please let me know :)
What my question isnt:
Efficient way to maintain a h2o data frame
H2O running slower than data.table R
Loading data bigger than the memory size in h2o
Hardware/Space:
32 Xeon threads w/ ~256 GB Ram
~65 GB of data to upload. (about 5.6 billion cells)
Problem:
It is taking hours to upload my data into h2o. This isn't any special processing, only "as.h2o(...)".
It takes less than a minute using "fread" to get the text into the space and then I make a few row/col transformations (diff's, lags) and try to import.
The total R memory is ~56GB before trying any sort of "as.h2o" so the 128 allocated shouldn't be too crazy, should it?
Question:
What can I do to make this take less than an hour to load into h2o? It should take from a minute to a few minutes, no longer.
What I have tried:
bumping ram up to 128 GB in 'h2o.init'
using slam, data.table, and options( ...
convert to "as.data.frame" before "as.h2o"
write to csv file (r write.csv chokes and takes forever. It is writing a lot of GB though, so I understand).
write to sqlite3, too many columns for a table, which is weird.
Checked drive cache/swap to make sure there are enough GB there. Perhaps java is using cache. (still working)
Update:
So it looks like my only option is to make a giant text file and then use "h2o.importFile(...)" for it. I'm up to 15GB written.
Update2:
It is a hideous csv file, at ~22GB (~2.4Mrows, ~2300 cols). For what it was worth, it took from 12:53pm until 2:44PM to write the csv file. Importing it was substantially faster, after it was written.
Think of as.h2o() as a convenience function, that does these steps:
converts your R data to a data.frame, if not already one.
saves that data.frame to a temp file on local disk (it will use data.table::fwrite() if available (*), otherwise write.csv())
call h2o.uploadFile() on that temp file
delete the temp file
As your updates say, writing huge data files to disk can take a while. But the other pain point here is using h2o.uploadFile() instead of the quicker h2o.importFile(). The decision of which to use is visibility:
With h2o.uploadFile() your client has to be able to see the file.
With h2o.importFile() your cluster has to be able to see the file.
When your client is running on the same machine as one of your cluster nodes, your data file is visible to both client and cluster, so always prefer h2o.importFile(). (It does a multi-threaded import.)
Another couple of tips: only bring data into the R session that you actually need there. And remember both R and H2O are column-oriented, so cbind can be quick. If you just need to process 100 of your 2300 columns in R, have them in one csv file, and keep the other 2200 columns in another csv file. Then h2o.cbind() them after loading each into H2O.
*: Use h2o:::as.h2o.data.frame (without parentheses) to see the actual code. For data.table writing you need to first do options(h2o.use.data.table = TRUE); you can also optionally switch it on/off with the h2o.fwrite option.
I noticed this problem when trying to run the following R script.
library(downloader)
download('http://download.cms.gov/nppes/NPPES_Data_Dissemination_Feb_2016.zip',
dest = 'dataset.zip', mode = 'wb')
npi <- read.csv(unz('dataset.zip', 'npidata_20050523-20160207.csv'),
as.is = TRUE)
The script kept spinning for some reason so I manually downloaded the data and noticed the compression ratio was 100%.
I am not certain if StackOverflow is the best Exchange for this question, so I am open to moving this question is another Exchange is suggested. The Open Data Exchange might be appropriate, but there isn't very much activity on that site.
My question is this: I work a lot with government curated data from Centers for Medicare and Medicaid Services (CMS). The data downloads from this site are in the form of zip files and occasionally, they have zip ratios of 100%. This is clearly impossible since the uncompressed size is ~800PB. (CMS notes on their site that they estimate the uncompressed size to be ~4GB.) This has affected me on my work computer; I have replicated this problem with co-worker's computer as well as my own personal computer.
One example can be found here. (Click the link and then click on NPPES Data Dissemination). There are other examples I've noticed and I've emailed CMS about this. They respond that the files are large and can't be handled with Excel. I am aware of this and this isn't really the problem I'm facing.
Does any one know why this would be happening and how I can fix it?
Per cdetermans point, what is the available system memory you have available for R to execute the uncompressing and subsequent loading of the data? Looking at both the image you posted, and the link to the actual data, which reads as ~560mb compressed, it did not pose a problem on my system ( Win 10, 16 GB, Core i7, R v.3.2.3) to download, uncompress, read the uncompressed CSV into a table.
I would recommend - if nothing else works - to decouple your uncompressing and data loading steps. Might even go as far as invoking (depending on your OS) a R system command to decompress your data, manually inspect, and then separately issue piecewise read.tables on the dataset.
Best of luck
rudycazabon
I am on a 512gb ram server. I have a 84gig CSV (hefty, I know). I am reading only 31 columns of 79, where the excluded are all floats/decimals.
After comparing many methods, it seems the highest performance way to do what I want would be to fread the file. The file size is 84gb, but watching "top" the process uses 160 gigs of memory (RES), even though the size of the eventual data.table is about 20gigs.
I know fread preallocates memory which is why it's so fast. Just wondering - is this normal and is there a way to curb the memory consumption?
Edit: it seems like, if I just ask fread to read 10000 rows (of 300MM), fread will still preallocate 84 gigs of memory.
See R FAQ 7.42. If you want to minimize the resources you use on the server, read the csv using fread once, then save the resulting object using save or saveRDS. Then read that binary file when you need the data.
Or you can use a command line tool like cut, awk, sed, etc to only select the columns you want and write the output to another file. Then you can use fread on that smaller file.
Try to see http://www.r-bloggers.com/efficiency-of-importing-large-csv-files-in-r/ or Reading 40 GB csv file into R using bigmemory.
May be bigmemory library helps you.
I am still suffering every time I deal with excel file in R.
What is the best way to do the following?
1- Import excel in R as a "whole workbook" and be able to do analysis in any sheet in the workbook? if you think about using XLConnect, please bear in mind the "out of memory" problem with Java. I have over 30MB files and dealing with Java memory problem every time consume more time. (running -Xmx does not work for me).
2- Do not miss any data from any excel sheet? saving file into csv says that some sheets are "out of range" which is 65,536 rows and 256 columns. Also it can not deal with some formulas.
3- Do not have to import each sheet separately? Importing sheets to SPSS, STATA or Eviews and save it into their extension and then work with the output file in R works fine most of the time. However, this method has two major problems; one is that you have to have the software downloaded on the machine and the other is that it imports only one sheet at time. If I have over 30 sheets, it will become very time consuming.
This might be an ongoing question that has been answered many many times, however, each answer solving a part of the problem not the whole issue. It is like putting the fire not strategically solving the problem.
I am on Mac OS 10.10 with R 3.1.1
I have tried a few package to open an excel openxlsx is definitely the best route. It is way faster and more stable than the other ones. The function is : openxlsx::read.xlsx. My advice is to use it to read the whole sheet and then play with the data within R, rather than reading several times part of the sheet. I used it a lot to open large excel files (8000 col plus) for 1000 lines plus, and it always worked well. I use the package xlsx to write in excel, but it had numerous memory issues to read (that's why I moved to openxlsx)
-Add In
On a side note, if you want to use R with excel you sometimes need to execute a VBA code from R. I found the procedure to be quite difficult to achieve. I fully documented the proper way of doing it in a previous question in stack : Apply VBA from R
Consider using the xlsx package. It has methods for dealing with excel files and worksheets. Your question is quite broad, but I think this can be an example:
library(xlsx)
wb <- loadWorkbook('r_test.xlsx')
sheets <- getSheets(wb)
sheet <- sheets[[1]]
df <- readColumns(sheet,
startColumn = 1, endColumn = 3,
startRow = 1, endRow = 6)
df
## id name x_value
##1 1 A 10
##2 2 B 15
##3 3 C 20
##4 4 D 13
##5 5 E 17
As for the memory issue I think you should check the ff package:
The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory.
Another option (but it may be overkill) would be to load the data to a real database and deal with database connections. If you are dealing with really big datasets, a database may be the best approach.
Some options would be:
The RSQLite package
If you can load your data to an SQLite database, you can use this package to connect directly to that database and handle the data directly. That would "split" the workload between R and the database engine. SQLite is quite easy to use and (almost) "config free", and each SQLite database is stored in a single file.
The RMySQL package
Even better than the above option; MySQL is great for storing large datasets. However you'll need to install and configure a MySQL server in your computer.
Remember: If you work with R and a database, delegate as much heavy workload to the database (e.g. data filtering, aggregation, etc), and use R to get the final results.