Strategies for reading in CSV files in pieces? - r

I have a moderate-sized file (4GB CSV) on a computer that doesn't have sufficient RAM to read it in (8GB on 64-bit Windows). In the past I would just have loaded it up on a cluster node and read it in, but my new cluster seems to arbitrarily limit processes to 4GB of RAM (despite the hardware having 16GB per machine), so I need a short-term fix.
Is there a way to read in part of a CSV file into R to fit available memory limitations? That way I could read in a third of the file at a time, subset it down to the rows and columns I need, and then read in the next third?
Thanks to commenters for pointing out that I can potentially read in the whole file using some big memory tricks:
Quickly reading very large tables as dataframes in R
I can think of some other workarounds (e.g. open in a good text editor, lop off 2/3 of the observations, then load in R), but I'd rather avoid them if possible.
So reading it in pieces still seems like the best way to go for now.

After reviewing this thread I noticed a conspicuous solution to this problem was not mentioned. Use connections!
1) Open a connection to your file
con = file("file.csv", "r")
2) Read in chunks of code with read.csv
read.csv(con, nrows="CHUNK SIZE",...)
Side note: defining colClasses will greatly speed things up. Make sure to define unwanted columns as NULL.
3) Do what ever you need to do
4) Repeat.
5) Close the connection
close(con)
The advantage of this approach is connections. If you omit this step, it will likely slow things down a bit. By opening a connection manually, you essentially open the data set and do not close it until you call the close function. This means that as you loop through the data set you will never lose your place. Imagine that you have a data set with 1e7 rows. Also imagine that you want to load a chunk of 1e5 rows at a time. Since we open the connection we get the first 1e5 rows by running read.csv(con, nrow=1e5,...), then to get the second chunk we run read.csv(con, nrow=1e5,...) as well, and so on....
If we did not use the connections we would get the first chunk the same way, read.csv("file.csv", nrow=1e5,...), however for the next chunk we would need to read.csv("file.csv", skip = 1e5, nrow=2e5,...). Clearly this is inefficient. We are have to find the 1e5+1 row all over again, despite the fact that we just read in the 1e5 row.
Finally, data.table::fread is great. But you can not pass it connections. So this approach does not work.
I hope this helps someone.
UPDATE
People keep upvoting this post so I thought I would add one more brief thought. The new readr::read_csv, like read.csv, can be passed connections. However, it is advertised as being roughly 10x faster.

You could read it into a database using RSQLite, say, and then use an sql statement to get a portion.
If you need only a single portion then read.csv.sql in the sqldf package will read the data into an sqlite database. First, it creates the database for you and the data does not go through R so limitations of R won't apply (which is primarily RAM in this scenario). Second, after loading the data into the database , sqldf reads the output of a specified sql statement into R and finally destroys the database. Depending on how fast it works with your data you might be able to just repeat the whole process for each portion if you have several.
Only one line of code accomplishes all three steps, so it's a no-brainer to just try it.
DF <- read.csv.sql("myfile.csv", sql=..., ...other args...)
See ?read.csv.sql and ?sqldf and also the sqldf home page.

Related

Copying data from SQL Server to R using ODBC connection

I have successfully set up a R SQL Server ODBC connection by doing:
DBI_connection <- dbConnect(odbc(),
driver = "SQL Server"
server = server_name
database = database_name)
Dataset_in_R <- dbFetch(dbSendQuery(DBI_connection,
"SELECT * FROM MyTable_in_SQL"))
3 quick questions:
1-Is there a quicker way to copy data from SQL Server to R? This table has +44million rows and it is still running...
2-If I make any changes to this data in R does it change anything in my MyTable_in_SQL? I dont think so because I have saved it in a global data.frame variable in R, but just checking.
3-How to avoid going through this step every time I open R? Is there a way to save my data.frame in the "background" in R?
1: Is there a quicker way to copy data from SQL Server to R?
The answer here is rather simple to answer. the odbc package in R does quite a bit under-the-hood to ensure compatibility between the result fetched from the server and R's data structure. It might be possible to obtain a slight increase in speed by using an alternative package (RODBC is an old package, and it sometimes seems faster). In this case however, with 44 mil. rows, I expect that the bigger performance boost comes from preparing your sql-statement. The general idea would be to
Remove any unnecessary columns. Remember each column will need to be downloaded, so if you have 20 columns, removing 1 column may reduce your query execution time by ~5% (assuming linear run-time)
If you plan on performing aggregation, it will (very close to almost) faster to perform this directly in your query, eg, if you have a column called Ticker and a column called Volume and you want the average value of Volume you could calculate this directly in your query. Similar for last row using last_value(colname) over ([partition by [grouping col 1], [grouping col 2] ...] order by [order col 1], [order col 2]) as last_value_colname.
If you choose to do this, it might be beneficial to test your query on a small subset of rows using TOP N or LIMIT N eg: select [select statement] from mytable_in_sql order by [order col] limit 100 which would only return the first 100 rows. As Martin Schmelzer commented this can be done via R with the dplyr::tbl function as well, however it is always faster to correct your statement.
Finally if your query becomes more complex (does not seem to be the case here), it might be beneficial to create a View on the table CREATE VIEW with the specific select statement and query this view instead. The server will then try to optimize the query, and if your problem is on the server side rather than local side, this can improve performance.
Finally one must state the obvious. As noted above when you query the server you are downloading some (maybe quite a lot) of data. This can be improved by improving your internet connection either by repositioning your computer, router or directly connecting via a cord (or purely upgrading ones internet connection). For 44 Mil. rows if you have only a single 64 bit double precision variable, you have 44 * 10^6 / 1024^3 = 2.6 GiB of data (if not compressed). If you have 10 columns, this goes up to 26 GiB of data. It simply is going to take quite a long time to download all of this data. Thus getting this row count down would be extremely helpful!
As a side note you might be able to simply download the table directly via SSMS slightly faster (still slow due to table size) and then import the file locally. For the fastest speed you likely have to look into the Bulk import and export functionality of SQL-server.
2: If I make any changes to this data in R does it change anything in my MyTable_in_SQL?
No: R has no internal pointer/connection once the table has been loaded. I don't even believe a package exists (in R at least) that opens a stream to the table which could dynamically update the table. I know that a functionality like this exists in Excel, but even using this has some dangerous side effects and should (in my opinion) only be used in read-only applications, where the user wants to see a (almost) live-stream of the data.
3: How to avoid going through this step every time I open R? Is there a way to save my data.frame in the "background" in R?
To avoid this, simply save the table after every session. Whenever you close Rstudio it will ask you if you want to save your current session, and here you may click yes, at which point it will save .Rhistory and .Rdata in the getwd() directory, which will be imported the next time you open your session (unless you changed your working directory before closing the session using setwd(...). However I highly suggest you do not do this for larger datasets, as it will cause your R session to take forever to open the next time you open R, as well as possibly creating unnecessary copies of your data (for example if you import it into df and make a transformation in df2 then you will suddenly have 2 copies of a 2.6+ GiB dataset to load every time you open R). Instead I highly suggest saving the file using arrow::write_parquet(df, file_path), which is a much (and I mean MUCH!!) faster alternative to saving as RDS or csv files. These can't be opened as easily in Excel, but can be opened in R using arrow::read_parquet and python using pandas.read_parquet or pyarrow.parquet.read_parquet, while being compressed to a size that is usually 50 - 80 % smaller than the equivalent csv file.
Note:
If you already did save your R session after loading in the file, and you experience a very slow startup, I suggest removing the .RData file from your working directory. Usually the documents folder (C:/Users/[user]/Documents) from your system.
On question 2 you're correct, any changes in R won't change anything in the DB.
About question 3, you can save.image() or save.image('path/image_name.Rdata') and it will save your environment so you can recover it later on another session with load.image('path/image_name.Rdata').
Maybe with this you don't need a faster way to get data from a DB.

R - Read large file with small memory

My data is organize in an csv file with millions of lines and several columns. This file is to large to read into memory all at once.
Fortunately, I only want to compute some statistics on it, like the mean of each column at every 100 rows and such. My solution, based on other posts where was to use read.csv2 with options nrow and skip. This works.
However, I realized that when loading from the end of the file this process is quite slow. As far as I can tell, the reader seems to go trough the file until it passes all the lines that I say to skip and then reads. This, of course, is sub optimal, as it keeps reading over the initial lines every time.
Is there a solution, like python parser, where we can read the file line by line, stop when needed, and then continue? And keeping the nice reading simplicity that comes from read.csv2?

Store an increasing matrix into HDD and not in memory

I'm facing a pretty expected problem while I'm running irritatingly the below code which creates all possible combinations for a specified sequence and then it stores them in the final.grid variable. The thing is that there is no only one sequence but about hundred of thousands of them and each one could have enough combinations.
for()...
combs = get.all.combs(sequence)
final.grid = rbind(final.grid, combs)
Anyway. Tried to run my code in a windows PC with 4GB RAM and after 4 hours (not even half of the combinations being calculated) R returned this error
Error: cannot allocate vector of size 4.0 Gb
What i was though as solution is to write after each iteration the final.grid to a file , free the allocated memory and continue. The truth is that I have not experience on such implementations with R and I don't know which solution to choose and if there are some of them that will do better and more efficiently. Have in mind that probably my final grid will need some GBs.
Somewhere in the stack exchange I read about ff package but there was not enough discussion on the subject (at least I didn't found it) and preferred to ask here for your opinions.
Thanks
I cannot understand very well your question, because the piece of code that you put is not clear to figure it out your problem.
But, you can try saving your results as .RData or .nc files, depending on the nature of your data. However, it could be better if you are more explicit about your problem, for instance showing what code is behind get.all.combs function or sequence data.
One thing you can try is the memory.limit() function to see if you can allocate enough memory for your work. This may not work if your Windows OS is 32 bit.
If you have large data object that you don't need for some parts of your program, you could first save them, and them remove using 'rm', and when you need them again you can load the objects.
The link below has more info that could be useful to you.
Increasing (or decreasing) the memory available to R processes
EDIT:
You can use object.size function to see memory requirement for objects you have. If they are too big, try loading them only when you need them.
It is possible one of the functions you use try to allocate more memory than you have. See if you can try to find where exactly the program crashes.

R using waaay more memory than expected

I have an Rscript being called from a java program. The purpose of the script is to automatically generate a bunch of graphs in ggplot and them splat them on a pdf. It has grown somewhat large with maybe 30 graphs each of which are called from their own scripts.
The input is a tab delimited file from 5-20mb but the R session goes up to 12gb of ram usage sometimes (on a mac 10.68 btw but this will be run on all platforms).
I have read about how to look at the memory size of objects and nothing is ever over 25mb and even if it deep copies everything for every function and every filter step it shouldn't get close to this level.
I have also tried gc() to no avail. If I do gcinfo(TRUE) then gc() it tells me that it is using something like 38mb of ram. But the activity monitor goes up to 12gb and things slow down presumably due to paging on the hd.
I tried calling it via a bash script in which I did ulimit -v 800000 but no good.
What else can I do?
In the process of making assignments R will always make temporary copies, sometimes more than one or even two. Each temporary assignment will require contiguous memory for the full size of the allocated object. So the usual advice is to plan to have _at_least_ three time the amount of contiguous _memory available. This means you also need to be concerned about how many other non-R programs are competing for system resources as well as being aware of how you memory is being use by R. You should try to restart your computer, run only R, and see if you get success.
An input file of 20mb might expand quite a bit (8 bytes per double, and perhaps more per character element in your vectors) depending on what the structure of the file is. The pdf file object will also take quite a bit of space if you are plotting each point within a large file.
My experience is not the same as others who have commented. I do issue gc() before doing memory intensive operations. You should offer code and describe what you mean by "no good". Are you getting errors or observing the use of virtual memory ... or what?
I apologize for not posting a more comprehensive description with code. It was fairly long as was the input. But the responses I got here were still quite helpful. Here is how I mostly fixed my problem.
I had a variable number of columns which, with some outliers got very numerous. But I didn't need the extreme outliers, so I just excluded them and cut off those extra columns. This alone decreased the memory usage greatly. I hadn't looked at the virtual memory usage before but sometimes it was as high as 200gb lol. This brought it down to up to 2gb.
Each graph was created in its own function. So I rearranged the code such that every graph was first generated, then printed to pdf, then rm(graphname).
Futher, I had many loops in which I was creating new columns in data frames. Instead of doing this, I just created vectors not attached to data frames in these calculations. This actually had the benefit of greatly simplifying some of the code.
Then after not adding columns to the existing dataframes and instead making column vectors it reduced it to 400mb. While this is still more than I would expect it to use, it is well within my restrictions. My users are all in my company so I have some control over what computers it gets run on.

faster than scan() with Rcpp?

Reading ~5x10^6 numeric values into R from a text file is relatively slow on my machine (a few seconds, and I read several such files), even with scan(..., what="numeric", nmax=5000) or similar tricks. Could it be worthwhile to try an Rcpp wrapper for this sort of task (e.g. Armadillo has a few utilities to read text files)?
Or would I likely be wasting my time for little to no gain in performance because of an expected interface overhead? I'm not sure what's currently limiting the speed (intrinsic machine performance, or else?) It's a task that I repeat many times a day, typically, and the file format is always the same, 1000 columns, around 5000 rows.
Here's a sample file to play with, if needed.
nr <- 5000
nc <- 1000
m <- matrix(round(rnorm(nr*nc),3),nr=nr)
cat(m[1, -1], "\n", file = "test.txt") # first line is shorter
write.table(m[-1, ], file = "test.txt", append=TRUE,
row.names = FALSE, col.names = FALSE)
Update: I tried read.csv.sql and also load("test.txt", arma::raw_ascii) using Armadillo and both were slower than the scan solution.
I highly recommend checking out fread in the latest version of data.table. The version on CRAN (1.8.6) doesn't have fread yet (at the time of this post) so you should be able to get it if you install from the latest source at R-forge. See here.
Please bear in mind that I'm not an R-expert but maybe the concept applies here too: usually reading binary stuff is much faster than reading text files. If your source files don't change frequently (e.g. you are running varied versions of your script/program on the same data), try to read them via scan() once and store them in a binary format (the manual has a chapter about exporting binary files).
From there on you can modify your program to read the binary input.
#Rcpp: scan() & friends are likely to call a native implementation (like fscanf()) so writing your own file read functions via Rcpp may not provide a huge performance gain. You can still try it though (and optimize for your particular data).
Salut Baptiste,
Data Input/Output is a huge topic, so big that R comes with its own manual on data input/output.
R's basic functions can be slow because they are so very generic. If you know your format, you can easily write yourself a faster import adapter. If you know your dimensions too, it is even easier as you need only one memory allocation.
Edit: As a first approximation, I would write a C++ ten-liner. Open a file, read a line, break it into tokens, assign to a vector<vector< double > > or something like that. Even if you use push_back() on individual vector elements, you should be competitive with scan(), methinks.
I once had a neat little csv reader class in C++ based on code by Brian Kernighan himself. Fairly generic (for csv files), fairly powerful.
You can then squeeze performance as you see fit.
Further edit: This SO question has a number of pointers for the csv reading case, and references to the Kernighan and Plauger book.
Yes, you almost certainly can create something that goes faster than read.csv/scan. However, for high performance file reading there are some existing tricks that already let you go much faster, so anything you do would be competing against those.
As Mathias alluded to, if your files don't change very often, then you can cache them by calling save, then restore them with load. (Make sure to use ascii = FALSE, since reading the binary files will be quicker.)
Secondly, as Gabor mentioned, you can often get a substantial performance boost by reading your file into a database and then from that database into R.
Thirdly, you can use the HadoopStreaming package to use Hadoop's file reading capabilities.
For more thoughts in these techniques, see Quickly reading very large tables as dataframes in R.

Resources