My data is organize in an csv file with millions of lines and several columns. This file is to large to read into memory all at once.
Fortunately, I only want to compute some statistics on it, like the mean of each column at every 100 rows and such. My solution, based on other posts where was to use read.csv2 with options nrow and skip. This works.
However, I realized that when loading from the end of the file this process is quite slow. As far as I can tell, the reader seems to go trough the file until it passes all the lines that I say to skip and then reads. This, of course, is sub optimal, as it keeps reading over the initial lines every time.
Is there a solution, like python parser, where we can read the file line by line, stop when needed, and then continue? And keeping the nice reading simplicity that comes from read.csv2?
Related
I have a pretty big (20GB) CSV file, and I need to modify some of its columns.
What is the MOST OPTIMIZED way of importing the data table line by line (or probably few thousands of line per read) ?
I have tried the solution given below
What is a good way to read line-by-line in R?
But it seems to be very slow. Is there any library which can read line by line, in the table structure itself -- also which has some kind of Buffer logic to make the read faster ?
You can use the fast fread() from data.table.
By skip=, you're setting the beginning of the read segment and by nrow=, the number of rows to read.
I'm trying to import into R a large number of pipe-delimited files that were created in a windows environment, with CR+LF as the end of record (=EOL) delimiter. However, they also have CR's scattered about periodically, which is resulting in frequent inappropriately-split lines. Ideally, want an efficient way to solve this problem from within R - either by finding a way to specify the EOL delimiter when I import, or, if necessary, by reading in the text file and excising the CRs before any parsing of lines is done.
The creators of the files comment on this problem and recommend adding "TERMSTR= CRLF" into your SAS code, and I can find lots of discussions of how to do this in other languages as well. For R, however, all I can find is this discussion, here on stackoverflow:
Possible to change the record delimiter in R?
The sample problem given is a great match for my problem. The solution identified is nice for their specific situation of having a single file like this, but for me would require coding up separate scripts for importing each of the dozens of files, since each have different primary keys that would need to be recognized after the fact to repair the inappropriate import. Alternatively, I could open each file in something like Notebook++ to remove the extra CR's but again, that seems quite inefficient, and then would have to be repeated by hand every time the initial data set was updated by its producers.
Given how frequent a problem this seems to be for people, and the existence of hard-coded solutions in other programming languages, I'm confused as to why there isn't a fix in R and feel like I must be missing something. It seems clear (I think?) that there's no way to do this directly from read.table or even from readLines, but is there a way perhaps to do this using scan, that I'm missing?
Thanks for any thoughts!
I have an Rscript being called from a java program. The purpose of the script is to automatically generate a bunch of graphs in ggplot and them splat them on a pdf. It has grown somewhat large with maybe 30 graphs each of which are called from their own scripts.
The input is a tab delimited file from 5-20mb but the R session goes up to 12gb of ram usage sometimes (on a mac 10.68 btw but this will be run on all platforms).
I have read about how to look at the memory size of objects and nothing is ever over 25mb and even if it deep copies everything for every function and every filter step it shouldn't get close to this level.
I have also tried gc() to no avail. If I do gcinfo(TRUE) then gc() it tells me that it is using something like 38mb of ram. But the activity monitor goes up to 12gb and things slow down presumably due to paging on the hd.
I tried calling it via a bash script in which I did ulimit -v 800000 but no good.
What else can I do?
In the process of making assignments R will always make temporary copies, sometimes more than one or even two. Each temporary assignment will require contiguous memory for the full size of the allocated object. So the usual advice is to plan to have _at_least_ three time the amount of contiguous _memory available. This means you also need to be concerned about how many other non-R programs are competing for system resources as well as being aware of how you memory is being use by R. You should try to restart your computer, run only R, and see if you get success.
An input file of 20mb might expand quite a bit (8 bytes per double, and perhaps more per character element in your vectors) depending on what the structure of the file is. The pdf file object will also take quite a bit of space if you are plotting each point within a large file.
My experience is not the same as others who have commented. I do issue gc() before doing memory intensive operations. You should offer code and describe what you mean by "no good". Are you getting errors or observing the use of virtual memory ... or what?
I apologize for not posting a more comprehensive description with code. It was fairly long as was the input. But the responses I got here were still quite helpful. Here is how I mostly fixed my problem.
I had a variable number of columns which, with some outliers got very numerous. But I didn't need the extreme outliers, so I just excluded them and cut off those extra columns. This alone decreased the memory usage greatly. I hadn't looked at the virtual memory usage before but sometimes it was as high as 200gb lol. This brought it down to up to 2gb.
Each graph was created in its own function. So I rearranged the code such that every graph was first generated, then printed to pdf, then rm(graphname).
Futher, I had many loops in which I was creating new columns in data frames. Instead of doing this, I just created vectors not attached to data frames in these calculations. This actually had the benefit of greatly simplifying some of the code.
Then after not adding columns to the existing dataframes and instead making column vectors it reduced it to 400mb. While this is still more than I would expect it to use, it is well within my restrictions. My users are all in my company so I have some control over what computers it gets run on.
I have a moderate-sized file (4GB CSV) on a computer that doesn't have sufficient RAM to read it in (8GB on 64-bit Windows). In the past I would just have loaded it up on a cluster node and read it in, but my new cluster seems to arbitrarily limit processes to 4GB of RAM (despite the hardware having 16GB per machine), so I need a short-term fix.
Is there a way to read in part of a CSV file into R to fit available memory limitations? That way I could read in a third of the file at a time, subset it down to the rows and columns I need, and then read in the next third?
Thanks to commenters for pointing out that I can potentially read in the whole file using some big memory tricks:
Quickly reading very large tables as dataframes in R
I can think of some other workarounds (e.g. open in a good text editor, lop off 2/3 of the observations, then load in R), but I'd rather avoid them if possible.
So reading it in pieces still seems like the best way to go for now.
After reviewing this thread I noticed a conspicuous solution to this problem was not mentioned. Use connections!
1) Open a connection to your file
con = file("file.csv", "r")
2) Read in chunks of code with read.csv
read.csv(con, nrows="CHUNK SIZE",...)
Side note: defining colClasses will greatly speed things up. Make sure to define unwanted columns as NULL.
3) Do what ever you need to do
4) Repeat.
5) Close the connection
close(con)
The advantage of this approach is connections. If you omit this step, it will likely slow things down a bit. By opening a connection manually, you essentially open the data set and do not close it until you call the close function. This means that as you loop through the data set you will never lose your place. Imagine that you have a data set with 1e7 rows. Also imagine that you want to load a chunk of 1e5 rows at a time. Since we open the connection we get the first 1e5 rows by running read.csv(con, nrow=1e5,...), then to get the second chunk we run read.csv(con, nrow=1e5,...) as well, and so on....
If we did not use the connections we would get the first chunk the same way, read.csv("file.csv", nrow=1e5,...), however for the next chunk we would need to read.csv("file.csv", skip = 1e5, nrow=2e5,...). Clearly this is inefficient. We are have to find the 1e5+1 row all over again, despite the fact that we just read in the 1e5 row.
Finally, data.table::fread is great. But you can not pass it connections. So this approach does not work.
I hope this helps someone.
UPDATE
People keep upvoting this post so I thought I would add one more brief thought. The new readr::read_csv, like read.csv, can be passed connections. However, it is advertised as being roughly 10x faster.
You could read it into a database using RSQLite, say, and then use an sql statement to get a portion.
If you need only a single portion then read.csv.sql in the sqldf package will read the data into an sqlite database. First, it creates the database for you and the data does not go through R so limitations of R won't apply (which is primarily RAM in this scenario). Second, after loading the data into the database , sqldf reads the output of a specified sql statement into R and finally destroys the database. Depending on how fast it works with your data you might be able to just repeat the whole process for each portion if you have several.
Only one line of code accomplishes all three steps, so it's a no-brainer to just try it.
DF <- read.csv.sql("myfile.csv", sql=..., ...other args...)
See ?read.csv.sql and ?sqldf and also the sqldf home page.
I have an XML file of size 31 GB. I need to find the total number of lines in that file. I know the command wc -l will give me the same. However it's taking too long to perform this operation. Is there any faster mechanism to find the number of lines in a large file?
31 gigs is a really big text file. I bet it would compress down to about 1.5 gigs. I would create these files in a compressed format to begin with then you can stream a decompressed version of the file through wc. This will greatly reduce the amount of i/o and memory used to process this file. gzip can read and write compressed streams.
But I would also make the following comments:
Line numbers are not really that informative for XML as whitespace between elements is ignored (except for mixed content). What do you really want to know about the dataset? I bet counting elements would be more useful.
Make sure your xml file is not unnecessarily redunant, for example are you repeating the same namespace declarations all over the document?
Perhaps XML is not the best way to represent this document, if it is try looking into something like Fast Infoset
if all you need is the line count, wc -l will be as fast as anything else.
The problem is the 31GB text file.
If accuracy isn't an issue, find the average line length and divide the file size by that. That way you can get a really fast approximation. (make sure to consider the character encoding used)
This falls beyond the point where the code should be refactored to avoid your problem entirely. One way to do this is to place all of the data in the file into a tuple store database instead. Apache couchDB and Intersystems Cache are two systems that you could use for this, and will be far better optimized for the type of data you're dealing with.
If you're really stuck with the xml file, then another option is to count all the lines ahead of time and cache this value. Each time a line is added or removed from the file, you can add or subtract one from the file. Also, make sure to use a 64 bit integer since there may be more than 2^32 lines.
No, not really. wc is going to be pretty well optimized. 31GB is a lot of data, and reading it in to count lines is going to take a while no matter what program you use.
Also, this question isn't really appropriate for Stack Overflow, as it's not about programming at all.
Isn't counting lines pretty uncertain since in XML newline is basically just a cosmetic thing? It would probably be better to count the number of occurrences of a specific tag.