I have a pretty big (20GB) CSV file, and I need to modify some of its columns.
What is the MOST OPTIMIZED way of importing the data table line by line (or probably few thousands of line per read) ?
I have tried the solution given below
What is a good way to read line-by-line in R?
But it seems to be very slow. Is there any library which can read line by line, in the table structure itself -- also which has some kind of Buffer logic to make the read faster ?
You can use the fast fread() from data.table.
By skip=, you're setting the beginning of the read segment and by nrow=, the number of rows to read.
Related
I've found good tips about fast ways to import files into R, but I'm wondering if it is possible to import only a subset of a given file into a variable.
In my case, I have a file with 16 million rows saved as .rds (and also as .feather, as I was playing with the speed of both formats) and I'd like to import a subset of it (say, a few rows or a few columns) for initial analysis.
Is it possible? The readRDS() does not seem to accept any subsetting, while read_feather() does not seem to allow row selection (although you can specify the columns). Should I consider another data format?
The short answer is 'no'. A nice alternative is the fst file format, which does allow the retrieval of a selection of columns and rows from a large dataset. More info here.
Using readr::read_csv you could use n_max parameter and read as many rows as you like.
With readRDS, I suppose you could read the file dplyr::sample_n and then just erase it from memory with rm(object).
If you can not read the whole file into memory, you could use either sqlite, or another database, which is the prefered way, or you could try something along the line of readr::read_delim_chunked, which alows you to read a file in chunks, do something with the read chunk (like sample_n), delete the read chukc from memory and keep just the callback's result and go on like that until the file is over.
My data is organize in an csv file with millions of lines and several columns. This file is to large to read into memory all at once.
Fortunately, I only want to compute some statistics on it, like the mean of each column at every 100 rows and such. My solution, based on other posts where was to use read.csv2 with options nrow and skip. This works.
However, I realized that when loading from the end of the file this process is quite slow. As far as I can tell, the reader seems to go trough the file until it passes all the lines that I say to skip and then reads. This, of course, is sub optimal, as it keeps reading over the initial lines every time.
Is there a solution, like python parser, where we can read the file line by line, stop when needed, and then continue? And keeping the nice reading simplicity that comes from read.csv2?
I have an .xdf file on an HDFS cluster which is around 10 GB having nearly 70 columns. I want to read it into a R object so that I could perform some transformation and manipulation. I tried to Google about it and come around with two functions:
rxReadXdf
rxXdfToDataFrame
Could any one tell me the preferred function for this as I want to read data & perform the transformation in parallel on each node of the cluster?
Also if I read and perform transformation in chunks, do I have to merge the output of each chunks?
Thanks for your help in advance.
Cheers,
Amit
Note that rxReadXdf and rxXdfToDataFrame have different arguments and do slightly different things:
rxReadXdf has a numRows argument, so use this if you want to read the top 1000 (say) rows of the dataset
rxXdfToDataFrame supports rxTransforms, so use this if you want to manipulate your data in addition to reading it
rxXdfToDataFrame also has the maxRowsByCols argument, which is another way of capping the size of the input
So in your case, you want to use rxXdfToDataFrame since you're transforming the data in addition to reading it. rxReadXdf is a bit faster in the local compute context if you just want to read the data (no transforms). This is probably also true for HDFS, but I haven’t checked this.
However, are you sure that you want to read the data into a data frame? You can use rxDataStep to run (almost) arbitrary R code on an xdf file, while still leaving your data in that format. See the linked documentation page for how to use the transforms arguments.
I'm trying to import into R a large number of pipe-delimited files that were created in a windows environment, with CR+LF as the end of record (=EOL) delimiter. However, they also have CR's scattered about periodically, which is resulting in frequent inappropriately-split lines. Ideally, want an efficient way to solve this problem from within R - either by finding a way to specify the EOL delimiter when I import, or, if necessary, by reading in the text file and excising the CRs before any parsing of lines is done.
The creators of the files comment on this problem and recommend adding "TERMSTR= CRLF" into your SAS code, and I can find lots of discussions of how to do this in other languages as well. For R, however, all I can find is this discussion, here on stackoverflow:
Possible to change the record delimiter in R?
The sample problem given is a great match for my problem. The solution identified is nice for their specific situation of having a single file like this, but for me would require coding up separate scripts for importing each of the dozens of files, since each have different primary keys that would need to be recognized after the fact to repair the inappropriate import. Alternatively, I could open each file in something like Notebook++ to remove the extra CR's but again, that seems quite inefficient, and then would have to be repeated by hand every time the initial data set was updated by its producers.
Given how frequent a problem this seems to be for people, and the existence of hard-coded solutions in other programming languages, I'm confused as to why there isn't a fix in R and feel like I must be missing something. It seems clear (I think?) that there's no way to do this directly from read.table or even from readLines, but is there a way perhaps to do this using scan, that I'm missing?
Thanks for any thoughts!
I have an XML file of size 31 GB. I need to find the total number of lines in that file. I know the command wc -l will give me the same. However it's taking too long to perform this operation. Is there any faster mechanism to find the number of lines in a large file?
31 gigs is a really big text file. I bet it would compress down to about 1.5 gigs. I would create these files in a compressed format to begin with then you can stream a decompressed version of the file through wc. This will greatly reduce the amount of i/o and memory used to process this file. gzip can read and write compressed streams.
But I would also make the following comments:
Line numbers are not really that informative for XML as whitespace between elements is ignored (except for mixed content). What do you really want to know about the dataset? I bet counting elements would be more useful.
Make sure your xml file is not unnecessarily redunant, for example are you repeating the same namespace declarations all over the document?
Perhaps XML is not the best way to represent this document, if it is try looking into something like Fast Infoset
if all you need is the line count, wc -l will be as fast as anything else.
The problem is the 31GB text file.
If accuracy isn't an issue, find the average line length and divide the file size by that. That way you can get a really fast approximation. (make sure to consider the character encoding used)
This falls beyond the point where the code should be refactored to avoid your problem entirely. One way to do this is to place all of the data in the file into a tuple store database instead. Apache couchDB and Intersystems Cache are two systems that you could use for this, and will be far better optimized for the type of data you're dealing with.
If you're really stuck with the xml file, then another option is to count all the lines ahead of time and cache this value. Each time a line is added or removed from the file, you can add or subtract one from the file. Also, make sure to use a 64 bit integer since there may be more than 2^32 lines.
No, not really. wc is going to be pretty well optimized. 31GB is a lot of data, and reading it in to count lines is going to take a while no matter what program you use.
Also, this question isn't really appropriate for Stack Overflow, as it's not about programming at all.
Isn't counting lines pretty uncertain since in XML newline is basically just a cosmetic thing? It would probably be better to count the number of occurrences of a specific tag.