Reading a specific line from a huge file *fast* [duplicate] - r

This question already has an answer here:
Efficiently reading specific lines from large files into R
(1 answer)
Closed 9 years ago.
I have a huge comma-delimited file (1.5 Gb) and want to read one particular line from the file in R.
I've seen (many) versions of this question many times, and all suggest something like
con = file(fileName)
open(con)
scan(con, what=list("character", "character"), skip=1000000, nlines=1, sep="\t", quiet=TRUE)
That works, but it's still extremely slow - we're talking between 20 and 30 seconds to read a single line!
Is there a faster way? Surely there must be a fast way to jump to a particular line...
Thanks a million!

Do you know anything else about the structure of your file?
If every line/row has exactly the same number of bytes then you could calculate the number of bytes and seek to the beginning of the line.
However if the number of bytes per line is not exactly the same for every line then you need to read every character, check to see if it is a newline (or other carriage return, or both) and count those to find the line that you are looking for. This is what the skip argument to scan and friends does.
There may be other tools that are quicker at doing the reading and counting that you could have preprocess your file and return only the line of interest.
If you are going to do something like this multiple times then it could speed up the overall process to read the file into a different structure, such as a data base, that can access arbitrary lines directly, or pre index the lines so that you can seek directly to the appropriate line.

Related

How to read a table line by line - using R?

I have a pretty big (20GB) CSV file, and I need to modify some of its columns.
What is the MOST OPTIMIZED way of importing the data table line by line (or probably few thousands of line per read) ?
I have tried the solution given below
What is a good way to read line-by-line in R?
But it seems to be very slow. Is there any library which can read line by line, in the table structure itself -- also which has some kind of Buffer logic to make the read faster ?
You can use the fast fread() from data.table.
By skip=, you're setting the beginning of the read segment and by nrow=, the number of rows to read.

R - Read large file with small memory

My data is organize in an csv file with millions of lines and several columns. This file is to large to read into memory all at once.
Fortunately, I only want to compute some statistics on it, like the mean of each column at every 100 rows and such. My solution, based on other posts where was to use read.csv2 with options nrow and skip. This works.
However, I realized that when loading from the end of the file this process is quite slow. As far as I can tell, the reader seems to go trough the file until it passes all the lines that I say to skip and then reads. This, of course, is sub optimal, as it keeps reading over the initial lines every time.
Is there a solution, like python parser, where we can read the file line by line, stop when needed, and then continue? And keeping the nice reading simplicity that comes from read.csv2?

Read only first few lines of CSV (or any text file) [duplicate]

This question already has answers here:
How to read first 1000 lines of .csv file into R? [closed]
(2 answers)
Closed 7 years ago.
I often come across larger datasets where I don't know what they actually contain.
Instead of waiting for them to be opened in a conventional text editor or within the RStudio text editor, for example, I would like to look at the first few lines.
I don't even have to parse the contents, scanning these first few lines will help me to determine what method to use.
Is there a function/package for this?
read.table has an nrows option:
nrows: integer: the maximum number of rows to read in. Negative and
other invalid values are ignored.
so read in a few and see what you've got.
If you have a Unix environment then the command head file.csv will show the first ten lines. There are lots of other useful Unix commands (guess what tail file.csv does) and even if you are on Windows you could benefit from installing Cygwin and learning it!
Here is an answer to your question:
How to read first 1000 lines of .csv file into R?
Basically use nrows in read.csv or read.table...

Load large datasets into data frame [duplicate]

This question already has answers here:
Quickly reading very large tables as dataframes
(12 answers)
Closed 8 years ago.
I have a dataset stored in text file, it is of 997 columns, 45000 rows. All values are double values except row names and column names. I use R studio with read.table command to read the data file, but it seems taking hours to do it. Then I aborted it.
Even I use Excel to open it, it takes me 2 minutes.
R Studio seems lacking of efficiency in this task, any suggestions given how to make it faster ? I dont want to read the data file all the time ?
I plan to load it once and store it in Rdata object, which can make the loading data faster in the future. But the first load seems not working.
I am not a computer graduate, any kind help will be well appreciated.
I recommend data.table although you will end up with a data table after this. If you choose not to use the data table, you can simply convert back to a normal data frame.
require(data.table)
data=fread('yourpathhere/yourfile')
As documented in the ?read.table help file there are three arguments that can dramatically speed up and/or reduce the memory required to import data. First, by telling read.table what kind of data each column contains you can avoid the overhead associated with read.table trying to guess the type of data in each column. Secondly, by telling read.table how many rows the data file has you can avoid allocating more memory than is actually required. Finally, if the file does not contain comments, you can reduce the resources required to import the data by telling R not to look for comments. Using all of these techniques I was able to read a .csv file with 997 columns and 45000 rows in under two minutes on a laptop with relatively modest hardware:
tmp <- data.frame(matrix(rnorm(997*45000), ncol = 997))
write.csv(tmp, "tmp.csv", row.names = FALSE)
system.time(x <- read.csv("tmp.csv", colClasses="numeric", comment.char = ""))
# user system elapsed
#115.253 2.574 118.471
I tried reading the file using the default read.csv arguments, but gave up after 30 minutes or so.

Length of an XML file

I have an XML file of size 31 GB. I need to find the total number of lines in that file. I know the command wc -l will give me the same. However it's taking too long to perform this operation. Is there any faster mechanism to find the number of lines in a large file?
31 gigs is a really big text file. I bet it would compress down to about 1.5 gigs. I would create these files in a compressed format to begin with then you can stream a decompressed version of the file through wc. This will greatly reduce the amount of i/o and memory used to process this file. gzip can read and write compressed streams.
But I would also make the following comments:
Line numbers are not really that informative for XML as whitespace between elements is ignored (except for mixed content). What do you really want to know about the dataset? I bet counting elements would be more useful.
Make sure your xml file is not unnecessarily redunant, for example are you repeating the same namespace declarations all over the document?
Perhaps XML is not the best way to represent this document, if it is try looking into something like Fast Infoset
if all you need is the line count, wc -l will be as fast as anything else.
The problem is the 31GB text file.
If accuracy isn't an issue, find the average line length and divide the file size by that. That way you can get a really fast approximation. (make sure to consider the character encoding used)
This falls beyond the point where the code should be refactored to avoid your problem entirely. One way to do this is to place all of the data in the file into a tuple store database instead. Apache couchDB and Intersystems Cache are two systems that you could use for this, and will be far better optimized for the type of data you're dealing with.
If you're really stuck with the xml file, then another option is to count all the lines ahead of time and cache this value. Each time a line is added or removed from the file, you can add or subtract one from the file. Also, make sure to use a 64 bit integer since there may be more than 2^32 lines.
No, not really. wc is going to be pretty well optimized. 31GB is a lot of data, and reading it in to count lines is going to take a while no matter what program you use.
Also, this question isn't really appropriate for Stack Overflow, as it's not about programming at all.
Isn't counting lines pretty uncertain since in XML newline is basically just a cosmetic thing? It would probably be better to count the number of occurrences of a specific tag.

Resources