Optimizing File reading in R - r

My R application reads input data from large txt files. it does not read the entire
file in one shot. Users specify the name of the gene, (3 or 4 at a time) and based on the user-input, app goes to the appropriate row and reads the data.
File format: 32,000 rows (one gene per row, first two columns contain info about
gene name, etc.) 35,000 columns with numerical data (decimal numbers).
I used read.table (filename, skip=10,000 ) etc. to go to the right row, then read
35,000 columns of data. then I do this again for the 2nd gene, 3rd gene (upto 4 genes max)
and then process the numerical results.
The file reading operations take about 1.5 to 2.0 Minutes. I am experimenting with
reading the entire file and then taking the data for the desired genes.
Any way to accelerate this? I can rewrite the gene data in another format (one
time processing) if that will accelerate reading operations in the future.

You can use the colClasses argument to read.table to speed things up, if you know the exact format of your files. For 2 character columns and 34,998 (?) numeric columns, you would use
colClasses = c(rep("character",2), rep("numeric",34998))

This would be more efficient if you used a database interface. There are several available via the RODBC package, but a particularly well-integrated-with-R option would be the sqldf package which by default uses SQLite. You would then be able to use the indexing capacity of the database to do lookup of the correct rows and read all the columns in one operation.

Related

How to handle and normalize a dataframe with billions of rows?

I would need to analyze a dataframe on R (bash or even python if you have suggestions, but I don't know how to use python well). The dataframe has approximately 6 billion rows and 8 columns (control1, treaty1, control2, treaty2, control3, treaty3, control4, treaty4).
Since it is a file of almost 300Gb and 6 billion lines I cannot open it with R.
I would need to read the file line by line and remove the lines where there is even only a 0.
How could I do?
If I also needed to divide each value inside a column by a number, and put the result in a new dataframe equal to the starting one, how could I do?

R Code: csv file data incorrectly breaking across lines

I have some csv data that I'm trying to read in, where lines are breaking across rows weirdly.
An example of the file (the files are the same but the date varies) is here: http://nemweb.com.au/Reports/Archive/DispatchIS_Reports/PUBLIC_DISPATCHIS_20211118.zip
The csv is non-rectangular because there's 4 different types of data included, each with their own heading rows. I can't skip a certain number of lines because the length of the data varies by date.
The data that I want is the third dataset (sometimes the second), and has approximately twice the number of headers as the data above it. So I use read.csv() without a header and ideally it should pull all data and fill NAs above.
But for some reason read.csv() seems to decide that there's 28 columns of data (corresponding to the data headers on row 2) which splits the data lower down across three rows - so instead of the data headers being on one row, it splits across three; and so does all the rows of data below it.
I tried reading it in with the column names explicitly defined, it's still splitting the rows weirdly
I can't figure out what's going on - if I open the csv file in Excel it looks perfectly normal.
If I use readr::read_lines() there's no errant carriage returns or new lines as far as I can tell.
Hoping someone might have some guidance, otherwise I'll have to figure out a kind of nasty read_lines approach.

Processing very large files in R

I have a dataset that is 188 million rows with 41 columns. It comes as a massive compressed fixed width file and I am currently reading it into R using the vroom package like this:
d <- vroom_fwf('data.dat.gz',
fwf_positions([41 column position],
[41 column names])
vroom does a wonderful job here in the sense that the data are actually read into an R session on a machine with 64Gb of memory. When I run object.size on d it is a whopping 61Gb is size. When I turn around to do anything with this data I can't. All I get back the Error: cannot allocate vector of size {x} Gb because there really isn't any memory left to much of anything with that data. I have tried base R with [, dplyr::filter and trying to convert to a data.table via data.table::setDT each with the same result.
So my question is what are people's strategies for this type of thing? My main goal is to convert the compressed fixed width file to a parquet format but I would like to split it into small more manageable files based on some values in a column in the data then write them to parquet (using arrow::write_parquet)
My ideas at this point are to read a subset of columns keeping the column that I want to subset by, write the parquet files then bind the columns/merge the two back together. This seems like a more error prone solution though so I thought I would turn here and see what is available for further conversions.

Querying out of memory 60gb tsv's in R on the first column, which database/method?

I have 6 large tsv's matrices of 60gb (uncompressed) containing 20million rows x 501 columns: the first index/integer column that is basically the row number (so not even necessary), 500 columns are numerical (float, 4 decimals e.g. 1.0301). All tsv's have the same number of rows that correspond to each other.
I need to extract rows on rownumber.
I need is to extract up to 5,000 contiguous rows or up to 500 non-contiguous rows;so not millions. Hopefully, also have some kind of compression to reduce the size of 60gb so maybe no SQL? What would be the best way to do this?
One method I tried is to seperate them into 100 gzipped files, index them using tabix, and then query them, but this is too slow for my need (500 random rows took 90 seconds).
I read about the ff package, but have not found how to index by the first column?
Are there other ways ?
Thanks so much.
I will use fread() from data.table package
Using the parameters skip and nrows you can play with the starting line to read (skip) or the number of rows to read (nrows)
If you want to explore the tidyverse approach I recommend you this solution R: Read in random rows from file using fread or equivalent?

Load large datasets into data frame [duplicate]

This question already has answers here:
Quickly reading very large tables as dataframes
(12 answers)
Closed 8 years ago.
I have a dataset stored in text file, it is of 997 columns, 45000 rows. All values are double values except row names and column names. I use R studio with read.table command to read the data file, but it seems taking hours to do it. Then I aborted it.
Even I use Excel to open it, it takes me 2 minutes.
R Studio seems lacking of efficiency in this task, any suggestions given how to make it faster ? I dont want to read the data file all the time ?
I plan to load it once and store it in Rdata object, which can make the loading data faster in the future. But the first load seems not working.
I am not a computer graduate, any kind help will be well appreciated.
I recommend data.table although you will end up with a data table after this. If you choose not to use the data table, you can simply convert back to a normal data frame.
require(data.table)
data=fread('yourpathhere/yourfile')
As documented in the ?read.table help file there are three arguments that can dramatically speed up and/or reduce the memory required to import data. First, by telling read.table what kind of data each column contains you can avoid the overhead associated with read.table trying to guess the type of data in each column. Secondly, by telling read.table how many rows the data file has you can avoid allocating more memory than is actually required. Finally, if the file does not contain comments, you can reduce the resources required to import the data by telling R not to look for comments. Using all of these techniques I was able to read a .csv file with 997 columns and 45000 rows in under two minutes on a laptop with relatively modest hardware:
tmp <- data.frame(matrix(rnorm(997*45000), ncol = 997))
write.csv(tmp, "tmp.csv", row.names = FALSE)
system.time(x <- read.csv("tmp.csv", colClasses="numeric", comment.char = ""))
# user system elapsed
#115.253 2.574 118.471
I tried reading the file using the default read.csv arguments, but gave up after 30 minutes or so.

Resources