Mysterious problems appending data frames with rbind - r

I am in a dilly of a pickle trying to join several files together into a master file. There are 5 files with same structure, and I can read each file individually into a data frame with no problems. I even manually set the column class for 200+ variables rather than letting R decide, because I believed that was causing the problem. However, appending any two files together causes me to run out of memory.
Warning messages:
1: In rbind(deparse.level, ...) :
Reached total allocation of 4043Mb: see help(memory.size)
So I did some experimenting:
I joined two different chunks of file 1 together. That works.
I joined a chunk of file 2 to a chunk of file 1. That works.
I joined a chunk of file 2 to the original file 1. That works.
Each of these files comes in at a little under 200MB so I am not sure that I should be running out of memory. If anybody is interested, the data comes from hearstchallenge.com. The competition is long over, we are just using the data for an analysis experiment (and not programming!).
Any suggestions for how to solve this?

I have run into similar problems. The solution is not to use rbind() or cbind() on large data. They tend to leak memory.
To solve your problem using only R, first create a dataframe of the dimensions that the dataframe would have after you put the pieces together. Then use assignments to fill in the large dataframe.

Related

R code halts in for loop no error message

I'm working on a segment of R code which is a for loop that reads in columns from a collection of .csv files the paths of which are compiled in a file path directory I've made. As it reads in each file it stores four different columns into four different grand tables within my working environment.
I'm trying to make part of code calculates the monthly average for each month then stores it in a new column so that I can use it to replace missing data points and this involves a couple more for loops for mapping the subset of the aggregate table.
All this ends up being four nested for loops which handles a great deal of data at once before overwriting with the next large file.
After incorporating the nested 3 loops which build the monthly average vector the code started halting as soon as it tries to read in the data file with no error message it just looks like this.
halted code no errror
if I add show_col_types = FALSE to the read_csv it looks like this instead still halting the code.
halted code with show_col_types = FALSE
I can't include the data or much of the code because my company will not allow it but I would appreciate any input since there isn't any error message I can google. Thanks!

Processing very large files in R

I have a dataset that is 188 million rows with 41 columns. It comes as a massive compressed fixed width file and I am currently reading it into R using the vroom package like this:
d <- vroom_fwf('data.dat.gz',
fwf_positions([41 column position],
[41 column names])
vroom does a wonderful job here in the sense that the data are actually read into an R session on a machine with 64Gb of memory. When I run object.size on d it is a whopping 61Gb is size. When I turn around to do anything with this data I can't. All I get back the Error: cannot allocate vector of size {x} Gb because there really isn't any memory left to much of anything with that data. I have tried base R with [, dplyr::filter and trying to convert to a data.table via data.table::setDT each with the same result.
So my question is what are people's strategies for this type of thing? My main goal is to convert the compressed fixed width file to a parquet format but I would like to split it into small more manageable files based on some values in a column in the data then write them to parquet (using arrow::write_parquet)
My ideas at this point are to read a subset of columns keeping the column that I want to subset by, write the parquet files then bind the columns/merge the two back together. This seems like a more error prone solution though so I thought I would turn here and see what is available for further conversions.

R - Read large file with small memory

My data is organize in an csv file with millions of lines and several columns. This file is to large to read into memory all at once.
Fortunately, I only want to compute some statistics on it, like the mean of each column at every 100 rows and such. My solution, based on other posts where was to use read.csv2 with options nrow and skip. This works.
However, I realized that when loading from the end of the file this process is quite slow. As far as I can tell, the reader seems to go trough the file until it passes all the lines that I say to skip and then reads. This, of course, is sub optimal, as it keeps reading over the initial lines every time.
Is there a solution, like python parser, where we can read the file line by line, stop when needed, and then continue? And keeping the nice reading simplicity that comes from read.csv2?

R merge large number of data frames

I have the output from a data submission which is in the form of multiple vector list objects in rda files.
Each list object is in a separate rda file and i have nearly 2000 files.
I want to merge all the objects into a single object in a single rda file in the fastest way (partly because i may need to repeat this several times).
All the rda files are fairly small (~10mb though this will be a compressed size), but it all adds up with the number of files.
Memory isn't a huge problem as am running it on a server with >700GB RAM,
My first approach to incrementally load them one by one concatenate with the merged list object and remove the object that was appended went badly due to the time it was going to take (something like 40 days at a best guess).
My revised approach is below, but wondering if there is a quicker way to do this given that i may need to repeat the process:
load("data_1.rda")
load("data_2.rda")
load("data_3.rda") ...
load("data_2000.rda")
my.list <- list()
my.list <- c(my.list, data.1, data.2, data.3, ... , data.2000)
save(my.list, file="my_list.rda")
And just to add to things i'm getting an error when doing this:
Error: attempt to set index 18446744071562067968/2877912830 in SET_STRING_ELT
It's not a very helpful error message
All the rdas load as objects into the environment fine, but when i try and concatenate them that is when I get the error message, and it seems like it is when it gets to a particular point as it doesn't fail immediately. Wasn't sure if it is some sort of limit in the number of concatenations you can do or rogue data, but troubleshooting it it appears to be syntax rather than data related.
Have chunked it up into 5 batches and then doing a final concatenation before saving the rda. Have seen other answers for this sort of thing suggesting using rbind, mget, and do.Call or list function - would using any of these functions make it faster and achieve the same thing?
Something like this:
my.list <- do.call(rbind, mget(ls(pattern="^data_")))
Thanks

Load large datasets into data frame [duplicate]

This question already has answers here:
Quickly reading very large tables as dataframes
(12 answers)
Closed 8 years ago.
I have a dataset stored in text file, it is of 997 columns, 45000 rows. All values are double values except row names and column names. I use R studio with read.table command to read the data file, but it seems taking hours to do it. Then I aborted it.
Even I use Excel to open it, it takes me 2 minutes.
R Studio seems lacking of efficiency in this task, any suggestions given how to make it faster ? I dont want to read the data file all the time ?
I plan to load it once and store it in Rdata object, which can make the loading data faster in the future. But the first load seems not working.
I am not a computer graduate, any kind help will be well appreciated.
I recommend data.table although you will end up with a data table after this. If you choose not to use the data table, you can simply convert back to a normal data frame.
require(data.table)
data=fread('yourpathhere/yourfile')
As documented in the ?read.table help file there are three arguments that can dramatically speed up and/or reduce the memory required to import data. First, by telling read.table what kind of data each column contains you can avoid the overhead associated with read.table trying to guess the type of data in each column. Secondly, by telling read.table how many rows the data file has you can avoid allocating more memory than is actually required. Finally, if the file does not contain comments, you can reduce the resources required to import the data by telling R not to look for comments. Using all of these techniques I was able to read a .csv file with 997 columns and 45000 rows in under two minutes on a laptop with relatively modest hardware:
tmp <- data.frame(matrix(rnorm(997*45000), ncol = 997))
write.csv(tmp, "tmp.csv", row.names = FALSE)
system.time(x <- read.csv("tmp.csv", colClasses="numeric", comment.char = ""))
# user system elapsed
#115.253 2.574 118.471
I tried reading the file using the default read.csv arguments, but gave up after 30 minutes or so.

Resources