R: Exporting massive files within a loop - r

I have a code that generates several very large dataframes in a loop. Each one has around 300 million rows so I run out of memory before the loop is over. I am trying to export each dataframe once it is constructed within the loop and then remove it to free up space in my R environment before I start constructing the next.
The issue is how to export these very large datasets. I tried using fwrite from the data.table package but when I open the csv file I get an empty csv file called Book1 instead. I also tried saving it as a dta file using write.dta from the foreign package but Stata tells me it is corrupted when I try opening it.

When saving it as .csv with fwrite and opening it with Stata it worked perfectly!

Related

Optimum way to overwrite an xlsx worksheet

I'm trying to write an Excel worksheet with the XLConnect package. The data I'm using is a data.frame (820*132). Once I'm done building the dataset, I'm using the writeWorksheetToFile function to export.
If the file does not exist yet and I am creating it from scratch, everything works well.
If I want to overwrite an existing sheet, the function takes approximately a minute to write and in addition, when I open the excel file, I have an error message saying: "we found a problem with some content in 'my_file.xlsx. Do you want to try to recover as much as we can?"
I tried to use other packages to write to excel like xlsx and openxlsx but they do not allow to overwrite a sheet without overwriting the entire workbook.
I've checked a few solutions such as this, but I not optimal.
I am looking for the most optimal way of writing excel worksheets, with an overwrite option that is suitable for large datasets.
I'm using the latest versions of R and RStudio.
My Excel verion is 1902, 64bits.

How to write data into a macro-enabled Excel file (write.xlslx corrupts my document)?

I'm trying to write a table into a macro-enabled Excel file (.xlsm) through the R. The write.xlsx (openxlsx) and writeWorksheetToFile (XLconnect) functions don't work.
When I used the openxlsx package, as seen below, the resulting .xlsm files ended up getting corrupted.
Code:
library(XLConnect)
library(openxlsx)
for (i in 1:3){
write.xlsx(Input_Files[[i]], Inputs[i], sheetName="Input_Sheet")
}
#Input_Files[[i]] are the R data.frames which need to be inserted into the .xslm file
#Inputs[i] are the excel files upon which the tables should be written into
Corrupted .xlsm file error message after write.xlsx:
Excel cannot open the file 'xxxxx.xslm' because the file format or file extension is not valid. Verify that the file has not been corrupted and that the file extension matches the format of the file
After researching this problem extensively, I found that the XLConnect connect package offers the writeWorksheetToFile function which works with .xlsm, albeit after running it a few times it yields an error message that there is no more free space. It also runs for 20+ minutes for tables with approximately 10,000 lines. I tried adding xlcFreeMemory at the beginning of the for loop, but it doesn't solve the issue.
Code:
library(XLConnect)
library(openxlsx)
for (i in 1:3){
xlcFreeMemory()
writeWorksheetToFile(Inputs[i], Input_Files[[i]], "Input_Sheet")
}
#Input_Files[[i]] are the R data.frames which need to be inserted into the .xslm file
#Inputs[i] are the excel files upon which the tables should be written into
Could anyone recommend a way to easily and quickly transfer an R table into an xlsm file without corrupting it?

Import .rds file to h2o frame directly

I have a large .rds file saved and I trying to directly import .rds file to h2o frame using some functionality, because it is not feasible for me to read that file in R enviornment and then use as.h2o function to convert.
I am looking for some fast and efficient way to deal with it.
My attempts:
I have tried to read that file and then convert it into h2o frame. But, it is way much time consuming process.
I tried saving file in .csv format and using h2o.import() with parse=T.
Due to memory constraint I was not able to save complete dataframe.
Please suggest me any efficient way to do it.
Any suggestions would be highly appreciated.
The native read/write functionality in R is not very efficient, so I'd recommend using data.table for that. Both options below make use of data.table in some way.
First, I'd recommend trying the following: Once you install the data.table package, and load the h2o library, set options("h2o.use.data.table"=TRUE). What that will do is make sure that as.h2o() uses data.table underneath for the conversion from an R data.frame to an H2O Frame. Something to note about how as.h2o() works -- it writes the file from R to disk and then reads it back again into H2O using h2o.importFile(), H2O's parallel file-reader.
There is another option, which is effectively the same thing, though your RAM doesn't need to store two copies of the data at once (one in R and one in H2O), so it might be more efficient if you are really strapped for resources.
Save the file as a CSV or a zipped CSV. If you are having issues saving the data frame to disk as a CSV, then you should make sure you're using an efficient file writer like data.table::fwrite(). Once you have the file on disk, read it directly into H2O using h2o.importFile().

exporting R data.table in .txt or .csv format

In these days I have run into a problem of data export from R to a more “common” format as .csv or .txt.
My dataset is in data.table format and has 149000 rows * 124 columns. I adopt the following lines of code to try to export it:
write.table(data_reduced,"directory/data_reduced.txt",sep="\t",row.names=FALSE)
write.csv2(data_reduced,"directory/data_reduced.csv")
The result, in both cases, is that the .txt or .csv files have a lower number of rows than they are supposed to do and this changes with the different trials I did (it ranges from 900 to 1800, more or less). Usually what I get are the first rows and then the very last one.
I have tried to transform the data.table in a matrix or data.frame but the result I get is more or less the same. I have also tried to adopt the write.xlsx function but I have some problems with Java (which is something common as I have noticed reading the SO forum and other web sources).
I have also read about a function called fwrite to export very large datasets but it looks like that my RStudio cannot find it, despite I installed the data.table package.
Can anyone give me an explanation/solution for this problem? I've been reading different sources to sort it out but with no success until now.
I use RStudio Version 0.99.473.

Importing .csv files with Sys.Dates()

I have a .csv dataset that gets dumped everyday which I use to generate a daily list for tracking participants using a R script. I would like to automate this R script, however in order to do so, I need to read in the .csv using Sys.Date().
The .csv dataset is named: DumpedList_2013-11-27 (The date will always be today's date).
I would like to import this into the script, like I would for .Rdata file.
load(paste('/srv/Data/Baseline2/baseline2_', Sys.Date(), '.Rdata',sep=''))
What is the equivalent of the command above for reading in .csv files?
I have tried load and read.csv commands, but get error messages:
data=read.csv('P:/DirectoryPath/DumpedList_',Sys.Date(),'.csv')
I also attempted to create todaydate=Sys.Date() and then used it to load the data, but error messages again. a=load(paste("P:/DirectoryPath/DumpedList_",todaydate,".csv"))
Any insight?
By default paste will separate with spaces, use paste0 to join strings together seamlessly:
read.csv(paste0('P:/DirectoryPath/DumpedList_',Sys.Date(),'.csv'))

Resources