Use readLines in successive chunks R - r

I've got a file with 2m+ lines.
To avoid memory overload, I want to read these lines in chunks and then perform further processing with the lines in the chunk.
I read that readLines is the fastest but I could not find a way to read chunks with readlines.
raw = readLines(target_file, n = 500)
But what I'd want is to then have a readLines for n = 501:1000, e.g.
raw = readLines(target_file, n = 501:1000)
Is there a way to do this in R?

Maybe this helps someone in the future:
The readr package has just what I was looking for: a function to read lines in chunks.
read_lines_chunked reads a file in chunks of lines and then expects a callback to be run on these chunks.
Let f be the function needed for storing a chunk for later use:
f = function(x, pos){
filename = paste("./chunks/chunk_", pos, ".RData", sep="")
save(x, file = filename)
}
Then I can use this in the main wrapper as:
read_lines_chunked(file = target_json
, chunk_size = 10000
, callback = SideEffectChunkCallback$new(f)
)
Works.

I don't know how many variables (columns) you have, but data.table::fread is a very fast alternative to what you want:
require(data.table)
raw <- fread(target_file)

Related

Read large CSV in chunks and transform to RDS in R

I have a 20GB CSV file that I want to convert to an RDS file in R. However, the original file is too large to be processed (the computer with 64GB RAM tells me that 80.9GB needs to be allocated which exceeds its memory capacity). Therefore I am wondering, whether and how I can read that CSV in chunks, turn each chunk into a separate RDS file and afterward merge them together? Would that yield the same outcome as if I directly turned that one CSV file into an RDS file?
I am very new to R and could unfortunately not find any answers to my question.
Below is the current code I'm using.
library(Matrix)
library(data.table)
b <- fread('dtm.csv')
b_matx<- as.matrix(b)
dtm_b <- Matrix(b_matx, sparse = TRUE)
saveRDS(dtm_b, "dtm.rds")
See if this works.
It reads one column at a time using fread. By default fread creates a data frame; however, these use external pointers which can be
problem so we use data.table=FALSE argument. After reading a columni n it immediately writes it back out as an RDS file. After all columns have been written back out as RDS files it reads the RDS files back in and writes the final RDS file out which combines them. We use the 6 row input in the Note at the end as an example.
If fread with select= still takes up too much memory use the xsv utility (not an R program) to ensure that only the column of interest is read in. xsv can be downloaded for various platforms here and then use the commented out line instead of the line following it. (Alternately suitably use cut, sed or awk for the same purpose.)
You can also try interspersing the code lines with gc() to trigger garbage collection.
Also try replacing as.data.frame in the last line with setDT.
library(data.table)
File <- "BOD.csv"
freadDF <- function(..., data.table = FALSE) fread(..., data.table = data.table)
L <- as.list(freadDF(File, nrows = 0))
nms <- names(L)
fmt <- "xsv select %s %s"
# for(nm in nms) saveRDS(freadDF(cmd = sprintf(fmt, nm, File))[[1]], paste0(nm, ".rds"))
for(nm in nms) saveRDS(freadDF(File, select = nm)[[1]], paste0(nm, ".rds"))
for(nm in names(L)) L[[nm]] <- readRDS(paste0(nm, ".rds"))
saveRDS(as.data.frame(L), sub(".csv$", ".rds", File))
Note
write.csv(BOD, "BOD.csv", quote = FALSE, row.names = FALSE)

Partition a large list into chunks with convenient I/O

I have a large list with size of approx. 1.3GB. I'm looking for the fastest solution in R to generate chunks and save them in any convenient format so that :
a) every saved file of the chunk is less than 100MB large
b) the original list can be loaded conveniently and fast into a new R workspace
EDIT II : The reason to do so is a R-solution to bypass the GitHub file size restriction of 100MB per file. The limitation to R is due to some external non-technical restrictions which I can't comment.
What is the best solution for this problem?
EDIT I: Since it was mentioned in the comments that some code for the problem is helpful to create a better question:
An R-example of a list with size of 1.3 GB:
li <- list(a = rnorm(10^8),
b = rnorm(10^7.8))
So, you want to split a file and to reload it in a single dataframe.
There is a twist: to reduce file size, it would be wise to compress, but then the file size is not entirely deterministic. You may have to tweak a parameter.
The following is a piece of code I have used for a similar task (unrelated to GitHub though).
The split.file function takes 3 arguments: a dataframe, the number of rows to write in each file, and the base filename. For instance, if basename is "myfile", the files will be "myfile00001.rds", "myfile00002.rds", etc.
The function returns the number of files written.
The join.files function takes the base name.
Note:
Play with the rows parameter to find out the correct size to fit in 100 MB. It depends on your data, but for similar datasets a fixed size should do. However, if you are dealing with very different datasets, this approach will likely fail.
When reading, you need to have twice as much memory as occupied by your dataframe (because a list of the smaller dataframes is first read, then rbinded.
The number is written as 5 digits, but you can change that. The goal is to have the names in lexicographic order, so that when the files are concatenated, the rows are in the same order as the original file.
Here are the functions:
split.file <- function(db, rows, basename) {
n = nrow(db)
m = n %/% rows
for (k in seq_len(m)) {
db.sub <- db[seq(1 + (k-1)*rows, k*rows), , drop = F]
saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, k),
compress = "xz", ascii = F)
}
if (m * rows < n) {
db.sub <- db[seq(1 + m*rows, n), , drop = F]
saveRDS(db.sub, file = sprintf("%s%.5d.rds", basename, m+1),
compress = "xz", ascii = F)
m <- m + 1
}
m
}
join.files <- function(basename) {
files <- sort(list.files(pattern = sprintf("%s[0-9]{5}\\.rds", basename)))
do.call("rbind", lapply(files, readRDS))
}
Example:
n <- 1500100
db <- data.frame(x = rnorm(n))
split.file(db, 100000, "myfile")
dbx <- join.files("myfile")
all(dbx$x == db$x)

R : import millions of small alpha numeric csv files

I have about 300GB of 15KB csv files (each with exactly 100 rows each) that I need to import, concatenate, manipulate and resave as a single rds.
I've managed to reduce the amount of RAM needed by only importing the columns I need but as soon as I need to do any operations on the columns, I max it out.
What is your strategy for this type of problem?
This is a shot at answering your question.
While this may not be the most effective of efficient solution, it works. The biggest upside is that you don't need to store all the information at once, instead just appending the result to a file.
If this is not fast enough it is possible to use parallell to speed it up.
library(tidyverse)
library(data.table)
# Make some example files
for (file_number in 1:1000) {
df = data.frame(a = runif(10), b = runif(10))
write_csv(x = df, path = paste0("example_",file_number,".csv"))
}
# Get the list of files, change getwd() to your directory,
list_of_files <- list.files(path = getwd(), full.names = TRUE)
# Define function to read, manipulate, and save result
read_man_save <- function(filename) {
# Read file using data.table fread, which is faster than read_csv
df = fread(file = filename)
# Do the manipulation here, for example getting only the mean of A
result = mean(df$a)
# Append to a file
write(result, file = "out.csv", append = TRUE)
}
# Use lapply to perform the function over the list of filenames
# The output (which is null) is stored in a junk object
junk <- lapply(list_of_files, read_man_save)
# The resulting "out.csv" now contains 1000 lines of the mean
Feel free to comment if you want any edits to better reflect your use case.
You could also use the disk.frame library, it is designed to allow manipulation of data larger than RAM.
You can then manipulate the data like you would in data.table or using dplyr verbs.

Loop in R to read sequentially numbered filenames and output accordingly numbered files

I'm sure this is very simple, but I'm new to doing my own programming in R and haven't quite gotten a hang of the syntax for looping.
I have code like this:
mydata1 <- read.table("ph001.txt", header=TRUE)
# ... series of formatting and merging steps
write.table(mydata4, "ph001_anno.txt", row.names=FALSE, quote=FALSE, sep="\t")
png("manhattan_ph001.png"); manhattan(mydata4); dev.off()
png("qq_ph001.png"); qq(mydata4$P); dev.off()
The input file ph001.txt is output from a linear regression algorithm, and from that file, I need to output ph001_anno.txt, manhattan_ph001.png, and qq_ph001.png. The latter two are using the qqman package.
I have a folder that contains ph001 through ph138, and would like a loop function that reads these files individually and creates the corresponding output files for each file. As I said, I'm sure there is an easy way to do this as a loop function, but the part that's tripping me up is modifying the output filenames.
You can use the stringr package to do a lot of the string manipulation you want in order to generate your file names, like so:
f <- function(i) {
num <- str_pad(i, 3, pad = "0")
a <- str_c("ph", num, "_anno.txt")
m <- str_c("manhattan_ph", num, ".png")
q <- str_c("qq_ph", num, ".png")
# Put code to do stuff with these file names here
}
sapply(1:138, f)
In the above block of code, for each number in 1:138 you create the name of three files. You can then use those file names in calls to read.table or ggsave or whatever you want.

Convert R read.csv to a readLines batch?

I have a fitted model that I'd like to apply to score a new dataset stored as a CSV. Unfortunately, the new data set is kind of large, and the predict procedure runs out of memory on it if I do it all at once. So, I'd like to convert the procedure that worked fine for small sets below, into a batch mode that processes 500 lines at a time, then outputs a file for each scored 500.
I understand from this answer (What is a good way to read line-by-line in R?) that I can use readLines for this. So, I'd be converting from:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
newdata <- as.data.frame(read.csv('newstuff.csv'), stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=filename)
to something like:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
con <- file("newstuff.csv", open = "r")
i = 0
while (length(mylines <- readLines(con, n = 500, warn = FALSE)) > 0) {
i = i+1
newdata <- as.data.frame(mylines, stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=paste(filename,i,'.csv',sep=''))
}
close(con)
However, when I print the mylines object inside the loop, it doesn't get auto-columned correctly the same way read.csv produces something that is---headers are still a mess, and whatever modulo column-width happens under the hood that wraps the vector into an ncol object isn't happening.
Whenever I find myself writing barbaric things like cutting the first row, wrapping the columns, I generally suspect R has a better way to do things. Any suggestions for how I can get a read.csv-like output form a readLines csv connection?
If you want to read your data into memory in chunks using read.csv by using the skip and nrows arguments. In pseudo-code:
read_chunk = function(start, n) {
read.csv(file, skip = start, nrows = n)
}
start_indices = (0:no_chunks) * chunk_size + 1
lapply(start_indices, function(x) {
dat = read_chunk(x, chunk_size)
pred = predict(fit, dat)
write.csv(pred)
}
Alternatively, you could put the data into an sqlite database, and use the sqlite package to query the data in chunks. See also this answer, or do some digging with [r] large csv on SO.

Resources