Index elements of large CSV file when reading line by line - r

I'm reading a large CSV file (>15 GB) line by line in R. I'm using
con <- file("datafile.csv", open = "r")
while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
# code to be written
}
In the "code to be written" section, I need to be able to refer to individual elements in each row and save them to an array. The file has no headers if that's important.
Thanks!

You could use read.table with argument text to parse oneLine string as if it were a csv file:
# set your arguments: separator, decimal separator etc...
x <- read.table(text=oneLine, sep=",", dec=".", header=F)
The returned x is a data.frame with one row only that you can easily turn into an array.

You could do something like this:
CHUNK_SIZE <- 5000
con <- file('datafile.csv', 'rt')
res <- NULL
while (nrow(chunk <- read.csv(con, nrow = CHUNK_SIZE, header = FALSE, stringsAsFactors = FALSE)) > 0) {
res <- rbind(res, chunk)
if (nrow(chunk) < CHUNK_SIZE) break
}

Related

How can I quickly find all the files in a directory that are missing a first row?

I have a folder of files that are in .csv format. They have blank lines in them that are necessary (this indicates an absence of a measure from a LiDAR unit, which is good and needs to stay in). But occasionally, the first row is empty this throws off the code and the package and everything aborts.
Right now I have to open each .csv and see if the first line is empty.
I would like to do one of the following, but am at a loss how to:
1) write a code that quickly scans through all of the files in the directory and tells me which ones are missing the first line
2) be able to skip the empty lines that are only at the beginning--which can vary, sometimes more than one line is empty
3) have a code that cycles through all of the .csv files and inserts a dummy first line of numbers so the files all import no problem.
Thanks!
Here's a bit of code that does 1 and 2 above. I'm not sure why you'd want to insert dummy line(s) given the ability to do 1 and 2; it's straightforward to do, but usually it's not a good idea to modify raw data files.
# Create some test files
cat("x,y", "1,2", sep="\n", file = "blank0.csv")
cat("", "x,y", "1,2", sep="\n", file = "blank1.csv")
cat("", "", "x,y", "1,2", sep="\n", file = "blank2.csv")
files <- list.files(pattern = "*.csv", full.names = TRUE)
for(i in seq_along(files)) {
filedata <- readLines(files[i])
lines_to_skip <- min(which(filedata != "")) - 1
cat(i, files[i], lines_to_skip, "\n")
x <- read.csv(files[i], skip = lines_to_skip)
}
This prints
1 ./blank0.csv 0
2 ./blank1.csv 1
3 ./blank2.csv 2
and reads in each dataset correctly.
I believe that the two functions that follow can do what you want/need.
First, a function to determine the files with a second line blank.
second_blank <- function(path = ".", pattern = "\\.csv"){
fls <- list.files(path = path, pattern = pattern)
second <- sapply(fls, function(f) readLines(f, n = 2)[2])
which(nchar(gsub(",", "", second)) == 0)
}
Then, a function to read in the files with such lines, one at a time. Note that I assume that the first line is the columns header and that at least the second line is left blank. There is a dots argument, ..., for you to pass other arguments to read.table, such as stringsAsFactors = FALSE.
skip_blank <- function(file, ...){
header <- readLines(file, n = 1)
header <- strsplit(header, ",")[[1]]
count <- 1L
while(TRUE){
txt <- scan(file, what = "character", skip = count, nlines = 1)
if(nchar(gsub(",", "", txt)) > 0) break
count <- count + 1L
}
dat <- read.table(file, skip = count, header = TRUE, sep = ",", dec = ".", fill = TRUE, ...)
names(dat) <- header
dat
}
Now, an example usage.
second_blank(pattern = "csv") # a first run as an example usage
inx <- second_blank() # this will be needed later
fl_names <- list.files(pattern = "\\.csv") # get all the CSV files
df_list <- lapply(fl_names[inx], skip_blank) # read the problem ones
names(df_list) <- fl_names[inx] # tidy up the result list
df_list

Stream processing large csv file in R

I need to make a couple of relatively simple changes to a very large csv file (c.8.5GB). I tried initially using various reader functions: read.csv, readr::read.csv, data.table::fread. However: they all run out of memory.
I'm thinking I need to use a stream processing approach instead; read a chunk, update it, write it, repeat. I found this answer which is on the right lines; however I don't how to terminate the loop (I'm relatively new to R).
So I have 2 questions:
What's the right way to make the while loop work?
Is there a better way (for some definition of 'better')? e.g. is there some way to do this using dplyr & pipes?
Current code as follows:
src_fname <- "testdata/model_input.csv"
tgt_fname <- "testdata/model_output.csv"
#Changes needed in file: rebase identifiers, set another col to constant value
rebase_data <- function(data, offset) {
data$'Unique Member ID' <- data$'Unique Member ID' - offset
data$'Client Name' <- "TestClient2"
return(data)
}
CHUNK_SIZE <- 1000
src_conn = file(src_fname, "r")
data <- read.csv(src_conn, nrows = CHUNK_SIZE, check.names=FALSE)
cols <- colnames(data)
offset <- data$'Unique Member ID'[1] - 1
data <- rebase_data(data, offset)
#1st time through, write the headers
tgt_conn = file(tgt_fname, "w")
write.csv(data,tgt_conn, row.names=FALSE)
#loop over remaining data
end = FALSE
while(end == FALSE) {
data <- read.csv(src_conn, nrows = CHUNK_SIZE, check.names=FALSE, col.names = cols)
data <- rebase_data(data, offset)
#write.csv doesn't support col.names=FALSE; so use write.table which does
write.table(data, tgt_conn, row.names=FALSE, col.names=FALSE, sep=",")
# ??? How to test for EOF and set end = TRUE if so ???
# This doesn't work, presumably because nrow() != CHUNK_SIZE on final loop?
if (nrow(data) < CHUNK_SIZE) {
end <- TRUE
}
}
close(src_conn)
close(tgt_conn)
Thanks for any pointers.
Sorry to poke a 2-year-old thread, but now with readr::read_csv_chunked (auto-loaded along with dplyr when loading tidyverse), we could also do like:
require(tidyverse)
## For non-exploratory code, as #antoine-sac suggested, use:
# require(readr) # for function `read_csv_chunked` and `read_csv`
# require(dplyr) # for the pipe `%>%` thus less parentheses
src_fname = "testdata/model_input.csv"
tgt_fname = "testdata/model_output.csv"
CHUNK_SIZE = 1000
offset = read_csv(src_fname, n_max=1)$comm_code %>% as.numeric() - 1
rebase.chunk = function(df, pos) {
df$comm_code = df$comm_code %>% as.numeric() - offset
df$'Client Name' = "TestClient2"
is.append = ifelse(pos > 1, T, F)
df %>% write_csv(
tgt_fname,
append=is.append
)
}
read_csv_chunked(
src_fname,
callback=SideEffectChunkCallback$new(rebase.chunk),
chunk_size = chunck.size,
progress = T # optional, show progress bar
)
Here the tricky part is to set is.append based on parameter pos, which indicates the start row number of the data frame df within original file. Within readr::write_csv, when append=F the header (columns name) will be written to file, otherwise not.
Try this out:
library("chunked")
read_chunkwise(src_fname, chunk_size=CHUNK_SIZE) %>%
rebase_data(offset) %>%
write_chunkwise(tgt_fname)
You may need to fiddle a bit with the colnames to get exactly what you want.
(Disclaimer: haven't tried the code)
Note that there is no vignette with the package but the standard usage is described on github: https://github.com/edwindj/chunked/
OK I found a solution, as follows:
# src_fname <- "testdata/model_input.csv"
# tgt_fname <- "testdata/model_output.csv"
CHUNK_SIZE <- 20000
#Changes needed in file: rebase identifiers, set another col to constant value
rebase_data <- function(data, offset) {
data$'Unique Member ID' <- data$'Unique Member ID' - offset
data$'Client Name' <- "TestClient2"
return(data)
}
#--------------------------------------------------------
# Get the structure first to speed things up
#--------------------------------------------------------
structure <- read.csv(src_fname, nrows = 2, check.names = FALSE)
cols <- colnames(structure)
offset <- structure$'Unique Member ID'[1] - 1
#Open the input & output files for reading & writing
src_conn = file(src_fname, "r")
tgt_conn = file(tgt_fname, "w")
lines_read <- 0
end <- FALSE
read_header <- TRUE
write_header <- TRUE
while(end == FALSE) {
data <- read.csv(src_conn, nrows = CHUNK_SIZE, check.names=FALSE, col.names = cols, header = read_header)
if (nrow(data) > 0) {
lines_read <- lines_read + nrow(data)
print(paste0("lines read this chunk: ", nrow(data), ", lines read so far: ", lines_read))
data <- rebase_data(data, offset)
#write.csv doesn't support col.names=FALSE; so use write.table which does
write.table(data, tgt_conn, row.names=FALSE, col.names=write_header, sep = ",")
}
if (nrow(data) < CHUNK_SIZE) {
end <- TRUE
}
read_header <- FALSE
write_header <- FALSE
}
close(src_conn)
close(tgt_conn)

Read numeric input as string R

So, i have this input csv of the form,
id,No.,V,S,D
1,0100000109,623,233,331
2,0200000109,515,413,314
3,0600000109,611,266,662
I need to read the No. Column as it is(i.e., as a character). I know i can use something like this for that:
data <- read.csv("input.csv", colClasses = c("MSISDN" = "character"))
I have a code that i'm using to read the csv file in chunks:
chunk_size <- 2
con <- file("input.csv", open = "r")
data_frame <- read.csv(con,nrows = chunk_size,colClasses = c("MSISDN" = "character"),quote="",header = TRUE,)
header <- names(data_frame)
print(header)
print(data_frame)
if(nrow(data_frame) == chunk_size) {
repeat {
data_frame <- read.csv(con,nrows = chunk_size, header = FALSE, quote="")
names(data_frame)<-c(header)
print(header)
print(data_frame)
if(nrow(data_frame) < chunk_size) {
break
}
}
}
close(con)
But, here what the issue i'm facing is that, the first chunk will only read the No. Column as a character, the rest of the chunks will not.
How can i resolve this?
PS: the original input file has about 150+ columns and about 20 Million rows.
You can read the data as string with readLines and split it:
fileName <- "input.csv"
df <- do.call(rbind.data.frame, strsplit(readLines(fileName), ",")[-1]) # skipping headlines
colnames(df) <- c("id","No.","V","S","D") #adding headlines
or the direct approach with read.csv:
fileName <- "input.csv"
col <- c("integer","character","integer","integer","integer")
df <- read.csv(file = fileName,
sep = ",",
colClasses=col,
header = TRUE,
stringsAsFactors = FALSE)
You need to give the column type colClasses in the read.csv() inside the repeat procedure.
You no longer have the header so you need to define an unnamed vector to specify the colClasses.
Let's say the size of colClasses is 150.
myColClasses=rep("numeric",150)
myColClasses[2] <- "character"
repeat {
data_frame <- read.csv(con,nrows = chunk_size, colClasses=myColClasses, header = FALSE, quote="")
...

Skip empty files when importing text files

I have a folder with about 700 text files that I want to import and add a column to. I've figured out how to do this using the following code:
files = list.files(pattern = "*c.txt")
DF <- NULL
for (f in files) {
data <- read.table(f, header = F, sep=",")
data$species <- strsplit(f, split = "c.txt") <-- (column name is filename)
DF <- rbind(DF, data)
}
write.xlsx(DF,"B:/trends.xlsx")
Problem is, there are about 100 files that are empty. so the code stops at the first empty file and I get this error message:
Error in read.table(f, header = F, sep = ",") :
no lines available in input
Is there a way to skip over these empty files?
You can skip empty files by checking that file.size(some_file) > 0:
files <- list.files("~/tmp/tmpdir", pattern = "*.csv")
##
df_list <- lapply(files, function(x) {
if (!file.size(x) == 0) {
read.csv(x)
}
})
##
R> dim(do.call("rbind", df_list))
#[1] 50 2
This skips over the 10 files that are empty, and reads in the other 10 that are not.
Data:
for (i in 1:10) {
df <- data.frame(x = 1:5, y = 6:10)
write.csv(df, sprintf("~/tmp/tmpdir/file%i.csv", i), row.names = FALSE)
## empty file
system(sprintf("touch ~/tmp/tmpdir/emptyfile%i.csv", i))
}
For a different approach that introduces explicit error handling, think about a tryCatch to handle anything else bad that might happen in your read.table.
for (f in files) {
data <- tryCatch({
if (file.size(f) > 0){
read.table(f, header = F, sep=",")
}
}, error = function(err) {
# error handler picks up where error was generated
print(paste("Read.table didn't work!: ",err))
})
data$species <- strsplit(f, split = "c.txt")
DF <- rbind(DF, data)
}

Looping a function over multiple files

I wrote a simple function:
myfunction <- function(fileName, stringsAsFactors=TRUE,
check.names=FALSE,
skip =1,...) {
Data <- read.delim(fileName, skip = skip,
stringsAsFactors=stringsAsFactors,
check.names = check.names, ...)
cb <- list()
Index <- as.numeric(as.factor(Data[,1]))
cb <- cbind(Data, Index)
return(cb)
}
This function takes the first column of the file named Data, create an Index according to that first column and then cbind the file Data and the index created.
This function will be applied in file named: myfile_00.txt, myfile_01.txt and so on. For one single file it looks like:
myfunction (fileName = "myfile_00.txt")
myfunction (fileName = "myfile_01.txt")
.......
I have around 1000 files so I suppose, the loop can be as from another post:
mytxt <- dir(pattern=".txt")
n <- length(mytxt)
mylist <- vector("list", n)
for(i in 1:n) {
mylist[[i]] <- read.delim(mytxt[i], header = F, skip = 1)
}
then:
d <- lapply(mylist, myfunction)
Unfortunately it does not work... When using lapply an error occurs:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file' must be a character string or connection
Since I' m new in R probably I' m doing mistakes I'm not able to figure out.
Like #Arun pointed out, you are trying to run your function twice: once on the files and once one the data frames you have created... Instead, your code should look like this:
files <- list.files(pattern = ".txt")
mylist <- lapply(files, myfunction)

Resources