I have over 300 large CSV files with the same filename, each in a separate sub-directory, that I would like to merge into a single dataset using R. I'm asking for help on how to remove columns I don't need in each CSV file, while merging in a way that breaks the process down into smaller chunks that my memory can more easily handle.
My objective is to create a single CSV file that I can then import into STATA for further analysis using code I have already written and tested on one of these files.
Each of my CSVs is itself rather large (about 80 columns, many of which are unnecessary, and each file has tens to hundreds of thousands of rows), and there are almost 16 million observations in total, or roughly 12GB.
I have written some code which manages to do this successfully for a test case of two CSVs. The challenge is that neither my work nor my personal computers have enough memory to do this for all 300+ files.
The code I have tried is here:
library(here) ##installs package to find files
( allfiles = list.files(path = here("data"), ##creates a list of the files, read as [1], [2], ... [n]
pattern = "candidates.csv", ##[identifies the relevant files]
full.names = TRUE, ##identifies the full file name
recursive = TRUE) ) ##searches in sub-directories
read_fun = function(path) {
test = read.csv(path,
header = TRUE )
test
} ###reads all the files
(test = read.csv(allfiles,
header = TRUE ) )###tests that file [1] has been read
library(purrr) ###installs package to unlock map_dfr
library(dplyr) ###installs packages to unlock map_dfr
( combined_dat = map_dfr(allfiles, read_fun) )
I expect the result to be a single RDS file, and this works for the test case. Unfortunately, the amount of memory this process requires when looking at 15.5m observations across all my files causes RStudio to crash, and no RDS file is produced.
I am looking for help on how to 1) reduce the load on my memory by stripping out some of the variables in my CSV files I don't need (columns with headers junk1, junk2, etc); and 2) how to merge in a more manageable way that merges my CSV files in sequence, either into a few RDS files to themselves be merged later, or through a loop cumulatively into a single RDS file.
However, I don't know how to proceed with these - I am still new to R, and any help on how to proceed with both 1) and 2) would be much appreciated.
Thanks,
Twelve GB is quite a bit for one object. It's probably not practical to use a single RDS or CSV unless you have far more than 12GB of RAM. You might want to look into using a database, a techology that is made for this kind of thing. I'm sure Stata can also interact with databases. You might also want to read up on how to interact with large CSVs using various strategies and packages.
Creating a large CSV isn't at all difficult. Just remember that you have to work with said giant CSV sometime in the future, which probably will be difficult. To create a large CSV, just process each component CSV individually and then append them to your new CSV. The following reads in each CSV, removes unwanted columns, and then appends the resulting dataframe to a flat file:
library(dplyr)
library(readr)
library(purrr)
load_select_append <- function(path) {
# Read in CSV. Let every column be of class character.
df <- read_csv(path, col_types = paste(rep("c", 82), collapse = ""))
# Remove variables beginning with "junk"
df <- select(df, -starts_with("junk"))
# If file exists, append to it without column names, otherwise create with
# column names.
if (file.exists("big_csv.csv")) {
write_csv(df, "big_csv.csv", col_names = F, append = T)
} else {
write_csv(df, "big_csv.csv", col_names = T)
}
}
# Get the paths to the CSVs.
csv_paths <- list.files(path = "dir_of_csvs",
pattern = "\\.csv.*",
recursive = T,
full.names = T
)
# Apply function to each path.
walk(csv_paths, load_select_append)
When you're ready to work with your CSV you might want to consider using something like the ff package, which enables interaction with on-disk objects. You are somewhat restricted in what you can do with an ffdf object, so eventually you'll have to work with samples:
library(ff)
df_ff <- read.csv.ffdf(file = "big_csv.csv")
df_samp <- df_ff[sample.int(nrow(df_ff), size = 100000),]
df_samp <- mutate(df_samp, ID = factor(ID))
summary(df_samp)
#### OUTPUT ####
values ID
Min. :-9.861 17267 : 6
1st Qu.: 6.643 19618 : 6
Median :10.032 40258 : 6
Mean :10.031 46804 : 6
3rd Qu.:13.388 51269 : 6
Max. :30.465 52089 : 6
(Other):99964
As far as I know, chunking and on-disk interactions are not possible with RDS or RDA, so you are stuck with flat files (or you go with one of the other options I mentioned above).
Related
I'm having a lot of trouble reading/writing to CSV files. Say I have over 300 CSV's in a folder, each being a matrix of values.
If I wanted to find out a characteristic of each individual CSV file such as which rows had an exact number of 3's, and write the result to another CSV fil for each test, how would I go about iterating this over 300 different CSV files?
For example, say I have this code I am running for each file:
values_4 <- read.csv(file = 'values_04.csv', header=FALSE) // read CSV in as it's own DF
values_4$howMany3s <- apply(values_04, 1, function(x) length(which(x==3))) // compute number of 3's
values_4$exactly4 <- apply(values_04[50], 1, function(x) length(which(x==4))) // show 1/0 on each column that has exactly four 3's
values_4 // print new matrix
I am then continuously copy and pasting this code and changing the "4" to a 5, 6, etc and noting the values. This seems wildly inefficient to me but I'm not experienced enough at R to know exactly what my options are. Should I look at adding all 300 CSV files to a single list and somehow looping through them?
Appreciate any help!
Here's one way you can read all the files and proceess them. Untested code as you haven't given us anything to work on.
# Get a list of CSV files. Use the path argument to point to a folder
# other than the current working directory
files <- list.files(pattern=".+\\.csv")
# For each file, work your magic
# lapply runs the function defined in the second argument on each
# value of the first argument
everything <- lapply(
files,
function(f) {
values <- read.csv(f, header=FALSE)
apply(values, 1, function(x) length(which(x==3)))
}
)
# And returns the results in a list. Each element consists of
# the results from one function call.
# Make sure you can access the elements of the list by filename
names(everything) <- files
# The return value is a list. Access all of it with
everything
# Or a single element with
everything[["values04.csv"]]
I have about 300GB of 15KB csv files (each with exactly 100 rows each) that I need to import, concatenate, manipulate and resave as a single rds.
I've managed to reduce the amount of RAM needed by only importing the columns I need but as soon as I need to do any operations on the columns, I max it out.
What is your strategy for this type of problem?
This is a shot at answering your question.
While this may not be the most effective of efficient solution, it works. The biggest upside is that you don't need to store all the information at once, instead just appending the result to a file.
If this is not fast enough it is possible to use parallell to speed it up.
library(tidyverse)
library(data.table)
# Make some example files
for (file_number in 1:1000) {
df = data.frame(a = runif(10), b = runif(10))
write_csv(x = df, path = paste0("example_",file_number,".csv"))
}
# Get the list of files, change getwd() to your directory,
list_of_files <- list.files(path = getwd(), full.names = TRUE)
# Define function to read, manipulate, and save result
read_man_save <- function(filename) {
# Read file using data.table fread, which is faster than read_csv
df = fread(file = filename)
# Do the manipulation here, for example getting only the mean of A
result = mean(df$a)
# Append to a file
write(result, file = "out.csv", append = TRUE)
}
# Use lapply to perform the function over the list of filenames
# The output (which is null) is stored in a junk object
junk <- lapply(list_of_files, read_man_save)
# The resulting "out.csv" now contains 1000 lines of the mean
Feel free to comment if you want any edits to better reflect your use case.
You could also use the disk.frame library, it is designed to allow manipulation of data larger than RAM.
You can then manipulate the data like you would in data.table or using dplyr verbs.
I'm currently trying to read a 20GB file. I only need 3 columns of that file.
My problem is, that I'm limited to 16 GB of ram. I tried using readr and processing the data in chunks with the function read_csv_chunked and read_csv with the skip parameter, but those both exceeded my RAM limits.
Even the read_csv(file, ..., skip = 10000000, nrow = 1) call that reads one line uses up all my RAM.
My question now is, how can I read this file? Is there a way to read chunks of the file without using that much ram?
The LaF package can read in ASCII data in chunks. It can be used directly or if you are using dplyr the chunked package uses it providing an interface for use with dplyr.
The readr package has readr_csv_chunked and related functions.
The section of this web page entitled The Loop as well as subsequent sections of that page describes how to do chunked reads with base R.
It may be that if you remove all but the first three columns that it will be small enough to just read it in and process in one go.
vroom in the vroom package can read in files very quickly and also has the ability to read in just the columns named in the select= argument which may make it small enough to read it in in one go.
fread in the data.table package is a fast reading function that also supports a select= argument which can select only specified columns.
read.csv.sql in the sqldf (also see github page) package can read a file larger than R can handle into a temporary external SQLite database which it creates for you and removes afterwards and reads the result of the SQL statement given into R. If the first three columns are named col1, col2 and col3 then try the code below. See ?read.csv.sql and ?sqldf for the remaining arguments which will depend on your file.
library(sqldf)
DF <- read.csv.sql("myfile", "select col1, col2, col3 from file",
dbname = tempfile(), ...)
read.table and read.csv in the base of R have a colClasses=argument which takes a vector of column classes. If the file has nc columns then use colClasses = rep(c(NA, "NULL"), c(3, nc-3)) to only read the first 3 columns.
Another approach is to pre-process the file using cut, sed or awk (available natively in UNIX and in the Rtools bin directory on Windows) or any of a number of free command line utilities such as csvfix outside of R to remove all but the first three columns and then see if that makes it small enough to read in one go.
Also check out the High Performance Computing task view.
We can try something like this, first a small example csv:
X = data.frame(id=1:1e5,matrix(runi(1e6),ncol=10))
write.csv(X,"test.csv",quote=F,row.names=FALSE)
You can use the nrow function, instead of providing a file, you provide a connection, and you store the results inside a list, for example:
data = vector("list",200)
con = file("test.csv","r")
data[[1]] = read.csv(con, nrows=1000)
dim(data[[1]])
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,1:3]
head(data[[1]])
id X1 X2 X3
1 1 0.13870273 0.4480100 0.41655108
2 2 0.82249489 0.1227274 0.27173937
3 3 0.78684815 0.9125520 0.08783347
4 4 0.23481987 0.7643155 0.59345660
5 5 0.55759721 0.6009626 0.08112619
6 6 0.04274501 0.7234665 0.60290296
In the above, we read the first chunk, collected the colnames and subsetted. If you carry on reading through the connection, the headers will be missing, and we need to specify that:
for(i in 2:200){
data[[i]] = read.csv(con, nrows=1000,col.names=COLS,header=FALSE)[,1:3]
}
Finally, we build of all of those into a data.frame:
data = do.call(rbind,data)
all.equal(data[,1:3],X[,1:3])
[1] TRUE
You can see that I specified a much larger list than required, this is to show if you don't know how long the file is, as you specify something larger, it should work. This is a bit better than writing a while loop..
So we wrap it into a function, specifying the file, number of rows to read at one go, the number of times, and the column names (or position) to subset:
read_chunkcsv=function(file,rows_to_read,ntimes,col_subset){
data = vector("list",rows_to_read)
con = file(file,"r")
data[[1]] = read.csv(con, nrows=rows_to_read)
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,col_subset]
for(i in 2:ntimes){
data[[i]] = read.csv(con,
nrows=rows_to_read,col.names=COLS,header=FALSE)[,col_subset]
}
return(do.call(rbind,data))
}
all.equal(X[,1:3],
read_chunkcsv("test.csv",rows_to_read=10000,ntimes=10,1:3))
I am trying to identify which types of csv files would not be modified in the future.
There are 540 csv files in one folder, and only 518 are modified. Basically, I wrote code to read and prepare this files to be modified by Java application and by running terminal on Linux they are modified.
This is what terminal shows:
data_3_5.csv
Error in mapmatching or profiling!
No edge matches found for path. Too short? Sequence size 2
directory <- "/path/folder"
directory_jar <- "/path/path.jar"
setwd(directory)
file_names <-list.files(directory)
predict(file_names, model, filename="", fun=predict, ext=NULL,
const=NULL, index=1, na.rm=TRUE)
I think, it doesn't work only for those files what have small length? Maybe just apply code which calculates the length of all columns in all csv files and which would be small than n?
Welcome, and good job posting some code. You're pretty close, the predict function is used in modelling though, try this on:
directory <- "/path/folder"
directory_jar <- "/path/path.jar"
setwd(directory)
## let's take out a little bit of protection to ensure we are only getting csvs
file_names <-list.files(directory, pattern = ".csv", full.names = TRUE)
## ^ ok so the above gives us all the filenames, but we haven't read them in yet...
## so let's create a function that reads the files in and counts how many columns in each.
library(tidyverse)
## if the above fails, run install.packages("tidyverse")
## let's create a function that will open the csv file and read the number of columns for each.
openerFun <- function(x){ ## here x is the input, or the path
openedFile <- read.csv(x, stringsAsFactors = FALSE) ## open the file
numCols <- ncol(openedFile) ## Count columns
tibble(name = x, numCols = numCols) ## output the file with the # columns
}
## and now let's call it with map, but map_dfr it's better cause we have a nice dataframe!
map_dfr(file_names, openerFun)
Once you have that, you can use it to compare against which files failed... hopefully that will help!
I have this huge database from a telescope at the institute where I currently am working, this telescope saves every single day in a file, it takes values for each of the 8 channels it measures every 10 seconds, and every day starts at 00:00 and finishes at 23:59, unless there was a connection error, in which case there are 2 or more files for one single day.
Also, the database has measurement mistakes, missing data, repeated values, etc.
File extensions are .sn1 for days saved in one single file and, .sn1, .sn2, .sn3...... for days saved in multiple files, all the files have the same number of rows and variables, besides that there are 2 formats of databases, one has a sort of a header and it uses the first 5 lines of the file, the other one doesn't have it.
Every month has it's own folder including the days it has, and then this folders are saved in the year they belong to, so for 10 years I'm talking about more than 3000 files, and to be honest I had never worked with .sn1 files before
I have code to merge 2 or a handful of files into 1, but this time I have thousands of files (which is way more then what I've used before and also the reason of why I can't provide a simple example) and I would like to generate a program that would merge all of the files to 1 huge database, so I can get a better sample from it.
I have an Excel extension that would list all the file locations in a specific folder, can I use a list like this to put all the files together?
Suggestions were too long for a comment, so I'm posting them as an aswer here.
It appears that you are able to read the files into R (at least one at a time) so I'm not getting into that.
Multiple Locations: If you have a list of all the locations, you can search in those locations to give you just the files you need. You mentioned an excel file (let's call it paths.csv - has only one column with the directory locations):
library(data.table)
all_directories <- fread(paths.csv, col.names = "paths")
# Focussing on only .sn1 files to begin with
files_names <- dir(path = all_directories$paths[1], pattern = ".sn1")
# Getting the full path for each file
file_names <- paste(all_directories$path[1], file_names, sep = "/")
Reading all the files: I created a space-delimited dummy file and gave it the extension ".sn1" - I was able to read it properly with data.table::fread(). If you're able to open the files using notepad or something similar, it should work for you too. Need more information on how the files with different headers can be distinguished from one another - do they follow a naming convention, or have different extensions (appears to be the case). Focusing on the files with 5 rows of headers/other info for now.
read_func <- function(fname){
dat <- fread(fname, sep = " ", skip = 5)
dat$file_name <- fname # Add file name as a variable - to use for sorting the big dataset
}
# Get all files into a list
data_list <- lapply(file_names, read_func)
# Merge list to get one big dataset
dat <- rdbindlist(data_list, use.names = T, fill = T)
Doing all of the above will give you a dataset for all the files that have the extension ".sn1" in the first directory from your list of directories (paths.csv). You can enclose all of this in a function and use lapply over all the different directories to get a list wherein each element is a dataset of all such files.
To include files with ".sn2", ".sn3" ... extensions you can modify the call as below:
ptrns <- paste(sapply(1:5, function(z) paste(".sn",z,sep = "")), collapse = "|")
# ".sn1|.sn2|.sn3|.sn4|.sn5"
dir(paths[1], pattern = ptrns)
Here's the simplified version that should work for all file extensions in all directories right away - might take some time if the files are too large etc. You may want to consider doing this in chunks instead.
# Assuming only one column with no header. sep is set to ";" since by default fread may treate spaces
# as separators. You can use any other symbol that is unlikely to be present in the location names
# We need the output to be a vector so we can use `lapply` without any unwanted behaviour
paths_vec <- as.character(fread("paths.csv", sep = ";", select = 1, header = F)$V1)
# Get all file names incl. location)
file_names <- unlist(lapply(paths_vec, function(z){
ptrns <- paste(sapply(1:5, function(q) paste(".sn",q,sep = "")), collapse = "|")
inter <- dir(z, pattern = ptrns)
return(paste(z,inter, sep = "/"))
}))
# Get all data in a single data.table using read_func previously defined
dat <- rbindlist(lapply(file_names, read_func), use.names = T, fill = T)