rbind (concatenate) several big files - r

I have some problems regarding Project R and merging tab-delimited files.
I guess my biggest problem is that we are talking about 165 files with a size of around 180MB.
For sure I work on an Ubuntu server after my local OSX computer was not able to handle the data.
I tried several methods as outline here:
So we are talking about the "lapply method", the "plyr" method, a for-loop solution and a newer method with the fread-function.
Nevertheless it's not possible to merge all these csv files into one data.frame.
require(data.table)
require(bit64)
require(plyr)
setwd("~/Documents/Data/App/feed/")
options(stringsAsFactors = FALSE)
options(scipen = 999)
# List of all the contained files
file_list <- list.files()
#FREAD-Method
dataset = rbindlist(lapply(file_list, fread, header=FALSE, sep="\t"))
# LAPPLY-Method
dataset2 <- do.call("rbind",lapply(file_list, FUN=function(files){read.table(files, header=TRUE, sep="\t")}))
#LDPLY-Method
dataset3 <- ldply(file_list, read.table, header=FALSE, sep="\t", .progress = "text", .inform = TRUE)
I guess the For-Loop-solution is not so fancy.
Now my big question: How can I merge my (big) CSV files in one data.frame? Or is R not able to handle such big data?
Would be happy if anybody can help me out since I tried to find the solution with several parameters for 2 days now.

Related

Using a For-loop to create multiple objects with incremental suffixes, then reading in .csv file to each new object (also with incremental suffixes)

I've just started learning R so forgive me for my ignorance! I'm reading in lots of .csv files, each of which correlates to a different year (2010-2019). I then filter down the .csv files based on a variable within one of the columns (because the datasets are very large. Currently I am using the below code to do this and then repeating it for each year:
data_2010 <- data.table::fread("//Project/2010 data/2010 data.csv", select = c("date", "id", "type"))
data_b_2010 <- data_2010[which(data_2010$type=="ABC123")]
rm(data_2010)
What I would like to do is use a For-loop to create new object data_20xx for each year, and then read in the .csv files (and apply the filter of "type") for each year too.
I think I know how to create the objects in a For-loop but not entirely sure how I would also assign the .csv files and change the filepath string so it updates with each year (i.e. "//Project/2010 data/2010 data.csv" to "//Project/2011 data/2011 data.csv").
Any help would be greatly appreciated!
Next time please provide a repoducible example so we can help you.
I would use data.table which contains specialized functions to do what you want.
library(data.table)
setwd("Project")
allfiles <- list.files(recursive = T, full.names = T)
allcsv <- allfiles[grepl(".csv", allfiles)]
data_list <- list()
for(i in 1:length(allcsv)) {
print(paste(round(i/length(allcsv),2)))
data_list[i] <- fread(allcsv[i])
}
data_list_filtered <- lapply(data_list, function(x) {
y <- data.frame(x)
return(y[which(y["type"]=="ABC123",)])
})
result <- rbindlist(data_list_filtered)
First, list.files will tell you all the files contained in your working dir by default.
Second, read each csv file into the data_list list using the fast and efficient fread function.
Third, do the filtering within a loop, as requested.
Fourth, use rbindlist from data.table to rbind all of these data.table's.
Finally, if you are not familiar with the data.table syntax, you can run setDF(result) to convert your results back to a data.frame.
I strongly encourage you to learn the data.table syntax as it is quite powerful and efficient for tabular data manipulations. These vignettes will get you started.

How to I make my function import and concatenate/merge "all" the files in a folder? [duplicate]

This question already has answers here:
Append multiple csv files into one file using R
(4 answers)
Closed 2 years ago.
Due to....limitations I have been forced to download my data manually into one csv file at a time. Until now, this hasn't been an issue. I've saved all off my files in the same folder, so I've been able to use a function so simply merge them (all column names are exactly the same).
I have recently have to download multitudes more data than previously, however. I am currently trying to import/concatenate 513 csv-files at the same time and it seems my function has hit some kind of limit. All csv files are no longer imported, which is of course very disconcerting.
I tried to move the unimported files (together with files that were successfully imported) to another folder, and I could import/concatenate those files just fine. This doesnt seem to have anything to do with the files themselves but with the sheer number of them being imported/concatenated at the same time.
Is there a way to import and concatenate "all" files in a folder with no limitations?
The top 4 and bottom 4 lines in each csv file contains metadata and needs to be disregarded. Until now I've been using the following loop to import/concatenate my files:
setwd("path")
file_list<-list.files("path")
for (file in file_list){
# if the merged dataset doesn't exist, create it
if (!exists("dataset")){
dataset <- head(read_delim(file, delim=';',na="",skip=4),-4)
}
# if the merged dataset does exist, append to it
if (exists("dataset")){
temp_dataset <-head(read_delim(file, delim=';',na="",skip=4),-4)
dataset<-rbind(dataset, temp_dataset)
rm(temp_dataset)
}
}
In base R, you would use do.call(rbind, list_data). With data.table, you can use data.table::rbindlist that will be more efficient.
data.table
library(data.table)
setwd("path")
file_list<-list.files("path")
list_data <- lapply(file_list, function(file) head(fread(file, delim=';',na="",skip=4),-4))
df <- rbindlist(list_data, fill = TRUE, use.names = TRUE)
I added the arguments fill = TRUE and use.names = TRUE to be safe: you lose a little bit of efficiency here but you are sure you rbind columns at the place they should be.
Base R
setwd("path")
file_list<-list.files("path")
list_data <- lapply(file_list, function(file) head(read_delim(file, sep=';',na.strings = "", skip=4),-4))
df <- do.call(rbind, list_data)

Repeating tk_choose.files to import multiple .csv files multiple times

I am using sapply(tk_choose.files) to produce an interactive window where I can choose which .csv files (multiple) to import. I then do some basic data manipulation so that the mean of one particular column can be plotted using ggplot.
So far my code looks something like this:
>tfiles <- data.frame(sapply(sapply(tk_choose.files(caption="Choose T files
(hold CTRL to select multiple files)"), read.table, header=TRUE, sep=","), c))
>rfiles <- data.frame(sapply(sapply(tk_choose.files(caption="Choose R files
(hold CTRL to select multiple files)"), read.table, header=TRUE, sep=","), c))
I have then calculated the mean of a particular column for both tfiles and rfiles so that I could plot 100-tfiles-rfiles.
While this is working fine for one set of data, I would like to now import more sets of data, preferably also using sapply(tk_choose.files). Essentially I need to get t/rfiles1, t/rfiles2...and repeat the data manipulation process after that, so that I could get a plot of multiple sets of data. I have no idea how to do this without having to copy and paste my code!
Sorry if this is a stupid question, I am very new to R so I am really stuck, your help is greatly appreciated!
Assuming that the files in the working directory are as follow:
all.files<-list.files(pattern="\\.csv")
all.files
[1] "R01.csv" "R02.csv" "R03.csv" "R04.csv" "T01.csv" "T02.csv" "T03.csv" "T04.csv"
And you wish to call tfiles1 as merged data of T01 and T02; tfiles2 as merged data of T03 and T04
T <- grep("T", all.files, value=T)
T
[1] "T01.csv" "T02.csv" "T03.csv" "T04.csv"
t.list <- list(T[1:2], T[3:4])
all.T <- lapply(t.list, function(x)ldply(x, read.csv))
for (i in 1:length(all.T)) assign(paste0("tfiles", i), all.T[[i]]) #this will produce tfiles1 and tfiles2 in your R environment.

Using rbind() to combine multiple data frames into one larger data.frame within lapply()

I'm using R-Studio 0.99.491 and R version 3.2.3 (2015-12-10). I'm a relative newbie to R, and I'd appreciate some help. I'm doing a project where I'm trying to use the server logs on an old media server to identify which folders/files within the server are still being accessed and which aren't, so that my team knows which files to migrate. Each log is for a 24 hour period, and I have approximately a year's worth of logs, so in theory, I should be able to see all of the access over the past year.
My ideal output is to get a tree structure or plot that will show me the folders on our server that are being used. I've figured out how to read one log (one day) into R as a data.frame and then use the data.tree package in R to turn that into a tree. Now, I want to recursively go through all of the files in the directory, one by one, and add them to that original data.frame, before I create the tree. Here's my current code:
#Create the list of log files in the folder
files <- list.files(pattern = "*.log", full.names = TRUE, recursive = FALSE)
#Create a new data.frame to hold the aggregated log data
uridata <- data.frame()
#My function to go through each file, one by one, and add it to the 'uridata' df, above
lapply(files, function(x){
uriraw <- read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)
#print(nrow(uriraw)
uridata <- rbind(uridata, uriraw)
#print(nrow(uridata))
})
The problem is that, no matter what I try, the value of 'uridata' within the lapply loop seems to not be saved/passed outside of the lapply loop, but is somehow being overwritten each time the loop runs. So instead of getting one big data.frame, I just get the contents of the last 'uriraw' file. (That's why there are those two commented print commands inside the loop; I was testing how many lines there were in the data frames each time the loop ran.)
Can anyone clarify what I'm doing wrong? Again, I'd like one big data.frame at the end that combines the contents of each of the (currently seven) log files in the folder.
do.call() is your friend.
big.list.of.data.frames <- lapply(files, function(x){
read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)
})
or more concisely (but less-tinkerable):
big.list.of.data.frames <- lapply(files, read.table,
skip = 3,header = TRUE,
stringsAsFactors = FALSE)
Then:
big.data.frame <- do.call(rbind,big.list.of.data.frames)
This is a recommended way to do things because "growing" a data frame dynamically in R is painful. Slow and memory-expensive, because a new frame gets built at each iteration.
You can use map_df from purrr package instead of lapply, to directly have all results combined as a data frame.
map_df(files, read.table, skip = 3, header = TRUE, stringsAsFactors = FALSE)
Another option is fread from data.table
library(data.table)
rbindlist(lapply(files, fread, skip=3))

Read and rbind multiple csv files

I have a series of csv files (one per anum) with the same column headers and different number of rows. Originally I was reading them in and merging them like so;
setwd <- ("N:/Ring data by cruise/Shetland")
LengthHeight2013 <- read.csv("N:/Ring data by cruise/Shetland/R_0113A_S2013_WD.csv",sep=",",header=TRUE)
LengthHeight2012 <- read.csv("N:/Ring data by cruise/Shetland/R_0212A_S2012_WD.csv",sep=",",header=TRUE)
LengthHeight2011 <- read.csv("N:/Ring data by cruise/Shetland/R_0211A_S2011_WOD.csv",sep=",",header=TRUE)
LengthHeight2010 <- read.csv("N:/Ring data by cruise/Shetland/R_0310A_S2010_WOD.csv",sep=",",header=TRUE)
LengthHeight2009 <- read.csv("N:/Ring data by cruise/Shetland/R_0309A_S2009_WOD.csv",sep=",",header=TRUE)
LengthHeight <- merge(LengthHeight2013,LengthHeight2012,all=TRUE)
LengthHeight <- merge(LengthHeight,LengthHeight2011,all=TRUE)
LengthHeight <- merge(LengthHeight,LengthHeight2010,all=TRUE)
LengthHeight <- merge(LengthHeight,LengthHeight2009,all=TRUE)
I would like to know if there is a shorter/tidier way to do this, also considering that each time I run the script I might want to look at a different range of years.
I also found this bit of code by Tony Cookson which looks like it would do what I want, however the data frame it produces for me has only the correct headers but no data rows.
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)
mymergeddata = multmerge("C://R//mergeme")
Find files (list.files) and read the files in a loop (lapply), then call (do.call) row bind (rbind) to put all files together by rows.
myMergedData <-
do.call(rbind,
lapply(list.files(path = "N:/Ring data by cruise"), read.csv))
Update: There is a vroom package, according to the manuals it is much faster than data.table::fread and base read.csv. The syntax looks neat, too:
library(vroom)
myMergedData <- vroom(files)
If you're looking for speed, then try this:
require(data.table) ## 1.9.2 or 1.9.3
ans = rbindlist(lapply(filenames, fread))
You could try it in a tidy way like this:
files <- list.files('the path of you files',
pattern = ".csv$", recursive = TRUE, full.names = TRUE)
myMergedData <- read_csv(files) %>% bind_rows()
Don't have enough rep to comment, but to answer Rafael Santos, you can use the code here to add params to the lapply in the answer above. Using lapply and read.csv on multiple files (in R)

Resources