Merge 2 files and eliminates duplicates effectively R - r

I have two very big files that I have to merge and than eliminate duplicates according to one column. So far I am doing like this
myfiles <- list.files(pattern="*.dat")
myfilesContent <- lapply(myfiles, read.delim, header=F, quote="\"",sep=" ",colClasses="character")
data = as.data.frame(data.table::rbindlist(myfilesContent))
data <- data[!duplicated(data$V1,fromLast=TRUE),]
but reading the two files consumes a lot of memory. Is there a better way of doing it?
Many thanks

but reading the two files consumes a lot of memory. Is there a better way of doing it?
Try fread instead of read.delim.
Yes keep using rbindlist.
Use unique(...,by=V1) on the data.table rather than converting it to data.frame.
Should be a lot faster and more memory efficient.

Related

How do I bind two disk frames together?

I have two disk frame and each are about 20GB worth of files.
It's too big to merge as data tables because the process requires more than the memory I have available. I tried using this code: output <- rbindlist(list(df1, df2))
The wrinkle is that I'd like to also run unique since there might be dups in my data.
Can I use the same code with rbindlist on two disk frames?
Yeah. You just do rbindlist.disk.frame(list(df1, df2))
I need to implement bind_rows at some point too!

Optimized way of merging ~200k large [many columns but only one row] data frames with differing column names

I have a large number (over 200,000) separate files, each of which contains a single row and many columns (sometimes upwards of several hundred columns). There is one column (id) in common across all of the files. Otherwise, the column names are semi-random, with incomplete overlap between dataframes. Currently, I am using %in% to determine which columns are in common and then merging using those columns. This works perfectly well, although I am confident (maybe hopeful is a better word) that it could be done faster.
For example:
dfone<-data.frame(id="12fgh",fred="1",wilma="2",barney="1")
dftwo<-data.frame(id="36fdl",fred="5",daphne="3")
common<-names(dfone)[names(dfone) %in% names(dftwo)]
merged<-merge(dfone,dftwo,by=common,all=TRUE)
So, since I'm reading a large number of files, here is what I'm doing right now:
fls<-list.files()
first<-fls[1]
merged<-read.csv(first)
for (fl in fls) {
dffl<-read.csv(fl)
common<-names(dffl)[names(dffl) %in% names(merged)]
merged<-merge(dffl,merged,by=common,all=TRUE)
# print(paste(nrow(merged)," rows in dataframe.",sep=""))
# flush.console()
# print(paste("Just did ",fl,".",sep=""))
# flush.console()
}
Obviously, the commented out section is just a way of keeping track of it as it's running. Which it is, although profoundly slowly, and it runs ever more slowly as it is assembling the data frame.
(1) I am confident that a loop isn't the right way to do this, but I can't figure out a way to vectorize this
(2) My hope is that there is some way to do the merge that I'm missing that doesn't involve my column name comparison kludge
(3) All of which is to say that this is running way too slowly to be viable
Any thought on how to optimize this mess? Thank you very much in advance.
A much shorter and cleaner approach is to read them all into a list and then merge.
do.call(merge, lapply(list.files(), read.csv))
It will still be slow though. You could speed it up by replacing read.csv with something faster (e.g., data.table::fread) and possibly by replacing lapply with parallel::mclapply.

read.csv to import more than 2105 columns?

Leaving question for archival purposes only.
(Read.csv did read in all columns, I just did not see them in the preview when opening the data.frame)
Related to this question:
Maximum number of columns that can be read using read.csv
I would like to import a csv file into R that contains about 3200 columns (100 rows). I am used to working with data.frames and read.csv, but my usual approach failed because
data <- read.csv("data.csv", header=TRUE)
only imported the first 2105 columns. It did not display an error message.
How can I readin a csv file with more than 2105 columns?
without specifying column classes
into a data frame
the file contains different data types (dates, strings, numbers, ..)
speed is not my biggest concern
I did not manage to apply the solutions in Quickly reading very large tables as dataframes in R to my situation. Tried this, but it does not seem to work without information on column classes:
df <- as.data.frame(scan("data.csv",sep=','))
There are already several questions about reading in large datafiles with millions of rows/columns and how to speed up the process, but my files are much smaller, so I am hoping that there is an easier solution that I overlooked.
Try using data.table.
library(data.table)
data <- fread("data.csv")
(Posted answer on behalf of the OP).
Read.csv did read in all columns, I just did not see them in the preview when opening the data.frame

Reading multiple csv files faster into data.table R

I have 900000 csv files which i want to combine into one big data.table. For this case I created a for loop which reads every file one by one and adds them to the data.table. The problem is that it is performing to slow and the amount of time used is expanding exponentially. It would be great if someone could help me make the code run faster. Each one of the csv files has 300 rows and 15 columns.
The code I am using so far:
library(data.table)
setwd("~/My/Folder")
WD="~/My/Folder"
data<-data.table(read.csv(text="X,Field1,PostId,ThreadId,UserId,Timestamp,Upvotes,Downvotes,Flagged,Approved,Deleted,Replies,ReplyTo,Content,Sentiment"))
csv.list<- list.files(WD)
k=1
for (i in csv.list){
temp.data<-read.csv(i)
data<-data.table(rbind(data,temp.data))
if (k %% 100 == 0)
print(k/length(csv.list))
k<-k+1
}
Presuming your files are conventional csv, I'd use data.table::fread since it's faster. If you're on a Linux-like OS, I would use the fact it allows shell commands. Presuming your input files are the only csv files in the folder I'd do:
dt <- fread("tail -n-1 -q ~/My/Folder/*.csv")
You'll need to set the column names manually afterwards.
If you wanted to keep things in R, I'd use lapply and rbindlist:
lst <- lapply(csv.list, fread)
dt <- rbindlist(lst)
You could also use plyr::ldply:
dt <- setDT(ldply(csv.list, fread))
This has the advantage that you can use .progress = "text" to get a readout of progress in reading.
All of the above assume that the files all have the same format and have a header row.
Building on Nick Kennedy's answer using plyr::ldply there is roughly a 50% speed increase by enabling the .parallel option while reading 400 csv files roughly 30-40 MB each.
Original answer with progress bar
dt <- setDT(ldply(csv.list, fread, .progress="text")
Enabling .parallel also with a text progress bar
library(plyr)
library(data.table)
library(doSNOW)
cl <- makeCluster(4)
registerDoSNOW(cl)
pb <- txtProgressBar(max=length(csv.list), style=3)
pbu <- function(i) setTxtProgressBar(pb, i)
dt <- setDT(ldply(csv.list, fread, .parallel=TRUE, .paropts=list(.options.snow=list(progress=pbu))))
stopCluster(cl)
As suggested by #Repmat, use rbind.fill. As suggested by #Christian Borck, use fread for faster reads.
require(data.table)
require(plyr)
files <- list.files("dir/name")
df <- rbind.fill(lapply(files, fread, header=TRUE))
Alternatively you could use do.call, but rbind.fill is faster (http://www.r-bloggers.com/the-rbinding-race-for-vs-do-call-vs-rbind-fill/)
df <- do.call(rbind, lapply(files, fread, header=TRUE))
Or you could use the data.table package, see this
You are growing your data table in a for loop - this is why it takes forever. If you want to keep the for loop as is, first create a empty data frame (before the loop), which has the dimensions you need (rows x columns), and place it in the RAM.
Then write to this empty frame in each iteration.
Otherwise use rbind.fill from package plyr - and avoid the loop altogehter.
To use rbind.fill:
require(plyr)
data <- rbind.fill(df1, df2, df3, ... , dfN)
To pass the names of the df's, you could/should use an apply function.
I go with #Repmat as your current solution using rbind() is copying the whole data.table in memory every time it is called (this is why time is growing exponentially). Though another way would be to create an empty csv file with only the headers first and then simply append the data of all your files to this csv-file.
write.table(fread(i), file = "your_final_csv_file", sep = ";",
col.names = FALSE, row.names=FALSE, append=TRUE, quote=FALSE)
This way you don't have to worry about putting the data to the right indexes in your data.table. Also as a hint: fread() is the data.table file reader which is much faster than read.csv.
In generell R wouldn't be my first choice for this data munging tasks.
One suggestion would be to merge them first in groups of 10 or so, and then merge those groups, and so on. That has the advantage that if individual merges fail, you don't lose all the work. The way you are doing it now not only leads to exponentially slowing execution, but exposes you to having to start over from the very beginning every time you fail.
This way will also decrease the average size of the data frames involved in the rbind calls, since the majority of them will be being appended to small data frames, and only a few large ones at the end. This should eliminate the majority of the execution time that is growing exponentially.
I think no matter what you do it is going to be a lot of work.
Some things to consider under the assumption you can trust all the input data and that each record is sure to be unique:
Consider creating the table being imported into without indexes. As indexes get huge the time involved to manage them during imports grows -- so it sounds like this may be happening. If this is your issue it would still take a long time to create indexes later.
Alternately, with the amount of data you are discussing you may want to consider a method of partitioning the data (often done via date ranges). Depending on your database you may then have individually indexed partitions -- easing index efforts.
If your demonstration code doesn't resolve down to a database file import utility then use such a utility.
It may be worth processing files into larger data sets prior to importing them. You could experiment with this by combining 100 files into one larger file before loading, for example, and comparing times.
In the event you can't use partitions (depending on the environment and the experience of the database personnel) you can use a home brewed method of seperating data into various tables. For example data201401 to data201412. However, you'd have to roll your own utilities to query across boundaries.
While decidedly not a better option it is something you could do in a pinch -- and it would allow you to retire/expire aged records easily and without having to adjust the related indexes. it would also let you load pre-processed incoming data by "partition" if desired.

Load large datasets into data frame [duplicate]

This question already has answers here:
Quickly reading very large tables as dataframes
(12 answers)
Closed 8 years ago.
I have a dataset stored in text file, it is of 997 columns, 45000 rows. All values are double values except row names and column names. I use R studio with read.table command to read the data file, but it seems taking hours to do it. Then I aborted it.
Even I use Excel to open it, it takes me 2 minutes.
R Studio seems lacking of efficiency in this task, any suggestions given how to make it faster ? I dont want to read the data file all the time ?
I plan to load it once and store it in Rdata object, which can make the loading data faster in the future. But the first load seems not working.
I am not a computer graduate, any kind help will be well appreciated.
I recommend data.table although you will end up with a data table after this. If you choose not to use the data table, you can simply convert back to a normal data frame.
require(data.table)
data=fread('yourpathhere/yourfile')
As documented in the ?read.table help file there are three arguments that can dramatically speed up and/or reduce the memory required to import data. First, by telling read.table what kind of data each column contains you can avoid the overhead associated with read.table trying to guess the type of data in each column. Secondly, by telling read.table how many rows the data file has you can avoid allocating more memory than is actually required. Finally, if the file does not contain comments, you can reduce the resources required to import the data by telling R not to look for comments. Using all of these techniques I was able to read a .csv file with 997 columns and 45000 rows in under two minutes on a laptop with relatively modest hardware:
tmp <- data.frame(matrix(rnorm(997*45000), ncol = 997))
write.csv(tmp, "tmp.csv", row.names = FALSE)
system.time(x <- read.csv("tmp.csv", colClasses="numeric", comment.char = ""))
# user system elapsed
#115.253 2.574 118.471
I tried reading the file using the default read.csv arguments, but gave up after 30 minutes or so.

Resources