I have 20 different .csv files and I need to some how stack the data in R so that I can get an overall picture of the data.
Presently I am copying and pasting the columns in excel to make one big data set.
However, I am sure there is a quicker and more efficient way of doing this in R as this would ultimately take a while.
Also, to make things worse some of the variable names are not the same in each data set.
eg VARIABLE1 is written as variable1 in some datasets. How would i rectify this in R as I understand that R is case sensitive?
Any help would be greatly appreciated. Thanks!
The easiest and the fastest way to do this, if you're (or wish you to be) familiar with data.table package is this way (not tested):
require(data.table)
in_pth <- "path_to_csv_files" # directory where CSV files are located, not the files.
files <- list.files(in_pth, full.names=TRUE, recursive=FALSE, pattern="\\.csv$")
out <- rbindlist(lapply(files, fread))
list.files parameters:
full.names = TRUE will return the full path to your file. Suppose your in_pth <- "c:\\my_csv_folder" and inside this you've two files: 01.csv and 02.csv. Then, full.names=TRUE will return c:\\my_csv_folder\\01.csv and c:\\my_csv_folder\\02.csv (full path).
recursive = FALSE will not search inside directories within your in_pth folder. Assume you've two more csv files in c:\\my_csv_folder\\another_folder. Now, if you want to load these files inside this one, then you can set recursive=TRUE, which'll scan for files until you reach all directories searching down.
pattern=\\.csv$: This is a regular expression to tell which sort of files to load. If your folder, in addition to csv files also has text files (.txt), then by specifying this pattern, you'll load only the csv files. If your folder has only CSV files, then this is not necessary.
data.table functions:
rbindlist avoids conflict in column names by retaining the name of the previous data.table. That is, if you've two data.tables dt1, dt2 with column names x,y and a,b respectively, then doing rbindlist(dt1,dt2) will take care of changing a,b to x,y and rbindlist(dt2, dt1) will take care of changing x,y to a,b.
fread takes care of columns, headers separators etc most often automatically.. and is extremely fast (although still experimental, so you may want to check your output to be sure it's all fine (even if stable)).
# Denis:It is also worth looking into the plyr package for the same. rbind.fill(...) allows you to combine data.frames by row.
install.packages("plyr")
library(plyr)
help (rbind.fill) for details gives you following:
rbinds a list of data frames filling missing columns with NA.
Usage
rbind.fill(...)
Arguments
...
input data frames to row bind together. The first argument can be a list of data frames, in which case all other arguments are ignored.
Details
This is an enhancement to rbind that adds in columns that are not present in all inputs, accepts a list of data frames, and operates substantially faster.
Column names and types in the output will appear in the order in which they were encountered. No checking is performed to ensure that each column is of consistent type in the inputs.
To my knowledge,there is no cbind.fill; however, there is the user function cbind.fill that allows you to combine data.frames by column. Details here.
There are two solutions: one depending on rbind.fill in the plyr package and another is independent of rbind.fill.
Another way, without using external packages, is to use the cbind() command: it makes the binding per column.. So if you have to different tables you can just pass them as arguments to cbind() and they will be appended
Related
I have the output from a data submission which is in the form of multiple vector list objects in rda files.
Each list object is in a separate rda file and i have nearly 2000 files.
I want to merge all the objects into a single object in a single rda file in the fastest way (partly because i may need to repeat this several times).
All the rda files are fairly small (~10mb though this will be a compressed size), but it all adds up with the number of files.
Memory isn't a huge problem as am running it on a server with >700GB RAM,
My first approach to incrementally load them one by one concatenate with the merged list object and remove the object that was appended went badly due to the time it was going to take (something like 40 days at a best guess).
My revised approach is below, but wondering if there is a quicker way to do this given that i may need to repeat the process:
load("data_1.rda")
load("data_2.rda")
load("data_3.rda") ...
load("data_2000.rda")
my.list <- list()
my.list <- c(my.list, data.1, data.2, data.3, ... , data.2000)
save(my.list, file="my_list.rda")
And just to add to things i'm getting an error when doing this:
Error: attempt to set index 18446744071562067968/2877912830 in SET_STRING_ELT
It's not a very helpful error message
All the rdas load as objects into the environment fine, but when i try and concatenate them that is when I get the error message, and it seems like it is when it gets to a particular point as it doesn't fail immediately. Wasn't sure if it is some sort of limit in the number of concatenations you can do or rogue data, but troubleshooting it it appears to be syntax rather than data related.
Have chunked it up into 5 batches and then doing a final concatenation before saving the rda. Have seen other answers for this sort of thing suggesting using rbind, mget, and do.Call or list function - would using any of these functions make it faster and achieve the same thing?
Something like this:
my.list <- do.call(rbind, mget(ls(pattern="^data_")))
Thanks
I have 900000 csv files which i want to combine into one big data.table. For this case I created a for loop which reads every file one by one and adds them to the data.table. The problem is that it is performing to slow and the amount of time used is expanding exponentially. It would be great if someone could help me make the code run faster. Each one of the csv files has 300 rows and 15 columns.
The code I am using so far:
library(data.table)
setwd("~/My/Folder")
WD="~/My/Folder"
data<-data.table(read.csv(text="X,Field1,PostId,ThreadId,UserId,Timestamp,Upvotes,Downvotes,Flagged,Approved,Deleted,Replies,ReplyTo,Content,Sentiment"))
csv.list<- list.files(WD)
k=1
for (i in csv.list){
temp.data<-read.csv(i)
data<-data.table(rbind(data,temp.data))
if (k %% 100 == 0)
print(k/length(csv.list))
k<-k+1
}
Presuming your files are conventional csv, I'd use data.table::fread since it's faster. If you're on a Linux-like OS, I would use the fact it allows shell commands. Presuming your input files are the only csv files in the folder I'd do:
dt <- fread("tail -n-1 -q ~/My/Folder/*.csv")
You'll need to set the column names manually afterwards.
If you wanted to keep things in R, I'd use lapply and rbindlist:
lst <- lapply(csv.list, fread)
dt <- rbindlist(lst)
You could also use plyr::ldply:
dt <- setDT(ldply(csv.list, fread))
This has the advantage that you can use .progress = "text" to get a readout of progress in reading.
All of the above assume that the files all have the same format and have a header row.
Building on Nick Kennedy's answer using plyr::ldply there is roughly a 50% speed increase by enabling the .parallel option while reading 400 csv files roughly 30-40 MB each.
Original answer with progress bar
dt <- setDT(ldply(csv.list, fread, .progress="text")
Enabling .parallel also with a text progress bar
library(plyr)
library(data.table)
library(doSNOW)
cl <- makeCluster(4)
registerDoSNOW(cl)
pb <- txtProgressBar(max=length(csv.list), style=3)
pbu <- function(i) setTxtProgressBar(pb, i)
dt <- setDT(ldply(csv.list, fread, .parallel=TRUE, .paropts=list(.options.snow=list(progress=pbu))))
stopCluster(cl)
As suggested by #Repmat, use rbind.fill. As suggested by #Christian Borck, use fread for faster reads.
require(data.table)
require(plyr)
files <- list.files("dir/name")
df <- rbind.fill(lapply(files, fread, header=TRUE))
Alternatively you could use do.call, but rbind.fill is faster (http://www.r-bloggers.com/the-rbinding-race-for-vs-do-call-vs-rbind-fill/)
df <- do.call(rbind, lapply(files, fread, header=TRUE))
Or you could use the data.table package, see this
You are growing your data table in a for loop - this is why it takes forever. If you want to keep the for loop as is, first create a empty data frame (before the loop), which has the dimensions you need (rows x columns), and place it in the RAM.
Then write to this empty frame in each iteration.
Otherwise use rbind.fill from package plyr - and avoid the loop altogehter.
To use rbind.fill:
require(plyr)
data <- rbind.fill(df1, df2, df3, ... , dfN)
To pass the names of the df's, you could/should use an apply function.
I go with #Repmat as your current solution using rbind() is copying the whole data.table in memory every time it is called (this is why time is growing exponentially). Though another way would be to create an empty csv file with only the headers first and then simply append the data of all your files to this csv-file.
write.table(fread(i), file = "your_final_csv_file", sep = ";",
col.names = FALSE, row.names=FALSE, append=TRUE, quote=FALSE)
This way you don't have to worry about putting the data to the right indexes in your data.table. Also as a hint: fread() is the data.table file reader which is much faster than read.csv.
In generell R wouldn't be my first choice for this data munging tasks.
One suggestion would be to merge them first in groups of 10 or so, and then merge those groups, and so on. That has the advantage that if individual merges fail, you don't lose all the work. The way you are doing it now not only leads to exponentially slowing execution, but exposes you to having to start over from the very beginning every time you fail.
This way will also decrease the average size of the data frames involved in the rbind calls, since the majority of them will be being appended to small data frames, and only a few large ones at the end. This should eliminate the majority of the execution time that is growing exponentially.
I think no matter what you do it is going to be a lot of work.
Some things to consider under the assumption you can trust all the input data and that each record is sure to be unique:
Consider creating the table being imported into without indexes. As indexes get huge the time involved to manage them during imports grows -- so it sounds like this may be happening. If this is your issue it would still take a long time to create indexes later.
Alternately, with the amount of data you are discussing you may want to consider a method of partitioning the data (often done via date ranges). Depending on your database you may then have individually indexed partitions -- easing index efforts.
If your demonstration code doesn't resolve down to a database file import utility then use such a utility.
It may be worth processing files into larger data sets prior to importing them. You could experiment with this by combining 100 files into one larger file before loading, for example, and comparing times.
In the event you can't use partitions (depending on the environment and the experience of the database personnel) you can use a home brewed method of seperating data into various tables. For example data201401 to data201412. However, you'd have to roll your own utilities to query across boundaries.
While decidedly not a better option it is something you could do in a pinch -- and it would allow you to retire/expire aged records easily and without having to adjust the related indexes. it would also let you load pre-processed incoming data by "partition" if desired.
I have an .xdf file on an HDFS cluster which is around 10 GB having nearly 70 columns. I want to read it into a R object so that I could perform some transformation and manipulation. I tried to Google about it and come around with two functions:
rxReadXdf
rxXdfToDataFrame
Could any one tell me the preferred function for this as I want to read data & perform the transformation in parallel on each node of the cluster?
Also if I read and perform transformation in chunks, do I have to merge the output of each chunks?
Thanks for your help in advance.
Cheers,
Amit
Note that rxReadXdf and rxXdfToDataFrame have different arguments and do slightly different things:
rxReadXdf has a numRows argument, so use this if you want to read the top 1000 (say) rows of the dataset
rxXdfToDataFrame supports rxTransforms, so use this if you want to manipulate your data in addition to reading it
rxXdfToDataFrame also has the maxRowsByCols argument, which is another way of capping the size of the input
So in your case, you want to use rxXdfToDataFrame since you're transforming the data in addition to reading it. rxReadXdf is a bit faster in the local compute context if you just want to read the data (no transforms). This is probably also true for HDFS, but I haven’t checked this.
However, are you sure that you want to read the data into a data frame? You can use rxDataStep to run (almost) arbitrary R code on an xdf file, while still leaving your data in that format. See the linked documentation page for how to use the transforms arguments.
Can someone please tell me how, in R, I can access numbered data sets with the loop variable?
So, if I have a long list of files in each of which I need to find all the places where a particular value is in the second column and take the corresponding value in the same row in the third column and list these all in one file, how might I do this? The files are named by the title of the folder, date, and time, respectively, in this fashion, "name_0619_0123". There are the same number of files per each day, and they are at the times every day. Therefore if there is a command that can somehow let me access a file in such a way that I can have a variable (dependent on the loop counting variable) in the string that I give for the file name in the command, I can access a different file per each loop iterations.
Any and all ideas please
Also, if there is a more appropriate place for me to ask this question, please let me know.
There are probably lots of ways to do this in R:
You can use a command line script (see the R documentation).
i.e.
R CMD BATCH "--args arg1 arg2" foo.R &
Where foo.R is your R script and the args can be the loop varaibles you are interested in.
Another way to do this is to use regular expressions to parse out information from your file names.
If you provide a more concrete example I'll be able to show you some more specific code.
Here are some guidelines if you can glob those files you need to process either with a pattern or picking up all of them.
You may generate a list of files with list.files, read them in one shot with lapply, read.csv, and fetch what you need into a data.frame with a single row. Then, using do.call, rbind, and your list of data.frames, you can combine everything into a single data.frame without even writing for explicitly.
I am trying to manage multiple files in R but am having a difficult time of it. I want to take the data in each of these files and manipulate them through a series of steps (all files receiving the same treatment). I think that I am going about it in a very silly manner though. Is there a way to manage many files (each the same a before) without using 900 apply statements? For example, when is it recommended you merge all the data frames rather that treat each separately? Is there a way to merge more than two, or an uncertain number, as with the way the files are input here? Or is there a better way to handle so many files?
I take files in a standard way:
chosen<-(tk_choose.files(default="", caption="Files:", multi=TRUE, filters=NULL, index=1))
But after that I would like to do several things with the data. As of now I am just apply different things but it is getting confusing. See:
ytrim<-lapply(chosen, function(x) strtrim(y, width=11))
chRead<-lapply(chosen,read.table,header=TRUE)
tmp<-lapply(inputFiles, function(x) stack(fnctn))
etc, etc. This surely can't be the recommended way to go about it. Is there a better way to handle a multitude of files?
You can write one function with all operations, and apply it to all your files like this:
doSomethingWithFile <- function(filename) {
ytrim <- strtrim(filename, width=11))
chRead<- read.table(filename,header=TRUE)
# Return some result
chRead
}
result<-lapply(chosen, doSomethingWithFile)
You will only need to think about how to return the results, as lapply needs to return a list with the same length as the input (chosen, in this case). You could also look at one of the apply functions of the plyr packages for more flexibility.
(BTW: this code is not without errors, but neither is your example... I'll update mine if you give a proper example)