Combining 1200 csv files in r with different column numbers - r

I need to combine 1200 csv files into one but they have multiple columns. Newbie her: Upon searching through the forums, I've decided that my code should look something like this:
list.files()
filenames <- list.files(path = "~/")
do.call("rbind.fill", lapply(filenames, read.csv, header = TRUE))
When I run this, I don't receive anything but: NULL
Any ideas for me to be able to output one large csv file that combines all of these would be appreciated. Thanks.

Your "filenames" should be empty. Be sure list.files find any files in the folder you specified.
Excerpt from rbind.fill documentation:
Arguments
...
input data frames to row bind together. The first argument can be a list of data frames, in which case all other arguments are ignored. Any NULL inputs are silently dropped. If all inputs are NULL, the output is NULL

Related

How to import many xslx files to R? (In one xlsx file there are many sheets and I need only one)

I'm new here and I don't know how this site works. If I make mistakes, sorry.
Soooo I have 23 xlsx files with many sheets in them.
I have to create dataset which contains all of those files but with only one sheet. Columns and names of the sheets are the same.
I have to bind them by rows.
If anyone know how to do it, I will be very grateful.
file.list <-list.files("D:/Profile/name/Desktop/Viss/foldername",pattern=".xlsx")
df.list <- lapply(file.list, read_excel)
Error: path does not exist:
df <- rbindlist(df.list, idcol = "id")
I don't know where to put the extract of this one sheet and I don't know what to write in idcol="".
I think your approach is correct, but you should use the full path in file.list <-list.files("D:/Profile/name/Desktop/Viss/foldername",pattern=".xlsx", full.names=TRUE)
EDIT: You should use pattern="\\.xlsx" in
list.files("D:/Profile/name/Desktop/Viss/foldername",pattern="\\.xlsx", full.names=TRUE)
EDIT2: You can always see any function help by running ? followed by your function name like ?rbindlist, or in RStudio, pressing F1 on the function name. The idcol parameter should be TRUE or FALSE, in your case, FALSE probably.
idcol
Generates an index column. Default (NULL) is not to. If idcol=TRUE then the column is auto named .id. Alternatively the column name can be directly provided, e.g., idcol = "id". If input is a named list, ids are generated using them, else using integer vector from 1 to length of input list. See examples.*
EDIT3 if you want to specify the sheet name you can use
lapply(file.list, function(x) read_excel(x, sheet="sheetname"))

How to combine .txt files from multiple folders

I want to combine multiple .txt files in R from multiple folders. However, I'm running into trouble when I want to separate the data into different columns. Right now, the files combine but into one single column when there should be four.
I used list.files to find .txt files in the folders in my working directory. Then I used rbind and lapply to combine them with read.delim. (see below)
files = list.files(pattern = "*.txt")
myfiles = do.call(rbind, lapply(files, function(x) read.delim(x, header = FALSE, stringsAsFactors = FALSE)))
The above code combines all of the .txt files, but the first 3 rows of each file are artifacts of the data download (basically just a naming feature) and are not pertinent to the data itself. So once the files are combined, the three lines repeat. I cannot use filter(), as I would have to manually go through the data (many thousands of lines). I would also like to repeat this process in another folder with a similar setup. So I'd like to be able to use the same code.
I think I can resolve the issue by removing the top 3 lines of each .txt file before combining them. Then I can set header = FALSE and just add in headers once the files are combined. But again, there are many hundreds of files, so I do not wish to do this manually. I'm not sure how to do this, though. Any suggestions?
Thank you for any help.
Options, transcribed from the comment:
By itself, read.delim(..., skip = 3) will remove those leading duplicate rows. This will also remove the header row, so all of your frames will have generic column names, not a big problem.
To fix that, you can re-read the first row of one of the files (first?) to get the column names, with read.delim(..., nrows=1). If we used nrows=0, it reads all, so we need a minimum of 1 to limit the rows read; in the comment I included [0,], but since all you need is the column-names, it doesn't really affect things.
You can do it the first time with something like:
files = list.files(pattern = "*.txt")
myfiles = do.call(rbind, lapply(files, function(x) read.delim(x, skip = 3, header = FALSE, stringsAsFactors = FALSE)))
# added this part ^^^^^^^^^
colnames(myfiles) <- colnames(read.delim(files[1], header=TRUE, nrows=1))

How to get named list when reading multiple csv files from given folder?

Assume I have several csv files in given folder, When I read them, I could have unnamed list where each list element has default numeric index. How can I read them as named list instead ? Here is my code what I've done:
Regarding the reproducible data, it is allowed to use public dataset
file <- list.files(folder, full.names = TRUE, "\\.csv$")
f.read <- lapply(1:length(file), function(ele_) {
res <- as(read.csv(file[ele_]), "data.frame")
res
})
I am expecting to have named list instead of default unnamed list, how can I get easily my expected output? Any idea ?
my desired output would be :
after I read cvs files from given folder, each list element must have specific name instead of default numeric index.
How can I get my expected output easily? Does anyone know possible way getting this output? Thank you
We can use setNames with the extracted file names (with basename and file_path_sans_ext)
setNames(lapply(file, read.csv), tools::file_path_sans_ext(basename(file)))
Alternatively, you can pass a named vector into lapply:
f.read <- lapply(setNames(file, tools::file_path_sans_ext(basename(file))), read.csv)

Reading specific column of multiple files in R

I have used the following code to read multiple .csv files in R:
Assembly<-t(read.table("E:\\test\\exp1.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Assembly","f"))[1:4416,"Assembly",drop=FALSE])
Top1<-t(read.table("E:\\test\\exp2.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Top1","f"))[1:4416,"Top1",drop=FALSE])
Top3<-t(read.table("E:\\test\\exp3.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Top3","f"))[1:4416,"Top3",drop=FALSE])
Top11<-t(read.table("E:\\test\\exp4.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Top11","f"))[1:4416,"Top11",drop=FALSE])
Assembly1<-t(read.table("E:\\test\\exp5.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Assembly1","f"))[1:4416,"Assembly1",drop=FALSE])
Area<-t(read.table("E:\\test\\exp6.csv",sep="|",header=FALSE,col.names=c("a","b","c","d","Area","f"))[1:4416,"Area",drop=FALSE])
data<-rbind(Assembly,Top1,Top3,Top11,Assembly1,Area)
So the entire data is in the folder "test" in E drive. Is there a simpler way in R to read multiple .csv data with a couple of lines of code or some sort of function call to substitute what has been made above?
(Untested code; no working example available) Try: Use the list.files function to generate the correct names and then use colClasses as argument to read.csv to throw away the first 4 columns (and since that vector is recycled you will alss throw away the 6th column):
lapply(list.files("E:\\test\\", patt="^exp[1-6]"), read.csv,
colClasses=c(rep("NULL", 4), "numeric"), nrows= 4416)
If you want this to be returned as a dataframe, then wrap data.frame around it.

merge multiple files with different rows in R

I know that this question has been asked previously, but answers to the previous posts cannot seem to solve my problem.
I have dozens of tab-delimited .txt files. Each file has two columns ("pos", "score"). I would like to compile all of the "score" columns into one file with multiple columns. The number of rows in each file varies and they are irrelevant for the compilation.
If someone could direct me on how to accomplish this, preferably in R, it would be a lot of helpful.
Alternatively, my ultimate goal is to read the median and mean of the "score" column from each file. So if this could be accomplished, with or without compiling the files, it would be even more helpful.
Thanks.
UPDATE:
As appealing as the idea of personal code ninjas is, I understand this will have to remain a fantasy. Sorry for not being explicit.
I have tried lapply and Reduce, e.g.,
> files <- dir(pattern="X.*\\.txt$")
> File_list <- lapply(filesToProcess,function(score)
+ read.table(score,header=TRUE,row.names=1))
> File_list <- lapply(files,function(z) z[c("pos","score")])
> out_file <- Reduce(function(x,y) {merge(x,y,by=c("pos"))},File_list)
which I know doesn't really make sense, considering I have variable row numbers. I have also tried plyr
> files <- list.files()
> out_list <- llply(files,read.table)
As well as cbind and rbind. Usually I get an error message, because the row numbers don't match up or I just get all the "score" data compiled into one column.
The advice on similar posts (e.g., Merging multiple csv files in R, Simultaneously merge multiple data.frames in a list, and Merge multiple files in a list with different number of rows) has not been helpful.
I hope this clears things up.
This problem could be solved in two steps:
Step 1. Read the data from your csv files into a list of data frames, where files is a vector of file names. If you need to add extra arguments to read.csv, add them like shown below. See ?lapply for details.
list_of_dataframes <- lapply(files, read.csv, stringsAsFactors = FALSE)
Step 2. Calculate means for each data frame:
means <- sapply(list_of_dataframes, function(df) mean(df$score))
Of course, you can always do it in one step like this:
means <- sapply(files, function(filename) mean(read.csv(filename)$score))
I think you want smth like this:
all_data = do.call(rbind, lapply(files,
function(f) {
cbind(read.csv(f), file_name=f)
}))
You can then do whatever "by" type of action you like. Also, don't forget to adjust the various read.csv options to suit your needs.
E.g. once you have the above, you can do the following (and much more):
library(data.table)
dt = data.table(all_data)
dt[, list(mean(score), median(score)), by = file_name]
A small note: you could also use data.table's fread, to read the files in instead of the read.table and its derivatives, and that would be much faster, and while we're at it, use rbindlist instead of do.call(rbind,.

Resources