Read several txt file from different directories in R - r

I have several txt files in different directories. I want to read each file separately in R that I will apply some analysis on each one later.
The directories are the same except the last folder as the following:
c:/Desktop/ATA/1/"files.txt"
c:/Desktop/ATA/2/"files.txt"
c:/Desktop/ATA/3/"files.txt"
...
...
The files in all directories have the same name and the last folder starts from 1 to last order.

Create all the filenames to read using sprintf or something similar. Then use read.table or whatever you use to read the text files.
lapply(sprintf("c:/Desktop/ATA/%d/files.txt", 1:10), function(x)
read.table(x, header = TRUE))
Replace 10 with the number of folders you have.

Maybe you can try:
list_file <- list.files(path = "c:/Desktop/ATA", recursive = T, pattern = ".txt", full.names = T)
This will return the list of text files contained in your folder. Then, you can create a for loop to open them and apply some functions on each.
for(i in 1:length(list_file))
{
data = read.table(list_file[i],header = T, sep = "\t")
... function to apply
}

First Thanks Guys, I mixed your codes and modified a little bit:
common_path = "c:/Desktop/ATA/"
primary_dirs = length(list.files(common_path)) # Gives no. of folders in path
list_file <- sprintf("c:/Desktop/ATA/%d/files.txt", 1:primary_dirs)
for(i in 1:length(list_file))
{
data = read.table(list_file[i],header = T, sep = "\t")
}
So, by this way the folders are sorted based on 1,2,3 not 1,10,11,2,3.

Related

Specifying pathname in map_dfr

The structure of my directory is as follows:
Extant_Data -> Data -> Raw
-> course_enrollment
-> frpm
I have a few different function to to read in some text files and excel files respectively.
read_fun = function(path){
test = read.delim(path, sep="\t", header=TRUE, fill = TRUE, colClasses = c(rep("character",23)))
test
}
read_fun_frpm= function(path){
test = read_excel(path, sheet = 2, col_names = frpm_names)
}
I feed this into map_dfr so that the function reads in each of the files and rowbinds them.
allfiles = list.files(path = "Extant_Data/Data/Raw/course_enrollment",
pattern = "CourseEnrollment.txt",
full.names=FALSE,
recursive = T)
# Rowbind all the course enrollment data
# !!! BUT I HAVE set the working directory to a subdirectory so that it finds those files
setwd("/Extant_Data/Data/Raw/course_enrollment")
course_combined <- map_dfr(allfiles,read_fun)
allfiles = list.files(path = "Extant_Data/Data/Raw/frpm/post12",
pattern = "frpm*",
full.names=FALSE,
recursive = T)
# Rowbind all the course enrollment data
# !!!I have to change the directory AGAIN
setwd(""Extant_Data/Data/Raw/frpm/post12")
frpm_combined <- map_dfr(allfiles,read_fun_frpm)
As mentioned in the comments, I have to keep changing the working directory so that map_dfr can locate the files. I don't think this is best practice, how might I work around this so I don't have to keep changing the directory? Any suggestions appreciated. Sorry it's hard to provide a re-producible example.
Note: This throws an error.
frpm_combined <- map_dfr(allfiles,read_fun_frpm('Extant_Data/Data/Raw/frpm/post12'))

Can I automate an increasing value in a file name in R?

So I have .csv's of nesting data that I need to trim. I wrote a series of functions in R and then spit out the new pretty .csv. The issue is that I need to do this with 59 .csv's and I would like to automate the name.
data1 <- read.csv("Nest001.csv", skip = 3, header=F)
functions functions functions
write.csv("Nest001_NEW.csv, file.path(out.path, edit), row.names=F)
So...is there any way for me to loop the name Nest001 to Nest0059 so that I don't have to delete and retype the name for every .csv?
EDIT to incorporate Gregor's suggestion:
One option:
filenames_in <- sprintf("Nest%03d.csv", 1:59)
filenames_out <- sub(pattern = "(\\d{3})(\\.)", replacement = "\\1_NEW\\2", filenames_in)
all_files <- matrix(c(filenames_in, filenames_out), ncol = 2)
And then loop through them:
for (i in 1:nrow(all_files)) {
temp <- read.csv(all_files[[i, 1]], skip = 3, header=F)
do stuff
write.csv(temp, all_files[[i, 2]], row.names = f)
)
To do this purrr-style, you would create two lists similar to the above, and then write a custom function to read in the file, perform all the functions, and then output it.
e.g.
purrr::walk2(
.x = list(filenames_in),
.y = list(filenames_out),
.f = ~my_function()
)
Consider .x and .y as the i in the for loop; it goes through both lists simultaneously, and performs the function on each item.
More info is available here.
Your best bet is to put all of these CSVs into one folder, without any other CSVs in that folder. Then, you can write a loop to go over every file in that folder, and read them in.
library(dplyr)
setwd("path to the folder with CSV's goes here")
combinedData = data.frame()
files = list.files()
for (file in files)
{
read.csv(file)
combinedData = bind_rows(combinedData, file)
}
EDIT: if there are other files in the folder that you don't want to read, you can add this line of code to only read in files that contain the word "Nest" in the title:
files= files[grepl("Nest",filesToRead)]
I don't remember off the top of my head if that is case sensitive or not

Merging files by name and condition in r

I have a series of files in a folder that look like this;
B_1.csv, B_1_2.csv, B_2.csv, B_2_2.csv, B_3.csv, B_4.csv, B_4_1.csv
Basically, I wish to merge any files which contain '_2' to their proceeding number (i.e. B_1_2.csv merges with B_1.csv). A further complication is that some files (such as B_3.csv) do not have a second file (_2) and therefore need to be ignored. I cannot think of an easy way of completing this. Any help would be greatly appreciated. Many thanks
Untested, of course, but this should work (or something close to it):
# identify CSV files
files = list.files(pattern = "csv$")
# look for ones that need merging
to_merge = grep("[0-9]_2\\.csv", files, value = TRUE)
# identify what they should be merged to
merge_target_file = sub("_2.csv", ".csv", to_merge, fixed = TRUE)
# make sure those exist
problems = setdiff(merge_target_file, files)
if(length(problems)) stop(problems, " not found, need merge targets.")
# read in the data
merge_target = lapply(merge_target_file, read.csv, stringsAsFactors = FALSE)
to_merge = lapply(to_merge, read.csv, stringsAsFactors = FALSE)
# merge and write
for(i in seq_along(merge_target)) {
write.csv(rbind(merge_target[[i]], to_merge[[i]]), file = merge_target_file[i])
}

Importing multiple files in sparklyr

I'm very new to sparklyr and spark, so please let me know if this is not the "spark" way to do this.
My problem
I have 50+ .txt files at around 300 mb each, all in the same folder, call it x, that I need to import to sparklyr, preferably one table.
I can read them individually like
spark_read_csv(path=x, sc=sc, name="mydata", delimiter = "|", header=FALSE)
If I were to import them all outside of sparklyr, I would probably create a list with the file names, call it filelist and then import them all into a list with lapply
filelist = list.files(pattern = ".txt")
datalist = lapply(filelist, function(x)read.table(file = x, sep="|", header=FALSE))
This gives me a list where element k is the k:th .txt file in filelist. So my question is: is there an equivalent way in sparklyr to do this?
What I've tried
I've tried to use lapply()and spark_read_csv, like I did above outside sparklyr. Just changed read.table to spark_read_csv and the arguments
datalist = lapply(filelist, function(x)spark_read_csv(path = x, sc = sc, name = "name", delimiter="|", header=FALSE))
which gives me a list with the same number of elements as .txt files, but every element (.txt file) is identical to the last .txt file in the file list.
> identical(datalist[[1]],datalist[[2]])
[1] TRUE
I obviously want each element to be one of the datasets. My idea is that after this, I can just rbind them together.
Edit:
Found a way. The problem was that the argument "name" in spark_read_csv needs to be updated for each time a new file is read, otherwise it will overwrite. So I did in a for loop instead of lapply, and in each iteration I change the name. Are there better ways?
datalist <- list()
for(i in 1:length(filelist)){
name <- paste("dataset",i,sep = "_")
datalist[[i]] <- spark_read_csv(path = filelist[i], sc = sc,
name = name, delimiter="|", header=FALSE)
}
Since you (emphasis mine)
have 50+ .txt files at around 300 mb each, all in the same folder
you can just use wildcard in the path:
spark_read_csv(
path = "/path/to/folder/*.txt",
sc = sc, name = "mydata", delimiter = "|", header=FALSE)
If directory contains only the data you can simplify this even further:
spark_read_csv(
path = "/path/to/folder/",
sc = sc, name = "mydata", delimiter = "|", header = FALSE)
Native Spark readers also support reading multiple paths at once (Scala code):
spark.read.csv("/some/path", "/other/path")
but as of 0.7.0-9014 it is not properly implemented in sparklyr (current implementation of spark_normalize_path doesn't support vectors of size larger than one).

looping over all files in the same directory in R

the following code in R for all the files. actually I made a for loop for that but when I run it it will be applied only on one file not all of them. BTW, my files do not have header.
You use [[ to subset something from peaks. However, after reading it using the file name, it is a data frame with then no more reference to the file name. Thus, you just have to get rid of the [[i]].
for (i in filelist.coverages) {
peaks <- read.delim(i, sep='', header=F)
PeakSizes <- c(PeakSizes, peaks$V3 - peaks$V2)
}
By using the iterator i within read.delim() which holds a new file name each time, every time R goes through the loop, peaks will have the content of a new file.
In your code, i is referencing to a name file. Use indices instead.
And, by the way, don't use setwd, use full.names = TRUE option in list.files. And preallocate PeakSizes like this: PeakSizes <- numeric(length(filelist.coverages)).
So do:
filelist.coverages <- list.files('K:/prostate_cancer_porto/H3K27me3_ChIPseq/',
pattern = 'island.bed', full.names = TRUE)
##all 97 bed files
PeakSizes <- numeric(length(filelist.coverages))
for (i in seq_along(filelist.coverages)) {
peaks <- read.delim(filelist.coverages[i], sep = '', header = FALSE)
PeakSizes[i] <- peaks$V3 - peaks$V2
}
Or you could simply use sapply or purrr::map_dbl:
sapply(filelist.coverages, function(file) {
peaks <- read.delim(file, sep = '', header = FALSE)
peaks$V3 - peaks$V2
})

Resources