Applying a task to several files in R - r

I would to apply a loop in R to process several files, one file per time. The files have exactly the same pattern, just the string "...split1..." is a crescent number to my files. Then a have files like "...split1...", "...split2..." ... "...split777...". I want output files like in the same logic, in the example: "newsplit1.txt", "newsplit2.txt" ... "newsplit777.txt".
all <- read.table("nsamplescluster.split1.adjusted", header=TRUE, sep=";")
all <- all[, -grep("GType", colnames(all))]
write.table(all, "newsplit1.txt", sep=";")
Cheers!

Use loop and paste file names.
for(i in 1:777){
infile <- paste0("nsamplescluster.split",i,".adjusted")
outfile <- paste0("newsplit",i,".txt")
all <- read.table(infile, header=TRUE, sep=";")
all <- all[, -grep("GType", colnames(all))]
write.table(all, outfile, sep=";")
}

If the files are all in the same directory, you can also use
filenames<- list.files(your.directory, pattern="nsamplescluster")
This will create a vector with all file names in your.directory with the indicated pattern. You can then use this to loop over your files. For instance,
for(i in filenames){
do stuff
}
This may come in handy if the number of files change.

Related

Best way to import multiple .csv files into separate data frames? lapply() [duplicate]

Suppose we have files file1.csv, file2.csv, ... , and file100.csv in directory C:\R\Data and we want to read them all into separate data frames (e.g. file1, file2, ... , and file100).
The reason for this is that, despite having similar names they have different file structures, so it is not that useful to have them in a list.
I could use lapply but that returns a single list containing 100 data frames. Instead I want these data frames in the Global Environment.
How do I read multiple files directly into the global environment? Or, alternatively, How do I unpack the contents of a list of data frames into it?
Thank you all for replying.
For completeness here is my final answer for loading any number of (tab) delimited files, in this case with 6 columns of data each where column 1 is characters, 2 is factor, and remainder numeric:
##Read files named xyz1111.csv, xyz2222.csv, etc.
filenames <- list.files(path="../Data/original_data",
pattern="xyz+.*csv")
##Create list of data frame names without the ".csv" part
names <-substr(filenames,1,7)
###Load all files
for(i in names){
filepath <- file.path("../Data/original_data/",paste(i,".csv",sep=""))
assign(i, read.delim(filepath,
colClasses=c("character","factor",rep("numeric",4)),
sep = "\t"))
}
Quick draft, untested:
Use list.files() aka dir() to dynamically generate your list of files.
This returns a vector, just run along the vector in a for loop.
Read the i-th file, then use assign() to place the content into a new variable file_i
That should do the trick for you.
Use assign with a character variable containing the desired name of your data frame.
for(i in 1:100)
{
oname = paste("file", i, sep="")
assign(oname, read.csv(paste(oname, ".txt", sep="")))
}
This answer is intended as a more useful complement to Hadley's answer.
While the OP specifically wanted each file read into their R workspace as a separate object, many other people naively landing on this question may think that that's what they want to do, when in fact they'd be better off reading the files into a single list of data frames.
So for the record, here's how you might do that.
#If the path is different than your working directory
# you'll need to set full.names = TRUE to get the full
# paths.
my_files <- list.files("path/to/files")
#Further arguments to read.csv can be passed in ...
all_csv <- lapply(my_files,read.csv,...)
#Set the name of each list element to its
# respective file name. Note full.names = FALSE to
# get only the file names, not the full path.
names(all_csv) <- gsub(".csv","",
list.files("path/to/files",full.names = FALSE),
fixed = TRUE)
Now any of the files can be referred to by my_files[["filename"]], which really isn't much worse that just having separate filename variables in your workspace, and often it is much more convenient.
Here is a way to unpack a list of data.frames using just lapply
filenames <- list.files(path="../Data/original_data",
pattern="xyz+.*csv")
filelist <- lappy(filenames, read.csv)
#if necessary, assign names to data.frames
names(filelist) <- c("one","two","three")
#note the invisible function keeps lapply from spitting out the data.frames to the console
invisible(lapply(names(filelist), function(x) assign(x,filelist[[x]],envir=.GlobalEnv)))
Reading all the CSV files from a folder and creating vactors same as the file names:
setwd("your path to folder where CSVs are")
filenames <- gsub("\\.csv$","", list.files(pattern="\\.csv$"))
for(i in filenames){
assign(i, read.csv(paste(i, ".csv", sep="")))
}
A simple way to access the elements of a list from the global environment is to attach the list. Note that this actually creates a new environment on the search path and copies the elements of your list into it, so you may want to remove the original list after attaching to prevent having two potentially different copies floating around.
I want to update the answer given by Joran:
#If the path is different than your working directory
# you'll need to set full.names = TRUE to get the full
# paths.
my_files <- list.files(path="set your directory here", full.names=TRUE)
#full.names=TRUE is important to be added here
#Further arguments to read.csv can be passed in ...
all_csv <- lapply(my_files, read.csv)
#Set the name of each list element to its
# respective file name. Note full.names = FALSE to
# get only the file names, not the full path.
names(all_csv) <- gsub(".csv","",list.files("copy and paste your directory here",full.names = FALSE),fixed = TRUE)
#Now you can create a dataset based on each filename
df <- as.data.frame(all_csv$nameofyourfilename)
a simplified version, assuming your csv files are in the working directory:
listcsv <- list.files(pattern= "*.csv") #creates list from csv files
names <- substr(listcsv,1,nchar(listcsv)-4) #creates list of file names, no .csv
for (k in 1:length(listcsv)){
assign(names[[k]] , read.csv(listcsv[k]))
}
#cycles through the names and assigns each relevant dataframe using read.csv
#copy all the files you want to read in R in your working directory
a <- dir()
#using lapply to remove the".csv" from the filename
for(i in a){
list1 <- lapply(a, function(x) gsub(".csv","",x))
}
#Final step
for(i in list1){
filepath <- file.path("../Data/original_data/..",paste(i,".csv",sep=""))
assign(i, read.csv(filepath))
}
Use list.files and map_dfr to read many csv files
df <- list.files(data_folder, full.names = TRUE) %>%
map_dfr(read_csv)
Reproducible example
First write sample csv files to a temporary directory.
It's more complicated than I thought it would be.
library(dplyr)
library(purrr)
library(purrrlyr)
library(readr)
data_folder <- file.path(tempdir(), "iris")
dir.create(data_folder)
iris %>%
# Keep the Species column in the output
# Create a new column that will be used as the grouping variable
mutate(species_group = Species) %>%
group_by(species_group) %>%
nest() %>%
by_row(~write.csv(.$data,
file = file.path(data_folder, paste0(.$species_group, ".csv")),
row.names = FALSE))
Read these csv files into one data frame.
Note the Species column has to be present in the csv files, otherwise we would loose that information.
iris_csv <- list.files(data_folder, full.names = TRUE) %>%
map_dfr(read_csv)

Read files matching subdirectory patterns in R

I've used a lot of posts to get me this far (such as here R list files with multiple conditions and here How can I read multiple files from multiple directories into R for processing? but can't accomplish what I need in R.
I have many .csv files distributed in multiple subdirectories that I want to read in and then save as separate objects to the corresponding basename. The end result will be to rbind each of those files together. Here's sample dir structure and some of what I've tried:
./DATA/Cat_Animal/animal1.csv
./DATA/Dog_Animal/animal2.csv
./DATA/Dog_Animal/animal3.csv
./DATA/Dog_Animal/animal3.1.csv
#read in all csv files
files <- list.files(path="./DATA", pattern="*.csv", full.names=TRUE, recursive=TRUE)
But this results in all files in all subdirectories. I want to match specific files (animalsX.csv) in specific subdirectories matching the pattern (X_Animal) such as this:
files <- dir(path=paste0("./DATA/", pattern="*+_Animal"), recursive=TRUE, full.names=TRUE, pattern="animal+.*csv")
Once I get my list of files, I want to read each of them in and save each to the corresponding file's basename. So the file named animal1.csv
would be saved to animal1. I think I need to use the function basename() somewhere in a loop but not sure how.
Help very much appreciated I've spent a lot of time trying out various options with little progress.
This question is really two questions, consider splitting them up. On the last part of your question, how to rbind a list full of data.frames together try:
finalDf = do.call(rbind, result)
You'll likely need to use str_split() from the stringr package to extract the parts of the file path you need. You could also use str_extract() regular expressions.
I think I found a work-around for the short term because luckily I only have a few subdirectories currently.
myFiles1 <- list.files(path = "./DATA/Cat_Animal/", pattern="animal+.*csv")
processFile <- function(f) {
df <- read.csv(file = paste0("./DATA/Cat_Animal/", f ))
}
result1 <- sapply(myFiles1, processFile)
#then do it again for the next subdir:
myFiles2 <- list.files(path = "./DATA/Dog_Animal/", pattern="animal+.*csv")
processFile <- function(f) {
df <- read.csv(file = paste0("./DATA/Dog_Animal/", f ))
}
result2 <- sapply(myFiles2, processFile)
finalDf = do.call(rbind, result1, result2)
I know there is a better way but can't figure out the pattern matching for the subdirectories! It's so easy in unix for example
You can simply do it two times.
a <- list.files(path="./DATA", pattern="*_Animal", full.names=T, recursive=F)
a
#[1] "./DATA/Cat_Animal" "./DATA/Dog_Animal"
files <- list.files(path=a, pattern="*animal*", full.names=T)
files
#[1] "./DATA/Cat_Animal/animal1.txt" "./DATA/Dog_Animal/animal2.txt" #"./DATA/Dog_Animal/animal3.txt"
#[4] "./DATA/Dog_Animal/animal4.txt"
In the first step, please make sure to use full.names = T and recursive = F. You need full.names = T to get the file path not just file name, otherwise you might lose path to animal*.csv in the second step. And recursive = T would return nothing since Dog_Animal and Cat_Animal are folders not files.

R Read files, manipulate file names and write files from/to folder

I have 10 files in a folder, which I have to manipulate. At the moment I do that manually meaning I adapt the script ten times. Now I'm trying to do that automatically.
First I get the file names by this command:
data <- list.files(path=".", pattern=".csv", full.names=TRUE)
Now I have the idea to iterate over the file names by a for-loop.
for(i in data) {
df <- read.csv("i", header=T)
df$sum <- sum(df$value1, df$value2)
write.table(df, file=i, row.names=FALSE)
}
I'm not sure if the read.csv command works. Moreover, t I don't want to rewrite the file. The originally file names have the following structure
30LOV_1.csv
30LOV_2.csv
100LOV_1.csv
2000LOV_1.csv
I want to add something like _20min to the file names, e.g.
30LOV_1_20min.csv
30LOV_2_20min.csv
100LOV_1_20min.csv
2000LOV_1_20min.csv
How can I achieve this? Do you have any suggestions?
Thanks
You can use sub to add additional information to the end of your filename. I used parenthesis to create a captured group to get the first part of the file name (.*) and then recalled it in the second part with \\1.
for(i in data) {
df <- read.csv("i", header=T)
df$sum <- sum(df$value1, df$value2)
filename <- sub("(.*).csv","\\1_20min.csv",i)
filename
# [1] "30LOV_1_20min.csv"
write.table(df, file=filename, row.names=FALSE)
}

Reading in files stored in a dataframe in R

I have three directories that each contain about 2,000 files - the files have the same exact format but they are from 3 different sources. For each set of 3 files, I need to read in the data, merge them, and do some calculations and store the output. I've already got my script running for a test case; now I'm trying to loop it over all the files (so, 2000 sets of 3 files of each).
I only want to read in 3 at a time, of course. I thought of this approach: create a dataframe of files where the columns represent the 3 types and the rows represent files. I do that here:
type1Files <- list.files(path="path_to_dir", pattern="*.tsv", full.names=TRUE, recursive=FALSE)
type2Files <- list.files(path="path_to_dir", pattern="*.tsv", full.names=TRUE, recursive=FALSE)
type3Files <- list.files(path="path_to_dir", pattern="*.tsv", full.names=TRUE, recursive=FALSE)
enter files.df <- cbind.data.frame(type1=type1Files,type2=type2Files,type3=type3Files)
Now I need to read these files by column, looping over rows so only 3 files get opened in one loop. The issue is that I cannot read in a file using read.table, and I think it's because of the format of the filename (read.table() is not being fed the right format).
head(files.df) #confirms that each file is not surrounded by double quotes as required by read.table
My read.table statement:
type1.df <- read.table(x, header=FALSE, sep="\t", stringsAsFactors=FALSE)
Where, for x, I have tried the following:
shQuote(files.df[1,"type1"])
dQuote(files.df[1,"type1"])
file.t <- files.df[1,"type1"]
paste0('"',file.t,'"')
I've tried them all directly in read.table() as well as saving to objects and naming the object in read.table(). I even trying using cat() because I thought the escaped quotes might be the problem. Nothing works. I either get "unexpected input" as the error, or the typical error: "Error in file(file, "rt") : cannot open the connection." Furthermore, if I paste that exact filename that is printed in the error into my read.table() statement, it runs just fine. So, after many hours, I am stumped.
Can this be done in this way?
Thank you all for your advice.
Consider iterating directly from the lists without using an intermediary dataframe. With Map, you can pass each list of 2,000 dataframes for iterative cbind calls across all three types. Below cbind.data.frame prefixes columns with type1, type2, and type3.
bind_dfs <- function(x,y,z) {
xdf <- read.table(x, header=FALSE, sep="\t", stringsAsFactors=FALSE)
ydf <- read.table(y, header=FALSE, sep="\t", stringsAsFactors=FALSE)
zdf <- read.table(z, header=FALSE, sep="\t", stringsAsFactors=FALSE)
cbind.data.frame(type1=xdf, type2=ydf, type3=zdf)
}
dfList <- Map(bind_dfs, type1Files, type2Files, type3Files)
Also, to run your calculations, you can either extend the bind_dfs method
bind_dfs <- function(x,y,z) {
xdf <- read.table(x, header=FALSE, sep="\t", stringsAsFactors=FALSE)
ydf <- read.table(y, header=FALSE, sep="\t", stringsAsFactors=FALSE)
zdf <- read.table(z, header=FALSE, sep="\t", stringsAsFactors=FALSE)
df <- cbind(xdf, ydf, zdf)
df <- #... other calculations
return(df)
}
Or use another loop on dataframe list:
newdfList <- lapply(dfList, function(df){
df <- # ... other calculations
return(df)
})

Loop in R to read many files

I have been wondering if anybody knows a way to create a loop that loads files/databases in R.
Say i have some files like that: data1.csv, data2.csv,..., data100.csv.
In some programming languages you one can do something like this data +{ x }+ .csv the system recognizes it like datax.csv, and then you can apply the loop.
Any ideas?
Sys.glob() is another possibility - it's sole purpose is globbing or wildcard expansion.
dataFiles <- lapply(Sys.glob("data*.csv"), read.csv)
That will read all the files of the form data[x].csv into list dataFiles, where [x] is nothing or anything.
[Note this is a different pattern to that in #Joshua's Answer. There, list.files() takes a regular expression, whereas Sys.glob() just uses standard wildcards; which wildcards can be used is system dependent, details can be used can be found on the help page ?Sys.glob.]
See ?list.files.
myFiles <- list.files(pattern="data.*csv")
Then you can loop over myFiles.
I would put all the CSV files in a directory, create a list and do a loop to read all the csv files from the directory in the list.
setwd("~/Documents/")
ldf <- list() # creates a list
listcsv <- dir(pattern = "*.csv") # creates the list of all the csv files in the directory
for (k in 1:length(listcsv)){
ldf[[k]] <- read.csv(listcsv[k])
}
str(ldf[[1]])
Read the headers in a file so that we can use them for replacing in merged file
library(dplyr)
library(readr)
list_file <- list.files(pattern = "*.csv") %>%
lapply(read.csv, stringsAsFactors=F) %>%
bind_rows
fi <- list.files(directory_path,full.names=T)
dat <- lapply(fi,read.csv)
dat will contain the datasets in a list
Let's assume that your files have the file format that you mentioned in your question and that they are located in the working directory.
You can vectorise creation of the file names if they have a simple naming structure. Then apply a loading function on all the files (here I used purrr package, but you can also use lapply)
library(purrr)
c(1:100) %>% paste0("data", ., ".csv") %>% map(read.csv)
Here's another solution using a for loop. I like it better than the others because of its flexibility and because all dfs are directly stored in the global environment.
Assume you've already set your working directory, the algorithm will iteratively read all files and store them in the global environment with the name "datai".
list <- c(1:100)
for (i in list) {
filename <- paste0("data", i)
wd <- paste0("data", i, ".csv")
assign(filename, read.csv(wd))
}
First, set the working directory.
Find and store all the files ending with .csv.
Bind all of them row-wise.
Following is the code sample:
setwd("C:/yourpath")
temp <- list.files(pattern = "*.csv")
allData <- do.call("rbind",lapply(Sys.glob(temp), read.csv))
This may be helpful if you have datasets for participants as in psychology/sports/medicine etc.
setwd("C:/yourpath")
temp <- list.files(pattern = "*.sav")
#Maybe you want to unselect /delete IDs
DEL <- grep('ID(04|08|11|13|19).sav', temp)
temp2 <- temp[-DEL]
#Make a list of that contains all data
read.all <- lapply(temp2, read_sav)
#View(read.all[1])
#Option 1: put one under the next
df <- do.call("rbind", read.all)
Option 2: make something within each dataset (single IDs) e.g. get the mean of certain parts of each participant
mw_extraktion <- function(data_raw){
data_raw <- data.frame(data_raw)
#you may now calculate e.g. the mean for a certain variable for each ID
ID <- data_raw$ID[1]
data_OneID <- c(ID, Var2, Var3) #put your new variables (e.g. Means) here
} #end of function
data_combined <- t(data.frame(sapply(read.all, mw_extraktion) ) )

Resources