Loop in R to read many files

Loop in R to read many files - r

I have been wondering if anybody knows a way to create a loop that loads files/databases in R.
Say i have some files like that: data1.csv, data2.csv,..., data100.csv.
In some programming languages you one can do something like this data +{ x }+ .csv the system recognizes it like datax.csv, and then you can apply the loop.
Any ideas?

Sys.glob() is another possibility - it's sole purpose is globbing or wildcard expansion.
dataFiles <- lapply(Sys.glob("data*.csv"), read.csv)
That will read all the files of the form data[x].csv into list dataFiles, where [x] is nothing or anything.
[Note this is a different pattern to that in #Joshua's Answer. There, list.files() takes a regular expression, whereas Sys.glob() just uses standard wildcards; which wildcards can be used is system dependent, details can be used can be found on the help page ?Sys.glob.]

See ?list.files.
myFiles <- list.files(pattern="data.*csv")
Then you can loop over myFiles.

I would put all the CSV files in a directory, create a list and do a loop to read all the csv files from the directory in the list.
setwd("~/Documents/")
ldf <- list() # creates a list
listcsv <- dir(pattern = "*.csv") # creates the list of all the csv files in the directory
for (k in 1:length(listcsv)){
ldf[[k]] <- read.csv(listcsv[k])
}
str(ldf[[1]])

Read the headers in a file so that we can use them for replacing in merged file
library(dplyr)
library(readr)
list_file <- list.files(pattern = "*.csv") %>%
lapply(read.csv, stringsAsFactors=F) %>%
bind_rows

fi <- list.files(directory_path,full.names=T)
dat <- lapply(fi,read.csv)
dat will contain the datasets in a list

Let's assume that your files have the file format that you mentioned in your question and that they are located in the working directory.
You can vectorise creation of the file names if they have a simple naming structure. Then apply a loading function on all the files (here I used purrr package, but you can also use lapply)
library(purrr)
c(1:100) %>% paste0("data", ., ".csv") %>% map(read.csv)

Here's another solution using a for loop. I like it better than the others because of its flexibility and because all dfs are directly stored in the global environment.
Assume you've already set your working directory, the algorithm will iteratively read all files and store them in the global environment with the name "datai".
list <- c(1:100)
for (i in list) {
filename <- paste0("data", i)
wd <- paste0("data", i, ".csv")
assign(filename, read.csv(wd))
}

First, set the working directory.
Find and store all the files ending with .csv.
Bind all of them row-wise.
Following is the code sample:
setwd("C:/yourpath")
temp <- list.files(pattern = "*.csv")
allData <- do.call("rbind",lapply(Sys.glob(temp), read.csv))

This may be helpful if you have datasets for participants as in psychology/sports/medicine etc.
setwd("C:/yourpath")
temp <- list.files(pattern = "*.sav")
#Maybe you want to unselect /delete IDs
DEL <- grep('ID(04|08|11|13|19).sav', temp)
temp2 <- temp[-DEL]
#Make a list of that contains all data
read.all <- lapply(temp2, read_sav)
#View(read.all[1])
#Option 1: put one under the next
df <- do.call("rbind", read.all)
Option 2: make something within each dataset (single IDs) e.g. get the mean of certain parts of each participant
mw_extraktion <- function(data_raw){
data_raw <- data.frame(data_raw)
#you may now calculate e.g. the mean for a certain variable for each ID
ID <- data_raw$ID[1]
data_OneID <- c(ID, Var2, Var3) #put your new variables (e.g. Means) here
} #end of function
data_combined <- t(data.frame(sapply(read.all, mw_extraktion) ) )

Related

Best way to import multiple .csv files into separate data frames? lapply() [duplicate]

Suppose we have files file1.csv, file2.csv, ... , and file100.csv in directory C:\R\Data and we want to read them all into separate data frames (e.g. file1, file2, ... , and file100).
The reason for this is that, despite having similar names they have different file structures, so it is not that useful to have them in a list.
I could use lapply but that returns a single list containing 100 data frames. Instead I want these data frames in the Global Environment.
How do I read multiple files directly into the global environment? Or, alternatively, How do I unpack the contents of a list of data frames into it?

Thank you all for replying.
For completeness here is my final answer for loading any number of (tab) delimited files, in this case with 6 columns of data each where column 1 is characters, 2 is factor, and remainder numeric:
##Read files named xyz1111.csv, xyz2222.csv, etc.
filenames <- list.files(path="../Data/original_data",
pattern="xyz+.*csv")
##Create list of data frame names without the ".csv" part
names <-substr(filenames,1,7)
###Load all files
for(i in names){
filepath <- file.path("../Data/original_data/",paste(i,".csv",sep=""))
assign(i, read.delim(filepath,
colClasses=c("character","factor",rep("numeric",4)),
sep = "\t"))
}

Quick draft, untested:
Use list.files() aka dir() to dynamically generate your list of files.
This returns a vector, just run along the vector in a for loop.
Read the i-th file, then use assign() to place the content into a new variable file_i
That should do the trick for you.

Use assign with a character variable containing the desired name of your data frame.
for(i in 1:100)
{
oname = paste("file", i, sep="")
assign(oname, read.csv(paste(oname, ".txt", sep="")))
}

This answer is intended as a more useful complement to Hadley's answer.
While the OP specifically wanted each file read into their R workspace as a separate object, many other people naively landing on this question may think that that's what they want to do, when in fact they'd be better off reading the files into a single list of data frames.
So for the record, here's how you might do that.
#If the path is different than your working directory
# you'll need to set full.names = TRUE to get the full
# paths.
my_files <- list.files("path/to/files")
#Further arguments to read.csv can be passed in ...
all_csv <- lapply(my_files,read.csv,...)
#Set the name of each list element to its
# respective file name. Note full.names = FALSE to
# get only the file names, not the full path.
names(all_csv) <- gsub(".csv","",
list.files("path/to/files",full.names = FALSE),
fixed = TRUE)
Now any of the files can be referred to by my_files[["filename"]], which really isn't much worse that just having separate filename variables in your workspace, and often it is much more convenient.

Here is a way to unpack a list of data.frames using just lapply
filenames <- list.files(path="../Data/original_data",
pattern="xyz+.*csv")
filelist <- lappy(filenames, read.csv)
#if necessary, assign names to data.frames
names(filelist) <- c("one","two","three")
#note the invisible function keeps lapply from spitting out the data.frames to the console
invisible(lapply(names(filelist), function(x) assign(x,filelist[[x]],envir=.GlobalEnv)))

Reading all the CSV files from a folder and creating vactors same as the file names:
setwd("your path to folder where CSVs are")
filenames <- gsub("\\.csv$","", list.files(pattern="\\.csv$"))
for(i in filenames){
assign(i, read.csv(paste(i, ".csv", sep="")))
}

A simple way to access the elements of a list from the global environment is to attach the list. Note that this actually creates a new environment on the search path and copies the elements of your list into it, so you may want to remove the original list after attaching to prevent having two potentially different copies floating around.

I want to update the answer given by Joran:
#If the path is different than your working directory
# you'll need to set full.names = TRUE to get the full
# paths.
my_files <- list.files(path="set your directory here", full.names=TRUE)
#full.names=TRUE is important to be added here
#Further arguments to read.csv can be passed in ...
all_csv <- lapply(my_files, read.csv)
#Set the name of each list element to its
# respective file name. Note full.names = FALSE to
# get only the file names, not the full path.
names(all_csv) <- gsub(".csv","",list.files("copy and paste your directory here",full.names = FALSE),fixed = TRUE)
#Now you can create a dataset based on each filename
df <- as.data.frame(all_csv$nameofyourfilename)

a simplified version, assuming your csv files are in the working directory:
listcsv <- list.files(pattern= "*.csv") #creates list from csv files
names <- substr(listcsv,1,nchar(listcsv)-4) #creates list of file names, no .csv
for (k in 1:length(listcsv)){
assign(names[[k]] , read.csv(listcsv[k]))
}
#cycles through the names and assigns each relevant dataframe using read.csv

#copy all the files you want to read in R in your working directory
a <- dir()
#using lapply to remove the".csv" from the filename
for(i in a){
list1 <- lapply(a, function(x) gsub(".csv","",x))
}
#Final step
for(i in list1){
filepath <- file.path("../Data/original_data/..",paste(i,".csv",sep=""))
assign(i, read.csv(filepath))
}

Use list.files and map_dfr to read many csv files
df <- list.files(data_folder, full.names = TRUE) %>%
map_dfr(read_csv)
Reproducible example
First write sample csv files to a temporary directory.
It's more complicated than I thought it would be.
library(dplyr)
library(purrr)
library(purrrlyr)
library(readr)
data_folder <- file.path(tempdir(), "iris")
dir.create(data_folder)
iris %>%
# Keep the Species column in the output
# Create a new column that will be used as the grouping variable
mutate(species_group = Species) %>%
group_by(species_group) %>%
nest() %>%
by_row(~write.csv(.$data,
file = file.path(data_folder, paste0(.$species_group, ".csv")),
row.names = FALSE))
Read these csv files into one data frame.
Note the Species column has to be present in the csv files, otherwise we would loose that information.
iris_csv <- list.files(data_folder, full.names = TRUE) %>%
map_dfr(read_csv)

Read files matching subdirectory patterns in R

I've used a lot of posts to get me this far (such as here R list files with multiple conditions and here How can I read multiple files from multiple directories into R for processing? but can't accomplish what I need in R.
I have many .csv files distributed in multiple subdirectories that I want to read in and then save as separate objects to the corresponding basename. The end result will be to rbind each of those files together. Here's sample dir structure and some of what I've tried:
./DATA/Cat_Animal/animal1.csv
./DATA/Dog_Animal/animal2.csv
./DATA/Dog_Animal/animal3.csv
./DATA/Dog_Animal/animal3.1.csv
#read in all csv files
files <- list.files(path="./DATA", pattern="*.csv", full.names=TRUE, recursive=TRUE)
But this results in all files in all subdirectories. I want to match specific files (animalsX.csv) in specific subdirectories matching the pattern (X_Animal) such as this:
files <- dir(path=paste0("./DATA/", pattern="*+_Animal"), recursive=TRUE, full.names=TRUE, pattern="animal+.*csv")
Once I get my list of files, I want to read each of them in and save each to the corresponding file's basename. So the file named animal1.csv
would be saved to animal1. I think I need to use the function basename() somewhere in a loop but not sure how.
Help very much appreciated I've spent a lot of time trying out various options with little progress.

This question is really two questions, consider splitting them up. On the last part of your question, how to rbind a list full of data.frames together try:
finalDf = do.call(rbind, result)
You'll likely need to use str_split() from the stringr package to extract the parts of the file path you need. You could also use str_extract() regular expressions.

I think I found a work-around for the short term because luckily I only have a few subdirectories currently.
myFiles1 <- list.files(path = "./DATA/Cat_Animal/", pattern="animal+.*csv")
processFile <- function(f) {
df <- read.csv(file = paste0("./DATA/Cat_Animal/", f ))
}
result1 <- sapply(myFiles1, processFile)
#then do it again for the next subdir:
myFiles2 <- list.files(path = "./DATA/Dog_Animal/", pattern="animal+.*csv")
processFile <- function(f) {
df <- read.csv(file = paste0("./DATA/Dog_Animal/", f ))
}
result2 <- sapply(myFiles2, processFile)
finalDf = do.call(rbind, result1, result2)
I know there is a better way but can't figure out the pattern matching for the subdirectories! It's so easy in unix for example

You can simply do it two times.
a <- list.files(path="./DATA", pattern="*_Animal", full.names=T, recursive=F)
a
#[1] "./DATA/Cat_Animal" "./DATA/Dog_Animal"
files <- list.files(path=a, pattern="*animal*", full.names=T)
files
#[1] "./DATA/Cat_Animal/animal1.txt" "./DATA/Dog_Animal/animal2.txt" #"./DATA/Dog_Animal/animal3.txt"
#[4] "./DATA/Dog_Animal/animal4.txt"
In the first step, please make sure to use full.names = T and recursive = F. You need full.names = T to get the file path not just file name, otherwise you might lose path to animal*.csv in the second step. And recursive = T would return nothing since Dog_Animal and Cat_Animal are folders not files.

Load multiple excel files and name object after a file name

I have read several questions related to this but none is what I am looking for.
The best one is by far using the readxlpackage
library(readxl)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list, read_excel)
but as it is explained, it gives a list.
what I want is to get each file by their name in work directory
what I am doing is to get setwdinto the directory I have all the xls files then I load them one by one based on their name for example
mydf1 <- read_excel("mydf1.xlsx")
mydfb <- read_excel("mydfb.xlsx")
datac <- read_excel("datac.xlsx")
Is there any other way to get them without repeating the name over and over again?

You can use assign with for loop:
library(readxl)
file.list <- list.files(pattern = "*.xlsx")
for(i in file.list) {
assign(sub(".xlsx", "", i), read_excel(i))
}
PS.: you need sub to remove file extension (otherwise you would get object mydf1.xlsx instead of mydf1).

This is a perfect use case for the purrr package:
library(readxl)
library(tidyverse) #loads purrr
#for each excel file name, read excel sheet and append to df
df.excel <- file.names %>% map_df( ~ read_excel(path = .x))

You could use something like this in your loop:
lapply(seq_along(file.list), function(x){
df<-read_excel(x)
y<-gsub("\\..*","",x)
assign(y, df, envir=globalenv())
})

You only think that you want each one loading into the global environment. As you become more experienced with R you will find that in most (if not all) cases it is better to keep related objects like this together in a list.
If they are all in a list then you can use lapply or sapply to run the same command on every element instead of trying to create a new loop where you get each object and process it.
The list approach is less likely to overwrite other objects that you may want to keep or cause other programming at a distance bugs (which can be very hard to track down).

Building on the purrr approach by #SethRaithel, this provides column with the file names.
library(tidyverse)
library(readxl)
# create a list of files matching a regular expression
# in a defined path
file_list <- fs::dir_ls(temp_path, regexp="*.xls")
data_new <- file_list %>%
# convert to a tibble
as_tibble() %>%
# create column with just file name for reference
mutate(file = fs::path_file(value)) %>%
# uses map to read all the files and then return a single df
mutate(data = purrr::map(value, .f=readxl::read_excel)) %>%
unnest(cols=data) %>%
janitor::clean_names() %>%
select(-value)

How can I turn a part of the filename into a variable when reading multiple text files?

I have multiple textfiles (around 60) that I merge into a single file. I am looking for way of only adding the first 4 digits of the file name in a variable for each file. An example of a file name is 1111_2222_3333.txt.
So basically I need an additional variable that includes the first 4 digits of the file name per file.
I did find the following related topics, but this does not allow me to include the 4 four digits only:
How can I turn the filename into a variable when reading multiple csvs into R
R: Read multiple files and label them based on the file name
My code that does not include the file name yet is currently:
files <- list.files("pathname", pattern="*.TXT")
masterfilesales <- do.call(rbind, lapply(files, read.table))

Update: Although the initial answer is correct, the same goal can be achieved in fewer steps by using sapply with simplify=FALSE instead of lapply because sapply automatically assigns the filenames to the elements in the list:
library(data.table)
files <- list.files("pathname", pattern="*.TXT")
file.list <- sapply(files, read.table, simplify=FALSE)
masterfilesales <- rbindlist(file.list, idcol="id")[, id := substr(id,1,4)]
Old answer: To achieve what you want, you can utilize a combination of the setattr function and the idcol pararmeter of the rbindlist function from the data.table-package as follows:
library(data.table)
files <- list.files("pathname", pattern="*.TXT")
file.list <- lapply(files, read.table)
setattr(file.list, "names", files)
masterfilesales <- rbindlist(file.list, idcol="id")[, id := substr(id,1,4)]
Alternatively, you can set the filenames in base R with:
attr(file.list, "names") <- files
or:
names(file.list) <- files
and bind them together with bind_rows from the dplyr package (which has also an .id parameter to create an id-column):
masterfilesales <- bind_rows(file.list, .id="id") %>% mutate(id = substr(id,1,4))

Are you looking for something like this?
c("1111_444.txt", "443343iqueh.txt") -> a
substring(a, first=1, last=4)

Get summary / plot for a column out of a folder (~3000 csv files) in R

I'm a student from Germany. I want to create a summary (0.25 & 0.75 quantile, mean, min, max) and different plots for special columns (e.g. Inflow or Low).
The issue is that there is not only one .csv file, there are about 3200 files in that folder - different names (ISIN numbers of portfolios all starting with DE000LS9xxx).
After I looked through different platforms and this forum I tried different possibilities. My last try was to name every file 001.csv, 002.csv, etc. and use an answer out of this forum:
directory <- setwd("~/Desktop/Uni/paper/testdata/")
Inflowmean <- function(directory, Inflow, id = 1:3) {
filenames <- sprintf("%03d.csv", id)
filenames <- paste(directory, filenames, sep=";", dec=",")
ldf <- lapply(filenames, read.csv)
df=ldply(ldf)
summary(df[, Inflow], na.rm = TRUE)
}
I really hope that you can help me, cause I'm new and just started to learn commands in RStudio - seems that I'm not able to handle it, also tried different tutorials and the help function in the program...
Thank you so much!

It is rather unclear what your question actually is, but there are a number of problems with your code:
directory <- setwd("~/Desktop/Uni/paper/testdata/"): See ?setwd - it returns the current directory before changing the working directory, not ~/Desktop/Uni/paper/testdata/. You probably want
directory <- "~/Desktop/Uni/paper/testdata/"
setwd(directory)
filenames <- paste(directory, filenames, sep=";", dec=",") -- this will create filenames like "~/Desktop/Uni/paper/testdata/;001.csv;,". You probably want the separator to be / or .Platform$file.sep. I don't know why you have dec="," but that will just paste it onto the end. Try pasteing a few things together to see what gives you file names that make sense for your data.
Your ldply syntax is wrong: you probably want
ldply(ldf, function (x) summary(x[, Inflow], na.rm=T))
See ?ldply for more information. Also, to use ldply, you need library(plyr) somewhere. If you just want base R, you could try
do.call(rbind, lapply(x, function (x) summary(x[, Inflow], na.rm=T)))
Where the lapply applies your function (summary(x[, Inflow], na.rm=T)) to each of your dataframes, and do.call(rbind, ...) just joins all the summaries together into a single dataframe.

from
Using R to list all files with a specified extension
and
Opening all files in a folder, and applying a function
filenames <- list.files("~/Desktop/Uni/paper/testdata", pattern="*.csv", full.names=TRUE)
ldf <- lapply(filenames, read.csv)
res <- lapply(ldf, summary)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Loop in R to read many files - r

See ?list.files. myFiles <- list.files(pattern="data.*csv") Then you can loop over myFiles.

Read the headers in a file so that we can use them for replacing in merged file library(dplyr) library(readr) list_file <- list.files(pattern = "*.csv") %>% lapply(read.csv, stringsAsFactors=F) %>% bind_rows

fi <- list.files(directory_path,full.names=T) dat <- lapply(fi,read.csv) dat will contain the datasets in a list

First, set the working directory. Find and store all the files ending with .csv. Bind all of them row-wise. Following is the code sample: setwd("C:/yourpath") temp <- list.files(pattern = "*.csv") allData <- do.call("rbind",lapply(Sys.glob(temp), read.csv))

Related

Best way to import multiple .csv files into separate data frames? lapply() [duplicate]

Read files matching subdirectory patterns in R

Load multiple excel files and name object after a file name

How can I turn a part of the filename into a variable when reading multiple text files?

Get summary / plot for a column out of a folder (~3000 csv files) in R

Categories

Resources