R3.4.1 Read data from multiple .csv files - r

I'm trying to build up a function that can import/read several data tables in .csv files, and then compute statistics on the selected files.
Each of the 332 .csv file contains a table with the same column names: Date, Pollutant and id. There are a lot of missing values.
This is the function I wrote so far, to compute the mean of values for a pollutant:
pollutantmean <- function(directory, pollutant, id = 1:332) {
library(dplyr)
setwd(directory)
good<-c()
for (i in (id)){
task1<-read.csv(sprintf("%03d.csv",i))
}
p<-select(task1, pollutant)
good<-c(good,complete.cases(p))
mean(p[good,])
}
The problem I have is that each time it goes through the loop a new file is read and the data already read are replaced by the data from the new file.
So I end up with a function working perfectly fine with 1 single file, but not when I want to select multiple files
e.g. if I ask for id=10:20, I end up with the mean calculated only on file 20.
How could I change the code so that I can select multiple files?
Thank you!

My answer offers a way of doing what you want to do (if I understood everything correctly) without using a loop. My two assumptions are: (1) you have 332 *.csv files with the same header (column names) - so all file are of the same structure, and (2) you can combine your tables into one big data frame.
If these two assumptions are correct, I would use a list of your files to import your files as data frames (so this answer does not contain a loop function!).
# This creates a list with the name of your file. You have to provide the path to this folder.
file_list <- list.files(path = [your path where your *.csv files are saved in], full.names = TRUE)
# This will create a list of data frames.
mylist <- lapply(file_list, read.csv)
# This will 'row-bind' the data frames of the list to one big list.
mydata <- rbindlist(mylist)
# Now you can perform your calculation on this big data frame, using your column information to filter or subset to get information of just a subset of this table (if necessary).
I hope this helps.

Maybe something like this?
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
od <- setwd(directory)
on.exit(setwd(od))
task_list <- lapply(sprintf("%03d.csv", id), read.csv)
p_list <- lapply(task_list, function(x) complete.cases(select(x, pollutant)))
mean(sapply(p_list, mean))
}
Notes:
- Put all your library calls at the beginning of your scripts, they will be much easier to read. Never inside a function.
- To set a working directory inside a function is also a bad idea. When the function returns, that change will still be on and you might get lost. The better way is to set wd's outside functions, but since you've set it inside the function, I've addapted the code accordingly.

Related

Renaming files in folder with R

I have many files in my working directory with the same name followed by a number as such "name_#.csv". They each contain the same formatted time series data.
The file names are very long so when I import them into dataframes the df name is super long and I'd like to rename each as "df_#" so that I can then create another function to plot each individually or quickly plot, for example, the first three without typing in this megalong name.
I don't want to concatenate anything to the current names, but take each one and rename it completely ending with the number in the list as it iterates through the files.
Here is an example I have so far.
name = list.files(pattern="*.csv")
for (i in 1:length(name)) assign(paste0("df", name[i]), read.csv(name[i], skip = 15 ))
This is just adding a 'df' to the front and not changing the whole name.
I'm also not sure if it makes sense to proceed this way. Essentially my data is three replicates of time series data on the same sample and I eventually want to take three at a time and plot them on the same graph and so forth until the end of the files.
You can name the file in your global environment without renaming the original file in the folder by just telling R you want to assign it that name in the loop with a few modifications to your original code. For instance:
# Define file path to desired folder
file_path <- "Desktop/SO Example/" #example file path, though could eb working directory
# Your code to get CSV file names in the folder
name <- list.files(path = file_path, pattern="*.csv")
# Modify the loop to assign a new name
for(x in seq_along(name)){
assign(paste0("df_", x),
read.csv(paste0(file_path, name[x]), skip = 15))
}
This will load the data as df_1, df_2, etc. I believe in your assign function you were using paste0("df", name[i]), which concatenates "df" with the filename in position i, not the value of I in the loop - that is why you were getting df prepended to each name on import.

To stack up results in one masterfile in R

Using this script I have created a specific folder for each csv file and then saved all my further analysis results in this folder. The name of the folder and csv file are same. The csv files are stored in the main/master directory.
Now, I have created a csv file in each of these folders which contains a list of all the fitted values.
I would now like to do the following:
Set the working directory to the particular filename
Read fitted values file
Add a row/column stating the name of the site/ unique ID
Add it to the masterfile which is stored in the main directory with a title specifying site name/filename. It can be stacked by rows or by columns it doesn't really matter.
Come to the main directory to pick the next file
Repeat the loop
Using the merge(), rbind(), cbind() combines all the data under one column name. I want to keep all the sites separate for comparison at a later on stage.
This is what I'm using at the moment and I'm lost on how to proceed further.
setwd( "path") # main directory
path <-"path" # need this for convenience while switching back to main directory
# import all files and create a character type array
files <- list.files(path=path, pattern="*.csv")
for(i in seq(1, length(files), by = 1)){
fileName <- read.csv(files[i]) # repeat to set the required working directory
base <- strsplit(files[i], ".csv")[[1]] # getting the filename
setwd(file.path(path, base)) # setting the working directory to the same filename
master <- read.csv(paste(base,"_fiited_values curve.csv"))
# read the fitted value csv file for the site and store it in a list
}
I want to construct a for loop to make one master file with the files in different directories. I do not want to merge all under one column name.
For example, If I have 50 similar csv files and each had two columns of data, I would like to have one csv file which accommodates all of it; but in its original format rather than appending to the existing row/column. So then I will have 100 columns of data.
Please tell me what further information can I provide?
for reading a group of files, from a number of different directories, with pathnames patha pathb pathc:
paths = c('patha','pathb','pathc')
files = unlist(sapply(paths, function(path) list.files(path,pattern = "*.csv", full.names = TRUE)))
listContainingAllFiles = lapply(files, read.csv)
If you want to be really quick about it, you can grab fread from data.table:
library(data.table)
listContainingAllFiles = lapply(files, fread)
Either way this will give you a list of all objects, kept separate. If you want to join them together vertically/horizontally, then:
do.call(rbind, listContainingAllFiles)
do.call(cbind, listContainingAllFiles)
EDIT: NOTE, the latter makes no sense unless your rows actually mean something when they're corresponding. It makes far more sense to just create a field tracking what location the data is from.
if you want to include the names of the files as the method of determining sample location (I don't see where you're getting this info from in your example), then you want to do this as you read in the files, so:
listContainingAllFiles = lapply(files,
function(file) data.frame(filename = file,
read.csv(file)))
then later you can split that column to get your details (Assuming of course you have a standard naming convention)

I'd like to delete the top (header) row from a list of multiple tables

I read in multiple .csv files from a directory using list.files(path, pattern=".csv"), then lapply(data, read.csv) which opens them as a list of tables in R. The headers are attached (like they're part of the data) & I'd like to delete the first row from each table in the list to eliminate them & make my own headers.
I was able to do this when I read in 1 file at a time using lapply(data[-1,]) but now it's not working on the list of tables..
Do I have to turn them into a data frame first?
If so I'm not sure how to go about this in a data frame..?
Thx in advance
Suppose you want to change the header of the files in your directory, then attach them together:
myfun <- function(x) {
dataset <- fread(x,header=TRUE, sep=",")
setnames(dataset,c("Name1","Name2"))
return(dataset)
}
data <- rbindlist(lapply(list.files(),myfun))

Read, process and export analysis results from multiple .csv files in R

I have a bunch of CSV files and I would like to perform the same analysis (in R) on the data within each file. Firstly, I assume each file must be read into R (as opposed to running a function on the CSV and providing output, like a sed script).
What is the best way to input numerous CSV files to R, in order to perform the analysis and then output separate results for each input?
Thanks (btw I'm a complete R newbie)
You could go for Sean's option, but it's going to lead to several problems:
You'll end up with a lot of unrelated objects in the environment, with the same name as the file they belong to. This is a problem because...
For loops can be pretty slow, and because you've got this big pile of unrelated objects, you're going to have to rely on for loops over the filenames for each subsequent piece of analysis - otherwise, how the heck are you going to remember what the objects are named so that you can call them?
Calling objects by pasting their names in as strings - which you'll have to do, because, again, your only record of what the object is called is in this list of strings - is a real pain. Have you ever tried to call an object when you can't write its name in the code? I have, and it's horrifying.
A better way of doing it might be with lapply().
# List files
filelist <- list.files(pattern = "*.csv")
# Now we use lapply to perform a set of operations
# on each entry in the list of filenames.
to_dispose_of <- lapply(filelist, function(x) {
# Read in the file specified by 'x' - an entry in filelist
data.df <- read.csv(x, skip = 1, header = TRUE)
# Store the filename, minus .csv. This will be important later.
filename <- substr(x = x, start = 1, stop = (nchar(x)-4))
# Your analysis work goes here. You only have to write it out once
# to perform it on each individual file.
...
# Eventually you'll end up with a data frame or a vector of analysis
# to write out. Great! Since you've kept the value of x around,
# you can do that trivially
write.table(x = data_to_output,
file = paste0(filename, "_analysis.csv"),
sep = ",")
})
And done.
You can try the following codes by putting all csv files in the same directory.
names = list.files(pattern="*.csv") %csv file names
for(i in 1:length(names)){ assign(names[i],read.csv(names[i],skip=1, header=TRUE))}
Hope this helps !

Assigning unknown variable to new variable name

I have to load in many files and tansform their data. Each file contains only one data.table, however the tables have various names.
I would like to run a single script over all of the files -- to do so, i must assign the unknown data.table to a common name ... say blob.
What is the R way of doing this? At present, my best guess (which seems like a hack, but works) is to load the data.table into a new environment, and then: assign('blob', get(objects(envir=newEnv)[1], env=newEnv).
In a reproducible context this is:
newEnv <- new.env()
assign('a', 1:10, envir = newEnv)
assign('blob', get(objects(envir=newEnv)[1], env=newEnv))
Is there a better way?
The R way is to create a single object, i.e. a single list of data tables.
Here is some pseudocode that contains three steps:
Use list.files() to create a list of all files in a folder.
Use lapply() and read.csv() to read your files and create a list of data frames. Replace read.csv() with read.table() or whatever is appropriate for your data.
Use lapply() again, this time with as.data.table() to convert the data frames to data tables.
The pseudocode:
filenames <- list.files("path/to/files")
dat <- lapply(files, read.csv)
dat <- lapply(dat, as.data.table)
Your result should be a single list, called dat, containing a data table for each of your original files.
I assume that you saved the data.tables using save() somewhat like this:
d1 <- data.table(value=1:10)
save(d1, file="data1.rdata")
and your problem is that when you load the file you don't know the name (here: d1) that you used when saving the file. Correct?
I suggest you use instead saveRDS() and readRDS() for saving/loading single objects:
d1 <- data.table(value=1:10)
saveRDS(d1, file="data1.rds")
blob <- readRDS("data1.rds")

Resources