Can't name loaded files - R - r

So I have a folder with bunch of csv, I set the wd to that folder and extracted the files names:
data_dir <- "~/Desktop/All Waves Data/csv"
setwd(data_dir)
vecFiles <- list.files(data_dir)
all good, now the problem comes when I try to load all of the files using a loop on vecFiles:
for(fl in vecFiles) {
fl <- read.csv(vecFiles[i], header = T, fill = T)
}
The loop treats 'fl' as a plain string when it comes to the naming, resulting only saving the last file under 'fl' (by overwriting the previous one at each time).
I was trying to figure out why this happens but failed.
Any explanation?
Edit: Trying to achieve the following: assume you have a folder with data1.csv, data2.csv ... datan.csv, I want to load them into separate data frames named data1, data2 ..... datan

You want to read in all csv file from your working directory and have the locations of those files saved in vecFiles.
Why your attempt doesn't work
What you are currently doing doesn't work, because you are overwriting the object fn with the newly loaded csv file in every iteration. After all iterations have been run through, you are left with only the last overwritten fn object.
Another example to clarify why fn only contains the value of the last csv-file: If you declare fn <- "abc" in line1, and in line2 you say fn <- "def" (i.e. you overwrite fn from line1) you will obviously have the value "def" saved in fn after line2, right?
fn <- "abc"
fn <- "def"
fn
#[1] "def"
Solutions
There are two prominent ways to solve this: 1) stick with a slightly altered for-loop. 2) Use sapply().
1) The altered for loop: Create an empty list called fn, and assign the loaded csv files to the i-th element of that list in every iteration:
fn <- list()
for(i in seq_along(vecFiles)){
fn[[i]] <- read.csv(vecFiles[i], header=T, fill=T)
}
names(fn) <- vecFiles
2) Use sapply(): sapply() is a function that R-users like to use instead of for-loops.
fn <- sapply(vecFiles, read.csv, header=T, fill=T)
names(fn) <- vecFiles
Note that you can also use lapply() instead of sapply(). The only difference is that lapply() gives you a list as output

You're not declaring anything new when you load the file. Each time you load, it loads into fl, because of that you would only see the last file in vecFiles.
Couple of potential solutions.
First lapply:
fl <- lapply(vecFiles, function(x) read.csv(x, header=T, fill=t) )
names(fl) <- vecFiles
This will create a list of elements within fl.
Second 'rbind':
Under the assumption your data has all the same columns:
fl <- read.csv(vecFiles[1], header=T, fill=t)
for(i in vecFiles[2:length(vecFiles)]){
fl <- rbind(fl, read.csv(vecFiles[i], header=T, fill=t) )
}
Hopefully that is helpful!

Related

Saving data frames using a for loop with file names corresponding to data frames

I have a few data frames (colors, sets, inventory) and I want to save each of them into a folder that I have set as my wd. I want to do this using a for loop, but I am not sure how to write the file argument such that R understands that it should use the elements of the vector as the file names.
I might write:
DFs <- c("colors", "sets", "inventory")
for (x in 1:length(DFs)){
save(x, file = "x.Rda")
}
The goal would be that the files would save as colors.Rda, sets.Rda, etc. However, the last element to run through the loop simply saves as x.Rda.
In short, perhaps my question is: how do you tell R that I am wanting to use elements being run through a loop within an argument when that argument requires a character string?
For bonus points, I am sure I will encounter the same problem if I want to load a series of files from that folder in the future. Rather than loading each one individually, I'd also like to write a for loop. To load these a few minutes ago, I used the incredibly clunky code:
sets_file <- "~/Documents/ME teaching/R notes/datasets/sets.csv"
sets <- read.csv(sets_file)
inventories_file <- "~/Documents/ME teaching/R notes/datasets/inventories.csv"
inventories <- read.csv(inventories_file)
colors_file <- "~/Documents/ME teaching/R notes/datasets/colors.csv"
colors <- read.csv(colors_file)
For compactness I use lapply instead of a for loop here, but the idea is the same:
lapply(DFs, \(x) save(list=x, file=paste0(x, ".Rda"))))
Note that you need to generate the varying file names by providing x as a variable and not as a character (as part of the file name).
To load those files, you can simply do:
lapply(paste0(DFs, ".Rda"), load, envir = globalenv())
To save you can do this:
DFs <- list(color, sets, inventory)
names(DFs) = c("color", "sets", "inventory")
for (x in 1:length(DFs)){
dx = paste(names(DFs)[[x]], "Rda", sep = ".")
dfx = DFs[[x]]
save(dfx, file = dx)
}
To specify the path just inform in the construction of the dx object as following to read.
To read:
DFs <- c("colors", "sets", "inventory")
# or
DFs = dir("~/Documents/ME teaching/R notes/datasets/")
for(x in 1:length(DFs)){
arq = paste("~/Documents/ME teaching/R notes/datasets/", DFs[x], ".csv", sep = "")
DFs[x] = read.csv(arq)
}
It will read as a list, so you can access using [[]] indexation.

Reading multiple csv files

I need to read multiple csv files and print the first six values for each of these. I tried this code but it is obviously wrong because the value of di is overwritten each iteration of the loop. How can I read multiple files?
library(xlsx)
for(i in 1:7){
di = read.csv(file.choose(),header=T)
print(di)
}
d = list(d1,d2,d3,d4,d5,d6,d7)
lapply(d,head)
If you want to keep you data frames in a list, rather than assigning each to a new object.
Option 1:
fs <- dir(pattern = ".csv")
d <- list()
for (i in seq_along(fs)) {
d[[i]] <- read.csv(fs[[1]])
print(head(d[[i]]))
}
Option 2:
fs <- dir(pattern = ".csv")
d <- lapply(fs, read.csv)
lapply(d, head)
Using option 1, you need to initialize an empty list to populate and assign with double bracket [[ notation. Using option 2, you don't need to initialize an empty list.
I'm slightly confused by if you want to just print the 6 lines or store them and if you want to keep the remainder of the csv files or not. Let's say all you want is to print the 6 lines, then, assuming you know the file names you can do this with
print(read.csv(filename, nlines = 6))
And repeat for each file. Alternatively if you want to save each file then you could do
f1 <- read.csv(filename, nlines = 6)
Repeat for each and use print(head).
Alternatively using your method but fixing the overwrite problem:
library(xlsx)
for(i in 1:7)
assign(paste0("d",i), read.csv(file.choose(),header=T))
lapply(list(d1,d2,d3,d4,d5,d6,d7),head)
The use of assign dynamically assigns names so that each is unique and doesn't overwrite each other, which I think is what you were aiming for. This isn't very 'elegant' though but fits with your chosen method

For loop to read table R

I would like to loop through a vector of directory names and insert the directory name into the read.table function and 'see' the tables outside the loop.
I have a vector of directory names:
dir_names <- c("SRR2537079","SRR2537080","SRR2537081","SRR2537082", "SRR2537083","SRR2537084")
I now want to loop through these and read the tables in each directory.
I have:
list.data<-list()
for(i in dir_names){
#print(i)
list.data[[i]] <- read.table('dir_names[i]/circularRNA_known.txt', header=FALSE, sep="\t",stringsAsFactors=FALSE)
}
but this isn't recognizing dir_names[i]. Do I have to use paste somehow??
You are right, you need to paste the value. i will also be the list element not a number, so you don't need to call it as dir_names[i] just i
list.data<-list()
for(i in dir_names){
#print(i)
list.data[[i]] <- read.table(paste0(i,'/circularRNA_known.txt'), header=FALSE, sep="\t",stringsAsFactors=FALSE)
}
Can I also suggest that (just for your info, if you wanted a more elegant solution) that you could use plyr's llply instead of a loop. It means it can all happen within one line, and could easily change the output to combine all files into a data.frame (using ldply) if they are in consistent formats
list.data.2 <- llply(dir_names, function(x) read.table(paste0(x,"/circularRNA_known.txt"), header=FALSE, sep="\t",stringsAsFactors=FALSE))
dir_names[i] should be used as a variable.
list.data<-list()
for(i in (1:length(dir_names))){
#print(i)
list.data[[i]] <- read.table(paste0(dir_names[i], '/circularRNA_known.txt'), header=FALSE, sep="\t",stringsAsFactors=FALSE)
}

R: save each loop result into one data frame

I have written a loop in R (still learning). My purpose is to pick the max AvgConc and max Roll_TotDep from each looping file, and then have two data frames that each contains all the max numbers picked from individual files. The code I wrote only save the last iteration results (for only one single file)... Can someone point me a right direction to revise my code, so I can append the result of each new iteration with previous ones? Thanks!
data.folder <- "D:\\20150804"
files <- list.files(path=data.folder)
for (i in 1:length(files)) {
sub <- read.table(file.path(data.folder, files[i]), header=T)
max1Conc <- sub[which.max(sub$AvgConc),]
maxETD <- sub[which.max(sub$Roll_TotDep),]
write.csv(max1Conc, file= "max1Conc.csv", append=TRUE)
write.csv(maxETD, file= "maxETD.csv", append=TRUE)
}
The problem is that max1Conc and maxETD are not lists data.frames or vectors (or other types of object capable of storing more than one value).
To fix this:
maxETD<-vector()
max1Conc<-vector()
for (i in 1:length(files)) {
sub <- read.table(file.path(data.folder, files[i]), header=T)
max1Conc <- append(max1Conc,sub[which.max(sub$AvgConc),])
maxETD <- append(maxETD,sub[which.max(sub$Roll_TotDep),])
write.csv(max1Conc, file= "max1Conc.csv", append=TRUE)
write.csv(maxETD, file= "maxETD.csv", append=TRUE)
}
The difference here is that I made the two variables you wish to write out empty vectors (max1Conc and maxETD), and then used the append command to add each successive value to the vectors.
There are more idiomatic R ways of accomplishing your goal; personally, I suggest you look into learning the apply family of functions. (http://adv-r.had.co.nz/Functionals.html)
I can't directly test the whole thing because I don't have a directory with files like yours, but I tested the parts, and I think this should work as an apply-driven alternative. It starts with a pair of functions, one to ingest a file from your directory and other to make a row out of the two max values from each of those files:
library(dplyr)
data.folder <- "D:\\20150804"
getfile <- function(filename) {
sub <- read.table(file.path(data.folder, filename), header=TRUE)
return(sub)
}
getmaxes <- function(df) {
rowi <- data.frame(AvConc.max = max(df[,"AvConc"]), ETD.max = max(df[,"ETD"]))
return(rowi)
}
Then it uses a couple of rounds of lapply --- embedded in piping courtesy ofdplyr --- to a) build a list with each data set as an item, b) build a second list of one-row data frames with the maxes from each item in the first list, c) rbind those rows into one big data frame, d) and then cbind the filenames to that data frame for reference.
dfmax <- lapply(as.list(list.files(path = data.folder)), getfiles) %>%
lapply(., getmaxes) %>%
Reduce(function(...) rbind(...), .) %>%
data.frame(file = list.files(path = data.folder), .)

Building a mean across several csv files

I have an assignment on Coursera and I am stuck - I do not necessarily need or want a complete answer (as this would be cheating) but a hint in the right direction would be highly appreciated.
I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames) and the files you want to use in the calculation (id).
I have tried to keep it as simple as possible:
pm <- function(directory, pollutant, id = 1:332) {
setwd("C:/Users/cw/Documents")
setwd(directory)
files <<- list.files()
First of all, set the wd and get a list of all files
x <- id[1]
x
get the starting point of the user-specified ID.
Problem
for (i in x:length(id)) {
df <- rep(NA, length(id))
df[i] <- lapply(files[i], read.csv, header=T)
result <- do.call(rbind, df)
return(df)
}
}
So this is where I am hitting a wall: I would need to take the user-specified input from above (e.g. 10:25) and put the content from files "010.csv" through "025.csv" into a dataframe to actually come up with the mean of one specific column.
So my idea was to run a for-loop along the length of id (e.g. 16 for 10:25) starting with the starting point of the specified id. Within this loop I would then need to take the appropriate values of files as the input for read.csv and put the content of the .csv files in a dataframe.
I can get single .csv files and put them into a dataframe, but not several.
Does anybody have a hint how I could procede?
Based on your example e.g. 16 files for 10:25, i.e. 010.csv, 011.csv, 012.csv, etc.
Under the assumption that your naming convention follows the order of the files in the directory, you could try:
csvFiles <- list.files(pattern="\\.csv")[10:15]#here [10:15] ... in production use your function parameter here
file_list <- vector('list', length=length(csvFiles))
df_list <- lapply(X=csvFiles, read.csv, header=TRUE)
names(df_list) <- csvFiles #OPTIONAL: if you want to rename (later rows) to the csv list
df <- do.call("rbind", df_list)
mean(df[ ,"columnName"])
These code snippets should be possible to pimp and incorprate into your routine.
You can aggregate your csv files into one big table like this :
for(i in 100:250)
{
infile<-paste("C:/Users/cw/Documents/",i,".csv",sep="")
newtable<-read.csv(infile)
newtable<-cbind(newtable,rep(i,dim(newtable)[1]) # if you want to be able to identify tables after they are aggregated
bigtable<-rbind(bigtable,newtable)
}
(you will have to replace 100:250 with the user-specified input).
Then, calculating what you want shouldn't be very hard.
That won't works for files 001 to 099, you'll have to distinguish those from the others because of the "0" but it's fixable with little treatment.
Why do you have lapply inside a for loop? Just do lapply(files[files %in% paste0(id, ".csv")], read.csv, header=T).
They should also teach you to never use <<-.

Resources