I'm quite new to R, sorry if the programming looks bad.
Goal is to create filenames based on a common prefix, i.e. given prefix loop x times to produce prefix-1, prefix-2, prefix-3. And then use these filenames to read.csv(prefix-1,prefix-2, prefix-3).
I've gotten the code to work, but very inefficiently below:
name <- vector(mode="character", length=0)
for (i in 1:numruns)name[i] <- paste(prefix, "-", i, ".log", sep="")
if (numruns == 1) {
raw_data_1 <-read.csv(name[1], header=F, sep="\t", skip=11)
}
if (numruns == 2) {
raw_data_1 <-read.csv(name[1], header=F, sep="\t", skip=11)
raw_data_2 <-read.csv(name[2], header=F, sep="\t", skip=11)
}
if (numruns == 3) {
raw_data_1 <-read.csv(name[1], header=F, sep="\t", skip=11)
raw_data_2 <-read.csv(name[2], header=F, sep="\t", skip=11)
raw_data_3 <-read.csv(name[3], header=F, sep="\t", skip=11) #import files
}
I'm trying to learn how to be more efficient, above works for my purposes but I feel like I should be able wrap it up in the initial loop that produces the names. When I try to modify the original loop I can't get it to work...
for (i in 1:numruns){
name[i] <- paste(prefix, "-", i, ".log", sep="")
raw_data <- paste("raw_data_", i, sep="")
print(raw_data)
raw_data <- read.csv(name[i], header=F, sep="\t", skip=11)
}
Rather than get raw_data_1,raw_data_2,raw_data_3... I get "raw_data". I'm confused because print(raw_data) actually prints "raw_data_1-3" correctly (but only "raw_data" actually contains any information).
Thanks for any help or critique on my code to make it more efficient.
You should start using native vectorization early on. It may be confusing at first, but eventually you'll see all its' power and beauty. Notice that many base functions are vectorized, so that looping over arguments is often redundant (see paste usage below). Learn more about apply family, it is an essential tool right from the start (see lapply call).
Since reading multiple files is a common task, here's the chain I frequently use. We build all file names first according to a known pattern. Then we read them all at once, without any loops whatsoever. Finally, we may want to combine a list of files into a single data frame.
n <- 4
prefix <- 'some_prefix'
file_names <- paste0(prefix, '-', seq_len(n), '.log')
#[1] "some_prefix-1.log" "some_prefix-2.log" "some_prefix-3.log" "some_prefix-4.log"
# a list of data frames
df_list <- lapply(file_names, function(x) read.csv(x, head=F, sep='\t', skip=11))
# total data frame (if all data frames are compatible)
df_total <- do.call(cbind, df_list)
One way to do this is to put them in a list along the lines of:
raw_data <- vector(mode = "list", length = numruns) #allocate space for list
for (i in 1:numruns){ raw_data[[i]] <- read.csv(name[i], header=F, sep="\t", skip=11)}
you can use lapply do do this in one command instead - might be worth reading up for the future.
The reason that your code isn't working is that you're assigning the string "raw_data_1" to raw_data, and then overwriting it with the data from the file. If you really want to go down the route of having lots of variables, have a look at assign() and get().
Related
I have a few data frames (colors, sets, inventory) and I want to save each of them into a folder that I have set as my wd. I want to do this using a for loop, but I am not sure how to write the file argument such that R understands that it should use the elements of the vector as the file names.
I might write:
DFs <- c("colors", "sets", "inventory")
for (x in 1:length(DFs)){
save(x, file = "x.Rda")
}
The goal would be that the files would save as colors.Rda, sets.Rda, etc. However, the last element to run through the loop simply saves as x.Rda.
In short, perhaps my question is: how do you tell R that I am wanting to use elements being run through a loop within an argument when that argument requires a character string?
For bonus points, I am sure I will encounter the same problem if I want to load a series of files from that folder in the future. Rather than loading each one individually, I'd also like to write a for loop. To load these a few minutes ago, I used the incredibly clunky code:
sets_file <- "~/Documents/ME teaching/R notes/datasets/sets.csv"
sets <- read.csv(sets_file)
inventories_file <- "~/Documents/ME teaching/R notes/datasets/inventories.csv"
inventories <- read.csv(inventories_file)
colors_file <- "~/Documents/ME teaching/R notes/datasets/colors.csv"
colors <- read.csv(colors_file)
For compactness I use lapply instead of a for loop here, but the idea is the same:
lapply(DFs, \(x) save(list=x, file=paste0(x, ".Rda"))))
Note that you need to generate the varying file names by providing x as a variable and not as a character (as part of the file name).
To load those files, you can simply do:
lapply(paste0(DFs, ".Rda"), load, envir = globalenv())
To save you can do this:
DFs <- list(color, sets, inventory)
names(DFs) = c("color", "sets", "inventory")
for (x in 1:length(DFs)){
dx = paste(names(DFs)[[x]], "Rda", sep = ".")
dfx = DFs[[x]]
save(dfx, file = dx)
}
To specify the path just inform in the construction of the dx object as following to read.
To read:
DFs <- c("colors", "sets", "inventory")
# or
DFs = dir("~/Documents/ME teaching/R notes/datasets/")
for(x in 1:length(DFs)){
arq = paste("~/Documents/ME teaching/R notes/datasets/", DFs[x], ".csv", sep = "")
DFs[x] = read.csv(arq)
}
It will read as a list, so you can access using [[]] indexation.
So I have a folder with bunch of csv, I set the wd to that folder and extracted the files names:
data_dir <- "~/Desktop/All Waves Data/csv"
setwd(data_dir)
vecFiles <- list.files(data_dir)
all good, now the problem comes when I try to load all of the files using a loop on vecFiles:
for(fl in vecFiles) {
fl <- read.csv(vecFiles[i], header = T, fill = T)
}
The loop treats 'fl' as a plain string when it comes to the naming, resulting only saving the last file under 'fl' (by overwriting the previous one at each time).
I was trying to figure out why this happens but failed.
Any explanation?
Edit: Trying to achieve the following: assume you have a folder with data1.csv, data2.csv ... datan.csv, I want to load them into separate data frames named data1, data2 ..... datan
You want to read in all csv file from your working directory and have the locations of those files saved in vecFiles.
Why your attempt doesn't work
What you are currently doing doesn't work, because you are overwriting the object fn with the newly loaded csv file in every iteration. After all iterations have been run through, you are left with only the last overwritten fn object.
Another example to clarify why fn only contains the value of the last csv-file: If you declare fn <- "abc" in line1, and in line2 you say fn <- "def" (i.e. you overwrite fn from line1) you will obviously have the value "def" saved in fn after line2, right?
fn <- "abc"
fn <- "def"
fn
#[1] "def"
Solutions
There are two prominent ways to solve this: 1) stick with a slightly altered for-loop. 2) Use sapply().
1) The altered for loop: Create an empty list called fn, and assign the loaded csv files to the i-th element of that list in every iteration:
fn <- list()
for(i in seq_along(vecFiles)){
fn[[i]] <- read.csv(vecFiles[i], header=T, fill=T)
}
names(fn) <- vecFiles
2) Use sapply(): sapply() is a function that R-users like to use instead of for-loops.
fn <- sapply(vecFiles, read.csv, header=T, fill=T)
names(fn) <- vecFiles
Note that you can also use lapply() instead of sapply(). The only difference is that lapply() gives you a list as output
You're not declaring anything new when you load the file. Each time you load, it loads into fl, because of that you would only see the last file in vecFiles.
Couple of potential solutions.
First lapply:
fl <- lapply(vecFiles, function(x) read.csv(x, header=T, fill=t) )
names(fl) <- vecFiles
This will create a list of elements within fl.
Second 'rbind':
Under the assumption your data has all the same columns:
fl <- read.csv(vecFiles[1], header=T, fill=t)
for(i in vecFiles[2:length(vecFiles)]){
fl <- rbind(fl, read.csv(vecFiles[i], header=T, fill=t) )
}
Hopefully that is helpful!
I have three directories that each contain about 2,000 files - the files have the same exact format but they are from 3 different sources. For each set of 3 files, I need to read in the data, merge them, and do some calculations and store the output. I've already got my script running for a test case; now I'm trying to loop it over all the files (so, 2000 sets of 3 files of each).
I only want to read in 3 at a time, of course. I thought of this approach: create a dataframe of files where the columns represent the 3 types and the rows represent files. I do that here:
type1Files <- list.files(path="path_to_dir", pattern="*.tsv", full.names=TRUE, recursive=FALSE)
type2Files <- list.files(path="path_to_dir", pattern="*.tsv", full.names=TRUE, recursive=FALSE)
type3Files <- list.files(path="path_to_dir", pattern="*.tsv", full.names=TRUE, recursive=FALSE)
enter files.df <- cbind.data.frame(type1=type1Files,type2=type2Files,type3=type3Files)
Now I need to read these files by column, looping over rows so only 3 files get opened in one loop. The issue is that I cannot read in a file using read.table, and I think it's because of the format of the filename (read.table() is not being fed the right format).
head(files.df) #confirms that each file is not surrounded by double quotes as required by read.table
My read.table statement:
type1.df <- read.table(x, header=FALSE, sep="\t", stringsAsFactors=FALSE)
Where, for x, I have tried the following:
shQuote(files.df[1,"type1"])
dQuote(files.df[1,"type1"])
file.t <- files.df[1,"type1"]
paste0('"',file.t,'"')
I've tried them all directly in read.table() as well as saving to objects and naming the object in read.table(). I even trying using cat() because I thought the escaped quotes might be the problem. Nothing works. I either get "unexpected input" as the error, or the typical error: "Error in file(file, "rt") : cannot open the connection." Furthermore, if I paste that exact filename that is printed in the error into my read.table() statement, it runs just fine. So, after many hours, I am stumped.
Can this be done in this way?
Thank you all for your advice.
Consider iterating directly from the lists without using an intermediary dataframe. With Map, you can pass each list of 2,000 dataframes for iterative cbind calls across all three types. Below cbind.data.frame prefixes columns with type1, type2, and type3.
bind_dfs <- function(x,y,z) {
xdf <- read.table(x, header=FALSE, sep="\t", stringsAsFactors=FALSE)
ydf <- read.table(y, header=FALSE, sep="\t", stringsAsFactors=FALSE)
zdf <- read.table(z, header=FALSE, sep="\t", stringsAsFactors=FALSE)
cbind.data.frame(type1=xdf, type2=ydf, type3=zdf)
}
dfList <- Map(bind_dfs, type1Files, type2Files, type3Files)
Also, to run your calculations, you can either extend the bind_dfs method
bind_dfs <- function(x,y,z) {
xdf <- read.table(x, header=FALSE, sep="\t", stringsAsFactors=FALSE)
ydf <- read.table(y, header=FALSE, sep="\t", stringsAsFactors=FALSE)
zdf <- read.table(z, header=FALSE, sep="\t", stringsAsFactors=FALSE)
df <- cbind(xdf, ydf, zdf)
df <- #... other calculations
return(df)
}
Or use another loop on dataframe list:
newdfList <- lapply(dfList, function(df){
df <- # ... other calculations
return(df)
})
I have written a loop in R (still learning). My purpose is to pick the max AvgConc and max Roll_TotDep from each looping file, and then have two data frames that each contains all the max numbers picked from individual files. The code I wrote only save the last iteration results (for only one single file)... Can someone point me a right direction to revise my code, so I can append the result of each new iteration with previous ones? Thanks!
data.folder <- "D:\\20150804"
files <- list.files(path=data.folder)
for (i in 1:length(files)) {
sub <- read.table(file.path(data.folder, files[i]), header=T)
max1Conc <- sub[which.max(sub$AvgConc),]
maxETD <- sub[which.max(sub$Roll_TotDep),]
write.csv(max1Conc, file= "max1Conc.csv", append=TRUE)
write.csv(maxETD, file= "maxETD.csv", append=TRUE)
}
The problem is that max1Conc and maxETD are not lists data.frames or vectors (or other types of object capable of storing more than one value).
To fix this:
maxETD<-vector()
max1Conc<-vector()
for (i in 1:length(files)) {
sub <- read.table(file.path(data.folder, files[i]), header=T)
max1Conc <- append(max1Conc,sub[which.max(sub$AvgConc),])
maxETD <- append(maxETD,sub[which.max(sub$Roll_TotDep),])
write.csv(max1Conc, file= "max1Conc.csv", append=TRUE)
write.csv(maxETD, file= "maxETD.csv", append=TRUE)
}
The difference here is that I made the two variables you wish to write out empty vectors (max1Conc and maxETD), and then used the append command to add each successive value to the vectors.
There are more idiomatic R ways of accomplishing your goal; personally, I suggest you look into learning the apply family of functions. (http://adv-r.had.co.nz/Functionals.html)
I can't directly test the whole thing because I don't have a directory with files like yours, but I tested the parts, and I think this should work as an apply-driven alternative. It starts with a pair of functions, one to ingest a file from your directory and other to make a row out of the two max values from each of those files:
library(dplyr)
data.folder <- "D:\\20150804"
getfile <- function(filename) {
sub <- read.table(file.path(data.folder, filename), header=TRUE)
return(sub)
}
getmaxes <- function(df) {
rowi <- data.frame(AvConc.max = max(df[,"AvConc"]), ETD.max = max(df[,"ETD"]))
return(rowi)
}
Then it uses a couple of rounds of lapply --- embedded in piping courtesy ofdplyr --- to a) build a list with each data set as an item, b) build a second list of one-row data frames with the maxes from each item in the first list, c) rbind those rows into one big data frame, d) and then cbind the filenames to that data frame for reference.
dfmax <- lapply(as.list(list.files(path = data.folder)), getfiles) %>%
lapply(., getmaxes) %>%
Reduce(function(...) rbind(...), .) %>%
data.frame(file = list.files(path = data.folder), .)
First off, this is related to a homework question for the Coursera R programming course. I have found other ways to do what I want to do but my research has led me to a question I'm curious about. I have a variable number of csv files that I need to pull data from and then take the mean of the "pollutant" column in said files. The files are listed in their directory with an id number. I put together the following code which works fine for a single csv file but doesn't work for multiple csv files:
pollutantmean <- function (directory, pollutant, id = 1:332) {
id <- formatC(id, width=3, flag="0")`
dataset<-read.csv(paste(directory, "/", id,".csv",sep=""),header=TRUE)`
mean(dataset[,pollutant], na.rm = TRUE)`
}
I also know how to rbind multiple csv files together if I know the ids when I am creating the function, but I am not sure how to assign rbind to a variable range of ids or if thats even possible. I found other ways to do it such as calling an lapply and the unlisting the data, just curious if there is an easier way.
Well, this uses an lapply, but it might be what you want.
file_list <- list.files("*your directory*", full.names = T)
combined_data <- do.call(rbind, lapply(file_list, read.csv, header = TRUE))
This will turn all of your files into one large dataset, and from there it's easy to take the mean. Is that what you wanted?
An alternative way of doing this would be to step through file by file, taking sums and number of observations and then taking the mean afterwards, like so:
sums <- numeric()
n <- numeric()
i <- 1
for(file in file_list){
temp_df <- read.csv(file, header = T)
temp_mean <- mean(temp_df$pollutant)
sums[i] <- sum(temp_df$pollutant)
n[i] <- nrow(temp_df)
i <- i + 1
}
new_mean <- sum(sums)/sum(n)
Note that both of these methods require that only your desired csvs are in that folder. You can use a pattern argument in the list.files call if you have other files in there that you're not interested in.
A vector is not accepted for 'file' in read.csv(file, ...)
Below is a slight modification of yours. A vector of file paths are created and they are looped by sapply.
files <- paste("directory-name/",formatC(1:332, width=3, flag="0"),
".csv",sep="")
pollutantmean <- function(file, pollutant) {
dataset <- read.csv(file, header = TRUE)
mean(dataset[, pollutant], na.rm = TRUE)
}
sapply(files, pollutantmean)