I have three directories that each contain about 2,000 files - the files have the same exact format but they are from 3 different sources. For each set of 3 files, I need to read in the data, merge them, and do some calculations and store the output. I've already got my script running for a test case; now I'm trying to loop it over all the files (so, 2000 sets of 3 files of each).
I only want to read in 3 at a time, of course. I thought of this approach: create a dataframe of files where the columns represent the 3 types and the rows represent files. I do that here:
type1Files <- list.files(path="path_to_dir", pattern="*.tsv", full.names=TRUE, recursive=FALSE)
type2Files <- list.files(path="path_to_dir", pattern="*.tsv", full.names=TRUE, recursive=FALSE)
type3Files <- list.files(path="path_to_dir", pattern="*.tsv", full.names=TRUE, recursive=FALSE)
enter files.df <- cbind.data.frame(type1=type1Files,type2=type2Files,type3=type3Files)
Now I need to read these files by column, looping over rows so only 3 files get opened in one loop. The issue is that I cannot read in a file using read.table, and I think it's because of the format of the filename (read.table() is not being fed the right format).
head(files.df) #confirms that each file is not surrounded by double quotes as required by read.table
My read.table statement:
type1.df <- read.table(x, header=FALSE, sep="\t", stringsAsFactors=FALSE)
Where, for x, I have tried the following:
shQuote(files.df[1,"type1"])
dQuote(files.df[1,"type1"])
file.t <- files.df[1,"type1"]
paste0('"',file.t,'"')
I've tried them all directly in read.table() as well as saving to objects and naming the object in read.table(). I even trying using cat() because I thought the escaped quotes might be the problem. Nothing works. I either get "unexpected input" as the error, or the typical error: "Error in file(file, "rt") : cannot open the connection." Furthermore, if I paste that exact filename that is printed in the error into my read.table() statement, it runs just fine. So, after many hours, I am stumped.
Can this be done in this way?
Thank you all for your advice.
Consider iterating directly from the lists without using an intermediary dataframe. With Map, you can pass each list of 2,000 dataframes for iterative cbind calls across all three types. Below cbind.data.frame prefixes columns with type1, type2, and type3.
bind_dfs <- function(x,y,z) {
xdf <- read.table(x, header=FALSE, sep="\t", stringsAsFactors=FALSE)
ydf <- read.table(y, header=FALSE, sep="\t", stringsAsFactors=FALSE)
zdf <- read.table(z, header=FALSE, sep="\t", stringsAsFactors=FALSE)
cbind.data.frame(type1=xdf, type2=ydf, type3=zdf)
}
dfList <- Map(bind_dfs, type1Files, type2Files, type3Files)
Also, to run your calculations, you can either extend the bind_dfs method
bind_dfs <- function(x,y,z) {
xdf <- read.table(x, header=FALSE, sep="\t", stringsAsFactors=FALSE)
ydf <- read.table(y, header=FALSE, sep="\t", stringsAsFactors=FALSE)
zdf <- read.table(z, header=FALSE, sep="\t", stringsAsFactors=FALSE)
df <- cbind(xdf, ydf, zdf)
df <- #... other calculations
return(df)
}
Or use another loop on dataframe list:
newdfList <- lapply(dfList, function(df){
df <- # ... other calculations
return(df)
})
Related
I have tried to create a for loop that does something for each of 4 csv files similar to this but with more files.
dat1<- read.csv("female.csv", header =T)
dat2<- read.csv("male.csv", header =T)
for (i in 1:2) {
message("Female, Male")
Temp <- dat[i][(dat[i]$NAME == "Temp"), ]
Temp <- Temp[complete.cases(Temp)]
print(mean(Temp$MEAN))
However, I get an error:
Error in Temp$MEAN : $ operator is invalid for atomic vectors
Not sure why this isn't working. Any help would be appreciated for looping through csv files!
Personally, I think the easiest way to do this is with the plyr package:
library(plyr)
myFiles <- c("male.csv", "female.csv")
dat <- ldply(myFiles, read.csv)
dat <- dat[complete.cases(dat), ]
mean(dat$MEAN)
The way this works is that you first create a vector of file names. Then the ldply() function performs the function read.csv() on the vector of filenames, and converts the output automatically to a data.frame. Then you do the complete.cases() and mean() in the usual way.
Edit:
But if you want the mean of each file then here is one way of doing it:
# create a vector of files
myFiles <- c("male.csv", "female.csv")
# create a function that properly handles ONLY ONE ELEMENT
readAndCalc <- function(x){ # pass in the filename
tmp <- read.csv(x) # read the single file
tmp <- tmp[complete.cases(tmp), ] # complete.cases()
mean(tmp$MEAN) # mean
}
x <- "male.csv"
readAndCalc(x) # test with ONE file
sapply(myFiles, readAndCalc) # run with all your files
The way this works is that you first create a vector of filenames, just like before. Then you create a function that processes ONLY ONE file at a time. Then you can test that the function works using the readAndCalc function you just created. Finally do it for all your files with the sapply() function. Hope that helps.
I have the following part of the code that contains two loops. I have some txt files, which I want to be read and analyzed in R separately, one by one. Currently, I face a problem of importing them to R. For example, the name of the first file is "C:/Users/User 1/Documents/Folder 1/1 1986.txt". To read it in R I have made the following loop:
## company
for(i in 1)
{
## year
for(j in 1986)
{
df=read.delim(paste("C:/Users/User 1/Documents/Folder 1/", i, j, ".txt"), stringsAsFactors=FALSE, header=FALSE)
df<-data.frame(rename(df, c("V3"="weight")))
}
}
When I run the loop, I get the following error:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'C:/Users/User 1/Documents/Folder 1/ 13 1986 .txt': No such file or directory
How do I avoid those additional gaps that R assumes to exist in the name of the original file?
You should replace paste with paste0.
By default, paste use spaces as a separator, thus yielding the obtained result. And paste0 use nothing as a separator.
Because I don't know how your files look like exactly, maybe this won't help you... But this is how I read in files with a loop:
First: setting the working directory
setwd("/Users/User 1/Documents/Folder 1")
Then I always save my data as one excel file with different sheets. For this example I have 15 different sheets in my excel file named 2000-2014, the first sheet is called "2000", the second "2001" and so on.
sheets <- list() # creating empty list named sheets
for(i in 1:15){
sheets[[i]] <- read_excel("2000-2014.xlsx", sheet = i) # every sheet will be one layer of the list sheets
k <- c(2000:2014)
sheet[[i]]$Year <- k[i] # to every listlayer I add a column "Year", matching the actual year my data is from
}
No I want my data from 2000 to 2014 merged in one big data frame. I can still analyse them one by one!
data <- do.call(rbind.data.frame, sheets)
To tidy my data all in one and to get it into the form Hadley Wickham and ggplot2 like it (http://vita.had.co.nz/papers/tidy-data.pdf) I restructure it:
data_restructed <- data %>%
as.data.frame() %>%
tidyr::gather(key = "categories", value = "values", 2:12)
2:12 because in my case columns 2:12 contain all the values while column 1 contains countrienames. Now you have all your data in one big dataframe and can analyse them seperated to specific variables like the year or the category or year AND category and so on.
I would avoid the loop in this case and go with lapply.
Files <- list.files('C:/Users/User 1/Documents/Folder 1/', pattern = "*.txt")
fileList <- lapply(Files, FUN =- function(x){
df <- read.delim(x, stringsAsFactors=FALSE, header=FALSE)
df <- data.frame(rename(df, c("V3"="weight")))
return(df)
})
do.call('rbind', fileList)
First off, this is related to a homework question for the Coursera R programming course. I have found other ways to do what I want to do but my research has led me to a question I'm curious about. I have a variable number of csv files that I need to pull data from and then take the mean of the "pollutant" column in said files. The files are listed in their directory with an id number. I put together the following code which works fine for a single csv file but doesn't work for multiple csv files:
pollutantmean <- function (directory, pollutant, id = 1:332) {
id <- formatC(id, width=3, flag="0")`
dataset<-read.csv(paste(directory, "/", id,".csv",sep=""),header=TRUE)`
mean(dataset[,pollutant], na.rm = TRUE)`
}
I also know how to rbind multiple csv files together if I know the ids when I am creating the function, but I am not sure how to assign rbind to a variable range of ids or if thats even possible. I found other ways to do it such as calling an lapply and the unlisting the data, just curious if there is an easier way.
Well, this uses an lapply, but it might be what you want.
file_list <- list.files("*your directory*", full.names = T)
combined_data <- do.call(rbind, lapply(file_list, read.csv, header = TRUE))
This will turn all of your files into one large dataset, and from there it's easy to take the mean. Is that what you wanted?
An alternative way of doing this would be to step through file by file, taking sums and number of observations and then taking the mean afterwards, like so:
sums <- numeric()
n <- numeric()
i <- 1
for(file in file_list){
temp_df <- read.csv(file, header = T)
temp_mean <- mean(temp_df$pollutant)
sums[i] <- sum(temp_df$pollutant)
n[i] <- nrow(temp_df)
i <- i + 1
}
new_mean <- sum(sums)/sum(n)
Note that both of these methods require that only your desired csvs are in that folder. You can use a pattern argument in the list.files call if you have other files in there that you're not interested in.
A vector is not accepted for 'file' in read.csv(file, ...)
Below is a slight modification of yours. A vector of file paths are created and they are looped by sapply.
files <- paste("directory-name/",formatC(1:332, width=3, flag="0"),
".csv",sep="")
pollutantmean <- function(file, pollutant) {
dataset <- read.csv(file, header = TRUE)
mean(dataset[, pollutant], na.rm = TRUE)
}
sapply(files, pollutantmean)
I would to apply a loop in R to process several files, one file per time. The files have exactly the same pattern, just the string "...split1..." is a crescent number to my files. Then a have files like "...split1...", "...split2..." ... "...split777...". I want output files like in the same logic, in the example: "newsplit1.txt", "newsplit2.txt" ... "newsplit777.txt".
all <- read.table("nsamplescluster.split1.adjusted", header=TRUE, sep=";")
all <- all[, -grep("GType", colnames(all))]
write.table(all, "newsplit1.txt", sep=";")
Cheers!
Use loop and paste file names.
for(i in 1:777){
infile <- paste0("nsamplescluster.split",i,".adjusted")
outfile <- paste0("newsplit",i,".txt")
all <- read.table(infile, header=TRUE, sep=";")
all <- all[, -grep("GType", colnames(all))]
write.table(all, outfile, sep=";")
}
If the files are all in the same directory, you can also use
filenames<- list.files(your.directory, pattern="nsamplescluster")
This will create a vector with all file names in your.directory with the indicated pattern. You can then use this to loop over your files. For instance,
for(i in filenames){
do stuff
}
This may come in handy if the number of files change.
I'm quite new to R, sorry if the programming looks bad.
Goal is to create filenames based on a common prefix, i.e. given prefix loop x times to produce prefix-1, prefix-2, prefix-3. And then use these filenames to read.csv(prefix-1,prefix-2, prefix-3).
I've gotten the code to work, but very inefficiently below:
name <- vector(mode="character", length=0)
for (i in 1:numruns)name[i] <- paste(prefix, "-", i, ".log", sep="")
if (numruns == 1) {
raw_data_1 <-read.csv(name[1], header=F, sep="\t", skip=11)
}
if (numruns == 2) {
raw_data_1 <-read.csv(name[1], header=F, sep="\t", skip=11)
raw_data_2 <-read.csv(name[2], header=F, sep="\t", skip=11)
}
if (numruns == 3) {
raw_data_1 <-read.csv(name[1], header=F, sep="\t", skip=11)
raw_data_2 <-read.csv(name[2], header=F, sep="\t", skip=11)
raw_data_3 <-read.csv(name[3], header=F, sep="\t", skip=11) #import files
}
I'm trying to learn how to be more efficient, above works for my purposes but I feel like I should be able wrap it up in the initial loop that produces the names. When I try to modify the original loop I can't get it to work...
for (i in 1:numruns){
name[i] <- paste(prefix, "-", i, ".log", sep="")
raw_data <- paste("raw_data_", i, sep="")
print(raw_data)
raw_data <- read.csv(name[i], header=F, sep="\t", skip=11)
}
Rather than get raw_data_1,raw_data_2,raw_data_3... I get "raw_data". I'm confused because print(raw_data) actually prints "raw_data_1-3" correctly (but only "raw_data" actually contains any information).
Thanks for any help or critique on my code to make it more efficient.
You should start using native vectorization early on. It may be confusing at first, but eventually you'll see all its' power and beauty. Notice that many base functions are vectorized, so that looping over arguments is often redundant (see paste usage below). Learn more about apply family, it is an essential tool right from the start (see lapply call).
Since reading multiple files is a common task, here's the chain I frequently use. We build all file names first according to a known pattern. Then we read them all at once, without any loops whatsoever. Finally, we may want to combine a list of files into a single data frame.
n <- 4
prefix <- 'some_prefix'
file_names <- paste0(prefix, '-', seq_len(n), '.log')
#[1] "some_prefix-1.log" "some_prefix-2.log" "some_prefix-3.log" "some_prefix-4.log"
# a list of data frames
df_list <- lapply(file_names, function(x) read.csv(x, head=F, sep='\t', skip=11))
# total data frame (if all data frames are compatible)
df_total <- do.call(cbind, df_list)
One way to do this is to put them in a list along the lines of:
raw_data <- vector(mode = "list", length = numruns) #allocate space for list
for (i in 1:numruns){ raw_data[[i]] <- read.csv(name[i], header=F, sep="\t", skip=11)}
you can use lapply do do this in one command instead - might be worth reading up for the future.
The reason that your code isn't working is that you're assigning the string "raw_data_1" to raw_data, and then overwriting it with the data from the file. If you really want to go down the route of having lots of variables, have a look at assign() and get().