After having searched for help in different threads on this topic, I still have not become wiser. Therefore: Here comes another question on looping through multiple data files...
OK. I have multiple CSV files in one folder containing 5 columns of data. The filenames are as follows:
Moist yyyymmdd hh_mm_ss.csv
I would like to create a script that reads processes the CSV-files one by one doing the following steps:
1) load file
2) check number of rows and exclude file if less than 3 registrations
3) calculate mean value of all measurements (=rows) for column 2
4) calculate mean value of all measurements (=rows) for column 4
5) output the filename timestamp, mean column 2 and mean column 4 to a data frame,
I have written the following function
moist.each.mean <- function() {
library("tcltk")
directory <- tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <- regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame(timestamp=character(), humidity=numeric(), temp=numeric())
for(i in 1:length(filelist)){
file.in[[i]] <- read.csv(filelist[i], header=F)
if (nrow(file.in[[i]]<3)){
print("discard")
} else {
newrow <- c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1))
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
}
but i keep getting an error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(V1 = c(10519949L, :
replacement has 18 rows, data has 17
Any ideas?
Thx, kruemelprinz
I'd also suggest to use (l)apply... Here's my take:
getMeans <- function(fpath,runfct,
target_cols = c(2),
sep=",",
dec=".",
header = T,
min_obs_threshold = 3){
f <- list.files(fpath)
fcsv <- f[grepl("\.csv",f)]
fcsv <- paste0(fpath,fcsv)
csv_list <- lapply(fcsv,read.table,sep = sep,
dec = dec, header = header)
csv_rows <- sapply(csv_list,nrow)
rel_csv_list <- csv_list[!(csv_rows < min_obs_threshold)]
lapply(rel_csv_list,function(x) colMeans(x[,target_cols]))
}
Also with that kind of error message, the debugger might be very helpful.
Just run debug(moist.each.mean) and execute the function stepwise.
Here's a slightly different approach. Use lapply to read each csv file, exclude it if necessary, otherwise create a summary. This gives you a list where each element is a data frame summary. Then use rbind to create the final summary data frame.
Without a sample of your data, I can't be sure the code below exactly matches your problem, but hopefully it will be enough to get you where you want to go.
# Get vector of filenames to read
filelist=list.files(path=directory, pattern="csv")
# Read all the csv files into a list and create summaries
df.list = lapply(filelist, function(f) {
file.in = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Set to empty data frame if file has less than 3 rows of data
if (nrow(file.in) < 3) {
print(paste("Discard", f))
# Otherwise, capture file timestamp and summarise data frame
} else {
data.frame(timestamp=substr(f, 7, 22),
humidity=round(mean(file.in$V2),1),
temp=round(mean(file.in$V4),1))
}
})
# Bind list into final summary data frame (excluding the list elements
# that don't contain a data frame because they didn't have enough rows
# to be included in the summary)
result = do.call(rbind, df.list[sapply(df.list, is.data.frame)])
One issue with your original code is that you create a vector of summary results rather than a data frame of results:
c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1)) is a vector with three elements. What you actually want is a data frame with three columns:
data.frame(timestamp=filetitles[[i]],
humidity=round(mean(file.in[[i]]$V2),1),
temp=round(mean(file.in[[i]]$V4),1))
Thanks for the suggestions using lapply. This is definitely of value as it saves a whole lot of code as well! Meanwhile, I managed to fix my original code as well:
library("tcltk")
# directory: path to csv files
directory <-
tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <-
regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame()
for (i in 1:length(filelist)) {
file.in <- read.csv(filelist[i], header = F, skipNul = T)
if (nrow(file.in) < 3) {
print("discard")
} else {
newrow <-
matrix(
c(filetitles[[i]], round(mean(file.in$V2, na.rm=T),1), round(mean(file.in$V4, na.rm=T),1)), nrow = 1, ncol =
3, byrow = T
)
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
Only I did not get it to work as a function because then I would only have one row in mdf containing the last file data. Somehow it did not add rows but overwrite row 1 with each iteration. But using it without a function wrapper worked fine...
Related
I have a large number of CSV files. I need to extract relevant data from each file, and compile all of the relevant data into a new file.
I have been copying/pasting the code below and changing relevant details (e.g., file name) to repeat the same process for many CSV files. After that, I use cbind()/write.xlsx() to combine all of the relevant data and write it to an excel file. I need a more efficient method to accomplish this task.
How can I:
incorporate a loop that imports a large number of CSV files (to replace #1 below)
select relevant rows based on a string instead of entering specific row numbers
(to replace # 2 below)
combine all of the relevant data into a single data frame with each file's data in one column
library(tidyr)
# 1 - import raw data files
file1 <- read.csv ("1.csv", header = FALSE, sep = "\n")
# 2 - select relevant rows
file1 <- as.data.frame(file1[c(41:155),])
colnames(file1) <- c("file1")
#separate components of each line from raw csv file / isolate data
temp1 <- separate(file1, file1, into = c("Text", "IntNum", "Data", sep = "\\s"))
temp1 <- temp1$Data
temp1 <- as.data.frame(temp1)
If the number of relevant rows in each file is the same, you could do it like this. Option 1 shows a solution using a loop, option 2 shows a solution using sapply.
In a first step I generate three csv-files to make the code reproducible. The start row in each file is defined by "start", the end row by "end". I then get a list with the names of these files with dir().
#make csv-files, target vector always same length (3)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "begin",
paste0("dat", i),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#get list of file names
allFiles <- dir(pattern = glob2rx("*.csv"))
Option 1 - loop
For the loop you could first initialize a result data frame ("outDF") with the number of columns set to the number of csv-files and the number of rows set to the length of the target vector in each file ("start" to "end"). You can then loop over the files and fill the data frame. The start and end rows can be indexed using which().
#initialise result data frame
outDF <- data.frame(matrix(nrow = 3, ncol = length(allFiles),
dimnames = list(NULL, allFiles)))
#loop over csv files
for (iFile in allFiles) {
idat <- read.csv(iFile, stringsAsFactors = FALSE) #read csv
outDF[, iFile] <- idat[which(idat$x == "start"):which(idat$x == "end"),]
}
Option 2 - sapply
Instead of a loop you could use sapply with a custom function to extract the relevant rows in each file. This returns a matrix which you could then transform into a dataframe.
out <- sapply(allFiles, FUN = function(x) {
idat <- read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
outDF <- as.data.frame(out)
If the number of rows between "start" and "end" differs between files, the above options won´t work. In this case you could generate a data frame by first using lapply() (similar to option 2) to generate a result list (with different lengths of the list elements) and then padding shorter lists with NAs before transforming the result into a dataframe again.
#make csv-files with with target vector of different lengths (3:12)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "start",
rep(paste0("dat", i), sample(1:10,1)),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#lapply
out <- lapply(allFiles, FUN = function(x) {
idat = read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
out <- lapply(out, `length<-`, max(lengths(out)))
outDF <- do.call(cbind, out)
I'm trying to write a function called complete that takes a file directory (which has csv files titled 1-332) and the title of the file as a number to print out the number of rows without NA in the sulfate or nitrate columns. I am trying to use mutate to add a column titled nobs which returns 1 if neither column is na and then takes the sum of nobs for my answer, but I get an error message that the object nob is not found. How can I fix this? The specific file directory in question is downloaded within this block of code.
library(tidyverse)
if(!file.exists("rprog-data-specdata.zip")) {
temp <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip",temp)
unzip(temp)
unlink(temp)
}
complete <- function(directory, id = 1:332){
#create a list of files
files_full <- list.files(directory, full.names = TRUE)
#create an empty data frame
dat <- data.frame()
for(i in id){
dat <- rbind(dat, read.csv(files_full[i]))
}
mutate(dat, nob = ifelse(!is.na(dat$sulfate) & !is.na(dat$nitrate), 1, 0))
x <- summarise(dat, sum = sum(nob))
return(x)
}
When one runs the following code nobs should be 117, but I get an error message instead
complete("specdata", 1)
Error: object 'nob' not found"
I think the function below should get what you need. Rather than a loop, I prefer map (or apply) in this setting. It's difficult to say where your code went wrong without the error message or an example I can run on my machine, however.
Happy Coding,
Daniel
library(tidyverse)
complete <- function(directory, id = 1:332){
#create a list of files
files_full <- list.files(directory, full.names = TRUE)
# cycle over each file to get the number of nonmissing rows
purrr::map_int(
files_full,
~ read.csv(.x) %>% # read in datafile
dplyr::select(sulfate, nitrate) %>% # select two columns of interest
tidyr::drop_na %>% # drop missing observations
nrow() # get the number of rows with no missing data
) %>%
sum() # sum the total number of rows not missing among all files
}
As mentioned, avoid building objects in a loop. Instead, consider building a list of data frames from each csv then call rbind once. In fact, even consider base R (i.e., tinyverse) for all your needs:
complete <- function(directory, id = 1:332){
# create a list of files
files_full <- list.files(directory, full.names = TRUE)
# create a list of data frames
df_list <- lapply(files_full[id], read.csv)
# build a single data frame with nob column
dat <- transform(do.call(rbind, df_list),
nob = ifelse(!is.na(sulfate) & !is.na(nitrate), 1, 0)
)
return(sum(dat$nob))
}
So I am trying to read several csv files, take their first column and create a new file. I have succeeded using qpcR and data.table using the following code:
FileNames <- dir(pattern = "*.csv")
x <- integer()
for (FileName in FileNames) {
data <- read.csv(file = FileName, header=FALSE, skip=1)
y <- data[,1]
x<-qpcR:::cbind.na(x, y)
rm(data)
}
write.csv(x, file = 'test.csv')
This works fine, however I have discovered that I can read just the first column of my data using the data.table library.
x <- integer()
for (FileName in FileNames) {
data <- fread(FileName,select=1,skip=1, header=FALSE)
y <- data[1:nrow(data),]
x<-qpcR:::cbind.na(x, y)
rm(data)
}
write.csv(x, file = 'test.csv')
However this seems to treat y as a data value or integer, which throws up the error:
Error in data.table::data.table(...) :
Item 2 has no length. Provide at least one item (such as NA, NA_integer_ etc) to be repeated to match the 11 rows in the longest column. Or, all columns can be 0 length, for insert()ing rows into.
Any help on this would be great thanks.
Turns out after investigating using typeof(), that I needed to convert the list generated by fread, to a numeric by adding the following line.
data <- as.numeric(unlist(data))
This then worked
I’m looking to do the following in R.
I have 250+ csv files of chromatographic data structured similarly to the example below, but with 21 rows instead of three:
1 4.708252 BB 9.946890 7.830349 0.01982016 4.684836 4.742056
2 4.970352 BB 1.792341 1.497008 0.01896829 4.945352 5.005390
3 6.393414 BB 6.599891 5.309925 0.01950091 6.368413 6.428723
What I want to do is read a subset of the data in all 250 files into a single data frame, which is easy enough — but I also need to restructure it a fair bit.
Every row in the table above is a peak. I only want the data from the first and fourth columns (which are ‘peak number’ and ‘area under the peak’, respectively), and in the output I need to make each peak an individual column, rather than a row as above, with the peak number as the header. Finally, I want to create a new column where each row (that is, the data from each individual csv file) is given the same name as the csv file name.
So, imagine I have 3 files: ABC1.csv, ABC2.csv, and ABC3.csv. Each file looks like my example above. I want to automatically take all those files and merge them into a single data frame such as the one below.
ID 1 2 3
ABC1 9.94689 1.792341 6.599891
ABC2 9.76651 1.932332 6.600022
ABC3 8.99193 2.556471 6.718934
I hope I’ve made this clear enough. I’ve been able to manage most of the steps but haven’t been successful writing them into a single script. And I have no idea how, if there is any way, to make the file name into a variable.
Cheers
I am assuming the working directory is set to where the files are. Then you can get the list of files below.
filenames <- list.files()
Have a helper function to read a file and keep just columns 1 and 4.
readdata <- function(filename) {
df <- read.csv(filename)
vec <- df[, 4]
names(vec) <- df[, 1]
return(vec)
}
Loop over all of the files and rbind them
result <- do.call(rbind, lapply(filenames, readdata))
Name them as you like
row.names(result) <- filenames
this following code can probably be of some help, though the file name is still not working properly -
path <- "C:\\Users\\Vidyut\\"
filenames <- list.files(path = path,pattern = ".csv")
l <- data.frame(ID=character(),col1=numeric(),col2=numeric(),col3=numeric(),stringsAsFactors=FALSE)
for (i in filenames) {
#i = filenames[1]
full = paste(path,i,sep="")
m <- read.csv(full, header=F)
# extract the subset of rows required from each file
# m <- m[c(),]
n<- m[,c(1,4)]
y <- gsub('.csv','',i)
print("y=")
print(y)
d <- list(ID=as.character(y),col1=n[1,2],col2=n[2,2],col3=n[3,2])
print("d=")
print(d)
l <- rbind.data.frame(l,d)
print("l=")
print(l)
}
Mind you, this is not very pretty code - just something hacked together to get the job done (visible from the multiple print lines scattered across).
Here's a solution for you. This only works if we can assume that there are exactly 21 peaks in each file and they are in order 1:21. If that's not the case a few changes to the code should remedy this.
folder = "c:/temp/"
files <- dir(folder)
first_loop <- TRUE
for (file in files) {
# Read one file, only the first and fourth columns
temp <- read.csv(file=paste0(folder,file),
header = FALSE,
colClasses = c("integer", "NULL", "NULL", "numeric", "NULL", "NULL", "NULL", "NULL"))
# Transpose the data
temp <- data.frame(t(temp))
# Remove the peak number
temp <- temp[2,]
# Concatenate the dataframes together
temp$file <- file
if (first_loop) {
data <- temp
first_loop <- FALSE
} else {
data <- rbind(data, temp)
}
}
data
Within a for loop, I am trying to run a function between two columns of data in my data frame, and move to another data set every interation of the loop. I would like to output every output of the for loop into one vector of answers.
I can't get passed the following errors (listed below my code), depending on if I add or remove row.names = NULL to data <- read.csv... part of the following code (line 4 of the for-loop):
** Edited to include directory references, where the error ultimately was:
corr <- function(directory, threshold = 0) {
source("complete.R")
The above code/ my unseen directory organzation was where my error was
lookup <- complete("specdata")
setwd(paste0(getwd(),"/",directory,sep=""))
files <-list.files(full.names="TRUE") #read file names
len <- length(files)
answer2 <- vector("numeric")
answer <- vector("numeric")
dataN <- data.frame()
for (i in 1:len) {
if (lookup[i,"nobs"] > threshold){
# TRUE -> read that file, remove the NA data and add to the overall data frame
data <- read.csv(file = files[i], header = TRUE, sep = ",")
#remove incomplete
dataN <- data[complete.cases(data),]
#If yes, compute the correlation and assign its results to an intermediate vector.
answer<-cor(dataN[,"sulfate"],dataN[,"nitrate"])
answer2 <- c(answer2,answer)
}
}
setwd("../")
return(answer2)
}
1) Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
vs.)
2) Error in [.data.frame(data, , 2:3) : undefined columns selected
What I've tried
referring to the column names directly "colA"
initializing data and dataN to empty data.frames before the for loop
initializing answer2 to an empty vector
Getting an better understanding on how vectors, matrices and data.frames work with each other
** Thank you!**
My problem was that I had the function .R file that I was referencing in the code above, in the same directory as the data files I was looping through and analyzing. My "files" vector was an incorrect length, because it was reading the another .R function I made and referenced earlier in the function. I believe this R file is what created the 'undefined columns'
I apologize, I ended up not even putting up the right area of code where the problem lay.
Key Takeaway: You can always move between directories within a function! In fact, it may be very necessary if you want to perform a function on all the contents of a directory of interest
One approach:
# get the list of file names
files <- list.files(path='~',pattern='*.csv',full.names = TRUE)
# load all files
list.data <- lapply(files,read.csv, header = TRUE, sep = ",", row.names = NULL)
# remove rows with NAs
complete.data <- lapply(list.data,function(d) d[complete.cases(d),])
# compute correlation of the 2nd and 3rd columns in every data set
answer <- sapply(complete.data,function(d) cor(d[,2],d[,3]))
The same idea, buth slightly different realization
cr <- function(fname) {
d <- read.csv(fname, header = TRUE, sep = ",", row.names = NULL)
dc <- d[complete.cases(d),]
cor(dc[,2],dc[,3])
}
answer2 <- sapply(files,cr)
example of CSV files:
# ==> a.csv <==
# a,b,c,d
# 1,2,3,4
# 11,12,13,14
# 11,NA,13,14
# 11,12,13,14
#
# ==> b.csv <==
# A,B,C,D
# 101,102,103,104
# 101,102,103,104
# 11,12,13,14