I'm fairly new in working with R but trying to get this done. I have dozens of ENVI spectral datasets stored in a directory. Each dataset is seperated into two files. They all have the same name convention, i.e.:
ID_YYYYMMDD_350-200nm.asr
ID_YYYYMMDD_350-200nm.hdr
The task is to read the dataset, add two columns (ID and date from filename), and store the results in a *.csv-file. I got this to work for a single file (hardcoded).
library(caTools)
setwd("D:/some/path/software_scripts")
### filename without extension
name <- "011a_20100509_350-2500nm"
### split filename in area-id and date
flaeche<-substr(name, 0, 4)
date <- as.Date((substr(name,6,13)),"%Y%m%d")
### get values from ENVI-file in a matrix
spectrum <- read.ENVI(paste(name,".esl", sep = ""), headerfile=paste(name,".hdr", sep=""))
### add columns
spectrum <- cbind(Flaeche=flaeche,Datum=as.character(date),spectrum)
### CSV-Dataset with all values
write.csv(spectrum, file = name,".csv", sep=",")
I want to combine all available files into one *.csv file. I know that I've to use list.files but have no idea, how to implement the read.ENVI function and add the resulting matrices ongoing to CSV.
Update:
library(caTools)
setwd("D:/some/path/mean")
files <- list.files() # change or leave totally empty if setwd() put you in the right spot
all_names <- sub("^([^.]*).*", "\\1", files) # strip off extensions
name <- unique(all_names) # get rid of duplicates from .esl and .hdr
# wrap your existing code in a function
mungeENVI <- function(name) {
# split filename in area-id and date
flaeche<-substr(name, 0, 4)
date <- as.Date((substr(name,6,13)),"%Y%m%d")
# get values from ENVI-file in a matrix
spectrum <- read.ENVI(paste(name,".esl", sep = ""), headerfile=paste(name,".hdr", sep=""))
# add columns
spectrum <- cbind(Flaeche=flaeche,Datum=as.character(date),spectrum)
return(spectrum)
}
# use lapply to 'loop' over each name
list_of_ENVIs <- lapply(name, mungeENVI) # returns a list
# use do.call(rbind, x) to turn it into a big data.frame
final_df <- do.call(rbind, list_of_ENVIs)
# now write output
write.csv(final_df, "all_results.csv")
you can find a sample dataset here: Sample dataset
I work with a lot of lab data where I can rely on the output files being in a reliable format (same column order, column name, header format, etc). So this is assuming that the .ENVI files you have are similar to that. If your files are not like that, I'm happy to help with that too, I'd just need to see a dummy file or two.
Anyways here's the idea:
library(caTools)
library(lubridate)
library(magrittr)
setwd("~/Binfo/TST/Stack/") # adjust as needed
files <- list.files("data/", full.name = T) # adjust as needed
all_names <- gsub("\\.\\D{3}", "", files) # strip off extensions
names1 <- unique(all_names) # get rid of duplicates
# wrap your existing code in a function
mungeENVI <- function(name) {
# split filename in area-id and date
f <- gsub(".*\\/(\\d{3}\\D)_.*", "\\1", name)
d <- gsub(".*_(\\d+)_.*", "\\1", name) %>% ymd()
# get values from ENVI-file in a matrix
spectrum <- read.ENVI(paste(name,".esl", sep = ""), headerfile=paste(name,".hdr", sep=""))
# add columns
spectrum <- cbind(Flaeche=f,Datum= as.character(d),spectrum)
return(spectrum)
}
# use lapply to 'loop' over each name
list_of_ENVIs <- lapply(names1, mungeENVI) # returns a list
# use do.call(rbind, x) to turn it into a big data.frame
final_df <- do.call(rbind, list_of_ENVIs)
# now write output
write.csv(final_df, "data/all_results.csv")
Let me know if you have any problems and we an go from there. Cheers.
I edited my answer a bit, I think the problem you were hitting is in list.files() it should have had the argument full.name = T. I also adjusted you parsing method to be a little more defensive and use grep capture expressions. I tested the code with your two example files (4 really) but I can build out a large matrix (66743 elements). Also I used lubridate, I think it's a better way to work with dates and times.
Related
I have a folder with several .csv files containing raw data with multiple rows and 39 columns (x obs. of 39 variables), which have been read into R as follows:
# Name path containing .csv files as folder
folder = ("/users/.../");
# Find the number of files in the folder
file_list = list.files(path=folder, pattern="*.csv")
# Read files in the folder
for (i in 1:length(file_list))
{
assign(file_list[i],
read.csv(paste(folder, file_list[i], sep='')))
}
I want to find the mean of a specific column in each of these .csv files and save it in a vector as follows:
for (i in 1:length(file_list))
{
clean = na.omit(file_list[i])
ColumnNameMean[i] = mean(clean["ColumnName"])
}
When I run the above fragment of code, I get the error "argument is not numeric or logical: returning NA". This happens in spite of attempting to remove the NA values using na.omit. Using complete.cases,
clean = file_list[i][complete.cases(file_list[i]), ]
I get the error: incorrect number of dimensions, even though the number of columns haven't been explicitly stated.
How do I fix this?
Edit: corrected clean[i] to clean (and vice versa). Ran code, same error.
Sample .csv file
There are several things wrong with your code.
folder = ("/users/.../"); You don't need the parenthesis and you definitely do not need the semi-colon. The semi-colon separates instructions, does not end them. So, this instruction is in fact two instructions, the assigment of a string to folder and between the ; and the newline the NULL instruction.
You are creating many objects in the global environment in the for loop where you assign the return value of read.csv. It is much better to read in the files into a list of data.frames.
na.omit can remove all rows from the data.frames. And there is no need to use it since mean has a na.rm argument.
You compute the mean values of each column of each data.frame. Though the data.frames are processed in a loop, the columns are not and R has a fast colMeans function.
You mistake [ for [[. The correct ways would be either clean[, "ColumnName"] or clean[["ColumnName"]].
Now the code, revised. I present several alternatives to compute the columns' means.
First, read all files in one go. I set the working directory before reading them and reset after.
folder <- "/users/.../"
file_list <- list.files(path = folder, pattern = "^muse.*\\.csv$")
old_dir <- setwd(folder)
df_list <- lapply(file_list, read.csv)
setwd(old_dir)
Now compute the means of three columns.
cols <- c("Delta_TP9", "Delta_AF7", "Theta_TP9")
All_Means <- lapply(df_list, function(DF) colMeans(DF[cols], na.rm = TRUE))
names(All_Means) <- file_list
Compute the means of all columns starting with Delta or Theta. Get those columns names with grep.
df_names <- names(df_list[[1]])
cols2 <- grep("^Delta", df_names, value = TRUE)
cols2 <- c(cols2, grep("^Theta", df_names, value = TRUE))
All_Means_2 <- lapply(df_list, function(DF) colMeans(DF[cols2], na.rm = TRUE))
names(All_Means_2) <- file_list
Finally, compute the means of all numeric columns. Note that this time the index vector cols3 is a logical vector.
cols3 <- sapply(df_list[[1]], is.numeric)
All_Means_3 <- lapply(df_list, function(DF) colMeans(DF[cols3], na.rm = TRUE))
names(All_Means_3) <- file_list
Try it like this:
setwd("U:/Playground/StackO/")
# Find the number of files in the folder
file_list = list.files(path=getwd(), pattern="*.csv")
# Read files in the folder
for (i in 1:length(file_list)){
assign(file_list[i],
read.csv(file_list[i]))
}
ColumnNameMean <- rep(NULL, length(file_list))
for (i in 1:length(file_list)){
clean = get(file_list[i])
ColumnNameMean[i] = mean(clean[,"Delta_TP10"])
}
ColumnNameMean
#> [1] 1.286201
I used get to retrieve the data.frame otherwise file_list[i] just returns a string. I think this is an idiom used in other languages like python. I tried to stay true to the way you were using but there are easier way than indexing like this.
Maybe this:
lapply(list.files(path=getwd(), pattern="*.csv"), function(f){ dt <- read.csv(f); mean(dt[,"Delta_TP10"]) })
PS: Be careful with na.omit(), it removes ALL the rows with NA which in your case is your whole data.frame since Elements is only NA
Apologies if this may seem simple, but I can't find a workable answer anywhere on the site.
My data is in the form of a csv with the filename being a name and number. Not quite as simple as having file with a generic word and increasing number...
I've achieved exactly what i want to do with just one file, but the issue is there are a couple of hundred to do, so changing the name each time is quite tedious.
Posting my original single-batch code here in the hopes someone may be able to ease the growing tension of failed searches.
# set workspace
getwd()
setwd(".../Desktop/R Workspace")
# bring in original file, skipping first four rows
Person_7<- read.csv("PersonRound7.csv", header=TRUE, skip=4)
# cut matrix down to 4 columns
Person7<- Person_7[,c(1,2,9,17)]
# give columns names
colnames(Person7) <- c("Time","Spare", "Distance","InPeriod")
# find the empty rows, create new subset. Take 3 rows away for empty lines.
nullrow <- (which(Person7$Spare == "Velocity"))-3
Person7 <- Person7[(1:nullrow), ]
#keep 3 needed columns from matrix
Person7<- Person7[,c(1,3,4)]
colnames(Person7) <- c("Time","Distance","InPeriod")
#convert distance and time columns to factors
options(digits=9)
Person7$Distance <- as.numeric(as.character(Person7$Distance))
Person7$Time <- as.numeric(as.character(Person7$Time))
#Create the differences column for distance
Person7$Diff <- c(0, diff(Person7$Distance))
...whole heap of other stuff...
#export Minutes to an external file
write.csv(Person7_maxs, ".../Desktop/GPS Minutes/Person7.csv")
So the three part issue is as follows:
I can create a list or vector to read through the file names, but not a dataframe for each, each time (if that's even a good way to do it).
The variable names throughout the code will need to change instead of just being "Person1" "Person2", they'll be more like "Johnny1" "Lou23".
Need to export each resulting dataframe to it's own csv file with the original name.
Taking any and all suggestions on board - s.t.ruggling with this one.
Cheers!
Consider using one list of the ~200 dataframes. No need for separate named objects flooding global environment (though list2env still shown below). Hence, use lapply() to iterate through all csv files of working directory, then simply name each element of list to basename of file:
setwd(".../Desktop/R Workspace")
files <- list.files(path=getwd(), pattern=".csv")
# CREATE DATA FRAME LIST
dfList <- lapply(files, function(f) {
df <- read.csv(f, header=TRUE, skip=4)
df <- setNames(df[c(1,2,9,17)], c("Time","Spare","Distance","InPeriod"))
# ...same code referencing temp variable, df
write.csv(df_max, paste0(".../Desktop/GPS Minutes/", f))
return(df)
})
# NAME EACH ELEMENT TO CORRESPONDING FILE'S BASENAME
dfList <- setNames(dfList, gsub(".csv", "", files))
# REFERENCE A DATAFRAME WITH LIST INDEXING
str(dfList$PersonRound7) # PRINT STRUCTURE
View(dfList$PersonRound7) # VIEW DATA FRAME
dfList$PersonRound7$Time # OUTPUT ONE COLUMN
# OUTPUT ALL DFS TO SEPARATE OBJECTS (THOUGH NOT NEEDED)
list2env(dfList, envir = .GlobalEnv)
I have tried to create a for loop that does something for each of 4 csv files similar to this but with more files.
dat1<- read.csv("female.csv", header =T)
dat2<- read.csv("male.csv", header =T)
for (i in 1:2) {
message("Female, Male")
Temp <- dat[i][(dat[i]$NAME == "Temp"), ]
Temp <- Temp[complete.cases(Temp)]
print(mean(Temp$MEAN))
However, I get an error:
Error in Temp$MEAN : $ operator is invalid for atomic vectors
Not sure why this isn't working. Any help would be appreciated for looping through csv files!
Personally, I think the easiest way to do this is with the plyr package:
library(plyr)
myFiles <- c("male.csv", "female.csv")
dat <- ldply(myFiles, read.csv)
dat <- dat[complete.cases(dat), ]
mean(dat$MEAN)
The way this works is that you first create a vector of file names. Then the ldply() function performs the function read.csv() on the vector of filenames, and converts the output automatically to a data.frame. Then you do the complete.cases() and mean() in the usual way.
Edit:
But if you want the mean of each file then here is one way of doing it:
# create a vector of files
myFiles <- c("male.csv", "female.csv")
# create a function that properly handles ONLY ONE ELEMENT
readAndCalc <- function(x){ # pass in the filename
tmp <- read.csv(x) # read the single file
tmp <- tmp[complete.cases(tmp), ] # complete.cases()
mean(tmp$MEAN) # mean
}
x <- "male.csv"
readAndCalc(x) # test with ONE file
sapply(myFiles, readAndCalc) # run with all your files
The way this works is that you first create a vector of filenames, just like before. Then you create a function that processes ONLY ONE file at a time. Then you can test that the function works using the readAndCalc function you just created. Finally do it for all your files with the sapply() function. Hope that helps.
I have 100 files, of which I want to extract the 4th column (total_volume) containing 100.000 rows and put it together in 1 big file which then contains 100 columns with each 100.000 rows. I was trying something with the following script:
setwd("/run/media/mydirectory")
library(data.table)
fileNames <- Sys.glob("*.txt.csv")
#read file in fileNames
for (fileName in fileNames) {
dataDf <- read.delim(fileName, header = FALSE)
# remove columns with only example values
dataDf <- dataDf[, -(7:14)]
# convert data frame to data table
dataDt <- data.table(dataDf)
# set column names
setnames(dataDt, c("mcs", "cell_type", "cell_number", "total_volume"))
#new file with only total volume
total_volume <- dataDt$total_volume
#export file
write.table(dataDt$total_volume, file = "total_volume20.csv")
But what I get then is that all columns get superimposed with as result a .csv file with the 4th column of only the last file. I would like the columns to be next to eachother instead of being superimposed. How could I do that?
Thanks in advance!
P.S. Obviously the overwriting thing happens because I used a loop. However, I am not sure how else to combine everything, so suggestions are very welcome!
You haven't given us a reproducible example, so I can't test this properly, but this should give you a table with one column for total volume from each of the files you get from the call to Sys.glob(). The idea is to make a function that does what you want for one file; use lapply() to make a list with the results of that function for each file in your target environment; then cbind the columns in that list into one big table.
setwd("/run/media/mydirectory")
library(data.table)
fileNames <- Sys.glob("*.txt.csv")
# For the function, I'm reproducing your code. You could do in fewer lines and without
# data.table if you like, but maybe there's a reason you chose this approach.
extractor <- function(fileName) {
require(data.table)
dataDf <- read.delim(fileName, header = FALSE)
dataDf <- dataDf[, -(7:14)]
dataDt <- data.table(dataDf)
setnames(dataDt, c("mcs", "cell_type", "cell_number", "total_volume"))
total_volume <- dataDt$total_volume
return(total_volume)
}
total.list <- lapply(fileNames, extractor)
total.table <- Reduce(cbind, total.list)
write.table(total.table, file = "total_volume20.csv")
Or do that last bit in one line if you like:
write.table(Reduce(cbind, lapply(Sys.glob("*.txt.csv"), extractor)), file="total_volume20.csv")
I have written a loop in R (still learning). My purpose is to pick the max AvgConc and max Roll_TotDep from each looping file, and then have two data frames that each contains all the max numbers picked from individual files. The code I wrote only save the last iteration results (for only one single file)... Can someone point me a right direction to revise my code, so I can append the result of each new iteration with previous ones? Thanks!
data.folder <- "D:\\20150804"
files <- list.files(path=data.folder)
for (i in 1:length(files)) {
sub <- read.table(file.path(data.folder, files[i]), header=T)
max1Conc <- sub[which.max(sub$AvgConc),]
maxETD <- sub[which.max(sub$Roll_TotDep),]
write.csv(max1Conc, file= "max1Conc.csv", append=TRUE)
write.csv(maxETD, file= "maxETD.csv", append=TRUE)
}
The problem is that max1Conc and maxETD are not lists data.frames or vectors (or other types of object capable of storing more than one value).
To fix this:
maxETD<-vector()
max1Conc<-vector()
for (i in 1:length(files)) {
sub <- read.table(file.path(data.folder, files[i]), header=T)
max1Conc <- append(max1Conc,sub[which.max(sub$AvgConc),])
maxETD <- append(maxETD,sub[which.max(sub$Roll_TotDep),])
write.csv(max1Conc, file= "max1Conc.csv", append=TRUE)
write.csv(maxETD, file= "maxETD.csv", append=TRUE)
}
The difference here is that I made the two variables you wish to write out empty vectors (max1Conc and maxETD), and then used the append command to add each successive value to the vectors.
There are more idiomatic R ways of accomplishing your goal; personally, I suggest you look into learning the apply family of functions. (http://adv-r.had.co.nz/Functionals.html)
I can't directly test the whole thing because I don't have a directory with files like yours, but I tested the parts, and I think this should work as an apply-driven alternative. It starts with a pair of functions, one to ingest a file from your directory and other to make a row out of the two max values from each of those files:
library(dplyr)
data.folder <- "D:\\20150804"
getfile <- function(filename) {
sub <- read.table(file.path(data.folder, filename), header=TRUE)
return(sub)
}
getmaxes <- function(df) {
rowi <- data.frame(AvConc.max = max(df[,"AvConc"]), ETD.max = max(df[,"ETD"]))
return(rowi)
}
Then it uses a couple of rounds of lapply --- embedded in piping courtesy ofdplyr --- to a) build a list with each data set as an item, b) build a second list of one-row data frames with the maxes from each item in the first list, c) rbind those rows into one big data frame, d) and then cbind the filenames to that data frame for reference.
dfmax <- lapply(as.list(list.files(path = data.folder)), getfiles) %>%
lapply(., getmaxes) %>%
Reduce(function(...) rbind(...), .) %>%
data.frame(file = list.files(path = data.folder), .)