Undefined Columns Selected v. duplicate 'row.names' are not allowed - r

Within a for loop, I am trying to run a function between two columns of data in my data frame, and move to another data set every interation of the loop. I would like to output every output of the for loop into one vector of answers.
I can't get passed the following errors (listed below my code), depending on if I add or remove row.names = NULL to data <- read.csv... part of the following code (line 4 of the for-loop):
** Edited to include directory references, where the error ultimately was:
corr <- function(directory, threshold = 0) {
source("complete.R")
The above code/ my unseen directory organzation was where my error was
lookup <- complete("specdata")
setwd(paste0(getwd(),"/",directory,sep=""))
files <-list.files(full.names="TRUE") #read file names
len <- length(files)
answer2 <- vector("numeric")
answer <- vector("numeric")
dataN <- data.frame()
for (i in 1:len) {
if (lookup[i,"nobs"] > threshold){
# TRUE -> read that file, remove the NA data and add to the overall data frame
data <- read.csv(file = files[i], header = TRUE, sep = ",")
#remove incomplete
dataN <- data[complete.cases(data),]
#If yes, compute the correlation and assign its results to an intermediate vector.
answer<-cor(dataN[,"sulfate"],dataN[,"nitrate"])
answer2 <- c(answer2,answer)
}
}
setwd("../")
return(answer2)
}
1) Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
vs.)
2) Error in [.data.frame(data, , 2:3) : undefined columns selected
What I've tried
referring to the column names directly "colA"
initializing data and dataN to empty data.frames before the for loop
initializing answer2 to an empty vector
Getting an better understanding on how vectors, matrices and data.frames work with each other
** Thank you!**

My problem was that I had the function .R file that I was referencing in the code above, in the same directory as the data files I was looping through and analyzing. My "files" vector was an incorrect length, because it was reading the another .R function I made and referenced earlier in the function. I believe this R file is what created the 'undefined columns'
I apologize, I ended up not even putting up the right area of code where the problem lay.
Key Takeaway: You can always move between directories within a function! In fact, it may be very necessary if you want to perform a function on all the contents of a directory of interest

One approach:
# get the list of file names
files <- list.files(path='~',pattern='*.csv',full.names = TRUE)
# load all files
list.data <- lapply(files,read.csv, header = TRUE, sep = ",", row.names = NULL)
# remove rows with NAs
complete.data <- lapply(list.data,function(d) d[complete.cases(d),])
# compute correlation of the 2nd and 3rd columns in every data set
answer <- sapply(complete.data,function(d) cor(d[,2],d[,3]))
The same idea, buth slightly different realization
cr <- function(fname) {
d <- read.csv(fname, header = TRUE, sep = ",", row.names = NULL)
dc <- d[complete.cases(d),]
cor(dc[,2],dc[,3])
}
answer2 <- sapply(files,cr)
example of CSV files:
# ==> a.csv <==
# a,b,c,d
# 1,2,3,4
# 11,12,13,14
# 11,NA,13,14
# 11,12,13,14
#
# ==> b.csv <==
# A,B,C,D
# 101,102,103,104
# 101,102,103,104
# 11,12,13,14

Related

How to loop over on different files and save the output with filename in R?

I have several files with the names RTDFE, TRYFG, FTYGS, WERTS...like 100 files in txt format. For each file, I'm using the following code and writing the output in a file.
name = c("RTDFE")
file1 <- paste0(name, "_filter",".txt")
file2 <- paste0(name, "_data",".txt")
### One
A <- read.delim(file1, sep = "\t", header = FALSE)
#### two
B <- read.delim(file2, sep = "\t", header = FALSE)
C <- merge(A, B, by="XYZ")
nrow(C)
145
Output:
Samples Common
RTDFE 145
Every time I'm assigning the file to variable name running my code and writing the output in the file. Instead, I want the code to be run on all the files in one go and want the following output. Common is the row of merged data frame C
The output I need:
Samples Common
RTDFE 145
TRYFG ...
FTYGS ...
WERTS ...
How to do this? Any help.
How about putting all your names in a single vector, called names, like this:
names<-c("TRYFG","RTDFE",...)
and then feeding each one to a function that reads the files, merges them, and returns the rows
f<-function(n) {
fs = paste0(n,c("_filter", "_data"),".txt")
C = merge(
read.delim(fs[1],sep="\t", header=F),
read.delim(fs[2],sep="\t", header=F), by="XYZ")
data.frame(Samples=n,Common=nrow(C))
}
Then just call call this function f on each of the values in names, row binding the result together
do.call(rbind, lapply(names, f))
An easy way to create the vector names is like this:
p = "_(filter|data).txt"
names = unique(gsub(p,"",list.files(pattern = p)))
I am making some assumptions here.
The first assumption is that you have all these files in a folder with no other text files (.txt) in this folder.
If so you can get the list of files with the command list.files.
But when doing so you will get the "_data.txt" and the "filter.txt".
We need a way to extract the basic part of the name.
I use "str_replace" to remove the "_data.txt" and the "_filter.txt" from the list.
But when doing so you will get a list with two entries. Therefore I use the "unique" command.
I store this in "lfiles" that will now contain "RTDFE, TRYFG, FTYGS, WERTS..." and any other file that satisfy the conditions.
After this I run a for loop on this list.
I reopen the files similarly as you do.
I merge by XYZ and I immediately put the results in a data frame.
By using rbind I keep adding results to the data frame "res".
library(stringr)
lfiles=list.files(path = ".", pattern = ".txt")
## we strip, from the files, the "_filter and the data
lfiles=unique( sapply(lfiles, function(x){
x=str_replace(x, "_data.txt", "")
x=str_replace(x, "_filter.txt", "")
return(x)
} ))
res=NULL
for(i in lfiles){
file1 <- paste0(i, "_filter.txt")
file2 <- paste0(i, "_data.txt")
### One
A <- read.delim(file1, sep = "\t", header = FALSE)
#### two
B <- read.delim(file2, sep = "\t", header = FALSE)
res=rbind(data.frame(Samples=i, Common=nrow(merge(A, B, by="XYZ"))))
}
Ok, I will assume you have a folder called "data" with files named "RTDFE_filter.txt, RTDFE_data, TRYFG_filter.txt, TRYFG_data.txt, etc. (only and exacly this files).
This code should give a possible way
# save the file names
files = list.files("data")
# get indexes for "data" (for "filter" indexes, add 1)
files_data_index = seq(1, length(f), 2) # 1, 3, 5, ...
# loop on indexes
results = lapply(files_data_index, function(i) {
A <- read.delim(files[i+1], sep = "\t", header = FALSE)
B <- read.delim(files[i], sep = "\t", header = FALSE)
C <- merge(A, B, by="XYZ")
samp = strsplit(files[i], "_")[[1]][1]
com = nrow(C)
return(c(Samples = samp, Comon = com))
})
# combine results
do.call(rbind, results)

searching multiple txt files for data and reporting result to new table

I have thousands of txt files containing Mass, %Base data. I need to search each file for a row within a specific mass range. Then, report that row into a new table with the filename as an additional character. The goal is a table of (Mass, %Base, Filename) for all of the text files based on the condition of the search.
Existing File example for file1name.txt:
Mass %Base
100 .1
101 26.2
...
900 0
Goal:
Mass %Base File
375.004 98 file1name
375.003 96 file2name
My current code is:
library(tidyverse)
library(readr)
#setwd to where data is located
setwd("Z:/Dnigra")
#set path where data is located
path <- "Z:/Dnigra"
mc <- 375.3 #mc is the calculated target mass
limit<- 0.1 # the width of the search window
#finds the files with the correct extensions
fs <-list.files(path, pattern=glob2rx("*.txt$"))
for (f in fs){
fname <- file.path(path, f)
df <- read_tsv(fname,col_names=FALSE, skip =1)
#filters the data that includes the target mass
df <- between(mc,limit,limit)
#create new data based on contents
allSpectra <- data.frame(df,f)
#write new data to sep file
write.table(allSpectra ,"allwobble.csv",
append= T,
sep=",",
row = F
)
}
The end result is a table with:
df f
FALSE filename
Also errors:
Parsed with column specification: cols( X1 = col_character(), X2 = col_character() ) Warning: 2536 parsing failures.
I think there may be a few things here to address:
First, with read_tsv you might want to specify the column types as double if appropriate, so values are not read in as character strings. This would affect your ability to filter and subset based on Mass.
Next, the between statement has the syntax of:
between(x, left, right)
where x <= right and x >= left. If you want to make sure your mc value is between 375.2 and 375.4 you might want between(X1, mc-limit, mc+limit) instead. Note that since no header was read in, the Mass variable is assumed first as X1.
When you use write.table and append, you might want to set col.names to FALSE (or include header on first write).
Hope this is helpful to you.
for (f in fs){
fname <- file.path(path, f)
df <- read_tsv(fname, col_names = FALSE, skip=1, col_types = "dd")
#filters the data that includes the target mass
df <- filter(df, between(X1, mc-limit, mc+limit))
#create new data based on contents
allSpectra <- data.frame(df,f)
#write new data to sep file
write.table(allSpectra ,"allwobble.csv",
append= T,
sep=",",
row = F,
col.names = FALSE
)
}
Thanks #Ben. I had gotten to that point last night and had added a tolerance calculation. The "dd" definitely helped but required a col_names to get through another error. The final code is below. A parsing error comes up, but it does what it need to do!
tol<- .02 # the width of the search window
mmneg <- mc - tol
mmpos <- mc + tol
#finds the files with the correct extensions
fs <-list.files(path, pattern=glob2rx("*.txt$"))
for (f in fs){
fname <- file.path(path, f)
df <- read_tsv(fname, skip =1,skip_empty_rows = T, col_types="dd", col_names=c("X1","X2"))
#filters the data that excludes the offending peak
df<- filter(df,between(X1,mmneg,mmpos))
#create new data based on contents
allSpectra <- data.frame(df,f)
#write new data to sep file
write.table(allSpectra ,"Caviunin_20_.csv",
append= T,
sep=",",
row = F,
col.names = F
)
}

Write File name as column header

I have a loop that looks at a group of files, takes the 4th column and combines them together. I would like like to append the filename that comes after the "Output" folder as the header of the column.
files2 <- list.files(path="c:/Users/~/Output",pattern="*.csv", full.names=TRUE, recursive=FALSE)
newdata <- (1:51)
for(ii in files2){
titlename2<- tools::file_path_sans_ext(basename(files2))
#genes <- read.csv(files2[1], header=True)[,1] # gene names
mydata2 <-read.csv(ii, header = T, stringsAsFactors=FALSE)
mydata2<- mydata2[,4]
newdata <- cbind(newdata,mydata2)
colnames(newdata)= c(files2)
}
However, when I try and apply the filename I get the following error:
Error in dimnames(x) <- dn :
``length of 'dimnames' [2] not equal to array extent
How do I apply the file name as the column header?
Thanks in Advance.
The problem comes from the fact that colnames(newdata) and c(files2) do not have the same length.
You could for example move colnames(newdata) = c(files2) after the for loop and replace c(files2) by something like c("ID", files2) (as you have length(files2) + 1 columns).

Mean values from multiple csv to data frame

After having searched for help in different threads on this topic, I still have not become wiser. Therefore: Here comes another question on looping through multiple data files...
OK. I have multiple CSV files in one folder containing 5 columns of data. The filenames are as follows:
Moist yyyymmdd hh_mm_ss.csv
I would like to create a script that reads processes the CSV-files one by one doing the following steps:
1) load file
2) check number of rows and exclude file if less than 3 registrations
3) calculate mean value of all measurements (=rows) for column 2
4) calculate mean value of all measurements (=rows) for column 4
5) output the filename timestamp, mean column 2 and mean column 4 to a data frame,
I have written the following function
moist.each.mean <- function() {
library("tcltk")
directory <- tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <- regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame(timestamp=character(), humidity=numeric(), temp=numeric())
for(i in 1:length(filelist)){
file.in[[i]] <- read.csv(filelist[i], header=F)
if (nrow(file.in[[i]]<3)){
print("discard")
} else {
newrow <- c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1))
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
}
but i keep getting an error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(V1 = c(10519949L, :
replacement has 18 rows, data has 17
Any ideas?
Thx, kruemelprinz
I'd also suggest to use (l)apply... Here's my take:
getMeans <- function(fpath,runfct,
target_cols = c(2),
sep=",",
dec=".",
header = T,
min_obs_threshold = 3){
f <- list.files(fpath)
fcsv <- f[grepl("\.csv",f)]
fcsv <- paste0(fpath,fcsv)
csv_list <- lapply(fcsv,read.table,sep = sep,
dec = dec, header = header)
csv_rows <- sapply(csv_list,nrow)
rel_csv_list <- csv_list[!(csv_rows < min_obs_threshold)]
lapply(rel_csv_list,function(x) colMeans(x[,target_cols]))
}
Also with that kind of error message, the debugger might be very helpful.
Just run debug(moist.each.mean) and execute the function stepwise.
Here's a slightly different approach. Use lapply to read each csv file, exclude it if necessary, otherwise create a summary. This gives you a list where each element is a data frame summary. Then use rbind to create the final summary data frame.
Without a sample of your data, I can't be sure the code below exactly matches your problem, but hopefully it will be enough to get you where you want to go.
# Get vector of filenames to read
filelist=list.files(path=directory, pattern="csv")
# Read all the csv files into a list and create summaries
df.list = lapply(filelist, function(f) {
file.in = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Set to empty data frame if file has less than 3 rows of data
if (nrow(file.in) < 3) {
print(paste("Discard", f))
# Otherwise, capture file timestamp and summarise data frame
} else {
data.frame(timestamp=substr(f, 7, 22),
humidity=round(mean(file.in$V2),1),
temp=round(mean(file.in$V4),1))
}
})
# Bind list into final summary data frame (excluding the list elements
# that don't contain a data frame because they didn't have enough rows
# to be included in the summary)
result = do.call(rbind, df.list[sapply(df.list, is.data.frame)])
One issue with your original code is that you create a vector of summary results rather than a data frame of results:
c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1)) is a vector with three elements. What you actually want is a data frame with three columns:
data.frame(timestamp=filetitles[[i]],
humidity=round(mean(file.in[[i]]$V2),1),
temp=round(mean(file.in[[i]]$V4),1))
Thanks for the suggestions using lapply. This is definitely of value as it saves a whole lot of code as well! Meanwhile, I managed to fix my original code as well:
library("tcltk")
# directory: path to csv files
directory <-
tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <-
regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame()
for (i in 1:length(filelist)) {
file.in <- read.csv(filelist[i], header = F, skipNul = T)
if (nrow(file.in) < 3) {
print("discard")
} else {
newrow <-
matrix(
c(filetitles[[i]], round(mean(file.in$V2, na.rm=T),1), round(mean(file.in$V4, na.rm=T),1)), nrow = 1, ncol =
3, byrow = T
)
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
Only I did not get it to work as a function because then I would only have one row in mdf containing the last file data. Somehow it did not add rows but overwrite row 1 with each iteration. But using it without a function wrapper worked fine...

Calculate the mean of one column from several CSV files

I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames), and the files to use in the calculation.
Here is my function:
pollutantmean2 <- function(directory = getwd(), pollutant, id = 1:332) {
# add one or two zeros to ID so that they match the CSV file names
filenames <- sprintf("%03d.csv", id)
# path to specdata folder
# if no path is provided, default is working directory
filedir <- file.path(directory, filenames)
# get the data from selected ID or IDs from the specified path
dataset <- read.csv(filedir, header = TRUE)
# calculate mean removing all NAs
polmean <- mean(dataset$pollutant, na.rm = TRUE)
# return mean
polmean
}
It appears there are two things wrong with my code. To break it down, I separated the function into two separate function to handle the two tasks: 1) get the required files and 2) calculate the mean of the desired column (aka pollutant).
Task 1: Getting the appropriate files - It works as long as I only want one file. If I select a range of files, such as 1:25 I get an error message that says Error in file(file, "rt") : invalid 'description' argument. I have Googled this error but still have no clue how to fix it.
# function that obtains csv files and stores them
getfile <- function(directory = getwd(), id) {
filenames <- sprintf("%03d.csv", id)
filedir <- file.path(directory, filenames)
dataset <- read.csv(filedir, header = TRUE)
dataset
}
If I run getfile("specdata", 1) it works fine, but if I run getfile("specdata", 1:10) I get the following error: Error in file(file, "rt") : invalid 'description' argument.
Task 2: Calculating mean of specified named column - Assuming I have a usable data frame, I then try to calculate the mean with the following function:
calcMean <- function(dataset, pollutant) {
polmean <- mean(dataset$pollutant, na.rm = TRUE)
polmean
}
But if I run calcMean(mydata, "sulfate") (where mydata is a data frame I loaded manually) I get an error message:
Warning message:
In mean.default(dataset$pollutant, na.rm = TRUE) :
argument is not numeric or logical: returning NA
The odd thing is that if I run mean(mydata$sulfate, na.rm = TRUE) in the console, it works fine.
I have researched this for several days and after endless tweaking, I have run out of ideas.
You do not need more functions. The solution can be simpler from my understanding in 6 lines:
pollutantmean <- function(directory, pollutant, id = 1:10) {
filenames <- sprintf("%03d.csv", id)
filenames <- paste(directory, filenames, sep="/")
ldf <- lapply(filenames, read.csv)
df=ldply(ldf)
# df is your list of data.frames
mean(df[, pollutant], na.rm = TRUE)
}
I think your major problem is listing the files in your working directory and reading them into R. Try list.files function in R Example code which may work for you is
files <- list.files(pattern = ".csv") ## creates a vector with all file names in your folder
polmean <- rep(0,length(files))
for(i in 1:length(files)){
data <- read.csv(files[i],header=T)
polmean[i] <- mean(data$pollutant)
}
result <- cbind(files,polmean)
write.csv(result,"result_polmeans.csv")
This program gives you the data with name of file in the first column and corresponding means in the second column.

Resources