Related
I've been trying to merge a list of dataframes and keep getting the error:
"Error in [.data.frame(y, ids$y, y.cols, drop = FALSE) :
undefined columns selected".
Below is the code I've used
read_OO <- function(filename){
read.delim(filename, skip=14)
}# Skip first 14 lines of metadata in data files
filenames <- list.files(folderpath, pattern="*.txt", full.names=TRUE)
filelist <- lapply(filenames, read_OO)
SampleIDs <- stringr::str_remove(str_remove(filenames, folderpath), ".txt")
names(filelist) <- SampleIDs
filelist <- mapply(cbind, filelist, SampleIDs, SIMPLIFY=F)
colnames <- c("Wavelength","Absorbance", "SampleIDs")
filelist <- lapply(filelist, setNames, colnames)
abs2017 <- plyr::join_all(filelist, by = c("Wavelength","Absorbance", "SampleIDs"), type = "full", match = "all")
The error comes on the last line
I've also tried merging by
t <- Reduce(function(x, y) merge(x, y,
by=c("Wavelength","Absorbance", "SampleIDs"),
all = TRUE), filelist)
But it stops the code at an "approximate location" (it doesn't provide a specific error and says it can't find the source)
Is there something I can look for in my file structure that may be the problem? I can't find any inconsistencies between the files (they're all identical outputs from a machine)
There was in fact a single file with a slightly different format than all the other files, so this is now solved. Once that was corrected, the code above worked.
If anyone has any comments on how to scan through a list and check for structure discrepancies that would be appreciated!
I need to save a list of csv files and extract values from thr 13th row on of a specific column (the second one) from each of dataframes.
Here's my try:
temp <- list.files(FILEPATH, pattern="*\\.csv$", full.names = TRUE)
for (i in 1:length(temp)){
assign(temp[i], read.csv(temp[i], header=TRUE, ski[=13, na.strings=c("", "NA")))
subset(temp[i], select=2) #extract the second column of the dataframe
temp[i] <- na.omit(temp[i])
However, this doesn't work. On the one hand, I think that's because of the skip argument of the read.csv command, as it apparently ignores the headers. On the other hand, if skip is not used, the following error pops up:
Error in subset.default(temp[i], select = 2) : argument "subset" is
missing, with no default
When I insert the argument subset=TRUE in the subset command, it doesn't give any error, but no extraction is performed.
Any possible solution?
Without seeing the files it's not easy to tell, but I would use lapply, not a for loop. Maybe you can get inspiration from something like the follwing. I use read.table because you skip = 13 lines and read.csv reads in the first line as column headers. Note that I avoid the use of assign.
df_list <- lapply(temp, read.table, sep = ",", skip = 13, na.strings = c("", "NA"))
names(df_list) <- temp
col2_list <- lapply(df_list, `[[`, 2)
col2_list <- lapply(col2_list, na.omit)
names(col2_list) <- temp
col2_list
If you want col2_list to be a list of df's with just one column each, column 2 of the original files, then, like I've said in comment use
col2_list <- lapply(df_list, `[`, 2)
And to rename that one column and renumber the rows consecutively
new_name <- "the_column_of_choice" # change this!
col2_list <- lapply(col2_list, function(x){
names(x) <- new_name
row.names(x) <- NULL
x
})
After having searched for help in different threads on this topic, I still have not become wiser. Therefore: Here comes another question on looping through multiple data files...
OK. I have multiple CSV files in one folder containing 5 columns of data. The filenames are as follows:
Moist yyyymmdd hh_mm_ss.csv
I would like to create a script that reads processes the CSV-files one by one doing the following steps:
1) load file
2) check number of rows and exclude file if less than 3 registrations
3) calculate mean value of all measurements (=rows) for column 2
4) calculate mean value of all measurements (=rows) for column 4
5) output the filename timestamp, mean column 2 and mean column 4 to a data frame,
I have written the following function
moist.each.mean <- function() {
library("tcltk")
directory <- tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <- regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame(timestamp=character(), humidity=numeric(), temp=numeric())
for(i in 1:length(filelist)){
file.in[[i]] <- read.csv(filelist[i], header=F)
if (nrow(file.in[[i]]<3)){
print("discard")
} else {
newrow <- c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1))
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
}
but i keep getting an error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(V1 = c(10519949L, :
replacement has 18 rows, data has 17
Any ideas?
Thx, kruemelprinz
I'd also suggest to use (l)apply... Here's my take:
getMeans <- function(fpath,runfct,
target_cols = c(2),
sep=",",
dec=".",
header = T,
min_obs_threshold = 3){
f <- list.files(fpath)
fcsv <- f[grepl("\.csv",f)]
fcsv <- paste0(fpath,fcsv)
csv_list <- lapply(fcsv,read.table,sep = sep,
dec = dec, header = header)
csv_rows <- sapply(csv_list,nrow)
rel_csv_list <- csv_list[!(csv_rows < min_obs_threshold)]
lapply(rel_csv_list,function(x) colMeans(x[,target_cols]))
}
Also with that kind of error message, the debugger might be very helpful.
Just run debug(moist.each.mean) and execute the function stepwise.
Here's a slightly different approach. Use lapply to read each csv file, exclude it if necessary, otherwise create a summary. This gives you a list where each element is a data frame summary. Then use rbind to create the final summary data frame.
Without a sample of your data, I can't be sure the code below exactly matches your problem, but hopefully it will be enough to get you where you want to go.
# Get vector of filenames to read
filelist=list.files(path=directory, pattern="csv")
# Read all the csv files into a list and create summaries
df.list = lapply(filelist, function(f) {
file.in = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Set to empty data frame if file has less than 3 rows of data
if (nrow(file.in) < 3) {
print(paste("Discard", f))
# Otherwise, capture file timestamp and summarise data frame
} else {
data.frame(timestamp=substr(f, 7, 22),
humidity=round(mean(file.in$V2),1),
temp=round(mean(file.in$V4),1))
}
})
# Bind list into final summary data frame (excluding the list elements
# that don't contain a data frame because they didn't have enough rows
# to be included in the summary)
result = do.call(rbind, df.list[sapply(df.list, is.data.frame)])
One issue with your original code is that you create a vector of summary results rather than a data frame of results:
c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1)) is a vector with three elements. What you actually want is a data frame with three columns:
data.frame(timestamp=filetitles[[i]],
humidity=round(mean(file.in[[i]]$V2),1),
temp=round(mean(file.in[[i]]$V4),1))
Thanks for the suggestions using lapply. This is definitely of value as it saves a whole lot of code as well! Meanwhile, I managed to fix my original code as well:
library("tcltk")
# directory: path to csv files
directory <-
tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <-
regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame()
for (i in 1:length(filelist)) {
file.in <- read.csv(filelist[i], header = F, skipNul = T)
if (nrow(file.in) < 3) {
print("discard")
} else {
newrow <-
matrix(
c(filetitles[[i]], round(mean(file.in$V2, na.rm=T),1), round(mean(file.in$V4, na.rm=T),1)), nrow = 1, ncol =
3, byrow = T
)
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
Only I did not get it to work as a function because then I would only have one row in mdf containing the last file data. Somehow it did not add rows but overwrite row 1 with each iteration. But using it without a function wrapper worked fine...
Within a for loop, I am trying to run a function between two columns of data in my data frame, and move to another data set every interation of the loop. I would like to output every output of the for loop into one vector of answers.
I can't get passed the following errors (listed below my code), depending on if I add or remove row.names = NULL to data <- read.csv... part of the following code (line 4 of the for-loop):
** Edited to include directory references, where the error ultimately was:
corr <- function(directory, threshold = 0) {
source("complete.R")
The above code/ my unseen directory organzation was where my error was
lookup <- complete("specdata")
setwd(paste0(getwd(),"/",directory,sep=""))
files <-list.files(full.names="TRUE") #read file names
len <- length(files)
answer2 <- vector("numeric")
answer <- vector("numeric")
dataN <- data.frame()
for (i in 1:len) {
if (lookup[i,"nobs"] > threshold){
# TRUE -> read that file, remove the NA data and add to the overall data frame
data <- read.csv(file = files[i], header = TRUE, sep = ",")
#remove incomplete
dataN <- data[complete.cases(data),]
#If yes, compute the correlation and assign its results to an intermediate vector.
answer<-cor(dataN[,"sulfate"],dataN[,"nitrate"])
answer2 <- c(answer2,answer)
}
}
setwd("../")
return(answer2)
}
1) Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
vs.)
2) Error in [.data.frame(data, , 2:3) : undefined columns selected
What I've tried
referring to the column names directly "colA"
initializing data and dataN to empty data.frames before the for loop
initializing answer2 to an empty vector
Getting an better understanding on how vectors, matrices and data.frames work with each other
** Thank you!**
My problem was that I had the function .R file that I was referencing in the code above, in the same directory as the data files I was looping through and analyzing. My "files" vector was an incorrect length, because it was reading the another .R function I made and referenced earlier in the function. I believe this R file is what created the 'undefined columns'
I apologize, I ended up not even putting up the right area of code where the problem lay.
Key Takeaway: You can always move between directories within a function! In fact, it may be very necessary if you want to perform a function on all the contents of a directory of interest
One approach:
# get the list of file names
files <- list.files(path='~',pattern='*.csv',full.names = TRUE)
# load all files
list.data <- lapply(files,read.csv, header = TRUE, sep = ",", row.names = NULL)
# remove rows with NAs
complete.data <- lapply(list.data,function(d) d[complete.cases(d),])
# compute correlation of the 2nd and 3rd columns in every data set
answer <- sapply(complete.data,function(d) cor(d[,2],d[,3]))
The same idea, buth slightly different realization
cr <- function(fname) {
d <- read.csv(fname, header = TRUE, sep = ",", row.names = NULL)
dc <- d[complete.cases(d),]
cor(dc[,2],dc[,3])
}
answer2 <- sapply(files,cr)
example of CSV files:
# ==> a.csv <==
# a,b,c,d
# 1,2,3,4
# 11,12,13,14
# 11,NA,13,14
# 11,12,13,14
#
# ==> b.csv <==
# A,B,C,D
# 101,102,103,104
# 101,102,103,104
# 11,12,13,14
I would like to be able to scan a csv file row by row in R and exclude the rows that contain the word "target".
The problem is that the data comes from different places and the word "target" can come up in a number of different columns in the data frame.
So I need a line in a function that will look for this string, and if it is not present, then append that row to a new data frame (that I will then write out as a new csv).
Any and all help gratefully recieved.
Andrie's comment is probably the way most users would approach this, but if you want to do this at the reading in stage, you can try this:
Read in your csv using readLines and make any lines that have the text target blank:
temp = gsub(".*target.*", "", readLines("test.csv"))
Use read.table to convert temp to a data.frame. Since all lines that have the text target are now blank, the default blank.lines.skip=TRUE in read.table should correctly read in the rest of your data as a data.frame.
read.table(text=temp, sep=",", header=TRUE)
Use readLines:
lines <- readLines(file)
n.lines <- length(lines)
vec.1 <- rep(0, n.lines)
vec.2 <- rep(0, n.lines)
# more vectors as necessary
counter <- 0
for (i in 1:n.lines){
this.line <- strplit(lines[i], ",")
if ("target" %in% this.line) next
counter <- counter + 1
vec.1[counter] <- this.line[1]
vec.2[counter] <- this.line[2]
# etc.
}
df <- data.frame(vec.1[1:counter], vec.2[1:counter])
You may have to change n.lines slightly and change the indexing of the for loop if your file has headers; two lines would change as follows:
n.lines <- length(lines) - 1
and
for(i in 2:(n.lines+1)){
I would call from.readLines <- readLines(filename) and then just sub-select the rows that don't contain the target string: data <- read.csv(text = from.readLines[-grep('target', from.readLines)], header = F).
The faster way to do it (if your file is huge) would be to grep -v 'target' original.csv > new.csv first on the command line and then run read.csv(new.csv, ...) in R.
But anyway,
> #Without header
> from.readLines <- c('afaf,afasf,target', 'afaf,target,afasf', 'dagdg,asgst,sagga', 'dagdg,dg,sfafgsgg')
> data <- read.csv(text = from.readLines[-grep('target', from.readLines)], header = F)
> print(data)
V1 V2 V3
1 dagdg asgst sagga
2 dagdg dg sfafgsgg
>
> #With header
> from.readLines <- c('var1,var2,var3', 'afaf,afasf,target', 'afaf,target,afasf', 'dagdg,asgst,sagga', 'dagdg,dg,sfafgsgg')
> data <- read.csv(text = from.readLines[-(grep('target', from.readLines[-1]) + 1)])
> print(data)
var1 var2 var3
1 dagdg asgst sagga
2 dagdg dg sfafgsgg