I wrote a simple function:
myfunction <- function(fileName, stringsAsFactors=TRUE,
check.names=FALSE,
skip =1,...) {
Data <- read.delim(fileName, skip = skip,
stringsAsFactors=stringsAsFactors,
check.names = check.names, ...)
cb <- list()
Index <- as.numeric(as.factor(Data[,1]))
cb <- cbind(Data, Index)
return(cb)
}
This function takes the first column of the file named Data, create an Index according to that first column and then cbind the file Data and the index created.
This function will be applied in file named: myfile_00.txt, myfile_01.txt and so on. For one single file it looks like:
myfunction (fileName = "myfile_00.txt")
myfunction (fileName = "myfile_01.txt")
.......
I have around 1000 files so I suppose, the loop can be as from another post:
mytxt <- dir(pattern=".txt")
n <- length(mytxt)
mylist <- vector("list", n)
for(i in 1:n) {
mylist[[i]] <- read.delim(mytxt[i], header = F, skip = 1)
}
then:
d <- lapply(mylist, myfunction)
Unfortunately it does not work... When using lapply an error occurs:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file' must be a character string or connection
Since I' m new in R probably I' m doing mistakes I'm not able to figure out.
Like #Arun pointed out, you are trying to run your function twice: once on the files and once one the data frames you have created... Instead, your code should look like this:
files <- list.files(pattern = ".txt")
mylist <- lapply(files, myfunction)
Related
I have 1500+ .txt files called data_{date from 2015070918 to today} all with 7 columns worth of data and variable row amounts. I have managed to use the following code to extract and merge the data into one table:
files = list.files(pattern = ".txt")
myData <- lapply(files, function(x) {
tryCatch(read.table(x, header = F, sep = ','), error=function(e) NULL)
})
Note: there are no headers on the columns, currently I don't even know which variable is which!
At the moment the data only has the date in the file name and therefore it isn't possible to distinguish between each subset of daily data. I want to create an additional column to include the date which I can extract if I can include the filename in an additional column.
I searched on stackexchange and came across this possible solution: Importing multiple .csv files into R and adding a new column with file name
df <- do.call(rbind, lapply(files, function(x) cbind(read.csv(x, header = F, sep = ","), name=strsplit(x,'\\.')[[1]][1])))
However I get the following error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
I have used read.csv on individual files and they have imported without any issues. Any ideas to resolve this would be greatly appreciated!
This should work, if your read.table command is correct:
myData_list <- lapply(files, function(x) {
out <- tryCatch(read.table(x, header = F, sep = ','), error = function(e) NULL)
if (!is.null(out)) {
out$source_file <- x
}
return(out)
})
myData <- data.table::rbindlist(myData_list)
In the past I found that you can spare yourself a lot of headache using data.table::fread instead of read.table. So you could consider this:
myData_list <- lapply(files, function(x) {
out <- data.table::fread(x, header = FALSE)
out$source_file <- x
return(out)
})
myData <- data.table::rbindlist(myData_list)
You can add the tryCatch part back if necessary. Depending on how the files vector looks, basename() might be interesting to use on the column source_file.
You could try using sapply with an index corresponding to each of the files:
files <- list.files(pattern = ".txt")
myData <- lapply(seq_along(files), function(x) {
tryCatch(
{
dt <- read.table(files[x], header = F, sep = ',')
dt$index <- x # or files[x] is you want to use the file name instead
dt
},
error=function(e) { NULL }
)
})
Here is the data I am working with. https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip
I'm trying to create a function called pollutantmean that will load selected files, aggregate (rbind) the columns, and return a mean of a certain column. I have figured out everything except how to run the loop so I can turn the multiple files into one big data frame.
for (id in 1:5) {
files_full <- Sys.glob("*.csv")
fileQ <- files_full[[id]]
empty_tbl <- rbind(empty_tbl, read.csv(fileQ, header = TRUE))
}
This for loop works by itself but when i try and use my bigger function
pollutantmean <- function(directory = "specdata", pollutant, id = 1:332) {
empty_tbl <- data.frame()
for (id in 1:332) {
files_full <- Sys.glob("*.csv")
fileQ <- files_full[[i]]
empty_tbl <- rbind(empty_tbl, read.csv(fileQ, header = TRUE))
}
goodata <- na.omit(empty_tbl)
if(pollutant == "sulfate") {
mean(goodata[,2])
} else {
mean(goodata[,3])
}
}
I get the:
"Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file' must be a character string or connection".
I am at a complete loss over how to fix this and have tried many, many different ways. I'm sure I'm messing something up with the naming of the file but I try the for loop by itself and it works fine...
Consider using lapply() on csv files that uses the directory argument of function. Below assumes specdata is a subfolder of the current working directory:
pollutantmean <- function(directory = "specdata", pollutant) {
files_full <- Sys.glob(paste0(directory,"/*.csv"))[1:332] # FIRST 332 CSVs IN DIRECTORY
dfList <- lapply(files_full, read.csv, header=TRUE)
df <- do.call(rbind, dfList)
gooddata <- na.omit(df)
pmean <- ifelse(pollutant == "sulfate", mean(gooddata[,2]), mean(gooddata[,3]))
}
I am working in a directory, but the data I want to read is in a subdirectory. I get an error when I try to read the csv files, my code is the following:
setwd("~/Documents/")
files <- list.files(path = "data/")
f <- list()
for (i in 1:length(files)) {
f[[i]] <- read.csv(files[i], header = T, sep = ";")
}
And the error I get is:
Error in file(file, "rt"): cannot open the connection
What am I doing wrong?
The following will work, assuming you have correctly specified the other read.csv parameters.
setwd("~/Documents/")
files <- list.files(path = "data/")
f <- list()
for (i in 1:length(files)) {
f[[i]] <- read.csv(paste0("data/",files[i]), header = T, sep = ";")
}
Alternatively, you could drop the paste0 and simply set your working directory to ~/Documents/data/ in the first place.
setwd("~/Documents/data/")
files <- list.files() #No parameter necessary now since you're in the proper directory
f <- list()
for (i in 1:length(files)) {
f[[i]] <- read.csv(files[i], header = T, sep = ";")
}
If you need to be in ~/Documents/ at the end of this loop, then finish it up by adding the following after the loop.
setwd("~/Documents/")
I need to run the same set of code for multiple CSV files. I want to do it with the same with macro. Below is the code that I am executing, but results are not coming properly. It is reading the data in 2-d format while I need to run in 3-d format.
lf = list.files(path = "D:/THD/data", pattern = ".csv",
full.names = TRUE, recursive = TRUE, include.dirs = TRUE)
ds<-lapply(lf,read.table)
I dont know if this is going to be useful but one of the way I do is:
##Step 1 read files
mycsv = dir(pattern=".csv")
n <- length(mycsv)
mylist <- vector("list", n)
for(i in 1:n) mylist[[i]] <- read.csv(mycsv[i],header = T)
then I useually just use apply function to change things, for example,
## Change coloumn name
mylist <- lapply(mylist, function(x) {names(x) <- c("type","date","v1","v2","v3","v4","v5","v6","v7","v8","v9","v10","v11","v12","v13","v14","v15","v16","v17","v18","v19","v20","v21","v22","v23","v24","total") ; return(x)})
## changing type coloumn for weekday/weekend
mylist <- lapply(mylist, function(x) {
f = c("we", "we", "wd", "wd", "wd", "wd", "wd")
x$type = rep(f,52, length.out = 365)
return(x)
})
and so on.
Then I save with this following code again after all the changes I made (it is also sometime useful to split original file name and rename each files to save with a part of file name so that I can track each individual files later)
## for example some of my file had a pattern in file name such as "201_E424220_N563500.csv",so I split this to save with a new name like this:
mylist <-lapply(1:length(mylist), function(i) {
mylist.i <- mylist[[i]]
s = strsplit(mycsv[i], "_" , fixed = TRUE)[[1]]
d = cbind(mylist.i[, c("type", "date")], ID = s[1], Easting = s[2], Northing = s[3], mylist.i[, 3:ncol(mylist.i)])
return(d)
})
for(i in 1:n)
write.csv(file = paste("file", i, ".csv", sep = ""), mylist[i], row.names = F)
I hope this will help. When you get some time pleaes read about the PLYR package as I am sure this will be very useful for you, it is a very useful package with lots of data analysis options. PLYR has apply functions such as:
## l_ply split list, apply function and discard result
## ldply split list, apply function and return result in data frame
## laply split list, apply function and return result in an array
for example you can use the ldply to read all your csv and return a data frame simething like:
data = ldply(list.files(pattern = ".csv"), function(fname) {
j = read.csv(fname, header = T)
return(j)
})
So here J will be your data frame with all your csv files data.
Thanks,Ayan
I am new to R and trying to do some correlation analysis on multiple sets of data. I am able to do the analysis, but I am trying to figure out how I can output the results of my data. I'd like to have output like the following:
NAME,COR1,COR2
....,....,....
....,....,....
If I could write such a file to output, then I can post process it as needed. My processing script looks like this:
run_analysis <- function(logfile, name)
{
preds <- read.table(logfile, header=T, sep=",")
# do something with the data: create some_col, another_col, etc.
result1 <- cor(some_col, another_col)
result1 <- cor(some_col2, another_col2)
# somehow output name,result1,result2 to a CSV file
}
args <- commandArgs(trailingOnly = TRUE)
date <- args[1]
basepath <- args[2]
logbase <- paste(basepath, date, sep="/")
logfile_pattern <- paste( "*", date, "csv", sep=".")
logfiles <- list.files(path=logbase, pattern=logfile_pattern)
for (f in logfiles) {
name = unlist(strsplit(f,"\\."))[1]
logfile = paste(logbase, f, sep="/")
run_analysis(logfile, name)
}
Is there an easy way to create a blank data frame and then add data to it, row by row?
Have you looked at the functions in R for writing data to files? For instance, write.csv. Perhaps something like this:
rs <- data.frame(name = name, COR1 = result1, COR2 = result2)
write.csv(rs,"path/to/file",append = TRUE,...)
I like using the foreach library for this sort of thing:
library(foreach)
run_analysis <- function(logfile, name) {
preds <- read.table(logfile, header=T, sep=",")
# do something with the data: create some_col, another_col, etc.
result1 <- cor(some_col, another_col)
result2 <- cor(some_col2, another_col2)
# Return one row of results.
data.frame(name=name, cor1=result1, cor2=result2)
}
args <- commandArgs(trailingOnly = TRUE)
date <- args[1]
basepath <- args[2]
logbase <- paste(basepath, date, sep="/")
logfile_pattern <- paste( "*", date, "csv", sep=".")
logfiles <- list.files(path=logbase, pattern=logfile_pattern)
## Collect results from run_analysis into a table, by rows.
dat <- foreach (f=logfiles, .combine="rbind") %do% {
name = unlist(strsplit(f,"\\."))[1]
logfile = paste(logbase, f, sep="/")
run_analysis(logfile, name)
}
## Write output.
write.csv(dat, "output.dat", quote=FALSE)
What this does is to generate one row of output on each call to run_analysis, binding them into a single table called dat (the .combine="rbind" part of the call to foreach causes row binding). Then you can just use write.csv to get the output you want.