I have a large database that I've split into multiple files. Each file is saved in the same directory, and there is a numerical sequence in the naming scheme so the order of the database is maintained. Ive done this to reduce the time and memory it takes to load and manipulate the database. I would like to start analyzing the database in sequence, which I intend to accomplish using a rollapply like function. I am having a problem when I want the window to span two files at once. Which is where I need help. Here is dummy dataset that will create five CSV files with a similar naming scheme to my database:
library(readr)
val <- c(1,2,3,4,5)
df_1 <- data.frame(val)
write_csv(df_1, "1_database.csv", col_names = TRUE)
write_csv(df_1, "2_database.csv", col_names = TRUE)
write_csv(df_1, "3_database.csv", col_names = TRUE)
write_csv(df_1, "4_database.csv", col_names = TRUE)
write_csv(df_1, "5_database.csv", col_names = TRUE)
Keep in mind that this database is huge, and causes memory and time issues on my current machine. The solution MUST have a component that "forgets". This means recurrently joining the files, or loading them all at once to the R environment is not an option. When a new file is loaded, the last file must be removed from the R environment. I can have at maximum three files loaded at once. For example files 1-3 can be loaded, and then file 1 needs to be removed before file 4 is loaded.
The output can be a single list of all files - the combination of files 1-5 in a single list.
For the sake of simplicity, lets say I want to use a window of 2, and I want to calculate the mean of this window. I'm imagining something like this (see below) but this maybe a failed approach, and I'm open to anything.
appreciated_function <- function(x){
Your greatly appreciated function
}
rollapply(df, 2, appreciated_function, by.column = FALSE, align = "left")
Suppose the window width is k. Iterate through all files and for each one read that file plus the first k-1 rows of the next (except for the last) and use rollapply on that appending what we get to what we have so far. Alternately, if the output is too large we could write out each result instead of appending it.
At the bottom we check that it gives the expected result.
library(readr)
library(zoo)
val <- c(1,2,3,4,5)
df_1 <- data.frame(val)
write_csv(df_1, "1_database.csv", col_names = TRUE)
write_csv(df_1, "2_database.csv", col_names = TRUE)
write_csv(df_1, "3_database.csv", col_names = TRUE)
write_csv(df_1, "4_database.csv", col_names = TRUE)
write_csv(df_1, "5_database.csv", col_names = TRUE)
d <- dir(pattern = "database.csv$")
k <- 2
r <- NULL
for(i in seq_along(d)) {
Next <- if (i != length(d)) read_csv(d[i+1], n_max = k-1)
DF <- rbind(read_csv(d[i]), Next)
r0 <- rollapply(DF, k, sum, align = "left")
# if output too large replace next statement with one to write out r0
r <- rbind(r, r0)
}
# check
r2 <- rollapply(data.frame(val = sequence(rep(5, 5))), k, sum, align = "left")
identical(r, r2)
## [1] TRUE
Related
I have a large number of CSV files. I need to extract relevant data from each file, and compile all of the relevant data into a new file.
I have been copying/pasting the code below and changing relevant details (e.g., file name) to repeat the same process for many CSV files. After that, I use cbind()/write.xlsx() to combine all of the relevant data and write it to an excel file. I need a more efficient method to accomplish this task.
How can I:
incorporate a loop that imports a large number of CSV files (to replace #1 below)
select relevant rows based on a string instead of entering specific row numbers
(to replace # 2 below)
combine all of the relevant data into a single data frame with each file's data in one column
library(tidyr)
# 1 - import raw data files
file1 <- read.csv ("1.csv", header = FALSE, sep = "\n")
# 2 - select relevant rows
file1 <- as.data.frame(file1[c(41:155),])
colnames(file1) <- c("file1")
#separate components of each line from raw csv file / isolate data
temp1 <- separate(file1, file1, into = c("Text", "IntNum", "Data", sep = "\\s"))
temp1 <- temp1$Data
temp1 <- as.data.frame(temp1)
If the number of relevant rows in each file is the same, you could do it like this. Option 1 shows a solution using a loop, option 2 shows a solution using sapply.
In a first step I generate three csv-files to make the code reproducible. The start row in each file is defined by "start", the end row by "end". I then get a list with the names of these files with dir().
#make csv-files, target vector always same length (3)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "begin",
paste0("dat", i),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#get list of file names
allFiles <- dir(pattern = glob2rx("*.csv"))
Option 1 - loop
For the loop you could first initialize a result data frame ("outDF") with the number of columns set to the number of csv-files and the number of rows set to the length of the target vector in each file ("start" to "end"). You can then loop over the files and fill the data frame. The start and end rows can be indexed using which().
#initialise result data frame
outDF <- data.frame(matrix(nrow = 3, ncol = length(allFiles),
dimnames = list(NULL, allFiles)))
#loop over csv files
for (iFile in allFiles) {
idat <- read.csv(iFile, stringsAsFactors = FALSE) #read csv
outDF[, iFile] <- idat[which(idat$x == "start"):which(idat$x == "end"),]
}
Option 2 - sapply
Instead of a loop you could use sapply with a custom function to extract the relevant rows in each file. This returns a matrix which you could then transform into a dataframe.
out <- sapply(allFiles, FUN = function(x) {
idat <- read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
outDF <- as.data.frame(out)
If the number of rows between "start" and "end" differs between files, the above options won´t work. In this case you could generate a data frame by first using lapply() (similar to option 2) to generate a result list (with different lengths of the list elements) and then padding shorter lists with NAs before transforming the result into a dataframe again.
#make csv-files with with target vector of different lengths (3:12)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "start",
rep(paste0("dat", i), sample(1:10,1)),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#lapply
out <- lapply(allFiles, FUN = function(x) {
idat = read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
out <- lapply(out, `length<-`, max(lengths(out)))
outDF <- do.call(cbind, out)
A machine I use spits out .csv files named by the time. But I need them named after the plate they were read from, which is contained within the file.
I created list of files:
files <- list.files(path="", pattern="*.csv")
I then tried using a for-loop to first create a data frame from each file containing the 1st row only, then to create a variable from the relevant piece of data, (the desired name), and then renaming the files.
for(x in files)
{
y <- read.csv(x, nrow = 1, header = FALSE, stringsAsFactors = TRUE)
z <- y[2, 2]
file.rename(x, z)
}
It didn't work. After 7 hours of trying (new to R) I am here. Please give simple advice, I have basically zero R experience.
I believe the following for loop does what the question asks for if the new filename is the second column header value.
If it is not, change nmax to the appropriate column number.
fls <- list.files(pattern = '\\.csv')
for(f in fls){
x <- scan(file = f, what = character(), nmax = 2, nlines = 1, sep = ',')
g <- paste0(x[2], '.csv')
file.rename(f, g)
}
I got many .csv files of different sizes. I choose some of them who correspond at a condition (those matching with my id in the example). They are ordered by date and can be huge. I need to know the minimum and maximum dates of these files.
I can read all of those wanted and only for the column date.hour, and then I can find easily the minimum and maximum of all the dates values.
But it would be a lot faster, as I repeat this for thousand ids, if I could read only the first and last rows of my files.
Does anyone got an idea of how to solve this ?
This code works well, but I wish to improve it.
function to read several files at once
`read.tables.simple <- function(file.names, ...) {require(plyr)
ldply(file.names, function(fn) data.frame(read.table(fn, ...)))}`
reading the files and selecting the minimum and maximum dates for all of theses
`diri <- dir()
dat <- read.tables.simple(diri[1], header = TRUE, sep = ";", colClasses = "character")
colclass <- rep("NULL", ncol(dat))
x <- which(colnames(dat) == "date.hour")
colclass[x] <- "character"
x <- grep("id", diri)
dat <- read.tables.simple(diri[x], header = TRUE, sep = ";", colClasses = colclass)
datmin <- min(dat$date.hour)
datmax <- max(dat$date.hour)`
In general, read.table is very slow. If you use read_tsv, read_csv or read_delim from the readr library, it will already be much, much faster.
If you are on Linux/Mac OS, you can also read only the first or last parts by setting up a pipe, which will be more or less instant, no matter how large your file is. Let's assume you have no column headers:
library(readr)
read_last <- function(file) {
read_tsv(pipe(paste('tail -n 1', file)), col_names=FALSE)
}
# Readr can already read only a select number of lines, use `n_max`
first <- read_tsv(file, n_max=1, col_names=FALSE)
If you want to go in on parallelism, you can even read files in parallel, see e.g., library(parallel) and ?mclapply
The following function will read the first two lines of your csv (the header row and the first data row), then seek to the end of the file and read the last line. It will then stick these three lines together to read them as a two-row csv in memory from which it returns the column date.time. This will have your minimum and maximum values, since the times are arranged in order.
You need to tell the function the maximum line length. It's OK if you over-estimate this, but make sure the number is less than a third of your file size.
read_head_tail <- function(file_path, line_length = 100)
{
con <- file(file_path)
open(con)
seek(con, where = 0)
first <- suppressWarnings(readChar(con, nchars = 2 * line_length))
first <- strsplit(first, "\n")[[1]][1:2]
seek(con, where = file.info(file_path)$size - line_length)
last <- suppressWarnings(readChar(con, nchars = line_length))
last <- strsplit(last, "\n")[[1]]
last <- last[length(last)]
close(con)
csv <- paste(paste0(first, collapse = "\n"), last, sep = "\n")
df <- read.csv(text = csv, stringsAsFactors = FALSE)[-1]
return(df$date.hour)
}
I'm quite new at R and a bit stuck on what I feel is likely a common operation to do. I have a number of files (57 with ~1.5 billion rows cumulatively by 6 columns) that I need to perform basic functions on. I'm able to read these files in and perform the calculations I need no problem but I'm tripping up in the final output. I envision the function working on 1 file at a time, outputting the worked file and moving onto the next.
After calculations I would like to output 57 new .txt files named after the file the input data first came from. So far I'm able to perform the calculations on smaller test datasets and spit out 1 appended .txt file but this isn't what I want as a final output.
#list filenames
files <- list.files(path=, pattern="*.txt", full.names=TRUE, recursive=FALSE)
#begin looping process
loop_output = lapply(files,
function(x) {
#Load 'x' file in
DF<- read.table(x, header = FALSE, sep= "\t")
#Call calculated height average a name
R_ref= 1647.038203
#Add column names to .las data
colnames(DF) <- c("X","Y","Z","I","A","FC")
#Calculate return
DF$R_calc <- (R_ref - DF$Z)/cos(DF$A*pi/180)
#Calculate intensity
DF$Ir_calc <- DF$I * (DF$R_calc^2/R_ref^2)
#Output new .txt with calcuated columns
write.table(DF, file=, row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")
})
My latest code endeavors have been to mess around with the intial lapply/sapply function as so:
#begin looping process
loop_output = sapply(names(files),
function(x) {
As well as the output line:
#Output new .csv with calcuated columns
write.table(DF, file=paste0(names(DF), "txt", sep="."),
row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")
From what I've been reading the file naming function during write.table output may be one of the keys I don't have fully aligned yet with the rest of the script. I've been viewing a lot of other asked questions that I felt were applicable:
Using lapply to apply a function over list of data frames and saving output to files with different names
Write list of data.frames to separate CSV files with lapply
to no luck. I deeply appreciate any insights or paths towards the right direction on inputting x number of files, performing the same function on each, then outputting the same x number of files. Thank you.
The reason the output is directed to the same file is probably that file = paste0(names(DF), "txt", sep=".") returns the same value for every iteration. That is, DF must have the same column names in every iteration, therefore names(DF) will be the same, and paste0(names(DF), "txt", sep=".") will be the same. Along with the append = TRUE option the result is that all output is written to the same file.
Inside the anonymous function, x is the name of the input file. Instead of using names(DF) as a basis for the output file name you could do some transformation of this character string.
example.
Given
x <- "/foo/raw_data.csv"
Inside the function you could do something like this
infile <- x
outfile <- file.path(dirname(infile), gsub('raw', 'clean', basename(infile)))
outfile
[1] "/foo/clean_data.csv"
Then use the new name for output, with append = FALSE (unless you need it to be true)
write.table(DF, file = outfile, row.names = FALSE, col.names = FALSE, append = FALSE, fileEncoding = "UTF-8")
Using your code, this is the general idea:
require(purrr)
#list filenames
files <- list.files(path=, pattern="*.txt", full.names=TRUE, recursive=FALSE)
#Call calculated height average a name
R_ref= 1647.038203
dfTransform <- function(file){
colnames(file) <- c("X","Y","Z","I","A","FC")
#Calculate return
file$R_calc <- (R_ref - file$Z)/cos(file$A*pi/180)
#Calculate intensity
file$Ir_calc <- file$I * (file$R_calc^2/R_ref^2)
return(file)
}
output <- files %>% map(read.table,header = FALSE, sep= "\t") %>%
map(dfTransform) %>%
map(write.table, file=paste0(names(DF), "txt", sep="."),
row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")
After having searched for help in different threads on this topic, I still have not become wiser. Therefore: Here comes another question on looping through multiple data files...
OK. I have multiple CSV files in one folder containing 5 columns of data. The filenames are as follows:
Moist yyyymmdd hh_mm_ss.csv
I would like to create a script that reads processes the CSV-files one by one doing the following steps:
1) load file
2) check number of rows and exclude file if less than 3 registrations
3) calculate mean value of all measurements (=rows) for column 2
4) calculate mean value of all measurements (=rows) for column 4
5) output the filename timestamp, mean column 2 and mean column 4 to a data frame,
I have written the following function
moist.each.mean <- function() {
library("tcltk")
directory <- tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <- regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame(timestamp=character(), humidity=numeric(), temp=numeric())
for(i in 1:length(filelist)){
file.in[[i]] <- read.csv(filelist[i], header=F)
if (nrow(file.in[[i]]<3)){
print("discard")
} else {
newrow <- c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1))
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
}
but i keep getting an error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(V1 = c(10519949L, :
replacement has 18 rows, data has 17
Any ideas?
Thx, kruemelprinz
I'd also suggest to use (l)apply... Here's my take:
getMeans <- function(fpath,runfct,
target_cols = c(2),
sep=",",
dec=".",
header = T,
min_obs_threshold = 3){
f <- list.files(fpath)
fcsv <- f[grepl("\.csv",f)]
fcsv <- paste0(fpath,fcsv)
csv_list <- lapply(fcsv,read.table,sep = sep,
dec = dec, header = header)
csv_rows <- sapply(csv_list,nrow)
rel_csv_list <- csv_list[!(csv_rows < min_obs_threshold)]
lapply(rel_csv_list,function(x) colMeans(x[,target_cols]))
}
Also with that kind of error message, the debugger might be very helpful.
Just run debug(moist.each.mean) and execute the function stepwise.
Here's a slightly different approach. Use lapply to read each csv file, exclude it if necessary, otherwise create a summary. This gives you a list where each element is a data frame summary. Then use rbind to create the final summary data frame.
Without a sample of your data, I can't be sure the code below exactly matches your problem, but hopefully it will be enough to get you where you want to go.
# Get vector of filenames to read
filelist=list.files(path=directory, pattern="csv")
# Read all the csv files into a list and create summaries
df.list = lapply(filelist, function(f) {
file.in = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Set to empty data frame if file has less than 3 rows of data
if (nrow(file.in) < 3) {
print(paste("Discard", f))
# Otherwise, capture file timestamp and summarise data frame
} else {
data.frame(timestamp=substr(f, 7, 22),
humidity=round(mean(file.in$V2),1),
temp=round(mean(file.in$V4),1))
}
})
# Bind list into final summary data frame (excluding the list elements
# that don't contain a data frame because they didn't have enough rows
# to be included in the summary)
result = do.call(rbind, df.list[sapply(df.list, is.data.frame)])
One issue with your original code is that you create a vector of summary results rather than a data frame of results:
c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1)) is a vector with three elements. What you actually want is a data frame with three columns:
data.frame(timestamp=filetitles[[i]],
humidity=round(mean(file.in[[i]]$V2),1),
temp=round(mean(file.in[[i]]$V4),1))
Thanks for the suggestions using lapply. This is definitely of value as it saves a whole lot of code as well! Meanwhile, I managed to fix my original code as well:
library("tcltk")
# directory: path to csv files
directory <-
tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <-
regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame()
for (i in 1:length(filelist)) {
file.in <- read.csv(filelist[i], header = F, skipNul = T)
if (nrow(file.in) < 3) {
print("discard")
} else {
newrow <-
matrix(
c(filetitles[[i]], round(mean(file.in$V2, na.rm=T),1), round(mean(file.in$V4, na.rm=T),1)), nrow = 1, ncol =
3, byrow = T
)
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
Only I did not get it to work as a function because then I would only have one row in mdf containing the last file data. Somehow it did not add rows but overwrite row 1 with each iteration. But using it without a function wrapper worked fine...