I need to read many files into R, do some clean up, and then combine them into one data frame. The files all basically start like this:
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.07.11 09:47:35 =~=~=~=~=~=~=~=~=~=~=~=
up
Upload #18
Reader: S1 Site: AA
--------- upload 18 start ---------
Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap
E,2016-07-05,11:45:44.17,"upload 17 complete"
D,2016-07-05,11:46:24.69,00:00:00.87,HA,900_226000745055,A2,8,1102
D,2016-07-05,11:46:43.23,00:00:01.12,HA,900_226000745055,A2,10,143
The row with column headers is "Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap". Data should have 9 columns. The problem is that the number of rows above the header string is different for every file, so I cannot simply use skip = 5. I also only need lines that begin with "D,", everything else is messages, not data.
What is the best way to read in my files, ensuring that I have 9 columns and skipping all the junk?
I have been using the read_csv function from the readr() package because thus far it has produced the fewest formatting issues. But, I am open to any new ideas including a way to read in just lines that begin with "D,". I toyed with using read.table and skip = grep("Type," readLines(i)), but it doesn't seem to find the header string correctly. Here's my basic code:
dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA", skip = 35)
# do clean-up stuff
datalist[[i]] <- d
}
One other basic R solution is the following: You read in the file by lines, get the indices of rows, that begin with "D" and the header row. After, you simply split these lines by "," and put it in a data.frame and assign the names from the header row to it.
lines <- readLines(i)
dataRows <- grep("^D,", lines)
names <- unlist(strsplit(lines[grep("Type,", lines)], split = ","))
data <- as.data.frame(matrix(unlist(strsplit(lines[dataRows], ",")), nrow = length(dataRows), byrow=T))
names(data) <- names
Output:
Type Date Time Duration Type Tag ID Ant Count Gap
1 D 2016-07-05 11:46:24.69 00:00:00.87 HA 900_226000745055 A2 8 1102
2 D 2016-07-05 11:46:43.23 00:00:01.12 HA 900_226000745055 A2 10 143
You can use a custom function to loop over each file and filter only those which start with D in the type column and bind them all together at the end. Drop the bind_rows if you want them as separate lists.
load_data <-function(path) {
require(dplyr)
setwd(path)
files <- dir()
read_files <- function(x) {
data_file <- read.csv(paste(path, "/", x, ".csv", sep = ""), stringsAsFactors = FALSE, na.strings=c("","NA"))
row.number <- grep("^Type$", data_file[,1])
colnames(data_file) <- data_file[row.number,]
data_file <- data_file[-c(1:row.number+1),]
data_file <- data_file %>%
filter(grepl("^D", Type))
return(data_file)
}
data <- lapply(files, read_files)
}
list_of_file <- bind_rows(load_data("YOUR_FOLDER_PATH"))
If your header row always begins with the word Type, you can simply omit the skip option from your initial read, and then remove any rows before the header row. Here's some code to get you started (not tested):
dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA")
headerRow <- which( d01[,1] == 'Type' )
d01 <- d01[headerRow+1,] # This keeps all rows after the header row.
# do clean-up stuff
datalist[[i]] <- d
}
If you want to keep the header, you can use:
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA")
headerRow <- which( d01[,1] == 'Type' )
d01 <- d01[headerRow+1,] # This keeps all rows after the header row.
header <- d01[headerRow,] # Get names from header row.
setNames( d01, header ) # Assign names.
# do clean-up stuff
datalist[[i]] <- d
}
Related
I have several files with the names RTDFE, TRYFG, FTYGS, WERTS...like 100 files in txt format. For each file, I'm using the following code and writing the output in a file.
name = c("RTDFE")
file1 <- paste0(name, "_filter",".txt")
file2 <- paste0(name, "_data",".txt")
### One
A <- read.delim(file1, sep = "\t", header = FALSE)
#### two
B <- read.delim(file2, sep = "\t", header = FALSE)
C <- merge(A, B, by="XYZ")
nrow(C)
145
Output:
Samples Common
RTDFE 145
Every time I'm assigning the file to variable name running my code and writing the output in the file. Instead, I want the code to be run on all the files in one go and want the following output. Common is the row of merged data frame C
The output I need:
Samples Common
RTDFE 145
TRYFG ...
FTYGS ...
WERTS ...
How to do this? Any help.
How about putting all your names in a single vector, called names, like this:
names<-c("TRYFG","RTDFE",...)
and then feeding each one to a function that reads the files, merges them, and returns the rows
f<-function(n) {
fs = paste0(n,c("_filter", "_data"),".txt")
C = merge(
read.delim(fs[1],sep="\t", header=F),
read.delim(fs[2],sep="\t", header=F), by="XYZ")
data.frame(Samples=n,Common=nrow(C))
}
Then just call call this function f on each of the values in names, row binding the result together
do.call(rbind, lapply(names, f))
An easy way to create the vector names is like this:
p = "_(filter|data).txt"
names = unique(gsub(p,"",list.files(pattern = p)))
I am making some assumptions here.
The first assumption is that you have all these files in a folder with no other text files (.txt) in this folder.
If so you can get the list of files with the command list.files.
But when doing so you will get the "_data.txt" and the "filter.txt".
We need a way to extract the basic part of the name.
I use "str_replace" to remove the "_data.txt" and the "_filter.txt" from the list.
But when doing so you will get a list with two entries. Therefore I use the "unique" command.
I store this in "lfiles" that will now contain "RTDFE, TRYFG, FTYGS, WERTS..." and any other file that satisfy the conditions.
After this I run a for loop on this list.
I reopen the files similarly as you do.
I merge by XYZ and I immediately put the results in a data frame.
By using rbind I keep adding results to the data frame "res".
library(stringr)
lfiles=list.files(path = ".", pattern = ".txt")
## we strip, from the files, the "_filter and the data
lfiles=unique( sapply(lfiles, function(x){
x=str_replace(x, "_data.txt", "")
x=str_replace(x, "_filter.txt", "")
return(x)
} ))
res=NULL
for(i in lfiles){
file1 <- paste0(i, "_filter.txt")
file2 <- paste0(i, "_data.txt")
### One
A <- read.delim(file1, sep = "\t", header = FALSE)
#### two
B <- read.delim(file2, sep = "\t", header = FALSE)
res=rbind(data.frame(Samples=i, Common=nrow(merge(A, B, by="XYZ"))))
}
Ok, I will assume you have a folder called "data" with files named "RTDFE_filter.txt, RTDFE_data, TRYFG_filter.txt, TRYFG_data.txt, etc. (only and exacly this files).
This code should give a possible way
# save the file names
files = list.files("data")
# get indexes for "data" (for "filter" indexes, add 1)
files_data_index = seq(1, length(f), 2) # 1, 3, 5, ...
# loop on indexes
results = lapply(files_data_index, function(i) {
A <- read.delim(files[i+1], sep = "\t", header = FALSE)
B <- read.delim(files[i], sep = "\t", header = FALSE)
C <- merge(A, B, by="XYZ")
samp = strsplit(files[i], "_")[[1]][1]
com = nrow(C)
return(c(Samples = samp, Comon = com))
})
# combine results
do.call(rbind, results)
I have a large number of CSV files. I need to extract relevant data from each file, and compile all of the relevant data into a new file.
I have been copying/pasting the code below and changing relevant details (e.g., file name) to repeat the same process for many CSV files. After that, I use cbind()/write.xlsx() to combine all of the relevant data and write it to an excel file. I need a more efficient method to accomplish this task.
How can I:
incorporate a loop that imports a large number of CSV files (to replace #1 below)
select relevant rows based on a string instead of entering specific row numbers
(to replace # 2 below)
combine all of the relevant data into a single data frame with each file's data in one column
library(tidyr)
# 1 - import raw data files
file1 <- read.csv ("1.csv", header = FALSE, sep = "\n")
# 2 - select relevant rows
file1 <- as.data.frame(file1[c(41:155),])
colnames(file1) <- c("file1")
#separate components of each line from raw csv file / isolate data
temp1 <- separate(file1, file1, into = c("Text", "IntNum", "Data", sep = "\\s"))
temp1 <- temp1$Data
temp1 <- as.data.frame(temp1)
If the number of relevant rows in each file is the same, you could do it like this. Option 1 shows a solution using a loop, option 2 shows a solution using sapply.
In a first step I generate three csv-files to make the code reproducible. The start row in each file is defined by "start", the end row by "end". I then get a list with the names of these files with dir().
#make csv-files, target vector always same length (3)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "begin",
paste0("dat", i),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#get list of file names
allFiles <- dir(pattern = glob2rx("*.csv"))
Option 1 - loop
For the loop you could first initialize a result data frame ("outDF") with the number of columns set to the number of csv-files and the number of rows set to the length of the target vector in each file ("start" to "end"). You can then loop over the files and fill the data frame. The start and end rows can be indexed using which().
#initialise result data frame
outDF <- data.frame(matrix(nrow = 3, ncol = length(allFiles),
dimnames = list(NULL, allFiles)))
#loop over csv files
for (iFile in allFiles) {
idat <- read.csv(iFile, stringsAsFactors = FALSE) #read csv
outDF[, iFile] <- idat[which(idat$x == "start"):which(idat$x == "end"),]
}
Option 2 - sapply
Instead of a loop you could use sapply with a custom function to extract the relevant rows in each file. This returns a matrix which you could then transform into a dataframe.
out <- sapply(allFiles, FUN = function(x) {
idat <- read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
outDF <- as.data.frame(out)
If the number of rows between "start" and "end" differs between files, the above options won´t work. In this case you could generate a data frame by first using lapply() (similar to option 2) to generate a result list (with different lengths of the list elements) and then padding shorter lists with NAs before transforming the result into a dataframe again.
#make csv-files with with target vector of different lengths (3:12)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "start",
rep(paste0("dat", i), sample(1:10,1)),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#lapply
out <- lapply(allFiles, FUN = function(x) {
idat = read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
out <- lapply(out, `length<-`, max(lengths(out)))
outDF <- do.call(cbind, out)
After having searched for help in different threads on this topic, I still have not become wiser. Therefore: Here comes another question on looping through multiple data files...
OK. I have multiple CSV files in one folder containing 5 columns of data. The filenames are as follows:
Moist yyyymmdd hh_mm_ss.csv
I would like to create a script that reads processes the CSV-files one by one doing the following steps:
1) load file
2) check number of rows and exclude file if less than 3 registrations
3) calculate mean value of all measurements (=rows) for column 2
4) calculate mean value of all measurements (=rows) for column 4
5) output the filename timestamp, mean column 2 and mean column 4 to a data frame,
I have written the following function
moist.each.mean <- function() {
library("tcltk")
directory <- tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <- regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame(timestamp=character(), humidity=numeric(), temp=numeric())
for(i in 1:length(filelist)){
file.in[[i]] <- read.csv(filelist[i], header=F)
if (nrow(file.in[[i]]<3)){
print("discard")
} else {
newrow <- c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1))
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
}
but i keep getting an error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(V1 = c(10519949L, :
replacement has 18 rows, data has 17
Any ideas?
Thx, kruemelprinz
I'd also suggest to use (l)apply... Here's my take:
getMeans <- function(fpath,runfct,
target_cols = c(2),
sep=",",
dec=".",
header = T,
min_obs_threshold = 3){
f <- list.files(fpath)
fcsv <- f[grepl("\.csv",f)]
fcsv <- paste0(fpath,fcsv)
csv_list <- lapply(fcsv,read.table,sep = sep,
dec = dec, header = header)
csv_rows <- sapply(csv_list,nrow)
rel_csv_list <- csv_list[!(csv_rows < min_obs_threshold)]
lapply(rel_csv_list,function(x) colMeans(x[,target_cols]))
}
Also with that kind of error message, the debugger might be very helpful.
Just run debug(moist.each.mean) and execute the function stepwise.
Here's a slightly different approach. Use lapply to read each csv file, exclude it if necessary, otherwise create a summary. This gives you a list where each element is a data frame summary. Then use rbind to create the final summary data frame.
Without a sample of your data, I can't be sure the code below exactly matches your problem, but hopefully it will be enough to get you where you want to go.
# Get vector of filenames to read
filelist=list.files(path=directory, pattern="csv")
# Read all the csv files into a list and create summaries
df.list = lapply(filelist, function(f) {
file.in = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Set to empty data frame if file has less than 3 rows of data
if (nrow(file.in) < 3) {
print(paste("Discard", f))
# Otherwise, capture file timestamp and summarise data frame
} else {
data.frame(timestamp=substr(f, 7, 22),
humidity=round(mean(file.in$V2),1),
temp=round(mean(file.in$V4),1))
}
})
# Bind list into final summary data frame (excluding the list elements
# that don't contain a data frame because they didn't have enough rows
# to be included in the summary)
result = do.call(rbind, df.list[sapply(df.list, is.data.frame)])
One issue with your original code is that you create a vector of summary results rather than a data frame of results:
c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1)) is a vector with three elements. What you actually want is a data frame with three columns:
data.frame(timestamp=filetitles[[i]],
humidity=round(mean(file.in[[i]]$V2),1),
temp=round(mean(file.in[[i]]$V4),1))
Thanks for the suggestions using lapply. This is definitely of value as it saves a whole lot of code as well! Meanwhile, I managed to fix my original code as well:
library("tcltk")
# directory: path to csv files
directory <-
tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <-
regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame()
for (i in 1:length(filelist)) {
file.in <- read.csv(filelist[i], header = F, skipNul = T)
if (nrow(file.in) < 3) {
print("discard")
} else {
newrow <-
matrix(
c(filetitles[[i]], round(mean(file.in$V2, na.rm=T),1), round(mean(file.in$V4, na.rm=T),1)), nrow = 1, ncol =
3, byrow = T
)
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
Only I did not get it to work as a function because then I would only have one row in mdf containing the last file data. Somehow it did not add rows but overwrite row 1 with each iteration. But using it without a function wrapper worked fine...
My files are systemically named and all in the same folder. So I want to take advantage and write a function to read them one by one instead of doing this manually for each one.
The names are stored in this text file:
DF <- read.table(text=" site column row
1 abs 1259 463
2 adm 1253 460
3 afrm 1258 463", header=T)
I want to write a function to go row by row and do this:
You can see for instance if we apply for the first row:
cor$site is abs so:
file1=read.table("C:\\Data\\abs.txt",sep = "\t")
cor$column is 1259
cor$row is 463
So
wf= read.table("C:\\Users\\measurement_ 1259_463.txt", sep =' ' , header =TRUE)
Now I do any calculations with file1 and wf.........
And then go to the second row and so on.
Create a character vector with the file names you want to read and follow the instructions in consolidating data frames in R or reading multiple csv files in R.
files <- data.frame(
site = paste("C:\\Data\\", DF$site, ".txt", sep=""),
measurement = paste("C:\\Users\\measurement_", DF$column, "_",
DF$row, ".txt", sep=""),
stringsAsFactors = FALSE)
results <- Map(function(s, m){
file1 <- read.table(s, sep="\t")
wt <- read.table(m, sep=' ', header=TRUE)
# Do stuff
return(result)
}, files$site, files$measurement)
# Alternatively
results <- vector("list", nrow(files))
for(i in 1:nrow(files)){
file1 <- read.table(files$site[i], sep="\t")
wt <- read.table(files$measurment[i], sep=' ', header=TRUE)
# Do stuff
results[[i]] <- # result
}
I would like to be able to scan a csv file row by row in R and exclude the rows that contain the word "target".
The problem is that the data comes from different places and the word "target" can come up in a number of different columns in the data frame.
So I need a line in a function that will look for this string, and if it is not present, then append that row to a new data frame (that I will then write out as a new csv).
Any and all help gratefully recieved.
Andrie's comment is probably the way most users would approach this, but if you want to do this at the reading in stage, you can try this:
Read in your csv using readLines and make any lines that have the text target blank:
temp = gsub(".*target.*", "", readLines("test.csv"))
Use read.table to convert temp to a data.frame. Since all lines that have the text target are now blank, the default blank.lines.skip=TRUE in read.table should correctly read in the rest of your data as a data.frame.
read.table(text=temp, sep=",", header=TRUE)
Use readLines:
lines <- readLines(file)
n.lines <- length(lines)
vec.1 <- rep(0, n.lines)
vec.2 <- rep(0, n.lines)
# more vectors as necessary
counter <- 0
for (i in 1:n.lines){
this.line <- strplit(lines[i], ",")
if ("target" %in% this.line) next
counter <- counter + 1
vec.1[counter] <- this.line[1]
vec.2[counter] <- this.line[2]
# etc.
}
df <- data.frame(vec.1[1:counter], vec.2[1:counter])
You may have to change n.lines slightly and change the indexing of the for loop if your file has headers; two lines would change as follows:
n.lines <- length(lines) - 1
and
for(i in 2:(n.lines+1)){
I would call from.readLines <- readLines(filename) and then just sub-select the rows that don't contain the target string: data <- read.csv(text = from.readLines[-grep('target', from.readLines)], header = F).
The faster way to do it (if your file is huge) would be to grep -v 'target' original.csv > new.csv first on the command line and then run read.csv(new.csv, ...) in R.
But anyway,
> #Without header
> from.readLines <- c('afaf,afasf,target', 'afaf,target,afasf', 'dagdg,asgst,sagga', 'dagdg,dg,sfafgsgg')
> data <- read.csv(text = from.readLines[-grep('target', from.readLines)], header = F)
> print(data)
V1 V2 V3
1 dagdg asgst sagga
2 dagdg dg sfafgsgg
>
> #With header
> from.readLines <- c('var1,var2,var3', 'afaf,afasf,target', 'afaf,target,afasf', 'dagdg,asgst,sagga', 'dagdg,dg,sfafgsgg')
> data <- read.csv(text = from.readLines[-(grep('target', from.readLines[-1]) + 1)])
> print(data)
var1 var2 var3
1 dagdg asgst sagga
2 dagdg dg sfafgsgg