So I am trying to read several csv files, take their first column and create a new file. I have succeeded using qpcR and data.table using the following code:
FileNames <- dir(pattern = "*.csv")
x <- integer()
for (FileName in FileNames) {
data <- read.csv(file = FileName, header=FALSE, skip=1)
y <- data[,1]
x<-qpcR:::cbind.na(x, y)
rm(data)
}
write.csv(x, file = 'test.csv')
This works fine, however I have discovered that I can read just the first column of my data using the data.table library.
x <- integer()
for (FileName in FileNames) {
data <- fread(FileName,select=1,skip=1, header=FALSE)
y <- data[1:nrow(data),]
x<-qpcR:::cbind.na(x, y)
rm(data)
}
write.csv(x, file = 'test.csv')
However this seems to treat y as a data value or integer, which throws up the error:
Error in data.table::data.table(...) :
Item 2 has no length. Provide at least one item (such as NA, NA_integer_ etc) to be repeated to match the 11 rows in the longest column. Or, all columns can be 0 length, for insert()ing rows into.
Any help on this would be great thanks.
Turns out after investigating using typeof(), that I needed to convert the list generated by fread, to a numeric by adding the following line.
data <- as.numeric(unlist(data))
This then worked
Related
I got many .csv files of different sizes. I choose some of them who correspond at a condition (those matching with my id in the example). They are ordered by date and can be huge. I need to know the minimum and maximum dates of these files.
I can read all of those wanted and only for the column date.hour, and then I can find easily the minimum and maximum of all the dates values.
But it would be a lot faster, as I repeat this for thousand ids, if I could read only the first and last rows of my files.
Does anyone got an idea of how to solve this ?
This code works well, but I wish to improve it.
function to read several files at once
`read.tables.simple <- function(file.names, ...) {require(plyr)
ldply(file.names, function(fn) data.frame(read.table(fn, ...)))}`
reading the files and selecting the minimum and maximum dates for all of theses
`diri <- dir()
dat <- read.tables.simple(diri[1], header = TRUE, sep = ";", colClasses = "character")
colclass <- rep("NULL", ncol(dat))
x <- which(colnames(dat) == "date.hour")
colclass[x] <- "character"
x <- grep("id", diri)
dat <- read.tables.simple(diri[x], header = TRUE, sep = ";", colClasses = colclass)
datmin <- min(dat$date.hour)
datmax <- max(dat$date.hour)`
In general, read.table is very slow. If you use read_tsv, read_csv or read_delim from the readr library, it will already be much, much faster.
If you are on Linux/Mac OS, you can also read only the first or last parts by setting up a pipe, which will be more or less instant, no matter how large your file is. Let's assume you have no column headers:
library(readr)
read_last <- function(file) {
read_tsv(pipe(paste('tail -n 1', file)), col_names=FALSE)
}
# Readr can already read only a select number of lines, use `n_max`
first <- read_tsv(file, n_max=1, col_names=FALSE)
If you want to go in on parallelism, you can even read files in parallel, see e.g., library(parallel) and ?mclapply
The following function will read the first two lines of your csv (the header row and the first data row), then seek to the end of the file and read the last line. It will then stick these three lines together to read them as a two-row csv in memory from which it returns the column date.time. This will have your minimum and maximum values, since the times are arranged in order.
You need to tell the function the maximum line length. It's OK if you over-estimate this, but make sure the number is less than a third of your file size.
read_head_tail <- function(file_path, line_length = 100)
{
con <- file(file_path)
open(con)
seek(con, where = 0)
first <- suppressWarnings(readChar(con, nchars = 2 * line_length))
first <- strsplit(first, "\n")[[1]][1:2]
seek(con, where = file.info(file_path)$size - line_length)
last <- suppressWarnings(readChar(con, nchars = line_length))
last <- strsplit(last, "\n")[[1]]
last <- last[length(last)]
close(con)
csv <- paste(paste0(first, collapse = "\n"), last, sep = "\n")
df <- read.csv(text = csv, stringsAsFactors = FALSE)[-1]
return(df$date.hour)
}
I have a folder full of .txt files that I want to loop through and compress into one data frame, but each .txt file is data for one subject and there are no columns in the text files that indicate subject number or time point in the study (e.g. 1-5). I need to add a line or two of code into my loop that looks for strings of four numbers (i.e. each file is labeled something like: "4325.5_ERN_No_Startle") and just creates a column with 4325 and another column with 5 that will appear for every data point for that subject until the loop gets to the next one. I have been looking for awhile but am still coming up empty, any suggestions?
I also have not quite gotten the loop to work:
path = "/Users/me/Desktop/Event Codes/ERN task/ERN text files transferred"
out.file <- ""
file <- ""
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- read.table(file.names[i],header=FALSE, fill = TRUE)
out.file <- rbind(out.file, file)
}
which runs okay until I get this error message part way through:
Error in read.table(file.names[i], header = FALSE, fill = TRUE) :
no lines available in input
Consider using regex to parse the file name for study period and subject, both of which are then binded in a lapply of list.files:
path = "path/to/text/files"
# ANY TXT FILE WITH PATTERN OF 4 DIGITS FOLLOWED BY A PERIOD AND ONE DIGIT
file.names <- list.files(path, pattern="*[0-9]{4}\\.[0-9]{1}.*txt", full.names=TRUE)
# IMPORT ALL FILES INTO A LIST OF DATAFRAMES AND BINDS THE REGEX EXTRACTS
dfList <- lapply(file.names, function(x) {
if (file.exists(x)) {
data.frame(period=regmatches(x, gregexpr('[0-9]{4}', x))[[1]],
subject=regmatches(x, gregexpr('\\.[0-9]{1}', x))[[1]],
read.table(x, header=FALSE, fill=TRUE),
stringsAsFactors = FALSE)
}
})
# COMBINE EACH DATA FRAME INTO ONE
df <- do.call(rbind, dfList)
# REMOVE PERIOD IN SUBJECT (NEEDED EARLIER FOR SPECIAL DIGIT)
df['subject'] <- sapply(df['subject'],
function(x) gsub("\\.", "", x))
You can try to use tryCatchwhich basically would give you a NULL instead of an error.
file <- tryCatch(read.table(file.names[i],header=FALSE, fill = TRUE), error=function(e) NULL))
After having searched for help in different threads on this topic, I still have not become wiser. Therefore: Here comes another question on looping through multiple data files...
OK. I have multiple CSV files in one folder containing 5 columns of data. The filenames are as follows:
Moist yyyymmdd hh_mm_ss.csv
I would like to create a script that reads processes the CSV-files one by one doing the following steps:
1) load file
2) check number of rows and exclude file if less than 3 registrations
3) calculate mean value of all measurements (=rows) for column 2
4) calculate mean value of all measurements (=rows) for column 4
5) output the filename timestamp, mean column 2 and mean column 4 to a data frame,
I have written the following function
moist.each.mean <- function() {
library("tcltk")
directory <- tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <- regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame(timestamp=character(), humidity=numeric(), temp=numeric())
for(i in 1:length(filelist)){
file.in[[i]] <- read.csv(filelist[i], header=F)
if (nrow(file.in[[i]]<3)){
print("discard")
} else {
newrow <- c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1))
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
}
but i keep getting an error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(V1 = c(10519949L, :
replacement has 18 rows, data has 17
Any ideas?
Thx, kruemelprinz
I'd also suggest to use (l)apply... Here's my take:
getMeans <- function(fpath,runfct,
target_cols = c(2),
sep=",",
dec=".",
header = T,
min_obs_threshold = 3){
f <- list.files(fpath)
fcsv <- f[grepl("\.csv",f)]
fcsv <- paste0(fpath,fcsv)
csv_list <- lapply(fcsv,read.table,sep = sep,
dec = dec, header = header)
csv_rows <- sapply(csv_list,nrow)
rel_csv_list <- csv_list[!(csv_rows < min_obs_threshold)]
lapply(rel_csv_list,function(x) colMeans(x[,target_cols]))
}
Also with that kind of error message, the debugger might be very helpful.
Just run debug(moist.each.mean) and execute the function stepwise.
Here's a slightly different approach. Use lapply to read each csv file, exclude it if necessary, otherwise create a summary. This gives you a list where each element is a data frame summary. Then use rbind to create the final summary data frame.
Without a sample of your data, I can't be sure the code below exactly matches your problem, but hopefully it will be enough to get you where you want to go.
# Get vector of filenames to read
filelist=list.files(path=directory, pattern="csv")
# Read all the csv files into a list and create summaries
df.list = lapply(filelist, function(f) {
file.in = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Set to empty data frame if file has less than 3 rows of data
if (nrow(file.in) < 3) {
print(paste("Discard", f))
# Otherwise, capture file timestamp and summarise data frame
} else {
data.frame(timestamp=substr(f, 7, 22),
humidity=round(mean(file.in$V2),1),
temp=round(mean(file.in$V4),1))
}
})
# Bind list into final summary data frame (excluding the list elements
# that don't contain a data frame because they didn't have enough rows
# to be included in the summary)
result = do.call(rbind, df.list[sapply(df.list, is.data.frame)])
One issue with your original code is that you create a vector of summary results rather than a data frame of results:
c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1)) is a vector with three elements. What you actually want is a data frame with three columns:
data.frame(timestamp=filetitles[[i]],
humidity=round(mean(file.in[[i]]$V2),1),
temp=round(mean(file.in[[i]]$V4),1))
Thanks for the suggestions using lapply. This is definitely of value as it saves a whole lot of code as well! Meanwhile, I managed to fix my original code as well:
library("tcltk")
# directory: path to csv files
directory <-
tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <-
regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame()
for (i in 1:length(filelist)) {
file.in <- read.csv(filelist[i], header = F, skipNul = T)
if (nrow(file.in) < 3) {
print("discard")
} else {
newrow <-
matrix(
c(filetitles[[i]], round(mean(file.in$V2, na.rm=T),1), round(mean(file.in$V4, na.rm=T),1)), nrow = 1, ncol =
3, byrow = T
)
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
Only I did not get it to work as a function because then I would only have one row in mdf containing the last file data. Somehow it did not add rows but overwrite row 1 with each iteration. But using it without a function wrapper worked fine...
I’m looking to do the following in R.
I have 250+ csv files of chromatographic data structured similarly to the example below, but with 21 rows instead of three:
1 4.708252 BB 9.946890 7.830349 0.01982016 4.684836 4.742056
2 4.970352 BB 1.792341 1.497008 0.01896829 4.945352 5.005390
3 6.393414 BB 6.599891 5.309925 0.01950091 6.368413 6.428723
What I want to do is read a subset of the data in all 250 files into a single data frame, which is easy enough — but I also need to restructure it a fair bit.
Every row in the table above is a peak. I only want the data from the first and fourth columns (which are ‘peak number’ and ‘area under the peak’, respectively), and in the output I need to make each peak an individual column, rather than a row as above, with the peak number as the header. Finally, I want to create a new column where each row (that is, the data from each individual csv file) is given the same name as the csv file name.
So, imagine I have 3 files: ABC1.csv, ABC2.csv, and ABC3.csv. Each file looks like my example above. I want to automatically take all those files and merge them into a single data frame such as the one below.
ID 1 2 3
ABC1 9.94689 1.792341 6.599891
ABC2 9.76651 1.932332 6.600022
ABC3 8.99193 2.556471 6.718934
I hope I’ve made this clear enough. I’ve been able to manage most of the steps but haven’t been successful writing them into a single script. And I have no idea how, if there is any way, to make the file name into a variable.
Cheers
I am assuming the working directory is set to where the files are. Then you can get the list of files below.
filenames <- list.files()
Have a helper function to read a file and keep just columns 1 and 4.
readdata <- function(filename) {
df <- read.csv(filename)
vec <- df[, 4]
names(vec) <- df[, 1]
return(vec)
}
Loop over all of the files and rbind them
result <- do.call(rbind, lapply(filenames, readdata))
Name them as you like
row.names(result) <- filenames
this following code can probably be of some help, though the file name is still not working properly -
path <- "C:\\Users\\Vidyut\\"
filenames <- list.files(path = path,pattern = ".csv")
l <- data.frame(ID=character(),col1=numeric(),col2=numeric(),col3=numeric(),stringsAsFactors=FALSE)
for (i in filenames) {
#i = filenames[1]
full = paste(path,i,sep="")
m <- read.csv(full, header=F)
# extract the subset of rows required from each file
# m <- m[c(),]
n<- m[,c(1,4)]
y <- gsub('.csv','',i)
print("y=")
print(y)
d <- list(ID=as.character(y),col1=n[1,2],col2=n[2,2],col3=n[3,2])
print("d=")
print(d)
l <- rbind.data.frame(l,d)
print("l=")
print(l)
}
Mind you, this is not very pretty code - just something hacked together to get the job done (visible from the multiple print lines scattered across).
Here's a solution for you. This only works if we can assume that there are exactly 21 peaks in each file and they are in order 1:21. If that's not the case a few changes to the code should remedy this.
folder = "c:/temp/"
files <- dir(folder)
first_loop <- TRUE
for (file in files) {
# Read one file, only the first and fourth columns
temp <- read.csv(file=paste0(folder,file),
header = FALSE,
colClasses = c("integer", "NULL", "NULL", "numeric", "NULL", "NULL", "NULL", "NULL"))
# Transpose the data
temp <- data.frame(t(temp))
# Remove the peak number
temp <- temp[2,]
# Concatenate the dataframes together
temp$file <- file
if (first_loop) {
data <- temp
first_loop <- FALSE
} else {
data <- rbind(data, temp)
}
}
data
I am new to R program and currently working on a set of financial data. Now I got around 10 csv files under my working directory and I want to analyze one of them and apply the same command to the rest of csv files.
Here are all the names of these files: ("US%10y.csv", "UK%10y.csv", "GER%10y.csv","JAP%10y.csv", "CHI%10y.csv", "SWI%10y.csv","SOA%10y.csv", "BRA%10y.csv", "CAN%10y.csv", "AUS%10y.csv")
For example, because the Date column in CSV files are Factor so I need to change them to Date format:
CAN <- read.csv("CAN%10y.csv", header = T, sep = ",")
CAN$Date <- as.character(CAN$Date)
CAN$Date <- as.Date(CAN$Date, format ="%m/%d/%y")
CAN_merge <- merge(all.dates.frame, CAN, all = T)
CAN_merge$Bid.Yield.To.Maturity <- NULL
all.dates.frame is a data frame of 731 consecutive days. I want to merge them so that each file will have the same number of rows which later enables me to combine 10 files together to get a 731 X 11 master data frame.
Surely I can copy and paste this code and change the file name, but is there any simple approach to use apply or for loop to do that ???
Thank you very much for your help.
This should do the trick. Leave a comment if a certain part doesn't work. Wrote this blind without testing.
Get a list of files in your current directory ending in name .csv
L = list.files(".", ".csv")
Loop through each of the name and reads in each file, perform the actions you want to perform, return the data.frame DF_Merge and store them in a list.
O = lapply(L, function(x) {
DF <- read.csv(x, header = T, sep = ",")
DF$Date <- as.character(CAN$Date)
DF$Date <- as.Date(CAN$Date, format ="%m/%d/%y")
DF_Merge <- merge(all.dates.frame, CAN, all = T)
DF_Merge$Bid.Yield.To.Maturity <- NULL
return(DF_Merge)})
Bind all the DF_Merge data.frames into one big data.frame
do.call(rbind, O)
I'm guessing you need some kind of indicator, so this may be useful. Create a indicator column based on the first 3 characters of your file name rep(substring(L, 1, 3), each = 731)
A dplyr solution (though untested since no reproducible example given):
library(dplyr)
file_list <- c("US%10y.csv", "UK%10y.csv", "GER%10y.csv","JAP%10y.csv", "CHI%10y.csv", "SWI%10y.csv","SOA%10y.csv", "BRA%10y.csv", "CAN%10y.csv", "AUS%10y.csv")
can_l <- lapply(
file_list
, read.csv
)
can_l <- lapply(
can_l
, function(df) {
df %>% mutate(Date = as.Date(as.character(Date), format ="%m/%d/%y"))
}
)
# Rows do need to match when column-binding
can_merge <- left_join(
all.dates.frame
, bind_cols(can_l)
)
can_merge <- can_merge %>%
select(-Bid.Yield.To.Maturity)
One possible solution would be to read all the files into R in the form of a list, and then use lapply to to apply a function to all data files. For example:
# Create vector of file names in working direcotry
files <- list.files()
files <- files[grep("csv", files)]
#create empty list
lst <- vector("list", length(files))
#Read files in to list
for(i in 1:length(files)) {
lst[[i]] <- read.csv(files[i])
}
#Apply a function to the list
l <- lapply(lst, function(x) {
x$Date <- as.Date(as.character(x$Date), format = "%m/%d/%y")
return(x)
})
Hope it's helpful.