I need to run the same set of code for multiple CSV files. I want to do it with the same with macro. Below is the code that I am executing, but results are not coming properly. It is reading the data in 2-d format while I need to run in 3-d format.
lf = list.files(path = "D:/THD/data", pattern = ".csv",
full.names = TRUE, recursive = TRUE, include.dirs = TRUE)
ds<-lapply(lf,read.table)
I dont know if this is going to be useful but one of the way I do is:
##Step 1 read files
mycsv = dir(pattern=".csv")
n <- length(mycsv)
mylist <- vector("list", n)
for(i in 1:n) mylist[[i]] <- read.csv(mycsv[i],header = T)
then I useually just use apply function to change things, for example,
## Change coloumn name
mylist <- lapply(mylist, function(x) {names(x) <- c("type","date","v1","v2","v3","v4","v5","v6","v7","v8","v9","v10","v11","v12","v13","v14","v15","v16","v17","v18","v19","v20","v21","v22","v23","v24","total") ; return(x)})
## changing type coloumn for weekday/weekend
mylist <- lapply(mylist, function(x) {
f = c("we", "we", "wd", "wd", "wd", "wd", "wd")
x$type = rep(f,52, length.out = 365)
return(x)
})
and so on.
Then I save with this following code again after all the changes I made (it is also sometime useful to split original file name and rename each files to save with a part of file name so that I can track each individual files later)
## for example some of my file had a pattern in file name such as "201_E424220_N563500.csv",so I split this to save with a new name like this:
mylist <-lapply(1:length(mylist), function(i) {
mylist.i <- mylist[[i]]
s = strsplit(mycsv[i], "_" , fixed = TRUE)[[1]]
d = cbind(mylist.i[, c("type", "date")], ID = s[1], Easting = s[2], Northing = s[3], mylist.i[, 3:ncol(mylist.i)])
return(d)
})
for(i in 1:n)
write.csv(file = paste("file", i, ".csv", sep = ""), mylist[i], row.names = F)
I hope this will help. When you get some time pleaes read about the PLYR package as I am sure this will be very useful for you, it is a very useful package with lots of data analysis options. PLYR has apply functions such as:
## l_ply split list, apply function and discard result
## ldply split list, apply function and return result in data frame
## laply split list, apply function and return result in an array
for example you can use the ldply to read all your csv and return a data frame simething like:
data = ldply(list.files(pattern = ".csv"), function(fname) {
j = read.csv(fname, header = T)
return(j)
})
So here J will be your data frame with all your csv files data.
Thanks,Ayan
Related
I have a large number of CSV files. I need to extract relevant data from each file, and compile all of the relevant data into a new file.
I have been copying/pasting the code below and changing relevant details (e.g., file name) to repeat the same process for many CSV files. After that, I use cbind()/write.xlsx() to combine all of the relevant data and write it to an excel file. I need a more efficient method to accomplish this task.
How can I:
incorporate a loop that imports a large number of CSV files (to replace #1 below)
select relevant rows based on a string instead of entering specific row numbers
(to replace # 2 below)
combine all of the relevant data into a single data frame with each file's data in one column
library(tidyr)
# 1 - import raw data files
file1 <- read.csv ("1.csv", header = FALSE, sep = "\n")
# 2 - select relevant rows
file1 <- as.data.frame(file1[c(41:155),])
colnames(file1) <- c("file1")
#separate components of each line from raw csv file / isolate data
temp1 <- separate(file1, file1, into = c("Text", "IntNum", "Data", sep = "\\s"))
temp1 <- temp1$Data
temp1 <- as.data.frame(temp1)
If the number of relevant rows in each file is the same, you could do it like this. Option 1 shows a solution using a loop, option 2 shows a solution using sapply.
In a first step I generate three csv-files to make the code reproducible. The start row in each file is defined by "start", the end row by "end". I then get a list with the names of these files with dir().
#make csv-files, target vector always same length (3)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "begin",
paste0("dat", i),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#get list of file names
allFiles <- dir(pattern = glob2rx("*.csv"))
Option 1 - loop
For the loop you could first initialize a result data frame ("outDF") with the number of columns set to the number of csv-files and the number of rows set to the length of the target vector in each file ("start" to "end"). You can then loop over the files and fill the data frame. The start and end rows can be indexed using which().
#initialise result data frame
outDF <- data.frame(matrix(nrow = 3, ncol = length(allFiles),
dimnames = list(NULL, allFiles)))
#loop over csv files
for (iFile in allFiles) {
idat <- read.csv(iFile, stringsAsFactors = FALSE) #read csv
outDF[, iFile] <- idat[which(idat$x == "start"):which(idat$x == "end"),]
}
Option 2 - sapply
Instead of a loop you could use sapply with a custom function to extract the relevant rows in each file. This returns a matrix which you could then transform into a dataframe.
out <- sapply(allFiles, FUN = function(x) {
idat <- read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
outDF <- as.data.frame(out)
If the number of rows between "start" and "end" differs between files, the above options won´t work. In this case you could generate a data frame by first using lapply() (similar to option 2) to generate a result list (with different lengths of the list elements) and then padding shorter lists with NAs before transforming the result into a dataframe again.
#make csv-files with with target vector of different lengths (3:12)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "start",
rep(paste0("dat", i), sample(1:10,1)),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#lapply
out <- lapply(allFiles, FUN = function(x) {
idat = read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
out <- lapply(out, `length<-`, max(lengths(out)))
outDF <- do.call(cbind, out)
I have a large dataframe my_df in R containing 1983000 records. The following lines of sample code take the chunk of 1000 rows starting from 25001, do some processing, and write the processed data into a file to the local disk.
my_df1 <- my_df[25001:26000,]
my_df1$end <- as.POSIXct(paste(my_df1$end,"23:59",sep = ""))
my_df1$year <- lubridate::year(my_df1$start)
str_data <- my_df1
setwd("path_to_local_dir/data25001_26000")
write.table(str_data, file = "data25001-26000.csv",row.names = F,col.names = F,quote = F)
and so on like this:
my_df2 <- my_df[26001:27000,]
...
I would like automate this task such that the chunks of 1000 records are processed and written to a new directory. Any advise on how this could be done?
Consider generalizing your process in a function, data_to_disk, and call function with an iterator method like lapply passing a sequence of integers with seq() for each subsequent thousand. Also, incorporate a dynamic directory creation (but maybe dump all 1,000+ files in one directory instead of 1,000+ dirs?).
data_to_disk <- function(num) {
str_data <- within(my_df[num:(num + 999)], {
end <- as.POSIXct(paste0(end, "23:59"))
year <- lubridate::year($start)
})
my_dir <- paste0("path_to_local_dir/data", num, "_", num + 999)
if(!dir.exists(my_dir)) dir.create(my_dir)
write.table(str_data, file = paste0(my_dir, "/", "data", num, "-", num + 999, ".csv"),
row.names = FALSE, col.names = FALSE, quote = FALSE)
return(my_df)
}
seqs <- seq(25001, nrow(my_df), by=1000)
head(seqs)
# [1] 25001 26001 27001 28001 29001 30001
tail(seqs)
# [1] 1977001 1978001 1979001 1980001 1981001 1982001
# LIST OF 1,958 DATA FRAMES
df_list <- lapply(seqs, data_to_disk)
Here is my code doing the sliced loop:
step1 = 1000
runto = nrow(my_df)
nsteps = ceiling(runto/step1)
for( part in seq_len(nsteps) ) { # part = 1
cat( part, 'of', nsteps, '\n')
fr = (part-1)*step1 + 1
to = min(part*step1, runto)
my_df1 = my_df[fr:to,]
# ...
write.table(str_data, file = paste0("data",fr,"-",to,".csv"))
}
rm(part, step1, runto, nsteps, fr, to)
You can add a grouping variable to your data first (e.g., to identify every 1000 rows), then use d_ply() to split the data and write to file.
df <- data.frame(var=runif(1000000))
df$fold <- cut(seq(1,nrow(df)),breaks=100,labels=FALSE)
df %>% filter(fold<=2) %>% # only writes first two files
d_ply(.,.(fold), function(i){
# make filenames 'data1.csv', 'data2.csv'
write_csv(i,paste0('data',distinct(i,fold),'.csv'))
})
This is similar to #Parfait but takes a lot of stuff out of the function. Specifically, it creates a copy of the entire dataset and then performs the time manipulation functions.
my_df1 <- my_df
my_df1$end <- as.POSIXct(paste(my_df1$end,"23:59",sep = ""))
my_df1$year <- lubridate::year(my_df1$start)
lapply(seq(25001, nrow(my_df1), by = 1000),
function(i) write.table(my_df1[i:i+1000-1,]
, file = paste0('path_to_logal_dir/data'
, i, '-', i+1000-1, '.csv')
,row.names = F,col.names = F,quote = F)
)
For me, I'd probably just do:
write.table(my_df1, file = ...)
and be done with it. I don't see the advantages of splitting it up - 1 million rows really isn't that many.
I am creating an object calling all .csv files in a directory, reading them in according to some specifications, and merging them.
Before merging them I want to take the first two letters of the file names and create a new column in each table reporting that two letter as a variable.
I got this far:
temp = list.files(pattern="*.csv")
myfiles = lapply(temp, function(x) read.csv(x,
header=TRUE,
#sep=";",
stringsAsFactors=F,
encoding = "UTF-8",
na.strings = c("NA",""),
colClasses=c("code"="character")))
myfiles.final = do.call(rbind, myfiles)
When I try to create the new variable though I generate a replacement that has double the rows of the data:
temp.2 <- lapply(temp, function(x) substr(x, start = 1, stop = 2))
myfiles.2 = lapply(myfiles,
function(x){
a <- temp.2[seq_along(myfiles)]
x$identifier <- rep(a,nrow(x))
return(x)
})
In the folder the files are named, for example AA029893.csv,BB024593.csv..., for the first table I just want a new column called "identifier" that has "AA" for all entries, for the second "BB", and so on.
Thanks a lot
lapply is good for iterating along 1 list (e.g., myfiles data frames). To add a column to each data frame, you want to iterate in parallel over two lists, the list of data frames and the list of names. Map does this (for an arbitrary number of lists):
myfiles.2 = Map(function(dd, nn) {dd$identifier = nn; return(dd)},
dd = myfiles, nn = temp.2)
An easier alternative is to add the column post-hoc:
myfiles.final = do.call(rbind, myfiles)
myfiles.final$identifier = rep(
sapply(temp, function(x) substr(x, start = 1, stop = 2)),
each = lengths(myfiles)
)
The easiest alternative is to use data.table::rbindlist or dplyr::bind_rows, either of which will automatically add an ID column based on the names of your list. Depending on the size of your data, they might be quite a bit faster as well.
names(myfiles) = sapply(temp, function(x) substr(x, start = 1, stop = 2))
myfiles.2 = dplyr::bind_rows(myfiles)
myfiles.2 = data.table::rbindlist(myfiles, idcol = "identifier")
I have a bunt of single files which need to apply a test. I need to find the way to write automatically results of each file into a file. Here is what I do:
library(ape)
stud_files <- list.files("path/dir/data",full.names = T)
for (f in stud_files) {
df <- read.table(f, header=TRUE, sep=";")
df_xts <- as.xts(df$cola, order.by = as.Date(df$colb,"%m/%d/%Y"))
pet <- testa(df_xts)
res <- data.frame(estimate = pet$estimate,
p.value=pet$p.value,
logi = pet$alternative)
write.dna(res,file = "res_testa.xls",format = "sequential")
}
This loop works well, except the last command which aim to write the results of each file consecutively, it saved only the last performance. And the results save as string, not a table as I define above (data.frame). Any idea in this case? Thanks in advance
Check help(write.dna).
write.dna(x, file, format = "interleaved", append = FALSE,
nbcol = 6, colsep = " ", colw = 10, indent = NULL,
blocksep = 1)
append a logical, if TRUE the data are appended to the file without
erasing the data possibly existing in the file, otherwise the file (if
it exists) is overwritten (FALSE the default).
Set append = TRUE and you should be all set.
As some of the comments point out, however, you are probably better off generating your table, and then writing it all at once to a file. Unless you have billions of files, you likely won't run out of memory.
Here is how I would approach this.
library(ape)
library(data.table)
stud_files <- list.files("path/dir/data",full.names = T)
sumfunc <- function(f) {
df <- read.table(f, header=TRUE, sep=";")
df_xts <- as.xts(df$cola, order.by = as.Date(df$colb,"%m/%d/%Y"))
pet <- testa(df_xts)
res <- data.table(estimate = pet$estimate,
p.value=pet$p.value,
logi = pet$alternative)
return(res)
}
lres <- lapply(stud_files, sumfunc)
dat <- rbindlist(lres)
write.table(dat,
file = "res_testa.csv",
sep = ",",
quote = FALSE,
row.names = FALSE)
I have different dataframes and what I want to do is:
apply a function repeated times to each dataframe
save results of each repetition on a new dataframe keeping the name of the original dataframes and adding something else to differentiate it
Here is what I have tried until now
# read all files to list
dataframes <- dir( pattern = ".txt")
list_dataframes <- llply(dataframes, read.csv, header = T, sep =" ", dec=".", na.string = "nd")
n <- length(dataframes)
# apply myfunction 10 times
for (j in 1:10){
modified_list <- llply(list_dataframes, myfunction)
}
if (j <10){
num.char <- paste("n0", j, sep="")
} else num.char <- paste("n", j, sep="")
# save back data frames
for (i in 1:n)
write.table(file = paste( "newfile/_modified",num.char, ".csv", sep = ""),
modified_list[i], row.names = F)
What I want as a result is the modified dataframes (in this case the 10 repetitions for each df of the list)that will have:
the name of the original df
the new name
and the number of iteration
Something likeoriginaldfname_newname_n0
I can not find where I'm missing up. Any help will be deeply appreciated
Two major issues, I think:
the } (line 9 above) should be after your second for loop;
your last line should probably reference modified_list[[i]] instead of using the single-[ notation.
So your code should work (untested, slightly modified for style) as:
library(plyr)
# read all files to list
dataframes <- dir(pattern = ".txt")
list_dataframes <- llply(dataframes, read.csv,
header = T, sep = " ", dec=".", na.string = "nd")
n <- length(dataframes)
# apply myfunction 10 times
for (j in 1:10) {
modified_list <- llply(list_dataframes, myfunction)
# save back data frames
for (i in 1:n)
write.table(file = sprintf("newfile/%s_newname_%02d.csv", dataframes[i], j),
modified_list[[i]], row.names = FALSE)
}
If this were code golf, the last portion could be reduced a little with:
for (j in 1:10) {
mapply(function(df, nm) write.csv(file = sprintf('newfile/%s_newname_%02d.csv', nm, j),
df, row.names = FALSE),
llply(list_dataframes, myfunction), dataframes)
}
(This doesn't necessarily make it perfectly clearer, but it does reduce things a bit. Use it if you at some point prefer to not use for loops, though the performance in this case will be almost identical.)
Note:
Please include required libraries, e.g., library(plyr).
Though lapply would have worked just fine, I kept the use of llply to match your example.