Reading several large files in a loop - r

I am trying to read several large files in a loop. So instead of doing:
library(fst)
df1 <-read_fst("C:/data1.fst", c(1:2), from = 1, to = 1000)
df2 <-read_fst("C:/data2.fst", c(1:2), from = 1, to = 1000)
df3 <-read_fst("C:/data3.fst", c(1:2), from = 1, to = 1000)
I would like to do something like this:
for(i in 1:3){
df_i <- read_fst("C:/data_i.fst", c(1:2), from = 1, to = 1000)
}

You can use list.files to generate all .fst files in a given dir and then loop through them:
files <- list.files(pattern =".fst") # .fst files in your current directory
df_list <- rep(list(NA), length(files)) # Init list of DFs
for (i in seq_along(files))
df_list[[i]] <- fst::read_fst(files[i], ...)
You could refine the pattern arg in list.files to match a certain pattern, e.g. pattern = "data_\\d+.fst" to match data_i.fst
You can also specify the directory to look into via the path arg and return the full file names via full.names

It is better using a list for the loop output like this. You can create a vector to save the dirs where the files are stored (I did on myvec and you can change 1:3 to 1:n where n can be a larger number). With that done, all the results from loop will be in List. Here the code:
library(fst)
#Create empty list
List <- list()
#Vector
myvec <- paste0("C:/data",1:3,".fst")
#Loop
for(i in 1:length(myvec))
{
List[[i]] <- read_fst(myvec[i], c(1:2), from = 1, to = 1000)
}

Related

Read nested folder and file name, export to Excel file

So I am tasked with building an excel spreadsheet cataloging a drive with various nested folders and files.
This SO gets me somewhat there but I am confused on how to get my desired output.
I know that there might be a command to get file info and I can break that into these columns.
Apart from the directories split into subdirs, the adaptation of the function in the question's link, Stibu's answer, might be of help.
rfl <- function(path) {
folders <- list.dirs(path, recursive = FALSE, full.names = FALSE)
if (length(folders)==0) {
files <- list.files(path, full.names = TRUE)
finfo <- file.info(files)
Filename <- basename(files)
FileType <- tools::file_ext(files)
DateModified <- finfo$mtime
FullFilePath <- dirname(files)
size <- finfo$size
data.frame(Filename, FileType, DateModified, FullFilePath, size)
} else {
sublist <- lapply(paste0(path,"/",folders),rfl)
setNames(sublist,folders)
}
}
If you have the full path and file names then you can loop through that and parse it into these columns. You can get more file info with file.info:
files <- c("I:/Administration/Budget/2015-BUDGET DOCUMENT.xlsx",
"I:/Administration/Budget/2014-2015 Budget/BUDGET DOCUMENT.xlsx")
# files <- list.files("I:", recursive = T, full.names = T) # this could take a while to run
file_info <- list(length = length(files))
for (i in seq_along(files)){
fullpath <- dirname(files[i])
fullname <- basename(files[i])
file_ext <- unlist(strsplit(fullname, ".", fixed = T))
file_meta <- file.info(files[i])[c("size", "mtime")]
path <- unlist(strsplit(fullpath, "/", fixed = T))[-1]
file_info[[i]] <- unlist(c(file_ext, file_meta, fullpath, path))
}
l <- lapply(file_info, `length<-`, max(lengths(file_info)))
df <- data.frame(do.call(rbind, l))
names(df) <- c("filename", "extension", "size", "modified", paste0("sub", 1:(ncol(df) - 4)))
rownames(df) <- NULL
df$modified <- as.POSIXct.numeric(as.numeric(df$modified), origin = "1970-01-01")
df$size <- as.numeric(df$size)
If you do not have the files you can recursively search the drive using list.files() with recursive = T: list.files("I:", recursive = T, full.names = T)
Note:
l <- lapply(file_info, `length<-`, max(lengths(file_info))) sets the vector length of each list element to be the same. This is necessary because otherwise when the vectors are stacked with unequal lengths values get recycled. A simple example of this is: rbind(1:3, 1:5)
The output of unlist(c(file_ext, file_meta, fullpath, path)) is a vector and vectors in R are atomic, meaning all elements have to be the same class. That means everything gets converted to character in this case, which is why we have the lines df$modified <- ... and df$size <- ... at the end to convert them to their appropriate type.
If you want to output this data frame to excel check out xlsx::write.xlsx or openxlsx::write.xlsx. If you don't have those libraries installed you'll need to use install.packages() first.
Output
Because these files/locations don't actually exist on my computer there are NA values in the size and date modified fields:
filename extension size modified sub1 sub2 sub3 sub4
1 2015-BUDGET DOCUMENT xlsx NA <NA> I:/Administration/Budget Administration Budget <NA>
2 BUDGET DOCUMENT xlsx NA <NA> I:/Administration/Budget/2014-2015 Budget Administration Budget 2014-2015 Budget

How to iterate a function through a list of matrices

I have a folder full of csv files that I have read and turned into matrices.
setwd("~/Desktop/EMD Test")
FilesToProcess <- list.files(pattern = "csv")
listOfFiles <- lapply(FilesToProcess, function(x){ out = read.csv(x, header=FALSE, stringsAsFactors = FALSE) as.matrix(out) })
Now I need to do an EMD calculation comparing all the files to the first one. Manually it looks like this:
emd(listOfFles[[1]],listOfFiles[[2]])
What I would like to do is run this command with all the files in ListOfFIles like
emd(listOfFles[[1]],listOfFiles[[x]])
I have tried several things with lapply and for loops but nothing has worked.
We can use a nested lapply if we want to do the pairwise emd on all combinations of list elements
lapply(seq_along(listOfFles), function(i) lapply(seq_along(ListOfFles),
function(j) emd(listOfFles[[i]], listOfFles[[j]])))
Or another option is combn which could be more efficient because the number of combinations are reduced
combn(ListOfFles, 2, FUN = function(x) emd(x[[1]], x[[2]]), simplify = FALSE)
If you want to compare the first file with all other files. You can use for loop in this way.
FilesToProcess <- list.files(pattern = "\\.csv$")
result <- vector('list', length(FilesToProcess) - 1)
for(i in 2:length(FilesToProcess)) {
result[[i]] <- emd(listOfFles[[1]],listOfFiles[[i]])
}

Extract data from text files using for loop

I have 40 text files with names :
[1] "2006-03-31.txt" "2006-06-30.txt" "2006-09-30.txt" "2006-12-31.txt" "2007-03-31.txt"
[6] "2007-06-30.txt" "2007-09-30.txt" "2007-12-31.txt" "2008-03-31.txt" etc...
I need to extract one specific data, i know how to do it individually but this take a while:
m_value1 <- `2006-03-31.txt`$Marknadsvarde_tot[1]
m_value2 <- `2006-06-30.txt`$Marknadsvarde_tot[1]
m_value3 <- `2006-09-30.txt`$Marknadsvarde_tot[1]
m_value4 <- `2006-12-31.txt`$Marknadsvarde_tot[1]
Can someone help me with a for loop which would extract the data from a specific column and row through all the different text files please?
Assuming your files are all in the same folder, you can use list.files to get the names of all the files, then loop through them and get the value you need. So something like this?
m_value<-character() #or whatever the type of your variable is
filelist<-list.files(path="...", all.files = TRUE)
for (i in 1:length(filelist)){
df<-read.table(myfile[i], h=T)
m_value[i]<-df$Marknadsvarde_tot[1]
}
EDIT:
In case you have imported already all the data you can use get:
txt_files <- list.files(pattern = "*.txt")
for(i in txt_files) { x <- read.delim(i, header=TRUE) assign(i,x) }
m_value<-character()
for(i in 1:length(txt_files)) {
m_value[i] <- get(txt_files[i])$Marknadsvarde_tot[1]
}
You could utilize the select-parameter from fread of the data.table-package for this:
library(data.table)
file.list <- list.files(pattern = '.txt')
lapply(file.list, fread, select = 'Marknadsvarde_tot', nrow = 1, header = FALSE)
This will result in a list of datatables/dataframes. If you just want a vector with all the values:
sapply(file.list, function(x) fread(x, select = 'Marknadsvarde_tot', nrow = 1, header = FALSE)[[1]])
temp = list.files(pattern="*.txt")
library(data.table)
list2env(
lapply(setNames(temp, make.names(gsub("*.txt$", "", temp))),
fread), envir = .GlobalEnv)
Added data.table to an existing answer at Importing multiple .csv files into R
After you get all your files you can get data from the data.tables using DT[i,j,k] where i will be your condition

reading excel files into a single dataframe with readxl R

I have a bunch of excel files and I want to read them and merge them into a single data frame.
I have the following code:
library(readxl)
files <- list.files()
f <- list()
data_names <- gsub("[.]xls", "", files)
to read each excel file into data frames
for (i in 1:length(files)){
assign(data_names[i], read_excel(files[i], sheet = 1, skip = 6))
}
but, if I try to save it in a variable, just saved the last file
for (i in 1:length(files)){
temp <- read_excel(files[i], sheet = 1, skip = 6)
}
I would do this using plyr:
library(readxl)
library(plyr)
files <- list.files(".", "\\.xls")
data <- ldply(files, read_excel, sheet = 1, skip = 6)
If you wanted to add a column with the file name, you could instead do:
data <- ldply(files, function(fil) {
data.frame(File = fil, read_excel(fil, sheet = 1, skip = 6))
}
I would recommend to use the list enviourment in R, assign can be quite confusing and you can't determain values with GET.
Should look like this:
l <- list()
for (i in 1:length(files)){
l[[i]] <- read_excel(files[i], sheet = 1, skip = 6))
}
ltogether <- do.call("rbind",l)

Reading multiple csv of same format in a data frame

I need to run the same set of code for multiple CSV files. I want to do it with the same with macro. Below is the code that I am executing, but results are not coming properly. It is reading the data in 2-d format while I need to run in 3-d format.
lf = list.files(path = "D:/THD/data", pattern = ".csv",
full.names = TRUE, recursive = TRUE, include.dirs = TRUE)
ds<-lapply(lf,read.table)
I dont know if this is going to be useful but one of the way I do is:
##Step 1 read files
mycsv = dir(pattern=".csv")
n <- length(mycsv)
mylist <- vector("list", n)
for(i in 1:n) mylist[[i]] <- read.csv(mycsv[i],header = T)
then I useually just use apply function to change things, for example,
## Change coloumn name
mylist <- lapply(mylist, function(x) {names(x) <- c("type","date","v1","v2","v3","v4","v5","v6","v7","v8","v9","v10","v11","v12","v13","v14","v15","v16","v17","v18","v19","v20","v21","v22","v23","v24","total") ; return(x)})
## changing type coloumn for weekday/weekend
mylist <- lapply(mylist, function(x) {
f = c("we", "we", "wd", "wd", "wd", "wd", "wd")
x$type = rep(f,52, length.out = 365)
return(x)
})
and so on.
Then I save with this following code again after all the changes I made (it is also sometime useful to split original file name and rename each files to save with a part of file name so that I can track each individual files later)
## for example some of my file had a pattern in file name such as "201_E424220_N563500.csv",so I split this to save with a new name like this:
mylist <-lapply(1:length(mylist), function(i) {
mylist.i <- mylist[[i]]
s = strsplit(mycsv[i], "_" , fixed = TRUE)[[1]]
d = cbind(mylist.i[, c("type", "date")], ID = s[1], Easting = s[2], Northing = s[3], mylist.i[, 3:ncol(mylist.i)])
return(d)
})
for(i in 1:n)
write.csv(file = paste("file", i, ".csv", sep = ""), mylist[i], row.names = F)
I hope this will help. When you get some time pleaes read about the PLYR package as I am sure this will be very useful for you, it is a very useful package with lots of data analysis options. PLYR has apply functions such as:
## l_ply split list, apply function and discard result
## ldply split list, apply function and return result in data frame
## laply split list, apply function and return result in an array
for example you can use the ldply to read all your csv and return a data frame simething like:
data = ldply(list.files(pattern = ".csv"), function(fname) {
j = read.csv(fname, header = T)
return(j)
})
So here J will be your data frame with all your csv files data.
Thanks,Ayan

Resources