Read nested folder and file name, export to Excel file - r

So I am tasked with building an excel spreadsheet cataloging a drive with various nested folders and files.
This SO gets me somewhat there but I am confused on how to get my desired output.
I know that there might be a command to get file info and I can break that into these columns.

Apart from the directories split into subdirs, the adaptation of the function in the question's link, Stibu's answer, might be of help.
rfl <- function(path) {
folders <- list.dirs(path, recursive = FALSE, full.names = FALSE)
if (length(folders)==0) {
files <- list.files(path, full.names = TRUE)
finfo <- file.info(files)
Filename <- basename(files)
FileType <- tools::file_ext(files)
DateModified <- finfo$mtime
FullFilePath <- dirname(files)
size <- finfo$size
data.frame(Filename, FileType, DateModified, FullFilePath, size)
} else {
sublist <- lapply(paste0(path,"/",folders),rfl)
setNames(sublist,folders)
}
}

If you have the full path and file names then you can loop through that and parse it into these columns. You can get more file info with file.info:
files <- c("I:/Administration/Budget/2015-BUDGET DOCUMENT.xlsx",
"I:/Administration/Budget/2014-2015 Budget/BUDGET DOCUMENT.xlsx")
# files <- list.files("I:", recursive = T, full.names = T) # this could take a while to run
file_info <- list(length = length(files))
for (i in seq_along(files)){
fullpath <- dirname(files[i])
fullname <- basename(files[i])
file_ext <- unlist(strsplit(fullname, ".", fixed = T))
file_meta <- file.info(files[i])[c("size", "mtime")]
path <- unlist(strsplit(fullpath, "/", fixed = T))[-1]
file_info[[i]] <- unlist(c(file_ext, file_meta, fullpath, path))
}
l <- lapply(file_info, `length<-`, max(lengths(file_info)))
df <- data.frame(do.call(rbind, l))
names(df) <- c("filename", "extension", "size", "modified", paste0("sub", 1:(ncol(df) - 4)))
rownames(df) <- NULL
df$modified <- as.POSIXct.numeric(as.numeric(df$modified), origin = "1970-01-01")
df$size <- as.numeric(df$size)
If you do not have the files you can recursively search the drive using list.files() with recursive = T: list.files("I:", recursive = T, full.names = T)
Note:
l <- lapply(file_info, `length<-`, max(lengths(file_info))) sets the vector length of each list element to be the same. This is necessary because otherwise when the vectors are stacked with unequal lengths values get recycled. A simple example of this is: rbind(1:3, 1:5)
The output of unlist(c(file_ext, file_meta, fullpath, path)) is a vector and vectors in R are atomic, meaning all elements have to be the same class. That means everything gets converted to character in this case, which is why we have the lines df$modified <- ... and df$size <- ... at the end to convert them to their appropriate type.
If you want to output this data frame to excel check out xlsx::write.xlsx or openxlsx::write.xlsx. If you don't have those libraries installed you'll need to use install.packages() first.
Output
Because these files/locations don't actually exist on my computer there are NA values in the size and date modified fields:
filename extension size modified sub1 sub2 sub3 sub4
1 2015-BUDGET DOCUMENT xlsx NA <NA> I:/Administration/Budget Administration Budget <NA>
2 BUDGET DOCUMENT xlsx NA <NA> I:/Administration/Budget/2014-2015 Budget Administration Budget 2014-2015 Budget

Related

Trying to rename multiple.csv files using data contained within the file

A machine I use spits out .csv files named by the time. But I need them named after the plate they were read from, which is contained within the file.
I created list of files:
files <- list.files(path="", pattern="*.csv")
I then tried using a for-loop to first create a data frame from each file containing the 1st row only, then to create a variable from the relevant piece of data, (the desired name), and then renaming the files.
for(x in files)
{
y <- read.csv(x, nrow = 1, header = FALSE, stringsAsFactors = TRUE)
z <- y[2, 2]
file.rename(x, z)
}
It didn't work. After 7 hours of trying (new to R) I am here. Please give simple advice, I have basically zero R experience.
I believe the following for loop does what the question asks for if the new filename is the second column header value.
If it is not, change nmax to the appropriate column number.
fls <- list.files(pattern = '\\.csv')
for(f in fls){
x <- scan(file = f, what = character(), nmax = 2, nlines = 1, sep = ',')
g <- paste0(x[2], '.csv')
file.rename(f, g)
}

Select CSV files and read in pairs

I am comparing two pairs of csv files each at a time. The files I have each end with a number like cars_file2.csv, Lorries_file3.csv, computers_file4.csv, phones_file5.csv. I have like 70 files per folder and the way I am comparing is, I compare cars_file2.csv and Lorries_file3.csv then Lorries_file3.csv and
computers_file4.csv, and the pattern is 2,3,3,4,4,5 like that. Is there a smart way I can handle this instead of manually coming back and change file like the way I am reading here or I can use the last number on each csv to read them smartly. NOTE the files have same suffixes _file:
library(daff)
setwd("path")
# Load csvs to compare into data frames
x_original <- read.csv("cars_file2.csv", strip.white=TRUE, stringsAsFactors = FALSE)
x_changed <- read.csv("Lorries_file3.csv", strip.white=TRUE, stringsAsFactors = FALSE)
render(diff_data(x_original,x_changed ,ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE))
My intention is to compare each two pairs of csv and recorded, Field additions, deletions and modified
You may want to load all files at once and do your comparison with a full list of files.
This may help:
# your path
path <- "insert your path"
# get folders in this path
dir_data <- as.list(list.dirs(path))
# get all filenames
dir_data <- lapply(dir_data,function(x){
# list of folders
files <- list.files(x)
files <- paste(x,files,sep="/")
# only .csv files
files <- files[substring(files,nchar(files)-3,nchar(files)) %in% ".csv"]
# remove possible errors
files <- files[!is.na(files)]
# save if there are files
if(length(files) >= 1){
return(files)
}
})
# delete NULL-values
dir_data <- compact(dir_data)
# make it a named vector
dir_data <- unique(unlist(dir_data))
names(dir_data) <- sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(dir_data))
names(dir_data) <- as.numeric(substring(names(dir_data),nchar(names(dir_data)),nchar(names(dir_data))))
# remove possible NULL-values
dir_data <- dir_data[!is.na(names(dir_data))]
# make it a list again
dir_data <- as.list(dir_data)
# load data
data_upload <- lapply(dir_data,function(x){
if(file.exists(x)){
data <- read.csv(x,header=T,sep=";")
}else{
data <- "file not found"
}
return(data)
})
# setup for comparison
diffs <- lapply(as.character(sort(as.numeric(names(data_upload)))),function(x){
# check if the second dataset exists
if(as.character(as.numeric(x)+1) %in% names(data_upload)){
# first dataset
print(data_upload[[x]])
# second dataset
print(data_upload[[as.character(as.numeric(x)+1)]])
# do your operations here
comparison <- render(diff_data(data_upload[[x]],
data_upload[[as.character(as.numeric(x)+1)]],
ignore_whitespace=T,count_like_a_spreadsheet = F))
numbers <- c(x, as.numeric(x)+1)
# save both the comparison data and the numbers of the datasets
return(list(comparison,numbers))
}
})
# you can find the differences here
diffs
This script loads all csv-files in a folder and its sub-folders and puts them into a list by their numbers. In case there are no doubles, this will work. If you have doubles, you will have to adjust the part where the vector is named so that you can index the full names of the files afterwards.
A simple for- loop using paste will read-in the pairs:
for (i in 1:70) { # assuming the last pair is cars_file70.csv and Lorries_file71.csv
x_original <- read.csv(paste0("cars_file",i,".csv"), strip.white=TRUE, stringsAsFactors = FALSE)
x_changed <- read.csv(paste0("Lorries_file3",i+1,".csv"), strip.white=TRUE, stringsAsFactors = FALSE)
render(diff_data(x_original,x_changed ,ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE))
}
For simplicity I used 2 .csv files.
csv_1
1,2,4
csv_2
1,8,10
Load all the .csv files from folder,
files <- dir("Your folder path", pattern = '\\.csv', full.names = TRUE)
tables <- lapply(files, read.csv)
#create empty list to store comparison output
diff <- c()
Loop through all loaded files and compare,
for (pos in 1:length(csv)) {
if (pos != length(csv)) { #ignore last one
#save comparison output
diff[[pos]] <- diff_data(as.data.frame(csv[pos]), as.data.frame(csv[pos + 1]), ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE)
}
}
Compared output by diff
[[1]]
Daff Comparison: ‘as.data.frame(tables[pos])’ vs. ‘as.data.frame(tables[pos + 1])’
+++ +++ --- ---
## X1 X8 X10 X2 X4

Passing a list of one function to another, across different directories

I am trying to assign random numbers to a copy of a file in various directories (tricky to replicate). The directory structure is as follows:
1100
1100/Images
I first create the new directories and copy the images across. For this I have the following working
NewImageGen <- function(singledir){
#
Directoryforrandoms <- "RandomNumAsignment"
Directoryforrandoms <- paste(singledir, "/", Directoryforrandoms, sep = "")
dir.create(Directoryforrandoms,
showWarnings = F)
#
Imagedir <- paste(singledir, '/Images', sep = '')
filestocopy <- list.files(Imagedir,
recursive = F,
full.names = T)
file.copy(from = filestocopy,
to = Directoryforrandoms,
overwrite = F)
#
newfiles <- list.files(path = Directoryforrandoms,
pattern = ".tif", # they are all tiff files
full.names = T)
#
return(newfiles)
}
NewImages <- pblapply(alldirs, FUN = NewImageGen)
This gives me a large list, which is divided into four (due to there being four directories in this instance). I want to pass the newfiles to another function which generates a random number prefix and sticks it on the front. I can do this on a regular list of files using:
RandomNumGen <- function(singleimg){
randomnumber <- as.character(sample(100000:999999, 1, replace=F))
singlerename <- sub('^', randomnumber, singleimg)
file.rename(from = singleimg,
to = singlerename)
}
It runs through all elements of the list but returns a frustrating false.
Any help would be top notch!
pblapply() returns a list while list.files() returns a character vector.
So if your function works with the return value of list.files() try changing your call from pblapply() to pbsapply().
Edit:
newfiles, i.e. the return value of the NewImageGen() function, contains full paths (as you set full.names = T in the last list.files() call).
sub('^', randomnumber, singleimg) will add the random number before the first part of the path like 677570/home/Jim/1100/Images/RandomNumAsignment/original_image_name.tif.
Instead you want to do
file.path(dirname(singleimg), sub('^', randomnumber, basename(singleimg))

Combine csv files with common file identifier

I have a list of approximately 500 csv files each with a filename that consists of a six-digit number followed by a year (ex. 123456_2015.csv). I would like to append all files together that have the same six-digit number. I tried to implement the code suggested in this question:
Import and rbind multiple csv files with common name in R but I want the appended data to be saved as new csv files in the same directory as the original files are currently saved. I have also tried to implement the below code however the csv files produced from this contain no data.
rm(list=ls())
filenames <- list.files(path = "C:/Users/smithma/Desktop/PM25_test")
NAPS_ID <- gsub('.+?\\([0-9]{5,6}?)\\_.+?$', '\\1', filenames)
Unique_NAPS_ID <- unique(NAPS_ID)
n <- length(Unique_NAPS_ID)
for(j in 1:n){
curr_NAPS_ID <- as.character(Unique_NAPS_ID[j])
NAPS_ID_pattern <- paste(".+?\\_(", curr_NAPS_ID,"+?)\\_.+?$", sep = "" )
NAPS_filenames <- list.files(path = "C:/Users/smithma/Desktop/PM25_test", pattern = NAPS_ID_pattern)
write.csv(do.call("rbind", lapply(NAPS_filenames, read.csv, header = TRUE)),file = paste("C:/Users/smithma/Desktop/PM25_test/MERGED", "MERGED_", Unique_NAPS_ID[j], ".csv", sep = ""), row.names=FALSE)
}
Any help would be greatly appreciated.
Because you're not doing any data manipulation, you don't need to treat the files like tabular data. You only need to copy the file contents.
filenames <- list.files("C:/Users/smithma/Desktop/PM25_test", full.names = TRUE)
NAPS_ID <- substr(basename(filenames), 1, 6)
Unique_NAPS_ID <- unique(NAPS_ID)
for(curr_NAPS_ID in Unique_NAPS_ID){
NAPS_filenames <- filenames[startsWith(basename(filenames), curr_NAPS_ID)]
output_file <- paste0(
"C:/Users/nwerth/Desktop/PM25_test/MERGED_", curr_NAPS_ID, ".csv"
)
for (fname in NAPS_filenames) {
line_text <- readLines(fname)
# Write the header from the first file
if (fname == NAPS_filenames[1]) {
cat(line_text[1], '\n', sep = '', file = output_file)
}
# Append every line in the file except the header
line_text <- line_text[-1]
cat(line_text, file = output_file, sep = '\n', append = TRUE)
}
}
My changes:
list.files(..., full.names = TRUE) is usually the best way to go.
Because the digits appear at the start of the filenames, I suggest substr. It's easier to get an idea of what's going on when skimming the code.
Instead of looping over the indices of a vector, loop over the values. It's more succinct and less likely to cause problems if the vector's empty.
startsWith and endsWith are relatively new functions, and they're great.
You only care about copying lines, so just use readLines to get them in and cat to get them out.
You might consider something like this:
##will take the first 6 characters of each file name
six.digit.filenames <- substr(filenames, 1,6)
path <- "C:/Users/smithma/Desktop/PM25_test/"
unique.numbers <- unique(six.digit.filenames)
for(j in unique.numbers){
sub <- filenames[which(substr(filenames,1,6) == j)]
data.for.output <- c()
for(file in sub){
##now do your stuff with these files including read them in
data <- read.csv(paste0(path,file))
data.for.output <- rbind(data.for.output,data)
}
write.csv(data.for.output,paste0(path,j, '.csv'), row.names = F)
}

Applying a function on all csv files from a certain folder

I am reading csv files from a certain folder, which all have the same structure. Furthermore, I have created a function which adds a certain value to a dataFrame.
I have created the "folder reading" - part and also created the function. However, I now need to connect these two with each other. This is where I am having my problems:
Here is my code:
addValue <- function(valueToAdd, df.file, writterPath) {
df.file$result <- df.file$Value + valueToAdd
x <- x + 1
df.file <- as.data.frame(do.call(cbind, df.file))
fullFilePath <- paste(writterPath, x , "myFile.csv", sep="")
write.csv(as.data.frame(df.file), fullFilePath)
}
#1.reading R files
path <- "C:/Users/RFiles/files/"
files <- list.files(path=path, pattern="*.csv")
for(file in files)
{
perpos <- which(strsplit(file, "")[[1]]==".")
assign(
gsub(" ","",substr(file, 1, perpos-1)),
read.csv(paste(path,file,sep="")))
}
#2.appyling function
writterPath <- "C:/Users/RFiles/files/results/"
addValue(2, sys, writterPath)
How to apply the addValue() function in my #1.reading R files construct? Any recommendations?
I appreciate your answers!
UPDATE
When trying out the example code, I get:
+ }
+ ## If you really need to change filenames with numbers,
+ newfname <- file.path(npath, paste0(x, basename(fname)))
+ ## otherwise just use `file.path(npath, basename(fname))`.
+
+ ## (4) Write back to a different file location:
+ write.csv(newdat, file = newfname, row.names = FALSE)
+ }
Error in `$<-.data.frame`(`*tmp*`, "results", value = numeric(0)) :
replacement has 0 rows, data has 11
Any suggestions?
There are several problems with your code (e.g., x in your function is never defined and is not retained between calls to addValue), so I'm guessing that this is a chopped-down version of the real code and you still have remnants remaining. Instead of picking it apart verbosely, I'll just offer my own suggested code and a few pointers.
The function addValue looks like it is good for changing a data.frame, but I would not have guessed (by the name, at least) that it would also write the file to disk (and potentially over-write an existing file).
I'm guessing you are trying to (1) read a file, (2) "add value" to it, (3) assign it to a global variable, and (4) write it to disk. The third can be problematic (and contentious with some programmers), but I'll leave it for now.
Unless writing to disk is inherent to your idea of "adding value" to a data.frame, I recommend you keep #2 separate from #4. Below is a suggested alternative to your code:
addValue <- function(valueToAdd, df) {
df$results <- df$Value + valueToAdd
## more stuff here?
return(df)
}
opath <- 'c:/Users/RFiles/files/raw' # notice the difference
npath <- 'c:/Users/RFiles/files/adjusted'
files <- list.files(path = opath, pattern = '*.csv', full.names = TRUE)
x <- 0
for (fname in files) {
x <- x + 1
## (1) read in and (2) "add value" to it
dat <- read.csv(fname)
newdat <- addValue(2, dat)
## (3) Conditionally assign to a global variable:
varname <- gsub('\\.[^.]*$', '', basename(fname))
if (! exists(varname)) {
assign(x = varname, value = newdat)
} else {
warning('variable exists, did not overwrite: ', varname)
}
## If you really need to change filenames with numbers,
newfname <- file.path(npath, paste0(x, basename(fname)))
## otherwise just use `file.path(npath, basename(fname))`.
## (4) Write back to a different file location:
write.csv(newdat, file = newfname, row.names = FALSE)
}
Notice that it will not overwrite global variables. This may be an annoying check, but will keep you from losing data if you accidentally run this section of code.
An alternative to assigning numerous variables to the global address space is to save all of them to a single list. Assuming they are the same format, you will likely be dealing with them with identical (or very similar) analytical methods, so putting them all in one list will facilitate that. The alternative of tracking disparate variable names can be tiresome.
## addValue as defined previously
opath <- 'c:/Users/RFiles/files/raw'
npath <- 'c:/Users/RFiles/files/adjusted'
ofiles <- list.files(path = opath, pattern = '*.csv', full.names = TRUE)
nfiles <- file.path(npath, basename(ofiles))
dats <- mapply(function(ofname, nfname) {
dat <- read.csv(ofname)
newdat <- addValue(2, dat)
write.csv(newdat, file = nfname, row.names = FALSE)
newdat
}, ofiles, nfiles, SIMPLIFY = FALSE)
length(dats) # number of files
names(dats) # one for each file

Resources