Using unzip() in R: how to deal with duplicate file names? - r

I have a large number of nested directories with .ZIP files containing .CSV files that I want to loop through in R, extract the contents using unzip(), and then read the csv files into R.
However, there are many cases (numbering thousands of files) where there are multiple .zip files in the same directory containing .csv files with identical file names. If I set the overwrite=FALSE argument in unzip(), it ignores all duplicated names after the first. What I want is for it to extract all files but add some suffix to the file name that will allow the duplicated files to be extracted to the same directory, so that I do not have to create even more nested subdirectories to hold the files.
Example:
Directory ~/zippedfiles contains:
archive1.zip (consists of foo.csv, bar.csv), archive2.zip (foo.csv, blah.csv)
Run the following:
unzip('~/zippedfiles/archive1.zip', exdir='~/zippedfiles', overwrite=FALSE)
unzip('~/zippedfiles/archive2.zip', exdir='~/zippedfiles', overwrite=FALSE)
The result is
bar.csv
blah.csv
foo.csv
The desired result is
bar.csv
blah.csv
foo.csv
foo(1).csv

Rather than renaming the duplicate file names, why not keep them unique by assigning a separate folder for each unzip action (just like your OS probably would). This way you don't have to worry about changing file names, and you end up with a single list referencing all unzipped folders:
setwd( '~/zippedfiles' )
# get a list of ".zip" files
ziplist <- list.files( pattern = ".zip" )
# start a fresh vector to fill
unzippedlist <- vector( mode = "character", length = 0L )
# for every ".zip" file we found...
for( zipfile in ziplist ) {
# decide on a name for an output folder
outfolder <- gsub( ".zip", "", zipfile )
# create the output folder
dir.create( outfolder )
# unzip into the new output folder
unzip( 'zipfile', exdir = outfolder, overwrite=FALSE )
# get a list of files just unzipped
newunzipped <- list.files( path = outfolder, full.names = T )
# add that new list of files to the complete list
unzippedlist <- c( unzippedlist, newunzipped )
}
The vector unzippedlist should contain all of your unzipped files, with every one being unique, not necessarily by file name, but by a combination of directory and filename. So you can pass it as a vector to capture all of your files.

A solution for you might be to use system()/system2() and then use one of the countless unix methods to archieve that.

Related

Moving and copying multiple files

I will have list of source path & destination path in excel,
How can I move these files
source
destination
C:/users/desk/1/a.pdf
C:/users/desktop/2
C:/users/desk/1/b.pdf
C:/users/desktop/3
C:/users/desk/1/abb.pdf
C:/users/desktop/56
I need to copy a file from particular source to respective given destination.
To copy the files, you can use file.copy. This accepts a vector of single directory, or a vector of file paths as destinations, and copies the files to the new directory/paths.
As your destination column contains only directory paths, you need to specify full paths (including file names) for new files. To do this, you can use file.path and basename to concatenate the original file names (in source) to the new directories (destination).
df = data.frame(
source = c('C:/users/desk/1/a.pdf', 'C:/users/desk/1/b.pdf', 'C:/users/desk/1/abb.pdf'),
destination = c('C:/users/desktop/2', 'C:/users/desktop/3', 'C:/users/desktop/56')
)
file.copy(from = df$source, to = file.path(df$destination, basename(df$source)))
To move the files, you can use file.rename.
file.rename(from = df$source, to = file.path(df$destination, basename(df$source)))
Note 1: file.rename may only work when moving files between locations on the same drive. To move files across drives you could use file.copy followed by file.remove to remove the original files after copying. If doing this, you should be careful not to remove files if the copy operation fails, e.g.:
file.move <- function(from, to, ...) {
# copy files and store vector of success
cp <- file.copy(from = from, to = to, ...)
# remove those files that were successful
file.remove(from[cp])
# warn about unsuccessful files
if (any(!cp)) {
warning(
'The following files could not be moved:\n',
paste(from[!cp], collapse = '\n')
)
}
}
file.move(from = df$source, to = file.path(df$destination, basename(df$source)))
Note 2: This is all assuming that you have read in your excel data using one of read.csv or data.table::fread (for .csv files) or readxl::read_excel (for .xls or .xlsx files)

Read all files in specific folder in R

I am trying to read all files in a specific sub-folder of the wd. I have been able to add a for loop successfully, but the loop only looks at files within the wd. I thought the command line:
directory <- 'folder.I.want.to.look.in'
would enable this but the script still only looks in the wd. However, the above command does help create a list of the correct files. I have included the script below that I have written but not sure what I need to modify to aim it at a specific sub-folder.
directory <- 'folder.I.want.to.look.in'
files <- list.files(path = directory)
out_file <- read_excel("file.to.be.used.in.output", col_names = TRUE)
for (filename in files){
show(filename)
filepath <- paste0(filename)
## Import data
data <- read_excel(filepath, skip = 8, col_names = TRUE)
data <- data[, -c(6:8)]
further script
}
The further script is irrelevant to this question and works fine. I just can't get the loop to look over each file in files from directory. Many thanks in advance
Set your base directory, and then use it to create a vector of all the files with list.files, e.g.:
base_dir <- 'path/to/my/working/directory'
all_files <- paste0(base_dir, list.files(base_dir, recursive = TRUE))
Then just loop over all_files. By default, list.files has recursive = FALSE, i.e., it will only get the files and directory names of the directory you specify, rather than going into each subfolder. Setting recursive = TRUE will return the full filepath excluding your base directory, which is why we concatenate it with base_dir.

Unzip Multiple files containing same name using R

I have a 105 zipped files in a folder. They all contain one csv file each with the same name i.e. 'EapTransactions_1'
Currently I am using the following code in R to extract all of them into a new folder :
library(plyr)
outDir<-"C:/Users/dhritul.gupta/Migration Files/Trial1/extract"
zipF=list.files(path = "C:/Users/dhritul.gupta/Migration Files/Trial1", pattern = "*.zip", full.names = TRUE)
ldply(.data = zipF, .fun = unzip, exdir = outDir)
The problem with this approach is that since all file names are the same every one of them get overwritten and only the last one is saved.
Is there anyway to save each one of them by renaming them or adding a prefix/suffix to the file names while extraction?
You may try using file.rename to add a unique number to the end of each file, before you make the call which uses unzip:
zipF <- list.files(path = "C:/Users/dhritul.gupta/Migration Files/Trial1",
pattern = "*.zip", full.names = TRUE)
file.rename(zipF, paste0("EapTransactions_", 1:105))
ldply(.data=zipF, .fun=unzip, exdir=outDir)
I tried to build something based on Tim's idea. It worked for me when I stored the files at a temporary location to rename the files. I then moved the renamed files to the final destination and deleted the temporary files.
TempoutDir <-"C:/Users/dhritul.gupta/Migration Files/Trial1/extract/Temp" # Define a temp location
setwd(TempoutDir) #setwd for rename/remove functions to work
for (i in 1:length(zipF))
{
unzip(zipF[i],exdir=TempoutDir,overwrite = FALSE)
#Files are overwritten because of same name. Give a new name to the file with a random number using runif and save them at the final location. Delete the files in temp folder
a <- c(list.files(TempoutDir)) #Vector with actual file name
b <- c(paste(runif(length(list.files(TempoutDir)), min=0, max=1000 ),as.character(list.files(TempoutDir))))
#Vector with an appended temp number in front of the file name
file.rename(a,b) # Rename the file in temp location
file.copy(list.files(TempoutDir),outDir) # Move file from temp location to main location
file.remove(list.files(TempoutDir)) # Delete files in Temp location
rm(a)
rm(b) #Delete vectors a,b from environment
}
You should have all the files moved to the desired folder with random numbers appended in front of the file names and nothing left in the temp folder

copy csv file from multiple directories to a new one in R

I am trying to extract many .csv files from multiple directories/subdirectories and copy them in a new folder, where I would like to end up with only .csv files.
The csv files are stored in subdirectories with the following structure:
D:\R data\main_folder\03\07\04\BBB_0120180307031414614.csv
D:\R data\main_folder\03\07\05\BBB_0120180307031414615.csv
I am trying the list.files function to extract the csv files names only.
my_dirs <- list.files("D:\\R data\\main_folder\\",pattern="\\.csv$" ,recursive = TRUE,
include.dirs = FALSE,full.names = FALSE)
The problem is that csv files are listed with the directory path, e.g.
03/07/03/BBB_0120180307031414614.csv
And this, even though full.names and include.dirs is set to FALSE.
This prevents me from copying those files in a new folder, as the name is not recognized.
What am I doing wrong?
Thanks
Use basename function coupled with list.files like below.
If I understood you correctly then you want to fetch the names of .csv files present in different directory.
I have made a temp folder in my documents directory of windows machine , Inside that I have two folders "one" and "two", Inside these folders I have csv files named as "just_one.csv" and "just_two.csv".
So If I want to fetch the names "just_one.csv" and "just_two.csv" then I could do this:
basename(list.files("C:/Users/C_Nfdl_99878314/Documents/temp", "*.csv", recursive=T))
Which results to:
[1] "just_one.csv" "just_two.csv"

renaming file based on folder names with R

I have a folder Tmin which contains 18 folders. Each of the 18 folders contains hundreds of file. I would like to create a program with R that allow to add the name of the folder files for each file. I do not want to rename each of the file with a different name, I only want to add the folder name at the beginning of the file name. I am new in R and in programming. I was not able to have a batch function that can repeat the operation for each folder. You can find attached two pictures, which show what I would like to obtain.
For example, the file called "name_date.tiff" contained in the folder "MACA_Miroc" will become "MACA_Miroc_name_date.tiff". Moreover, I would like to repeat the operation automatically for each folder. Thanks in advance for any help!
Wanted situation and organization of my folders and file
This ought to work:
mydir <- getwd()
primary_folder <- "C:/Users/Desktop/Test_Data/"
subfolders <- grep("*MACA*", list.dirs(primary_folder, full.names = T, recursive = F),
value = T)
renameFunc <- function(z){
setwd(z)
fnames <- dir(recursive = F, pattern= ".tiff|.csv")
addname <- substr(z, nchar(primary_folder)+2, nchar(z))
lapply(fnames, function(current_name){
#Regex to get extension, may need to addd $ sign to signify end of file name
ptrn <- ".*\\.([a-zA-Z]{2,4})"
extension <- regmatches(current_name, regexec(ptrn, current_name))[[1]][2]
no_extension <- gsub(paste(".",extension, sep = ""), "", current_name)
new_name <- paste(gsub("_"," ", no_extension), " ", addname, ".", extension, sep = "")
file.rename(current_name, new_name)
})
}
lapply(subfolders, readFunc)
setwd(mydir)
I think if you're not in the directory where you want to change file names, you must specify the full name. Changing the working directory was a quick way but you could use full names (using regular expressions to get the correct from and to values for file.rename()). I got some errors at one poing when I was not in the directory where I wanted to change the name.
I feel this allows more control over which folders you want to change the names in since incorrect operation can be very messy. You may also want to skip some file extensions or subfolders etc.
Your path folder
folder<-"C:/path/example/"
Extract files list
files<-list.files(folder)
Extract folder name
folder_name<-unlist(strsplit(folder,"/"))[length(unlist(strsplit(folder,"/")))]
Rename all files
file.rename(from = paste0(folder,files),to = paste0(folder,folder_name,"_",files))

Resources