Compress files from a directory in gzip format (*.gz) - r

I have a directory that contains files with different file extensions and I have to compress them one by one because the 7z program does not allow me to do it massively. for Example:
List item
file1.xyz
file2.rrr
file3.qwe
file250.pep
expected output
file1.xyz.gz
file2.rrr.gz
file3.qwe.gz
file250.pep.gz
any idea how to do this in R? thank you

Yes you can do this in R. Assuming your files are in a subdirectory called files:
files <- dir("./files/", full.names = TRUE)
lapply(files, R.utils::gzip, remove = FALSE)
Note that remove = FALSE is very important if after compression you do not want to delete the original file. The documentation lists other options, e.g. whether you want to overwrite existing files of the same name.

Related

How to select specific files according to a spreadsheet criteria and then copy from directory to another directory in R?

I have a task that requires me to use a specific column in a CSV spreadsheet that stores the file names, for example:
File Name
CA-001
WV-001
ma-001
My task is to move some files from folder 'source' to folder 'target'.
And I'm using this csv spreadsheet as a crosswalk to select any files with names that match with what's in the column 'File Name'. Then I'm asking R to copy from the source folder that contains not only these files but also other files that are not in this list(eg: CO-001, SC-001...). If it's helpful, all of the files are PDFs, so we don't worry about file type. I want only the files that have names match with what's in the csv spreadsheet. How can I do this?
I have some sample code below, but it still didn't execute successfully.
source <- "C:/Users/53038/MovePDF/Test_From"
target <- "C:/Users/53038/MovePDF/Test_To"
all.files <- list.files(path = source)
csvfile <- read.csv('C:/Users/53038/MovePDF/Master.csv')
toCopy <- all.files[all.files %in% csvfile$Move]
file.copy(toCopy, target)
Thank you!
With the provided code, the selection of patterns you want to match will be in csvfile$File.Name.
I'm assuming the source directory is potentially very large. Instead of performing slow regular expressions to match substrings (while we know the exact filename), and/or getting a complete file listing (which is also slow), I will only seek if the exactly wanted filenames exist before copying them:
source <- "C:/Users/53038/MovePDF/Test_From"
target <- "C:/Users/53038/MovePDF/Test_To"
csvfile <- read.csv('C:/Users/53038/MovePDF/Master.csv')
# add .pdf suffix
toCopy <- paste0(csvfile$File.Name,'.pdf')
# add source directory path
toCopy <- file.path(source, toCopy)
# optional: extract only the existing files from toCopy. You can skip this step if you're sure they exist and/or you don't mind receiving errors
toCopy <- toCopy[file.exists(toCopy)]
# make it so
file.copy(toCopy, target, overwrite = T)
I would preferably keep the .pdf extension in the filename at all times, so also in the source CSV. There would be an issue on case-sensitive filesystems (almost all Linux installations, rarely macOS or Windows) if the extension is .PDF, .Pdf, etc.

Read in data from same subfolder in different subfolders to R

I have multiple folders in my data file such that the files all have a common directory of "~/Desktop/Data/". Each file in the data folder is different such that
/Desktop
/Data
/File1/Data1/
/File2/Data1/
/File3/Data1/
The File folders are different but they all contain a data folder that is named the same. I have .dta files in each of the data subfolders that I would like to read into R
EDIT: I should also note the contents in the File folders to be:
../Filex
/Data1 -- What I want to read from
/Data2
/Data3
/Code
with /Filex/Data1 being the main folder of interest. All File folders are structured this way.
I have consulted multiple stack overflow feeds and so far only figured out how to list them all had all the File folders been the same. However, I am unsure as to how I can read the data into R if these File folders were named slightly differently.
I have tried this thus far, but I get an empty set in return
files <- dir("~/Desktop/Data/*/Data/", recursive=TRUE, full.names=TRUE, pattern="\\.dta$")
For actual data, downloading files from ICPSR might help in replicating the issue.
EDIT: I am working on MAC OSX 10.15.5
Thank you so much for your assistance!
Try
files <- dir("~/Desktop/Data",pattern=".+.dta$", full.names = TRUE, recursive = TRUE)
# to make sure /Data is there, as suggestted by #Martin Gal:
files[grepl("Data/",files)]
This Regex tester and this Regex cheatsheet have been very useful to come to the solution.
Tested under Windows :
files <- dir('c:/temp',pattern=".+.dta$", full.names = TRUE, recursive = TRUE)
files[grepl("Data/",files)]
[1] "c:/temp/File1/Data/test2.dta" "c:/temp/File2/Data/test.dta"

copy csv file from multiple directories to a new one in R

I am trying to extract many .csv files from multiple directories/subdirectories and copy them in a new folder, where I would like to end up with only .csv files.
The csv files are stored in subdirectories with the following structure:
D:\R data\main_folder\03\07\04\BBB_0120180307031414614.csv
D:\R data\main_folder\03\07\05\BBB_0120180307031414615.csv
I am trying the list.files function to extract the csv files names only.
my_dirs <- list.files("D:\\R data\\main_folder\\",pattern="\\.csv$" ,recursive = TRUE,
include.dirs = FALSE,full.names = FALSE)
The problem is that csv files are listed with the directory path, e.g.
03/07/03/BBB_0120180307031414614.csv
And this, even though full.names and include.dirs is set to FALSE.
This prevents me from copying those files in a new folder, as the name is not recognized.
What am I doing wrong?
Thanks
Use basename function coupled with list.files like below.
If I understood you correctly then you want to fetch the names of .csv files present in different directory.
I have made a temp folder in my documents directory of windows machine , Inside that I have two folders "one" and "two", Inside these folders I have csv files named as "just_one.csv" and "just_two.csv".
So If I want to fetch the names "just_one.csv" and "just_two.csv" then I could do this:
basename(list.files("C:/Users/C_Nfdl_99878314/Documents/temp", "*.csv", recursive=T))
Which results to:
[1] "just_one.csv" "just_two.csv"

Looping through folder and finding specific file in R

I am trying to loop through many folders in a directory, looking for a particular xml file buried in one of the folders. I would then like to save the location of that file and then run my code against that file (I will not include that code in this). What I am asking here is to loop through all the folders and then open the specific file.
For example:
My main folder would be: C:\Parsing
It has two folders named "folder1" and "folder2".
each folder has an xml file that I am interested in, lets say its called "needed.xml"
I would like to have a scrip that loops through the directory and finds those particular scripts.
Do you know how I could that in R.
Using list.files and greplyou could look recursively through all sub-folders
rootPath="C:\Parsing"
listFiles=list.files(rootPath,recursive=TRUE)
searchFileName="needed.xml"
presentFile=grepl(searchFileName,listFiles)
if(nchar(presentFile)) cat("File",searchFileName,"is present at", presentFile,"\n")
Is this what you're looking for?
require(XML)
fol <- list.files("C:/Parsing")
for (i in fol){
dir <- paste("C:/Parsing" , i, "/needed.xml", sep = "")
if(file.exists(dir) == T){
needed <- xmlToList(dir)
}
}
This will locate your xml file and read it into R as a list. I wasn't clear from your question if you wanted the output to be the data itself or just the directory location of your data which could then be supplied to another function/script. If you just want the location, remove the 'xmlToList' function.
I would do something like this (replace *.xml with your filename.xml if you want):
list.files(path = "C:\Parsing", pattern = "*.xml", recursive = TRUE, full.names = TRUE)
This will recursively look for files with extension .xml in the path C:\Parsing and return the full path of the matched files.

How to get into the directory of a file in R?

I have a directory "space" containing 300 CSV files, and its path is "C://rstuff//space".
And I have a function:
myfunction <- function(my_dir, x, y){
}
I want to open some of the csv files, so I want to get the location of these files,and I use the argument 'my_dir' to indicate the location of the CSV files.
I want to use setwd(paste0("C://rstuff//", my_dir)) (thanks for Batanichek's comment), but I think my way is not good to set the path, if I don't know the path exactly, what should I do? Is there any good methods?
You can use list.files
setwd("C://rstuff//space")
my_files<-list.files(pattern = ".csv",
full.names = TRUE, recursive = TRUE, ignore.case = TRUE)
This finds all csv files in your working directory and give you the path starting from your working directory.
[1] "./csvs2/data_1-10.csv"
[2] "./csvs2/old/data_1001-1010.csv"
[3] "./overview/results.csv"
Then you can specify the ones you want to use. I for example give the important csv files a number after an "_" e.g "data_23". So you can exclude all non-important files with:
my_files<-my_files[-(which(grepl("_", my_files)==FALSE))]

Resources