How to read files that contain a certain number/word in R? - r

I have a huge number of files and the filenames look like this:
AL0201A0000500100day.1-1-2011
AL0201A0000700100dymax.1-1-2011
AL0201A0000900100day.1-1-2011
AT0ILL10000400100day.1-1-1997
AT0HBG10000700500hour8.1-1-1995
And I only want to read the files that contain for example 00007. The files are in different folders and subfolders.
How do I do that? I totally fail to use the * function in R. I know how to do that in Python, but I get nowhere in R. :/
Any suggestions?

You can find files whose name contains "00007", in the current working directory and all sub-directories with:
files <- list.files(pattern = '00007', recursive = T)

Related

Is there a way to reference files in a folder within the working directory in R?

I have already finished with my RMarkdown and I'm trying to clean up the workspace a little bit. This isn't exactly a necessary thing but more of an organizational practice which I'm not even sure if it's a good practice, so that I can keep the data separate from some scripts and other R and git related files.
I have a bunch of .csv files for data that I used. Previously they were on (for example)
C:/Users/Documents/Project
which is what I set as my working directory. But now I want them in
C:/Users/Document/Project/Data
The problem is that this only breaks the following code because they are not in the wd.
#create one big dataframe by unioning all the data
bigfile <- vroom(list.files(pattern = "*.csv"))
I've tried adding a full path to list.files() to where the csvs are but no luck.
bigfile <- vroom(list.files(path = "C:/Users/Documents/Project/Data", pattern = "*.csv"))
Error: 'data1.csv' does not exist in current working directory ('C:/Users/Documents/Project').
Is there a way to only access the /Data folder once for creating my dataframe with vroom() instead of changing the working directory multiple times?
You can list files including those in all subdirectories (Data in particular) using list.files(pattern = "*.csv", recursive = TRUE)
Best practices
Have one directory of raw and only raw data (the stuff you measured)
Have another directory of external data (e.g. reference data bases). This is something you do can remove afterwards and redownload if required.
Have another directory for the source code
Put only the source code directory under version control plus one other file containing check sums of the raw and external data to proof integrity
Every other thing must be reproducible using raw data and the source code. This can be removed after the project. Maybe you want to keep small result files (e.g. tables) which take long time to reproduce.
You can list the files and capture the full filepath name right?
bigfile <- vroom(list.files(path = "C:/Users/Documents/Project/Data", pattern = "*.csv", full.names = T))
and that should read the file in the directory without reference to your wd
Try one of these:
# list all csv files within Data within current directory
Sys.glob("Data/*.csv")
# list all csv files within immediate subdirectories of current directory
Sys.glob("*/*.csv")
If you only have csv files then these would also work but seem less desirable. Might be useful though if you quickly want to review what files and directories are there. (I would be very careful not to use the second one within statements to delete files since if you are not in the directory you think it is in then you can wind up deleting files you did not intend to delete. The first one might too but is a bit safer since it would only lead to deleting wrong files if the directory you are in does have a Data subdirectory.)
# list all files & directories within Data within current directory
Sys.glob("Data/*")
# list all files & directories within immediate subdirectories of current directory
Sys.glob("*/*")
If the subfolder always has the same name (or the same number of characters), you should be able to do it thanks to substring. In your example, "Data" has 4 characters (5 with the /), so the following code should do:
Repository <- substring(getwd(), 1, nchar(getwd())-5)

Is there a way to load csv files saved in different folders with only a partial file name in R

I am trying to load multiple csv files that are each saved in different folders within my working directory in R. However I only know part of each of the file name.
For example the file in "folder1" will be named "xxx_xxx_folder1.csv", and the file in "folder2" is "xxx_xxx_folder2.csv" etc. There is only one csv in each folder.
I was wondering is there a way to load files saved in different folders with only a partial file name?
The only way I have got it to partially work so far is to have all the files in one folder
Thanks and sorry if any of this is unclear!
From your description you could use list.files with option recursive=TRUE to get a list of your csv files. You could then loop over the list to read your files:
fn <- list.files(PATH_TO_WORKING_DIRECTORY, "\\.csv$", recursive = TRUE, full.names = TRUE)
lapply(fn, read.csv)

How can I copy a file of a certain extension or containing a certain phrase/pattern in R?

I'm developing a For loop to copy a specific file from one directory and paste it into another, and then rename it. The code itself is fairly solid, as it was developed to be used previously to do the same thing, but for a different file.
My issue is that the file I want has had different naming conventions in the past, making it more difficult than just copying and renaming every file with one name.
Notably, all files possess the same "pattern" of '_descriptor.xml' at the end -- the bit that has changed is in the beginning, which is generally 9 characters (but not always). Is there a clean way to tell R to copy files that have this pattern?
Thanks!
You can use pattern argument in list.files to select files that have specific pattern in them.
files_to_copy <- list.files(pattern = '_descriptor\\.xml$', full.names = TRUE)
We can also do
files_to_copy <- list.files(pattern = '_descriptor[.]xml$', full.names = TRUE)

Reading data from zip files located in zip files with R

I'd like to use R to extract data from zip files located in zip files (i.e. preform some ZIP file inception).
An example "directory" of one of my datapoints looks like this:
C:\ZipMother.zip\ZipChild.zip\index.txt
My goal is to read in the "index.txt" from each ZipChild.zip. The issue is that I have 324 ZipMother.zip files with an average of 2000 ZipChild.zip files, therefore unzipping the ZipMother.zip files is a concern due to memory constraints (the ZipMother files are about 600 megabytes on average).
With the unzip package, I can successfully get the filepaths of each ZipChild located in the ZipMother, but I cannot use it to list the files located in the ZipChild folders.
Therefore,
unzip("./ZipMother.zip",list=TRUE)
works just fine, but...
unzip("./ZipMother.zip/ZipChild.zip",list=TRUE)
gives me the following error
Error in unzip("./ZipMother.zip/ZipChild.zip", list = TRUE) :
zip file './ZipMother.zip/ZipChild.zip' cannot be opened
Is there any way to use unzip or another method to extract the data from the ZipChild files?
Once I get this to work, I plan on using the ldply function to compile the index.txt files into a dataset.
Any input is very much appreciated. Thank you!
A reproducible example (i.e. a link to a zip file with the appropriate structure) would be useful, but how about:
tmpd <- tempdir()
## extract just the child
unzip("./ZipMother.zip",
files="zipChild.zip",exdir=tmpd)
ff <- file.path(tmpd,"zipChild.zip")
index <- unzip(ff,list=TRUE)
unlink(ff)
This could obviously be packaged into a function for convenience.
It could be slow, but it means you never have to unpack more than one child at a time ...

How to get R to read in files from multiple subdirectories under one large directory?

I am trying to get started with writing my first R code. I have searched for this answer but I am not quite sure what I've found is what I'm looking for exactly. I know how to get R to read in multiple files in the same subdirectory, but I'm not quite sure how to get it to read in one specific file from multiple subdirectories.
For instance, I have a main directory containing a series of trajectory replicates, each replicate is in it's own subdirectory. The break down is as follows;
"Main Dir" -> "SubDir1" -> "ReplicateDirs 1-6"
From each "ReplicateDir" I want R to pull the "RMSD.dat" table (file) to read from. All of the RMSD.dat files have identical names, they are just in different directories and contain different data of course.
I could move all the files to one folder but this doesn't seem like the most efficient way to attack this problem.
If anyone could enlighten me, I'd appreciate it.
Thanks
This should work, of course change My Dir to your directory
dat.files <- list.files(path="Main Dir",
recursive=T,
pattern="RMSD.dat"
,full.names=T)
If you want to read the files into the data set, you could use the function below:
readDatFile <- function(f) {
dat.fl <- read.csv(f) # You may have to change read.csv to match your data type
}
And apply to the list of files:
data.files <- sapply(dat.files, readDatFile)

Resources