Extract segment of filename - r

I'm trying to extract a filename and save the dataframe with that same name.
The problem I have is that if the filename for some reason is inside a folder with a similar word, stringr will return that word as well.
filename <- "~folder/testdata/2016/testdata 2016.csv"
If I run this:
library(stringr)
str <- str_trim(stringr::str_extract(filename,"[t](.*)"), "left") it returns testdata/2016/testdata 2016.csv when all I want is testdata 2016. Optimally it would even be better to get testdata2016.
I've been trying several combinations but there has to be a simpler way of doing this. If there was a way of reading the path from right to left, starting at .csv stop at /, I wouldn't have this issue.

You can have below approaches:
library(stringr)
str_replace(str_extract(filename,"\\w*\\s+\\w*(?=\\.)"),"\\s+","")
str_replace_all(basename(filename),"\\s+|\\.csv","")
You can use basename approach as suggested by Benjamin.
?basename:
basename removes all of the path up to and including the last path
separator (if any).
Output:
[1] "testdata2016"

Plenty of help in base R (tools pkg comes with the default R install):
gsub(" ", "",
tools::file_path_sans_ext(
basename("~folder/testdata/2016/testdata 2016.csv")))

Related

Switching the order of paste() in piping in R

I am fairly new to R and I would like to paste the string "exampletext" in front of each file name within the path.
csvList <- list.files(path = "./csv_by_subject") %>%
paste0("*exampletext")
Currently this code renders things like "csv*exampletext" and I want it to be *exampletextcsv". I would like to continue to using dplyr and piping - help appreciated!
As others pointed out, the pipe is not necessary here. But if you do want to use it, you just have to specify that the second argument to paste0 is "the thing you are piping", which you do using a period (.)
list.files(path = "./csv_by_subject") %>%
paste0("*exampletext", .)
paste0('exampletext', csvList) should do the trick. It's not necessarily using dplyr and piping, but it's taking advantage of the vectorization features that R provides.
If you'd like to paste *exampletext before all of the file names, you can reverse the order of what you're doing now using paste0 and passing the second argument as list.files. paste0 can handle vectors as the second argument and will apply the paste to each element.
csvList <- paste0("*exampletext", list.files(path = "./csv_by_subject"))
This returns a few examples from a local folder on my machine:
csvList
[1] "*exampletext_error_metric.m"
[2] "*exampletext_get_K_clusters.m"
...

How to modify i in an R loop?

I have several large R objects saved as .RData files: "this.RData", "that.RData", "andTheOther.RData" and so on. I don't have enough memory, so I want to load each in a loop, extract some rows, and unload it. However, once I load(i), I need to strip the ".RData" part of (i) before I can do anything with objects "this", "that", "andTheOther". I want to do the opposite of what is described in How to iterate over file names in a R script? How can I do that? Thx
Edit: I omitted to mention the files are not in the working directory and have a filepath as well. I came across Getting filename without extension in R and file_path_sans_ext takes out the extension but the rest of the path is still there.
Do you mean something like this?
i <- c("/path/to/this.RDat", "/another/path/to/that.RDat")
f <- gsub(".*/([^/]+)", "\\1", i)
f1 <- gsub("\\.RDat", "", f)
f1
[1] "this" "that"
On windows' paths you have to use "\\" instead of "/"
Edit: Explanation. Technically, these are called "regular
expressions" (regexps), not "patterns".
. any character
.* arbitrary number (including 0) of any kind of characters
.*/ arbitrary number of any kind of characters, followed by a
/
[^/] any character but not /
[^/]+ arbitrary number (1 or more) of any kind of characters,
but not /
( and ) enclose groups. You can use the groups when
replacing as \\1, \\2 etc.
So, look for any kind of character, followed by /, followed by
anything but not the path separator. Replace this with the "anything
but not separator".
There are many good tutorials for regexps, just look for it.
A simple way to do this using would be to extract the base name from the filepaths with base::basename() and then remove the file extension with tools::file_path_sans_ext().
paths_to_files <- c("./path/to/this.RData", "./another/path/to/that.RData")
tools::file_path_sans_ext(
basename(
paths_to_files
)
)
## Returns:
## [1] "this" "that"

R - how to find exact pattern when listing files

I have a number of files from which I would like to find only the ones that match an exact pattern.
When I run:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC')
files <- list.files(idir, pattern=mods[1])
I got the results:
> files
[1] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_185001-190012.nc"
[2] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_190101-195012.nc"
[3] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_195101-201012.nc"
[4] "clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc"
[5] "clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc"
[6] "clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc"
which is wrong, because I just wanted the last three names (which match the EXACT pattern I wish).
Even if I use regex to create the pattern, I will get a empty vector as result:
files <- list.files(idir, pattern=paste("^",m[1],"$", sep=''), full.names=T)
> files
character(0)
What am I missing here?
Thanks!
Your solution works, the first three files also have the pattern GISS-E2-H.
To get only the last three, you can do as suggested by #G.Grothendieck and add the _ to mods:
mods=c('GISS-E2-H_','GISS-E2-R','GISS-E2-R-CC')
Now to test your solution I'll create the files:
allfiles <- c("clt_Amon_GISS-E2-H-CC_historical_r1i1p1_185001-190012.nc",
"clt_Amon_GISS-E2-H-CC_historical_r1i1p1_190101-195012.nc",
"clt_Amon_GISS-E2-H-CC_historical_r1i1p1_195101-201012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc")
for (file in allfiles) {
write("empty file", file)
}
Now it works:
> list.files(getwd(), pattern=mods[1])
[1] "clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc" "clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc"
[3] "clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc"
Edit:
An alternative is as originally proposed, and instead of replacing mods you can append the _ inside list.files:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC') #Original
list.files(getwd(), pattern=paste0(mods[1], "_"))
I would use this with caution, though. If you turn this into some kind of loop to also read the other file patterns in mods, the _ will be appended to all patterns, making them possibly incorrect.
Try this:
files <- list.files(idir, pattern = ".*GISS-E2-Hd.*")
Your original vector of patterns was this:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC')
which was trying to match exactly files called GISS-E2-H etc. Since those files do not exits in your idir you were getting back character(0).

R programming, naming output file using variable

I would like to direct output to a file, using a write.csv statement. I am wanting to write 16 different output files, labeling each one with the extension 1 through 16.
Example as written now:
trackfilenums=1:16
for (i in trackfilenums){
calculations etc
write.csv(max.hsi, 'Severity_Index.csv', row.names=F)
}
I would like for the output csv files to be labeled 'Severity_Index_1.csv', 'Severity_Index_2.csv', etc. Not sure how to do this in R language.
Thanks!
Kimberly
You will want to use the paste command:
write.csv(max.hsi, paste0("Severity_Index_", i,".csv"), row.names=F)
Some people like to have file names like Name_01 Name_02 etc instead of Name_1 Name_2 etc. This may, for example, make the alphabetical order more reasonable: with some software, otherwise, 10 would come after 1, 20 after 2, etc.
This kind of numbering can be achieved with sprintf:
sprintf("Severity_Index_%02d.csv", 7)
The interesting part is %02d -- this says that i is an integer value (could actually use %02i as well) that will take at least 2 positions, and leading zero will be used if necessary.
# try also
sprintf("Severity_Index_%03d.csv", 7)
sprintf("Severity_Index_%2d.csv", 7)
To add to the other answers here, I find it's also a good idea to sanitise the pasted string to make sure it is ok for the file system. For that purpose I have the following function:
fsSafe <- function(string) {
safeString <- gsub("[^[:alnum:]]", "_", string)
safeString <- gsub("_+", "_", safeString)
safeString
}
This simply strips out all non-alphabetic and non-numeric characters and replacing them with an underscore.

Using R to list all files with a specified extension

I'm very new to R and am working on updating an R script to iterate through a series of .dbf tables created using ArcGIS and produce a series of graphs.
I have a directory, C:\Scratch, that will contain all of my .dbf files. However, when ArcGIS creates these tables, it also includes a .dbf.xml file. I want to remove these .dbf.xml files from my file list and thus my iteration. I've tried searching and experimenting with regular expressions to no avail. This is the basic expression I'm using (Excluding all of the various experimentation):
files <- list.files(pattern = "dbf")
Can anyone give me some direction?
files <- list.files(pattern = "\\.dbf$")
$ at the end means that this is end of string. "dbf$" will work too, but adding \\. (. is special character in regular expressions so you need to escape it) ensure that you match only files with extension .dbf (in case you have e.g. .adbf files).
Try this which uses globs rather than regular expressions so it will only pick out the file names that end in .dbf
filenames <- Sys.glob("*.dbf")
Peg the pattern to find "\\.dbf" at the end of the string using the $ character:
list.files(pattern = "\\.dbf$")
Gives you the list of files with full path:
Sys.glob(file.path(file_dir, "*.dbf")) ## file_dir = file containing directory
I am not very good in using sophisticated regular expressions, so I'd do such task in the following way:
files <- list.files()
dbf.files <- files[-grep(".xml", files, fixed=T)]
First line just lists all files from working dir. Second one drops everything containing ".xml" (grep returns indices of such strings in 'files' vector; subsetting with negative indices removes corresponding entries from vector).
"fixed" argument for grep function is just my whim, as I usually want it to peform crude pattern matching without Perl-style fancy regexprs, which may cause surprise for me.
I'm aware that such solution simply reflects drawbacks in my education, but for a novice it may be useful =) at least it's easy.

Resources