extract portion of file name using gsub()

extract portion of file name using gsub() - r

I'm reading multiple .txt files using list.file() and file.path(). Just wanted to parse the full path names and extract the portion after the last “/” and before the “.”
Here is the structure of file path names :
"C:/Users/Alexandre/Desktop/COURS/FORMATIONS/THESE/PROJET/RESULTATS/Vessel features/Fusion/OK/SAT-DPL192C.txt"
The code I've tried
# l <- list.files(pattern = "SAT(.+)*.txt")
# f <- file.path(getwd(), c=(l))
f <- c("C:/Users/Alexandre/Desktop/COURS/FORMATIONS/THESE/PROJET/RESULTATS/Vessel features/Fusion/OK/SAT-DPL192C.txt", "C:/Users/Alexandre/Desktop/COURS/FORMATIONS/THESE/PROJET/RESULTATS/Vessel features/Fusion/OK/SAT-DPL193D.txt")
d <- lapply(f, read.delim)
names(d) <- gsub(".*/(.*)..*", "1", f)
Last string give [1] "1" "1" instead of [1] "DPL192C" "DPL193D" etc...
I've also tried the syntax like ".*/(.+)*..* for the portion to conserv with same result.

A . is a special character, so you need to escape it. Please, when you want to grab the captured expression, you need to use \\1, not just 1. Try this:
gsub(".*/(.*)\\..*", "\\1", f)
# [1] "SAT-DPL192C" "SAT-DPL193D"

Related

How to use wildcard in file path name for pattern matching in R?

I have a directory with my rproj and a "data" folder for all outputs. There are 40 subdirectories within the data folder containing each an "output.csv". The subdirectories have completely different names but all end with 1 or 2.
data/****1/output.csv
data/****2/output.csv
The astericks represent the varying part of the name (different number of letters), and each csv I need has the exact same name.
I need to seperately list all of the "output.csv"s into based on whether its subdirectory ends with 1 or 2, and I have been trying with the grep() function
allOutputFiles <- list.files(pattern = "output.csv", recursive = TRUE, full.names = TRUE)
files1 <- grep(pattern = "./data/1$", allOutputFiles, value = TRUE)
files2 <- grep(pattern = "./data/2$", allOutputFiles, value = TRUE)
But every time I run it, it returns character(0). If I add a '\' in front of the 1$, it returns invalid regular expression './data/\1$', reason 'Invalid back reference'
How do I properly apply wildcard to the varying file path?

We can use dirname to get the parent directory, then gsub to extract the last character of it. Then we use split to separate the filenames by this one letter.
# allOutputFiles <- c("data/****1/output.csv","data/****2/output.csv")
allOutputFiles <- list.files(pattern = "output.csv", recursive = TRUE, full.names = TRUE)
gsub(".*(.)$", "\\1", dirname(allOutputFiles))
# [1] "1" "2"
out <- split(allOutputFiles, gsub(".*(.)$", "\\1", dirname(allOutputFiles)))
out
# $`1`
# [1] "data/****1/output.csv"
# $`2`
# [1] "data/****2/output.csv"
If you want to index on that, index with out[["1"]] (while out[[1]] would conveniently work here, that's coincidence based on your choice of last-letters and should not be relied on).

How to access first element of a list of elements after strsplit?

I have a list of files saved as a list after running
files <- list.files(pattern=".txt")
So when I run files I have something like the following:
AA1131.report.txt
BB1132.reprot.txt
CC0900.report.txt
.
.
.
I want to get just the first part of the filename before the .report.txt so in R I tried:
>files <- list.files(pattern=".txt")
>files <- strsplit(files, "\\.")
>files[[1]][1]
[1] "AA1131"
I was expecting:
[1] "AA1131"
[1] "BB1132"
[1] "CC0900"
Or some way to get them and save them as a list so I can use them as ID row names in my tibble for the first column.

We need to loop over the list (from strsplit) and extract the first element
sapply(files, `[[`, 1)
The files[[1]] extracts only the first list element
Also, this can be done without an strsplit
trimws(files, whitespace = "\\..*")
or with sub
sub("\\..*", "", files)

Apply a loop through all files wth similar pattern -for many pattern names in R

I have a set of .csv files that correspond to specific stations for different years. I would like to make a pattern that would seek for all files that are similar from the 10th character until .csv.
So far, I have the following:
files =list.files(pattern = ".csv")
files
[1] "data2011_AAST0100.csv" "data2011_ADST0500.csv" "data2011_AETR0100.csv"
[2] "data2011_AIST0200.csv" "data2011_AKST0200.csv" "data2011_AMST0100.csv"
[3] "data2012_AAST0100.csv" "data2012_AETR0100.csv" "data2012_AIST0200.csv"
[4] "data2012_AMST0100.csv" "data2012_ANST0100.csv" "data2012_APST0300.csv"
[5] "data2013_AAST0100.csv" "data2013_AETR0100.csv" "data2013_AIST0200.csv"
[6] "data2013_AMST0100.csv" "data2013_ANST0100.csv" "data2013_APST0300.csv"
However, I would like to have something like this, which basically seeks for all similar pattern names after the 10th character.
files =list.files(pattern = "AAST")
files
[1] "data2011_AAST0100.csv" "data2012_AAST0100.csv" "data2013_AAST0100.csv"
My goal is to apply the following loop for all stations.
outfile = ""
for (i in 1:length(files)){
tempData = read.csv(files[i], header = FALSE, sep="", na.strings=c(" "))
colnames(tempData) = unlist(headers)
df[is.na(tempData)] = NA
outfile <- rbind(outfile, tempData)
}

Instead of using the pattern argument in list.files, you can get all the file names by calling list.files and then filter them with str_extract_all() from the tidyverse package.
library(tidyverse)
# test <- list.files()
test <- c("data2011_AAST0100.csv", "data2011_ADST0500.csv", "data2011_AETR0100.csv",
"data2011_AIST0200.csv", "data2011_AKST0200.csv", "data2011_AMST0100.csv")
unique(unlist(str_extract_all(test, pattern = '(?<=\\_).{4}')))
[1] "AAST" "ADST" "AETR" "AIST" "AKST" "AMST"
This assumes that the names you want to extract are always 4 letters.

How to insert text in specific in directory in R

I am looking for an elegant way to insert character (name) into directory and create .csv file. I found one possible solution, however I am looking another without "replacing" but "inserting" text between specific charaktects.
#lets start
df <-data.frame()
name <- c("John Johnson")
dir <- c("C:/Users/uzytkownik/Desktop/.csv")
#how to insert "name" vector between "Desktop/" and "." to get:
dir <- c("C:/Users/uzytkownik/Desktop/John Johnson.csv")
write.csv(df, file=dir)
#???
#I found the answer but it is not very elegant in my opinion
library(qdapRegex)
dir2 <- c("C:/Users/uzytkownik/Desktop/ab.csv")
dir2<-rm_between(dir2,'a','b', replacement = name)
> dir2
[1] "C:/Users/uzytkownik/Desktop/John Johnson.csv"
write.csv(df, file=dir2)

I like sprintf syntax for "fill-in-the-blank" style string construction:
name <- c("John Johnson")
sprintf("C:/Users/uzytkownik/Desktop/%s.csv", name)
# [1] "C:/Users/uzytkownik/Desktop/John Johnson.csv"
Another option, if you can't put the %s in the directory string, is to use sub. This is replacing, but it replaces .csv with <name>.csv.
dir <- c("C:/Users/uzytkownik/Desktop/.csv")
sub(".csv", paste0(name, ".csv"), dir, fixed = TRUE)
# [1] "C:/Users/uzytkownik/Desktop/John Johnson.csv"

This should get you what you need.
dir <- "C:/Users/uzytkownik/Desktop/.csv"
name <- "joe depp"
dirsplit <- strsplit(dir,"\\/\\.")
paste0(dirsplit[[1]][1],"/",name,".",dirsplit[[1]][2])
[1] "C:/Users/uzytkownik/Desktop/joe depp.csv"

I find that paste0() is the way to go, so long as you store your directory and extension separately:
path <- "some/path/"
file <- "file"
ext <- ".csv"
write.csv(myobj, file = paste0(path, file, ext))
For those unfamiliar, paste0() is shorthand for paste( , sep="").

Let’s suppose you have list with the desired names for some data structures you want to save, for instance:
names = [“file_1”, “file_2”, “file_3”]
Now, you want to update the path in which you are going to save your files adding the name plus the extension,
path = “/Users/Documents/Test_Folder/”
extension = “.csv”
A simple way to achieve it is using paste() to create the full path as input for write.csv() inside a lapply, as follows:
lapply(names, function(x) {
write.csv(x = data,
file = paste(path, x, extension))
}
)
The good thing of this approach is you can iterate on your list which contain the names of your files and the final path will be updated automatically. One possible extension is to define a list with extensions and update the path accordingly.

R read csv as character string

new to R and haven't been able to locate an answer to this question. I am using the following to create a new variable that tags each line as containing a word, or not.
a$keywordtag <- (1:nrow(a) %in% c(sapply(needle, grep, a$text, fixed = TRUE)))
the 'needle' or the words to search for is being read in as:
needle <- c("foo", "x", "y")
However, I want the needle to read in as a csv file. read.csv doesn't seem to have the option to read in as a character string. stringsAsFactors=FALSE doesn't work either. Any suggestions on this?
The csv would be:
a <- read.table(text='
"foo"
"x"
"y"', header=FALSE)

You should have all the text in one string and end each line with a new line character
(rc <- read.csv(text = paste0(needle, collapse = "\n"), header = FALSE))
V1
1 foo
2 x
3 y
identical(a, rc)
# [1] TRUE
You could also try readLines
read.csv(text = readLines(textConnection(needle)), sep = "\n", header = FALSE)
V1
1 foo
2 x
3 y
In the last line, if needle is actually a file, replace textConnection(needle) with the file name

If stringsAsFactors=FALSE isn't working for you, you might focus on troubleshooting that. The following code should work just fine to read in as character strings:
> needle = read.csv("PathToNeedle\\needle.csv", stringsAsFactors=FALSE, header=FALSE)
> needle[1]
V1
1 foo
2 x
3 y
> typeof(needle[1,1])
[1] "character"
If the csv file you are reading in to needle is really just:
"foo"
"x"
"y"
then that's very peculiar. What is the resulting dataframe you get when you run read.csv? If it simply isn't working, an alternative to try is to directly specify the data type as follows:
needle = read.csv("PathToNeedle\\needle.csv", colClasses=c('character'), header=FALSE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

extract portion of file name using gsub() - r

A . is a special character, so you need to escape it. Please, when you want to grab the captured expression, you need to use \\1, not just 1. Try this: gsub("./(.)\\..*", "\\1", f) # [1] "SAT-DPL192C" "SAT-DPL193D"

Related

How to use wildcard in file path name for pattern matching in R?

How to access first element of a list of elements after strsplit?

Apply a loop through all files wth similar pattern -for many pattern names in R

How to insert text in specific in directory in R

R read csv as character string

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

extract portion of file name using gsub() - r

A . is a special character, so you need to escape it. Please, when you want to grab the captured expression, you need to use \\1, not just 1. Try this: gsub(".*/(.*)\\..*", "\\1", f) # [1] "SAT-DPL192C" "SAT-DPL193D"

Related

How to use wildcard in file path name for pattern matching in R?

How to access first element of a list of elements after strsplit?

Apply a loop through all files wth similar pattern -for many pattern names in R

How to insert text in specific in directory in R

R read csv as character string

Categories

Resources

A . is a special character, so you need to escape it. Please, when you want to grab the captured expression, you need to use \\1, not just 1. Try this: gsub("./(.)\\..*", "\\1", f) # [1] "SAT-DPL192C" "SAT-DPL193D"