String extraction with regular expression in R

String extraction with regular expression in R - r

I am trying to extract bunch of information from filenames using regular expressions in R. As I am matching the pattern, str_view() is showing me the correct set of strings. Yet, when I am trying to sub those and extract the remaining portion, it doesn't work. I also tried str_extract() but it isn't working. What am I doing wrong?
fname <- "TC2L6C_2020-08-14_1516_6C-ASG_29_00020.tab"
fext <- tools::file_path_sans_ext(fname)
stringr::str_view(fext, ".*-ASG_\\d+_", match = TRUE)
P_num <- gsub(".*-ASG_\\d{2}_", "", fext)
P_num <- stringr::str_extract(fname, "(?<=-ASG_\\d+)([^_])*(?=\\.tab)")

Using trimws from base R
trimws(fname, whitespace = ".*_|\\..*")
[1] "00020"
data
fname <- "TC2L6C_2020-08-14_1516_6C-ASG_29_00020.tab"

Here is a simple approach using sub:
fname <- "TC2L6C_2020-08-14_1516_6C-ASG_29_00020.tab"
output <- sub("^.*-ASG_\\d+_(.*)\\.tab$", "\\1", fname)
output
[1] "00020"
Above we use a capture group to isolate the portion of the filename, sans extension, which you want to match.

Related

Normalize ASCII to UTF-8 in R

I have a dataframe I am trying to convert to rdf to edit in
Protege. The dataframe unfortunately has ASCII codes that are not visible when the strings are printed, most notoriously \u0020, which its he code for a space.
x <- "\u0020".
x
> " "
grepl() works fine when searching for the pattern,
but does not return the original string when the
result is printed.
match <-
grep(pattern = "\u0020", x = x, value = TRUE)
match
> " "
The problem is that these codes are throwing Protege off and I'm trying to normalize them to basic characters such as \u0020 to " ", but I cannot find any regex that will catch these and replace them with the single non-code character. The regex pattern [^ -~] does not catch these values and I'm completely blind to these strings otherwise. How can I normalize any of these codes in R?

Personally, I would just replace all unicode in the file using the stringi library.
Given a CSV file, test.csv that looks like
col1,col2,col3
\u0020, moretext, evenmoretext
First load it as a data.frame
> frame <- read.csv("test.txt", encoding="UTF-8")
> frame
col1 col2 col3
1 \\u0020 moretext evenmoretext
Next, find all of the occurrences that you want to replace and use stri_unescape_unicode to turn it into something that Protege likes.
> frame$col1
[1] "\\u0020"
frame$col1 <- stri_unescape_unicode(frame$col1)
> frame$col1
[1] " "
Once replaced, you should be able to write your csv back to disk without the unicode entries.

Create a string with special character in R

I have an issue creating a string with a special character. I have asked a similar question and I have also read answers to similar questions about my problem but I am not able to find the solution.
I want to create a string character with a special character. I have been trying with cat but I know it is only for printing, not for saving the string in a variable in R.
I want as a result this:
> cat("C:\\Users\\ppp\\ddd\\")
C:\Users\ppp\ddd\
and I have been trying with paste and collapse but without success:
> x = c("C:","Users","ppp","ddd")
> t <- paste0(x, collapse = '\n')
> t
[1] "C:\nUsers\nppp\nddd"

Are you sure you don't want
x = c("C:","Users","ppp","ddd")
t <- paste0(x, collapse = '/')
t
[1] "C:/Users/ppp/ddd"
R uses this format for setting working directories.
You can also do:
x = c("C:","Users","ppp","ddd")
t <- paste0(x, collapse = '\\')
t
[1] "C:\\Users\\ppp\\ddd"
Although this result looks wrong, if you are using the string in a shell() command in R to be interpreted Windows for example, it will be interpreted correctly

Not Answering... but
t <- paste0(x, collapse = '/')
"C:/Users/ppp/ddd" seems to work on windows.

R function to get directory name of a file as characters

I can create a list of csv files in folder_A:
list1 <- dir_ls("path to folder_A")
I can define a function to add a column with filenames and combine these files into one dataframe:
read_and_save_combo <- function(fileX){
read_csv(fileX) %>%
mutate(fileX = path_file(fileX)}
combo_df <- map_df(list1, read_and_save_combo)
I want to add another column with enclosing folder name (would be the same for all files, folder_A). If I use dirname() on an individual file, I get the full parent directory path to folder_A. I only want the characters "folder_A". If I use dirname() as part of the function, I get another column but its filled with "." Less importantly, I don't know why I get the "." instead of the full path, but more importantly is there a function like path_parentfoldername, that would let me add a new column with only the name of the folder containing each file to each row of the combined dataframe?
Thanks!
Edit:
New function for clarity after answers:
read_and_save_combo <- function(fileX){
read_csv(fileX) %>%
mutate(filename = path_file(fileX), foldername = dirname(fileX) %>%
str_replace(pattern = ".*/", replacement = ""))}
This works because . is the wildcard but * modifies the meaning to 0-infinity characters, so ".*" is any character and any number of characters preceding /. Gregor said this but now I understand it.
Also, I was getting the column filled with ".", because in the function, I was reading one file, but then trying to mutate with dirname operating on the list, which is a vector with multiple elements (more than one file).

You can use dirname + basename :
list1 <- list.files('folder_A_path', full.names = TRUE)
read_and_save_combo <- function(fileX) {
readr::read_csv(fileX) %>%
dplyr::mutate(fileX = basename(dirname(fileX)))
}
combo_df <- purrr::map_df(list1, read_and_save_combo)
If your file is at the path 'Users/Downloads/FolderA/Filename.csv' :
dirname('Users/Downloads/FolderA/Filename.csv')
#[1] "Users/Downloads/FolderA"
basename(dirname('Users/Downloads/FolderA/Filename.csv'))
#[1] "FolderA"

"path to folder_A" is a bad example, use "path/to/folder_A". You need to delete everything from the start through the last /:
library(stringr)
str_replace("path/to/folder_A", pattern = ".*/", replacement = "")
# [1] "folder_A"
If you're worried about \\ or other non-standard things, use dirname() as the input.

Here are two ways to do what I wanted, using the helpful answers above:
read_and_save_combo <- function(file){
read_csv(file) %>%
mutate(filename = path_file(file), foldername = basename(dirname(file)))}
read_and_save_combo <- function(file){
read_csv(file) %>%
mutate(filename = path_file(file), foldername = dirname(file) %>%
str_replace(pattern = ".*/", replacement = ""))}
Other basic things I learned that could be helpful for other beginners:
(1) While writing the function, point all the functions (read_csv(), dirname(), etc.) at a uniform variable (here written as "file" but it could be just a letter "g" or whatever you choose). Then you will avoid the problem I had where part of the function is acting on one file and another part is acting on a list.
(2)
filex and fileX
appear far too similar to each other using certain fonts, which can mess you up (capitalization).

Importing files into R if filename matches specified criteria

Using R I am trying to loop the import of csv files iff the filename contains a specific string
For example, I have a list of files with names 'file01042016_abc.csv', 'file020142016_abc.csv', 'file03042016_abc.csv'...'file26092019_abc.csv' and I have a list of specific values in the format '01042016', '05042016', '09042016', etc.
I would like to only import the files if the filename contains the string value in the second list.
I can import them altogether (shown below) but there are several thousand files and takes a considerable amount of time so would like to reduce it by importing only the files needed based on condition mentioned above.
files <- list.files(path)
for (i in 1:length(files)) {
assign(paste("Df", files[i], sep = "_"), read.csv(paste(path, files[i], sep='')))
}
Any help/suggestions would be greatly appreciated. Thank you.

Using regex along with grepl:
files <- list.files(path)
formats <- c("01042016", "05042016", "09042016")
regex <- paste(formats, collapse="|")
sapply(files, function(x) {
if (grepl(regex, x)) {
assign(paste("Df", x, sep = "_"), read.csv(paste(path, x, sep='')))
}
})
The strategy here is to generate a single regex alternation containing all numeric filename fragments which would whitelist a file as a candidate to be read. For the sample data given above, regex would become:
01042016|05042016|09042016
Then, we call grepl on each file to see if it matches one of the whitelisted patterns. Note that I switched to using sapply as files.list returns a character vector of filenames.

We can just prefilter the files vector, and then loop as normal.
files0 <- c('file01042016_abc.csv', 'file020142016_abc.csv',
'file03042016_abc.csv', 'file26092019_abc.csv',
'file09042016_abc.csv')
k <- c('01042016', '05042016', '09042016')
pat <- paste(k, collapse="|")
files <- grep(pat, files0, value=TRUE)
files
# [1] "file01042016_abc.csv" "file09042016_abc.csv"

How to read \" double-quote escaped values with read.table in R

I am having trouble to read a file containing lines like the one below in R.
"_:b5507F4C7x59005","Fabiana D\"atri"
Any idea? How can I make read.table understand that \" is the escape of quote?
Cheers,
Alexandre

It seems to me that read.table/read.csv cannot handle escaped quotes.
...But I think I have an (ugly) work-around inspired by #nullglob;
First read the file WITHOUT a quote character.
(This won't handle embedded , as #Ben Bolker noted)
Then go though the string columns and remove the quotes:
The test file looks like this (I added a non-string column for good measure):
13,"foo","Fab D\"atri","bar"
21,"foo2","Fab D\"atri2","bar2"
And here is the code:
# Generate test file
writeLines(c("13,\"foo\",\"Fab D\\\"atri\",\"bar\"",
"21,\"foo2\",\"Fab D\\\"atri2\",\"bar2\"" ), "foo.txt")
# Read ignoring quotes
tbl <- read.table("foo.txt", as.is=TRUE, quote='', sep=',', header=FALSE, row.names=NULL)
# Go through and cleanup
for (i in seq_len(NCOL(tbl))) {
if (is.character(tbl[[i]])) {
x <- tbl[[i]]
x <- substr(x, 2, nchar(x)-1) # Remove surrounding quotes
tbl[[i]] <- gsub('\\\\"', '"', x) # Unescape quotes
}
}
The output is then correct:
> tbl
V1 V2 V3 V4
1 13 foo Fab D"atri bar
2 21 foo2 Fab D"atri2 bar2

On Linux/Unix (or on Windows with cygwin or GnuWin32), you can use sed to convert the escaped double quotes \" to doubled double quotes "" which can be handled well by read.csv:
p <- pipe(paste0('sed \'s/\\\\"/""/g\' "', FILENAME, '"'))
d <- read.csv(p, ...)
rm(p)
Effectively, the following sed command is used to preprocess the CSV input:
sed 's/\\"/""/g' file.csv
I don't call this beautiful, but at least you don't have to leave the R environment...

My apologies ahead of time that this isn't more detailed -- I'm right in the middle of a code crunch.
You might consider using the scan() function. I created a simple sample file "sample.csv," which consists of:
V1,V2
"_:b5507F4C7x59005","Fabiana D\"atri"
Two quick possibilities are (with output commented so you can copy-paste to the command line):
test <- scan("sample.csv", sep=",", what='character',allowEscapes=TRUE)
## Read 4 items
test
##[1] "V1" "V2" "_:b5507F4C7x59005"
##[4] "Fabiana D\\atri\n"
or
test <- scan("sample.csv", sep=",", what='character',comment.char="\\")
## Read 4 items
test
## [1] "V1" "V2" "_:b5507F4C7x59005"
## [4] "Fabiana D\\atri\n"
You'll probably need to play around with it a little more to get what you want. And I see that you've already mentioned writeLines, so you may have already tried this. Either way, good luck!

I was able to get your eample to work by setting the quote argument:
> read.csv('test.csv',quote="'",head=FALSE)
V1 V2
1 "_:b5507F4C7x59005" "Fabiana D\\"atri"
2 "_:b5507F4C7x59005" "Fabiana D\\"atri"

read_delim from package readr can handle escaped and doubled double quotes, using the arguments escape_double and escape_backslash.
For example, if our file escapes quotes by doubling them:
"quote""","hello"
1,2
then we use
read_delim(file, delim=',') # default escape_backslash=FALSE, escape_double=TRUE
If our file escapes quotes with a backslash:
"quote\"","hello"
1,2
we use
read_delim(file, delim=',', escape_double=FALSE, escape_backslash=TRUE)

As of newer R versions, readr::read_delim() is the correct answer.
data = read_delim(filename, delim = "\t", quote = "\"",
escape_backslash=T, escape_double=F,
# The columns depend on your data
col_names = c("timeStart", "posEnd", "added", "removed"),
col_types = "nncc"
)

This should be fine with read.csv(). Take a look at the help for ?read.csv - the option for specifying the quote is quote = "....". In this case, though, there may be a problem: it seems that read.csv() prefers to see matching quotes.
I tried the same with read.table("sample.txt", header = FALSE, as.is = TRUE), with your text in sample.txt, and it seems to work. When all else fails with read.csv(), I tend to back up to read.table() and specify the parameters carefully.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

String extraction with regular expression in R - r

Using trimws from base R trimws(fname, whitespace = "._|\\..") [1] "00020" data fname <- "TC2L6C_2020-08-14_1516_6C-ASG_29_00020.tab"

Here is a simple approach using sub: fname <- "TC2L6C_2020-08-14_1516_6C-ASG_29_00020.tab" output <- sub("^.-ASG_\\d+_(.)\\.tab$", "\\1", fname) output [1] "00020" Above we use a capture group to isolate the portion of the filename, sans extension, which you want to match.

Related

Normalize ASCII to UTF-8 in R

Create a string with special character in R

R function to get directory name of a file as characters

Importing files into R if filename matches specified criteria

How to read \" double-quote escaped values with read.table in R

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

String extraction with regular expression in R - r

Using trimws from base R trimws(fname, whitespace = ".*_|\\..*") [1] "00020" data fname <- "TC2L6C_2020-08-14_1516_6C-ASG_29_00020.tab"

Here is a simple approach using sub: fname <- "TC2L6C_2020-08-14_1516_6C-ASG_29_00020.tab" output <- sub("^.*-ASG_\\d+_(.*)\\.tab$", "\\1", fname) output [1] "00020" Above we use a capture group to isolate the portion of the filename, sans extension, which you want to match.

Related

Normalize ASCII to UTF-8 in R

Create a string with special character in R

R function to get directory name of a file as characters

Importing files into R if filename matches specified criteria

How to read \" double-quote escaped values with read.table in R

Categories

Resources

Using trimws from base R trimws(fname, whitespace = "._|\\..") [1] "00020" data fname <- "TC2L6C_2020-08-14_1516_6C-ASG_29_00020.tab"

Here is a simple approach using sub: fname <- "TC2L6C_2020-08-14_1516_6C-ASG_29_00020.tab" output <- sub("^.-ASG_\\d+_(.)\\.tab$", "\\1", fname) output [1] "00020" Above we use a capture group to isolate the portion of the filename, sans extension, which you want to match.