R sub with perl - starts search backwards? - r

I have strings that look like a shown below. I need to extract part of the string that is between first // and first subsequent /. I use sub with perl = F but it's roughly 4 times slower than with perl = T. So I tried perl = T and found that search starts from the END of the string??
a = "https://moo.com/meh/woof//A.ds.serving/hgtht//ghhg/tjtke"
print(gsub(".*//(.*?)/.*","\\1",a))
"moo.com"
print(gsub(".*//(.*?)/.*","\\1",a,perl=T))
"ghhg"
moo.com is what I need. I am very surprised to see this - is it documented somewhere? How can I rewrite it with perl - I have 20M rows to work with, and speed is important. Thanks!
Edit: it is not given that every string will start with http

You can try .*?//(.*?)/.* to make the first .* lazy too so that // will match the first // instance:
gsub(".*?//(.*?)/.*","\\1",a,perl=T)
# [1] "moo.com"
And ?gsub says:
The standard regular-expression code has been reported to be very slow
when applied to extremely long character strings (tens of thousands of
characters or more): the code used when perl = TRUE seems much faster
and more reliable for such usages.
The standard version of gsub does not substitute correctly repeated
word-boundaries (e.g. pattern = "\b"). Use perl = TRUE for such
matches.

Related

How to grep a string ending in a specific punctuation mark

I'm trying to grep strings that end in a dash in R, but having trouble. I've worked out how to grep strings ending in any punctuation mark, maybe not the best way but this worked:
grep("\\#[[:print:]]+[[:punct:]]$",c)
Can't for the life of me work out how to grep strings that end specifically in a dash
for example these strings:
- # (piano) - not this.
- # hello hello - not this either.
I'd like to sub all the stuff between the dashes (and including the dashes) with nothing "" and leave the text to the right of the second dash, which end in full stops. So, I would like the output to be (for example, based on the example above):
not this.
and
not this either.
Any help would be appreciated.
Thank you!
Maro
UPDATE:
Hi again everyone,
I'm just updating my original question again:
So what I had in my original data was these three examples (I tried to simplify in my original post above, but I think it might be helpful for you all to see what I was actually dealing with):
- # (Piano) - no, and neither can you.
- # (Piano) - uh-huh.
- # Many dreams ago - Try it again.
(numbers 1-3 are for the purposes of making things clearer, they are not part of the strings)
I was trying to find a way to delete all the stuff between and including the two dashes, and leave all the stuff after the second dash, so I wanted my output to be:
no, and neither can you.
uh-huh.
Try it again.
I ended up using this:
gsub(("-[[:blank:]]#[[:blank:]]\\(?[A-Z][a-z]*\\)?[[:blank:]]-", "", c)
which helped me get 1. and 2. in one go. But this didn't help with 3 - I thought by including the question mark after the open and close bracket (which I thought meant 'optional') this would help me get all three targets, but for some reason it didn't. To then get 3, I just ended up targeting that specific string i.e. - # Many dreams ago -, by using:
gsub(("- # Many dreams ago -"), "", c)
I'm new to this, so not the best solution I'm sure.
In my original post (this has been edited a couple of times) I included square brackets around the three strings, which explains some of the answers I originally received from members of the community. Apologies for the confusion!
Thanks everyone - if there's anything that doesn't make sense, please let me know, and I'll try to clarify.
Maro
If you want to stay in between the square brackets you can start the match at #, then use a negated character class [^][]* matching optional chars other than an opening or closing square bracket, and match the last -
Replace the match with an empty string.
c <- "[- # (piano) - not this.]"
sub("#[^][]*-", "", c)
Output
[1] "[- not this.]"
For a more specific match of that string format, you can match the whole line including the square brackets, the # and the string ending on a full stop, and capture what you want to keep.
In the replacement use the capture group value.
c <- c("[- # (piano) - not this.]", "[- # hello hello - not this either.]")
sub("\\[[^][#]*#[^][]*-\\s*([^][]*\\.)]", "\\1", c)
Output
[1] "not this." "not this either."

How to remove "\" from paste function output with quotation marks?

I'm working with the following code:
Y_Columns <- c("Y.1.1")
paste('{"ImportId":"', Y_Columns, '"}', sep = "")
The paste function produces the following output:
"{\"ImportId\":\"Y.1.1\"}"
How do I get the paste function to omit the \? Such that, the output is:
"{"ImportId":"Y.1.1"}"
Thank you for your help.
Note: I did do a search on SO to see if there were any Q's that asked "what is an escape character in R". But I didn't review all the 160 answers, only the first 20.
This is one way of demonstrating what I wrote in my comment:
out <- paste('{"ImportId":"', Y_Columns, '"}', sep = "")
out
#[1] "{\"ImportId\":\"Y.1.1\"}"
?print
print(out,quote=FALSE)
#[1] {"ImportId":"Y.1.1"}
Both R and regex patterns use escape characters to allow special characters to be displayed in print output or input. (And sometimes regex patterns need to have doubled escapes.) R has a few characters that need to be "escaped" in certain situation. You illustrated one such situation: including double-quote character inside a result that will be printed with surrounding double-quotes. If you were intending to include any single quotes inside a character value that was delimited by single quotes at the time of creation, they would have needed to be escaped as well.
out2 <- '\'quoted\''
nchar(out2)
#[1] 8 ... note that neither the surround single-quotes nor the backslashes get counted
> out2
[1] "'quoted'" ... and the default output quote-char is a double-quote.
Here's a good Q&A to review:How to replace '+' using gsub() function in R
It has two answers, both useful: one shows how to double escape a special character and the other shows how to use teh fixed argument to get around that requirement.
And another potentially useful Q&A on the topic of handling Windows paths:
File path issues in R using Windows ("Hex digits in character string" error)
And some further useful reading suggestions: Look at the series of help pages that start with capital letters. (Since I can never remember which one has which nugget of essential information, I tried ?Syntax first and it has a "See Also" list of essential reading: Arithmetic, Comparison, Control, Extract, Logic, NumericConstants, Paren, Quotes, Reserved. and I then realized what I wanted to refer you to was most likely ?Quotes where all the R-specific escape sequence letters should be listed.

time pattern in list.files function (R)

I'm trying to get a list of subdirectories from a path. These subdirectories have a time pattern month\day\hour, i.e. 03\21\11.
I naively used the following:
list.files("path",pattern="[0-9]\[0-9]\[0-9]", recursive = TRUE, include.dirs = TRUE)
But it doesn't work.
How to code for the digitdigit\digitdigit\digitdigit pattern here?
Thank you
This Regex works for 10\11\18.
(\d\d\\\d\d\\\d\d)
I think you may need lazy matching for regex, unless there's always two digits - in which case other responses look valid.
If you could provide a vector of file name strings, that would be super helpful.
Capturing backslashes is confusing, I've found this thread helpful: R - gsub replacing backslashes
My guess is something like this: '[0-9]+?\\\\[0-9]+?\\\\[0-9]+'

Regex match alphabetic/numeric that is not x

Assume I want to extract all strings starting in either ftp or ftpk (example made up).
I currently have a solution:
Get all the strings starting with ftp but not those starting in
ftpx or ftpc.
I wonder how I can make it more general (because right now I'm listing the exceptions which can become tedious), something like:
Get all the strings starting with ftp but not those starting in
ftpX where X is any alphabetic/numeric that is not k.
# Data:
vec <- c("ftp:ladpmxqgvt", "ftpx:xfiwyoloqu", "ftpk:yol.qdsrehn",
"ftpc:krjqdzsuhb", "ftpk:yolo.taxukj", "ftp:qvxarpkjid",
"ebutlngqkr", "yolx.vhznja")
# Current solution (desired output)
grep("^ftp[^xc]", vec, value = TRUE)
"ftp:ladpmxqgvt" "ftpk:yol.qdsrehn" "ftpk:yolo.taxukj" "ftp:qvxarpkjid"
Code
See regex in use here
^ftpk?:
If you don't know if : will follow ftp you can use the following, which simply ensures ftp or ftpk is followed by a non-word character:
^ftpk?\b
Results
Input
ftp:ladpmxqgvt
ftpx:xfiwyoloqu
ftpk:yol.qdsrehn
ftpc:krjqdzsuhb
ftpk:yolo.taxukj
ftp:qvxarpkjid
ebutlngqkr
yolx.vhznja
Output
Below lists only matches
ftp:ladpmxqgvt
ftpk:yol.qdsrehn
ftpk:yolo.taxukj
ftp:qvxarpkjid
Explanation
^ Assert position at the start of the line
ftp Match this literally
k? Match k literally zero or once
: Match this literally
I think this solution most closely mimics the sentence:
Get all the strings starting with ftp but not those starting in ftpX where X is any alphabetic/numeric that is not k.
grep("ftp(?!k)[[:alnum:]](*SKIP)(*FAIL)|ftp", vec, value = TRUE, perl = TRUE)
or
grep("ftp(?!(?!k)[[:alnum:]])", vec, value = TRUE, perl = TRUE)
Result:
[1] "ftp:ladpmxqgvt" "ftpk:yol.qdsrehn" "ftpk:yolo.taxukj" "ftp:qvxarpkjid"
Note:
The first solution uses the (*SKIP)(*FAIL) trick to avoid matching particular patterns. In this case, I am using it to avoid matching ftp followed by an alphanumeric character except k, and matching any ftp that was not avoided.
The second solution is similar, but uses negative lookahead. (?!k)[[:alnum:]] matches all alphanumerics except k, while ftp(?!(?!k)[[:alnum:]]) matches ftp not immediately followed by any alphanumerics except k.
The advantage of these two solutions is that one can add to the things to avoid. Just add them to (?!k)[[:alnum:]] or (?!(?!k)[[:alnum:]]).

How to modify i in an R loop?

I have several large R objects saved as .RData files: "this.RData", "that.RData", "andTheOther.RData" and so on. I don't have enough memory, so I want to load each in a loop, extract some rows, and unload it. However, once I load(i), I need to strip the ".RData" part of (i) before I can do anything with objects "this", "that", "andTheOther". I want to do the opposite of what is described in How to iterate over file names in a R script? How can I do that? Thx
Edit: I omitted to mention the files are not in the working directory and have a filepath as well. I came across Getting filename without extension in R and file_path_sans_ext takes out the extension but the rest of the path is still there.
Do you mean something like this?
i <- c("/path/to/this.RDat", "/another/path/to/that.RDat")
f <- gsub(".*/([^/]+)", "\\1", i)
f1 <- gsub("\\.RDat", "", f)
f1
[1] "this" "that"
On windows' paths you have to use "\\" instead of "/"
Edit: Explanation. Technically, these are called "regular
expressions" (regexps), not "patterns".
. any character
.* arbitrary number (including 0) of any kind of characters
.*/ arbitrary number of any kind of characters, followed by a
/
[^/] any character but not /
[^/]+ arbitrary number (1 or more) of any kind of characters,
but not /
( and ) enclose groups. You can use the groups when
replacing as \\1, \\2 etc.
So, look for any kind of character, followed by /, followed by
anything but not the path separator. Replace this with the "anything
but not separator".
There are many good tutorials for regexps, just look for it.
A simple way to do this using would be to extract the base name from the filepaths with base::basename() and then remove the file extension with tools::file_path_sans_ext().
paths_to_files <- c("./path/to/this.RData", "./another/path/to/that.RData")
tools::file_path_sans_ext(
basename(
paths_to_files
)
)
## Returns:
## [1] "this" "that"

Resources