Substring by specific character [duplicate] - r

I would like to extract filename from url in R. For now I do it as follows, but maybe it can be done shorter like in python. assuming path is just string.
path="http://www.exanple.com/foo/bar/fooXbar.xls"
in R:
tail(strsplit(path,"[/]")[[1]],1)
in Python:
path.split("/")[-1:]
Maybe some sub, gsub solution?

There's a function for that...
basename(path)
[1] "fooXbar.xls"

#SimonO101 has the most robust answer IMO, but some other options:
Since regular expressions are greedy, you can use that to your advantage
sub('.*/', '', path)
# [1] "fooXbar.xls"
Also, you shouldn't need the [] around the / in your strsplit.
> tail(strsplit(path,"/")[[1]],1)
[1] "fooXbar.xls"

Related

Regular Expression using R Programming Language

String<- "46,XX,t(1;19)(p32;q13.3),t(6;9)(p22;q34),del(32)t(12;16)(p12;q21)[cp20]"
The value I want to extract is t(1;19)(p32;q13.3), t(6;9)(p22;q34), t(12;16)(p12;q21)
The regex I'm using
ABC<-str_extract(String, regex("t.{1,16}"))
output I Get: t(1;19)(p32;q13.3
I know my code I incomplete but I'm unable to figure out a way to extract this information.
Thank you in advance
Assuming your String is :
String<- "46,XX,t(1;19)(p32;q13.3),t(6;9)(p22;q34),del(32)t(12;16)(p12;q21)[cp20]"
We can use str_extract_all as :
stringr::str_extract_all(String, "t\\(.*?\\)\\(.*?\\)")[[1]]
#[1] "t(1;19)(p32;q13.3)" "t(6;9)(p22;q34)" "t(12;16)(p12;q21)"
This returns "t" followed by everything in round brackets (()), followed by everything in another round bracket next to it.

Running regex in R using str_extract_all has regexp not yet implemented

I am trying to use regex to parse a file using regex. Most of the solutions to using regex in R use the stringr package. I have not found another way, or another package to use that would work. If you have another way of going about this that would also be acceptable.
What I am trying to accomplish is to grab a couple of values that are seperated by spaces with the last value being some comma seperated values of variable length. This should go into a matrix or df in table like format is it is currently.
foo foo_123bar foo,bar,bazz
foo2 foo_456bar foo2,bar2
I have the working example of my regex here.
There could be a couple of issues I could be running into. The first could be that the regex I am writing is not supported by R's regex engine. Although I have the feeling from this that would be supported. I have seen that R uses a POSIX like format which could make things interesting. The second simply could be exactly what the error message bellow is showing. This is not a feature that has been coded in yet. This however would be the most troubling because I don't know another way to solve my problem without this package.
Below is the R code that I am using to replicate this error
library("stringr")
string = " foo foo_123bar foo,bar,bazz\n foo2 foo_456bar foo2,bar2,bazz2"
pattern = "
(?(DEFINE)
(?<blanks>[[:blank:]]+)
(?<var>\"?[[:alnum:]_]+\"?)
(?<csvar>(\"?[[:alnum:]_]+\"?,?)+)
)
^
(?&blanks)((?&var))
(?&blanks)((?&var))
(?&blanks)((?&csvar))"
# Both of these are throwing the error
str_extract_all(string, pattern)
str_extract_all(string, regex(pattern, multiline=TRUE, comments=TRUE))
> Error in stri_extract_all_regex(string, pattern, simplify = simplify, :
> Use of regexp feature that is not yet implemented. (U_REGEX_UNIMPLEMENTED)
# Using the example from ?str_extract_all runs without error
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract_all(shopping_list, "\\b[a-z]+\\b", simplify = TRUE)
I am looking for a solution, not necessarily a stringr solution, but this is the only way I found that fits my needs. The other simpler R regex functions only accept the pattern and not the extra parameters that include the multi line and comment functionality that I am using.
You have a PCRE regex that can only be used in methods/functions that parse the regex with the PCRE regex library (or Boost, it is based on PCRE). stringr str_extract parses the regex with the ICU regex library. ICU regex does not support recursion and DEFINE block. You just can't use the in-pattern approach to define subpatterns and then re-use them.
Instead, just declare the regex parts you need to re-use as variables and build the pattern dynamically:
library("stringr")
string = " foo foo_123bar foo,bar,bazz\n foo2 foo_456bar foo2,bar2,bazz2"
blanks <- "[[:blank:]]+"
vars <- "\"?[[:alnum:]_]+\"?"
csvar <- "(?:\"?[[:alnum:]_]+\"?,?)+"
pattern <- paste0("^",blanks,"(", vars, ")",blanks,"(", vars,")",blanks,"(",csvar, ")")
str_match_all(string, pattern)
# [[1]]
# [,1] [,2] [,3] [,4]
#[1,] " foo foo_123bar foo,bar,bazz" "foo" "foo_123bar" "foo,bar,bazz"
Note: you need to use str_match (or str_match_all) to extract the capturing group values as str_extract or str_extract_all only allows access to the whole match values.

How to extract specific characters in R String

I have a file name string:
directoryLocation<-"\Users\me\Dropbox\Work\"
How can I extract all the "\" and replace it with "\"? In other languages, you can loop through the string and then replace character by character, but I don't think you can do that in R.
I tried
substr(directoryLocation,1,1)
but it is highly optimized to this case...how can it be more general?
Thanks
gsub is the general tool for this, but as others have noted you need a confusing four slashes to account for the escapes: you need to escape for both R text and the regexp engine simultaneously.
An alternative, if using Windows, is to use normalizePath and setting the winslash parameter:
normalizePath(directoryLocation,winslash="/",mustWork=FALSE)
[1] "C:/Users/me/Dropbox/Work/"
Though this may perform additional work on expanding relative paths to absolute ones (seen here by prepending with C:).
In theory this would do what you want
gsub("\\\", "/", directoryLocation)
however...
R> directoryLocation<-"\\Users\\me\\Dropbox\\Work\\"
R> directoryLocation
[1] "\\Users\\me\\Dropbox\\Work\\"
R> gsub("\\\\", "/", directoryLocation)
[1] "/Users/me/Dropbox/Work/"
At least on windows one needs to escape all of the backslashes, but gsub is what you want.
gsub("\\\\","/","\\Users\\me\\Dropbox\\Work\\")
[1] "/Users/me/Dropbox/Work/"

How do I strip dollar signs ($) from data/ escape special characters in R?

I've been using gsub("toreplace","replacement", myvector) to clean out data in R. While this works for commas and the like, removing "$" has no effect. So if I do gsub("$","",myvector) all the dollar signs remain in place.
I think this is because $ is a special character in R. I tried escaping it "\$" but that yields the same result (no effect). And I couldn't find a resource on escaping special characters in R.
Obviously I should do this in preprocessing. But I was wondering if anyone out there knew how to either a) escape special characters in R b) get rid of pesky $ in R directly. For science.
You have to escape it twice, first for R, second for the regex.
gsub('\\$', '', c("a$a", "bb$"))
[1] "aa" "bb"
See ?Quotes for details on quoting and escaping.
Use fixed = TRUE:
gsub('$', '', c("a$a", "bb$"), fixed = TRUE)
Then you don't need to worry about any special characters. In stringr, this is implemented a little differently:
library(stringr)
str_replace_all(c("$100","ta$ty"), fixed("$"), "")
Thanks to DiggyF and James for the examples!
Escaping characters can be a pain some times, but just putting it in square brackets (make it a character class) helps with this:
> gsub("[$]","",c("$100","ta$ty"))
[1] "100" "taty"
if you have $ followed by number in set of data columns (e.g. $400,000) there is an easier way that worked like charm for me.
data%>%
mutate_at(5:6, parse_number)
where 5:6 are the data column numbers.

How can I determine the current directory name in R?

The only solution I've encountered is to use regular expressions and recursively replace the first directory until you get a word with no slashes.
gsub("/\\w*/","/",gsub("/\\w*/","/",getwd()))
Is there anything slightly more elegant? (and more portable?)
Your example code doesn't work for me, but you're probably looking for either basename or dirname:
> getwd()
[1] "C:/cvswork/data"
> basename(getwd())
[1] "data"
> dirname(getwd())
[1] "C:/cvswork"
If you didn't know basename (and I didn't), you could have used this:
tail(strsplit(getwd(), "/")[[1]], 1)

Resources