using str_replace_all to replace / with \ - r

I'm working through R for Data Science and one of the exercises asks me to replace all forward slashes with backslashes. I can't get this to work.
> x <- c("//w+", "//b[aeiou]//b")
> str_replace_all(x, "/", "\\")
[1] "w+" "b[aeiou]b"
The online solution doesn't work either, as it replaces one forward slash with two backslashes.
> x <- c("//w+", "//b[aeiou]//b")
> str_replace_all(x, "/", "\\\\")
[1] "\\\\w+" "\\\\b[aeiou]\\\\b"
Edit: I'm adding this to clarify my question. I literally want the string "//" to be "\\". I can't get that to happen. Here's an example in action showing how it's not working.
This works because I have used \ correctly in the string:
> x <- "\\w+'\\w+"
> sentence <- "Open the crate but don't break the glass."
> str_extract(sentence, x)
[1] "don't"
This doesn't work. I mistakenly used / instead of \ and try to use str_replace_all to fix this:
> y <- "//w+'//w+"
> z <- str_replace_all(y, "/", "\\\\")
> str_extract(sentence, z)
[1] NA
That's because z is not "\\w+'\\w+" like I want it to be, but rather:
> z
[1] "\\\\w+'\\\\w+"

The solution given online is actually working correctly! The extra backslashes that you're seeing are the escape characters necessary for other functions to correctly interpret the presence of \ characters.
The following commands:
x <- c("//w+", "//b[aeiou]//b")
y <- str_replace_all(x, "/", "\\\\")
Produce new vector y. When printed to the R console, you'll see this:
[1] "\\\\w+" "\\\\b[aeiou]\\\\b"
This looks wrong, but it isn't. Again, the extra backslashes are there to escape the literal backslashes. If you feed these strings to a function that interprets strings, you'll see that the string representation is actually correct, with each forward slash replaced with a backslash:
message(y)
\\w+\\b[aeiou]\\b
cat(y)
\\w+ \\b[aeiou]\\b

str_replace_all(x, "/", "\\\\")andstr_replace_all(x, "/", "\\")both are working in r for this problem

Related

Regular expression to extract specific part of a URL

I have a vector of URLs and need to extract a certain part of it. I've tried using a regex tester to see if my attempts worked, but they were no good.
The URLs I have are in this format: https://www.baseball-reference.com/teams/MIL/1976.shtml
I ned to extract the three letters after "teams/" (so for the example above, I need "MIL")
Does anyone have any idea how to get the correct regular expression to get this working? Thanks.
1) basename/dirname Try this:
u <- "https://www.baseball-reference.com/teams/MIL/1976.shtml" # input data
basename(dirname(u))
## [1] "MIL"
2) sub or with a regular expression:
sub(".*teams/(.*?)/.*", "\\1", u)
## [1] "MIL"
3) strsplit Split the string on / and take the second last component.
s <- strsplit(u, "/")[[1]]
s[length(s) - 1]
## [1] "MIL"
4) gsub Since the required substring is all upper case and no other characters in the input are this gsub which removes all characters that are not upper case letters would work:
gsub("[^A-Z]", "", u)
## [1] "MIL"
Many different ways to achieve this using regexp's. Here's one:
url <- "https://www.baseball-reference.com/teams/MIL/1976.shtml"
gsub(".+teams/(\\w{3}).+$", "\\1", url);
#[1] "MIL"
Or
x <- c('https://www.baseball-reference.com/teams/MIL/1976.shtml')
pattern <- "/teams/([^/]+)"
m <- regexec(pattern, x)
res = regmatches(x, m)[[1]]
res[2]
which yields
[1] "MIL"
Consider using the stringr package to simplify your code when handling strings.
Use a regular expression with positive lookbehind to catch alphanumeric codes following the string "teams\":
stringr::str_extract(url, "(?<=teams\\/)[A-Z]*")
In your case, if the URLs literally all begin with the same string https://www.baseball-reference.com/teams/ then you can avoid regex entirely and use a simple substring to get the three-letter code which follows:
stringr::str_sub(url, 42, 44)
Here are the results:
> url <- "https://www.baseball-reference.com/teams/MIL/1976.shtml"
>
> stringr::str_extract(url, "(?<=teams\\/)[A-Z]*")
[1] "MIL"
>
> stringr::str_sub(url, 42, 44)
[1] "MIL"

Use gsub to replace curly apostrophe with straight apostrophe in R list of character vectors

Looking for some guidance on how to replace a curly apostrophe with a straight apostrophe in an R list of character vectors.
The reason I'm replacing the curly apostrophes - later in the script, I check each list item, to see if it's found in a dictionary (using qdapDictionary) to ensure it's a real word and not garbage. The dictionary uses straight apostrophes, so words with the curly apostrophes are being "rejected."
A sample of the code I have currently follows. In my test list, item #6 contains a curly apostrophe, and item #2 has a straight apostrophe.
Example:
list_TestWords <- as.list(c("this", "isn't", "ideal", "but", "we", "can’t", "fix", "it"))
func_ReplaceTypographicApostrophes <- function(x) {
gsub("’", "'", x, ignore.case = TRUE)
}
list_TestWords_Fixed <- lapply(list_TestWords, func_ReplaceTypographicApostrophes)
The result: No change. Item 6 still using curly apostrophe. See output below.
list_TestWords_Fixed
[[1]]
[1] "this"
[[2]]
[1] "isn't"
[[3]]
[1] "ideal"
[[4]]
[1] "but"
[[5]]
[1] "we"
[[6]]
[1] "can’t"
[[7]]
[1] "fix"
[[8]]
[1] "it"
Any help you can offer will be most appreciated!
This might work: gsub("[\u2018\u2019\u201A\u201B\u2032\u2035]", "'", x)
I found it over here: http://axonflux.com/handy-regexes-for-smart-quotes
You might be running up against a bug in R on Windows. Try using utf8::as_utf8 on your input. Alternatively, this also works:
library(utf8)
list_TestWords <- as.list(c("this", "isn't", "ideal", "but", "we", "can’t", "fix", "it"))
lapply(list_TestWords, utf8_normalize, map_quote = TRUE)
This will replace the following characters with ASCII apostrophe:
U+055A ARMENIAN APOSTROPHE
U+2018 LEFT SINGLE QUOTATION MARK
U+2019 RIGHT SINGLE QUOTATION MARK
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+FF07 FULLWIDTH APOSTROPHE
It will also convert your text to composed normal form (NFC).
I see a problem in your call to gsub:
gsub("/’", "/'", x, ignore.case = TRUE)
You are prefixing the curly single quote with a forward slash. I don't know why you are doing this. I could speculate that you are trying to escape the quote characters, but this is having the side effect that your pattern is now trying to match a forward slash followed by a quote. As this never occurs in your text, no replacements are being made. You should be doing this:
gsub("’", "'", x, ignore.case = TRUE)
Follow the link below for a demo which shows that using the above gsub calls works as you expect.
Demo
Was about to say the same thing.
Try using str_replace from stringr package, will not need to use slashes
I was facing similar problem. Somehow non of the solutions worked for me. So I devised an indirect way of doing it by identifying apostrophe and replacing it with the required format.
gsub("(\\w)(\\W)(\\w\\s)", "\\1'\\3","sid’s bicycle")
[1] "sid's bicycle"
Hope it helps someone.

R retrieving strings with sub: Why this does not work?

I would like to extract parts of strings. The string is:
> (x <- 'ab/cd efgh "xyz xyz"')
> [1] "ab/cd efgh \"xyz xyz\""
Now, I would like first to extract the first part:
> # get "ab/cd efgh"
> sub(" \"[/A-Za-z ]+\"","",x)
[1] "ab/cd efgh"
But I don't succeed in extracting the second part:
> # get "xyz xyz"
> sub("(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "ab/cd efgh \"xyz xyz\""
What is wrong with this code?
Thanks for help.
Your last snippet does not work because you reinsert the whole match back into the result: (\"[A-Za-z ]+\")$ matches and captures ", 1+ letters and spaces, " into Group 1 and \1 in the replacement puts it back.
You may actually get the last part inside quotes by removing all chars other than " at the start of the string:
x <- 'ab/cd efgh "xyz xyz"'
sub('^[^"]+', "", x)
See the R demo
The sub here will find and replace just once, and it will match the string start (with ^) followed with 1+ chars other than " with [^"]+ negated character class.
To get this to work with sub, you have to match the whole string. The help file says
For sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character). Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding).
So to get this to work with your regex, pre-pend the sometimes risky catchall ".*"
sub(".*(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "\"xyz xyz\""

string manipulation to remove the name of files

I have a list of strings
/temp/123/afedcgid/abc.csv
/temp/123/4388dkfa/abc1.csv
/temp/123/4388dkfa/ab1.csv
I want to remove name of the file from the strings
The results desired are
/temp/123/afedcgid
/temp/123/4388dkfa
/temp/123/4388dkfa
How can i do it. Thanks.
You could try the below,
sub("/[^/]*$", "", x)
It removes all the chars from the last / symbol.
OR
> x <- "/temp/123/afedcgid/abc.csv"
> sub("(.*)/.*", "\\1", x)
[1] "/temp/123/afedcgid"
captures all the chars from the start upto the last / symbol (excluding /). Then the following chars are matched by .*. Replacing the matched chars with chars inside group 1 will give you the desired output.
Example:
> x <- "/temp/123/afedcgid/abc.csv"
> sub("/[^/]*$", "", x)
[1] "/temp/123/afedcgid"
OR
regmatches(x, gregexpr(".+(?=/)", x, perl=TRUE))
Use this regex to catch character you want to replace
\/\w+\.\w+$
try this demo
Demo
files <- c("/temp/123/afedcgid/abc.csv" ,
"/temp/123/4388dkfa/abc1.csv" , "/temp/123/4388dkfa/ab1.csv")
sub("\\/\\w+\\.\\w+$" , "" , files)
as you may know you need to \\ for escaping sequences in R

How to Convert "space" into "%20" with R

Referring the title, I'm figuring how to convert space between words to be %20 .
For example,
> y <- "I Love You"
How to make y = I%20Love%20You
> y
[1] "I%20Love%20You"
Thanks a lot.
Another option would be URLencode():
y <- "I love you"
URLencode(y)
[1] "I%20love%20you"
gsub() is one option:
R> gsub(pattern = " ", replacement = "%20", x = y)
[1] "I%20Love%20You"
The function curlEscape() from the package RCurl gets the job done.
library('RCurl')
y <- "I love you"
curlEscape(urls=y)
[1] "I%20love%20you"
I like URLencode() but be aware that sometimes it does not work as expected if your url already contains a %20 together with a real space, in which case not even the repeated option of URLencode() is doing what you want.
In my case, I needed to run both URLencode() and gsub consecutively to get exactly what I needed, like so:
a = "already%20encoded%space/a real space.csv"
URLencode(a)
#returns: "encoded%20space/real space.csv"
#note the spaces that are not transformed
URLencode(a, repeated=TRUE)
#returns: "encoded%2520space/real%20space.csv"
#note the %2520 in the first part
gsub(" ", "%20", URLencode(a))
#returns: "encoded%20space/real%20space.csv"
In this particular example, gsub() alone would have been enough, but URLencode() is of course doing more than just replacing spaces.

Resources