Extract text in two columns from a string - r

I have a table where one column has data like this:
table$test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"
1.) I am trying to extract the first part of this string within the square brackets in one column, i.e.
table$project_name <- "projectname"
using the regex:
project_name <- "^\\[|(?:[a-zA-Z]|[0-9])+|\\]$"
table$project_name <- str_extract(table$test_string, project_name)
If I test the regex on 1 value (1 row individually) of the table, the above regex works with using
str_extract_all(table$test_string, project_name[[1]][2]).
However, I get NA when I apply the regex pattern to the whole table and an error if I use str_extract_all.
2.) Second part of the string, which is a URL in another column,
table$url_link <- "https://somewebsite.com/projectname/Abc/xyz-09"
I am using the following regex expression for URL:
url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
table$url_link <- str_extract(table$test_string, url_pattern)
and this works on the whole table, however, I still get the ')' last paranthesis in the url link.
What am I missing here? and why does the first regex work individually and not on the whole table?
and for the url, how do I not get the last paranthesis?

It feels like you could simplify things considerably by using parentheses to group capture. For example:
test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"
regex <- "\\[(.*)\\]\\((.*)\\)"
gsub(regex, "\\1", test_string)
#> [1] "projectname"
gsub(regex, "\\2", test_string)
#> [1] "https://somewebsite.com/projectname/Abc/xyz-09"

We can make use of convenient functions from qdapRegex
library(qdapRegex)
rm_round(test_string, extract = TRUE)[[1]]
#[1] "https://somewebsite.com/projectname/Abc/xyz-09"
rm_square(test_string, extract = TRUE)[[1]]
#[1] "projectname"
data
test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"

Related

Splicing names in a list, and adding characters to a specific string?

Hello there: I currently have a list of file names (100s) which are separated by multiple "/" at certain points. I would like to find the last "/" in each name and replace it with "/Old". A quick example of what I have tried:
I have managed to do it for a single file name in the list but can't seem to apply it to the whole list.
Test<- "Cars/sedan/Camry"
Then I know I tried finding the last "/" in the name I tried the following :
Last <- tail(gregexpr("/", Test)[[1]], n= 1)
str_sub(Test, Last, Last)<- "/Old"
Which gives me
Test[1] "Cars/sedan/OldCamry"
Which is exactly what I need but I am having troubling applying tail and gregexpr to my list of names so that it does it all at the same time.
Thanks for any help!
Apologies for my poor formatting still adjusting.
If your file names are in a character vector you can use str_replace() from the stringr package for this:
items <- c(
"Cars/sedan/Camry",
"Cars/sedan/XJ8",
"Cars/SUV/Cayenne"
)
stringr::str_replace(items, pattern = "([^/]+$)", replacement = "Old\\1")
[1] "Cars/sedan/OldCamry" "Cars/sedan/OldXJ8" "Cars/SUV/OldCayenne"
Keeping a stringi function as an alternative.
If your dataframe is "df" and your text is in column named "text.
library(stringi)
df %>%
mutate(new_text=stringi::stri_replace_last_fixed(text, '/', '/Old '))

String Manipulation in R data frames

I just learnt R and was trying to clean data for analysis using R using string manipulation using the code given below for Amount_USD column of a table. I could not find why changes were not made. Please help.
Code:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,10) == "\\\xc2\\\xa0",
str_sub(csv_file$Amount_USD,12,-1),csv_file2$Amount_USD)
Result:
\\xc2\\xa010,000,000
\\xc2\\xa016,200,000
\\xc2\\xa019,350,000
Expected Result:
10,000,000
16,200,000
19,350,000
You could use the following code, but maybe there is a more compact way:
vec <- c("\\xc2\\xa010,000,000", "\\xc2\\xa016,200,000", "\\xc2\\xa019,350,000")
gsub("(\\\\x[[:alpha:]]\\d\\\\x[[:alpha:]]0)([d,]*)", "\\2", vec)
[1] "10,000,000" "16,200,000" "19,350,000"
A compact way to extract the numbers is by using str_extract and negative lookahead:
library(stringr)
str_extract(vec, "(?!0)[\\d,]+$")
[1] "10,000,000" "16,200,000" "19,350,000"
How this works:
(?!0): this is negative lookahead to make sure that the next character is not 0
[\\d,]+$: a character class allowing only digits and commas to occur one or more times right up to the string end $
Alternatively:
str_sub(vec, start = 9)
There were a few minor issues with your code.
The main one being two unneeded backslashes in your matching statement. This also leads to a counting error in your first str_sub(), where you should be getting the first 8 characters not 10. Finally, you should be getting the substring from the next character after the text you want to match (i.e. position 9, not 12). The following should work:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,8) == "\\xc2\\xa0", str_sub(csv_file$Amount_USD,9,-1),csv_file2$Amount_USD)
However, I would have done this with a more compact gsub than provided above. As long as the text at the start to remove is always going to be "\\xc2\\xa0", you can simply replace it with nothing. Note that for gsub you will need to escape all the backslashes, and hence you end up with:
csv_file2$Amount_USD <- gsub("\\\\xc2\\\\xa0", replacement = "", csv_file2$Amount_USD)
Personally, especially if you plan to do any sort of mathematics with this column, I would go the additional step and remove the commas, and then coerce the column to be numeric:
csv_file2$Amount_USD <- as.numeric(gsub("(\\\\xc2\\\\xa0)|,", replacement = "", csv_file2$Amount_USD))

Replacing string variable with punctuation in R without removing other string

In R, I am having trouble replacing a substring that has punctuation. Ie within the string "r.Export", I am trying to replace "r." with "Report.". I've used gsub and below is my code:
string <- "r.Export"
short <- "r."
replacement <- "Report."
gsub(short,replacement,string)
The desired output is: "Report.Export" however gsub seems to replace the second r such that the output is:
Report.ExpoReport.
Using sub() instead is not a solution either because I am doing multiple gsubs where sometimes the string to be replaced is:
short <- "o."
So, then the o's in r.Export are replaced anyway and it becomes a complete mess.
string <- "r.Export"
short <- "r\\."
replacement <- "Report."
gsub(short,replacement,string)
Returns:
[1] "Report.Export"
Or, using fixed=TRUE:
string <- "r.Export"
short <- "r."
replacement <- "Report."
gsub(short,replacement,string, fixed=TRUE)
Returns:
[1] "Report.Export"
Explanation: Without the fixed=TRUE argument, gsub expects a regular expression as first argument. And with regular expressions . is a placeholder for 'any character'. If you want the literal . (period) you have to use either \\. (i.e. escaping the period) or the aforementioned argument fixed=TRUE
Since you have characters in your pattern (.) which has a special meaning in regex use fixed = TRUE which matches the string as is.
gsub(short,replacement,string, fixed = TRUE)
#[1] "Report.Export"
I might actually add word boundaries and lookaheads to the mix here, to ensure as targeted a match as possible:
string <- "r.Export"
replacement <- "Report."
output <- gsub("\\br\\.(?=\\w)", replacement, string, perl=TRUE)
output
[1] "Report.Export"
This approach ensures that we only match r. when the r is preceded by whitespace or is the start of the string, and also when what follows the dot is another word. Consider the sentence The project r.Export needed a programmer. We wouldn't want to replace the final r. in this case.
We can use sub
sub(short,replacement,string, fixed = TRUE)
#[1] "Report.Export"

Extract shortest matching string regex

Minimal Reprex
Suppose I have the string as1das2das3D. I want to extract everything from the letter a to the letter D. There are three different substrings that match this - I want the shortest / right-most match, i.e. as3D.
One solution I know to make this work is stringr::str_extract("as1das2das3D", "a[^a]+D")
Real Example
Unfortunately, I can't get this to work on my real data. In my real data I have string with (potentially) two URLs and I'm trying to extract the one that's immediately followed by rel=\"next\". So, in the below example string, I'd like to extract the URL https://abc.myshopify.com/ZifQ.
foo <- "<https://abc.myshopify.com/YifQ>; rel=\"previous\", <https://abc.myshopify.com/ZifQ>; rel=\"next\""
# what I've tried
stringr::str_extract(foo, '(?<=\\<)https://.*(?=\\>; rel\\="next)') # wrong output
stringr::str_extract(foo, '(?<=\\<)https://(?!https)+(?=\\>; rel\\="next)') # error
You could do:
stringr::str_extract(foo,"https:[^;]+(?=>; rel=\"next)")
[1] "https://abc.myshopify.com/ZifQ"
or even
stringr::str_extract(foo,"https(?:(?!https).)+(?=>; rel=\"next)")
[1] "https://abc.myshopify.com/ZifQ"
Would this be an option?
Splitting string on ; or , comparing it with target string and take url from its previous index.
urls <- strsplit(foo, ";\\s+|,\\s+")[[1]]
urls[which(urls == "rel=\"next\"") - 1]
#[1] "<https://abc.myshopify.com/ZifQ>"
Here may be an option.
gsub(".+\\, <(.+)>; rel=\"next\"", "\\1", foo, perl = T)
#[1] "https://abc.myshopify.com/ZifQ"

Extract characters from string based on rule (repeated hyphen)

I have a large dataframe with a column that looks something like this:
var <- c("150507-001-0000001", "KMD070515-2-0000001",
"15144KMD01AA-0000001", "Z75Z151222-0000001")
What I want to do is extract part of the string. I want all characters undtil second hyphen. So this is what I need:
150507-001
KMD070515-2
15144KMD01AA-0000001
Z75Z151222-0000001
So I know if I only wanted the data before the hyphen I'd do this:
> var <- sub("-.*", "", var)
> var
150507
KMD070515
15144KMD01AA
Z75Z151222
I've also tried a package qdap which kinda gave me what I wanted:
library("qdap")
var <- beg2char(var, "-", 2)
I do get the column I need with the last code, however something seems to be wrong. Because when I do a left_join based on the column it doesn't work. I can find a match by copy-paste in data view, but left_join doesn't find anything. Doing a leftjoin with the var made with sub (see above) do however work. But for some of my rows I need the characters after the first hyphen (and before the second) to find a match.
Here is a non regex solution, for those who might be interested:
x <- "150507-001-0000001"
paste(strsplit(x, "-")[[1]][1:2], collapse="-")
[1] "150507-001"
If you wanted to apply this logic to your entire vector, then use:
sapply(var, function(x) paste(strsplit(x, "-")[[1]][1:2], collapse="-"))
We can use sub to match the pattern of characters that are not a - followed by - and another set of characters that are not a -, capture as a group ((...)) and replace with the backreference (\\1) of the captured group
sub("^([^-]+-[^-]+).*", "\\1", var)
#[1] "150507-001" "KMD070515-2"
#[3] "15144KMD01AA-0000001" "Z75Z151222-0000001"

Resources