String comparison with R - r

I have two datasets that I want to link (inner_join) with a common key which is a string. The problem is that in one of the two dataset the key is not complete, but this uncomplete key is included in the other one, like the following example:
key for 1st dataset: PV955--075P412171042--
and for the 2nd: PV955--???P412171042--
The ??? represents numbers that are missing, so my question is can we do like a string comparison/inclusion to check if the characters of my 2nd key are included my 1st key and do the join on this if yes?
Idk if the issue is clear, and thanks for the answers.

It's hard to answer without seeing your data, however you can try this:
library(stringr)
> str_detect("075P412171042","P412171042")
[1] TRUE

In base R with regular expressions :
key1 <- "PV955--075P412171042--"
key2 <- "PV955--???P412171042--"
key2re <- gsub("--...", "--...", key2)
grepl(key2re, key1)
## [1] TRUE
Replace the 3 unknown characters after "--" by dots meaning any character in regular expressions.
Then grepl check if the two strings match.

Related

In R How to remove a precise character in a column ( in this case the " , " )that has other same character that i don't want to remove?

i have a dataset with some columns that have a monetized value, but considering the name of the columns and the description of them, i believe that there's an error in the representation of the numbers. i.e. (5,52,32,974)----> this is an example of the number, i believe there is a comma too many or put in the wrong position. I would like to know if it's possible to remove a certain comma in this case and came to this representation of the number, for instance 55.232.974... of $ for example. The dataset is in .csv. Thanks in advance.
if I understand it correctly your data is given as a string.
Then you could use the following code:
a <- c("5,52,32,974", "5,52,32,974", "5,52,32,974")
b <- gsub(",", "", a)
as.numeric(b)
#[1] 55232974 55232974 55232974

Extract all numbers from a character string into a SINGLE character string of numbers in the original order [duplicate]

How can I extract digits from a string that can have a structure of xxxx.x or xxxx.x-x and combine them as a number? e.g.
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1",
"1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
The desired (numeric) output would be:
101011, 101021, 101031...
I tried
regexp <- "([[:digit:]]+)"
solution <- str_extract(list, regexp)
However that only extracts the first set of digits; and using something like
regexp <- "([[:digit:]]+\\.[[:digit:]]+\\-[[:digit:]]+)"
returns the first result (data in its initial form) if matched otherwise NA for shorter strings. Thoughts?
Remove all non-digit symbols:
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1", "1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
as.numeric(gsub("\\D+", "", list))
## => [1] 101011 101021 101031 10301 10401 106011 106021 10701 110011 110021
See the R demo online
I have no experience with R but I do know regular expressions. When I look at the pattern you're specifying "([[:digit:]]+)". I assume [[:digit:]] stands for [0-9], so you're capturing one group of digits.
It seems to me you're missing a + to make it capture multiple groups of digits.
I'd think you'd need to use "([[:digit:]]+)+".

Extract shortest matching string regex

Minimal Reprex
Suppose I have the string as1das2das3D. I want to extract everything from the letter a to the letter D. There are three different substrings that match this - I want the shortest / right-most match, i.e. as3D.
One solution I know to make this work is stringr::str_extract("as1das2das3D", "a[^a]+D")
Real Example
Unfortunately, I can't get this to work on my real data. In my real data I have string with (potentially) two URLs and I'm trying to extract the one that's immediately followed by rel=\"next\". So, in the below example string, I'd like to extract the URL https://abc.myshopify.com/ZifQ.
foo <- "<https://abc.myshopify.com/YifQ>; rel=\"previous\", <https://abc.myshopify.com/ZifQ>; rel=\"next\""
# what I've tried
stringr::str_extract(foo, '(?<=\\<)https://.*(?=\\>; rel\\="next)') # wrong output
stringr::str_extract(foo, '(?<=\\<)https://(?!https)+(?=\\>; rel\\="next)') # error
You could do:
stringr::str_extract(foo,"https:[^;]+(?=>; rel=\"next)")
[1] "https://abc.myshopify.com/ZifQ"
or even
stringr::str_extract(foo,"https(?:(?!https).)+(?=>; rel=\"next)")
[1] "https://abc.myshopify.com/ZifQ"
Would this be an option?
Splitting string on ; or , comparing it with target string and take url from its previous index.
urls <- strsplit(foo, ";\\s+|,\\s+")[[1]]
urls[which(urls == "rel=\"next\"") - 1]
#[1] "<https://abc.myshopify.com/ZifQ>"
Here may be an option.
gsub(".+\\, <(.+)>; rel=\"next\"", "\\1", foo, perl = T)
#[1] "https://abc.myshopify.com/ZifQ"

R: replacing a table column with a modified version of that column

I am using R currently and I have produced a table with 3 columns. The first column contains names looking like "XXX_YYY_ZZZ" and I would only want to keep the "XXX" part. This is why I tried gsub, but couldn't make it so I turned to strapplyc(), which works but produces only one column. Apparently, I would want to keep my initial table, but with the first column replaced by the strapplyc() output. Or any other different approach you think would fit better!
Thank you in advance.
Since you have NOT showed samples so creating a simplex example here for testing it.
cal1 <- c("XXX_YYY_ZZZ","XXX_YYY_ZZZ")
gsub("_.*","",cal1)
Output will be as follows.
> gsub("_.*","",cal1)
[1] "XXX" "XXX"
Works for me. Here is a regex which looks for three groups of text, separated by underscores. The ^ indicates start of string and $ indicates end of string. I capture first (\\1) group, but there's nothing stopping you from capturing \\2, \\3 or even \\1\\3.
gsub("^(.*)_(.*)_(.*)$", "\\1", "XXX_YYY_ZZZ")
[1] "XXX"
You could also use strsplit.
> strsplit("XXX_YYY_ZZZ", "_")[[1]][1]
[1] "XXX"

Need to count character

I have a dataframe LoopVariable and the following couple of lines of code:
print(unique(LoopVariable[,"Job..R"]))
[1] "14047/2" "18331/3"
My output are two character and that is all good. My question now is: How can I count my output for further calculation usage? In other words: I have two characters and I need them to be as an integer for further calculation usage. In my example here the integer value would be "2".
Use the length() function for this. You can find more about the function by typing ?length into your console.
This is likely what you should expect:
length(unique(LoopVariable[,"Job..R"]))
[1] 2

Resources