I can't figure out how to get this regex to work.
My sample data vector looks like this:
claims40 1.1010101
clinical41 391.1
...
It follows the pattern of:
a name,
followed with no spaces by a version number, and
then various other numbers.
I'm trying to create a new column in the data frame with just the name, which can be a variable amount of characters.
So the new column should look like:
claims
clinical
...
When I try to use the expression:
^(.*?)\\d
in regexp, I don't get the correct character match length.
Question: What is the correct regex to capture everything in a string prior to the first number?
gsub("[^a-zA-Z]", "", c("claims40 1.1010101", "clinical41 391.1"))
# [1] "claims" "clinical"
Also this posix style:
gsub("[^[:alpha:]]", "", c("claims40 1.1010101", "clinical41 391.1"))
# [1] "claims" "clinical"
If you specifically want to match until the first digit, you could also do this
gsub("^(.+?)(?=\\d).*", "\\1", c("claims40 1.1010101", "clinical41 391.1"), perl = TRUE)
[1] "claims" "clinical"
Also with str_extract from stringr:
stringr::str_extract(c("claims40 1.1010101", "clinical41 391.1"), "^[[:alpha:]]+")
# [1] "claims" "clinical"
This "extracts" the alphabetical characters instead of removing everything else.
Related
I have a table where one column has data like this:
table$test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"
1.) I am trying to extract the first part of this string within the square brackets in one column, i.e.
table$project_name <- "projectname"
using the regex:
project_name <- "^\\[|(?:[a-zA-Z]|[0-9])+|\\]$"
table$project_name <- str_extract(table$test_string, project_name)
If I test the regex on 1 value (1 row individually) of the table, the above regex works with using
str_extract_all(table$test_string, project_name[[1]][2]).
However, I get NA when I apply the regex pattern to the whole table and an error if I use str_extract_all.
2.) Second part of the string, which is a URL in another column,
table$url_link <- "https://somewebsite.com/projectname/Abc/xyz-09"
I am using the following regex expression for URL:
url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
table$url_link <- str_extract(table$test_string, url_pattern)
and this works on the whole table, however, I still get the ')' last paranthesis in the url link.
What am I missing here? and why does the first regex work individually and not on the whole table?
and for the url, how do I not get the last paranthesis?
It feels like you could simplify things considerably by using parentheses to group capture. For example:
test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"
regex <- "\\[(.*)\\]\\((.*)\\)"
gsub(regex, "\\1", test_string)
#> [1] "projectname"
gsub(regex, "\\2", test_string)
#> [1] "https://somewebsite.com/projectname/Abc/xyz-09"
We can make use of convenient functions from qdapRegex
library(qdapRegex)
rm_round(test_string, extract = TRUE)[[1]]
#[1] "https://somewebsite.com/projectname/Abc/xyz-09"
rm_square(test_string, extract = TRUE)[[1]]
#[1] "projectname"
data
test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"
I have a set of 'filename.extension's and I want to extract just the filename. I am having trouble extracting the full filename when the filename shares a character with the file extension. for example, the filename.extension "qrs.sas7bdat" has
filename="qrs"
extension="sas7bdat"
In this case one may observe that the filename shares in common with the extension the character "s".
Here's some R code to give more context:
files_sas <- c("abc.sas7bdat","qrs.sas7bdat")
stringr::str_extract(files_sas,"(?:.*|.*s)[^\\.sas7bdat]")
This set of code returns the following character vector:
"abc" "qr"
This is not what I want -- the desired result I want follows:
c("abc","qrs")
It looks like I'm close, and so I am hoping someone might be able to help me get my desired result.
Many thanks.
We can use sub to match the . (. is a metacharacter that matches any character, so we escape (\\) iit, followed by other character (.*), in the replacement, we can specify blank ("")
sub("\\..*", "", files_sas)
#[1] "abc" "qrs"
Or with stringr
library(stringr)
str_remove(files_sas, "\\..*")
Or with file_path_sans_ext
tools::file_path_sans_ext(files_sas)
#[1] "abc" "qrs"
I have a column of texts look like below:
str1 = "ABCID 123456789 is what I'm looking for, could you help me to check this Item's status?"
I want to use gsub function in R to extract "ABCID 123456789" from there. The number might change with different numbers, but ABCID is a constant. Can someone know the solution with that please? Thanks very much!
We can use str_extract to select the fixed word followed by space and one or more numbers (\\d+)
library(stringr)
str_extract(df1$col1, "ABCID \\d+")
If there are multiple instances, use str_extract_all
str_extract_all(df1$col1, "ABCID \\d+")
NOTE: The OP states that to extract "ABCID 123456789" from there
If the number has constant length (9) you could you use positive lookbehind:
sub("(?<=ABCID \\d{9}).*", "", str1, perl = TRUE)
# [1] "ABCID 123456789"
Match the beginning of string (^) leading letters (ABCID), a space, digits (\d+) and everything else (.*) and replace it all with the captured portion, i.e. the portion within parentheses. Note that we want to use sub, not gsub, here because there is only one substitution.
sub("^(ABCID \\d+).*", "\\1", str1)
## [1] "ABCID 123456789"
How to selectively remove characters from a string following a pattern?
I wish to remove the 7 figures and the preceding colon.
For example:
"((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
should become
"((Northern_b,Tropical_b)N19"
x <- "((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
gsub("[:]\\d{1}[.]\\d{6}", "", x)
The gsub function does string replacement and replaces all matched found in the string (see ?gsub). An alternative, if you want something with a more friendly names is str_replace_all from the stringr package.
The regular expression makes use of the \\d{n} search, which looks for digits. The integer indicates the number of digits to look for. So \\d{1} looks for a set of digits with length 1. \\d{6} looks for a set of digits of length 6.
gsub('[:][0-9.]+','',x)
[1] "((Northern_b,Tropical_b)N19"
Another approach to solve this problem
library(stringr)
str1 <- c("((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950")
str_replace_all(str1, regex("(:\\d{1,}\\.\\d{1,})"), "")
#[1] "((Northern_b,Tropical_b)N19"
I have a character vector that contains text similar to the following:
text <- c("ABc.def.xYz", "ge", "lmo.qrstu")
I would like to remove everything before a .:
> "xYz" "ge" "qrstu"
However, the grep function seems to be treating . as a letter:
pattern <- "([A-Z]|[a-z])+$"
grep(pattern, text, value = T)
> "ABc.def.xYz" "ge" "lmo.qrstu"
The pattern works elsewhere, such as on regexpal.
How can I get grep to behave as expected?
grep is for finding the pattern. It returns the index of the vector that matches a pattern. If, value=TRUE is specified, it returns the value. From the description, it seems that you want to remove substring instead of returning a subset of the initial vector.
If you need to remove the substring, you can use sub
sub('.*\\.', '', text)
#[1] "xYz" "ge" "qrstu"
As the first argument, we match a pattern i.e. '.*\\.'. It matches one of more characters (.*) followed by a dot (\\.). The \\ is needed to escape the . to treat it as that symbol instead of any character. This will match until the last . character in the string. We replace that matched pattern with a '' as the replacement argument and thereby remove the substring.
grep doesn't do any replacements. It searches for matches and returns the indices (or the value if you specify value=T) that give a match. The results you're getting are just saying that those meet your criteria at some point in the string. If you added something that doesn't meet the criteria anywhere into your text vector (for example: "9", "#$%23", ...) then it wouldn't return those when you called grep on it.
If you want it just to return the matched portion you should look at the regmatches function. However for your purposes it seems like sub or gsub should do what you want.
gsub(".*\\.", "", text)
I would suggest reading the help page for regexs ?regex. The wikipedia page is a decent read as well but note that R's regexs are a little different than some others. https://en.wikipedia.org/wiki/Regular_expression
You may try str_extract function from stringr package.
str_extract(text, "[^.]*$")
This would match all the non-dot characters exists at the last.
Your pattern does work, the problem is that grep does something different than what you are thinking it does.
Let's first use your pattern with str_extract_all from the package stringr.
library(stringr)
str_extract_all(text, pattern ="([A-Z]|[a-z])+$")
[[1]]
[1] "xYz"
[[2]]
[1] "ge"
[[3]]
[1] "qrstu"
Notice that the results came as you expected!
The problem you are having is that grep will give you the complete element that matches you regular expression and not only the matching part of the element. For example, in the example below, grep will return you the first element because it matches "a":
grep(pattern = "a", x = c("abcdef", "bcdf"), value = TRUE)
[1] "abcdef"