Is it possible to extract words from a string starting with $ in R?
x <- c(“$abc”, “abc”, “$123”, “456”)
desired results
(case 1)
[1] “$abc”, “$123”
or even better (case 2)
[1] “$abc”
Thanks
We can use str_detect from stringr
library(stringr)
x[str_detect(x, "^\\$[A-Za-z]")]
#[1] "$abc" "$AC-DC"
data
x <- c("$abc", "abc", "$123", "456", "$AC-DC", "A-Z")
The startWith function (base) returns per index if the value starts with a string provided as a parameter (TRUE) or not (FALSE), so you could do something like this
x[startsWith(x,"$")]
This is my python code. But it will give you an idea about how it can work in r. The logic is same here:
L = [“$abc”, “abc”, “$123”, “456”]
for i in L:
if "$" in i:
print(i)
I just created a list named L.
Then I used a for loop to get all the strings inside a list line by line and then printing it.
Using grep:
x <- c("$abc", "abc", "$123", "456", "$AC-DC", "A-Z")
grep("^\\$[A-Za-z]", x, value=TRUE)
#[1] "$abc" "$AC-DC"
^ means starts with.
\\$ means search for literal $.
[A-Za-z] means any letter.
Related
in R: I have some strings with the following pattern of letters and numbers
A11B3XyC4
A1B14C23XyC16
B14C23XyC16D3
B14C23C16D3
I want to remove the part "Xy" (always the same letters) and when I do this I want to increase the number behind the Letter B by one (everything else should stay the same).
When there is no "Xy" in the string there is no change to the string
The result should look like this:
A11B4C4
A1B15C23C16
B15C23C16D3
B14C23C16D3
Could you point me to a function capable of this? I struggle with doing a calculation (x+1) with a string.
Thank you!
We could use str_replace to do the increment on the substring of numbers that follows the 'B' string after removing the 'Xy' only for cases where there is 'Xy' substring in case_when
library(stringr)
library(dplyr)
case_when(str_detect(str1, "Xy") ~ str_replace(str_remove(str1,
"Xy"), "(?<=B)(\\d+)", function(x) as.numeric(x) + 1), TRUE ~str1)
[1] "A11B4C4" "A1B15C23C16" "B15C23C16D3" "B14C23C16D3"
data
str1 <- c("A11B3XyC4", "A1B14C23XyC16", "B14C23XyC16D3", "B14C23C16D3")
I have a column as below.
9453, 55489, 4588, 18893, 4457, 2339, 45489HQ, 7833HQ
I would like to add leading zero if the number is less than 5 digits. However, some numbers have "HQ" in the end, some don't.(I did check other posts, they dont have similar problem in the "HQ" part)
so the finally desired output should be:
09453, 55489, 04588, 18893, 04457, 02339, 45489HQ, 07833HQ
any idea how to do this? Thank you so much for reading my post!
A one-liner using regular expressions:
my_strings <- c("9453", "55489", "4588",
"18893", "4457", "2339", "45489HQ", "7833HQ")
gsub("^([0-9]{1,4})(HQ|$)", "0\\1\\2",my_strings)
[1] "09453" "55489" "04588" "18893"
"04457" "02339" "45489HQ" "07833HQ"
Explanation:
^ start of string
[0-9]{1,4} one to four numbers in a row
(HQ|$) the string "HQ" or the end of the string
Parentheses represent capture groups in order. So 0\\1\\2 means 0 followed by the first capture group [0-9]{1,4} and the second capture group HQ|$.
Of course if there is 5 numbers, then the regex isn't matched, so it doesn't change.
I was going to use the sprintf approach, but found the the stringr package provides a very easy solution.
library(stringr)
x <- c("9453", "55489", "4588", "18893", "4457", "2339", "45489HQ", "7833HQ")
[1] "9453" "55489" "4588" "18893" "4457" "2339" "45489HQ" "7833HQ"
This can be converted with one simple stringr::str_pad() function:
stringr::str_pad(x, 5, side="left", pad="0")
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "7833HQ"
If the number needs to be padded even if the total string width is >5, then the number and text need to be separated with regex.
The following will work. It combines regex matching with the very helpful sprintf() function:
sprintf("%05.0f%s", # this encodes the format and recombines the number with padding (%05.0f) with text(%s)
as.numeric(gsub("^(\\d+).*", "\\1", x)), #get the number
gsub("[[:digit:]]+([a-zA-Z]*)$", "\\1", x)) #get just the text at the end
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "07833HQ"
Another attempt, which will also work in cases like "123" or "1HQR":
x <- c("18893","4457","45489HQ","7833HQ","123", "1HQR")
regmatches(x, regexpr("^\\d+", x)) <- sprintf("%05d", as.numeric(sub("\\D+$","",x)))
x
#[1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
This basically finds any numbers at the start of the string (^\\d+) and replaces them with a zero-padded (via sprintf) string that was subset out by removing any non-numeric characters (\\D+$) from the end of the string.
We can use only sprintf() and gsub() by splitting up the parts then putting them back together.
sprintf("%05d%s", as.numeric(gsub("[^0-9]+", "", x)), gsub("[0-9]+", "", x))
# [1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
Using #thelatemail's data:
x <- c("18893", "4457", "45489HQ", "7833HQ", "123", "1HQR")
I want to get a vector of the words within a string in R that begins with $`GPE.
This is what I tried:
grep(pattern = "$`GPE", x = GPE_string, value = TRUE)
However it returned: character(0)
You can do this using str_extract_all in stringr:
library(stringr)
str_extract_all(GPE_string, "(\\$`GPE.+?)\\b")
Explanation:
The $ in the pattern needs to be escaped with \\
The part enclosed in (...) will be extracted
\\b means word boundary, and .+? means one or more characters
The result of str_extract_all is a list of vectors,
for each string in the input vector.
You need escape characters.
Try
grep(pattern="\$\`GPE", x=GPE_string, value=TRUE)
If you're only looking for the words that start with "$`GPE", you can do:
GPE_string[startsWith(GPE_string, "$`GPE")]
So for example,
> GPE_string<- c("$`GPE_Hello", "$`GPEWorld", "Hello", "World")
> GPE_string
[1] "$`GPE_Hello" "$`GPEWorld" "Hello" "World"
> GPE_string[startsWith(GPE_string, "$`GPE")]
[1] "$`GPE_Hello" "$`GPEWorld"
I known this should be simple but I cannot return a subset of characters from a string using regex in R.
Foo <- 'propertyid=R206411&state_id='
Reg <- 'propertyid=(.*)&state_id='
Test <- grep(pattern=Reg, x=Foo, value=TRUE)
This captures the entire string for me and I want to capture just the R206411. The string I want to capture might vary in length and content, so the key is to have the capture begin after the '=' in propertyid=, and then end the capture once it sees the '&' in '&state_id'.
Thanks for your time.
You have to use positive lookbehind and lookahead assertions like this:
Foo <- 'propertyid=R206411&state_id='
Reg <- gregexpr('(?<=propertyid=).*(?=&state_id=)', Foo, perl=TRUE)
regmatches(Foo, Reg)
Well, grep doesn't play well with captured groups which is what you are trying to do. What you probably want is gsub
Foo <- 'propertyid=R206411&state_id='
Reg <- 'propertyid=(.*)&state_id='
gsub(Reg, "\\1", Foo)
# [1] "R206411"
Here we take your pattern, and we replace the match with "\1" (and since R requires us to escape backslashes we double the slash) which stands for the first capture group (which is what the parenthesis indicate). So since you match the whole string, it will replace the whole string with just the matching portion.
The strapplyc function in the gsubfn package can do exactly that. Using Foo and Reg from the question:
> library(gsubfn)
>
> strapplyc(Foo, Reg, simplify = TRUE)
[1] "R206411"
I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!
You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"
You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.
This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)
Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"