Extracting a character that contains a certain type of element in R - r

For Example, lets say I have the following string
vec <- " #_Jim98 Did you turn off the stove #9am?"
I would like to count the number of # characters that contain only numbers,letters,#, and underscore symbol in the string. In the case above, it would only count 1 since #9am? contains the ? symbol, so it won't be counted.
Also, it could not be longer than 10 characters.

1) Search for # followed by any number of the allowed characters "\\w" followed by a whitespace character "\\s" or | end of string $. If zero word characters are allowable then change the + to *. The expression is vectorized, i.e. x can be a character vector. No packages are used.
x <- " #_Jim98 Did you turn off the stove #9am?" # test input
pat <- "#\\w+(\\s|$)"
lengths(regmatches(x, gregexpr(pat, x)))
## [1] 1
Note that the reason for regmatches is that gregexpr produces a -1 rather than a zero length vector for no matches whereas regmatches will produce a zero length vector. Thus it works for the edge case of no matches.
2) A slightly more compact solution would be this where pat is from above:
library(gsubfn)
lengths(strapplyc(x, pat))
## [1] 1

We can do this with a regular expression. I'm interpreting that you are counting words separated by space characters or occurring at the beginning or end of the string. This assumes the # is at the start of the word, and I match a # followed by some number of word characters \\w(letters and digits) or underscores. You can remove the first (^|\\s) if you don't care about having # at the beginning of the word and would like to count 3 words in, for example, " #_Jim98 Did the Latin#s or tom#domain turn off the stove #9am?"
stringr::str_count(" #_Jim98 Did you turn off the stove #9am?", "(^|\\s)#(\\w|_)*?($|\\s)")
#> [1] 1
Created on 2018-04-12 by the reprex package (v0.2.0).

Related

regex to find the position of the first four concurrent unique values

I've solved 2022 advent of code day 6, but was wondering if there was a regex way to find the first occurance of 4 non-repeating characters:
From the question:
bvwbjplbgvbhsrlpgdmjqwftvncz
bvwbjplbgvbhsrlpgdmjqwftvncz
# discard as repeating letter b
bvwbjplbgvbhsrlpgdmjqwftvncz
# match the 5th character, which signifies the end of the first four character block with no repeating characters
in R I've tried:
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_match("(.*)\1", txt)
But I'm having no luck
You can use
stringr::str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
See the regex demo. Here, (.) captures any char into consequently numbered groups and the (?!...) negative lookaheads make sure each subsequent . does not match the already captured char(s).
See the R demo:
library(stringr)
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
## => [1] "vwbj"
Note that the stringr::str_match (as stringr::str_extract) takes the input as the first argument and the regex as the second argument.

Matching character followed by exactly 1 digit

I need to align formatting of some clinical trial IDs two merge two databases. For example, in database A patient 123 visit 1 is stored as '123v01' and in database B just '123v1'
I can match A to B by grep match those containing 'v0' and strip out the trailing zero to just 'v', but for academic interest & expanding R / regex skills, I want to reverse match B to A by matching only those containing 'v' followed by only 1 digit, so I can then separately pad that digit with a leading zero.
For a reprex:
string <- c("123v1", "123v01", "123v001")
I can match those with >= 2 digits following a 'v', then inverse subset
> idx <- grepl("v(\\d{2})", string)
> string[!idx]
[1] "123v1"
But there must be a way to match 'v' followed by just a single digit only? I have tried the lookarounds
# Negative look ahead "v not followed by 2+ digits"
grepl("v(?!\\d{2})", string)
# Positive look behind "single digit following v"
grepl("(?<=v)\\d{1})", string)
But both return an 'invalid regex' error
Any suggestions?
You need to set the perl=TRUE flag on your grepl function.
e.g.
grepl("v(?!\\d{2})", string, perl=TRUE)
[1] TRUE FALSE FALSE
See this question for more info.
You may use
grepl("v\\d(?!\\d)", string, perl=TRUE)
The v\d(?!\d) pattern matches v, 1 digits and then makes sure there is no digit immediately to the right of the current location (i.e. after the v + 1 digit).
See the regex demo.
Note that you need to enable PCRE regex flavor with the perl=TRUE argument.

Stringr function or or gsub() to find an x digit string and extract first x digits?

Regex and stringr newbie here. I have a data frame with a column from which I want to find 10-digit numbers and keep only the first three digits. Otherwise, I want to just keep whatever is there.
So to make it easy let's just pretend it's a simple vector like this:
new<-c("111", "1234567891", "12", "12345")
I want to write code that will return a vector with elements: 111, 123, 12, and 12345. I also need to write code (I'm assuming I'll do this iteratively) where I extract the first two digits of a 5-digit string, like the last element above.
I've tried:
gsub("\\d{10}", "", new)
but I don't know what I could put for the replacement argument to get what I'm looking for. Also tried:
str_replace(new, "\\d{10}", "")
But again I don't know what to put in for the replacement argument to get just the first x digits.
Edit: I disagree that this is a duplicate question because it's not just that I want to extract the first X digits from a string but that I need to do that with specific strings that match a pattern (e.g., 10 digit strings.)
If you are willing to use the library stringr from which comes the str_replace you are using. Just use str_extract
vec <- c(111, 1234567891, 12)
str_extract(vec, "^\\d{1,3}")
The regex ^\\d{1,3} matches at least 1 to a maximum of 3 digits occurring right in the beginning of the phrase. str_extract, as the name implies, extracts and returns these matches.
You may use
new<-c("111", "1234567891", "12")
sub("^(\\d{3})\\d{7}$", "\\1", new)
## => [1] "111" "123" "12"
See the R online demo and the regex demo.
Regex graph:
Details
^ - start of string anchor
(\d{3}) - Capturing group 1 (this value is accessed using \1 in the replacement pattern): three digit chars
\d{7} - seven digit chars
$ - end of string anchor.
So, the sub command only matches strings that are composed only of 10 digits, captures the first three into a separate group, and then replaces the whole string (as it is the whole match) with the three digits captured in Group 1.
You can use:
as.numeric(substring(my_vec,1,3))
#[1] 111 123 12

Replace number with a random number of same amount of digits

I have a string containing some numbers and to relace every single digit with a sigle random number.
E.g. "111" should be replaced with 3 random numbers between 0-9 that are concatenated like "364".
My idea was to match a number, get the number of digits, calculate as many random numbers and concatenate them to finally replace my matched number:
test <- "this is 1 example 123. I like the no.37"
gsub("([0-9])", paste0(sample(0:9, nchar("\\1")), collapse = ""), test)
My goal would be to have a string where every single digit is replaced by a random digit. E.g.
"this is 3 an example 628. I like the no.09"
I tried some approaches but can't find a good solution.
Use the gsubfn library, it will make things simpler:
library(gsubfn)
test <- "this is 1 example 123. I like the no.37"
gsubfn("[0-9]+", ~ paste0(sample(0:9, nchar(x)), collapse = ""), test)
[1] "this is 8 example 205. I like the no.37"
Here, gsubfn will match all 1 or more digits in the string (see the [0-9]+ pattern). Then, the matches are passed to the callback where nchar gets the real value of the captured substring (the digit substrings).

Retain string till character limit with last complete word, and store remaining words in 2nd variable

Take these example strings, I want to split them such that the length is limited to X or less characters, a complete word is at the end of each string, and the remaining part is stored in another column. The words are always separated by space. I came across this partial solution in TSQL (doesn't create variable for extra words). However I need to do it in R. I was provided the first half solution in a previous question, this doesn't have the remaining words in new variables. I need help to create the new variable
{gsub(patt="(^.{2,100})([ ].+)", repl="\\1",y)}
For example:
XOVEW VJIEW NI **stays** XOVEW VJIEW NI (assuming X is 14)
XOVEW VJIEW NIGOI **becomes** XOVEW VJIEW (NIGOI goes to a new vector)
XOVEW VJIEWNIGOI **becomes** XOVEW (assuming X is 14)
So new variable will contain c("NIGOI","VJIEWNIGOI") coming from 2nd and 3rd row above.
v1 <- ifelse( nchar(vect) > 14, gsub( "(.*)\\s+(\\w+)", "\\1 - \\2", vect),vect);
values <- data.frame(do.call('rbind', lapply(strsplit(v1,split="-"), `length<-`,2)));
Output:
[,1] [,2]
[1,] "XOVEW VJIEW NI" NA
[2,] "XOVEW VJIEW " " NIGOI"
[3,] "XOVEW " " VJIEWNIGOI"
I have created a small vector which will check if your string length is greater or smaller than 14 (?nchar in case you want to understand it).
Then wherever, it is longer than 14 I have created a string seperated by a dash, This is just to segregate the two strings, where the first strings deptics any collection of word which is not the last one, the second string matches the last word of the statement.
To match these I used regex, a dot represents any character, a star zero or more matches(together it means any character with zero or more matches) , a \\s+ matches 1 or more spaces and \\w+ matches one or more words. Collectively the match is such that it should have last word segregated with rest of the string in cases where string length is more than 14 within ifelse. Also these characters are further captured into \\1 and \\2 with a dash separation. where \\1 matches the first non last word match and \\2 match the last word of the string.
At last do.call is used with with rbind(bind all the rows) and lapply(to get even number of columns across all the elements)
I hope this explains your query.

Resources