Match string between ; and % [duplicate] - r

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed 2 years ago.
I wish to extract the decimal value in the string without the % sign. So in this case, I want the numeric 0.45
x <- "document.write(GIC_annual[\"12-17 MTH\"][\"99999.99\"]);0.450%"
str_extract(x, "^;[0-9.]")
My attempt fails. Here's my thinking.
Begin the extraction at the semicolon ^;
Grab any numbers between 0 and 9.
Include the decimal point

You also have this option:
stringr::str_extract(y, "\\d\\.\\d{1,}(?=%)")
[1] "0.450"
So basically you look ahead and check if there is % or not, if yes, you capture the digits before it.
Details
\\d digit;
\\. dot;
\\d digit;
{1,} capturing 1 or more digit after .;
(?=%) look ahead and check if there is % and if there is one, it retuns captured number

Since you don't want semi-colon in the output use it as lookbehind regex.
stringr::str_extract(x, "(?<=;)[0-9]\\.[0-9]+")
#[1] "0.450"
In base R using sub :
sub('.*;([0-9]\\.[0-9]+).*', '\\1', x)

Related

Regex: extracting matches preceding a pattern in R [duplicate]

This question already has answers here:
Remove part of string after "."
(6 answers)
Extract string before "|" [duplicate]
(3 answers)
Closed 1 year ago.
I'm trying to extract matches preceding a pattern in R. Lets say that I have a vector consisting of the next elements:
my_vector
> [1] "ABCC12|94160" "ABCC13|150000" "ABCC1|4363" "ACTA1|58"
[5] "ADNP2|22850" "ADNP|23394" "ARID1B|57492" "ARID2|196528"
I'm looking for a regular expression to extract all characters preceding the "|". The expected result must be something like this:
my_new_vector
> [1] "ABCC12" "ABCC13" "ABCC1" "ACTA1"
and so on.
I have already tried using stringr functions and regular expressions based on look arounds, but I failed.
I really appreciate your advices and help to solve my issue.
Thanks in advance!
We could use trimws and specify the whitespace as a regex that matches the | (metacharacter - so escape \\ followed by one or more character (.*)
trimws(my_vector, whitespace = "\\|.*")

Remove first character of string with condition in R [duplicate]

This question already has answers here:
remove leading 0s with stringr in R
(3 answers)
Closed 2 years ago.
I'm trying to remove the 0 that appears at the beginning of some observations for Zipcode in the following table:
I think the sub function is probably my best choice but I only want to do the replacement for observations that begin with 0, not all observations like the following does:
data_individual$Zipcode <-sub(".", "", data_individual$Zipcode)
Is there a way to condition this so it only removes the first character if the Zipcode starts with 0? Maybe grepl for those that begin with 0 and generate a dummy variable to use?
We can specify the ^0+ as pattern i.e. one or more 0s at the start (^) of the string instead of . (. in regex matches any character)
data_individual$Zipcode <- sub("^0+", "", data_individual$Zipcode)
Or with tidyverse
library(stringr)
data_individual$Zipcode <- str_remove(data_individual$Zipcode, "^0+")
Another option without regex would be to convert to numeric as numeric values doesn't support prefix 0 (assuming all zipcodes include only digits)
data_individual$Zipcode <- as.numeric(data_individual$Zipcode)

Remove characters which repeat more than twice in a string [duplicate]

This question already has answers here:
remove repeated character between words
(4 answers)
Closed 3 years ago.
I have this text:
F <- "hhhappy birthhhhhhdayyy"
and I want to remove the repeat characters, I tried this code
https://stackoverflow.com/a/11165145/10718214
and it works, but I need to remove repeat characters if it repeats more than 2, and if it repeated 2 times keep it.
so the output that I expect is
"happy birthday"
any help?
Try using sub, with the pattern (.)\\1{2,}:
F <- ("hhhappy birthhhhhhdayyy")
gsub("(.)\\1{2,}", "\\1", F)
[1] "happy birthday"
Explanation of regex:
(.) match and capture any single character
\\1{2,} then match the same character two or more times
We replace with just the single matching character. The quantity \\1 represents the first capture group in sub.

Regex capture 1 character [duplicate]

This question already has answers here:
Complete word matching using grepl in R
(3 answers)
Closed 4 years ago.
Whenever english character of length 1 exists, I want that to be combined with the previous text.
gsub('(.*)\\s+([a-zA-Z]{1})', "\\1\\2", 'Anti-Candida a ингибинов')
Anti-Candidaa ингибинов
For the example below, it should return 'Anti-Candida am ингибинов' as 'am' is of length 2.
gsub('(.*)\\s+([a-zA-Z]{1})', "\\1\\2", 'Anti-Candida am ингибинов')
You can use this regex:
\W+([a-zA-Z])\b
replace with \\1. The trick here is to match a word boundary after the single letter.
Demo
Your regex will work as well, if you just add that \b at the end.

Remove characters in string before specific symbol(including it) [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Use gsub remove all string before first white space in R
(4 answers)
Closed 5 years ago.
at the beginning, yes - simillar questions are present here, however the solution doesn't work as it should - at least for me.
I'd like to remove all characters, letters and numbers with any combination before first semicolon, and also remove it too.
So we have some strings:
x <- "1;ABC;GEF2"
y <- "X;EER;3DR"
Let's do so gsub() with . and * which means any symbol with occurance 0 or more:
gsub(".*;", "", x)
gsub(".*;", "", y)
And as a result i get:
[1] "GEF2"
[1] "3DR"
But I'd like to have:
[1] "ABC;GEF2"
[1] "EER;3DR"
Why did it 'catch' second occurence of semicolon instead of first?
You could use
gsub("[^;]*;(.*)", "\\1", x)
# [1] "ABC;GEF2"

Resources