str_extract in R to extract number from a string

str_extract in R to extract number from a string - r

I have a table as below that I would like to extract the number following the underscore
description desired_output
desc_lvl1_id_1 1
desc_lvl1_id_2 2
The solution that I have come up with is split into two parts, first to get the underscore and the number that I want, then to take out the underscore gsub("_", "", str_extract(description, "_[0-9]")). I'm hoping though that this can be done in one step

We can use a positive lookbehind ((?<=_)) and match the numbers that follow the _ as the pattern in str_extract.
library(stringr)
df1$desired_output <- as.numeric(str_extract(df1$description, '(?<=_)\\d+'))

Related

Add symbol between the letter S and any number in a column dataframe

I am trying to add a - between letter S and any number in a column of a data frame. So, this is an example:
VariableA
TRS34
MMH22
GFSR104
GS23
RRTM55
P3
S4
My desired output is:
VariableA
TRS-34
MMH22
GFSR104
GS-23
RRTM55
P3
S-4
I was trying yo use gsub:
gsub('^([a-z])-([0-9]+)$','\\1d\\2',myDF$VariableA)
but this is not working.
How can I solve this?
Thanks!

Your ^([a-z])-([0-9]+)$ regex attempts to match strings that start with a letter, then have a - and then one or more digits. This can't work as there are no hyphens in the strings, you want to introduce it into the strings.
You can use
gsub('(S)([0-9])', '\\1-\\2', myDF$VariableA)
The (S)([0-9]) regex matches and captures S into Group 1 (\1) and then any digit is captured into Group 2 (\2) and the replacement pattern is a concatenation of group values with a hyphen in between.
If there is only one substitution expected, replace gsub with sub.
See the regex demo and the online R demo.
Other variations:
gsub('(S)(\\d)', '\\1-\\2', myDF$VariableA) # \d also matches digits
gsub('(?<=S)(?=\\d)', '-', myDF$VariableA, perl=TRUE) # Lookarounds make backreferences redundant

Here is the version I like using sub:
myDF$VariableA <- gsub('S(\\d)', 'S-\\1', myDF$VariableA)
This requires using only one capture group.

Using stringr package
library(stringr)
str_replace_all(myDF$VariableA, 'S(\\d)', 'S-\\1')

You could also use lookbehinds if you set perl=TRUE:
> gsub('(?<=S)([0-9]+)', '-\\1', myDF$VariableA, perl=TRUE)
[1] "TRS-34" "MMH22" "GFSR104" "GS-23" "RRTM55" "P3" "S-4"
>

How to split strings separated by many semicolons in R?

My desire is to know the length of a certain text separated by ; which comes after any number. In the text named txt below, I don't want to consider the first two semicolons. To get the length, the ; comes after 6, 5 should be considered. I mean the code should lookbehind some number(s) to consider the appropriate ;.
library(stringr)
txt <- "A;B; dd (2020) text pp. 805-806; Mining; exercise (1999), ee, p-123-125; F;G;H text, (2017) kk"
lenghths(strsplit(txt,";")) gives me 8. In my case, however, it should be 3. Any help is highly appreciated.

We can use a regex lookaround to match a ; that succeeds a digit ((?<=[0-9])) and get the lengths
lengths(strsplit(txt, "(?<=[5-6]);", perl = TRUE))
#[1] 3
Or using str_count
library(stringr)
str_count(txt, '[5-6];') + 1
#[1] 3

How to substitute a character in multiples locations with R

I'm trying to split a dataframe with "," separators. However, some parts of the strings have the pattern [0-9][,][0-9]{2}, and i'd like to substitute only the comma inside, not the hole pattern, in order to preserve the numerical inputs.
I try to solve with stringr, but got stucked in the following pattern of error:
library(stringr)
string <- '"name: John","age: 27","height: 1,73", "weight: 78,30"'
str_replace_all(string, "[0-9][,][0-9]{2}", "[0-9][;][0-9]{2}")
[1] "\"name: John\",\"age: 27\",\"height: [0-9][;][0-9]{2}\", \"weight: 7[0-9][;][0-9]{2}\""
I know it can be done with substitution by position, but the string is too big.
I'd appreciate any help. Thanks in advance.

You need to use capturing groups around the parts of the pattern you need to keep and, in the replacement pattern, refer to those submatches with backreferences:
> str_replace_all(string, "([0-9]),([0-9]{2})", "\\1;\\2")
[1] "\"name: John\",\"age: 27\",\"height: 1;73\", \"weight: 78;30\""
Or the same regex can be used with gsub:
> gsub("([0-9]),([0-9]{2})", "\\1;\\2", string)
[1] "\"name: John\",\"age: 27\",\"height: 1;73\", \"weight: 78;30\""
Details:
([0-9]) - capturing group 1, whose value is referred to using \\1 in the replacement pattern, matching a single digit
, - a comma
([0-9]{2}) - capturing group 2, whose value is referred to using \\2 in the replacement pattern, matching 2 digits.

How to extract numbers between multiple underscores

I have files as following:
"C:/FolderInf/infoAnalysis/und_2017_01_28_12.csv"
And I would like to extract the figures (actually a date) from the file names using gsub(), sub() basically without any package involved.
Number of failed trials:
gsub("\\D", "", lfs) here also the last give 12 appears, I need only the first 8 figures!
gsub("(.+?)(\\_.*)", "\\2", lfs) complication in removing the rest unwanted characters....

You can specify this extraction exactly:
gsub(".*(\\d{4}_\\d{2}_\\d{2}).*", "\\1", x)
## [1] "2017_01_28"

We can use basename to extract part of the string and then match one or more characters that are not a _ ([^_]+) from the start (^) of the string, followed by a _ or | a _ followed by one or more characters that are not a _ ([^_]+) until the end ($) of the string and replace it with a blank ("")
gsub("^[^_]+_|_[^_]+$", "", basename(str1))
#[1] "2017_01_28"
data
str1 <- "C:/FolderInf/infoAnalysis/und_2017_01_28_12.csv"

You could try (but there's probably more elegant solutions available):
substr(gsub("[^0-9]","",test),1,8)
[1] "20170128"

Using R base function strsplit (No package needed)
paste(as.character(unlist(strsplit(as.character(string),"[_.]"))[c(2,3,4)]),collapse=" ")

As you have told they are dates, you can use the next regexp, for example:
^[^_]*_([0-9]*)_([0-9]*)_([0-9]*)_([0-9]*)
See demo. Beware that you cannot abbreviate this expression as _([0-9]*){4}, as you would end with only a matching group (and not the four that the above expression has) and wouldn't get the four numbers. If you need only the date, you can cut the regexp before the last group.

selective removal of characters following a pattern using R

How to selectively remove characters from a string following a pattern?
I wish to remove the 7 figures and the preceding colon.
For example:
"((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
should become
"((Northern_b,Tropical_b)N19"

x <- "((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
gsub("[:]\\d{1}[.]\\d{6}", "", x)
The gsub function does string replacement and replaces all matched found in the string (see ?gsub). An alternative, if you want something with a more friendly names is str_replace_all from the stringr package.
The regular expression makes use of the \\d{n} search, which looks for digits. The integer indicates the number of digits to look for. So \\d{1} looks for a set of digits with length 1. \\d{6} looks for a set of digits of length 6.

gsub('[:][0-9.]+','',x)
[1] "((Northern_b,Tropical_b)N19"

Another approach to solve this problem
library(stringr)
str1 <- c("((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950")
str_replace_all(str1, regex("(:\\d{1,}\\.\\d{1,})"), "")
#[1] "((Northern_b,Tropical_b)N19"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

str_extract in R to extract number from a string - r

We can use a positive lookbehind ((?<=_)) and match the numbers that follow the _ as the pattern in str_extract. library(stringr) df1$desired_output <- as.numeric(str_extract(df1$description, '(?<=_)\\d+'))

Related

Add symbol between the letter S and any number in a column dataframe

How to split strings separated by many semicolons in R?

How to substitute a character in multiples locations with R

How to extract numbers between multiple underscores

selective removal of characters following a pattern using R

Categories

Resources