I am learning about extracting string information using stringr library in R. Let's say I have a string LC_Cars_20160601_01.hdf5.rds. The numbers 01 before ".hdf5" indicate that it is for participant # 01. How can I extract this number? I have tried using str_extract but I don't know what should I provide in the pattern argument. Please guide.
One option is using gsub to remove all the elements that are not required
gsub(".*\\d+_|\\..*$", "", str1)
#[1] "01"
data
str1 <- "LC_Cars_20160601_01.hdf5.rds"
Related
I've seen a lot of similar questions, but I wasn't able to get the desired output.
I have a string means_variab_textimput_x2_200.txt and I want to catch ONLY what is between the third and fourth underscores: textimput
I'm using R, stringr, I've tried many things, but none solved the issue:
my_string <- "means_variab_textimput_x2_200.txt"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*')
"means_variab_textimput"
str_extract(my_string, '^(?:([^_]+)_){4}')
"means_variab_textimput_x2_"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*\\.') ## the closer I got was this
"_textimput_x2_200."
Any ideas? Ps: I'm VERY new to Regex, so details would be much appreciated :)
additional question: can I also get only a "part" of the word? let's say, instead of textimput only text but without counting the words? It would be good to know both possibilities
this this one this one were helpful, but I couldn't get the final expected results. Thanks in advance.
stringr uses ICU based regular expressions. Therefore, an option would be to use regex lookarounds, but here the length is not fixed, thus (?<= wouldn't work. Another option is to either remove the substrings with str_remove or use str_replace to match and capture the third word which doesn't have the _ ([^_]+) and replace with the backreference (\\1) of the captured word
library(stringr)
str_replace(my_string, "^[^_]+_[^_]+_([^_]+)_.*", "\\1")
[1] "textimput"
If we need only the substring
str_replace(my_string, "^[^_]+_[^_]+_([^_]{4}).*", "\\1")
[1] "text"
In base R, it is easier with strsplit and get the third word with indexing
strsplit(my_string, "_")[[1]][3]
# [1] "textimput"
Or use perl = TRUE in regexpr
regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]+", my_string, perl = TRUE))
# [1] "textimput"
For the substring
regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]{4}", my_string, perl = TRUE))
[1] "text"
Following up on question asked in comment about restricting the size of the extracted word, this can easily be achieved using quantification. If, for example, you want to extract only the first 4 letters:
sub("[^_]+_[^_]+_([^_]{4}).*$", "\\1", my_string)
[1] "text"
I have a column of texts look like below:
str1 = "ABCID 123456789 is what I'm looking for, could you help me to check this Item's status?"
I want to use gsub function in R to extract "ABCID 123456789" from there. The number might change with different numbers, but ABCID is a constant. Can someone know the solution with that please? Thanks very much!
We can use str_extract to select the fixed word followed by space and one or more numbers (\\d+)
library(stringr)
str_extract(df1$col1, "ABCID \\d+")
If there are multiple instances, use str_extract_all
str_extract_all(df1$col1, "ABCID \\d+")
NOTE: The OP states that to extract "ABCID 123456789" from there
If the number has constant length (9) you could you use positive lookbehind:
sub("(?<=ABCID \\d{9}).*", "", str1, perl = TRUE)
# [1] "ABCID 123456789"
Match the beginning of string (^) leading letters (ABCID), a space, digits (\d+) and everything else (.*) and replace it all with the captured portion, i.e. the portion within parentheses. Note that we want to use sub, not gsub, here because there is only one substitution.
sub("^(ABCID \\d+).*", "\\1", str1)
## [1] "ABCID 123456789"
I'm trying to extract part of a string using stringr.
I'm aiming for the output to be E5_1_C33 and E5_1_C23, but instead I'm getting NA.
Any help would be appreciated!
library(stringr)
mystring <- c("can_ComplianceWHOInfrastructurePol_E5_1_C33","can_ComplianceWHOInfrastructurePol_E5_1_C23")
str_extract(mystring, "A\\d_\\d_B\\d\\d$")
slightly modified your line , as as need any letter not only A and B:
str_extract(mystring, "[A-z]\\d_\\d_[A-z]\\d\\d$")
Here's an R base approach using gsub
> gsub(".*(\\w{2}_\\w{1}_\\w{3})$", "\\1", mystring)
[1] "E5_1_C33" "E5_1_C23"
I have a string list as below:
df = read.table(text="AC1=60;AD=393,115;AF1=0.318816;BQB=0.508823;DP=1016;DP4=393
AC1=190;AD=2,747;AF1=1;BQB=0.0722892;DP=749;DP4=2,0,747,0;FQ=-43.6844
AC1=150;AD=1,5;AF1=0.787353;DP=6;DP4=1,0,5,0;VDB=0.00215942
AC1=47;AD=660,182;AF1=0.24862;BQB=0.680047;DP=1684;DP4=660,0,182,0
AC1=47;AD=659,183;AF1=0.248425;DP=842;DP4=0,659,0,183;FQ=999
AC1=78;AD=23,17;AF1=0.408247;BQB=1;DP=40;DP4=23,0,17,0", header=FALSE, stringsAsFactors=F)
each element is separated by ";". I would like to extract out only "DP=[0-9]" part. The result is expected as:
DP=1016
DP=749
DP=6
DP=1684
DP=842
DP=40
I appreciate any helps.
In base:
gsub(".*((?<=;)DP=[^;]+(?=;)).*", "\\1", df$V1, perl=TRUE)
#[1] "DP=1016" "DP=749" "DP=6" "DP=842" "DP=1684" "DP=40"
I was surprised when the resident genius on regex suggested the use packages for text extraction. sub and gsub can get unruly when pulling out a specific string:
library(stringr)
str_extract_all(df$V1, "(?<=;)DP=[^;]+(?=;)")
Here is one regular expression that will work
gsub(".*;(DP=[0-9.]+);.*$", "\\1", df$V1)
If it's the case that the "DP=" substring contains multiple entries separated by commas, as do substrings like "DP4= " in some cases in the example data, then as #pierre-lafortune notes in the comments below, and in his answer, you might be better off with the [^;] character class:
gsub(".*;(DP=[^;]+);.*$", "\\1", df$V1)
Of course, you could just add the comma to the character class,
gsub(".*;(DP=[0-9.,]+);.*$", "\\1", df$V1)
but there may be other characters you want to keep as well. So [^;] would be the most inclusive approach.
How to selectively remove characters from a string following a pattern?
I wish to remove the 7 figures and the preceding colon.
For example:
"((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
should become
"((Northern_b,Tropical_b)N19"
x <- "((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
gsub("[:]\\d{1}[.]\\d{6}", "", x)
The gsub function does string replacement and replaces all matched found in the string (see ?gsub). An alternative, if you want something with a more friendly names is str_replace_all from the stringr package.
The regular expression makes use of the \\d{n} search, which looks for digits. The integer indicates the number of digits to look for. So \\d{1} looks for a set of digits with length 1. \\d{6} looks for a set of digits of length 6.
gsub('[:][0-9.]+','',x)
[1] "((Northern_b,Tropical_b)N19"
Another approach to solve this problem
library(stringr)
str1 <- c("((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950")
str_replace_all(str1, regex("(:\\d{1,}\\.\\d{1,})"), "")
#[1] "((Northern_b,Tropical_b)N19"