substring in R with stringr - r

I have a string that looks like this :
my_sting="AC=1;AN=249706;AF=4.00471e-06;rf_tp_probability=8.55653e-01;"
it is based on a column in my data :
REF ALT QUAL FILTER INFO
1 C A 3817.77 PASS AN=2;AF=4.00471e06;rf_tp_probability=8.55653
2 C G 3817.77 PASS AN=3;AF=5;rf_tp_probability=8.55653
i wish to select only the part that start with AF= and ends with the number AF is equal to .
for example here: AF=4.00471e-06
I tried this :
print(str_extract_all(my_sting, "AF=.+;"))
[[1]]
[1] "AF=4.00471e-06;rf_tp_probability=8.55653e-01;"
but it returned everything until the end. instead of returning AF=4.00471e-06
is there any way to fix this ? thank you

You can write the pattern using a negated character class [^;]+ as:
library(stringr)
my_sting="AC=1;AN=249706;AF=4.00471e-06;rf_tp_probability=8.55653e-01;"
print(str_extract_all(my_sting, "AF=[^;]+"))
Output
[[1]]
[1] "AF=4.00471e-06"

Another option. Use "followed by ;" (i.e., (?=;))
my_sting="AC=1;AN=249706;AF=4.00471e-06;rf_tp_probability=8.55653e-01;"
str_extract(my_sting, "AF=.*?(?=;)")
#> [1] "AF=4.00471e-06"

Related

Why group capturing extract eveything in str_replace in r

Context
a = 'g_pm10year1126.81 - 139.90'
I have a character vector a. I want to extract the content after year1 in the string a ("126.81 - 139.90").
By using str_extract(a, "(? <=year1).*") I successfully extracted the content I wanted.
After that, I tried to use group capturing in the str_replace function, but it returned the whole string a.
Question
My question is why str_replace(a, "(? <=year1)(. *)", '\\1') returns "g_pm10year1126.81 - 139.90".
As I understand it it should return 126.81 - 139.90.
Reproducible code:
library(stringr)
a = 'g_pm10year1126.81 - 139.90'
> str_extract(a, "(?<=year1).*")
[1] "126.81 - 139.90"
> str_replace(a, "(?<=year1)(.*)", '\\1')
[1] "g_pm10year1126.81 - 139.90"
The issue is that you are replacing the captured group with itself. Hence you are not changing anything and end up with your input string.
To achieve your desired result using str_replace you have to replace the part before the captured group, i.e. you could do:
library(stringr)
a = 'g_pm10year1126.81 - 139.90'
str_replace(a, "^.*?(?<=year1)(.*)", '\\1')
#> [1] "126.81 - 139.90"

Add a character to a specific part of a string?

I have a list of file names as such:
"A/B/file.jpeg"
"A/C/file2.jpeg"
"B/C/file3.jpeg"
and a couple of variations of such.
My question is how would I be able to add a "new" or any characters into each of these file names after the second "/" such that the length of the string/name doesn't matter just that it is placed after the second "/"
Results would ideally be:
"A/B/newfile.jpeg"
"A/B/newfile2.jpeg" etc.
Thanks!
Another possible solution, based on stringr::str_replace:
library(stringr)
l <- c("A/B/file.jpeg", "A/B/file2.jpeg", "A/B/file3.jpeg")
str_replace(l, "\\/(?=file)", "\\/new")
#> [1] "A/B/newfile.jpeg" "A/B/newfile2.jpeg" "A/B/newfile3.jpeg"
Using gsub.
gsub('(file)', 'new\\1', x)
# [1] "A/B/newfile.jpeg" "A/C/newfile2.jpeg" "B/C/newfile3.jpeg"
Data:
x <- c("A/B/file.jpeg", "A/C/file2.jpeg", "B/C/file3.jpeg")

How do I find a subtext without comma using regex in R?

I have a data frame as:
result <- c('Ab1 : 256 ug/mL(R), Ab2(disk); 18mm(S)', 'Ab1 : 4 ug/mL(S), Ab2(disk); <2mm(R)')
df <- data.frame(result)
What should I do if I would like to check whether '(R)' appears after 'antibiotics1' ?
grep("Ab1[[:print:]]*\\(R\\)", result)
gives
[1] 1 2
while the result I want is
[1] 1
Try this:
grep("Ab1[^(]*?\\(R\\)", result)
[1] 1
Ab1 match 'Ab1' literally
[^(]*? match anything besides an opening parenthesis, non greedily
(R) match '(R)' literally
In the second case, it is not possible to do this match without first consuming at least one opening parenthesis, hence only the first matches.

Delete duplicate elements in String in R

I've got some problems deleting duplicate elements in a string.
My data look similar to this:
idvisit path
1 1,16,23,59
2 2,14,14,19
3 5,19,23,19
4 10,10
5 23,23,27,29,23
I have a column containing an unique ID and a column containing a path for web page navigation.
The right column contains some cases, where pages just were reloaded and the page were tracked twice or even more.
The pages are separated with commas and are saved as factors.
My problem is, that I don't want to have multiple pages in a row, so the data should look like this.
idvisit path
1 1,16,23,59
2 2,14,19
3 5,19,23,19
4 10
5 23,27,29,23
The multiple pages next to each other should be removed. I know how to delete a specific multiple number using regexpressions, but I have about 20.000 different pages and can't do this for all of them.
Does anyone have a solution or a hint, for my problem?
Thanks
Sebastian
We can use tidyverse. Use the separate_rows to split the 'path' variable by the delimiter (,) to convert to a long format, then grouped by 'idvisit', we paste the run-length-encoding values
library(tidyverse)
separate_rows(df1, path) %>%
group_by(idvisit) %>%
summarise(path = paste(rle(path)$values, collapse=","))
# A tibble: 5 × 2
# idvisit path
# <int> <chr>
#1 1 1,16,23,59
#2 2 2,14,19
#3 3 5,19,23,19
#4 4 10
#5 5 23,27,29,23
Or a base R option is
df1$path <- sapply(strsplit(df1$path, ","), function(x) paste(rle(x)$values, collapse=","))
NOTE: If the 'path' column is factor class, convert to character before passing as argument to strsplit i.e. strsplit(as.character(df1$path), ",")
Using stringr package, with function: str_replace_all, I think it gets what you want using the following regular expression: ([0-9]+),\\1and then replace it with \\1 (we need to scape the \ special character):
library(stringr)
> str_replace_all("5,19,23,19", "([0-9]+),\\1", "\\1")
[1] "5,19,23,19"
> str_replace_all("10,10", "([0-9]+),\\1", "\\1")
[1] "10"
> str_replace_all("2,14,14,19", "([0-9]+),\\1", "\\1")
[1] "2,14,19"
You can use it in a array form: x <- c("5,19,23,19", "10,10", "2,14,14,19") then:
str_replace_all(x, "([0-9]+),\\1", "\\1")
[1] "5,19,23,19" "10" "2,14,19"
or using sapply:
result <- sapply(x, function(x) str_replace_all(x, "([0-9]+),\\1", "\\1"))
Then:
> result
5,19,23,19 10,10 2,14,14,19
"5,19,23,19" "10" "2,14,19"
Notes:
The first line is the attribute information:
> str(result)
Named chr [1:3] "5,19,23,19" "10" "2,14,19"
- attr(*, "names")= chr [1:3] "5,19,23,19" "10,10" "2,14,14,19"
If you don't want to see them (it does not affect the result), just do:
attributes(result) <- NULL
Then,
> result
[1] "5,19,23,19" "10" "2,14,19"
Explanation about the regular expression used: ([0-9]+),\\1
([0-9]+): Starts with a group 1 delimited by () and finds any digit (at least one)
,: Then comes a punctuation sign: , (we can include spaces here, but the original example only uses this character as delimiter)
\\1: Then comes an identical string to the group 1, i.e.: the repeated number. If that doesn't happen, then the pattern doesn't match.
Then if the pattern matches, it replaces it, with the value of the variable \\1, i.e. the first time the number appears in the pattern matched.
How to handle more than one duplicated number, for example 2,14,14,14,19?:
Just use this regular expression instead: ([0-9]+)(,\\1)+, then it matches when at least there is one repetition of the delimiter (right) and the number. You can try other possibilities using this regex101.com (in MHO it more user friendly than other online regular expression checkers).
I hope this would work for you, it is a flexible solution, you just need to adapt it with the pattern you need.

R: Find patern and get the values in between

I am using readLines() to extract an html code from a site. In almost every line of the code there is pattern of the form <td>VALUE1<td>VALUE2<td>. I would like to take the values in between the <td>. I tried some compilations such as:
output <- gsub(pattern='(.*<td>)(.*)(<td>.*)(.*)(.*<td>)',replacement='\\2',x='<td>VALUE1<td>VALUE2<td>')
but the output gives back only the one value. Any idea how to do that?
string <- "<td>VALUE1<td>VALUE2<td>"
regmatches(string , gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T) )
# use gregexpr function to get the match indices and the lengthes
indices <- gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T)
# this should be the result
# [1] 5 15
# attr(,"match.length")
# this means you have two matches the first one starts at index 5 and the
#second match starts at index 15
#[1] 6 6
#attr(,"useBytes")
# this means the first match should be with length 6 , also in this case the
#second match with length of 6
# then get the result of this match and pass it to regmatches function to
# substring your string at these indices
regmatches(string , indices)
Did you take a look at the "XML" package that can extract tables from HTML? You probably need to provide more context of the entire message that you are trying to parse so that we could see if it might be appropriate.

Resources