Extract characters between two characters R [duplicate] - r

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed 3 years ago.
I have a df and I want to extract the tissue name between the './' and '.v8'
So for this df the result would be a column with just 'Thyroid', 'Esophagus_Muscularis', Adipose_Subcutaneous
gene<-c("ENSG00000065485.19","ENSG00000079112.9","ENSG00000079112")
tissue<-c("./Thyroid.v8.signif_variant_gene_pairs.txt.gz","./Esophagus_Muscularis.v8.signif_variant_gene_pairs.txt.gz","./Adipose_Subcutaneous.v8.signif_variant_gene_pairs.txt.gz")
df<-data.frame(gene,tissue)
I really struggle with regex and tried:
pattern="/.\(.*)/.v8(.*)"
result <- regmatches(df$tissue,regexec(pattern,df$tissue))
but I get:
Error: '(' is an unrecognized escape in character string starting
""/.("

In R, we need to escape (\). Here, we used a regex lookaround that matches the word (\\w+) which succeeds the . (metacharacter - escaped) and the \, followed by the . (\\ escape) and 'v8'
library(stringr)
library(dplyr)
df %>%
mutate(new = str_extract(tissue, "(?<=\\.[/])\\w+(?=\\.v8)"))
# gene tissue new
#1 ENSG00000065485.19 ./Thyroid.v8.signif_variant_gene_pairs.txt.gz Thyroid
#2 ENSG00000079112.9 ./Esophagus_Muscularis.v8.signif_variant_gene_pairs.txt.gz Esophagus_Muscularis
#3 ENSG00000079112 ./Adipose_Subcutaneous.v8.signif_variant_gene_pairs.txt.gz Adipose_Subcutaneous
The (?<=\\.[/]) - is a positive lookbehind to match the . and the / that precedes the word (\\w+), and (?=\\.v8) - positive lookahead to match the . and string 'v8' after the word. So, basically, it looks for a word that have a pattern before and after it and extracts the word

Related

Regex: extracting matches preceding a pattern in R [duplicate]

This question already has answers here:
Remove part of string after "."
(6 answers)
Extract string before "|" [duplicate]
(3 answers)
Closed 1 year ago.
I'm trying to extract matches preceding a pattern in R. Lets say that I have a vector consisting of the next elements:
my_vector
> [1] "ABCC12|94160" "ABCC13|150000" "ABCC1|4363" "ACTA1|58"
[5] "ADNP2|22850" "ADNP|23394" "ARID1B|57492" "ARID2|196528"
I'm looking for a regular expression to extract all characters preceding the "|". The expected result must be something like this:
my_new_vector
> [1] "ABCC12" "ABCC13" "ABCC1" "ACTA1"
and so on.
I have already tried using stringr functions and regular expressions based on look arounds, but I failed.
I really appreciate your advices and help to solve my issue.
Thanks in advance!
We could use trimws and specify the whitespace as a regex that matches the | (metacharacter - so escape \\ followed by one or more character (.*)
trimws(my_vector, whitespace = "\\|.*")

Gsub in R for hyphens and digits [duplicate]

This question already has answers here:
Trim a string to a specific number of characters in R
(3 answers)
Using gsub in R to remove values in Zip Code field
(1 answer)
Closed 2 years ago.
I'm trying to use gsub on the df$Zipcode in the following data frame:
#Sample
df <-data.frame(ID = c(1,2,3,4,5,6,7),
Zipcode =c("10001-2838", "95011", "95011", "100028018", "84321", "84321", "94011"))
df
I want to take everything after the "-" (hyphen) out and replace it with nothing. Something like:
df$Zipcode <- gsub("\-", "", df$Zipcode)
But I don't think that is quite right. I also want to take the first 5 digits of all Zipcodes that are longer than 5 digits, like observation 4. Which should just be 10002. Maybe this is correct:
df$Zipcode <- gsub("[:6:]", "", df$Zipcode)
We can capture the first 5 characters that are not a - as a group and replace with the backreference (\\1) of the captured group
df$Zipcode <- sub("^([^-]{5}).*", "\\1", df$Zipcode)
df$Zipcode
#[1] "10001" "95011" "95011" "10002" "84321" "84321" "94011"
I think what you're looking for is this:
sub("(\\d{5}).*", "\\1", df$Zipcode)
[1] "10001" "95011" "95011" "10002" "84321" "84321" "94011"
This matches the first 5 digits, puts them into a capturing group, and 'remembers' them (but not the rest) via backreference \\1 in the replacement argument to sub.

Remove first character of string with condition in R [duplicate]

This question already has answers here:
remove leading 0s with stringr in R
(3 answers)
Closed 2 years ago.
I'm trying to remove the 0 that appears at the beginning of some observations for Zipcode in the following table:
I think the sub function is probably my best choice but I only want to do the replacement for observations that begin with 0, not all observations like the following does:
data_individual$Zipcode <-sub(".", "", data_individual$Zipcode)
Is there a way to condition this so it only removes the first character if the Zipcode starts with 0? Maybe grepl for those that begin with 0 and generate a dummy variable to use?
We can specify the ^0+ as pattern i.e. one or more 0s at the start (^) of the string instead of . (. in regex matches any character)
data_individual$Zipcode <- sub("^0+", "", data_individual$Zipcode)
Or with tidyverse
library(stringr)
data_individual$Zipcode <- str_remove(data_individual$Zipcode, "^0+")
Another option without regex would be to convert to numeric as numeric values doesn't support prefix 0 (assuming all zipcodes include only digits)
data_individual$Zipcode <- as.numeric(data_individual$Zipcode)

Match string between ; and % [duplicate]

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed 2 years ago.
I wish to extract the decimal value in the string without the % sign. So in this case, I want the numeric 0.45
x <- "document.write(GIC_annual[\"12-17 MTH\"][\"99999.99\"]);0.450%"
str_extract(x, "^;[0-9.]")
My attempt fails. Here's my thinking.
Begin the extraction at the semicolon ^;
Grab any numbers between 0 and 9.
Include the decimal point
You also have this option:
stringr::str_extract(y, "\\d\\.\\d{1,}(?=%)")
[1] "0.450"
So basically you look ahead and check if there is % or not, if yes, you capture the digits before it.
Details
\\d digit;
\\. dot;
\\d digit;
{1,} capturing 1 or more digit after .;
(?=%) look ahead and check if there is % and if there is one, it retuns captured number
Since you don't want semi-colon in the output use it as lookbehind regex.
stringr::str_extract(x, "(?<=;)[0-9]\\.[0-9]+")
#[1] "0.450"
In base R using sub :
sub('.*;([0-9]\\.[0-9]+).*', '\\1', x)

Keep part of string after last sign. [duplicate]

This question already has answers here:
Extract last word in string in R
(5 answers)
Closed 4 years ago.
I would like to keep only the string after the last | sign in my rownames which looks like this:
in:
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Chromatiales|f__Woeseiaceae|g__Woeseia"
out:
g__Woeseia
I have this code which keeps everything from the start until a given sign:
gsub("^.*\\.",".",x)
We could do this by capturing as a group. Using sub, match characters (.*) until the | and capture zero or more characters that are not a | (([^|]*)) until the end ($) of the string and replace by the backreference (\\1) of the captured group
sub(".*\\|([^|]*)$", "\\1", str1)
#[1] "g__Woeseia"
Or match characters until the | and replace it with blank ("")
sub(".*\\|", "", str1)
#[1] "g__Woeseia"
data
str1 <- "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Chromatiales|f__Woeseiaceae|g__Woeseia"

Resources