Keep part of string after last sign. [duplicate] - r

This question already has answers here:
Extract last word in string in R
(5 answers)
Closed 4 years ago.
I would like to keep only the string after the last | sign in my rownames which looks like this:
in:
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Chromatiales|f__Woeseiaceae|g__Woeseia"
out:
g__Woeseia
I have this code which keeps everything from the start until a given sign:
gsub("^.*\\.",".",x)

We could do this by capturing as a group. Using sub, match characters (.*) until the | and capture zero or more characters that are not a | (([^|]*)) until the end ($) of the string and replace by the backreference (\\1) of the captured group
sub(".*\\|([^|]*)$", "\\1", str1)
#[1] "g__Woeseia"
Or match characters until the | and replace it with blank ("")
sub(".*\\|", "", str1)
#[1] "g__Woeseia"
data
str1 <- "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Chromatiales|f__Woeseiaceae|g__Woeseia"

Related

Regex: extracting matches preceding a pattern in R [duplicate]

This question already has answers here:
Remove part of string after "."
(6 answers)
Extract string before "|" [duplicate]
(3 answers)
Closed 1 year ago.
I'm trying to extract matches preceding a pattern in R. Lets say that I have a vector consisting of the next elements:
my_vector
> [1] "ABCC12|94160" "ABCC13|150000" "ABCC1|4363" "ACTA1|58"
[5] "ADNP2|22850" "ADNP|23394" "ARID1B|57492" "ARID2|196528"
I'm looking for a regular expression to extract all characters preceding the "|". The expected result must be something like this:
my_new_vector
> [1] "ABCC12" "ABCC13" "ABCC1" "ACTA1"
and so on.
I have already tried using stringr functions and regular expressions based on look arounds, but I failed.
I really appreciate your advices and help to solve my issue.
Thanks in advance!
We could use trimws and specify the whitespace as a regex that matches the | (metacharacter - so escape \\ followed by one or more character (.*)
trimws(my_vector, whitespace = "\\|.*")

Gsub in R for hyphens and digits [duplicate]

This question already has answers here:
Trim a string to a specific number of characters in R
(3 answers)
Using gsub in R to remove values in Zip Code field
(1 answer)
Closed 2 years ago.
I'm trying to use gsub on the df$Zipcode in the following data frame:
#Sample
df <-data.frame(ID = c(1,2,3,4,5,6,7),
Zipcode =c("10001-2838", "95011", "95011", "100028018", "84321", "84321", "94011"))
df
I want to take everything after the "-" (hyphen) out and replace it with nothing. Something like:
df$Zipcode <- gsub("\-", "", df$Zipcode)
But I don't think that is quite right. I also want to take the first 5 digits of all Zipcodes that are longer than 5 digits, like observation 4. Which should just be 10002. Maybe this is correct:
df$Zipcode <- gsub("[:6:]", "", df$Zipcode)
We can capture the first 5 characters that are not a - as a group and replace with the backreference (\\1) of the captured group
df$Zipcode <- sub("^([^-]{5}).*", "\\1", df$Zipcode)
df$Zipcode
#[1] "10001" "95011" "95011" "10002" "84321" "84321" "94011"
I think what you're looking for is this:
sub("(\\d{5}).*", "\\1", df$Zipcode)
[1] "10001" "95011" "95011" "10002" "84321" "84321" "94011"
This matches the first 5 digits, puts them into a capturing group, and 'remembers' them (but not the rest) via backreference \\1 in the replacement argument to sub.

Extract characters between two characters R [duplicate]

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed 3 years ago.
I have a df and I want to extract the tissue name between the './' and '.v8'
So for this df the result would be a column with just 'Thyroid', 'Esophagus_Muscularis', Adipose_Subcutaneous
gene<-c("ENSG00000065485.19","ENSG00000079112.9","ENSG00000079112")
tissue<-c("./Thyroid.v8.signif_variant_gene_pairs.txt.gz","./Esophagus_Muscularis.v8.signif_variant_gene_pairs.txt.gz","./Adipose_Subcutaneous.v8.signif_variant_gene_pairs.txt.gz")
df<-data.frame(gene,tissue)
I really struggle with regex and tried:
pattern="/.\(.*)/.v8(.*)"
result <- regmatches(df$tissue,regexec(pattern,df$tissue))
but I get:
Error: '(' is an unrecognized escape in character string starting
""/.("
In R, we need to escape (\). Here, we used a regex lookaround that matches the word (\\w+) which succeeds the . (metacharacter - escaped) and the \, followed by the . (\\ escape) and 'v8'
library(stringr)
library(dplyr)
df %>%
mutate(new = str_extract(tissue, "(?<=\\.[/])\\w+(?=\\.v8)"))
# gene tissue new
#1 ENSG00000065485.19 ./Thyroid.v8.signif_variant_gene_pairs.txt.gz Thyroid
#2 ENSG00000079112.9 ./Esophagus_Muscularis.v8.signif_variant_gene_pairs.txt.gz Esophagus_Muscularis
#3 ENSG00000079112 ./Adipose_Subcutaneous.v8.signif_variant_gene_pairs.txt.gz Adipose_Subcutaneous
The (?<=\\.[/]) - is a positive lookbehind to match the . and the / that precedes the word (\\w+), and (?=\\.v8) - positive lookahead to match the . and string 'v8' after the word. So, basically, it looks for a word that have a pattern before and after it and extracts the word

Remove characters which repeat more than twice in a string [duplicate]

This question already has answers here:
remove repeated character between words
(4 answers)
Closed 3 years ago.
I have this text:
F <- "hhhappy birthhhhhhdayyy"
and I want to remove the repeat characters, I tried this code
https://stackoverflow.com/a/11165145/10718214
and it works, but I need to remove repeat characters if it repeats more than 2, and if it repeated 2 times keep it.
so the output that I expect is
"happy birthday"
any help?
Try using sub, with the pattern (.)\\1{2,}:
F <- ("hhhappy birthhhhhhdayyy")
gsub("(.)\\1{2,}", "\\1", F)
[1] "happy birthday"
Explanation of regex:
(.) match and capture any single character
\\1{2,} then match the same character two or more times
We replace with just the single matching character. The quantity \\1 represents the first capture group in sub.

Remove a part of a string [duplicate]

This question already has answers here:
Remove part of string after "."
(6 answers)
Closed 5 years ago.
I got a string;
"Enviroment is dangerous.123"
Now I want to remove everything after "dangerous" so the result will be
"Enviroment is dangerous"
I got different text strings of different length. So it needs to respond to the string "dangerous"
How do I do that?
We can use sub to match the . followed by one or more numbers (\\d+) until the end of the string ($) and replace with blank ("")
sub("\\.\\d+$", "", str1)
#[1] "Enviroment is dangerous"
data
str1 <- "Enviroment is dangerous.123"

Resources