This question already has answers here:
Extract last word in string in R
(5 answers)
Closed 4 years ago.
I would like to keep only the string after the last | sign in my rownames which looks like this:
in:
"d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Chromatiales|f__Woeseiaceae|g__Woeseia"
out:
g__Woeseia
I have this code which keeps everything from the start until a given sign:
gsub("^.*\\.",".",x)
We could do this by capturing as a group. Using sub, match characters (.*) until the | and capture zero or more characters that are not a | (([^|]*)) until the end ($) of the string and replace by the backreference (\\1) of the captured group
sub(".*\\|([^|]*)$", "\\1", str1)
#[1] "g__Woeseia"
Or match characters until the | and replace it with blank ("")
sub(".*\\|", "", str1)
#[1] "g__Woeseia"
data
str1 <- "d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Chromatiales|f__Woeseiaceae|g__Woeseia"
Related
This question already has answers here:
Remove part of string after "."
(6 answers)
Extract string before "|" [duplicate]
(3 answers)
Closed 1 year ago.
I'm trying to extract matches preceding a pattern in R. Lets say that I have a vector consisting of the next elements:
my_vector
> [1] "ABCC12|94160" "ABCC13|150000" "ABCC1|4363" "ACTA1|58"
[5] "ADNP2|22850" "ADNP|23394" "ARID1B|57492" "ARID2|196528"
I'm looking for a regular expression to extract all characters preceding the "|". The expected result must be something like this:
my_new_vector
> [1] "ABCC12" "ABCC13" "ABCC1" "ACTA1"
and so on.
I have already tried using stringr functions and regular expressions based on look arounds, but I failed.
I really appreciate your advices and help to solve my issue.
Thanks in advance!
We could use trimws and specify the whitespace as a regex that matches the | (metacharacter - so escape \\ followed by one or more character (.*)
trimws(my_vector, whitespace = "\\|.*")
This question already has answers here:
Trim a string to a specific number of characters in R
(3 answers)
Using gsub in R to remove values in Zip Code field
(1 answer)
Closed 2 years ago.
I'm trying to use gsub on the df$Zipcode in the following data frame:
#Sample
df <-data.frame(ID = c(1,2,3,4,5,6,7),
Zipcode =c("10001-2838", "95011", "95011", "100028018", "84321", "84321", "94011"))
df
I want to take everything after the "-" (hyphen) out and replace it with nothing. Something like:
df$Zipcode <- gsub("\-", "", df$Zipcode)
But I don't think that is quite right. I also want to take the first 5 digits of all Zipcodes that are longer than 5 digits, like observation 4. Which should just be 10002. Maybe this is correct:
df$Zipcode <- gsub("[:6:]", "", df$Zipcode)
We can capture the first 5 characters that are not a - as a group and replace with the backreference (\\1) of the captured group
df$Zipcode <- sub("^([^-]{5}).*", "\\1", df$Zipcode)
df$Zipcode
#[1] "10001" "95011" "95011" "10002" "84321" "84321" "94011"
I think what you're looking for is this:
sub("(\\d{5}).*", "\\1", df$Zipcode)
[1] "10001" "95011" "95011" "10002" "84321" "84321" "94011"
This matches the first 5 digits, puts them into a capturing group, and 'remembers' them (but not the rest) via backreference \\1 in the replacement argument to sub.
This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed 3 years ago.
I have a df and I want to extract the tissue name between the './' and '.v8'
So for this df the result would be a column with just 'Thyroid', 'Esophagus_Muscularis', Adipose_Subcutaneous
gene<-c("ENSG00000065485.19","ENSG00000079112.9","ENSG00000079112")
tissue<-c("./Thyroid.v8.signif_variant_gene_pairs.txt.gz","./Esophagus_Muscularis.v8.signif_variant_gene_pairs.txt.gz","./Adipose_Subcutaneous.v8.signif_variant_gene_pairs.txt.gz")
df<-data.frame(gene,tissue)
I really struggle with regex and tried:
pattern="/.\(.*)/.v8(.*)"
result <- regmatches(df$tissue,regexec(pattern,df$tissue))
but I get:
Error: '(' is an unrecognized escape in character string starting
""/.("
In R, we need to escape (\). Here, we used a regex lookaround that matches the word (\\w+) which succeeds the . (metacharacter - escaped) and the \, followed by the . (\\ escape) and 'v8'
library(stringr)
library(dplyr)
df %>%
mutate(new = str_extract(tissue, "(?<=\\.[/])\\w+(?=\\.v8)"))
# gene tissue new
#1 ENSG00000065485.19 ./Thyroid.v8.signif_variant_gene_pairs.txt.gz Thyroid
#2 ENSG00000079112.9 ./Esophagus_Muscularis.v8.signif_variant_gene_pairs.txt.gz Esophagus_Muscularis
#3 ENSG00000079112 ./Adipose_Subcutaneous.v8.signif_variant_gene_pairs.txt.gz Adipose_Subcutaneous
The (?<=\\.[/]) - is a positive lookbehind to match the . and the / that precedes the word (\\w+), and (?=\\.v8) - positive lookahead to match the . and string 'v8' after the word. So, basically, it looks for a word that have a pattern before and after it and extracts the word
This question already has answers here:
remove repeated character between words
(4 answers)
Closed 3 years ago.
I have this text:
F <- "hhhappy birthhhhhhdayyy"
and I want to remove the repeat characters, I tried this code
https://stackoverflow.com/a/11165145/10718214
and it works, but I need to remove repeat characters if it repeats more than 2, and if it repeated 2 times keep it.
so the output that I expect is
"happy birthday"
any help?
Try using sub, with the pattern (.)\\1{2,}:
F <- ("hhhappy birthhhhhhdayyy")
gsub("(.)\\1{2,}", "\\1", F)
[1] "happy birthday"
Explanation of regex:
(.) match and capture any single character
\\1{2,} then match the same character two or more times
We replace with just the single matching character. The quantity \\1 represents the first capture group in sub.
This question already has answers here:
Remove part of string after "."
(6 answers)
Closed 5 years ago.
I got a string;
"Enviroment is dangerous.123"
Now I want to remove everything after "dangerous" so the result will be
"Enviroment is dangerous"
I got different text strings of different length. So it needs to respond to the string "dangerous"
How do I do that?
We can use sub to match the . followed by one or more numbers (\\d+) until the end of the string ($) and replace with blank ("")
sub("\\.\\d+$", "", str1)
#[1] "Enviroment is dangerous"
data
str1 <- "Enviroment is dangerous.123"