extract text from email and between two dots in R

extract text from email and between two dots in R - r

I have some email address where I am trying to extract the domain from. I found a solution here but it is taking too long.
I am trying with the following approach:
First remove all the text before the # sign.
gsub("#(.+)$", "\\1", emails)
Other - not used
qdapRegex::ex_between(emails, ".", ".")
Data:
emails <- c("ut317#hotmail.com", "drrro#iueywapp.com", "esdfdsfos#lasdfsdfsdstores.com",
"asfds#mobsdaff.com", "asfsdaf.gsdsfdsfd#hotmail.org", "asdfdsaf#sdffsddapp.com",
"wqrerq.mwqerweing#mwerqwie.com", "qwera#niweqrerw.tv", "qwereqr3rew7#hotmail.com",
"mqwerwewrk#moweqrewsfaslay.com")

You can try the following:
str_sub(emails, str_locate(emails, "#")[,1]+1)
Output:
[1] "hotmail.com" "iueywapp.com" "lasdfsdfsdstores.com" "mobsdaff.com"
[5] "hotmail.org" "sdffsddapp.com" "mwerqwie.com" "niweqrerw.tv"
[9] "hotmail.com" "moweqrewsfaslay.com"

how about
sub(".*#(.*)\\..*","\\1",emails)
[1] "hotmail" "iueywapp" "lasdfsdfsdstores" "mobsdaff"
[5] "hotmail" "sdffsddapp" "mwerqwie" "niweqrerw"
[9] "hotmail" "moweqrewsfaslay"
or if you want everything after the #:
sub(".*#","",emails)
[1] "hotmail.com" "iueywapp.com" "lasdfsdfsdstores.com"
[4] "mobsdaff.com" "hotmail.org" "sdffsddapp.com"
[7] "mwerqwie.com" "niweqrerw.tv" "hotmail.com"
[10] "moweqrewsfaslay.com"

We can use trimws from base R
trimws(emails, whitespace= '.*#')
#[1] "hotmail.com" "iueywapp.com" "lasdfsdfsdstores.com" "mobsdaff.com" "hotmail.org" "sdffsddapp.com"
#[7] "mwerqwie.com" "niweqrerw.tv" "hotmail.com" "moweqrewsfaslay.com"
trimws(emails, whitespace= '.*#|\\..*')
#[1] "hotmail" "iueywapp" "lasdfsdfsdstores" "mobsdaff" "hotmail" "sdffsddapp" "mwerqwie"
#[8] "niweqrerw" "hotmail" "moweqrewsfaslay"

Related

How to use regex in R?

I want to remove the dashes and keep only the first 4 substrings except for the last character.
sub.maf.barcode <- gsub("^([^-]*-[^-]*-[^-]*-[^-]*).{1}$", "\\1", ori.maf.barcode$Tumor_Sample_Barcode)
> ori.maf.barcode$Tumor_Sample_Barcode[1:5]
[1] "TCGA-2K-A9WE-01A-11D-A382-10" "TCGA-2Z-A9J1-01A-11D-A382-10"
[3] "TCGA-2Z-A9J2-01A-11D-A382-10" "TCGA-2Z-A9J3-01A-12D-A382-10"
[5] "TCGA-2Z-A9J5-01A-21D-A382-10"
Expected output:
[1] "TCGA-2K-A9WE-01" "TCGA-2Z-A9J1-01"
[3] "TCGA-2Z-A9J2-01" "TCGA-2Z-A9J3-01"
[5] "TCGA-2Z-A9J5-01"

You could do
gsub('.-[^-]*-[^-]*-.[^-]*$', "", ori.maf.barcode$Tumor_Sample_Barcode)
#> [1] "TCGA-2K-A9WE-01" "TCGA-2Z-A9J1-01" "TCGA-2Z-A9J2-01"
#> [4] "TCGA-2Z-A9J3-01" "TCGA-2Z-A9J5-01"
Or
substr(ori.maf.barcode$Tumor_Sample_Barcode, 1, 15)
#> [1] "TCGA-2K-A9WE-01" "TCGA-2Z-A9J1-01" "TCGA-2Z-A9J2-01"
#> [4] "TCGA-2Z-A9J3-01" "TCGA-2Z-A9J5-01"

using str_extract
library(stringr)
str_extract(ori.maf.barcode$Tumor_Sample_Barcode, "^([^-]+-){3}\\d+")
-output
[1] "TCGA-2K-A9WE-01" "TCGA-2Z-A9J1-01" "TCGA-2Z-A9J2-01"
[4] "TCGA-2Z-A9J3-01" "TCGA-2Z-A9J5-01"

Change the row names in R

i have two dataframes with similar rownames:
> rownames(abundance)[1:10]
[1] "X001.V2.fastq_mapped_to_agora.txt.uniq"
[2] "X001.V8.fastq_mapped_to_agora.txt.uniq"
[3] "X003.V17.fastq_mapped_to_agora.txt.uniq"
[4] "X003.V2.fastq_mapped_to_agora.txt.uniq"
[5] "X003.V8.fastq_mapped_to_agora.txt.uniq"
[6] "X004.V2.fastq_mapped_to_agora.txt.uniq"
[7] "X004.V8.fastq_mapped_to_agora.txt.uniq"
[8] "X005.V2.fastq_mapped_to_agora.txt.uniq"
[9] "X005.V8.fastq_mapped_to_agora.txt.uniq"
[10] "X006.V2.fastq_mapped_to_agora.txt.uniq"
> rownames(fluxes)[1:10]
[1] "001.V8" "003.V17" "003.V2" "003.V8" "004.V2" "004.V8" "005.V2"
[8] "005.V8" "006.V2" "006.V8"
But the row names of the dataframe abundance is larger. How can i make the names of each rows like the rownames of fluxes. It can be like from "X" to second ".".

We could use sub:
rownames(abundance) <- sub("X(.*)\\.fastq_mapped_to_agora\\.txt\\.uniq", "\\1", rownames(abundance))
Output:
[1] "001.V2" "001.V8" "003.V17" "003.V2" "003.V8" "004.V2" "004.V8" "005.V2" "005.V8" "006.V2"

We may use trimws
rownames(abundance) <- trimws(rownames(abundance), whitespace = "\\..*")
Or could be
rownames(abundance) <- sub("^([^.]+\\.[^.]+)\\..*", "\\1", rownames(abundance))
-testing
> trimws("X001.V2.fastq_mapped_to_agora.txt.uniq", whitespace = "\\..*")
[1] "X001"
> sub("^([^.]+\\.[^.]+)\\..*", "\\1", "X001.V2.fastq_mapped_to_agora.txt.uniq")
[1] "X001.V2"

How to extract text from a column using R

How would I go about extracting, for each row (there are ~56,000 records in an Excel file) in a specific column, only part of a string? I need to keep all text to the left of the last '/' forward slash. The challenge is that not all cells have the same number of '/'. There is always a filename (*.wav) at the end of the last '/', but the number of characters in the filename is not always the same (sometimes 5 and sometimes 6).
Below are some examples of the strings in the cells:
cloch/51.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav
AB_AeolinaL/025-C#.wav
AB_AeolinaL/026-D.wav
AB_violadamourL/rel99999/091-G.wav
AB_violadamourL/rel99999/092-G#.wav
AB_violadamourR/024-C.wav
AB_violadamourR/025-C#.wav
The extracted text should be:
cloch
grand/Grand_bombarde/02-suchy_Grand_bombarde
grand/Grand_bombarde/02-suchy_Grand_bombarde
AB_AeolinaL
AB_AeolinaL
AB_violadamourL/rel99999
AB_violadamourL/rel99999
AB_violadamourR
AB_violadamourR
Can anyone recommend a strategy using R?

You can use the stringr package str_remove(string,pattern) function like:
str = "grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav"
str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
Output:
> str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
[1] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
Then you can just iterate over all other strings:
strings <- c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav",
"AB_violadamourL/rel99999/091-G.wav",
"AB_violadamourL/rel99999/092-G#.wav",
"AB_violadamourR/024-C.wav",
"AB_violadamourR/025-C#.wav")
str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
Output:
> str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"

You have to substract strings using this method:
substr(strings,1,regexpr("\\/[^\\/]*$", strings)-1)
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"
Input
strings<-c("cloch/51.wav","grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav","grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav","AB_AeolinaL/025-C#.wav","AB_AeolinaL/026-D.wav","AB_violadamourL/rel99999/091-G.wav","AB_violadamourL/rel99999/092-G#.wav","AB_violadamourR/024-C.wav","AB_violadamourR/025-C#.wav")
In which this regex regexpr("\\/[^\\/]*$", strings) gives you the position of the last "/"

Assuming that the strings you propose are in a column of a dataframe:
df <- data.frame(x = 1:5, y = c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav"))
# I define a function that separates a string at each "/"
# throws the last piece and reattaches the pieces
cut_str <- function(s) {
st <- head((unlist(strsplit(s, "\\/"))), -1)
r <- paste(st, collapse = "/")
return(r)
}
# through the sapply function I get the desired result
new_strings <- as.vector(sapply(df$y, FUN = cut_str))
new_strings
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"

You could use
dirname(strings)
If there is no /, this returns ., which you could remove afterwards if you like, e.g.:
res <- dirname(strings)
res[res=="."] <- ""
``

You could start the match with / followed by 1 or more times any char except a forward slash or a whitespace char using a negated character class [^\\s/]+
Then match .wav at the end of the string using $
Replace the match with an empty string using sub for example.
[^\\s/]+\\.wav$
See the regex matches | R demo
strings <- c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav",
"AB_violadamourL/rel99999/091-G.wav",
"AB_violadamourL/rel99999/092-G#.wav",
"AB_violadamourR/024-C.wav",
"AB_violadamourR/025-C#.wav")
sub("/[^\\s/]+\\.wav$", "", strings)
Output
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"

Remove string from a vector in R

I have a vector that looks like
> inecodes
[1] "01001" "01002" "01049" "01003" "01006" "01037" "01008" "01004" "01009" "01010" "01011"
[12] "01013" "01014" "01016" "01017" "01021" "01022" "01023" "01046" "01056" "01901" "01027"
[23] "01019" "01020" "01028" "01030" "01031" "01032" "01902" "01033" "01036" "01058" "01034"
[34] "01039" "01041" "01042" "01043" "01044" "01047" "01051" "01052" "01053" "01054" "01055"
And I want to remove these "numbers" from this vector:
>pob
[1] "01001-Alegría-Dulantzi" "01002-Amurrio"
[3] "01049-Añana" "01003-Aramaio"
[5] "01006-Armiñón" "01037-Arraia-Maeztu"
[7] "01008-Arratzua-Ubarrundia" "01004-Artziniega"
[9] "01009-Asparrena" "01010-Ayala/Aiara"
[11] "01011-Baños de Ebro/Mañueta" "01013-Barrundia"
[13] "01014-Berantevilla" "01016-Bernedo"
[15] "01017-Campezo/Kanpezu" "01021-Elburgo/Burgelu"
[17] "01022-Elciego" "01023-Elvillar/Bilar"
[19] "01046-Erriberagoitia/Ribera Alta"
They are longer that these samples and they don't have the same length. The answer must to be like following:
>pob
[1] "Alegría-Dulantzi" "Amurrio"
[3] "Añana" "Aramaio"
[5] "Armiñón" "Arraia-Maeztu"
[7] "Arratzua-Ubarrundia" "Artziniega"
[9] "Asparrena" "Ayala/Aiara"
[11] "Baños de Ebro/Mañueta" "Barrundia"
[13] "Berantevilla" "Bernedo"
[15] "Campezo/Kanpezu" "Elburgo/Burgelu"
[17] "Elciego" "Elvillar/Bilar"
[19] "Erriberagoitia/Ribera Alta"

Not sure why you needed inecodes at all, since you can use sub to remove all digits:
sub('^\\d+-', '', pob)
Result:
[1] "Alegría-Dulantzi" "Amurrio" "Añana"
[4] "Aramaio" "Armiñón" "Arraia-Maeztu"
[7] "Arratzua-Ubarrundia" "Artziniega" "Asparrena"
[10] "Ayala/Aiara" "Baños de Ebro/Mañueta" "Barrundia"
[13] "Berantevilla" "Bernedo" "Campezo/Kanpezu"
[16] "Elburgo/Burgelu" "Elciego" "Elvillar/Bilar"
[19] "Erriberagoitia/Ribera Alta"
One reason that you might need inecodes is that you have codes in pob that don't exist in inecodes, but that doesn't seem like the case here. If you insist on using inecodes to remove numbers from pob, you can use str_replace_all from stringr:
library(stringr)
str_replace_all(pob, setNames(rep("", length(inecodes)), paste0(inecodes, "-")))
This gives you the exact same result:
[1] "Alegría-Dulantzi" "Amurrio" "Añana"
[4] "Aramaio" "Armiñón" "Arraia-Maeztu"
[7] "Arratzua-Ubarrundia" "Artziniega" "Asparrena"
[10] "Ayala/Aiara" "Baños de Ebro/Mañueta" "Barrundia"
[13] "Berantevilla" "Bernedo" "Campezo/Kanpezu"
[16] "Elburgo/Burgelu" "Elciego" "Elvillar/Bilar"
[19] "Erriberagoitia/Ribera Alta"
Data:
inecodes = c("01001", "01002", "01049", "01003", "01006", "01037", "01008",
"01004", "01009", "01010", "01011", "01013", "01014", "01016",
"01017", "01021", "01022", "01023", "01046", "01056", "01901",
"01027", "01019", "01020", "01028", "01030", "01031", "01032",
"01902", "01033", "01036", "01058", "01034", "01039", "01041",
"01042", "01043", "01044", "01047", "01051", "01052", "01053",
"01054", "01055")
pob = c("01001-Alegría-Dulantzi", "01002-Amurrio", "01049-Añana", "01003-Aramaio",
"01006-Armiñón", "01037-Arraia-Maeztu", "01008-Arratzua-Ubarrundia",
"01004-Artziniega", "01009-Asparrena", "01010-Ayala/Aiara", "01011-Baños de Ebro/Mañueta",
"01013-Barrundia", "01014-Berantevilla", "01016-Bernedo", "01017-Campezo/Kanpezu",
"01021-Elburgo/Burgelu", "01022-Elciego", "01023-Elvillar/Bilar",
"01046-Erriberagoitia/Ribera Alta")

library(stringr)
for(code in inecodes) {
ix <- which(str_detect(pob, code))
pob[ix] <- unlist(str_split(pob, "-", 2))[2]
}

Try this. Match should be much faster
pos<-which(!is.na(pob[match(sub('^([0-9]+)-.*$','\\1',pob),inecodes)]))
pob[pos]<-sub('^[0-9]+-(.*)$','\\1',pob[pos])
Please do post the timings if you manage to get this. Match usually solves many computational issues for large data sets lookup. Would like to see if there are any opposite scenarios.

A bit shorter than sub, str_detect and str_replace is str_remove:
library(stringr)
c("01001-Alegría-Dulantzi", "01002-Amurrio") %>%
str_remove("[0-9]*-")
returns
"Alegría-Dulantzi" "Amurrio"

Case insensitive sort of vector of string in R

I have the following vector:
mylist <- c("MBT.LN.ID", "ISA51VG.LN.ID", "R848.LN.ID", "sHz.LN.ID", "FK565.LN.ID",
"bCD.LN.ID", "MALP2s.LN.ID", "ADX.LN.ID", "AddaVax.LN.ID", "FCA.LN.ID",
"Pam3CSK4.LN.ID", "D35.LN.ID", "ALM.LN.ID", "K3.LN.ID", "K3SPG.LN.ID",
"MPLA.LN.ID", "DMXAA.LN.ID", "cGAMP.LN.ID", "Poly_IC.LN.ID",
"cdiGMP.LN.ID")
I'd like to sort them alphabetically in case-insensitive manner.
The expected output is this:
[1] "AddaVax.LN.ID" "ADX.LN.ID" "ALM.LN.ID" "bCD.LN.ID" "cdiGMP.LN.ID" "cGAMP.LN.ID"
[7] "D35.LN.ID" "DMXAA.LN.ID" "FCA.LN.ID" "FK565.LN.ID" "ISA51VG.LN.ID" "K3.LN.ID"
[13] "K3SPG.LN.ID" "MALP2s.LN.ID" "MBT.LN.ID" "MPLA.LN.ID" "Pam3CSK4.LN.ID" "Poly_IC.LN.ID"
[19] "R848.LN.ID" "sHz.LN.ID"
I tried this but failed (Using R.3.2.0 alpha):
> sort(mylist)
[1] "ADX.LN.ID" "ALM.LN.ID" "AddaVax.LN.ID" "D35.LN.ID"
[5] "DMXAA.LN.ID" "FCA.LN.ID" "FK565.LN.ID" "ISA51VG.LN.ID"
[9] "K3.LN.ID" "K3SPG.LN.ID" "MALP2s.LN.ID" "MBT.LN.ID"
[13] "MPLA.LN.ID" "Pam3CSK4.LN.ID" "Poly_IC.LN.ID" "R848.LN.ID"
[17] "bCD.LN.ID" "cGAMP.LN.ID" "cdiGMP.LN.ID" "sHz.LN.ID"

Try
mylist[order(tolower(mylist))]

As noted by #Pascal, this is documented in help(Comparison) and sort is local specific. One Option is switching your local (for example Sys.setlocale("LC_TIME", "us")), but that could be inconvenient. Another option could be using gtools::mixedsort which could be also useful because you string also contains numbers.
library(gtools)
mixedsort(mylist)
# [1] "AddaVax.LN.ID" "ADX.LN.ID" "ALM.LN.ID" "bCD.LN.ID" "cdiGMP.LN.ID" "cGAMP.LN.ID" "D35.LN.ID" "DMXAA.LN.ID" "FCA.LN.ID" "FK565.LN.ID"
# [11] "ISA51VG.LN.ID" "K3.LN.ID" "K3SPG.LN.ID" "MALP2s.LN.ID" "MBT.LN.ID" "MPLA.LN.ID" "Pam3CSK4.LN.ID" "Poly_IC.LN.ID" "R848.LN.ID" "sHz.LN.ID"

> library(searchable)
> sort(ignore.case(mylist))
[1] "AddaVax.LN.ID" "ADX.LN.ID" "ALM.LN.ID" "bCD.LN.ID" "cdiGMP.LN.ID"
[6] "cGAMP.LN.ID" "D35.LN.ID" "DMXAA.LN.ID" "FCA.LN.ID" "FK565.LN.ID"
[11] "ISA51VG.LN.ID" "K3.LN.ID" "K3SPG.LN.ID" "MALP2s.LN.ID" "MBT.LN.ID"
[16] "MPLA.LN.ID" "Pam3CSK4.LN.ID" "Poly_IC.LN.ID" "R848.LN.ID" "sHz.LN.ID"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

extract text from email and between two dots in R - r

You can try the following: str_sub(emails, str_locate(emails, "#")[,1]+1) Output: [1] "hotmail.com" "iueywapp.com" "lasdfsdfsdstores.com" "mobsdaff.com" [5] "hotmail.org" "sdffsddapp.com" "mwerqwie.com" "niweqrerw.tv" [9] "hotmail.com" "moweqrewsfaslay.com"

Related

How to use regex in R?

Change the row names in R

How to extract text from a column using R

Remove string from a vector in R

Case insensitive sort of vector of string in R

Categories

Resources