How to extract text from a column using R

How to extract text from a column using R - r

How would I go about extracting, for each row (there are ~56,000 records in an Excel file) in a specific column, only part of a string? I need to keep all text to the left of the last '/' forward slash. The challenge is that not all cells have the same number of '/'. There is always a filename (*.wav) at the end of the last '/', but the number of characters in the filename is not always the same (sometimes 5 and sometimes 6).
Below are some examples of the strings in the cells:
cloch/51.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav
AB_AeolinaL/025-C#.wav
AB_AeolinaL/026-D.wav
AB_violadamourL/rel99999/091-G.wav
AB_violadamourL/rel99999/092-G#.wav
AB_violadamourR/024-C.wav
AB_violadamourR/025-C#.wav
The extracted text should be:
cloch
grand/Grand_bombarde/02-suchy_Grand_bombarde
grand/Grand_bombarde/02-suchy_Grand_bombarde
AB_AeolinaL
AB_AeolinaL
AB_violadamourL/rel99999
AB_violadamourL/rel99999
AB_violadamourR
AB_violadamourR
Can anyone recommend a strategy using R?

You can use the stringr package str_remove(string,pattern) function like:
str = "grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav"
str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
Output:
> str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
[1] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
Then you can just iterate over all other strings:
strings <- c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav",
"AB_violadamourL/rel99999/091-G.wav",
"AB_violadamourL/rel99999/092-G#.wav",
"AB_violadamourR/024-C.wav",
"AB_violadamourR/025-C#.wav")
str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
Output:
> str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"

You have to substract strings using this method:
substr(strings,1,regexpr("\\/[^\\/]*$", strings)-1)
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"
Input
strings<-c("cloch/51.wav","grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav","grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav","AB_AeolinaL/025-C#.wav","AB_AeolinaL/026-D.wav","AB_violadamourL/rel99999/091-G.wav","AB_violadamourL/rel99999/092-G#.wav","AB_violadamourR/024-C.wav","AB_violadamourR/025-C#.wav")
In which this regex regexpr("\\/[^\\/]*$", strings) gives you the position of the last "/"

Assuming that the strings you propose are in a column of a dataframe:
df <- data.frame(x = 1:5, y = c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav"))
# I define a function that separates a string at each "/"
# throws the last piece and reattaches the pieces
cut_str <- function(s) {
st <- head((unlist(strsplit(s, "\\/"))), -1)
r <- paste(st, collapse = "/")
return(r)
}
# through the sapply function I get the desired result
new_strings <- as.vector(sapply(df$y, FUN = cut_str))
new_strings
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"

You could use
dirname(strings)
If there is no /, this returns ., which you could remove afterwards if you like, e.g.:
res <- dirname(strings)
res[res=="."] <- ""
``

You could start the match with / followed by 1 or more times any char except a forward slash or a whitespace char using a negated character class [^\\s/]+
Then match .wav at the end of the string using $
Replace the match with an empty string using sub for example.
[^\\s/]+\\.wav$
See the regex matches | R demo
strings <- c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav",
"AB_violadamourL/rel99999/091-G.wav",
"AB_violadamourL/rel99999/092-G#.wav",
"AB_violadamourR/024-C.wav",
"AB_violadamourR/025-C#.wav")
sub("/[^\\s/]+\\.wav$", "", strings)
Output
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"

Related

How to use regex in R?

I want to remove the dashes and keep only the first 4 substrings except for the last character.
sub.maf.barcode <- gsub("^([^-]*-[^-]*-[^-]*-[^-]*).{1}$", "\\1", ori.maf.barcode$Tumor_Sample_Barcode)
> ori.maf.barcode$Tumor_Sample_Barcode[1:5]
[1] "TCGA-2K-A9WE-01A-11D-A382-10" "TCGA-2Z-A9J1-01A-11D-A382-10"
[3] "TCGA-2Z-A9J2-01A-11D-A382-10" "TCGA-2Z-A9J3-01A-12D-A382-10"
[5] "TCGA-2Z-A9J5-01A-21D-A382-10"
Expected output:
[1] "TCGA-2K-A9WE-01" "TCGA-2Z-A9J1-01"
[3] "TCGA-2Z-A9J2-01" "TCGA-2Z-A9J3-01"
[5] "TCGA-2Z-A9J5-01"

You could do
gsub('.-[^-]*-[^-]*-.[^-]*$', "", ori.maf.barcode$Tumor_Sample_Barcode)
#> [1] "TCGA-2K-A9WE-01" "TCGA-2Z-A9J1-01" "TCGA-2Z-A9J2-01"
#> [4] "TCGA-2Z-A9J3-01" "TCGA-2Z-A9J5-01"
Or
substr(ori.maf.barcode$Tumor_Sample_Barcode, 1, 15)
#> [1] "TCGA-2K-A9WE-01" "TCGA-2Z-A9J1-01" "TCGA-2Z-A9J2-01"
#> [4] "TCGA-2Z-A9J3-01" "TCGA-2Z-A9J5-01"

using str_extract
library(stringr)
str_extract(ori.maf.barcode$Tumor_Sample_Barcode, "^([^-]+-){3}\\d+")
-output
[1] "TCGA-2K-A9WE-01" "TCGA-2Z-A9J1-01" "TCGA-2Z-A9J2-01"
[4] "TCGA-2Z-A9J3-01" "TCGA-2Z-A9J5-01"

Change the row names in R

i have two dataframes with similar rownames:
> rownames(abundance)[1:10]
[1] "X001.V2.fastq_mapped_to_agora.txt.uniq"
[2] "X001.V8.fastq_mapped_to_agora.txt.uniq"
[3] "X003.V17.fastq_mapped_to_agora.txt.uniq"
[4] "X003.V2.fastq_mapped_to_agora.txt.uniq"
[5] "X003.V8.fastq_mapped_to_agora.txt.uniq"
[6] "X004.V2.fastq_mapped_to_agora.txt.uniq"
[7] "X004.V8.fastq_mapped_to_agora.txt.uniq"
[8] "X005.V2.fastq_mapped_to_agora.txt.uniq"
[9] "X005.V8.fastq_mapped_to_agora.txt.uniq"
[10] "X006.V2.fastq_mapped_to_agora.txt.uniq"
> rownames(fluxes)[1:10]
[1] "001.V8" "003.V17" "003.V2" "003.V8" "004.V2" "004.V8" "005.V2"
[8] "005.V8" "006.V2" "006.V8"
But the row names of the dataframe abundance is larger. How can i make the names of each rows like the rownames of fluxes. It can be like from "X" to second ".".

We could use sub:
rownames(abundance) <- sub("X(.*)\\.fastq_mapped_to_agora\\.txt\\.uniq", "\\1", rownames(abundance))
Output:
[1] "001.V2" "001.V8" "003.V17" "003.V2" "003.V8" "004.V2" "004.V8" "005.V2" "005.V8" "006.V2"

We may use trimws
rownames(abundance) <- trimws(rownames(abundance), whitespace = "\\..*")
Or could be
rownames(abundance) <- sub("^([^.]+\\.[^.]+)\\..*", "\\1", rownames(abundance))
-testing
> trimws("X001.V2.fastq_mapped_to_agora.txt.uniq", whitespace = "\\..*")
[1] "X001"
> sub("^([^.]+\\.[^.]+)\\..*", "\\1", "X001.V2.fastq_mapped_to_agora.txt.uniq")
[1] "X001.V2"

Is there a specific function in R to merge 2 vectors [duplicate]

This question already has answers here:
Pasting two vectors with combinations of all vectors' elements
(8 answers)
Closed 2 years ago.
I have two vectors, one that contains a list of variables, and one that contains dates, such as
Variables_Pays <- c("PIB", "ConsommationPrivee","ConsommationPubliques",
"FBCF","ProductionIndustrielle","Inflation","InflationSousJacente",
"PrixProductionIndustrielle","CoutHoraireTravail")
Annee_Pays <- c("2000","2001")
I want to merge them to have a vector with each variable indexed by my date, that is my desired output is
> Colonnes_Pays_Principaux
[1] "PIB_2020" "PIB_2021" "ConsommationPrivee_2020"
[4] "ConsommationPrivee_2021" "ConsommationPubliques_2020" "ConsommationPubliques_2021"
[7] "FBCF_2020" "FBCF_2021" "ProductionIndustrielle_2020"
[10] "ProductionIndustrielle_2021" "Inflation_2020" "Inflation_2021"
[13] "InflationSousJacente_2020" "InflationSousJacente_2021" "PrixProductionIndustrielle_2020"
[16] "PrixProductionIndustrielle_2021" "CoutHoraireTravail_2020" "CoutHoraireTravail_2021"
Is there a simpler / more readabl way than a double for loop as I have tried and succeeded below ?
Colonnes_Pays_Principaux <- vector()
for (Variable in (1:length(Variables_Pays))){
for (Annee in (1:length(Annee_Pays))){
Colonnes_Pays_Principaux=
append(Colonnes_Pays_Principaux,
paste(Variables_Pays[Variable],Annee_Pays[Annee],sep="_")
)
}
}

expand.grid will create a data frame with all combinations of the two vectors.
with(
expand.grid(Variables_Pays, Annee_Pays),
paste0(Var1, "_", Var2)
)
#> [1] "PIB_2000" "ConsommationPrivee_2000"
#> [3] "ConsommationPubliques_2000" "FBCF_2000"
#> [5] "ProductionIndustrielle_2000" "Inflation_2000"
#> [7] "InflationSousJacente_2000" "PrixProductionIndustrielle_2000"
#> [9] "CoutHoraireTravail_2000" "PIB_2001"
#> [11] "ConsommationPrivee_2001" "ConsommationPubliques_2001"
#> [13] "FBCF_2001" "ProductionIndustrielle_2001"
#> [15] "Inflation_2001" "InflationSousJacente_2001"
#> [17] "PrixProductionIndustrielle_2001" "CoutHoraireTravail_2001"

We can use outer :
c(t(outer(Variables_Pays, Annee_Pays, paste, sep = '_')))
# [1] "PIB_2000" "PIB_2001"
# [3] "ConsommationPrivee_2000" "ConsommationPrivee_2001"
# [5] "ConsommationPubliques_2000" "ConsommationPubliques_2001"
# [7] "FBCF_2000" "FBCF_2001"
# [9] "ProductionIndustrielle_2000" "ProductionIndustrielle_2001"
#[11] "Inflation_2000" "Inflation_2001"
#[13] "InflationSousJacente_2000" "InflationSousJacente_2001"
#[15] "PrixProductionIndustrielle_2000" "PrixProductionIndustrielle_2001"
#[17] "CoutHoraireTravail_2000" "CoutHoraireTravail_2001"

No real need to go beyond the basics here! Use paste for pasting the strings and rep to repeat either Annee_Pays och Variables_Pays to get all combinations:
Variables_Pays <- c("PIB", "ConsommationPrivee","ConsommationPubliques",
"FBCF","ProductionIndustrielle","Inflation","InflationSousJacente",
"PrixProductionIndustrielle","CoutHoraireTravail")
Annee_Pays <- c("2000","2001")
# To get this is the same order as in your example:
paste(rep(Variables_Pays, rep(2, length(Variables_Pays))), Annee_Pays, sep = "_")
# Alternative order:
paste(Variables_Pays, rep(Annee_Pays, rep(length(Variables_Pays), 2)), sep = "_")
# Or, if order doesn't matter too much:
paste(Variables_Pays, rep(Annee_Pays, length(Variables_Pays)), sep = "_")

In base R:
Variables_Pays <- c("PIB", "ConsommationPrivee","ConsommationPubliques",
"FBCF","ProductionIndustrielle","Inflation","InflationSousJacente",
"PrixProductionIndustrielle","CoutHoraireTravail")
Annee_Pays <- c("2000","2001")
cbind(paste(Variables_Pays, Annee_Pays,sep="_"),paste(Variables_Pays, rev(Annee_Pays),sep="_")

extract text from email and between two dots in R

I have some email address where I am trying to extract the domain from. I found a solution here but it is taking too long.
I am trying with the following approach:
First remove all the text before the # sign.
gsub("#(.+)$", "\\1", emails)
Other - not used
qdapRegex::ex_between(emails, ".", ".")
Data:
emails <- c("ut317#hotmail.com", "drrro#iueywapp.com", "esdfdsfos#lasdfsdfsdstores.com",
"asfds#mobsdaff.com", "asfsdaf.gsdsfdsfd#hotmail.org", "asdfdsaf#sdffsddapp.com",
"wqrerq.mwqerweing#mwerqwie.com", "qwera#niweqrerw.tv", "qwereqr3rew7#hotmail.com",
"mqwerwewrk#moweqrewsfaslay.com")

You can try the following:
str_sub(emails, str_locate(emails, "#")[,1]+1)
Output:
[1] "hotmail.com" "iueywapp.com" "lasdfsdfsdstores.com" "mobsdaff.com"
[5] "hotmail.org" "sdffsddapp.com" "mwerqwie.com" "niweqrerw.tv"
[9] "hotmail.com" "moweqrewsfaslay.com"

how about
sub(".*#(.*)\\..*","\\1",emails)
[1] "hotmail" "iueywapp" "lasdfsdfsdstores" "mobsdaff"
[5] "hotmail" "sdffsddapp" "mwerqwie" "niweqrerw"
[9] "hotmail" "moweqrewsfaslay"
or if you want everything after the #:
sub(".*#","",emails)
[1] "hotmail.com" "iueywapp.com" "lasdfsdfsdstores.com"
[4] "mobsdaff.com" "hotmail.org" "sdffsddapp.com"
[7] "mwerqwie.com" "niweqrerw.tv" "hotmail.com"
[10] "moweqrewsfaslay.com"

We can use trimws from base R
trimws(emails, whitespace= '.*#')
#[1] "hotmail.com" "iueywapp.com" "lasdfsdfsdstores.com" "mobsdaff.com" "hotmail.org" "sdffsddapp.com"
#[7] "mwerqwie.com" "niweqrerw.tv" "hotmail.com" "moweqrewsfaslay.com"
trimws(emails, whitespace= '.*#|\\..*')
#[1] "hotmail" "iueywapp" "lasdfsdfsdstores" "mobsdaff" "hotmail" "sdffsddapp" "mwerqwie"
#[8] "niweqrerw" "hotmail" "moweqrewsfaslay"

Splitting a sequence within dataframe?

I have a csv file like this,
x <- read.csv("C:/Users/XXXX/Documents/XXXX/Day1_15042014/work2.csv")
class(x)
x$Sequence.window![enter image description here][1]
> x$Sequence.window
[1] VVELRKTGGDTLEFHKFYKNFSSGLKDVVWN
[2] PGLTTQGTKFGRKIVKTLAYRVKSTQPSSGN
[3] EATEFYLRYYVGHKGKFGHEFLEFEFREDGK
[4] LVPVVWGERKTPEIEKKGFGASSKAATSLPS
[5] NMNELPEKKNSAGFIKLEDKQKLIVEMEKSV
[6] PTLHFNYRYFETDAPKDVPGAPRQWWFGGGT
[7] PDPTTAPMEAAKQPKKKRSRSKKCKSVNNLD
[8] PAKAAKTAKVTSPAKKAVAATKKVATVATKK
The class of this is a dataframe . I would now like to split the sequence window within a range 10:22 ( Ex [1] VVELRKTGGDTLEFHKFYKNFSSGLKDVVWN, output should be like [1] DTLEFHKFYKNFS for all the sequences) . How would I do this within a data frame?

You can use the substr function
#dummy data
x <- read.table(text="Sequence.window
VVELRKTGGDTLEFHKFYKNFSSGLKDVVWN
PGLTTQGTKFGRKIVKTLAYRVKSTQPSSGN
EATEFYLRYYVGHKGKFGHEFLEFEFREDGK",header=TRUE,as.is=TRUE)
#substr from 10 to 22
substr(x$Sequence.window,start=10,stop=22)
#[1] "DTLEFHKFYKNFS" "FGRKIVKTLAYRV" "YVGHKGKFGHEFL"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to extract text from a column using R - r

You could use dirname(strings) If there is no /, this returns ., which you could remove afterwards if you like, e.g.: res <- dirname(strings) res[res=="."] <- "" ``

Related

How to use regex in R?

Change the row names in R

Is there a specific function in R to merge 2 vectors [duplicate]

extract text from email and between two dots in R

Splitting a sequence within dataframe?

Categories

Resources