How to subset a character vector based on substring matches?

How to subset a character vector based on substring matches? - r

I want to create ori.same.maf.barcodes variable to store the strings of ori.maf.barcode if the substrings before fourth "-" character matches the strings in sub.same.barcodes.
How sub.same.barcodes and ori.maf.barcode were generated. sub.maf.barcode is the subset of the ori.maf.barcode$Tumor_Sample_Barcode. The sub.same.barcodes is the intersect of sub.maf.barcode and sub.met.barcode. Now, I want to match sub.same.barcodes back to ori.maf.barcode.
ori.maf.barcode <- maf#clinical.data
sub.maf.barcode <- gsub("^([^-]*-[^-]*-[^-]*-[^-]*).*", "\\1", ori.maf.barcode$Tumor_Sample_Barcode) # Remove the dashes and keep only the first 4
sub.same.barcodes <- intersect(sub.maf.barcode, sub.met.barcode)
Attempt:
ori.same.maf.barcodes <- ori.maf.barcode %in% sub.same.barcodes
But my code returns "FALSE" instead of a character vector.
dput(ori.maf.barcode[1:20])
structure(list(Tumor_Sample_Barcode = c("TCGA-2K-A9WE-01A-11D-A382-10",
"TCGA-2Z-A9J1-01A-11D-A382-10", "TCGA-2Z-A9J2-01A-11D-A382-10",
"TCGA-2Z-A9J3-01A-12D-A382-10", "TCGA-2Z-A9J5-01A-21D-A382-10",
"TCGA-2Z-A9J6-01A-11D-A382-10", "TCGA-2Z-A9J7-01A-11D-A382-10",
"TCGA-2Z-A9J8-01A-11D-A42J-10", "TCGA-2Z-A9JD-01A-11D-A42J-10",
"TCGA-2Z-A9JG-01A-11D-A42J-10", "TCGA-2Z-A9JI-01A-11D-A42J-10",
"TCGA-2Z-A9JJ-01A-11D-A42J-10", "TCGA-2Z-A9JK-01A-11D-A42J-10",
"TCGA-2Z-A9JM-01A-12D-A42J-10", "TCGA-2Z-A9JN-01A-21D-A42J-10",
"TCGA-2Z-A9JO-01A-11D-A42J-10", "TCGA-2Z-A9JQ-01A-11D-A42J-10",
"TCGA-2Z-A9JR-01A-12D-A42J-10", "TCGA-2Z-A9JS-01A-21D-A42J-10",
"TCGA-3Z-A93Z-01A-11D-A36X-10")), class = c("data.table", "data.frame"
), row.names = c(NA, -20L), .internal.selfref = <pointer: 0x0000025e377005d0>)
dput(sub.met.barcode[1:20])
c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-01A", "TCGA-UZ-A9PZ-01A",
"TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-G7-7502-01A", "TCGA-B1-A47M-11A",
"TCGA-SX-A7SO-01A", "TCGA-HE-A5NJ-01A", "TCGA-MH-A856-01A", "TCGA-A4-8312-01A",
"TCGA-BQ-5892-01A", "TCGA-A4-7732-11A", "TCGA-5P-A9K9-01A", "TCGA-UZ-A9PX-01A",
"TCGA-BQ-7061-01A", "TCGA-BQ-5876-01A", "TCGA-DZ-6134-01A", "TCGA-BQ-5884-01A",
"TCGA-BQ-5889-11A")

We could use sub to extract the substring till the fourth - and then use %in% on the logical vector to subset
i1 <- trimws(sub("^(([^-]+-){4}).*", "\\1", ori.maf.barcode),
whitespace = "-") %in%
sub("^(([^-]+-){4}).*", "\\1", sub.same.barcodes)
ori.same.maf.barcodes <- ori.maf.barcode[i1]
-output
> ori.same.maf.barcodes
[1] "TCGA-BQ-7058-01A-11D-1963-05"
[2] "TCGA-2Z-A9JQ-01A-11D-A42K-05"
[3] "TCGA-BQ-5887-11A-01D-1963-05"
update
Using the new dput in the OP' post, the 'ori.maf.barcode' is a data.table with column named as 'Tumor_Sample_Barcode'. Extract the column with $ or [[ in base R or directly use the data.table methods to subset
library(data.table)
ori.maf.barcode[trimws(sub("^(([^-]+-){4}).*", "\\1",
Tumor_Sample_Barcode),
whitespace = "-") %in% sub("^(([^-]+-){4}).*", "\\1", sub.met.barcode)]
Tumor_Sample_Barcode
<char>
1: TCGA-2Z-A9JQ-01A-11D-A42J-10
data
ori.maf.barcode <- c("TCGA-BQ-7058-01A-11D-1963-05",
"TCGA-DZ-6131-01A-11D-1963-05",
"TCGA-UZ-A9PZ-01A-11D-A42K-05", "TCGA-2Z-A9JQ-01A-11D-A42K-05",
"TCGA-BQ-5887-11A-01D-1963-05", "TCGA-G7-7502-01A-12D-A43K-06"
)
sub.same.barcodes <- c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-02A",
"TCGA-UZ-A9PZ-03A",
"TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-2Z-A9JQ-01A")

Please note that with the sample data you have provided it is not possible for the value TCGA-G7-7502-01A-12D-A43K-06 to appear in the output.
library(stringr)
sub.same.barcodes <- c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-02A", "TCGA-UZ-A9PZ-03A",
"TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-2Z-A9JQ-01A")
ori.maf.barcode <- c("TCGA-BQ-7058-01A-11D-1963-05", "TCGA-DZ-6131-01A-11D-1963-05",
"TCGA-UZ-A9PZ-01A-11D-A42K-05", "TCGA-2Z-A9JQ-01A-11D-A42K-05",
"TCGA-BQ-5887-11A-01D-1963-05", "TCGA-G7-7502-01A-12D-A43K-06")
idx <- which(str_extract_all(ori.maf.barcode, '.{4}-.{2}-.{4}-.{3}') %in% sub.same.barcodes)
ori.same.maf.barcodes <- ori.maf.barcode[ idx ]
print(ori.same.maf.barcodes)
Output:
[1] "TCGA-BQ-7058-01A-11D-1963-05" "TCGA-2Z-A9JQ-01A-11D-A42K-05" "TCGA-BQ-5887-11A-01D-1963-05"

Your almost there, but your code ori.maf.barcode %in% sub.same.barcodes creates the logical equation that returns TRUE and FALSE, which is what you are seeing. In order to get back the values which equate to TRUE you need to pass that expression into a subsetting method to get back what you want.
ori.maf.barcode[which(ori.maf.barcode %in% sub.same.barcodes)]
If it is a vector this should return another vector with only those entries which are TRUE in the logical statement.
And you need to string match to get the entries based on the first part as iod said below:
This is a loop picks them out one at a time and adds them to a new vector
new.barcodes<-c()
for (sub in sub.same.barcodes){
new<- ori.maf.barcode[which(startsWith(ori.maf.barcode, sub))]
new.barcodes<-c(new.barcodes, new)
}
This will iterate through your prefixes and pull out what you want into a new vector

Related

Subset character vector by pattern

I have a character vector made up of filenames like:
vector <- c("LR1_0001_a", "LR1_0002_b", "LR02_0001_b", "LR02_0002_x", "LR3_001_c")
My goal is to subset this vector based on pattern matching the first x number of characters (dynamically), up to the first "_". The outputs would look something like this:
solution1 <- c("LR1_0001_a", "LR1_0002_b")
solution2 <- c("LR02_0001_b", "LR02_0002_b")
solution3 <- c("LR3_001_c")
I have experimented with mixtures of unique and grep but have not had any luck so far

We can use sub to remove everything after underscore "_" and split the vector.
output <- split(vector, sub('_.*', '', vector))
output
#$LR02
#[1] "LR02_0001_b" "LR02_0002_x"
#$LR1
#[1] "LR1_0001_a" "LR1_0002_b"
#$LR3
#[1] "LR3_001_c"
This returns a list of vectors, which is usually a better way to manage data instead of creating number of objects in global environment. However, if you want them as separate vectors we can use list2env.
list2env(output, .GlobalEnv)
This will create vectors with the name LR02, LR1 and LR3 respectively.

Base R solution (coerce vector to data.frame):
# Split vector into list (as in ronak's answer):
vect_list <- split(vect, sub("_.*", "", vect))
# Pad each vector in the list to be the same length as the longest vector:
padded_vect_list <- lapply(vect_list,
function(x){length(x) = max(lengths(vect_list)); return(x)})
# Coerce the list of vectors into a dataframe:
df <- data.frame(do.call("cbind", padded_vect_list))
Data:
vect <- c("LR1_0001_a", "LR1_0002_b", "LR02_0001_b", "LR02_0002_x", "LR3_001_c")

We can use trimws
out <- split(vector, trimws(vector, whitespace = "_[a-z]+"))
and then use list2env
list2env(out, .GlobalEnv)

In R, how do I split each string in a vector to return everything before the Nth instance of a character?

Example:
df <- data.frame(Name = c("J*120_234_458_28", "Z*23_205_a834_306", "H*_39_004_204_99_04902"))
I would like to be able to select everything before the third underscore for each row in the dataframe. I understand how to split the string apart:
df$New <- sapply(strsplit((df$Name),"_"), `[`)
But this places a list in each row. I've thus far been unable to figure out how to use sapply to unlist() each row of df$New select the first N elements of the list to paste/collapse them back together. Because the length of each subelement can be distinct, and the number of subelements can also be distinct, I haven't been able to figure out an alternative way of getting this info.

We specify the 'n', after splitting the character column by '_', extract the n-1 first components
n <- 4
lapply(strsplit(as.character(df$Name), "_"), `[`, seq_len(n - 1))
If we need to paste it together, can use anonymous function call (function(x)) after looping over the list with lapply/sapply, get the first n elements with head and paste them together`
sapply(strsplit(as.character(df$Name), "_"), function(x)
paste(head(x, n - 1), collapse="_"))
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"
Or use regex method
sub("^([^_]+_[^_]+_[^_]+)_.*", "\\1", df$Name)
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"
Or if the 'n' is really large, then
pat <- sprintf("^([^_]+){%d}[^_]+).*", n-1)
sub(pat, "\\1", df$Name)
Or
sub("^(([^_]+_){2}[^_]+)_.*", "\\1", df$Name)
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"

Changing values into numeric values

I have a few columns where the value is for example : 525K or 1.1M. I want to convert those values to thousand or millions as numerics without using an extra R package besides baser and tidyr.
Is there anyone who can help me with a code or a function how I can do this in a simple and quick way?
I have tried to do it by hand with removing the 'M' or 'K' and the '.'.
players_set$Value <- gsub(pattern = "M", replacement = "000000 ",
x = players_set$Value, fixed = TRUE)

For a base R option, we can try using sub to generate an arithmetic expression, based on the K or M unit. Then, use eval with parse to get the final number:
getValue <- function(input) {
output <- sub("M", "*1000000", sub("K", "*1000", input))
eval(parse(text=output))
}
getValue("525K")
getValue("1.1M")
[1] 525000
[1] 1100000

Here is another option with a named vector matching
getValue <- function(input) {
# remove characters except LETTERS
v1 <- gsub("[0-9.€]+", "", input)
# remove characters except digits
v2 <- gsub("[A-Za-z€]+", "", input)
# create a named vector
keyval <- setNames(c(1e6, 1e3), c("M", "K"))
# match the LETTERS (v1) with the keyval to get the numeric value
# multiply with v2
unname(as.numeric(v2) *keyval[v1])
}
getValue("525K")
#[1] 525000
getValue("1.1M")
#[1] 1100000
getValue("€525K")
#[1] 525000
getValue("€1.1M")
#[1] 1100000

add running counter for semi-consecutive strings in vector

I would like to add a number indicating the x^th occurrence of a word in a vector. (So this question is different from Make a column with duplicated values unique in a dataframe , because I have a simple vector and try to avoid the overhead of casting it to a data.frame).
E.g. for the vector:
book, ship, umbrella, book, ship, ship
the output would be:
book, ship, umbrella, book2, ship2, ship3
I have solved this myself by transposing the vector to a dataframe and next using the grouping function. That feels like using a sledgehammer to crack nuts:
# add consecutive number for equal string
words <- c("book", "ship", "umbrella", "book", "ship", "ship")
# transpose word vector to data.frame for grouping
df <- data.frame(words = words)
df <- df %>% group_by(words) %>% mutate(seqN = row_number())
# combine columns and remove '1' for first occurrence
wordsVec <- paste0(df$words, df$seqN)
gsub("1", "", wordsVec)
# [1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Is there a more clean solution, e.g. using the stringr package?

You can still utilize row_number() from dplyr but you don't need to convert to data frame, i.e.
sub('1$', '', ave(words, words, FUN = function(i) paste0(i, row_number(i))))
#[1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Another option is to use make.unique along with gsubfn to increment your values by 1, i.e.
library(gsubfn)
gsubfn("\\d+", function(x) as.numeric(x) + 1, make.unique(words))
#[1] "book" "ship" "umbrella" "book.2" "ship.2" "ship.3"

grep() and sub() and regular expression

I'd like to change the variable names in my data.frame from e.g. "pmm_StartTimev4_E2_C19_1" to "pmm_StartTimev4_E2_C19". So if the name ends with an underscore followed by any number it gets removed.
But I'd like for this to happen only if the variable name has the word "Start" in it.
I've got a muddled up bit of code that doesn't work. Any help would be appreciated!
# Current data frame:
dfbefore <- data.frame(a=c("pmm_StartTimev4_E2_C19_1","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19_2","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Desired data frame:
dfafter <- data.frame(a=c("pmm_StartTimev4_E2_C19","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Current code:
sub((.*{1,}[0-9]*).*","",grep("Start",names(df),value = TRUE)

How about something like this using gsub().
stripcol <- function(x) {
gsub("(.*Start.*)_\\d+$", "\\1", as.character(x))
}
dfnew <- dfbefore
dfnew[] <- lapply(dfbefore, stripcol)
We use the regular expression to look for "Start" and then grab everything but the underscore number at the end. We use lapply to apply the function to all columns.

doit <- function(x){
x <- as.character(x)
if(grepl("Start",x)){
x <- gsub("_([0-9])","",x)
}
return(x)
}
apply(dfbefore,c(1,2),doit)
a b
[1,] "pmm_StartTimev4_E2_C19" "pmm_StartTo_v4_E2_C19"
[2,] "pmm_StartTimev4_E2_E2_C1" "complete_E1_C12_1"
[3,] "delivery_C1_C12" "pmm_StartTo_v4_E2_C19"

We can use sub to capture groups where the 'Start' substring is also present followed by an underscore and one or more numbers. In the replacement, use the backreference of the captured group. As there are multiple columns, use lapply to loop over the columns, apply the sub and assign the output back to the original data
out <- dfbefore
out[] <- lapply(dfbefore, sub,
pattern = "^(.*_Start.*)_\\d+$", replacement ="\\1")
out
dfafter[] <- lapply(dfafter, as.character)
all.equal(out, dfafter, check.attributes = FALSE)
#[1] TRUE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to subset a character vector based on substring matches? - r

Related

Subset character vector by pattern

In R, how do I split each string in a vector to return everything before the Nth instance of a character?

Changing values into numeric values

add running counter for semi-consecutive strings in vector

grep() and sub() and regular expression

Categories

Resources