Related
I want to create ori.same.maf.barcodes variable to store the strings of ori.maf.barcode if the substrings before fourth "-" character matches the strings in sub.same.barcodes.
How sub.same.barcodes and ori.maf.barcode were generated. sub.maf.barcode is the subset of the ori.maf.barcode$Tumor_Sample_Barcode. The sub.same.barcodes is the intersect of sub.maf.barcode and sub.met.barcode. Now, I want to match sub.same.barcodes back to ori.maf.barcode.
ori.maf.barcode <- maf#clinical.data
sub.maf.barcode <- gsub("^([^-]*-[^-]*-[^-]*-[^-]*).*", "\\1", ori.maf.barcode$Tumor_Sample_Barcode) # Remove the dashes and keep only the first 4
sub.same.barcodes <- intersect(sub.maf.barcode, sub.met.barcode)
Attempt:
ori.same.maf.barcodes <- ori.maf.barcode %in% sub.same.barcodes
But my code returns "FALSE" instead of a character vector.
dput(ori.maf.barcode[1:20])
structure(list(Tumor_Sample_Barcode = c("TCGA-2K-A9WE-01A-11D-A382-10",
"TCGA-2Z-A9J1-01A-11D-A382-10", "TCGA-2Z-A9J2-01A-11D-A382-10",
"TCGA-2Z-A9J3-01A-12D-A382-10", "TCGA-2Z-A9J5-01A-21D-A382-10",
"TCGA-2Z-A9J6-01A-11D-A382-10", "TCGA-2Z-A9J7-01A-11D-A382-10",
"TCGA-2Z-A9J8-01A-11D-A42J-10", "TCGA-2Z-A9JD-01A-11D-A42J-10",
"TCGA-2Z-A9JG-01A-11D-A42J-10", "TCGA-2Z-A9JI-01A-11D-A42J-10",
"TCGA-2Z-A9JJ-01A-11D-A42J-10", "TCGA-2Z-A9JK-01A-11D-A42J-10",
"TCGA-2Z-A9JM-01A-12D-A42J-10", "TCGA-2Z-A9JN-01A-21D-A42J-10",
"TCGA-2Z-A9JO-01A-11D-A42J-10", "TCGA-2Z-A9JQ-01A-11D-A42J-10",
"TCGA-2Z-A9JR-01A-12D-A42J-10", "TCGA-2Z-A9JS-01A-21D-A42J-10",
"TCGA-3Z-A93Z-01A-11D-A36X-10")), class = c("data.table", "data.frame"
), row.names = c(NA, -20L), .internal.selfref = <pointer: 0x0000025e377005d0>)
dput(sub.met.barcode[1:20])
c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-01A", "TCGA-UZ-A9PZ-01A",
"TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-G7-7502-01A", "TCGA-B1-A47M-11A",
"TCGA-SX-A7SO-01A", "TCGA-HE-A5NJ-01A", "TCGA-MH-A856-01A", "TCGA-A4-8312-01A",
"TCGA-BQ-5892-01A", "TCGA-A4-7732-11A", "TCGA-5P-A9K9-01A", "TCGA-UZ-A9PX-01A",
"TCGA-BQ-7061-01A", "TCGA-BQ-5876-01A", "TCGA-DZ-6134-01A", "TCGA-BQ-5884-01A",
"TCGA-BQ-5889-11A")
We could use sub to extract the substring till the fourth - and then use %in% on the logical vector to subset
i1 <- trimws(sub("^(([^-]+-){4}).*", "\\1", ori.maf.barcode),
whitespace = "-") %in%
sub("^(([^-]+-){4}).*", "\\1", sub.same.barcodes)
ori.same.maf.barcodes <- ori.maf.barcode[i1]
-output
> ori.same.maf.barcodes
[1] "TCGA-BQ-7058-01A-11D-1963-05"
[2] "TCGA-2Z-A9JQ-01A-11D-A42K-05"
[3] "TCGA-BQ-5887-11A-01D-1963-05"
update
Using the new dput in the OP' post, the 'ori.maf.barcode' is a data.table with column named as 'Tumor_Sample_Barcode'. Extract the column with $ or [[ in base R or directly use the data.table methods to subset
library(data.table)
ori.maf.barcode[trimws(sub("^(([^-]+-){4}).*", "\\1",
Tumor_Sample_Barcode),
whitespace = "-") %in% sub("^(([^-]+-){4}).*", "\\1", sub.met.barcode)]
Tumor_Sample_Barcode
<char>
1: TCGA-2Z-A9JQ-01A-11D-A42J-10
data
ori.maf.barcode <- c("TCGA-BQ-7058-01A-11D-1963-05",
"TCGA-DZ-6131-01A-11D-1963-05",
"TCGA-UZ-A9PZ-01A-11D-A42K-05", "TCGA-2Z-A9JQ-01A-11D-A42K-05",
"TCGA-BQ-5887-11A-01D-1963-05", "TCGA-G7-7502-01A-12D-A43K-06"
)
sub.same.barcodes <- c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-02A",
"TCGA-UZ-A9PZ-03A",
"TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-2Z-A9JQ-01A")
Please note that with the sample data you have provided it is not possible for the value TCGA-G7-7502-01A-12D-A43K-06 to appear in the output.
library(stringr)
sub.same.barcodes <- c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-02A", "TCGA-UZ-A9PZ-03A",
"TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-2Z-A9JQ-01A")
ori.maf.barcode <- c("TCGA-BQ-7058-01A-11D-1963-05", "TCGA-DZ-6131-01A-11D-1963-05",
"TCGA-UZ-A9PZ-01A-11D-A42K-05", "TCGA-2Z-A9JQ-01A-11D-A42K-05",
"TCGA-BQ-5887-11A-01D-1963-05", "TCGA-G7-7502-01A-12D-A43K-06")
idx <- which(str_extract_all(ori.maf.barcode, '.{4}-.{2}-.{4}-.{3}') %in% sub.same.barcodes)
ori.same.maf.barcodes <- ori.maf.barcode[ idx ]
print(ori.same.maf.barcodes)
Output:
[1] "TCGA-BQ-7058-01A-11D-1963-05" "TCGA-2Z-A9JQ-01A-11D-A42K-05" "TCGA-BQ-5887-11A-01D-1963-05"
Your almost there, but your code ori.maf.barcode %in% sub.same.barcodes creates the logical equation that returns TRUE and FALSE, which is what you are seeing. In order to get back the values which equate to TRUE you need to pass that expression into a subsetting method to get back what you want.
ori.maf.barcode[which(ori.maf.barcode %in% sub.same.barcodes)]
If it is a vector this should return another vector with only those entries which are TRUE in the logical statement.
And you need to string match to get the entries based on the first part as iod said below:
This is a loop picks them out one at a time and adds them to a new vector
new.barcodes<-c()
for (sub in sub.same.barcodes){
new<- ori.maf.barcode[which(startsWith(ori.maf.barcode, sub))]
new.barcodes<-c(new.barcodes, new)
}
This will iterate through your prefixes and pull out what you want into a new vector
I am currently helping a friend with his research and am gathering information about different natural disasters that occured from 2004-2016. The data can be found using this link:
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/
when you import it to R it gives helpful information, however, my friend, and now I, am only interested in State, Year, Month, Event, Type, County, Direct & indirect deaths and injuries, and property damage. So first I am extracting the columns I need and will later in the code combine them back together, however the data is currently in string mode, for the Property Damage column I need it to present as numeric since it is in cash value. So for example, I have a data entry in that column that looks like "8.6k" and I need it as this 8600 and for all the "NA" entries to be replaced with a 0.
I have this so far but it gives me back a string of "NA"s. Can anyone think of a better way of doing this?
State<- W2004$STATE
Year<-W2004$YEAR
Month<-W2004$MONTH_NAME
Event<-W2004$EVENT_TYPE
Type<-W2004$CZ_TYPE
County<-W2004$CZ_NAME
Direct_Death<-W2004$DEATHS_DIRECT
Indirect_Death<-W2004$DEATHS_INDIRECT
Direct_Injury<-W2004$INJURIES_DIRECT
Indirect_Injury<-W2004$INJURIES_INDIRECT
W2004$DAMAGE_PROPERTY<-as.numeric(W2004$DAMAGE_PROPERTY)
Damage_Property<-W2004$DAMAGE_PROPERTY
l <- cbind( all the columns up there)
print(l)
We can try using a case when expression here, to map each type of unit to a bona fide number. Going with the two examples you actually showed us:
library(dplyr)
x <- c("1.00M", "8.6k")
result <- case_when(
grepl("\\d+k$", x) ~ as.numeric(sub("\\D+$", "", x)) * 1000,
grepl("\\d+M$", x) ~ as.numeric(sub("\\D+$", "", x)) * 1000000,
TRUE ~ as.numeric(sub("\\D+$", "", x))
)
You can extract the letter and use switch() which is easily maintainable, if you want to add additional symbols it is very easy.
First, the setup:
options(scipen = 999) # to prevent R from printing scientific numbers
library(stringr) # to extract letters
This is the sample vector:
numbers_with_letters <- c("1.00M", "8.6k", 50)
Use lapply() to loop through vector, extract the letter, replace it with a number, remove the letter, convert to numeric, and multiply:
lapply(numbers_with_letters, function(x) {
letter <- str_extract(x, "[A-Za-z]")
letter_to_num <- switch(letter,
k = 1000,
M = 1000000,
1) # 1 is the default option if no letter found
numbers_with_letters <- as.numeric(gsub("[A-Za-z]", "", x))
#remove all NAs and replace with 0
numbers_with_letters[is.na(numbers_with_letters)] <- 0
return(numbers_with_letters * letter_to_num)
})
This returns:
[[1]]
[1] 1000000
[[2]]
[1] 8600
[[3]]
[1] 50
[[4]]
[1] 0
Maybe I'm oversimplifying here, but . . .
library(tidyverse)
data <- tibble(property_damage = c("8.6k", "NA"))
data %>%
mutate(
as_number = if_else(
property_damage != "NA",
str_extract(property_damage, "\\d+\\.*\\d*"),
"0"
),
as_number = as.numeric(as_number)
)
I'd like to change the variable names in my data.frame from e.g. "pmm_StartTimev4_E2_C19_1" to "pmm_StartTimev4_E2_C19". So if the name ends with an underscore followed by any number it gets removed.
But I'd like for this to happen only if the variable name has the word "Start" in it.
I've got a muddled up bit of code that doesn't work. Any help would be appreciated!
# Current data frame:
dfbefore <- data.frame(a=c("pmm_StartTimev4_E2_C19_1","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19_2","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Desired data frame:
dfafter <- data.frame(a=c("pmm_StartTimev4_E2_C19","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Current code:
sub((.*{1,}[0-9]*).*","",grep("Start",names(df),value = TRUE)
How about something like this using gsub().
stripcol <- function(x) {
gsub("(.*Start.*)_\\d+$", "\\1", as.character(x))
}
dfnew <- dfbefore
dfnew[] <- lapply(dfbefore, stripcol)
We use the regular expression to look for "Start" and then grab everything but the underscore number at the end. We use lapply to apply the function to all columns.
doit <- function(x){
x <- as.character(x)
if(grepl("Start",x)){
x <- gsub("_([0-9])","",x)
}
return(x)
}
apply(dfbefore,c(1,2),doit)
a b
[1,] "pmm_StartTimev4_E2_C19" "pmm_StartTo_v4_E2_C19"
[2,] "pmm_StartTimev4_E2_E2_C1" "complete_E1_C12_1"
[3,] "delivery_C1_C12" "pmm_StartTo_v4_E2_C19"
We can use sub to capture groups where the 'Start' substring is also present followed by an underscore and one or more numbers. In the replacement, use the backreference of the captured group. As there are multiple columns, use lapply to loop over the columns, apply the sub and assign the output back to the original data
out <- dfbefore
out[] <- lapply(dfbefore, sub,
pattern = "^(.*_Start.*)_\\d+$", replacement ="\\1")
out
dfafter[] <- lapply(dfafter, as.character)
all.equal(out, dfafter, check.attributes = FALSE)
#[1] TRUE
I have an issue about replacing strings with the new ones conditionally.
I put short version of my real problem so far its working however I need a better solution since there are many rows in the real data.
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
Basicly I want to replace strings with replace_strings. First item in the strings replaced with the first item in the replace_strings.
replace_strings <- c("A1","A2","A3","A4")
So the final string should look like
final string <- c("ca_A1","cb_A2","cc_A3","cd_A4")
I write some simple function assign_new
assign_new <- function(x){
ifelse(grepl("A33",x),gsub("A33","A1",x),
ifelse(grepl("A32",x),gsub("A32","A2",x),
ifelse(grepl("A31",x),gsub("A31","A3",x),
ifelse(grepl("A30",x),gsub("A30","A4",x),x))))
}
assign_new(strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Ok it seems we have solution. But lets say if I have A1000 to A1 and want to replace them from A1 to A1000 I need to do 1000 of rows of ifelse statement. How can we tackle that?
If your vectors are ordered to be matched, then you can use:
> paste0(gsub("(.*_)(.*)","\\1", strings ), replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
You can use regmatches.First obtain all the characters that are followed by _ using regexpr then replace as shown below
`regmatches<-`(strings,regexpr("(?<=_).*",strings,perl = T),value=replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Not the fastests but very tractable and easy to maintain:
for (i in 1:length(strings)) {
strings[i] <- gsub("\\d+$", i, strings[i])
}
"\\d+$" just matches any number at the end of the string.
EDIT: Per #Onyambu's comment, removing map2_chr as paste is a vectorized function.
foo <- function(x, y){
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_"))
}
foo(strings, replace_strings)
with x being strings and y being replace_strings. You first split the strings object at the _ character, and paste with the respective replace_strings object.
EDIT:
For objects where there is no positional relationship you could create a reference table (dataframe, list, etc.) and match your values.
reference_tbl <- data.frame(strings, replace_strings)
foo <- function(x){
y <- reference_tbl$replace_strings[match(x, reference_tbl$strings)]
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_")
}
foo(strings)
Using the dplyr package:
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
replace_strings <- c("A1","A2","A3","A4")
df <- data.frame(strings, replace_strings)
df <- mutate(rowwise(df),
strings = gsub("_.*",
paste0("_", replace_strings),
strings)
)
df <- select(df, strings)
Output:
# A tibble: 4 x 1
strings
<chr>
1 ca_A1
2 cb_A2
3 cc_A3
4 cd_A4
yet another way:
mapply(function(x,y) gsub("(\\w\\w_).*",paste0("\\1",y),x),strings,replace_strings,USE.NAMES=FALSE)
# [1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Currently I'm using nested ifelse functions with grepl to check for matches to a vector of strings in a data frame, for example:
# vector of possible words to match
x <- c("Action", "Adventure", "Animation")
# data
my_text <- c("This one has Animation.", "This has none.", "Here is Adventure.")
my_text <- as.data.frame(my_text)
my_text$new_column <- ifelse (
grepl("Action", my_text$my_text) == TRUE,
"Action",
ifelse (
grepl("Adventure", my_text$my_text) == TRUE,
"Adventure",
ifelse (
grepl("Animation", my_text$my_text) == TRUE,
"Animation", NA)))
> my_text$new_column
[1] "Animation" NA "Adventure"
This is fine for just a few elements (e.g., the three here), but how do I return when the possible matches are much larger (e.g., 150)? Nested ifelse seems crazy. I know I can grepl multiple things at once as in the code below, but this return a logical telling me only if the string was matched, not which one was matched. I'd like to know what was matched (in the case of multiple, any of the matches is fine.
x <- c("Action", "Adventure", "Animation")
my_text <- c("This one has Animation.", "This has none.", "Here is Adventure.")
grepl(paste(x, collapse = "|"), my_text)
returns: [1] TRUE FALSE TRUE
what i'd like it to return: "Animation" ""(or FALSE) "Adventure"
Following the pattern here, a base solution.
x <- c("ActionABC", "AdventureDEF", "AnimationGHI")
regmatches(x, regexpr("(Action|Adventure|Animation)", x))
stringr has an easier way to do this
library(stringr)
str_extract(x, "(Action|Adventure|Animation)")
Building on Benjamin's base solution, use lapply so that you will have a character(0) value when there is no match.
Just using regmatches on your sample code directly, will you give the following error.
my_text$new_column <-regmatches(x = my_text$my_text, m = regexpr(pattern = paste(x, collapse = "|"), text = my_text$my_text))
Error in `$<-.data.frame`(`*tmp*`, new_column, value = c("Animation", :
replacement has 2 rows, data has 3
This is because there are only 2 matches and it will try to fit the matches values in the data frame column which has 3 rows.
To fill non-matches with a special value so that this operation can be done directly we can use lapply.
my_text$new_column <-
lapply(X = my_text$my_text, FUN = function(X){
regmatches(x = X, m = regexpr(pattern = paste(x, collapse = "|"), text = X))
})
This will put character(0) where there is no match.
Table screenshot
Hope this helps.
This will do it...
my_text$new_column <- unlist(
apply(
sapply(x, grepl, my_text$my_text),
1,
function(y) paste("",x[y])))
The sapply produces a logical matrix showing which of the x terms appears in each element of your column. The apply then runs through this row-by-row and pastes together all of the values of x corresponding to TRUE values. (It pastes a "" at the start to avoid NAs and keep the length of the output the same as the original data.) If there are two terms in x matched for a row, they will be pasted together in the output.