Separating into separate columns based on 2 delimiters - r

I'm an R beginner and have a basic question I can't seem to figure out. I have row values that need to be separated into different columns, but there are more than one delimiter I am trying to use. The Expression_level column contains an ensembl gene ID with its corresponding value, in the form ensembl:exp value, but there are sometimes 2 ensembl IDs in the same row separated by ;. I want to have a column for ensembl and for gene expression value, but not sure how to separate while keeping them mapped to the correct ID/expression value. This is the type of data I am working with: rna_seq and this is what I am trying to get out: org_rna. TYIA
rna_seq= cbind("Final_gene" = c("KLHL15", "CPXCR1", "MAP7D3", "WDR78"), "Expression_level" = c("1.62760683812965:ENSG00000174010", "-9.96578428466209:ENSG00000147183",
"-4.32192809488736:ENSG00000129680", "-1.39592867633114:ENSG00000152763;-9.96578428466209:ENSG00000231080"))
org_rna = cbind("Final_gene" = c("KLHL15", "CPXCR1", "MAP7D3", "WDR78", "WDR78"), "Ensembl" = c("ENSG00000174010", "ENSG00000147183", "ENSG00000129680", "ENSG00000152763", "ENSG00000231080")
, "Expression" = c("1.62760683812965", "-9.96578428466209", "-4.32192809488736", "-1.39592867633114", "-9.96578428466209"))

library(tidyr)
library(dplyr)
rna_seq %>%
as.data.frame() %>%
# separate cells containing multiple values into
# multiple rows
separate_rows(Expression_level, sep = ";") %>%
# extract pairs
extract(col = Expression_level,
into = c("Expression", "Ensembl"),
regex = "(.*):(.*)")
# A tibble: 5 x 3
# Final_gene Expression Ensembl
# <chr> <chr> <chr>
# KLHL15 1.62760683812965 ENSG00000174010
# CPXCR1 -9.96578428466209 ENSG00000147183
# MAP7D3 -4.32192809488736 ENSG00000129680
# WDR78 -1.39592867633114 ENSG00000152763
# WDR78 -9.96578428466209 ENSG00000231080

Another (less elegant) solution using separate():
library(tidyr)
library(dplyr)
rna_seq |> as.data.frame() |>
# Separate any second IDs
separate(Expression_level, sep = ";", into = c("ID1", "ID2")) |>
# Reshape to longer (columns to rows)
pivot_longer(cols = starts_with("ID")) |>
# Separate Expression from Ensembl
separate(value, sep = ":", into = c("Expression", "Ensembl")) |>
filter(!is.na(Expression)) |>
select(Final_gene, Ensembl, Expression)

Related

Rowwise comparison of the length of a string against a list of string lengths

Consider the following data frame with two columns of strings of variable length:
library("tidyverse")
df <- tibble(REF = c("TTG", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "T", "TTGTGTGTGTGTGTGTGTGTGT"),
ALT = c("T", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT,CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "TTG", "TTGTGTGTGTGTGTGTGTGTGTGT"))
# # A tibble: 4 × 2
# REF ALT
# <chr> <chr>
# 1 TTG T
# 2 CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT,CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
# 3 T TTG
# 4 TTGTGTGTGTGTGTGTGTGTGT TTGTGTGTGTGTGTGTGTGTGTGT
Differently from column REF, column ALT sometimes includes several strings concatenated by comma (e.g. row 2).
I want to compare the length of strings in REF (REF_LEN) and ALT (ALT_LEN), and generate a TYPE column with values:
"SNM" when REF_LEN = ALT_LEN
"INS" when REF_LEN < ALT_LEN
"DEL" when REF_LEN > ALT_LEN
But I want to do it in a way that, when several strings are present in ALT, the output of this new TYPE column contains these types as well separated by a comma. i.e., the expected output here would be:
"DEL" "INS,DEL" "INS" "INS"
So far, I know how to get the length of values in ALT, but I fail at collapsing these values, as the output will contain lengths from all ALTs in the table, not just pairwise (i.e. 1,35,31,3,24):
df %>%
dplyr::mutate(REF_LEN = str_length(REF),
ALT_LEN = str_split(ALT, ","),
ALT_LEN = purrr::map(ALT_LEN, str_length) %>% unlist() %>% paste(collapse = ","))
Code above is incomplete as you can see, but I am also unable to work in a different direction using a helper function that gets the TYPE column above done. This will return many errors, but not sure why, since it seems to work nicely with values from ALT_LEN individually:
name <- function(alt_lens, ref_len) {
alt_lens <- unlist(alt_lens)
ifelse(alt_lens < ref_len, "DEL", ifelse(alt_lens > ref_len, "INS", "SNM"))
}
df %>%
dplyr::mutate(REF_LEN = str_length(REF),
ALT_LEN = str_split(ALT, ","),
TYPE = purrr::map(ALT_LEN, str_length) %>% name(REF_LEN))
Any ideas? thanks!
Here's a codegolf-ish base R solution :
df <- data.frame(REF = c("TTG", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "T", "TTGTGTGTGTGTGTGTGTGTGT"),
ALT = c("T", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT,CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "TTG", "TTGTGTGTGTGTGTGTGTGTGTGT"))
df$TYPE <- mapply(
function(x, y) paste(c("INS", "SNM", "DEL")[2 + sign(nchar(x)- nchar(y))], collapse = ","),
df$REF, strsplit(df$ALT, ","), USE.NAMES = FALSE)
df$TYPE
#> [1] "DEL" "INS,DEL" "INS" "INS"
Created on 2022-04-20 by the reprex package (v2.0.1)
Update: Removed first answer. Thanks to akrun for pointing me there!. The concept is the same: using nchar with case_when, the difference is to use separate_rows from tidyr package:
library(dplyr)
library(tidyr)
df %>%
mutate(id = row_number()) %>%
separate_rows(ALT, sep = ",") %>%
mutate(TYPE = case_when(nchar(REF)==nchar(ALT) ~ "SNM",
nchar(REF)< nchar(ALT) ~ "INS",
nchar(REF)> nchar(ALT) ~ "DEL",
TRUE ~ NA_character_)) %>%
group_by(id) %>%
mutate(TYPE = toString(TYPE)) %>%
slice(1)
REF ALT id TYPE
<chr> <chr> <int> <chr>
1 TTG T 1 DEL
2 CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT 2 INS, DEL
3 T TTG 3 INS
4 TTGTGTGTGTGTGTGTGTGTGT TTGTGTGTGTGTGTGTGTGTGTGT 4 INS

Select unique values

I need to change this function that doesn't match for unique values. For example, if I want MAPK4, the function matches MAPK41 and AMAPK4 etc. The function must select only the unique values.
Function:
library(dplyr)
df2 <- df %>%
rowwise() %>%
mutate(mutated = paste(mutated_genes[unlist(
lapply(mutated_genes, function(x) grepl(x,genes, ignore.case = T)))], collapse=","),
circuit_name = gsub("", "", circuit_name)) %>%
select(-genes) %>%
data.frame()
data:
df <-structure(list(circuit_name = c("hsa04010__117", "hsa04014__118" ), genes = c("MAP4K4,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP3*,DUSP3*,DUSP3*,DUSP3*,PPM1A,AKT3,AKT3,AKT3,ZAK,MAP3K12,MAP3K13,TRAF2,CASP3,IL1R1,IL1R1,TNFRSF1A,IL1A,IL1A,TNF,RAC1,RAC1,RAC1,RAC1,MAP2K7,MAPK8,MAPK8,MAPK8,MECOM,HSPA1A,HSPA1A,HSPA1A,HSPA1A,HSPA1A,HSPA1A,MAP4K3,MAPK8IP2,MAP4K1", "MAP4K4,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*")), class = "data.frame", row.names = c(NA, -2L))
mutated_genes <- c("MAP4K4", "MAP3K12","TRAF2", "CACNG3")
output:
circuit_name mutated
1 hsa04010__117 MAP4K4,TRAF2
2 hsa04014__118 MAP4K4
A base R approach would be by splitting the genes on "," and return those string which match mutated_genes.
df$mutated <- sapply(strsplit(df$genes, ","), function(x)
toString(grep(paste0(mutated_genes, collapse = "|"), x, value = TRUE)))
df[c(1, 3)]
# circuit_name mutated
#1 hsa04010__117 MAP4K4, MAP3K12, TRAF2
#2 hsa04014__118 MAP4K4
Please note that based on the mutated_genes vector, your expected output is missing MAP3K12 for hsa04010__117.
Here is a tidyverse possibility
df %>%
separate_rows(genes) %>%
filter(genes %in% mutated_genes) %>%
group_by(circuit_name) %>%
summarise(mutated = toString(genes))
## A tibble: 2 x 2
# circuit_name mutated
# <chr> <chr>
#1 hsa04010__117 MAP4K4, MAP3K12, TRAF2
#2 hsa04014__118 MAP4K4
Explanation: We separate comma-separated entries into different rows, then select only those rows where genes %in% mutated_genes and summarise results per circuit_name by concatenating genes entries.
PS. Personally I'd recommend keeping the data in a tidy long format (i.e. don't concatenate entries with toString); that way you have one row per gene, which will make any post-processing of the data much more straightforward.
We can use str_extract
library(stringr)
df$mutated <- sapply(str_extract_all(df$genes, paste(mutated_genes,
collapse="|")), toString)

Sum by aggregating complex paired names in R

In R, I'm trying to aggregate a dataframe based on unique IDs, BUT I need to use some kind of wild card value for the IDs. Meaning I have paired names like this:
lion_tiger
elephant_lion
tiger_lion
And I need the lion_tiger and tiger_lion IDs to be summed together, because the order in the pair does not matter.
Using this dataframe as an example:
df <- data.frame(pair = c("1_3","2_4","2_2","1_2","2_1","4_2","3_1","4_3","3_2"),
value = c("12","10","19","2","34","29","13","3","14"))
So the values for pair IDs, "1_2" and "2_1" need to be summed in a new table. That new row would then read:
1_2 36
Any suggestions? While my example has numbers as the pair IDs, in reality I would need this to read in text (like the lion_tiger" example above).
We can split the 'pair' column by _, then sort and paste it back, use it in a group by function to get the sum
tapply(as.numeric(as.character(df$value)),
sapply(strsplit(as.character(df$pair), '_'), function(x)
paste(sort(as.numeric(x)), collapse="_")), FUN = sum)
Or another option is gsubfn
library(gsubfn)
df$pair <- gsubfn('([0-9]+)_([0-9]+)', ~paste(sort(as.numeric(c(x, y))), collapse='_'),
as.character(df$pair))
df$value <- as.numeric(as.character(df$value))
aggregate(value~pair, df, sum)
Using tidyverse and purrrlyr
df <- data.frame(name=c("lion_tiger","elephant_lion",
"tiger_lion"),value=c(1,2,3),stringsAsFactors=FALSE)
require(tidyverse)
require(purrrlyr)
df %>% separate(col = name, sep = "_", c("A", "B")) %>%
by_row(.collate = "rows",
..f = function(this_row) {
paste0(sort(c(this_row$A, this_row$B)), collapse = "_")
}) %>%
rename(sorted = ".out") %>%
group_by(sorted) %>%
summarize(sum(value))%>%show
## A tibble: 2 x 2
# sorted `sum(value)`
# <chr> <dbl>
#1 elephant_lion 2
#2 lion_tiger 4

Dynamic adding of columns in R

I need some help with finding a good way to dynamically add columns with counts for different categories that I need to extract from a string.
In my data, I have a column that contains names of categories and counts thereof. The fields can be empty or contain any combination of categories one can think of. Here are some examples:
themes:firstcategory_1;secondcategory_33;thirdcategory_5
themes:secondcategory_33;fourthcategory_2
themes:fifthcategory_1
What I need is a column for each category (should have the category's name) and the count extracted from the strings above. The list of categories is dynamic, so I don't know beforehand which ones exist.
How do I approach this?
This code will get a column for each category with the counts for each row.
library(dplyr)
library(tidyr)
library(stringr)
# Create test dataframe
df <- data.frame(themes = c("firstcategory_1;secondcategory_33;thirdcategory_5", "secondcategory_33;fourthcategory_2","fifthcategory_1"), stringsAsFactors = FALSE)
# Get the number of columns to split values into
cols <- max(str_count(df$themes,";")) + 1
# Get vector of temporary column names
cols <- paste0("col",c(1:cols))
df <- df %>%
# Add an ID column based on row number
mutate(ID = row_number()) %>%
# Separate multiple categories by semicolon
separate(col = themes, into = cols, sep = ";", fill = "right") %>%
# Gather categories into a single column
gather_("Column", "Value", cols) %>%
# Drop temporary column
select(-Column) %>%
# Filter out NA values
filter(!is.na(Value)) %>%
# Separate categories from their counts by underscore
separate(col = Value, into = c("Category","Count"), sep = "_", fill = "right") %>%
# Spread categories to create a column for each category, with the count for each ID in that category
spread(Category, Count)

Not able to convert a list into a data frame

I'm new in programming in R, and I've been having this problem for several days now. I started with a list, I created from splitting a file. This list contains a number of items in a single row.
head(sales2)
> $`7143443`
>>[1] "SSS-U-CCXVCSS1" "L-CCX-8GETTS-LIC"
>$`7208993`
>>[1] "NFFGSR4=" "1MV-FT-1=" "VI-NT/TE="
>$`7241758`
>>[1] "PW_SQSGG="
>$`9273628`
>>[1] "O1941-SE9" "CCO887VA-K9" "2901-SEC/K9" "CO1941-C/K9"
>$`9371709`
>>[1] "HGR__SASS=" "WWQTTB0S-L" "WS-RRRT48FP" "WTTTF24PS-L"
[5] "GEDQTT8TS-L" "WD-SRNS-2S-L"
>$`9830473`
>>[1] "SPA$FFSB0S"
I wanted it to convert into a data frame , I used
x<-do.call(rbind, lapply(sales2,data.frame))
It gets converted in the data frame ,but it converts like this
> head(x,6)
id
> 7143443.1 "SSS-U-CCXVCSS1"
> 7143443.2 "L-CCX-8GETTS-LIC"
> 7208993.1 "NFFGSR4="
> 7208993.2 "1MV-FT-1="
> 7208993.3 "VI-NT/TE="
> 7241758 "PW_SQSGG="
I want 7143443's all item in a single row not in multiple row
Through this I want to calculate how many rows contain 2 items together
for example "WS-C2960S-48TS-L" , "WS-C2960S-24TS-L",
these 2 elements are there in how many rows?
You can also say probability of these over all other elements.
I am not sure what is your final desired output. But the following script can convert your list to a data frame. Perhaps you can begin your analysis from this data frame.
# Create example list
sales2 <- list(`7143443` = c("SSS-U-CCXVCSS1", "L-CCX-8GETTS-LIC"),
`7208993` = c("NFFGSR4=", "1MV-FT-1=", "VI-NT/TE="),
`7241758` = "PW_SQSGG=",
`9273628` = c("O1941-SE9", "CCO887VA-K9", "2901-SEC/K9", "CO1941-C/K9"),
`9371709` = c("HGR__SASS=", "WWQTTB0S-L", "WS-RRRT48FP", "WTTTF24PS-L",
"GEDQTT8TS-L", "WD-SRNS-2S-L"),
`9830473` = "SPA$FFSB0S")
# Load packages
library(dplyr)
library(purrr)
dat <- map(sales2, data_frame) %>% # Convert each list element to a data frame
bind_rows(.id = "ID") %>% # Combine all data frame
rename(Value = `.x[[i]]`) %>% # Change the name of the second column
group_by(ID) %>% # Group by the first column
summarise(Value = paste0(Value, collapse = " ")) # Collapse the second column
dat
# A tibble: 6 × 2
ID Value
<chr> <chr>
1 7143443 SSS-U-CCXVCSS1 L-CCX-8GETTS-LIC
2 7208993 NFFGSR4= 1MV-FT-1= VI-NT/TE=
3 7241758 PW_SQSGG=
4 9273628 O1941-SE9 CCO887VA-K9 2901-SEC/K9 CO1941-C/K9
5 9371709 HGR__SASS= WWQTTB0S-L WS-RRRT48FP WTTTF24PS-L GEDQTT8TS-L WD-SRNS-2S-L
6 9830473 SPA$FFSB0S
Update
After reading original poster's comment, I decided to update my solution, to count how many rows contain two specified string patterns.
Here one row is a unique ID. So I assume that the request can be rephrased to "How many IDs contain two specified string patterns?" If this is the case, I would prefer not to collapse all the observations. Because after collapsing all observations to from one ID per row, we need to develop a strategy to match the string, such as using the regular expression. I am not familiar with regular string, so I will leave this for others to provide solutions.
In addition, the original poster did not specify which two strings are the targeted, so I would develop a strategy that the users can replace the targeted string case by case.
dat <- map(sales2, data_frame) %>% # Convert each list element to a data frame
bind_rows(.id = "ID") %>% # Combine all data frame
rename(Value = `.x[[i]]`) # Change the name of the second column
# After this, there is no need to collapse the rows
# Set the target string, User can change the strings here
target_string1 <- c("SSS-U-CCXVCSS1", "L-CCX-8GETTS-LIC")
dat2 <- dat %>%
filter(Value %in% target_string1) %>% # Filter rows matching the targeted string
distinct(ID, Value, .keep_all = TRUE) %>% # Only keep one row if ID and Value have exact duplicated
count(ID) %>% # Count how many rows per ID
filter(n > 1) %>% # Keep only ID that the Count number is larger than 1
select(ID)
dat2
# A tibble: 1 × 1
ID
<chr>
1 7143443

Resources