R: Extracting Bigrams with Zero-Width Lookaheads - r

I want to extract bigrams from sentences, using the regex described here and store the output to a new column which references the original.
library(dplyr)
library(stringr)
library(splitstackshape)
df <- data.frame(a =c("apple orange plum"))
# Single Words - Successful
df %>%
# Base R
mutate(b = sapply(regmatches(a,gregexpr("\\w+\\b", a, perl = TRUE)),
paste, collapse=";")) %>%
# Duplicate with Stringr
mutate(c = sapply(str_extract_all(a,"\\w+\\b"),paste, collapse=";")) %>%
cSplit(., c(2,3), sep = ";", direction = "long")
Initially, I thought the problem seemed to be with the regex engine but neither stringr::str_extract_all (ICU) nor base::regmatches (PCRE) works.
# Bigrams - Fails
df %>%
# Base R
mutate(b = sapply(regmatches(a,gregexpr("(?=(\\b\\w+\\s+\\w+))", a, perl = TRUE)),
paste, collapse=";")) %>%
# Duplicate with Stringr
mutate(c = sapply(str_extract_all(a,"(?=(\\b\\w+\\s+\\w+))"),paste, collapse=";")) %>%
cSplit(., c(2,3), sep = ";", direction = "long")
As a result, I'm guessing the problem is probably to do with using a zero-width lookahead around a capturing group. Is there any valid regex in R which will allows these bigrams be extracted?

As #WiktorStribiżew suggested, using str_extract_all helps here. Here's how to apply it with multiple rows in a data frame. Let
(df <- data.frame(a = c("one two three", "four five six")))
# a
# 1 one two three
# 2 four five six
Then we may do
df %>% rowwise() %>%
do(data.frame(., b = str_match_all(.$a, "(?=(\\b\\w+\\s+\\w+))")[[1]][, 2], stringsAsFactors = FALSE))
# Source: local data frame [4 x 2]
# Groups: <by row>
#
# A tibble: 4 x 2
# a b
# * <fct> <chr>
# 1 one two three one two
# 2 one two three two three
# 3 four five six four five
# 4 four five six five six
where stringsAsFactors = FALSE is just to avoid warnings coming from bindings rows.

Related

Separating into separate columns based on 2 delimiters

I'm an R beginner and have a basic question I can't seem to figure out. I have row values that need to be separated into different columns, but there are more than one delimiter I am trying to use. The Expression_level column contains an ensembl gene ID with its corresponding value, in the form ensembl:exp value, but there are sometimes 2 ensembl IDs in the same row separated by ;. I want to have a column for ensembl and for gene expression value, but not sure how to separate while keeping them mapped to the correct ID/expression value. This is the type of data I am working with: rna_seq and this is what I am trying to get out: org_rna. TYIA
rna_seq= cbind("Final_gene" = c("KLHL15", "CPXCR1", "MAP7D3", "WDR78"), "Expression_level" = c("1.62760683812965:ENSG00000174010", "-9.96578428466209:ENSG00000147183",
"-4.32192809488736:ENSG00000129680", "-1.39592867633114:ENSG00000152763;-9.96578428466209:ENSG00000231080"))
org_rna = cbind("Final_gene" = c("KLHL15", "CPXCR1", "MAP7D3", "WDR78", "WDR78"), "Ensembl" = c("ENSG00000174010", "ENSG00000147183", "ENSG00000129680", "ENSG00000152763", "ENSG00000231080")
, "Expression" = c("1.62760683812965", "-9.96578428466209", "-4.32192809488736", "-1.39592867633114", "-9.96578428466209"))
library(tidyr)
library(dplyr)
rna_seq %>%
as.data.frame() %>%
# separate cells containing multiple values into
# multiple rows
separate_rows(Expression_level, sep = ";") %>%
# extract pairs
extract(col = Expression_level,
into = c("Expression", "Ensembl"),
regex = "(.*):(.*)")
# A tibble: 5 x 3
# Final_gene Expression Ensembl
# <chr> <chr> <chr>
# KLHL15 1.62760683812965 ENSG00000174010
# CPXCR1 -9.96578428466209 ENSG00000147183
# MAP7D3 -4.32192809488736 ENSG00000129680
# WDR78 -1.39592867633114 ENSG00000152763
# WDR78 -9.96578428466209 ENSG00000231080
Another (less elegant) solution using separate():
library(tidyr)
library(dplyr)
rna_seq |> as.data.frame() |>
# Separate any second IDs
separate(Expression_level, sep = ";", into = c("ID1", "ID2")) |>
# Reshape to longer (columns to rows)
pivot_longer(cols = starts_with("ID")) |>
# Separate Expression from Ensembl
separate(value, sep = ":", into = c("Expression", "Ensembl")) |>
filter(!is.na(Expression)) |>
select(Final_gene, Ensembl, Expression)

How to remove everything from a row except pattern

I have a dataframe that contains one column separated by ; like this
AB00001;09843;AB00002;GD00001
AB84375;34
AB84375;AB84375
74859375;AB001;4455;FG3455
What I want is remove everything except the codes that starts with AB....
AB00001;AB00002
AB84375
AB84375;AB84375
AB001
I've tried to separate them with separate(), but I don´t know how to continue. Any suggestions?
If your data frame is called df and your column is called V1, you could try:
sapply(strsplit(df$V1, ";"), function(x) paste(grep("^AB", x, value = TRUE), collapse = ";"))
#> [1] "AB00001;AB00002" "AB84375" "AB84375;AB84375" "AB001"
This splits at all the semicolons then matches all strings starting with "AB", then joins them back together with semicolons.
I thought of using stringr and Daniel O's data:
df %>%
mutate(data = str_extract_all(data, "AB\\w+"))
which gives us
data
1 AB00001, AB00002
2 AB84375
3 AB84375, AB84375
4 AB001
1) Base R Assuming DF shown reproducibly in the Note at the end we prefix each line with a semicolon and then use the gsub with the pattern shown and finally remove the semicolon we added. No packages are used.
transform(DF, V1 = sub("^;", "", gsub("(;AB\\d+)|;[^;]*", "\\1", paste0(";", V1))))
giving:
V1
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
2) dplyr/tidyr This one is longer than the others in this answer but it is straight forward and has no complex regular expressions.
library(dplyr)
library(tidyr)
DF %>%
mutate(id = 1:n()) %>%
separate_rows(V1, sep = ";") %>%
filter(substr(V1, 1, 2) == "AB") %>%
group_by(id) %>%
summarize(V1 = paste(V1, collapse = ";")) %>%
ungroup %>%
select(-id)
giving:
# A tibble: 4 x 1
V1
<chr>
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
3) gsubfn Replace codes that do not start with AB with an empty string and then remove redundant semicolons from what is left.
library(gsubfn)
transform(DF, V1 = gsub("^;|;$", "", gsub(";+", ";",
gsubfn("[^;]*", ~ if (substr(x, 1, 2) == "AB") x else "", V1))))
giving:
V1
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
Note
Lines <- "AB00001;09843;AB00002;GD00001
AB84375;34
AB84375;AB84375
74859375;AB001;4455;FG3455"
DF <- read.table(text = Lines, as.is = TRUE, strip.white = TRUE)

Select unique values

I need to change this function that doesn't match for unique values. For example, if I want MAPK4, the function matches MAPK41 and AMAPK4 etc. The function must select only the unique values.
Function:
library(dplyr)
df2 <- df %>%
rowwise() %>%
mutate(mutated = paste(mutated_genes[unlist(
lapply(mutated_genes, function(x) grepl(x,genes, ignore.case = T)))], collapse=","),
circuit_name = gsub("", "", circuit_name)) %>%
select(-genes) %>%
data.frame()
data:
df <-structure(list(circuit_name = c("hsa04010__117", "hsa04014__118" ), genes = c("MAP4K4,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP3*,DUSP3*,DUSP3*,DUSP3*,PPM1A,AKT3,AKT3,AKT3,ZAK,MAP3K12,MAP3K13,TRAF2,CASP3,IL1R1,IL1R1,TNFRSF1A,IL1A,IL1A,TNF,RAC1,RAC1,RAC1,RAC1,MAP2K7,MAPK8,MAPK8,MAPK8,MECOM,HSPA1A,HSPA1A,HSPA1A,HSPA1A,HSPA1A,HSPA1A,MAP4K3,MAPK8IP2,MAP4K1", "MAP4K4,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*")), class = "data.frame", row.names = c(NA, -2L))
mutated_genes <- c("MAP4K4", "MAP3K12","TRAF2", "CACNG3")
output:
circuit_name mutated
1 hsa04010__117 MAP4K4,TRAF2
2 hsa04014__118 MAP4K4
A base R approach would be by splitting the genes on "," and return those string which match mutated_genes.
df$mutated <- sapply(strsplit(df$genes, ","), function(x)
toString(grep(paste0(mutated_genes, collapse = "|"), x, value = TRUE)))
df[c(1, 3)]
# circuit_name mutated
#1 hsa04010__117 MAP4K4, MAP3K12, TRAF2
#2 hsa04014__118 MAP4K4
Please note that based on the mutated_genes vector, your expected output is missing MAP3K12 for hsa04010__117.
Here is a tidyverse possibility
df %>%
separate_rows(genes) %>%
filter(genes %in% mutated_genes) %>%
group_by(circuit_name) %>%
summarise(mutated = toString(genes))
## A tibble: 2 x 2
# circuit_name mutated
# <chr> <chr>
#1 hsa04010__117 MAP4K4, MAP3K12, TRAF2
#2 hsa04014__118 MAP4K4
Explanation: We separate comma-separated entries into different rows, then select only those rows where genes %in% mutated_genes and summarise results per circuit_name by concatenating genes entries.
PS. Personally I'd recommend keeping the data in a tidy long format (i.e. don't concatenate entries with toString); that way you have one row per gene, which will make any post-processing of the data much more straightforward.
We can use str_extract
library(stringr)
df$mutated <- sapply(str_extract_all(df$genes, paste(mutated_genes,
collapse="|")), toString)

Sum by aggregating complex paired names in R

In R, I'm trying to aggregate a dataframe based on unique IDs, BUT I need to use some kind of wild card value for the IDs. Meaning I have paired names like this:
lion_tiger
elephant_lion
tiger_lion
And I need the lion_tiger and tiger_lion IDs to be summed together, because the order in the pair does not matter.
Using this dataframe as an example:
df <- data.frame(pair = c("1_3","2_4","2_2","1_2","2_1","4_2","3_1","4_3","3_2"),
value = c("12","10","19","2","34","29","13","3","14"))
So the values for pair IDs, "1_2" and "2_1" need to be summed in a new table. That new row would then read:
1_2 36
Any suggestions? While my example has numbers as the pair IDs, in reality I would need this to read in text (like the lion_tiger" example above).
We can split the 'pair' column by _, then sort and paste it back, use it in a group by function to get the sum
tapply(as.numeric(as.character(df$value)),
sapply(strsplit(as.character(df$pair), '_'), function(x)
paste(sort(as.numeric(x)), collapse="_")), FUN = sum)
Or another option is gsubfn
library(gsubfn)
df$pair <- gsubfn('([0-9]+)_([0-9]+)', ~paste(sort(as.numeric(c(x, y))), collapse='_'),
as.character(df$pair))
df$value <- as.numeric(as.character(df$value))
aggregate(value~pair, df, sum)
Using tidyverse and purrrlyr
df <- data.frame(name=c("lion_tiger","elephant_lion",
"tiger_lion"),value=c(1,2,3),stringsAsFactors=FALSE)
require(tidyverse)
require(purrrlyr)
df %>% separate(col = name, sep = "_", c("A", "B")) %>%
by_row(.collate = "rows",
..f = function(this_row) {
paste0(sort(c(this_row$A, this_row$B)), collapse = "_")
}) %>%
rename(sorted = ".out") %>%
group_by(sorted) %>%
summarize(sum(value))%>%show
## A tibble: 2 x 2
# sorted `sum(value)`
# <chr> <dbl>
#1 elephant_lion 2
#2 lion_tiger 4

How to change code T-25-4 into T-25-04 in a dataframe in R?

I have a data.frame in R. The first columns contain codes like T-25-4. I want to change it to T-25-04 and so on. So the last number should be in 2 digits
Example:
T-25-1
T-25-2
T-25-3
T-25-4
T-25-5
T-25-6
T-25-7
T-25-8
T-25-9
Borrowing first part of ycw's answer, but simpler with mutate and gsub:
library(tidyverse)
dt <- data_frame(Col = c("T-25-1", "T-25-2", "T-25-3", "T-25-4", "T-25-5",
"T-25-6", "T-25-7", "T-25-8", "T-25-9"))
dt %>%
mutate(Col = gsub("(\\d)$", paste0("0", "\\1"), Col))
If last digit goes higher than 9 and you don't want to add 0:
dt %>%
mutate(Col = ifelse(nchar(sub(".*-(\\d+)$", "\\1", Col)) < 2, # Check if last number is less than 10
sub("(\\d+)$", paste0("0", "\\1"), Col), # Add 0 in front if less than 10
Col))
We can use functions from tidyverse and stringr. df2 is the final output.
library(tidyverse)
library(stringr)
# Create example data frame
dt <- data_frame(Col = c("T-25-1", "T-25-2", "T-25-3", "T-25-4", "T-25-5",
"T-25-6", "T-25-7", "T-25-8", "T-25-9"))
# Process the data
dt2 <- dt %>%
# Separate the original column to three columns
separate(Col, into = c("Col1", "Col2", "Col3")) %>%
# Pad zero to Col3 until the width is 2
mutate(Col3 = str_pad(Col3, width = 2, side= "left", pad = "0")) %>%
# Combine all three columns separated by "-
unite(Col, Col1:Col3, sep = "-")
# View the reuslts
dt2
# A tibble: 9 x 1
Col
* <chr>
1 T-25-01
2 T-25-02
3 T-25-03
4 T-25-04
5 T-25-05
6 T-25-06
7 T-25-07
8 T-25-08
9 T-25-09

Resources