extract string from multiple columns in new column

extract string from multiple columns in new column - r

I want to find a word in different columns and mutate it in a new column.
"data" is an example and "goal" is what I want. I tried a lot but I didn't get is work.
library(dplyr)
library(stringr)
data <- tibble(
component1 = c(NA, NA, "Word", NA, NA, "Word"),
component2 = c(NA, "Word", "different_word", NA, NA, "not_this")
)
goal <- tibble(
component1 = c(NA, NA, "Word", NA, NA, "Word"),
component2 = c(NA, "Word", "different_word", NA, NA, "not_this"),
component = c(NA, "Word", "Word", NA, NA, "Word")
)
not_working <- data %>%
mutate(component = across(starts_with("component"), ~ str_extract(.x, "Word")))

For your provided data structure we could use coalesce:
library(dplyr)
data %>%
mutate(component = coalesce(component1, component2))
component1 component2 component
<chr> <chr> <chr>
1 NA NA NA
2 NA Word Word
3 Word different_word Word
4 NA NA NA
5 NA NA NA
6 Word not_this Word

With if_any and str_detect:
library(dplyr)
library(stringr)
data %>%
mutate(component = ifelse(if_any(starts_with("component"), str_detect, "Word"), "Word", NA))
output
component1 component2 component
<chr> <chr> <chr>
1 NA NA NA
2 NA Word Word
3 Word different_word Word
4 NA NA NA
5 NA NA NA
6 Word not_this Word
If you wanna stick to str_extract, this would be the way to go:
data %>%
mutate(across(starts_with("component"), str_extract, "Word",
.names = "{.col}_extract")) %>%
mutate(component = coalesce(component1_extract, component2_extract),
.keep = "unused")
# A tibble: 6 × 3
component1 component2 component
<chr> <chr> <chr>
1 NA NA NA
2 NA Word Word
3 Word different_word Word
4 NA NA NA
5 NA NA NA
6 different_word Word Word

Related

How to Filter out specific rows in a data frame using dplyr?

I have several sheets that I import from excel. While these sheets are similar there are some differences due to manual entry. I am trying to filter out the rows that has "Total" and anything beyond that row. The logic I have works for df1 and df3 but I am not sure how to get it to work for df2. Could someone please help?
df1<-structure(list(...1 = structure(c(1630022400, 1630108800, 1630195200,
1630281600, 1630368000, NA), tzone = "UTC", class = c("POSIXct",
"POSIXt")), `Vinayak Trading` = c(1984.31, NA, NA, NA, NA, 2916.17
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
df2<-structure(list(...1 = c("44526", "44527", "44528", "44529", "44530",
"Total"), `Vinayak Trading` = c(NA, NA, NA, NA, NA, 0)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
df3<-structure(list(...1 = c("44680", "44681", NA, "Total", NA, NA
), `Vinayak Trading` = c(NA, NA, NA, 2736.42, NA, NA)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
transform <- function(df) {
names(df)[1] <- "Date"
df <- df %>%
filter(row_number() < which(is.na(Date))) %>% #To tackle sheets where Total is not present
filter(row_number() < which(Date=="Total")) %>% #To remove Total in sheets where it is present
select(-Total) #To remove the Total column
}
df1 <- transform(df1)# Desired reults
df2 <- transform(df2)# Error due to no NAs - don't know how to handle
df3 <- transform(df3)#Desired result with warning

We can use dplyr::cumany to remove "Total" (or NA) and anything beyond.
transform2 <- function(df) {
df %>%
rename(Date = 1) %>%
filter(!cumany(Date %in% c("Total", NA))) %>%
select(-any_of("Total"))
}
transform2(df1)
# # A tibble: 5 × 2
# Date `Vinayak Trading`
# <dttm> <dbl>
# 1 2021-08-27 00:00:00 1984.
# 2 2021-08-28 00:00:00 NA
# 3 2021-08-29 00:00:00 NA
# 4 2021-08-30 00:00:00 NA
# 5 2021-08-31 00:00:00 NA
transform2(df2)
# # A tibble: 5 × 2
# Date `Vinayak Trading`
# <chr> <dbl>
# 1 44526 NA
# 2 44527 NA
# 3 44528 NA
# 4 44529 NA
# 5 44530 NA
transform2(df3)
# # A tibble: 2 × 2
# Date `Vinayak Trading`
# <chr> <dbl>
# 1 44680 NA
# 2 44681 NA
We can use rename(Date = 1) as an inline replacement for names(df)[1] <- "Date", it seems a bit more pipe-esque;
== NA doesn't return true/false, but %in% NA does; we can use is.na(Date) | Date == "Total", or we can use Date %in% c("Total", NA) with the results one would expect;
cumany is a "cumulative any", meaning that when a value returns true, then all subsequent values will be true as well, see cumany(c(F,T,F)); for its opposite, see cumall(c(T,F,T)); for a base-R equivalent, use cumsum(cond) > 0 for cumany and cumsum(!cond) == 0 for cumall; and
I use select(-any_of("Total")) since it will remove the column if it exists and do nothing otherwise (none of your sample data included it, so I thought it better to be safe).

Forming a new column from whichever of two columns isn’t NA [duplicate]

This question already has answers here:
Replace a value NA with the value from another column in R
(5 answers)
Closed last month.
I have a simplified dataframe:
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
I want to create a new column rating that has the value of the number in either column x or column y. The dataset is such a way that whenever there's a numeric value in x, there's a NA in y. If both columns are NAs, then the value in rating should be NA.
In this case, the expected output is: 1,2,3,3,2,NA

With coalesce:
library(dplyr)
test %>%
mutate(rating = coalesce(x, y))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA

library(dplyr)
test %>%
mutate(rating = if_else(is.na(x),
y, x))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA

Here several solutions.
# Input
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
# Base R solution
test$rating <- ifelse(!is.na(test$x), test$x,
ifelse(!is.na(test$y), test$y, NA))
# dplyr solution
library(dplyr)
test <- test %>%
mutate(rating = case_when(!is.na(x) ~ x,
!is.na(y) ~ y,
TRUE ~ NA_real_))
# data.table solution
library(data.table)
setDT(test)
test[, rating := ifelse(!is.na(x), x, ifelse(!is.na(y), y, NA))]
Created on 2022-12-23 with reprex v2.0.2

test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
test$rating <- dplyr::coalesce(test$x, test$y)

Merge column if duplicates in rows between columns

I have a dataframe such as :
COL1 COL2 COL3 COL4 COL5 COL6 COL7
1 Sp1-2 Sp1-2 Sp3_2-54 Sp3-2 Sp3-2 Sp3-2 SP9-43
2 Sp5-1 Sp5-2 Sp2-4 Sp9-2 Sp10-3 SP9-90 NA
3 Sp_7-3 Sp_7-3 NA SP6-56 Sp2-7 SP3-3 NA
And I would simply like to merge columns when at leats two elements are duplicated.
for example, in COL1 and COL2, Sp1-2 & Sp_7-3 are duplicated in both columns, then I merge it that way by adding a pipe "|" between non-duplicated elements:
COL1|COL2 COL3 COL4|COL5|COL6 COL7
1 Sp1-2 Sp3_2-54 Sp3-2 SP9-43
2 Sp5-1|Sp5-2 Sp2-4 Sp9-2|Sp10-3|SP9-90 NA
3 Sp_7-3 NA SP6-56|Sp2-7|SP3-3 NA
Here is the dput format :
structure(list(COL1 = c("Sp1-2", "Sp5-1", "Sp_7-3"), COL2 = c("Sp1-2",
"Sp5-2", "Sp_7-3"), COL3 = c("Sp3_2-54", "Sp2-4", NA), COL4 = c("Sp3-2",
"Sp9-2", "SP6-56"), COL5 = c("Sp3-2", "Sp10-3", "Sp2-7"), COL6 = c("Sp3-2",
"SP9-90", "SP3-3"), COL7 = c("SP9-43", NA, NA)), class = "data.frame", row.names = c(NA,
-3L))
Another example :
G136 G348 G465
1 NA NA NA
2 NA NA NA
3 SP4-140 SP4-140 NA
4 SP2-8 NA NA
5 SP3-59 NA NA
6 SP1_contig.682-8 NA SP1_contig.682-8
expected output:
G136|G348|G465
1 NA
2 NA
3 SP4-140
4 SP2-8
5 SP3-59
6 SP1_contig.682-8
the deput format :
dat<- structure(list(G136 = c(NA, NA, "SP4-140", "SP2-8", "SP3-59", "SP1_contig.682-8", NA, NA, NA), G348 = c(NA, NA, "SP4-140", NA, NA, NA, NA, NA, NA), G465 = c(NA, NA, NA, NA, NA, "SP1_contig.682-8", NA, NA, NA)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"))

This is probably best handled by reshaping your data first, then it's straight forward to use various groupings to achieve your desired result:
library(tidyr)
library(dplyr)
dat %>%
rowid_to_column() %>%
pivot_longer(-rowid) %>%
filter(!is.na(value)) %>%
group_by(rowid, value) %>%
mutate(new_name = paste(name, collapse = "|")) %>%
separate_rows(new_name, sep = "\\|") %>%
group_by(name) %>%
mutate(new_name = paste(unique(new_name), collapse = "|")) %>%
group_by(value) %>%
filter(nchar(new_name) == max(nchar(new_name))) %>%
ungroup() %>%
select(-name) %>%
pivot_wider(names_from = new_name, values_from = value, values_fn = ~ paste(unique(.x), collapse = "|")) %>%
complete(rowid = full_seq(c(1, rowid), 1))
# A tibble: 3 × 5
rowid `COL1|COL2` COL3 `COL4|COL5|COL6` COL7
<dbl> <chr> <chr> <chr> <chr>
1 1 Sp1-2 Sp3_2-54 Sp3-2 SP9-43
2 2 Sp5-1|Sp5-2 Sp2-4 Sp9-2|Sp10-3|SP9-90 NA
3 3 Sp_7-3 NA SP6-56|Sp2-7|SP3-3 NA
And using the data in your second example gives:
# A tibble: 6 × 2
rowid `G136|G348|G465`
<dbl> <chr>
1 1 NA
2 2 NA
3 3 SP4-140
4 4 SP2-8
5 5 SP3-59
6 6 SP1_contig.682-8

It's really messy...but you may try
library(igraph)
library(stringdist)
library(data.table)
table(df[1,])
d <- c()
for (i in 1:(ncol(df)-1)){
for (j in (i+1):ncol(df)) {
if(any(na.omit(stringdist(df[,i], df[,j], method = "lv") == 0))) {
d <- rbind(d, c(i,j))
}
}
}
dd <- data.table(d)
net <- graph_from_data_frame(d = dd, directed = F)
key <- split(names(V(net)), components(net)$membership)
res <- matrix(NA,nrow = nrow(df), ncol = 0)
names_dummy <- c()
df_dummy <- c()
for (i in key){
i <- as.numeric(i)
names_dummy <- c(names_dummy, paste0(colnames(df)[i], collapse = "|"))
df_dummy <- cbind(df_dummy, apply(df[,i], 1, function(x) {paste0(unique(unlist(x)), collapse = "|")}))
}
colnames(df_dummy) <- names_dummy
df_dummy
res <- cbind(df_dummy, df[,-as.numeric(unlist(key))])
res <- res[,sort(colnames(res))]
res
COL1|COL2 COL3 COL4|COL5|COL6 COL7
1 Sp1-2 Sp3_2-54 Sp3-2 SP9-43
2 Sp5-1|Sp5-2 Sp2-4 Sp9-2|Sp10-3|SP9-90 <NA>
3 Sp_7-3 <NA> SP6-56|Sp2-7|SP3-3 <NA>

How to find common and unique elements across multiple data-frames

Objective To see the common elements that is my row which are basically gene name in my different comparisons.
This was the answer which I tried to follow.
df1 = data.frame(genes = c('gene1', 'gene3', 'gene4', 'gene2'))
df2 = data.frame(genes = c('gene3', 'gene2', 'gene5', 'gene1', "genet"))
df3 = data.frame(genes = c('gene6', 'gene3', 'gene4', 'gdene7', 'genex', "gene10"))
dfList <- list(df1, df2, df3)
reduce(dfList, inner_join)
reduce(dfList, inner_join)
Joining, by = "genes"
Joining, by = "genes"
genes
1 gene3
This fails in this case
df1 = data.frame(genes = c('gene1', 'gene3', 'gene4', 'gene2'))
df2 = data.frame(genes = c('gene3', 'gene2', 'gene5', 'gene1', "genet"))
df3 = data.frame(genes = c('gene6', 'gene13', 'gene4', 'gdene7', 'genex', "gene10"))
dfList <- list(df1, df2, df3)
reduce(dfList, inner_join)
educe(dfList, inner_join)
Joining, by = "genes"
Joining, by = "genes"
[1] genes
<0 rows> (or 0-length row.names)
Now how to address this problem. I gave a small set I have like 15 comparison.
Expected output
gene3 df1 df2 df3 ## for common genes
gene1 df1 df2 ## for genes which arr not across all the combination
gene2
In the first case the solution works as the gene3 is preset in all the case but fails when it is present in only 2 condition.
So how do I find out all the possible combination where the genes are present in different possible combination.
For example if gene3 is present in all three so it is reported but gene1 and gene2 are present in df1 and df2 but these are not reported.
So I would like to see if a group of genes present in all condition which is not possible most likely but all the possible combination where its is present
My actual dataframes are named as such which is in a list
names(result_abd)
[1] "M0_vs_M1_TCGA_stages" "M0_vs_M2_TCGA_stages" "M0_vs_M3_TCGA_stages" "M0_vs_M4_TCGA_stages" "M0_vs_M5_TCGA_stages" "M1_vs_M2_TCGA_stages"
[7] "M1_vs_M3_TCGA_stages" "M1_vs_M4_TCGA_stages" "M1_vs_M5_TCGA_stages" "M2_vs_M3_TCGA_stages" "M2_vs_M4_TCGA_stages" "M2_vs_M5_TCGA_stages"
[13] "M3_vs_M4_TCGA_stages" "M3_vs_M5_TCGA_stages" "M4_vs_M5_TCGA_stages"
>
So I would have like 15 columns for each dataframe
I ran your code the output is as such
dput(head(a))
structure(list(gene = c("ENSG00000000003", "ENSG00000000971",
"ENSG00000002726", "ENSG00000003989", "ENSG00000005381", "ENSG00000006534"
), dfM0_vs_M1_TCGA_stages = c("M0_vs_M1_TCGA_stages", "M0_vs_M1_TCGA_stages",
"M0_vs_M1_TCGA_stages", "M0_vs_M1_TCGA_stages", "M0_vs_M1_TCGA_stages",
"M0_vs_M1_TCGA_stages"), dfM0_vs_M2_TCGA_stages = c(NA, "M0_vs_M2_TCGA_stages",
"M0_vs_M2_TCGA_stages", NA, "M0_vs_M2_TCGA_stages", NA), dfM0_vs_M3_TCGA_stages = c("M0_vs_M3_TCGA_stages",
"M0_vs_M3_TCGA_stages", "M0_vs_M3_TCGA_stages", NA, "M0_vs_M3_TCGA_stages",
NA), dfM0_vs_M4_TCGA_stages = c("M0_vs_M4_TCGA_stages", NA, "M0_vs_M4_TCGA_stages",
NA, "M0_vs_M4_TCGA_stages", "M0_vs_M4_TCGA_stages"), dfM0_vs_M5_TCGA_stages = c("M0_vs_M5_TCGA_stages",
NA, "M0_vs_M5_TCGA_stages", NA, "M0_vs_M5_TCGA_stages", "M0_vs_M5_TCGA_stages"
), dfM1_vs_M2_TCGA_stages = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), dfM1_vs_M3_TCGA_stages = c(NA,
NA, NA, NA, "M1_vs_M3_TCGA_stages", NA), dfM1_vs_M4_TCGA_stages = c(NA,
"M1_vs_M4_TCGA_stages", NA, NA, NA, NA), dfM1_vs_M5_TCGA_stages = c(NA,
NA, "M1_vs_M5_TCGA_stages", NA, NA, NA), dfM2_vs_M3_TCGA_stages = c(NA,
NA, NA, NA, "M2_vs_M3_TCGA_stages", NA), dfM2_vs_M4_TCGA_stages = c(NA,
"M2_vs_M4_TCGA_stages", NA, NA, NA, NA), dfM2_vs_M5_TCGA_stages = c(NA,
NA, "M2_vs_M5_TCGA_stages", NA, "M2_vs_M5_TCGA_stages", NA),
dfM3_vs_M4_TCGA_stages = c(NA, "M3_vs_M4_TCGA_stages", NA,
NA, "M3_vs_M4_TCGA_stages", NA), dfM3_vs_M5_TCGA_stages = c(NA,
"M3_vs_M5_TCGA_stages", NA, NA, "M3_vs_M5_TCGA_stages", NA
), dfM4_vs_M5_TCGA_stages = c(NA, NA, "M4_vs_M5_TCGA_stages",
NA, NA, NA)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
Dataframe format
A tibble: 6 × 16
gene dfM0_vs_M1_TCGA… dfM0_vs_M2_TCGA… dfM0_vs_M3_TCGA… dfM0_vs_M4_TCGA… dfM0_vs_M5_TCGA… dfM1_vs_M2_TCGA… dfM1_vs_M3_TCGA… dfM1_vs_M4_TCGA…
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ENSG00000000003 M0_vs_M1_TCGA_s… NA M0_vs_M3_TCGA_s… M0_vs_M4_TCGA_s… M0_vs_M5_TCGA_s… NA NA NA
2 ENSG00000000971 M0_vs_M1_TCGA_s… M0_vs_M2_TCGA_s… M0_vs_M3_TCGA_s… NA NA NA NA M1_vs_M4_TCGA_s…
3 ENSG00000002726 M0_vs_M1_TCGA_s… M0_vs_M2_TCGA_s… M0_vs_M3_TCGA_s… M0_vs_M4_TCGA_s… M0_vs_M5_TCGA_s… NA NA NA
4 ENSG00000003989 M0_vs_M1_TCGA_s… NA NA NA NA NA NA NA
5 ENSG00000005381 M0_vs_M1_TCGA_s… M0_vs_M2_TCGA_s… M0_vs_M3_TCGA_s… M0_vs_M4_TCGA_s… M0_vs_M5_TCGA_s… NA M1_vs_M3_TCGA_s… NA
6 ENSG00000006534 M0_vs_M1_TCGA_s… NA NA M0_vs_M4_TCGA_s… M0_vs_M5_TCGA_s… NA NA NA
Now this is what i wanted. The next step as an example I would like to see
If I take this gene ENSG00000000971 is present in 7 comparison but not in others where its is reported as NA.How do I group them.
Like making another data frame with those genes lets say are present in multiple comparison and not include wherever this is NA

It's not clear to me exactly how you want your ourput formatted (several index columns or one column containing a string or a list column). But here's one option. I start by combining your list of data fames into a single data frame, with an index indicating the source.
library(tidyverse)
dfList <- list(df1, df2, df3)
dfList %>%
bind_rows(.id="df") %>%
pivot_wider(names_from=df, names_prefix="df", values_from=df)
# A tibble: 10 × 4
genes df1 df2 df3
<chr> <chr> <chr> <chr>
1 gene1 1 2 NA
2 gene3 1 2 3
3 gene4 1 NA 3
4 gene2 1 2 NA
5 gene5 NA 2 NA
6 genet NA 2 NA
7 gene6 NA NA 3
8 gdene7 NA NA 3
9 genex NA NA 3
10 gene10 NA NA 3
Addition in response to OP's question below. (Though note that's actually a new question and really should be a new post.)
dfList %>%
bind_rows(.id="df") %>%
group_by(genes) %>%
summarise(minDF=min(df), maxDF=max(df)) %>%
filter(minDF == maxDF & maxDF == 3) %>%
pull(genes)
[1] "gdene7" "gene10" "gene6" "genex"
Once again, the key is to put all the data into a single data frame. (And the desired format of the output is not clear.)

remove NA values and combine non NA values into a single column

I have a data set which has numeric and NA values in all columns. I would like to create a new column with all non NA values and preserve the row names
v1 v2 v3 v4 v5
a 1 NA NA NA NA
b NA 2 NA NA NA
c NA NA 3 NA NA
d NA NA NA 4 NA
e NA NA NA NA 5
I have tried using the coalesce function from dplyr
digital_metrics_FB <- fb_all_data %>%
mutate(fb_metrics = coalesce("v1",
"v2",
"v3",
"v4",
"v5"))
and also tried an apply function
df2 <- sapply(fb_all_data,function(x) x[!is.na(x)])
still cannot get it to work.
I am looking for the final result to be where all non NA values come together in the final column and the row names are preserved
final
a 1
b 2
c 3
d 4
e 5
any help would be much appreciated

We can use pmax
do.call(pmax, c(fb_all_data , na.rm = TRUE))
If there are more than one non-NA element and want to combine as a string, a simple base R option would be
data.frame(final = apply(fb_all_data, 1, function(x) toString(x[!is.na(x)])))
Or using coalesce
library(dplyr)
library(tibble)
fb_all_data %>%
rownames_to_column('rn') %>%
transmute(rn, final = coalesce(v1, v2, v3, v4, v5)) %>%
column_to_rownames('rn')
# final
#a 1
#b 2
#c 3
#d 4
#e 5
Or using tidyverse, for multiple non-NA elements
fb_all_data %>%
rownames_to_column('rn') %>%
transmute(rn, final = pmap_chr(.[-1], ~ c(...) %>%
na.omit %>%
toString)) %>%
column_to_rownames('rn')
NOTE: Here we are showing data that the OP showed as example and not some other dataset
data
fb_all_data <- structure(list(v1 = c(1L, NA, NA, NA, NA), v2 = c(NA, 2L, NA,
NA, NA), v3 = c(NA, NA, 3L, NA, NA), v4 = c(NA, NA, NA, 4L, NA
), v5 = c(NA, NA, NA, NA, 5L)), class = "data.frame",
row.names = c("a",
"b", "c", "d", "e"))

With tidyverse, you can do:
df %>%
rownames_to_column() %>%
gather(var, val, -1, na.rm = TRUE) %>%
group_by(rowname) %>%
summarise(val = paste(val, collapse = ", "))
rowname val
<chr> <chr>
1 a 1
2 b 2, 3
3 c 3
4 d 4
5 e 5
Sample data to have a row with more than one non-NA value:
df <- read.table(text = " v1 v2 v3 v4 v5
a 1 NA NA NA NA
b NA 2 3 NA NA
c NA NA 3 NA NA
d NA NA NA 4 NA
e NA NA NA NA 5", header = TRUE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

extract string from multiple columns in new column - r

For your provided data structure we could use coalesce: library(dplyr) data %>% mutate(component = coalesce(component1, component2)) component1 component2 component <chr> <chr> <chr> 1 NA NA NA 2 NA Word Word 3 Word different_word Word 4 NA NA NA 5 NA NA NA 6 Word not_this Word

Related

How to Filter out specific rows in a data frame using dplyr?

Forming a new column from whichever of two columns isn’t NA [duplicate]

Merge column if duplicates in rows between columns

How to find common and unique elements across multiple data-frames

remove NA values and combine non NA values into a single column

Categories

Resources