How to find common and unique elements across multiple data-frames

How to find common and unique elements across multiple data-frames - r

Objective To see the common elements that is my row which are basically gene name in my different comparisons.
This was the answer which I tried to follow.
df1 = data.frame(genes = c('gene1', 'gene3', 'gene4', 'gene2'))
df2 = data.frame(genes = c('gene3', 'gene2', 'gene5', 'gene1', "genet"))
df3 = data.frame(genes = c('gene6', 'gene3', 'gene4', 'gdene7', 'genex', "gene10"))
dfList <- list(df1, df2, df3)
reduce(dfList, inner_join)
reduce(dfList, inner_join)
Joining, by = "genes"
Joining, by = "genes"
genes
1 gene3
This fails in this case
df1 = data.frame(genes = c('gene1', 'gene3', 'gene4', 'gene2'))
df2 = data.frame(genes = c('gene3', 'gene2', 'gene5', 'gene1', "genet"))
df3 = data.frame(genes = c('gene6', 'gene13', 'gene4', 'gdene7', 'genex', "gene10"))
dfList <- list(df1, df2, df3)
reduce(dfList, inner_join)
educe(dfList, inner_join)
Joining, by = "genes"
Joining, by = "genes"
[1] genes
<0 rows> (or 0-length row.names)
Now how to address this problem. I gave a small set I have like 15 comparison.
Expected output
gene3 df1 df2 df3 ## for common genes
gene1 df1 df2 ## for genes which arr not across all the combination
gene2
In the first case the solution works as the gene3 is preset in all the case but fails when it is present in only 2 condition.
So how do I find out all the possible combination where the genes are present in different possible combination.
For example if gene3 is present in all three so it is reported but gene1 and gene2 are present in df1 and df2 but these are not reported.
So I would like to see if a group of genes present in all condition which is not possible most likely but all the possible combination where its is present
My actual dataframes are named as such which is in a list
names(result_abd)
[1] "M0_vs_M1_TCGA_stages" "M0_vs_M2_TCGA_stages" "M0_vs_M3_TCGA_stages" "M0_vs_M4_TCGA_stages" "M0_vs_M5_TCGA_stages" "M1_vs_M2_TCGA_stages"
[7] "M1_vs_M3_TCGA_stages" "M1_vs_M4_TCGA_stages" "M1_vs_M5_TCGA_stages" "M2_vs_M3_TCGA_stages" "M2_vs_M4_TCGA_stages" "M2_vs_M5_TCGA_stages"
[13] "M3_vs_M4_TCGA_stages" "M3_vs_M5_TCGA_stages" "M4_vs_M5_TCGA_stages"
>
So I would have like 15 columns for each dataframe
I ran your code the output is as such
dput(head(a))
structure(list(gene = c("ENSG00000000003", "ENSG00000000971",
"ENSG00000002726", "ENSG00000003989", "ENSG00000005381", "ENSG00000006534"
), dfM0_vs_M1_TCGA_stages = c("M0_vs_M1_TCGA_stages", "M0_vs_M1_TCGA_stages",
"M0_vs_M1_TCGA_stages", "M0_vs_M1_TCGA_stages", "M0_vs_M1_TCGA_stages",
"M0_vs_M1_TCGA_stages"), dfM0_vs_M2_TCGA_stages = c(NA, "M0_vs_M2_TCGA_stages",
"M0_vs_M2_TCGA_stages", NA, "M0_vs_M2_TCGA_stages", NA), dfM0_vs_M3_TCGA_stages = c("M0_vs_M3_TCGA_stages",
"M0_vs_M3_TCGA_stages", "M0_vs_M3_TCGA_stages", NA, "M0_vs_M3_TCGA_stages",
NA), dfM0_vs_M4_TCGA_stages = c("M0_vs_M4_TCGA_stages", NA, "M0_vs_M4_TCGA_stages",
NA, "M0_vs_M4_TCGA_stages", "M0_vs_M4_TCGA_stages"), dfM0_vs_M5_TCGA_stages = c("M0_vs_M5_TCGA_stages",
NA, "M0_vs_M5_TCGA_stages", NA, "M0_vs_M5_TCGA_stages", "M0_vs_M5_TCGA_stages"
), dfM1_vs_M2_TCGA_stages = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), dfM1_vs_M3_TCGA_stages = c(NA,
NA, NA, NA, "M1_vs_M3_TCGA_stages", NA), dfM1_vs_M4_TCGA_stages = c(NA,
"M1_vs_M4_TCGA_stages", NA, NA, NA, NA), dfM1_vs_M5_TCGA_stages = c(NA,
NA, "M1_vs_M5_TCGA_stages", NA, NA, NA), dfM2_vs_M3_TCGA_stages = c(NA,
NA, NA, NA, "M2_vs_M3_TCGA_stages", NA), dfM2_vs_M4_TCGA_stages = c(NA,
"M2_vs_M4_TCGA_stages", NA, NA, NA, NA), dfM2_vs_M5_TCGA_stages = c(NA,
NA, "M2_vs_M5_TCGA_stages", NA, "M2_vs_M5_TCGA_stages", NA),
dfM3_vs_M4_TCGA_stages = c(NA, "M3_vs_M4_TCGA_stages", NA,
NA, "M3_vs_M4_TCGA_stages", NA), dfM3_vs_M5_TCGA_stages = c(NA,
"M3_vs_M5_TCGA_stages", NA, NA, "M3_vs_M5_TCGA_stages", NA
), dfM4_vs_M5_TCGA_stages = c(NA, NA, "M4_vs_M5_TCGA_stages",
NA, NA, NA)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
Dataframe format
A tibble: 6 × 16
gene dfM0_vs_M1_TCGA… dfM0_vs_M2_TCGA… dfM0_vs_M3_TCGA… dfM0_vs_M4_TCGA… dfM0_vs_M5_TCGA… dfM1_vs_M2_TCGA… dfM1_vs_M3_TCGA… dfM1_vs_M4_TCGA…
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ENSG00000000003 M0_vs_M1_TCGA_s… NA M0_vs_M3_TCGA_s… M0_vs_M4_TCGA_s… M0_vs_M5_TCGA_s… NA NA NA
2 ENSG00000000971 M0_vs_M1_TCGA_s… M0_vs_M2_TCGA_s… M0_vs_M3_TCGA_s… NA NA NA NA M1_vs_M4_TCGA_s…
3 ENSG00000002726 M0_vs_M1_TCGA_s… M0_vs_M2_TCGA_s… M0_vs_M3_TCGA_s… M0_vs_M4_TCGA_s… M0_vs_M5_TCGA_s… NA NA NA
4 ENSG00000003989 M0_vs_M1_TCGA_s… NA NA NA NA NA NA NA
5 ENSG00000005381 M0_vs_M1_TCGA_s… M0_vs_M2_TCGA_s… M0_vs_M3_TCGA_s… M0_vs_M4_TCGA_s… M0_vs_M5_TCGA_s… NA M1_vs_M3_TCGA_s… NA
6 ENSG00000006534 M0_vs_M1_TCGA_s… NA NA M0_vs_M4_TCGA_s… M0_vs_M5_TCGA_s… NA NA NA
Now this is what i wanted. The next step as an example I would like to see
If I take this gene ENSG00000000971 is present in 7 comparison but not in others where its is reported as NA.How do I group them.
Like making another data frame with those genes lets say are present in multiple comparison and not include wherever this is NA

It's not clear to me exactly how you want your ourput formatted (several index columns or one column containing a string or a list column). But here's one option. I start by combining your list of data fames into a single data frame, with an index indicating the source.
library(tidyverse)
dfList <- list(df1, df2, df3)
dfList %>%
bind_rows(.id="df") %>%
pivot_wider(names_from=df, names_prefix="df", values_from=df)
# A tibble: 10 × 4
genes df1 df2 df3
<chr> <chr> <chr> <chr>
1 gene1 1 2 NA
2 gene3 1 2 3
3 gene4 1 NA 3
4 gene2 1 2 NA
5 gene5 NA 2 NA
6 genet NA 2 NA
7 gene6 NA NA 3
8 gdene7 NA NA 3
9 genex NA NA 3
10 gene10 NA NA 3
Addition in response to OP's question below. (Though note that's actually a new question and really should be a new post.)
dfList %>%
bind_rows(.id="df") %>%
group_by(genes) %>%
summarise(minDF=min(df), maxDF=max(df)) %>%
filter(minDF == maxDF & maxDF == 3) %>%
pull(genes)
[1] "gdene7" "gene10" "gene6" "genex"
Once again, the key is to put all the data into a single data frame. (And the desired format of the output is not clear.)

Related

extract string from multiple columns in new column

I want to find a word in different columns and mutate it in a new column.
"data" is an example and "goal" is what I want. I tried a lot but I didn't get is work.
library(dplyr)
library(stringr)
data <- tibble(
component1 = c(NA, NA, "Word", NA, NA, "Word"),
component2 = c(NA, "Word", "different_word", NA, NA, "not_this")
)
goal <- tibble(
component1 = c(NA, NA, "Word", NA, NA, "Word"),
component2 = c(NA, "Word", "different_word", NA, NA, "not_this"),
component = c(NA, "Word", "Word", NA, NA, "Word")
)
not_working <- data %>%
mutate(component = across(starts_with("component"), ~ str_extract(.x, "Word")))

For your provided data structure we could use coalesce:
library(dplyr)
data %>%
mutate(component = coalesce(component1, component2))
component1 component2 component
<chr> <chr> <chr>
1 NA NA NA
2 NA Word Word
3 Word different_word Word
4 NA NA NA
5 NA NA NA
6 Word not_this Word

With if_any and str_detect:
library(dplyr)
library(stringr)
data %>%
mutate(component = ifelse(if_any(starts_with("component"), str_detect, "Word"), "Word", NA))
output
component1 component2 component
<chr> <chr> <chr>
1 NA NA NA
2 NA Word Word
3 Word different_word Word
4 NA NA NA
5 NA NA NA
6 Word not_this Word
If you wanna stick to str_extract, this would be the way to go:
data %>%
mutate(across(starts_with("component"), str_extract, "Word",
.names = "{.col}_extract")) %>%
mutate(component = coalesce(component1_extract, component2_extract),
.keep = "unused")
# A tibble: 6 × 3
component1 component2 component
<chr> <chr> <chr>
1 NA NA NA
2 NA Word Word
3 Word different_word Word
4 NA NA NA
5 NA NA NA
6 different_word Word Word

R cleaning data frame to identify which columns don't have corresponding output columns

I need to tidy my dataframe so that it has inputs (e.g. A.1, B.1, C.1, ...) on column V1 and outputs (abc and 123, def and 456, ...) on columns V2 and V3 (these two columns always come in pairs). This is what I would like to look like:
df_clean <- data.frame(V1 = c("A.1", "B.1", "C.1", "D.1", "E.1", "F.1", "C.1", "E.1", "G.1", "H.1"),
V2 = c("abc", "def", NA, "ghi", NA, NA, NA, NA, "jkl", "mno"),
V3 = c("123", "456", NA, "789", NA, NA, NA, NA, "101", "112"))
But right now it looks like below.
df <- data.frame(V1 = c("A.1", NA, "B.1", NA, "C.1", NA, NA, "E.1", NA, NA, NA, NA, NA, "H.1"),
V2 = c(NA, "abc", NA, "def", NA, "D.1", "ghi", NA, "F.1", "C.1", "E.1", "G.1", "jkl", "mno"),
V3 = c(NA, "123", NA, "456", NA, NA, "789", NA, NA, NA, NA, NA, "101", "112"))
To elaborate, you can see from the first four rows of df that if I have consecutive inputs that have outputs, the inputs will be on the first column and the outputs will be on the second and third column. But if there is an input with no output (like C.1), the next input (D.1) is placed at the second column, and the following inputs will continue to be placed there until an input with an output appears (e.g. after D.1, ghi 789 are on the second and third columns. Because D.1 had outputs, E.1 is placed back at the first column. But E.1 doesn't have an output, so F.1 is now on the second column. F.1 doesn't have an output either, so the next input (D.1) is also on the second column. When we get to G.1, it has an output so jkl and 101 are on the second and third column. Then H.1 goes back on the first column.)
I've thought about combining V2 and V3 but this doesn't work if my input is in V2. I think I can identify the inputs since they all have the .1 string at the end in common but hasn't gotten much further beyond this. Any suggestions on how to clean this dataframe up?

Using dplyr/tidyr you could do something like this:
df |>
# Move the V2's to V1 where ness.
mutate(V1_old = V1,
V1 = if_else(is.na(V1_old) & is.na(V3), V2, V1_old),
V2 = if_else(is.na(V1_old) & is.na(V3), NA_character_, V2)) |>
select(-V1_old) |>
# Fill NA's, for V1, down
fill(V1, .direction = "down") |>
# Fill NA's within groups V1, for V2 + V3, up
group_by(V1) |>
fill(-V1, .direction = "up") |>
# Filter away only duplicates for columns without NA's
# If that's a mistake, use distinct() instead
filter(is.na(V2) | !duplicated(cur_data())) |>
ungroup()
Output:
# A tibble: 10 × 3
V1 V2 V3
<chr> <chr> <chr>
1 A.1 abc 123
2 B.1 def 456
3 C.1 NA NA
4 D.1 ghi 789
5 E.1 NA NA
6 F.1 NA NA
7 C.1 NA NA
8 E.1 NA NA
9 G.1 jkl 101
10 H.1 mno 112

Here is one more suggestion:
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(V1 = coalesce(V1, V2),
V1 = if_else(str_detect(V1, "\\w\\.\\d+"), V1, NA_character_),
V2 = coalesce(V2, V1),
V2 = if_else(is.na(V3), NA_character_, V2)) %>%
fill(V1, .direction = "down") %>%
group_by(x = cumsum(V1 != lag(V1, def = first(V1)))) %>%
arrange(desc(row_number()), .by_group = TRUE) %>%
slice(1) %>%
ungroup() %>%
select(1,2,3)
V1 V2 V3
<chr> <chr> <chr>
1 A.1 abc 123
2 B.1 def 456
3 C.1 NA NA
4 D.1 ghi 789
5 E.1 NA NA
6 F.1 NA NA
7 C.1 NA NA
8 E.1 NA NA
9 G.1 jkl 101
10 H.1 mno 112

Looping loading spreadsheets, making a new dataframe from specific cells, and then merging into one dataframe - R

Goals:
Read in all xlsm files from a folder (my working directory)
Pull specific cells from each file and make a new, clean dataframe OR pull the values into a vector
Combine these into one dataframe
I have a large number of excel "forms" that I need to import into r for analysis. Unfortunately, the form was not designed with this goal in mind, so when reading it into r, the dataframe is not in any shape for analysis. Here is an example dataframe:
library(tidyverse)
df_ex <- data.frame(Form.Title = c(NA, "Name:", "ID:", NA, NA, "Result 1:", "Result 2:", "Result 3:", NA, NA, NA),
X = c(NA, "a", 12345, NA, NA, 4, 7, 2, NA, "Count 1:", "Count 3:"),
Additional.Form.Title = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 9, 3),
X.1 = c(NA, "Title:", "Phone Number:", "email:", NA, NA, NA, NA, NA, NA, NA),
X.2 = c(NA, "x", "123-456-7890", "ex#x.com", NA, NA, NA, NA, NA, "Count2:", "Count4:"),
X.3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 16, 12)
)
Which would look like:
Form.Title X Additional.Form.Title X.1 X.2 X.3
1 <NA> <NA> NA <NA> <NA> NA
2 Name: a NA Title: x NA
3 ID: 12345 NA Phone Number: 123-456-7890 NA
4 <NA> <NA> NA email: ex#x.com NA
5 <NA> <NA> NA <NA> <NA> NA
6 Result 1: 4 NA <NA> <NA> NA
7 Result 2: 7 NA <NA> <NA> NA
8 Result 3: 2 NA <NA> <NA> NA
9 <NA> <NA> NA <NA> <NA> NA
10 <NA> Count 1: 9 <NA> Count2: 16
11 <NA> Count 3: 3 <NA> Count4: 12
My goal is to take certain cells and move them into a proper dataframe. For example, "Name:" [2,1] would become a column name and "a" [2,2] would be a value in the row under that column. I would then want to loop this process for the rest of the forms and merge the rows into one dataframe.
I started with this, but it did not work
library(readxl)
library(tidyverse)
test <- list.files(pattern = "*xlsm") %>%
map(., ~ read_excel(.x, sheet = 1)) %>%
map(., ~ data.frame(Name = .x[2,2], Result1 = .x[6,2], Result2 = .x[7,2], Result3 = .x[8,2])) %>%
bind_rows()
Trying a different way, but stuck at the attempted loop. Any help with how to proceed or a better route is greatly appreciated.
library(readxl)
library(tidyverse)
#Read in all spreadsheets in working directory
list_xlsms <- list.files(pattern = "*.xlsm") %>%
map(., ~ read_excel(.x, sheet = 1))
#Make empty dataframe. I'll add rows later
df <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(df) <- c('Name', 'Result1', 'Result2')
#Pull specific values from spreadsheets and create vector to add row to empty dataframe.
#Need to do this with all spreadsheets in list_xlsms
#Will then add all the vectors as rows to the empty dataframe
for (i in list_xlsms)
{
c(i$X[2], i$X[6], i$X[7])
}

How to get all rows sharing same url into 1 row?

data frame after unseating has multiple rows with na values that can be summarized into one row. All text/character data.
Example:
link feature-1 feature-2 feature-3
link_1 a. NA NA
link_1. NA NA b
link_1. NA. c NA
link2 NA. a NA
link_2 NA NA d
link_2 x NA NA

Assuming that you are only ever combining NA values and text, then I recommend the following:
library(dplyr)
# here is a mock dataset
df = data.frame(grp = c('a','a','a','b','b','b'),
value1 = c(NA,NA,'text','text',NA,NA),
value2 = c(NA,'txt',NA,NA,'txt',NA),
stringsAsFactors = FALSE)
df %>%
# convert NA values to empty text strings
mutate(value1 = ifelse(is.na(value1), "", value1),
value2 = ifelse(is.na(value2), "", value2)) %>%
# specify the groups
group_by(grp) %>%
# append all the text in each group into a single row
summarise(val1 = paste(value1, collapse = ""),
val2 = paste(value2, collapse = ""))
Based on this answer.
Looking at the data in your question, you might need to first standardize some values. Because "link_1" vs "link_1." and "NA" vs "NA." will be treated as different.

You can use across to get first non-NA value by group in multiple columns.
library(dplyr)
df %>% group_by(link) %>% summarise(across(starts_with('feature'), ~na.omit(.)[1]))
# link feature.1 feature.2 feature.3
# <chr> <chr> <chr> <chr>
#1 link_1 a c b
#2 link_2 x a d
data
df <- structure(list(link = c("link_1", "link_1", "link_1", "link_2",
"link_2", "link_2"), feature.1 = c("a", NA, NA, NA, NA, "x"),
feature.2 = c(NA, NA, "c", "a", NA, NA), feature.3 = c(NA,
"b", NA, NA, "d", NA)), class = "data.frame", row.names = c(NA, -6L))

remove NA values and combine non NA values into a single column

I have a data set which has numeric and NA values in all columns. I would like to create a new column with all non NA values and preserve the row names
v1 v2 v3 v4 v5
a 1 NA NA NA NA
b NA 2 NA NA NA
c NA NA 3 NA NA
d NA NA NA 4 NA
e NA NA NA NA 5
I have tried using the coalesce function from dplyr
digital_metrics_FB <- fb_all_data %>%
mutate(fb_metrics = coalesce("v1",
"v2",
"v3",
"v4",
"v5"))
and also tried an apply function
df2 <- sapply(fb_all_data,function(x) x[!is.na(x)])
still cannot get it to work.
I am looking for the final result to be where all non NA values come together in the final column and the row names are preserved
final
a 1
b 2
c 3
d 4
e 5
any help would be much appreciated

We can use pmax
do.call(pmax, c(fb_all_data , na.rm = TRUE))
If there are more than one non-NA element and want to combine as a string, a simple base R option would be
data.frame(final = apply(fb_all_data, 1, function(x) toString(x[!is.na(x)])))
Or using coalesce
library(dplyr)
library(tibble)
fb_all_data %>%
rownames_to_column('rn') %>%
transmute(rn, final = coalesce(v1, v2, v3, v4, v5)) %>%
column_to_rownames('rn')
# final
#a 1
#b 2
#c 3
#d 4
#e 5
Or using tidyverse, for multiple non-NA elements
fb_all_data %>%
rownames_to_column('rn') %>%
transmute(rn, final = pmap_chr(.[-1], ~ c(...) %>%
na.omit %>%
toString)) %>%
column_to_rownames('rn')
NOTE: Here we are showing data that the OP showed as example and not some other dataset
data
fb_all_data <- structure(list(v1 = c(1L, NA, NA, NA, NA), v2 = c(NA, 2L, NA,
NA, NA), v3 = c(NA, NA, 3L, NA, NA), v4 = c(NA, NA, NA, 4L, NA
), v5 = c(NA, NA, NA, NA, 5L)), class = "data.frame",
row.names = c("a",
"b", "c", "d", "e"))

With tidyverse, you can do:
df %>%
rownames_to_column() %>%
gather(var, val, -1, na.rm = TRUE) %>%
group_by(rowname) %>%
summarise(val = paste(val, collapse = ", "))
rowname val
<chr> <chr>
1 a 1
2 b 2, 3
3 c 3
4 d 4
5 e 5
Sample data to have a row with more than one non-NA value:
df <- read.table(text = " v1 v2 v3 v4 v5
a 1 NA NA NA NA
b NA 2 3 NA NA
c NA NA 3 NA NA
d NA NA NA 4 NA
e NA NA NA NA 5", header = TRUE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to find common and unique elements across multiple data-frames - r

Related

extract string from multiple columns in new column

R cleaning data frame to identify which columns don't have corresponding output columns

Looping loading spreadsheets, making a new dataframe from specific cells, and then merging into one dataframe - R

How to get all rows sharing same url into 1 row?

remove NA values and combine non NA values into a single column

Categories

Resources