Collapse columns in a dataframe (R) - r

Basically, I have a dataframe, df
Beginning1 Protein2 Protein3 Protein4 Biomarker1
Pathway3 A G NA NA F
Pathway8 Z G NA NA E
Pathway9 A G Z H F
Pathway6 Y G Z H E
Pathway2 A G D NA F
Pathway5 Q G D NA E
Pathway1 A D K NA F
Pathway7 A B C D F
Pathway4 V B C D E
And I want to combine the dataframe so that those rows when are identical from "Protein2" to "Protein4" are condense, giving the following:
Beginning1 Protein2 Protein3 Protein4 Biomarker1
Pathway3 A,Z G NA NA F,E
Pathway9 A,Y G Z H F,E
Pathway2 A,Q G D NA F,E
Pathway1 A D K NA F
Pathway7 A,V B C D F,E
This is very similar to a question that I asked before (Consolidating duplicate rows in a dataframe), however the difference is that I am also consolidating the "Beginning1" row.
So far, I have tried:
library(dat.table)
dat<-data.table(df)
Total_collapse <- dat[, .(
Biomarker1 = paste0(Biomarker1, collapse = ", ")),
by = .(Beginning1, Protein1, Protein2, Protein3)]
Total_collapse <- dat[, .(
Beginning1 = paste0(Beginning1, collapse = ", ")),
by = .(Protein1, Protein2, Protein3)]
which gives the output:
Beginning1 Protein2 Protein3 Protein4 Biomarker1
Pathway3 G NA NA F,E
Pathway9 G Z H F,E
Pathway2 G D NA F,E
Pathway1 D K NA F
Pathway7 B C D F,E
Does anyone know how to fix this problem? I have also tried duplicating the solution from Collapse / concatenate / aggregate a column to a single comma separated string within each group, but have had no success.
I am sorry if it is a simple error- I am pretty new to R.

Here's a possible solution using dplyr
df %>% group_by_at(vars(Protein2:Protein4)) %>%
summarize_all(paste, collapse=",")

Using data.table you can use .SD to refer to all columns not specified in the by argument. Then we can use lapply to accomplish the paste() with collapse.
library(data.table)
dt <- read.table(text = "Beginning1 Protein2 Protein3 Biomarker1
A G NA NA F
Z G NA NA E
A G Z H F
Y G Z H E
A G D NA F
Q G D NA E
A D K NA F
A B C D F
V B C D E",header = T)
dt <- data.table(dt)
dt[,lapply(.SD, function(col) paste(col, collapse=", ")),
by=.(Protein2, Protein3, Protein4)]
Output
Protein2 Protein3 Protein4 Beginning1 Biomarker1
1: G NA NA A, Z F, E
2: G Z H A, Y F, E
3: G D NA A, Q F, E
4: D K NA A F
5: B C D A, V F, E

We can use aggregate from base R
r1 <- aggregate(cbind(Beginning1, Biomarker1)~., replace(df,is.na(df), "NA"), FUN = toString)
r1
# Protein2 Protein3 Protein4 Beginning1 Biomarker1
#1 B C D A, V F, E
#2 G Z H A, Y F, E
#3 G D NA A, Q F, E
#4 D K NA A F
#5 G NA NA A, Z F, E
r1[r1=="NA"] <- NA

Related

How to replace values of several columns based on/ another column in R within each row?

I am working on a data set (30000 x 500 ) where I need to replace some values in columns based on data from another column. The problem is that in each row, the reference values change. Here is an sub-example of the data set:
#Create a data frame
df <- data.frame(SNP = c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP6","SNP7","SNP8","SNP9","SNP10"),
A_allele = c("C","G","C","G","C","C","A","T","G","C"),
B_allele = c("G","A","T","A","A","G","T","A","C","A"),
alleles = c("C/G","G/A","C/T","G/A","C/A","C/G","A/T","T/A","G/C","C/A"),
line_1 = sample(c("A","B"),10, replace = TRUE),
line_2 = sample(c("A","B"),10, replace = TRUE),
line_3 = sample(c("A","B"),10, replace = TRUE),
line_4 = sample(c("A","B"),10, replace = TRUE),
line_5 = sample(c("A","B"),10, replace = TRUE),
line_6 = sample(c("A","B"),10, replace = TRUE),
line_7 = sample(c("A","B"),10, replace = TRUE),
line_8 = sample(c("A","B"),10, replace = TRUE),
line_9 = sample(c("A","B"),10, replace = TRUE),
line_10 = sample(c("A","B"),10, replace = TRUE)
)
df
head(df)
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1 SNP1 C G C/G B A B A B B B B B A
2 SNP2 G A G/A A B A A A B B A B A
3 SNP3 C T C/T B B A B B B A A A A
4 SNP4 G A G/A A B B A B A B B B A
5 SNP5 C A C/A B A B B B A B A B B
6 SNP6 C G C/G B A B A B A B B B B
7 SNP7 A T A/T B A A B A A B A B A
8 SNP8 T A T/A A B A B A A B B A B
9 SNP9 G C G/C B A B B B B A B A B
10 SNP10 C A C/A B B B B B A A A A A
For each row, A_allele and B_allele columns serve as reference values to change A or B values in the 10 lines. When there is an "A" value => use the value from column A_allele and when there is a "B" value => use the value from column_B.
In the example, this should be as following:
Row 1: Change lines with A to C / Change lines with B to G
Row 2: Change lines with A to G / Change lines with B to A
Row 3: Change lines with A to C / Change lines with B to T
Row 10: The same idea.
The output should look something like this:
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1 SNP1 C G C/G G C G C G G G G G C
2 SNP2 G A G/A G A G G G A A G A G
3 SNP3 C T C/T T T C T T T C C C C
4 SNP4 G A G/A G A A G A G A A A G
5 SNP5 C A C/A A C A A A C A C A A
6 SNP6 C G C/G G C G C G C G G G G
7 SNP7 A T A/T T A A T A A T A T A
8 SNP8 T A T/A T A T A T T A A T A
9 SNP9 G C G/C C G C C C C G C G C
10 SNP10 C A C/A A A A A A C C C C C
As there are ~30000 rows, I would like an efficient code to run if it possible.
Any suggestions?
You can do
library(tidyverse)
df %>% mutate(across(starts_with("line"), ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))
#output with df generated with set.seed(2021)
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1 SNP1 C G C/G C C G C C C G G C G
2 SNP2 G A G/A A A A A G G G G G G
3 SNP3 C T C/T T T C C T T T T T C
4 SNP4 G A G/A A G A A A G G A G A
5 SNP5 C A C/A C C C A C A A C C A
6 SNP6 C G C/G G C C C C C G C G G
7 SNP7 A T A/T T A T T T T T A T A
8 SNP8 T A T/A A T A T A A A T A T
9 SNP9 G C G/C C C C C C G G G C C
10 SNP10 C A C/A A C A C A C C C C A
As stated in comments, if column name do not follow a pattern, Option-1 you can store these in a vector say vars and use this inside across
set.seed(2021)
df <- data.frame(SNP = c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP6","SNP7","SNP8","SNP9","SNP10"),
A_allele = c("C","G","C","G","C","C","A","T","G","C"),
B_allele = c("G","A","T","A","A","G","T","A","C","A"),
alleles = c("C/G","G/A","C/T","G/A","C/A","C/G","A/T","T/A","G/C","C/A"),
line_1 = sample(c("A","B"),10, replace = TRUE),
line_2 = sample(c("A","B"),10, replace = TRUE),
line_3 = sample(c("A","B"),10, replace = TRUE),
line_4 = sample(c("A","B"),10, replace = TRUE),
line_5 = sample(c("A","B"),10, replace = TRUE),
line_6 = sample(c("A","B"),10, replace = TRUE),
line_7 = sample(c("A","B"),10, replace = TRUE),
cat = sample(c("A","B"),10, replace = TRUE),
dog = sample(c("A","B"),10, replace = TRUE),
rabbit = sample(c("A","B"),10, replace = TRUE)
)
vars <- c("line_1", "line_2", "line_3", "line_4", "line_5", "line_6", "line_7", "cat", "dog", "rabbit")
df %>% mutate(across(.cols = vars, ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))
Note: Using an external vector in selections is ambiguous.
i Use `all_of(vars)` instead of `vars` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 cat dog rabbit
1 SNP1 C G C/G C C G C C C G G C G
2 SNP2 G A G/A A A A A G G G G G G
3 SNP3 C T C/T T T C C T T T T T C
4 SNP4 G A G/A A G A A A G G A G A
5 SNP5 C A C/A C C C A C A A C C A
6 SNP6 C G C/G G C C C C C G C G G
7 SNP7 A T A/T T A T T T T T A T A
8 SNP8 T A T/A A T A T A A A T A T
9 SNP9 G C G/C C C C C C G G G C C
10 SNP10 C A C/A A C A C A C C C C A
Option-2 you may also column indexes directly
df %>% mutate(across(5:14, ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 cat dog rabbit
1 SNP1 C G C/G C C G C C C G G C G
2 SNP2 G A G/A A A A A G G G G G G
3 SNP3 C T C/T T T C C T T T T T C
4 SNP4 G A G/A A G A A A G G A G A
5 SNP5 C A C/A C C C A C A A C C A
6 SNP6 C G C/G G C C C C C G C G G
7 SNP7 A T A/T T A T T T T T A T A
8 SNP8 T A T/A A T A T A A A T A T
9 SNP9 G C G/C C C C C C G G G C C
10 SNP10 C A C/A A C A C A C C C C A
You can use across in dplyr along with ifelse.
library(dplyr)
df %>% mutate(across(starts_with('line'), ~ifelse(. == 'A', A_allele, B_allele)))
Or lapply in base R :
cols <- grep('line', names(df))
df[cols] <- lapply(df[cols], function(x) ifelse(x == 'A', df$A_allele, df$B_allele))

Combine multiple columns into vector by row with dplyr

I am trying to combine multiple columns into a single cell for each row and then remove missing values.
Sample data:
df <- data.frame(a=c("a", "b", "c", "d"),
b=c(NA, "a", "b", "c"),
c=c("a", "b", "e", "g"))
Attempt:
df %>% rowwise() %>%
mutate(collapse=as.character(paste(a,b,c, collapse=",")),
collapse_nona=na.omit(collapse))
Output:
# A tibble: 4 x 5
a b c collapse collapse_nona
* <fct> <fct> <fct> <chr> <chr>
1 a NA a a NA a,b a b,c b e,d c… a NA a,b a b,c b e,d …
2 b a b a NA a,b a b,c b e,d c… a NA a,b a b,c b e,d …
3 c b e a NA a,b a b,c b e,d c… a NA a,b a b,c b e,d …
4 d c g a NA a,b a b,c b e,d c… a NA a,b a b,c b e,d …
1) I am not successfully creating cells with values for each row (the whole column appears in collapse).
2) Cells in the collapse column do not behave like a vector.
Desired output
a b c collapse collapse_nona
* <fct> <fct> <fct> <chr> <chr>
1 a NA a a NA a a a
2 b a b b a b b a b
3 c b e c b e c b e
4 d c g d c g d c g
Thank you
With unite, there is an option for na.rm and it is by default FALSE
library(tidyr)
library(dplyr)
df %>%
mutate_all(as.character) %>%
unite(collapse, a, b,c, remove = FALSE, sep=" ") %>%
unite(collapse_nona, a, b, c, remove = FALSE, sep=" ", na.rm = TRUE) %>%
select(names(df), everything())
# a b c collapse collapse_nona
#1 a <NA> a a NA a a a
#2 b a b b a b b a b
#3 c b e c b e c b e
#4 d c g d c g d c g
Or with paste and str_remove_all (from stringr) - Note that paste/str_c are vectorized, so there is no need to loop over each row with rowwise
df %>%
mutate(collapse = paste(a, b, c),
collapse_nona = str_remove_all(collapse, "\\sNA|NA\\s"))
# a b c collapse collapse_nona
#1 a <NA> a a NA a a a
#2 b a b b a b b a b
#3 c b e c b e c b e
#4 d c g d c g d c g
Another option is pmap to loop over each row, remove the NA elements with na.omit and then paste or str_c (from stringr)
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate_all(as.character) %>%
mutate(collapse_nona = pmap_chr(., ~ c(...) %>%
na.omit %>%
str_c(collapse=" ")))
# a b c collapse_nona
#1 a <NA> a a a
#2 b a b b a b
#3 c b e c b e
#4 d c g d c g
The think the core issue is that you don't want collapse, you want sep. Then rowwise calculation is unnecessary. Also, NA will get printed as character, so you cannot remove them with na.omit
df %>%
mutate(collapse = paste(a,b,c, sep = " "), collapse_nona = gsub("NA", "", collapse))
a b c collapse collapse_nona
1 a <NA> a a NA a a a
2 b a b b a b b a b
3 c b e c b e c b e
4 d c g d c g d c g
I think this does it. You could play around with the sep argument in str_c.
library(dplyr)
library(stringr)
df %>%
mutate(collapse = str_c(str_replace_na(a), str_replace_na(b), str_replace_na(c), sep = " "),
collapse_nona = str_c(str_replace_na(a, ""), str_replace_na(b, ""), str_replace_na(c,""), sep = " "))
a b c collapse collapse_nona
1 a <NA> a a NA a a a
2 b a b b a b b a b
3 c b e c b e c b e
4 d c g d c g d c g

I need to reshape my wide table into long format but keeping multiple fields for each record, for example:

src hop1 hop2 hop3 hop4 destination
A B C NA NA NA
P Q R S NA NA
H I J K L M
Now I want to melt this table to below format
src nn en
A B C
B C C
P Q S
Q R S
R S S
H I M
I J M
J K M
K L M
L M M
I have basic understanding of melt and reshape, but not able to visualize how i can leverage those functions for the desired output.
You essentially want to gather all the columns up, grouped by each row. Then look forward to the leading value, and to the last value. So something like this in dplyr:
dat %>%
mutate(row=row_number()) %>%
gather(key, src, -row) %>%
drop_na() %>%
group_by(row) %>%
mutate(nn=lead(src), en=last(src)) %>%
ungroup() %>%
filter(!is.na(nn)) %>%
arrange(row) %>%
select(src, nn, en)
## A tibble: 10 x 3
# src nn en
# <chr> <chr> <chr>
# 1 A B C
# 2 B C C
# 3 P Q S
# 4 Q R S
# 5 R S S
# 6 H I M
# 7 I J M
# 8 J K M
# 9 K L M
#10 L M M
Or like this in base R:
tmp <- na.omit(data.frame(row=seq_len(nrow(dat)), src=unlist(dat), stringsAsFactors=FALSE))
tmp$nn <- ave(tmp$src, tmp$row, FUN=function(x) c(tail(x,-1),NA) )
tmp$en <- ave(tmp$src, tmp$row, FUN=function(x) tail(x,1) )
tmp <- tmp[!is.na(tmp$nn),]
tmp[order(tmp$row), c("src","nn","en")]
# src nn en
#src1 A B C
#hop11 B C C
#src2 P Q S
#hop12 Q R S
#hop22 R S S
#src3 H I M
#hop13 I J M
#hop23 J K M
#hop33 K L M
#hop43 L M M
Where dat was:
dat <- read.table(text="src hop1 hop2 hop3 hop4 destination
A B C NA NA NA
P Q R S NA NA
H I J K L M", header=TRUE, stringsAsFactors=FALSE)
Or we can use apply from base R
out <- do.call(rbind, apply(dat, 1, function(x) {
x1 <- na.omit(x)
data.frame(src = x1[-length(x1)], nn = x1[-1], en = x1[length(x1)])
}))
row.names(out) <- NULL
out
# src nn en
#1 A B C
#2 B C C
#3 P Q S
#4 Q R S
#5 R S S
#6 H I M
#7 I J M
#8 J K M
#9 K L M
#10 L M M
Or another option in base R would be to use max.col to find the last non-NA element, use rep to replicate the last non-NA element in each row and create data.frame
ij <- cbind(seq_len(nrow(dat)), max.col(!is.na(dat), "last"))
v1 <- dat[ij]
i1 <- rowSums(!is.na(dat))
src <- na.omit(c(t(replace(dat, ij, NA))))
nn <- na.omit(c(t(dat)[-1,]))
data.frame(src ,nn, en = rep(v1, i1-1 ))
# src nn en
#1 A B C
#2 B C C
#3 P Q S
#4 Q R S
#5 R S S
#6 H I M
#7 I J M
#8 J K M
#9 K L M
#10 L M M

How to mutate columns whose column names differ by a suffix?

In a dataset like
data_frame(a=letters, a_1=letters, b=letters, b_1=letters)
I would like to concatenate the columns that share a similar "root", namely a with a_1 and b with b_1. The output should look like
# A tibble: 26 x 2
a b
<chr> <chr>
1 a a a a
2 b b b b
3 c c c c
4 d d d d
5 e e e e
6 f f f f
7 g g g g
8 h h h h
9 i i i i
10 j j j j
# ... with 16 more rows
If you're looking for a tidyverse approach, you can do it using tidyr::unite_:
library(tidyr)
# get a list column name groups
cols <- split(names(df), sub("_.*", "", names(df)))
# loop through list and unite columns
for(x in names(cols)) {
df <- unite_(df, x, cols[[x]], sep = " ")
}
Here is one way to go about it,
ind <- sub('_.*', '', names(df))
as.data.frame(sapply(unique(ind), function(i) do.call(paste, df[i == ind])))
# a b
#1 a a a a
#2 b b b b
#3 c c c c
#4 d d d d
#5 e e e e
#6 f f f f
#7 g g g g
#8 h h h h

How to assign text into a column based on another dataframe in r

I have a dataframe:
df = read.table(text="group X1 X2 X3 X4 X5 X6 X7
P1 H H H H H H H
P1 C D C D B C C
P1 D C B A C D H
P1 D C B A C D D
P1 C D C D B C D
P2 C D B D C D C
P2 H H H H H H H
P2 D C C A B C D
P3 C D C D B C C
P3 H H H H H H H
P3 C D C D B C C
P3 D C B A C D D", header=T, stringsAsFactors=F)
I have another dataframe:
df2 = read.table(text="Group col R S
P1 'X2 X4 X7' 'C A D' 'D D C'
P2 'X2 X3 X4 X6' 'C C A C' 'D B D D'
P3 'X3 X5 X6 X7' 'B C D D' 'C B C C'", header=T, stringsAsFactors=F)
I would like to add a column named "assign" to hold the assignment which is based on df2. For example, if df$group=="P1", then only concatenate columns in df shown in df2$col "P1" row, if all columns have the same letter "H", then assign "H" to the "assign" column; if match the string in df2$R column, assign "R"; if match the string in df2$S column, assign "S"; if not match any three cases as mentioned, then assign "U".
I have tested my script in the group "P1", but I don't know how to return the assigned value to the df and go through the loop. Appreciate any helps.
I expect the result as:
df = read.table(text="group 1 2 3 4 5 6 7 assign
P1 H H H H H H H H
P1 C D C D B C C S
P1 D C B A C D D U
P1 D C B A C D D R
P1 C D C D B C D U
P2 C D B D C D C S
P2 H H H H H H H H
P2 D C C A B C D R
P3 C D C D B C C S
P3 H H H H H H H H
P3 C D C D B C C S
P3 D C B A C D D R
", header=T, stringsAsFactors=F)
You can use data.table and solve your problem in three steps:
merge the data.tables
this is the key step, build a pattern to match later, the cool thing is that we can use a flexible number of .SDcols across the by groups in the data.table
build the assign variable
Here is the code:
# data
require(data.table)
dt = data.table(df)
dt2 = data.table(df2)
# add col_int, a list(!) of col indices, to dt2 for each Group
dt3 = dt2[, list(col_name = strsplit(col, ' ')[[1]]), by = Group]
dt3 = dt3[, col_idx := match(col_name, names(dt))]
dt3 = dt3[, list(col_idx = list(col_idx)), by = Group]
dt2 = merge(dt2, dt3, by = 'Group')
# solution
dt = merge(x = dt,
y = dt2,
by = 'Group')
idx_matching_table = names(dt)
# a: using strings
dt[,
j = pattern := {
.SD[, do.call('paste', c(.SD)), .SDcols = strsplit(col, ' ')[[1]]]
},
by = list(Group, col)]
# b: using indices
dt[,
j = pattern_2 := {
# .SD has less cols (compared to dt), therefore find out what the integer index of col_idx in .SD is:
col_idx_sd = match(idx_matching_table[col_idx[[1]]], names(.SD))
.SD[, do.call('paste', c(.SD)), .SDcols = col_idx_sd]
},
by = list(Group, col)]
dt[, identical(pattern, pattern_2)] # TRUE
dt[, assign := 'U']
dt[pattern %like% '[H ]+H', assign := 'H']
dt[pattern == R, assign := 'R']
dt[pattern == S, assign := 'S']
EDIT I replaced apply(.SD, 1, paste, collapse = ' ') with do.call('paste', c(.SD)) to avoid coercion to matrix.

Resources