How to mutate columns whose column names differ by a suffix? - r

In a dataset like
data_frame(a=letters, a_1=letters, b=letters, b_1=letters)
I would like to concatenate the columns that share a similar "root", namely a with a_1 and b with b_1. The output should look like
# A tibble: 26 x 2
a b
<chr> <chr>
1 a a a a
2 b b b b
3 c c c c
4 d d d d
5 e e e e
6 f f f f
7 g g g g
8 h h h h
9 i i i i
10 j j j j
# ... with 16 more rows

If you're looking for a tidyverse approach, you can do it using tidyr::unite_:
library(tidyr)
# get a list column name groups
cols <- split(names(df), sub("_.*", "", names(df)))
# loop through list and unite columns
for(x in names(cols)) {
df <- unite_(df, x, cols[[x]], sep = " ")
}

Here is one way to go about it,
ind <- sub('_.*', '', names(df))
as.data.frame(sapply(unique(ind), function(i) do.call(paste, df[i == ind])))
# a b
#1 a a a a
#2 b b b b
#3 c c c c
#4 d d d d
#5 e e e e
#6 f f f f
#7 g g g g
#8 h h h h

Related

R. Create dataframe with conditional combinations of elements from vector

I have a vector with around 600 unique elements: A, B, C, D, E, F, G, H, I, etc. Using R, I would like to get a dataframe with 4 columns, where each row has all possible combinations of 4 elements under the following conditions:
"A" goes always in column 1.
Column 2 has B or C.
Columns 3 and 4 have pairs of the remaining elements (pair X, Y is considered equal to pair Y, X). I expect to get something like:
1 2 3 4
A B D E
A B F G
A B H I
A C D E
A C F G
A C H I
A possible solution using combn(), expand.grid() and tidyr::separate based on #akrun's comment.
library(magrittr)
library(tidyr)
vec_a <- LETTERS[1]
vec_b <- LETTERS[2:3]
vec_c <- LETTERS[4:26]
vec_d <- combn(vec_c, 2, FUN = paste, collapse = " ")
res <- expand.grid(vec_a, vec_b, vec_d) %>%
tidyr::separate(Var3, c("Var3","Var4"), " ")
head(res, 25)
#> Var1 Var2 Var3 Var4
#> 1 A B D E
#> 2 A C D E
#> 3 A B D F
#> 4 A C D F
#> 5 A B D G
#> 6 A C D G
#> 7 A B D H
#> 8 A C D H
#> 9 A B D I
#> 10 A C D I
#> 11 A B D J
#> 12 A C D J
#> 13 A B D K
#> 14 A C D K
#> 15 A B D L
#> 16 A C D L
#> 17 A B D M
#> 18 A C D M
#> 19 A B D N
#> 20 A C D N
#> 21 A B D O
#> 22 A C D O
#> 23 A B D P
#> 24 A C D P
#> 25 A B D Q

Combine multiple columns into vector by row with dplyr

I am trying to combine multiple columns into a single cell for each row and then remove missing values.
Sample data:
df <- data.frame(a=c("a", "b", "c", "d"),
b=c(NA, "a", "b", "c"),
c=c("a", "b", "e", "g"))
Attempt:
df %>% rowwise() %>%
mutate(collapse=as.character(paste(a,b,c, collapse=",")),
collapse_nona=na.omit(collapse))
Output:
# A tibble: 4 x 5
a b c collapse collapse_nona
* <fct> <fct> <fct> <chr> <chr>
1 a NA a a NA a,b a b,c b e,d c… a NA a,b a b,c b e,d …
2 b a b a NA a,b a b,c b e,d c… a NA a,b a b,c b e,d …
3 c b e a NA a,b a b,c b e,d c… a NA a,b a b,c b e,d …
4 d c g a NA a,b a b,c b e,d c… a NA a,b a b,c b e,d …
1) I am not successfully creating cells with values for each row (the whole column appears in collapse).
2) Cells in the collapse column do not behave like a vector.
Desired output
a b c collapse collapse_nona
* <fct> <fct> <fct> <chr> <chr>
1 a NA a a NA a a a
2 b a b b a b b a b
3 c b e c b e c b e
4 d c g d c g d c g
Thank you
With unite, there is an option for na.rm and it is by default FALSE
library(tidyr)
library(dplyr)
df %>%
mutate_all(as.character) %>%
unite(collapse, a, b,c, remove = FALSE, sep=" ") %>%
unite(collapse_nona, a, b, c, remove = FALSE, sep=" ", na.rm = TRUE) %>%
select(names(df), everything())
# a b c collapse collapse_nona
#1 a <NA> a a NA a a a
#2 b a b b a b b a b
#3 c b e c b e c b e
#4 d c g d c g d c g
Or with paste and str_remove_all (from stringr) - Note that paste/str_c are vectorized, so there is no need to loop over each row with rowwise
df %>%
mutate(collapse = paste(a, b, c),
collapse_nona = str_remove_all(collapse, "\\sNA|NA\\s"))
# a b c collapse collapse_nona
#1 a <NA> a a NA a a a
#2 b a b b a b b a b
#3 c b e c b e c b e
#4 d c g d c g d c g
Another option is pmap to loop over each row, remove the NA elements with na.omit and then paste or str_c (from stringr)
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate_all(as.character) %>%
mutate(collapse_nona = pmap_chr(., ~ c(...) %>%
na.omit %>%
str_c(collapse=" ")))
# a b c collapse_nona
#1 a <NA> a a a
#2 b a b b a b
#3 c b e c b e
#4 d c g d c g
The think the core issue is that you don't want collapse, you want sep. Then rowwise calculation is unnecessary. Also, NA will get printed as character, so you cannot remove them with na.omit
df %>%
mutate(collapse = paste(a,b,c, sep = " "), collapse_nona = gsub("NA", "", collapse))
a b c collapse collapse_nona
1 a <NA> a a NA a a a
2 b a b b a b b a b
3 c b e c b e c b e
4 d c g d c g d c g
I think this does it. You could play around with the sep argument in str_c.
library(dplyr)
library(stringr)
df %>%
mutate(collapse = str_c(str_replace_na(a), str_replace_na(b), str_replace_na(c), sep = " "),
collapse_nona = str_c(str_replace_na(a, ""), str_replace_na(b, ""), str_replace_na(c,""), sep = " "))
a b c collapse collapse_nona
1 a <NA> a a NA a a a
2 b a b b a b b a b
3 c b e c b e c b e
4 d c g d c g d c g

r create new data frame that matches in rows elements grouped by another column

I want to create a new data frame from the df one below. In the new data frame (df2), each element in df$name is placed in the first column and matched in its row with other element of df$name grouped by df$group.
df <- data.frame(group = rep(letters[1:2], each=3),
name = LETTERS[1:6])
> df
group name
1 a A
2 a B
3 a C
4 b D
5 b E
6 b F
In this example, "A", "B", and "C" in df$name belong to "a" in df$group, and I want to put them in the same row in a new data frame. The desired output looks like this:
> df2
V1 V2
1 A B
2 A C
3 B A
4 B C
5 C A
6 C B
7 D E
8 D F
9 E D
10 E F
11 F D
12 F E
We could do this in base R with merge
out <- setNames(subset(merge(df, df, by.x = 'group', by.y = 'group'),
name.x != name.y, select = -group), c("V1", "V2"))
row.names(out) <- NULL
out
# V1 V2
#1 A B
#2 A C
#3 B A
#4 B C
#5 C A
#6 C B
#7 D E
#8 D F
#9 E D
#10 E F
#11 F D
#12 F E
In my opinion its case of self-join. Using dplyr a solution can be as:
library(dplyr)
inner_join(df, df, by="group") %>%
filter(name.x != name.y) %>%
select(V1 = name.x, V2 = name.y)
# V1 V2
# 1 A B
# 2 A C
# 3 B A
# 4 B C
# 5 C A
# 6 C B
# 7 D E
# 8 D F
# 9 E D
# 10 E F
# 11 F D
# 12 F E
df <- data.frame(group = rep(letters[1:2], each=3),
name = LETTERS[1:6])
library(tidyverse)
df %>%
group_by(group) %>% # for every group
summarise(v = list(expand.grid(V1=name, V2=name))) %>% # create all combinations of names
select(v) %>% # keep only the combinations
unnest(v) %>% # unnest combinations
filter(V1 != V2) # exclude rows with same names
# # A tibble: 12 x 2
# V1 V2
# <fct> <fct>
# 1 B A
# 2 C A
# 3 A B
# 4 C B
# 5 A C
# 6 B C
# 7 E D
# 8 F D
# 9 D E
# 10 F E
# 11 D F
# 12 E F

Collapse columns in a dataframe (R)

Basically, I have a dataframe, df
Beginning1 Protein2 Protein3 Protein4 Biomarker1
Pathway3 A G NA NA F
Pathway8 Z G NA NA E
Pathway9 A G Z H F
Pathway6 Y G Z H E
Pathway2 A G D NA F
Pathway5 Q G D NA E
Pathway1 A D K NA F
Pathway7 A B C D F
Pathway4 V B C D E
And I want to combine the dataframe so that those rows when are identical from "Protein2" to "Protein4" are condense, giving the following:
Beginning1 Protein2 Protein3 Protein4 Biomarker1
Pathway3 A,Z G NA NA F,E
Pathway9 A,Y G Z H F,E
Pathway2 A,Q G D NA F,E
Pathway1 A D K NA F
Pathway7 A,V B C D F,E
This is very similar to a question that I asked before (Consolidating duplicate rows in a dataframe), however the difference is that I am also consolidating the "Beginning1" row.
So far, I have tried:
library(dat.table)
dat<-data.table(df)
Total_collapse <- dat[, .(
Biomarker1 = paste0(Biomarker1, collapse = ", ")),
by = .(Beginning1, Protein1, Protein2, Protein3)]
Total_collapse <- dat[, .(
Beginning1 = paste0(Beginning1, collapse = ", ")),
by = .(Protein1, Protein2, Protein3)]
which gives the output:
Beginning1 Protein2 Protein3 Protein4 Biomarker1
Pathway3 G NA NA F,E
Pathway9 G Z H F,E
Pathway2 G D NA F,E
Pathway1 D K NA F
Pathway7 B C D F,E
Does anyone know how to fix this problem? I have also tried duplicating the solution from Collapse / concatenate / aggregate a column to a single comma separated string within each group, but have had no success.
I am sorry if it is a simple error- I am pretty new to R.
Here's a possible solution using dplyr
df %>% group_by_at(vars(Protein2:Protein4)) %>%
summarize_all(paste, collapse=",")
Using data.table you can use .SD to refer to all columns not specified in the by argument. Then we can use lapply to accomplish the paste() with collapse.
library(data.table)
dt <- read.table(text = "Beginning1 Protein2 Protein3 Biomarker1
A G NA NA F
Z G NA NA E
A G Z H F
Y G Z H E
A G D NA F
Q G D NA E
A D K NA F
A B C D F
V B C D E",header = T)
dt <- data.table(dt)
dt[,lapply(.SD, function(col) paste(col, collapse=", ")),
by=.(Protein2, Protein3, Protein4)]
Output
Protein2 Protein3 Protein4 Beginning1 Biomarker1
1: G NA NA A, Z F, E
2: G Z H A, Y F, E
3: G D NA A, Q F, E
4: D K NA A F
5: B C D A, V F, E
We can use aggregate from base R
r1 <- aggregate(cbind(Beginning1, Biomarker1)~., replace(df,is.na(df), "NA"), FUN = toString)
r1
# Protein2 Protein3 Protein4 Beginning1 Biomarker1
#1 B C D A, V F, E
#2 G Z H A, Y F, E
#3 G D NA A, Q F, E
#4 D K NA A F
#5 G NA NA A, Z F, E
r1[r1=="NA"] <- NA

How to mutate a subset of columns with dplyr?

I have this tbl
data_frame(a_a = letters[1:10], a_b = letters[1:10], a = letters[1:10])
And I am trying to substitute all d in each column starting with a_ with the value new value.
I thought the below code would do the job, but it doesn't:
data_frame(a_a = letters[1:10], a_b = letters[1:10], a = letters[1:10]) %>%
mutate_each(vars(starts_with('a_'), funs(gsub('d', 'new value',.))))
instead it gives
Error: is.fun_list(calls) is not TRUE
Guiding from this similar question and considering dft as your input, you can try :
dft %>%
dplyr::mutate_each(funs(replace(., . == "d", "nval")), matches("a_"))
which gives:
## A tibble: 10 × 3
# a_a a_b a
# <chr> <chr> <chr>
#1 a a a
#2 b b b
#3 c c c
#4 nval nval d
#5 e e e
#6 f f f
#7 g g g
#8 h h h
#9 i i i
#10 j j j

Resources