Combine multiple columns into vector by row with dplyr - r

I am trying to combine multiple columns into a single cell for each row and then remove missing values.
Sample data:
df <- data.frame(a=c("a", "b", "c", "d"),
b=c(NA, "a", "b", "c"),
c=c("a", "b", "e", "g"))
Attempt:
df %>% rowwise() %>%
mutate(collapse=as.character(paste(a,b,c, collapse=",")),
collapse_nona=na.omit(collapse))
Output:
# A tibble: 4 x 5
a b c collapse collapse_nona
* <fct> <fct> <fct> <chr> <chr>
1 a NA a a NA a,b a b,c b e,d c… a NA a,b a b,c b e,d …
2 b a b a NA a,b a b,c b e,d c… a NA a,b a b,c b e,d …
3 c b e a NA a,b a b,c b e,d c… a NA a,b a b,c b e,d …
4 d c g a NA a,b a b,c b e,d c… a NA a,b a b,c b e,d …
1) I am not successfully creating cells with values for each row (the whole column appears in collapse).
2) Cells in the collapse column do not behave like a vector.
Desired output
a b c collapse collapse_nona
* <fct> <fct> <fct> <chr> <chr>
1 a NA a a NA a a a
2 b a b b a b b a b
3 c b e c b e c b e
4 d c g d c g d c g
Thank you

With unite, there is an option for na.rm and it is by default FALSE
library(tidyr)
library(dplyr)
df %>%
mutate_all(as.character) %>%
unite(collapse, a, b,c, remove = FALSE, sep=" ") %>%
unite(collapse_nona, a, b, c, remove = FALSE, sep=" ", na.rm = TRUE) %>%
select(names(df), everything())
# a b c collapse collapse_nona
#1 a <NA> a a NA a a a
#2 b a b b a b b a b
#3 c b e c b e c b e
#4 d c g d c g d c g
Or with paste and str_remove_all (from stringr) - Note that paste/str_c are vectorized, so there is no need to loop over each row with rowwise
df %>%
mutate(collapse = paste(a, b, c),
collapse_nona = str_remove_all(collapse, "\\sNA|NA\\s"))
# a b c collapse collapse_nona
#1 a <NA> a a NA a a a
#2 b a b b a b b a b
#3 c b e c b e c b e
#4 d c g d c g d c g
Another option is pmap to loop over each row, remove the NA elements with na.omit and then paste or str_c (from stringr)
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate_all(as.character) %>%
mutate(collapse_nona = pmap_chr(., ~ c(...) %>%
na.omit %>%
str_c(collapse=" ")))
# a b c collapse_nona
#1 a <NA> a a a
#2 b a b b a b
#3 c b e c b e
#4 d c g d c g

The think the core issue is that you don't want collapse, you want sep. Then rowwise calculation is unnecessary. Also, NA will get printed as character, so you cannot remove them with na.omit
df %>%
mutate(collapse = paste(a,b,c, sep = " "), collapse_nona = gsub("NA", "", collapse))
a b c collapse collapse_nona
1 a <NA> a a NA a a a
2 b a b b a b b a b
3 c b e c b e c b e
4 d c g d c g d c g

I think this does it. You could play around with the sep argument in str_c.
library(dplyr)
library(stringr)
df %>%
mutate(collapse = str_c(str_replace_na(a), str_replace_na(b), str_replace_na(c), sep = " "),
collapse_nona = str_c(str_replace_na(a, ""), str_replace_na(b, ""), str_replace_na(c,""), sep = " "))
a b c collapse collapse_nona
1 a <NA> a a NA a a a
2 b a b b a b b a b
3 c b e c b e c b e
4 d c g d c g d c g

Related

R. Create dataframe with conditional combinations of elements from vector

I have a vector with around 600 unique elements: A, B, C, D, E, F, G, H, I, etc. Using R, I would like to get a dataframe with 4 columns, where each row has all possible combinations of 4 elements under the following conditions:
"A" goes always in column 1.
Column 2 has B or C.
Columns 3 and 4 have pairs of the remaining elements (pair X, Y is considered equal to pair Y, X). I expect to get something like:
1 2 3 4
A B D E
A B F G
A B H I
A C D E
A C F G
A C H I
A possible solution using combn(), expand.grid() and tidyr::separate based on #akrun's comment.
library(magrittr)
library(tidyr)
vec_a <- LETTERS[1]
vec_b <- LETTERS[2:3]
vec_c <- LETTERS[4:26]
vec_d <- combn(vec_c, 2, FUN = paste, collapse = " ")
res <- expand.grid(vec_a, vec_b, vec_d) %>%
tidyr::separate(Var3, c("Var3","Var4"), " ")
head(res, 25)
#> Var1 Var2 Var3 Var4
#> 1 A B D E
#> 2 A C D E
#> 3 A B D F
#> 4 A C D F
#> 5 A B D G
#> 6 A C D G
#> 7 A B D H
#> 8 A C D H
#> 9 A B D I
#> 10 A C D I
#> 11 A B D J
#> 12 A C D J
#> 13 A B D K
#> 14 A C D K
#> 15 A B D L
#> 16 A C D L
#> 17 A B D M
#> 18 A C D M
#> 19 A B D N
#> 20 A C D N
#> 21 A B D O
#> 22 A C D O
#> 23 A B D P
#> 24 A C D P
#> 25 A B D Q

Group values in rows according into similar columns

I had a column with multiple values inside it..
Like...
ColumnX1
A,D,C,B,F,E,G
F,A,B,E,G,C
C,D,G,F,A,T
I splitted the data with
Species_Data2 <- data.frame(str_split_fixed(Species_Data$Other.Anopheline.species, ",", 21))
But I got the values as below:
I have dataframe like:-
X1 X2 X3 X4 X5 X6 X7
A D C B F E G
F A B E G NA C
C D G F A T NA
I wanted to make a dataframe like:
X1 X2 X3 X4 X5 X6 X7 X8
A B C D E F G NA
A B C NA E F G NA
A NA C D NA F G T
and then....
I want to make the columns names as row values:-
Colnames
'A' 'B' 'C' 'D' 'E' 'F' 'G' 'T'
A B C D E F G NA
A B C NA E F G NA
A NA C D NA F G T
Tried to create sorting...but does not work that great... :(..
Comes up with O values though....
If I understand correctly, the OP wants to rearrange the data so that there is a separate column for each letter. If a letter is present in a row, then the letter appears in the appropriate column/row of the reshaped data. NA indicates that a letter is missing in a row. In addition, the letter columns should be arranged in alphabetical order.
1. dplyr/tidyr approach
If we start with the data.frame resulting from OP's call to stringr::str_split_fixed() we need to reshape the splitted data from wide to long format, remove empty entries, order rows so that columns appear in letter order and reshape to wide format again. For reshaping, a row id is required. To achieve the desired output, pivot_wide() has to be called the names_from = value parameter:
library(dplyr)
library(tidyr)
as.data.frame(stringr::str_split_fixed(DF$ColumnX1, ",", 21)) %>%
mutate(rn = row_number()) %>%
pivot_longer(-rn) %>%
filter(value != "") %>%
arrange(as.character(value)) %>%
pivot_wider(rn, names_from = value)
rn A B C D E F G T
<int> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 1 A B C D E F G NA
2 2 A B C NA E F G NA
3 3 A NA C D NA F G T
2. data.table approach
If we start from the unsplitted original data, there is a much more concise variant which uses data.table's dcast() for reshaping:
library(data.table)
setDT(DF)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][, dcast(.SD, nrow ~ V1)]
nrow A B C D E F G T
1: 1 A B C D E F G <NA>
2: 2 A B C <NA> E F G <NA>
3: 3 A <NA> C D <NA> F G T
If required, the additional row id column can be removed in both approaches.
Data
DF <- data.frame(ColumnX1 = c("A,D,C,B,F,E,G",
"F,A,B,E,G,C",
"C,D,G,F,A,T")
)
EDIT: Duplicate values
In a comment, the OP has disclosed that the production dataset contains duplicate values.
In case of duplicate values, dcast() uses the length() function by default to aggregate the data.
With a modified dataset DF2 which contains duplicate values in rows 1 and 2, the original data.table approach returns:
library(data.table)
setDT(DF2)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][, dcast(.SD, nrow ~ V1)]
nrow A B C D E F G T
1: 1 1 1 2 1 1 1 1 0
2: 2 1 1 1 0 1 2 1 0
3: 3 1 0 1 1 0 1 1 1
Here, the number of duplicate letters is shown.
The expected behaviour can be restored by removing the duplicate values before reshaping by using unique():
setDT(DF2)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][
, dcast(unique(.SD), nrow ~ V1)]
nrow A B C D E F G T
1: 1 A B C D E F G <NA>
2: 2 A B C <NA> E F G <NA>
3: 3 A <NA> C D <NA> F G T
Also the dplyr/tidyr approach needs to be modified by specifying an appropriate aggregation function in the call to pivot_wider():
library(dplyr)
library(tidyr)
as.data.frame(stringr::str_split_fixed(DF2$ColumnX1, ",", 21)) %>%
mutate(rn = row_number()) %>%
pivot_longer(-rn) %>%
filter(value != "") %>%
arrange(as.character(value)) %>%
pivot_wider(rn, names_from = value, values_fn = list(value = unique))
Data with duplicate values
DF2 <- data.frame(ColumnX1 = c("A,D,C,B,F,E,G,C",
"F,A,B,E,G,C,F",
"C,D,G,F,A,T")
)

r create new data frame that matches in rows elements grouped by another column

I want to create a new data frame from the df one below. In the new data frame (df2), each element in df$name is placed in the first column and matched in its row with other element of df$name grouped by df$group.
df <- data.frame(group = rep(letters[1:2], each=3),
name = LETTERS[1:6])
> df
group name
1 a A
2 a B
3 a C
4 b D
5 b E
6 b F
In this example, "A", "B", and "C" in df$name belong to "a" in df$group, and I want to put them in the same row in a new data frame. The desired output looks like this:
> df2
V1 V2
1 A B
2 A C
3 B A
4 B C
5 C A
6 C B
7 D E
8 D F
9 E D
10 E F
11 F D
12 F E
We could do this in base R with merge
out <- setNames(subset(merge(df, df, by.x = 'group', by.y = 'group'),
name.x != name.y, select = -group), c("V1", "V2"))
row.names(out) <- NULL
out
# V1 V2
#1 A B
#2 A C
#3 B A
#4 B C
#5 C A
#6 C B
#7 D E
#8 D F
#9 E D
#10 E F
#11 F D
#12 F E
In my opinion its case of self-join. Using dplyr a solution can be as:
library(dplyr)
inner_join(df, df, by="group") %>%
filter(name.x != name.y) %>%
select(V1 = name.x, V2 = name.y)
# V1 V2
# 1 A B
# 2 A C
# 3 B A
# 4 B C
# 5 C A
# 6 C B
# 7 D E
# 8 D F
# 9 E D
# 10 E F
# 11 F D
# 12 F E
df <- data.frame(group = rep(letters[1:2], each=3),
name = LETTERS[1:6])
library(tidyverse)
df %>%
group_by(group) %>% # for every group
summarise(v = list(expand.grid(V1=name, V2=name))) %>% # create all combinations of names
select(v) %>% # keep only the combinations
unnest(v) %>% # unnest combinations
filter(V1 != V2) # exclude rows with same names
# # A tibble: 12 x 2
# V1 V2
# <fct> <fct>
# 1 B A
# 2 C A
# 3 A B
# 4 C B
# 5 A C
# 6 B C
# 7 E D
# 8 F D
# 9 D E
# 10 F E
# 11 D F
# 12 E F

How to mutate columns whose column names differ by a suffix?

In a dataset like
data_frame(a=letters, a_1=letters, b=letters, b_1=letters)
I would like to concatenate the columns that share a similar "root", namely a with a_1 and b with b_1. The output should look like
# A tibble: 26 x 2
a b
<chr> <chr>
1 a a a a
2 b b b b
3 c c c c
4 d d d d
5 e e e e
6 f f f f
7 g g g g
8 h h h h
9 i i i i
10 j j j j
# ... with 16 more rows
If you're looking for a tidyverse approach, you can do it using tidyr::unite_:
library(tidyr)
# get a list column name groups
cols <- split(names(df), sub("_.*", "", names(df)))
# loop through list and unite columns
for(x in names(cols)) {
df <- unite_(df, x, cols[[x]], sep = " ")
}
Here is one way to go about it,
ind <- sub('_.*', '', names(df))
as.data.frame(sapply(unique(ind), function(i) do.call(paste, df[i == ind])))
# a b
#1 a a a a
#2 b b b b
#3 c c c c
#4 d d d d
#5 e e e e
#6 f f f f
#7 g g g g
#8 h h h h

How to mutate a subset of columns with dplyr?

I have this tbl
data_frame(a_a = letters[1:10], a_b = letters[1:10], a = letters[1:10])
And I am trying to substitute all d in each column starting with a_ with the value new value.
I thought the below code would do the job, but it doesn't:
data_frame(a_a = letters[1:10], a_b = letters[1:10], a = letters[1:10]) %>%
mutate_each(vars(starts_with('a_'), funs(gsub('d', 'new value',.))))
instead it gives
Error: is.fun_list(calls) is not TRUE
Guiding from this similar question and considering dft as your input, you can try :
dft %>%
dplyr::mutate_each(funs(replace(., . == "d", "nval")), matches("a_"))
which gives:
## A tibble: 10 × 3
# a_a a_b a
# <chr> <chr> <chr>
#1 a a a
#2 b b b
#3 c c c
#4 nval nval d
#5 e e e
#6 f f f
#7 g g g
#8 h h h
#9 i i i
#10 j j j

Resources