This question already has answers here:
Finding the index or unique values from a dataframe column
(4 answers)
Closed 2 years ago.
Consider the following:
library(dplyr)
df <- data.frame(group_1 = c("A", "B", "A", "C"),
group_2 = c("B", "C", "B", "A"))
> df
group_1 group_2
1 A B
2 B C
3 A B
4 C A
I would like to receive the following output, in pseudocode:
df %>%
group_by(group_1, group_2) %>%
summarize(rows = whichever_rows_contain_group_1_and_group_2, .groups = "keep")
group_1 group_2 rows
A B 1,3
B C 2
C A 4
I've tried playing around with rownames() with not much luck. What is the appropriate command with summarize() that I can use to get what I seek?
The value of rows, for each row, should be in ascending order.
Try working around row_number() to create a new variable and then use summarise() to obtain the desired variable using toString(). Here the code:
library(dplyr)
#Code
dfnew <- df %>%
mutate(id=row_number()) %>%
group_by(group_1,group_2) %>%
summarise(Var=toString(id))
Output:
# A tibble: 3 x 3
# Groups: group_1 [3]
group_1 group_2 Var
<fct> <fct> <chr>
1 A B 1, 3
2 B C 2
3 C A 4
Another option can be (Many thanks and all credit to #ThomasIsCoding):
#Code2
dfnew2 <- df %>%
mutate(id=row_number()) %>%
group_by(group_1,group_2) %>%
summarise_at("id",toString)
Same output:
# A tibble: 3 x 3
# Groups: group_1 [3]
group_1 group_2 id
<fct> <fct> <chr>
1 A B 1, 3
2 B C 2
3 C A 4
Try aggregate like below
aggregate(rows ~ ., cbind(df,rows = 1:nrow(df)),c)
which gives
group_1 group_2 rows
1 C A 4
2 A B 1, 3
3 B C 2
Here is a tidyverse way.
library(dplyr)
library(tibble)
df %>%
rowid_to_column() %>%
group_by(group_1, group_2) %>%
summarise(rows = paste0(rowid, collapse = ","))
# A tibble: 3 x 3
# Groups: group_1 [3]
group_1 group_2 rows
<chr> <chr> <chr>
1 A B 1,3
2 B C 2
3 C A 4
Using dplyr and paste0
> library(dplyr)
> df %>% mutate(rn = row_number()) %>% group_by(group_1, group_2) %>% summarise(rows = paste0(rn, collapse = ','))
`summarise()` regrouping output by 'group_1' (override with `.groups` argument)
# A tibble: 3 x 3
# Groups: group_1 [3]
group_1 group_2 rows
<chr> <chr> <chr>
1 A B 1,3
2 B C 2
3 C A 4
>
Related
I have a data frame containing a varying number of data points in the same column:
library(tidyverse)
df <- tribble(~id, ~data,
"A", "a;b;c",
"B", "e;f")
I want to obtain one row per data point, separating the content of column data and distributing it on rows. This code gives the expected result, but is clumsy:
df %>%
separate(data,
into = paste0("dat_",1:5),
sep = ";",
fill = "right") %>%
pivot_longer(starts_with("dat_"),
names_to = "data_number",
names_pattern = "dat_(\\d+)") %>%
filter(!is.na(value))
#> # A tibble: 5 x 3
#> id data_number value
#> <chr> <chr> <chr>
#> 1 A 1 a
#> 2 A 2 b
#> 3 A 3 c
#> 4 B 1 e
#> 5 B 2 f
Tidyverse solutions preferred.
Here is one way
library(dplyr)
library(tidyr)
library(data.table)
df %>%
separate_rows(data) %>%
mutate(data_number = rowid(id), .before = 2)
-output
# A tibble: 5 x 3
id data_number data
<chr> <int> <chr>
1 A 1 a
2 A 2 b
3 A 3 c
4 B 1 e
5 B 2 f
library(dplyr)
library(tidyr)
df %>%
separate_rows(data)
output:
# A tibble: 5 x 2
id data
<chr> <chr>
1 A a
2 A b
3 A c
4 B e
5 B f
Using str_split and unnest -
library(tidyverse)
df %>%
mutate(data = str_split(data, ';'),
data_number = map(data, seq_along)) %>%
unnest(c(data, data_number))
# id data data_number
# <chr> <chr> <int>
#1 A a 1
#2 A b 2
#3 A c 3
#4 B e 1
#5 B f 2
Dear dplyr/tidyverse companions, I am looking for a nice solution to the following problem. I only get my solutions in base R with a loop. How do you solve this cleanly in tidyverse?
I have a dataset called data, which has not useful column names and not useful values (integer).
data <- tibble(var1 = rep(c(1:3), 2),
var2 = rep(c(1:3), 2))
# A tibble: 6 x 2
var1 var2
<int> <int>
1 1 1
2 2 2
3 3 3
4 1 1
5 2 2
6 3 3
Additional I have a coding table, which has for every column a better name (var1 -> variable1) and a better value (1 -> "a")
coding <- tibble(variable = c(rep("var1", 3),rep("var2", 3)),
name = c(rep("variable1", 3),rep("variable2", 3)),
code = rep(c(1:3), 2),
value = rep(c("a", "b", "c"), 2))
# A tibble: 6 x 4
variable name code value
<chr> <chr> <int> <chr>
1 var1 variable1 1 a
2 var1 variable1 2 b
3 var1 variable1 3 c
4 var2 variable2 1 a
5 var2 variable2 2 b
6 var2 variable2 3 c
I'm looking for a result, which has transformed names of the columns and the real values as factors in the dataset, compare:
result <- tibble(variable1 = factor(rep(c("a", "b", "c"), 2)),
variable2 = factor(rep(c("a", "b", "c"), 2)))
# A tibble: 6 x 2
variable1 variable2
<fct> <fct>
1 a a
2 b b
3 c c
4 a a
5 b b
6 c c
Thank you for your commitment :) :) :) :)
library(dplyr)
library(tidyr)
data %>%
stack() %>%
left_join(coding, by = c(ind = "variable", values = "code")) %>%
group_by(name) %>%
mutate(j = row_number()) %>%
pivot_wider(id_cols = j, values_from = value) %>%
select(-j)
# # A tibble: 6 x 2
# variable1 variable2
# <chr> <chr>
# 1 a a
# 2 b b
# 3 c c
# 4 a a
# 5 b b
# 6 c c
A general solution for any number of columns -
create a row number column to identify each row
get data in long format
join it with coding for each value
keep only unique rows and get it back in wide format.
library(dplyr)
library(tidyr)
data %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row, values_to = 'code') %>%
left_join(coding, by = 'code') %>%
select(row, name = name.y, value) %>%
distinct() %>%
pivot_wider() %>%
select(-row)
# variable1 variable2
# <chr> <chr>
#1 a a
#2 b b
#3 c c
#4 a a
#5 b b
#6 c c
I could write a loop to do this, but I was wondering how this might be done in R with dplyr. I have a data frame with two columns. Column 1 is the group, Column 2 is the value. I would like a data frame that has every combination of two values from each group in two separate columns. For example:
input = data.frame(col1 = c(1,1,1,2,2), col2 = c("A","B","C","E","F"))
input
#> col1 col2
#> 1 1 A
#> 2 1 B
#> 3 1 C
#> 4 2 E
#> 5 2 F
and have it return
output = data.frame(col1 = c(1,1,1,2), col2 = c("A","B","C","E"), col3 = c("B","C","A","F"))
output
#> col1 col2 col3
#> 1 1 A B
#> 2 1 B C
#> 3 1 C A
#> 4 2 E F
I'd like to be able to include it within dplyr syntax:
input %>%
group_by(col1) %>%
???
I tried writing my own function that produces a data frame of combinations like what I need from a vector and sent it into the group_map function, but didn't have success:
combos = function(x, ...) {
x = t(combn(x, 2))
return(as.data.frame(x))
}
input %>%
group_by(col1) %>%
group_map(.f = combos)
Produced an error.
Any suggestions?
You can do :
library(dplyr)
data <- input %>%
group_by(col1) %>%
summarise(col2 = t(combn(col2, 2)))
cbind(data[1], data.frame(data$col2))
# col1 X1 X2
# <dbl> <chr> <chr>
#1 1 A B
#2 1 A C
#3 1 B C
#4 2 E F
input %>%
group_by(col1) %>%
nest(data=-col1) %>%
mutate(out= map(data, ~ t(combn(unlist(.x), 2)))) %>%
unnest(out) %>% select(-data)
# A tibble: 4 x 2
# Groups: col1 [2]
col1 out[,1] [,2]
<dbl> <chr> <chr>
1 1 A B
2 1 A C
3 1 B C
4 2 E F
Or :
combos = function(x, ...) {
return(tibble(col1=x[[1,1]],col2=t(combn(unlist(x[[2]], use.names=F), 2))))
}
input %>%
group_by(col1) %>%
group_map(.f = combos, .keep=T) %>% invoke(rbind,.) %>% tibble
# A tibble: 4 x 2
col1 col2[,1] [,2]
<dbl> <chr> <chr>
1 1 A B
2 1 A C
3 1 B C
4 2 E F
Thank you! In terms of parsimony, I like both the answer from Ben
input %>%
group_by(col1) %>%
do(data.frame(t(combn(.$col2, 2))))
and Ronak
data <- input %>%
group_by(col1) %>%
summarise(col2 = t(combn(col2, 2)))
cbind(data[1], data.frame(data$col2))
I have a data frame with grouped variable and I want to sum them by group. It's easy with dplyr.
library(dplyr)
library(magrittr)
data <- data.frame(group = c("a", "a", "b", "c", "c"), n1 = 1:5, n2 = 2:6)
data %>% group_by(group) %>%
summarise_all(sum)
# A tibble: 3 x 3
group n1 n2
<fctr> <int> <int>
1 a 3 5
2 b 3 4
3 c 9 11
But now I want a new column total with the sum of n1 and n2 by group. Like this:
# A tibble: 3 x 3
group n1 n2 ttl
<fctr> <int> <int> <int>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20
How can I do that with dplyr?
EDIT:
Actually, it's just an example, I have a lot of variables.
I tried these two codes but it's not in the right dimension...
data %>% group_by(group) %>%
summarise_all(sum) %>%
summarise_if(is.numeric, sum)
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate_if(is.numeric, .funs = sum)
You can use mutate after summarize:
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt1 = n1 + n2)
# A tibble: 3 x 4
# group n1 n2 tt1
# <fctr> <int> <int> <int>
#1 a 3 5 8
#2 b 3 4 7
#3 c 9 11 20
If need to sum all numeric columns, you can use rowSums with select_if (to select numeric columns) to sum columns up:
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt1 = rowSums(select_if(., is.numeric)))
# A tibble: 3 x 4
# group n1 n2 tt1
# <fctr> <int> <int> <dbl>
#1 a 3 5 8
#2 b 3 4 7
#3 c 9 11 20
We can use apply together with the dplyr functions.
data <- data.frame(group = c("a", "a", "b", "c", "c"), n1 = 1:5, n2 = 2:6)
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate(ttl = apply(.[, 2:ncol(.)], 1, sum))
# A tibble: 3 × 4
group n1 n2 ttl
<fctr> <int> <int> <int>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20
Or rowSums with the same strategy. The key is to use . to specify the data frame and [] with x:ncol(.) to keep the columns you want.
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate(ttl = rowSums(.[, 2:ncol(.)]))
# A tibble: 3 × 4
group n1 n2 ttl
<fctr> <int> <int> <dbl>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20
Base R
cbind(aggregate(.~group, data, sum), ttl = sapply(split(data[,-1], data$group), sum))
# group n1 n2 ttl
#a a 3 5 8
#b b 3 4 7
#c c 9 11 20
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'group', get the sum of each columns in the Subset of data.table, and then with Reduce, get the sum of the rows of the columns of interest
library(data.table)
setDT(data)[, lapply(.SD, sum) , group][, tt1 := Reduce(`+`, .SD),
.SDcols = names(data)[-1]][]
# group n1 n2 tt1
#1: a 3 5 8
#2: b 3 4 7
#3: c 9 11 20
Or with base R
addmargins(as.matrix(rowsum(data[-1], data$group)), 2)
# n1 n2 Sum
#a 3 5 8
#b 3 4 7
#c 9 11 20
Or with dplyr
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt = rowSums(.[-1]))
I have the following tibble:
library(tidyverse)
df <- tibble::tribble(
~gene, ~colB, ~colC,
"a", 1, 2,
"b", 2, 3,
"c", 3, 4,
"d", 1, 1
)
df
#> # A tibble: 4 x 3
#> gene colB colC
#> <chr> <dbl> <dbl>
#> 1 a 1 2
#> 2 b 2 3
#> 3 c 3 4
#> 4 d 1 1
What I want to do is to filter every columns after gene column
for values greater or equal 2 (>=2). Resulting in this:
gene, colB, colC
a NA 2
b 2 3
c 3 4
How can I achieve that?
The number of columns after genes actually is more than just 2.
One solution: convert from wide to long format, so you can filter on just one column, then convert back to wide at the end if required. Note that this will drop genes where no values meet the condition.
library(tidyverse)
df %>%
gather(name, value, -gene) %>%
filter(value >= 2) %>%
spread(name, value)
# A tibble: 3 x 3
gene colB colC
* <chr> <dbl> <dbl>
1 a NA 2
2 b 2 3
3 c 3 4
The forthcoming dplyr 0.6 (install from GitHub now, if you like) has filter_at, which can be used to filter to any rows that have a value greater than or equal to 2, and then na_if can be applied similarly through mutate_at, so
df %>%
filter_at(vars(-gene), any_vars(. >= 2)) %>%
mutate_at(vars(-gene), funs(na_if(., . < 2)))
#> # A tibble: 3 x 3
#> gene colB colC
#> <chr> <dbl> <dbl>
#> 1 a NA 2
#> 2 b 2 3
#> 3 c 3 4
or similarly,
df %>%
mutate_at(vars(-gene), funs(na_if(., . < 2))) %>%
filter_at(vars(-gene), any_vars(!is.na(.)))
which can be translated for use with dplyr 0.5:
df %>%
mutate_at(vars(-gene), funs(na_if(., . < 2))) %>%
filter(rowSums(is.na(.)) < (ncol(.) - 1))
All return the same thing.
We can use data.table
library(data.table)
setDT(df)[df[, Reduce(`|`, lapply(.SD, `>=`, 2)), .SDcols = colB:colC]
][, (2:3) := lapply(.SD, function(x) replace(x, x < 2, NA)), .SDcols = colB:colC][]
# gene colB colC
#1: a NA 2
#2: b 2 3
#3: c 3 4
Or with melt/dcast
dcast(melt(setDT(df), id.var = 'gene')[value>=2], gene ~variable)
# gene colB colC
#1: a NA 2
#2: b 2 3
#3: c 3 4
Alternatively we could also try the below code
df %>% rowwise %>%
filter(any(c_across(starts_with('col'))>=2)) %>%
mutate(across(starts_with('col'), ~ifelse(!(.>=2), NA, .)))
Created on 2023-02-05 with reprex v2.0.2
# A tibble: 3 × 3
# Rowwise:
gene colB colC
<chr> <dbl> <dbl>
1 a NA 2
2 b 2 3
3 c 3 4