a beautiful solution to decode a table with dplyr and mutate - r

Dear dplyr/tidyverse companions, I am looking for a nice solution to the following problem. I only get my solutions in base R with a loop. How do you solve this cleanly in tidyverse?
I have a dataset called data, which has not useful column names and not useful values (integer).
data <- tibble(var1 = rep(c(1:3), 2),
var2 = rep(c(1:3), 2))
# A tibble: 6 x 2
var1 var2
<int> <int>
1 1 1
2 2 2
3 3 3
4 1 1
5 2 2
6 3 3
Additional I have a coding table, which has for every column a better name (var1 -> variable1) and a better value (1 -> "a")
coding <- tibble(variable = c(rep("var1", 3),rep("var2", 3)),
name = c(rep("variable1", 3),rep("variable2", 3)),
code = rep(c(1:3), 2),
value = rep(c("a", "b", "c"), 2))
# A tibble: 6 x 4
variable name code value
<chr> <chr> <int> <chr>
1 var1 variable1 1 a
2 var1 variable1 2 b
3 var1 variable1 3 c
4 var2 variable2 1 a
5 var2 variable2 2 b
6 var2 variable2 3 c
I'm looking for a result, which has transformed names of the columns and the real values as factors in the dataset, compare:
result <- tibble(variable1 = factor(rep(c("a", "b", "c"), 2)),
variable2 = factor(rep(c("a", "b", "c"), 2)))
# A tibble: 6 x 2
variable1 variable2
<fct> <fct>
1 a a
2 b b
3 c c
4 a a
5 b b
6 c c
Thank you for your commitment :) :) :) :)

library(dplyr)
library(tidyr)
data %>%
stack() %>%
left_join(coding, by = c(ind = "variable", values = "code")) %>%
group_by(name) %>%
mutate(j = row_number()) %>%
pivot_wider(id_cols = j, values_from = value) %>%
select(-j)
# # A tibble: 6 x 2
# variable1 variable2
# <chr> <chr>
# 1 a a
# 2 b b
# 3 c c
# 4 a a
# 5 b b
# 6 c c

A general solution for any number of columns -
create a row number column to identify each row
get data in long format
join it with coding for each value
keep only unique rows and get it back in wide format.
library(dplyr)
library(tidyr)
data %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row, values_to = 'code') %>%
left_join(coding, by = 'code') %>%
select(row, name = name.y, value) %>%
distinct() %>%
pivot_wider() %>%
select(-row)
# variable1 variable2
# <chr> <chr>
#1 a a
#2 b b
#3 c c
#4 a a
#5 b b
#6 c c

Related

Pivot longer with mutliple data points in a single column

I have a data frame containing a varying number of data points in the same column:
library(tidyverse)
df <- tribble(~id, ~data,
"A", "a;b;c",
"B", "e;f")
I want to obtain one row per data point, separating the content of column data and distributing it on rows. This code gives the expected result, but is clumsy:
df %>%
separate(data,
into = paste0("dat_",1:5),
sep = ";",
fill = "right") %>%
pivot_longer(starts_with("dat_"),
names_to = "data_number",
names_pattern = "dat_(\\d+)") %>%
filter(!is.na(value))
#> # A tibble: 5 x 3
#> id data_number value
#> <chr> <chr> <chr>
#> 1 A 1 a
#> 2 A 2 b
#> 3 A 3 c
#> 4 B 1 e
#> 5 B 2 f
Tidyverse solutions preferred.
Here is one way
library(dplyr)
library(tidyr)
library(data.table)
df %>%
separate_rows(data) %>%
mutate(data_number = rowid(id), .before = 2)
-output
# A tibble: 5 x 3
id data_number data
<chr> <int> <chr>
1 A 1 a
2 A 2 b
3 A 3 c
4 B 1 e
5 B 2 f
library(dplyr)
library(tidyr)
df %>%
separate_rows(data)
output:
# A tibble: 5 x 2
id data
<chr> <chr>
1 A a
2 A b
3 A c
4 B e
5 B f
Using str_split and unnest -
library(tidyverse)
df %>%
mutate(data = str_split(data, ';'),
data_number = map(data, seq_along)) %>%
unnest(c(data, data_number))
# id data data_number
# <chr> <chr> <int>
#1 A a 1
#2 A b 2
#3 A c 3
#4 B e 1
#5 B f 2

Add levels missing in one group to summary table using dplyr

When summarizing data, some groups may have observations not present in another group. In the example below, group 2 has no males. How can I in a tidy way, insert these observations in a summary table?
data example:
a <- data.frame(gender=factor(c("m", "m", "m", "f", "f", "f", "f")), group=c(1,1,1,1,1,2,2))
gender group
1 m 1
2 m 1
3 m 1
4 f 1
5 f 1
6 f 2
7 f 2
data summary:
a %>% group_by(gender, group) %>% summarise(n=n())
gender group n
<fct> <dbl> <int>
1 f 1 2
2 f 2 2
3 m 1 3
Desired output:
gender group n
<fct> <dbl> <int>
1 f 1 2
2 f 2 2
3 m 1 3
4 m 2 0
At the end, we can use complete
library(dplyr)
library(tidyr)
a %>%
group_by(gender, group) %>%
summarise(n=n(), .groups = 'drop') %>%
complete(gender, group, fill = list(n = 0))
-output
# A tibble: 4 x 3
# gender group n
# <fct> <dbl> <dbl>
#1 f 1 2
#2 f 2 2
#3 m 1 3
#4 m 2 0
Or an option is also to reshape to wide and then back to long format
a %>%
pivot_wider(names_from = group, values_from = group,
values_fn = length, values_fill = 0) %>%
pivot_longer(cols = -gender, names_to = 'group', values_to = 'n')
It is more easier in base R
as.data.frame(table(a))

How to get all combinations of 2 from a grouped column in a data frame

I could write a loop to do this, but I was wondering how this might be done in R with dplyr. I have a data frame with two columns. Column 1 is the group, Column 2 is the value. I would like a data frame that has every combination of two values from each group in two separate columns. For example:
input = data.frame(col1 = c(1,1,1,2,2), col2 = c("A","B","C","E","F"))
input
#> col1 col2
#> 1 1 A
#> 2 1 B
#> 3 1 C
#> 4 2 E
#> 5 2 F
and have it return
output = data.frame(col1 = c(1,1,1,2), col2 = c("A","B","C","E"), col3 = c("B","C","A","F"))
output
#> col1 col2 col3
#> 1 1 A B
#> 2 1 B C
#> 3 1 C A
#> 4 2 E F
I'd like to be able to include it within dplyr syntax:
input %>%
group_by(col1) %>%
???
I tried writing my own function that produces a data frame of combinations like what I need from a vector and sent it into the group_map function, but didn't have success:
combos = function(x, ...) {
x = t(combn(x, 2))
return(as.data.frame(x))
}
input %>%
group_by(col1) %>%
group_map(.f = combos)
Produced an error.
Any suggestions?
You can do :
library(dplyr)
data <- input %>%
group_by(col1) %>%
summarise(col2 = t(combn(col2, 2)))
cbind(data[1], data.frame(data$col2))
# col1 X1 X2
# <dbl> <chr> <chr>
#1 1 A B
#2 1 A C
#3 1 B C
#4 2 E F
input %>%
group_by(col1) %>%
nest(data=-col1) %>%
mutate(out= map(data, ~ t(combn(unlist(.x), 2)))) %>%
unnest(out) %>% select(-data)
# A tibble: 4 x 2
# Groups: col1 [2]
col1 out[,1] [,2]
<dbl> <chr> <chr>
1 1 A B
2 1 A C
3 1 B C
4 2 E F
Or :
combos = function(x, ...) {
return(tibble(col1=x[[1,1]],col2=t(combn(unlist(x[[2]], use.names=F), 2))))
}
input %>%
group_by(col1) %>%
group_map(.f = combos, .keep=T) %>% invoke(rbind,.) %>% tibble
# A tibble: 4 x 2
col1 col2[,1] [,2]
<dbl> <chr> <chr>
1 1 A B
2 1 A C
3 1 B C
4 2 E F
Thank you! In terms of parsimony, I like both the answer from Ben
input %>%
group_by(col1) %>%
do(data.frame(t(combn(.$col2, 2))))
and Ronak
data <- input %>%
group_by(col1) %>%
summarise(col2 = t(combn(col2, 2)))
cbind(data[1], data.frame(data$col2))

Extract which rows groups appear using dplyr [duplicate]

This question already has answers here:
Finding the index or unique values from a dataframe column
(4 answers)
Closed 2 years ago.
Consider the following:
library(dplyr)
df <- data.frame(group_1 = c("A", "B", "A", "C"),
group_2 = c("B", "C", "B", "A"))
> df
group_1 group_2
1 A B
2 B C
3 A B
4 C A
I would like to receive the following output, in pseudocode:
df %>%
group_by(group_1, group_2) %>%
summarize(rows = whichever_rows_contain_group_1_and_group_2, .groups = "keep")
group_1 group_2 rows
A B 1,3
B C 2
C A 4
I've tried playing around with rownames() with not much luck. What is the appropriate command with summarize() that I can use to get what I seek?
The value of rows, for each row, should be in ascending order.
Try working around row_number() to create a new variable and then use summarise() to obtain the desired variable using toString(). Here the code:
library(dplyr)
#Code
dfnew <- df %>%
mutate(id=row_number()) %>%
group_by(group_1,group_2) %>%
summarise(Var=toString(id))
Output:
# A tibble: 3 x 3
# Groups: group_1 [3]
group_1 group_2 Var
<fct> <fct> <chr>
1 A B 1, 3
2 B C 2
3 C A 4
Another option can be (Many thanks and all credit to #ThomasIsCoding):
#Code2
dfnew2 <- df %>%
mutate(id=row_number()) %>%
group_by(group_1,group_2) %>%
summarise_at("id",toString)
Same output:
# A tibble: 3 x 3
# Groups: group_1 [3]
group_1 group_2 id
<fct> <fct> <chr>
1 A B 1, 3
2 B C 2
3 C A 4
Try aggregate like below
aggregate(rows ~ ., cbind(df,rows = 1:nrow(df)),c)
which gives
group_1 group_2 rows
1 C A 4
2 A B 1, 3
3 B C 2
Here is a tidyverse way.
library(dplyr)
library(tibble)
df %>%
rowid_to_column() %>%
group_by(group_1, group_2) %>%
summarise(rows = paste0(rowid, collapse = ","))
# A tibble: 3 x 3
# Groups: group_1 [3]
group_1 group_2 rows
<chr> <chr> <chr>
1 A B 1,3
2 B C 2
3 C A 4
Using dplyr and paste0
> library(dplyr)
> df %>% mutate(rn = row_number()) %>% group_by(group_1, group_2) %>% summarise(rows = paste0(rn, collapse = ','))
`summarise()` regrouping output by 'group_1' (override with `.groups` argument)
# A tibble: 3 x 3
# Groups: group_1 [3]
group_1 group_2 rows
<chr> <chr> <chr>
1 A B 1,3
2 B C 2
3 C A 4
>

Using `rle` function along with `dplyr` `group_by` command to mapping grouping variable

I have a dataframe with three columns that has information similar to the data frame given below. Now I wish to extract information search pattern based on the information in column a.
Based on the support from few developers (#thelatemail and #David T), I was able to identify the pattern with rle function, please see here - using rle function to identify pattern. Now, I wish to move ahead and add grouping information to the extracted pattern. I tried with dplyr do function - refer to the code below. However, this does not work.
The example data and desired output is given as well for your reference.
##mycode that produces error - needs to be fixed
test <- data%>%
group_by(b, c)%>%
do(., data.frame(from = rle(.$a)$values), to = lead(rle(.$a)$values))
##code to create the data frame
a <- c( "a", "b", "b", "b", "a", "c", "a", "b", "d", "d", "d", "e", "f", "f", "e", "e")
b <- c(rep("experiment", times = 8), rep("control", times = 8))
c <- c(rep("A01", times = 4), rep("A02", times = 4), rep("A03", times = 4), rep("A04", times = 4))
data <- data.frame(c,b,a)
## desired output
c b from to fromCount toCount
<chr> <chr> <int> <int>
1 A01 experimental a b 1 3
2 A02 experimental a c 1 1
3 A02 experimental c a 1 1
4 A02 experimental a b 1 1
5 A03 control d e 3 1
6 A04 control f e 2 2
Compared to the earlier post here, the information gets compressed since we apply grouping to the a column.
We could use rleid from data.table
library(data.table)
library(dplyr)
data %>%
group_by(b, c, grp = rleid(a)) %>%
summarise(from = first(a), fromCount = n()) %>%
mutate(to = lead(from), toCount = lead(fromCount)) %>%
ungroup %>%
select(-grp) %>%
filter(!is.na(to)) %>%
arrange(c)
# A tibble: 6 x 6
# b c from fromCount to toCount
# <chr> <chr> <chr> <int> <chr> <int>
#1 experiment A01 a 1 b 3
#2 experiment A02 a 1 c 1
#3 experiment A02 c 1 a 1
#4 experiment A02 a 1 b 1
#5 control A03 d 3 e 1
#6 control A04 f 2 e 2
Or using rle, after grouping by 'b', 'c', summarise with rle to create a list column, then extract the 'values' and 'lengths' from column in summarise, create the 'to', 'toCount' on the lead of the 'from', 'fromCount' column filter out the NA elements and arrange the rows based on the 'c' column
data %>%
group_by(b, c) %>%
summarise(rl = list(rle(a)),
from = rl[[1]]$values,
fromCount = rl[[1]]$lengths) %>%
mutate(to = lead(from),
toCount = lead(fromCount)) %>%
ungroup %>%
select(-rl) %>%
filter(!is.na(to)) %>%
arrange(c)
# A tibble: 6 x 6
# b c from fromCount to toCount
# <chr> <chr> <chr> <int> <chr> <int>
#1 experiment A01 a 1 b 3
#2 experiment A02 a 1 c 1
#3 experiment A02 c 1 a 1
#4 experiment A02 a 1 b 1
#5 control A03 d 3 e 1
#6 control A04 f 2 e 2
We could also loop over the rle list column ('rl') with map, extract the components, and take the lead of the lengths, values in a tibble, use unnest_wider to create the columns and unnest the list structure, filter out the NA elements and arrange
library(tidyr)
library(purrr)
data %>%
group_by(b, c) %>%
summarise(rl = list(rle(a))) %>%
ungroup %>%
mutate(out = map(rl,
~ tibble(from = .x$values,
fromCount = .x$lengths,
to = lead(from),
toCount = lead(fromCount)))) %>%
unnest_wider(c(out)) %>%
unnest(from:toCount) %>%
filter(!is.na(to)) %>%
arrange(c) %>%
select(-rl)
Or in the tidyverse, create a function that does the rle for the Tracking for a single subject
rleSlice <- function(Tracking) {
rlTrack <- rle(as.character(Tracking)) # Strip the levels from the factor, they interfere
tibble(from = rlTrack$values, to = lead(rlTrack$values),
fromCount = rlTrack$lengths, toCount = lead(rlTrack$lengths)) %>%
filter(!is.na(to)) %>%
list()
}
Make sure it's behaving
[[1]]
rleSlice(c("a", "b", "b", "b", "c"))
A tibble: 2 x 4
from to fromCount toCount
<chr> <chr> <int> <int>
1 a b 1 3
2 b c 3 1
Now we'll group and get the rle for each participant
data %>%
as_tibble() %>%
# This is easier to track than all these a,b,c's
rename(Subject = c, Test = b, Tracking = a) %>%
group_by(Subject, Test) %>%
summarise(Slice = rleSlice(Tracking)) %>%
unnest(col = "Slice") %>%
ungroup()
# A tibble: 6 x 6
Subject Test from to fromCount toCount
<fct> <fct> <chr> <chr> <int> <int>
1 A01 experiment a b 1 3
2 A02 experiment a c 1 1
3 A02 experiment c a 1 1
4 A02 experiment a b 1 1
5 A03 control d e 3 1
6 A04 control f e 2 2

Resources