When summarizing data, some groups may have observations not present in another group. In the example below, group 2 has no males. How can I in a tidy way, insert these observations in a summary table?
data example:
a <- data.frame(gender=factor(c("m", "m", "m", "f", "f", "f", "f")), group=c(1,1,1,1,1,2,2))
gender group
1 m 1
2 m 1
3 m 1
4 f 1
5 f 1
6 f 2
7 f 2
data summary:
a %>% group_by(gender, group) %>% summarise(n=n())
gender group n
<fct> <dbl> <int>
1 f 1 2
2 f 2 2
3 m 1 3
Desired output:
gender group n
<fct> <dbl> <int>
1 f 1 2
2 f 2 2
3 m 1 3
4 m 2 0
At the end, we can use complete
library(dplyr)
library(tidyr)
a %>%
group_by(gender, group) %>%
summarise(n=n(), .groups = 'drop') %>%
complete(gender, group, fill = list(n = 0))
-output
# A tibble: 4 x 3
# gender group n
# <fct> <dbl> <dbl>
#1 f 1 2
#2 f 2 2
#3 m 1 3
#4 m 2 0
Or an option is also to reshape to wide and then back to long format
a %>%
pivot_wider(names_from = group, values_from = group,
values_fn = length, values_fill = 0) %>%
pivot_longer(cols = -gender, names_to = 'group', values_to = 'n')
It is more easier in base R
as.data.frame(table(a))
Related
I have a data frame containing a varying number of data points in the same column:
library(tidyverse)
df <- tribble(~id, ~data,
"A", "a;b;c",
"B", "e;f")
I want to obtain one row per data point, separating the content of column data and distributing it on rows. This code gives the expected result, but is clumsy:
df %>%
separate(data,
into = paste0("dat_",1:5),
sep = ";",
fill = "right") %>%
pivot_longer(starts_with("dat_"),
names_to = "data_number",
names_pattern = "dat_(\\d+)") %>%
filter(!is.na(value))
#> # A tibble: 5 x 3
#> id data_number value
#> <chr> <chr> <chr>
#> 1 A 1 a
#> 2 A 2 b
#> 3 A 3 c
#> 4 B 1 e
#> 5 B 2 f
Tidyverse solutions preferred.
Here is one way
library(dplyr)
library(tidyr)
library(data.table)
df %>%
separate_rows(data) %>%
mutate(data_number = rowid(id), .before = 2)
-output
# A tibble: 5 x 3
id data_number data
<chr> <int> <chr>
1 A 1 a
2 A 2 b
3 A 3 c
4 B 1 e
5 B 2 f
library(dplyr)
library(tidyr)
df %>%
separate_rows(data)
output:
# A tibble: 5 x 2
id data
<chr> <chr>
1 A a
2 A b
3 A c
4 B e
5 B f
Using str_split and unnest -
library(tidyverse)
df %>%
mutate(data = str_split(data, ';'),
data_number = map(data, seq_along)) %>%
unnest(c(data, data_number))
# id data data_number
# <chr> <chr> <int>
#1 A a 1
#2 A b 2
#3 A c 3
#4 B e 1
#5 B f 2
p <- data.frame(x = c("A", "B", "C", "A", "B"),
y = c("A", "B", "D", "A", "B"),
z = c("B", "C", "B", "D", "E"))
p
d <- p %>%
group_by(x) %>%
summarize(occurance1 = count(x),
occurance2 = count(y),
occurance3 = count(z),
total = occurance1 + occurance2 + occurance3)
d
Output:
A tibble: 3 x 5
x occurance1 occurance2 occurance3 total
<chr> <int> <int> <int> <int>
1 A 2 2 1 5
2 B 2 2 1 5
3 C 1 1 1 3
I have a dataset similar to the one above where I'm trying to get the counts of the different factors in each column. The first one works perfectly, probably because it's grouped by (x), but I've run into various problems with the other two rows. As you can see, it doesn't count "D" at all in y, instead counting it as "C" and z doesn't have an "A" in it, but there's a count of 1 for A. Help?
count needs data.frame/tibble as input and not a vector. To make this work, we may need to reshape to 'long' format with pivot_longer and apply the count on the columns, and then use adorn_totals to get the total column
library(dplyr)
library(tidyr)
library(janitor)
p %>%
pivot_longer(cols = everything()) %>%
count(name, value) %>%
pivot_wider(names_from = value, values_from = n, values_fill = 0) %>%
janitor::adorn_totals('col')
-output
name A B C D E Total
x 2 2 1 0 0 5
y 2 2 0 1 0 5
z 0 2 1 1 1 5
In addition to akrun's solution here is one without janitor using select_if:
p %>%
pivot_longer(
cols = everything(),
names_to = "name",
values_to = "values"
) %>%
count(name,values) %>%
pivot_wider(names_from = values, values_from = n, values_fill = 0) %>%
ungroup() %>%
mutate(Total = rowSums(select_if(., is.integer), na.rm = TRUE))
name A B C D E Total
<chr> <int> <int> <int> <int> <int> <dbl>
1 x 2 2 1 0 0 5
2 y 2 2 0 1 0 5
3 z 0 2 1 1 1 5
Dear dplyr/tidyverse companions, I am looking for a nice solution to the following problem. I only get my solutions in base R with a loop. How do you solve this cleanly in tidyverse?
I have a dataset called data, which has not useful column names and not useful values (integer).
data <- tibble(var1 = rep(c(1:3), 2),
var2 = rep(c(1:3), 2))
# A tibble: 6 x 2
var1 var2
<int> <int>
1 1 1
2 2 2
3 3 3
4 1 1
5 2 2
6 3 3
Additional I have a coding table, which has for every column a better name (var1 -> variable1) and a better value (1 -> "a")
coding <- tibble(variable = c(rep("var1", 3),rep("var2", 3)),
name = c(rep("variable1", 3),rep("variable2", 3)),
code = rep(c(1:3), 2),
value = rep(c("a", "b", "c"), 2))
# A tibble: 6 x 4
variable name code value
<chr> <chr> <int> <chr>
1 var1 variable1 1 a
2 var1 variable1 2 b
3 var1 variable1 3 c
4 var2 variable2 1 a
5 var2 variable2 2 b
6 var2 variable2 3 c
I'm looking for a result, which has transformed names of the columns and the real values as factors in the dataset, compare:
result <- tibble(variable1 = factor(rep(c("a", "b", "c"), 2)),
variable2 = factor(rep(c("a", "b", "c"), 2)))
# A tibble: 6 x 2
variable1 variable2
<fct> <fct>
1 a a
2 b b
3 c c
4 a a
5 b b
6 c c
Thank you for your commitment :) :) :) :)
library(dplyr)
library(tidyr)
data %>%
stack() %>%
left_join(coding, by = c(ind = "variable", values = "code")) %>%
group_by(name) %>%
mutate(j = row_number()) %>%
pivot_wider(id_cols = j, values_from = value) %>%
select(-j)
# # A tibble: 6 x 2
# variable1 variable2
# <chr> <chr>
# 1 a a
# 2 b b
# 3 c c
# 4 a a
# 5 b b
# 6 c c
A general solution for any number of columns -
create a row number column to identify each row
get data in long format
join it with coding for each value
keep only unique rows and get it back in wide format.
library(dplyr)
library(tidyr)
data %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row, values_to = 'code') %>%
left_join(coding, by = 'code') %>%
select(row, name = name.y, value) %>%
distinct() %>%
pivot_wider() %>%
select(-row)
# variable1 variable2
# <chr> <chr>
#1 a a
#2 b b
#3 c c
#4 a a
#5 b b
#6 c c
I have a dataframe with three columns that has information similar to the data frame given below. Now I wish to extract information search pattern based on the information in column a.
Based on the support from few developers (#thelatemail and #David T), I was able to identify the pattern with rle function, please see here - using rle function to identify pattern. Now, I wish to move ahead and add grouping information to the extracted pattern. I tried with dplyr do function - refer to the code below. However, this does not work.
The example data and desired output is given as well for your reference.
##mycode that produces error - needs to be fixed
test <- data%>%
group_by(b, c)%>%
do(., data.frame(from = rle(.$a)$values), to = lead(rle(.$a)$values))
##code to create the data frame
a <- c( "a", "b", "b", "b", "a", "c", "a", "b", "d", "d", "d", "e", "f", "f", "e", "e")
b <- c(rep("experiment", times = 8), rep("control", times = 8))
c <- c(rep("A01", times = 4), rep("A02", times = 4), rep("A03", times = 4), rep("A04", times = 4))
data <- data.frame(c,b,a)
## desired output
c b from to fromCount toCount
<chr> <chr> <int> <int>
1 A01 experimental a b 1 3
2 A02 experimental a c 1 1
3 A02 experimental c a 1 1
4 A02 experimental a b 1 1
5 A03 control d e 3 1
6 A04 control f e 2 2
Compared to the earlier post here, the information gets compressed since we apply grouping to the a column.
We could use rleid from data.table
library(data.table)
library(dplyr)
data %>%
group_by(b, c, grp = rleid(a)) %>%
summarise(from = first(a), fromCount = n()) %>%
mutate(to = lead(from), toCount = lead(fromCount)) %>%
ungroup %>%
select(-grp) %>%
filter(!is.na(to)) %>%
arrange(c)
# A tibble: 6 x 6
# b c from fromCount to toCount
# <chr> <chr> <chr> <int> <chr> <int>
#1 experiment A01 a 1 b 3
#2 experiment A02 a 1 c 1
#3 experiment A02 c 1 a 1
#4 experiment A02 a 1 b 1
#5 control A03 d 3 e 1
#6 control A04 f 2 e 2
Or using rle, after grouping by 'b', 'c', summarise with rle to create a list column, then extract the 'values' and 'lengths' from column in summarise, create the 'to', 'toCount' on the lead of the 'from', 'fromCount' column filter out the NA elements and arrange the rows based on the 'c' column
data %>%
group_by(b, c) %>%
summarise(rl = list(rle(a)),
from = rl[[1]]$values,
fromCount = rl[[1]]$lengths) %>%
mutate(to = lead(from),
toCount = lead(fromCount)) %>%
ungroup %>%
select(-rl) %>%
filter(!is.na(to)) %>%
arrange(c)
# A tibble: 6 x 6
# b c from fromCount to toCount
# <chr> <chr> <chr> <int> <chr> <int>
#1 experiment A01 a 1 b 3
#2 experiment A02 a 1 c 1
#3 experiment A02 c 1 a 1
#4 experiment A02 a 1 b 1
#5 control A03 d 3 e 1
#6 control A04 f 2 e 2
We could also loop over the rle list column ('rl') with map, extract the components, and take the lead of the lengths, values in a tibble, use unnest_wider to create the columns and unnest the list structure, filter out the NA elements and arrange
library(tidyr)
library(purrr)
data %>%
group_by(b, c) %>%
summarise(rl = list(rle(a))) %>%
ungroup %>%
mutate(out = map(rl,
~ tibble(from = .x$values,
fromCount = .x$lengths,
to = lead(from),
toCount = lead(fromCount)))) %>%
unnest_wider(c(out)) %>%
unnest(from:toCount) %>%
filter(!is.na(to)) %>%
arrange(c) %>%
select(-rl)
Or in the tidyverse, create a function that does the rle for the Tracking for a single subject
rleSlice <- function(Tracking) {
rlTrack <- rle(as.character(Tracking)) # Strip the levels from the factor, they interfere
tibble(from = rlTrack$values, to = lead(rlTrack$values),
fromCount = rlTrack$lengths, toCount = lead(rlTrack$lengths)) %>%
filter(!is.na(to)) %>%
list()
}
Make sure it's behaving
[[1]]
rleSlice(c("a", "b", "b", "b", "c"))
A tibble: 2 x 4
from to fromCount toCount
<chr> <chr> <int> <int>
1 a b 1 3
2 b c 3 1
Now we'll group and get the rle for each participant
data %>%
as_tibble() %>%
# This is easier to track than all these a,b,c's
rename(Subject = c, Test = b, Tracking = a) %>%
group_by(Subject, Test) %>%
summarise(Slice = rleSlice(Tracking)) %>%
unnest(col = "Slice") %>%
ungroup()
# A tibble: 6 x 6
Subject Test from to fromCount toCount
<fct> <fct> <chr> <chr> <int> <int>
1 A01 experiment a b 1 3
2 A02 experiment a c 1 1
3 A02 experiment c a 1 1
4 A02 experiment a b 1 1
5 A03 control d e 3 1
6 A04 control f e 2 2
I have a data frame with grouped variable and I want to sum them by group. It's easy with dplyr.
library(dplyr)
library(magrittr)
data <- data.frame(group = c("a", "a", "b", "c", "c"), n1 = 1:5, n2 = 2:6)
data %>% group_by(group) %>%
summarise_all(sum)
# A tibble: 3 x 3
group n1 n2
<fctr> <int> <int>
1 a 3 5
2 b 3 4
3 c 9 11
But now I want a new column total with the sum of n1 and n2 by group. Like this:
# A tibble: 3 x 3
group n1 n2 ttl
<fctr> <int> <int> <int>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20
How can I do that with dplyr?
EDIT:
Actually, it's just an example, I have a lot of variables.
I tried these two codes but it's not in the right dimension...
data %>% group_by(group) %>%
summarise_all(sum) %>%
summarise_if(is.numeric, sum)
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate_if(is.numeric, .funs = sum)
You can use mutate after summarize:
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt1 = n1 + n2)
# A tibble: 3 x 4
# group n1 n2 tt1
# <fctr> <int> <int> <int>
#1 a 3 5 8
#2 b 3 4 7
#3 c 9 11 20
If need to sum all numeric columns, you can use rowSums with select_if (to select numeric columns) to sum columns up:
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt1 = rowSums(select_if(., is.numeric)))
# A tibble: 3 x 4
# group n1 n2 tt1
# <fctr> <int> <int> <dbl>
#1 a 3 5 8
#2 b 3 4 7
#3 c 9 11 20
We can use apply together with the dplyr functions.
data <- data.frame(group = c("a", "a", "b", "c", "c"), n1 = 1:5, n2 = 2:6)
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate(ttl = apply(.[, 2:ncol(.)], 1, sum))
# A tibble: 3 × 4
group n1 n2 ttl
<fctr> <int> <int> <int>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20
Or rowSums with the same strategy. The key is to use . to specify the data frame and [] with x:ncol(.) to keep the columns you want.
data %>% group_by(group) %>%
summarise_all(sum) %>%
mutate(ttl = rowSums(.[, 2:ncol(.)]))
# A tibble: 3 × 4
group n1 n2 ttl
<fctr> <int> <int> <dbl>
1 a 3 5 8
2 b 3 4 7
3 c 9 11 20
Base R
cbind(aggregate(.~group, data, sum), ttl = sapply(split(data[,-1], data$group), sum))
# group n1 n2 ttl
#a a 3 5 8
#b b 3 4 7
#c c 9 11 20
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'group', get the sum of each columns in the Subset of data.table, and then with Reduce, get the sum of the rows of the columns of interest
library(data.table)
setDT(data)[, lapply(.SD, sum) , group][, tt1 := Reduce(`+`, .SD),
.SDcols = names(data)[-1]][]
# group n1 n2 tt1
#1: a 3 5 8
#2: b 3 4 7
#3: c 9 11 20
Or with base R
addmargins(as.matrix(rowsum(data[-1], data$group)), 2)
# n1 n2 Sum
#a 3 5 8
#b 3 4 7
#c 9 11 20
Or with dplyr
data %>%
group_by(group) %>%
summarise_all(sum) %>%
mutate(tt = rowSums(.[-1]))