Summarize information by group in data table in R - r

I'm trying to get multiple summary statistics in R grouped by Team. I used code like below, but output is not what I want.
please point me in a better direction. Thanks!
set.seed(77)
data <- data.frame(Team =sample(c("A","B"),30, replace=TRUE),
gender=sample(c("female","male"),30, replace=TRUE),
Age =sample(c(0:100),30, replace=T))
dat <- data %>%
group_by(Team, gender) %>%
dplyr::summarize_all(list(my_mean = mean,
my_sum = sum,
my_sd = sd)) %>%
as.data.frame()
df <- data %>%
group_by(Team) %>%
summarize(total = n(gender),
mean = mean(Age),
Max_Age = max(Age),
Min_Age = min(Age),
sd = sd(Age),
)
I want to get like this pic.

You may need to create the dataframe for the summary statistics of age per Team (age_summary in the example below) and that for the count of Team members per gender and Team (gender_summary in the example below), and then merge them into one dataframe (say summary_df).
library(tidyverse)
set.seed(77)
data <- data.frame(
Team = sample(c("A", "B"), 30, replace = TRUE),
gender = sample(c("female", "male"), 30, replace = TRUE),
Age = sample(c(0:100), 30, replace = T)
)
age_summary <- data %>%
group_by(Team) %>%
summarize(
mean = mean(Age),
Max = max(Age),
Min = min(Age),
sd = sd(Age)
) %>%
column_to_rownames("Team") %>%
t() %>%
as_tibble(
rownames = "age_summary"
)
gender_summary <- data %>%
group_by(Team) %>%
count(gender) %>%
ungroup() %>%
pivot_wider(names_from = Team, values_from = n)
summary_df <- full_join(
age_summary,
gender_summary
) %>%
mutate(
"item" = if_else(
is.na(gender),
"Age",
"Sex"
)
) %>%
unite("summary", c(age_summary, gender), na.rm = TRUE, remove = FALSE) %>%
relocate(item, .before = 1) %>%
select(-c(age_summary, gender))
# # A tibble: 6 × 4
# item summary A B
# <chr> <chr> <dbl> <dbl>
# 1 Age mean 45.6 57.8
# 2 Age Max 92 82
# 3 Age Min 5 14
# 4 Age sd 30.1 22.1
# 5 Sex female 8 9
# 6 Sex male 7 6

Related

Making a Sankey Diagram in R

I'm trying to create a Sankey Diagram. I am using R with either {plotly} or {networkD3} packages. Both ask for the same type of data: source, target, value. I'm not really sure what source, target, and value is supposed to be and how to aggregate my data to this format. I have the following:
data.frame(
UniqID = rep(c(1:10), times=4),
Year = c(rep("2005", times=10), rep("2010", times=10), rep("2015", times=10), rep("2020", times=10)),
Response_Variable = round(runif(n = 40, min = 0, max = 2), digits = 0)
)
The response variable is a categorical variable of 0, 1, or 2. I would like to show the flow of the classes of this variable from one year to the next. The final product should look something like this:
In my case, "Wave" would be Year and "Outcome" would be the classes (0, 1, 2) of the response variable.
You don't really have enough information in your data to make a chart exactly like that because with the data you provided it's not clear which things changed from one category to the next across years. Maybe you were trying to achieve that with the UniqID column, but the way the data is, it doesn't make sense...
df <- data.frame(UniqID=rep(c(1:10), times=4),
Year=rep(c("2005", "2010", "2015", "2020"), times=10),
Response_Variable=round(runif(n=40, min = 0, max = 2), digits=0))
library(dplyr)
df %>% arrange(UniqID, Year) %>% filter(UniqID == 1)
#> UniqID Year Response_Variable
#> 1 1 2005 2
#> 2 1 2005 1
#> 3 1 2015 1
#> 4 1 2015 0
Ignoring that, the data format you're asking about is a list of "links" each one defining a movement from one "node", the "source" node, to another "node", the "target" "node". So in your case, each year-category combination is a "node", and you need a list of each "link" between those nodes, and potentially a "value" for each of your links, which in your case the number of occurrences of the source node makes the most sense. You could reshape your data to that format like this...
df %>%
group_by(Year, Response_Variable) %>%
summarise(value = n(), .groups = "drop") %>%
mutate(source = paste(Year, Response_Variable, sep = "_")) %>%
group_by(Response_Variable) %>%
mutate(target = lead(source, order_by = Year)) %>%
filter(!is.na(target))
#> # A tibble: 9 × 5
#> # Groups: Response_Variable [3]
#> Year Response_Variable value source target
#> <chr> <dbl> <int> <chr> <chr>
#> 1 2005 0 4 2005_0 2010_0
#> 2 2005 1 3 2005_1 2010_1
#> 3 2005 2 3 2005_2 2010_2
#> 4 2010 0 2 2010_0 2015_0
#> 5 2010 1 6 2010_1 2015_1
#> 6 2010 2 2 2010_2 2015_2
#> 7 2015 0 3 2015_0 2020_0
#> 8 2015 1 3 2015_1 2020_1
#> 9 2015 2 4 2015_2 2020_2
To get to the more specific format that {networkD3} requires, you need one data.frame for links and one that lists each node. The links data.frame needs to refer to each node in the nodes data.frame by its 0-based index. You can set that up like this...
library(dplyr)
library(networkD3)
df <-
data.frame(
UniqID=rep(c(1:10), times=4),
Year=rep(c("2005", "2010", "2015", "2020"), times=10),
Response_Variable=round(runif(n=40, min = 0, max = 2), digits=0)
)
links <-
df %>%
group_by(Year, Response_Variable) %>%
summarise(value = n(), .groups = "drop") %>%
mutate(source = paste(Year, Response_Variable, sep = "_")) %>%
group_by(Response_Variable) %>%
mutate(target = lead(source, order_by = Year)) %>%
filter(!is.na(target)) %>%
ungroup() %>%
select(source, target, value)
nodes <- data.frame(node_id = unique(c(links$source, links$target)))
links$source <- match(links$source, nodes$node_id) - 1
links$target <- match(links$target, nodes$node_id) - 1
sankeyNetwork(
Links = links,
Nodes = nodes,
Source = "source",
Target = "target",
Value = "value",
NodeID = "node_id"
)
#> Links is a tbl_df. Converting to a plain data frame.
given the modification to your example data, it would look like this...
library(dplyr)
library(networkD3)
df <-
data.frame(
UniqID=rep(c(1:10), times=4),
Year=c(rep("2005", times=10), rep("2010", times=10), rep("2015", times=10), rep("2020", times=10)),
Response_Variable=round(runif(n=40, min = 0, max = 2), digits=0)
)
links <-
df %>%
arrange(UniqID, Year) %>%
mutate(source = paste(Year, Response_Variable, sep = "_")) %>%
group_by(UniqID) %>%
mutate(target = lead(source, order_by = Year)) %>%
filter(!is.na(target)) %>%
ungroup() %>%
select(UniqID, source, target) %>%
group_by(source, target) %>%
summarise(value = n(), .groups = "drop")
nodes <- data.frame(node_id = unique(c(links$source, links$target)))
nodes$node_label <- sub("(.*)_([0-9]+)$", "\\1 (response \\2)", nodes$node_id)
nodes$node_group <- sub("^.*_", "", nodes$node_id)
links$source <- match(links$source, nodes$node_id) - 1
links$target <- match(links$target, nodes$node_id) - 1
sankeyNetwork(
Links = links,
Nodes = nodes,
Source = "source",
Target = "target",
Value = "value",
NodeID = "node_label",
NodeGroup = "node_group"
)
The answer is to use ggsankey and not plotly nor networkD3!

randomly add NA values to dataframe with the proportion set by group

I would like to randomly add NA values to my dataframe with the proportion set by group.
library(tidyverse)
set.seed(1)
dat <- tibble(group = c(rep("A", 100),
rep("B", 100)),
value = rnorm(200))
pA <- 0.5
pB <- 0.2
# does not work
# was trying to create another column that i could use with
# case_when to set value to NA if missing==1
dat %>%
group_by(group) %>%
mutate(missing = rbinom(n(), 1, c(pA, pB))) %>%
summarise(mean = mean(missing))
I'd create a small tibble to keep track of the expected missingness rates, and join it to the first data frame. Then go through row by row to decide whether to set a value to missing or not.
This is easy to generalize to more than two groups as well.
library("tidyverse")
set.seed(1)
dat <- tibble(
group = c(
rep("A", 100),
rep("B", 100)
),
value = rnorm(200)
)
expected_nans <- tibble(
group = c("A", "B"),
p = c(0.5, 0.2)
)
dat_with_nans <- dat %>%
inner_join(
expected_nans,
by = "group"
) %>%
mutate(
r = runif(n()),
value = if_else(r < p, NA_real_, value)
) %>%
select(
-p, -r
)
dat_with_nans %>%
group_by(
group
) %>%
summarise(
mean(is.na(value))
)
#> # A tibble: 2 × 2
#> group `mean(is.na(value))`
#> <chr> <dbl>
#> 1 A 0.53
#> 2 B 0.17
Created on 2022-03-11 by the reprex package (v2.0.1)
Nesting and unnesting works
library(tidyverse)
dat <- tibble(group = c(rep("A", 1000),
rep("B", 1000)),
value = rnorm(2000))
pA <- .1
pB <- 0.5
set.seed(1)
dat %>%
group_by(group) %>%
nest() %>%
mutate(p = case_when(
group=="A" ~ pA,
TRUE ~ pB
)) %>%
mutate(data = purrr::map(data, ~ mutate(.x, missing = rbinom(n(), 1, p)))) %>%
unnest() %>%
summarise(mean = mean(missing))
# A tibble: 2 × 2
group mean
<chr> <dbl>
1 A 0.11
2 B 0.481
set.seed(1)
dat %>%
group_by(group) %>%
nest() %>%
mutate(p = case_when(
group=="A" ~ pA,
TRUE ~ pB
)) %>%
mutate(data = purrr::map(data, ~ mutate(.x, missing = rbinom(n(), 1, p)))) %>%
unnest() %>%
ungroup() %>%
mutate(value = case_when(
missing == 1 ~ NA_real_,
TRUE ~ value
)) %>%
select(-p, -missing)

Rotate table in R

How can I melt/reshape/rotate my table from this:
profit lost obs fc.mape
mean 3724.743 804.1835 427.8899 0.21037696
std.dev 677.171 406.1391 372.5544 0.06072549
To this:
mean std.dev
profit x
lost x
obs x
fc.mape x
Here is a tidyverse solution. I find it too complicated but it works. Maybe there are simpler ones.
library(dplyr)
library(tidyr)
df1 %>%
mutate(id = row.names(.)) %>%
pivot_longer(
cols = -id,
names_to = "stat"
) %>%
group_by(id) %>%
mutate(n = row_number()) %>%
ungroup() %>%
pivot_wider(
id_cols = c(n, stat),
names_from = id,
values_from = value
) %>%
select(-n)
## A tibble: 4 x 3
# stat mean std.dev
# <chr> <dbl> <dbl>
#1 profit 3725. 677.
#2 lost 804. 406.
#3 obs 428. 373.
#4 fc.mape 0.210 0.0607
Data
df1 <-
structure(list(profit = c(3724.743, 677.171), lost = c(804.1835,
406.1391), obs = c(427.8899, 372.5544), fc.mape = c(0.21037696,
0.06072549)), class = "data.frame", row.names = c("mean", "std.dev"))

dplyr count unique and repeat id's by months

I have a df that looks like the following:
ID DATE
12 10-20-20
12 10-22-20
10 10-15-20
9 10-10-20
11 11-01-20
7 11-02-20
I would like to group by month and then create a column for unique id count and repeat id count like below:
MONTH Unique_Count Repeat_Count
10-1-20 2 2
11-1-20 2 0
I am able to get the date down to the first of the month and group by ID but I am not sure how to count unique instances within the months.
df %>%
mutate(month = floor_date(as.Date(DATE), "month")) %>%
group_by(ID) %>%
mutate(count = n())
Are you perhaps looking for:
df %>%
mutate(month = strftime(floor_date(as.Date(DATE, "%m-%d-%y"), "month"),
"%m-%d-%y")) %>%
group_by(month) %>%
summarize(unique_count = length(which(table(ID) == 1)),
repeat_count = sum(table(ID)[(which(table(ID) > 1))]))
#> # A tibble: 2 x 3
#> month unique_count repeat_count
#> <chr> <int> <int>
#> 1 10-01-20 2 2
#> 2 11-01-20 2 0
Here's a shot at it:
library(lubridate)
library(dplyr)
dates <- as.Date(c("2020-10-15", "2020-10-15", "2020-11-16", "2020-11-16", "2020-11-16"))
ids <- c(12, 12, 13, 13, 14)
df <- data.frame(dates, ids)
duplicates <- df %>%
group_by(dates_floored = floor_date(dates, unit = "month"), ids) %>%
mutate(duplicate_count = n()) %>%
filter(duplicate_count > 1) %>%
distinct(ids, .keep_all = TRUE)
uniques <- df %>%
group_by(dates_floored = floor_date(dates, unit = "month"), ids) %>%
mutate(unique_count = n()) %>%
filter(unique_count < 2) %>%
distinct(ids, .keep_all = TRUE)
df_cleaned <- full_join(uniques, duplicates, by = c("ids", "dates", "dates_floored")) %>%
group_by(dates_floored) %>%
summarize(count_duplicates = sum(duplicate_count, na.rm = TRUE),
count_unique = sum(unique_count, na.rm = TRUE))
df_cleaned

More efficient way to perform calculations on multiple (combined) columns by group

What is a more efficient way to perform calculations on multiple combined columns by group?
I have a dataset with Manager Effectiveness & Team Effectiveness components. How can I quickly calculate the number of 5s for each component by gender?
The desired outcome is like so:
Number of 5s for 'Manager effectiveness' = 2
Number of 5s for 'Team effectiveness' = 0
So far, I've tried the dplyr method:
Data %>%
group_by(gender) %>%
summarise(sum(c(Manager EQ, Manager IQ)) == 5)
Data %>%
group_by(gender) %>%
summarise(sum(c(Team collaboration, Team friendliness)) == 5)
Though it works, typing each column name quickly becomes tedious and error-prone as more columns are involved.
We can use summarise_at
library(dplyr)
Data %>%
group_by(gender) %>%
summarise_at(vars(starts_with('Manager')), ~ sum(. == 5))
Or if we are checking the sum of all numeric columns, use summarise_if
Data %>%
group_by(gender) %>%
summarise_if(is.numeric, ~ sum(. == 5))
Can we wrapped in a function
f1 <- function(dat, colPrefix, grp, val) {
dat %>%
group_by_at(grp) %>%
summarise_at(vars(starts_with(colPrefix)), ~ sum(. == val))
}
f1(Data, "Manager", "gender", 5)
Mostly expanding on #akrun's answer:
## made up data 100 observations
set.seed(133)
dat <- 1:5
gen <- c("M", "F")
z <- tibble(me = sample(dat, 100, TRUE),
mi = sample(dat, 100, TRUE),
tc = sample(dat, 100, TRUE),
tf = sample(dat, 100, TRUE),
gender = sample(gen, 100, TRUE))
# Grouping by gender, counting 5's, and reshaping data
z %>%
group_by(gender) %>%
summarise_at(vars(everything()), ~ sum(. == 5)) %>%
pivot_longer(me:tf) %>%
mutate(name = paste0("# 5's for ", name)) %>%
pivot_wider(gender)
Output:
# A tibble: 2 x 5
gender `# 5's for me` `# 5's for mi` `# 5's for tc` `# 5's for tf`
<chr> <int> <int> <int> <int>
1 F 6 6 8 5
2 M 10 14 20 5
This is starting to get a little hack-ey, but in response to Amanda's comment & my misunderstanding of the question:
z %>%
group_by(gender) %>%
summarise_at(vars(everything()), ~ sum(. == 5)) %>%
pivot_longer(me:tf) %>%
mutate(name = paste0("# 5's for ", name)) %>%
mutate(grp = ifelse(str_detect(name, 'm'), 'manager', 'team')) %>%
group_by(gender, grp) %>%
summarise(total_5s = sum(value))
Gives results:
# A tibble: 4 x 3
# Groups: gender [2]
gender grp total_5s
<chr> <chr> <int>
1 F manager 12
2 F team 13
3 M manager 24
4 M team 25
Unfortunately this relies heavily on making a distinction and group based on the column names of the original data.

Resources