Summarise+Group_by+mean+sd various columns - graph

I need to have a new df with gender separation of means (several columns) and sd, and later do my graphs. After of hours of work I did it but in a silly way :D one by one.
For me that I'm a super beginner, I'm trying-mistaking during hours until something work. But I know with a teacher or more explanation I would do better.
" mediapi<-rowMeans(datos[1:325,c(31,37)])
Sdbi<-rowSds(as.matrix(datos[1:325,c(6:18)])). "
al later one by one, because I tried in one table and it does not work...
datos %>%
group_by(Sexo) %>%
summarize(m = mean(mediaci), # calculates the mean
s = sd(mediaci), # calculates the standard deviation
n = n()) %>% # calculates the total number of observations
ungroup()
datos %>%
group_by(Sexo) %>%
summarize(m = mean(mediaii), # calculates the mean
s = sd(mediaii), # calculates the standard deviation
n = n()) %>% # calculates the total number of observations
ungroup()
datos %>%
group_by(Sexo) %>%
summarize(m = mean(mediabi), # calculates the mean
s = sd(mediabi), # calculates the standard deviation
n = n()) %>% # calculates the total number of observations
ungroup()
datos %>%
group_by(Sexo) %>%
summarize(m = mean(mediaai), # calculates the mean
s = sd(mediaai), # calculates the standard deviation
n = n()) %>% # calculates the total number of observations
ungroup()
datos %>%
group_by(Sexo) %>%
summarize(m = mean(mediabai), # calculates the mean
s = sd(mediabai), # calculates the standard deviation
n = n()) %>% # calculates the total number of observations
ungroup()
datos %>%
group_by(Sexo) %>%
summarize(m = mean(mediapi),# calculates the mean
s = sd(mediapi),
n = n()) %>% # calculates the total number of observations
ungroup()
datos %>%
group_by(Sexo) %>%
summarize(m = mean(mediandi),# calculates the mean
s = sd(mediandi),
n = n()) %>% # calculates the total number of observations
ungroup()
datos %>%
group_by(Sexo) %>%
summarize(m = mean(mediansi),# calculates the mean
s = sd(mediansi),
n = n()) %>% # calculates the total number of observations
ungroup()
Can someone tell me an easiest and efficient way? Thanks a lot

Related

How do I create a table in R with conditional formatting and row and column totals?

Are there any R packages that I use to replicate the table below -
I would like a table with conditional formatting for the table values but no conditional formatting on the row and column grand totals.
The code can be used to reproduce the values in the table along with the row and column grand totals -
library(tidyverse)
# vectors
dates <- rep(date_vec <- c(as.Date("2022-01-01"), as.Date("2022-02-01"), as.Date("2022-03-01")), 30)
row_groups <- c(rep("row_group1", 20), rep("row_group2", 30), rep("row_group3", 10), rep("row_group4", 30))
col_groups <- c(rep("col_group1", 10), rep("col_group2", 10), rep("col_group3", 30), rep("col_group4", 40))
# dataframe
df <- tibble(dates, row_groups, col_groups)
# column grand totals
col_group_total <- df %>%
group_by(dates, col_groups) %>%
count() %>%
group_by(col_groups) %>%
summarise(mean = mean(n)) %>%
mutate(pct = mean/sum(mean))
# row grand totals
row_group_total <- df %>%
group_by(dates, row_groups) %>%
count() %>%
group_by(row_groups) %>%
summarise(mean = mean(n)) %>%
mutate(pct = mean/sum(mean))%>%
ungroup()
# table values
group_total <- df %>%
group_by(dates, row_groups, col_groups) %>%
count() %>%
group_by(row_groups, col_groups) %>%
summarise(count = mean(n)) %>%
ungroup() %>%
mutate(pct = count/sum(count))%>%
ungroup()
red_color <- "#f4cccc"
yellow_color <- "#f3f0ce"
green_color <- "#d9ead3"
library(janitor); library(gt)
df %>%
tabyl(row_groups, col_groups) %>%
adorn_percentages("all") %>%
adorn_totals(c("col")) -> df_tabyl
gt(df_tabyl) %>%
data_color(columns = col_group1:col_group4,
colors = scales::col_numeric(
palette = c(red_color, yellow_color, green_color),
domain = range(df_tabyl[1:4,2:5])
)
) %>%
fmt_percent(columns = -row_groups,
rows = everything()) %>%
summary_rows(
columns = -row_groups,
fns = list("Total" = "sum"),
formatter = fmt_percent
)
The coloring varies with your example b/c the col_numeric function maps the colors linearly along the three provided colors, and 11% is only 1/3 of the way between 0% and 33%. Not sure what approach you expect.

How to repeat an operation for several subsets and groups of the same dataset with dplyr?

I am wondering whether there is a way using functional programming to repeat some operations on different subset of a data?
Below is an example of how I would do it "manually", but my question is: is there a way to apply the same formula to different subsets of the same dataset?
Here is a sample dataset:
dt <- data.frame(group = rep(LETTERS[1:3], each = 12*3),
year = rep(2018:2020, each = 12),
month = rep(1:12, times = 3),
value = rnorm(12*3*3, 2, .3))
And this is what I am doing right now. There are three ways of grouping (per group, per group AND per year, and per group and per year for a subset of the months). Then, the same action is carried out (summary with mean, min, max).
The code below accomplishes what I want, but I wonder if there is a more efficient way to do this, ideally, using dplyr.
bind_rows(
# First grouping
dt %>% group_by(group) %>%
# Common summary
summarise(mean = mean(value),
min = min(value),
max = max(value)) %>%
mutate(grouping = "per group"),
# Second grouping
dt %>% group_by(group, year) %>%
# Common summary
summarise(mean = mean(value),
min = min(value),
max = max(value)) %>%
mutate(grouping = "per group and per year"),
# Third grouping
dt %>% filter (month %in% 6:8) %>% group_by(group, year) %>%
# Common summary
summarise(mean = mean(value),
min = min(value),
max = max(value)) %>%
mutate(grouping = "per group, summer months")
)
Any idea?
library(purrr)
library(dplyr)
groupings <- list(
. %>% group_by(group),
. %>% group_by(group, year),
. %>% filter (month %in% 6:8) %>% group_by(group, year)
)
grouping_labels <- list(
"per group",
"per group and per year",
"per group, summer months"
)
common_summary <- . %>%
summarise(mean = mean(value),
min = min(value),
max = max(value))
map2(
groupings,
grouping_labels,
~ dt %>% .x() %>% common_summary() %>% mutate(grouping = .y)
) %>%
bind_rows()

More efficient way of taking averages over multiple lists

I have some data where I use the rsample package to create rolling windows (I use the iris data set here). The rolling_iris dataset contains a number of lists.
I would like to compute the min, max, mean and sd of each of the lists. That is in split 1 compute the min across the first 4 columns etc. I originally do this by mapping over the splits and using pivot_longer to rearrange the data then computing the statistics, finally using pivot_wider to get the data back into the original form. This is quite slow.
library(dplyr)
library(purrr)
iris
rolling_iris <- rsample::rolling_origin(iris, initial = 10, assess = 1, cumulative = FALSE, skip = 0)
rolling_iris_statistics <- map(rolling_iris$splits, ~analysis(.x) %>%
pivot_longer(cols = 1:4) %>%
mutate(
min = min(value),
max = max(value),
mean = mean(value),
sd = sd(value)
) %>%
group_by(name) %>%
mutate(rowID = row_number()) %>%
pivot_wider(names_from = name, values_from = value)
)
I would like to map over each of the lists and compute the above statistics. Then once this is done scale the analysis by the following function.
Scale_Me <- function(x){
(x - min(x)) / (max(x) - min(x))
}
Additional:
rolling_iris_analysis <- map(rolling_iris$splits, ~analysis(.x))
rolling_iris_assessment <- map(rolling_iris$splits, ~assessment(.x))
EDIT:
I managed to compute the following (I am not sure if it is "faster")
analysis <- map(rolling_iris$splits, ~analysis(.x))
map(analysis, ~select(., c(1:4)) %>% as.matrix %>% mean())
The below code subsets into each sub data frame. So, rolling_iris_dfs is a list of data frames. Then, you can iterate over each data frame and compute statistics.
rolling_iris_dfs <- map(seq(1, length(rolling_iris[[1]])), ~rolling_iris[[1]][[.x]]$data)
rolling_iris_stats <- map(rolling_iris_dfs, ~analysis(.x) %>%
pivot_longer(cols = 1:4) %>%
mutate(
min = min(value),
max = max(value),
mean = mean(value),
sd = sd(value)
) %>%
group_by(name) %>%
mutate(rowID = row_number()) %>%
pivot_wider(names_from = name, values_from = value)
)

Calculate percentage with group by dplyr

I want to calculate the percentage for each character colname in my dataframe but the percentage isn't good.
My code :
for(i in names(which((sapply(creditDF,class) == "character")))){
distribution <- creditDF %>%
group_by_at(.vars = i) %>%
summarise(value = n(),
percent = value/sum(value)) %>%
select(label = i, value, percent)
}
Résult :
label value percent
<chr> <int> <dbl>
1 chéquier autorisé 415 1
2 chéquier interdit 53 1
Normally for the first lines the percentage is 415/468*100.
How can I fix my problem ?
Thanks for your help.
Here, we need to ungroup to get the sum of the whole 'value' column i.e
-- %>%
group_by_at(.vars = i) %>%
summarise(value = n() %>%
ungroup() %>%
mutate(percent = value/sum(value)) %>%
select(label = i, value, percent)
}

Using dplyr to get counts

I want to be able to get the counts, standard deviation and mean of certain variables after grouping them. I am able to get the mean and std, but getting the counts is giving me an error. This is the following code I have:
NYC_Trees %>%
group_by(Condition) %>%
dplyr::summarise(mean = round(mean(Compensatory.Value), 2),
sd = round(sd(Compensatory.Value), 2),
count(NYC_Trees,Condition, wt = Compensatory.Value))
I get the error: cannot handle.
I want the output such as:
Condition Native N Mean Std
What am I doing wrong?
count(NYC_Trees,Condition, wt = Compensatory.Value) should be the same as NYC_Trees %>% group_by(Condition) %>% summarise(n = sum(Compensatory.Value). This clearly returns a vector and therefore the summarise function cannot handle it.
So you could just have the line n = sum(Compensatory.Value) inside the summarise:
NYC_Trees %>%
group_by(Condition) %>%
dplyr::summarise(mean = round(mean(Compensatory.Value), 2),
sd = round(sd(Compensatory.Value), 2),
n = sum(Compensatory.Value))
Is that what you are trying to do? If you just want the number of values in each group, you can use n = n() instead:
NYC_Trees %>%
group_by(Condition) %>%
dplyr::summarise(mean = round(mean(Compensatory.Value), 2),
sd = round(sd(Compensatory.Value), 2),
n = n())

Resources