Adding overall mean when using group_by - r

I am using the dplyr package to generate some tables and I'm making use of the adorn_totals("row") function.
This works fine when I want to sum values within the groups, however in some cases I want an overall mean instead of a sum. Is there an adorn_means function?
Sample code:
Regions2 <- Data %>%
filter(!is.na(REGION))%>%
group_by(REGION) %>%
summarise(Numberofpeople=length(Names))%>%
adorn_totals("row")
here my "total" row is simply the sum of all people within the regions. This gives me
REGION NumberofPeople
East Midlands 578,943
East of England 682,917
London 1,247,540
North East 245,830
North West 742,886
South East 963,040
South West 623,684
West Midlands 653,335
Yorkshire 553,853
TOTAL 6,292,028
My next piece of code generates an average salary for each region, but I want to add an overall average for the total
Regions3 <- Data %>%
filter(!is.na(REGION))%>%
filter(!is.na(AVGSalary))%>%
group_by(REGION) %>%
summarise(AverageSalary=mean(AVGSalary))
if I use adnorn_totals("row") as before I simply get the sum of the averages, not the overall average for the dataset.
How do I get the overall average?
UPADATE with some noddy data:
Data
people region salary
person1 London 1000
person2 South West 1050
person3 South East 900
person4 London 800
person5 Scotland 1020
person6 South West 750
person7 East 600
person8 London 1200
person9 South West 1150
The group averages are therefore:
London 1000
South West 983.33
South East 900
Scotland 1020
East 600
I want to add the overall total to the bottom
Total 941.11

1) Because the overall average is the weighted average of the averages (not the plain average of the averages), i.e. it is 941 and not 901, we maintain an n column so that in the end we can correctly compute the overall average. Although the data shown does not have any NAs we use drop_na in order to also use it with such data. This will remove any row containing an NA.
library(dplyr)
library(tidyr)
Region %>%
drop_na %>%
group_by(region) %>%
summarize(avg = mean(salary), n = n()) %>%
ungroup %>%
bind_rows(summarize(., region = "Overall Avg",
avg = sum(avg * n) / sum(n),
n = sum(n))) %>%
select(-n)
giving:
# A tibble: 6 x 2
region avg
<chr> <dbl>
1 East 600
2 London 1000
3 Scotland 1020
4 South East 900
5 South West 983.
6 Overall Avg 941.
2) Another approach would be to construct the Overall Avg line by going back to the original data:
Region %>%
drop_na %>%
group_by(region) %>%
summarize(avg = mean(salary)) %>%
ungroup %>%
bind_rows(summarize(Region %>% drop_na, region = "Overall Avg", avg = mean(salary)))
giving:
# A tibble: 6 x 2
region avg
<chr> <dbl>
1 East 600
2 London 1000
3 Scotland 1020
4 South East 900
5 South West 983.
6 Overall Avg 941.
2a) If you object to referring to Region twice then try this.
Region_ <- Region %>%
drop_na
Region_ %>%
group_by(region) %>%
summarize(avg = mean(salary)) %>%
ungroup %>%
bind_rows(summarize(Region_, region = "Overall Avg", avg = mean(salary)))
2b) or as a single pipeline where now Region_ is local to the pipeline and will automatically be removed after the pipeline completes:
Region %>%
drop_na %>%
{ Region_ <- .
Region_ %>%
group_by(region) %>%
summarize(avg = mean(salary)) %>%
ungroup %>%
bind_rows(summarize(Region_, region = "Overall Avg", avg = mean(salary)))
}
Note
We used this as the input:
Lines <- "people region salary
person1 London 1000
person2 South West 1050
person3 South East 900
person4 London 800
person5 Scotland 1020
person6 South West 750
person7 East 600
person8 London 1200
person9 South West 1150"
library(gsubfn)
Region <- read.pattern(text = Lines, pattern = "^(\\S+) +(.*) (\\d+)$",
as.is = TRUE, skip = 1, strip.white = TRUE,
col.names = read.table(text = Lines, nrow = 1, as.is = TRUE))

One option is to add a row with bind_rows
library(dplyr)
Data %>%
group_by(region) %>%
summarise(Avgsalary = mean(salary)) %>%
bind_rows(data_frame(region = 'Total',
Avgsalary = mean(.$Avgsalary, na.rm = TRUE)))
Or another option is add_row from tibble
Data %>%
group_by(region) %>%
summarise(Avgsalary = mean(salary)) %>%
add_row(region = 'Total', Avgsalary = mean(.$Avgsalary))
If this is based on the overall mean before taking the mean, then we need to calculate it before
Data %>%
mutate(Total = mean(salary)) %>%
group_by(region) %>%
summarise(Avgsummary = mean(salary), Total = first(Total)) %>%
add_row(region = 'Total', Avgsummary = .$Total[1]) %>%
select(-Total)

Related

map_df -- Argument 1 must be a data frame or a named atomic vector

I am an infectious diseases physician and have set myself the challenge of creating a dataframe with the UK cumulative published cases of monkeypox, so I can graph it as a runing tally or a chloropleth map as there is no nice dashboard at present for this.
All the data is published as html webpages rather than as a nice csv so I am trying to scrape it all off the internet using the rvest package.
Data is only published intermittently (about twice per week) with the cumulative totals for each of the 4 home nations in UK.
I have managed to get working code to pull data from each of the separate webpages and testing it on the first 2 pages in my mpx_gov_uk_pages list works well giving a small example tibble:
library(tidyverse)
library(lubridate)
library(rvest)
library(janitor)
# load in overview page url which has links to each date of published cases
mpx_gov_uk_overview_page <- c("https://www.gov.uk/government/publications/monkeypox-outbreak-epidemiological-overview")
# extract urls for each date page
mpx_gov_uk_pages <- mpx_gov_uk_overview_page %>%
read_html %>%
html_nodes(".govuk-link") %>%
html_attr('href') %>%
str_subset("\\d{1,2}-[a-z]+-\\d{4}") %>%
paste0("https://www.gov.uk", .) %>%
as.character()
# make table for home nations for each date
table1 <- mpx_gov_uk_pages[1] %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract(mpx_gov_uk_pages[1], "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
table2 <- mpx_gov_uk_pages[2] %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract(mpx_gov_uk_pages[2], "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [4].
# Combine tables
bind_rows(table1, table2)
#> # A tibble: 8 × 3
#> date area cases
#> <date> <chr> <dbl>
#> 1 2022-08-02 England 2638
#> 2 2022-08-02 Northern Ireland 24
#> 3 2022-08-02 Scotland 65
#> 4 2022-08-02 Wales 32
#> 5 2022-07-29 England 2436
#> 6 2022-07-29 Northern Ireland 19
#> 7 2022-07-29 Scotland 61
#> 8 2022-07-29 Wales 30
I want to automate this by creating a generic function and passing the list of urls to purrr::map_df as there will be an ever growing number of pages (there's already 13):
pull_first_table <- function(x){
x %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract({{x}}, "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
}
summary_table <- map_df(mpx_gov_uk_pages, ~ pull_first_table)
Error in `dplyr::bind_rows()`:
! Argument 1 must be a data frame or a named atomic vector.
Run `rlang::last_error()` to see where the error occurred.
The generic function seems to work ok when I supply it with a single element e.g. mpx_gov_uk_cases[2] but I cannot seem to get map_df to work properly even though the webscraping is producing tibbles.
All help and pointers greatly welcomed.
We just need the function and not a lambda expression.
map_dfr(mpx_gov_uk_pages, pull_first_table)
-output
# A tibble: 52 × 3
date area cases
<date> <chr> <dbl>
1 2022-08-02 England 2638
2 2022-08-02 Northern Ireland 24
3 2022-08-02 Scotland 65
4 2022-08-02 Wales 32
5 2022-07-29 England 2436
6 2022-07-29 Northern Ireland 19
7 2022-07-29 Scotland 61
8 2022-07-29 Wales 30
9 2022-07-26 England 2325
10 2022-07-26 Northern Ireland 18
# … with 42 more rows
If we use the lambda expression,
map_dfr(mpx_gov_uk_pages, ~ pull_first_table(.x))

Sort data frame rows in the right columns

I am currently in the data cleaning process. My data has more than 6 digits rows. I cannot come up with a solution in order to have the data in the right order. Can you give me a hint please?
Thanks in advance
df <- data.frame(price= c("['380€']", "3hr 15 min", "4hr", "3hr 55min", "2h", "20€"),
airlines = c("['Icelandir']", "€1,142", "16€", "17€", "19€", "Iberia"),
duration = c("['3h']","Turkish airlines", "KLM", "easyJet", "2 hr 1min", "Finnair"),
depart = c("LGW", "AMS", "NUE", "ZRH", "LHR", "VAR"))
My desired output is
price airline duration price_right airline_right duration_right depart
['380€'] ['Icelandair'] ['3h'] ['380€'] ['Icelandair'] ['3h'] LGW
3 hr 15 min €1,142 Turkish airlines €1,142 Turkish airlines 3 hr 15 min AMS
4hr €16 KLM €16 KLM 4hr NUE
3hr 55min €17 easyJet €17 easyJet 3hr 55min ZRH
2h €19 2hr 1min €19 Iberia 2h LHR
2hr min "Iberia" Finnair €20 Finnair 2hr 1min VAR
For this example we could do something like this:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
arrange(value) %>%
group_by(group =as.integer(gl(n(),3,n()))) %>%
mutate(id = row_number()) %>%
mutate(name = case_when(id == 1 ~ "price",
id == 2 ~ "duration",
id == 3 ~ "airlines",
TRUE ~ NA_character_)) %>%
ungroup() %>%
select(-group, -id) %>%
group_by(name) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-id)
price duration airlines
<chr> <chr> <chr>
1 ['380€'] ['3h'] ['Icelandir']
2 €1,142 3hr 15 min Turkish airlines

Can get ggplot2 bar chart to display direct values for Y axis?

When I plot a my barchart, the chart is putting out values on the Y-axis I don't understand. How can I get the barchart to use actual values?
#Here is the code for my graph
stock %>%
#Tidy data to be handled correctly
group_by(year) %>%
filter(year == "2017") %>%
pivot_longer(bio_sus:bio_notsus) %>%
mutate(value2 = ifelse(name=="bio_sus",-1*value, value)) %>%
#make the graph
ggplot(aes(ocean_whole, value2/100, fill=name)) +
geom_bar(stat = "identity")
The bar chart is putting out values between 2.5 and -2.5 when my value 2 values range between 100 and - 100
ocean_sub code year ocean_whole name value value2
<chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 Eastern Central Atlantic NA 2017 atlantic bio_sus 57.1 -57.1
2 Eastern Central Atlantic NA 2017 atlantic bio_notsus 42.9 42.9
3 Eastern Central Pacific NA 2017 pacific bio_sus 86.7 -86.7
4 Eastern Central Pacific NA 2017 pacific bio_notsus 13.3 13.3
5 Eastern Indian Ocean NA 2017 indian bio_sus 68.6 -68.6
How can I get the chart to display the actual values?
#My code is from TidyTuesdays Global seafood:
stock <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-12/fish-stocks-within-sustainable-levels.csv')
#transformed in the following way
oceans <- c("pacific", "atlantic", "indian", "mediterranean")
lu <- stack(sapply(oceans, grep, x = stock$entity, ignore.case = TRUE))
stock$oceans <- stock$entity
stock$oceans[lu$values] <- as.character(lu$ind)
stock %>%
group_by(oceans) %>%
summarise(across(matches("^share"), sum))
colnames(stock) <- (c("ocean_sub", "code", "year", "bio_sus", "bio_notsus", "ocean_whole"))
Your tibble contains multiple values for ocean_whole before you give it over to ggplot(). The sums of these values amount the unexpected values. Check:
library(dplyr)
stock %>%
group_by(year) %>%
filter(year == "2017") %>%
pivot_longer(bio_sus:bio_notsus) %>%
mutate(value2 = ifelse(name=="bio_sus",-1*value, value)) %>%
group_by(ocean_whole, name) %>%
summarise(sum(value2))

Finding rows that have the minimum of a specific factor group

I'm am attempting to find the minimum incomes from the state.x77 dataset based on the state.region variable.
df1 <- data.frame(state.region,state.x77,row.names = state.name)
tapply(state.x77,state.region,min)
I am trying to get it to output which state has the lowest income for X region eg for south Alabama would be the lowest income. Im trying to use tapply but I keep getting an error saying
Error in tapply(state.x77, state.region, min) :
arguments must have same length
What is the issue?
Here is a solution. First get the vector of incomes and make of it a named vector. Then use tapply to get the names of the minima incomes.
state <- setNames(state.x77[, "Income"], rownames(state.x77))
tapply(state, state.region, function(x) names(x)[which.min(x)])
# Northeast South North Central West
# "Maine" "Mississippi" "South Dakota" "New Mexico"
The following, more complicated, code will output state names, regions and incomes.
df1 <- data.frame(
State = rownames(state.x77),
Income = state.x77[, "Income"],
Region = state.region
)
merge(aggregate(Income ~ Region, df1, min), df1)[c(3, 1, 2)]
# State Region Income
#1 South Dakota North Central 4167
#2 Maine Northeast 3694
#3 Mississippi South 3098
#4 New Mexico West 3601
And another solution with aggregate but avoiding merge.
agg <- aggregate(Income ~ Region, df1, min)
i <- match(agg$Income, df1$Income)
data.frame(
State = df1$State[i],
Region = df1$Region[i],
Income = df1$Income[i]
)
# State Region Income
#1 Maine Northeast 3694
#2 Mississippi South 3098
#3 South Dakota North Central 4167
#4 New Mexico West 3601
You can also use this solution:
library(dplyr)
library(tibble)
state2 %>%
rownames_to_column() %>%
bind_cols(state.region) %>%
rename(State = rowname,
Region = ...10) %>%
group_by(Region, State) %>%
summarise(Income = sum(Income)) %>% arrange(desc(Income)) %>%
slice_tail(n = 1)
# A tibble: 4 x 3
# Groups: Region [4]
Region State Income
<fct> <chr> <dbl>
1 Northeast Maine 3694
2 South Mississippi 3098
3 North Central South Dakota 4167
4 West New Mexico 3601

How to find the percentage after taking the count using summarise

let's say I have this data set :
footballplayers Nationality
A Germany
B Germany
C France
D France
E Belgium
F Belgium
I took the count :
df %>% group_by(Nationality) %>% summarise(count=n())
Nationality count
Germany 2
France 2
Belgium 2
now I have to find the percentage of each nationality: 2/6 *100 for example in this case. so how to do it in a single query after taking the count? so that I can use it in the pie chart.
Try this
df %>%
group_by(Nationality) %>%
summarise(Count = n()) %>%
mutate(percentage = round(Count / sum(Count) * 100, 2))
You can omit the round function if you do not want to round the percentage!

Resources