transform() to add rows with dplyr() - r

I've got a data frame (df) with two variables, site and purchase.
I'd like to use dplyr() to group my data by site and purchase, and get the counts and percentages for the grouped data. I'd however also like the tibble to feature rows called ALLSITES, representing the data of all the sites grouped by purchase, so that I end up with a tibble looking similar to dfgoal.
The problem's that my current code doesn't get me the ALLSITES rows. I've tried adding a base R function into dplyr(), which doesn't work.
Any help would be much appreciated.
Starting point (df):
df <- data.frame(site=c("LON","MAD","PAR","MAD","PAR","MAD","PAR","MAD","PAR","LON","MAD","LON","MAD","MAD","MAD"),purchase=c("a1","a2","a1","a1","a1","a1","a1","a1","a1","a2","a1","a2","a1","a2","a1"))
Desired outcome:
dfgoal <- data.frame(site=c("LON","LON","MAD","MAD","PAR","ALLSITES","ALLSITES"),purchase=c("a1","a2","a1","a2","a1","a1","a2"),bin=c(1,2,6,2,4,11,4),pin_per=c(33.33333,66.66667,75.00000,25.00000,100.00000,73.33333,26.66666))
Current code:
library(dplyr)
df %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))
df %>%
rbind(df, transform(df, site = "ALLSITES") %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))

We can start from the first output code block, after grouping by 'site' with a created string of 'ALLSITES' and 'purchase' get the sum of 'bin' and later 'bin_per', then with bind_rows row bind the two datasets
df1 %>%
ungroup() %>%
group_by(site = 'ALLSITES', purchase) %>%
summarise(bin = sum(bin)) %>%
ungroup %>%
mutate(bin_per = 100*(bin/sum(bin))) %>%
bind_rows(df1, .)

Related

After grouping, cannot get dplyr's slice to select top 3 of each grouping

I am trying to only retain the top 3 records from each grouping based on l5_ppg_max. This code sets up the table correctly, but when I add the slice code it doe not select top 3 records of each group.
#library(reticulate)
library(tidyverse)
library(plotly)
library(janitor)
library(readxl)
library(reprex)
player_projection <- read_csv("DFF_NHL_cheatsheet.csv", col_names = TRUE)
team__reg_line <- player_projection %>%
clean_names() %>%
mutate_if(is.numeric, ~replace_na(., 0)) %>%
filter(!position == "G") %>%
filter(!reg_line == 0) %>%
select(team, reg_line, l5_ppg_max, salary) %>%
arrange(team, reg_line, desc(l5_ppg_max)) %>%
group_by(team, reg_line, salary, l5_ppg_max)
When I add this line:
slice_head(n = 3)
It still returns all records.
Also tried top_n(3), but read it was deprecated so stayed with dplyr slice functions. Quite easy to do in excel manually, but need to do in R for ggplot outputs.

Getting rid of NA values in R when trying to aggregate columns

I'm trying to aggregate this df by the last value in each corresponding country observation. For some reason, the last value that is added to the tibble is not correct.
aggre_data <- combined %>%
group_by(location) %>%
summarise(Last_value_vacc = last(people_vaccinated_per_hundred)
aggre_data
I believe it has something to do with all of the NA values throughout the df. However I did try:
aggre_data <- combined %>%
group_by(location) %>%
summarise(Last_value_vacc = last(people_vaccinated_per_hundred(na.rm = TRUE)))
aggre_data
Update:
combined %>%
group_by(location) %>%
arrange(date, .by_group = TRUE) %>% # or whatever
summarise(Last_value_vacc = last(na.omit( people_vaccinated_per_hundred)))

How to print a grouped_df grouped by two variables on two tables with dplyr in R

I want to group by two variables, compute a mean for the groups, then print the result on distinct tables.
Unlike the below where I get all my means in a single table, I would like one output table for x==1 and another one for x==2
data = tibble(x=factor(sample(1:2,10,rep=TRUE)),
y=factor(sample(letters[1:2],10,rep=TRUE)),
z=1:10)
data %>% group_by(x) %>% summarize(Mean_z=mean(z))
res = data %>% group_by(x,y) %>% summarize(Mean_z=mean(z))
print(res)
res %>% knitr::kable() %>% kableExtra::kable_styling()```
You want separate outputs for when x==1 and x==2. A simple way with dplyr would be to filter:
library(dplyr)
data = tibble(x=factor(sample(1:2,10,rep=TRUE)),
y=factor(sample(letters[1:2],10,rep=TRUE)),
z=1:10)
res = data %>% group_by(x,y) %>% summarize(Mean_z=mean(z))
x1= res%>%
filter(x ==1)
x2= res%>%
filter(x ==2)
x1 %>% knitr::kable() %>% kableExtra::kable_styling()
x2 %>% knitr::kable() %>% kableExtra::kable_styling()
I'm not sure why you have this line of code:
data %>% group_by(x) %>% summarize(Mean_z=mean(z))
It doesn't create a new object and so it's output won't be available to be used in subsequent lines of code. If you did use it, it would give you the means for z for each x value, without splitting into each y value.

How do I filter based on a count within summarise in order to use as part of other summarise functions?

I am looking to figure out how to filter after grouping my data within summarise. I have 2 created columns below. I'd ideally like to filter the seasonTotal column within summarise to a value of greater than 3, and then calculate the homeRunsPerSeason based on that filtered count.
Reprex below:
library(Lahman)
library(tidyverse)
data <- Lahman::Batting
data <- data %>%
filter(yearID > 2015)
grouped_data <- data %>%
group_by(playerID) %>%
summarise(seasonTotal = n(),
homeRunsPerSeason = sum(HR / seasonTotal)
)
Separate each of the steps you want to accomplish. Calculate the season total, filter, then summarize.
grouped_data <- data %>%
group_by(playerID) %>%
mutate(seasonTotal = n()) %>%
filter(seasonTotal > 3) %>%
summarise(homeRunsPerSeason = sum(HR / seasonTotal))

Summarise with multiple conditions based on years

I would like to create a set of columns based on papers count for each number of year, therefore filtering multiple conditions in dplyr through summarise:
This is my code:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(papers_count = n()) %>%
arrange(desc(papers_count))
The code above gives me two columns, 'word' and 'papers_count', I would like to create more columns like papers_count (papers_count1990, papers_count1991, etc..) based on each year between 1990 and 2017.
I Am looking for something like ths:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(tot_papers_count = n(), papers_count_1991 = n()year="1991", ...) %>%
arrange(desc(papers_count))
please does anybody have any suggestion?
I would suggest adding year to the group_by, and then using spread to create multiple summary columns.
library(tidyr)
words_list_by_year <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(year,word) %>%
summarise(papers_count = n()) %>%
spread(year,papers_count,fill=0)

Resources