How to add by row (using babynames dataset) - r

I'm trying to create a ggplot of using the babynames dataset which shows a comparison between the percentage of girls and boys that have a certain name over a range of years. I'm a little familiar with adding by column which would look like babynames$boys + babynames$girls if I created a column with the number of girls with a certain name and a column of boys with a certain name. I'm a bit conceptually stuck so far so I just have:
babynames %>%
filter(name == "Jordan") %>%
group_by(year, sex) %>%
summarize(total = sum(n))

So you want the percentages?
Try:
babynames %>%
filter(name == "Jordan") %>%
group_by(year, sex) %>%
summarize(total = sum(n)) %>%
mutate(both = sum(total)) %>%
mutate(perc = total/both*100)

Related

R - Issue with Ranking and Grouping

I have the following question that I am trying to solve with R:
"For each year, first calculate the mean observed value for each country (to allow for settings where countries may have more than 1 value per year, note that this is true in this data set). Then rank countries by increasing MMR for each year.
Calculate the mean ranking across all years, extract the mean ranking for 10 countries with the lowest ranking across all years, and print the resulting table."
This is what I have so far:
dput(mmr)
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean = mean(mmr)) %>%
arrange(mean) %>%
group_by(country)
tib2
My output is so close to where I need it to be, I just need to make each country have only one row (that has the mean ranking for each country).
Here is the result:
Output
Thank you!
Just repeat the same analysis, but instead of grouping by (country, year), just group by country:
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean_mmr = mean(mmr)) %>%
arrange(mean) %>%
group_by(country) %>%
summarise(mean_mmr = mean(mean_mmr)) %>%
arrange(mean_mmr) %>%
ungroup() %>%
slice_min(n=10)
tib2
Not sure without the data, but does this work?
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean1 = mean(mmr)) %>%
ungroup() %>%
group_by(year) %>%
mutate(rank1 = rank(mean1)) %>%
ungroup() %>%
group_by(country) %>%
summarise(rank = mean(rank1))%>%
ungroup() %>%
arrange(rank) %>%
slice_head(n=10)

Having trouble using the filter function in R

The question I am given to answer is:
in the state of Maryland, which two counties occur most frequently in the dataset?
In my data set there is a column called 'States' that contains state abbreviations in them. I am having trouble only displaying the frequency of the counties that are only in Maryland
This is what I have so far:
hw1_dataset_for_msully56 %>%
filter(State == MD) %>%
group_by(County) %>%
summarise(n = n()) %>%
arrange(-n)
We need quotes around the MD as "MD"
hw1_dataset_for_msully56 %>%
filter(State == "MD") %>%
group_by(County) %>%
summarise(n = n()) %>%
arrange(-n)
Also, instead of group_by/summarise, it can be simplified with count
hw1_dataset_for_msully56 %>%
filter(State == 'MD') %>%
count(County) %>%
arrange(-n)

Extract all columns of nested tibble data based on condition

For example I want to extract and add all variables based on minimal value of one variable (i.e. year in nested gapminder by country)
library(tidyverse)
data("gapminder")
gap_nested <- gapminder %>%
nest(data = -country) %>%
mutate(year = map(data, ~ min(.x$year)))
How do I do this? )
You can use the filter function
You can use the filter function from they dplyr package (included in tidyverse), like in this example:
gap_nested <- gapminder %>%
nest(data = -country) %>%
mutate(year = map(data, ~ min(.x$year))) %>%
filter(year == 1960)
This will return only countries which have minimum year equals to 1960.
Hope this helps.

Iterating over multiple lists using purrr::map

Below is my data
library(gapminder)
library(tidyverse)
lst <- unique(gapminder$continent)
ylst = c(2007, 1952)
map2_dfr(lst,ylst, ~gapminder %>% filter(continent == .x & year == .y) %>%
arrange(desc(gdpPercap))
%>% slice(1) %>% select(continent, country,gdpPercap,year))
The data is the gapminder data from the R library 'gapminder'.
I want to find the country with the highest gdpPercap for each year for each continent using purrr.
However this code is giving me the error that the lengths of my two lists are not the same
What is the map syntax to iterate over two lists, when the lengths are not the same? And how should I use that to fix the code and achieve my objective?
I would do this by grouping and nesting:
gapminder %>%
filter(year %in% ylst) %>%
group_by(continent, year) %>%
nest() %>%
mutate(data=map(data, ~top_n(., 1, gdpPercap))) %>%
unnest(c(data)) %>%
select(continent, country,gdpPercap,year)

Summarise with multiple conditions based on years

I would like to create a set of columns based on papers count for each number of year, therefore filtering multiple conditions in dplyr through summarise:
This is my code:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(papers_count = n()) %>%
arrange(desc(papers_count))
The code above gives me two columns, 'word' and 'papers_count', I would like to create more columns like papers_count (papers_count1990, papers_count1991, etc..) based on each year between 1990 and 2017.
I Am looking for something like ths:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(tot_papers_count = n(), papers_count_1991 = n()year="1991", ...) %>%
arrange(desc(papers_count))
please does anybody have any suggestion?
I would suggest adding year to the group_by, and then using spread to create multiple summary columns.
library(tidyr)
words_list_by_year <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(year,word) %>%
summarise(papers_count = n()) %>%
spread(year,papers_count,fill=0)

Resources