Summarise with multiple conditions based on years

Summarise with multiple conditions based on years - r

I would like to create a set of columns based on papers count for each number of year, therefore filtering multiple conditions in dplyr through summarise:
This is my code:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(papers_count = n()) %>%
arrange(desc(papers_count))
The code above gives me two columns, 'word' and 'papers_count', I would like to create more columns like papers_count (papers_count1990, papers_count1991, etc..) based on each year between 1990 and 2017.
I Am looking for something like ths:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(tot_papers_count = n(), papers_count_1991 = n()year="1991", ...) %>%
arrange(desc(papers_count))
please does anybody have any suggestion?

I would suggest adding year to the group_by, and then using spread to create multiple summary columns.
library(tidyr)
words_list_by_year <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(year,word) %>%
summarise(papers_count = n()) %>%
spread(year,papers_count,fill=0)

Related

How do you efficiently group by multiple columns in dplyr

With dplyr you can group by columns like this:
library(dplyr)
df <- data.frame(a=c(1,2,1,3,1,4,1,5), b=c(2,3,4,1,2,3,4,5))
df %>%
group_by(a) %>%
summarise(count = n())
If I want to group by two columns all the guides say:
df %>%
group_by(a,b) %>%
summarise(count = n())
But can I not feed the group_by() parameters more efficiently somehow, rather than having to type them in explicitly, e.g. like:
cols = colnames(df)
df %>%
group_by(cols) %>%
summarise(count = n())
I have examples where I want to group by 10+ columns, and it is pretty horrible to write it out if you can just parse their names.

across and curly-curly is the answer (even though it doesn't make sense to group_by using all your columns)
cols = colnames(df)
df %>%
group_by(across({{cols}}) %>%
summarise(count = n())

You can use across with any of the tidy selectors. For example if you want all columns
df %>%
group_by(across(everything())) %>%
summarise(count = n())
Of if you want a list
cols <- c("a","b")
df %>%
group_by(across(all_of(cols))) %>%
summarise(count = n())
See help("language", package="tidyselect") for all the selection options.

Unique rows based on two logical conditions

I want my dataframe to return unique rows based on two logical conditions (OR not AND).
But when I ran this, df %>% group_by(sex) %>% distinct(state, education) %>% summarise(n=n()) I got deduplicated rows based on the two conditions joined by AND not OR.
Is there a way to get something like this df %>% group_by(sex) %>% distinct(state | education) %>% summarise(n=n()) so that the deduplicated rows will be joined by OR not AND?
Thank you.

You can use tidyr::pivot_longer and then distinct afterwards:
df %>%
pivot_longer(c(state, education), names_to = "type", values_to = "value")
group_by(sex) %>%
distinct(value) %>%
summarise(n = n())
In this case, pivot_longer simply puts state and education into one column called value.

R - Issue with Ranking and Grouping

I have the following question that I am trying to solve with R:
"For each year, first calculate the mean observed value for each country (to allow for settings where countries may have more than 1 value per year, note that this is true in this data set). Then rank countries by increasing MMR for each year.
Calculate the mean ranking across all years, extract the mean ranking for 10 countries with the lowest ranking across all years, and print the resulting table."
This is what I have so far:
dput(mmr)
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean = mean(mmr)) %>%
arrange(mean) %>%
group_by(country)
tib2
My output is so close to where I need it to be, I just need to make each country have only one row (that has the mean ranking for each country).
Here is the result:
Output
Thank you!

Just repeat the same analysis, but instead of grouping by (country, year), just group by country:
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean_mmr = mean(mmr)) %>%
arrange(mean) %>%
group_by(country) %>%
summarise(mean_mmr = mean(mean_mmr)) %>%
arrange(mean_mmr) %>%
ungroup() %>%
slice_min(n=10)
tib2

Not sure without the data, but does this work?
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean1 = mean(mmr)) %>%
ungroup() %>%
group_by(year) %>%
mutate(rank1 = rank(mean1)) %>%
ungroup() %>%
group_by(country) %>%
summarise(rank = mean(rank1))%>%
ungroup() %>%
arrange(rank) %>%
slice_head(n=10)

summing up the first 5 elements of a list

I have a data frame that contains a column with varying numbers of integer values. I need to take the first five of these values and sum them up. I found a way to do it for one, but can't seem to generalize it to loop through all:
Here is the code for the first element:
results$occupied[1] %>%
strsplit(",") %>%
as.list() %>%
unlist() %>%
head(5) %>%
as.numeric() %>%
sum()
And what does not work for all elements:
results %>%
rowwise() %>%
select(occupied) %>%
as.character() %>%
strsplit(",") %>%
as.list() %>%
unlist() %>%
head(5) %>%
as.numeric() %>%
sum()

In base R, you can do :
sapply(strsplit(results$occupied, ","), function(x) sum(as.numeric(head(x, 5))))
Or using dplyr and purrr
library(dplyr)
library(purrr)
results %>%
mutate(total_sum = map_dbl(strsplit(occupied, ","),
~sum(as.numeric(head(.x, 5)))))
Similarly, using rowwise :
results %>%
rowwise() %>%
mutate(total_sum = sum(as.numeric(head(strsplit(occupied, ",")[[1]], 5))))

We can use separate_rows to split the 'occupied' column and expand the rows, then do a group by row number and get the sum of the first five elements
library(dplyr)
library(tidyr)
results %>%
mutate(rn = row_number()) %>%
separate_rows(occupied, convert = TRUE) %>%
group_by(rn) %>%
slice(seq_len(5)) %>%
summmarise(total_sum = sum(occupied)) %>%
select(-rn) %>%
bind_cols(results, .)

How to add by row (using babynames dataset)

I'm trying to create a ggplot of using the babynames dataset which shows a comparison between the percentage of girls and boys that have a certain name over a range of years. I'm a little familiar with adding by column which would look like babynames$boys + babynames$girls if I created a column with the number of girls with a certain name and a column of boys with a certain name. I'm a bit conceptually stuck so far so I just have:
babynames %>%
filter(name == "Jordan") %>%
group_by(year, sex) %>%
summarize(total = sum(n))

So you want the percentages?
Try:
babynames %>%
filter(name == "Jordan") %>%
group_by(year, sex) %>%
summarize(total = sum(n)) %>%
mutate(both = sum(total)) %>%
mutate(perc = total/both*100)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Summarise with multiple conditions based on years - r

Related

How do you efficiently group by multiple columns in dplyr

Unique rows based on two logical conditions

R - Issue with Ranking and Grouping

summing up the first 5 elements of a list

How to add by row (using babynames dataset)

Categories

Resources