Finding the largest year interval in R - r

Suppose I have 10 years time and name associated to it like following,
Name Year
A 1990
B 1991
C 1992
A 1993
A 1994
.
.
.
I want to find the name that has been out of use for the longest time.
Can anybody help me how to do this?

Using dplyr:
library(dplyr)
mutate(your_data, max_year = max(Year)) %>%
group_by(Name) %>%
summarize(most_recent = max(Year),
unused_length = first(max_year) - most_recent) %>%
ungroup() %>%
arrange(most_recent)
This will order the names by their most recent use, with the oldest most recent use first.
If you only care about getting that one most out-of-use name, you just need the first row of the result. Add slice(1) to the chain as so:
mutate(your_data, max_year = max(Year)) %>%
group_by(Name) %>%
summarize(most_recent = max(Year),
unused_length = first(max_year) - most_recent) %>%
ungroup() %>%
arrange(most_recent) %>%
slice(1)

Related

Getting rid of NA values in R when trying to aggregate columns

I'm trying to aggregate this df by the last value in each corresponding country observation. For some reason, the last value that is added to the tibble is not correct.
aggre_data <- combined %>%
group_by(location) %>%
summarise(Last_value_vacc = last(people_vaccinated_per_hundred)
aggre_data
I believe it has something to do with all of the NA values throughout the df. However I did try:
aggre_data <- combined %>%
group_by(location) %>%
summarise(Last_value_vacc = last(people_vaccinated_per_hundred(na.rm = TRUE)))
aggre_data
Update:
combined %>%
group_by(location) %>%
arrange(date, .by_group = TRUE) %>% # or whatever
summarise(Last_value_vacc = last(na.omit( people_vaccinated_per_hundred)))

R - Issue with Ranking and Grouping

I have the following question that I am trying to solve with R:
"For each year, first calculate the mean observed value for each country (to allow for settings where countries may have more than 1 value per year, note that this is true in this data set). Then rank countries by increasing MMR for each year.
Calculate the mean ranking across all years, extract the mean ranking for 10 countries with the lowest ranking across all years, and print the resulting table."
This is what I have so far:
dput(mmr)
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean = mean(mmr)) %>%
arrange(mean) %>%
group_by(country)
tib2
My output is so close to where I need it to be, I just need to make each country have only one row (that has the mean ranking for each country).
Here is the result:
Output
Thank you!
Just repeat the same analysis, but instead of grouping by (country, year), just group by country:
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean_mmr = mean(mmr)) %>%
arrange(mean) %>%
group_by(country) %>%
summarise(mean_mmr = mean(mean_mmr)) %>%
arrange(mean_mmr) %>%
ungroup() %>%
slice_min(n=10)
tib2
Not sure without the data, but does this work?
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean1 = mean(mmr)) %>%
ungroup() %>%
group_by(year) %>%
mutate(rank1 = rank(mean1)) %>%
ungroup() %>%
group_by(country) %>%
summarise(rank = mean(rank1))%>%
ungroup() %>%
arrange(rank) %>%
slice_head(n=10)

Reclassify attributes that are less than x% of total as 'other'

Okay so I have data as so:
ID Name Job
001 Bill Carpenter
002 Wilma Lawyer
003 Greyson Lawyer
004 Eddie Janitor
I want to group these together for analysis so any job that appears less than x percent of the whole will be grouped into "Other"
How can I do this, here is what I tried:
df %>%
group_by(Job) %>%
summarize(count = n()) %>%
mutate(pct = count/sum(count)) %>%
arrange(desc(count)) %>%
drop_na()
And now I know what the percentages are but how do I integrate this in to the original data to make everything below X "Other". (let's say less than or equal to 25% is other).
Maybe there's a more straightforward way....
You can try this :
library(dplyr)
df %>%
count(Job) %>%
mutate(n = n/sum(n)) %>%
left_join(df, by = 'Job') %>%
mutate(Job = replace(Job, n <= 0.25, 'Other'))
To integrate our calculation in original data we do a left_join and then replace the values.

Creating a funnel using a pivot table in R considering NA column

I have the following dataset:
library(tidyverse)
dataset <- data.frame(id = c(121,122,123,124,125),
segment = c("A","B","B","A",NA),
Web = c(1,1,1,1,1),
Tryout = c(1,1,1,0,1),
Purchase = c(1,0,1,0,0),
stringsAsFactors = FALSE)
This table as you see converts to a funnel, from web visits (the quantity of rows), to tryout to a purchase. So a useful view of this funnel should be:
Step Total A B NA
Web 5 2 2 1
Tryout 4 1 2 1
Purchase 2 1 1 0
So I tried row by row doing this. The web views code is:
dataset %>% mutate(segment = ifelse(is.na(segment), "NA", segment)) %>%
group_by(segment) %>% summarise(Total = n()) %>%
ungroup() %>% spread(segment, Total) %>% mutate(Total = `A` + `B` + `NA`) %>%
select(Total,A,B,`NA`)
And worked fine, except that I have to put manually the row name. But for the other steps like tryout and purchase, is there a way to do it in just one simpler code, avoiding binding? Consider that this is an example and I have many columns so any help will be greatly appreciated.
Here is one option where we convert the data to 'long' format after removing the 'id' column, grouped by 'name' get the sum of 'value', then grouped by 'segment', 'Total' as well and do the second sum, get the distinct rows and pivot back to 'wide' format
library(dplyr)
library(tidyr)
dataset %>%
select(-id) %>%
pivot_longer(cols = -segment) %>%
group_by(name) %>%
mutate(Total = sum(value)) %>%
group_by(name, segment, Total) %>%
mutate(n = sum(value)) %>%
ungroup %>%
select(-value) %>%
distinct %>%
pivot_wider(names_from = segment, values_from = n)
# A tibble: 3 x 5
# name Total A B `NA`
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Web 5 2 2 1
#2 Tryout 4 1 2 1
#3 Purchase 2 1 1 0
dataset %>%
select(-id) %>%
group_by(segment) %>%
summarise_all(sum) %>%
gather(Step, val, -segment) %>%
spread(segment, val) %>%
mutate(Total = rowSums(.[,-1]))

Obtain more variables after grouping, summarising with select (dplyr)

My data frame:
date | weekday | price
2018 | 1 | 25
2018 | 1 | 35
2019 | 2 | 40
I try to run this code under dplyr:
pi %>%
group_by(date) %>%
group_by(date) %>%
summarise(price = sum(price, na.rm = T)) %>%
select(price, date, weekday) %>%
print()
It doesn't work.
Any solution? Thanks in advance
Follow the order: select-->group_by-->summarise
df%>%select(price, date, weekday)%>%
group_by(date, weekday)%>%summarise(sum(price,na.rm=T))
People are correctly suggesting to group_by date and weekday, but if you have a lot of columns, that could be a pain to write out. Here's another idiom I frequently use for data.frames with lots of columns:
pi %>%
group_by(date) %>%
mutate(price = sum(price, na.rm = T)) %>%
filter(row_number() == 1)
This will keep all the first instances of each column variables without having to explicitly write them all out.

Resources