Reclassify attributes that are less than x% of total as 'other'

Reclassify attributes that are less than x% of total as 'other' - r

Okay so I have data as so:
ID Name Job
001 Bill Carpenter
002 Wilma Lawyer
003 Greyson Lawyer
004 Eddie Janitor
I want to group these together for analysis so any job that appears less than x percent of the whole will be grouped into "Other"
How can I do this, here is what I tried:
df %>%
group_by(Job) %>%
summarize(count = n()) %>%
mutate(pct = count/sum(count)) %>%
arrange(desc(count)) %>%
drop_na()
And now I know what the percentages are but how do I integrate this in to the original data to make everything below X "Other". (let's say less than or equal to 25% is other).
Maybe there's a more straightforward way....

You can try this :
library(dplyr)
df %>%
count(Job) %>%
mutate(n = n/sum(n)) %>%
left_join(df, by = 'Job') %>%
mutate(Job = replace(Job, n <= 0.25, 'Other'))
To integrate our calculation in original data we do a left_join and then replace the values.

Related

R - Issue with Ranking and Grouping

I have the following question that I am trying to solve with R:
"For each year, first calculate the mean observed value for each country (to allow for settings where countries may have more than 1 value per year, note that this is true in this data set). Then rank countries by increasing MMR for each year.
Calculate the mean ranking across all years, extract the mean ranking for 10 countries with the lowest ranking across all years, and print the resulting table."
This is what I have so far:
dput(mmr)
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean = mean(mmr)) %>%
arrange(mean) %>%
group_by(country)
tib2
My output is so close to where I need it to be, I just need to make each country have only one row (that has the mean ranking for each country).
Here is the result:
Output
Thank you!

Just repeat the same analysis, but instead of grouping by (country, year), just group by country:
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean_mmr = mean(mmr)) %>%
arrange(mean) %>%
group_by(country) %>%
summarise(mean_mmr = mean(mean_mmr)) %>%
arrange(mean_mmr) %>%
ungroup() %>%
slice_min(n=10)
tib2

Not sure without the data, but does this work?
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean1 = mean(mmr)) %>%
ungroup() %>%
group_by(year) %>%
mutate(rank1 = rank(mean1)) %>%
ungroup() %>%
group_by(country) %>%
summarise(rank = mean(rank1))%>%
ungroup() %>%
arrange(rank) %>%
slice_head(n=10)

R dplyr group by more than 2 variables and calculate relative percentages inside each 1st variable group

I would like to group a dataframe by 4 variables, summarise it with a count and then calculate the percentage of counts each row accounts for compared to the total counts in each group of the 1st variable. As a last step, i calculate a cumulative percentage and assign the row to a category based on certain thresholds.
A simple example first:
library(nycflights13)
library(dplyr)
test <- flights %>%
left_join(airlines, by = c('carrier'), na_matches = "never") %>%
group_by(origin, name) %>%
summarise(count_flights = n()) %>%
arrange(origin, desc(count_flights)) %>%
mutate(prop = prop.table(count_flights) * 100,
cumprop = cumsum(prop),
ABC = cut(cumprop, c(0,80,95,100), labels = c('A','B','C')))
This works fine, i get the number of flights per NYC airport and carrier, along with the percentage each row accounts for in relation to the airport total.
Now, this does not work when grouping by 2 more variables:
test2 <- flights %>%
left_join(airlines, by = c('carrier'), na_matches = "never") %>%
group_by(origin, name, dest, day) %>%
summarise(count_flights = n()) %>%
arrange(origin, desc(count_flights)) %>%
mutate(prop = prop.table(count_flights) * 100,
cumprop = cumsum(prop),
ABC = cut(cumprop, c(0,80,95,100), labels = c('A','B','C')))
What i expect is the cumsum to equal 100 just before a change of airport/origin, or put another way, the percentage of each rows to be calculated against the total flights of each airport.
Any thoughts?

The best way to do this is to group_by the variables that you want to use for the new, less specific bucket (origin), and then divide the count by the total count in a mutate:
flights %>%
left_join(airlines, by = c('carrier'), na_matches = "never") %>%
group_by(origin, name, dest, day) %>%
summarise(count_flights = n()) %>%
arrange(origin) %>%
group_by(origin) %>%
mutate(prop = count_flights/sum(count_flights),
cumprop = cumsum(prop))

computing differences between groups: alternative to spread for multiple computations

I commonly need to compute differences between groups, nested by some interval and/or additional grouping. For computing a single variable, this is easy to accomplish with spread and mutate. Here's a reproducible example with the datasetChickWeight; don't get distracted by the calculation itself (this is just a toy example), my question is about how to handle a dataset structured like the dataframe ChickSum created below.
# reproducible dataset
data(ChickWeight)
ChickSum = ChickWeight %>%
filter(Time == max(Time) | Time == min(Time)) %>%
group_by(Diet, Time) %>%
summarize(mean.weight = mean(weight)) %>%
ungroup()
Here is how I might go about calculating the change in average chick weight between the first and last time, stratified by diet:
# Compute change in mean weight between first and last time
ChickSum %>%
spread(Time, mean.weight) %>%
mutate(weight.change = `21` - `0`)
However, this doesn't work so well with multiple variables:
ChickSum2 = ChickWeight %>%
filter(Time == max(Time) | Time == min(Time)) %>%
group_by(Diet, Time) %>%
# now also compute variable "count"
summarize(count = n(), mean.weight = mean(weight)) %>%
ungroup()
I can't spread by Time and both count and mean.weight; my current solution is to do two spread-mutate operations---once for count and again for mean.weight---and then join the results.
ChickCountChange = ChickSum2 %>%
select(-mean.weight) %>%
spread(Time, count) %>%
mutate(count.change = `21` - `0`)
ChickWeightChange = ChickSum2 %>%
select(-count) %>%
spread(Time, mean.weight) %>%
mutate(weight.change = `21` - `0`)
full_join(
select(ChickWeightChange, Diet, weight.change),
select(ChickCountChange, Diet, count.change),
by = "Diet")
Is there another approach to these types of computation? I've been trying to conceive of a strategy that combines group_by and purrr::pmap in order to avoid spread but still maintain the advantages of the above approach (such as spread's fill argument for choosing how to handle missing group combinations), but I haven't figured it out. I'm open to suggestions or alternative data structures/ways of thinking about the problem.

You might try re-grouping, then using lag() to calculate the differences. Works for your toy example, but it may be better to see some of your real dataset:
ChickWeight %>%
filter(Time == max(Time) | Time == min(Time)) %>%
group_by(Diet, Time) %>%
# now also compute variable "count"
summarize(count = n(), mean.weight = mean(weight)) %>%
ungroup() %>%
group_by(Diet) %>%
mutate(count.change = count - lag(count),
weight.change = mean.weight - lag(mean.weight)) %>%
filter(Time == max(Time))
Result:
Diet Time count mean.weight count.change weight.change
<fct> <dbl> <int> <dbl> <int> <dbl>
1 1 21 16 178. -4 136.
2 2 21 10 215. 0 174
3 3 21 10 270. 0 230.
4 4 21 9 239. -1 198.

So I came up with a potential/partial solution in the process of writing up a reproducible example. Essentially, we use gather to group by the variables themselves:
ChickSum2 %>%
gather(variable, value, count, mean.weight) %>%
spread(Time, value) %>% mutate(Change = `21` - `0`) %>%
select(Diet, variable, Change) %>%
spread(variable, Change)
This works only if the following two conditions are true:
All variables are the same type (e.g. both mean.weight and count are numeric).
the difference calculation is the same for all variables (e.g. I want to compute last - first for all variables).
I guess the second condition could be relaxed by using e.g. case_when.

How to add by row (using babynames dataset)

I'm trying to create a ggplot of using the babynames dataset which shows a comparison between the percentage of girls and boys that have a certain name over a range of years. I'm a little familiar with adding by column which would look like babynames$boys + babynames$girls if I created a column with the number of girls with a certain name and a column of boys with a certain name. I'm a bit conceptually stuck so far so I just have:
babynames %>%
filter(name == "Jordan") %>%
group_by(year, sex) %>%
summarize(total = sum(n))

So you want the percentages?
Try:
babynames %>%
filter(name == "Jordan") %>%
group_by(year, sex) %>%
summarize(total = sum(n)) %>%
mutate(both = sum(total)) %>%
mutate(perc = total/both*100)

Finding the largest year interval in R

Suppose I have 10 years time and name associated to it like following,
Name Year
A 1990
B 1991
C 1992
A 1993
A 1994
.
.
.
I want to find the name that has been out of use for the longest time.
Can anybody help me how to do this?

Using dplyr:
library(dplyr)
mutate(your_data, max_year = max(Year)) %>%
group_by(Name) %>%
summarize(most_recent = max(Year),
unused_length = first(max_year) - most_recent) %>%
ungroup() %>%
arrange(most_recent)
This will order the names by their most recent use, with the oldest most recent use first.
If you only care about getting that one most out-of-use name, you just need the first row of the result. Add slice(1) to the chain as so:
mutate(your_data, max_year = max(Year)) %>%
group_by(Name) %>%
summarize(most_recent = max(Year),
unused_length = first(max_year) - most_recent) %>%
ungroup() %>%
arrange(most_recent) %>%
slice(1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Reclassify attributes that are less than x% of total as 'other' - r

You can try this : library(dplyr) df %>% count(Job) %>% mutate(n = n/sum(n)) %>% left_join(df, by = 'Job') %>% mutate(Job = replace(Job, n <= 0.25, 'Other')) To integrate our calculation in original data we do a left_join and then replace the values.

Related

R - Issue with Ranking and Grouping

R dplyr group by more than 2 variables and calculate relative percentages inside each 1st variable group

computing differences between groups: alternative to spread for multiple computations

How to add by row (using babynames dataset)

Finding the largest year interval in R

Categories

Resources