How to compute cumulative and one specific column in R? - r

I have the data about sales by years and by-products, let's say like this:
Year <- c(2010,2010,2010,2010,2010,2011,2011,2011,2011,2011,2012,2012,2012,2012,2012)
Model <- c("a","b","c","d","e","a","b","c","d","e","a","b","c","d","e")
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12")
df <- data.frame(Year, Model, Sale)
Firstly I need to calculate the "Share" column which represents the share of each product within each year.
After I compute cumulative share like this:
In the 3rd step need to identify products that accumulate total sales up to 70% in the last year (2012 in this case) and keep only these products in the whole dataframe + add a ranking column (based on last year) and summarises all the rest of products as category "other". So the final dataframe should be like this:

This is a fairly complex data wrangling task, but can be achieved using dplyr:
library(dplyr)
df %>%
mutate(Sale = as.numeric(Sale)) %>%
group_by(Year) %>%
mutate(Share = 100 * Sale/ sum(Sale),
Year_order = order(order(-Share))) %>%
arrange(Year, Year_order, by_group = TRUE) %>%
mutate(Cumm.Share = cumsum(Share)) %>%
ungroup() %>%
mutate(below_70 = Model %in% Model[Year == max(Year) & Cumm.Share < 70]) %>%
mutate(Model = ifelse(below_70, Model, 'Other')) %>%
group_by(Year, Model) %>%
summarize(Sale = sum(Sale), Share = sum(Share), .groups = 'keep') %>%
group_by(Year) %>%
mutate(pseudoShare = ifelse(Model == 'Other', 0, Share)) %>%
arrange(Year, -pseudoShare, by_group = TRUE) %>%
ungroup() %>%
mutate(Rank = match(Model, Model[Year == max(Year)])) %>%
select(-pseudoShare)
#> # A tibble: 9 x 5
#> Year Model Sale Share Rank
#> <dbl> <chr> <dbl> <dbl> <int>
#> 1 2010 a 30 19.4 2
#> 2 2010 c 23 14.8 1
#> 3 2010 Other 102 65.8 3
#> 4 2011 c 19 10.2 1
#> 5 2011 a 11 5.88 2
#> 6 2011 Other 157 84.0 3
#> 7 2012 c 89 44.7 1
#> 8 2012 a 33 16.6 2
#> 9 2012 Other 77 38.7 3
Note that in the output this code has kept groups a and c, rather than c and d, as in your expected output. This is because a and d have the same value in the final year (16.6), and therefore either could be chosen.
Created on 2022-04-21 by the reprex package (v2.0.1)

Year <- c(2010,2010,2010,2010,2010,2011,2011,2011,2011,2011,2012,2012,2012,2012,2012)
Model <- c("a","b","c","d","e","a","b","c","d","e","a","b","c","d","e")
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12")
df <- data.frame(Year, Model, Sale, stringsAsFactors=F)
years <- unique(df$Year)
shares <- c()
cumshares <- c()
for (year in years){
extract <- df[df$Year == year, ]
sale <- as.numeric(extract$Sale)
share <- 100*sale/sum(sale)
shares <- append(shares, share)
cumshare <- rev(cumsum(rev(share)))
cumshares <- append(cumshares, cumshare)
}
df$Share <- shares
df$Cumm.Share <- cumshares
df
gives
> df
Year Model Sale Share Cumm.Share
1 2010 a 30 19.354839 100.000000
2 2010 b 45 29.032258 80.645161
3 2010 c 23 14.838710 51.612903
4 2010 d 33 21.290323 36.774194
5 2010 e 24 15.483871 15.483871
6 2011 a 11 5.882353 100.000000
7 2011 b 56 29.946524 94.117647
8 2011 c 19 10.160428 64.171123
9 2011 d 45 24.064171 54.010695
10 2011 e 56 29.946524 29.946524
11 2012 a 33 16.582915 100.000000
12 2012 b 32 16.080402 83.417085
13 2012 c 89 44.723618 67.336683
14 2012 d 33 16.582915 22.613065
15 2012 e 12 6.030151 6.030151
I don't understand what you mean by step 3, how do you decide which products to keep?

Related

Inventory Projection Calculation in R

I am trying to replace an obsolete Excel report currently used for sales forecasting and inventory projections by our supply chain team and I am using R for this.
The desired output is a data frame with one of the columns being the projected closing inventory positions for each week across a span of N weeks.
The part I am struggling with is the recursive calculation for the closing inventory positions. Below is a subset of the data frame with dummy data where "stock_projection" is the desire result.
I've just started learning about recursion in R so I am not really sure on how to implement this here. Any help will be much appreciated!
week
forecast
opening_stock
stock_projection
1
10
100
100
2
11
89
3
12
77
4
10
67
5
11
56
6
10
46
7
12
34
8
11
23
9
9
14
10
12
2
Update
I have managed to modify the solution explained here and have replicated the above outcome:
inventory<- tibble(week = 1, opening_stock = 100)
forecast<- tibble(week = 2:10, forecast = c(11, 12, 10, 11, 10, 12, 11, 9, 12) )
dat <- full_join(inventory, forecast)
dat2 <- dat %>%
mutate(forecast = -forecast) %>%
gather(transaction, value, -week) %>%
arrange(week) %>%
mutate(value = replace_na(value, 0))
dat2 %>%
mutate(value = cumsum(value)) %>%
ungroup() %>%
group_by(week) %>%
summarise(stock_projection = last(value))
Despite working like a charm, I am wondering whether there is another way to achieve this?
I think in the question above, you don't have to worry too much about recursion because the stock projection looks just like the opening stock minus the cumulative sum of the forecast. You could do that with:
library(dplyr)
dat <- tibble(
week = 1:10,
forecast = c(10,11,12,10,11,10,12,11,9,12),
opening_stock = c(100, rep(NA, 9))
)
dat <- dat %>%
mutate(fcst = case_when(week == 1 ~ 0,
TRUE ~ forecast),
stock_projection = case_when(
week == 1 ~ opening_stock,
TRUE ~ opening_stock[1] - cumsum(fcst))) %>%
dplyr::select(-fcst)
dat
# # A tibble: 10 × 4
# week forecast opening_stock stock_projection
# <int> <dbl> <dbl> <dbl>
# 1 1 10 100 100
# 2 2 11 NA 89
# 3 3 12 NA 77
# 4 4 10 NA 67
# 5 5 11 NA 56
# 6 6 10 NA 46
# 7 7 12 NA 34
# 8 8 11 NA 23
# 9 9 9 NA 14
# 10 10 12 NA 2

How do I create a new factor level that summarizes total values of other factor levels?

I have a dataset where I have at least three columns
year sex value
1 2019 M 10
2 2019 F 20
3 2020 M 50
4 2020 F 20
I would like to group by the first column, year, and then add another level to sex that corresponds the total value in column 3, that is, I would like something like this:
year sex value
<int> <chr> <dbl>
1 2019 M 10
2 2019 F 20
3 2019 Total 30
4 2020 M 50
5 2020 F 20
6 2020 Total 70
Any help is appreciated, especially in dplyr.
Here is just another way of doing this:
library(dplyr)
library(purrr)
df %>%
group_split(year) %>%
map_dfr(~ add_row(.x, year = first(.x$year), sex = "Total", value = sum(.x$value)))
# A tibble: 6 x 3
year sex value
<int> <chr> <dbl>
1 2019 M 10
2 2019 F 20
3 2019 Total 30
4 2020 M 50
5 2020 F 20
6 2020 Total 70
You can summarise the data for each year and bind it to the original dataset.
library(dplyr)
df %>%
group_by(year) %>%
summarise(sex = 'Total',
value = sum(value)) %>%
bind_rows(df) %>%
arrange(year, sex)
# year sex value
# <int> <chr> <dbl>
#1 2019 F 20
#2 2019 M 10
#3 2019 Total 30
#4 2020 F 20
#5 2020 M 50
#6 2020 Total 70
Or in base R -
aggregate(value~year, df, sum) |>
transform(sex = 'Total') |>
rbind(df)
data
df <- data.frame(year = rep(2019:2020, each = 2),
sex = c('M', 'F'), value = c(10, 20, 50, 20))

Drop rows conditional on value on other rows using dplyr in R

Using the example data provided below: For each institution type ("a" and "b") I want to drop rows with fac == "no" if there exists a row with fac == "yes" for the same year. I then want to sum the values by year. I am, however, not able to figure out how to drop the correct "no"-rows. Below are a couple of my attempts based on answers give here.
set.seed(123)
ext <- tibble(
institution = c(rep("a", 7), rep("b", 7)),
year = rep(c("2005", "2005", "2006", "2007", "2008", "2009", "2009"), 2),
fac = rep(c("yes", "no", "no", "no", "no", "yes", "no"), 2),
value = sample(1:100, 14, replace=T)
)
ext %>%
group_by(institution, year) %>%
filter(if (fac == "yes") fac != "no")
ext %>%
group_by(institution, year) %>%
case_when(fac == "yes" ~ filter(., fac != "no"))
ext %>%
group_by(institution, year) %>%
{if (fac == "yes") filter(., fac != "no")}
Another way would be:
library(dplyr)
ext %>%
group_by(institution, year) %>%
filter(fac == 'yes' | n() < 2)
# institution year fac value
# 1 a 2005 yes 31
# 2 a 2006 no 51
# 3 a 2007 no 14
# 4 a 2008 no 67
# 5 a 2009 yes 42
# 6 b 2005 yes 43
# 7 b 2006 no 25
# 8 b 2007 no 90
# 9 b 2008 no 91
# 10 b 2009 yes 69
In case you want the overall amounts by year, add these two lines, which will yield the following output:
group_by(year) %>%
summarise(value=sum(value))
# year value
# <chr> <int>
# 1 2005 74
# 2 2006 76
# 3 2007 104
# 4 2008 158
# 5 2009 111
Does this work: by summarise, I assumed you want to sum by year after applying the filtering.
library(dplyr)
ext %>% group_by(institution, year) %>% filter(fac == 'yes'|all(fac == 'no'))
# A tibble: 10 x 4
# Groups: institution, year [10]
institution year fac value
<chr> <chr> <chr> <int>
1 a 2005 yes 31
2 a 2006 no 51
3 a 2007 no 14
4 a 2008 no 67
5 a 2009 yes 42
6 b 2005 yes 43
7 b 2006 no 25
8 b 2007 no 90
9 b 2008 no 91
10 b 2009 yes 69
ext %>% group_by(institution, year) %>% filter(fac == 'yes'|all(fac == 'no')) %>%
ungroup() %>% group_by(year) %>% summarise(value = sum(value))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 5 x 2
year value
<chr> <int>
1 2005 74
2 2006 76
3 2007 104
4 2008 158
5 2009 111
Try creating a flag to identify the yes occurence and after that filter only the desired values. You would need to group by institution and year. Then, compute the length of values with yes greater or equal to one. With that you can flag the no values if there is some value yes inside the group. Finally, filter only the zero values in Flag and you will drop the rows as you expected. Here the code:
library(dplyr)
#Code
newdf <- ext %>% group_by(institution,year) %>%
mutate(NYes=length(fac[fac=='yes']),
Flag=ifelse(fac=='no' & NYes>=1,1,0)) %>%
filter(Flag==0) %>% select(-c(NYes,Flag))
Output:
# A tibble: 10 x 4
# Groups: institution, year [10]
institution year fac value
<chr> <chr> <chr> <int>
1 a 2005 yes 31
2 a 2006 no 51
3 a 2007 no 14
4 a 2008 no 67
5 a 2009 yes 42
6 b 2005 yes 43
7 b 2006 no 25
8 b 2007 no 90
9 b 2008 no 91
10 b 2009 yes 69
And the full code to summarise by year:
#Code 2
newdf <- ext %>% group_by(institution,year) %>%
mutate(NYes=length(fac[fac=='yes']),
Flag=ifelse(fac=='no' & NYes>=1,1,0)) %>%
filter(Flag==0) %>% select(-c(NYes,Flag)) %>%
ungroup() %>%
group_by(year) %>%
summarise(value=sum(value))
Output:
# A tibble: 5 x 2
year value
<chr> <int>
1 2005 74
2 2006 76
3 2007 104
4 2008 158
5 2009 111
An option with data.table
library(data.table)
setDT(ext)[ext[, .I[fac == 'yes'|all(fac == 'no')], .(institution, year)]$V1]

Select sample from a grouping variable depending on another grouping in R

I have the following data frame with 1,000 rows; 10 Cities, each having 100 rows and I would like to randomly select 10 names by Year in the city and the selected should 10 sample names should come from at least one of the years in the City i.e the 10 names for City 1 should not come from only 1996 for instance.
City Year name
1 1 1996 b
2 1 1996 c
3 1 1997 d
4 1 1997 e
...
101 2 1996 f
102 2 1996 g
103 2 1997 h
104 2 1997 i
Desired Final Sample Data
City Year name
1 1 1996 b
2 1 1998 c
3 1 2001 d
...
11 2 1997 g
12 2 1999 h
13 2 2005 b
...
21 3 1998 a
22 3 2010 c
23 3 2005 d
Sample Data
df1 <- data.frame(City = rep(1:10, each = 100),
Year = rep(1996:2015, each = 5),
name = rep(letters[1:25], 40))
I am failing to randomly select the 10 sample names by Year (without repeating years - unless when the number of Years in a city is less than 10) for all the 10 Cities, how can I go over this?
The Final sample should have 10 names of each city and years should not repeat unless when they are less than 10 in that city.
Thank you.
First group by City and use sample_n to sample a sub-dataframe.
Then group by City and Year, and sample from name one element per group. Don't forget to set the RNG seed in order to make the result reproducible.
library(dplyr)
set.seed(2020)
df1 %>%
group_by(City) %>%
sample_n(min(n(), 10)) %>%
ungroup() %>%
group_by(City, Year) %>%
summarise(name = sample(name, 1))
#`summarise()` regrouping output by 'City' (override with `.groups` argument)
## A tibble: 4 x 3
## Groups: City [2]
# City Year name
# <int> <int> <chr>
#1 1 1996 b
#2 1 1997 e
#3 2 1996 f
#4 2 1997 h
Data
df1 <- read.table(text = "
City Year name
1 1 1996 b
2 1 1996 c
3 1 1997 d
4 1 1997 e
101 2 1996 f
102 2 1996 g
103 2 1997 h
104 2 1997 i
", header = TRUE)
Edit
Instead of reinventing the wheel, use package sampling, function strata to get an index into the data set and then filter its corresponding rows.
library(dplyr)
library(sampling)
set.seed(2020)
df1 %>%
mutate(row = row_number()) %>%
filter(row %in% strata(df1, stratanames = c('City', 'Year'), size = rep(1, 1000), method = 'srswor')$ID_unit) %>%
select(-row) %>%
group_by(City) %>%
sample_n(10) %>%
arrange(City, Year)

R: How to run a custom function after group_by using group_map?

Given that, i have a dataframe as below:
dt <- data.frame(year = sample(c(2000:2019),100,replace = T ),
month = sample(c(1:12),100,replace = T ),
paitent_ID = sample(c(1:50),100,replace = T ),
state = sample(c(1:10),100,replace = T ) )
and i need to apply the below function to this dataset after group by and sort:
newState <- function(dt){
dt["new"]= dt[0,"state"]*3
dt
}
So, this function is supposed to add a new column called new to each group.
Here is the group_by:
library(dplyr)
dt %>%
group_by(paitent_ID) %>%
group_map( ~ .x %>%
arrange( year,month)) %>%
group_map( ~ .x %>%
newState())
when i run the code, it complains with:
Error in UseMethod("group_split") :
no applicable method for 'group_split' applied to an object of class "list"
As #André Oliveira mentions in the comments, it is recommended to use mutate for adding a column. However, it is possible to do so with group_modify after making some small changes to your function.
newState <- function(dt, groupvars){
dt["new"]= dt[1,"state"]*3
dt
}
dt %>%
group_by(paitent_ID) %>%
arrange(year, month) %>%
group_modify(newState) %>%
ungroup
# # A tibble: 100 x 5
# paitent_ID year month state new
# <int> <int> <int> <int> <dbl>
# 1 1 2006 5 3 9
# 2 2 2012 12 3 9
# 3 3 2013 11 8 24
# 4 3 2014 10 1 24
# 5 3 2019 5 6 24
# 6 4 2006 7 5 15
# 7 4 2006 7 2 15
# 8 5 2003 8 8 24
# 9 7 2015 12 2 6
# 10 7 2017 8 10 6
And a more conventional approach
dt %>%
group_by(paitent_ID) %>%
arrange(year, month) %>%
mutate(new = state[1]*3)

Resources