I have a dataframe which has sales at a ppg , product level, I want to find out how many products contribute towards particular % (ex 75%) of the sale like testing a pareto principle.
The data is
df= structure(list(Ppg = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1",
"p2"), class = "factor"), product = structure(c(1L, 2L, 3L, 4L,
1L, 2L, 3L), .Label = c("A", "B", "C", "D"), class = "factor"),
sales = c(50, 40, 30, 80, 100, 70, 30)), .Names = c("Ppg",
"product", "sales"), row.names = c(NA, -7L), class = "data.frame")
> df
Ppg product sales
1 p1 A 50
2 p1 B 40
3 p1 C 30
4 p1 D 80
5 p2 A 100
6 p2 B 70
7 p2 C 30
I retrieved the cumulative sum using dplyr
df %>% group_by(Ppg) %>% summarise(sale = sum(sales) %>% mutate(c1 = cumsum(sales))
Ppg product sales c1
<fctr> <fctr> <dbl> <dbl>
1 p1 A 50 50
2 p1 B 40 90
3 p1 C 30 120
4 p1 D 80 200
5 p2 A 100 100
6 p2 B 70 170
7 p2 C 30 200
Is there any way to
i) calculate proportion of sales (based on cumsum)
ii) How many distinct products contributed toward certain % of sales.
exmple for ppg p1, 2 distinct products (A & B combind give 75 % of sales)
so finally something like below would be ideal
ppg Number_Products_towards_75%
p1 2
p2 1
Assuming you're fine using the order that the product are currently in to get you answer (since reordering the rows would get you different results):
For 1, you can get the result with an extra mutate. Just divide the cumulative sum by the sum of all the sales in that group:
df %>%
group_by(Ppg) %>%
mutate(c1 = cumsum(sales)) %>%
mutate(percent = c1 / sum(sales))
Gets you:
# A tibble: 7 x 5
# Groups: Ppg [2]
Ppg product sales c1 percent
<fctr> <fctr> <dbl> <dbl> <dbl>
1 p1 A 50.0 50.0 0.250
2 p1 B 40.0 90.0 0.450
3 p1 C 30.0 120 0.600
4 p1 D 80.0 200 1.00
5 p2 A 100 100 0.500
6 p2 B 70.0 170 0.850
7 p2 C 30.0 200 1.00
For 2, you could then use a mutate to add a column for if that product is below the threshold and summarize to count the products below the threshold (and then add one to the count since one more would get you over it).
threshold <- 0.5
df %>%
group_by(Ppg) %>%
mutate(c1 = cumsum(sales)) %>%
mutate(percent = c1 / sum(sales)) %>%
mutate(isbelowthreshold = percent < threshold) %>% # add a column for if it's below the threshold
summarize(count = sum(isbelowthreshold) + 1) # we need to add one since one extra product will put you over the threshold
gets you:
# A tibble: 2 x 2
Ppg count
<fctr> <dbl>
1 p1 3.00
2 p2 1.00
But again this depends on the order of the products. Consider ordering them from highest to lowest value first? Something like
df %>%
group_by(Ppg) %>%
arrange(Ppg, desc(sales))
Related
I have a problem on how to recategorize a variable according to whether it meets a certain condition or not. That is, if the category does not meet the criteria, it is assigned to another category that does.
My data has the following form:
data = data.frame(firm_size = c("Micro", "Small", "Medium","Big"),
employees = c(5,10,100,1000))
> data
firm_size employees
1 Micro 5
2 Small 10
3 Medium 100
4 Big 1000
So, if my condition is that I must group the companies that have less than 10 employees and then combine them with the other category that does meet the criteria
> new_data
firm_size employees
1 Micro-Small 15
3 Medium 100
4 Big 1000
What I'm trying to do is write a function that generalizes this procedure, for example, that also works if my data is
> data
firm_size employees
1 Micro 5
2 Small 8
3 Medium 9
4 Big 1000
> new_data
firm_size employees
1 Micro-Small-Medium 22
4 Big 1000
I think that this can be done with the tools of the tidyverse.
Thanks in advance
Here's an approach with tally:
library(dplyr)
size <- 10
data %>%
arrange(firm_size,desc(employees)) %>%
group_by(firm_size = c(as.character(firm_size[employees > size]),
rep(paste(firm_size[employees <= size], collapse = "-"),
sum(employees <= size)))) %>%
tally(employees, name = "employees")
## A tibble: 3 x 2
# firm_size employees
# <chr> <dbl>
#1 Big 1000
#2 Medium 100
#3 Small-Micro 15
And for your second set of data:
data2 %>%
arrange(firm_size,desc(employees)) %>%
group_by(firm_size = c(as.character(firm_size[employees > size]),
rep(paste(firm_size[employees <= size], collapse = "-"),
sum(employees <= size)))) %>%
tally(employees, name = "employees")
## A tibble: 2 x 2
# firm_size employees
# <chr> <int>
#1 Big 1000
#2 Medium-Small-Micro 22
Data
data <- structure(list(firm_size = structure(c(3L, 4L, 2L, 1L), .Label = c("Big",
"Medium", "Micro", "Small"), class = "factor"), employees = c(5,
10, 100, 1000)), class = "data.frame", row.names = c(NA, -4L))
data2 <- structure(list(firm_size = structure(c(3L, 4L, 2L, 1L), .Label = c("Big",
"Medium", "Micro", "Small"), class = "factor"), employees = c(5L,
8L, 9L, 1000L)), class = "data.frame", row.names = c("1", "2",
"3", "4"))
You can use the great forcats package
library(tidyverse)
data <- data.frame(
firm_size = c("Micro", "Small", "Medium", "Big", "Small"),
employees = c(5, 10, 100, 1000, 10)
)
# If you need n groups
data %>%
mutate(firm_size2 = firm_size %>% as_factor() %>% fct_lump(n = 2, w = employees)) %>%
group_by(firm_size2) %>%
summarise(sum_emp = sum(employees),.groups = "drop")
#> # A tibble: 3 x 2
#> firm_size2 sum_emp
#> <fct> <dbl>
#> 1 Medium 100
#> 2 Big 1000
#> 3 Other 25
# If you need at least x on the sum of a vector
data %>%
mutate(firm_size2 = firm_size %>% as_factor() %>% fct_lump_min(min = 10, w = employees)) %>%
group_by(firm_size2) %>%
summarise(sum_emp = sum(employees),.groups = "drop")
#> # A tibble: 4 x 2
#> firm_size2 sum_emp
#> <fct> <dbl>
#> 1 Small 20
#> 2 Medium 100
#> 3 Big 1000
#> 4 Other 5
Created on 2020-06-11 by the reprex package (v0.3.0)
Yet another solution, set into a custom function:
library(tidyverse)
mymerge <- function(dat, min) {
merged_dat <- dat %>%
filter(if_else(employees <= min, TRUE, FALSE)) %>%
summarize(firm_size = str_flatten(firm_size, collapse = " - "),
employees = sum(employees))
dat %>%
filter(if_else(employees <= min, FALSE, TRUE)) %>%
bind_rows(merged_dat)
}
mymerge(data, 30)
firm_size employees
1 Medium 100
2 Big 1000
3 Micro - Small 15
mymerge(data, 300)
firm_size employees
1 Big 1000
2 Micro - Small - Medium 115
I want to ranks quantiles for my data, but I get the high rank for the highest score. I want to get the lowest rank (i.e., 1) for the highest score.
The following code gives me a high rank for the highest score, which I am not interested in:
M %>%
mutate(quantile = ntile(Score, 10))
I have used reverse= TRUE, or include.lowest = TRUE, did not work. Can you help me? I felt that it is not necessary to provide a sample of data.
Just negate the values:
M %>% mutate(quantile = ntile(-Score, 10))
Example:
df1 %>%
mutate(quantile = ntile(-score, 3))
# # A tibble: 6 x 4
# ntile date score quantile
# <int> <date> <dbl> <int>
# 1 1 2005-08-31 -2.39 3
# 2 1 2005-09-30 0.573 2
# 3 1 2005-10-31 -1.61 3
# 4 1 2005-11-30 5.43 1
# 5 1 2005-12-31 0.106 2
# 6 1 2006-01-31 6.66 1
Toy Data:
df1 <- structure(list(ntile = c(1L, 1L, 1L, 1L, 1L, 1L),
date = structure(c(13026, 13056, 13087, 13117, 13148, 13179),
class = "Date"),
score = c(-2.38916419707325, 0.572675136581781, -1.61130358515631,
5.42706994951004, 0.105533424368025, 6.65697289481407)),
row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
I am honest, I could come up with a decent title for this.
Basically, I have a dateframe:
ID Qty BasePrice Total
1 2 30 50
1 1 20 20
2 4 5 15
For each line I want to calculate the following:
Result = (Qty * BasePrice) - Total
Which is supposedly easy to do in R. However, I want to group the results by ID (sum them).
Sample Output:
ID Qty BasePrice Total Results
1 2 30 50 10
1 1 20 20 10
2 4 5 15 5
For instance, for ID=1, the values represent ((2*30)-50)+((1*20)-20)
Any idea on how can I achieve this?
Thanks!
We can do a group_by sum of the difference between the product of 'Qty', 'BasePrice' with 'Total'
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Result = sum((Qty * BasePrice) - Total))
# A tibble: 3 x 5
# Groups: ID [2]
# ID Qty BasePrice Total Result
# <int> <int> <int> <int> <int>
#1 1 2 30 50 10
#2 1 1 20 20 10
#3 2 4 5 15 5
data
df1 <- structure(list(ID = c(1L, 1L, 2L), Qty = c(2L, 1L, 4L), BasePrice = c(30L,
20L, 5L), Total = c(50L, 20L, 15L)), class = "data.frame", row.names = c(NA,
-3L))
We have some tidy data with treatments (multiple samples and control), time points, and measured values. I want to normalize all the samples by dividing by the corresponding time point in the control variable.
I know how I would do this with each value in its own column, but can't figure out how to us a combination of gather mutate, sumamrise etc from tidyr or dplyr to do this in a straightforward way.
Here is a sample data frame definition:
structure(list(time = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
value = c(10, 20, 15, 100, 210, 180, 110, 180, 140),
as.factor.treat. = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
.Label = c("c", "t1", "t2"), class = "factor")),
.Names = c("time", "value", "treat"),
row.names = c(NA, -9L), class = "data.frame")
Data frame looks like this:
time value treat
1 10 c
2 20 c
3 15 c
1 100 t1
2 210 t1
3 180 t1
1 110 t2
2 180 t2
3 140 t2
Expected output. same but with normvalue column containing c(1,1,1,10,10.5,12,11,9,9.333333)
I'd like to get out columns of normalized value for each treatment and time point using tidyverse procedures...
If you group by time (assuming that, as in the example, it is the grouping variable for time-point) then we can use bracket notation in a mutate statement to search only within the group. We can use that to access the control value for each group and then divide the un-normalized value by that:
df %>%
group_by(time) %>%
mutate(value.norm = value / value[treat == 'c'])
# A tibble: 9 x 4
# Groups: time [3]
time value treat value.norm
<dbl> <dbl> <fct> <dbl>
1 1 10 c 1
2 2 20 c 1
3 3 15 c 1
4 1 100 t1 10
5 2 210 t1 10.5
6 3 180 t1 12
7 1 110 t2 11
8 2 180 t2 9
9 3 140 t2 9.33
All this does is take the value column of each row and divide it by the value for the control sample with the same time value. As you can see, it doesn't care if sample t1 is missing an observation for time == 1:
df <- structure(list(time = c(1, 2, 3, 2, 3, 1, 2, 3),
value = c(10, 20, 15, 210, 180, 110, 180, 140),
as.factor.treat. = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L),
.Label = c("c", "t1", "t2"), class = "factor")),
.Names = c("time", "value", "treat"),
row.names = c(NA, -8L), class = "data.frame")
df %>%
group_by(time) %>%
mutate(value.norm = value / value[treat == 'c'])
# A tibble: 8 x 4
# Groups: time [3]
time value treat value.norm
<dbl> <dbl> <fct> <dbl>
1 1 10 c 1
2 2 20 c 1
3 3 15 c 1
4 2 210 t1 10.5
5 3 180 t1 12
6 1 110 t2 11
7 2 180 t2 9
8 3 140 t2 9.33
I have a data frame with duplicated ID´s. An ID stands for a specific entity. The ID´s are duplicated because the dataset refers to a process that every entity can go through multiple times.
Here is a small example dat:
library(dplyr)
glimpse(dat)
Observations: 6
Variables: 3
$ ID <dbl> 1, 1, 1, 2, 2, 2
$ Amount <dbl> 10, 70, 80, 50, 10, 10
$ Product <fct> A, B, C, B, E, A
ID stands for the entity, Amount stands for the amount of money the entity has spend and Product stands for the good the entity bought.
The issue is that I have to "condense" this data. So, every ID / entity may occur only once. For the continuous variable, this is not an issue because I can simply calculate the mean per ID.
library(tidyr)
dat_con_ID <- dat %>%
select(ID) %>%
unique()
dat_con_Amount <- dat %>%
group_by(ID) %>%
summarise(Amount = mean(Amount))
dat_con <- inner_join(dat_con_ID, dat_con_Amount, by = "ID")
glimpse(dat_con)
Observations: 2
Variables: 2
$ ID <dbl> 1, 2
$ Amount <dbl> 53.33333, 23.33333
The problem is, that I can´t calculate the mean of Product because it´s a categorical variable. An option would be to make a dummy variable out of this factor and calculate the mean. But since the original data frame is really huge this is not a good solution. Any Idea how to handle this problem?
May be you are trying to do this:
I am using data.table library. I also modified your data by adding one extra row for ID = 1, so that you can see the difference in the output.
Data:
library('data.table')
dat <- data.table(ID =as.double(c(1, 1, 1, 2, 2, 2,1)),
Amount = as.double(c( 10, 70, 80, 50, 10, 10, 20)),
Product = factor( c('A', 'B', 'C', 'B', 'E', 'A', 'A')))
Code:
# average amount per id
dat[, .(avg_amt = mean(Amount)), by = .(ID) ]
# ID avg_amt
# 1: 1 45.00000
# 2: 2 23.33333
# average product per id
dat[, .SD[, .N, by = Product ][, .( avg_pdt = N/sum(N), Product)], by = .(ID) ]
# ID avg_pdt Product
# 1: 1 0.5000000 A
# 2: 1 0.2500000 B
# 3: 1 0.2500000 C
# 4: 2 0.3333333 B
# 5: 2 0.3333333 E
# 6: 2 0.3333333 A
# combining average amount and average product per id
dat[, .SD[, .N, by = Product ][, .( Product,
avg_pdt = N/sum(N),
avg_amt = mean(Amount))],
by = .(ID) ]
# ID Product avg_pdt avg_amt
# 1: 1 A 0.5000000 45.00000
# 2: 1 B 0.2500000 45.00000
# 3: 1 C 0.2500000 45.00000
# 4: 2 B 0.3333333 23.33333
# 5: 2 E 0.3333333 23.33333
# 6: 2 A 0.3333333 23.33333
edit
Another idea would be to count 'Product' as per 'ID', calculating the mean of 'Amount' and the relative frequencies for each product. spread the data by 'Product' to end up with the data in wide format. So, every ID / entity may occur only once.
dat %>%
add_count(Product, ID) %>%
group_by(ID) %>%
mutate(Amount = mean(Amount),
n = n / n()) %>%
unique() %>%
spread(Product, n, sep = "_") %>%
ungroup()
# A tibble: 2 x 6
# ID Amount Product_A Product_B Product_C Product_E
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1. 45.0 0.500 0.250 0.250 NA
#2 2. 23.3 0.333 0.333 NA 0.333
My first attempt, not what OP was looking for but in case someone is interested:
As suggested by #steveb in the comments, you could summarise Product as a string.
library(dplyr)
dat %>%
group_by(ID) %>%
summarise(Amount = mean(Amount),
Product = toString( sort(unique(Product)))
)
# A tibble: 2 x 3
# ID Amount Product
# <dbl> <dbl> <chr>
#1 1. 45.0 A, B, C
#2 2. 23.3 A, B, E
data
dat <- structure(list(ID = c(1, 1, 1, 2, 2, 2, 1), Amount = c(10, 70,
80, 50, 10, 10, 20), Product = structure(c(1L, 2L, 3L, 2L, 4L,
1L, 1L), .Label = c("A", "B", "C", "E"), class = "factor")), .Names = c("ID",
"Amount", "Product"), row.names = c(NA, -7L), .internal.selfref = <pointer: 0x2c14528>, class = c("tbl_df",
"tbl", "data.frame"))