I observe the number of purchases of (in the example below: 4) different customers on (five) different days. Now I want to create a new variable summing up the number of purchases of every single user during the last 20 purchases that have been made in total, across users.
Example data:
> da <- data.frame(customer_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
+ day = c("2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15"),
+ n_purchase = c(5,2,8,0,3,2,0,3,4,0,2,4,5,1,0,2,3,5,0,3))
> da
customer_id day n_purchase
1 1 2016-04-11 5
2 1 2016-04-12 2
3 1 2016-04-13 8
4 1 2016-04-14 0
5 1 2016-04-15 3
6 2 2016-04-11 2
7 2 2016-04-12 0
8 2 2016-04-13 3
9 2 2016-04-14 4
10 2 2016-04-15 0
11 3 2016-04-11 2
12 3 2016-04-12 4
13 3 2016-04-13 5
14 3 2016-04-14 1
15 3 2016-04-15 0
16 4 2016-04-11 2
17 4 2016-04-12 3
18 4 2016-04-13 5
19 4 2016-04-14 0
20 4 2016-04-15 3
I need to know three things to construct my variable:
(1) What's the overall number of purchases on a day across users (day purchases)?
(2) What's the cumulative number of purchases across users starting from the first day (cumsum_day_purchases)?
(3) On which day did, originating from the current observation, the 20 immediately precending (across users) purchases start? This is where I have issues with coding such a variable.
> library(dplyr)
> da %>%
+ group_by(day) %>%
+ mutate(day_purchases = sum(n_purchase)) %>%
+ group_by(customer_id) %>%
+ mutate(cumsum_day_purchases = cumsum(day_purchases))
# A tibble: 20 x 5
# Groups: customer_id [4]
customer_id day n_purchase day_purchases cumsum_day_purchases
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 2016-04-11 5 11 11
2 1 2016-04-12 2 9 20
3 1 2016-04-13 8 21 41
4 1 2016-04-14 0 5 46
5 1 2016-04-15 3 6 52
6 2 2016-04-11 2 11 11
7 2 2016-04-12 0 9 20
8 2 2016-04-13 3 21 41
9 2 2016-04-14 4 5 46
10 2 2016-04-15 0 6 52
11 3 2016-04-11 2 11 11
12 3 2016-04-12 4 9 20
13 3 2016-04-13 5 21 41
14 3 2016-04-14 1 5 46
15 3 2016-04-15 0 6 52
16 4 2016-04-11 2 11 11
17 4 2016-04-12 3 9 20
18 4 2016-04-13 5 21 41
19 4 2016-04-14 0 5 46
20 4 2016-04-15 3 6 52
I will now in the following dataset compute the variable I wish to have by hand.
For all observations on day 2016-04-12 , I compute the cumulative sum
of purchases of a specific customer by adding the number of purchases
of the current day and the precending day, because in total all
customers together made 20 purchases on the current day and the
precending day.
For day 2016-04-13, I only use the number of purchases of a user on
this day, because there have been 21 (41-20) new purchases on the day itself
Resulting in the following output:
> da = da %>% ungroup() %>%
+ mutate(cumsum_last_20_purchases = c(5,5+2,8,0,0+3,2,2+0,3,4,4+0,2,2+4,5,1,1+0,2,2+3,5,0,0+3))
> da
# A tibble: 20 x 6
customer_id day n_purchase day_purchases cumsum_day_purchases cumsum_last_20_purchases
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 2016-04-11 5 11 11 5
2 1 2016-04-12 2 9 20 7
3 1 2016-04-13 8 21 41 8
4 1 2016-04-14 0 5 46 0
5 1 2016-04-15 3 6 52 3
6 2 2016-04-11 2 11 11 2
7 2 2016-04-12 0 9 20 2
8 2 2016-04-13 3 21 41 3
9 2 2016-04-14 4 5 46 4
10 2 2016-04-15 0 6 52 4
11 3 2016-04-11 2 11 11 2
12 3 2016-04-12 4 9 20 6
13 3 2016-04-13 5 21 41 5
14 3 2016-04-14 1 5 46 1
15 3 2016-04-15 0 6 52 1
16 4 2016-04-11 2 11 11 2
17 4 2016-04-12 3 9 20 5
18 4 2016-04-13 5 21 41 5
19 4 2016-04-14 0 5 46 0
20 4 2016-04-15 3 6 52 3
We can create a new grouping based on the last day the day_purchase columns is above 20, and then use cumsum on that:
library(dplyr)
da %>%
group_by(day) %>%
mutate(day_purchases = sum(n_purchase)) %>%
group_by(customer_id) %>%
mutate(above = with(rle(day_purchases >= 20), rep(1:length(lengths), lengths))) %>%
group_by(above, .add =TRUE) %>%
mutate(cumsum_last_20_purchases = cumsum(n_purchase))
#> # A tibble: 20 x 6
#> # Groups: customer_id, above [12]
#> customer_id day n_purchase day_purchases above cumsum_last_20_purchas…
#> <dbl> <fct> <dbl> <dbl> <int> <dbl>
#> 1 1 2016-04-11 5 11 1 5
#> 2 1 2016-04-12 2 9 1 7
#> 3 1 2016-04-13 8 21 2 8
#> 4 1 2016-04-14 0 5 3 0
#> 5 1 2016-04-15 3 6 3 3
#> 6 2 2016-04-11 2 11 1 2
#> 7 2 2016-04-12 0 9 1 2
#> 8 2 2016-04-13 3 21 2 3
#> 9 2 2016-04-14 4 5 3 4
#> 10 2 2016-04-15 0 6 3 4
#> 11 3 2016-04-11 2 11 1 2
#> 12 3 2016-04-12 4 9 1 6
#> 13 3 2016-04-13 5 21 2 5
#> 14 3 2016-04-14 1 5 3 1
#> 15 3 2016-04-15 0 6 3 1
#> 16 4 2016-04-11 2 11 1 2
#> 17 4 2016-04-12 3 9 1 5
#> 18 4 2016-04-13 5 21 2 5
#> 19 4 2016-04-14 0 5 3 0
#> 20 4 2016-04-15 3 6 3 3
Created on 2020-07-28 by the reprex package (v0.3.0)
Related
I would like to repeat the first two rows for each id two times. I don't know how to do that. Does anyone have a suggestion?
id <- rep(1:4,each=6)
scored <- c(12,13,NA,NA,NA,NA,14,20,NA,NA,NA,NA,23,56,NA,NA,NA,NA, 45,78,NA,NA,NA,NA)
df <- data.frame(id,scored)
df
id scored
1 1 12
2 1 13
3 1 NA
4 1 NA
5 1 NA
6 1 NA
7 2 14
8 2 20
9 2 NA
10 2 NA
11 2 NA
12 2 NA
13 3 23
14 3 56
15 3 NA
16 3 NA
17 3 NA
18 3 NA
19 4 45
20 4 78
21 4 NA
22 4 NA
23 4 NA
24 4 NA
>
I want it to look like:
df
id score
1 1 12
2 1 13
3 1 12
4 1 13
5 1 12
6 1 13
7 2 14
8 2 20
9 2 14
10 2 20
11 2 14
12 2 20
13 3 23
14 3 56
15 3 23
16 3 56
17 3 23
18 3 56
19 4 45
20 4 78
21 4 45
22 4 78
23 4 45
24 4 78
>
..................................................
..................................................
..................................................
We can do a group by rep on the non-NA elements of 'scored'
library(dplyr)
df %>%
group_by(id) %>%
mutate(scored = rep(scored[!is.na(scored)], length.out = n()))
# A tibble: 24 x 2
# Groups: id [4]
# id scored
# <int> <dbl>
# 1 1 12
# 2 1 13
# 3 1 12
# 4 1 13
# 5 1 12
# 6 1 13
# 7 2 14
# 8 2 20
# 9 2 14
#10 2 20
# … with 14 more rows
I need to segregate the data into 4 equal chunks based on percentage in descending order based on Qty_ordered . I tried using 'bins.quantiles'function (from the binr package) in R but not working. Any other methods which can be used?
Input
SL.No Item Qty_Ordered
1 VT25 2
2 VT58 4
3 VT40 10
4 VT58 2
5 VT 69 12
6 VT 67 6
7 VT45 21
8 VT 25 16
9 VT 40 24
10 VT98 10
11 VT78 18
12 VT40 6
13 VT 25 26
14 VT85 6
15 VT78 10
16 VT25 4
17 VT40 15
18 VT69 24
Output
SL.No Item Qty Ordered Class
1 VT25 2 1
4 VT58 2 1
2 VT58 4 1
16 VT25 4 1
6 VT 67 6 2
12 VT40 6 2
14 VT85 6 2
3 VT40 10 2
10 VT98 10 2
15 VT78 10 3
5 VT 69 12 3
17 VT40 15 3
8 VT 25 16 3
11 VT78 18 3
7 VT45 21 4
9 VT 40 24 4
18 VT69 24 4
13 VT 25 26 4
Maybe this?
library(data.table)
test <- fread(input = "SL.No Item Qty_Ordered
1 VT25 2
2 VT58 4
3 VT40 10
4 VT58 2
5 VT69 12
6 VT67 6
7 VT45 21
8 VT25 16
9 VT40 24
10 VT98 10
11 VT78 18
12 VT40 6
13 VT25 26
14 VT85 6
15 VT78 10
16 VT25 4
17 VT40 15
18 VT69 24", header = T)
setorder(test, Qty_Ordered)
test[, Class := .I %/% ((.N+1)/4) + 1]
test
# SL.No Item Qty_Ordered Class
# 1: 1 VT25 2 1
# 2: 4 VT58 2 1
# 3: 2 VT58 4 1
# 4: 16 VT25 4 1
# 5: 6 VT67 6 2
# 6: 12 VT40 6 2
# 7: 14 VT85 6 2
# 8: 3 VT40 10 2
# 9: 10 VT98 10 2
# 10: 15 VT78 10 3
# 11: 5 VT69 12 3
# 12: 17 VT40 15 3
# 13: 8 VT25 16 3
# 14: 11 VT78 18 3
# 15: 7 VT45 21 4
# 16: 9 VT40 24 4
# 17: 18 VT69 24 4
# 18: 13 VT25 26 4
Here's a way using the tidyverse
library(tidyverse)
df <- read.table(text = "SL.No Item Qty_Ordered
1 VT25 2
2 VT58 4
3 VT40 10
4 VT58 2
5 VT69 12
6 VT67 6
7 VT45 21
8 VT25 16
9 VT40 24
10 VT98 10
11 VT78 18
12 VT40 6
13 VT25 26
14 VT85 6
15 VT78 10
16 VT25 4
17 VT40 15
18 VT69 24",header = T)
df %>%
mutate(Class = findInterval(x = Qty_Ordered, vec = quantile(Qty_Ordered),rightmost.closed = T)) %>%
arrange(Class)
Suppose I have the next data frame.
table<-data.frame(group=c(0,5,10,15,20,25,30,35,40,0,5,10,15,20,25,30,35,40,0,5,10,15,20,25,30,35,40),plan=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3),price=c(1,4,5,6,8,9,12,12,12,3,5,6,7,10,12,20,20,20,5,6,8,12,15,20,22,28,28))
group plan price
1 0 1 1
2 5 1 4
3 10 1 5
4 15 1 6
5 20 1 8
6 25 1 9
7 30 1 12
8 35 1 12
9 40 1 12
10 0 2 3
11 5 2 5
12 10 2 6
13 15 2 7
14 20 2 10
15 25 2 12
16 30 2 20
17 35 2 20
18 40 2 20
How can I get the values from the table up to the maximum price, without duplicates.
So the result would be:
group plan price
1 0 1 1
2 5 1 4
3 10 1 5
4 15 1 6
5 20 1 8
6 25 1 9
7 30 1 12
10 0 2 3
11 5 2 5
12 10 2 6
13 15 2 7
14 20 2 10
15 25 2 12
16 30 2 20
You can use slice in dplyr:
library(dplyr)
table %>%
group_by(plan) %>%
slice(1:which.max(price == max(price)))
which.max gives the index of the first occurrence of price == max(price). Using that, I can slice the data.frame to only keep rows for each plan up to the maximum price.
Result:
# A tibble: 22 x 3
# Groups: plan [3]
group plan price
<dbl> <dbl> <dbl>
1 0 1 1
2 5 1 4
3 10 1 5
4 15 1 6
5 20 1 8
6 25 1 9
7 30 1 12
8 0 2 3
9 5 2 5
10 10 2 6
# ... with 12 more rows
Example data.frame:
df = read.table(text = 'colA colB
2 7
2 7
2 7
2 7
1 7
1 7
1 7
89 5
89 5
89 5
88 5
88 5
70 5
70 5
70 5
69 5
69 5
44 4
44 4
44 4
43 4
42 4
42 4
41 4
41 4
120 1
100 1', header = TRUE)
I need to add an index col based on colA and colB where colB shows the exact number of rows to group but it can be duplicated. colB groups rows based on colA and colA -1.
Expected output:
colA colB index_col
2 7 1
2 7 1
2 7 1
2 7 1
1 7 1
1 7 1
1 7 1
89 5 2
89 5 2
89 5 2
88 5 2
88 5 2
70 5 3
70 5 3
70 5 3
69 5 3
69 5 3
44 4 4
44 4 4
44 4 4
43 4 4
42 4 5
42 4 5
41 4 5
41 4 5
120 1 6
100 1 7
UPDATE
How can I adapt the code that works for the above df for the same purpose but by looking at colB values grouped based on colA, colA -1 and colA -2? i.e. (instead of 2 days considering 3 days)
new_df = read.table(text = 'colA colB
3 10
3 10
3 10
2 10
2 10
2 10
2 10
1 10
1 10
1 10
90 7
90 7
89 7
89 7
89 7
88 7
88 7
71 7
71 7
70 7
70 7
70 7
69 7
69 7
44 5
44 5
44 5
43 5
42 5
41 5
41 5
41 5
40 5
40 5
120 1
100 1', header = TRUE)
Expected output:
colA colB index_col
3 10 1
3 10 1
3 10 1
2 10 1
2 10 1
2 10 1
2 10 1
1 10 1
1 10 1
1 10 1
90 7 2
90 7 2
89 7 2
89 7 2
89 7 2
88 7 2
88 7 2
71 7 3
71 7 3
70 7 3
70 7 3
70 7 3
69 7 3
69 7 3
44 5 4
44 5 4
44 5 4
43 5 4
42 5 4
41 5 5
41 5 5
41 5 5
40 5 5
40 5 5
120 1 6
100 1 7
Thanks
We can use rleid
library(data.table)
index_col <-setDT(df)[, if(colB[1L] < .N) ((seq_len(.N)-1) %/% colB[1L])+1
else as.numeric(colB), rleid(colB)][, rleid(V1)]
df[, index_col := index_col]
df
# colA colB index_col
# 1: 2 7 1
# 2: 2 7 1
# 3: 2 7 1
# 4: 2 7 1
# 5: 1 7 1
# 6: 1 7 1
# 7: 1 7 1
# 8: 70 5 2
# 9: 70 5 2
#10: 70 5 2
#11: 69 5 2
#12: 69 5 2
#13: 89 5 3
#14: 89 5 3
#15: 89 5 3
#16: 88 5 3
#17: 88 5 3
#18: 120 1 4
#19: 100 1 5
Or a one-liner would be
setDT(df)[, index_col := df[, ((seq_len(.N)-1) %/% colB[1L])+1, rleid(colB)][, as.integer(interaction(.SD, drop = TRUE, lex.order = TRUE))]]
Update
Based on the new update in the OP's post
setDT(new_df)[, index_col := cumsum(c(TRUE, abs(diff(colA))> 1))
][, colB := .N , index_col]
new_df
# colA colB index_col
# 1: 3 10 1
# 2: 3 10 1
# 3: 3 10 1
# 4: 2 10 1
# 5: 2 10 1
# 6: 2 10 1
# 7: 2 10 1
# 8: 1 10 1
# 9: 1 10 1
#10: 1 10 1
#11: 71 7 2
#12: 71 7 2
#13: 70 7 2
#14: 70 7 2
#15: 70 7 2
#16: 69 7 2
#17: 69 7 2
#18: 90 7 3
#19: 90 7 3
#20: 89 7 3
#21: 89 7 3
#22: 89 7 3
#23: 88 7 3
#24: 88 7 3
#25: 44 2 4
#26: 43 2 4
#27: 120 1 5
#28: 100 1 6
An approach in base R:
df$idxcol <- cumsum(c(1,abs(diff(df$colA)) > 1) + c(0,diff(df$colB) != 0) > 0)
which gives:
> df
colA colB idxcol
1 2 7 1
2 2 7 1
3 2 7 1
4 2 7 1
5 1 7 1
6 1 7 1
7 1 7 1
8 70 5 2
9 70 5 2
10 70 5 2
11 69 5 2
12 69 5 2
13 89 5 3
14 89 5 3
15 89 5 3
16 88 5 3
17 88 5 3
18 120 1 4
19 100 1 5
On the updated example data, you need to adapt the approach to:
n <- 1
idx1 <- cumsum(c(1, diff(df$colA) < -n) + c(0, diff(df$colB) != 0) > 0)
idx2 <- ave(df$colA, cumsum(c(1, diff(df$colA) < -n)), FUN = function(x) c(0, cumsum(diff(x)) < -n ))
idx2[idx2==1 & c(0,diff(idx2))==0] <- 0
df$idxcol <- idx1 + cumsum(idx2)
which gives:
> df
colA colB idxcol
1 2 7 1
2 2 7 1
3 2 7 1
4 2 7 1
5 1 7 1
6 1 7 1
7 1 7 1
8 89 5 2
9 89 5 2
10 89 5 2
11 88 5 2
12 88 5 2
13 70 5 3
14 70 5 3
15 70 5 3
16 69 5 3
17 69 5 3
18 44 4 4
19 44 4 4
20 44 4 4
21 43 4 4
22 42 4 5
23 42 4 5
24 41 4 5
25 41 4 5
26 120 1 6
27 100 1 7
For new_df just change n tot 2 and you will get the desired output for that as well.
I have a data frame consisting of the fluorescence read out of multiple cells tracked over time, for example:
Number=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
Fluorescence=c(9,10,20,30,8,11,21,31,6,12,22,32,7,13,23,33)
df = data.frame(Number, Fluorescence)
Which gets:
Number Fluorescence
1 1 9
2 2 10
3 3 20
4 4 30
5 1 8
6 2 11
7 3 21
8 4 31
9 1 6
10 2 12
11 3 22
12 4 32
13 1 7
14 2 13
15 3 23
16 4 33
Number pertains to the cell number. What I want is to collate the fluorescence readout based on the cell number. The data.frame here has it counting 1-4, whereas really I want something like this:
Number Fluorescence
1 1 9
2 1 8
3 1 6
4 1 7
5 2 10
6 2 11
7 2 12
8 2 13
9 3 20
10 3 21
11 3 22
12 3 23
13 4 30
14 4 31
15 4 32
16 4 33
Or even more ideal would be having columns based on Number, then respective cell fluorescence:
1 2 3 4
1 9 10 20 30
2 8 11 21 31
3 6 12 22 32
4 7 13 23 33
I've used the which function to extract them one at a time:
Cell1=df[which(df[,1]==1),2]
But this would require me to write a line for each cell (of which there are hundreds).
Thank you for any help with this! Apologies that I'm still a bit of an R noob.
How about this:
library(tidyr);library(data.table)
number <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
fl <- c(9,10,20,30,8,11,21,31,6,12,22,32,7,13,23,33)
df <- data.table(number,fl)
df[, index:=1:.N, keyby=number]
df
number fl index
1: 1 9 1
2: 1 8 2
3: 1 6 3
4: 1 7 4
5: 2 10 1
6: 2 11 2
7: 2 12 3
8: 2 13 4
9: 3 20 1
10: 3 21 2
11: 3 22 3
12: 3 23 4
13: 4 30 1
14: 4 31 2
15: 4 32 3
16: 4 33 4
The index is added for the unique identifier in spread function from tidyr. Look this post for more information.
spread(df,number,fl)
index 1 2 3 4
1: 1 9 10 20 30
2: 2 8 11 21 31
3: 3 6 12 22 32
4: 4 7 13 23 33