How to calculate a proportions based on group by - r

I am trying to calculate consecutive proportions of the target feature.
Data Set
df <- data.frame(ID = c(11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
22, 22, 22, 22, 22, 22, 22, 22, 22, 22),
target = c(0, 0, 0, 1, 1, 1, 0, 1, 1, 1,
0, 0, 1, 1, 1, 0, 1, 0, 1, 1))
ID target
1 11 0
2 11 0
3 11 0
4 11 1
5 11 1
6 11 1
7 11 0
8 11 1
9 11 1
10 11 1
11 22 0
12 22 0
13 22 1
14 22 1
15 22 1
16 22 0
17 22 1
18 22 0
19 22 1
20 22 1
This is what I tried:
df <- df %>%
group_by(ID) %>%
mutate(count_per_ID = row_number(),
consecutive_target = sequence(rle(as.character(target))$lengths),
val = ifelse(target == 0, 0, consecutive_target),
proportion_target_by_ID = val / count_per_ID) %>%
ungroup()
I created count_per_ID that calculates the total number of rows for each group ID.
Then consecutive_target feature counts the number of observations in target feature and each time a change occurs, it restarts. By change I mean, switch between values of 0 or 1 of the target value.
val copies those values in the consecutive_target based on target 1 or 0 value.
proportion_target_by_ID takes val feature and divides by count_per_ID
The issue is that when there is 0 value in val feature, then the idea of taking the proportion of target values by ID is invalid.
ID target count_per_ID consecutive_target val proportion_target_by_ID
<dbl> <dbl> <int> <int> <dbl> <dbl>
1 11 0 1 1 0 0
2 11 0 2 2 0 0
3 11 0 3 3 0 0
4 11 1 4 1 1 0.25
5 11 1 5 2 2 0.4
6 11 1 6 3 3 0.5
7 11 0 7 1 0 0
8 11 1 8 1 1 0.125
9 11 1 9 2 2 0.222
10 11 1 10 3 3 0.3
11 22 0 1 1 0 0
12 22 0 2 2 0 0
13 22 1 3 1 1 0.333
14 22 1 4 2 2 0.5
15 22 1 5 3 3 0.6
16 22 0 6 1 0 0
17 22 1 7 1 1 0.143
18 22 0 8 1 0 0
19 22 1 9 1 1 0.111
20 22 1 10 2 2 0.2
How the result should look like:
ID target count_per_ID consecutive_target val proportion_target_by_ID
<dbl> <dbl> <int> <int> <dbl> <dbl>
1 11 0 1 1 0 0
2 11 0 2 2 0 0
3 11 0 3 3 0 0
4 11 1 4 1 1 0.25
5 11 1 5 2 2 0.4
6 11 1 6 3 3 0.5
7 11 0 7 1 3 0.428
8 11 1 8 1 4 0.5
9 11 1 9 2 5 0.555
10 11 1 10 3 6 0.6
11 22 0 1 1 0 0
12 22 0 2 2 0 0
13 22 1 3 1 1 0.333
14 22 1 4 2 2 0.5
15 22 1 5 3 3 0.6
16 22 0 6 1 3 0.5
17 22 1 7 1 4 0.571
18 22 0 8 1 4 0.5
19 22 1 9 1 5 0.55
20 22 1 10 2 6 0.6

An option is to change the code for creating the 'val' from
val = ifelse(target == 0, 0, consecutive_target
to
val = cumsum(target != 0)
-fullcode
df %>%
group_by(ID) %>%
mutate(count_per_ID = row_number(),
consecutive_target = sequence(rle(as.character(target))$lengths),
val = cumsum(target != 0),
proportion_target_by_ID = val / count_per_ID)
# A tibble: 20 x 6
# Groups: ID [2]
# ID target count_per_ID consecutive_target val proportion_target_by_ID
# <dbl> <dbl> <int> <int> <int> <dbl>
# 1 11 0 1 1 0 0
# 2 11 0 2 2 0 0
# 3 11 0 3 3 0 0
# 4 11 1 4 1 1 0.25
# 5 11 1 5 2 2 0.4
# 6 11 1 6 3 3 0.5
# 7 11 0 7 1 3 0.429
# 8 11 1 8 1 4 0.5
# 9 11 1 9 2 5 0.556
#10 11 1 10 3 6 0.6
#11 22 0 1 1 0 0
#12 22 0 2 2 0 0
#13 22 1 3 1 1 0.333
#14 22 1 4 2 2 0.5
#15 22 1 5 3 3 0.6
#16 22 0 6 1 3 0.5
#17 22 1 7 1 4 0.571
#18 22 0 8 1 4 0.5
#19 22 1 9 1 5 0.556
#20 22 1 10 2 6 0.6

Related

Recode when there is a missing category in R

I need a recoding help. Here how my dataset looks like.
df <- data.frame(id = c(1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3, 4,4,4,4,4),
score = c(0,1,0,1,0, 0,2,0,2,2, 0,3,3,0,0, 0,1,3,1,3))
> df
id score
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 2 0
7 2 2
8 2 0
9 2 2
10 2 2
11 3 0
12 3 3
13 3 3
14 3 0
15 3 0
16 4 0
17 4 1
18 4 3
19 4 1
20 4 3
Some ids have missing score categories. So if this is the case per id, I would like to recode score category. So:
a) if the score options are `0,1,2` and `1` score is missing, then `2` need to be recoded as `1`,
b) if the score options are `0,1,2,3` and `1,2` score is missing, then `3` need to be recoded as `1`,
c) if the score options are `0,1,2,3` and `2` score is missing, then `2,3` need to be recoded as `1,2`,
the idea is there should not be any missing score categories in between.
The desired output would be:
> df.1
id score score.recoded
1 1 0 0
2 1 1 1
3 1 0 0
4 1 1 1
5 1 0 0
6 2 0 0
7 2 2 1
8 2 0 0
9 2 2 1
10 2 2 1
11 3 0 0
12 3 3 1
13 3 3 1
14 3 0 0
15 3 0 0
16 4 0 0
17 4 1 1
18 4 3 2
19 4 1 1
20 4 3 2
df %>%
group_by(id)%>%
mutate(score = as.numeric(factor(score)) - 1)
# A tibble: 20 x 2
# Groups: id [4]
id score
<dbl> <dbl>
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 2 0
7 2 1
8 2 0
9 2 1
10 2 1
11 3 0
12 3 1
13 3 1
14 3 0
15 3 0
16 4 0
17 4 1
18 4 2
19 4 1
20 4 2
Using data.table
library(data.table)
setDT(df)[, score.recoded := 0][
score >0, score.recoded := match(score, score), id]
-output
> df
id score score.recoded
<num> <num> <int>
1: 1 0 0
2: 1 1 1
3: 1 0 0
4: 1 1 1
5: 1 0 0
6: 2 0 0
7: 2 2 1
8: 2 0 0
9: 2 2 1
10: 2 2 1
11: 3 0 0
12: 3 3 1
13: 3 3 1
14: 3 0 0
15: 3 0 0
16: 4 0 0
17: 4 1 1
18: 4 3 2
19: 4 1 1
20: 4 3 2

Using row index number to calculate values

I'm having trouble using the row number as index. For example I want a new column that will give me the sales taking into account the next 4 days. I want to create column name:sale_next 4
The issue with my code is that I don't know how to make the index of the row_number() variable, since what I'm doing is fetching the actual value of the column.
#heres to create the data
df <- read.table(text = "day price price_change sales High_sales_ind
1 5 0 12 1
2 5 0 6 0
3 5 0 5 0
4 5 0 4 0
5 5 0 10 1
6 5 0 10 1
7 5 0 10 1
8 5 0 12 1
9 5 0 14 1
10 7 2 3 0
11 7 0 2 0", header = TRUE)
#my code
df<- df %>% mutate(sales_next4 = sales[row_number():sales_rownumber()+4)
What I need:
day
price
price_change
sales
High_sales_ind
sales_next4
1
5
0
12
1
27
2
5
0
6
0
25
3
5
0
5
0
29
4
5
0
4
0
34
5
5
0
10
1
42
6
5
0
10
1
46
7
5
0
10
1
39
8
5
0
12
1
31
9
5
0
14
1
19
10
7
2
3
0
5
11
7
0
2
0
2
Any help would be appreciated.
You can use rollapply from the zoo package for cases like this, assuming that the days are consecutive as in the example data provided.
You'll need to use the partial = and align = arguments to fill the column correctly, see ?rollapply for the details.
library(dplyr)
library(zoo)
df <- df %>%
mutate(sales_next4 = rollapply(sales, 4, sum, partial = TRUE, align = "left"))
Result:
day price price_change sales High_sales_ind sales_next4
1 1 5 0 12 1 27
2 2 5 0 6 0 25
3 3 5 0 5 0 29
4 4 5 0 4 0 34
5 5 5 0 10 1 42
6 6 5 0 10 1 46
7 7 5 0 10 1 39
8 8 5 0 12 1 31
9 9 5 0 14 1 19
10 10 7 2 3 0 5
11 11 7 0 2 0 2
You can use map() from purrr to do rolling sum depending on the day column.
library(dplyr)
library(purrr)
df %>%
mutate(sales_next4 = map_dbl(day, ~ sum(sales[between(day, .x, .x+3)])))
# day price price_change sales High_sales_ind sales_next4
# 1 1 5 0 12 1 27
# 2 2 5 0 6 0 25
# 3 3 5 0 5 0 29
# 4 4 5 0 4 0 34
# 5 5 5 0 10 1 42
# 6 6 5 0 10 1 46
# 7 7 5 0 10 1 39
# 8 8 5 0 12 1 31
# 9 9 5 0 14 1 19
# 10 10 7 2 3 0 5
# 11 11 7 0 2 0 2
Using slider
library(dplyr)
library(slider)
df %>%
mutate(sales_next4 = slide_dbl(day, ~ sum(sales[.x]), .after = 3))
day price price_change sales High_sales_ind sales_next4
1 1 5 0 12 1 27
2 2 5 0 6 0 25
3 3 5 0 5 0 29
4 4 5 0 4 0 34
5 5 5 0 10 1 42
6 6 5 0 10 1 46
7 7 5 0 10 1 39
8 8 5 0 12 1 31
9 9 5 0 14 1 19
10 10 7 2 3 0 5
11 11 7 0 2 0 2
You can use Reduce() and data.table::shift()
library(data.table)
setDT(df)[, n4:=Reduce(`+`,shift(c(sales,0,0,0),-3:0))[1:.N]]
Output:
day price price_change sales High_sales_ind sales_next4
1 1 5 0 12 1 27
2 2 5 0 6 0 25
3 3 5 0 5 0 29
4 4 5 0 4 0 34
5 5 5 0 10 1 42
6 6 5 0 10 1 46
7 7 5 0 10 1 39
8 8 5 0 12 1 31
9 9 5 0 14 1 19
10 10 7 2 3 0 5
11 11 7 0 2 0 2
or, could this as part of dplyr/mutate pipeline
mutate(df, sales_next4 = Reduce(`+`, data.table::shift(c(sales,0,0,0),0:-3))[1:nrow(df)])

Counter max frequency non consecutive numbers

I have some data where one of the variables is an accountant with some requirements. What I need to know now is how many times that counter reaches 1 for each ID, if there are several 1's in a row, you only have to count 1.
For example, let's say that the ID has counter: 1, 0, 0, 1, 1, 0, 0, 1,1,1,0,0. I would have to say that the id has 3 of frequency.
Frec_counter count the number of non-consecutive times that a 1. appears. If there are consecutive 1's, the last one is numbered.
My data:
id <- c(10,10,10,10,10,11,11,11,11,11,11,12,12,12,13, 13, 15, 14)
counter <- c(0,0,1,1,0,1,0,1,0,1,1,1,1,1,0,0,1,1)
DF <- data.frame(id, counter); DF
Id 10 has 0,0,1,1,0.
5 data, but only 1 non-consecutive, so it is set to frec_counter 0,0,0,1,0
My desirable output:
id <- c(10,10,10,10,10,11,11,11,11,11,11,12,12,12,13, 13, 15, 14)
counter <- c(0,0,1,1,0,1,0,1,0,1,1,1,1,1,0,0,1,1)
frec_counter <- c(0,0,0,1,0,1,0,2,0,0,3,0,0,1,0,0,1,1)
max_counter <- c(1,1,1,1,1,3,3,3,3,3,3,1,1,1,0,0,1,1)
DF <- data.frame(id, counter, frec_counter, max_counter); DF
Here is one approach using tidyverse:
library(tidyverse)
DF %>%
group_by(id) %>% #group by id
mutate(one = ifelse(counter == lead(counter), 0, counter) #if the leading value is the same replace the value with 0
one = ifelse(is.na(one), counter, one), #to handle last in group where lead results in NA
frec_counter1 = cumsum(one), #get cumulative sum of 1s
frec_counter1 = ifelse(one == 0, 0 , frec_counter1), #replace the cumsum values with 0 where approprate
max_counter1 = max(frec_counter1)) %>% #get the max frec_counter1 per group
select(-one) #remove dummy variable
#output
id counter frec_counter max_counter frec_counter1 max_counter1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10 0 0 1 0 1
2 10 0 0 1 0 1
3 10 1 0 1 0 1
4 10 1 1 1 1 1
5 10 0 0 1 0 1
6 11 1 1 3 1 3
7 11 0 0 3 0 3
8 11 1 2 3 2 3
9 11 0 0 3 0 3
10 11 1 0 3 0 3
11 11 1 3 3 3 3
12 12 1 0 1 0 1
13 12 1 0 1 0 1
14 12 1 1 1 1 1
15 13 0 0 0 0 0
16 13 0 0 0 0 0
17 15 1 1 1 1 1
18 14 1 1 1 1 1
Your data:
id <- c(10,10,10,10,10,11,11,11,11,11,11,12,12,12,13, 13, 15, 14)
counter <- c(0,0,1,1,0,1,0,1,0,1,1,1,1,1,0,0,1,1)
DF <- data.frame(id, counter)
id counter
1 10 0
2 10 0
3 10 1
4 10 1
5 10 0
6 11 1
7 11 0
8 11 1
9 11 0
10 11 1
11 11 1
12 12 1
13 12 1
14 12 1
15 13 0
16 13 0
17 15 1
18 14 1
If all you wanted was the final counts, we could do that in base R:
counts <- with(DF, split(counter, id))
lengths <- lapply(counts, rle)
final <- lapply(lengths, function(x) sum(x$values == 1))
$`10`
[1] 1
$`11`
[1] 3
$`12`
[1] 1
$`13`
[1] 0
$`14`
[1] 1
$`15`
[1] 1
But since you specifically want a data frame with the intermediary "flags", the tidyverse set of packages works better:
library(tidyverse)
df.new <- DF %>%
group_by(id) %>%
mutate(
frec_counter = counter == 1 & (is.na(lead(counter)) | lead(counter == 0)),
frec_counter = as.numeric(frec_counter),
max_counter = sum(frec_counter)
)
# A tibble: 18 x 4
# Groups: id [6]
id counter frec_counter max_counter
<dbl> <dbl> <dbl> <dbl>
1 10 0 0 1
2 10 0 0 1
3 10 1 0 1
4 10 1 1 1
5 10 0 0 1
6 11 1 1 3
7 11 0 0 3
8 11 1 1 3
9 11 0 0 3
10 11 1 0 3
11 11 1 1 3
12 12 1 0 1
13 12 1 0 1
14 12 1 1 1
15 13 0 0 0
16 13 0 0 0
17 15 1 1 1
18 14 1 1 1

Suitable alternative for ddply function

How can I convert the following tibble to the final result posted below using dplyr?
> group_by(hth, team) %>% arrange(team)
Source: local data frame [26 x 14]
Groups: team [13]
team CSK DC DD GL KKR KTK KXIP MI PW RCB RPSG
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CSK 0 8 11 0 11 2 9 10 4 10 0
2 CSK 0 2 5 0 5 0 8 12 2 9 0
3 DC 2 0 8 0 2 1 7 5 3 8 0
4 DC 8 0 3 0 7 0 3 5 1 3 0
5 DD 5 3 0 0 7 2 8 5 2 10 2
6 DD 11 8 0 2 10 0 10 13 4 7 0
7 GL 0 0 2 0 0 0 0 0 0 1 0
8 GL 0 0 0 0 2 0 2 2 0 2 2
9 KKR 5 7 10 2 0 0 5 10 3 15 0
10 KKR 11 2 7 0 0 2 14 8 2 3 2
# ... with 16 more rows, and 2 more variables: RR <dbl>, SH <dbl>
>
I used plyr's ddply function and was able to achieve the result.
> ddply(hth, .(team), function(x) colSums(x[,-1], na.rm = TRUE))
team CSK DC DD GL KKR KTK KXIP MI PW RCB RPSG RR SH
1 CSK 0 10 16 0 16 2 17 22 6 19 0 17 6
2 DC 10 0 11 0 9 1 10 10 4 11 0 9 0
3 DD 16 11 0 2 17 2 18 18 6 17 2 16 8
4 GL 0 0 2 0 2 0 2 2 0 3 2 0 3
5 KKR 16 9 17 2 0 2 19 18 5 18 2 15 9
6 KTK 2 1 2 0 2 0 1 1 1 2 0 2 0
7 KXIP 17 10 18 2 19 1 0 18 6 18 2 15 8
8 MI 22 10 18 2 18 1 18 0 6 19 2 16 8
9 PW 6 4 6 0 5 1 6 6 0 5 0 5 2
10 RCB 19 11 17 3 18 2 18 19 5 0 2 16 9
11 RPSG 0 0 2 2 2 0 2 2 0 2 0 0 2
12 RR 17 9 16 0 15 2 15 16 5 16 0 0 7
13 SH 6 0 8 3 9 0 8 8 2 9 2 7 0
>
How to achieve the same using just dplyr functions?
Looks like you are grouping by team and summing the columns, in dplyr:
library(dplyr)
hth %>%
group_by(team) %>%
summarise_all(funs(sum), na.rm = TRUE)

Removing the unordered pairs repeated twice in a file in R

I have a file like this in R.
**0 1**
0 2
**0 3**
0 4
0 5
0 6
0 7
0 8
0 9
0 10
**1 0**
1 11
1 12
1 13
1 14
1 15
1 16
1 17
1 18
1 19
**3 0**
As we can see, there are similar unordered pairs in this ( marked pairs ), like,
1 0
and
0 1
I wish to remove these pairs. And I want to count the number of such pairs that I have and append the count in front of the tow that is repeated. If not repeated, then 1 should be written in the third column.
For example ( A sample of the output file )
0 1 2
0 2 1
0 3 2
0 4 1
0 5 1
0 6 1
0 7 1
0 8 1
0 9 1
0 10 1
1 11 1
1 12 1
1 13 1
1 14 1
1 15 1
1 16 1
1 17 1
1 18 1
1 19 1
How can I achieve it in R?
Here is a way using transform, pmin and pmax to reorder the data by row, and then aggregate to provide a count:
# data
x <- data.frame(a=c(rep(0,10),rep(1,10),3),b=c(1:10,0,11:19,0))
#logic
aggregate(count~a+b,transform(x,a=pmin(a,b), b=pmax(a,b), count=1),sum)
a b count
1 0 1 2
2 0 2 1
3 0 3 2
4 0 4 1
5 0 5 1
6 0 6 1
7 0 7 1
8 0 8 1
9 0 9 1
10 0 10 1
11 1 11 1
12 1 12 1
13 1 13 1
14 1 14 1
15 1 15 1
16 1 16 1
17 1 17 1
18 1 18 1
19 1 19 1
Here's one approach:
First, create a vector of the columns sorted and then pasted together.
x <- apply(mydf, 1, function(x) paste(sort(x), collapse = " "))
Then, use ave to create the counts you are looking for.
mydf$count <- ave(x, x, FUN = length)
Finally, you can use the "x" vector again, this time to detect and remove duplicated values.
mydf[!duplicated(x), ]
# V1 V2 count
# 1 0 1 2
# 2 0 2 1
# 3 0 3 2
# 4 0 4 1
# 5 0 5 1
# 6 0 6 1
# 7 0 7 1
# 8 0 8 1
# 9 0 9 1
# 10 0 10 1
# 12 1 11 1
# 13 1 12 1
# 14 1 13 1
# 15 1 14 1
# 16 1 15 1
# 17 1 16 1
# 18 1 17 1
# 19 1 18 1
# 20 1 19 1

Resources