Calculate the ratio between different column, with some restrictions - r

Consider the dataset:
example1 = data.frame("year"=c(1,1,3,4,1,2,3,4,1,2,3,4,5),
"household"=c(1,1,1,1,2,2,2,2,2,2,2,2,2),
"person"= c(1,1,1,1,1,1,1,1,2,2,2,2,2),
"expected income" = c(seq(140,260,10)),
"income" = c(seq(110,230,10)))
Just to have an idea person=1 is the father of the family and person=2 is the mother of the family, in the complete dataset there will be also the children, but it doesn't matter right now.
I need to calculate the ratio between column(4) "expected income" in year(i) and column(5)"income" in year (i+1).
Furthermore the ratio has to be done only when the "person" and "household" is the same.
for example it doesn't have to be calculated the ratio between col(4)-row(4) and col(5)-row(5) because they are two man of different household,
the same for col(5)-row(8) and col(5)-row(9) because they are two different person within the same household.
Instead of the ratio between the "expected income" and the "income" of two different people I need an NA.
It has to be done generically since it is just a semplification of a dataset with more than 60000 row.

It sounds like you need to group by household and person, then find the ratio of the expected income to the lead value of income:
library(tidyverse)
example1 %>%
group_by(person, household) %>%
mutate(ratio = expected.income / lead(income))
#> # A tibble: 13 x 6
#> # Groups: person, household [3]
#> year household person expected.income income ratio
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 140 110 1.17
#> 2 2 1 1 150 120 1.15
#> 3 3 1 1 160 130 1.14
#> 4 4 1 1 170 140 NA
#> 5 1 2 1 180 150 1.12
#> 6 2 2 1 190 160 1.12
#> 7 3 2 1 200 170 1.11
#> 8 4 2 1 210 180 NA
#> 9 1 2 2 220 190 1.1
#> 10 2 2 2 230 200 1.10
#> 11 3 2 2 240 210 1.09
#> 12 4 2 2 250 220 1.09
#> 13 5 2 2 260 230 NA
Created on 2022-05-11 by the reprex package (v2.0.1)

Is this what you are looking for:
library(dplyr)
example1 %>%
mutate(ratio = ifelse(person == household, expected.income/income, NA))
Output:
year household person expected.income income ratio
1 1 1 1 140 110 1.272727
2 1 1 1 150 120 1.250000
3 3 1 1 160 130 1.230769
4 4 1 1 170 140 1.214286
5 1 2 1 180 150 NA
6 2 2 1 190 160 NA
7 3 2 1 200 170 NA
8 4 2 1 210 180 NA
9 1 2 2 220 190 1.157895
10 2 2 2 230 200 1.150000
11 3 2 2 240 210 1.142857
12 4 2 2 250 220 1.136364
13 5 2 2 260 230 1.130435

First order by household, person and year. Then calculate the ratio and set all rations to NA where the next lien is not the next year or not the same household or not the same person.
. <- example1
. <- .[order(.$household, .$person, .$year),]
.$ratio <- .$expected.income / c(.$income[-1], NA)
is.na(.$ratio) <- (1 + .$year) != c(.$year[-1], NA) |
.$household != c(.$household[-1], NA) | .$person != c(.$person[-1], NA)
.
# year household person expected.income income ratio
#1 1 1 1 140 110 NA
#2 1 1 1 150 120 NA
#3 3 1 1 160 130 1.142857
#4 4 1 1 170 140 NA
#5 1 2 1 180 150 1.125000
#6 2 2 1 190 160 1.117647
#7 3 2 1 200 170 1.111111
#8 4 2 1 210 180 NA
#9 1 2 2 220 190 1.100000
#10 2 2 2 230 200 1.095238
#11 3 2 2 240 210 1.090909
#12 4 2 2 250 220 1.086957
#13 5 2 2 260 230 NA
Don't know if stating two times with year 1 is a typo, but it shows if the condition of next year is considered.

Related

cumsum by participant and reset on 0 R [duplicate]

This question already has answers here:
R cumulative sum by condition with reset
(3 answers)
Cumulative sum that resets when 0 is encountered
(4 answers)
Closed 1 year ago.
I have a data frame that looks like this below. I need to sum the number of correct trials by participant, and reset the counter when it gets to a 0.
Participant TrialNumber Correct
118 1 1
118 2 1
118 3 1
118 4 1
118 5 1
118 6 1
118 7 1
118 8 0
118 9 1
118 10 1
120 1 1
120 2 1
120 3 1
120 4 1
120 5 0
120 6 1
120 7 0
120 8 1
120 9 1
120 10 1
I've tried using splitstackshape:
df$Count <- getanID(cbind(df$Participant, cumsum(df$Correct)))[,.id]
But it cumulatively sums the correct trials when it gets to a 0 and not by participant:
Participant TrialNumber Correct Count
118 1 1 1
118 2 1 1
118 3 1 1
118 4 1 1
118 5 1 1
118 6 1 1
118 7 1 1
118 8 0 2
118 9 1 1
118 10 1 1
120 1 1 1
120 2 1 1
120 3 1 1
120 4 1 1
120 5 0 2
120 6 1 1
120 7 0 2
120 8 1 1
120 9 1 1
120 10 1 1
I then tried using dplyr:
df %>%
group_by(Participant) %>%
mutate(Count=cumsum(Correct)) %>%
ungroup %>%
as.data.frame(df)
Participant TrialNumber Correct Count
118 1 1 1
118 2 1 2
118 3 1 3
118 4 1 4
118 5 1 5
118 6 1 6
118 7 1 7
118 8 0 7
118 9 1 8
118 10 1 9
120 1 1 1
120 2 1 2
120 3 1 3
120 4 1 4
120 5 0 4
120 6 1 5
120 7 0 5
120 8 1 6
120 9 1 7
120 10 1 8
Which gets me closer, but still doesn't reset the counter when it gets to 0. If anyone has any suggestions to do this it would be greatly appreciated, thank you
Does this work?
library(dplyr)
library(data.table)
df %>%
mutate(grp = rleid(Correct)) %>%
group_by(Participant, grp) %>%
mutate(Count = cumsum(Correct)) %>%
select(- grp)
# A tibble: 10 x 4
# Groups: Participant, grp [6]
grp Participant Correct Count
<int> <chr> <dbl> <dbl>
1 1 A 1 1
2 1 A 1 2
3 1 A 1 3
4 2 A 0 0
5 3 A 1 1
6 3 B 1 1
7 3 B 1 2
8 4 B 0 0
9 5 B 1 1
10 5 B 1 2
Toy data:
df <- data.frame(
Participant = c(rep("A", 5), rep("B", 5)),
Correct = c(1,1,1,0,1,1,1,0,1,1)
)

How can I create a lag difference variable within group relative to baseline?

I would like a variable that is a lagged difference to the within group baseline. I have panel data that I have balanced.
my_data <- data.frame(id = c(1,1,1,2,2,2,3,3,3), group = c(1,2,3,1,2,3,1,2,3), score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
id group score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I would like it to look like this:
id group score lag_diff_baseline
1 1 1 0 NA
2 1 2 150 150
3 1 3 170 170
4 2 1 80 NA
5 2 2 100 20
6 2 3 110 30
7 3 1 75 NA
8 3 2 100 25
9 3 3 0 -75
The data.table version of #Liam's answer
library(data.table)
setDT(my_data)
my_data[,.(id,group,score,lag_diff_baseline = score-first(score)),by = id]
I missed the easy answer:
library(dplyr)
my_data %>%
group_by(id) %>%
mutate(lag_diff_baseline = score - first(score))

How to rank a column with a condition

I have a data frame :
dt <- read.table(text = "
1 390
1 366
1 276
1 112
2 97
2 198
2 400
2 402
3 110
3 625
4 137
4 49
4 9
4 578 ")
The first colomn is Index and the second is distance.
I want to add a colomn to rank the distance by Index in a descending order (the highest distance will be ranked first)
The result will be :
dt <- read.table(text = "
1 390 1
1 66 4
1 276 2
1 112 3
2 97 4
2 198 3
2 300 2
2 402 1
3 110 2
3 625 1
4 137 2
4 49 3
4 9 4
4 578 1")
Another R base approach
> dt$Rank <- unlist(tapply(-dt$V2, dt$V1, rank))
A tidyverse solution
dt %>%
group_by(V1) %>%
mutate(Rank=rank(-V2))
transform(dt,s = ave(-V2,V1,FUN = rank))
V1 V2 s
1 1 390 1
2 1 66 4
3 1 276 2
4 1 112 3
5 2 97 4
6 2 198 3
7 2 300 2
8 2 402 1
9 3 110 2
10 3 625 1
11 4 137 2
12 4 49 3
13 4 9 4
14 4 578 1
You could group, arrange, and rownumber. The result is a bit easier on the eyes than a simple rank, I think, and so worth an extra step.
dt %>%
group_by(V1) %>%
arrange(V1,desc(V2)) %>%
mutate(rank = row_number())
# A tibble: 14 x 3
# Groups: V1 [4]
V1 V2 rank
<int> <int> <int>
1 1 390 1
2 1 366 2
3 1 276 3
4 1 112 4
5 2 402 1
6 2 400 2
7 2 198 3
8 2 97 4
9 3 625 1
10 3 110 2
11 4 578 1
12 4 137 2
13 4 49 3
14 4 9 4
A scrambled alternative is min_rank
dt %>%
group_by(V1) %>%
mutate(min_rank(desc(V2)) )

Create Customized weighted variable in R

My data set looks like this
set.seed(1)
data <- data.frame(ITEMID = 101:120,DEPT = c(rep(1,10),rep(2,10)),
CLASS = c(1,1,1,1,1,2,2,2,2,2,1,1,1,1,1,2,2,2,2,2),
SUBCLASS = c(3,3,3,3,4,4,4,4,4,3,3,3,3,3,3,4,4,4,4,4),
PRICE = sample(1:20,20),UNITS = sample(1:100,20)
)
> data
ITEMID DEPT CLASS SUBCLASS PRICE UNITS
1 101 1 1 3 6 94
2 102 1 1 3 8 22
3 103 1 1 3 11 64
4 104 1 1 3 16 13
5 105 1 1 4 4 26
6 106 1 2 4 14 37
7 107 1 2 4 15 2
8 108 1 2 4 9 36
9 109 1 2 4 19 81
10 110 1 2 3 1 31
11 111 2 1 3 3 44
12 112 2 1 3 2 54
13 113 2 1 3 20 90
14 114 2 1 3 10 17
15 115 2 1 3 5 72
16 116 2 2 4 7 57
17 117 2 2 4 12 67
18 118 2 2 4 17 9
19 119 2 2 4 18 60
20 120 2 2 4 13 34
Now I want to add another column called PRICE_RATIO using the following logic
Taking ItemID 101 and group_by with DEPT,CLASS and SUBCLASS yields prices c(6,8,11,16) and UNITS c(94,22,64,13) for ITEMIDs c(101,102,103,104) respectively
Now for each item id the variable PRICE_RATIO will be the ratio of the price of that item id to weighted price of all other itemIDs in the group. For example
For item ID 101 other items are c(102,103,104) whose total UNITS is (22+ 64+13) =99 and weights are (22/99,64/99,13/99). So weighted price for all other items is (22/99)*8 + (64/99)*11 + (13/99)*16 = 10.9899. Hence value for PRICE_RATIO will be 6/10.9899= .54
Similarly for all other items.
Any help in creating the code for this will be greatly appreciated
One solution to your problem, and generally such problems can be with the use of dplyr package and its data munging capabilities. The logic here is as you say, you group by the desired columns, then mutate the desired value (sum product of price and units (excluding the product for that specific row) and ratio of price to that weight. You can execute every step in this computation separately (I encourage that so you can learn) and see exactly what it does.
library(dplyr)
data %>%
group_by(DEPT, CLASS, SUBCLASS) %>%
mutate(price_ratio = round(PRICE /
((sum(UNITS * PRICE) - UNITS * PRICE) /
(sum(UNITS) - UNITS)),
2))
Output is as follows:
Source: local data frame [20 x 7]
Groups: DEPT, CLASS, SUBCLASS [6]
ITEMID DEPT CLASS SUBCLASS PRICE UNITS price_ratio
<int> <dbl> <dbl> <dbl> <int> <int> <dbl>
1 101 1 1 3 6 94 0.55
2 102 1 1 3 8 22 0.93
3 103 1 1 3 11 64 1.50
4 104 1 1 3 16 13 1.99
5 105 1 1 4 4 26 NaN
6 106 1 2 4 14 37 0.88
7 107 1 2 4 15 2 0.97
8 108 1 2 4 9 36 0.52
9 109 1 2 4 19 81 1.63
10 110 1 2 3 1 31 NaN
11 111 2 1 3 3 44 0.29
12 112 2 1 3 2 54 0.18
13 113 2 1 3 20 90 4.86
14 114 2 1 3 10 17 1.08
15 115 2 1 3 5 72 0.46
16 116 2 2 4 7 57 0.48
17 117 2 2 4 12 67 0.93
18 118 2 2 4 17 9 1.36
19 119 2 2 4 18 60 1.67
20 120 2 2 4 13 34 1.03

Group Data in R for consecutive rows

If there's not a quick 1-3 liner for this in R, I'll definitely just use linux sort and a short python program using groupby, so don't bend over backwards trying to get something crazy working. Here's the input data frame:
df_in <- data.frame(
ID = c(1,1,1,1,1,2,2,2,2,2),
weight = c(150,150,151,150,150,170,170,170,171,171),
start_day = c(1,4,7,10,11,5,10,15,20,25),
end_day = c(4,7,10,11,30,10,15,20,25,30)
)
ID weight start_day end_day
1 1 150 1 4
2 1 150 4 7
3 1 151 7 10
4 1 150 10 11
5 1 150 11 30
6 2 170 5 10
7 2 170 10 15
8 2 170 15 20
9 2 171 20 25
10 2 171 25 30
I would like to do some basic aggregation by ID and weight, but only when the group is in consecutive rows of df_in. Specifically, the desired output is
df_desired_out <- data.frame(
ID = c(1,1,1,2,2),
weight = c(150,151,150,170,171),
min_day = c(1,7,10,5,20),
max_day = c(7,10,30,20,30)
)
ID weight min_day max_day
1 1 150 1 7
2 1 151 7 10
3 1 150 10 30
4 2 170 5 20
5 2 171 20 30
This question seems to be extremely close to what I want, but I'm having lots of trouble adapting it for some reason.
In dplyr, I would do this by creating another grouping variable for the consecutive rows. This is what the code cumsum(c(1, diff(weight) != 0) is doing in the code chunk below. An example of this is also here.
The group creation can be done within group_by, and then you can proceed accordingly with making any summaries by group.
library(dplyr)
df_in %>%
group_by(ID, group_weight = cumsum(c(1, diff(weight) != 0)), weight) %>%
summarise(start_day = min(start_day), end_day = max(end_day))
Source: local data frame [5 x 5]
Groups: ID, group_weight [?]
ID group_weight weight start_day end_day
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 1 150 1 7
2 1 2 151 7 10
3 1 3 150 10 30
4 2 4 170 5 20
5 2 5 171 20 30
This approach does leave you with the extra grouping variable in the dataset, which can be removed, if needed, with select(-group_weight) after ungrouping.
First we combine ID and weight. The quick-and-dirty way is using paste:
df_in$id_weight <- paste(df_in$id, df_in$weight, sep='_')
df_in
ID weight start_day end_day id_weight
1 1 150 1 4 1_150
2 1 150 4 7 1_150
3 1 151 7 10 1_151
4 1 150 10 11 1_150
5 1 150 11 30 1_150
6 2 170 5 10 2_170
7 2 170 10 15 2_170
8 2 170 15 20 2_170
9 2 171 20 25 2_171
10 2 171 25 30 2_171
Safer way is to use interaction or group_indices: Combine values in 4 columns to a single unique value
We can group consecutively using rle.
rlel <- rle(df_in$id_weight)$lengths
df_in$group <- unlist(lapply(1:length(rlel), function(i) rep(i, rlel[i])))
df_in
ID weight start_day end_day id_weight group
1 1 150 1 4 1_150 1
2 1 150 4 7 1_150 1
3 1 151 7 10 1_151 2
4 1 150 10 11 1_150 3
5 1 150 11 30 1_150 3
6 2 170 5 10 2_170 4
7 2 170 10 15 2_170 4
8 2 170 15 20 2_170 4
9 2 171 20 25 2_171 5
10 2 171 25 30 2_171 5
Now with the convenient group number we can summarize by group.
df_in %>%
group_by(group) %>%
summarize(id_weight = id_weight[1],
start_day = min(start_day),
end_day = max(end_day))
# A tibble: 5 x 4
group id_weight start_day end_day
<int> <chr> <dbl> <dbl>
1 1 1_150 1 7
2 2 1_151 7 10
3 3 1_150 10 30
4 4 2_170 5 20
5 5 2_171 20 30
with(df_in, {
aggregate(day, list('ID'=ID, 'weight'=weight),
function(x) c('min_day' = min(x), 'max_day' = max(x)))
})
Produces:
ID weight x.min_day x.max_day
1 1 150 1 5
2 1 151 3 3
3 2 170 1 3
4 2 171 4 5

Resources