How to find percentile and then group in R - r

I have a data frame like below (df).
day area hour time count
___ ____ _____ ___ ____
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
1 1 1 1 10
1 1 1 2 12
1 1 1 3 8
1 1 1 4 12
1 1 1 5 15
1 1 1 6 18
1 1 1 7 12
1 1 1 8 15
1 1 1 9 18
1 1 2 1 10
1 1 2 2 18
1 1 2 3 19
.....
2 1 0 1 18
2 1 0 2 12
2 1 0 3 18
2 1 0 4 12
2 1 1 1 8
2 1 1 2 12
2 1 1 3 18
2 1 1 4 10
2 1 1 5 15
2 1 1 6 18
2 1 1 7 12
2 1 1 8 15
2 1 1 9 18
2 1 2 1 10
2 1 2 2 18
2 1 2 3 19
2 1 2 4 9
2 1 2 5 18
2 1 2 6 9
.....
30 99 23 1 9
30 99 23 2 8
30 99 23 3 9
30 99 23 4 19
30 99 23 5 18
30 99 23 6 9
30 99 23 7 19
30 99 23 8 8
30 99 23 9 19
Here I have the data for 30 days for 87 areas (1 to 82 and then I have 90,93,95,97,99) and 24 hours (0 to 23) per day.So the data is about the time taken to cross the area and how many have crossed.
For example:
day area hour time count
___ ____ _____ ___ ____
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
This gives me the On day 1 on hour 0 the time taken to cross the area 1
time count cumulative_count
___ ___ ________________
1 10 10
2 12 22
3 8 30
4 12 42
5 15 57
6 18 75
10 vehicles crossed the area in 1 minute.
12 vehicles crossed the area in 2 minutes.
8 vehicles crossed the area in 3 minutes.
12 vehicles crossed the area in 4 minutes.
15 vehicles crossed the area in 5 minutes.
18 vehicles crossed the area in 6 minutes.
From this I want to calculate How much time it took for 80% of the vehicles to cross area 1 in day 1 hour 0.So total vehicles=(10+12+8+12+15+18)=75.So 80% of 75 is 60.So time taken for 80% of the vehicles(80% of 75 which is 60) to pass the area 1 at day 1 hour 0 will be between 5 and 6(will be nearer to 5). So the result will be like:
day area hour time_taken_for_80%vehicles_to_pass
___ ____ ____ ___________________________________
1 1 0 5.33(approximately)
1 1 1 7.30
1 1 2 2.16
....
30 1 23 3.13
1 2 0 ---
1 2 1 ---
1 2 2 ---
1 2 3 ---
.......
30 99 21 ---
30 99 22 ---
30 99 23 ---
I know to I have to take quantile and then group by the area and day and hour.So I tried with
library(dplyr)
grp <- group_by(df, day,area,hour,quantile(df$count,0.8))
But it does not work.Any help is appreciated

My solution calculates the percentage of vehicles that crossed the area, for each time. Then gets the first time the percentage is above 80%:
str <- 'day area hour time count
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
1 1 1 1 10
1 1 1 2 12
1 1 1 3 8
1 1 1 4 12
1 1 1 5 15
1 1 1 6 18
1 1 1 7 12
1 1 1 8 15
1 1 1 9 18
1 1 2 1 10
1 1 2 2 18
1 1 2 3 19'
file <- textConnection(str)
df <- read.table(file, header = T)
df
library(dplyr)
df %>% group_by(day, area, hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount)) %>%
filter(p > 0.8) %>%
summarise(time = min(time))
result:
day area hour time
<int> <int> <int> <int>
1 1 1 0 6
2 1 1 1 8
3 1 1 2 3
Or with a linear estimation of the time when 80% is reached:
df %>% group_by(day, area, hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount),
g = +(p > 0.8),
order = (g*2-1)*time) %>%
group_by(day, area, hour,g) %>%
filter(row_number((g*2-1)*time)==1) %>%
group_by(day, area, hour) %>%
summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))
result:
day area hour time
<int> <int> <int> <dbl>
1 1 1 0 5.166667
2 1 1 1 7.600000
3 1 1 2 2.505263
or get the same result using lag and lead
df %>% group_by(day, area, hour) %>%
arrange(hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount)) %>%
filter((p >= 0.8&lag(p)<0.8)|(p < 0.8&lead(p)>=0.8)) %>%
summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))

Related

Keep previous value if it is under a certain threshold

I would like to create a variable called treatment_cont that is grouped by group as follows:
ID day day_diff treatment treatment_cont
1 0 NA 1 1
1 14 14 1 1
1 20 6 2 2
1 73 53 1 1
2 0 NA 1 1
2 33 33 1 1
2 90 57 2 2
2 112 22 3 2
2 152 40 1 1
2 178 26 4 1
Treatment_cont is the same as treatment but we want to keep the same treatment regime only when the day_diff, the difference in days between treatments, is lower than 30.
I have tried many ways on dplyr, manipulating the table, but I cannot figure out how to do it efficiently.
Probably, a conditional mutate, using case_when and lag might work:
df %>% mutate(treatment_cont = case_when(day_diff < 30 ~ treatment,TRUE ~ lag(treatment)))
You are probably looking for lag (and perhaps it's brother, lead):
df %>%
replace_na(list(day_diff=0)) %>%
group_by(ID) %>%
arrange(day) %>%
mutate(
treatment_cont = ifelse(day_diff < 30, lag(treatment_cont, default = treatment_cont[1]),treatment_cont)
# A tibble: 10 x 5
ID day day_diff treatment treatment_cont
<int> <int> <dbl> <int> <int>
1 1 0 0 1 1
2 1 14 14 1 1
3 1 20 6 2 1
4 1 73 53 1 1
5 2 0 0 1 1
6 2 33 33 1 1
7 2 90 57 2 2
8 2 112 22 3 2
9 2 152 40 1 1
10 2 178 26 4 1
) %>%
ungroup %>%
arrange(ID, day)

create a filter variable to indicate time order value is available in r

I have student scores and some students take multiple scores with time order. Here is the sample dataset I have.
df <- data.frame(
id = c(1,2,2,3,4,5,5,6),
time = c(1,1,2,1,1,1,2,1),
score = c(15,16,18,19,22,29,19,52))
> df
id time score
1 1 1 15
2 2 1 16
3 2 2 18
4 3 1 19
5 4 1 22
6 5 1 29
7 5 2 19
8 6 1 52
time variable here is actual time but I just put number order for simplicity. I need flag variables to show which students took first and which took the second score.
Here is my desired output.
> df
id time score score1 score2
1 1 1 15 1 0
2 2 1 16 1 0
3 2 2 18 0 1
4 3 1 19 1 0
5 4 1 22 1 0
6 5 1 29 1 0
7 5 2 19 0 1
8 6 1 52 1 0
Does anyone have an idea?
Thanks!
Does this work:
library(dplyr)
df %>% mutate(score1 = case_when(time == 1 ~ 1, TRUE ~ 0),
score2 = case_when(time == 2 ~ 1, TRUE ~ 0))
id time score score1 score2
1 1 1 15 1 0
2 2 1 16 1 0
3 2 2 18 0 1
4 3 1 19 1 0
5 4 1 22 1 0
6 5 1 29 1 0
7 5 2 19 0 1
8 6 1 52 1 0

R MICE Imputation

data=data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5),
"time"=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),
"v1"=c(16,12,14,12,17,16,12,12,13,12,16,16,10,10,14,17,17,12,10,11),
"v2"=c(1,1,3,2,2,2,3,1,2,1,2,1,3,1,1,2,3,3,1,2),
"v3"=c(4,1,4,4,2,2,2,2,1,3,2,3,1,2,2,1,4,1,1,4),
"v4"=c(NA,27,NA,42,40,48,45,25,29,NA,NA,27,NA,NA,NA,NA,NA,NA,44,39),
"v5"=c(NA,1,NA,NA,1,3,3,2,NA,NA,NA,1,NA,NA,NA,NA,3,2,4,1),
"v6"=c(NA,0,1,NA,1,NA,1,NA,0,NA,1,1,NA,NA,NA,NA,0,0,NA,0),
"v7"=c(0,1,1,NA,0,1,1,0,1,0,NA,0,NA,NA,NA,NA,0,1,NA,1),
"v8"=c(1,NA,0,1,0,0,NA,1,1,NA,0,0,NA,NA,NA,NA,1,0,NA,1))
This is my sample data and with it I am seeking to:
A. For time = 1 use v1-v3 to impute v4-v8 using MICE (v4 is continuous, v5 is categorical, v6-v8 is binary)
B. After imputed values are imputed for time = 1, I want to fill NA values that follow with the previous value. So if the variable for time 1-4 is: NA,NA,0,1 and the imputed value at time 1 is 1, then it could be: 1-1-0-1
I attemped:
dataNEW <- mice(data[,data$time == 1],m=5,maxit=50,meth='pmm',seed=500)
A. For time = 1 use v1-v3 to impute v4-v8 using MICE (v4 is continuous, v5 is categorical, v6-v8 is binary)
First, variables v5 - v6 have to be converted to factors:
data$v5 <- factor(data$v5)
data$v6 <- factor(data$v6)
data$v7 <- factor(data$v7)
data$v8 <- factor(data$v8)
Create a predictor matrix to tell mice to use only v1-v3 to predict v4-v8:
Pred_Matrix <- 1 - diag(ncol(data))
Pred_Matrix[,c(1:2, 6:10)] <- 0
Impute using only 1 imputation (the default is 5) because all you want are the imputed values; you're not doing anything else such as pooling the results for modelling.
impA <- mice(subset(data, subset = time==1), pred = Pred_Matrix, m = 1)
The imputed data can be extracted using the complete function (from the mice package, not tidyr).
B. After imputed values are imputed for time = 1, I want to fill NA
values that follow with the previous value. So if the variable for
time 1-4 is: NA,NA,0,1 and the imputed value at time 1 is 1, then it
could be: 1-1-0-1
library(dplyr)
library(tidyr) # Needed for the fill function
mice::complete(impA) %>%
rbind(subset(data, subset=time!=1)) %>%
arrange(student, time) %>%
group_by(student) %>%
fill(v4:v8)
# A tibble: 20 x 10
# Groups: student [5]
student time v1 v2 v3 v4 v5 v6 v7 v8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <fct> <fct>
1 1 1 16 1 4 40 2 1 0 1
2 1 2 12 1 1 27 1 0 1 1
3 1 3 14 3 4 27 1 1 1 0
4 1 4 12 2 4 42 1 1 1 1
5 2 1 17 2 2 40 1 1 0 0
6 2 2 16 2 2 48 3 1 1 0
7 2 3 12 3 2 45 3 1 1 0
8 2 4 12 1 2 25 2 1 0 1
9 3 1 13 2 1 29 1 0 1 1
10 3 2 12 1 3 29 1 0 0 1
11 3 3 16 2 2 29 1 1 0 0
12 3 4 16 1 3 27 1 1 0 0
13 4 1 10 3 1 40 1 0 0 0
14 4 2 10 1 2 40 1 0 0 0
15 4 3 14 1 2 40 1 0 0 0
16 4 4 17 2 1 40 1 0 0 0
17 5 1 17 3 4 40 3 0 0 1
18 5 2 12 3 1 40 2 0 1 0
19 5 3 10 1 1 44 4 0 1 0
20 5 4 11 2 4 39 1 0 1 1
Data
Note, I had to change the first value of v5 to 2, otherwise the polyreg imputation fails (there are only two categories for time=1).
data=data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5),
"time"=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),
"v1"=c(16,12,14,12,17,16,12,12,13,12,16,16,10,10,14,17,17,12,10,11),
"v2"=c(1,1,3,2,2,2,3,1,2,1,2,1,3,1,1,2,3,3,1,2),
"v3"=c(4,1,4,4,2,2,2,2,1,3,2,3,1,2,2,1,4,1,1,4),
"v4"=c(NA,27,NA,42,40,48,45,25,29,NA,NA,27,NA,NA,NA,NA,NA,NA,44,39),
"v5"=c(2,1,NA,NA,1,3,3,2,NA,NA,NA,1,NA,NA,NA,NA,3,2,4,1),
"v6"=c(NA,0,1,NA,1,NA,1,NA,0,NA,1,1,NA,NA,NA,NA,0,0,NA,0),
"v7"=c(0,1,1,NA,0,1,1,0,1,0,NA,0,NA,NA,NA,NA,0,1,NA,1),
"v8"=c(1,NA,0,1,0,0,NA,1,1,NA,0,0,NA,NA,NA,NA,1,0,NA,1))

Calculating entropy in grouped panel data

I have a grouped data structure (different households answering a weekly opinion poll) and I observe every household over 52 weeks (in the example 4 weeks). Now I want to indicate the value of a household at a given point in time using entropy. The value of a household participating in the poll should be higher, if the household didn't participate in the past weeks. So a household always answering the poll should have a lower value in these 4 given weeks than a household answering every two weeks in the two weeks when it does participate. It's important that for a given household the inequality measure varies over weeks.
What's the best way to do so? If it's entropy, how do I apply it to a panel data structure using R?
The data structure is as follows:
da_poll <- data.frame(household = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), participation = c(1,1,1,1,0,0,0,1,0,1,0,1,1,1,1,0))
da_poll
household participation
1 1 1
2 1 1
3 1 1
4 1 1
5 2 0
6 2 0
7 2 0
8 2 1
9 3 0
10 3 1
11 3 0
12 3 1
13 4 1
14 4 1
15 4 1
16 4 0
# 1 indicates participation, 0 no participation.
I have tried to group it by households, but then I only get one value for each household:
da_poll %>%
group_by(household) %>%
mutate(entropy = entropy(participation))
A tibble: 16 x 4
# Groups: household [4]
household week participation entropy
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1.39
2 1 2 1 1.39
3 1 3 1 1.39
4 1 4 1 1.39
5 2 1 0 0
6 2 2 0 0
7 2 3 0 0
8 2 4 1 0
9 3 1 0 0.693
10 3 2 1 0.693
11 3 3 0 0.693
12 3 4 1 0.693
13 4 1 1 1.10
14 4 2 1 1.10
15 4 3 1 1.10
16 4 4 0 1.10
If I group based in household and week, I also get something strange:
da_poll %>%
group_by(household, week) %>%
mutate(entropy = entropy(participation))
# A tibble: 16 x 4
# Groups: household, week [16]
household week participation entropy
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 0
2 1 2 1 0
3 1 3 1 0
4 1 4 1 0
5 2 1 0 NA
6 2 2 0 NA
7 2 3 0 NA
8 2 4 1 0
9 3 1 0 NA
10 3 2 1 0
11 3 3 0 NA
12 3 4 1 0
13 4 1 1 0
14 4 2 1 0
15 4 3 1 0
16 4 4 0 NA
To calculate the entropy cummulatively you need to write your own cummulative function. There is probably a more tidyverse-idomatic way do it but this is what I came up with. Based on your post and your comments, entropy may not be the metric you are looking for.
cummulEntropy <- function(x){
unlist(lapply(seq_along(x), function(i) entropy::entropy(x[1:i])))
}
da_poll %>%
group_by(household) %>%
mutate(entropy=cummulEntropy(participation))
# A tibble: 16 x 3
# Groups: household [4]
# household participation entropy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 1 0.693
# 3 1 1 1.10
# 4 1 1 1.39
# 5 2 0 NA
# 6 2 0 NA
# 7 2 0 NA
# 8 2 1 0
# 9 3 0 NA
#10 3 1 0
#11 3 0 0
#12 3 1 0.693
#13 4 1 0
#14 4 1 0.693
#15 4 1 1.10
#16 4 0 1.10

Lagged function within group

I would like to write code to compute within each group, sum of lagged differences as shown in the table below:
ID x rank U R Required Output Value
1 1 1 U1 R1 -
1 1 2 U2 R2 R2-U1
1 1 3 U3 R3 (R3-U2) + (R3-U1)
1 1 4 U4 R4 (R4-U3) + (R4-U2) + (R4-U1)
1 0 5 U5 R5 R5
1 0 6 U6 R6 R6
1 0 7 U7 R7 R7
2 1 1 U8 R8 -
2 1 2 U9 R9 R9-U8
2 1 3 U10 R10 (R10-U9) + (R10 - U8)
2 1 4 U11 R11 (R11-U10) + (R11 - U9) + (R11 - U8)
3 1 1 U12 R12 -
3 0 2 U13 R13 R13
3 0 3 U14 R14 R14
ID is the unique group identifier. x is a bool and depending on its value the required output is either sum of difference with previous values or same period value. "rank" is a rank ordering column and the maximum rank can vary within each group. "U" and "R" are the main columns of interest.
To give a numerical example, I need the following:
ID x rank U R Required Output Value
1 1 1 10 7 -
1 1 2 9 11 1
1 1 3 10 10 1 + 0 = 1
1 1 4 11 13 3+4+3 = 10
1 0 5 7 8 8
1 0 6 8 8 8
1 0 7 5 7 7
2 1 1 3 2 -
2 1 2 9 15 12
2 1 3 13 14 16
2 1 4 1 14 17
3 1 1 12 1 -
3 0 2 14 9 9
3 0 3 1 11 11
R code to generate this table:
ID = c(rep(1,7),rep(2,4),rep(3,3))
x = c(rep(1,4),rep(0,3),rep(1,5),rep(0,2))
rank = c(1:7,1:4,1:3)
U = c(10,9,10,11,7,8,5,3,9,13,1,12,14,1)
R = c(7,11,10,13,8,8,7,2,15,14,14,1,9,11)
dat = cbind(ID,x,rank,U,R)
colnames(dat)=c("ID","x","rank","U","R")
Here's a tidyverse solution:
library(dplyr)
library(tidyr)
dat %>%
as_tibble() %>%
group_by(ID) %>%
mutate(output = ifelse(x, lag(rank) * R - lag(cumsum(U)), R))
Result:
# A tibble: 14 x 6
# Groups: ID [3]
ID x rank U R output
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 10 7 NA
2 1 1 2 9 11 1
3 1 1 3 10 10 1
4 1 1 4 11 13 10
5 1 0 5 7 8 8
6 1 0 6 8 8 8
7 1 0 7 5 7 7
8 2 1 1 3 2 NA
9 2 1 2 9 15 12
10 2 1 3 13 14 16
11 2 1 4 1 14 17
12 3 1 1 12 1 NA
13 3 0 2 14 9 9
14 3 0 3 1 11 11
Here is a base R solution using ave
dat <- within(dat,output <- ave(R,ID,x, FUN = function(v) v*(seq(v)-1))-ave(U,ID,x, FUN = function(v) c(NA,cumsum(v)[-length(v)])))
dat <- within(dat, output <- ifelse(x==0,R,output))
such that
> dat
ID x rank U R output
1 1 1 1 10 7 NA
2 1 1 2 9 11 1
3 1 1 3 10 10 1
4 1 1 4 11 13 10
5 1 0 5 7 8 8
6 1 0 6 8 8 8
7 1 0 7 5 7 7
8 2 1 1 3 2 NA
9 2 1 2 9 15 12
10 2 1 3 13 14 16
11 2 1 4 1 14 17
12 3 1 1 12 1 NA
13 3 0 2 14 9 9
14 3 0 3 1 11 11

Resources