R MICE Imputation - r

data=data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5),
"time"=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),
"v1"=c(16,12,14,12,17,16,12,12,13,12,16,16,10,10,14,17,17,12,10,11),
"v2"=c(1,1,3,2,2,2,3,1,2,1,2,1,3,1,1,2,3,3,1,2),
"v3"=c(4,1,4,4,2,2,2,2,1,3,2,3,1,2,2,1,4,1,1,4),
"v4"=c(NA,27,NA,42,40,48,45,25,29,NA,NA,27,NA,NA,NA,NA,NA,NA,44,39),
"v5"=c(NA,1,NA,NA,1,3,3,2,NA,NA,NA,1,NA,NA,NA,NA,3,2,4,1),
"v6"=c(NA,0,1,NA,1,NA,1,NA,0,NA,1,1,NA,NA,NA,NA,0,0,NA,0),
"v7"=c(0,1,1,NA,0,1,1,0,1,0,NA,0,NA,NA,NA,NA,0,1,NA,1),
"v8"=c(1,NA,0,1,0,0,NA,1,1,NA,0,0,NA,NA,NA,NA,1,0,NA,1))
This is my sample data and with it I am seeking to:
A. For time = 1 use v1-v3 to impute v4-v8 using MICE (v4 is continuous, v5 is categorical, v6-v8 is binary)
B. After imputed values are imputed for time = 1, I want to fill NA values that follow with the previous value. So if the variable for time 1-4 is: NA,NA,0,1 and the imputed value at time 1 is 1, then it could be: 1-1-0-1
I attemped:
dataNEW <- mice(data[,data$time == 1],m=5,maxit=50,meth='pmm',seed=500)

A. For time = 1 use v1-v3 to impute v4-v8 using MICE (v4 is continuous, v5 is categorical, v6-v8 is binary)
First, variables v5 - v6 have to be converted to factors:
data$v5 <- factor(data$v5)
data$v6 <- factor(data$v6)
data$v7 <- factor(data$v7)
data$v8 <- factor(data$v8)
Create a predictor matrix to tell mice to use only v1-v3 to predict v4-v8:
Pred_Matrix <- 1 - diag(ncol(data))
Pred_Matrix[,c(1:2, 6:10)] <- 0
Impute using only 1 imputation (the default is 5) because all you want are the imputed values; you're not doing anything else such as pooling the results for modelling.
impA <- mice(subset(data, subset = time==1), pred = Pred_Matrix, m = 1)
The imputed data can be extracted using the complete function (from the mice package, not tidyr).
B. After imputed values are imputed for time = 1, I want to fill NA
values that follow with the previous value. So if the variable for
time 1-4 is: NA,NA,0,1 and the imputed value at time 1 is 1, then it
could be: 1-1-0-1
library(dplyr)
library(tidyr) # Needed for the fill function
mice::complete(impA) %>%
rbind(subset(data, subset=time!=1)) %>%
arrange(student, time) %>%
group_by(student) %>%
fill(v4:v8)
# A tibble: 20 x 10
# Groups: student [5]
student time v1 v2 v3 v4 v5 v6 v7 v8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <fct> <fct>
1 1 1 16 1 4 40 2 1 0 1
2 1 2 12 1 1 27 1 0 1 1
3 1 3 14 3 4 27 1 1 1 0
4 1 4 12 2 4 42 1 1 1 1
5 2 1 17 2 2 40 1 1 0 0
6 2 2 16 2 2 48 3 1 1 0
7 2 3 12 3 2 45 3 1 1 0
8 2 4 12 1 2 25 2 1 0 1
9 3 1 13 2 1 29 1 0 1 1
10 3 2 12 1 3 29 1 0 0 1
11 3 3 16 2 2 29 1 1 0 0
12 3 4 16 1 3 27 1 1 0 0
13 4 1 10 3 1 40 1 0 0 0
14 4 2 10 1 2 40 1 0 0 0
15 4 3 14 1 2 40 1 0 0 0
16 4 4 17 2 1 40 1 0 0 0
17 5 1 17 3 4 40 3 0 0 1
18 5 2 12 3 1 40 2 0 1 0
19 5 3 10 1 1 44 4 0 1 0
20 5 4 11 2 4 39 1 0 1 1
Data
Note, I had to change the first value of v5 to 2, otherwise the polyreg imputation fails (there are only two categories for time=1).
data=data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5),
"time"=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),
"v1"=c(16,12,14,12,17,16,12,12,13,12,16,16,10,10,14,17,17,12,10,11),
"v2"=c(1,1,3,2,2,2,3,1,2,1,2,1,3,1,1,2,3,3,1,2),
"v3"=c(4,1,4,4,2,2,2,2,1,3,2,3,1,2,2,1,4,1,1,4),
"v4"=c(NA,27,NA,42,40,48,45,25,29,NA,NA,27,NA,NA,NA,NA,NA,NA,44,39),
"v5"=c(2,1,NA,NA,1,3,3,2,NA,NA,NA,1,NA,NA,NA,NA,3,2,4,1),
"v6"=c(NA,0,1,NA,1,NA,1,NA,0,NA,1,1,NA,NA,NA,NA,0,0,NA,0),
"v7"=c(0,1,1,NA,0,1,1,0,1,0,NA,0,NA,NA,NA,NA,0,1,NA,1),
"v8"=c(1,NA,0,1,0,0,NA,1,1,NA,0,0,NA,NA,NA,NA,1,0,NA,1))

Related

create a filter variable to indicate time order value is available in r

I have student scores and some students take multiple scores with time order. Here is the sample dataset I have.
df <- data.frame(
id = c(1,2,2,3,4,5,5,6),
time = c(1,1,2,1,1,1,2,1),
score = c(15,16,18,19,22,29,19,52))
> df
id time score
1 1 1 15
2 2 1 16
3 2 2 18
4 3 1 19
5 4 1 22
6 5 1 29
7 5 2 19
8 6 1 52
time variable here is actual time but I just put number order for simplicity. I need flag variables to show which students took first and which took the second score.
Here is my desired output.
> df
id time score score1 score2
1 1 1 15 1 0
2 2 1 16 1 0
3 2 2 18 0 1
4 3 1 19 1 0
5 4 1 22 1 0
6 5 1 29 1 0
7 5 2 19 0 1
8 6 1 52 1 0
Does anyone have an idea?
Thanks!
Does this work:
library(dplyr)
df %>% mutate(score1 = case_when(time == 1 ~ 1, TRUE ~ 0),
score2 = case_when(time == 2 ~ 1, TRUE ~ 0))
id time score score1 score2
1 1 1 15 1 0
2 2 1 16 1 0
3 2 2 18 0 1
4 3 1 19 1 0
5 4 1 22 1 0
6 5 1 29 1 0
7 5 2 19 0 1
8 6 1 52 1 0

Calculating entropy in grouped panel data

I have a grouped data structure (different households answering a weekly opinion poll) and I observe every household over 52 weeks (in the example 4 weeks). Now I want to indicate the value of a household at a given point in time using entropy. The value of a household participating in the poll should be higher, if the household didn't participate in the past weeks. So a household always answering the poll should have a lower value in these 4 given weeks than a household answering every two weeks in the two weeks when it does participate. It's important that for a given household the inequality measure varies over weeks.
What's the best way to do so? If it's entropy, how do I apply it to a panel data structure using R?
The data structure is as follows:
da_poll <- data.frame(household = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), participation = c(1,1,1,1,0,0,0,1,0,1,0,1,1,1,1,0))
da_poll
household participation
1 1 1
2 1 1
3 1 1
4 1 1
5 2 0
6 2 0
7 2 0
8 2 1
9 3 0
10 3 1
11 3 0
12 3 1
13 4 1
14 4 1
15 4 1
16 4 0
# 1 indicates participation, 0 no participation.
I have tried to group it by households, but then I only get one value for each household:
da_poll %>%
group_by(household) %>%
mutate(entropy = entropy(participation))
A tibble: 16 x 4
# Groups: household [4]
household week participation entropy
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1.39
2 1 2 1 1.39
3 1 3 1 1.39
4 1 4 1 1.39
5 2 1 0 0
6 2 2 0 0
7 2 3 0 0
8 2 4 1 0
9 3 1 0 0.693
10 3 2 1 0.693
11 3 3 0 0.693
12 3 4 1 0.693
13 4 1 1 1.10
14 4 2 1 1.10
15 4 3 1 1.10
16 4 4 0 1.10
If I group based in household and week, I also get something strange:
da_poll %>%
group_by(household, week) %>%
mutate(entropy = entropy(participation))
# A tibble: 16 x 4
# Groups: household, week [16]
household week participation entropy
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 0
2 1 2 1 0
3 1 3 1 0
4 1 4 1 0
5 2 1 0 NA
6 2 2 0 NA
7 2 3 0 NA
8 2 4 1 0
9 3 1 0 NA
10 3 2 1 0
11 3 3 0 NA
12 3 4 1 0
13 4 1 1 0
14 4 2 1 0
15 4 3 1 0
16 4 4 0 NA
To calculate the entropy cummulatively you need to write your own cummulative function. There is probably a more tidyverse-idomatic way do it but this is what I came up with. Based on your post and your comments, entropy may not be the metric you are looking for.
cummulEntropy <- function(x){
unlist(lapply(seq_along(x), function(i) entropy::entropy(x[1:i])))
}
da_poll %>%
group_by(household) %>%
mutate(entropy=cummulEntropy(participation))
# A tibble: 16 x 3
# Groups: household [4]
# household participation entropy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 1 0.693
# 3 1 1 1.10
# 4 1 1 1.39
# 5 2 0 NA
# 6 2 0 NA
# 7 2 0 NA
# 8 2 1 0
# 9 3 0 NA
#10 3 1 0
#11 3 0 0
#12 3 1 0.693
#13 4 1 0
#14 4 1 0.693
#15 4 1 1.10
#16 4 0 1.10

Assign sequential group ID given a group start indicator

I need to assign subgroup IDs given a group ID and an indicator showing the beginning of the new subgroup. Here's a test dataset:
group <- c(rep("A", 8), rep("B", 8))
x1 <- c(rep(0, 3), rep(1, 3), rep(0, 2))
x2 <- rep(0:1, 4)
df <- data.frame(group=group, indic=c(x1, x2))
Here is the resulting data frame:
df
group indic
1 A 0
2 A 0
3 A 0
4 A 1
5 A 1
6 A 1
7 A 0
8 A 0
9 B 0
10 B 1
11 B 0
12 B 1
13 B 0
14 B 1
15 B 0
16 B 1
indic==1 means that row is the beginning of a new subgroup, and the subgroup should be numbered 1 higher than the previous subgroup. Where indic==0 the subgroup should be the same as the previous subgroup. The subgroup numbering starts at 1. When the group variable changes, the subgroup numbering resets to 1. I would like to use the tidyverse framework.
Here is the result that I want:
df
group indic subgroup
1 A 0 1
2 A 0 1
3 A 0 1
4 A 1 2
5 A 1 3
6 A 1 4
7 A 0 4
8 A 0 4
9 B 0 1
10 B 1 2
11 B 0 2
12 B 1 3
13 B 0 3
14 B 1 4
15 B 0 4
16 B 1 5
I would like to be able to give some methods that I've tried already but didn't work, but I haven't been able to find anything even close. Any help will be appreciated.
You can just use
library(dplyr)
df %>% group_by(group) %>%
mutate(subgroup=cumsum(indic)+1)
# group indic subgroup
# <fct> <dbl> <dbl>
# 1 A 0 1
# 2 A 0 1
# 3 A 0 1
# 4 A 1 2
# 5 A 1 3
# 6 A 1 4
# 7 A 0 4
# 8 A 0 4
# 9 B 0 1
# 10 B 1 2
# 11 B 0 2
# 12 B 1 3
# 13 B 0 3
# 14 B 1 4
# 15 B 0 4
# 16 B 1 5
We use dplyr to do the grouping and then we just use cumsum with takes the cumulative sum of the indic column so each time it sees a 1 it increases.

If a value appears in the row, all subsequent rows should take this value (with dplyr)

I'm just starting to learn R and I'm already facing the first bigger problem.
Let's take the following panel dataset as an example:
N=5
T=3
time<-rep(1:T, times=N)
id<- rep(1:N,each=T)
dummy<- c(0,0,1,1,0,0,0,1,0,0,0,1,0,1,0)
df<-as.data.frame(cbind(id, time,dummy))
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 0
I now want the dummy variable for all rows of a cross section to take the value 1 after the 1 for this cross section appears for the first time. So, what I want is:
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 0
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 1
So I guess I need something like:
df_new<-df %>%
group_by(id) %>%
???
I already tried to set all zeros to NA and use the na.locf function, but it didn't really work.
Anybody got an idea?
Thanks!
Use cummax
df %>%
group_by(id) %>%
mutate(dummy = cummax(dummy))
# A tibble: 15 x 3
# Groups: id [5]
# id time dummy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 0
# 8 3 2 1
# 9 3 3 1
#10 4 1 0
#11 4 2 0
#12 4 3 1
#13 5 1 0
#14 5 2 1
#15 5 3 1
Without additional packages you could do
transform(df, dummy = ave(dummy, id, FUN = cummax))

How to find percentile and then group in R

I have a data frame like below (df).
day area hour time count
___ ____ _____ ___ ____
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
1 1 1 1 10
1 1 1 2 12
1 1 1 3 8
1 1 1 4 12
1 1 1 5 15
1 1 1 6 18
1 1 1 7 12
1 1 1 8 15
1 1 1 9 18
1 1 2 1 10
1 1 2 2 18
1 1 2 3 19
.....
2 1 0 1 18
2 1 0 2 12
2 1 0 3 18
2 1 0 4 12
2 1 1 1 8
2 1 1 2 12
2 1 1 3 18
2 1 1 4 10
2 1 1 5 15
2 1 1 6 18
2 1 1 7 12
2 1 1 8 15
2 1 1 9 18
2 1 2 1 10
2 1 2 2 18
2 1 2 3 19
2 1 2 4 9
2 1 2 5 18
2 1 2 6 9
.....
30 99 23 1 9
30 99 23 2 8
30 99 23 3 9
30 99 23 4 19
30 99 23 5 18
30 99 23 6 9
30 99 23 7 19
30 99 23 8 8
30 99 23 9 19
Here I have the data for 30 days for 87 areas (1 to 82 and then I have 90,93,95,97,99) and 24 hours (0 to 23) per day.So the data is about the time taken to cross the area and how many have crossed.
For example:
day area hour time count
___ ____ _____ ___ ____
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
This gives me the On day 1 on hour 0 the time taken to cross the area 1
time count cumulative_count
___ ___ ________________
1 10 10
2 12 22
3 8 30
4 12 42
5 15 57
6 18 75
10 vehicles crossed the area in 1 minute.
12 vehicles crossed the area in 2 minutes.
8 vehicles crossed the area in 3 minutes.
12 vehicles crossed the area in 4 minutes.
15 vehicles crossed the area in 5 minutes.
18 vehicles crossed the area in 6 minutes.
From this I want to calculate How much time it took for 80% of the vehicles to cross area 1 in day 1 hour 0.So total vehicles=(10+12+8+12+15+18)=75.So 80% of 75 is 60.So time taken for 80% of the vehicles(80% of 75 which is 60) to pass the area 1 at day 1 hour 0 will be between 5 and 6(will be nearer to 5). So the result will be like:
day area hour time_taken_for_80%vehicles_to_pass
___ ____ ____ ___________________________________
1 1 0 5.33(approximately)
1 1 1 7.30
1 1 2 2.16
....
30 1 23 3.13
1 2 0 ---
1 2 1 ---
1 2 2 ---
1 2 3 ---
.......
30 99 21 ---
30 99 22 ---
30 99 23 ---
I know to I have to take quantile and then group by the area and day and hour.So I tried with
library(dplyr)
grp <- group_by(df, day,area,hour,quantile(df$count,0.8))
But it does not work.Any help is appreciated
My solution calculates the percentage of vehicles that crossed the area, for each time. Then gets the first time the percentage is above 80%:
str <- 'day area hour time count
1 1 0 1 10
1 1 0 2 12
1 1 0 3 8
1 1 0 4 12
1 1 0 5 15
1 1 0 6 18
1 1 1 1 10
1 1 1 2 12
1 1 1 3 8
1 1 1 4 12
1 1 1 5 15
1 1 1 6 18
1 1 1 7 12
1 1 1 8 15
1 1 1 9 18
1 1 2 1 10
1 1 2 2 18
1 1 2 3 19'
file <- textConnection(str)
df <- read.table(file, header = T)
df
library(dplyr)
df %>% group_by(day, area, hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount)) %>%
filter(p > 0.8) %>%
summarise(time = min(time))
result:
day area hour time
<int> <int> <int> <int>
1 1 1 0 6
2 1 1 1 8
3 1 1 2 3
Or with a linear estimation of the time when 80% is reached:
df %>% group_by(day, area, hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount),
g = +(p > 0.8),
order = (g*2-1)*time) %>%
group_by(day, area, hour,g) %>%
filter(row_number((g*2-1)*time)==1) %>%
group_by(day, area, hour) %>%
summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))
result:
day area hour time
<int> <int> <int> <dbl>
1 1 1 0 5.166667
2 1 1 1 7.600000
3 1 1 2 2.505263
or get the same result using lag and lead
df %>% group_by(day, area, hour) %>%
arrange(hour) %>%
mutate(cumcount = cumsum(count),
p = cumcount/max(cumcount)) %>%
filter((p >= 0.8&lag(p)<0.8)|(p < 0.8&lead(p)>=0.8)) %>%
summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))

Resources