Calculating entropy in grouped panel data - r

I have a grouped data structure (different households answering a weekly opinion poll) and I observe every household over 52 weeks (in the example 4 weeks). Now I want to indicate the value of a household at a given point in time using entropy. The value of a household participating in the poll should be higher, if the household didn't participate in the past weeks. So a household always answering the poll should have a lower value in these 4 given weeks than a household answering every two weeks in the two weeks when it does participate. It's important that for a given household the inequality measure varies over weeks.
What's the best way to do so? If it's entropy, how do I apply it to a panel data structure using R?
The data structure is as follows:
da_poll <- data.frame(household = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), participation = c(1,1,1,1,0,0,0,1,0,1,0,1,1,1,1,0))
da_poll
household participation
1 1 1
2 1 1
3 1 1
4 1 1
5 2 0
6 2 0
7 2 0
8 2 1
9 3 0
10 3 1
11 3 0
12 3 1
13 4 1
14 4 1
15 4 1
16 4 0
# 1 indicates participation, 0 no participation.
I have tried to group it by households, but then I only get one value for each household:
da_poll %>%
group_by(household) %>%
mutate(entropy = entropy(participation))
A tibble: 16 x 4
# Groups: household [4]
household week participation entropy
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1.39
2 1 2 1 1.39
3 1 3 1 1.39
4 1 4 1 1.39
5 2 1 0 0
6 2 2 0 0
7 2 3 0 0
8 2 4 1 0
9 3 1 0 0.693
10 3 2 1 0.693
11 3 3 0 0.693
12 3 4 1 0.693
13 4 1 1 1.10
14 4 2 1 1.10
15 4 3 1 1.10
16 4 4 0 1.10
If I group based in household and week, I also get something strange:
da_poll %>%
group_by(household, week) %>%
mutate(entropy = entropy(participation))
# A tibble: 16 x 4
# Groups: household, week [16]
household week participation entropy
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 0
2 1 2 1 0
3 1 3 1 0
4 1 4 1 0
5 2 1 0 NA
6 2 2 0 NA
7 2 3 0 NA
8 2 4 1 0
9 3 1 0 NA
10 3 2 1 0
11 3 3 0 NA
12 3 4 1 0
13 4 1 1 0
14 4 2 1 0
15 4 3 1 0
16 4 4 0 NA

To calculate the entropy cummulatively you need to write your own cummulative function. There is probably a more tidyverse-idomatic way do it but this is what I came up with. Based on your post and your comments, entropy may not be the metric you are looking for.
cummulEntropy <- function(x){
unlist(lapply(seq_along(x), function(i) entropy::entropy(x[1:i])))
}
da_poll %>%
group_by(household) %>%
mutate(entropy=cummulEntropy(participation))
# A tibble: 16 x 3
# Groups: household [4]
# household participation entropy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 1 0.693
# 3 1 1 1.10
# 4 1 1 1.39
# 5 2 0 NA
# 6 2 0 NA
# 7 2 0 NA
# 8 2 1 0
# 9 3 0 NA
#10 3 1 0
#11 3 0 0
#12 3 1 0.693
#13 4 1 0
#14 4 1 0.693
#15 4 1 1.10
#16 4 0 1.10

Related

R Count Unique By Group in DPLYR

HAVE = data.frame("TRIMESTER" = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4),
"STUDENT" = c(1,2,3,3,4,2,5,6,7,1,2,2,2,2,2,1,2,3,4,5))
HAVE$WANT1 = c(4,4,4,4,4,5,5,5,5,5,1,1,1,1,5,5,5,5,5,5)
HAVE$WANT2 = c(0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1)
I have HAVE and wish to APPEND a column to count the UNIQUE value of STUDENT for every TRIMESTER shown WANT1 and I wish to create WANT2 which is the SUM of times for every TRIMESTER that STUDENT==5 appears so STUDENT==5 appear ZERO times in TRIMESTER == 1, so the value for all TRIMESTER == 1 is ZERO but student 5 appear ONCE in TRIMESTER==4 so the value is 1
After grouping by 'TRIMESTER', get the count of distinct elements of 'STUDENT' with n_distinct and the count of STUDENT 5 with sum on a logical expression
library(dplyr)
HAVE %>%
group_by(TRIMESTER) %>%
mutate(WANT1new = n_distinct(STUDENT),
WANT2NEW = sum(STUDENT == 5)) %>%
ungroup
-output
# A tibble: 20 × 6
TRIMESTER STUDENT WANT1 WANT2 WANT1new WANT2NEW
<dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 1 4 0 4 0
2 1 2 4 0 4 0
3 1 3 4 0 4 0
4 1 3 4 0 4 0
5 1 4 4 0 4 0
6 2 2 5 1 5 1
7 2 5 5 1 5 1
8 2 6 5 1 5 1
9 2 7 5 1 5 1
10 2 1 5 1 5 1
11 3 2 1 0 1 0
12 3 2 1 0 1 0
13 3 2 1 0 1 0
14 3 2 1 0 1 0
15 4 2 5 1 5 1
16 4 1 5 1 5 1
17 4 2 5 1 5 1
18 4 3 5 1 5 1
19 4 4 5 1 5 1
20 4 5 5 1 5 1
The code below should produce the desired result.
library(dplyr)
HAVE %>%
group_by(TRIMESTER) %>%
mutate(WANT1 = length(unique(STUDENT)),
WANT2 = as.numeric(any(5 == STUDENT)))

create a filter variable to indicate time order value is available in r

I have student scores and some students take multiple scores with time order. Here is the sample dataset I have.
df <- data.frame(
id = c(1,2,2,3,4,5,5,6),
time = c(1,1,2,1,1,1,2,1),
score = c(15,16,18,19,22,29,19,52))
> df
id time score
1 1 1 15
2 2 1 16
3 2 2 18
4 3 1 19
5 4 1 22
6 5 1 29
7 5 2 19
8 6 1 52
time variable here is actual time but I just put number order for simplicity. I need flag variables to show which students took first and which took the second score.
Here is my desired output.
> df
id time score score1 score2
1 1 1 15 1 0
2 2 1 16 1 0
3 2 2 18 0 1
4 3 1 19 1 0
5 4 1 22 1 0
6 5 1 29 1 0
7 5 2 19 0 1
8 6 1 52 1 0
Does anyone have an idea?
Thanks!
Does this work:
library(dplyr)
df %>% mutate(score1 = case_when(time == 1 ~ 1, TRUE ~ 0),
score2 = case_when(time == 2 ~ 1, TRUE ~ 0))
id time score score1 score2
1 1 1 15 1 0
2 2 1 16 1 0
3 2 2 18 0 1
4 3 1 19 1 0
5 4 1 22 1 0
6 5 1 29 1 0
7 5 2 19 0 1
8 6 1 52 1 0

R define a new variable as count starting when condition is met

so I´m trying to add two new variables to my dataframe. A variable named start, which is supposed to be a a running count from 0 to whatever number of rows there are for one group, and a second variable named stop which is practically the same, but starting at 1. The count should start, once the value of a second variable scores >0. It is further important, that the count continues until the last value of the group (so it shouldn´t stop if Var1=0 again) and that NAs are ignored in the sense, that counting continues.
Consider the following dataset as an example
ID Var1 start stop
1 0
1 1 0 1
1 4 1 2
1 2 2 3
1 NA 3 4
1 4 4 5
2 0
2 0
2 3 0 1
2 5 1 2
2 9 2 3
2 0 3 4
I don´t really care for the values start and stop take on before Var1>0 first, so whether it´s 0 or NA is not important
Thanks very much for the good answers in advance!!
Dirty solution to the problem, will probably work just take out the extra columns that I made as steps with select
library(tidyverse)
df_example <- read_table("ID Var1 start stop
1 0
1 1 0 1
1 4 1 2
1 2 2 3
1 NA 3 4
1 4 4 5
2 0
2 0
2 3 0 1
2 5 1 2
2 9 2 3
2 0 3 4")
df_example %>%
group_by(ID) %>%
mutate(greater_1 = if_else(replace_na(Var1,1) > 0,1,0),
run_sum = cumsum(greater_1),
to_fill = if_else(run_sum == 1,1,NA_real_)) %>%
fill(to_fill) %>%
mutate(end2 = cumsum(to_fill %>% replace_na(0)),
star2 = if_else(end2 -1 > 0,end2 -1,0))
#> # A tibble: 12 x 9
#> # Groups: ID [2]
#> ID Var1 start stop greater_1 run_sum to_fill end2 star2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 NA NA 0 0 NA 0 0
#> 2 1 1 0 1 1 1 1 1 0
#> 3 1 4 1 2 1 2 1 2 1
#> 4 1 2 2 3 1 3 1 3 2
#> 5 1 NA 3 4 1 4 1 4 3
#> 6 1 4 4 5 1 5 1 5 4
#> 7 2 0 NA NA 0 0 NA 0 0
#> 8 2 0 NA NA 0 0 NA 0 0
#> 9 2 3 0 1 1 1 1 1 0
#> 10 2 5 1 2 1 2 1 2 1
#> 11 2 9 2 3 1 3 1 3 2
#> 12 2 0 3 4 0 3 1 4 3
Created on 2020-08-04 by the reprex package (v0.3.0)

If a value appears in the row, all subsequent rows should take this value (with dplyr)

I'm just starting to learn R and I'm already facing the first bigger problem.
Let's take the following panel dataset as an example:
N=5
T=3
time<-rep(1:T, times=N)
id<- rep(1:N,each=T)
dummy<- c(0,0,1,1,0,0,0,1,0,0,0,1,0,1,0)
df<-as.data.frame(cbind(id, time,dummy))
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 0
I now want the dummy variable for all rows of a cross section to take the value 1 after the 1 for this cross section appears for the first time. So, what I want is:
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 0
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 1
So I guess I need something like:
df_new<-df %>%
group_by(id) %>%
???
I already tried to set all zeros to NA and use the na.locf function, but it didn't really work.
Anybody got an idea?
Thanks!
Use cummax
df %>%
group_by(id) %>%
mutate(dummy = cummax(dummy))
# A tibble: 15 x 3
# Groups: id [5]
# id time dummy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 0
# 8 3 2 1
# 9 3 3 1
#10 4 1 0
#11 4 2 0
#12 4 3 1
#13 5 1 0
#14 5 2 1
#15 5 3 1
Without additional packages you could do
transform(df, dummy = ave(dummy, id, FUN = cummax))

How to calculate recency in R

I have the following data:
set.seed(20)
round<-rep(1:10,2)
part<-rep(1:2, c(10,10))
game<-rep(rep(1:2,c(5,5)),2)
pay1<-sample(1:10,20,replace=TRUE)
pay2<-sample(1:10,20,replace=TRUE)
pay3<-sample(1:10,20,replace=TRUE)
decs<-sample(1:3,20,replace=TRUE)
previous_max<-c(0,1,0,0,0,0,0,1,0,0,0,0,1,1,1,0,0,1,1,0)
gamematrix<-cbind(part,game,round,pay1,pay2,pay3,decs,previous_max )
gamematrix<-data.frame(gamematrix)
Here is the output:
part game round pay1 pay2 pay3 decs previous_max
1 1 1 1 9 5 6 2 0
2 1 1 2 8 1 1 1 1
3 1 1 3 3 5 5 3 0
4 1 1 4 6 1 5 1 0
5 1 1 5 10 3 8 3 0
6 1 2 6 10 1 5 1 0
7 1 2 7 1 10 7 3 0
8 1 2 8 1 10 8 2 1
9 1 2 9 4 1 5 1 0
10 1 2 10 4 7 7 2 0
11 2 1 1 8 4 1 1 0
12 2 1 2 8 5 5 2 0
13 2 1 3 1 9 3 1 1
14 2 1 4 8 2 10 2 1
15 2 1 5 2 6 2 3 1
16 2 2 6 5 5 6 2 0
17 2 2 7 4 5 1 2 0
18 2 2 8 2 10 5 2 1
19 2 2 9 3 7 3 2 1
20 2 2 10 9 3 1 1 0
How can I calculate a new indicator variable "previous_max",which returns whether in the next round of the same game, the same participant choose the maximal payoff from the previous round.
So I want something like follows:
Participant (part) 1:
In the first round of each game, previous_max is "0" (no previous round), in round 2, previous_max ="1", because in round 1, the maximal pay was max(pay1,pay2,pay3)=max(9,5,6)=9, and in round 2, the participant's decisions (decs) was 1 (which was the maximal value in previous round).
In round 3, previous_max=0, because the maximal value in round 2 was 8 (which is "pay1"), but the participant choose "3" (which is pay3).
Here's a solution using dplyr and purr::map.
I would have preferred to use group_by than split but max.col ignores groups and I don't know of a dplyr equivalent`.
the output is slightly different but I think it's because of your mistakes, please explain if not and I'll update my answer.
library(purrr)
library(dplyr)
gamematrix %>%
split(.$part) %>%
map(~ .x %>% mutate(
prev_max = as.integer(
decs ==
c(0,max.col(.[c("pay1","pay2","pay3")])[-n()]) # the number of the max columns, offset by one
))) %>%
bind_rows
# ` part game round pay1 pay2 pay3 decs prev_max
# 1 1 1 1 9 5 6 2 0
# 2 1 1 2 8 1 1 1 1
# 3 1 1 3 3 5 5 3 0
# 4 1 1 4 6 1 5 1 0
# 5 1 1 5 10 3 8 3 0
# 6 1 2 6 10 1 5 1 1
# 7 1 2 7 1 10 7 3 0
# 8 1 2 8 1 10 8 2 1
# 9 1 2 9 4 1 5 1 0
# 10 1 2 10 4 7 7 2 0
# 11 2 1 1 8 4 1 1 0
# 12 2 1 2 8 5 5 2 0
# 13 2 1 3 1 9 3 1 1
# 14 2 1 4 8 2 10 2 1
# 15 2 1 5 2 6 2 3 1
# 16 2 2 6 5 5 6 2 1
# 17 2 2 7 4 5 1 2 0
# 18 2 2 8 2 10 5 2 1
# 19 2 2 9 3 7 3 2 1
# 20 2 2 10 9 3 1 1 0

Resources