R define a new variable as count starting when condition is met - r

so I´m trying to add two new variables to my dataframe. A variable named start, which is supposed to be a a running count from 0 to whatever number of rows there are for one group, and a second variable named stop which is practically the same, but starting at 1. The count should start, once the value of a second variable scores >0. It is further important, that the count continues until the last value of the group (so it shouldn´t stop if Var1=0 again) and that NAs are ignored in the sense, that counting continues.
Consider the following dataset as an example
ID Var1 start stop
1 0
1 1 0 1
1 4 1 2
1 2 2 3
1 NA 3 4
1 4 4 5
2 0
2 0
2 3 0 1
2 5 1 2
2 9 2 3
2 0 3 4
I don´t really care for the values start and stop take on before Var1>0 first, so whether it´s 0 or NA is not important
Thanks very much for the good answers in advance!!

Dirty solution to the problem, will probably work just take out the extra columns that I made as steps with select
library(tidyverse)
df_example <- read_table("ID Var1 start stop
1 0
1 1 0 1
1 4 1 2
1 2 2 3
1 NA 3 4
1 4 4 5
2 0
2 0
2 3 0 1
2 5 1 2
2 9 2 3
2 0 3 4")
df_example %>%
group_by(ID) %>%
mutate(greater_1 = if_else(replace_na(Var1,1) > 0,1,0),
run_sum = cumsum(greater_1),
to_fill = if_else(run_sum == 1,1,NA_real_)) %>%
fill(to_fill) %>%
mutate(end2 = cumsum(to_fill %>% replace_na(0)),
star2 = if_else(end2 -1 > 0,end2 -1,0))
#> # A tibble: 12 x 9
#> # Groups: ID [2]
#> ID Var1 start stop greater_1 run_sum to_fill end2 star2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 NA NA 0 0 NA 0 0
#> 2 1 1 0 1 1 1 1 1 0
#> 3 1 4 1 2 1 2 1 2 1
#> 4 1 2 2 3 1 3 1 3 2
#> 5 1 NA 3 4 1 4 1 4 3
#> 6 1 4 4 5 1 5 1 5 4
#> 7 2 0 NA NA 0 0 NA 0 0
#> 8 2 0 NA NA 0 0 NA 0 0
#> 9 2 3 0 1 1 1 1 1 0
#> 10 2 5 1 2 1 2 1 2 1
#> 11 2 9 2 3 1 3 1 3 2
#> 12 2 0 3 4 0 3 1 4 3
Created on 2020-08-04 by the reprex package (v0.3.0)

Related

R Count Unique By Group in DPLYR

HAVE = data.frame("TRIMESTER" = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4),
"STUDENT" = c(1,2,3,3,4,2,5,6,7,1,2,2,2,2,2,1,2,3,4,5))
HAVE$WANT1 = c(4,4,4,4,4,5,5,5,5,5,1,1,1,1,5,5,5,5,5,5)
HAVE$WANT2 = c(0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1)
I have HAVE and wish to APPEND a column to count the UNIQUE value of STUDENT for every TRIMESTER shown WANT1 and I wish to create WANT2 which is the SUM of times for every TRIMESTER that STUDENT==5 appears so STUDENT==5 appear ZERO times in TRIMESTER == 1, so the value for all TRIMESTER == 1 is ZERO but student 5 appear ONCE in TRIMESTER==4 so the value is 1
After grouping by 'TRIMESTER', get the count of distinct elements of 'STUDENT' with n_distinct and the count of STUDENT 5 with sum on a logical expression
library(dplyr)
HAVE %>%
group_by(TRIMESTER) %>%
mutate(WANT1new = n_distinct(STUDENT),
WANT2NEW = sum(STUDENT == 5)) %>%
ungroup
-output
# A tibble: 20 × 6
TRIMESTER STUDENT WANT1 WANT2 WANT1new WANT2NEW
<dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 1 4 0 4 0
2 1 2 4 0 4 0
3 1 3 4 0 4 0
4 1 3 4 0 4 0
5 1 4 4 0 4 0
6 2 2 5 1 5 1
7 2 5 5 1 5 1
8 2 6 5 1 5 1
9 2 7 5 1 5 1
10 2 1 5 1 5 1
11 3 2 1 0 1 0
12 3 2 1 0 1 0
13 3 2 1 0 1 0
14 3 2 1 0 1 0
15 4 2 5 1 5 1
16 4 1 5 1 5 1
17 4 2 5 1 5 1
18 4 3 5 1 5 1
19 4 4 5 1 5 1
20 4 5 5 1 5 1
The code below should produce the desired result.
library(dplyr)
HAVE %>%
group_by(TRIMESTER) %>%
mutate(WANT1 = length(unique(STUDENT)),
WANT2 = as.numeric(any(5 == STUDENT)))

R update values within a grouped df with information from updated previous value

I would like conditionally mutate variables (var1, var2) within groups (id) at different timepoints (timepoint) using previously updated/muated values according to this function:
change_function <- function(value,pastvalue,timepoint){
if(timepoint==1){valuenew=value} else
if(value==0){valuenew=pastvalue-1}
if(value==1){valuenew=pastvalue}
if(value==2){valuenew=pastvalue+1}
return(valuenew)
}
pastvalue is the MUTATED/UPDATED value at timepoint -1 for timepoint 2:4
Here is an example and output file:
``` r
#example data
df <- data.frame(id=c(1,1,1,1,2,2,2,2),timepoint=c(1,2,3,4,1,2,3,4),var1=c(1,0,1,2,2,2,1,0),var2=c(2,0,1,2,3,2,1,0))
df
#> id timepoint var1 var2
#> 1 1 1 1 2
#> 2 1 2 0 0
#> 3 1 3 1 1
#> 4 1 4 2 2
#> 5 2 1 2 3
#> 6 2 2 2 2
#> 7 2 3 1 1
#> 8 2 4 0 0
#desired output
output <- data.frame(id=c(1,1,1,1,2,2,2,2),timepoint=c(1,2,3,4,1,2,3,4),var1=c(1,0,0,1,2,3,3,2),var2=c(2,1,1,2,3,4,4,3))
output
#> id timepoint var1 var2
#> 1 1 1 1 2
#> 2 1 2 0 1
#> 3 1 3 0 1
#> 4 1 4 1 2
#> 5 2 1 2 3
#> 6 2 2 3 4
#> 7 2 3 3 4
#> 8 2 4 2 3
```
<sup>Created on 2020-11-23 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>
My Approach: use my function using dplyr::mutate_at
library(dplyr)
df %>%
group_by(id) %>%
mutate_at(.vars=vars(var1,var2),
.funs=funs(.=change_function(.,dplyr::lag(.),timepoint)))
However, this does not work because if/else is not vectorized
Update 1:
Using a nested ifelse function does not give the desired output, because it does not use updated pastvalue's:
change_function <- function(value,pastvalue,timepoint){
ifelse((timepoint==1),value,
ifelse((value==0),pastvalue-1,
ifelse((value==1),pastvalue,
ifelse((value==2),pastvalue+1,NA))))
}
library(dplyr)
df %>%
group_by(id) %>%
mutate_at(.vars=vars(var1,var2),
.funs=funs(.=change_function(.,dplyr::lag(.),timepoint)))
id TimePoint var1 var2 var1_. var2_.
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 2 1 2
2 1 2 0 0 0 1
3 1 3 1 1 0 0
4 1 4 2 2 2 2
5 2 1 2 3 2 3
6 2 2 2 2 3 4
7 2 3 1 1 2 2
8 2 4 0 0 0 0
Update 2:
According to the comments, purrr:accumulate could be used
Thanks to akrun I could get the correct function:
# write a vectorized function
change_function <- function(prev, new) {
change=if_else(new==0,-1,
if_else(new==1,0,1))
if_else(is.na(new), new, prev + change)
}
# use purrr:accumulate
df %>%
group_by(id) %>%
mutate_at(.vars=vars(var1,var2),
.funs=funs(accumulate(.,change_function)))
# A tibble: 8 x 4
# Groups: id [2]
id timepoint var1 var2
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 2
2 1 2 0 1
3 1 3 0 1
4 1 4 1 2
5 2 1 2 3
6 2 2 3 4
7 2 3 3 4
8 2 4 2 3

Rows sequence by group using two columns

Suppose I have the following df
data <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,2,3,3,3),
Value = c(1,1,0,1,0,1,1,1,0,0,1,0,0,0),
Result = c(1,1,2,3,4,5,5,1,2,2,3,1,1,1))
How can I obtain column Result from the first two columns?
I have tried different approaches using rle, seq, cumsum and cur_group_id but can't get the Result column easily
library(data.table)
library(dplyr)
data %>%
group_by(ID) %>%
mutate(Result2 = rleid(Value))
This gives us:
ID Value Result Result2
<dbl> <dbl> <dbl> <int>
1 1 1 1 1
2 1 1 1 1
3 1 0 2 2
4 1 1 3 3
5 1 0 4 4
6 1 1 5 5
7 1 1 5 5
8 2 1 1 1
9 2 0 2 2
10 2 0 2 2
11 2 1 3 3
12 3 0 1 1
13 3 0 1 1
14 3 0 1 1
Does this work:
library(dplyr)
data %>% group_by(ID) %>% mutate(r = rep(seq_along(rle(ID*Value)$values), rle(ID*Value)$lengths))
# A tibble: 14 x 4
# Groups: ID [3]
ID Value Result r
<dbl> <dbl> <dbl> <int>
1 1 1 1 1
2 1 1 1 1
3 1 0 2 2
4 1 1 3 3
5 1 0 4 4
6 1 1 5 5
7 1 1 5 5
8 2 1 1 1
9 2 0 2 2
10 2 0 2 2
11 2 1 3 3
12 3 0 1 1
13 3 0 1 1
14 3 0 1 1
We could use rle with ave in base R
data$Result2 <- with(data, ave(Value, ID, FUN =
function(x) inverse.rle(within.list(rle(x), values <- seq_along(values)))))
data$Result2
#[1] 1 1 2 3 4 5 5 1 2 2 3 1 1 1

Calculating entropy in grouped panel data

I have a grouped data structure (different households answering a weekly opinion poll) and I observe every household over 52 weeks (in the example 4 weeks). Now I want to indicate the value of a household at a given point in time using entropy. The value of a household participating in the poll should be higher, if the household didn't participate in the past weeks. So a household always answering the poll should have a lower value in these 4 given weeks than a household answering every two weeks in the two weeks when it does participate. It's important that for a given household the inequality measure varies over weeks.
What's the best way to do so? If it's entropy, how do I apply it to a panel data structure using R?
The data structure is as follows:
da_poll <- data.frame(household = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), participation = c(1,1,1,1,0,0,0,1,0,1,0,1,1,1,1,0))
da_poll
household participation
1 1 1
2 1 1
3 1 1
4 1 1
5 2 0
6 2 0
7 2 0
8 2 1
9 3 0
10 3 1
11 3 0
12 3 1
13 4 1
14 4 1
15 4 1
16 4 0
# 1 indicates participation, 0 no participation.
I have tried to group it by households, but then I only get one value for each household:
da_poll %>%
group_by(household) %>%
mutate(entropy = entropy(participation))
A tibble: 16 x 4
# Groups: household [4]
household week participation entropy
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1.39
2 1 2 1 1.39
3 1 3 1 1.39
4 1 4 1 1.39
5 2 1 0 0
6 2 2 0 0
7 2 3 0 0
8 2 4 1 0
9 3 1 0 0.693
10 3 2 1 0.693
11 3 3 0 0.693
12 3 4 1 0.693
13 4 1 1 1.10
14 4 2 1 1.10
15 4 3 1 1.10
16 4 4 0 1.10
If I group based in household and week, I also get something strange:
da_poll %>%
group_by(household, week) %>%
mutate(entropy = entropy(participation))
# A tibble: 16 x 4
# Groups: household, week [16]
household week participation entropy
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 0
2 1 2 1 0
3 1 3 1 0
4 1 4 1 0
5 2 1 0 NA
6 2 2 0 NA
7 2 3 0 NA
8 2 4 1 0
9 3 1 0 NA
10 3 2 1 0
11 3 3 0 NA
12 3 4 1 0
13 4 1 1 0
14 4 2 1 0
15 4 3 1 0
16 4 4 0 NA
To calculate the entropy cummulatively you need to write your own cummulative function. There is probably a more tidyverse-idomatic way do it but this is what I came up with. Based on your post and your comments, entropy may not be the metric you are looking for.
cummulEntropy <- function(x){
unlist(lapply(seq_along(x), function(i) entropy::entropy(x[1:i])))
}
da_poll %>%
group_by(household) %>%
mutate(entropy=cummulEntropy(participation))
# A tibble: 16 x 3
# Groups: household [4]
# household participation entropy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 1 0.693
# 3 1 1 1.10
# 4 1 1 1.39
# 5 2 0 NA
# 6 2 0 NA
# 7 2 0 NA
# 8 2 1 0
# 9 3 0 NA
#10 3 1 0
#11 3 0 0
#12 3 1 0.693
#13 4 1 0
#14 4 1 0.693
#15 4 1 1.10
#16 4 0 1.10

If a value appears in the row, all subsequent rows should take this value (with dplyr)

I'm just starting to learn R and I'm already facing the first bigger problem.
Let's take the following panel dataset as an example:
N=5
T=3
time<-rep(1:T, times=N)
id<- rep(1:N,each=T)
dummy<- c(0,0,1,1,0,0,0,1,0,0,0,1,0,1,0)
df<-as.data.frame(cbind(id, time,dummy))
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 0
I now want the dummy variable for all rows of a cross section to take the value 1 after the 1 for this cross section appears for the first time. So, what I want is:
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 0
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 1
So I guess I need something like:
df_new<-df %>%
group_by(id) %>%
???
I already tried to set all zeros to NA and use the na.locf function, but it didn't really work.
Anybody got an idea?
Thanks!
Use cummax
df %>%
group_by(id) %>%
mutate(dummy = cummax(dummy))
# A tibble: 15 x 3
# Groups: id [5]
# id time dummy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 0
# 8 3 2 1
# 9 3 3 1
#10 4 1 0
#11 4 2 0
#12 4 3 1
#13 5 1 0
#14 5 2 1
#15 5 3 1
Without additional packages you could do
transform(df, dummy = ave(dummy, id, FUN = cummax))

Resources