Penalized cumulative sum in r - r

I need to calculate a penalized cumulative sum.
Individuals "A", "B" and "C" were supposed to get tested every other year. Every time they get tested, they accumulate 1 point. However, when they miss a test, their cumulative score gets deducted in 1.
I have the following code:
data.frame(year = rep(1990:1995, 3), person.id = c(rep("A", 6), rep("B", 6), rep("C", 6)), needs.testing = rep(c("Yes", "No"), 9), test.compliance = c(c(1,0,1,0,1,0), c(1,0,1,0,0,0), c(1,0,0,0,0,0)), penalized.compliance.cum.sum = c(c(1,1,2,2,3,3), c(1,1,2,2,1,1), c(1,1,0,0,-1,-1)))
...which gives the following:
year person.id needs.testing test.compliance penalized.compliance.cum.sum
1 1990 A Yes 1 1
2 1991 A No 0 1
3 1992 A Yes 1 2
4 1993 A No 0 2
5 1994 A Yes 1 3
6 1995 A No 0 3
7 1990 B Yes 1 1
8 1991 B No 0 1
9 1992 B Yes 1 2
10 1993 B No 0 2
11 1994 B Yes 0 1
12 1995 B No 0 1
13 1990 C Yes 1 1
14 1991 C No 0 1
15 1992 C Yes 0 0
16 1993 C No 0 0
17 1994 C Yes 0 -1
18 1995 C No 0 -1
As it is evident, "A" fully complied. "B" somewhat complied (in year 1994 he's supposed to get tested, but he missed the test, and consequently his cumulative sum gets deducted from 2 to 1). Finally, "C" complies just once (in year 1990, and every time she needs to get tested, she misses the test).
What I need is some code to get the "penalized.compliance.cum.sum" variable.
Please note:
Tests are every other year.
The "penalized.compliance.cum.sum" variable keeps adding the previous score.
But starts deducting only if the individual misses the test on the testing year (denoted in the "needs.testing" variable).
For instance, individual "C" complies in year 1990. In 1991 she doesn't need to get tested, and hence keeps her score of 1. Then, she misses the 1992 test, and 1 is subtracted from her cumulative score, getting a score of 0 in 1992. Then she keeps missing test getting a -1 at the end of the study.
Also, I need to assign different penalties (i.e. different numbers). In this example, it's just 1. However, I need to be able to penalize using other numbers such as 0.5, 0.1, and others.
Thanks!

Using case_when
library(dplyr)
df1 %>%
group_by(person.id) %>%
mutate(res = cumsum(case_when(needs.testing == "Yes" ~ 1- 2 *(test.compliance < 1), TRUE ~ 0)))

base R
do.call(rbind, by(dat, dat$person.id,
function(z) transform(z, res = cumsum(ifelse(needs.testing == "Yes", 1-2*(test.compliance < 1), 0)))
))
# year person.id needs.testing test.compliance penalized.compliance.cum.sum res
# A.1 1990 A Yes 1 1 1
# A.2 1991 A No 0 1 1
# A.3 1992 A Yes 1 2 2
# A.4 1993 A No 0 2 2
# A.5 1994 A Yes 1 3 3
# A.6 1995 A No 0 3 3
# B.7 1990 B Yes 1 1 1
# B.8 1991 B No 0 1 1
# B.9 1992 B Yes 1 2 2
# B.10 1993 B No 0 2 2
# B.11 1994 B Yes 0 1 1
# B.12 1995 B No 0 1 1
# C.13 1990 C Yes 1 1 1
# C.14 1991 C No 0 1 1
# C.15 1992 C Yes 0 0 0
# C.16 1993 C No 0 0 0
# C.17 1994 C Yes 0 -1 -1
# C.18 1995 C No 0 -1 -1
by splits a frame up by the INDICES (dat$person.id here), where in the function z is the data for just that group. This allows us to operate on the data without fearing the person changing in a vector.
by returns a list, and the canonical base-R way to combine lists into a frame is either rbind(a, b) when only two frames, or do.call(rbind, list(...)) when there may be more than two frames in the list.
The 1-2*(.) is just a trick to waffle between +1 and -1 based on test.compliance.
This has the side-effect of potentially changing the order of the rows. For instance, if it were ordered first by year then person.id, then the by-group calculations will still be good, but the output will be grouped by person.id (and ordered by year within the group). Minor, but note it if you need order to be something.
dplyr
library(dplyr)
dat %>%
group_by(person.id) %>%
mutate(res = cumsum(if_else(needs.testing == "Yes", 1-2*(test.compliance < 1), 0))) %>%
ungroup()
data.table
library(data.table)
datDT <- as.data.table(dat)
datDT[, res := cumsum(fifelse(needs.testing == "Yes", 1-2*(test.compliance < 1), 0)), by = .(person.id)]

This might do the trick for you?
df <- data.frame(year = rep(1990:1995, 3), person.id = c(rep("A", 6), rep("B", 6), rep("C", 6)), needs.testing = rep(c("Yes", "No"), 9), test.compliance = c(c(1,0,1,0,1,0), c(1,0,1,0,0,0), c(1,0,0,0,0,0)), penalized.compliance.cum.sum = c(c(1,1,2,2,3,3), c(1,1,2,2,1,1), c(1,1,0,0,-1,-1)))
library("dplyr")
penalty <- -1
df %>%
group_by(person.id) %>%
mutate(cumsum = cumsum(ifelse(needs.testing == "Yes" & test.compliance == 0, penalty, test.compliance)))
## A tibble: 18 x 6
## Groups: person.id [3]
# year person.id needs.testing test.compliance penalized.compliance.cum.sum cumsum
# <int> <chr> <chr> <dbl> <dbl> <dbl>
# 1 1990 A Yes 1 1 1
# 2 1991 A No 0 1 1
# 3 1992 A Yes 1 2 2
# 4 1993 A No 0 2 2
# 5 1994 A Yes 1 3 3
# 6 1995 A No 0 3 3
# 7 1990 B Yes 1 1 1
# 8 1991 B No 0 1 1
# 9 1992 B Yes 1 2 2
#10 1993 B No 0 2 2
#11 1994 B Yes 0 1 1
#12 1995 B No 0 1 1
#13 1990 C Yes 1 1 1
#14 1991 C No 0 1 1
#15 1992 C Yes 0 0 0
#16 1993 C No 0 0 0
#17 1994 C Yes 0 -1 -1
#18 1995 C No 0 -1 -1
You can then easily adjust the penalty variable to be whatever penalty you want.

Related

How do I add a column indicating the years since a binary variable in R?

I thought this would be trivial, and I think it must be, but I am very tired and stuck at this problem at the moment.
Consider a df with two columns, one with a year, and the other with a binary variable indicating some event.
df <- data.frame(year = c(2000,2001,2002,2003,2004, 2005,2006,2007,2008,2010),
flag = c(0,0,0,1,0,0,0,1,0,0))
I want to create a third column that simply counts the years since the last flag and that resets when a new flag appears, like so:
I thought this code would do the job:
First, add a 0 as "year_since" for every year with a flag, then, if there was a flag in the previous year, add 1 to the value of the previous "year_since".
df <- df %>% mutate(year_since = ifelse(flag == 1, 0, NA)) %>%
mutate(year_since = ifelse(dplyr::lag(flag, n=1, order_by = "year") == 1 & is.na(year_since),
dplyr::lag(year_since, n=1, order_by = "year")+1, year_since))
However, this returns NA for every row that should be 1,2,3, and so on.
You could do
df %>%
group_by(group = cumsum(flag)) %>%
mutate(year_since = ifelse(group == 0, NA, seq(n()) - 1)) %>%
ungroup() %>%
select(-group)
#> # A tibble: 10 x 3
#> year flag year_since
#> <dbl> <dbl> <dbl>
#> 1 2000 0 NA
#> 2 2001 0 NA
#> 3 2002 0 NA
#> 4 2003 1 0
#> 5 2004 0 1
#> 6 2005 0 2
#> 7 2006 0 3
#> 8 2007 1 0
#> 9 2008 0 1
#> 10 2010 0 2
Created on 2022-09-16 with reprex v2.0.2
Using data.table
library(data.table)
setDT(df)[, year_since := (NA^!cummax(flag)) * rowid(cumsum(flag))-1]
-output
> df
year flag year_since
<num> <num> <num>
1: 2000 0 NA
2: 2001 0 NA
3: 2002 0 NA
4: 2003 1 0
5: 2004 0 1
6: 2005 0 2
7: 2006 0 3
8: 2007 1 0
9: 2008 0 1
10: 2010 0 2

Attempting to create panel-data from cross sectional data

I'm attempting to transform data from the Global Terrorism Database so that instead of the unit being terror events, it will be "Country_Year" with one variable having the number of terror events that year.
I've managed to create a dataframe that has all one column with all the Country_Year combinations as one variable. I've also find that by using `
´table(GTD_94_Land$country_txt, GTD_94_Land$iyear)´ the table shows the values that I would like the new variable to have. What I can't figure out is how to store this number as a variable.
So my data look like this
eventid iyear crit1 crit2 crit3 country country_txt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 199401010008 1994 1 1 1 182 Somalia
2 199401010012 1994 1 1 1 209 Turkey
3 199401010013 1994 1 1 1 209 Turkey
4 199401020003 1994 1 1 1 209 Turkey
5 199401020007 1994 1 1 0 106 Kuwait
6 199401030002 1994 1 1 1 209 Turkey
7 199401030003 1994 1 1 1 228 Yemen
8 199401030006 1994 1 1 0 53 Cyprus
9 199401040005 1994 1 1 0 209 Turkey
10 199401040006 1994 1 1 0 209 Turkey
11 199401040007 1994 1 1 1 209 Turkey
12 199401040008 1994 1 1 1 209 Turkey
and I would like to transform so that I had
Terror attacks iyear crit1 crit2 crit3 country country_txt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 1994 1 1 1 182 Somalia
2 8 1994 1 1 1 209 Turkey
5 1 1994 1 1 0 106 Kuwait
7 1 1994 1 1 1 228 Yemen
8 1 1994 1 1 0 53 Cyprus
´´´
I've looked at some solutions but most of them seems to assume that the number the new variable should have already is in the data.
All help is appreciated!
Assuming df is the original dataframe:
df_out = df %>%
dplyr::select(-eventid) %>%
dplyr::group_by(country_txt,iyear) %>%
dplyr::mutate(Terrorattacs = n()) %>%
dplyr::slice(1L) %>%
dplyr::ungroup()
Ideally, I would use summarise but since I don't know the summarising criteria for other columns, I have simply used mutate and slice.
Note: The 'crit' columns values would be the first occurrence of the 'country_txt' and 'iyear'.
Here's a data.table solution. If the data set has already been filtered to have crit1 and crit2 equal to 1 (which you gave as a condition in a comment), you can remove the first argument (crit1 == 1 & crit2 == 1)
library(data.table)
set.seed(1011)
dat <- data.table(eventid = round(runif(100, 1000, 10000)),
iyear = sample(1994:1996, 100, rep = T),
crit1 = rbinom(100, 1, .9),
crit2 = rbinom(100, 1, .9),
crit3 = rbinom(100, 1, .9),
country = sample(1:3, 100, rep = T))
dat[, country_txt := LETTERS[country]]
## remove crit variables
dat[crit1 == 1 & crit2 == 1, .N, .(country, country_txt, iyear)]
#> country country_txt iyear N
#> 1: 1 A 1994 10
#> 2: 1 A 1995 4
#> 3: 3 C 1995 10
#> 4: 1 A 1996 7
#> 5: 2 B 1996 9
#> 6: 3 C 1996 5
#> 7: 2 B 1994 8
#> 8: 3 C 1994 13
#> 9: 2 B 1995 10
Created on 2019-09-24 by the reprex package (v0.3.0)

"Transition" Variable in R

Lets say I have these data set:
library(data.table)
mydata <- data.table(year=1991:2000,
z=c(0,0,1,1,1,1,1,0,0,0))
If I call the dataset, it will look something like this:
mydata
year z
1: 1991 0
2: 1992 0
3: 1993 1
4: 1994 1
5: 1995 1
6: 1996 1
7: 1997 1
8: 1998 0
9: 1999 0
10: 2000 0
What I need is:
A transition variable, call it c. If I had these dataset, it would look something like this:
year z c
1: 1991 0 0
2: 1992 0 0
3: 1993 1 1
4: 1994 1 NA
5: 1995 1 NA
6: 1996 1 NA
7: 1997 1 NA
8: 1998 0 0
9: 1999 0 0
10: 2000 0 0
Essentially, c marks when there has been a transition in variable z, from z=0 to z=1. When it does, c puts a 1 just once and then starts putting NA's until it returns to the original state (z=0). Then, it starts putting zeroes.
I have another id variable, but that would complicate the example. I think I can manage that part.
** EDITED **: In fact, it does not matter whether I have an id variable or not.
It sounds easy, but not being a R expert myself, it's killing me!
You can use rleid to create a group variable, and then replace duplicated 1 in z with NA using ifelse statement:
mydata[, c := ifelse(duplicated(z) & z == 1, NA_integer_, z), by = rleid(z)][]
# year z c
# 1: 1991 0 0
# 2: 1992 0 0
# 3: 1993 1 1
# 4: 1994 1 NA
# 5: 1995 1 NA
# 6: 1996 1 NA
# 7: 1997 1 NA
# 8: 1998 0 0
# 9: 1999 0 0
#10: 2000 0 0
Another attempt:
mydata[, c := z]
mydata[c==1, c := replace(c,-1,NA), by=rleid(z)]
# year z c
# 1: 1991 0 0
# 2: 1992 0 0
# 3: 1993 1 1
# 4: 1994 1 NA
# 5: 1995 1 NA
# 6: 1996 1 NA
# 7: 1997 1 NA
# 8: 1998 0 0
# 9: 1999 0 0
#10: 2000 0 0
library(data.table)
mydata <- data.table(year=1991:2000, z=c(0,0,1,1,1,1,1,0,0,0))
mydata[,c:=ifelse(z!=shift(z, type="lag"), 1, 0)]
mydata[1,]$c = 0
check this function in data.table shift(x, n=1L, fill=NA, type=c("lag", "lead"), give.names=FALSE) Shift function shifts x in both "lead" and "lag" direction. N is the number of steps. In the first comparison, NA is generated that has been changed to 0 in the last line. You can read ?shift in your session and read more about this function.

adding rows to data.frame conditionally

I have a big data.frame of flowers and fruits in a plant for a 30 years survey. I want to add zeros (0) in some rows which represent individuals in specific months where the plant did not have flowers or fruits (because it is a seasonal species).
Example:
Year Month Flowers Fruits
2004 6 25 2
2004 7 48 4
2005 7 20 1
2005 8 16 1
I want to add the months that are not included with values of zero so I was thinking in a function that recognize the missing months and fill them with 0.
Thanks.
## x is the data frame you gave in the question
x <- data.frame(
Year = c(2004, 2004, 2005, 2005),
Month = c(6, 7, 7, 8),
Flowers = c(25, 48, 20, 16),
Fruits = c(2, 4, 1, 1)
)
## y is the data frame that will provide the missing values,
## so you can replace 2004 and 2005 with whatever your desired
## time interval is
y <- expand.grid(Year = 2004:2005, Month = 1:12)
## this final step fills in missing dates and replaces NA's with zeros
library(tidyr)
x <- merge(x, y, all = TRUE) %>%
replace_na(list(Flowers = 0, Fruits = 0))
## if you don't want to use tidyr, you can alternatively do
x <- merge(x, y, all = TRUE)
x[is.na(x)] <- 0
It looks like this:
head(x, 10)
# Year Month Flowers Fruits
# 1 2004 1 0 0
# 2 2004 2 0 0
# 3 2004 3 0 0
# 4 2004 4 0 0
# 5 2004 5 0 0
# 6 2004 6 25 2
# 7 2004 7 48 4
# 8 2004 8 0 0
# 9 2004 9 0 0
# 10 2004 10 0 0
Here is another option using expand and left_join
library(dplyr)
library(tidyr)
expand(df1, Year, Month = 1:12) %>%
left_join(., df1) %>%
replace_na(list(Flowers=0, Fruits=0))
# Year Month Flowers Fruits
# <int> <int> <dbl> <dbl>
#1 2004 1 0 0
#2 2004 2 0 0
#3 2004 3 0 0
#4 2004 4 0 0
#5 2004 5 0 0
#6 2004 6 25 2
#7 2004 7 48 4
#8 2004 8 0 0
#9 2004 9 0 0
#10 2004 10 0 0
#.. ... ... ... ...

Create Indicator Data Frame Based on Interval Ranges

I am trying to create a "long" data frame of indicator ("dummy") variables out of a very peculiar type of "wide" data frame in R that has interval ranges of years defining my data.
What I have looks like this:
f=data.frame(name=c("A","B","C"),
year.start=c(1990,1994,1993),year.end=c(1994,1995,1993))
name year.start year.end
1 A 1990 1994
2 B 1994 1995
3 C 1993 1993
Update: I have changed the value of year.start for A to 1990 from the initial example of 1993 to address some of the answers below which rely on unique values instead of intervals.
What I would like is a long data frame that would look like this, with an entry for each of the possible years in the original data frame, eg, 1990 through 1995 where 1 = present and 0 = absent.
name year indicator
A 1990 1
A 1991 1
A 1992 1
A 1993 1
A 1994 1
A 1995 0
B 1990 0
B 1991 0
B 1992 0
B 1993 0
B 1994 1
B 1995 1
C 1990 0
C 1991 0
C 1992 0
C 1993 1
C 1994 0
C 1995 0
Try as I might, I don't see how I can do this with Hadley Wickham's reshape2 package.
Thanks!
Someone else might have suggestion for reshape2, but here is a base R solution:
years <- factor(unlist(f[-1]), levels=seq(min(f[-1]), max(f[-1]), by=1))
result <- data.frame(table(years, rep(f[[1]], length.out=length(years))))
# years Var2 Freq
# 1 1990 A 1
# 2 1991 A 0
# 3 1992 A 0
# 4 1993 A 0
# 5 1994 A 1
# 6 1995 A 0
# 7 1990 B 0
# 8 1991 B 0
# 9 1992 B 0
# 10 1993 B 0
# 11 1994 B 1
# 12 1995 B 1
# 13 1990 C 0
# 14 1991 C 0
# 15 1992 C 0
# 16 1993 C 2
# 17 1994 C 0
# 18 1995 C 0
here is a step-by-step breakdown, using data.table
library(data.table)
f <- as.data.table(f)
## ALL OF NAME-YEAR COMBINATIONS
ALL <- f[, CJ(name=name, year=seq(min(year.start), max(year.end)))]
## WHICH COMBINATIONS EXIST
PRESENT <- f[, list(year = seq(year.start, year.end)), by=name]
## SETKEYS FOR MERGING
setkey(ALL, name, year)
setkey(PRESENT, name, year)
## INITIALIZE INDICATOR TO ZERO, THEN SET TO 1 FOR THOSE PRESENT
ALL[, indicator := 0]
ALL[PRESENT, indicator := 1]
ALL
name year indicator
1: A 1993 1
2: A 1994 1
3: A 1995 0
4: B 1993 0
5: B 1994 1
6: B 1995 1
7: C 1993 1
8: C 1994 0
9: C 1995 0
Here's another solution, similar to the ones above, which aims to be straightforward:
zz <- cbind(name=f[1],year=rep(min(f[-1]):max(f[-1]),each=nrow(f)))
zz$indicator <- as.numeric((f$name==zz$name &
f$year.start<=zz$year &
f$year.end >=zz$year))
result <- zz[order(zz$name,zz$year),]
The first line builds a template with all the names and all the years. The second line sets indicator based on whether it is present in the range. The third line just reorders the result.
Another base R solution
f=data.frame(name=c("A","B","C"),
year.start=c(1993,1994,1993),year.end=c(1994,1995,1993), stringsAsFactors=F)
x <- expand.grid(unique(f$name),min(f1$year):max(f1$year))
names(x) <- c("name", "year")
x$indicator <- sapply(1:nrow(x), function(i) sum(x$name[i]==f$name & x$year[i] >= f$year.start & x$year[i] <= f$year.end))
x[order(x$name),]

Resources