Create Indicator Data Frame Based on Interval Ranges - r

I am trying to create a "long" data frame of indicator ("dummy") variables out of a very peculiar type of "wide" data frame in R that has interval ranges of years defining my data.
What I have looks like this:
f=data.frame(name=c("A","B","C"),
year.start=c(1990,1994,1993),year.end=c(1994,1995,1993))
name year.start year.end
1 A 1990 1994
2 B 1994 1995
3 C 1993 1993
Update: I have changed the value of year.start for A to 1990 from the initial example of 1993 to address some of the answers below which rely on unique values instead of intervals.
What I would like is a long data frame that would look like this, with an entry for each of the possible years in the original data frame, eg, 1990 through 1995 where 1 = present and 0 = absent.
name year indicator
A 1990 1
A 1991 1
A 1992 1
A 1993 1
A 1994 1
A 1995 0
B 1990 0
B 1991 0
B 1992 0
B 1993 0
B 1994 1
B 1995 1
C 1990 0
C 1991 0
C 1992 0
C 1993 1
C 1994 0
C 1995 0
Try as I might, I don't see how I can do this with Hadley Wickham's reshape2 package.
Thanks!

Someone else might have suggestion for reshape2, but here is a base R solution:
years <- factor(unlist(f[-1]), levels=seq(min(f[-1]), max(f[-1]), by=1))
result <- data.frame(table(years, rep(f[[1]], length.out=length(years))))
# years Var2 Freq
# 1 1990 A 1
# 2 1991 A 0
# 3 1992 A 0
# 4 1993 A 0
# 5 1994 A 1
# 6 1995 A 0
# 7 1990 B 0
# 8 1991 B 0
# 9 1992 B 0
# 10 1993 B 0
# 11 1994 B 1
# 12 1995 B 1
# 13 1990 C 0
# 14 1991 C 0
# 15 1992 C 0
# 16 1993 C 2
# 17 1994 C 0
# 18 1995 C 0

here is a step-by-step breakdown, using data.table
library(data.table)
f <- as.data.table(f)
## ALL OF NAME-YEAR COMBINATIONS
ALL <- f[, CJ(name=name, year=seq(min(year.start), max(year.end)))]
## WHICH COMBINATIONS EXIST
PRESENT <- f[, list(year = seq(year.start, year.end)), by=name]
## SETKEYS FOR MERGING
setkey(ALL, name, year)
setkey(PRESENT, name, year)
## INITIALIZE INDICATOR TO ZERO, THEN SET TO 1 FOR THOSE PRESENT
ALL[, indicator := 0]
ALL[PRESENT, indicator := 1]
ALL
name year indicator
1: A 1993 1
2: A 1994 1
3: A 1995 0
4: B 1993 0
5: B 1994 1
6: B 1995 1
7: C 1993 1
8: C 1994 0
9: C 1995 0

Here's another solution, similar to the ones above, which aims to be straightforward:
zz <- cbind(name=f[1],year=rep(min(f[-1]):max(f[-1]),each=nrow(f)))
zz$indicator <- as.numeric((f$name==zz$name &
f$year.start<=zz$year &
f$year.end >=zz$year))
result <- zz[order(zz$name,zz$year),]
The first line builds a template with all the names and all the years. The second line sets indicator based on whether it is present in the range. The third line just reorders the result.

Another base R solution
f=data.frame(name=c("A","B","C"),
year.start=c(1993,1994,1993),year.end=c(1994,1995,1993), stringsAsFactors=F)
x <- expand.grid(unique(f$name),min(f1$year):max(f1$year))
names(x) <- c("name", "year")
x$indicator <- sapply(1:nrow(x), function(i) sum(x$name[i]==f$name & x$year[i] >= f$year.start & x$year[i] <= f$year.end))
x[order(x$name),]

Related

Fill in Column Based on Other Rows (R)

I am looking for a way to fill in a column in R based on values in a different column. Below is what my data looks like.
year
action
player
end
2001
1
Mike
2003
2002
0
Mike
NA
2003
0
Mike
NA
2004
0
Mike
NA
2001
0
Alan
NA
2002
0
Alan
NA
2003
1
Alan
2004
2004
0
Alan
NA
I would like to either change the "action" column or create a new column such that it reflects the duration between the "year" and "end" variables. Below is what it would look like:
year
action
player
end
2001
1
Mike
2003
2002
1
Mike
NA
2003
1
Mike
NA
2004
0
Mike
NA
2001
0
Alan
NA
2002
0
Alan
NA
2003
1
Alan
2004
2004
1
Alan
NA
I have tried to do this with the following loop:
i <- 0
z <- 0
for (i in 1:nrow(df)){
i <- z + i + 1
if (df[i, 2] == 0) {}
else {df[i, 5] = (df[i, 4] - df[i, 1])}
z <- df[i,5]
for (z in i:nrow(df)){df[i, 2] = 1}
}
Here, my i value is skyrocketing, breaking the loop. I am not sure why that is occuring. I'd be interested to either know how to fix my approach or how to do this in a smarter fashion.
There's no need for explicit loops here.
First group your data frame by player. Then find the rows where the cumulative sum (cumsum) of action is greater than 0 and the year is less than or equal to the end year of the group. If the row meets these conditions, set action to 1, otherwise to 0.
Using the dplyr package you could achieve this in a couple of lines:
library(dplyr)
df %>%
group_by(player) %>%
mutate(action = as.numeric(cumsum(action) > 0 & year <= na.omit(end)[1]))
#> # A tibble: 8 x 4
#> # Groups: player [2]
#> year action player end
#> <int> <dbl> <chr> <int>
#> 1 2001 1 Mike 2003
#> 2 2002 1 Mike NA
#> 3 2003 1 Mike NA
#> 4 2004 0 Mike NA
#> 5 2001 0 Alan NA
#> 6 2002 0 Alan NA
#> 7 2003 1 Alan 2004
#> 8 2004 1 Alan NA

Penalized cumulative sum in r

I need to calculate a penalized cumulative sum.
Individuals "A", "B" and "C" were supposed to get tested every other year. Every time they get tested, they accumulate 1 point. However, when they miss a test, their cumulative score gets deducted in 1.
I have the following code:
data.frame(year = rep(1990:1995, 3), person.id = c(rep("A", 6), rep("B", 6), rep("C", 6)), needs.testing = rep(c("Yes", "No"), 9), test.compliance = c(c(1,0,1,0,1,0), c(1,0,1,0,0,0), c(1,0,0,0,0,0)), penalized.compliance.cum.sum = c(c(1,1,2,2,3,3), c(1,1,2,2,1,1), c(1,1,0,0,-1,-1)))
...which gives the following:
year person.id needs.testing test.compliance penalized.compliance.cum.sum
1 1990 A Yes 1 1
2 1991 A No 0 1
3 1992 A Yes 1 2
4 1993 A No 0 2
5 1994 A Yes 1 3
6 1995 A No 0 3
7 1990 B Yes 1 1
8 1991 B No 0 1
9 1992 B Yes 1 2
10 1993 B No 0 2
11 1994 B Yes 0 1
12 1995 B No 0 1
13 1990 C Yes 1 1
14 1991 C No 0 1
15 1992 C Yes 0 0
16 1993 C No 0 0
17 1994 C Yes 0 -1
18 1995 C No 0 -1
As it is evident, "A" fully complied. "B" somewhat complied (in year 1994 he's supposed to get tested, but he missed the test, and consequently his cumulative sum gets deducted from 2 to 1). Finally, "C" complies just once (in year 1990, and every time she needs to get tested, she misses the test).
What I need is some code to get the "penalized.compliance.cum.sum" variable.
Please note:
Tests are every other year.
The "penalized.compliance.cum.sum" variable keeps adding the previous score.
But starts deducting only if the individual misses the test on the testing year (denoted in the "needs.testing" variable).
For instance, individual "C" complies in year 1990. In 1991 she doesn't need to get tested, and hence keeps her score of 1. Then, she misses the 1992 test, and 1 is subtracted from her cumulative score, getting a score of 0 in 1992. Then she keeps missing test getting a -1 at the end of the study.
Also, I need to assign different penalties (i.e. different numbers). In this example, it's just 1. However, I need to be able to penalize using other numbers such as 0.5, 0.1, and others.
Thanks!
Using case_when
library(dplyr)
df1 %>%
group_by(person.id) %>%
mutate(res = cumsum(case_when(needs.testing == "Yes" ~ 1- 2 *(test.compliance < 1), TRUE ~ 0)))
base R
do.call(rbind, by(dat, dat$person.id,
function(z) transform(z, res = cumsum(ifelse(needs.testing == "Yes", 1-2*(test.compliance < 1), 0)))
))
# year person.id needs.testing test.compliance penalized.compliance.cum.sum res
# A.1 1990 A Yes 1 1 1
# A.2 1991 A No 0 1 1
# A.3 1992 A Yes 1 2 2
# A.4 1993 A No 0 2 2
# A.5 1994 A Yes 1 3 3
# A.6 1995 A No 0 3 3
# B.7 1990 B Yes 1 1 1
# B.8 1991 B No 0 1 1
# B.9 1992 B Yes 1 2 2
# B.10 1993 B No 0 2 2
# B.11 1994 B Yes 0 1 1
# B.12 1995 B No 0 1 1
# C.13 1990 C Yes 1 1 1
# C.14 1991 C No 0 1 1
# C.15 1992 C Yes 0 0 0
# C.16 1993 C No 0 0 0
# C.17 1994 C Yes 0 -1 -1
# C.18 1995 C No 0 -1 -1
by splits a frame up by the INDICES (dat$person.id here), where in the function z is the data for just that group. This allows us to operate on the data without fearing the person changing in a vector.
by returns a list, and the canonical base-R way to combine lists into a frame is either rbind(a, b) when only two frames, or do.call(rbind, list(...)) when there may be more than two frames in the list.
The 1-2*(.) is just a trick to waffle between +1 and -1 based on test.compliance.
This has the side-effect of potentially changing the order of the rows. For instance, if it were ordered first by year then person.id, then the by-group calculations will still be good, but the output will be grouped by person.id (and ordered by year within the group). Minor, but note it if you need order to be something.
dplyr
library(dplyr)
dat %>%
group_by(person.id) %>%
mutate(res = cumsum(if_else(needs.testing == "Yes", 1-2*(test.compliance < 1), 0))) %>%
ungroup()
data.table
library(data.table)
datDT <- as.data.table(dat)
datDT[, res := cumsum(fifelse(needs.testing == "Yes", 1-2*(test.compliance < 1), 0)), by = .(person.id)]
This might do the trick for you?
df <- data.frame(year = rep(1990:1995, 3), person.id = c(rep("A", 6), rep("B", 6), rep("C", 6)), needs.testing = rep(c("Yes", "No"), 9), test.compliance = c(c(1,0,1,0,1,0), c(1,0,1,0,0,0), c(1,0,0,0,0,0)), penalized.compliance.cum.sum = c(c(1,1,2,2,3,3), c(1,1,2,2,1,1), c(1,1,0,0,-1,-1)))
library("dplyr")
penalty <- -1
df %>%
group_by(person.id) %>%
mutate(cumsum = cumsum(ifelse(needs.testing == "Yes" & test.compliance == 0, penalty, test.compliance)))
## A tibble: 18 x 6
## Groups: person.id [3]
# year person.id needs.testing test.compliance penalized.compliance.cum.sum cumsum
# <int> <chr> <chr> <dbl> <dbl> <dbl>
# 1 1990 A Yes 1 1 1
# 2 1991 A No 0 1 1
# 3 1992 A Yes 1 2 2
# 4 1993 A No 0 2 2
# 5 1994 A Yes 1 3 3
# 6 1995 A No 0 3 3
# 7 1990 B Yes 1 1 1
# 8 1991 B No 0 1 1
# 9 1992 B Yes 1 2 2
#10 1993 B No 0 2 2
#11 1994 B Yes 0 1 1
#12 1995 B No 0 1 1
#13 1990 C Yes 1 1 1
#14 1991 C No 0 1 1
#15 1992 C Yes 0 0 0
#16 1993 C No 0 0 0
#17 1994 C Yes 0 -1 -1
#18 1995 C No 0 -1 -1
You can then easily adjust the penalty variable to be whatever penalty you want.

How to find observations whose dummy variable changes from 1 to 0 (and not viceversa) in a df in r

I have a survey composed of n individuals; each individual is present more than one time in the survey (panel). I have a variable pens, which is a dummy that takes value 1 if the individual invests in a complementary pension form. For example:
df <- data.frame(year=c(2002,2002,2004,2004,2006,2008), id=c(1,2,1,2,3,3), y.b=c(1950,1943,1950,1943,1966,1966), sex=c("F", "M", "F", "M", "M", "M"), income=c(100000,55000,88000,66000,12000,24000), pens=c(0,1,1,0,1,1))
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
where id is the individual, y.b is year of birth, pens is the dummy variable regarding complementary pension.
I want to know if there are individuals that invested in a complementary pension form in year t but didn't hold the complementary pension form in year t+2 (the survey is conducted every two years). In this way I want to know how many person had a complementary pension form but released it before pension or gave up (for example for economic reasons).
I tried with this command:
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
and actually I have the individuals whose pens variable had changed during time (the command check if a variable is constant in time). For this reason I find individuals whose pens variable changed from 0 (didn't have complementary pension) in year t to 1 in year t+2 and viceversa; but I am interested in individuals whose pens variable was 1 (had a complementary pensione) in year t and 0 in year t+2.
If I use this command with the df I get that for id 1 and 2 the variable x is 0 (pens variable isn't constant), but I'd need to find a way to get just id 2 (whose pens variable changed from 1 to 0).
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
year id pens x
1 2002 1 0 0
2 2002 2 1 0
3 2004 1 1 0
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
(for the sake of semplicity I omitted other variables)
So the desired output is:
year id pens x
1 2002 1 0 1
2 2002 2 1 0
3 2004 1 1 1
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
only id 2 has x=0 since the pens variable changed from 1 to 0.
Thanks in advance
This assigns 1 to the id's for which there is a decline in pens and 0 otherwise.
transform(d.d, x = ave(pens, id, FUN = function(x) any(diff(x) < 0)))
giving:
year id y.b sex income pens x
1 2002 1 1950 F 100000 0 0
2 2002 2 1943 M 55000 1 1
3 2004 1 1950 F 88000 1 0
4 2004 2 1943 M 66000 0 1
5 2006 3 1966 M 12000 1 0
6 2008 3 1966 M 24000 1 0
This should work even even if there are more than 2 rows per id but if we knew there were always 2 rows then we could omit the any simplifying it to:
transform(d.d, x = ave(pens, id, FUN = diff) < 0)
Note: The input in reproducible form is:
Lines <- "year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1"
d.d <- read.table(text = Lines, header = TRUE, check.names = FALSE)

"Transition" Variable in R

Lets say I have these data set:
library(data.table)
mydata <- data.table(year=1991:2000,
z=c(0,0,1,1,1,1,1,0,0,0))
If I call the dataset, it will look something like this:
mydata
year z
1: 1991 0
2: 1992 0
3: 1993 1
4: 1994 1
5: 1995 1
6: 1996 1
7: 1997 1
8: 1998 0
9: 1999 0
10: 2000 0
What I need is:
A transition variable, call it c. If I had these dataset, it would look something like this:
year z c
1: 1991 0 0
2: 1992 0 0
3: 1993 1 1
4: 1994 1 NA
5: 1995 1 NA
6: 1996 1 NA
7: 1997 1 NA
8: 1998 0 0
9: 1999 0 0
10: 2000 0 0
Essentially, c marks when there has been a transition in variable z, from z=0 to z=1. When it does, c puts a 1 just once and then starts putting NA's until it returns to the original state (z=0). Then, it starts putting zeroes.
I have another id variable, but that would complicate the example. I think I can manage that part.
** EDITED **: In fact, it does not matter whether I have an id variable or not.
It sounds easy, but not being a R expert myself, it's killing me!
You can use rleid to create a group variable, and then replace duplicated 1 in z with NA using ifelse statement:
mydata[, c := ifelse(duplicated(z) & z == 1, NA_integer_, z), by = rleid(z)][]
# year z c
# 1: 1991 0 0
# 2: 1992 0 0
# 3: 1993 1 1
# 4: 1994 1 NA
# 5: 1995 1 NA
# 6: 1996 1 NA
# 7: 1997 1 NA
# 8: 1998 0 0
# 9: 1999 0 0
#10: 2000 0 0
Another attempt:
mydata[, c := z]
mydata[c==1, c := replace(c,-1,NA), by=rleid(z)]
# year z c
# 1: 1991 0 0
# 2: 1992 0 0
# 3: 1993 1 1
# 4: 1994 1 NA
# 5: 1995 1 NA
# 6: 1996 1 NA
# 7: 1997 1 NA
# 8: 1998 0 0
# 9: 1999 0 0
#10: 2000 0 0
library(data.table)
mydata <- data.table(year=1991:2000, z=c(0,0,1,1,1,1,1,0,0,0))
mydata[,c:=ifelse(z!=shift(z, type="lag"), 1, 0)]
mydata[1,]$c = 0
check this function in data.table shift(x, n=1L, fill=NA, type=c("lag", "lead"), give.names=FALSE) Shift function shifts x in both "lead" and "lag" direction. N is the number of steps. In the first comparison, NA is generated that has been changed to 0 in the last line. You can read ?shift in your session and read more about this function.

Rearranging data frame in R with summarizing values

I need to rearrange a data frame, which currently looks like this:
> counts
year score freq rounded_year
1: 1618 0 25 1620
2: 1619 2 1 1620
3: 1619 0 20 1620
4: 1620 1 6 1620
5: 1620 0 70 1620
---
11570: 1994 107 1 1990
11571: 1994 101 2 1990
11572: 1994 10 194 1990
11573: 1994 1 30736 1990
11574: 1994 0 711064 1990
But what I need is the count of the unique values in score per decade (rounded_year).
So, the data frame should looks like this:
rounded_year 0 1 2 3 [...] total
1620 115 6 1 0 122
---
1990 711064 30736 0 0 741997
I've played around with aggregate and ddply, but so far without success. I hope, it's clear what I mean. I don't know how to describe it better.
Any ideas?
A simple example using dplyr and tidyr.
dt = data.frame(year = c(1618,1619,1620,1994,1994,1994),
score = c(0,1,0,2,2,3),
freq = c(3,5,2,6,7,8),
rounded_year = c(1620,1620,1620,1990,1990,1990))
dt
# year score freq rounded_year
# 1 1618 0 3 1620
# 2 1619 1 5 1620
# 3 1620 0 2 1620
# 4 1994 2 6 1990
# 5 1994 2 7 1990
# 6 1994 3 8 1990
library(dplyr)
library(tidyr)
dt %>%
group_by(rounded_year, score) %>%
summarise(freq = sum(freq)) %>%
mutate(total = sum(freq)) %>%
spread(score,freq, fill=0)
# Source: local data frame [2 x 6]
#
# rounded_year total 0 1 2 3
# (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1620 10 5 5 0 0
# 2 1990 21 0 0 13 8
In case you prefer to work with data.table (as the dataset you provide looks more like a data.table), you can use this:
library(data.table)
library(tidyr)
dt = setDT(dt)[, .(freq = sum(freq)) ,by=c("rounded_year","score")]
dt = dt[, total:= sum(freq) ,by="rounded_year"]
dt = spread(dt,score,freq, fill=0)
dt
# rounded_year total 0 1 2 3
# 1: 1620 10 5 5 0 0
# 2: 1990 21 0 0 13 8

Resources