How to find observations whose dummy variable changes from 1 to 0 (and not viceversa) in a df in r - r

I have a survey composed of n individuals; each individual is present more than one time in the survey (panel). I have a variable pens, which is a dummy that takes value 1 if the individual invests in a complementary pension form. For example:
df <- data.frame(year=c(2002,2002,2004,2004,2006,2008), id=c(1,2,1,2,3,3), y.b=c(1950,1943,1950,1943,1966,1966), sex=c("F", "M", "F", "M", "M", "M"), income=c(100000,55000,88000,66000,12000,24000), pens=c(0,1,1,0,1,1))
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
where id is the individual, y.b is year of birth, pens is the dummy variable regarding complementary pension.
I want to know if there are individuals that invested in a complementary pension form in year t but didn't hold the complementary pension form in year t+2 (the survey is conducted every two years). In this way I want to know how many person had a complementary pension form but released it before pension or gave up (for example for economic reasons).
I tried with this command:
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
and actually I have the individuals whose pens variable had changed during time (the command check if a variable is constant in time). For this reason I find individuals whose pens variable changed from 0 (didn't have complementary pension) in year t to 1 in year t+2 and viceversa; but I am interested in individuals whose pens variable was 1 (had a complementary pensione) in year t and 0 in year t+2.
If I use this command with the df I get that for id 1 and 2 the variable x is 0 (pens variable isn't constant), but I'd need to find a way to get just id 2 (whose pens variable changed from 1 to 0).
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
year id pens x
1 2002 1 0 0
2 2002 2 1 0
3 2004 1 1 0
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
(for the sake of semplicity I omitted other variables)
So the desired output is:
year id pens x
1 2002 1 0 1
2 2002 2 1 0
3 2004 1 1 1
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
only id 2 has x=0 since the pens variable changed from 1 to 0.
Thanks in advance

This assigns 1 to the id's for which there is a decline in pens and 0 otherwise.
transform(d.d, x = ave(pens, id, FUN = function(x) any(diff(x) < 0)))
giving:
year id y.b sex income pens x
1 2002 1 1950 F 100000 0 0
2 2002 2 1943 M 55000 1 1
3 2004 1 1950 F 88000 1 0
4 2004 2 1943 M 66000 0 1
5 2006 3 1966 M 12000 1 0
6 2008 3 1966 M 24000 1 0
This should work even even if there are more than 2 rows per id but if we knew there were always 2 rows then we could omit the any simplifying it to:
transform(d.d, x = ave(pens, id, FUN = diff) < 0)
Note: The input in reproducible form is:
Lines <- "year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1"
d.d <- read.table(text = Lines, header = TRUE, check.names = FALSE)

Related

Fill in Column Based on Other Rows (R)

I am looking for a way to fill in a column in R based on values in a different column. Below is what my data looks like.
year
action
player
end
2001
1
Mike
2003
2002
0
Mike
NA
2003
0
Mike
NA
2004
0
Mike
NA
2001
0
Alan
NA
2002
0
Alan
NA
2003
1
Alan
2004
2004
0
Alan
NA
I would like to either change the "action" column or create a new column such that it reflects the duration between the "year" and "end" variables. Below is what it would look like:
year
action
player
end
2001
1
Mike
2003
2002
1
Mike
NA
2003
1
Mike
NA
2004
0
Mike
NA
2001
0
Alan
NA
2002
0
Alan
NA
2003
1
Alan
2004
2004
1
Alan
NA
I have tried to do this with the following loop:
i <- 0
z <- 0
for (i in 1:nrow(df)){
i <- z + i + 1
if (df[i, 2] == 0) {}
else {df[i, 5] = (df[i, 4] - df[i, 1])}
z <- df[i,5]
for (z in i:nrow(df)){df[i, 2] = 1}
}
Here, my i value is skyrocketing, breaking the loop. I am not sure why that is occuring. I'd be interested to either know how to fix my approach or how to do this in a smarter fashion.
There's no need for explicit loops here.
First group your data frame by player. Then find the rows where the cumulative sum (cumsum) of action is greater than 0 and the year is less than or equal to the end year of the group. If the row meets these conditions, set action to 1, otherwise to 0.
Using the dplyr package you could achieve this in a couple of lines:
library(dplyr)
df %>%
group_by(player) %>%
mutate(action = as.numeric(cumsum(action) > 0 & year <= na.omit(end)[1]))
#> # A tibble: 8 x 4
#> # Groups: player [2]
#> year action player end
#> <int> <dbl> <chr> <int>
#> 1 2001 1 Mike 2003
#> 2 2002 1 Mike NA
#> 3 2003 1 Mike NA
#> 4 2004 0 Mike NA
#> 5 2001 0 Alan NA
#> 6 2002 0 Alan NA
#> 7 2003 1 Alan 2004
#> 8 2004 1 Alan NA

Cumulative sum for 2 criteria in R

I have a database where I want to calculate the cumulative sum of 2 criteria
dfdata = data.frame(car = c("toyota","toyota","toyota","toyota","toyota",
"honda","honda","honda","honda",
"lada","lada","lada","lada"),
year = c(2000,2000,2001,2001,2002,2001,2001,2002,2002,2003,2004,2005,2006),
id = c("a","b","a","c","a","d","d","d","e","f","f","f","f"))
You can see down the data:
dfdata
car year id
1 toyota 2000 a
2 toyota 2000 b
3 toyota 2001 a
4 toyota 2001 c
5 toyota 2002 a
6 honda 2001 d
7 honda 2001 d
8 honda 2002 d
9 honda 2002 e
10 lada 2003 f
11 lada 2004 f
12 lada 2005 f
13 lada 2006 f
Imagine I was observing cars passing by and that the plate on it is an "ID". So a car with the same id is the exact same car.
I want the sum of cars companies I've seen in one year
I want the cumulative sum of cars companies I've seen across the years
I want the cumulative sum of the cars companies I've seen more than once (counting the ones I've seen in the same year and the other years AND another column counting the ones that I've seen ONLY in the other years)
Here is how I got point 1. and point 2.
dfdata %>%
group_by(car, year) %>%
dplyr::summarise(nb = n()) %>%
dplyr::mutate(cs = cumsum(nb)) %>%
ungroup()
nb is the number of cars from a certain manufacturer I've seen in a particular year. cs is the cumulative sum of the cars across the years.
# A tibble: 9 x 4
car year nb cs
<fct> <dbl> <int> <int>
1 honda 2001 2 2
2 honda 2002 2 4
3 lada 2003 1 1
4 lada 2004 1 2
5 lada 2005 1 3
6 lada 2006 1 4
7 toyota 2000 2 2
8 toyota 2001 2 4
9 toyota 2002 1 5
But notice that I've lost the ID column. How can I get the number of cars that I've seen multiple times for the same ID.
Final output should be based on grouping ID (to answer point 3):
car year nb cs curetrap curetrap.no.same.year
1 honda 2001 2 2 1 0
2 honda 2002 2 4 2 1
3 lada 2003 1 1 0 0
4 lada 2004 1 2 1 1
5 lada 2005 1 3 2 2
6 lada 2006 1 4 3 3
7 toyota 2000 2 2 0 0
8 toyota 2001 2 4 1 1
9 toyota 2002 1 5 2 2
This is because "honda" have been seen 2 times in 2001 and 2 times in 2002. So the cumulative sum is 2 in 2001 and 2 + 2 in 2002. Then, within the same year I've seen the honda "d" twice, meaning that I "recaptured" the "d" 2001 honda hence the "1" in curetrap for 2001. In 2002, I recaptured the honda "d" again, thus the cumulative sum increased. For "curetrap.no.same.year" it's the same thing, but I want to ignore the recapture of the honda "d" in 2001 since it's the same year.
How is it possible to do that? Since I'm loosing the ID information, do I need to do it in 2 steps?
So far this is what I have:
tab.df = cbind(table(dfdata$id,dfdata$year),
car = as.character(dfdata[match(unique(dfdata$id),table = dfdata$id),"car"]))
df.df = as.data.frame(tab.df)
2000 2001 2002 2003 2004 2005 2006 car
a 1 1 1 0 0 0 0 toyota
b 1 0 0 0 0 0 0 toyota
c 0 1 0 0 0 0 0 toyota
d 0 2 1 0 0 0 0 honda
e 0 0 1 0 0 0 0 honda
f 0 0 0 1 1 1 1 lada
Which shows all the times I've seen a car in a year for a certain ID.
You can factor the problem into 2 steps by first adding binary variables in your original dataset which will flag the records you want to count, and then by simply computing sum and cumsum of these flags.
The following code gives the result you want
dfdata %>%
group_by(car, id) %>%
arrange(year, .by_group=TRUE) %>%
dplyr::mutate(already_seen = row_number()>1, already_seen_diff_year = year>year[1]) %>%
group_by(car, year) %>%
dplyr::summarise(nb = n(), cs = nb, curetrap = sum(already_seen), curetrap.no.same.year = sum(already_seen_diff_year)) %>%
dplyr::mutate_at(vars(cs, curetrap, curetrap.no.same.year), cumsum) %>%
ungroup()
NB: duplicating variable cs = nb is just a trick to write easily the subsequent call to mutate_at

filter a df with NA to get only individuals that appear more than one time in r

I am using a national survey to run a regression: the survey is conducted every two years and some individual are repeatedly interviewed while others just one time.
Now I want to make the df a panel one (have only the individual that appears more than one time). The df is like this:
year nquest nord nordp sex age
2000 10 1 1 F 40
2000 10 2 2 M 43
2000 30 1 1 M 30
2002 10 1 1 F 42
2002 10 2 2 M 45
2002 10 3 NA F 15
2002 30 1 1 M 32
2004 10 1 1 F 44
2004 10 2 2 M 47
2004 10 3 3 F 17
2004 50 1 NA M 66
where nquest is the code number of the family, nord is the code number of the individual and nordp is the code number that the individual had in the previous survey; when a new individual is interviewed the value in nordp is "missing" (R automatically insert NA). For example the individual 3 of family 10 has nordp=NA in 2002 because it is the first time that she is interviewed, while in 2004 nordp is 3 (because 3 was the number that she had in 2002).
I can't use nord to filter the df because the composition of the family may change (for example in 2002 in family x the mother has nordp=2 (it means that in 2000 nord was 2) and nord=2 but the next year nord could be 1 (for example if she gets divorced) but nordp is still 2).
I tried to filter using this command:
df <- df %>%
group_by(nquest, nordp)
filter(n()>1)
but I don't get the right df because if for the same family there are more than one individual insert (NA) they will be considered as the same person since nordp is NA the first time.
How can I consider also the individual that appears for the first time in a certain year (nordp=NA)? I tried to a create a command using age (the age in t shoul be equal to (age (in t-2) + 2; for example in 2000 age is 20, in 2002 is 22) but it didn't worked.
Consider that the df is composed by thousand observations and I can't check manually.
The final df should be:
year nquest nordp sex age
2000 10 1 F 40
2000 10 2 M 43
2000 30 1 M 30
2002 10 1 F 42
2002 10 2 M 45
2002 10 3 F 15
2002 30 1 M 32
2004 10 1 F 44
2004 10 2 M 47
2004 10 3 F 17
As you can see there are only the individual that appears more than one time and nquest=10 nordp=30 appears three times; with my command it appears just two times because in the first year nordp was NA.
We wish to assign unique IDs to individuals, then filter by the count of unique IDs. The main idea is to chain together the nordp and nord values within each family over years. Here's an idea inspired by Identify groups of linked episodes which chain together. First, load the igraph package, via library(igraph). Then the following function assigns IDs for a given family.
assignID <- function(d) {
fields <- names(d) # store original column names
d$nordp[is.na(d$nordp)] <- seq_len(sum(is.na(d$nordp))) + 100
d$nordp_x <- (d$year-2) * 1000 + d$nordp
d$nord_x <- d$year * 1000 + d$nord
dd <- d[, c("nordp_x", "nord_x")]
gr.test <- graph.data.frame(dd)
links <- data.frame(org_id = unique(unlist(dd)),
id = clusters(gr.test)$membership)
d <- merge(d, links, by.x = "nord_x", by.y = "org_id", all.x = TRUE)
d$uid <- d$nquest * 100 + d$id
d[, c(fields, "uid")]
}
The function can "tell", for example, that
year nordp nord
2000 1 1
2002 1 2
2004 2 3
is the same individual, by chaining together the nordp and nord over the years, and assigns the same unique ID to all 3 rows. So, for example,
assignID(subset(df, nquest == 10))
# year nquest nord nordp sex age dob uid
# 1 2000 10 1 1 F 40 1960 1001
# 2 2000 10 2 2 M 43 1957 1002
# 3 2002 10 1 1 F 42 1960 1001
# 4 2002 10 2 2 M 45 1957 1002
# 5 2002 10 3 101 F 15 1987 1003
# 6 2004 10 1 1 F 44 1960 1001
# 7 2004 10 2 2 M 47 1957 1002
# 8 2004 10 3 3 F 17 1987 1003
gives us an additional column with the uid for each individual.
The remaining steps are straightforward. We split the dataframe by nquest, apply assignID to each subset, and rbind the output:
dd <- do.call(rbind, by(df, df$nquest, assignID))
Then we can just group by uid and filter by count:
dd %>% group_by(uid) %>% filter(n()>1)
# Source: local data frame [10 x 8]
# Groups: uid [4]
# year nquest nord nordp sex age dob uid
# <int> <int> <int> <dbl> <fctr> <int> <int> <dbl>
# 1 2000 10 1 1 F 40 1960 1001
# 2 2000 10 2 2 M 43 1957 1002
# 3 2002 10 1 1 F 42 1960 1001
# 4 2002 10 2 2 M 45 1957 1002
# 5 2002 10 3 101 F 15 1987 1003
# 6 2004 10 1 1 F 44 1960 1001
# 7 2004 10 2 2 M 47 1957 1002
# 8 2004 10 3 3 F 17 1987 1003
# 9 2000 30 1 1 M 30 1970 3001
# 10 2002 30 1 1 M 32 1970 3001

Create Indicator Data Frame Based on Interval Ranges

I am trying to create a "long" data frame of indicator ("dummy") variables out of a very peculiar type of "wide" data frame in R that has interval ranges of years defining my data.
What I have looks like this:
f=data.frame(name=c("A","B","C"),
year.start=c(1990,1994,1993),year.end=c(1994,1995,1993))
name year.start year.end
1 A 1990 1994
2 B 1994 1995
3 C 1993 1993
Update: I have changed the value of year.start for A to 1990 from the initial example of 1993 to address some of the answers below which rely on unique values instead of intervals.
What I would like is a long data frame that would look like this, with an entry for each of the possible years in the original data frame, eg, 1990 through 1995 where 1 = present and 0 = absent.
name year indicator
A 1990 1
A 1991 1
A 1992 1
A 1993 1
A 1994 1
A 1995 0
B 1990 0
B 1991 0
B 1992 0
B 1993 0
B 1994 1
B 1995 1
C 1990 0
C 1991 0
C 1992 0
C 1993 1
C 1994 0
C 1995 0
Try as I might, I don't see how I can do this with Hadley Wickham's reshape2 package.
Thanks!
Someone else might have suggestion for reshape2, but here is a base R solution:
years <- factor(unlist(f[-1]), levels=seq(min(f[-1]), max(f[-1]), by=1))
result <- data.frame(table(years, rep(f[[1]], length.out=length(years))))
# years Var2 Freq
# 1 1990 A 1
# 2 1991 A 0
# 3 1992 A 0
# 4 1993 A 0
# 5 1994 A 1
# 6 1995 A 0
# 7 1990 B 0
# 8 1991 B 0
# 9 1992 B 0
# 10 1993 B 0
# 11 1994 B 1
# 12 1995 B 1
# 13 1990 C 0
# 14 1991 C 0
# 15 1992 C 0
# 16 1993 C 2
# 17 1994 C 0
# 18 1995 C 0
here is a step-by-step breakdown, using data.table
library(data.table)
f <- as.data.table(f)
## ALL OF NAME-YEAR COMBINATIONS
ALL <- f[, CJ(name=name, year=seq(min(year.start), max(year.end)))]
## WHICH COMBINATIONS EXIST
PRESENT <- f[, list(year = seq(year.start, year.end)), by=name]
## SETKEYS FOR MERGING
setkey(ALL, name, year)
setkey(PRESENT, name, year)
## INITIALIZE INDICATOR TO ZERO, THEN SET TO 1 FOR THOSE PRESENT
ALL[, indicator := 0]
ALL[PRESENT, indicator := 1]
ALL
name year indicator
1: A 1993 1
2: A 1994 1
3: A 1995 0
4: B 1993 0
5: B 1994 1
6: B 1995 1
7: C 1993 1
8: C 1994 0
9: C 1995 0
Here's another solution, similar to the ones above, which aims to be straightforward:
zz <- cbind(name=f[1],year=rep(min(f[-1]):max(f[-1]),each=nrow(f)))
zz$indicator <- as.numeric((f$name==zz$name &
f$year.start<=zz$year &
f$year.end >=zz$year))
result <- zz[order(zz$name,zz$year),]
The first line builds a template with all the names and all the years. The second line sets indicator based on whether it is present in the range. The third line just reorders the result.
Another base R solution
f=data.frame(name=c("A","B","C"),
year.start=c(1993,1994,1993),year.end=c(1994,1995,1993), stringsAsFactors=F)
x <- expand.grid(unique(f$name),min(f1$year):max(f1$year))
names(x) <- c("name", "year")
x$indicator <- sapply(1:nrow(x), function(i) sum(x$name[i]==f$name & x$year[i] >= f$year.start & x$year[i] <= f$year.end))
x[order(x$name),]

How to create timeseries by grouping entries in R?

I want to create a time series from 01/01/2004 until 31/12/2010 of daily mortality data in R. The raw data that I have now (.csv file), has as columns day - month - year and every row is a death case. So if the mortality on a certain day is for example equal to four, there are four rows with that date. If there is no death case reported on a specific day, that day is omitted in the dataset.
What I need is a time-series with 2557 rows (from 01/01/2004 until 31/12/2010) wherein the total number of death cases per day is listed. If there is no death case on a certain day, I still need that day to be in the list with a "0" assigned to it.
Does anyone know how to do this?
Thanks,
Gosia
Example of the raw data:
day month year
1 1 2004
3 1 2004
3 1 2004
3 1 2004
6 1 2004
7 1 2004
What I need:
day month year deaths
1 1 2004 1
2 1 2004 0
3 1 2004 3
4 1 2004 0
5 1 2004 0
6 1 2004 1
df <- read.table(text="day month year
1 1 2004
3 1 2004
3 1 2004
3 1 2004
6 1 2004
7 1 2004",header=TRUE)
#transform to dates
dates <- as.Date(with(df,paste(year,month,day,sep="-")))
#contingency table
tab <- as.data.frame(table(dates))
names(tab)[2] <- "deaths"
tab$dates <- as.Date(tab$dates)
#sequence of dates
res <- data.frame(dates=seq(from=min(dates),to=max(dates),by="1 day"))
#merge
res <- merge(res,tab,by="dates",all.x=TRUE)
res[is.na(res$deaths),"deaths"] <- 0
res
# dates deaths
#1 2004-01-01 1
#2 2004-01-02 0
#3 2004-01-03 3
#4 2004-01-04 0
#5 2004-01-05 0
#6 2004-01-06 1
#7 2004-01-07 1

Resources