Match household partners without loop - r

I am trying to avoid using a loop in my attempt to recode household partners' variable identification.
hldid denotes the household while persid the person in the household.
The variable partner indicates the persid of the partner and child indicate if the row is a child.
What is missing from the variable partner is the persid for both partners.
For instance for hldid == 1, the persid == 1 has a value of 0 for partner while it should be 2.
This is how the data looks like:
> test
hldid persid age sex relresp partner child
1 1 1 26 2 0 0 0
2 1 2 26 1 1 1 0
3 2 1 59 2 0 0 0
4 2 2 64 1 1 1 0
5 3 1 76 2 0 0 0
6 4 1 65 2 0 0 0
7 4 2 64 1 1 1 0
8 5 1 52 2 0 0 0
9 5 2 51 1 1 1 0
10 5 3 20 2 21 0 1
11 5 4 14 2 21 0 1
12 7 1 69 1 0 0 0
13 7 2 70 2 1 1 0
I managed to create a quite ugly loop, however it is too slow for the entire dataset.
test$partnerREC = test$partner
for(i in 1:13){
for(j in 1:13){
if(
test$hldid[i] == test$hldid[i+1] & # verify if household is the same
(test$persid[i] == test$partner[j])
)
{
test$partnerREC[i] = test$persid[j] # put the persid for each partner
}
}
}
> test
hldid persid age sex relresp partner child partnerREC
1 1 1 26 2 0 0 0 2
2 1 2 26 1 1 1 0 1
3 2 1 59 2 0 0 0 2
4 2 2 64 1 1 1 0 1
5 3 1 76 2 0 0 0 0
6 4 1 65 2 0 0 0 2
7 4 2 64 1 1 1 0 1
8 5 1 52 2 0 0 0 2
9 5 2 51 1 1 1 0 1
10 5 3 20 2 21 0 1 0
11 5 4 14 2 21 0 1 0
12 7 1 69 1 0 0 0 2
13 7 2 70 2 1 1 0 1
Any idea how I could use data.table for solving this?
test = structure(list(hldid = c(1, 1, 2, 2, 3, 4, 4, 5, 5, 5, 5, 7,
7), persid = c(1, 2, 1, 2, 1, 1, 2, 1, 2, 3, 4, 1, 2), age = c(26,
26, 59, 64, 76, 65, 64, 52, 51, 20, 14, 69, 70), sex = c(2, 1,
2, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2), relresp = c(0, 1, 0, 1, 0,
0, 1, 0, 1, 21, 21, 0, 1), partner = c(0, 1, 0, 1, 0, 0, 1, 0,
1, 0, 0, 0, 1), child = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
0)), class = "data.frame", row.names = c(NA, -13L))

A possible solution:
library(data.table) # load the package
setDT(test) # convert 'test' to a 'data.table'
test[, partnerREC := persid[c(pmin(2,.N):1,rep(0,(pmax(.N,2)-2)))] *
(persid %in% 1:2) *
(.N != 1)
, by = hldid][]
which gives:
> test
hldid persid age sex relresp partner child partnerREC
1: 1 1 26 2 0 0 0 2
2: 1 2 26 1 1 1 0 1
3: 2 1 59 2 0 0 0 2
4: 2 2 64 1 1 1 0 1
5: 3 1 76 2 0 0 0 0
6: 4 1 65 2 0 0 0 2
7: 4 2 64 1 1 1 0 1
8: 5 1 52 2 0 0 0 2
9: 5 2 51 1 1 1 0 1
10: 5 3 20 2 21 0 1 0
11: 5 4 14 2 21 0 1 0
12: 7 1 69 1 0 0 0 2
13: 7 2 70 2 1 1 0 1
This solution is based on the assumption (derived from the example data) that only 'persid' 1 & 2 are partners, anything higher are children.
What this does:
Group by hldid
Change the order of the first two persons in a hould but only when there are more people in a household with
persid[c(pmin(2,.N):1,rep(0,(pmax(.N,2)-2)))] where pmin is used to assure that a vector of length 1 is constructed when a household has only one person.
Multiply that with (persid %in% 1:2)to get zores for children.
Multiply that with (.N != 1) to get a value of zero for households of one person.

df <- data.frame(matrix(data = NA, ncol = 7))
names(df) <- names(test)
for(id in unique(test$hldid)){
t <- test[test$hldid==id,]
t$partner[t$persid == t$partner[t$partner!=0]] <- t$persid[which(t$partner!=0)]
df <- rbind(df, t)
}
df <- df[-1,]

A base R solution is more complicated than Jaap's data.table solution.
I work with a copy.
test2 <- test
After running the code in the question, run the following.
test2$partnerREC <- test2$partner
sp <- split(test2, test2$hldid)
test2 <- lapply(sp, function(DF){
i <- with(DF, which(persid %in% partner))
j <- with(DF, which(partner %in% persid))
#cat("i:", i, "\tj:", j, "\n")
DF$partnerREC[i] <- DF$persid[j]
DF
})
test2 <- do.call(rbind, test2)
row.names(test2) <- NULL
Now compare both results.
identical(test, test2)
#[1] TRUE

You can get there through some dplyr steps to join the data on itself and update the value of partner when persid == partner.
test2 <- left_join(test, test %>% select(hldid, persid, partner) %>% filter(partner != 0), by=c("hldid")) %>%
filter(persid.x == partner.y) %>%
mutate(partner.x = persid.y)
This will give you the head of household with their partner id matched up but you'd have to rejoin this to the original data (I'm not sure the equivalent of SQL update in dplyr lingo).

1. Create your test data.frame
library(tidyverse)
test <- tribble(
~hldid, ~persid, ~age, ~sex, ~relresp, ~partner, ~child,
1, 1, 26, 2, 0, 0, 0,
1, 2, 26, 1, 1, 1, 0,
2, 1, 59, 2, 0, 0, 0,
2, 2, 64, 1, 1, 1, 0,
3, 1, 76, 2, 0, 0, 0,
4, 1, 65, 2, 0, 0, 0,
4, 2, 64, 1, 1, 1, 0,
5, 1, 52, 2, 0, 0, 0,
5, 2, 51, 1, 1, 1, 0,
5, 3, 20, 2, 21, 0, 1,
5, 4, 14, 2, 21, 0, 1,
7, 1, 69, 1, 0, 0, 0,
7, 2, 70, 2, 1, 1, 0)
2. arrange(), group_by(), and mutate() come to rescue
test %>%
# arrange the data in case the raw data did not
arrange(hldid, child, persid) %>%
# group each household
group_by(hldid) %>%
# match first and second household person as each other's partner
mutate(partnerREC = ifelse(persid == first(persid), nth(persid, 2), first(persid))) %>%
# correct partnerREC for child and single
mutate(partnerREC = ifelse(child == 1 | is.na(partnerREC), 0, partnerREC))
# un-group it
ungroup()

A loop solution using rcpp
Source the rcpp script
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector HHgrid(CharacterVector hid, NumericVector persid, NumericVector partner,
NumericVector partnerRec) {
int nrows = hid.size();
for (int i = 1; i < nrows - 1; i ++){
for (int j = 0; j < nrows - 1; j++){
if( (hid(i) == hid(i+1)) & ( persid(i) == partner(j) ) ){
partnerRec(i) = persid(j);
}
}
Rcout << i << std::endl;
}
return(partnerRec);
}
and run the function
HHgrid(hid = test$hldid, persid = test$persid, partner = test$partner, partnerRec = test$partnerRec)
There is just a slight issue with the first case (if you know how to fix it)

Related

How do I create a column which has the value of the last time a condition was met in a separate row in R

I have data that looks like this:
d <- data.frame(Item = c(1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0)
I would like to create a column where the value is based on the last time a 0 was present in the row d$item. I don't really know how to get started with something like this in R.
Expected outcome is this:
f$recent <- c(NA, 0, 0, 1, 2, 3, 4, 5, 6, 0, 0, 1, 0, 1, 2, 3, 0)
Where each row is the the most recent observation of 0 (0 = on same row, 1 = previous row, etc.)
edit: Changed row to column, was posting before coffee. Also added expected result.
You can try rle + sequence
transform(
d,
recent = with(rle(Item), sequence(lengths)) * (Item != 0)
)
which gives
Item recent
1 1 1
2 0 0
3 0 0
4 1 1
5 1 2
6 1 3
7 1 4
8 1 5
9 1 6
10 0 0
11 0 0
12 1 1
13 0 0
14 1 1
15 1 2
16 1 3
17 0 0
You can do this, with sequence. It calculate the distance to the latest 1.
dif <- diff(c(which(d$Item == 1), length(d$Item) + 1))
sequence(dif, 0)
#[1] 0 1 2 0 0 0 0 0 0 1 2 0 1 0 0 0 1
Edit:
dif <- diff(c(1, which(d$Item != 1), length(d$Item) + 1))
sequence(dif, 0)
#[1] 0 0 0 1 2 3 4 5 6 0 0 1 0 1 2 3 0

about the equivalent R code for forvalues in Stata [duplicate]

This question already has an answer here:
about the equivalent command for forvalues
(1 answer)
Closed 4 years ago.
When we deal with the longitudinal recurrent events. I would like to detect the events which meet some condition.
data <- data.frame(id=c(rep(1, 4), rep(2, 3), rep(3, 3), rep(4,4)),
event=c(1, 1, 0, 0, 1, 1, 0, 1, 1, 0,1,1,1,1),
eventcount=c(1, 2, 0, 0, 1, 2, 0, 1, 2, 3,1,2,3,4),
firstevent=c(1, 0, 0, 0, 1, 0, 0, 1, 0, 0,1,0,0,0),
time=c(100, 250, 150, 300, 240, 400, 150, 200, 320, 360,100,210,220,235) )
When we want to detect events which occurred from 100 days to 150 days after the first event.
In Stata, we can use
gen event2=1 if id==id[_n-1]& time-time[_n-1]>100 & time-time[_n-1]<=150 & firstevent[_n-1]==1 & firstevent==0 & event==1
forvalues i = 2/4 {
replace event2=1 if id==id[_n-`i']& time-time[_n-`i']>100 &time-time[_n-`i']<=150 & firstevent[_n-`i']==1 & firstevent==0 & event==1
}
I would like to obtain the dataset below.
data_after <- data.frame(id=c(rep(1, 4), rep(2, 3), rep(3, 3), rep(4,4)),
event=c(1, 1, 0, 0, 1, 1, 0, 1, 1, 0,1,1,1,1),
eventcount=c(1, 2, 0, 0, 1, 2, 0, 1, 2, 3,1,2,3,4),
firstevent=c(1, 0, 0, 0, 1, 0, 0, 1, 0, 0,1,0,0,0),
time=c(100, 250, 150, 300, 240, 400, 150, 200, 320, 360,100,210,220,235),
afterevent=c(NA,1,NA,NA,NA,NA,NA,NA,1,NA,NA,1,1,1))
How can we write equivalent R code for this Stata code?
In the original data, we have more ids and more recurrent events.
This would correspond to your output:
library(dplyr)
data %>%
group_by(id) %>%
mutate(afterevent = ifelse(time <= (time[firstevent == 1] + 150) &
(time >= time[firstevent == 1] + 100), 1, NA))
Output:
# A tibble: 14 x 6
# Groups: id [4]
id event eventcount firstevent time afterevent
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 100 NA
2 1 1 2 0 250 1
3 1 0 0 0 150 NA
4 1 0 0 0 300 NA
5 2 1 1 1 240 NA
6 2 1 2 0 400 NA
7 2 0 0 0 150 NA
8 3 1 1 1 200 NA
9 3 1 2 0 320 1
10 3 0 3 0 360 NA
11 4 1 1 1 100 NA
12 4 1 2 0 210 1
13 4 1 3 0 220 1
14 4 1 4 0 235 1

How to count the number of rows from the last non-zero value (or to the next, if the lattest greater than k)

I've got a data like below:
ex <- structure(list(id = 1:20, V1 = c(1, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA, -20L))
Now I need to create a new column that would count the number of rows from the latest non-zero value. One additional condition is that if that number is greater than k then we would count the number of rows before the next non-zero value (with - sign). So assuming k = 10 the output would look like below:
structure(list(id = 1:20, V1 = c(1, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0), counter = c(0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, -2, -1, 0, 1, 2)), class = "data.frame", row.names = c(NA, -20L))
What's the efficient way to do this using tidyverse packages, if possible?
A base R approach with ave, where we first calculate number of rows from the last non-zero value and then for values greater than k we reverse the index position and add the negative sign to get sequence in -2, -1 order.
ave(ex$V1, cumsum(ex$V1 != 0) , FUN = function(x) {
inds <- seq_along(x) - 1
ifelse(inds > k, -rev(inds) - 1, inds)
})
#[1] 0 1 2 3 0 1 2 3 4 5 6 7 8 9 10 -2 -1 0 1 2
Another option:
ex <- structure(list(id = 1:20, V1 = c(1, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA, -20L))
library(tidyverse)
k_LIMIT <- 10
ex %>%
mutate(my_group_var = cumsum(V1)) %>%
group_by(my_group_var) %>%
mutate(want = ifelse((row_number() - 1) > k_LIMIT,
-(max((row_number() - 1)) - (row_number() - 1) + 1),
(row_number() - 1))) %>%
ungroup() %>%
select(-my_group_var)
# A tibble: 20 x 3
# id V1 want
# <int> <dbl> <dbl>
# 1 1 1 0
# 2 2 0 1
# 3 3 0 2
# 4 4 0 3
# 5 5 9 0
# 6 6 0 1
# 7 7 0 2
# 8 8 0 3
# 9 9 0 4
#10 10 0 5
#11 11 0 6
#12 12 0 7
#13 13 0 8
#14 14 0 9
#15 15 0 10
#16 16 0 -2
#17 17 0 -1
#18 18 1 0
#19 19 0 1
#20 20 0 2
Here is one option with data.table
library(data.table)
k <- 10
setDT(ex)[, grp := cumsum(V1 != 0)
][, counter := seq_len(.N) - 1, grp
][counter > k, counter := -as.numeric(rev(seq_len(.N))), grp
][, grp := NULL][]
#. id V1 counter
# 1: 1 1 0
# 2: 2 0 1
# 3: 3 0 2
# 4: 4 0 3
# 5: 5 9 0
# 6: 6 0 1
# 7: 7 0 2
# 8: 8 0 3
# 9: 9 0 4
#10: 10 0 5
#11: 11 0 6
#12: 12 0 7
#13: 13 0 8
#14: 14 0 9
#15: 15 0 10
#16: 16 0 -2
#17: 17 0 -1
#18: 18 1 0
#19: 19 0 1
#20: 20 0 2
You can use calculate step as difference versus lag id, additional condition on value could be added and then build sequences with conditions whether it is reverse or not
Steps <- ex%>%
rbind(c(nrow(ex)+1,1)) %>%
filter(sign(V1)!=0) %>%
mutate(step= id-c(1,(lag(id)[-1])))
c <- c()
k<-10
for (i in 1:nrow(Steps)){
c<- c(c,c(0:max(0,min(Steps[i,3], k)-1)))
if(Steps[i,3]> k){
c <- c(c, (k-Steps[i,3]+1):-1)
}
}
c<-c[ex$id]

Conditions including comparison of consequtive elements within columns

Could you please help me with the following task. Here is a template of my dataset:
Category <- c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3)
PrevRule <- c(-1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2)
UserRule <- c(2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 0, 0, 0, 1, 1, 1, 1)
Correct <- c(0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1)
df <- data.frame(Category, PrevRule, UserRule, Correct)
I need to create an additional PP column according to the following rules:
i is row index. n is maximum rows for each Category.
For Category == 1:
If Correct[i:i+2] == 0 then PP[i+2:n] = UserRule[i+2]
Else PP = NA
For Category > 1:
Do PP = PrevRule until Correct[i:i+2] == 0
Then PP[i+2:n] = UserRule[i+2]
So, in the end the PP column should look like in the following table (column starting with NA)
Category PrevRule UserRule Correct PP
1 1 -1 2 0 NA
2 1 -1 2 0 NA
3 1 -1 2 0 2
4 1 -1 1 1 2
5 1 -1 1 1 2
6 1 -1 1 1 2
7 1 -1 1 1 2
8 1 -1 1 1 2
9 2 1 1 0 1
10 2 1 1 0 1
11 2 1 2 1 1
12 2 1 0 0 1
13 2 1 0 0 1
14 2 1 0 0 0
15 2 1 2 1 0
16 2 1 2 1 0
17 2 1 2 1 0
18 2 1 2 1 0
19 2 1 2 1 0
20 3 2 2 0 2
21 3 2 0 0 2
22 3 2 0 0 2
23 3 2 0 0 0
24 3 2 1 1 0
25 3 2 1 1 0
26 3 2 1 1 0
27 3 2 1 1 0
Any advice and suggestions would be greatly appreciated!
Thank you!
It was realy difficult to understand what you ment, but here is my solution. It is not perfect as it uses a while-Loop but it should work:
ret <- array(dim = nrow(df))
i <- 1
while(i < nrow(df) - 3){
if(Category[i] == 1 & all(Correct[i + 0:2] == 0)){
tmp <- max(which(Category==1)) #end of category 1
ret[(i + 2):tmp] <- UserRule[i+2]
#set index i to the index-value of the last element in category 1
i <- tmp
}else{
ret[i] <- NA #actual not necessary, because PP is NA per default.
}
print("From now on, I will only ask clear questions!")
if(Category[i] >= 2){
ret[i] <- PrevRule[i]
if(all(Correct[(i-2):i] == 0)){#3 consecutive 0 in Correct
tmp <- max(which(Category == Category[i])) #end of current category
ret[i:tmp] <- UserRule[i]
i <- tmp #set index i to the index-value of the last element in the current category
}
}
i <- i + 1
}
df$PP <- ret

Generate a new variable based upon the values of the previous and next row by group

I am using panel data with multiple subjects (id) and have an event (first_occurrence) that occurs on different days. My goal is to create a new variable (result) that is 1 on the 2 days preceding the first occurrence, the day of the first occurrence, and the 2 days following the first occurrence.
Here is an example that includes both the sample data and the desired output:
data <- structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 3), day = c(0, 1, 2, 3, 4, 5, 6, 7, 0, 1,
2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 6), first_occurrence = c(0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1), desired_output = c(1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1)), .Names = c("id",
"day", "first_occurrence", "desired_output"), row.names = c(NA,
-21L), class = "data.frame")
Although this may not be the most efficient solution, I managed to get the code working in Stata (please see below for Stata code), although I would like to get it working in R as well and would appreciate any thoughts folks have.
Thanks!
Stata code:
tsset id day
gen run = .
by id: replace run = cond(L.run == ., 1, L.run + 1)
gen test = .
replace test = run if(first_occurrence == 1)
gen test2 = .
by id: replace test2 = test[_n-1]
gen test3 = .
by id: replace test3 = test[_n-2]
gen test4 = .
by id: replace test4 = test[_n+1]
gen test5 = .
by id: replace test5 = test[_n+2]
egen test_sum = rowtotal(test test2 test3 test4 test5)
replace test_sum = 1 if(test_sum >= 1)
rename test_sum result
drop run test test2 test3 test4 test5
Here's another approach using the package dplyr:
require(dplyr) #install and load the package
data %.%
arrange(id, day) %.% # to sort the data by id and day. If it is already, you can remove this row
group_by(id) %.%
mutate(n = 1:n(),
result = ifelse(abs(n - n[first_occurrence == 1]) <= 2, 1, 0)) %.%
select(-n)
# id day first_occurrence desired_output result
#1 1 0 0 1 1
#2 1 1 0 1 1
#3 1 2 1 1 1
#4 1 3 0 1 1
#5 1 4 0 1 1
#6 1 5 0 0 0
#7 1 6 0 0 0
#8 1 7 0 0 0
#9 2 0 0 0 0
#10 2 1 0 0 0
#11 2 2 0 1 1
#12 2 3 0 1 1
#13 2 4 1 1 1
#14 2 5 0 1 1
#15 3 0 0 0 0
#16 3 1 0 0 0
#17 3 2 0 0 0
#18 3 3 0 0 0
#19 3 4 0 1 1
#20 3 5 0 1 1
#21 3 6 1 1 1
What the code does is, first group by id and then it will add another column (n) where it counts the rows per group from 1 to the number of rows per group. Then it creates another column result with an ifelse that will check the absolute difference between the current n (for each row) and the n where first_occurrence is 1. If that difference is equal to or less than 2, result will be 1 otherwise 0. The last line removes the column n.
Edit:
It would probably be more efficient to place the mutate(n = 1:n()) before the group_by:
data %.%
arrange(id, day) %.% # to sort the data by id and day. If it is already, you can remove this row
mutate(n = 1:n()) %.%
group_by(id) %.%
mutate(result = ifelse(abs(n - n[first_occurrence == 1]) <= 2, 1, 0)) %.%
select(-n)
Here's one way. You can use ave to look by group, and then you can use which.max to find the first occurrence and then calculate the distance from that value for all the other values
close<-(with(data, ave(first_occurrence, id, FUN=function(x)
abs(seq_along(x)-which.max(x)))
)<=2)+0
Here I use +0 to turn the logical values into 0/1 values. Now you can combine that with your existing data
cbind(data, close)
And that gives
id day first_occurrence desired_output close
1 1 0 0 1 1
2 1 1 0 1 1
3 1 2 1 1 1
4 1 3 0 1 1
5 1 4 0 1 1
6 1 5 0 0 0
7 1 6 0 0 0
8 1 7 0 0 0
9 2 0 0 0 0
10 2 1 0 0 0
11 2 2 0 1 1
12 2 3 0 1 1
13 2 4 1 1 1
14 2 5 0 1 1
15 3 0 0 0 0
16 3 1 0 0 0
17 3 2 0 0 0
18 3 3 0 0 0
19 3 4 0 1 1
20 3 5 0 1 1
21 3 6 1 1 1
as desired. Note that this method assumes that the data is sorted by day.

Resources