Related
I have data that looks like this:
d <- data.frame(Item = c(1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0)
I would like to create a column where the value is based on the last time a 0 was present in the row d$item. I don't really know how to get started with something like this in R.
Expected outcome is this:
f$recent <- c(NA, 0, 0, 1, 2, 3, 4, 5, 6, 0, 0, 1, 0, 1, 2, 3, 0)
Where each row is the the most recent observation of 0 (0 = on same row, 1 = previous row, etc.)
edit: Changed row to column, was posting before coffee. Also added expected result.
You can try rle + sequence
transform(
d,
recent = with(rle(Item), sequence(lengths)) * (Item != 0)
)
which gives
Item recent
1 1 1
2 0 0
3 0 0
4 1 1
5 1 2
6 1 3
7 1 4
8 1 5
9 1 6
10 0 0
11 0 0
12 1 1
13 0 0
14 1 1
15 1 2
16 1 3
17 0 0
You can do this, with sequence. It calculate the distance to the latest 1.
dif <- diff(c(which(d$Item == 1), length(d$Item) + 1))
sequence(dif, 0)
#[1] 0 1 2 0 0 0 0 0 0 1 2 0 1 0 0 0 1
Edit:
dif <- diff(c(1, which(d$Item != 1), length(d$Item) + 1))
sequence(dif, 0)
#[1] 0 0 0 1 2 3 4 5 6 0 0 1 0 1 2 3 0
This question already has an answer here:
about the equivalent command for forvalues
(1 answer)
Closed 4 years ago.
When we deal with the longitudinal recurrent events. I would like to detect the events which meet some condition.
data <- data.frame(id=c(rep(1, 4), rep(2, 3), rep(3, 3), rep(4,4)),
event=c(1, 1, 0, 0, 1, 1, 0, 1, 1, 0,1,1,1,1),
eventcount=c(1, 2, 0, 0, 1, 2, 0, 1, 2, 3,1,2,3,4),
firstevent=c(1, 0, 0, 0, 1, 0, 0, 1, 0, 0,1,0,0,0),
time=c(100, 250, 150, 300, 240, 400, 150, 200, 320, 360,100,210,220,235) )
When we want to detect events which occurred from 100 days to 150 days after the first event.
In Stata, we can use
gen event2=1 if id==id[_n-1]& time-time[_n-1]>100 & time-time[_n-1]<=150 & firstevent[_n-1]==1 & firstevent==0 & event==1
forvalues i = 2/4 {
replace event2=1 if id==id[_n-`i']& time-time[_n-`i']>100 &time-time[_n-`i']<=150 & firstevent[_n-`i']==1 & firstevent==0 & event==1
}
I would like to obtain the dataset below.
data_after <- data.frame(id=c(rep(1, 4), rep(2, 3), rep(3, 3), rep(4,4)),
event=c(1, 1, 0, 0, 1, 1, 0, 1, 1, 0,1,1,1,1),
eventcount=c(1, 2, 0, 0, 1, 2, 0, 1, 2, 3,1,2,3,4),
firstevent=c(1, 0, 0, 0, 1, 0, 0, 1, 0, 0,1,0,0,0),
time=c(100, 250, 150, 300, 240, 400, 150, 200, 320, 360,100,210,220,235),
afterevent=c(NA,1,NA,NA,NA,NA,NA,NA,1,NA,NA,1,1,1))
How can we write equivalent R code for this Stata code?
In the original data, we have more ids and more recurrent events.
This would correspond to your output:
library(dplyr)
data %>%
group_by(id) %>%
mutate(afterevent = ifelse(time <= (time[firstevent == 1] + 150) &
(time >= time[firstevent == 1] + 100), 1, NA))
Output:
# A tibble: 14 x 6
# Groups: id [4]
id event eventcount firstevent time afterevent
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 100 NA
2 1 1 2 0 250 1
3 1 0 0 0 150 NA
4 1 0 0 0 300 NA
5 2 1 1 1 240 NA
6 2 1 2 0 400 NA
7 2 0 0 0 150 NA
8 3 1 1 1 200 NA
9 3 1 2 0 320 1
10 3 0 3 0 360 NA
11 4 1 1 1 100 NA
12 4 1 2 0 210 1
13 4 1 3 0 220 1
14 4 1 4 0 235 1
I am trying to avoid using a loop in my attempt to recode household partners' variable identification.
hldid denotes the household while persid the person in the household.
The variable partner indicates the persid of the partner and child indicate if the row is a child.
What is missing from the variable partner is the persid for both partners.
For instance for hldid == 1, the persid == 1 has a value of 0 for partner while it should be 2.
This is how the data looks like:
> test
hldid persid age sex relresp partner child
1 1 1 26 2 0 0 0
2 1 2 26 1 1 1 0
3 2 1 59 2 0 0 0
4 2 2 64 1 1 1 0
5 3 1 76 2 0 0 0
6 4 1 65 2 0 0 0
7 4 2 64 1 1 1 0
8 5 1 52 2 0 0 0
9 5 2 51 1 1 1 0
10 5 3 20 2 21 0 1
11 5 4 14 2 21 0 1
12 7 1 69 1 0 0 0
13 7 2 70 2 1 1 0
I managed to create a quite ugly loop, however it is too slow for the entire dataset.
test$partnerREC = test$partner
for(i in 1:13){
for(j in 1:13){
if(
test$hldid[i] == test$hldid[i+1] & # verify if household is the same
(test$persid[i] == test$partner[j])
)
{
test$partnerREC[i] = test$persid[j] # put the persid for each partner
}
}
}
> test
hldid persid age sex relresp partner child partnerREC
1 1 1 26 2 0 0 0 2
2 1 2 26 1 1 1 0 1
3 2 1 59 2 0 0 0 2
4 2 2 64 1 1 1 0 1
5 3 1 76 2 0 0 0 0
6 4 1 65 2 0 0 0 2
7 4 2 64 1 1 1 0 1
8 5 1 52 2 0 0 0 2
9 5 2 51 1 1 1 0 1
10 5 3 20 2 21 0 1 0
11 5 4 14 2 21 0 1 0
12 7 1 69 1 0 0 0 2
13 7 2 70 2 1 1 0 1
Any idea how I could use data.table for solving this?
test = structure(list(hldid = c(1, 1, 2, 2, 3, 4, 4, 5, 5, 5, 5, 7,
7), persid = c(1, 2, 1, 2, 1, 1, 2, 1, 2, 3, 4, 1, 2), age = c(26,
26, 59, 64, 76, 65, 64, 52, 51, 20, 14, 69, 70), sex = c(2, 1,
2, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2), relresp = c(0, 1, 0, 1, 0,
0, 1, 0, 1, 21, 21, 0, 1), partner = c(0, 1, 0, 1, 0, 0, 1, 0,
1, 0, 0, 0, 1), child = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
0)), class = "data.frame", row.names = c(NA, -13L))
A possible solution:
library(data.table) # load the package
setDT(test) # convert 'test' to a 'data.table'
test[, partnerREC := persid[c(pmin(2,.N):1,rep(0,(pmax(.N,2)-2)))] *
(persid %in% 1:2) *
(.N != 1)
, by = hldid][]
which gives:
> test
hldid persid age sex relresp partner child partnerREC
1: 1 1 26 2 0 0 0 2
2: 1 2 26 1 1 1 0 1
3: 2 1 59 2 0 0 0 2
4: 2 2 64 1 1 1 0 1
5: 3 1 76 2 0 0 0 0
6: 4 1 65 2 0 0 0 2
7: 4 2 64 1 1 1 0 1
8: 5 1 52 2 0 0 0 2
9: 5 2 51 1 1 1 0 1
10: 5 3 20 2 21 0 1 0
11: 5 4 14 2 21 0 1 0
12: 7 1 69 1 0 0 0 2
13: 7 2 70 2 1 1 0 1
This solution is based on the assumption (derived from the example data) that only 'persid' 1 & 2 are partners, anything higher are children.
What this does:
Group by hldid
Change the order of the first two persons in a hould but only when there are more people in a household with
persid[c(pmin(2,.N):1,rep(0,(pmax(.N,2)-2)))] where pmin is used to assure that a vector of length 1 is constructed when a household has only one person.
Multiply that with (persid %in% 1:2)to get zores for children.
Multiply that with (.N != 1) to get a value of zero for households of one person.
df <- data.frame(matrix(data = NA, ncol = 7))
names(df) <- names(test)
for(id in unique(test$hldid)){
t <- test[test$hldid==id,]
t$partner[t$persid == t$partner[t$partner!=0]] <- t$persid[which(t$partner!=0)]
df <- rbind(df, t)
}
df <- df[-1,]
A base R solution is more complicated than Jaap's data.table solution.
I work with a copy.
test2 <- test
After running the code in the question, run the following.
test2$partnerREC <- test2$partner
sp <- split(test2, test2$hldid)
test2 <- lapply(sp, function(DF){
i <- with(DF, which(persid %in% partner))
j <- with(DF, which(partner %in% persid))
#cat("i:", i, "\tj:", j, "\n")
DF$partnerREC[i] <- DF$persid[j]
DF
})
test2 <- do.call(rbind, test2)
row.names(test2) <- NULL
Now compare both results.
identical(test, test2)
#[1] TRUE
You can get there through some dplyr steps to join the data on itself and update the value of partner when persid == partner.
test2 <- left_join(test, test %>% select(hldid, persid, partner) %>% filter(partner != 0), by=c("hldid")) %>%
filter(persid.x == partner.y) %>%
mutate(partner.x = persid.y)
This will give you the head of household with their partner id matched up but you'd have to rejoin this to the original data (I'm not sure the equivalent of SQL update in dplyr lingo).
1. Create your test data.frame
library(tidyverse)
test <- tribble(
~hldid, ~persid, ~age, ~sex, ~relresp, ~partner, ~child,
1, 1, 26, 2, 0, 0, 0,
1, 2, 26, 1, 1, 1, 0,
2, 1, 59, 2, 0, 0, 0,
2, 2, 64, 1, 1, 1, 0,
3, 1, 76, 2, 0, 0, 0,
4, 1, 65, 2, 0, 0, 0,
4, 2, 64, 1, 1, 1, 0,
5, 1, 52, 2, 0, 0, 0,
5, 2, 51, 1, 1, 1, 0,
5, 3, 20, 2, 21, 0, 1,
5, 4, 14, 2, 21, 0, 1,
7, 1, 69, 1, 0, 0, 0,
7, 2, 70, 2, 1, 1, 0)
2. arrange(), group_by(), and mutate() come to rescue
test %>%
# arrange the data in case the raw data did not
arrange(hldid, child, persid) %>%
# group each household
group_by(hldid) %>%
# match first and second household person as each other's partner
mutate(partnerREC = ifelse(persid == first(persid), nth(persid, 2), first(persid))) %>%
# correct partnerREC for child and single
mutate(partnerREC = ifelse(child == 1 | is.na(partnerREC), 0, partnerREC))
# un-group it
ungroup()
A loop solution using rcpp
Source the rcpp script
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector HHgrid(CharacterVector hid, NumericVector persid, NumericVector partner,
NumericVector partnerRec) {
int nrows = hid.size();
for (int i = 1; i < nrows - 1; i ++){
for (int j = 0; j < nrows - 1; j++){
if( (hid(i) == hid(i+1)) & ( persid(i) == partner(j) ) ){
partnerRec(i) = persid(j);
}
}
Rcout << i << std::endl;
}
return(partnerRec);
}
and run the function
HHgrid(hid = test$hldid, persid = test$persid, partner = test$partner, partnerRec = test$partnerRec)
There is just a slight issue with the first case (if you know how to fix it)
I've got a data like below:
ex <- structure(list(id = 1:20, V1 = c(1, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA, -20L))
Now I need to create a new column that would count the number of rows from the latest non-zero value. One additional condition is that if that number is greater than k then we would count the number of rows before the next non-zero value (with - sign). So assuming k = 10 the output would look like below:
structure(list(id = 1:20, V1 = c(1, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0), counter = c(0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, -2, -1, 0, 1, 2)), class = "data.frame", row.names = c(NA, -20L))
What's the efficient way to do this using tidyverse packages, if possible?
A base R approach with ave, where we first calculate number of rows from the last non-zero value and then for values greater than k we reverse the index position and add the negative sign to get sequence in -2, -1 order.
ave(ex$V1, cumsum(ex$V1 != 0) , FUN = function(x) {
inds <- seq_along(x) - 1
ifelse(inds > k, -rev(inds) - 1, inds)
})
#[1] 0 1 2 3 0 1 2 3 4 5 6 7 8 9 10 -2 -1 0 1 2
Another option:
ex <- structure(list(id = 1:20, V1 = c(1, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA, -20L))
library(tidyverse)
k_LIMIT <- 10
ex %>%
mutate(my_group_var = cumsum(V1)) %>%
group_by(my_group_var) %>%
mutate(want = ifelse((row_number() - 1) > k_LIMIT,
-(max((row_number() - 1)) - (row_number() - 1) + 1),
(row_number() - 1))) %>%
ungroup() %>%
select(-my_group_var)
# A tibble: 20 x 3
# id V1 want
# <int> <dbl> <dbl>
# 1 1 1 0
# 2 2 0 1
# 3 3 0 2
# 4 4 0 3
# 5 5 9 0
# 6 6 0 1
# 7 7 0 2
# 8 8 0 3
# 9 9 0 4
#10 10 0 5
#11 11 0 6
#12 12 0 7
#13 13 0 8
#14 14 0 9
#15 15 0 10
#16 16 0 -2
#17 17 0 -1
#18 18 1 0
#19 19 0 1
#20 20 0 2
Here is one option with data.table
library(data.table)
k <- 10
setDT(ex)[, grp := cumsum(V1 != 0)
][, counter := seq_len(.N) - 1, grp
][counter > k, counter := -as.numeric(rev(seq_len(.N))), grp
][, grp := NULL][]
#. id V1 counter
# 1: 1 1 0
# 2: 2 0 1
# 3: 3 0 2
# 4: 4 0 3
# 5: 5 9 0
# 6: 6 0 1
# 7: 7 0 2
# 8: 8 0 3
# 9: 9 0 4
#10: 10 0 5
#11: 11 0 6
#12: 12 0 7
#13: 13 0 8
#14: 14 0 9
#15: 15 0 10
#16: 16 0 -2
#17: 17 0 -1
#18: 18 1 0
#19: 19 0 1
#20: 20 0 2
You can use calculate step as difference versus lag id, additional condition on value could be added and then build sequences with conditions whether it is reverse or not
Steps <- ex%>%
rbind(c(nrow(ex)+1,1)) %>%
filter(sign(V1)!=0) %>%
mutate(step= id-c(1,(lag(id)[-1])))
c <- c()
k<-10
for (i in 1:nrow(Steps)){
c<- c(c,c(0:max(0,min(Steps[i,3], k)-1)))
if(Steps[i,3]> k){
c <- c(c, (k-Steps[i,3]+1):-1)
}
}
c<-c[ex$id]
I am using panel data with multiple subjects (id) and have an event (first_occurrence) that occurs on different days. My goal is to create a new variable (result) that is 1 on the 2 days preceding the first occurrence, the day of the first occurrence, and the 2 days following the first occurrence.
Here is an example that includes both the sample data and the desired output:
data <- structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 3), day = c(0, 1, 2, 3, 4, 5, 6, 7, 0, 1,
2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 6), first_occurrence = c(0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1), desired_output = c(1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1)), .Names = c("id",
"day", "first_occurrence", "desired_output"), row.names = c(NA,
-21L), class = "data.frame")
Although this may not be the most efficient solution, I managed to get the code working in Stata (please see below for Stata code), although I would like to get it working in R as well and would appreciate any thoughts folks have.
Thanks!
Stata code:
tsset id day
gen run = .
by id: replace run = cond(L.run == ., 1, L.run + 1)
gen test = .
replace test = run if(first_occurrence == 1)
gen test2 = .
by id: replace test2 = test[_n-1]
gen test3 = .
by id: replace test3 = test[_n-2]
gen test4 = .
by id: replace test4 = test[_n+1]
gen test5 = .
by id: replace test5 = test[_n+2]
egen test_sum = rowtotal(test test2 test3 test4 test5)
replace test_sum = 1 if(test_sum >= 1)
rename test_sum result
drop run test test2 test3 test4 test5
Here's another approach using the package dplyr:
require(dplyr) #install and load the package
data %.%
arrange(id, day) %.% # to sort the data by id and day. If it is already, you can remove this row
group_by(id) %.%
mutate(n = 1:n(),
result = ifelse(abs(n - n[first_occurrence == 1]) <= 2, 1, 0)) %.%
select(-n)
# id day first_occurrence desired_output result
#1 1 0 0 1 1
#2 1 1 0 1 1
#3 1 2 1 1 1
#4 1 3 0 1 1
#5 1 4 0 1 1
#6 1 5 0 0 0
#7 1 6 0 0 0
#8 1 7 0 0 0
#9 2 0 0 0 0
#10 2 1 0 0 0
#11 2 2 0 1 1
#12 2 3 0 1 1
#13 2 4 1 1 1
#14 2 5 0 1 1
#15 3 0 0 0 0
#16 3 1 0 0 0
#17 3 2 0 0 0
#18 3 3 0 0 0
#19 3 4 0 1 1
#20 3 5 0 1 1
#21 3 6 1 1 1
What the code does is, first group by id and then it will add another column (n) where it counts the rows per group from 1 to the number of rows per group. Then it creates another column result with an ifelse that will check the absolute difference between the current n (for each row) and the n where first_occurrence is 1. If that difference is equal to or less than 2, result will be 1 otherwise 0. The last line removes the column n.
Edit:
It would probably be more efficient to place the mutate(n = 1:n()) before the group_by:
data %.%
arrange(id, day) %.% # to sort the data by id and day. If it is already, you can remove this row
mutate(n = 1:n()) %.%
group_by(id) %.%
mutate(result = ifelse(abs(n - n[first_occurrence == 1]) <= 2, 1, 0)) %.%
select(-n)
Here's one way. You can use ave to look by group, and then you can use which.max to find the first occurrence and then calculate the distance from that value for all the other values
close<-(with(data, ave(first_occurrence, id, FUN=function(x)
abs(seq_along(x)-which.max(x)))
)<=2)+0
Here I use +0 to turn the logical values into 0/1 values. Now you can combine that with your existing data
cbind(data, close)
And that gives
id day first_occurrence desired_output close
1 1 0 0 1 1
2 1 1 0 1 1
3 1 2 1 1 1
4 1 3 0 1 1
5 1 4 0 1 1
6 1 5 0 0 0
7 1 6 0 0 0
8 1 7 0 0 0
9 2 0 0 0 0
10 2 1 0 0 0
11 2 2 0 1 1
12 2 3 0 1 1
13 2 4 1 1 1
14 2 5 0 1 1
15 3 0 0 0 0
16 3 1 0 0 0
17 3 2 0 0 0
18 3 3 0 0 0
19 3 4 0 1 1
20 3 5 0 1 1
21 3 6 1 1 1
as desired. Note that this method assumes that the data is sorted by day.