Related
I am trying to produce a variable that counts how many times a "1" appeared sequentially in the preceding rows for a different variable. However, I need the count to persist even if there is one row missing a 1. (i.e., 10111011 should register as an 8). The code I use to count sequential 1s is:
The following code provides an example of the kind of thing I'm trying to do:
input <- c(1,0,1,1,0,1,1,0,1,0,1)
dfseq <- data.frame(input)
dfseq$seq <- sequence(rle(as.character(dfseq$input))$lengths)
which produces the following dataframe:
data_struc <-
structure(list(
input = c(1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1),
seq = c(1L,
1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L)
),
row.names = c(NA,-11L),
class = "data.frame")
However, I want the sequence to allow for one row of "failure" on the sequence, such that it continues to count consecutive ones even if one row contains a 0 and then the 1s continue. It should only stop counting once two 0s appear consecutively
I'd use a lagged variable with an OR condition:
library(dplyr)
dfseq %>% mutate(
cum_result = cumsum(input == 1 | (lag(input) == 1 & lead(input, default = 1) == 1))
)
# input seq cum_result
# 1 1 1 1
# 2 0 1 2
# 3 1 1 3
# 4 1 2 4
# 5 0 1 5
# 6 1 1 6
# 7 1 2 7
# 8 0 1 8
# 9 1 1 9
# 10 0 1 10
# 11 1 1 11
You were on the right track using rle. Using an extended dataset to illustrate the "allowing" part
rle_obj <- rle(dfseq$input)
sum(dfseq$input) + sum(ifelse(rle_obj$lengths[rle_obj$values==0]==1,1,0))
[1] 12
Data
dfseq <- structure(list(input = c(1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0,
0, 1), seq = c(1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1)), row.names = c(NA,
-14L), class = "data.frame")
dfseq
input seq
1 1 1
2 0 1
3 1 1
4 1 2
5 0 1
6 1 1
7 1 2
8 0 1
9 1 1
10 0 1
11 1 1
12 0 1
13 0 2
14 1 1
Not sure if we all understood the question right and the sample data clarifies not much and blurs any possible mistakes as it all follows the sequence according to OP. To be sure OP could provide a desired outcome and one based on a sample set that includes records that would break the sequence according to his criteria.
I changed the sample data a bit and this is how I interpretted the question.
dt <- data.frame(
input = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1)
)
library(data.table)
setDT(dt)
dt[, seq_count := 1:.N, by = rleid(input == 1 | (lag(input) == 1 & lead(input) != 0))]
dt[input == 0 & (lead(input) == 0 | lag(input) == 0), seq_count := NA]
dt
# input seq_count
# 1: 1 1
# 2: 0 2
# 3: 1 3
# 4: 1 4
# 5: 0 NA
# 6: 0 NA
# 7: 1 1
# 8: 0 2
# 9: 1 3
# 10: 1 4
# 11: 0 5
# 12: 1 6
I have some dplyr code I'm moving to data.table, this is a problem I just ran into. I want the difference from one row to the next in b stored in column c if a is greater or equal than 3. However after running this code:
df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
b = c(0, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))
setDT(df)
df[ , c := ifelse(a >= 3, c(0, diff(b)), b), by = .(a)]
all the elements in c are 0. Why is this?
df
a b c
1: 1 0 0
2: 1 1 0
3: 1 0 0
4: 1 1 0
5: 2 0 0
6: 2 1 0
7: 2 1 0
8: 3 0 0
9: 3 3 0
10: 3 4 0
11: 3 5 0
What I thought was the equivalent dplyr:
df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
b = c(0, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))
df %>%
group_by(a) %>%
mutate(c = ifelse( a >= 3, c(0, diff(b)), b))
From the help for ifelse(test, yes, no), it should return...
A vector of the same length and attributes (including dimensions and "class") as test and data values from the values of yes or no. The mode of the answer will be coerced from logical to accommodate first any values taken from yes and then any values taken from no.
However:
> df %>% group_by(a) %>% do(print(.$a))
[1] 1 1 1 1
[1] 2 2 2
[1] 3 3 3 3
> data.table(df)[, print(a), by=a]
[1] 1
[1] 2
[1] 3
As explained in the help pages, since the first argument has a length of one, if you pass vectors for the other parts, only their first element is used:
> ifelse(TRUE, 1:10, eleventy + million)
[1] 1
You should probably use if ... else ... when working with a constant value, like...
> data.table(df)[, b := if (a >= 3) c(0, diff(b)) else b, by=a]
or even better, in this case you can assign to a subset:
> data.table(df)[a >= 3, b := c(0, diff(b)), by=a]
Regarding why a has length 1 for the data.table idiom, see its FAQ question "Inside each group, why are the group variables length-1?"
I am creating a dataset which has non-zero values for b as the first element of each group by a to illustrate better. Your previous dataset had all zeros and also c(0,diff(b)) was starting with zero so it was hard to differentiate.
What happens here is that output of ifelse is a vector of length 1.
library(data.table)
df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
b = c(10, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))
Look below:
setDT(df)[ , c := ifelse(a >= 3, c(0, diff(b)), b), by = .(a)][]
#> a b c
#> 1: 1 10 10
#> 2: 1 1 10
#> 3: 1 0 10
#> 4: 1 1 10
#> 5: 2 0 0
#> 6: 2 1 0
#> 7: 2 1 0
#> 8: 3 0 0
#> 9: 3 3 0
#> 10: 3 4 0
#> 11: 3 5 0
Now, let's look at some other examples; here I am using a simple vector of length 4 (instead of c(0,diff(b))):
setDT(df)[ , c := ifelse(a >= 3L, c(20,2,3,4), -999), by=a][]
#> a b c
#> 1: 1 10 -999
#> 2: 1 1 -999
#> 3: 1 0 -999
#> 4: 1 1 -999
#> 5: 2 0 -999
#> 6: 2 1 -999
#> 7: 2 1 -999
#> 8: 3 0 20
#> 9: 3 3 20
#> 10: 3 4 20
#> 11: 3 5 20
You see that still the first element is getting assigned to all the rows of c for that group of a.
A work-around is using diff on a to see when it's not changing (i.e. diff(a)==0) and use that as a pseudo-grouping along with the other condition; like below:
setDT(df)[, c := ifelse(a >= 3 & c(F,diff(a)==0), c(0,diff(b)), b)][]
#> a b c
#> 1: 1 10 10
#> 2: 1 1 1
#> 3: 1 0 0
#> 4: 1 1 1
#> 5: 2 0 0
#> 6: 2 1 1
#> 7: 2 1 1
#> 8: 3 0 0
#> 9: 3 3 3
#> 10: 3 4 1
#> 11: 3 5 1
I am trying to avoid using a loop in my attempt to recode household partners' variable identification.
hldid denotes the household while persid the person in the household.
The variable partner indicates the persid of the partner and child indicate if the row is a child.
What is missing from the variable partner is the persid for both partners.
For instance for hldid == 1, the persid == 1 has a value of 0 for partner while it should be 2.
This is how the data looks like:
> test
hldid persid age sex relresp partner child
1 1 1 26 2 0 0 0
2 1 2 26 1 1 1 0
3 2 1 59 2 0 0 0
4 2 2 64 1 1 1 0
5 3 1 76 2 0 0 0
6 4 1 65 2 0 0 0
7 4 2 64 1 1 1 0
8 5 1 52 2 0 0 0
9 5 2 51 1 1 1 0
10 5 3 20 2 21 0 1
11 5 4 14 2 21 0 1
12 7 1 69 1 0 0 0
13 7 2 70 2 1 1 0
I managed to create a quite ugly loop, however it is too slow for the entire dataset.
test$partnerREC = test$partner
for(i in 1:13){
for(j in 1:13){
if(
test$hldid[i] == test$hldid[i+1] & # verify if household is the same
(test$persid[i] == test$partner[j])
)
{
test$partnerREC[i] = test$persid[j] # put the persid for each partner
}
}
}
> test
hldid persid age sex relresp partner child partnerREC
1 1 1 26 2 0 0 0 2
2 1 2 26 1 1 1 0 1
3 2 1 59 2 0 0 0 2
4 2 2 64 1 1 1 0 1
5 3 1 76 2 0 0 0 0
6 4 1 65 2 0 0 0 2
7 4 2 64 1 1 1 0 1
8 5 1 52 2 0 0 0 2
9 5 2 51 1 1 1 0 1
10 5 3 20 2 21 0 1 0
11 5 4 14 2 21 0 1 0
12 7 1 69 1 0 0 0 2
13 7 2 70 2 1 1 0 1
Any idea how I could use data.table for solving this?
test = structure(list(hldid = c(1, 1, 2, 2, 3, 4, 4, 5, 5, 5, 5, 7,
7), persid = c(1, 2, 1, 2, 1, 1, 2, 1, 2, 3, 4, 1, 2), age = c(26,
26, 59, 64, 76, 65, 64, 52, 51, 20, 14, 69, 70), sex = c(2, 1,
2, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2), relresp = c(0, 1, 0, 1, 0,
0, 1, 0, 1, 21, 21, 0, 1), partner = c(0, 1, 0, 1, 0, 0, 1, 0,
1, 0, 0, 0, 1), child = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
0)), class = "data.frame", row.names = c(NA, -13L))
A possible solution:
library(data.table) # load the package
setDT(test) # convert 'test' to a 'data.table'
test[, partnerREC := persid[c(pmin(2,.N):1,rep(0,(pmax(.N,2)-2)))] *
(persid %in% 1:2) *
(.N != 1)
, by = hldid][]
which gives:
> test
hldid persid age sex relresp partner child partnerREC
1: 1 1 26 2 0 0 0 2
2: 1 2 26 1 1 1 0 1
3: 2 1 59 2 0 0 0 2
4: 2 2 64 1 1 1 0 1
5: 3 1 76 2 0 0 0 0
6: 4 1 65 2 0 0 0 2
7: 4 2 64 1 1 1 0 1
8: 5 1 52 2 0 0 0 2
9: 5 2 51 1 1 1 0 1
10: 5 3 20 2 21 0 1 0
11: 5 4 14 2 21 0 1 0
12: 7 1 69 1 0 0 0 2
13: 7 2 70 2 1 1 0 1
This solution is based on the assumption (derived from the example data) that only 'persid' 1 & 2 are partners, anything higher are children.
What this does:
Group by hldid
Change the order of the first two persons in a hould but only when there are more people in a household with
persid[c(pmin(2,.N):1,rep(0,(pmax(.N,2)-2)))] where pmin is used to assure that a vector of length 1 is constructed when a household has only one person.
Multiply that with (persid %in% 1:2)to get zores for children.
Multiply that with (.N != 1) to get a value of zero for households of one person.
df <- data.frame(matrix(data = NA, ncol = 7))
names(df) <- names(test)
for(id in unique(test$hldid)){
t <- test[test$hldid==id,]
t$partner[t$persid == t$partner[t$partner!=0]] <- t$persid[which(t$partner!=0)]
df <- rbind(df, t)
}
df <- df[-1,]
A base R solution is more complicated than Jaap's data.table solution.
I work with a copy.
test2 <- test
After running the code in the question, run the following.
test2$partnerREC <- test2$partner
sp <- split(test2, test2$hldid)
test2 <- lapply(sp, function(DF){
i <- with(DF, which(persid %in% partner))
j <- with(DF, which(partner %in% persid))
#cat("i:", i, "\tj:", j, "\n")
DF$partnerREC[i] <- DF$persid[j]
DF
})
test2 <- do.call(rbind, test2)
row.names(test2) <- NULL
Now compare both results.
identical(test, test2)
#[1] TRUE
You can get there through some dplyr steps to join the data on itself and update the value of partner when persid == partner.
test2 <- left_join(test, test %>% select(hldid, persid, partner) %>% filter(partner != 0), by=c("hldid")) %>%
filter(persid.x == partner.y) %>%
mutate(partner.x = persid.y)
This will give you the head of household with their partner id matched up but you'd have to rejoin this to the original data (I'm not sure the equivalent of SQL update in dplyr lingo).
1. Create your test data.frame
library(tidyverse)
test <- tribble(
~hldid, ~persid, ~age, ~sex, ~relresp, ~partner, ~child,
1, 1, 26, 2, 0, 0, 0,
1, 2, 26, 1, 1, 1, 0,
2, 1, 59, 2, 0, 0, 0,
2, 2, 64, 1, 1, 1, 0,
3, 1, 76, 2, 0, 0, 0,
4, 1, 65, 2, 0, 0, 0,
4, 2, 64, 1, 1, 1, 0,
5, 1, 52, 2, 0, 0, 0,
5, 2, 51, 1, 1, 1, 0,
5, 3, 20, 2, 21, 0, 1,
5, 4, 14, 2, 21, 0, 1,
7, 1, 69, 1, 0, 0, 0,
7, 2, 70, 2, 1, 1, 0)
2. arrange(), group_by(), and mutate() come to rescue
test %>%
# arrange the data in case the raw data did not
arrange(hldid, child, persid) %>%
# group each household
group_by(hldid) %>%
# match first and second household person as each other's partner
mutate(partnerREC = ifelse(persid == first(persid), nth(persid, 2), first(persid))) %>%
# correct partnerREC for child and single
mutate(partnerREC = ifelse(child == 1 | is.na(partnerREC), 0, partnerREC))
# un-group it
ungroup()
A loop solution using rcpp
Source the rcpp script
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector HHgrid(CharacterVector hid, NumericVector persid, NumericVector partner,
NumericVector partnerRec) {
int nrows = hid.size();
for (int i = 1; i < nrows - 1; i ++){
for (int j = 0; j < nrows - 1; j++){
if( (hid(i) == hid(i+1)) & ( persid(i) == partner(j) ) ){
partnerRec(i) = persid(j);
}
}
Rcout << i << std::endl;
}
return(partnerRec);
}
and run the function
HHgrid(hid = test$hldid, persid = test$persid, partner = test$partner, partnerRec = test$partnerRec)
There is just a slight issue with the first case (if you know how to fix it)
I've got a data like below:
ex <- structure(list(id = 1:20, V1 = c(1, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA, -20L))
Now I need to create a new column that would count the number of rows from the latest non-zero value. One additional condition is that if that number is greater than k then we would count the number of rows before the next non-zero value (with - sign). So assuming k = 10 the output would look like below:
structure(list(id = 1:20, V1 = c(1, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0), counter = c(0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, -2, -1, 0, 1, 2)), class = "data.frame", row.names = c(NA, -20L))
What's the efficient way to do this using tidyverse packages, if possible?
A base R approach with ave, where we first calculate number of rows from the last non-zero value and then for values greater than k we reverse the index position and add the negative sign to get sequence in -2, -1 order.
ave(ex$V1, cumsum(ex$V1 != 0) , FUN = function(x) {
inds <- seq_along(x) - 1
ifelse(inds > k, -rev(inds) - 1, inds)
})
#[1] 0 1 2 3 0 1 2 3 4 5 6 7 8 9 10 -2 -1 0 1 2
Another option:
ex <- structure(list(id = 1:20, V1 = c(1, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA, -20L))
library(tidyverse)
k_LIMIT <- 10
ex %>%
mutate(my_group_var = cumsum(V1)) %>%
group_by(my_group_var) %>%
mutate(want = ifelse((row_number() - 1) > k_LIMIT,
-(max((row_number() - 1)) - (row_number() - 1) + 1),
(row_number() - 1))) %>%
ungroup() %>%
select(-my_group_var)
# A tibble: 20 x 3
# id V1 want
# <int> <dbl> <dbl>
# 1 1 1 0
# 2 2 0 1
# 3 3 0 2
# 4 4 0 3
# 5 5 9 0
# 6 6 0 1
# 7 7 0 2
# 8 8 0 3
# 9 9 0 4
#10 10 0 5
#11 11 0 6
#12 12 0 7
#13 13 0 8
#14 14 0 9
#15 15 0 10
#16 16 0 -2
#17 17 0 -1
#18 18 1 0
#19 19 0 1
#20 20 0 2
Here is one option with data.table
library(data.table)
k <- 10
setDT(ex)[, grp := cumsum(V1 != 0)
][, counter := seq_len(.N) - 1, grp
][counter > k, counter := -as.numeric(rev(seq_len(.N))), grp
][, grp := NULL][]
#. id V1 counter
# 1: 1 1 0
# 2: 2 0 1
# 3: 3 0 2
# 4: 4 0 3
# 5: 5 9 0
# 6: 6 0 1
# 7: 7 0 2
# 8: 8 0 3
# 9: 9 0 4
#10: 10 0 5
#11: 11 0 6
#12: 12 0 7
#13: 13 0 8
#14: 14 0 9
#15: 15 0 10
#16: 16 0 -2
#17: 17 0 -1
#18: 18 1 0
#19: 19 0 1
#20: 20 0 2
You can use calculate step as difference versus lag id, additional condition on value could be added and then build sequences with conditions whether it is reverse or not
Steps <- ex%>%
rbind(c(nrow(ex)+1,1)) %>%
filter(sign(V1)!=0) %>%
mutate(step= id-c(1,(lag(id)[-1])))
c <- c()
k<-10
for (i in 1:nrow(Steps)){
c<- c(c,c(0:max(0,min(Steps[i,3], k)-1)))
if(Steps[i,3]> k){
c <- c(c, (k-Steps[i,3]+1):-1)
}
}
c<-c[ex$id]
vector A:
a = c(0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1)
vector B: (only used for initialization)
b = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Dataframe:
dft <- data.frame(a,b)
The following for-loop compares for each row "i" the value A[i] with A[i+1] in vector A.
If i+1 is different -> write "count"
else check i+2 and increment "count" ...
The idea is to know for each row, the number of rows until the value in A changes.
count = 0
% takes endless (for large set) but does its job
for(i in 1:nrow(dft)) {
for(j in i+1:nrow(dft)-1) {
j_value <- dft[j,"a"]
i_value <- dft[i,"a"]
if (!is.na(j_value) & !is.na(i_value)){
tmp_value <- abs(i_value - j_value)
if(tmp_value > 0) {
dft[i,"b"] <- count
count = 0
break
} else {
count = count + 1
}
}
}
}
Results should be:
b
1: 5
2: 4
3: 3
4: 2
5: 1
6: 1
7: 2
8: 1
9: 3
10: 2
11: 1
12: 5
13: 4
14: 3
15: 2
16: 1
17: 0
The following should work:
b = rle(a)
unlist(mapply(":", b$lengths, 1))
# [1] 5 4 3 2 1 1 2 1 3 2 1 5 4 3 2 1 1
Or in one line:
with(rle(a), unlist(Map(":", lengths, 1)))
Using "data.table", you can do the following:
library(data.table)
data.table(a)[, b := .N:1, rleid(a)][]
# a b
# 1: 0 5
# 2: 0 4
# 3: 0 3
# 4: 0 2
# 5: 0 1
# 6: 1 1
# 7: 0 2
# 8: 0 1
# 9: 1 3
# 10: 1 2
# 11: 1 1
# 12: 0 5
# 13: 0 4
# 14: 0 3
# 15: 0 2
# 16: 0 1
# 17: 1 1
How about this, using data.table. There's a bit of reverse ordering, and use of shift to compare values with subsequent values. It might be a little convoluted, but it seems to work.
library( data.table )
dft <- data.table(a)
dft[ , f := shift( a, 1L, fill = F, type = "lead" ) != a
][ .N:1, b := seq_len(.N), by = cumsum(f)
][ , f := NULL ]
dft
a b
1: 0 5
2: 0 4
3: 0 3
4: 0 2
5: 0 1
6: 1 1
7: 0 2
8: 0 1
9: 1 3
10: 1 2
11: 1 1
12: 0 5
13: 0 4
14: 0 3
15: 0 2
16: 0 1
17: 1 1
Here is another approach with apply:
# The data
a=c(0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1)
# An index of the data
ind <- 1:length(a)
# The function to apply
f <- function(x){ifelse(!is.na(which(a[x]!=a[x:length(a)])[1] - 1), # Check if we are in the last group before series ends
which(a[x]!=a[x:length(a)])[1] - 1, # if not return distance to nearest value change
ind[length(a)] - x + 1) # if we are return length of last block of values
}
unlist(lapply(ind, f)) # Apply and unlist to vector
#> [1] 5 4 3 2 1 1 2 1 3 2 1 5 4 3 2 1 1
If you wanted you could reduce it to just the which() statement, in which case the last block of homogenous values would be assigned an NA. Depending on the context there are different ways you might want to treat the last block, as the number of repetitions until the value changes is censored (maybe you want to supply a string in the second term of the ifelse like '1+').