Conditions including comparison of consequtive elements within columns - r

Could you please help me with the following task. Here is a template of my dataset:
Category <- c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3)
PrevRule <- c(-1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2)
UserRule <- c(2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 0, 0, 0, 1, 1, 1, 1)
Correct <- c(0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1)
df <- data.frame(Category, PrevRule, UserRule, Correct)
I need to create an additional PP column according to the following rules:
i is row index. n is maximum rows for each Category.
For Category == 1:
If Correct[i:i+2] == 0 then PP[i+2:n] = UserRule[i+2]
Else PP = NA
For Category > 1:
Do PP = PrevRule until Correct[i:i+2] == 0
Then PP[i+2:n] = UserRule[i+2]
So, in the end the PP column should look like in the following table (column starting with NA)
Category PrevRule UserRule Correct PP
1 1 -1 2 0 NA
2 1 -1 2 0 NA
3 1 -1 2 0 2
4 1 -1 1 1 2
5 1 -1 1 1 2
6 1 -1 1 1 2
7 1 -1 1 1 2
8 1 -1 1 1 2
9 2 1 1 0 1
10 2 1 1 0 1
11 2 1 2 1 1
12 2 1 0 0 1
13 2 1 0 0 1
14 2 1 0 0 0
15 2 1 2 1 0
16 2 1 2 1 0
17 2 1 2 1 0
18 2 1 2 1 0
19 2 1 2 1 0
20 3 2 2 0 2
21 3 2 0 0 2
22 3 2 0 0 2
23 3 2 0 0 0
24 3 2 1 1 0
25 3 2 1 1 0
26 3 2 1 1 0
27 3 2 1 1 0
Any advice and suggestions would be greatly appreciated!
Thank you!

It was realy difficult to understand what you ment, but here is my solution. It is not perfect as it uses a while-Loop but it should work:
ret <- array(dim = nrow(df))
i <- 1
while(i < nrow(df) - 3){
if(Category[i] == 1 & all(Correct[i + 0:2] == 0)){
tmp <- max(which(Category==1)) #end of category 1
ret[(i + 2):tmp] <- UserRule[i+2]
#set index i to the index-value of the last element in category 1
i <- tmp
}else{
ret[i] <- NA #actual not necessary, because PP is NA per default.
}
print("From now on, I will only ask clear questions!")
if(Category[i] >= 2){
ret[i] <- PrevRule[i]
if(all(Correct[(i-2):i] == 0)){#3 consecutive 0 in Correct
tmp <- max(which(Category == Category[i])) #end of current category
ret[i:tmp] <- UserRule[i]
i <- tmp #set index i to the index-value of the last element in the current category
}
}
i <- i + 1
}
df$PP <- ret

Related

How do I create a column which has the value of the last time a condition was met in a separate row in R

I have data that looks like this:
d <- data.frame(Item = c(1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0)
I would like to create a column where the value is based on the last time a 0 was present in the row d$item. I don't really know how to get started with something like this in R.
Expected outcome is this:
f$recent <- c(NA, 0, 0, 1, 2, 3, 4, 5, 6, 0, 0, 1, 0, 1, 2, 3, 0)
Where each row is the the most recent observation of 0 (0 = on same row, 1 = previous row, etc.)
edit: Changed row to column, was posting before coffee. Also added expected result.
You can try rle + sequence
transform(
d,
recent = with(rle(Item), sequence(lengths)) * (Item != 0)
)
which gives
Item recent
1 1 1
2 0 0
3 0 0
4 1 1
5 1 2
6 1 3
7 1 4
8 1 5
9 1 6
10 0 0
11 0 0
12 1 1
13 0 0
14 1 1
15 1 2
16 1 3
17 0 0
You can do this, with sequence. It calculate the distance to the latest 1.
dif <- diff(c(which(d$Item == 1), length(d$Item) + 1))
sequence(dif, 0)
#[1] 0 1 2 0 0 0 0 0 0 1 2 0 1 0 0 0 1
Edit:
dif <- diff(c(1, which(d$Item != 1), length(d$Item) + 1))
sequence(dif, 0)
#[1] 0 0 0 1 2 3 4 5 6 0 0 1 0 1 2 3 0

How to find the most commonly occurring combinations of Boolean variables by row in R

I have a series of 14 Boolean variables and I would like to find the top 3 combinations of 3 or more variables (where the value == 1).
Sample data:
df <- data.frame(ID = c(1, 2, 3, 4, 5, 6, 7, 8),
var1 = c(0, 0, 1, 1, 1, 0, 0, 1),
var2 = c(1, 0, 0, 1, 1, 1, 1, 0),
var3 = c(0, 0, 1, 1, 1, 1, 0, 0),
var4 = c(1, 1, 1, 1, 1, 0, 1, 1),
var5 = c(0, 0, 0, 1, 1, 0, 1, 1)
)
df
> df
ID var1 var2 var3 var4 var5
1 1 0 1 0 1 0
2 2 0 0 0 1 0
3 3 1 0 1 1 0
4 4 1 1 1 1 1
5 5 1 1 1 1 1
6 6 0 1 1 0 0
7 7 0 1 0 1 1
8 8 1 0 0 1 1
I found a solution to bring all column names together per unique occurance:
# Bring to long format
df_long <- df %>%
melt(id.vars = "ID")
# Collapse the variables that have a '1' together per row
df_combo <- ddply(df_long, "ID", summarize,
combos = paste(variable[value == 1], collapse = "/"))
> df_combo
ID combos
1 1 var2/var4
2 2 var4
3 3 var1/var3/var4
4 4 var1/var2/var3/var4/var5
5 5 var1/var2/var3/var4/var5
6 6 var2/var3
7 7 var2/var4/var5
8 8 var1/var4/var5
If I only wanted counts on unique combinations this would be fine, but I would like to know the number of times each combination of 3 or more variables occurs, even in cases where other variables also occur. The combination (var1/var4/var5) occurs 3 times in the above example, but twice it occurs next to two other variables.
There must be an easy way to extract this information, just can't think of it. Thank you for your help!!
An attempt, using combn as the workhorse function.
arr <- which(df[-1] == 1, arr.ind=TRUE)
tmp <- tapply(arr[,"col"], arr[,"row"],
FUN=function(x) if (length(x) >= 3) combn(x,3, simplify=FALSE) )
tmp <- data.frame(do.call(rbind, unlist(tmp, rec=FALSE)))
aggregate(count ~ . , cbind(tmp, count=1), sum)
## X1 X2 X3 count
##1 1 2 3 2
##2 1 2 4 2
##3 1 3 4 3
##4 2 3 4 2
##5 1 2 5 2
##6 1 3 5 2
##7 2 3 5 2
##8 1 4 5 3
##9 2 4 5 3
##10 3 4 5 2

about the equivalent R code for forvalues in Stata [duplicate]

This question already has an answer here:
about the equivalent command for forvalues
(1 answer)
Closed 4 years ago.
When we deal with the longitudinal recurrent events. I would like to detect the events which meet some condition.
data <- data.frame(id=c(rep(1, 4), rep(2, 3), rep(3, 3), rep(4,4)),
event=c(1, 1, 0, 0, 1, 1, 0, 1, 1, 0,1,1,1,1),
eventcount=c(1, 2, 0, 0, 1, 2, 0, 1, 2, 3,1,2,3,4),
firstevent=c(1, 0, 0, 0, 1, 0, 0, 1, 0, 0,1,0,0,0),
time=c(100, 250, 150, 300, 240, 400, 150, 200, 320, 360,100,210,220,235) )
When we want to detect events which occurred from 100 days to 150 days after the first event.
In Stata, we can use
gen event2=1 if id==id[_n-1]& time-time[_n-1]>100 & time-time[_n-1]<=150 & firstevent[_n-1]==1 & firstevent==0 & event==1
forvalues i = 2/4 {
replace event2=1 if id==id[_n-`i']& time-time[_n-`i']>100 &time-time[_n-`i']<=150 & firstevent[_n-`i']==1 & firstevent==0 & event==1
}
I would like to obtain the dataset below.
data_after <- data.frame(id=c(rep(1, 4), rep(2, 3), rep(3, 3), rep(4,4)),
event=c(1, 1, 0, 0, 1, 1, 0, 1, 1, 0,1,1,1,1),
eventcount=c(1, 2, 0, 0, 1, 2, 0, 1, 2, 3,1,2,3,4),
firstevent=c(1, 0, 0, 0, 1, 0, 0, 1, 0, 0,1,0,0,0),
time=c(100, 250, 150, 300, 240, 400, 150, 200, 320, 360,100,210,220,235),
afterevent=c(NA,1,NA,NA,NA,NA,NA,NA,1,NA,NA,1,1,1))
How can we write equivalent R code for this Stata code?
In the original data, we have more ids and more recurrent events.
This would correspond to your output:
library(dplyr)
data %>%
group_by(id) %>%
mutate(afterevent = ifelse(time <= (time[firstevent == 1] + 150) &
(time >= time[firstevent == 1] + 100), 1, NA))
Output:
# A tibble: 14 x 6
# Groups: id [4]
id event eventcount firstevent time afterevent
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 100 NA
2 1 1 2 0 250 1
3 1 0 0 0 150 NA
4 1 0 0 0 300 NA
5 2 1 1 1 240 NA
6 2 1 2 0 400 NA
7 2 0 0 0 150 NA
8 3 1 1 1 200 NA
9 3 1 2 0 320 1
10 3 0 3 0 360 NA
11 4 1 1 1 100 NA
12 4 1 2 0 210 1
13 4 1 3 0 220 1
14 4 1 4 0 235 1

Match household partners without loop

I am trying to avoid using a loop in my attempt to recode household partners' variable identification.
hldid denotes the household while persid the person in the household.
The variable partner indicates the persid of the partner and child indicate if the row is a child.
What is missing from the variable partner is the persid for both partners.
For instance for hldid == 1, the persid == 1 has a value of 0 for partner while it should be 2.
This is how the data looks like:
> test
hldid persid age sex relresp partner child
1 1 1 26 2 0 0 0
2 1 2 26 1 1 1 0
3 2 1 59 2 0 0 0
4 2 2 64 1 1 1 0
5 3 1 76 2 0 0 0
6 4 1 65 2 0 0 0
7 4 2 64 1 1 1 0
8 5 1 52 2 0 0 0
9 5 2 51 1 1 1 0
10 5 3 20 2 21 0 1
11 5 4 14 2 21 0 1
12 7 1 69 1 0 0 0
13 7 2 70 2 1 1 0
I managed to create a quite ugly loop, however it is too slow for the entire dataset.
test$partnerREC = test$partner
for(i in 1:13){
for(j in 1:13){
if(
test$hldid[i] == test$hldid[i+1] & # verify if household is the same
(test$persid[i] == test$partner[j])
)
{
test$partnerREC[i] = test$persid[j] # put the persid for each partner
}
}
}
> test
hldid persid age sex relresp partner child partnerREC
1 1 1 26 2 0 0 0 2
2 1 2 26 1 1 1 0 1
3 2 1 59 2 0 0 0 2
4 2 2 64 1 1 1 0 1
5 3 1 76 2 0 0 0 0
6 4 1 65 2 0 0 0 2
7 4 2 64 1 1 1 0 1
8 5 1 52 2 0 0 0 2
9 5 2 51 1 1 1 0 1
10 5 3 20 2 21 0 1 0
11 5 4 14 2 21 0 1 0
12 7 1 69 1 0 0 0 2
13 7 2 70 2 1 1 0 1
Any idea how I could use data.table for solving this?
test = structure(list(hldid = c(1, 1, 2, 2, 3, 4, 4, 5, 5, 5, 5, 7,
7), persid = c(1, 2, 1, 2, 1, 1, 2, 1, 2, 3, 4, 1, 2), age = c(26,
26, 59, 64, 76, 65, 64, 52, 51, 20, 14, 69, 70), sex = c(2, 1,
2, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2), relresp = c(0, 1, 0, 1, 0,
0, 1, 0, 1, 21, 21, 0, 1), partner = c(0, 1, 0, 1, 0, 0, 1, 0,
1, 0, 0, 0, 1), child = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
0)), class = "data.frame", row.names = c(NA, -13L))
A possible solution:
library(data.table) # load the package
setDT(test) # convert 'test' to a 'data.table'
test[, partnerREC := persid[c(pmin(2,.N):1,rep(0,(pmax(.N,2)-2)))] *
(persid %in% 1:2) *
(.N != 1)
, by = hldid][]
which gives:
> test
hldid persid age sex relresp partner child partnerREC
1: 1 1 26 2 0 0 0 2
2: 1 2 26 1 1 1 0 1
3: 2 1 59 2 0 0 0 2
4: 2 2 64 1 1 1 0 1
5: 3 1 76 2 0 0 0 0
6: 4 1 65 2 0 0 0 2
7: 4 2 64 1 1 1 0 1
8: 5 1 52 2 0 0 0 2
9: 5 2 51 1 1 1 0 1
10: 5 3 20 2 21 0 1 0
11: 5 4 14 2 21 0 1 0
12: 7 1 69 1 0 0 0 2
13: 7 2 70 2 1 1 0 1
This solution is based on the assumption (derived from the example data) that only 'persid' 1 & 2 are partners, anything higher are children.
What this does:
Group by hldid
Change the order of the first two persons in a hould but only when there are more people in a household with
persid[c(pmin(2,.N):1,rep(0,(pmax(.N,2)-2)))] where pmin is used to assure that a vector of length 1 is constructed when a household has only one person.
Multiply that with (persid %in% 1:2)to get zores for children.
Multiply that with (.N != 1) to get a value of zero for households of one person.
df <- data.frame(matrix(data = NA, ncol = 7))
names(df) <- names(test)
for(id in unique(test$hldid)){
t <- test[test$hldid==id,]
t$partner[t$persid == t$partner[t$partner!=0]] <- t$persid[which(t$partner!=0)]
df <- rbind(df, t)
}
df <- df[-1,]
A base R solution is more complicated than Jaap's data.table solution.
I work with a copy.
test2 <- test
After running the code in the question, run the following.
test2$partnerREC <- test2$partner
sp <- split(test2, test2$hldid)
test2 <- lapply(sp, function(DF){
i <- with(DF, which(persid %in% partner))
j <- with(DF, which(partner %in% persid))
#cat("i:", i, "\tj:", j, "\n")
DF$partnerREC[i] <- DF$persid[j]
DF
})
test2 <- do.call(rbind, test2)
row.names(test2) <- NULL
Now compare both results.
identical(test, test2)
#[1] TRUE
You can get there through some dplyr steps to join the data on itself and update the value of partner when persid == partner.
test2 <- left_join(test, test %>% select(hldid, persid, partner) %>% filter(partner != 0), by=c("hldid")) %>%
filter(persid.x == partner.y) %>%
mutate(partner.x = persid.y)
This will give you the head of household with their partner id matched up but you'd have to rejoin this to the original data (I'm not sure the equivalent of SQL update in dplyr lingo).
1. Create your test data.frame
library(tidyverse)
test <- tribble(
~hldid, ~persid, ~age, ~sex, ~relresp, ~partner, ~child,
1, 1, 26, 2, 0, 0, 0,
1, 2, 26, 1, 1, 1, 0,
2, 1, 59, 2, 0, 0, 0,
2, 2, 64, 1, 1, 1, 0,
3, 1, 76, 2, 0, 0, 0,
4, 1, 65, 2, 0, 0, 0,
4, 2, 64, 1, 1, 1, 0,
5, 1, 52, 2, 0, 0, 0,
5, 2, 51, 1, 1, 1, 0,
5, 3, 20, 2, 21, 0, 1,
5, 4, 14, 2, 21, 0, 1,
7, 1, 69, 1, 0, 0, 0,
7, 2, 70, 2, 1, 1, 0)
2. arrange(), group_by(), and mutate() come to rescue
test %>%
# arrange the data in case the raw data did not
arrange(hldid, child, persid) %>%
# group each household
group_by(hldid) %>%
# match first and second household person as each other's partner
mutate(partnerREC = ifelse(persid == first(persid), nth(persid, 2), first(persid))) %>%
# correct partnerREC for child and single
mutate(partnerREC = ifelse(child == 1 | is.na(partnerREC), 0, partnerREC))
# un-group it
ungroup()
A loop solution using rcpp
Source the rcpp script
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector HHgrid(CharacterVector hid, NumericVector persid, NumericVector partner,
NumericVector partnerRec) {
int nrows = hid.size();
for (int i = 1; i < nrows - 1; i ++){
for (int j = 0; j < nrows - 1; j++){
if( (hid(i) == hid(i+1)) & ( persid(i) == partner(j) ) ){
partnerRec(i) = persid(j);
}
}
Rcout << i << std::endl;
}
return(partnerRec);
}
and run the function
HHgrid(hid = test$hldid, persid = test$persid, partner = test$partner, partnerRec = test$partnerRec)
There is just a slight issue with the first case (if you know how to fix it)

R / Metrics / AUC : How can these numbers give an AUC of 1?

I have the following toy dataset:
df1<-structure(list(X1 = c(1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), X2 = c(0.564666517055852,
0.993248174442609, 0.517237113309667, 0.0128217854167603, 0.952654357126895,
0.958073009436008, 0.860038905846366, 0.425314512801637, 0.809327038447625,
0.985049417726494, 0.165982081954436, 0.517237113309667, 0.00211852090504078,
0.296898500479658, 0.994690775408805, 0.999991149759367, 0.999949243479285,
0.999979994962211, 0.409697759931823, 0.999995828877373, 0.999991594894354,
0.999999834424374, 0.952641245900919, 0.999998774453881, 0.999999777896636,
0.999998864433372, 0.998786297471059, 0.999927421881167, 0.998265361329274,
0.999550929839182, 0.999900216754163, 0.999912135543067, 0.999999924775596,
0.996227950775217, 0.998265981873947, 0.999959584436354, 0.999993039255167,
0.99968139946193, 0.999999997308486, 0.999999458017638, 0.999996417856357,
0.99958403590535, 0.999998891765696, 0.999999624757926, 0.999818190766803,
0.999997979863151, 0.999974432439759, 0.996227950775217, 0.999999771762929,
0.983441425608786, 0.99999843468322)), .Names = c("X1", "X2"), row.names = c(NA,
-51L), class = "data.frame")
when I use the auc() function from the Metrics package it tells me that the score is 1.
> Metrics::auc(df1$X1, df1$X2)
[1] 1
This doesn't seem correct. Any suggestions?
That is correct. You only have 3 zeros, which is certainly going to bring on issues, but just look here (all zeros have the lower predicted probabillities than ones):
> dat[order(dat[,2]),]
X1 X2
13 0 0.002118521
4 0 0.012821785
11 0 0.165982082
14 1 0.296898500
19 1 0.409697760
8 1 0.425314513
3 1 0.517237113
12 1 0.517237113
1 1 0.564666517
9 1 0.809327038
7 1 0.860038906
23 1 0.952641246
5 1 0.952654357
6 1 0.958073009
50 1 0.983441426
10 1 0.985049418
2 1 0.993248174
15 1 0.994690775
34 1 0.996227951
48 1 0.996227951
29 1 0.998265361
35 1 0.998265982
27 1 0.998786297
30 1 0.999550930
42 1 0.999584036
38 1 0.999681399
45 1 0.999818191
31 1 0.999900217
32 1 0.999912136
28 1 0.999927422
17 1 0.999949243
36 1 0.999959584
47 1 0.999974432
18 1 0.999979995
16 1 0.999991150
21 1 0.999991595
37 1 0.999993039
20 1 0.999995829
41 1 0.999996418
46 1 0.999997980
51 1 0.999998435
24 1 0.999998774
26 1 0.999998864
43 1 0.999998892
40 1 0.999999458
44 1 0.999999625
49 1 0.999999772
25 1 0.999999778
22 1 0.999999834
33 1 0.999999925
39 1 0.999999997
By plotting the ROC of your data, you'll see that the AUC is 1.0

Resources