How to calculate matching objects without removing NA or 0 - r

I have an output for example as below:
ID C1 C2 C3 C4 C5 C6
1 0 1 2 2 1 1
2 0 1 1 2 1 1
3 1 0 1 1 1 1
4 2 0 2 2 1 2
5 2 1 1 0 2 2
6 1 2 1 0 1 2
7 2 2 2 2 0 2
8 1 1 1 1 0 1
9 1 1 2 2 2 0
10 1 2 1 2 1 0
and I determine the co-occurrence of objects through example from faster way to compare rows in a data frame
for ( i in 1:(nr-1)) {
# all combinations of i with i+1 to nr
samplematch <- cbind(dt[i],dt[(i+1):nr])
# renaming the comparison sample columns
setnames(samplematch,append(colnames(dt),paste0(colnames(dt),"2")))
#calculating number of matches
samplematch[,noofmatches := 0]
for (j in 1:nc){
samplematch[,noofmatches := noofmatches+1*(get(paste0("CC",j)) == get(paste0("CC",j,"2")))]
}
# removing individual value columns and matches < 5
samplematch <- samplematch[noofmatches >= 5,list(ID,ID2,noofmatches)]
# adding to the list
totalmatches[[i]] <- samplematch
}
The result obtains through above function help me identify the total matching between each ID. However, i only to identify the matching ID when the CC(1:6) consist only value 1 and 2. Meaning that the total value for each row suppose to be 5 and not 6.
The output that i require should consist information such as
ID1 ID2 Match
1 2 4/5
1 3 2/5
1 4 3/5
: : :
: : :
2 3 3/5
2 4 2/5
How should the function be written without remove any rows since each rows has value 0.

In the code below, IDs is a data table of all pairs of distinct IDs. Then you need to check x <- df[c(ID1, ID2), -1], the non-ID columns of df corresponding to the given ID pair, for each row. The code creates a logical vector which is TRUE for non-zero columns (x[1] != 0) and columns with equal elements (x[2] == x[1]). The sum of this vector is then the number of matches.
library(data.table)
setDT(df)
setkey(df, ID)
IDs <- CJ(ID1 = df$ID, ID2 = df$ID)[ID1 != ID2]
IDs[, Match := {x <- df[c(ID1, ID2), -1]
sum(x[1] != 0 & x[2] == x[1])}
, by = .(ID1, ID2)]
head(IDs)
# ID1 ID2 Match
# 1: 1 2 4
# 2: 1 3 2
# 3: 1 4 3
# 4: 1 5 1
# 5: 1 6 1
# 6: 1 7 2
Data used:
df <- fread('
ID C1 C2 C3 C4 C5 C6
1 0 1 2 2 1 1
2 0 1 1 2 1 1
3 1 0 1 1 1 1
4 2 0 2 2 1 2
5 2 1 1 0 2 2
6 1 2 1 0 1 2
7 2 2 2 2 0 2
8 1 1 1 1 0 1
9 1 1 2 2 2 0
10 1 2 1 2 1 0
')

Related

Counting Frequencies of Sequences

Suppose there are two students - each student takes an exam multiple times (e.g.result_id = 1 is the first exam, result_id = 2 is the second exam, etc.). The student can either "pass" (1) or "fail" (0).
The data looks something like this:
library(data.table)
my_data = data.frame(id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,2,2), results = c(0,1,0,1,0,0,1,1,1,0,1,1,0,1,0), result_id = c(1,2,3,4,5,6,1,2,3,4,5,6,7,8,9))
my_data = setDT(my_data)
id results result_id
1: 1 0 1
2: 1 1 2
3: 1 0 3
4: 1 1 4
5: 1 0 5
6: 1 0 6
7: 2 1 1
8: 2 1 2
9: 2 1 3
10: 2 0 4
11: 2 1 5
12: 2 1 6
13: 2 0 7
14: 2 1 8
15: 2 0 9
I am interested in counting the number of times that a student passes an exam, given that the student passed the previous two exams.
I tried to do this with the following code:
my_data$current_exam = shift(my_data$results, 0)
my_data$prev_exam = shift(my_data$results, 1)
my_data$prev_2_exam = shift(my_data$results, 2)
# Count the number of exam results for each record
out <- my_data[!is.na(prev_exam), .(tally = .N), by = .(id, current_exam, prev_exam, prev_2_exam)]
out = na.omit(out)
My code produces the following results:
> out
id current_exam prev_exam prev_2_exam tally
1: 1 0 1 0 2
2: 1 1 0 1 1
3: 1 0 0 1 1
4: 2 1 0 0 1
5: 2 1 1 0 2
6: 2 1 1 1 1
7: 2 0 1 1 2
8: 2 1 0 1 2
9: 2 0 1 0 1
However, I do not think that my code is correct.
For example, with Student_ID = 2 :
My code says that "Current_Exam = 1, Prev_Exam = 1, Prev_2_Exam = 0" happens 1 time, but looking at the actual data - this does not happen at all
Can someone please show me what I am doing wrong and how I can correct this?
Note: I think that this should be the expected output:
> expected_output
id current_exam prev_exam prev_2_exam tally
1: 1 0 1 0 2
2: 1 1 0 1 1
3: 1 0 0 1 1
4: 2 1 0 0 1
5: 2 1 1 0 1
6: 2 1 1 1 1
7: 2 0 1 1 2
8: 2 1 0 1 2
9: 2 0 1 0 0
You did not consider that you can not shift the results over id without placing NA.
. <- my_data[order(my_data$id, my_data$result_id),] #sort if needed
.$p1 <- ave(.$results, .$id, FUN = \(x) c(NA, x[-length(x)]))
.$p2 <- ave(.$p1, .$id, FUN = \(x) c(NA, x[-length(x)]))
aggregate(list(tally=.$p1), .[c("id","results", "p1", "p2")], length)
# id results p1 p2 tally
#1 1 0 1 0 2
#2 2 0 1 0 1
#3 2 1 1 0 1
#4 1 0 0 1 1
#5 1 1 0 1 1
#6 2 1 0 1 2
#7 2 0 1 1 2
#8 2 1 1 1 1
.
# id results result_id p1 p2
#1 1 0 1 NA NA
#2 1 1 2 0 NA
#3 1 0 3 1 0
#4 1 1 4 0 1
#5 1 0 5 1 0
#6 1 0 6 0 1
#7 2 1 1 NA NA
#8 2 1 2 1 NA
#9 2 1 3 1 1
#10 2 0 4 1 1
#11 2 1 5 0 1
#12 2 1 6 1 0
#13 2 0 7 1 1
#14 2 1 8 0 1
#15 2 0 9 1 0
An option would be to use filter to indicate those which had passed 3 times in a row.
cbind(., n=ave(.$results, .$id, FUN = \(x) filter(x, c(1,1,1), sides=1)))
# id results result_id n
#1 1 0 1 NA
#2 1 1 2 NA
#3 1 0 3 1
#4 1 1 4 2
#5 1 0 5 1
#6 1 0 6 1
#7 2 1 1 NA
#8 2 1 2 NA
#9 2 1 3 3
#10 2 0 4 2
#11 2 1 5 2
#12 2 1 6 2
#13 2 0 7 2
#14 2 1 8 2
#15 2 0 9 1
If olny the number of times that a student passes an exam, given that the student passed the previous two exams:
sum(ave(.$results, .$id, FUN = \(x) filter(x, c(1,1,1))==3), na.rm=TRUE)
#[1] 1
sum(ave(.$results, .$id, FUN = \(x)
x==1 & c(x[-1], 0) == 1 & c(x[-1:-2], 0, 0) == 1))
#[1] 1
When trying to count events that happen in series, cumsum() comes in quite handy. As opposed to creating multiple lagged variables, this scales well to counts across a larger number of events:
library(tidyverse)
d <- my_data |>
group_by(id) |> # group to cumulate within student only
mutate(
csum = cumsum(results), # cumulative sum of results
i = csum - lag(csum, 3, 0) # substract the cumulative sum from 3 observation before. This gives the number of exams passed in the current and previous 2 observations.
)
# Ungroup to get global count
d |>
ungroup() |>
count(i == 3) # Count the number of cases where the number of exams passes within 3 observations equals 3
#> # A tibble: 2 × 2
#> `i == 3` n
#> <lgl> <int>
#> 1 FALSE 14
#> 2 TRUE 1
# Retaining the group gives counts by student
d |>
count(i == 3) # Count the number of cases where the number of exams passes within 3 observations equals 3
#> # A tibble: 3 × 3
#> # Groups: id [2]
#> id `i == 3` n
#> <dbl> <lgl> <int>
#> 1 1 FALSE 6
#> 2 2 FALSE 8
#> 3 2 TRUE 1
Since you provided the data as data.table, here is how to do the same in that ecosystem:
my_data[ , csum := cumsum(results), .(id)]
my_data[ , i := csum - lag(csum, 3, 0), .(id)]
my_data[ , .(n_cases = sum(i ==3)), id]
#> id n_cases
#> 1: 1 0
#> 2: 2 1
Here's an approach using dplyr. It uses the lag function to look back 1 and 2 results. If the sum together with the current result is 3, then the condition is met. In the example you provided, the condition is only met once
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(!is.na(threex))
id results result_id threex
<dbl> <dbl> <dbl> <dbl>
1 1 0 3 0
2 1 1 4 0
3 1 0 5 0
4 1 0 6 0
5 2 1 3 1
6 2 0 4 0
7 2 1 5 0
8 2 1 6 0
9 2 0 7 0
10 2 1 8 0
11 2 0 9 0
If you then just want to capture the cases when the condition is met, add a filter.
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(threex == 1)
id results result_id threex
<dbl> <dbl> <dbl> <dbl>
1 2 1 3 1
If you are looking to understand how many times the condition is met per id, you can do this.
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(threex == 1) %>%
select(id) %>%
summarize(count = n())
id count
<dbl> <int>
1 2 1

Reshape complex time to event data in R

I have the following data frame where I have the beginning of the time, the end of the time AND a Date where the individual got the observations A or B.
df =
id Date Start_Date End_Date A B
1 2 1 4 1 0
1 3 1 4 0 1
2 3 2 9 1 0
2 6 2 9 1 0
2 7 2 9 1 0
2 2 2 9 0 1
What I want to do is to order the time chronologically (create a new Time variable), and fill the information A and B accordingly, that is, if the individual got A at time 2 it should also have at the following up times (i.e. 3 until End_Time). Ideally, the interval times are not regular but follow the changes in Date (see individual 2):
Cool_df =
id Time A B
1 1 0 0
1 2 1 0
1 3 1 1
1 4 1 1
2 2 0 1
2 3 1 1
2 6 1 1
2 7 1 1
2 9 1 1
Any recommendation highly appreciated because I do not know where to start.
Here is a data.table approach
library(data.table)
setDT(df)
# Summarise dates
ans <- df[, .(Date = unique(c(min(Start_Date), Date, max(End_Date)))), by = .(id)]
# Join
ans[ df[A==1,], A := 1, on = .(id,Date)]
ans[ df[B==1,], B := 1, on = .(id,Date)]
#fill down NA's using "locf"
cols.to.fill = c("A","B")
ans[, (cols.to.fill) := lapply(.SD, nafill, type = "locf"),
by = .(id), .SDcols = cols.to.fill]
#fill other NA with zero
ans[is.na(ans)] <- 0
# id Date A B
# 1: 1 1 0 0
# 2: 1 2 1 0
# 3: 1 3 1 1
# 4: 1 4 1 1
# 5: 2 2 0 1
# 6: 2 3 1 1
# 7: 2 6 1 1
# 8: 2 7 1 1
# 9: 2 9 1 1

How do I find the sum of inequalities by unique group/subgroup pairs?

Suppose I am working with the following data.table:
dta <- setDT(
data.frame(
id = c("A","A","A","B","B","C","C","C"),
subid = c(1,1,2,1,2,1,1,1),
x1 = c(1,1,3,1,2,3,3,3),
x2 = c(3,3,1,1,1,3,3,3)
)
)
> dta
id subid x1 x2
1: A 1 1 3
2: A 1 1 3
3: A 2 3 1
4: B 1 1 1
5: B 2 2 1
6: C 1 3 3
7: C 1 3 3
8: C 1 3 3
For each unique id-subid pairing, I would like to find the total number of times that x1<x2 and the total number of times that x1>=x2, and have those counts be added to the data.table as new columns/variables but aggregated to the id level.
The outcome would look something like:
id subid x1 x2 lt gt
1: A 1 1 3 1 1
2: A 1 1 3 1 1
3: A 2 3 1 1 1
4: B 1 1 1 0 2
5: B 2 2 1 0 2
6: C 1 3 3 0 1
7: C 1 3 3 0 1
8: C 1 3 3 0 1
For example, of the two unique id-subidparings for id="A", one has x1<x2 and one has x1>x2, which means that for A the variable for "less-than" has a value of 1 (i.e. dta$lt[dta$id==A] <- 1), and the same for "greater-than" (dta$gt[dta$id==A] <- 1).
I have been searching for a solution to this but have not had much luck. I have found solutions to similar problems (e.g. counting number of unique observations by unique pairings), but have not been able to modify them to suit my needs. In particular, I am struggling to aggregate the count from the id-subid level to the id level. (It could be that I'm not exactly sure how to search for -- or even word -- this question.)
I've been able to do this using nested loops on a data frame, but I suspect there is a more efficient way of doing it. In particular, I am curious about doing this using data.table.
A possible solution:
dta[, c('lt', 'gt') := unique(.SD)[, .(sum(x1 < x2), sum(x1 >= x2))], by = .(id)]
which gives:
> dta
id subid x1 x2 lt gt
1: A 1 1 3 1 1
2: A 1 1 3 1 1
3: A 2 3 1 1 1
4: B 1 1 1 0 2
5: B 2 2 1 0 2
6: C 1 3 3 0 1
7: C 1 3 3 0 1
8: C 1 3 3 0 1

Repeating loop and adding columns in R

I am trying to build an R code that will take my loop and run it 20 times. Each time I would like to add a column to the existing data frame. Here I tried it by adding the code 3 times, but I feel like there must be an easier way to automate this. I am very grateful for any help.
My original data file (called "igel") contains two columns ("Year" and "Grid") and 1096 rows. With the loop I pick a random number from the column "Grid" and check whether it has been picked before. If so it adds 0 to a new column if not it adds 1.
Here the code:
a <- data.frame(matrix(ncol = 2, nrow = 0))
x <- c("number", "count")
colnames(a) <- x
for (i in 1:1096) {
num_i <- sample(igel$Grid, 1)
count_i <- c(if (num_i %in% a$number == TRUE) {0} else {1})
a<-a %>% add_row(number = num_i, count = count_i)
}
b <- data.frame(matrix(ncol = 2, nrow = 0))
x <- c("number", "count")
colnames(b) <- x
for (i in 1:1096) {
num_i <- sample(igel$Grid, 1)
count_i <- c(if (num_i %in% b$number == TRUE) {0} else {1})
b<-b %>% add_row(number = num_i, count = count_i)
}
c <- data.frame(matrix(ncol = 2, nrow = 0))
x <- c("number", "count")
colnames(c) <- x
for (i in 1:1096) {
num_i <- sample(igel$Grid, 1)
count_i <- c(if (num_i %in% c$number == TRUE) {0} else {1})
c<-c %>% add_row(number = num_i, count = count_i)
}
df.total<- cbind(a$count,b$count, c$count)
Consider sapply and even its wrapper, replicate and calculate number and count separately in vector calculations instead of growing object in loop by row.
# RUNS 3 SAMPLES OF igel$Grid 1,096 TIMES (ADJUST 3 TO ANY POSITIVE INT LIKE 20)
grid_number <- data.frame(replicate(3, replicate(1096, sample(igel$Grid, 1))))
# RUNS ACROSS 3 COLUMNS TO CHECK CURRENT ROW VALUE IS INCLUDED FOR ALL VALUES BEFORE ROW
grid_count <- sapply(grid_number, function(col)
sapply(seq_along(col), function(i)
ifelse(col[i] %in% col[1:(i-1)], 0, 1)
)
)
While above does not exactly reproduce your output, df.total (a matrix and not data frame), due to the random sampling within iterations, the two maintain similar structure:
dim(df.total)
# [1] 1096 3
dim(grid_count)
# [1] 1096 3
Try to avoid iterating through rows. It is rarely necessary, if ever. Here is one approach (replace n with 1096 and elem with igel$Grid):
n = 20
elem = 1:5
df.total = list()
for (i in 1:5) {
a = data.frame(number = sample(elem, n, replace=TRUE))
a$count = as.numeric(duplicated(a$number))
df.total[[i]] = a
}
df.total = as.data.frame(df.total)
df.total
## number count number.1 count.1 number.2 count.2 number.3 count.3 number.4 count.4
## 1 4 0 2 0 5 0 4 0 1 0
## 2 3 0 5 0 3 0 4 1 3 0
## 3 5 0 3 0 4 0 2 0 4 0
## 4 5 1 1 0 2 0 5 0 3 1
## 5 2 0 4 0 2 1 5 1 5 0
## 6 4 1 2 1 2 1 5 1 5 1
## 7 5 1 1 1 3 1 2 1 4 1
## 8 5 1 2 1 5 1 5 1 4 1
## 9 2 1 1 1 1 0 1 0 1 1
## 10 3 1 1 1 5 1 4 1 1 1
## 11 5 1 3 1 1 1 3 0 5 1
## 12 2 1 1 1 2 1 5 1 1 1
## 13 3 1 5 1 4 1 5 1 4 1
## 14 1 0 4 1 2 1 4 1 1 1
## 15 4 1 4 1 2 1 5 1 1 1
## 16 4 1 2 1 5 1 2 1 5 1
## 17 3 1 1 1 1 1 3 1 2 0
## 18 2 1 2 1 2 1 2 1 2 1
## 19 2 1 3 1 1 1 2 1 1 1
## 20 1 1 3 1 2 1 1 1 3 1

Most frequent values in sliding window dataframe in R

I have the following dataframe (df):
A B T Required col (window = 3)
1 1 0 1
2 3 0 3
3 4 0 4
4 2 1 1 4
5 6 0 0 2
6 4 1 1 0
7 7 1 1 1
8 8 1 1 1
9 1 0 0 1
I would like to add the required column, as followed:
Insert in the current row the previous row value of A or B.
If in the last 3 (window) rows most of time the content of A column is equal to T column - choose A, otherwise - B. (There can be more columns - so the content of the column with the most times equal to T will be chosen).
What is the most efficient way to do it for big data table.
I changed the column named T to be named TC to avoid confusion with T as an abbreviation for TRUE
library(tidyverse)
library(data.table)
df[, newcol := {
equal <- A == TC
map(1:.N, ~ if(.x <= 3) NA
else if(sum(equal[.x - 1:3]) > 3/2) A[.x - 1]
else B[.x - 1])
}]
df
# N A B TC newcol
# 1: 1 1 0 1 NA
# 2: 2 3 0 3 NA
# 3: 3 4 0 4 NA
# 4: 4 2 1 1 4
# 5: 5 6 0 0 2
# 6: 6 4 1 1 0
# 7: 7 7 1 1 1
# 8: 8 8 1 1 1
# 9: 9 1 0 0 1
This works too, but it's less clear, and likely less efficient
df[, newcol := shift(A == TC, 1:3) %>%
pmap_lgl(~sum(...) > 3/2) %>%
ifelse(shift(A), shift(B))]
data:
df <- fread("
N A B TC
1 1 0 1
2 3 0 3
3 4 0 4
4 2 1 1
5 6 0 0
6 4 1 1
7 7 1 1
8 8 1 1
9 1 0 0
")
Probably much less efficient than the answer by Ryan, but without additional packages.
A<-c(1,3,4,2,6,4,7,8,1)
B<-c(0,0,0,1,0,1,1,1,0)
TC<-c(1,3,4,1,0,1,1,1,0)
req<-rep(NA,9)
df<-data.frame(A,B,TC,req)
window<-3
for(i in window:(length(req)-1)){
equal <- sum(df$A[(i-window+1):i]==df$TC[(i-window+1):i])
if(equal > window/2){
df$req[i+1]<-df$A[i]
}else{
df$req[i+1]<-df$B[i]
}
}

Resources