Counting pairs column elements but removing duplicate values - r

I've got a dataset with event data in the below format
> order_df
# A tibble: 10 x 4
H M B FB
<int> <dbl> <dbl> <dbl>
1 1 1 0 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
5 0 1 0 0
6 0 0 0 0
7 0 1 1 1
8 0 0 1 0
9 0 1 0 0
10 0 0 0 0
I'd like to show it as a matrix pairs, which I can achieve with the below code
> order_matrix = as.matrix(order_df)
> pair_matrix <- crossprod(order_matrix)
> pair_matrix
H M B FB
H 4 2 1 1
M 2 5 1 1
B 1 1 3 1
FB 1 1 1 2
However, the pair instances (i.e M:M) include all rows from the original dataframe that include that column, but I'd like that value to include only instance where the original dataframe row included ONLY a 1 in the column.
In the example above I'd like the H:H pair to be 0, as all instances with H included another column. the M:M pair would be 1, as only 1 instance included only M

I'm a little confused about the output here, since if the only rows that are counted are rows with a single 1 in them, then the resulting matrix will only have entries on the diagonal. In other words, it would be better to return a vector than the matrix.
You also say in your question that M should be 1, since it only appears on its own once. It actually appears on its own twice (row 5 and row 9).
You can get the result you need by removing all rows with a row sum of more than one then taking the column sums:
colSums(as.matrix(order_df[rowSums(order_df) == 1,]))
#> H M B FB
#> 0 2 1 0
and if you check carefully, this is correct.
If you really want the result in a matrix, just remove the rows with more than one value and take the cross product of that:
crossprod(as.matrix(order_df[rowSums(order_df) == 1,]))
#> H M B FB
#> H 0 0 0 0
#> M 0 2 0 0
#> B 0 0 1 0
#> FB 0 0 0 0

Related

Summarizing/counting multiple binary variables

For the purpose of this question, my data set includes 16 columns (c1_d, c2_d, ..., c16_d) and 364 rows (1-364). This is what it briefly looks like:
c1_d c2_d c3_d c4_d c5_d c6_d c7_d c8_d c9_d c10_d c11_d c12_d c13_d c14_d c15_d c16_d
1 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0
2 1 1 0 1 1 1 0 1 1 1 1 0 1 0 0 0
3 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0
4 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0
5 1 0 1 1 1 1 0 1 0 1 1 0 0 0 1 0
Please note that for example row 1, has five 1s and 11 0s.
This is what I'm trying to do: Basically counting how many rows have how many of the value 1 assigned to them (i.e. by the end of this analysis I want to get something like 20 rows had zero value 1 assigned to them, 33 rows had one value 1 assigned to them, 100 rows had 10 value 1 assigned to them, etc.).
I tried to create a data frame including all rows (364) and columns (16) I needed. I tried using the print.data.frame function, and its results are shown above, but it doesn't give me the number of 0s and 1s per row. I tried using functions such as table, ftable, and xtab, but they don't really work for more than three variables.
I would highly appreciate your help on this.
If I understand correctly:
library(dplyr)
library(tidyr)
df %>%
transmute(count0 = rowSums(df==0),
count1 = rowSums(df==1)) %>%
pivot_longer(everything()) %>%
count(name, value)
name value n
<chr> <dbl> <int>
1 count0 5 1
2 count0 6 1
3 count0 7 1
4 count0 11 1
5 count0 12 1
6 count1 4 1
7 count1 5 1
8 count1 9 1
9 count1 10 1
10 count1 11 1

ifelse replace value if it is lower than previous

I am working with a dataset which has some errors in the data. Numbers are sometimes registered wrong. Here is some toy data example:
The issue is that the Reversal column should only be counting up (per unique ID). So in a vector of 0,0,0,1,1,1,0,1,2,2,0,0,2,3, the 0's following the 1 and 2 should not be 0's. Instead, they should be equal to whatever value came before. I tried to remedy this by using the lag function from the dplyr package:
Data$Reversal <- ifelse(Data$Reversal < lag(Data$Reversal), lag(Data$Reversal), Data$Reversal) .
But this results in numerous issues:
The first value becomes NA. I've tried using the default=Data$Reversal call in the lag function but to no avail.
The Reversal value should reset to 0 for each Unique ID. Now it continues across ID's. I tried a messy code using group_by(ID) but could not get this to work, as it broke my earlier ifelse function.
This only works when there is 1 error. But if there are two errors in a row it only fixes 1 value.
Alternatively, I found this thread in which the answer provided by Andrie also seems promising. This fixes problem 1 and 3, but I can't get this code to work per ID (using the group_by function).
Andrie's answer:
local({
r <- rle(data)
x <- r$values
x0 <- which(x==0) # index positions of zeroes
xt <- x[x0-1]==x[x0+1] # zeroes surrounded by same value
r$values[x0[xt]] <- x[x0[xt]-1] # substitute with surrounding value
inverse.rle(r)
})
Any help would be much appreciated.
I think cummax does exactly what you need.
Base R
dat$Reversal <- ave(dat$Reversal, dat$ID, FUN = cummax)
dat
# ID Owner Reversal Success
# 1 1 A 0 0
# 2 1 A 0 0
# 3 1 A 0 0
# 4 1 B 1 1
# 5 1 B 1 0
# 6 1 B 1 0
# 7 1 error 1 0
# 8 1 error 1 0
# 9 1 B 1 0
# 10 1 B 1 0
# 11 1 C 1 1
# 12 1 C 2 0
# 13 1 error 2 0
# 14 1 C 2 0
# 15 1 C 3 1
# 16 2 J 0 0
# 17 2 J 0 0
dplyr
dat %>%
group_by(ID) %>%
mutate(Reversal = cummax(Reversal)) %>%
ungroup()
data.table
as.data.table(dat)[, Reversal := cummax(Reversal), by = .(ID)][]
Data, courtesy of https://extracttable.com/
dat <- read.table(header = TRUE, text = "
ID Owner Reversal Success
1 A 0 0
1 A 0 0
1 A 0 0
1 B 1 1
1 B 1 0
1 B 1 0
1 error 0 0
1 error 0 0
1 B 1 0
1 B 1 0
1 C 1 1
1 C 2 0
1 error 0 0
1 C 2 0
1 C 3 1
2 J 0 0
2 J 0 0")

calculating dataframe row combinations and matches with a separate column

I am trying to match all combinations of a dataframe (each combination reduces to a 1 or a 0 based on the sum) to another column and count the matches. I hacked this together but I feel like there is a better solution. Can someone suggest a better way to do this?
library(HapEstXXR)
test<-data.frame(a=c(0,1,0,1),b=c(1,1,1,1),c=c(0,1,0,1))
actual<-c(0,1,1,1)
ps<-powerset(1:dim(test)[2])
lapply(ps,function(x){
tt<-rowSums(test[,c(x)]) #Note: this fails when there is only one column
tt[tt>1]<-1 #if the sum is greater than 1 reduce it to 1
cbind(sum(tt==actual,na.rm=T),colnames(test)[x])
})
> test
a b c
1 0 1 0
2 1 1 1
3 0 1 0
4 1 1 1
goal: compare all combinations of columns (order doesnt matter) to actual column and see which matches most
b c a ab ac bc abc actual
1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 1
1 1 1 1 1 1 1 1
matches:
a: 3
b: 3
c: 3
ab: 3
....
Your code seems fine to me, I just simplified it a little bit:
sapply(ps,function(x){
tt <- rowSums(test[,x,drop=F]) > 0
colname <- paste(names(test)[x],collapse='')
setNames(sum(tt==actual,na.rm=T), colname) # make a named vector of one element length
})
# a b ab c ac bc abc
# 3 3 3 3 3 3 3

calculating sum of all against all columns with matching row count

I have a df with several columns having values 0 or 1. Something like:
a b c d e
1 0 0 0 0
0 1 0 1 0
0 1 0 1 0
1 0 1 0 1
I would like to create a 5 by 5 matrix showing total count if columns have 1 in same row. I only want to consider 1's and in case of diagonal it would automatically reflect total row in that column with 1. Output something like:
a b c d e
a 2 0 1 0 1
b 0 2 0 2 0
c 1 0 1 0 1
d 0 2 0 2 0
e 1 0 1 0 1
Thanks.
Sudhir
Convert to matrix and take cross product:
m <- as.matrix(d)
crossprod(m,m)

How to exclude cases that do not repeat X times in R?

I have a long format unbalanced longitudinal data. I would like to exclude all the cases that do not contain complete information. By that I mean all cases that do not repeat 8 times. Someone can help me finding a solution?
Below an example: I have three subjects {A, B, and C}. I have 8 information for A and B, but only 2 for C. How can I delete rows in which C is present based on the information it has less than 8 repeated measurements?
temp = scan()
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0
Any help?
Assuming your variable names are V1, V2... and so on, here's one approach:
temp[temp$V1 %in% names(which(table(temp$V1) == 8)), ]
The table(temp$V1) == 8 matches the values in the V1 column that have exactly 8 cases. The names(which(... part creates a basic character vector that we can match using %in%.
And another:
temp[ave(as.character(temp$V1), temp$V1, FUN = length) == "8", ]
Here's another approach:
temp <- read.table(text="
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0", header=FALSE)
do.call(rbind,
Filter(function(subgroup) nrow(subgroup) == 8,
split(temp, temp[[1]])))
split breaks the data.frame up by its first column, then Filter drops the subgroups that don't have 8 rows. Finally, do.call(rbind, ...) collapses the remaining subgroups back into a single data.frame.
If the first column of temp is character (rather than factor, which you can verify with str(temp)) and the rows are ordered by subgroup, you could also do:
with(rle(temp[[1]]), temp[rep(lengths==8, times=lengths), ])

Resources