calculating dataframe row combinations and matches with a separate column - r

I am trying to match all combinations of a dataframe (each combination reduces to a 1 or a 0 based on the sum) to another column and count the matches. I hacked this together but I feel like there is a better solution. Can someone suggest a better way to do this?
library(HapEstXXR)
test<-data.frame(a=c(0,1,0,1),b=c(1,1,1,1),c=c(0,1,0,1))
actual<-c(0,1,1,1)
ps<-powerset(1:dim(test)[2])
lapply(ps,function(x){
tt<-rowSums(test[,c(x)]) #Note: this fails when there is only one column
tt[tt>1]<-1 #if the sum is greater than 1 reduce it to 1
cbind(sum(tt==actual,na.rm=T),colnames(test)[x])
})
> test
a b c
1 0 1 0
2 1 1 1
3 0 1 0
4 1 1 1
goal: compare all combinations of columns (order doesnt matter) to actual column and see which matches most
b c a ab ac bc abc actual
1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 1
1 1 1 1 1 1 1 1
matches:
a: 3
b: 3
c: 3
ab: 3
....

Your code seems fine to me, I just simplified it a little bit:
sapply(ps,function(x){
tt <- rowSums(test[,x,drop=F]) > 0
colname <- paste(names(test)[x],collapse='')
setNames(sum(tt==actual,na.rm=T), colname) # make a named vector of one element length
})
# a b ab c ac bc abc
# 3 3 3 3 3 3 3

Related

If else Condition in R based on different columns and rows

I have a dataset with an ID column with multiple visits for every ID. I am trying to create a new variable Status, which will check the Visit column and Value column. The conditions are as follows
For visit in 1,2 & 3, if the values are 1,1,1 then 1
For visit in 1,2 & 3, if the values are 0,1,1 then 0
For visit in 1,2 & 3, if the values are 0,0,0 then 0
How do I specify this condition in R ?
Below is a sample dataset
ID
Visit
Value
1
1
1
1
2
1
1
3
1
2
1
1
2
2
0
2
3
0
3
1
0
3
2
0
3
3
0
4
1
0
4
2
1
4
3
1
Result dataset
ID
Visit
Value
Status
1
1
1
1
1
2
1
1
1
3
1
1
2
1
1
0
2
2
0
0
2
3
0
0
3
1
0
0
3
2
0
0
3
3
0
0
4
1
0
0
4
2
1
0
4
3
1
0
I'd have tried something like this (suppose your initial table is called df):
status = c()
for(i in 1:4){ #1:4 correspond to the ID you showed us
if(sum(df[df$ID == i,'value'])==3) status=c(status,rep(1,3))
if(sum(df[df$ID == i,'value'])!=3) status=c(status,rep(0,3))
}
df = cbind(df,status)
I hope that it will help you
I believe that case_when from the dplyr package is what you need to use. Here more details on that fuction: https://dplyr.tidyverse.org/reference/case_when.html

How to flag duplicate values in r - newbie

I'm trying to flag duplicate IDs in another column. I don't necessarily want to remove them yet, just create an indicator (0/1) of whether the IDs are unique or duplicates. In sql, it would be like this:
SELECT ID, count(ID) count from TABLE group by ID) a
On TABLE.ID = a.ID
set ID Duplicate Flag Column 1 = 1
where count > 1;
Is there a way to do this simply in r?
Any help would be greatly appreciated.
As an example of duplicated let's start with some values (numbers here, but strings would do the same thing)
x <- c(9, 1:5, 3:7, 0:8)
x
# 9 1 2 3 4 5 3 4 5 6 7 0 1 2 3 4 5 6 7 8
If you want to flag the second and later copies
as.numeric(duplicated(x))
# 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 1 0
If you want to flag all values that occur two or more times
as.numeric(x %in% x[duplicated(x)])
# 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0

Comparing vectors values by shifting reading frame

I have "Y maze" sequence data containing the characters, A,B,C. I am trying to quantitative the number of times those three values are found together. The data looks like this:
Animal=c(1,2,3,4,5)
VisitedZones=c(1,2,3,4,5)
data=data.frame(Animal, VisitedZones)
data[1,2]=("A,C,B,A,C,A,B,A,C,A,C,A,C,B,B,C,A,C,C,C")
data[2,2]=("A,C,B,A,C,A,B,A,C,A,C,A,C,B")
data[3,2]=("A,C,B,A,C,A,B,A,C,A")
data[4,2]=("A,C,B,A,C,A,A,A,B,A,C,A,C,A,C,B")
data[5,2]=("A,C,B,A,C,A,A,A,B,")
The tricky part is that I also have to consider the reading frame so that I can find all instances of ABC combinations. There are three reading frames, For example:
Here is the working example I have so far.
Split <- strsplit(data$VisitedZones, ",", fixed = TRUE)
## How long is each list element?
Ncol <- vapply(Split, length, 1L)
## Create an empty character matrix to store the results
M <- matrix(NA_character_, nrow = nrow(data),ncol = max(Ncol),
dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
## Use matrix indexing to figure out where to put the results
M[cbind(rep(1:nrow(data), Ncol),sequence(Ncol))] <- unlist(Split,
use.names = FALSE)
# Bind the values back together, here as a "data.table" (faster)
v2=data.table(Animal = data$Animal, M)
# I get error here
df=mutate(as.data.frame(v2),trio=paste0(v2,lead(v2),lead(v2,2)))
table(df$trio[1:(length(v2)-2)])
It would be great if I could get something like this:
Animal VisitedZones ABC ACB BCA BAC CAB CBA
1 A,B,C,A,B.C... 2 0 1 0 1 0
2 A,B,C,C... 1 0 0 0 0 0
3 A,C,B,A... 0 1 0 0 0 1
df<-mutate(as.data.frame(v2),trio=paste0(v2,lead(v2),lead(v2,2)))
table(df$trio[1:(length(v2)-2)])
Using dplyr, I generate for every letter in your vector the three-letter combination that starts from it, then create a table of frequencies of all found combinations (minus the last two, which are incomplete).
Result:
AAB ABC BCA CAA CAB
1 6 5 1 4
Your revised question is basically completely different, so I'll answer it here.
First, I would say your data structure doesn't make much sense to me, so I'll start out by reshaping it into something I can work with:
v2<-as.data.frame(t(v2))
Flip it over so the letters are in columns, not rows;
v2<-tidyr::gather(v2,"v","letter",na.rm=T)
Melt the table so it's long data (so that I'll be able to use lead etc).
v2<-group_by(v2,v)
df=mutate(v2,trio=paste0(letter,lead(letter),lead(letter,2)))
This brings us back basically to where we were at the end of the last question, only the data is grouped by the "animal" variable (here called "v" and represented by V1 thru V5).
df<-df[!grepl("NA",df$trio),]
Even though we removed the unnecessary NA's, we still end up having those pesky ABNA and ANANA etc at the end of each group, so this line will remove anything with an NA in it.
tt<-table(df$v,df$trio)
And finally, we create the table but also break it by "v". The result is this:
AAA AAB ABA ACA ACB ACC BAC BBC BCA CAA CAB CAC CBA CBB CCC
V1 0 0 1 3 2 1 2 1 1 0 1 3 1 1 1
V2 0 0 1 3 2 0 2 0 0 0 1 2 1 0 0
V3 0 0 1 2 1 0 2 0 0 0 1 0 1 0 0
V4 1 1 1 3 2 0 2 0 0 1 0 2 1 0 0
V5 1 1 0 1 1 0 1 0 0 1 0 0 1 0 0
You can now cbind it to your original data to get something like what you described, but it requires just an additional step, because of the way table saves its results:
data<-cbind(data,spread(as.data.frame(tt),Var2,Freq))[,-3]
Which ends up looking like this:
Animal VisitedZones AAA AAB ABA ACA ACB ACC BAC BBC BCA CAA CAB CAC CBA CBB CCC
1 1 A,C,B,A,C,A,B,A,C,A,C,A,C,B,B,C,A,C,C,C 0 0 1 3 2 1 2 1 1 0 1 3 1 1 1
2 2 A,C,B,A,C,A,B,A,C,A,C,A,C,B 0 0 1 3 2 0 2 0 0 0 1 2 1 0 0
3 3 A,C,B,A,C,A,B,A,C,A 0 0 1 2 1 0 2 0 0 0 1 0 1 0 0
4 4 A,C,B,A,C,A,A,A,B,A,C,A,C,A,C,B 1 1 1 3 2 0 2 0 0 1 0 2 1 0 0
5 5 A,C,B,A,C,A,A,A,B, 1 1 0 1 1 0 1 0 0 1 0 0 1 0 0

Considering Combination of Vectors Using Regex in R

I was looking for some insight on how to tackle this problem.
For instance, let's say I have vectors A, B, C, and D and every possible combination between them. I want to write a generic function that would create a matrix like this:
A B C D A&B A&C A&D B&C B&D C&D
A&B 1 1 0 0 2 1 1 1 1 0
A&C 1 0 1 0 1 2 1 1 0 1
A&D 1 0 0 1 1 1 2 0 1 1
B&C 0 1 1 0 1 1 0 2 1 1
B&D 0 1 0 1 1 0 1 1 2 1
C&D 0 0 1 1 0 1 1 1 1 2
The matching combination would be assigned a value of 2, while others that contain either value would be assigned a value of 1.
For instance, for A&B, A&B would be assigned a 2 while other vectors carrying either A or B would be assigned 1.
Right now I was considering on using Regex to check for overlapping names and creating a for loop. Would there be an easier/simpler way to tackle the problem?

How to write new column conditional on grouped rows in R?

I have a data frame where each Item has three categories (a, b,c) and a numeric Answer for each category is recorded (either 0 or 1). I would like to create a new column contingent on the rows in the Answer column. This is how my data frame looks like:
Item <- rep(c(1:3), each=3)
Option <- rep(c('a','b','c'), times=3)
Answer <- c(1,1,0,1,0,1,1,1,1)
df <- data.frame(Item, Option, Answer)
Item Option Answer
1 1 a 1
2 1 b 1
3 1 c 0
4 2 a 0
5 2 b 0
6 2 c 1
7 3 a 1
8 3 b 1
9 3 c 1
What is needed: whenever the three categories in the Option column are 1, the New column should receive a 1. In any other case, the column should have a 0. The desired output should look like this:
Item Option Answer New
1 1 a 1 0
2 1 b 1 0
3 1 c 0 0
4 2 a 0 0
5 2 b 0 0
6 2 c 1 0
7 3 a 1 1
8 3 b 1 1
9 3 c 1 1
I tried to achieve this without using a loop, but I got stuck because I don't know how to make a new column contingent on a group of rows, not just a single one. I have tried this solution but it doesn't work if the rows are not grouped in pairs.
Do you have any suggestions? Thanks a bunch!
This should work:
df %>%
group_by(Item)%>%
mutate(New = as.numeric(all(as.logical(Answer))))
using data.table
DT <- data.table(Item, Option, Answer)
DT[, Index := as.numeric(all(as.logical(Answer))), by= Item]
DT
Item Option Answer Index
1: 1 a 1 0
2: 1 b 1 0
3: 1 c 0 0
4: 2 a 1 0
5: 2 b 0 0
6: 2 c 1 0
7: 3 a 1 1
8: 3 b 1 1
9: 3 c 1 1
Or using only base R
df$Index <- with(df, +(ave(!!Answer, Item, FUN = all)))
df$Index
#[1] 0 0 0 0 0 0 1 1 1

Resources