Considering Combination of Vectors Using Regex in R - r

I was looking for some insight on how to tackle this problem.
For instance, let's say I have vectors A, B, C, and D and every possible combination between them. I want to write a generic function that would create a matrix like this:
A B C D A&B A&C A&D B&C B&D C&D
A&B 1 1 0 0 2 1 1 1 1 0
A&C 1 0 1 0 1 2 1 1 0 1
A&D 1 0 0 1 1 1 2 0 1 1
B&C 0 1 1 0 1 1 0 2 1 1
B&D 0 1 0 1 1 0 1 1 2 1
C&D 0 0 1 1 0 1 1 1 1 2
The matching combination would be assigned a value of 2, while others that contain either value would be assigned a value of 1.
For instance, for A&B, A&B would be assigned a 2 while other vectors carrying either A or B would be assigned 1.
Right now I was considering on using Regex to check for overlapping names and creating a for loop. Would there be an easier/simpler way to tackle the problem?

Related

Create a co-occurrence matrix from a single column of observations

I have a series of data frames, each with an individual identifier (in this example a letter A-E), and the site number it was observed at.
In this example, I have 3 data frames:
Letters<-c("A","B","C","D","E")
Site1<-c(1,1,2,2,2)
Site2<-c(10,10,20,30,30)
Site3<-c(17,27,37,47,57)
Df1<-data.frame(Letters, Site1)
Df2<-data.frame(Letters, Site2)
Df3<-data.frame(Letters, Site3)
For the first one, it ends up looking like this:
Df1
Letters Site
1 A 1
2 B 1
3 C 2
4 D 2
5 E 2
Individuals A and B were found at Site 1, and individuals C,D,and E were found at site 2.
I'm looking for a way to track which individuals are found within the same sites within a single dataframe (note the site numbers change each time, so I only care about within-dataframe groupings).
I'm assuming I would create individual co-occurrence matrix, with each single matrix only having a 1 or a 0 indicating whether an individual overlapped. Then the last step would be just to add them up like so:
DF1 co-occurrence
A B C D E
A 1 1 0 0 0
B 1 1 0 0 0
C 0 0 1 1 1
D 0 0 1 1 1
E 0 0 1 1 1
DF2 co-occurrence
A B C D E
A 1 1 0 0 0
B 1 1 0 0 0
C 0 0 1 0 0
D 0 0 0 1 1
E 0 0 0 1 1
DF3 co-occurrence
A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 0 0
D 0 0 0 1 0
E 0 0 0 0 1
And then add them up to see who is most often grouped with whom:
A B C D E
A 3 2 0 0 0
B 2 3 0 0 0
C 0 0 3 1 1
D 0 0 1 3 2
E 0 0 1 2 3
But I'm not sure how to implement this kind of workflow in R, or if this is even the best way to approach this problem. But my hope is to end up with a similar matrix to this last one above, or some similar method to quantify total co-occurrence

I need help putting values from one vector into another in R

I have two vectors in R
Vector 1
0 0 0 0 0 0 0 0 0 0
Vector 2
1 1 3 1 1 1 1 1
I need to put the values from vector 2 into vector 1 but into specific positions so that vector 1 becomes
1 1 3 0 0 1 1 1 1 1
I need to do this in one line of code. I tried doing:
vector1[1:3,6:10] = vector2[1:3,4:8]
but I am getting the error "incorrect number of dimensions".
Is it possible to do this?
vector1[c(1:3,6:10)] = vector2[c(1:3,4:8)]
> vector1
[1] 1 1 3 0 0 1 1 1 1 1
We may use negative indexing
vector1[-(4:5)] <- vector2
vector1
[1] 1 1 3 0 0 1 1 1 1 1

Comparing vectors values by shifting reading frame

I have "Y maze" sequence data containing the characters, A,B,C. I am trying to quantitative the number of times those three values are found together. The data looks like this:
Animal=c(1,2,3,4,5)
VisitedZones=c(1,2,3,4,5)
data=data.frame(Animal, VisitedZones)
data[1,2]=("A,C,B,A,C,A,B,A,C,A,C,A,C,B,B,C,A,C,C,C")
data[2,2]=("A,C,B,A,C,A,B,A,C,A,C,A,C,B")
data[3,2]=("A,C,B,A,C,A,B,A,C,A")
data[4,2]=("A,C,B,A,C,A,A,A,B,A,C,A,C,A,C,B")
data[5,2]=("A,C,B,A,C,A,A,A,B,")
The tricky part is that I also have to consider the reading frame so that I can find all instances of ABC combinations. There are three reading frames, For example:
Here is the working example I have so far.
Split <- strsplit(data$VisitedZones, ",", fixed = TRUE)
## How long is each list element?
Ncol <- vapply(Split, length, 1L)
## Create an empty character matrix to store the results
M <- matrix(NA_character_, nrow = nrow(data),ncol = max(Ncol),
dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
## Use matrix indexing to figure out where to put the results
M[cbind(rep(1:nrow(data), Ncol),sequence(Ncol))] <- unlist(Split,
use.names = FALSE)
# Bind the values back together, here as a "data.table" (faster)
v2=data.table(Animal = data$Animal, M)
# I get error here
df=mutate(as.data.frame(v2),trio=paste0(v2,lead(v2),lead(v2,2)))
table(df$trio[1:(length(v2)-2)])
It would be great if I could get something like this:
Animal VisitedZones ABC ACB BCA BAC CAB CBA
1 A,B,C,A,B.C... 2 0 1 0 1 0
2 A,B,C,C... 1 0 0 0 0 0
3 A,C,B,A... 0 1 0 0 0 1
df<-mutate(as.data.frame(v2),trio=paste0(v2,lead(v2),lead(v2,2)))
table(df$trio[1:(length(v2)-2)])
Using dplyr, I generate for every letter in your vector the three-letter combination that starts from it, then create a table of frequencies of all found combinations (minus the last two, which are incomplete).
Result:
AAB ABC BCA CAA CAB
1 6 5 1 4
Your revised question is basically completely different, so I'll answer it here.
First, I would say your data structure doesn't make much sense to me, so I'll start out by reshaping it into something I can work with:
v2<-as.data.frame(t(v2))
Flip it over so the letters are in columns, not rows;
v2<-tidyr::gather(v2,"v","letter",na.rm=T)
Melt the table so it's long data (so that I'll be able to use lead etc).
v2<-group_by(v2,v)
df=mutate(v2,trio=paste0(letter,lead(letter),lead(letter,2)))
This brings us back basically to where we were at the end of the last question, only the data is grouped by the "animal" variable (here called "v" and represented by V1 thru V5).
df<-df[!grepl("NA",df$trio),]
Even though we removed the unnecessary NA's, we still end up having those pesky ABNA and ANANA etc at the end of each group, so this line will remove anything with an NA in it.
tt<-table(df$v,df$trio)
And finally, we create the table but also break it by "v". The result is this:
AAA AAB ABA ACA ACB ACC BAC BBC BCA CAA CAB CAC CBA CBB CCC
V1 0 0 1 3 2 1 2 1 1 0 1 3 1 1 1
V2 0 0 1 3 2 0 2 0 0 0 1 2 1 0 0
V3 0 0 1 2 1 0 2 0 0 0 1 0 1 0 0
V4 1 1 1 3 2 0 2 0 0 1 0 2 1 0 0
V5 1 1 0 1 1 0 1 0 0 1 0 0 1 0 0
You can now cbind it to your original data to get something like what you described, but it requires just an additional step, because of the way table saves its results:
data<-cbind(data,spread(as.data.frame(tt),Var2,Freq))[,-3]
Which ends up looking like this:
Animal VisitedZones AAA AAB ABA ACA ACB ACC BAC BBC BCA CAA CAB CAC CBA CBB CCC
1 1 A,C,B,A,C,A,B,A,C,A,C,A,C,B,B,C,A,C,C,C 0 0 1 3 2 1 2 1 1 0 1 3 1 1 1
2 2 A,C,B,A,C,A,B,A,C,A,C,A,C,B 0 0 1 3 2 0 2 0 0 0 1 2 1 0 0
3 3 A,C,B,A,C,A,B,A,C,A 0 0 1 2 1 0 2 0 0 0 1 0 1 0 0
4 4 A,C,B,A,C,A,A,A,B,A,C,A,C,A,C,B 1 1 1 3 2 0 2 0 0 1 0 2 1 0 0
5 5 A,C,B,A,C,A,A,A,B, 1 1 0 1 1 0 1 0 0 1 0 0 1 0 0

calculating dataframe row combinations and matches with a separate column

I am trying to match all combinations of a dataframe (each combination reduces to a 1 or a 0 based on the sum) to another column and count the matches. I hacked this together but I feel like there is a better solution. Can someone suggest a better way to do this?
library(HapEstXXR)
test<-data.frame(a=c(0,1,0,1),b=c(1,1,1,1),c=c(0,1,0,1))
actual<-c(0,1,1,1)
ps<-powerset(1:dim(test)[2])
lapply(ps,function(x){
tt<-rowSums(test[,c(x)]) #Note: this fails when there is only one column
tt[tt>1]<-1 #if the sum is greater than 1 reduce it to 1
cbind(sum(tt==actual,na.rm=T),colnames(test)[x])
})
> test
a b c
1 0 1 0
2 1 1 1
3 0 1 0
4 1 1 1
goal: compare all combinations of columns (order doesnt matter) to actual column and see which matches most
b c a ab ac bc abc actual
1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 1
1 1 1 1 1 1 1 1
matches:
a: 3
b: 3
c: 3
ab: 3
....
Your code seems fine to me, I just simplified it a little bit:
sapply(ps,function(x){
tt <- rowSums(test[,x,drop=F]) > 0
colname <- paste(names(test)[x],collapse='')
setNames(sum(tt==actual,na.rm=T), colname) # make a named vector of one element length
})
# a b ab c ac bc abc
# 3 3 3 3 3 3 3

How to exclude cases that do not repeat X times in R?

I have a long format unbalanced longitudinal data. I would like to exclude all the cases that do not contain complete information. By that I mean all cases that do not repeat 8 times. Someone can help me finding a solution?
Below an example: I have three subjects {A, B, and C}. I have 8 information for A and B, but only 2 for C. How can I delete rows in which C is present based on the information it has less than 8 repeated measurements?
temp = scan()
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0
Any help?
Assuming your variable names are V1, V2... and so on, here's one approach:
temp[temp$V1 %in% names(which(table(temp$V1) == 8)), ]
The table(temp$V1) == 8 matches the values in the V1 column that have exactly 8 cases. The names(which(... part creates a basic character vector that we can match using %in%.
And another:
temp[ave(as.character(temp$V1), temp$V1, FUN = length) == "8", ]
Here's another approach:
temp <- read.table(text="
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0", header=FALSE)
do.call(rbind,
Filter(function(subgroup) nrow(subgroup) == 8,
split(temp, temp[[1]])))
split breaks the data.frame up by its first column, then Filter drops the subgroups that don't have 8 rows. Finally, do.call(rbind, ...) collapses the remaining subgroups back into a single data.frame.
If the first column of temp is character (rather than factor, which you can verify with str(temp)) and the rows are ordered by subgroup, you could also do:
with(rle(temp[[1]]), temp[rep(lengths==8, times=lengths), ])

Resources