How do I match a group of observations with a dyad? - r

Say I have a data frame with a list of names and the companies they have as clients:
name <- c("Anne", "Anne", "Mary", "Mary", "Mary", "Joe", "Joe", "Joe", "David", "David", "David", "David", "David")
company <- c("A", "B", "C", "D", "E", "A", "B", "C", "D", "E", "F", "G", "H")
df1 <- data.frame(name, company)
Then I have a second data frame where I have companies who are working together on projects:
company1 <- c("A", "B", "C", "D", "E", "F", "G", "H")
company2 <- c("B", "C", "E", "E", "G", "A", "B", "C")
df2 <- data.frame(company1, company2)
My preferred outcome would be something like this:
name A B C D E F G No of sets
1 Anne 1 1 0 0 0 0 0 1
2 David 0 0 0 1 1 1 1 1
3 Joe 1 1 1 0 0 0 0 2
4 Mary 0 0 1 1 1 0 0 1
So this counts the number of "sets" that match the sets in df2. For example, Anne has A and B with 1s, and it matches row 1 in df2. Joe has A, B, C, and both A and B and B and C are rows in df2, thus Joe's row has two matches.

I think this might work for you. Let me know. It doesn't match your expected result because you didn't include H, which I presumed to be a typo? Likewise, should Mary's No_of_sets also equal 2?
# Tabulate the frequency of name x company combinations
r <- as.data.frame.matrix(table(df1$name, df1$company))
r
#> A B C D E F G H
#> Anne 1 1 0 0 0 0 0 0
#> David 0 0 0 1 1 1 1 1
#> Joe 1 1 1 0 0 0 0 0
#> Mary 0 0 1 1 1 0 0 0
# Get "sets" of companies working together
s <- paste(df2$company1, df2$company2)
s
#> [1] "A B" "B C" "C E" "D E" "E G" "F A" "G B" "H C"
# Get all potential company sets associated with each name
m <- apply(r, MARGIN = 1, FUN = function(x) combn(names(which(x==1)), 2))
# Intersect sets of companies potentially working together (m) with
# companies actually working together (df2)
# (You could use a nested apply here, but I thought that it
# would be too opaque. Looping is a little more clear.)
for(name in rownames(r)){
pairs <- m[[name]]
ppairs <- apply(pairs, 2, paste0, collapse = " ")
r[which(rownames(r)==name),"No_of_sets"] <- length(intersect(ppairs, s))
}
r
#> A B C D E F G H No_of_sets
#> Anne 1 1 0 0 0 0 0 0 1
#> David 0 0 0 1 1 1 1 1 2
#> Joe 1 1 1 0 0 0 0 0 2
#> Mary 0 0 1 1 1 0 0 0 2
Created on 2021-10-19 by the reprex package (v2.0.1)
Edit: Let's say there's a chance that one name isn't working with more than one company. In that case, you'd need to add a conditional to account for this in both steps. First, new data... notice that the name "Solo" is only working with one company.
r
#> A B C D E F G H
#> Anne 1 1 0 0 0 0 0 0
#> David 0 0 0 1 1 1 1 1
#> Joe 1 1 1 0 0 0 0 0
#> Mary 0 0 1 1 1 0 0 0
#> Solo 1 0 0 0 0 0 0 0
m <- apply(r, MARGIN = 1, FUN = function(x)
if(length(names(which(x==1)))>1) {
combn(names(which(x==1)), 2)
} else names(which(x==1))
)
m
#> $Anne
#> [,1]
#> [1,] "A"
#> [2,] "B"
#>
#> $David
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] "D" "D" "D" "D" "E" "E" "E" "F" "F" "G"
#> [2,] "E" "F" "G" "H" "F" "G" "H" "G" "H" "H"
#>
#> $Joe
#> [,1] [,2] [,3]
#> [1,] "A" "A" "B"
#> [2,] "B" "C" "C"
#>
#> $Mary
#> [,1] [,2] [,3]
#> [1,] "C" "C" "D"
#> [2,] "D" "E" "E"
#>
#> $Solo
#> [1] "A"
for(name in rownames(r)){
pairs <- m[[name]]
if(length(pairs)>1){
ppairs <- apply(pairs, 2, paste0, collapse = " ")
} else ppairs <- pairs
r[which(rownames(r)==name),"No_of_sets"] <- length(intersect(ppairs, s))
}
r
#> A B C D E F G H No_of_sets
#> Anne 1 1 0 0 0 0 0 0 1
#> David 0 0 0 1 1 1 1 1 2
#> Joe 1 1 1 0 0 0 0 0 2
#> Mary 0 0 1 1 1 0 0 0 2
#> Solo 1 0 0 0 0 0 0 0 0

Related

Create a dynamic variable that performs a logical test on a vector and returns a vector

I am trying to create a variable that will check the first five elements of the vector x and return a vector of ones and zeros. ones if the variable equals X[i] and zero if not.
Something similar to =((X1:$X5=X1)*1) in excel
X <- c("A","B","C","D","E","F","G","H")
#i tried this but i had errors
T <- length(X)
vec <- rep(0,T)
for(i in 1:T){
j <- 5
if(T-i<=j){
j <- T-i}
else{j <- 5}
if(X[i] == X[i:(i+j)]){
vec[i] <- 1}
else{
vec[i] <- 0
}
}
Perhaps you are looking for
(X[1:5] == X[1]) * 1
#> [1] 1 0 0 0 0
For example:
X <- c("A", "B", "A", "D", "A")
(X[1:5] == X[1]) * 1
#> [1] 1 0 1 0 1
EDIT
For a list of vectors matching the criteria you could do:
X <- c("A", "A", "B", "C", "A", "B", "C", "E", "C", "A")
h <- 5
lapply(seq(length(X) - 5), function(i) (X[i:(i + h)] == X[i]) * 1)
#> [[1]]
#> [1] 1 1 0 0 1 0
#>
#> [[2]]
#> [1] 1 0 0 1 0 0
#>
#> [[3]]
#> [1] 1 0 0 1 0 0
#>
#> [[4]]
#> [1] 1 0 0 1 0 1
#>
#> [[5]]
#> [1] 1 0 0 0 0 1
Created on 2020-08-20 by the reprex package (v0.3.0)
We could use rollapply
library(zoo)
h <- 5
+(rollapply(X, width = h + 1, FUN = function(x) x[1] == x))
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 1 1 0 0 1 0
#[2,] 1 0 0 1 0 0
#[3,] 1 0 0 1 0 0
#[4,] 1 0 0 1 0 1
#[5,] 1 0 0 0 0 1
Or using embed from base R
m1 <- embed(c(X, X), h + 1)[1:h, (h+1):1]
+(m1 == m1[,1])
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 1 1 0 0 1 0
#[2,] 1 0 0 1 0 0
#[3,] 1 0 0 1 0 0
#[4,] 1 0 0 1 0 1
#[5,] 1 0 0 0 0 1

how to transform data frame to simultaneous equations in R

I have this matrix.
mat<-c("A","NODATA","NODATA","NODATA","A","B","C","NODATA","A","B","C","NODATA","D","E","A","NODATA","D","B","A","NODATA")
mat2 <- matrix(mat<-c("A","NODATA","NODATA","NODATA","A","B","C","NODATA","A","B","C","NODATA","D","E","A","NODATA","D","B","A","NODATA"),nrow = 4,ncol = 5)
mat3<-t(mat2)
colnames(mat3)<-c("col1","col2","col3","col4")
mat3
col1 col2 col3 col4
[1,] "A" "NODATA" "NODATA" "NODATA"
[2,] "A" "B" "C" "NODATA"
[3,] "A" "B" "C" "NODATA"
[4,] "D" "E" "A" "NODATA"
[5,] "D" "B" "A" "NODATA"
I want to change dataframe as below in R.
A B C D E NODATA
1 0 0 0 0 1
1 1 1 0 0 1
1 1 1 0 0 1
1 0 0 1 1 1
1 1 0 1 1 1
do you know any idea ?
thank you.
library(dplyr)
data.frame(rows=seq_len(nrow(mat3))[row(mat3)], values=c(mat3)) %>%
mutate(a=1) %>%
pivot_wider(id_cols="rows", names_from="values", values_from="a", values_fn=list(a=length)) %>%
mutate_all(~ +!is.na(.)) %>%
select(-rows) %>%
select(sort(colnames(.)))
# # A tibble: 5 x 6
# A B C D E NODATA
# <int> <int> <int> <int> <int> <int>
# 1 1 0 0 0 0 1
# 2 1 1 1 0 0 1
# 3 1 1 1 0 0 1
# 4 1 0 0 1 1 1
# 5 1 1 0 1 0 1
The first line (data.frame(...)) suggested by https://stackoverflow.com/a/26838774/3358272.
Here is a base R approach to this. We first create an empty matrix of zeroes with dimensions determined by number of columns of unique characters in original matrix. Then, we convert the matrix to pairs of "coordinates" (row, column pairs) that indicate where 1 should be placed and substitute.
mat3_pairs <- cbind(c(row(mat3)), c(mat3))
new_mat <- matrix(rep(0, length(unique(mat3_pairs[,2])) * nrow(mat3)), nrow = nrow(mat3))
colnames(new_mat) <- sort(unique(df$col))
rownames(new_mat) <- as.character(1:nrow(mat3))
new_mat[mat3_pairs] <- 1
new_mat
Output
A B C D E NODATA
1 1 0 0 0 0 1
2 1 1 1 0 0 1
3 1 1 1 0 0 1
4 1 0 0 1 1 1
5 1 1 0 1 0 1

Transform event list dataframe in adjacency dataframe

I have a df in which every columns represent an event and in cells there are the individuals, like this:
df=data.frame(topic1=c("a", "b","c", "d"), topic2=c("e","f", "g", "a"), topic3=c("b","c","g","h"))
I need to transform it in adjacency df, like this:
topic1 topic2 topic3
a 1 1 0
b 1 0 1
c 1 0 1
d 1 0 0
e 0 1 0
f 0 1 0
g 0 1 1
h 0 0 1
THX!
Form levs containing the levels in sorted order and then for each column of df determine which levs are in it. This gives a logical matrix which we can convert to numeric using +.
levs <- sort(unique(unlist(df))) # a b c d e f g h
+ sapply(df, function(x) levs %in% x)
giving:
topic1 topic2 topic3
[1,] 1 1 0
[2,] 1 0 1
[3,] 1 0 1
[4,] 1 0 0
[5,] 0 1 0
[6,] 0 1 0
[7,] 0 1 1
[8,] 0 0 1
The last line could be written even more compactly as:
+ sapply(df, `%in%`, x = levs)

Summing labels line-section by line in r

I have a large dataframe of 34,000 rows x 24 columns, each of which contain a category label. I would like to efficiently go through the dataframe and count up how many times each label was listed in a section of the line, including 0s.
(I've used a for loop driving a length(which) statement that wasn't terribly efficient)
Example:
df.test<-as.data.frame(rbind(c("A", "B", "C","B","A","A"),c("C", "C", "C","C","C","C"), c("A", "B", "B","A","A","A")))
df.res<-as.data.frame(matrix(ncol=6, nrow=3))
Let's say columns 1:3 in df.test are from one dataset, 4:6 from the other. What is the most efficient way to generate df.res to show this:
A B C A B C
1 1 1 2 1 0
0 0 3 0 0 3
1 2 0 3 0 0
A way -using a lot of _applys- is the following:
#list with the different data frames
df_ls <- sapply(seq(1, ncol(df.test), 3), function(x) df.test[,x:(x+2)], simplify = F)
#count each category
df.res <- do.call(cbind,
lapply(df_ls, function(df.) { t(apply(df., 1,
function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) }))
#> df.res
# A B C A B C
#[1,] 1 1 1 2 1 0
#[2,] 0 0 3 0 0 3
#[3,] 1 2 0 3 0 0
Simulating a dataframe like the one you described:
DF <- data.frame(replicate(24, sample(LETTERS[1:3], 34000, T)), stringsAsFactors = F)
#> head(DF)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24
#1 B C C C B A C B B A C C B C B B B C B C C B B C
#2 C B C A B C B C A B A C B B A A C A B B B C A B
#3 B C C A A A C A C A A A B B A A A C B B A C C C
#4 C C A B A B B B A A A C C A B A C C A C C C B A
#5 B B A A A A C A B B A B B A C A A A C A A C B C
#6 C A C C A B B C C C B C A B B B B B A C A A B A
#> dim(DF)
#[1] 34000 24
DF_ls <- sapply(seq(1, ncol(DF), 3), function(x) DF[,x:(x+2)], simplify = F)
system.time(
DF.res <- do.call(cbind,
lapply(DF_ls, function(df.) { t(apply(df., 1,
function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) })))
#user system elapsed
#59.84 0.07 60.73
#> head(DF.res)
# A B C A B C A B C A B C A B C A B C A B C A B C
#[1,] 0 1 2 1 1 1 0 2 1 1 0 2 0 2 1 0 2 1 0 1 2 0 2 1
#[2,] 0 1 2 1 1 1 1 1 1 1 1 1 1 2 0 2 0 1 0 3 0 1 1 1
#[3,] 0 1 2 3 0 0 1 0 2 3 0 0 1 2 0 2 0 1 1 2 0 0 0 3
#[4,] 1 0 2 1 2 0 1 2 0 2 0 1 1 1 1 1 0 2 1 0 2 1 1 1
#[5,] 1 2 0 3 0 0 1 1 1 1 2 0 1 1 1 3 0 0 2 0 1 0 1 2
#[6,] 1 0 2 1 1 1 0 1 2 0 1 2 1 2 0 0 3 0 2 0 1 2 1 0
EDIT Some more comments on the approach.
I'll do the above step by step.
The first step is to subset the different dataframes that are bound together; each one of those dataframes is put in a list. The function function(x) { df.test[,x:(x+2)], simplify = F } subsets the whole dataframe based on those values of x: seq(1, ncol(df.test), 3). Extending this, if your different dataframes where 4 columns distant, 3 would have been changed with 4 in the above sequence.
#> df_ls <- sapply(seq(1, ncol(df.test), 3), function(x) df.test[,x:(x+2)], simplify = F)
#> df_ls
#[[1]]
# V1 V2 V3
#1 A B C
#2 C C C
#3 A B B
#[[2]]
# V4 V5 V6
#1 B A A
#2 C C C
#3 A A A
The next step is to lapply to the -previously made- list a function that counts each category in each row of one dataframe (i.e. element of the list). The function is this: t(apply(df., 1, function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })). The inside function (function(x)) turns one row in a factor with levels all the categories and counts (table) the number each category occured in that row. apply applies this function to each row (MARGIN = 1) of the dataframe. So, now, we have counted the frequency of each category in each row of one dataframe.
#> table(factor(unlist(df_ls[[1]][3,]), levels = c("A", "B", "C")))
#df_ls[[1]][3,] is the third row of the first dataframe of df_ls
#(i.e. _one_ row of _one_ dataframe)
#A B C
#1 2 0
#> apply(df_ls[[1]], 1,
#+ function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })
# [,1] [,2] [,3] #df_ls[[1]] is the first dataframe of df_ls (i.e. _one_ dataframe)
#A 1 0 1
#B 1 0 2
#C 1 3 0
Because, the return of apply is not in the wanted form we use t to swap rows with columns.
The next step, is to lapply all the above to each dataframe (i.e. element of the list).
#> lapply(df_ls, function(df.) { t(apply(df., 1,
#+ function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) })
#[[1]]
# A B C
#[1,] 1 1 1
#[2,] 0 0 3
#[3,] 1 2 0
#[[2]]
# A B C
#[1,] 2 1 0
#[2,] 0 0 3
#[3,] 3 0 0
The last step is to cbind all those elements together. The way to bind by column all the elements of a list is to do.call cbind in that list.
#NOT the expected, using only cbind
#> cbind(lapply(df_ls, function(df.) { t(apply(df., 1,
#+ function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) }))
# [,1]
#[1,] Integer,9
#[2,] Integer,9
#Correct!
#> do.call(cbind, lapply(df_ls, function(df.) { t(apply(df., 1,
#+ function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) }))
# A B C A B C
#[1,] 1 1 1 2 1 0
#[2,] 0 0 3 0 0 3
#[3,] 1 2 0 3 0 0

R: loop / function to create a matrix for comparison (contrasts)

I have following type of data, means combination of factors
P1 <- c("a", "a", "a", "a", "b", "b", "b", "c", "c", "d")
P2 <- c("a", "b", "c", "d", "b", "c", "d", "c", "d", "d")
myfactors <- data.frame(P1, P2)
P1 P2
1 a a
2 a b
3 a c
4 a d
5 b b
6 b c
7 b d
8 c c
9 c d
10 d d
In real word the factors might be any number, I am trying write a function that can be applicable to any level of the factors. I want to set contrasts all combinations available in the data set. for example in this data set a-b, a-c,a-d, b-c,b-d, c-d. The contrast rule here.
for example for "a-b" is if P1 = P2 = a or b the coefficient = -1,
if P1=a, P2= b or P1= b, P2 = a, the coefficient = 2,
else coefficient = 0
The output coefficient matrix will like the following:
P1 P2 a-b a-c a-d b-c b-d c-d
a a -1 -1 -1 0 0 0
a b 2 0 0 0 0 0
a c 0 2 0 0 0 0
a d 0 0 2 0 0 0
b b 1 0 0 -1 -1 0
b c 0 0 0 2 0 0
b d 0 0 0 0 2 0
c c 0 1 0 0 0 -1
c d 0 0 0 -1 0 2
d d 0 0 -1 0 -1 -1
As the function I am thinking is flexible one, if I will apply to the following dataset,
P1 <- c("CI", "CI", "CI", "CD", "CD", "CK", "CK")
P2 <- c("CI", "CD", "CK", "CD", "CK", "CK", "CI")
mydf2 <- data.frame(P1, P2)
mydf2
P1 P2
1 CI CI
2 CI CD
3 CI CK
4 CD CD
5 CD CK
6 CK CK
7 CK CI
The expected coefficient matrix for this dataframe is:
P1 P2 CI-CD CI-CK CD-CK CK-CI
CI CI -1 -1 0 -1
CI CD 2 0 0 0
CI CK 0 2 0 0
CD CD -1 0 -1 0
CD CK 0 0 2 0
CK CK 0 -1 -1 -1
CK CI 0 0 0 2
I tried several ways but could not come to successful program.
EDITS:
(1) I am not testing all possible combinations, the combination that only appear in P1 and P2 are tested
(2) I intend to develop solution not only to this instance, but of general application. for example myfactors dataframe above.
You didn't supply a reason for your particular choice of the 6 ordered combinations of P1 and P2 values, so I just ran through them all:
combos <- cbind( combn(unique(c(P2, P1)), 2), combn(unique(c(P2, P1)), 2)[2:1, ])
combos
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "CI" "CI" "CD" "CD" "CK" "CK"
[2,] "CD" "CK" "CK" "CI" "CI" "CD"
As I went through the logic it seemed more compact to test for conditions 1) and 2) and just use Boolean math to return the results. If both conditins are untrue you get 0. I've check the entries that do not match yours and I think your construction was wrong in spots. You have 0 in the "CI-CK" row 7 and I think the answer by your rules should be 2.:
sapply(1:ncol(combos), function(x) with( mydf2,
2*( (P1==combos[1,x] & P2 == combos[2,x]) | (P2==combos[1,x] & P1 == combos[2,x])) -
(P1 == P2 & P1 %in% combos[,x]) ) )
#---------------
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -1 -1 0 -1 -1 0
[2,] 2 0 0 2 0 0
[3,] 0 2 0 0 2 0
[4,] -1 0 -1 -1 0 -1
[5,] 0 0 2 0 0 2
[6,] 0 -1 -1 0 -1 -1
[7,] 0 2 0 0 2 0
#------------------
mydf2[ , 3:8] <- sapply(1:ncol(combos), function(x) with( mydf2,
2*( (P1==combos[1,x] & P2 == combos[2,x]) | (P2==combos[1,x] & P1 == combos[2,x])) -
(P1 == P2 & P1 %in% combos[,x]) ) )
mydf2
#-----------------
P1 P2 CI-CD CI-CK CD-CK CD-CI CK-CI CK-CD
1 CI CI -1 -1 0 -1 -1 0
2 CI CD 2 0 0 2 0 0
3 CI CK 0 2 0 0 2 0
4 CD CD -1 0 -1 -1 0 -1
5 CD CK 0 0 2 0 0 2
6 CK CK 0 -1 -1 0 -1 -1
7 CK CI 0 2 0 0 2 0

Resources