create a scoring matrix from two dataframes - r

I am trying to compare sets of variables(X) that are stored in two dataframes (foo, bar). Each X is a unique independent variable that has up to 10 values of Y associated with it. I would like to compare every foo.X with every bar.X by comparing the number of Y values they have in common - so the output could be a matrix with axes of foo.x by bar.x in length.
this simple example of foo and bar would want to return a 2x2 matrix comparing a,b with c,d:
foo <- data.frame(x= c('a', 'a', 'a', 'b', 'b', 'b'), y=c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- data.frame(x= c('c', 'c', 'c', 'd', 'd', 'd'), y=c('ab', 'xy', 'xz', 'xy', 'fx', 'xz'))
EDIT:
I've left the following code for other newbies to learn from (for loops are effectvie but probably very suboptimal), but the two solutions below are effective. In particular Ramnath's use of data.table is very effective when dealing with very large dataframes.
store the dataframes as lists where the values of y are stored using the stack function
foo.list <- dlply(foo, .(x), function(x) stack(x, select = y))
bar.list <- dlply(bar, .(x),function(x) stack(x, select = y))
write a function for comparing membership in the two stacked lists
comparelists <- function(list1, list2) {
for (i in list1){
for (j in list2){
count <- 0
if (i[[1]] %in% j[[1]]) count <- count + 1
}
}
return count
}
write an output matrix
output.matrix <- matrix(1:length(foo.list), 1:length(bar.list))
for (i in foo.list){
for (j in bar.list){
output.matrix[i,j] <- comparelists(i,j)
}
}

There must be a hundred ways to do this; here is one that feels relatively straightforward to me:
library(reshape2)
foo <- data.frame(x = c('a', 'a', 'a', 'b', 'b', 'b'),
y = c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- data.frame(x = c('c', 'c', 'c', 'd', 'd', 'd'),
y = c('ab', 'xy', 'xz', 'xy', 'fx', 'xz'))
# Create a function that counts the number of common elements in two groups
nShared <- function(A, B) {
length(intersect(with(foo, y[x==A]), with(bar, y[x==B])))
}
# Enumerate all combinations of groups in foo and bar
(combos <- expand.grid(foo.x=unique(foo$x), bar.x=unique(bar$x)))
# foo.x bar.x
# 1 a c
# 2 b c
# 3 a d
# 4 b d
# Find number of elements in common among all pairs of groups
combos$n <- mapply(nShared, A=combos$foo.x, B=combos$bar.x)
# Reshape results into matrix form
dcast(combos, foo.x ~ bar.x)
# foo.x c d
# 1 a 1 0
# 2 b 0 1

Here is a simpler approach using merge
library(reshape2)
df1 <- merge(foo, bar, by = 'y')
dcast(df1, x.x ~ x.y, length)
x.x c d
1 a 1 0
2 b 0 1
EDIT. The merge can be faster using data.table. Here is the code
foo_dt <- data.table(foo, key = 'y')
bar_dt <- data.table(bar, key = 'y')
df1 <- bar_dt[foo_dt, nomatch = 0]

Related

Analysis of many attendance lists

I have 8 attendace lists from 8 different conferences. I need to know what persons assisted to at least 7 of the 8 conferences. I don't want to do it checking name by name in each list, so I'm planning to do it using R, but I have no clue about it. Any suggestions?
Might be a more simple way (my R is getting a bit rusty), but this works:
library(dplyr)
unique_attendees <- c('a', 'b', 'c', 'd', 'e')
conf1_attendees <- c('a','b')
conf2_attendees <- c('a','b','c')
conf3_attendees <- c('a','b','c','e')
conf4_attendees <- c('b', 'e')
conf5_attendees <- c('a','d', 'e')
conf6_attendees <- c('a','d', 'e')
conf7_attendees <- c('a','b', 'e')
conf8_attendees <- c('a','b', 'c')
conferences <- list(conf1_attendees, conf2_attendees, conf3_attendees, conf4_attendees, conf5_attendees, conf6_attendees, conf7_attendees,conf8_attendees)
attendance_record <- dplyr::bind_rows(lapply(unique_attendees, function(x){
cat(c('Working with: ', x, '\n'))
attendance <- lapply(conferences, function(y){
attended <- grepl(x, y)
return(attended)
})
number_attended = length(which(unlist(attendance) == TRUE))
result <- data.frame(person=x, number_attended=number_attended)
}))
result <- attendance_record %>%
mutate(attended_at_least_7 = data.table::fifelse(number_attended >= 7, TRUE, FALSE))
print(result)
Output:
person number_attended attended_at_least_7
1 a 7 TRUE
2 b 6 FALSE
3 c 3 FALSE
4 d 2 FALSE
5 e 5 FALSE
Obviously you'll need to adapt it to your problem since we don't know how your records are stored.

applying a function to multiple lists (R)

I have two lists:
source <- list(c(5,10,20,30))
source.val <- list(c('A', 'B', 'C', 'D'))
Each corresponding element in source has a corresponding value in source.val. I want to create dataframe from the above two files that look like below
source.val_5 source.val_10 source.val_20 source.val_30
A B C D
I did this
tempList <- list()
for(i in 1:lengths(source)){
tempList[[i]] <- data.frame(variable = paste0('source.val_',source[[1]][[i]]),
value = source.val[[1]][[i]])
}
temp.dat <- do.call('rbind', tempList)
temp.dat_wider <- tidyr::pivot_wider(finalList, id_cols = value, names_from = variable)
Now I want to do this across a bigger list
source <- list(c(5,10,20,30),
c(5,10,20,30),
c(5,10,20,30),
c(5,10,20,30))
source.val <- list(c('A', 'B', 'C', 'D'),
c('B', 'B', 'D', 'D'),
c('C', 'B', 'A', 'D'),
c('D', 'B', 'B', 'D'))
The resulting table will have 4 rows looking like this:
A tibble: 1 x 4
source.val_5 source.val_10 source.val_20 source.val_30
A B C D
B B D D
C B A D
D B B D
What is the best way to use function like mapply to achieve my desired result?
For the example shared, where all the elements of source have the same order you can do :
cols <- paste0('source.val_', sort(unique(unlist(source))))
setNames(do.call(rbind.data.frame, source.val), cols)
# source.val_5 source.val_10 source.val_20 source.val_30
#1 A B C D
#2 B B D D
#3 C B A D
#4 D B B D
However, for a general case where every value in source do not follow the same order you can reorder source.val based on source :
source.val <- Map(function(x, y) y[order(x)], source, source.val)
and then use the above code.

Efficient way to get graph communities

I have the following type of data where I know there will be lots of unconnected communities (good separation). I want to efficiently (fast and preferably low/no dependencies) separate the data into their communities. I know I can use igraph to do this task but was hoping there's a fast base R way to extract these communities. I show how I separate the communities using I graph below.
Is there a fast, base R way to extract communities? Sharing non-base R approaches is fine too so that this question is more useful to future searchers.
dat <- data.frame(
x = c('A', 'A', 'B', 'C', 'D', 'F', 'E', 'W', 'X', 'R', 'W'),
y = c('A', 'B', 'C', 'C', 'F', 'F', 'E', 'Y', 'P', 'P', 'P')
)
mat <- xtabs(~ x + y, data = dat)
library(igraph)
g <- graph.data.frame(dat)
plot(g)
clust <- cluster_walktrap(g)
data.frame(
val = clust$names,
group = clust$membership
)
Desired Output
## val group
## 1 A 2
## 2 B 2
## 3 C 2
## 4 D 3
## 5 F 3
## 6 E 4
## 7 W 1
## 8 X 1
## 9 R 1
## 10 Y 1
## 11 P 1

Count number of time combination of events appear in dataframe columns ext

This is an extension of the question asked in Count number of times combination of events occurs in dataframe columns, I will reword the question again so it is all here:
I have a data frame and I want to calculate the number of times each combination of events in two columns occur (in any order), with a zero if a combination doesn't appear.
For example say I have
df <- data.frame('x' = c('a', 'b', 'c', 'c', 'c'),
'y' = c('c', 'c', 'a', 'a', 'b'))
So
x y
a c
b c
c a
c a
c a
c b
a and b do not occur together, a and c 4 times (rows 2, 4, 5, 6) and b and c twice (3rd and 7th rows) so I would want to return
x-y num
a-b 0
a-c 4
b-c 2
I hope this makes sense? Thanks in advance
This should do it:
res = table(df)
To convert to data frame:
resdf = as.data.frame(res)
The resdf data.frame looks like:
x y Freq
1 a a 0
2 b a 0
3 c a 2
4 a b 0
5 b b 0
6 c b 1
7 a c 1
8 b c 1
9 c c 0
Note that this answer takes order into account. If ordering of the columns is unimportant, then modifying the original data.frame prior to the process will remove the effect of ordering (a-c treated the same as c-a).
df1 = as.data.frame(t(apply(df,1,sort)))
As said, you can do this with factor() and expand.grid() (or another way to get all possible combinations)
all.possible <- expand.grid(c('a','b','c'), c('a','b','c'))
all.possible <- all.possible[all.possible[, 1] != all.possible[, 2], ]
all.possible <- unique(apply(all.possible, 1, function(x) paste(sort(x), collapse='-')))
df <- data.frame('x' = c('a', 'b', 'c', 'c', 'c'),
'y' = c('c', 'c', 'a', 'a', 'b'))
table(factor(apply(df , 1, function(x) paste(sort(x), collapse='-')), levels=all.possible))
An alternative, because I was a bit bored. Perhaps a bit more generalised? But probably still uglier than it could be...
df2 <- as.data.frame(table(df))
df2$com <- apply(df2[,1:2],1,function(x) if(x[1] != x[2]) paste(sort(x),collapse='-'))
df2 <- df2[df2$com != "NULL",]
ddply(df2, .(unlist(com)), summarise,
num = sum(Freq))

Combine vector and data.frame matching column values and vector values

I have
vetor <- c(1,2,3)
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
I need a data.frame output that match each vector value to a specific id, resulting:
id vector1
1 a 1
2 b 2
3 a 1
4 c 3
5 a 1
Here are two approaches I often use for similar situations:
vetor <- c(1,2,3)
key <- data.frame(vetor=vetor, mat=c('a', 'b', 'c'))
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
data$vector1 <- key[match(data$id, key$mat), 'vetor']
#or with merge
merge(data, key, by.x = "id", by.y = "mat")
So you want one unique integer for each different id column?
This is called a factor in R, and your id column is one.
To convert to a numeric representation, use as.numeric:
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
data$vector1 <- as.numeric(data$id)
This works because data$id is not a column of strings, but a column of factors.
Here's an answer I found that follows the "mathematical.coffee" tip:
vector1 <- c('b','a','a','c','a','a') # 3 elements to be labeled: a, b and c
labels <- factor(vector1, labels= c('char a', 'char b', 'char c') )
data.frame(vector1, labels)
The only thing we need to observe is that in the factor(vector1,...) function, vector1 will be ordered and the labels must follow that order correctly.

Resources