Create edgelist for all interactions from data.frame - r

I am trying to do network analysis in igraph but having some issues with transforming the dataset I have into an edge list (with weights), given the differing amount of columns.
The data set looks as follows (df1) (much larger of course): First is the main operator id (main operator can also be partner and vice versa, so the Ids are staying the same in the edge list) The challenge is that the amount of partners varies (from 0 to 40) and every interaction has to be considered (not just "IdMain to IdPartnerX").
IdMain IdPartner1 IdPartner2 IdPartner3 IdPartner4 .....
1 4 3 7 6
2 3 1 NA NA
3 1 4 2 NA
4 9 6 3 NA
.
.
I already got the helpful tip to use reshape to do this, like:
data_melt <- reshape2::melt(data, id.vars = "IdMain")
edgelist <- data_melt[!is.na(data_melt$value), c("IdMain", "value")]
However, this only creates a 'directed' edgelist (from Main to Partners). What I need is something like below, where every interaction is recorded.
Id1 Id2
1 4
1 3
1 7
1 6
4 3
4 7
4 6
3 7
etc
Does anyone have a tip what the best way to go is? I also looked into the igraph library and couldn't find the function to do this.

There is no need for reshape(2) and melting etc. You just need to grap every combination of column pairs and then bind them together.
x <- read.table(text="IdMain IdPartner1 IdPartner2 IdPartner3 IdPartner4
1 4 3 7 6
2 3 1 NA NA
3 1 4 2 NA
4 9 6 3 NA", header=TRUE)
idx <- t(combn(seq_along(x), 2))
edgelist <- lapply(1:nrow(idx), function(i) x[, c(idx[i, 1], idx[i, 2])])
edgelist <- lapply(edgelist, setNames, c("ID1","ID2"))
edgelist <- do.call(rbind, edgelist)
edgelist <- edgelist[rowSums(is.na(edgelist))==0, ]
edgelist
# ID1 ID2
# 1 1 4
# 2 2 3
# 3 3 1
# 4 4 9
# 5 1 3
# 6 2 1
# 7 3 4
# 8 4 6
# 9 1 7
# 11 3 2
# 12 4 3
# 13 1 6
# 17 4 3
# 18 3 1
# 19 1 4
# 20 9 6
# 21 4 7
# 23 1 2
# 24 9 3
# 25 4 6
# 29 3 7 <--
# 31 4 2
# 32 6 3
# 33 3 6 <--
# 37 7 6 <--

Using the data below. You can achieve what looks to be your goal with apply and combn. This returns a list matrices with the pairwise comparison of the row element of your data.frame
myPairs <- apply(t(dat), 2, function(x) t(combn(x[!is.na(x)], 2)))
Note that the output of apply can be finicky and it is necessary here to have at least one row with an NA so that apply will return a list rather than a matrix.
If you want a data.frame at the end, use do.call and rbind to put the matrices together and then data.frame and setNames for the object coercion and to add names.
setNames(data.frame(do.call(rbind, myPairs)), c("Id1", "Id2"))
Id1 Id2
1 1 4
2 1 3
3 1 7
4 1 6
5 4 3
6 4 7
7 4 6
8 3 7
9 3 6
10 7 6
11 2 3
12 2 1
13 3 1
14 3 1
15 3 4
16 3 2
17 1 4
18 1 2
19 4 2
20 4 9
21 4 6
22 4 3
23 9 6
24 9 3
25 6 3
data
dat <-
structure(list(IdMain = 1:4, IdPartner1 = c(4L, 3L, 1L, 9L),
IdPartner2 = c(3L, 1L, 4L, 6L), IdPartner3 = c(7L, NA, 2L,
3L), IdPartner4 = c(6L, NA, NA, NA)), .Names = c("IdMain",
"IdPartner1", "IdPartner2", "IdPartner3", "IdPartner4"),
class = "data.frame", row.names = c(NA, -4L))

Related

I would like to extract the columns of each element of a list in R

I'd to extract the 3rd column (c) of each element in this list and store the result.
(I've listed the data frame in this example so that it looks like the long list of lists I have):
set.seed(59)
df<- data.frame(a=c(1,4,5,2),b=c(9,2,7,4),c=c(5,2,9,4))
df1<- data.frame(df,2*df)
df1<- list(df,2*df)
[[1]]
a b c
1 1 9 5
2 4 2 2
3 5 7 9
4 2 4 4
[[2]]
a b c
1 2 18 10
2 8 4 4
3 10 14 18
4 4 8 8
Seems fairly simple for just one element
> df1[[1]]["c"]
c
1 5
2 2
3 9
4 4
> df1["c"] # cries again
[[1]]
NULL
All I want to see is:
[[1]]
c
1 5
2 2
3 9
4 4
[[2]]
c
1 10
2 4
3 18
4 8
Thanks in advance
Use lapply :
data <- lapply(df1, function(x) x[, 'c', drop = FALSE])
data
#[[1]]
# c
#1 5
#2 2
#3 9
#4 4
#[[2]]
# c
#1 10
#2 4
#3 18
#4 8
When you subset one column dataframe it coerces it to lowest possible dimension which is a vector in this case. drop = FALSE is needed to keep it as a dataframe.

Conditional average or replace value if zero in dataframe

I have a problem finding the right code for the following problem.
Here is a simplified and short version of my dataframe df :
Line Id Amount
1 1 10
2 2 12
3 2 13
4 2 0
5 3 11
6 4 12
7 4 14
8 5 0
9 6 11
10 6 0
I would like to create another colum Amount_Avrg with the folowing conditions:
-if several lines have the same Id and an Amount that is different from zero the case for lines 2 and 3 and for lines 6 and 7, calculate the average of the different amounts
-if one line has an amount that is equal with 0 then:
A/ erase it if it is alone (if there is no other line with the same Id and a value different from 0) (the case of line 8)
B/ if there is one line with the same Id and a value different from 0 (the case for lines 9 and 10), replace 0 with the value of the other
C/ if there are two lines or more with a value different from zero (the case for lines 2 and 3), replace 0 with the average of the other amounts
The final dataframe I am expecting would then look like this one:
Line Id Amount Amount_Avrg
1 1 10 10
2 2 12 12.5
3 2 13 12.5
4 2 0 12.5
5 3 11 11
6 4 12 13
7 4 14 13
9 6 11 11
10 6 0 11
I have read in many answers that if loops were not efficient on R so if you could help me with another solution, that would be fantastic :-)
Using dplyr, we can group_by ID and take mean of non-zero Amount and remove rows with NA in them.
library(dplyr)
df %>%
group_by(Id) %>%
mutate(mn = mean(Amount[Amount > 0])) %>%
filter(!is.na(mn))
# Line Id Amount mn
# <int> <int> <int> <dbl>
#1 1 1 10 10
#2 2 2 12 12.5
#3 3 2 13 12.5
#4 4 2 0 12.5
#5 5 3 11 11
#6 6 4 12 13
#7 7 4 14 13
#8 9 6 11 11
#9 10 6 0 11
Or with data.table
library(data.table)
setDT(df)[, mn := mean(Amount[Amount > 0]), by = Id][!is.na(mn)]
data
df <- structure(list(Line = 1:10, Id = c(1L, 2L, 2L, 2L, 3L, 4L, 4L,
5L, 6L, 6L), Amount = c(10L, 12L, 13L, 0L, 11L, 12L, 14L, 0L,
11L, 0L)), class = "data.frame", row.names = c(NA, -10L))
You can use ave to calculate the mean per Id and then subset with !is.na to remove the rows where you have only 0 per Id.
x$Amount_Avrg <- ave(x$Amount, x$Id, FUN=function(x) mean(x[x>0]))
x <- x[!is.na(x$Amount_Avrg),]
x
# Line Id Amount Amount_Avrg
#1 1 1 10 10.0
#2 2 2 12 12.5
#3 3 2 13 12.5
#4 4 2 0 12.5
#5 5 3 11 11.0
#6 6 4 12 13.0
#7 7 4 14 13.0
#9 9 6 11 11.0
#10 10 6 0 11.0
Or with within and na.omit:
na.omit(within(x, mount_Avrg <- ave(Amount, Id, FUN=function(x) mean(x[x>0]))))
Or using aggregate and merge:
merge(x, aggregate(cbind(Amount_Avrg = Amount) ~ Id, data=x[x$Amount>0,], mean))
Data:
x <- read.table(header=TRUE, text="Line Id Amount
1 1 10
2 2 12
3 2 13
4 2 0
5 3 11
6 4 12
7 4 14
8 5 0
9 6 11
10 6 0")
If you create a summary table of all the nonzero-means, you can right-join that to the original table to get the result displayed in the question.
library(data.table)
setDT(df)
nonzero_means <- df[Amount > 0, .(Amount_Avg = mean(Amount)), Id]
df[nonzero_means, on = .(Id)]
# Line Id Amount Amount_Avg
# 1: 1 1 10 10.0
# 2: 2 2 12 12.5
# 3: 3 2 13 12.5
# 4: 4 2 0 12.5
# 5: 5 3 11 11.0
# 6: 6 4 12 13.0
# 7: 7 4 14 13.0
# 8: 9 6 11 11.0
# 9: 10 6 0 11.0

concatenate data based on a certain sequence

my data look like this way, and variable day ranges from 1 to 232. This is just a shorter version of the data, the real data have over 20000000 rows with variable 'day' ranging from 1 to 232
day time
1 2
1 2
2 2
2 3
3 4
3 5
4 4
4 2
and I have a vector that contains 1000 of randomly selected from sequences of variable day (1-232), say
df=c(3,4,1,2,...,4,1,3)
I want to create a new dataset that sorts based on the sequence. The we first extract day=3 from the data, and then extract day=4 after it, then extracr day=1 and rbind thereafter. For example, the first 4 sequence should look like this way:
day time
3 4
3 5
4 4
4 2
1 2
1 2
2 2
2 3
Base R method:
x <- structure(list(day = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), time = c(2L,
2L, 2L, 3L, 4L, 5L, 4L, 2L)), class = "data.frame", row.names = c(NA,
-8L))
df <- c(3,4,1,2,4,1,3)
do.call("rbind.data.frame", lapply(df, function(i) subset(x, day == i)))
# day time
# 5 3 4
# 6 3 5
# 7 4 4
# 8 4 2
# 1 1 2
# 2 1 2
# 3 2 2
# 4 2 3
# 71 4 4
# 81 4 2
# 11 1 2
# 21 1 2
# 51 3 4
# 61 3 5
The use of do.call("rbind.data.frame", ...) is prone to typical data.frame instantiation, meaning if your real data has any columns of type character, you will likely want to do
do.call("rbind.data.frame", c(lapply(df, function(i) subset(x, day == i)), stringsAsFactors = FALSE))
Also, it could easily be replaced (without the risk of factors) with data.table::rbindlist or dplyr::bind_rows.
If I understand correctly, you can do this in a pretty straight forward manner with data.table():
library(data.table)
df <- fread(text = "day time
1 2
1 2
2 2
2 3
3 4
3 5
4 4
4 2", header = TRUE)
seqs <- data.table(day = c(3,4,1,2,4,1,3))
df[seqs, on = "day"]
#> day time
#> 1: 3 4
#> 2: 3 5
#> 3: 4 4
#> 4: 4 2
#> 5: 1 2
#> 6: 1 2
#> 7: 2 2
#> 8: 2 3
#> 9: 4 4
#> 10: 4 2
#> 11: 1 2
#> 12: 1 2
#> 13: 3 4
#> 14: 3 5
Created on 2019-02-10 by the reprex package (v0.2.1)

Reshaping different variables for selecting values from one column in R

Below, a sample of my data, I have more Rs and Os.
A R1 O1 R2 O2 R3 O3
1 3 3 5 3 6 4
2 3 3 5 4 7 4
3 4 4 5 5 6 5
I want to get the following data
A R O Value
1 3 1 3
1 5 2 3
1 6 3 4
2 3 1 3
2 5 2 4
2 7 3 4
3 4 1 4
3 5 2 5
3 6 3 5
I try the melt function, but I was unsuccessful. Any help would be very much appreciated.
A solution using dplyr and tidyr. The key is to use gather to collect all the columns other than A, and the use extract to split the column, and then use spread to convert the data frame back to wide format.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
gather(Column, Number, -A) %>%
extract(Column, into = c("Column", "ID"), regex = "([A-Z]+)([0-9]+)") %>%
spread(Column, Number) %>%
select(A, R, O = ID, Value = O)
dt2
# A R O Value
# 1 1 3 1 3
# 2 1 5 2 3
# 3 1 6 3 4
# 4 2 3 1 3
# 5 2 5 2 4
# 6 2 7 3 4
# 7 3 4 1 4
# 8 3 5 2 5
# 9 3 6 3 5
DATA
dt <- read.table(text = "A R1 O1 R2 O2 R3 O3
1 3 3 5 3 6 4
2 3 3 5 4 7 4
3 4 4 5 5 6 5",
header = TRUE)

Calculate score based on responses to multiple-responses questions

I have user data from between 1 and 20 multiple-response questions. In other words, users did not all answer the same number of questions; they could choose how many of the 20 questions they wanted to answer. All questions have the same number of response options (which was 44), and users could choose as few or as many response options as they wanted for each question.
To give an example, here's a subset of the data (representing 3 multiple-response questions with 5 response options on each):
mydata <- structure(list(id = 1:5, q1.response1 = c(1L, NA, 1L, NA, 1L),
q1.response2 = c(NA, 1L, 1L, NA, NA), q1.response3 = c(NA,
1L, 1L, 1L, NA), q1.response4 = c(1L, 1L, 1L, NA, 1L), q1.response5 = c(NA,
1L, 1L, NA, NA), q2.response1 = c(NA, 1L, NA, NA, NA), q2.response2 = c(1L,
NA, 1L, 1L, 1L), q2.response3 = c(NA, 1L, NA, 1L, NA), q2.response4 = c(1L,
NA, NA, NA, 1L), q2.response5 = c(NA, 1L, NA, 1L, NA), q3.response1 = c(1L,
1L, NA, 1L, NA), q3.response2 = c(NA, 1L, NA, NA, NA), q3.response3 = c(1L,
NA, NA, 1L, NA), q3.response4 = c(1L, 1L, NA, NA, NA), q3.response5 = c(1L,
NA, NA, NA, NA)), .Names = c("id", "q1.response1", "q1.response2",
"q1.response3", "q1.response4", "q1.response5", "q2.response1",
"q2.response2", "q2.response3", "q2.response4", "q2.response5",
"q3.response1", "q3.response2", "q3.response3", "q3.response4",
"q3.response5"), class = "data.frame", row.names = c(NA, -5L))
A "1" indicates that they checked off that option; NA indicates that they did not.
What I was to do is calculate the following for each of the 5 users: , where ni is the number of responses that appear in a particular combination of questions, and i = 1, ..., 2k, where k is the number of multiple-choice questions.
For example, if another person responds to 3 multiple-response questions (as in the above example), each of the 5 response options will fall into ONLY 1 of 23=8 possible group combinations:
1) Response options selected only in question 1
2) Response options selected only in question 2
3) Response options selected only in question 3
4) Response options selected in both question 1 and 2
5) Response options selected in both question 1 and 3
6) Response options selected in both question 2 and 3
7) Response options selected in question 1, 2, and 3
8) Response options that were not selected at all
As an example, for respondent #1 in the sample data:
1) Response options selected only in question 1: none = 0 responses
2) Response options selected only in question 2: response2 = 1 response
3) Response options selected only in question 3: response3, response5 = 2 responses
4) Response options selected in both question 1 and 2: none = 0 responses
5) Response options selected in both question 1 and 3: response1 = 1 response
6) Response options selected in both question 2 and 3: none = 0 responses
7) Response options selected in question 1, 2, and 3: response4 = 1 response
8) Response options that were not selected at all: none = 0 responses
So the score for this respondent would be:
(0*log2(0))+(1*log2(1))+(2*log2(2))+(0*log2(0))+(1*log2(1))+(0*log2(0))+(1*log2(1))+(0*log2(0)) = 2
Any idea how to code this in R?
The first thing I would do here is transform your data into long format. There are various ways of doing this, such as the base R reshape() function, and the reshape2 package, but I actually decided to do it manually in this case by constructing a new data.frame using the constructor function data.frame() and a few carefully written calls to rep(). This approach also depends on flattening the original data.frame (minus the initial id column, which I instantiate separately) to a vector via as.matrix() and then c(), which follows the original data across rows and then across columns. The rep() calls had to be designed to align with that order.
mydata;
## id q1.response1 q1.response2 q1.response3 q1.response4 q1.response5 q2.response1 q2.response2 q2.response3 q2.response4 q2.response5 q3.response1 q3.response2 q3.response3 q3.response4 q3.response5
## 1 1 1 NA NA 1 NA NA 1 NA 1 NA 1 NA 1 1 1
## 2 2 NA 1 1 1 1 1 NA 1 NA 1 1 1 NA 1 NA
## 3 3 1 1 1 1 1 NA 1 NA NA NA NA NA NA NA NA
## 4 4 NA NA 1 NA NA NA 1 1 NA 1 1 NA 1 NA NA
## 5 5 1 NA NA 1 NA NA 1 NA 1 NA NA NA NA NA NA
NU <- nrow(mydata);
NQ <- 3;
NO <- 5;
long <- data.frame(id=rep(mydata$id,NQ*NO),question=rep(1:NQ,each=NO*NU),option=rep(1:NO,each=NU,NQ),response=c(as.matrix(mydata[-1])));
long;
## id question option response
## 1 1 1 1 1
## 2 2 1 1 NA
## 3 3 1 1 1
## 4 4 1 1 NA
## 5 5 1 1 1
## 6 1 1 2 NA
## 7 2 1 2 1
## 8 3 1 2 1
## 9 4 1 2 NA
## 10 5 1 2 NA
## 11 1 1 3 NA
## 12 2 1 3 1
## 13 3 1 3 1
## 14 4 1 3 1
## 15 5 1 3 NA
## 16 1 1 4 1
## 17 2 1 4 1
## 18 3 1 4 1
## 19 4 1 4 NA
## 20 5 1 4 1
## 21 1 1 5 NA
## 22 2 1 5 1
## 23 3 1 5 1
## 24 4 1 5 NA
## 25 5 1 5 NA
## 26 1 2 1 NA
## 27 2 2 1 1
## 28 3 2 1 NA
## 29 4 2 1 NA
## 30 5 2 1 NA
## 31 1 2 2 1
## 32 2 2 2 NA
## 33 3 2 2 1
## 34 4 2 2 1
## 35 5 2 2 1
## 36 1 2 3 NA
## 37 2 2 3 1
## 38 3 2 3 NA
## 39 4 2 3 1
## 40 5 2 3 NA
## 41 1 2 4 1
## 42 2 2 4 NA
## 43 3 2 4 NA
## 44 4 2 4 NA
## 45 5 2 4 1
## 46 1 2 5 NA
## 47 2 2 5 1
## 48 3 2 5 NA
## 49 4 2 5 1
## 50 5 2 5 NA
## 51 1 3 1 1
## 52 2 3 1 1
## 53 3 3 1 NA
## 54 4 3 1 1
## 55 5 3 1 NA
## 56 1 3 2 NA
## 57 2 3 2 1
## 58 3 3 2 NA
## 59 4 3 2 NA
## 60 5 3 2 NA
## 61 1 3 3 1
## 62 2 3 3 NA
## 63 3 3 3 NA
## 64 4 3 3 1
## 65 5 3 3 NA
## 66 1 3 4 1
## 67 2 3 4 1
## 68 3 3 4 NA
## 69 4 3 4 NA
## 70 5 3 4 NA
## 71 1 3 5 1
## 72 2 3 5 NA
## 73 3 3 5 NA
## 74 4 3 5 NA
## 75 5 3 5 NA
Here's a demo of how to use reshape() to accomplish the same thing. As you can see, this requires two consecutive calls to reshape(), because we need to longify both the option variable and the question variable. The order of these two columns ends up being reversed from what I created above, but that's circumstantial. Note that this approach absolves us of the need to manually store (or derive, which could theoretically be done) NQ and NO in advance of the transformation, but at the expense of the complexity of appeasing the quirks of the reshape() function.
long1 <- transform(structure(reshape(mydata,dir='l',varying=2:ncol(mydata),timevar='option'),reshapeLong=NULL),option=as.integer(sub('^response','',option,perl=T)));
long2 <- transform(structure(reshape(long1,dir='l',idvar=c('id','option'),varying=3:ncol(long1),timevar='question',sep=''),reshapeLong=NULL),question=as.integer(question),response=q,q=NULL);
rownames(long2) <- NULL;
identical(long,long2[names(long)]);
## [1] TRUE
The next step is figuring out which options for each user fell into which category. By "category" I'm referring to the combination of questions for which that user selected that particular option. Your formula requires first summing up the number of options that fall into each category.
Initially, I got the idea to normalize each user's selections for a particular option to a single number by treating each question as a binary digit and summing up the place-value-weighted numerical value of each selection. So, for example, if a user selected a particular option on questions 1 and 3, but not on question 2, then that would be binary 101 which would normalize to 5. This was the result, using aggregate() to group by id and option:
combo <- aggregate(cbind(category=response)~id+option,long,function(x) sum(x*2^(length(x):1-1),na.rm=T),na.action=na.pass);
combo;
## id option category
## 1 1 1 5
## 2 2 1 3
## 3 3 1 4
## 4 4 1 1
## 5 5 1 4
## 6 1 2 2
## 7 2 2 5
## 8 3 2 6
## 9 4 2 2
## 10 5 2 2
## 11 1 3 1
## 12 2 3 6
## 13 3 3 4
## 14 4 3 7
## 15 5 3 0
## 16 1 4 7
## 17 2 4 5
## 18 3 4 4
## 19 4 4 0
## 20 5 4 6
## 21 1 5 1
## 22 2 5 6
## 23 3 5 4
## 24 4 5 2
## 25 5 5 0
However, I then realized that this approach could easily lead to a problem. The issue is that it requires multiplying by place values that extend up to 2k-1. For your particular case k is 20, so that's only 524288, which is perfectly manageable, but imagine if you had 100 questions; the largest place value would be 633825300114114700748351602688! That doesn't fit into a 32-bit integer and so it would be converted to a double (roughly 6.33825300114115e+29), and that would screw up the entire aggregation we're going to have to do next (stay tuned), since nearby categories would be "rounded" together, due to the floating absolute precision of doubles.
I thought about how to solve this problem, and realized that it makes most sense to just switch to a string representation of the category. This will allow us to handle large numbers of questions, while still providing a simple and easily comparable representation of the category. I also manually set it as a factor rather than a character vector, which will be useful later on for the tabulate() call. So, here's the improved solution, again using aggregate() to group by id and option:
combo <- aggregate(cbind(category=response)~id+option,long,function(x) factor(paste(replace(x,is.na(x),0),collapse='')),na.action=na.pass);
combo;
## id option category
## 1 1 1 101
## 2 2 1 011
## 3 3 1 100
## 4 4 1 001
## 5 5 1 100
## 6 1 2 010
## 7 2 2 101
## 8 3 2 110
## 9 4 2 010
## 10 5 2 010
## 11 1 3 001
## 12 2 3 110
## 13 3 3 100
## 14 4 3 111
## 15 5 3 000
## 16 1 4 111
## 17 2 4 101
## 18 3 4 100
## 19 4 4 000
## 20 5 4 110
## 21 1 5 001
## 22 2 5 110
## 23 3 5 100
## 24 4 5 010
## 25 5 5 000
As a slight alternative, to save on characters, we could use more compact encodings than the above binary strings. Here's a rather intricate line of code to build hex strings:
combo <- aggregate(cbind(category=response)~id+option,long,function(x) factor(paste(c(0:9,letters[1:6])[colSums(matrix(c(rep(0,ceiling(length(x)/4)*4-length(x)),x)*2^(3:0),4),na.rm=T)+1],collapse='')),na.action=na.pass);
combo;
## id option category
## 1 1 1 5
## 2 2 1 3
## 3 3 1 4
## 4 4 1 1
## 5 5 1 4
## 6 1 2 2
## 7 2 2 5
## 8 3 2 6
## 9 4 2 2
## 10 5 2 2
## 11 1 3 1
## 12 2 3 6
## 13 3 3 4
## 14 4 3 7
## 15 5 3 0
## 16 1 4 7
## 17 2 4 5
## 18 3 4 4
## 19 4 4 0
## 20 5 4 6
## 21 1 5 1
## 22 2 5 6
## 23 3 5 4
## 24 4 5 2
## 25 5 5 0
Note how this result looks identical to the place value solution given earlier. That's just because this sample data has only 3 questions and thus only 8 categories, which doesn't extend into the hexadecimal letter range. On the other hand, the identicalness is a nice demonstration of how both solutions use a kind of numerical value representation, with the place value solution using actual integers, and this solution using hexadecimal strings.
The next step is to aggregate on the category, summing up ni log2 ni for all categories.
Now, since the addend is zero for ni = 0, we don't actually have to add up the value for every possible category; we can ignore the ones that are not present. This is fortunate, since there are 2k categories, which would become huge for large k. In other words, all we have to do is sum up the expression for each category that is represented in the data to get the result. And furthermore, since the addend is also zero for ni = 1, since log2(1) = 0, we can excise every category that has less than 2 constituents. Thus we have:
res <- aggregate(cbind(score=category)~id,combo,function(x) { nc <- tabulate(x); nc <- nc[nc>1]; sum(nc*log2(nc)); });
res;
## id score
## 1 1 2
## 2 2 4
## 3 3 8
## 4 4 2
## 5 5 2
This was a very complex question, and I might have made a mistake somewhere, so please check my work!

Resources