R- creating a counter-party frequency matrix - r

I have data from a barter economy. I am trying to create a matrix that counts how frequently items act as counterparties with other items.
As an example:
myDat <- data.frame(
TradeID = as.factor(c(1,1,1,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8)),
Origin = as.factor(c(1,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0)),
ItemID = as.factor(c(1,2,3,4,5,1,1,6,7,1,1,8,7,5,1,1,2,3,4))
)
TradeID Origin ItemID
1 1 1 1
2 1 0 2
3 1 0 3
4 2 1 4
5 2 1 5
6 2 0 1
7 3 1 1
8 3 0 6
9 4 1 7
10 4 0 1
11 5 1 1
12 5 0 8
13 6 1 7
14 6 0 5
15 7 1 1
16 7 0 1
17 8 1 2
18 8 0 3
19 8 0 4
20 9 1 1
21 9 0 8
Where TradeID indicates a specific transaction. ItemID indicates an item, and Origin indicates which direction the item went.
For example, given my data the matrix I'd create would look something like this:
For example, the value 2 at [1,8] indicates that item 1 & 8 were counterparties in two trades. (Note that it's a symmetric matrix, and so [8,1] also has the value 2).
While the value of 1 at [1,2] indicates that item 1 and 2 were counterparties in only one trade (all the other 1s throughout the matrix indicate the same)
As an odd example, note at [1,1], the value of 1 indicates that item 1 was a counterparty to itself once (trade number 7)
A little extra insight into my motivation, note in my simple example that item 1 tends to act as counterparty with many different items. In a barter economy (one without explicit money) we might expect a commodity currency to be a counterparty relatively more frequently than non-commodity-currencies. A matrix like this would be the first step at one way of discovering which item was a commodity currency.
I've been struggling with this for a while. But I think I'm nearly done with an overly complicated solution, which I'll post shortly.
I'm curious if y'all might offer a bit of help also.

Alright, I think I've got this figured out. The short answer is:
Reduce("+",by(myDat, myDat$TradeID, function(x) pmin(table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]) + table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]),1)))
Which gives the following matrix, matching the desired result:
1 2 3 4 5 6 7 8
1 1 1 1 1 1 1 1 2
2 1 0 1 1 0 0 0 0
3 1 1 0 0 0 0 0 0
4 1 1 0 0 0 0 0 0
5 1 0 0 0 0 0 1 0
6 1 0 0 0 0 0 0 0
7 1 0 0 0 1 0 0 0
8 2 0 0 0 0 0 0 0
Here's the long answer. You can get a list of matrices for each TradeID using the by and outer (%o%) and table functions. But this double-counts Trade 7, where item 1 is traded for item 1, so I use the pmax function to fix this. Then I sum across the list by using the Reduce function.
And here's the steps to get there. Note the addition of TradeID # 9, which was left out of the question's code.
# Data
myDat <- data.frame(
TradeID = as.factor(c(1,1,1,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,9,9)),
Origin = as.factor(c(1,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0)),
ItemID = as.factor(c(1,2,3,4,5,1,1,6,7,1,1,8,7,5,1,1,2,3,4,1,8))
)
# Sum in 1 direction
by(myDat, myDat$TradeID, function(x) table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]))
# Sum in both directions
by(myDat, myDat$TradeID, function(x) table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]) + table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]))
# Remove double-count in trade 7
by(myDat, myDat$TradeID, function(x) pmin(table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]) + table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]),1))
# Sum across lists
Reduce("+",by(myDat, myDat$TradeID, function(x) pmin(table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]) + table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]),1)))
One way to speed this up would be to sum in only 1 direction (taking advantage of symmetry) and then clean up the results.
result = Reduce("+",by(myDat, myDat$TradeID, function(x) table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1])))
result2 = result + t(result)
diag(result2) = diag(result)
result2
1 2 3 4 5 6 7 8
1 1 1 1 1 1 1 1 2
2 1 0 1 1 0 0 0 0
3 1 1 0 0 0 0 0 0
4 1 1 0 0 0 0 0 0
5 1 0 0 0 0 0 1 0
6 1 0 0 0 0 0 0 0
7 1 0 0 0 1 0 0 0
8 2 0 0 0 0 0 0 0
This appears to run nearly twice as fast.
> microbenchmark(Reduce("+",by(myDat, myDat$TradeID, function(x) pmin(table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]) + table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]),1))))
Unit: milliseconds
min lq median uq max neval
7.489092 7.733382 7.955861 8.536359 9.83216 100
> microbenchmark(Reduce("+",by(myDat, myDat$TradeID, function(x) table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]))))
Unit: milliseconds
min lq median uq max neval
4.023964 4.18819 4.277767 4.452824 5.801171 100

This will give you the number of observations per TradeID and ItemID
myDat <- data.frame(
TradeID = as.factor(c(1,1,1,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8)),
Origin = as.factor(c(1,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0)),
ItemID = as.factor(c(1,2,3,4,5,1,1,6,7,1,1,8,7,5,1,1,2,3,4))
)
result = tapply(myDat$Origin, list(myDat$ItemID,myDat$TradeID), length)
result[is.na(result)] = 0
result["1","7"]
result will then be:
> result
1 2 3 4 5 6 7 8
1 1 1 1 1 1 0 2 0
2 1 0 0 0 0 0 0 1
3 1 0 0 0 0 0 0 1
4 0 1 0 0 0 0 0 1
5 0 1 0 0 0 1 0 0
6 0 0 1 0 0 0 0 0
7 0 0 0 1 0 1 0 0
8 0 0 0 0 1 0 0 0
This will give you the proportion of 1 Origin per TradeID and ItemID
result = tapply(myDat$Origin, list(myDat$ItemID,myDat$TradeID), function(x) { sum(as.numeric(as.character(x)))/length(x) })
You can set the NA values in the last matrix to 0 using result[is.na(result)] = 0 but that would confuse no observations with nothing but 0 Origin trades.

This will give you the number of observations per consecutive ItemIDs:
idxList <- with(myDat, tapply(ItemID, TradeID, FUN = function(items)
lapply(seq(length(items) - 1),
function(i) sort(c(items[i], items[i + 1])))))
# indices of observations
idx <- do.call(rbind, unlist(idxList, recursive = FALSE))
# create a matrix
ids <- unique(myDat$ItemID)
mat <- matrix(0, length(ids), length(ids))
# place values in matrix
for (i in seq(nrow(idx))) {
mat[idx[i, , drop = FALSE]] <- mat[idx[i, , drop = FALSE]] + 1
}
# create symmatric marix
mat[lower.tri(mat)] <- t(mat)[lower.tri(mat)]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 1 0 0 1 1 1 1
[2,] 1 0 2 0 0 0 0 0
[3,] 0 2 0 1 0 0 0 0
[4,] 0 0 1 0 1 0 0 0
[5,] 1 0 0 1 0 0 1 0
[6,] 1 0 0 0 0 0 0 0
[7,] 1 0 0 0 1 0 0 0
[8,] 1 0 0 0 0 0 0 0

Related

How can I create dummy variables from a numeric variable in R?

How can I create dummy variables from a numeric variable in R?
I want to create N dummy variables. In such a way the numeric variable means how many zeros will come, counting from the first column. Imagine N=6. Like this:
x
a 5
b 2
c 4
d 1
e 9
It must become:
1 2 3 4 5 6
a 0 0 0 0 0 1
b 0 0 1 1 1 1
c 0 0 0 0 1 1
d 0 1 1 1 1 1
e 0 0 0 0 0 0
Thank you!
Here's a hacky solution for you
x = c(5,2,4,1,9)
N = 6
out = matrix(1, length(x), N)
for (i in 1:length(x))
out[i,1:min(x[i], N)] = 0
> out
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 0 0 0 0 1
[2,] 0 0 1 1 1 1
[3,] 0 0 0 0 1 1
[4,] 0 1 1 1 1 1
[5,] 0 0 0 0 0 0
We could do this in a vectorized manner by creating row/column index and assigning an already created matrix of 1s to 0 based on the index
m1 <- matrix(1, ncol = N, nrow = length(x),
dimnames = list(letters[seq_along(x)], seq_len(N)))
x1 <- pmin(x, ncol(m1))
m1[cbind(rep(seq_len(nrow(m1)), x1), sequence(x1))] <- 0
m1
# 1 2 3 4 5 6
#a 0 0 0 0 0 1
#b 0 0 1 1 1 1
#c 0 0 0 0 1 1
#d 0 1 1 1 1 1
#e 0 0 0 0 0 0
data
x <- c(5,2,4,1,9)
N <- 6

How can I create this special sequence?

I would like to create the following vector sequence.
0 1 0 0 2 0 0 0 3 0 0 0 0 4
My thought was to create 0 first with rep() but not sure how to add the 1:4.
Create a diagonal matrix, take the upper triangle, and remove the first element:
d <- diag(0:4)
d[upper.tri(d, TRUE)][-1L]
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
If you prefer a one-liner that makes no global assignments, wrap it up in a function:
(function() { d <- diag(0:4); d[upper.tri(d, TRUE)][-1L] })()
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
And for code golf purposes, here's another variation using d from above:
d[!lower.tri(d)][-1L]
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
rep and rbind up to their old tricks:
rep(rbind(0,1:4),rbind(1:4,1))
#[1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
This essentially creates 2 matrices, one for the value, and one for how many times the value is repeated. rep does not care if an input is a matrix, as it will just flatten it back to a vector going down each column in order.
rbind(0,1:4)
# [,1] [,2] [,3] [,4]
#[1,] 0 0 0 0
#[2,] 1 2 3 4
rbind(1:4,1)
# [,1] [,2] [,3] [,4]
#[1,] 1 2 3 4
#[2,] 1 1 1 1
You can use rep() to create a sequence that has n + 1 of each value:
n <- 4
myseq <- rep(seq_len(n), seq_len(n) + 1)
# [1] 1 1 2 2 2 3 3 3 3 4 4 4 4 4
Then you can use diff() to find the elements you want. You need to append a 1 to the end of the diff() output, since you always want the last value.
c(diff(myseq), 1)
# [1] 0 1 0 0 1 0 0 0 1 0 0 0 0 1
Then you just need to multiply the original sequence with the diff() output.
myseq <- myseq * c(diff(myseq), 1)
myseq
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
unlist(lapply(1:4, function(i) c(rep(0,i),i)))
# the sequence
s = 1:4
# create zeros vector
vec = rep(0, sum(s+1))
# assign the sequence to the corresponding position in the zeros vector
vec[cumsum(s+1)] <- s
vec
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
Or to be more succinct, use replace:
replace(rep(0, sum(s+1)), cumsum(s+1), s)
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4

How to reset cumsum at end of consecutive string [duplicate]

This question already has answers here:
Cumulative sum for positive numbers only [duplicate]
(9 answers)
Closed 6 years ago.
If I have the following vector:
x = c(1,1,1,0,0,0,0,1,1,0,0,1,1,1,0,0,1,1,1,1,0,0,0,0,1,1,1)
how can I calculate the cumulative sum for all of the consecutive 1's, resetting each time I hit a 0?
So, the desired output would look like this:
> y
[1] 1 2 3 0 0 0 0 1 2 0 0 1 2 3 0 0 1 2 3 4 0 0 0 0 1 2 3
This works:
unlist(lapply(rle(x)$lengths, FUN = function(z) 1:z)) * x
# [1] 1 2 3 0 0 0 0 1 2 0 0 1 2 3 0 0 1 2 3 4 0 0 0 0 1 2 3
It relies pretty heavily on your special case of only having 1s and 0s, but for that case it works great! Even better, with #nicola's suggested improvements:
sequence(rle(x)$lengths) * x
# [1] 1 2 3 0 0 0 0 1 2 0 0 1 2 3 0 0 1 2 3 4 0 0 0 0 1 2 3
I read this post about how to split a vector, and use splitAt2 by #Calimo.
So it's like this:
splitAt2 <- function(x, pos) {
out <- list()
pos2 <- c(1, pos, length(x)+1)
for (i in seq_along(pos2[-1])) {
out[[i]] <- x[pos2[i]:(pos2[i+1]-1)]
}
return(out)
}
x = c(1,1,1,0,0,0,0,1,1,0,0,1,1,1,0,0,1,1,1,1,0,0,0,0,1,1,1)
where_split = which(x == 0)
x_split = splitAt2(x, where_split)
unlist(sapply(x_split, cumsum))
# [1] 1 2 3 0 0 0 0 1 2 0 0 1 2 3 0 0 1 2 3 4 0 0 0 0 1 2 3
Here is another option
library(data.table)
ave(x, rleid(x), FUN=seq_along)*x
#[1] 1 2 3 0 0 0 0 1 2 0 0 1 2 3 0 0 1 2 3 4 0 0 0 0 1 2 3
Or without any packages
ave(x, cumsum(c(TRUE, x[-1]!= x[-length(x)])), FUN=seq_along)*x
#[1] 1 2 3 0 0 0 0 1 2 0 0 1 2 3 0 0 1 2 3 4 0 0 0 0 1 2 3

String decomposition

I need to decompose about 75 million character strings using R. I need to do something like creating a Term Document matrix, where each word that occurs in the document becomes a column in the matrix and anywhere the term occurs, the matrix element is coded as 1.
I have:
About 75 million character strings ranging in length from about 0-100 characters; they represent a time series giving coded information about what happened in that period. Each code is exactly one character and corresponds to a time period.
I need:
Some kind of matrix or way of conveying the information that takes away the time series and just tells me how many times a certain code was reported in each series.
For instance:
The string "ABCDEFG-123" would become be a row in the matrix where each character would be tallied as occurring once. If this is too difficult a matrix of 0s and 1s would also give me some information though I would prefer to keep as much information as possible.
Does anyone have any ideas of how to do this quickly? There are 20 possible codes.
Example:
my20chars = c(LETTERS[1:10], 0:9)
set.seed(1)
x = replicate(1e4, paste0(sample(c(my20chars,"-"),10, replace=TRUE), collapse=""))
One approach:
library(data.table)
d = setDT(stack(strsplit(setNames(x,x),"")))
dcast(d[ values %in% my20chars ], ind ~ values, fun = length)
Result:
ind 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J
1: ---8EEAD8I 0 0 0 0 0 0 0 0 2 0 1 0 0 1 2 0 0 0 1 0
2: --33B6E-32 0 0 1 3 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
3: --3IFBG8GI 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 2 0 2 0
4: --4210I8H5 1 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0
5: --5H4DE9F- 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 0
---
9996: JJFJBJ24AJ 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 5
9997: JJI-J-0FGB 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 3
9998: JJJ1B54H63 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 3
9999: JJJED7A3FI 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 0 0 1 3
10000: JJJIF6GI13 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 2 3
Benchmark:
library(microbenchmark)
nstrs = 1e5
nchars = 10
x = replicate(nstrs, paste0(sample(c(my20chars,"-"), nchars, replace=TRUE), collapse=""))
microbenchmark(
dcast = {
d = setDT(stack(strsplit(setNames(x,x),"")))
dcast(d[ values %in% my20chars ], ind ~ values, fun = length, value.var="ind")
},
times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# dcast 3.112633 3.423935 3.480692 3.494176 3.573967 3.741931 10
So, this is not fast enough to handle the OP's 75 million strings, but may be a good place to start.
I really like #Frank's solution, but here's another way, that has two advantages:
It uses a sparse matrix format, so you are more likely to fit everything into memory; and
It is (even) simpler.
It uses our quanteda package, where you tokenise the characters in each string and form a document-feature matrix from these in one command:
my20chars = c(LETTERS[1:10], 0:9)
set.seed(1)
x = replicate(1e4, paste0(sample(c(my20chars,"-"),10, replace=TRUE), collapse=""))
require(quanteda)
myDfm <- dfm(x, what = "character", toLower = FALSE, verbose = FALSE)
# for equivalent printing, does not change content:
myDfm <- myDfm[, order(features(myDfm))]
rownames(myDfm) <- x
head(myDfm)
# Document-feature matrix of: 6 documents, 20 features.
# 6 x 20 sparse Matrix of class "dfmSparse"
# features
# docs 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J
# FH29E8933B 0 0 1 2 0 0 0 0 1 2 0 1 0 0 1 1 0 1 0 0
# ED4I605-H6 1 0 0 0 1 1 2 0 0 0 0 0 0 1 1 0 0 1 1 0
# 9E3CFIAI8H 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 1 0 1 2 0
# 020D746C5I 2 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0
# 736116A054 1 2 0 1 1 1 2 1 0 0 1 0 0 0 0 0 0 0 0 0
# 08JFBCG03I 2 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 1 0 1 1
Disadvantage:
It's (much) slower.
Benchmark:
microbenchmark(
dcast = {
d = setDT(stack(strsplit(setNames(x,x),"")))
dcast(d[ values %in% my20chars ], ind ~ values, fun = length, value.var="ind")
},
quanteda = dfm(x, what = "character", toLower = FALSE, removePunct = FALSE, verbose = FALSE),
times = 10)
# Unit: seconds
# expr min lq mean median uq max naval
# dcast 2.380971 2.423677 2.465338 2.429331 2.521256 2.636102 10
# quanteda 21.106883 21.168145 21.369443 21.345173 21.519018 21.883966 10

Expand a single column to a wide/model matrix format

Suppose I have a column in a matrix or data.frame as follows:
df <- data.frame(col1=sample(letters[1:3], 10, TRUE))
I want to expand this out to multiple columns, one for each level in the column, with 0/1 entries indicating presence or absence of level for each row
newdf <- data.frame(a=rep(0, 10), b=rep(0,10), c=rep(0,10))
for (i in 1:length(levels(df$col1))) {
curLetter <- levels(df$col1)[i]
newdf[which(df$col1 == curLetter), curLetter] <- 1
}
newdf
I know there's a simple clever solution to this, but I can't figure out what it is.
I've tried expand.grid on df, which returns itself as is. Similarly melt in the reshape2 package on df returned df as is. I've also tried reshape but it complains about incorrect dimensions or undefined columns.
Obviously, model.matrix is the most direct candidate here, but here, I'll present three alternatives: table, lapply, and dcast (the last one since this question is tagged reshape2.
table
table(sequence(nrow(df)), df$col1)
#
# a b c
# 1 1 0 0
# 2 0 1 0
# 3 0 1 0
# 4 0 0 1
# 5 1 0 0
# 6 0 0 1
# 7 0 0 1
# 8 0 1 0
# 9 0 1 0
# 10 1 0 0
lapply
newdf <- data.frame(a=rep(0, 10), b=rep(0,10), c=rep(0,10))
newdf[] <- lapply(names(newdf), function(x)
{ newdf[[x]][df[,1] == x] <- 1; newdf[[x]] })
newdf
# a b c
# 1 1 0 0
# 2 0 1 0
# 3 0 1 0
# 4 0 0 1
# 5 1 0 0
# 6 0 0 1
# 7 0 0 1
# 8 0 1 0
# 9 0 1 0
# 10 1 0 0
dcast
library(reshape2)
dcast(df, sequence(nrow(df)) ~ df$col1, fun.aggregate=length, value.var = "col1")
# sequence(nrow(df)) a b c
# 1 1 1 0 0
# 2 2 0 1 0
# 3 3 0 1 0
# 4 4 0 0 1
# 5 5 1 0 0
# 6 6 0 0 1
# 7 7 0 0 1
# 8 8 0 1 0
# 9 9 0 1 0
# 10 10 1 0 0
It's very easy with model.matrix
model.matrix(~ df$col1 + 0)
The term + 0 means that the intercept is not included. Hence, you receive a dummy variable for each factor level.
The result:
df$col1a df$col1b df$col1c
1 0 0 1
2 0 1 0
3 0 0 1
4 1 0 0
5 0 1 0
6 1 0 0
7 1 0 0
8 0 1 0
9 1 0 0
10 0 1 0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`df$col1`
[1] "contr.treatment"

Resources