If, for some reason, a base R (two dimensional) table object comes along in your R workflow, what is the best (concise, readable, efficient) way to convert it to a data.table while keeping its dimension structure?
Example data:
set.seed(1)
tab <- structure(
sample(0:1, size = 15, replace = TRUE),
.Dim = c(5,3),
.Dimnames = list(
Pr = c("P1", "P2", "P3", "P4", "P9"),
Tr = c("T1", "T2", "T3")
),
class = "table"
)
Example of failed attempts
setDT(tab)
# Error in setDT(tab) :
# Argument 'x' to 'setDT' should be a 'list', 'data.frame' or 'data.table'
as.data.table(tab) # same data.table(tab), *could* be fixed with dcast()
# Pr Tr N
# 1: P1 T1 0
# 2: P2 T1 0
# 3: P3 T1 1
# 4: P4 T1 1
# 5: P9 T1 0
# 6: P1 T2 1
# 7: P2 T2 1
# 8: P3 T2 1
# 9: P4 T2 1
# 10: P9 T2 0
# 11: P1 T3 0
# 12: P2 T3 0
# 13: P3 T3 1
# 14: P4 T3 0
# 15: P9 T3 1
Two possible solutions are using unclass (converting the table into a regular 2d array) or the matrix method of as.data.frame:
# Convert it to a regular array with unclass
data.table(unclass(tab), keep.rownames = "Pr")
# Pr T1 T2 T3
# 1: P1 0 1 0
# 2: P2 0 1 0
# 3: P3 1 1 1
# 4: P4 1 1 0
# 5: P9 0 0 1
# Convert it first to a data.frame
setDT(as.data.frame.matrix(tab), keep.rownames = "Pr")[]
# Pr T1 T2 T3
# 1: P1 0 1 0
# 2: P2 0 1 0
# 3: P3 1 1 1
# 4: P4 1 1 0
# 5: P9 0 0 1
Related
This question already has answers here:
Creating co-occurrence matrix
(5 answers)
Closed 1 year ago.
I have a simple matrix, e.g.
test <- matrix(c("u1","p1","u1","p2","u2","p2","u2",
"p3","u3","p1","u4","p2","u5","p1",
"u5","p3","u6","p3","u7","p4","u7",
"p3","u8","p1","u9","p4"),
ncol=2,byrow=TRUE)
colnames(test) <- c("user","product")
test1<-as.data.frame(test)
test:
user product
1 u1 p1
2 u1 p2
3 u2 p2
4 u2 p3
5 u3 p1
6 u4 p2
7 u5 p1
8 u5 p3
9 u6 p3
10 u7 p4
11 u7 p3
12 u8 p1
13 u9 p4
I want to count how many users bought product pair together, such as p1&p2, p2& p3...
table(test1$product,test1$product) give me this :
p1 p2 p3 p4
p1 4 0 0 0
p2 0 3 0 0
p3 0 0 4 0
p4 0 0 0 2
How can I get the right result as :
p1 p2 p3 p4
p1 4 1 1 0
p2 1 3 1 0
p3 1 1 4 1
p4 0 0 1 2
Looking at your desired output, you are looking for the crossprod function:
crossprod(table(test1))
# product
# product p1 p2 p3 p4
# p1 4 1 1 0
# p2 1 3 1 0
# p3 1 1 4 1
# p4 0 0 1 2
This is the same as crossprod(table(test1$user, test1$product)) (reflecting Dennis's comment).
A similar question tagged to this post requested an efficient solution, but got deleted undeleted now. We decided to post the solution here.
Here is one with RcppEigen to do the crossproduct
library(RcppEigen)
library(inline)
prodFun <- '
typedef Eigen::Map<Eigen::MatrixXi> MapMti;
const MapMti B(as<MapMti>(BB));
const MapMti C(as<MapMti>(CC));
return List::create(B.adjoint() * C);
'
funCPr <- cxxfunction(signature(BB= "matrix", CC = "matrix"),
prodFun, plugin = "RcppEigen")
tbl <- table(test1)
funCPr(tbl, tbl)[[1]]
# [,1] [,2] [,3] [,4]
#[1,] 4 1 1 0
#[2,] 1 3 1 0
#[3,] 1 1 4 1
#[4,] 0 0 1 2
Benchmarks
set.seed(24)
test2 <- data.frame(user = sample(1:5000, 1e6, replace=TRUE),
product = sample(paste0("p", 1:50), 1e6, replace = TRUE),
stringsAsFactors=FALSE)
tbl1 <- table(test2)
library(microbenchmark)
microbenchmark(cPP = funCPr(tbl1, tbl1)[[1]],
CrossP = crossprod(tbl1),
adjMat = adjmat(tbl1)$adjacency,
unit = "relative", times = 10L)
#Unit: relative
# expr min lq mean median uq max neval cld
# cPP 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
# CrossP 2.079867 2.070509 2.234376 2.074388 2.290516 2.676798 10 a
# adjMat 6.223034 6.500791 9.619088 7.197824 7.771270 31.394812 10 b
NOTE: This could be made more efficient by doing the table in Rcpp
Ananda's solution is superior (it is lighter weight and requires no external package) but I am putting down another. I believe this is called an adjacency matrix (smarter people feel free to edit this if I'm wrong):
library(qdap)
adjmat(table(test1))$adjacency
## product
## product p1 p2 p3 p4
## p1 4 1 1 0
## p2 1 3 1 0
## p3 1 1 4 1
## p4 0 0 1 2
I have the following dataframe (DF_A):
PARTY_ID O1 O2 O3 O4
P1 0 0 1 0
P2 2 1 0 1
P3 0 0 0 0
P4 2 1 1 1
P5 1 0 0 1
I also have another dataframe (DF_B) with the position of the columns that I need in DF_A. This is DF_B:
PARTY_ID POS_1 POS_2
P1 1 2
P2 2 1
P3 3 1
P4 2 1
P5 1 4
I need to give the position of the columns (DF_B) for getting the values of DF_A. The desired result is something like this:
PARTY_ID V1 V2
P1 0 0
P2 1 2
P3 0 0
P4 1 2
P5 1 1
I'm trying to use which function, but it seems not to work.
Can anyone please help me?
SIDE NOTE: I need to do this the fastest way possible because my real data have more than 100K rows.
A quick and dirty way to do this using apply from base R:
DF_C <- apply(DF_A, 1, function(x) {
cols_to_use <- as.numeric(unlist(DF_B[DF_B$"PARTY_ID"==x["PARTY_ID"],2:3]))
x[-1][cols_to_use]
})
DF_C <- cbind(DF_A$PARTY_ID,t(DF_C))
colnames(DF_C) <- c("PARTY_ID", "V1","V2")
> ind <- as.matrix(DF_B[,-1])
> t(sapply(1:nrow(ind), function(i) DF_A[, -1][, ind[i,]][i,] ))
O1 O2
[1,] 0 0
[2,] 1 2
[3,] 0 0
[4,] 1 2
[5,] 1 1
If you want to get a data.frame:
> DF <- t(sapply(1:nrow(ind), function(i) DF_A[, -1][, ind[i,]][i,] ))
> data.frame(PARTY_ID=DF_A[,1], DF)
PARTY_ID O1 O2
1 P1 0 0
2 P2 1 2
3 P3 0 0
4 P4 1 2
5 P5 1 1
The one with a simple for loop:
DF_C <- DF_B # creating dataframe with same dimension and column/row identifiers
for(i in 1:nrow(DF_C)) { DF_C[i,] <- DF_A[i,as.numeric(DF_B[i,])] } #over rows
I have got a matrix describing relationship between diffetent people. If there is any connection between people I have "1" in a particular cell, otherwise - "0". How to turn this into a data.frame with two columns looks like:
person1 -- person4
person1 -- person6
person2 -- person1
?
Use melt from reshape2:
library(reshape2)
set.seed(1)
mx <- matrix(sample(0:1, 9, r=T), nrow=3, dimnames=replicate(2, paste0("p", 1:3), s=F))
# p1 p2 p3
# p1 0 1 1
# p2 0 0 1
# p3 1 1 1
melt(mx)
# Var1 Var2 value
# 1 p1 p1 0
# 2 p2 p1 0
# 3 p3 p1 1
# 4 p1 p2 1
# 5 p2 p2 0
# 6 p3 p2 1
# 7 p1 p3 1
# 8 p2 p3 1
# 9 p3 p3 1
I have a data.frame that looks like this:
C1 C2 C3 C4
1 -1 -1 1
1 1 -1 1
1 1 -1 1
1 -1 1 -1
I would like to count -1 and 1 occurrences by columns so I used:
tab= apply(DF, 2, table)
After I used the following string:
final <- as.data.frame(do.call("cbind", tab))
to write the result as data.frame. Unfortunately it gives me back an error because of the first element:
tab[[1]]
1
4
tab[[2]]
-1 1
2 2
..........
So I would like to add 0 to tab[[1]] regarding -1 frequencies to be able to write the results as data.frame.
And a third way:
x <- read.table(text = "C1 C2 C3 C4
1 -1 -1 1
1 1 -1 1
1 1 -1 1
1 -1 1 -1 ", header = TRUE)
sapply(sapply(x, factor, levels = c(1, -1), simplify = FALSE), table)
C1 C2 C3 C4
1 4 2 1 3
-1 0 2 3 1
Some benchmarking:
xx <- as.data.frame(matrix(sample(c(-1,1), 1e7, replace=TRUE), ncol=100))
Roland <- function(DF) {
res <- table(stack(DF))
res2 <- as.data.frame(res)
reshape(res2, timevar = "ind", idvar = "values", direction = "wide")
}
Roman <- function(x) {
sapply(sapply(x, factor, levels = c(1, -1), simplify = FALSE), table)
}
user20650 <- function(x) {
rbind(colSums(x == 1), colSums(x==-1))
}
require(microbenchmark)
microbenchmark(m1 <- Roland(xx), m2 <- Roman(xx), m3 <- user20650(xx), times = 2)
Unit: milliseconds
expr min lq median uq max neval
m1 <- Roland(xx) 17624.6297 17624.6297 18116.6595 18608.6893 18608.6893 2
m2 <- Roman(xx) 13838.2030 13838.2030 14301.9159 14765.6288 14765.6288 2
m3 <- user20650(xx) 786.3689 786.3689 788.7253 791.0818 791.0818 2
DF <- read.table(text="C1 C2 C3 C4
1 -1 -1 1
1 1 -1 1
1 1 -1 1
1 -1 1 -1 ",header=TRUE)
res <- table(stack(DF))
# ind
# values C1 C2 C3 C4
# -1 0 2 3 1
# 1 4 2 1 3
res2 <- as.data.frame(res)
# values ind Freq
# 1 -1 C1 0
# 2 1 C1 4
# 3 -1 C2 2
# 4 1 C2 2
# 5 -1 C3 3
# 6 1 C3 1
# 7 -1 C4 1
# 8 1 C4 3
reshape(res2, timevar = "ind", idvar = "values", direction = "wide")
# values Freq.C1 Freq.C2 Freq.C3 Freq.C4
# 1 -1 0 2 3 1
# 2 1 4 2 1 3
An alternative is res <- ftable(stack(DF)), which can be written to a file directly using write.ftable.
Here is my data:
sub <- paste ("s", 1:6, sep = "")
mark1a <- c("A", "A", "B", "d1", "A", 2)
mark1b <- c("A", "B", "d1", 2, "d1", "A")
myd <- data.frame (sub, mark1a, mark1b)
myd
sub mark1a mark1b
1 s1 A A
2 s2 A B
3 s3 B d1
4 s4 d1 2
5 s5 A d1
6 s6 2 A
I want create a design matrix of the pair of variables (columns) - mark1a and mark1b. A design matrix will consists of length (unique (c(mark1a, mark1b))) for each unique (c(mark1a, mark1b). then 1 or 2 based on if the particular number is present once or twice in the columns and else 0. The following is expected output (not a figure):
I could understand how this can be done:
You could try something like this:
cbind(myd, t(apply(myd, 1, function(x) sapply(unique(unlist(myd[, 2:3])), function(y) sum(x==y)))))
1 s1 A A 2 0 0 0
2 s2 A B 1 1 0 0
3 s3 B d1 0 1 1 0
4 s4 d1 2 0 0 1 1
5 s5 A d1 1 0 1 0
6 s6 2 A 1 0 0 1
First, make sure that the mark1a and mark1b columns share the same levels:
all.levels <- levels(myd["mark1a", "mark1b"])
levels(myd$mark1a) <- all.levels
levels(myd$mark1b) <- all.levels
Then you can compute the sum of two frequency tables and bind it to myd:
library(plyr)
cbind(myd, ddply(myd, "sub", function(x)table(x$mark1a) + table(x$mark1b))[,-1])
# sub mark1a mark1b 2 A B d1
# 1 s1 A A 0 2 0 0
# 2 s2 A B 0 1 1 0
# 3 s3 B d1 0 0 1 1
# 4 s4 d1 2 1 0 0 1
# 5 s5 A d1 0 1 0 1
# 6 s6 2 A 1 1 0 0
I would say the solution from #jmsigner is the way go to for a one-liner, but I usually get confused by those nested apply (and its relatives) solutions.
Here's a similar solution:
# Identify all the levels in `mark1a` and `mark1b`
mydLevels = unique(c(levels(myd$mark1a), levels(myd$mark1b)))
# Use these levels and an anonymous function with `lapply`
temp = data.frame(lapply(mydLevels,
function(x) rowSums(myd[-1] == x)+0))
colnames(temp) = mydLevels
# This gives you the correct output, but not in the order
# that you have in your original question.
cbind(myd, temp)
# sub mark1a mark1b 2 A B d1
# 1 s1 A A 0 2 0 0
# 2 s2 A B 0 1 1 0
# 3 s3 B d1 0 0 1 1
# 4 s4 d1 2 1 0 0 1
# 5 s5 A d1 0 1 0 1
# 6 s6 2 A 1 1 0 0