R Data Transformations - r

I have a need to look at the data in a data frame in a different way. Here is the problem..
I have a data frame as follows
Person Item BuyOrSell
1 a B
1 b S
1 a S
2 d B
3 a S
3 e S
One of the requirements I have is to see the data as follows. Show the sum of all transactions made by the Person on individual items broken by the transaction type (B or S)
Person aB aS bB bS dB dS eB eS
1 1 1 0 1 0 0 0 0
2 0 0 0 0 1 0 0 0
3 1 0 0 0 0 0 0 1
So i created a new column and appended the values of both the Item and BuyOrSell.
df$newcol<-paste(Item,"-",BuyOrSell,sep="")
table(Person,newcol)
and was able to achieve the above results.
The last transformation requirement which was a tough nut to crack was as follows....
aB aS bB bS dB dS eB eS
aB 1 1 0 1 0 0 0 0
aS 1 2 0 1 0 0 0 1
bB 0 0 0 0 0 0 0 0
bS 1 1 0 0 0 0 0 0
dB 0 0 0 0 1 0 0 0
dS 0 0 0 0 0 0 0 0
eB 0 0 0 0 0 0 0 0
eS 0 1 0 0 0 0 0 1
where the above table had to be filled in with the number of people who made a particular transaction also made a transaction on another item.
I tried table(newcol,newcol) but it generated counts only for aB-aB,aS-aS,bB-bB,..... and 0s for all other combinations.
Any ideas on what package or command will let me crack this nut ?

Isn't the final result just:
# Following Ricardo's solution for casting, but using `acast` instead
A <- acast(Person~Item+BuyOrSell,data=df,fun.aggregate=length,drop=FALSE)
# A' * A
> t(A) %*% A
# a_B a_S b_B b_S d_B d_S e_B e_S
# a_B 1 1 0 1 0 0 0 0
# a_S 1 2 0 1 0 0 0 1
# b_B 0 0 0 0 0 0 0 0
# b_S 1 1 0 1 0 0 0 0
# d_B 0 0 0 0 1 0 0 0
# d_S 0 0 0 0 0 0 0 0
# e_B 0 0 0 0 0 0 0 0
# e_S 0 1 0 0 0 0 0 1

I think there is a better way, but here's a method using the package reshape2.
require(reshape2)
#reshapes data so each item and buy/sell event interaction occurs once
df2 <- dcast(Person~Item+BuyOrSell,data=df,fun.aggregate=length,drop=FALSE)
df2
# Person a_B a_S b_B b_S d_B d_S e_B e_S
# 1 1 1 1 0 1 0 0 0 0
# 2 2 0 0 0 0 1 0 0 0
# 3 3 0 1 0 0 0 0 0 1
#reshapes data so every row is an interaction by person
df3 <- melt(df2,id.vars="Person")
head(df3)
# Person variable value
# 1 1 a_B 1
# 2 2 a_B 0
# 3 3 a_B 0
# 4 1 a_S 1
# 5 2 a_S 0
# 6 3 a_S 1
#removes empty rows where no action occurred
#removes value column
df4 <- with(df3,
data.frame(Person=rep.int(Person,value),variable=rep.int(variable,value))
#performs a self-merge: now each row is
#every combination of two actions that one person has done
df5 <- merge(df4,df4,by="Person")
head(df5)
# Person variable.x variable.y
# 1 1 a_B a_B
# 2 1 a_B a_S
# 3 1 a_B b_S
# 4 1 a_S a_B
# 5 1 a_S a_S
# 6 1 a_S b_S
#tabulates variable interactions
with(df5,table(variable.x,variable.y))

Blue Magister,your solution works perfectly and i analyzed each an every step that you performed.
The output of df4 was follows:
Person variable
1 1 a_B
2 1 a_S
3 3 a_S
4 1 b_S
5 2 d_B
6 3 e_S
The output of with(df5,table(variable.x,variable.y)) was
variable.y
variable.x a_B a_S b_B b_S d_B d_S e_B e_S
a_B 1 1 0 1 0 0 0 0
a_S 1 2 0 1 0 0 0 1
b_B 0 0 0 0 0 0 0 0
b_S 1 1 0 1 0 0 0 0
d_B 0 0 0 0 1 0 0 0
d_S 0 0 0 0 0 0 0 0
e_B 0 0 0 0 0 0 0 0
e_S 0 1 0 0 0 0 0 1
which is exactly what i want.
When i look at the output of d4 it was almost similar to the my newcol solution ( using paste )
> df
Person newcol
1 1 a-B
2 1 b-S
3 1 a-S
4 2 d-B
5 3 a-S
6 3 e-S
The only difference here is the ordering of the rows when compared to your df4.
So, i ended up running this command
dfx <- merge(df,df,by="Person")
with(dfx,table(newcol.x,newcol.y))
and it generated the following...
newcol.y
newcol.x a-B a-S b-S d-B e-S
a-B 1 1 1 0 0
a-S 1 2 1 0 1
b-S 1 1 1 0 0
d-B 0 0 0 1 0
e-S 0 1 0 0 1
The above output ignored few rows and columns. What am i doing different from you ?

Related

Count occurences of teams in matrix in R

Have a 1000*16 matrix from a simulation with team names as characters. I want to count number of occurrences per team in all 16 columns.
I know I could do apply(test, 2, table) but that makes the data hard to work with afterward since all teams is not included in every column.
If you have a vector that is all the unique team names you could do something like this. I'm counting occurrences here via column to ensure that not every team (in this case letter) is not included.
set.seed(15)
letter_mat <- matrix(
sample(
LETTERS,
size = 1000*16,
replace = TRUE
),
ncol = 16,
nrow = 1000
)
output <- t(
apply(
letter_mat,
1,
function(x) table(factor(x, levels = LETTERS))
)
)
head(output)
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
[1,] 1 2 0 1 1 1 1 0 0 0 1 0 0 0 0 1 1 1 1 1 0 1 1 0 0 1
[2,] 0 1 0 2 2 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 2 2 1
[3,] 1 1 0 0 1 0 1 2 1 0 0 0 0 0 1 0 1 0 1 1 0 0 3 0 1 1
[4,] 0 1 0 0 0 1 0 0 0 2 0 1 0 0 1 1 1 1 2 0 2 3 0 0 0 0
[5,] 2 1 0 0 0 0 0 2 0 2 1 1 1 0 0 2 0 2 1 0 0 1 0 0 0 0
[6,] 0 0 0 0 0 1 3 1 0 0 0 0 1 1 3 0 1 0 0 1 0 0 0 1 0 3

Building a symmetric binary matrix

I have a matrix that is for example like this:
rownames V1
a 1
c 3
b 2
d 4
y 2
q 4
i 1
j 1
r 3
I want to make a Symmetric binary matrix that it's dimnames of that is the same as rownames of above matrix. I want to fill these matrix by 1 & 0 in such a way that 1 indicated placing variables that has the same number in front of it and 0 for the opposite situation.This matrix would be like
dimnames
a c b d y q i j r
a 1 0 0 0 0 0 1 1 0
c 0 1 0 0 0 0 0 0 1
b 0 0 1 0 1 0 0 0 0
d 0 0 0 1 0 1 0 0 0
y 0 0 1 0 1 0 0 0 0
q 0 0 0 1 0 1 0 0 0
i 1 0 0 0 0 0 1 1 0
j 1 0 0 0 0 0 1 1 0
r 0 1 0 0 0 0 0 0 1
Anybody know how can I do that?
Use dist:
DF <- read.table(text = "rownames V1
a 1
c 3
b 2
d 4
y 2
q 4
i 1
j 1
r 3", header = TRUE)
res <- as.matrix(dist(DF$V1)) == 0L
#alternatively:
#res <- !as.matrix(dist(DF$V1))
#diag(res) <- 0L #for the first version of the question, i.e. a zero diagonal
res <- +(res) #for the second version, i.e. to coerce to an integer matrix
dimnames(res) <- list(DF$rownames, DF$rownames)
# 1 2 3 4 5 6 7 8 9
#1 1 0 0 0 0 0 1 1 0
#2 0 1 0 0 0 0 0 0 1
#3 0 0 1 0 1 0 0 0 0
#4 0 0 0 1 0 1 0 0 0
#5 0 0 1 0 1 0 0 0 0
#6 0 0 0 1 0 1 0 0 0
#7 1 0 0 0 0 0 1 1 0
#8 1 0 0 0 0 0 1 1 0
#9 0 1 0 0 0 0 0 0 1
You can do this using table and crossprod.
tcrossprod(table(DF))
# rownames
# rownames a b c d i j q r y
# a 1 0 0 0 1 1 0 0 0
# b 0 1 0 0 0 0 0 0 1
# c 0 0 1 0 0 0 0 1 0
# d 0 0 0 1 0 0 1 0 0
# i 1 0 0 0 1 1 0 0 0
# j 1 0 0 0 1 1 0 0 0
# q 0 0 0 1 0 0 1 0 0
# r 0 0 1 0 0 0 0 1 0
# y 0 1 0 0 0 0 0 0 1
If you want the row and column order as they are found in the data, rather than alphanumerically, you can subset
tcrossprod(table(DF))[DF$rownames, DF$rownames]
or use factor
tcrossprod(table(factor(DF$rownames, levels=unique(DF$rownames)), DF$V1))
If your data is large or sparse, you can use the sparse matrix algebra in xtabs, with similar ways to change the order of the resulting table as before.
Matrix::tcrossprod(xtabs(data=DF, ~ rownames + V1, sparse=TRUE))

Transpose and create categorical values in R

I have a data frame with the below structure from which I am looking to transpose the variables into categorical. Intent is to find the weighted mix of the variables.
data <- read.table(header=T, text='
subject weight sex test
1 2 M control
2 3 F cond1
3 2 F cond2
4 4 M control
5 3 F control
6 2 F control
')
data
Expected output:
subject weight control_F control_M cond1_F cond1_M cond2_F cond2_M
1 2 0 1 0 0 0 0
2 3 0 0 1 0 0 0
3 2 0 0 0 0 1 0
4 4 0 1 0 0 0 0
5 3 1 0 0 0 0 0
6 2 1 0 0 0 0 0
I tried using a combination of ifelse and cut, but just couldn't produce the output.
Any ideas on how I can do this?
TIA
You may use
model.matrix(~ subject + weight + sex:test - 1, data)
I think model.matrix is most natural here (see #Julius' answer), but here's an alternative:
library(data.table)
setDT(data)
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))
subject weight cond1_F cond1_M cond2_F cond2_M control_F control_M
1: 1 2 0 0 0 0 0 1
2: 2 3 1 0 0 0 0 0
3: 3 2 0 0 1 0 0 0
4: 4 4 0 0 0 0 0 1
5: 5 3 0 0 0 0 1 0
6: 6 2 0 0 0 0 1 0
To get the columns in the "right" order (with the control first), set factor levels before casting:
data[, test := relevel(test, "control")]
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))
subject weight control_F control_M cond1_F cond1_M cond2_F cond2_M
1: 1 2 0 1 0 0 0 0
2: 2 3 0 0 1 0 0 0
3: 3 2 0 0 0 0 1 0
4: 4 4 0 1 0 0 0 0
5: 5 3 1 0 0 0 0 0
6: 6 2 1 0 0 0 0 0
(Note: reshape2's dcast isn't so good here, since its drop option applies to both rows and cols.)

How to force table to have equal dimensions?

How can I force the dimensions of a table to be equal in R?
For example:
a <- c(0,1,2,3,4,5,1,3,4,5,3,4,5)
b <- c(1,2,3,3,3,3,3,3,3,3,5,5,6)
c <- table(a,b)
print(c)
# b
#a 1 2 3 5 6
# 0 1 0 0 0 0
# 1 0 1 1 0 0
# 2 0 0 1 0 0
# 3 0 0 2 1 0
# 4 0 0 2 1 0
# 5 0 0 2 0 1
However, I am looking for the following result:
print(c)
# b
#a 0 1 2 3 4 5 6
# 0 0 1 0 0 0 0 0
# 1 0 0 1 1 0 0 0
# 2 0 0 0 1 0 0 0
# 3 0 0 0 2 0 1 0
# 4 0 0 0 2 0 1 0
# 5 0 0 0 2 0 0 1
# 6 0 0 0 0 0 0 0
By using factors. table doesn't know the levels of your variable unless you tell it in some way!
a <- c(0,1,2,3,4,5,1,3,4,5,3,4,5)
b <- c(1,2,3,3,3,3,3,3,3,3,5,5,6)
a <- factor(a, levels = 0:6)
b <- factor(b, levels = 0:6)
table(a,b)
# b
#a 0 1 2 3 4 5 6
# 0 0 1 0 0 0 0 0
# 1 0 0 1 1 0 0 0
# 2 0 0 0 1 0 0 0
# 3 0 0 0 2 0 1 0
# 4 0 0 0 2 0 1 0
# 5 0 0 0 2 0 0 1
# 6 0 0 0 0 0 0 0
Edit The general way to force a square cross-tabulation is to do something like
x <- factor(a, levels = union(a, b))
y <- factor(b, levels = union(a, b))
table(x, y)

Faster way to multiplication in data frame

I have a data frame (name t) like this
ID N com_a com_b com_c
A 3 1 0 0
A 5 0 1 0
B 1 1 0 0
B 1 0 1 0
B 4 0 0 1
B 4 1 0 0
I have try to do com_a*N com_b*N com_c*N
ID N com_a com_b com_c com_a_N com_b_N com_c_N
A 3 1 0 0 3 0 0
A 5 0 1 0 0 5 0
B 1 1 0 0 1 0 0
B 1 0 1 0 0 1 0
B 4 0 0 1 0 0 4
B 4 1 0 0 4 0 0
I use for-function, but it need many time how do i do the fast in the big data
for (i in 1:dim(t)[1]){
t$com_a_N[i]=t$com_a[i]*t$N[i]
t$com_b_N[i]=t$com_b[i]*t$N[i]
t$com_c_N[i]=t$com_c[i]*t$N[i]
}
t <- transform(t,
com_a_N=com_a*N,
com_b_N=com_b*N,
com_c_N=com_c*N)
should be much faster. data.table solutions might be faster still.
You can use sweep for this
(st <- sweep(t[, 3:5], 1, t$N, "*"))
# com_a com_b com_c
#1 3 0 0
#2 0 5 0
#3 1 0 0
#4 0 1 0
#5 0 0 4
#6 4 0 0
The new names can be created with paste and setNames, and you can add the new columns to the existing data.frame with cbind. This will scale for any number of columns.
cbind(t, setNames(st, paste(names(st), "N", sep="_")))
# ID N com_a com_b com_c com_a_N com_b_N com_c_N
#1 A 3 1 0 0 3 0 0
#2 A 5 0 1 0 0 5 0
#3 B 1 1 0 0 1 0 0
#4 B 1 0 1 0 0 1 0
#5 B 4 0 0 1 0 0 4
#6 B 4 1 0 0 4 0 0
A data.table solution as proposed by #BenBolker
library(data.table)
setDT(t)[, c("com_a_N", "com_b_N", "com_c_N") := list(com_a*N, com_b*N, com_c*N)]
## ID N com_a com_b com_c com_a_N com_b_N com_c_N
## 1: A 3 1 0 0 3 0 0
## 2: A 5 0 1 0 0 5 0
## 3: B 1 1 0 0 1 0 0
## 4: B 1 0 1 0 0 1 0
## 5: B 4 0 0 1 0 0 4
## 6: B 4 1 0 0 4 0 0
Even faster using matrix multiplication:
cbind(dat,dat[,3:5]*dat$N)
Though you should set colnames after....
To avoid using explicit column index(not recommended) , you can use some grep magic:
cbind(dat,dat[,grep('com',colnames(dat))]*dat$N)
Another option with dplyr:
require(dplyr)
t <- mutate(t, com_a_N=com_a*N,
com_b_N=com_b*N,
com_c_N=com_c*N)

Resources