Faster way to multiplication in data frame - r

I have a data frame (name t) like this
ID N com_a com_b com_c
A 3 1 0 0
A 5 0 1 0
B 1 1 0 0
B 1 0 1 0
B 4 0 0 1
B 4 1 0 0
I have try to do com_a*N com_b*N com_c*N
ID N com_a com_b com_c com_a_N com_b_N com_c_N
A 3 1 0 0 3 0 0
A 5 0 1 0 0 5 0
B 1 1 0 0 1 0 0
B 1 0 1 0 0 1 0
B 4 0 0 1 0 0 4
B 4 1 0 0 4 0 0
I use for-function, but it need many time how do i do the fast in the big data
for (i in 1:dim(t)[1]){
t$com_a_N[i]=t$com_a[i]*t$N[i]
t$com_b_N[i]=t$com_b[i]*t$N[i]
t$com_c_N[i]=t$com_c[i]*t$N[i]
}

t <- transform(t,
com_a_N=com_a*N,
com_b_N=com_b*N,
com_c_N=com_c*N)
should be much faster. data.table solutions might be faster still.

You can use sweep for this
(st <- sweep(t[, 3:5], 1, t$N, "*"))
# com_a com_b com_c
#1 3 0 0
#2 0 5 0
#3 1 0 0
#4 0 1 0
#5 0 0 4
#6 4 0 0
The new names can be created with paste and setNames, and you can add the new columns to the existing data.frame with cbind. This will scale for any number of columns.
cbind(t, setNames(st, paste(names(st), "N", sep="_")))
# ID N com_a com_b com_c com_a_N com_b_N com_c_N
#1 A 3 1 0 0 3 0 0
#2 A 5 0 1 0 0 5 0
#3 B 1 1 0 0 1 0 0
#4 B 1 0 1 0 0 1 0
#5 B 4 0 0 1 0 0 4
#6 B 4 1 0 0 4 0 0

A data.table solution as proposed by #BenBolker
library(data.table)
setDT(t)[, c("com_a_N", "com_b_N", "com_c_N") := list(com_a*N, com_b*N, com_c*N)]
## ID N com_a com_b com_c com_a_N com_b_N com_c_N
## 1: A 3 1 0 0 3 0 0
## 2: A 5 0 1 0 0 5 0
## 3: B 1 1 0 0 1 0 0
## 4: B 1 0 1 0 0 1 0
## 5: B 4 0 0 1 0 0 4
## 6: B 4 1 0 0 4 0 0

Even faster using matrix multiplication:
cbind(dat,dat[,3:5]*dat$N)
Though you should set colnames after....
To avoid using explicit column index(not recommended) , you can use some grep magic:
cbind(dat,dat[,grep('com',colnames(dat))]*dat$N)

Another option with dplyr:
require(dplyr)
t <- mutate(t, com_a_N=com_a*N,
com_b_N=com_b*N,
com_c_N=com_c*N)

Related

summing all possible left to right diagonals along specified columns in a data frame by group?

Suppose I have something like this:
df<-data.frame(group=c(1, 1,2, 2, 2, 4,4,4,4,6,6,6),
binary1=c(1,0,1,0,0,0,0,0,0,0,0,0),
binary2=c(0,1,0,1,0,1,0,0,0,0,1,1),
binary3=c(0,0,0,0,1,0,1,0,0,0,0,0),
binary4=c(0,0,0,0,0,0,0,1,0,0,0,0))
I want to sum along all possible left to right diagonals within groups (i.e group 1, 2 4 and 6) and return the max sum. This is also in a dataframe, so I would like to specify to only sum along binary1-binary4. Anyone know if this is possible?
Here's my desired output:
group binary1 binary2 binary3 binary4 want
1 1 1 0 0 0 2
2 1 0 1 0 0 2
3 2 1 0 0 0 3
4 2 0 1 0 0 3
5 2 0 0 1 0 3
6 4 0 1 0 0 3
7 4 0 0 1 0 3
8 4 0 0 0 1 3
9 4 0 0 0 0 3
10 6 0 0 0 0 1
11 6 0 1 0 0 1
12 6 0 1 0 0 1
I have circled the "diagonals" I would like summed for group 4 in this image as an example:
Here is another solution where we use row and col indices to get all possible combinations of diagonals. Use by to split by group and merge it with original dataframe.
max_diag <- function(x) max(sapply(split(as.matrix(x), row(x) - col(x)), sum))
merge(df, stack(by(df[-1], df$group, max_diag)), by.x = "group", by.y = "ind")
# group binary1 binary2 binary3 binary4 values
#1 1 1 0 0 0 2
#2 1 0 1 0 0 2
#3 2 1 0 0 0 3
#4 2 0 1 0 0 3
#5 2 0 0 1 0 3
#6 4 0 1 0 0 3
#7 4 0 0 1 0 3
#8 4 0 0 0 1 3
#9 4 0 0 0 0 3
#10 6 0 0 0 0 1
#11 6 0 1 0 0 1
#12 6 0 1 0 0 1
You can split the data.frame and sum the diagonal using diag(). Once you have this sum diagonal per group, it's putting them back into the data.frame by calling the group.
Group 4 should be zero? Or am I missing something:
DIAG = by(df[,-1],df$group,function(i)sum(diag(as.matrix(i))))
df$want = DIAG[as.character(df$group)]
If I get your definition correct, we define a function to calculate sum of main diagonal:
main_diag = function(m){
sapply(1:(ncol(m)-1),function(i)sum(diag(m[,i:ncol(m)])))
}
Thanks to #IceCreamToucan for correcting this. Then we consider the max of all main diagonals, and their transpose:
DIAG = by(df[,-1],df$group,function(i){
i = as.matrix(i)
max(main_diag(i),main_diag(t(i)))
})
df$want = DIAG[as.character(df$group)]
group binary1 binary2 binary3 binary4 want
1 1 1 0 0 0 2
2 1 0 1 0 0 2
3 2 1 0 0 0 3
4 2 0 1 0 0 3
5 2 0 0 1 0 3
6 4 0 1 0 0 3
7 4 0 0 1 0 3
8 4 0 0 0 1 3
9 4 0 0 0 0 3
10 6 0 0 0 0 1
11 6 0 1 0 0 1
12 6 0 1 0 0 1

Transpose and create categorical values in R

I have a data frame with the below structure from which I am looking to transpose the variables into categorical. Intent is to find the weighted mix of the variables.
data <- read.table(header=T, text='
subject weight sex test
1 2 M control
2 3 F cond1
3 2 F cond2
4 4 M control
5 3 F control
6 2 F control
')
data
Expected output:
subject weight control_F control_M cond1_F cond1_M cond2_F cond2_M
1 2 0 1 0 0 0 0
2 3 0 0 1 0 0 0
3 2 0 0 0 0 1 0
4 4 0 1 0 0 0 0
5 3 1 0 0 0 0 0
6 2 1 0 0 0 0 0
I tried using a combination of ifelse and cut, but just couldn't produce the output.
Any ideas on how I can do this?
TIA
You may use
model.matrix(~ subject + weight + sex:test - 1, data)
I think model.matrix is most natural here (see #Julius' answer), but here's an alternative:
library(data.table)
setDT(data)
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))
subject weight cond1_F cond1_M cond2_F cond2_M control_F control_M
1: 1 2 0 0 0 0 0 1
2: 2 3 1 0 0 0 0 0
3: 3 2 0 0 1 0 0 0
4: 4 4 0 0 0 0 0 1
5: 5 3 0 0 0 0 1 0
6: 6 2 0 0 0 0 1 0
To get the columns in the "right" order (with the control first), set factor levels before casting:
data[, test := relevel(test, "control")]
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))
subject weight control_F control_M cond1_F cond1_M cond2_F cond2_M
1: 1 2 0 1 0 0 0 0
2: 2 3 0 0 1 0 0 0
3: 3 2 0 0 0 0 1 0
4: 4 4 0 1 0 0 0 0
5: 5 3 1 0 0 0 0 0
6: 6 2 1 0 0 0 0 0
(Note: reshape2's dcast isn't so good here, since its drop option applies to both rows and cols.)

adding data frame of counts to template data frame in R

I have data.frames of counts such as:
a <- data.frame(id=1:10,
"1"=c(rep(1,3),rep(0,7)),
"3"=c(rep(0,4),rep(1,6)))
names(a)[2:3] <- c("1","3")
a
> a
id 1 3
1 1 1 0
2 2 1 0
3 3 1 0
4 4 0 0
5 5 0 1
6 6 0 1
7 7 0 1
8 8 0 1
9 9 0 1
10 10 0 1
and a template data.frame such as
m <- data.frame(id=1:10,
"1"= rep(0,10),
"2"= rep(0,10),
"3"= rep(0,10),
"4"= rep(0,10))
names(m)[-1] <- 1:4
m
> m
id 1 2 3 4
1 1 0 0 0 0
2 2 0 0 0 0
3 3 0 0 0 0
4 4 0 0 0 0
5 5 0 0 0 0
6 6 0 0 0 0
7 7 0 0 0 0
8 8 0 0 0 0
9 9 0 0 0 0
10 10 0 0 0 0
and I want to add the values of a into the template m
in the appropraite columns, leaving the rest as 0.
This is working but I would like to know
if there is a more elegant way, perhaps using plyr or data.table:
provi <- rbind.fill(a,m)
provi[is.na(provi)] <- 0
mnew <- aggregate(provi[,-1],by=list(provi$id),FUN=sum)
names(mnew)[1] <- "id"
mnew <- mnew[c(1,order(names(mnew)[-1])+1)]
mnew
> mnew
id 1 2 3 4
1 1 1 0 0 0
2 2 1 0 0 0
3 3 1 0 0 0
4 4 0 0 0 0
5 5 0 0 1 0
6 6 0 0 1 0
7 7 0 0 1 0
8 8 0 0 1 0
9 9 0 0 1 0
10 10 0 0 1 0
I guess the concise option would be:
m[names(a)] <- a
Or we match the column names ('i1'), use that to create the column index with max.col, cbind with the row index ('i2'), and a similar step can be done to create 'i3'. We change the values in 'm' corresponding to 'i2' with the 'a' values based on 'i3'.
i1 <- match(names(a)[-1], names(m)[-1])
i2 <- cbind(m$id, i1[max.col(a[-1], 'first')]+1L)
i3 <- cbind(a$id, max.col(a[-1], 'first')+1L)
m[i2] <- a[i3]
m
# id 1 2 3 4
#1 1 1 0 0 0
#2 2 1 0 0 0
#3 3 1 0 0 0
#4 4 0 0 0 0
#5 5 0 0 1 0
#6 6 0 0 1 0
#7 7 0 0 1 0
#8 8 0 0 1 0
#9 9 0 0 1 0
#10 10 0 0 1 0
A data.table option would be melt/dcast
library(data.table)
dcast(melt(setDT(a), id.var='id')[,
variable:= factor(variable, levels=1:4)],
id~variable, value.var='value', drop=FALSE, fill=0)
# id 1 2 3 4
# 1: 1 1 0 0 0
# 2: 2 1 0 0 0
# 3: 3 1 0 0 0
# 4: 4 0 0 0 0
# 5: 5 0 0 1 0
# 6: 6 0 0 1 0
# 7: 7 0 0 1 0
# 8: 8 0 0 1 0
# 9: 9 0 0 1 0
#10: 10 0 0 1 0
A similar dplyr/tidyr option would be
library(dplyr)
library(tidyr)
gather(a, Var, Val, -id) %>%
mutate(Var=factor(Var, levels=1:4)) %>%
spread(Var, Val, drop=FALSE, fill=0)
You could use merge, too:
res <- suppressWarnings(merge(a, m, by="id", suffixes = c("", "")))
(res[, which(!duplicated(names(res)))][, names(m)])
# id 1 2 3 4
# 1 1 1 0 0 0
# 2 2 1 0 0 0
# 3 3 1 0 0 0
# 4 4 0 0 0 0
# 5 5 0 0 1 0
# 6 6 0 0 1 0
# 7 7 0 0 1 0
# 8 8 0 0 1 0
# 9 9 0 0 1 0
# 10 10 0 0 1 0

How to convert two factors to adjacency matrix in R?

I have a data frame with two columns (key and value) where each column is a factor:
df = data.frame(gl(3,4,labels=c('a','b','c')), gl(6,2))
colnames(df) = c("key", "value")
key value
1 a 1
2 a 1
3 a 2
4 a 2
5 b 3
6 b 3
7 b 4
8 b 4
9 c 5
10 c 5
11 c 6
12 c 6
I want to convert it to adjacency matrix (in this case 3x6 size) like:
1 2 3 4 5 6
a 1 1 0 0 0 0
b 0 0 1 1 0 0
c 0 0 0 0 1 1
So that I can run clustering on it (group keys that have similar values together) with either kmeans or hclust.
Closest that I was able to get was using model.matrix( ~ value, df) which results in:
(Intercept) value2 value3 value4 value5 value6
1 1 0 0 0 0 0
2 1 0 0 0 0 0
3 1 1 0 0 0 0
4 1 1 0 0 0 0
5 1 0 1 0 0 0
6 1 0 1 0 0 0
7 1 0 0 1 0 0
8 1 0 0 1 0 0
9 1 0 0 0 1 0
10 1 0 0 0 1 0
11 1 0 0 0 0 1
12 1 0 0 0 0 1
but results aren't grouped by key yet.
From another side I can collapse this dataset into groups using:
aggregate(df$value, by=list(df$key), unique)
Group.1 x.1 x.2
1 a 1 2
2 b 3 4
3 c 5 6
But I don't know what to do next...
Can someone help to solve this?
An easy way to do it in base R:
res <-table(df)
res[res>0] <-1
res
value
#key 1 2 3 4 5 6
# a 1 1 0 0 0 0
# b 0 0 1 1 0 0
# c 0 0 0 0 1 1

R data.table condition within group, but recorded at first instance in group

I have data that looks a bit like this:
df <- data.frame(ID=c(rep(1,4),rep(2,2),rep(3,2),4), TYPE=c(1,3,2,4,1,2,2,3,2),
SEQUENCE=c(seq(1,4),1,2,1,2,1))
ID TYPE SEQUENCE
1 1 1
1 3 2
1 2 3
1 4 4
2 1 1
2 2 2
3 2 1
3 3 2
4 2 1
I know need to check if a certain type is present in each ID block (binary), but only record the
answer in the first record per block (SEQUENCE == 1).
The best I came up with so far is coding them in the row they are present in, e.g.
library(data.table)
DT <- data.table(df)
DT$A[DT$TYPE==1] <- 1
DT$B[DT$TYPE==2] <- 1
DT$C[DT$TYPE==3] <- 1
DT$D[DT$TYPE==4] <- 1
DT[is.na(DT)] <- 0
RESULT:
ID TYPE SEQUENCE A B C D
1 1 1 1 0 0 0
1 3 2 0 0 1 0
1 2 3 0 1 0 0
1 4 4 0 0 0 1
2 1 1 1 0 0 0
2 2 2 0 1 0 0
3 2 1 0 1 0 0
3 3 2 0 0 1 0
4 2 1 0 1 0 0
However, the result should look like this:
ID TYPE SEQUENCE A B C D
1 1 1 1 1 1 1
1 3 2 0 0 0 0
1 2 3 0 0 0 0
1 4 4 0 0 0 0
2 1 1 1 1 0 0
2 2 2 0 0 0 0
3 2 1 0 1 1 0
3 3 2 0 0 0 0
4 2 1 0 1 0 0
I assume this can be done with data.table, but I haven't quite found the correct syntax.
This makes one copy of the data.table:
DT[, FAC := factor(TYPE, labels=LETTERS[1:4])]
DT <- dcast.data.table(DT, ID+TYPE+SEQUENCE~FAC, fun.aggregate=length)
DT[,LETTERS[1:4] := lapply(.SD,
function(x) c(any(as.logical(x)), rep(0L, length(x)-1))),
.SDcols=LETTERS[1:4], by=ID]
# ID TYPE SEQUENCE A B C D
#1: 1 1 1 1 1 1 1
#2: 1 2 3 0 0 0 0
#3: 1 3 2 0 0 0 0
#4: 1 4 4 0 0 0 0
#5: 2 1 1 1 1 0 0
#6: 2 2 2 0 0 0 0
#7: 3 2 1 0 1 1 0
#8: 3 3 2 0 0 0 0
#9: 4 2 1 0 1 0 0

Resources