I have a data frame with two columns (key and value) where each column is a factor:
df = data.frame(gl(3,4,labels=c('a','b','c')), gl(6,2))
colnames(df) = c("key", "value")
key value
1 a 1
2 a 1
3 a 2
4 a 2
5 b 3
6 b 3
7 b 4
8 b 4
9 c 5
10 c 5
11 c 6
12 c 6
I want to convert it to adjacency matrix (in this case 3x6 size) like:
1 2 3 4 5 6
a 1 1 0 0 0 0
b 0 0 1 1 0 0
c 0 0 0 0 1 1
So that I can run clustering on it (group keys that have similar values together) with either kmeans or hclust.
Closest that I was able to get was using model.matrix( ~ value, df) which results in:
(Intercept) value2 value3 value4 value5 value6
1 1 0 0 0 0 0
2 1 0 0 0 0 0
3 1 1 0 0 0 0
4 1 1 0 0 0 0
5 1 0 1 0 0 0
6 1 0 1 0 0 0
7 1 0 0 1 0 0
8 1 0 0 1 0 0
9 1 0 0 0 1 0
10 1 0 0 0 1 0
11 1 0 0 0 0 1
12 1 0 0 0 0 1
but results aren't grouped by key yet.
From another side I can collapse this dataset into groups using:
aggregate(df$value, by=list(df$key), unique)
Group.1 x.1 x.2
1 a 1 2
2 b 3 4
3 c 5 6
But I don't know what to do next...
Can someone help to solve this?
An easy way to do it in base R:
res <-table(df)
res[res>0] <-1
res
value
#key 1 2 3 4 5 6
# a 1 1 0 0 0 0
# b 0 0 1 1 0 0
# c 0 0 0 0 1 1
Related
Suppose I have something like this:
df<-data.frame(group=c(1, 1,2, 2, 2, 4,4,4,4,6,6,6),
binary1=c(1,0,1,0,0,0,0,0,0,0,0,0),
binary2=c(0,1,0,1,0,1,0,0,0,0,1,1),
binary3=c(0,0,0,0,1,0,1,0,0,0,0,0),
binary4=c(0,0,0,0,0,0,0,1,0,0,0,0))
I want to sum along all possible left to right diagonals within groups (i.e group 1, 2 4 and 6) and return the max sum. This is also in a dataframe, so I would like to specify to only sum along binary1-binary4. Anyone know if this is possible?
Here's my desired output:
group binary1 binary2 binary3 binary4 want
1 1 1 0 0 0 2
2 1 0 1 0 0 2
3 2 1 0 0 0 3
4 2 0 1 0 0 3
5 2 0 0 1 0 3
6 4 0 1 0 0 3
7 4 0 0 1 0 3
8 4 0 0 0 1 3
9 4 0 0 0 0 3
10 6 0 0 0 0 1
11 6 0 1 0 0 1
12 6 0 1 0 0 1
I have circled the "diagonals" I would like summed for group 4 in this image as an example:
Here is another solution where we use row and col indices to get all possible combinations of diagonals. Use by to split by group and merge it with original dataframe.
max_diag <- function(x) max(sapply(split(as.matrix(x), row(x) - col(x)), sum))
merge(df, stack(by(df[-1], df$group, max_diag)), by.x = "group", by.y = "ind")
# group binary1 binary2 binary3 binary4 values
#1 1 1 0 0 0 2
#2 1 0 1 0 0 2
#3 2 1 0 0 0 3
#4 2 0 1 0 0 3
#5 2 0 0 1 0 3
#6 4 0 1 0 0 3
#7 4 0 0 1 0 3
#8 4 0 0 0 1 3
#9 4 0 0 0 0 3
#10 6 0 0 0 0 1
#11 6 0 1 0 0 1
#12 6 0 1 0 0 1
You can split the data.frame and sum the diagonal using diag(). Once you have this sum diagonal per group, it's putting them back into the data.frame by calling the group.
Group 4 should be zero? Or am I missing something:
DIAG = by(df[,-1],df$group,function(i)sum(diag(as.matrix(i))))
df$want = DIAG[as.character(df$group)]
If I get your definition correct, we define a function to calculate sum of main diagonal:
main_diag = function(m){
sapply(1:(ncol(m)-1),function(i)sum(diag(m[,i:ncol(m)])))
}
Thanks to #IceCreamToucan for correcting this. Then we consider the max of all main diagonals, and their transpose:
DIAG = by(df[,-1],df$group,function(i){
i = as.matrix(i)
max(main_diag(i),main_diag(t(i)))
})
df$want = DIAG[as.character(df$group)]
group binary1 binary2 binary3 binary4 want
1 1 1 0 0 0 2
2 1 0 1 0 0 2
3 2 1 0 0 0 3
4 2 0 1 0 0 3
5 2 0 0 1 0 3
6 4 0 1 0 0 3
7 4 0 0 1 0 3
8 4 0 0 0 1 3
9 4 0 0 0 0 3
10 6 0 0 0 0 1
11 6 0 1 0 0 1
12 6 0 1 0 0 1
Let say I have a contingency table (made using the table function in R).
digit
ID 1 2 3 4 5 6 7 8 9
1672120 23 16 8 10 12 13 3 3 5
1672121 2 1 0 0 0 0 1 0 0
1672122 1 2 1 0 1 0 0 1 0
1672123 0 1 1 0 0 0 0 0 0
1672124 1 1 0 1 1 0 0 0 0
1672125 5 2 5 1 1 1 0 0 2
1672127 2 1 2 1 0 0 0 0 0
1672128 2 0 0 1 0 1 0 0 1
1672129 1 0 1 0 0 0 1 0 0
If I want to remove the rows where the number of counts is smaller than 5 from the contingency table, how should I do it?
Since you don't provide reproducible sample data here is an example based on the mtcars dataset
Let's create a count table of mtcars$gear vs. mtcars$carb
tbl <- table(mtcars$gear, mtcars$carb)
#
# 1 2 3 4 6 8
# 3 3 4 3 5 0 0
# 4 4 4 0 4 0 0
# 5 0 2 0 1 1 1
We then select only those rows where at least one count is larger than 2
tbl[apply(tbl > 2, 1, any), ]
#
# 1 2 3 4 6 8
# 3 3 4 3 5 0 0
# 4 4 4 0 4 0 0
I have the following data frame:
T a b c
1 1 0 0 0
2 2 1 0 0
3 5 1 0 0
4 6 1 0 0
5 7 0 1 0
6 9 0 1 0
7 10 0 0 1
8 12 0 0 0
9 14 0 0 0
10 15 1 0 0
11 16 1 0 0
12 17 0 1 0
13 18 0 0 1
I want to subset this data frame and create a list of data frames. Each data frame has to be populated with the rows (of the old one) that there is a sequence of successively "1" in a column, then in b column and last in c column. The expected result (for this data frame) would be a list of 2 data frames:
data frame 1:
T a b c
1 2 1 0 0
2 5 1 0 0
3 6 1 0 0
4 7 0 1 0
5 9 0 1 0
6 10 0 0 1
and data frame 2:
T a b c
1 15 1 0 0
2 16 1 0 0
3 17 0 1 0
4 18 0 0 1
Any ideas?
Thank you in advance!
Based on the expected output
i1 <- do.call(pmax, df1[-1])
grp <- inverse.rle(within.list(rle(i1 ==1), {values <- seq_along(values)}))
split(df1[i1==1,], grp[i1==1])
#$`2`
# T a b c
#2 2 1 0 0
#3 5 1 0 0
#4 6 1 0 0
#5 7 0 1 0
#6 9 0 1 0
#7 10 0 0 1
#$`4`
# T a b c
#10 15 1 0 0
#11 16 1 0 0
#12 17 0 1 0
#13 18 0 0 1
I have data.frames of counts such as:
a <- data.frame(id=1:10,
"1"=c(rep(1,3),rep(0,7)),
"3"=c(rep(0,4),rep(1,6)))
names(a)[2:3] <- c("1","3")
a
> a
id 1 3
1 1 1 0
2 2 1 0
3 3 1 0
4 4 0 0
5 5 0 1
6 6 0 1
7 7 0 1
8 8 0 1
9 9 0 1
10 10 0 1
and a template data.frame such as
m <- data.frame(id=1:10,
"1"= rep(0,10),
"2"= rep(0,10),
"3"= rep(0,10),
"4"= rep(0,10))
names(m)[-1] <- 1:4
m
> m
id 1 2 3 4
1 1 0 0 0 0
2 2 0 0 0 0
3 3 0 0 0 0
4 4 0 0 0 0
5 5 0 0 0 0
6 6 0 0 0 0
7 7 0 0 0 0
8 8 0 0 0 0
9 9 0 0 0 0
10 10 0 0 0 0
and I want to add the values of a into the template m
in the appropraite columns, leaving the rest as 0.
This is working but I would like to know
if there is a more elegant way, perhaps using plyr or data.table:
provi <- rbind.fill(a,m)
provi[is.na(provi)] <- 0
mnew <- aggregate(provi[,-1],by=list(provi$id),FUN=sum)
names(mnew)[1] <- "id"
mnew <- mnew[c(1,order(names(mnew)[-1])+1)]
mnew
> mnew
id 1 2 3 4
1 1 1 0 0 0
2 2 1 0 0 0
3 3 1 0 0 0
4 4 0 0 0 0
5 5 0 0 1 0
6 6 0 0 1 0
7 7 0 0 1 0
8 8 0 0 1 0
9 9 0 0 1 0
10 10 0 0 1 0
I guess the concise option would be:
m[names(a)] <- a
Or we match the column names ('i1'), use that to create the column index with max.col, cbind with the row index ('i2'), and a similar step can be done to create 'i3'. We change the values in 'm' corresponding to 'i2' with the 'a' values based on 'i3'.
i1 <- match(names(a)[-1], names(m)[-1])
i2 <- cbind(m$id, i1[max.col(a[-1], 'first')]+1L)
i3 <- cbind(a$id, max.col(a[-1], 'first')+1L)
m[i2] <- a[i3]
m
# id 1 2 3 4
#1 1 1 0 0 0
#2 2 1 0 0 0
#3 3 1 0 0 0
#4 4 0 0 0 0
#5 5 0 0 1 0
#6 6 0 0 1 0
#7 7 0 0 1 0
#8 8 0 0 1 0
#9 9 0 0 1 0
#10 10 0 0 1 0
A data.table option would be melt/dcast
library(data.table)
dcast(melt(setDT(a), id.var='id')[,
variable:= factor(variable, levels=1:4)],
id~variable, value.var='value', drop=FALSE, fill=0)
# id 1 2 3 4
# 1: 1 1 0 0 0
# 2: 2 1 0 0 0
# 3: 3 1 0 0 0
# 4: 4 0 0 0 0
# 5: 5 0 0 1 0
# 6: 6 0 0 1 0
# 7: 7 0 0 1 0
# 8: 8 0 0 1 0
# 9: 9 0 0 1 0
#10: 10 0 0 1 0
A similar dplyr/tidyr option would be
library(dplyr)
library(tidyr)
gather(a, Var, Val, -id) %>%
mutate(Var=factor(Var, levels=1:4)) %>%
spread(Var, Val, drop=FALSE, fill=0)
You could use merge, too:
res <- suppressWarnings(merge(a, m, by="id", suffixes = c("", "")))
(res[, which(!duplicated(names(res)))][, names(m)])
# id 1 2 3 4
# 1 1 1 0 0 0
# 2 2 1 0 0 0
# 3 3 1 0 0 0
# 4 4 0 0 0 0
# 5 5 0 0 1 0
# 6 6 0 0 1 0
# 7 7 0 0 1 0
# 8 8 0 0 1 0
# 9 9 0 0 1 0
# 10 10 0 0 1 0
Suppose I have the following data frames
treatmet1<-data.frame(id=c(1,2,7))
treatment2<-data.frame(id=c(3,7,10))
control<-data.frame(id=c(4,5,8,9))
I want to create a new data frame that is the union of those 3 and have an indicator column that takes the value 1 for each one.
experiment<-data.frame(id=c(1:10),treatment1=0, treatment2=0, control=0)
where experiment$treatment1[1]=1 etc etc
What is the best way of doing this in R?
Thanks!
Updated as per # Flodel:
kk<-rbind(treatment1,treatment2,control)
var1<-c("treatment1","treatment2","control")
kk$df<-rep(var1,c(dim(treatment1)[1],dim(treatment2)[1],dim(control)[1]))
kk
id df
1 1 treatment1
2 2 treatment1
3 7 treatment1
4 3 treatment2
5 7 treatment2
6 10 treatment2
7 4 control
8 5 control
9 8 control
10 9 control
If you want in the form of 1 and 0 , you can use table
ll<-table(kk)
ll
df
id control treatment1 treatment2
1 0 1 0
2 0 1 0
3 0 0 1
4 1 0 0
5 1 0 0
7 0 1 1
8 1 0 0
9 1 0 0
10 0 0 1
If you want it as a data.frame, then you can use reshape:
kk2<-reshape(data.frame(ll),timevar = "df",idvar = "id",direction = "wide")
names(kk2)[-1]<-sort(var1)
> kk2
kk2
id control treatment1 treatment2
1 1 0 1 0
2 2 0 1 0
3 3 0 0 1
4 4 1 0 0
5 5 1 0 0
6 7 0 1 1
7 8 1 0 0
8 9 1 0 0
9 10 0 0 1
df.bind <- function(...) {
df.names <- all.names(substitute(list(...)))[-1L]
ids.list <- setNames(lapply(list(...), `[[`, "id"), df.names)
num.ids <- max(unlist(ids.list))
tabs <- lapply(ids.list, tabulate, num.ids)
data.frame(id = seq(num.ids), tabs)
}
df.bind(treatment1, treatment2, control)
# id treatment1 treatment2 control
# 1 1 1 0 0
# 2 2 1 0 0
# 3 3 0 1 0
# 4 4 0 0 1
# 5 5 0 0 1
# 6 6 0 0 0
# 7 7 1 1 0
# 8 8 0 0 1
# 9 9 0 0 1
# 10 10 0 1 0
(Notice how it does include a row for id == 6.)
Taking
treatment1<-data.frame(id=c(1,2,7))
treatment2<-data.frame(id=c(3,7,10))
control<-data.frame(id=c(4,5,8,9))
You can use this:
x <- c("treatment1", "treatment2", "control")
f <- function(s) within(get(s), assign(s, 1))
r <- Reduce(function(x,y) merge(x,y,all=TRUE), lapply(x, f))
r[is.na(r)] <- 0
Result:
> r
id treatment1 treatment2 control
1 1 1 0 0
2 2 1 0 0
3 3 0 1 0
4 4 0 0 1
5 5 0 0 1
6 7 1 1 0
7 8 0 0 1
8 9 0 0 1
9 10 0 1 0
This illustrates what I was imagining to be the rbind strategy:
alldf <- rbind(treatmet1,treatment2,control)
alldf$grps <- model.matrix( ~ factor( c( rep(1,nrow(treatmet1)),
rep(2,nrow(treatment2)),
rep(3,nrow(control) ) ))-1)
dimnames( alldf[[2]])[2]<- list(c("trt1","trt2","ctrl"))
alldf
#-------------------
id grps.trt1 grps.trt2 grps.ctrl
1 1 1 0 0
2 2 1 0 0
3 7 1 0 0
4 3 0 1 0
5 7 0 1 0
6 10 0 1 0
7 4 0 0 1
8 5 0 0 1
9 8 0 0 1
10 9 0 0 1