Automatic subsetting of a dataframe on the basis of a prediction matrix - r

I have created a prediction matrix for large dataset as follows:
library(mice)
dfpredm <- quickpred(df, mincor=.3)
A B C D E F G H I J
A 0 1 1 1 0 1 0 1 1 0
B 1 0 0 0 1 0 1 0 0 1
C 0 0 0 1 1 0 0 0 0 0
D 1 0 1 0 0 1 0 1 0 1
E 0 1 0 1 0 1 1 0 1 0
**F 0 0 1 0 0 0 1 0 0 0**
G 0 1 0 1 0 0 0 0 0 0
H 1 0 1 0 0 1 0 0 0 1
I 0 1 0 1 1 0 1 0 0 0
J 1 0 1 0 0 1 0 1 0 0
I would like to create a subset of the original df on the basis on dfpredm.
More specifically I would like to do the following:
Let's assume that my dependent variable is F.
According to the prediction matrix F is correlated with C and G.
In addition, C and G are best predicted by D,E and B,D respectively.
The idea is now to create a subset of df based on the dependent variable F,for which in the F row the value is 1.
Fpredictors <- df[,(dfpredm["F",]) == 1]
But also do the same for the variables where the rows in F are 1. I am thinking of first getting the column names like this:
Fpredcol <-colnames(dfpredm[,(dfpredm["c241",]) == 1])
And then doing a for loop with these column names?
For the specific example I would like to end up with the subset.
dfsub <- df[,c("F","C","G","B","E","D")]
I would however like to automate this process. Could anyone show me how to do this?

Here is one strategy that seems like it would work for you:
first_preds <- function(dat, predictor) {
cols <- which(dat[predictor, ] == 1)
names(dat)[cols]
}
# wrap first_preds() for getting best and second best predictors
first_and_second_preds <- function(dat, predictor) {
matches <- first_preds(dat, predictor)
matches <- c(matches, unlist(lapply(matches, function(x) first_preds(dat, x))))
c(predictor, matches) %>% unique()
}
dat[first_and_second_preds(dat, "F")] # order is not exactly the same as your output
F C G D E B
A 1 1 0 1 0 1
B 0 0 1 0 1 0
C 0 0 0 1 1 0
D 1 1 0 0 0 0
E 1 0 1 1 0 1
F 0 1 1 0 0 0
G 0 0 0 1 0 1
H 1 1 0 0 0 0
I 0 0 1 1 1 1
J 1 1 0 0 0 0
Not sure if the ordering in the result is important, but you could add the logic if it is.
Using dat from here (a kinder way to share small R data on SO):
dat <- read.table(
text = "A B C D E F G H I J
A 0 1 1 1 0 1 0 1 1 0
B 1 0 0 0 1 0 1 0 0 1
C 0 0 0 1 1 0 0 0 0 0
D 1 0 1 0 0 1 0 1 0 1
E 0 1 0 1 0 1 1 0 1 0
F 0 0 1 0 0 0 1 0 0 0
G 0 1 0 1 0 0 0 0 0 0
H 1 0 1 0 0 1 0 0 0 1
I 0 1 0 1 1 0 1 0 0 0
J 1 0 1 0 0 1 0 1 0 0",
header = TRUE
)
Something a little more general that would let you use self_select predictors directly:
all_preds <- function(dat, predictors) {
unlist(lapply(predictors, function(x) names(dat)[which(dat[x, ] == 1 )]))
}
dat[all_preds(dat, c("A", "B"))]
B C D F H I A E G J
A 1 1 1 1 1 1 0 0 0 0
B 0 0 0 0 0 0 1 1 1 1
C 0 0 1 0 0 0 0 1 0 0
D 0 1 0 1 1 0 1 0 0 1
E 1 0 1 1 0 1 0 0 1 0
F 0 1 0 0 0 0 0 0 1 0
G 1 0 1 0 0 0 0 0 0 0
H 0 1 0 1 0 0 1 0 0 1
I 1 0 1 0 0 0 0 1 1 0

Related

double spread without hard coding

I am stuck here. I tried using spread twice from tidyr, I tried joining. But none of these methods give the right solution without some hard coding.
Is there any way to tranform this data:
cat1 cat2 title
1 A G AB
2 B G BC
3 C B CD
4 D G DE
5 E H EF
6 F A FG
into this:
A B C D E F G H
AB 1 0 0 0 0 0 1 0
BC 0 1 0 0 0 0 1 0
CD 0 1 1 0 0 0 0 0
DE 0 0 0 1 0 0 1 0
EF 0 0 0 0 1 0 0 1
FG 1 0 0 0 0 1 0 0
Sample data:
df<-data.frame(cat1=LETTERS[1:6],
cat2=c('G','G','B','G','H','A'),
title=paste0(LETTERS[1:6],LETTERS[2:7]))
Since I usually get dplyr answers faster: Base R or tidyr only solutions are also very welcome
I don't know if this qualifies as not hard coding for the op
df %>%
tidyr::gather(key = vars, value = values, cat1, cat2) %>%
dplyr::mutate(vars = 1) %>%
tidyr::spread(key = values, value = vars, fill = 0)
# title A B C D E F G H
# 1 AB 1 0 0 0 0 0 1 0
# 2 BC 0 1 0 0 0 0 1 0
# 3 CD 0 1 1 0 0 0 0 0
# 4 DE 0 0 0 1 0 0 1 0
# 5 EF 0 0 0 0 1 0 0 1
# 6 FG 1 0 0 0 0 1 0 0
Just melt first, then cast:
require(reshape2)
melt(df, id="title") %>% dcast(title ~ value, length)
title A B C D E F G H
1 AB 1 0 0 0 0 0 1 0
2 BC 0 1 0 0 0 0 1 0
3 CD 0 1 1 0 0 0 0 0
4 DE 0 0 0 1 0 0 1 0
5 EF 0 0 0 0 1 0 0 1
6 FG 1 0 0 0 0 1 0 0
melt puts all the values in a single column to cast.

Building a symmetric binary matrix

I have a matrix that is for example like this:
rownames V1
a 1
c 3
b 2
d 4
y 2
q 4
i 1
j 1
r 3
I want to make a Symmetric binary matrix that it's dimnames of that is the same as rownames of above matrix. I want to fill these matrix by 1 & 0 in such a way that 1 indicated placing variables that has the same number in front of it and 0 for the opposite situation.This matrix would be like
dimnames
a c b d y q i j r
a 1 0 0 0 0 0 1 1 0
c 0 1 0 0 0 0 0 0 1
b 0 0 1 0 1 0 0 0 0
d 0 0 0 1 0 1 0 0 0
y 0 0 1 0 1 0 0 0 0
q 0 0 0 1 0 1 0 0 0
i 1 0 0 0 0 0 1 1 0
j 1 0 0 0 0 0 1 1 0
r 0 1 0 0 0 0 0 0 1
Anybody know how can I do that?
Use dist:
DF <- read.table(text = "rownames V1
a 1
c 3
b 2
d 4
y 2
q 4
i 1
j 1
r 3", header = TRUE)
res <- as.matrix(dist(DF$V1)) == 0L
#alternatively:
#res <- !as.matrix(dist(DF$V1))
#diag(res) <- 0L #for the first version of the question, i.e. a zero diagonal
res <- +(res) #for the second version, i.e. to coerce to an integer matrix
dimnames(res) <- list(DF$rownames, DF$rownames)
# 1 2 3 4 5 6 7 8 9
#1 1 0 0 0 0 0 1 1 0
#2 0 1 0 0 0 0 0 0 1
#3 0 0 1 0 1 0 0 0 0
#4 0 0 0 1 0 1 0 0 0
#5 0 0 1 0 1 0 0 0 0
#6 0 0 0 1 0 1 0 0 0
#7 1 0 0 0 0 0 1 1 0
#8 1 0 0 0 0 0 1 1 0
#9 0 1 0 0 0 0 0 0 1
You can do this using table and crossprod.
tcrossprod(table(DF))
# rownames
# rownames a b c d i j q r y
# a 1 0 0 0 1 1 0 0 0
# b 0 1 0 0 0 0 0 0 1
# c 0 0 1 0 0 0 0 1 0
# d 0 0 0 1 0 0 1 0 0
# i 1 0 0 0 1 1 0 0 0
# j 1 0 0 0 1 1 0 0 0
# q 0 0 0 1 0 0 1 0 0
# r 0 0 1 0 0 0 0 1 0
# y 0 1 0 0 0 0 0 0 1
If you want the row and column order as they are found in the data, rather than alphanumerically, you can subset
tcrossprod(table(DF))[DF$rownames, DF$rownames]
or use factor
tcrossprod(table(factor(DF$rownames, levels=unique(DF$rownames)), DF$V1))
If your data is large or sparse, you can use the sparse matrix algebra in xtabs, with similar ways to change the order of the resulting table as before.
Matrix::tcrossprod(xtabs(data=DF, ~ rownames + V1, sparse=TRUE))

R: match rows of a matrices and get location specific info

Say you have a matrix M1 as such:
A B C D E F G H I J
353 1 0 1 0 0 1 0 0 1 1
288 1 0 1 0 0 1 1 0 1 1
275 1 0 1 0 1 1 0 0 1 1
236 0 0 1 0 0 1 0 0 1 1
235 0 0 1 0 0 1 1 0 1 1
227 1 0 1 0 1 1 1 0 1 1
the rownames are the values (they are not random they have meaning and it is what I want as I will explain).
Say you have another matrix M2 as such:
A B C D E F G H I J AA
[1,] 0 0 0 0 0 0 0 0 0 0 0
[2,] 1 0 0 0 0 0 0 0 0 0 0
[3,] 0 1 0 0 0 0 0 0 0 0 1
[4,] 1 1 0 0 0 0 0 0 0 0 0
[5,] 0 0 1 0 0 0 0 0 0 0 1
[6,] 1 0 1 0 0 0 0 0 0 0 0
Note A to J is the same number of cols, except the 2 new cols, AA
Now, I want something like:
for (i in 1:nrow(M2)){
if(M2[i,"AA"]==1){
#-1 since I M1 doesnt have the AA column
vec = M2[i,1:(ncol(M2)-1)]
#BELOW is what I am not sure of what how to implement
#get the rowname from M1 that matches vec, and replace M2[i,"AA"] = that value
}
}
The result should be 0, since in this example there are no rows of M1 matching any rows of M2[,A:J]

merge two rows of n dimension by considering one column

I have a matrix, that has been formed after using cbind()
! ? c e i k l t
dif 0 0 1 0 0 0
dor 1 0 0 0 0 0
dor 0 0 0 0 0 1
same 0 0 0 1 0 0
same 0 1 0 0 0 0
Suggest me a code in R that could merge the rows as below
! ? c e i k l t
same 1 1 0 1 0 0
dif 0 0 1 0 0 0
dor 1 0 0 0 0 1
Thank you..
df<-read.table(header=T,text="ID c e i k l t
dif 0 0 1 0 0 0
dor 1 0 0 0 0 0
dor 0 0 0 0 0 1
same 0 0 0 1 0 0
same 0 1 0 0 0 0")
require(plyr)
ddply(df,.(ID),function(x)colSums(x[,-1]))
ID c e i k l t
1 dif 0 0 1 0 0 0
2 dor 1 0 0 0 0 1
3 same 0 1 0 1 0 0
Command acknowledged:
aggregate(df[, -1], list(df[, 1]), function(x) {
Reduce("|", x)
})
# Group.1 c e i k l t
# 1 dif 0 0 1 0 0 0
# 2 dor 1 0 0 0 0 1
# 3 same 0 1 0 1 0 0
Do you want the sum, or do you want the logical OR:
Logical OR:
require(functional)
aggregate(. ~ ID, data=df, FUN=Compose(any, as.numeric))
ID c e i k l t
1 dif 0 0 1 0 0 0
2 dor 1 0 0 0 0 1
3 same 0 1 0 1 0 0
Sum:
aggregate(. ~ ID, data=df, FUN=sum)
The result here is the same.

Change a long table to wide table

Suppose I have a long table like this:
A <- rep(c("a","b","c","d"),each=4)
B <- rep(c("e","f","g","h"),4)
C <- rep(c("i","j"),8)
D <- rnorm(16)
df <- data.frame(A,B,C,D)
head(df)
A B C D
1 a e i -0.18984508
2 a f j -1.82703822
3 a g i -0.17307580
4 a h j -1.38104238
5 b e i 0.08699983
6 b f j -0.36442461
I would like to change to long table to a wide format so that each element in column A and B is a title of a column. Each row should be a 1 or 0 indicating if elements exists. Column C and D remains the same. The desired table is something like this:
C D a b e f g h
i -0.18984508 1 0 1 0 0 0
j -1.82703822 1 0 0 1 0 0
i -0.17307580 1 0 0 0 1 0
j -1.38104238 1 0 0 0 0 1
i 0.08699983 0 1 1 0 0 0
j -0.36442461 0 1 0 1 0 0
This is a form of reshaping which can be done with the reshape2 package.
library("reshape2")
dcast(melt(df, id.vars=c("C", "D")), C+D~value, fun.aggregate=length)
which gives
C D a b c d e f g h
1 i -1.44485242 0 1 0 0 0 0 1 0
2 i -0.80834639 0 0 0 1 0 0 1 0
3 i -0.15202085 0 0 0 1 1 0 0 0
4 i -0.05626233 1 0 0 0 1 0 0 0
5 i 0.12031754 1 0 0 0 0 0 1 0
6 i 0.62206658 0 0 1 0 0 0 1 0
7 i 0.77101891 0 1 0 0 1 0 0 0
8 i 1.38752097 0 0 1 0 1 0 0 0
9 j -2.52137154 0 0 0 1 0 0 0 1
10 j -0.53231537 0 1 0 0 0 0 0 1
11 j -0.30178539 1 0 0 0 0 0 0 1
12 j -0.29823112 1 0 0 0 0 1 0 0
13 j -0.12988540 0 1 0 0 0 1 0 0
14 j 0.00517754 0 0 1 0 0 1 0 0
15 j 0.51452289 0 0 1 0 0 0 0 1
16 j 0.53260223 0 0 0 1 0 1 0 0
The order is not the same as the original data set, but if that is important put an order column in, carry it through, and then sort on it at the end.

Resources