Many categorical variable at the same time in Matrix - r

I've collected data as follow :
A B C D E F G
1 1 0 0 0 0 0 0
1,2 0 1 0 0 0 0 2
1,2,3 0 0 0 0 0 0 0
1,3 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
2,3 4 0 0 0 5 0 0
3 1 3 0 0 0 2 0
4 0 0 0 0 0 0 0
For each Color (A,B,C,D,E,F,G) it corresponds to one or many category at the same time(1,2,3,4) according sample. For many category, there is comma separation.
I want to simplify my data to have it as follows :
A B C D E F G
1 1 1 0 0 0 0 2
3 4 0 0 0 5 2 0
2 4 1 0 0 5 0 2
4 0 0 0 0 0 0 0
is there a simple way (function) to do this ?
Reproducible example :
DF <- read.table(text = " Color Cat
A 1
B 1
C 4,2
D 1,3
E 1,2
F 3
G 5
A 2
B 3
C 1,2
D 4,3
E 3
F 1
G 1" , header = TRUE)
DF = table(DF$Cat,DF$Color)
cats <- strsplit(rownames(DF), ",", fixed = TRUE)
DF <- DF[rep(seq_len(nrow(DF)), sapply(cats, length)),]
DF$cat <- unlist(cats)
DF <- aggregate(. ~ cat, DF, FUN = sum)

DF <- read.table(text = " A B C D E F G
1 1 0 0 0 0 0 0
1,2 0 1 0 0 0 0 2
1,2,3 0 0 0 0 0 0 0
1,3 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
2,3 4 0 0 0 5 0 0
3 1 3 0 0 0 2 0
4 0 0 0 0 0 0 0", header = TRUE)
#split the row names
cats <- strsplit(rownames(DF), ",", fixed = TRUE)
#repeat each row of the DF times the number of cats
DF <- DF[rep(seq_len(nrow(DF)), sapply(cats, length)),]
#add column with cats
DF$cat <- unlist(cats)
#aggregate (your question is unclear regarding how)
DF <- aggregate(. ~ cat, DF, FUN = sum) #or FUN = max???
# cat A B C D E F G
#1 1 1 1 0 0 0 0 2
#2 2 4 1 0 0 5 0 2
#3 3 5 3 0 0 5 2 0
#4 4 0 0 0 0 0 0 0

Related

Creating new columns using data in one column and fill with data

If I have a data frame like this:
dt <- data.frame(cols = letters[1:6])
dt
#> cols
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e
#> 6 f
How to create new columns using data in the cols column (with 1s on the diagonal):
a b c d e f
a 1 0 0 0 0 0
b 0 1 0 0 0 0
c 0 0 1 0 0 0
d 0 0 0 1 0 0
e 0 0 0 0 1 0
f 0 0 0 0 0 1
In base R, we can use table
out <-table(dt$col, dt$col)
-output
out
a b c d e f
a 1 0 0 0 0 0
b 0 1 0 0 0 0
c 0 0 1 0 0 0
d 0 0 0 1 0 0
e 0 0 0 0 1 0
f 0 0 0 0 0 1
Or use diag
`dimnames<-`(diag(nrow(dt)), list(dt$col, dt$col))
Another possible solution:
m <- matrix(0, 6, 6, dimnames = list(dt$cols, dt$cols))
diag(m) <- 1
m
#> a b c d e f
#> a 1 0 0 0 0 0
#> b 0 1 0 0 0 0
#> c 0 0 1 0 0 0
#> d 0 0 0 1 0 0
#> e 0 0 0 0 1 0
#> f 0 0 0 0 0 1

Make a adjacency matrix in R

I want to make an adjacency matrix from a dataframe (mydata) consisting several rows with following rule:
List all letters as a square matrix
Count and sum number of connection from source from rest of columns (p1 p2 p3 p4 p5) of corresponding rows. For example, b is connected with a (2 and 8 rows) 5 times.
If letter is not included in source , connection values should be zero.
The dataframe is:
mydf <- data.frame(p1=c('a','a','a','b','g','b','c','c','d'),
p2=c('b','c','d','c','d','e','d','e','e'),
p3=c('a','a','c','c','d','d','d','a','a'),
p4=c('a','a','b','c','c','e','d','a','b'),
p5=c('a','b','c','d','I','b','b','c','z'),
source=c('a','b','c','d','e','e','a','b','d'))
The adjacency matrix should be as following
a b c d e g I z
a 4 2 1 3 0 0 0 0
b 5 1 3 0 1 0 0 0
c 1 1 2 1 0 0 0 0
d 1 2 3 2 1 0 0 1
e 0 2 1 3 2 1 1 0
g 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 0
z 0 0 0 0 0 0 0 0
I have hundreds of columns and thousands of rows. I would appreciate having any fastest way to do it in R
In base R, we can use table :
vals <- unlist(mydf[-ncol(mydf)])
table(factor(rep(mydf$source, ncol(mydf) - 1), levels = unique(vals)), vals)
# vals
# a b c d e g I z
# a 4 2 1 3 0 0 0 0
# b 5 1 3 0 1 0 0 0
# g 0 0 0 0 0 0 0 0
# c 1 1 2 1 0 0 0 0
# d 1 2 3 2 1 0 0 1
# e 0 2 1 3 2 1 1 0
# I 0 0 0 0 0 0 0 0
# z 0 0 0 0 0 0 0 0
In tidyverse we can do :
library(dplyr)
library(tidyr)
mydf %>%
pivot_longer(cols = -source) %>%
count(source, value) %>%
pivot_wider(names_from = value, values_from = n) %>%
complete(source = names(.)[-1]) %>%
mutate_all(~replace_na(., 0))

Automatic subsetting of a dataframe on the basis of a prediction matrix

I have created a prediction matrix for large dataset as follows:
library(mice)
dfpredm <- quickpred(df, mincor=.3)
A B C D E F G H I J
A 0 1 1 1 0 1 0 1 1 0
B 1 0 0 0 1 0 1 0 0 1
C 0 0 0 1 1 0 0 0 0 0
D 1 0 1 0 0 1 0 1 0 1
E 0 1 0 1 0 1 1 0 1 0
**F 0 0 1 0 0 0 1 0 0 0**
G 0 1 0 1 0 0 0 0 0 0
H 1 0 1 0 0 1 0 0 0 1
I 0 1 0 1 1 0 1 0 0 0
J 1 0 1 0 0 1 0 1 0 0
I would like to create a subset of the original df on the basis on dfpredm.
More specifically I would like to do the following:
Let's assume that my dependent variable is F.
According to the prediction matrix F is correlated with C and G.
In addition, C and G are best predicted by D,E and B,D respectively.
The idea is now to create a subset of df based on the dependent variable F,for which in the F row the value is 1.
Fpredictors <- df[,(dfpredm["F",]) == 1]
But also do the same for the variables where the rows in F are 1. I am thinking of first getting the column names like this:
Fpredcol <-colnames(dfpredm[,(dfpredm["c241",]) == 1])
And then doing a for loop with these column names?
For the specific example I would like to end up with the subset.
dfsub <- df[,c("F","C","G","B","E","D")]
I would however like to automate this process. Could anyone show me how to do this?
Here is one strategy that seems like it would work for you:
first_preds <- function(dat, predictor) {
cols <- which(dat[predictor, ] == 1)
names(dat)[cols]
}
# wrap first_preds() for getting best and second best predictors
first_and_second_preds <- function(dat, predictor) {
matches <- first_preds(dat, predictor)
matches <- c(matches, unlist(lapply(matches, function(x) first_preds(dat, x))))
c(predictor, matches) %>% unique()
}
dat[first_and_second_preds(dat, "F")] # order is not exactly the same as your output
F C G D E B
A 1 1 0 1 0 1
B 0 0 1 0 1 0
C 0 0 0 1 1 0
D 1 1 0 0 0 0
E 1 0 1 1 0 1
F 0 1 1 0 0 0
G 0 0 0 1 0 1
H 1 1 0 0 0 0
I 0 0 1 1 1 1
J 1 1 0 0 0 0
Not sure if the ordering in the result is important, but you could add the logic if it is.
Using dat from here (a kinder way to share small R data on SO):
dat <- read.table(
text = "A B C D E F G H I J
A 0 1 1 1 0 1 0 1 1 0
B 1 0 0 0 1 0 1 0 0 1
C 0 0 0 1 1 0 0 0 0 0
D 1 0 1 0 0 1 0 1 0 1
E 0 1 0 1 0 1 1 0 1 0
F 0 0 1 0 0 0 1 0 0 0
G 0 1 0 1 0 0 0 0 0 0
H 1 0 1 0 0 1 0 0 0 1
I 0 1 0 1 1 0 1 0 0 0
J 1 0 1 0 0 1 0 1 0 0",
header = TRUE
)
Something a little more general that would let you use self_select predictors directly:
all_preds <- function(dat, predictors) {
unlist(lapply(predictors, function(x) names(dat)[which(dat[x, ] == 1 )]))
}
dat[all_preds(dat, c("A", "B"))]
B C D F H I A E G J
A 1 1 1 1 1 1 0 0 0 0
B 0 0 0 0 0 0 1 1 1 1
C 0 0 1 0 0 0 0 1 0 0
D 0 1 0 1 1 0 1 0 0 1
E 1 0 1 1 0 1 0 0 1 0
F 0 1 0 0 0 0 0 0 1 0
G 1 0 1 0 0 0 0 0 0 0
H 0 1 0 1 0 0 1 0 0 1
I 1 0 1 0 0 0 0 1 1 0

double spread without hard coding

I am stuck here. I tried using spread twice from tidyr, I tried joining. But none of these methods give the right solution without some hard coding.
Is there any way to tranform this data:
cat1 cat2 title
1 A G AB
2 B G BC
3 C B CD
4 D G DE
5 E H EF
6 F A FG
into this:
A B C D E F G H
AB 1 0 0 0 0 0 1 0
BC 0 1 0 0 0 0 1 0
CD 0 1 1 0 0 0 0 0
DE 0 0 0 1 0 0 1 0
EF 0 0 0 0 1 0 0 1
FG 1 0 0 0 0 1 0 0
Sample data:
df<-data.frame(cat1=LETTERS[1:6],
cat2=c('G','G','B','G','H','A'),
title=paste0(LETTERS[1:6],LETTERS[2:7]))
Since I usually get dplyr answers faster: Base R or tidyr only solutions are also very welcome
I don't know if this qualifies as not hard coding for the op
df %>%
tidyr::gather(key = vars, value = values, cat1, cat2) %>%
dplyr::mutate(vars = 1) %>%
tidyr::spread(key = values, value = vars, fill = 0)
# title A B C D E F G H
# 1 AB 1 0 0 0 0 0 1 0
# 2 BC 0 1 0 0 0 0 1 0
# 3 CD 0 1 1 0 0 0 0 0
# 4 DE 0 0 0 1 0 0 1 0
# 5 EF 0 0 0 0 1 0 0 1
# 6 FG 1 0 0 0 0 1 0 0
Just melt first, then cast:
require(reshape2)
melt(df, id="title") %>% dcast(title ~ value, length)
title A B C D E F G H
1 AB 1 0 0 0 0 0 1 0
2 BC 0 1 0 0 0 0 1 0
3 CD 0 1 1 0 0 0 0 0
4 DE 0 0 0 1 0 0 1 0
5 EF 0 0 0 0 1 0 0 1
6 FG 1 0 0 0 0 1 0 0
melt puts all the values in a single column to cast.

Transform data frame

I have a questionnaire with an open-ended question like "Please name up to ten animals", which gives me the following data frame (where each letter stands for an animal):
nrow <- 1000
list <- vector("list", nrow)
for(i in 1:nrow){
na <- rep(NA, sample(1:10, 1))
list[[i]] <- sample(c(letters, na), 10, replace=FALSE)
}
df <- data.frame()
df <- rbind(df, do.call(rbind, list))
head(df)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 r <NA> a j w e i h u z
# 2 t o e x d v <NA> z n c
# 3 f y e s n c z i u k
# 4 y <NA> v j h z p i c q
# 5 w s v f <NA> c g b x e
# 6 p <NA> a h v x k z o <NA>
How can I transform this data frame to look like the following data frame? Remember that I don't actually know the column names.
r <- 1000
c <- length(letters)
t1 <- matrix(rbinom(r*c,1,0.5),r,c)
colnames(t1) <- letters
head(t1)
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# [1,] 0 1 0 1 0 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0
# [2,] 1 1 1 1 0 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 0 1 0 1 0 1
# [3,] 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
# [4,] 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 0 1 1 0 0
# [5,] 1 0 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0
# [6,] 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 1
td <- data.frame(t(apply(df, 1, function(x) as.numeric( unique(unlist(df)) %in% x))))
colnames (td) <- unique(unlist(df))
letters could be replaced with a vector of animal names colnames(t1).
You can do the following using tidyr which could be much faster than other approaches, though I like the approach by #germcd very much. You may need to tinker with the select, removing NAs as well as a blank space, which may be an artifact of the simulated data you provided:
require(tidyr)
## Add an ID for each record:
df$id <- 1:nrow(df)
out <- (df %>%
gather(column, animal, -id) %>%
filter(animal != " ") %>%
spread(animal, column)
)
head(out)
This code gathers the unnamed columns into a long format, removes any empty columns or missing data, and then spreads by the unique values of the animal column. This also has the potentially desirable property of preserving the column order in which the animals were named. If it's not desirable then you could easily convert the resulting animal columns to numeric:
out_num <- out
out_num[,-1] <- as.numeric((!is.na(out[,-1])))
head(out_num)
You can try mtabulate from the "qdapTools" package:
library(qdapTools)
head(mtabulate(as.data.frame(t(df))))
# c d i l m o r v x y a f s t k p u b h j n q e g w z
# 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 2 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
# 3 0 0 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0
# 4 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
# 5 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 0
# 6 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 0
There are, of course, many other options.
For example, cSplit_e from my "splitstackshape" package (with the downside that inefficiently, you need to paste the values together first before you can split them):
library(splitstackshape)
library(dplyr)
As ones and zeroes:
df %>%
mutate(combined = apply(., 1, function(x) paste(na.omit(x), collapse = ","))) %>%
cSplit_e("combined", ",", mode = "binary", type = "character", fill = 0) %>%
select(starts_with("combined_")) %>%
head
# combined_a combined_b combined_c combined_d combined_e combined_f combined_g combined_h combined_i
# 1 0 0 1 1 0 0 0 0 1
# 2 1 0 0 1 0 1 0 0 0
# 3 1 0 0 0 0 0 0 0 1
# 4 0 1 1 0 0 0 0 1 1
# 5 0 1 0 1 0 0 0 1 0
# 6 0 1 0 0 0 0 0 0 0
# combined_j combined_k combined_l combined_m combined_n combined_o combined_p combined_q combined_r
# 1 0 0 1 1 0 1 0 0 1
# 2 0 0 0 1 0 0 0 0 0
# 3 0 1 0 0 0 0 1 0 1
# 4 1 0 1 0 1 0 0 0 0
# 5 0 1 0 0 1 0 1 1 1
# 6 1 1 0 1 0 0 0 1 0
# combined_s combined_t combined_u combined_v combined_w combined_x combined_y combined_z
# 1 0 0 0 1 0 1 1 0
# 2 1 1 0 0 0 0 0 0
# 3 0 1 1 0 0 1 1 0
# 4 0 0 1 0 0 0 1 0
# 5 1 0 0 0 0 0 0 0
# 6 1 1 1 0 0 0 0 0
As the original values:
df %>%
mutate(combined = apply(., 1, function(x) paste(na.omit(x), collapse = ","))) %>%
cSplit_e("combined", ",", mode = "value", type = "character", fill = "") %>%
select(starts_with("combined_")) %>%
head
# combined_a combined_b combined_c combined_d combined_e combined_f combined_g combined_h combined_i
# 1 c d i
# 2 a d f
# 3 a i
# 4 b c h i
# 5 b d h
# 6 b
# combined_j combined_k combined_l combined_m combined_n combined_o combined_p combined_q combined_r
# 1 l m o r
# 2 m
# 3 k p r
# 4 j l n
# 5 k n p q r
# 6 j k m q
# combined_s combined_t combined_u combined_v combined_w combined_x combined_y combined_z
# 1 v x y
# 2 s t
# 3 t u x y
# 4 u y
# 5 s
# 6 s t u
Alternatively, you can use "reshape2":
library(reshape2)
## The values
dcast(melt(as.matrix(df), na.rm = TRUE),
Var1 ~ value, value.var = "value")
## ones and zeroes
dcast(melt(as.matrix(df), na.rm = TRUE),
Var1 ~ value, value.var = "value", fun.aggregate = length)

Resources