Using any in nested ifelse statement - r

data:
set.seed(1337)
m <- matrix(sample(c(0,0,0,1),size = 50,replace=T),ncol=5) %>% as.data.frame
colnames(m)<-LETTERS[1:5]
code:
m %<>%
mutate(newcol = ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(any(A,B,C,D,E),0,NA)),
desiredResult= ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(!(A==0&B==0&C==0&D==0&E==0),0,NA)))
looks like:
A B C D E newcol desiredResult
1 0 1 1 1 0 0 0
2 0 1 0 0 1 0 0
3 0 1 0 0 0 0 0
4 0 0 0 0 0 0 NA
5 0 1 0 1 0 0 0
6 0 0 1 0 0 0 0
7 1 1 1 1 0 1 1
8 0 1 1 0 0 0 0
9 0 0 0 0 0 0 NA
10 0 0 1 0 0 0 0
question
I want newcol to be the same as desiredResult.
Why can't I use any in that "stratified" manner of ifelse. Is there a function like any that would work in that situation?
possible workaround
I could define a function
any_vec <- function(...) {apply(cbind(...),1,any)} but this does not make me smile too much.
like suggested in the answer
using pmax works exactly like a vectorized any.
m %>%
mutate(pmaxResult = ifelse(A==1& pmax(B,C) & pmax(D,E),1,
ifelse(pmax(A,B,C,D,E),0,NA)),
desiredResult= ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(!(A==0&B==0&C==0&D==0&E==0),0,NA)))

Here's an alternative approach. I converted to logical at the beginning and back to integer at the end:
m %>%
mutate_all(as.logical) %>%
mutate(newcol = A & pmax(B,C) & pmax(D, E) ,
newcol = replace(newcol, !newcol & !pmax(A,B,C,D,E), NA)) %>%
mutate_all(as.integer)
# A B C D E newcol
# 1 0 1 1 1 0 0
# 2 0 1 0 0 1 0
# 3 0 1 0 0 0 0
# 4 0 0 0 0 0 NA
# 5 0 1 0 1 0 0
# 6 0 0 1 0 0 0
# 7 1 1 1 1 0 1
# 8 0 1 1 0 0 0
# 9 0 0 0 0 0 NA
# 10 0 0 1 0 0 0
I basically replaced the any with pmax.

Related

Automatic subsetting of a dataframe on the basis of a prediction matrix

I have created a prediction matrix for large dataset as follows:
library(mice)
dfpredm <- quickpred(df, mincor=.3)
A B C D E F G H I J
A 0 1 1 1 0 1 0 1 1 0
B 1 0 0 0 1 0 1 0 0 1
C 0 0 0 1 1 0 0 0 0 0
D 1 0 1 0 0 1 0 1 0 1
E 0 1 0 1 0 1 1 0 1 0
**F 0 0 1 0 0 0 1 0 0 0**
G 0 1 0 1 0 0 0 0 0 0
H 1 0 1 0 0 1 0 0 0 1
I 0 1 0 1 1 0 1 0 0 0
J 1 0 1 0 0 1 0 1 0 0
I would like to create a subset of the original df on the basis on dfpredm.
More specifically I would like to do the following:
Let's assume that my dependent variable is F.
According to the prediction matrix F is correlated with C and G.
In addition, C and G are best predicted by D,E and B,D respectively.
The idea is now to create a subset of df based on the dependent variable F,for which in the F row the value is 1.
Fpredictors <- df[,(dfpredm["F",]) == 1]
But also do the same for the variables where the rows in F are 1. I am thinking of first getting the column names like this:
Fpredcol <-colnames(dfpredm[,(dfpredm["c241",]) == 1])
And then doing a for loop with these column names?
For the specific example I would like to end up with the subset.
dfsub <- df[,c("F","C","G","B","E","D")]
I would however like to automate this process. Could anyone show me how to do this?
Here is one strategy that seems like it would work for you:
first_preds <- function(dat, predictor) {
cols <- which(dat[predictor, ] == 1)
names(dat)[cols]
}
# wrap first_preds() for getting best and second best predictors
first_and_second_preds <- function(dat, predictor) {
matches <- first_preds(dat, predictor)
matches <- c(matches, unlist(lapply(matches, function(x) first_preds(dat, x))))
c(predictor, matches) %>% unique()
}
dat[first_and_second_preds(dat, "F")] # order is not exactly the same as your output
F C G D E B
A 1 1 0 1 0 1
B 0 0 1 0 1 0
C 0 0 0 1 1 0
D 1 1 0 0 0 0
E 1 0 1 1 0 1
F 0 1 1 0 0 0
G 0 0 0 1 0 1
H 1 1 0 0 0 0
I 0 0 1 1 1 1
J 1 1 0 0 0 0
Not sure if the ordering in the result is important, but you could add the logic if it is.
Using dat from here (a kinder way to share small R data on SO):
dat <- read.table(
text = "A B C D E F G H I J
A 0 1 1 1 0 1 0 1 1 0
B 1 0 0 0 1 0 1 0 0 1
C 0 0 0 1 1 0 0 0 0 0
D 1 0 1 0 0 1 0 1 0 1
E 0 1 0 1 0 1 1 0 1 0
F 0 0 1 0 0 0 1 0 0 0
G 0 1 0 1 0 0 0 0 0 0
H 1 0 1 0 0 1 0 0 0 1
I 0 1 0 1 1 0 1 0 0 0
J 1 0 1 0 0 1 0 1 0 0",
header = TRUE
)
Something a little more general that would let you use self_select predictors directly:
all_preds <- function(dat, predictors) {
unlist(lapply(predictors, function(x) names(dat)[which(dat[x, ] == 1 )]))
}
dat[all_preds(dat, c("A", "B"))]
B C D F H I A E G J
A 1 1 1 1 1 1 0 0 0 0
B 0 0 0 0 0 0 1 1 1 1
C 0 0 1 0 0 0 0 1 0 0
D 0 1 0 1 1 0 1 0 0 1
E 1 0 1 1 0 1 0 0 1 0
F 0 1 0 0 0 0 0 0 1 0
G 1 0 1 0 0 0 0 0 0 0
H 0 1 0 1 0 0 1 0 0 1
I 1 0 1 0 0 0 0 1 1 0

R - Creating a new column within a data frame when two or more columns are a match in a row

I'm currently stuck on a part of my code that feels intuitive but I can't figure a way to do it. I have a very big data frame (nrows = 34036, ncol = 43) in which I want to create a continuous sequence of the variables where the value of the row is 1 (without having multiple columns with 1). It consists of only zeros and ones similar to the following:
A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1
I was able to remove the zeroes using:
#find the sum of each row
placeholderData <- transform(placeholderData, sum=rowSums(placeholderData))
placeholderData <- placeholderData[!(placeholderData$sum <= 0),]
And the data frame now looks like:
A B C D sum
1 0 0 0 1
0 0 0 1 1
0 0 0 1 1
1 0 1 0 2
1 0 1 0 2
0 1 0 0 1
0 1 0 0 1
1 0 0 1 2
My main problem comes when there are two or more 1's in a row. To try to solve this, I used the following code to identify the columns that have a sum of 2 or more:
placeholderData$Matches <- lapply(apply(placeholderData == 1, 1, which), names)
Which added the following column to the data frame:
A B C D sum Matches
1 0 0 0 1 A
0 0 0 1 1 D
0 0 0 1 1 D
1 0 1 0 2 c("A","C")
1 0 1 0 2 c("A","C")
0 1 0 0 1 B
0 1 0 0 1 B
1 0 0 1 2 c("A", "D")
I added the Matches column as an approach to solve the problem, but I'm not sure how would I do it without using a lot of logical operators (I don't know what columns have matches or not). What I would like to do is to aggregate the rows that have more than (or equal to) two 1's into a new column, to be able to have a data frame like this:
A B C D AC AD sum Matches
1 0 0 0 0 0 1 A
0 0 0 1 0 0 1 D
0 0 0 1 0 0 1 D
0 0 0 0 1 0 1 c("A","C")
0 0 0 0 1 0 1 c("A","C")
0 1 0 0 0 0 1 B
0 1 0 0 0 0 1 B
0 0 0 0 0 1 1 c("A", "D")
Then, I would be able to use my code as normal (It works just fine when there are no repeated values in rows). I tried searching to find similar questions, but I'm not sure if I was even asking the right question. I was wondering if anyone could provide some help or some ideas that I could try.
Thank you very much!
This seems a lot like making dummy variables, so I would use the model.matrix function commonly used for dummy variables (one-hot encoding):
m = read.table(header = T, text = "A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1")
m = m[rowSums(m) > 0, ]
d = factor(sapply(apply(m == 1, 1, which), function(x) paste(names(m)[x], collapse = "")))
result = data.frame(model.matrix(~ d + 0))
names(result) = levels(d)
# A AC AD B D
# 1 1 0 0 0 0
# 2 0 0 0 0 1
# 3 0 0 0 0 1
# 4 0 1 0 0 0
# 5 0 1 0 0 0
# 6 0 0 0 1 0
# 7 0 0 0 1 0
# 8 0 0 1 0 0

Split column of comma-separated numbers into multiple columns based on value

I have a column f in my dataframe that I would like to spread into multiple columns based on the values in that column. For example:
df <- structure(list(f = c(NA, "18,17,10", "12,8", "17,11,6", "18",
"12", "12", NA, "17,11", "12")), .Names = "f", row.names = c(NA,
10L), class = "data.frame")
df
# f
# 1 <NA>
# 2 18,17,10
# 3 12,8
# 4 17,11,6
# 5 18
# 6 12
# 7 12
# 8 <NA>
# 9 17,11
# 10 12
How would I split column f into multiple columns indicating the numbers in the row. I'm interested in something like this:
6 8 10 11 12 17 18
1 0 0 0 0 0 0 0
2 0 0 1 0 0 1 1
3 0 1 0 0 1 0 0
4 1 0 0 1 0 1 0
5 0 0 0 0 0 0 1
6 0 0 0 0 1 0 0
7 0 0 0 0 1 0 0
8 0 0 0 0 0 0 0
9 0 0 0 1 0 1 0
10 0 0 0 0 1 0 0
I'm thinking I could useunique on the f column to create the seperate columns based on the different numbers and then do a grepl to determine if the specific number is in column f but I was wondering if there was a better way. Something similar to spread or separate in the tidyr package.
A solution using tidyr::separate_rows will be as:
library(tidyverse)
df %>% mutate(ind = row_number()) %>%
separate_rows(f, sep=",") %>%
mutate(f = ifelse(is.na(f),0, f)) %>%
count(ind, f) %>%
spread(f, n, fill = 0) %>%
select(-2) %>% as.data.frame()
# ind 10 11 12 17 18 6 8
# 1 1 0 0 0 0 0 0 0
# 2 2 1 0 0 1 1 0 0
# 3 3 0 0 1 0 0 0 1
# 4 4 0 1 0 1 0 1 0
# 5 5 0 0 0 0 1 0 0
# 6 6 0 0 1 0 0 0 0
# 7 7 0 0 1 0 0 0 0
# 8 8 0 0 0 0 0 0 0
# 9 9 0 1 0 1 0 0 0
# 10 10 0 0 1 0 0 0 0
This could be achieved by splitting on the ,, the stack it to a two column data.frame and get the frequency with table
df1 <- na.omit(stack(setNames(lapply(strsplit(df$f, ","),
as.numeric), seq_len(nrow(df))))[, 2:1])
table(df1)
# values
#ind 6 8 10 11 12 17 18
# 1 0 0 0 0 0 0 0
# 2 0 0 1 0 0 1 1
# 3 0 1 0 0 1 0 0
# 4 1 0 0 1 0 1 0
# 5 0 0 0 0 0 0 1
# 6 0 0 0 0 1 0 0
# 7 0 0 0 0 1 0 0
# 8 0 0 0 0 0 0 0
# 9 0 0 0 1 0 1 0
# 10 0 0 0 0 1 0 0

Building a symmetric binary matrix

I have a matrix that is for example like this:
rownames V1
a 1
c 3
b 2
d 4
y 2
q 4
i 1
j 1
r 3
I want to make a Symmetric binary matrix that it's dimnames of that is the same as rownames of above matrix. I want to fill these matrix by 1 & 0 in such a way that 1 indicated placing variables that has the same number in front of it and 0 for the opposite situation.This matrix would be like
dimnames
a c b d y q i j r
a 1 0 0 0 0 0 1 1 0
c 0 1 0 0 0 0 0 0 1
b 0 0 1 0 1 0 0 0 0
d 0 0 0 1 0 1 0 0 0
y 0 0 1 0 1 0 0 0 0
q 0 0 0 1 0 1 0 0 0
i 1 0 0 0 0 0 1 1 0
j 1 0 0 0 0 0 1 1 0
r 0 1 0 0 0 0 0 0 1
Anybody know how can I do that?
Use dist:
DF <- read.table(text = "rownames V1
a 1
c 3
b 2
d 4
y 2
q 4
i 1
j 1
r 3", header = TRUE)
res <- as.matrix(dist(DF$V1)) == 0L
#alternatively:
#res <- !as.matrix(dist(DF$V1))
#diag(res) <- 0L #for the first version of the question, i.e. a zero diagonal
res <- +(res) #for the second version, i.e. to coerce to an integer matrix
dimnames(res) <- list(DF$rownames, DF$rownames)
# 1 2 3 4 5 6 7 8 9
#1 1 0 0 0 0 0 1 1 0
#2 0 1 0 0 0 0 0 0 1
#3 0 0 1 0 1 0 0 0 0
#4 0 0 0 1 0 1 0 0 0
#5 0 0 1 0 1 0 0 0 0
#6 0 0 0 1 0 1 0 0 0
#7 1 0 0 0 0 0 1 1 0
#8 1 0 0 0 0 0 1 1 0
#9 0 1 0 0 0 0 0 0 1
You can do this using table and crossprod.
tcrossprod(table(DF))
# rownames
# rownames a b c d i j q r y
# a 1 0 0 0 1 1 0 0 0
# b 0 1 0 0 0 0 0 0 1
# c 0 0 1 0 0 0 0 1 0
# d 0 0 0 1 0 0 1 0 0
# i 1 0 0 0 1 1 0 0 0
# j 1 0 0 0 1 1 0 0 0
# q 0 0 0 1 0 0 1 0 0
# r 0 0 1 0 0 0 0 1 0
# y 0 1 0 0 0 0 0 0 1
If you want the row and column order as they are found in the data, rather than alphanumerically, you can subset
tcrossprod(table(DF))[DF$rownames, DF$rownames]
or use factor
tcrossprod(table(factor(DF$rownames, levels=unique(DF$rownames)), DF$V1))
If your data is large or sparse, you can use the sparse matrix algebra in xtabs, with similar ways to change the order of the resulting table as before.
Matrix::tcrossprod(xtabs(data=DF, ~ rownames + V1, sparse=TRUE))

How to force table to have equal dimensions?

How can I force the dimensions of a table to be equal in R?
For example:
a <- c(0,1,2,3,4,5,1,3,4,5,3,4,5)
b <- c(1,2,3,3,3,3,3,3,3,3,5,5,6)
c <- table(a,b)
print(c)
# b
#a 1 2 3 5 6
# 0 1 0 0 0 0
# 1 0 1 1 0 0
# 2 0 0 1 0 0
# 3 0 0 2 1 0
# 4 0 0 2 1 0
# 5 0 0 2 0 1
However, I am looking for the following result:
print(c)
# b
#a 0 1 2 3 4 5 6
# 0 0 1 0 0 0 0 0
# 1 0 0 1 1 0 0 0
# 2 0 0 0 1 0 0 0
# 3 0 0 0 2 0 1 0
# 4 0 0 0 2 0 1 0
# 5 0 0 0 2 0 0 1
# 6 0 0 0 0 0 0 0
By using factors. table doesn't know the levels of your variable unless you tell it in some way!
a <- c(0,1,2,3,4,5,1,3,4,5,3,4,5)
b <- c(1,2,3,3,3,3,3,3,3,3,5,5,6)
a <- factor(a, levels = 0:6)
b <- factor(b, levels = 0:6)
table(a,b)
# b
#a 0 1 2 3 4 5 6
# 0 0 1 0 0 0 0 0
# 1 0 0 1 1 0 0 0
# 2 0 0 0 1 0 0 0
# 3 0 0 0 2 0 1 0
# 4 0 0 0 2 0 1 0
# 5 0 0 0 2 0 0 1
# 6 0 0 0 0 0 0 0
Edit The general way to force a square cross-tabulation is to do something like
x <- factor(a, levels = union(a, b))
y <- factor(b, levels = union(a, b))
table(x, y)

Resources