Counting existing permutations in R - r

I have a large dataset with columns IDNum, Var1, Var2, Var3, Var4, Var5, Var6. The variables are boolean with value either 0 or 1. Each row could be one of 64 different possible permutations. I would like to count the number of rows corresponding to each permutation present. Is there an efficient way to write this in R?

aggregate can do this. Here's a shorter example:
r <- function() rbinom(10, 1, .5)
d <- data.frame(IDNum=1:10, Var1=r(), Var2=r())
d
IDNum Var1 Var2
1 1 0 1
2 2 0 1
3 3 0 0
4 4 1 0
5 5 1 1
6 6 0 0
7 7 1 1
8 8 1 0
9 9 0 1
10 10 0 1
Now to count the number of each combination:
> aggregate(d$IDNum, d[-1], FUN=length)
Var1 Var2 x
1 0 0 2
2 1 0 2
3 0 1 4
4 1 1 2
The values in d$IDNum aren't actually used here, but something must be passed to the length function. The values in d$IDNum for each combination are passed to length to get the count.

This will give a slightly different result and will list out all the possibilities regardless of whether they are present or not. Example data:
nam <- c("IDNum",paste0("Var",1:6))
n <- 5
set.seed(23)
dat <- setNames(data.frame(1:n,replicate(6,sample(0:1,n,replace=TRUE))),nam)
# IDNum Var1 Var2 Var3 Var4 Var5 Var6
#1 1 1 0 1 0 1 1
#2 2 0 1 1 1 0 1
#3 3 0 1 0 1 0 1
#4 4 1 1 0 1 1 0
#5 5 1 1 1 1 0 1
Count em up:
data.frame(table(dat[-1]))
# Var1 Var2 Var3 Var4 Var5 Var6 Freq
#1 0 0 0 0 0 0 0
#...
#28 1 1 0 1 1 0 1
#...
#43 0 1 0 1 0 1 1
#...
#47 0 1 1 1 0 1 1
#48 1 1 1 1 0 1 1
#...
#54 1 0 1 0 1 1 1
#...
#64 1 1 1 1 1 1 0

You can as well using the count function in dplyr:
library(dplyr)
r <- function() rbinom(10, 1, .5)
d <- data.frame(IDNum=1:10, Var1=r(), Var2=r())
d
d %>% count(Var1, Var2)
Output:
Var1 Var2 n
1 0 0 3
2 0 1 3
3 1 0 1
4 1 1 3

Related

Count occurrences of distinct values across multiple columns and groups

I've got a dataframe like the one below (in the actual dataset the number of rows are a few thousands and I 've got more than 300 variables):
df <- data.frame (Gr = c("A","A","A","B","B","B","B","B","B"),
Var1 = c("a","b","c","e","a","a","c","e","b"),
Var2 = c("a","a","a","d","b","b","c","a","e"),
Var3 = c("e","a","b",NA,"a","b","c","d","a"),
Var4 = c("e",NA,"a","e","a","b","d","c",NA))
which returns:
Gr Var1 Var2 Var3 Var4
1 A a a e e
2 A b a a <NA>
3 A c a b a
4 B e d <NA> e
5 B a b a a
6 B a b b b
7 B c c c d
8 B e a d c
9 B b e a <NA>
and would like to get number of occurrences of each value (a,b,c,d,e, and NA) in each variable and in each group. Hence, the output should look something like the following:
df1 <- data.frame(Vars = c("Var1","Var2","Var3","Var4"),
a = c(1,3,1,1),
b = c(1,0,1,0),
c = c(1,0,0,0),
d = c(0,0,0,0),
e = c(0,0,1,1),
na = c(0,0,0,1))
df2 <- data.frame(Vars = c("Var1","Var2","Var3","Var4"),
a = c(2,1,2,1),
b = c(0,2,1,1),
c = c(1,1,1,1),
d = c(0,1,1,1),
e = c(2,1,0,1),
na = c(0,0,1,1))
output <- list(df1,df2)
names(output) <- c("A","B")
which looks like:
$A
Vars a b c d e na
1 Var1 1 1 1 0 0 0
2 Var2 3 0 0 0 0 0
3 Var3 1 1 0 0 1 0
4 Var4 1 0 0 0 1 1
$B
Vars a b c d e na
1 Var1 2 0 1 0 2 0
2 Var2 1 2 1 1 1 0
3 Var3 2 1 1 1 0 1
4 Var4 1 1 1 1 1 1
I haven't been able to make any considerable progress so far, and a tidyverse solution is preferred.
We may use mtabulate after spliting
library(qdapTools)
lapply(split(df[-1], df$Gr), mtabulate)
If we need the na count as well
lapply(split(replace(df[-1], is.na(df[-1]), "na"), df$Gr), mtabulate)
-output
$A
a b c e na
Var1 1 1 1 0 0
Var2 3 0 0 0 0
Var3 1 1 0 1 0
Var4 1 0 0 1 1
$B
a b c d e na
Var1 2 1 1 0 2 0
Var2 1 2 1 1 1 0
Var3 2 1 1 1 0 1
Var4 1 1 1 1 1 1
Or using tidyverse
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -Gr, names_to = "Vars") %>%
pivot_wider(names_from = value, values_from = value,
values_fn = length, values_fill = 0) %>%
{split(.[-1], .$Gr)}
-output
$A
# A tibble: 4 × 7
Vars a e b `NA` c d
<chr> <int> <int> <int> <int> <int> <int>
1 Var1 1 0 1 0 1 0
2 Var2 3 0 0 0 0 0
3 Var3 1 1 1 0 0 0
4 Var4 1 1 0 1 0 0
$B
# A tibble: 4 × 7
Vars a e b `NA` c d
<chr> <int> <int> <int> <int> <int> <int>
1 Var1 2 2 1 0 1 0
2 Var2 1 1 2 0 1 1
3 Var3 2 0 1 1 1 1
4 Var4 1 1 1 1 1 1
A NA save base R approach using colSums
val <- sort(unique(unlist(df[-1])), na.last=T)
as.list(lapply(split(df[-1], df$Gr), function(dlist)
data.frame(sapply(val, function(x)
colSums(dlist == x | (is.na(dlist) & is.na(x)), na.rm=T)), check.names=F)))
$A
a b c d e NA
Var1 1 1 1 0 0 0
Var2 3 0 0 0 0 0
Var3 1 1 0 0 1 0
Var4 1 0 0 0 1 1
$B
a b c d e NA
Var1 2 1 1 0 2 0
Var2 1 2 1 1 1 0
Var3 2 1 1 1 0 1
Var4 1 1 1 1 1 1
reshape2::recast(df,Gr+variable~value,length,id.var = 'Gr')
Gr variable a b c d e NA
1 A Var1 1 1 1 0 0 0
2 A Var2 3 0 0 0 0 0
3 A Var3 1 1 0 0 1 0
4 A Var4 1 0 0 0 1 1
5 B Var1 2 1 1 0 2 0
6 B Var2 1 2 1 1 1 0
7 B Var3 2 1 1 1 0 1
If you must split them:
split(reshape2::recast(df,Gr+variable~value,length,id.var = 'Gr'), ~Gr)
$A
Gr variable a b c d e NA
1 A Var1 1 1 1 0 0 0
2 A Var2 3 0 0 0 0 0
3 A Var3 1 1 0 0 1 0
4 A Var4 1 0 0 0 1 1
$B
Gr variable a b c d e NA
5 B Var1 2 1 1 0 2 0
6 B Var2 1 2 1 1 1 0
7 B Var3 2 1 1 1 0 1
8 B Var4 1 1 1 1 1 1
in base R:
ftable(cbind(df[1], stack(replace(df, is.na(df),'na'), -1)),col.vars = 2)
values a b c d e na
Gr ind
A Var1 1 1 1 0 0 0
Var2 3 0 0 0 0 0
Var3 1 1 0 0 1 0
Var4 1 0 0 0 1 1
B Var1 2 1 1 0 2 0
Var2 1 2 1 1 1 0
Var3 2 1 1 1 0 1
Var4 1 1 1 1 1 1

Recode only certain values and keep others as it is in R

I am trying to recode a list of columns var1:var8 in df - "sampledf" where I am changing the values "B" and "D" into "0", but keeping the other values as it is.
sampledf <- data.frame(
var1 = c(1,4,2,1,1,0,0,1,0,0,0),
var2 = c(1,1,"D",1,0,0,1,"B",0,"D",0),
var3 = c(1,5,2,1,"B",0,1,1,1,0,0),
var4 = c(1,1,0,1,2,0,1,1,5,1,1),
var5 = c(0,4,"D",1,0,0,0,1,1,1,1),
var6 = c(1,"D",0,1,0,2,1,1,0,1,0),
var7 = c(1,1,0,0,1,"E",1,0,"D",1,1),
var8 = c(1,1,0,0,2,5,1,"D",0,3,1))
This is what I tried but did not work. Compared to this example, the other values I have in my real dataset is very very long. So I cannot manually supply all the values. All I want is just to change this and keep others as it is.
sampledfnew <- sampledf %>% mutate(across(var1:var2, ~recode(
.x,
'B'=0L,
'D'=0L,
TRUE ~ X,
)))
Can anyone help me fix the error here?
Thank you
There are many ways to do this. Using ifelse -
library(dplyr)
change_values <- c('B', 'D')
sampledf %>% mutate(across(var1:var2, ~ifelse(.x %in% change_values, 0, .x)))
# var1 var2 var3 var4 var5 var6 var7 var8
#1 1 1 1 1 0 1 1 1
#2 4 1 5 1 4 D 1 1
#3 2 0 2 0 D 0 0 0
#4 1 1 1 1 1 1 0 0
#5 1 0 B 2 0 0 1 2
#6 0 0 0 0 0 2 E 5
#7 0 1 1 1 0 1 1 1
#8 1 0 1 1 1 1 0 D
#9 0 0 1 5 1 0 D 0
#10 0 0 0 1 1 1 1 3
#11 0 0 0 1 1 0 1 1
Alternatives to ifelse, since it is prone to at least two not-insignificant issues (class-dropping and class-ambiguity, discussed below).
sampledf %>%
mutate(
across(var1:var8, ~ if_else(
. %in% c("B", "D"),
if (is.character(.)) "0" else 0, # could also be maybechar(0, .) from below
.)
)
)
# var1 var2 var3 var4 var5 var6 var7 var8
# 1 1 1 1 1 0 1 1 1
# 2 4 1 5 1 4 0 1 1
# 3 2 0 2 0 0 0 0 0
# 4 1 1 1 1 1 1 0 0
# 5 1 0 0 2 0 0 1 2
# 6 0 0 0 0 0 2 E 5
# 7 0 1 1 1 0 1 1 1
# 8 1 0 1 1 1 1 0 0
# 9 0 0 1 5 1 0 0 0
# 10 0 0 0 1 1 1 1 3
# 11 0 0 0 1 1 0 1 1
In case you don't always want B/D to be replaced with the same value,
maybechar <- function(val, src) if (is.character(src)) as.character(val) else val
sampledf %>%
mutate(
across(var1:var8, ~ case_when(
. == "B" ~ maybechar(0, .),
. == "D" ~ maybechar(0, .),
TRUE ~ .)
)
)
Notes:
Most of the replacement being doing here is actually replacing with a "0" string instead of a 0 integer, because most of your data is string.
The use of ifelse by itself is something I often recommend against due to class ambiguity. It is feasible with ifelse to change the class of the return value without realizing it. See the difference between ifelse(c(T,T), 1:2, c("A","B")) and compare with ifelse(c(T,F), 1:2, c("A","B")) to see what I mean. This is "dangerous"/risky, and one thing that if_else explicitly guards against. (This also is enforced by case_when in my second code block.)
It is because of the previous bullet that I suggested the use of something like maybechar, which might suggest a little sloppy code but at least is a little more declarative/intentional about it. I give two ways to do it: the first is explicitly without a helper function, shown in the if_else example above, the second is with the helper function. It seems more prudent to use the helper function in the case of case_when, since the operation is being doing multiple times, so the code is a little easier to read (imo).
Another base R solution is:
sampledf[apply(sampledf, 2, \(x) x %in% c("B", "D"))] <- 0
> sampledf
var1 var2 var3 var4 var5 var6 var7 var8
1 1 1 1 1 0 1 1 1
2 4 1 5 1 4 0 1 1
3 2 0 2 0 0 0 0 0
4 1 1 1 1 1 1 0 0
5 1 0 0 2 0 0 1 2
6 0 0 0 0 0 2 E 5
7 0 1 1 1 0 1 1 1
8 1 0 1 1 1 1 0 0
9 0 0 1 5 1 0 0 0
10 0 0 0 1 1 1 1 3
11 0 0 0 1 1 0 1 1

R- count values in data.frame

df <- data.frame(row.names = c('ID1','ID2','ID3','ID4'),var1 = c(0,1,2,3),var2 = c(0,0,0,0),var3 = c(1,2,3,0),var4 = c('1','1','2','2'))
> df
var1 var2 var3 var4
ID1 0 0 1 1
ID2 1 0 2 1
ID3 2 0 3 2
ID4 3 0 0 2
I want df to look like this
var1 var2 var3 var4
0 1 4 1 0
1 1 0 1 2
2 1 0 1 2
3 1 0 1 0
So I want the values of df to be counted. The problem is, that not every value occurs in every column.
I tried this lapply(df,table) but that returns a list which I cannot convert into a data.frame (because of said reason).
I could do it kind of manually with table(df$var1) and bind everything together after doing that with every var, but that is boring. Can you find a better way?
Thanks ;)
Call table function with factor levels which are present in the entire dataset.
sapply(df,function(x) table(factor(x, levels = 0:3)))
# var1 var2 var3 var4
#0 1 4 1 0
#1 1 0 1 2
#2 1 0 1 2
#3 1 0 1 0
If you don't know beforehand what levels your data can take, we can find it from data itself.
vec <- unique(unlist(df))
sapply(df, function(x) table(factor(x, levels = vec)))
We could do this without any loop
table(c(col(df)), unlist(df))
# 0 1 2 3
# 1 1 1 1 1
# 2 4 0 0 0
# 3 1 1 1 1
# 4 0 2 2 0

Calculate rank based on several columns, with a precedence rule [duplicate]

This question already has an answer here:
Ranking multiple columns by different orders using data table
(1 answer)
Closed 2 years ago.
I have a dataframe like this
df <- expand.grid(0:1, 0:1, 0:1, 0:1)
df
Var1 Var2 Var3 Var4
1 0 0 0 0
2 1 0 0 0
3 0 1 0 0
4 1 1 0 0
5 0 0 1 0
6 1 0 1 0
7 0 1 1 0
8 1 1 1 0
9 0 0 0 1
10 1 0 0 1
11 0 1 0 1
12 1 1 0 1
13 0 0 1 1
14 1 0 1 1
15 0 1 1 1
16 1 1 1 1
I am trying to create a Rank column based on some conditions on Var1, Var2, Var3, Var4
The order of ranking precedence is determined by the variables
Column Var1 has the highest preference and if it has a value of 1, then it is given a higher rank
Column Var2 has a higher preference over Var3, Var4
Columns Var1 and Var2 have higher preference over Var3, Var4
There is NO preference given to Var3 and Var4 and are only used as counts for ranking
If any rows have the same counts for Var3, Var4, then they are ranked with the same number.
My desired output is
Var1 Var2 Var3 Var4 rank
1 0 0 0 0 12
2 1 0 0 0 6
3 0 1 0 0 9
4 1 1 0 0 3
5 0 0 1 0 11
6 1 0 1 0 5
7 0 1 1 0 8
8 1 1 1 0 2
9 0 0 0 1 11
10 1 0 0 1 5
11 0 1 0 1 8
12 1 1 0 1 2
13 0 0 1 1 10
14 1 0 1 1 4
15 0 1 1 1 7
16 1 1 1 1 1
I am trying to this manually but it is not very efficient
df %>%
mutate(rank = case_when(
Var1 == 1 & Var2 == 1 & Var3 == 1 & Var4 == 1~ "1",
Var1 == 1 & Var2 == 1 & Var3 == 1 & Var4 == 0~ "2",
TRUE ~ ""
))
I want to apply the logic to a larger dataset. Is there an efficient way to do this? Can someone point me in the right direction?
frank and frankv in data.table "accepts vectors, lists, data.frames or data.tables as input", which can be useful here.
First, frankv. It has a cols argument where columns to be ranked can be specified in a character vector - convenient if there are many column names which need to be generated programmatically. It also has a neat order argument.
library(data.table)
setDT(df)
df[ , Var34 := Var3 + Var4]
cols = c("Var1", "Var2", "Var34")
df[ , r := frankv(.SD, cols, order = -1L, ties.method = "dense")]
df[ , Var34 := NULL]
# Var1 Var2 Var3 Var4 r
# 1: 0 0 0 0 12
# 2: 1 0 0 0 6
# 3: 0 1 0 0 9
# 4: 1 1 0 0 3
# 5: 0 0 1 0 11
# 6: 1 0 1 0 5
# 7: 0 1 1 0 8
# 8: 1 1 1 0 2
# 9: 0 0 0 1 11
# 10: 1 0 0 1 5
# 11: 0 1 0 1 8
# 12: 1 1 0 1 2
# 13: 0 0 1 1 10
# 14: 1 0 1 1 4
# 15: 0 1 1 1 7
# 16: 1 1 1 1 1
frank is handy for interactive use:
df[ , r := frank(.SD, -Var1, -Var2, -Var34, ties.method = "dense")]
Related answers: How to emulate SQLs rank functions in R?; Rank based on several variables
I propose this which is a small trick :
df <- expand.grid(0:1, 0:1, 0:1, 0:1)
df[,2] <- df[,2] * 10
df[,3] <- df[,3] * 100
df[,4] <- df[,4] * 100
rank <- rowSums(df)
as.numeric(as.factor(rank))

How to create new variables using loop and ifelse statement

I have a number of variables with the same name but different suffixes. For example, var1, var2, var3, var4, var5, var6, ... and so on. Each of these variables has a random sequence of 0, 1, and 2. Using these variables, I am trying to create a new variable called testvariable. If any of the already existing variables has 1, I will set testvariable to 1. If they have 0 or 2, I will assign 0.
Is there a simple loop and/or ifelse statement I can use to create this variable? My real data is a lot more complex than this, so I don't want to copy and paste each individual variable and values.
Edit: This is for R.
If I understand you correctly, if any of the variables like var1, var2, .., etc has a value 1, then the testvariable must be 1 else 0, then do:
Sample df:
var1 var2 var3 var4 var5
1 1 1 1 1 1
2 2 2 2 2 2
3 0 1 0 1 0
4 0 0 0 0 1
5 1 1 2 1 2
6 2 2 2 2 2
7 2 2 1 2 2
8 1 1 2 1 1
9 0 0 0 0 0
Code:
df$testvariable <- ifelse(rowSums(df[, grepl("var", names(df))] == 1) > 0, 1, 0)
Output:
var1 var2 var3 var4 var5 testvariable
1 1 1 1 1 1 1
2 2 2 2 2 2 0
3 0 1 0 1 0 1
4 0 0 0 0 1 1
5 1 1 2 1 2 1
6 2 2 2 2 2 0
7 2 2 1 2 2 1
8 1 1 2 1 1 1
9 0 0 0 0 0 0
I would use mget for this. mget gives the variables as a list. Then you can check for each element of the list using sapply and combine the results using any. At the end I took advantage of your encoding of 0 and 1, but you can also use an if statement.
var1 <- c(0,0,1,2)
var2 <- c(2,2,2,2)
var3 <- c(0,2,0,2)
var4 <- c(0,2,2,2)
any(sapply(mget(paste0("var", 1:4)), function(x) 1 %in% x)) * 1
#> [1] 1
any(sapply(mget(paste0("var", 2:4)), function(x) 1 %in% x)) * 1
#> [1] 0

Resources