Finding "complete" groups in R - r

I have a large dataset of groups and subgroups. I want to filter the data according to the "completeness" of the groups, i.e. within each groups all levels of the sub-groups (a and b) should occur
A small example
group <- rep(c("A", "B", "C"), each=5)
a <- c(1,1,2,2,3,1,1,1,3,3,1,2,2,3,3)
b <- c("a", "a", "a", "b", "c", "a", "a", "a", "b", "c", "a", "b", "b", "b", "b")
df <- data.frame(group, a, b)
group a b
1 A 1 a
2 A 1 a
3 A 2 a
4 A 2 b
5 A 3 c
6 B 1 a
7 B 1 a
8 B 1 a
9 B 3 b
10 B 3 c
11 C 1 a
12 C 2 b
13 C 2 b
14 C 3 b
15 C 3 b
So here only A would be considered complete because all levels of a and b occur. Is there an efficient (and flexible) way to filter with those conditions?

I would do something like this:
sapply(split(df, df$group), function(x) all(a %in% x$a) & all(b %in% x$b))
## A B C
## TRUE FALSE FALSE

Here is a dplyr solution:
library(dplyr)
df %>%
group_by(group) %>%
mutate(
a_complete = all(unique(df$a) %in% a),
b_complete = all(unique(df$b) %in% b)
) %>%
filter(a_complete, b_complete) %>%
select(- ends_with("complete"))

I would try data.table, something like
library(data.table)
setDT(df)[, indx := length(unique(a)) + length(unique(b))]
df[, indx2 := length(unique(a)) + length(unique(b)), by = group]
df[indx == indx2]
# group a b indx indx2
# 1: A 1 a 6 6
# 2: A 1 a 6 6
# 3: A 2 a 6 6
# 4: A 2 b 6 6
# 5: A 3 c 6 6
Or for a more general solution, you can specify the column names and then use .SDcols, something like
cols <- c("a", "b")
setDT(df)[, indx := Reduce(sum, lapply(.SD, function(x) length(unique(x)))), .SDcols = cols]
df[, indx2 := Reduce(sum, lapply(.SD, function(x) length(unique(x)))), .SDcols = cols, by = group]
df[indx == indx2]

Related

How to table two variables which contain listed observations?

Two of the variables in the df I am working with may contain multiple values per observation. I want to table the frequencies of these variables, but can't use table() on type 'list'... I've created a sample df below:
col_a <- c("a", "b", "c", "a,b", "b,c")
col_b <- c("c", "b", "a", "a,a", "a,c")
df <- data.frame(col_a, col_b)
df <- df %>%
mutate(col_a = strsplit(df$col_a, ","),
col_b = strsplit(df$col_b, ",")
)
This outputs:
col_a col_b
1 a c
2 b b
3 c a
4 c("a", "b") c("a", "a")
5 c("b", "c") c("a", "c")
Now, table(df$col_a, df$col_b) returns Error in order(y) : unimplemented type 'list' in 'orderVector1'. In order to table the variables, I want to unlist the concatenated observations so that it looks like this:
col_a col_b
1 a c
2 b b
3 c a
4 a a
5 a a
6 b a
7 b a
8 b a
9 b c
10 c a
11 c c
Any ideas on how to accomplish this?
We may use separate_rows on the original data
library(tidyr)
library(dplyr)
df %>%
separate_rows(col_a) %>%
separate_rows(col_b)
-output
# A tibble: 11 × 2
col_a col_b
<chr> <chr>
1 a c
2 b b
3 c a
4 a a
5 a a
6 b a
7 b a
8 b a
9 b c
10 c a
11 c c

Reorder one row in tibble - move it to the last row

How do I rearrange the rows in tibble?
I wish to reorder rows such that: row with x = "c" goes to the bottom of the tibble, everything else remains same.
library(dplyr)
tbl <- tibble(x = c("a", "b", "c", "d", "e", "f", "g", "h"),
y = 1:8)
An alternative to dplyr::arrange(), using base R:
tbl[order(tbl$x == "c"), ] # Thanks to Merijn van Tilborg
Output:
# x y
# <chr> <int>
# 1 a 1
# 2 b 2
# 3 d 4
# 4 e 5
# 5 f 6
# 6 g 7
# 7 h 8
# 8 c 3
tbl |> dplyr::arrange(x == "c")
Using forcats, convert to factor having c the last, then arrange. This doesn't change the class of the column x.
library(forcats)
tbl %>%
arrange(fct_relevel(x, "c", after = Inf))
# # A tibble: 8 x 2
# x y
# <chr> <int>
# 1 a 1
# 2 b 2
# 3 d 4
# 4 e 5
# 5 f 6
# 6 g 7
# 7 h 8
# 8 c 3
If the order of x is important, it is better to keep it as factor class, below will change the class from character to factor with c being last:
tbl %>%
mutate(x = fct_relevel(x, "c", after = Inf)) %>%
arrange(x)

Sort data.table by grouping variable with condition

I have the following data.table (in fact my data.table is much bigger (more groups and more other variables)):
Data <- data.table(Group = rep(c("a", "b"), each = 3),
Var = 1:6)
> print(Data)
Group Var
1: a 1
2: a 2
3: a 3
4: b 4
5: b 5
6: b 6
Now I want to sort the data.table based on the variable Group but only if Group == "a".
My poor attempt was the following:
> Data[Group == "a", .SD[.N:1]]
Group Var
1: a 3
2: a 2
3: a 1
I know why this is wrong but I can not think of a solution that leads to my desired output:
Group Var
1: a 3
2: a 2
3: a 1
4: b 4
5: b 5
6: b 6
Your attempt would work with
library(data.table)
Data[Group == "a"] <- Data[Group == "a", .SD[.N:1]]
Data
# Group Var
#1: a 3
#2: a 2
#3: a 1
#4: b 4
#5: b 5
#6: b 6
But the above just reverses the rows if you want to sort the rows in decreasing order based on Var you could do
Data[Group == "a"] <- Data[Group == "a", .SD[order(-Var)]]
If you have multiple columns you can do
cols <- c("a", "b")
Data[Group %in% cols] <- Data[Group %in% cols,.SD[order(-Var)],Group]
A better approach with data.table is to update by reference as suggested by #markus
Data[Group %in% cols, Var := .SD[order(-Var)]$Var]
Without using the .SD notation:
> Data[Group == "a", Var := sort(Var, decreasing = TRUE)]
> Data
Group Var
1: a 3
2: a 2
3: a 1
4: b 4
5: b 5
6: b 6
As you have a large data.table and more groups, you may wish to consider the use of the .I notation (see the article by #nathaneastwood) as it does result in better performance in some situations. Here, .I will identify the row numbers of interest. Let us change the example so that we are interested in two groups:
Data <- data.table(Group = rep(c("a", "b", "c"), each = 3), Var = 10:18)
Then:
> Data[Data[, .I[Group %in% c("a", "c")]], Var := sort(Var, decreasing = TRUE), by = Group]
> Data
Group Var
1: a 12
2: a 11
3: a 10
4: b 13
5: b 14
6: b 15
7: c 18
8: c 17
9: c 16
For completeness the basic idea is contained in:
Data[Group %in% c("a", "c"), Var:= sort(Var, decreasing = TRUE), by = Group]

R - subset rows by rows in another data frame

Let's say I have a data frame df containing only factors/categorical variables. I have another data frame conditions where each row contains a different combination of the different factor levels of some subset of variables in df (made using expand.grid and levels etc.). I'm trying to figure out a way of subsetting df based on each row of conditions. So for example, if the column names of conditions are c("A", "B", "C") and the first row is c('a1', 'b1', 'c1'), then I want df[df$A == 'a1' & df$B == 'b1' & df$C == 'c1',], and so on.
I'd think this is a great time to use merge (or dplyr::*_join or ...):
df1 <- expand.grid(A = letters[1:4], B = LETTERS[1:4], stringsAsFactors = FALSE)
df1$rn <- seq_len(nrow(df1))
# 'df2' contains the conditions we want to filter (retain)
df2 <- data.frame(
a1 = c('a', 'a', 'c'),
b1 = c('B', 'C', 'C'),
stringsAsFactors = FALSE
)
df1
# A B rn
# 1 a A 1
# 2 b A 2
# 3 c A 3
# 4 d A 4
# 5 a B 5
# 6 b B 6
# 7 c B 7
# 8 d B 8
# 9 a C 9
# 10 b C 10
# 11 c C 11
# 12 d C 12
# 13 a D 13
# 14 b D 14
# 15 c D 15
# 16 d D 16
df2
# a1 b1
# 1 a B
# 2 a C
# 3 c C
Using df2 to define which combinations we need to keep,
merge(df1, df2, by.x=c('A','B'), by.y=c('a1','b1'))
# A B rn
# 1 a B 5
# 2 a C 9
# 3 c C 11
# or
dplyr::inner_join(df1, df2, by=c(A='a1', B='b1'))
(I defined df2 with different column names just to show how it works, but in reality since its purpose is "solely" to be declarative on which combinations to filter, it would make sense to me to have the same column names, in which case the by= argument just gets simpler.)
One option is to create the condition with Reduce
df[Reduce(`&`, Map(`==`, df[c("A", "B", "C")], df[1, c("A", "B", "C")])),]
Or another option is rowSums
df[rowSums(df[c("A", "B", "C")] ==
df[1, c("A", "B", "C")][col(df[c("A", "B", "C")])]) == 3,]

Number of non-NA records by column, grouped

I have a data.table that looks something like this:
> dt <- data.table(
group1 = c("a", "a", "a", "b", "b", "b", "b"),
group2 = c("x", "x", "y", "y", "z", "z", "z"),
data1 = c(NA, rep(T, 3), rep(F, 2), "sometimes"),
data2 = c("sometimes", rep(F,3), rep(T,2), NA))
> dt
group1 group2 data1 data2
1: a x NA sometimes
2: a x TRUE FALSE
3: a y TRUE FALSE
4: b y TRUE FALSE
5: b z FALSE TRUE
6: b z FALSE TRUE
7: b z sometimes NA
My goal is to find the number of non-NA records in each data column, grouped by group1 and group2.
group1 group2 data1 data2
1: a x 1 2
3: a y 1 1
4: b y 1 1
5: b z 3 2
I have this code left over from dealing with another part of the dataset, which had no NAs and was logical:
dt[
,
lapply(.SD, sum),
by = list(group1, group2),
.SDcols = c("data3", "data4")
]
But it won't work with NA values, or non-logical values.
dt[, lapply(.SD, function(x) sum(!is.na(x))), by = .(group1, group2)]
# group1 group2 data1 data2
#1: a x 1 2
#2: a y 1 1
#3: b y 1 1
#4: b z 3 2
Another alternative is to melt/dcast in order to avoid by column operation. This will remove the NAs and use the length function by default
dcast(melt(dt, id = c("group1", "group2"), na.rm = TRUE), group1 + group2 ~ variable)
# Aggregate function missing, defaulting to 'length'
# group1 group2 data1 data2
# 1: a x 1 2
# 2: a y 1 1
# 3: b y 1 1
# 4: b z 3 2
Using dplyr (with some help from David Arenburg & eddi):
library(dplyr)
dt %>% group_by(group1, group2) %>% summarise_each(funs(sum(!is.na(.))))
Source: local data table [4 x 4]
Groups: group1
group1 group2 data1 data2
1 a x 1 2
2 a y 1 1
3 b y 1 1
4 b z 3 2

Resources