Sort data.table by grouping variable with condition

Sort data.table by grouping variable with condition - r

I have the following data.table (in fact my data.table is much bigger (more groups and more other variables)):
Data <- data.table(Group = rep(c("a", "b"), each = 3),
Var = 1:6)
> print(Data)
Group Var
1: a 1
2: a 2
3: a 3
4: b 4
5: b 5
6: b 6
Now I want to sort the data.table based on the variable Group but only if Group == "a".
My poor attempt was the following:
> Data[Group == "a", .SD[.N:1]]
Group Var
1: a 3
2: a 2
3: a 1
I know why this is wrong but I can not think of a solution that leads to my desired output:
Group Var
1: a 3
2: a 2
3: a 1
4: b 4
5: b 5
6: b 6

Your attempt would work with
library(data.table)
Data[Group == "a"] <- Data[Group == "a", .SD[.N:1]]
Data
# Group Var
#1: a 3
#2: a 2
#3: a 1
#4: b 4
#5: b 5
#6: b 6
But the above just reverses the rows if you want to sort the rows in decreasing order based on Var you could do
Data[Group == "a"] <- Data[Group == "a", .SD[order(-Var)]]
If you have multiple columns you can do
cols <- c("a", "b")
Data[Group %in% cols] <- Data[Group %in% cols,.SD[order(-Var)],Group]
A better approach with data.table is to update by reference as suggested by #markus
Data[Group %in% cols, Var := .SD[order(-Var)]$Var]

Without using the .SD notation:
> Data[Group == "a", Var := sort(Var, decreasing = TRUE)]
> Data
Group Var
1: a 3
2: a 2
3: a 1
4: b 4
5: b 5
6: b 6
As you have a large data.table and more groups, you may wish to consider the use of the .I notation (see the article by #nathaneastwood) as it does result in better performance in some situations. Here, .I will identify the row numbers of interest. Let us change the example so that we are interested in two groups:
Data <- data.table(Group = rep(c("a", "b", "c"), each = 3), Var = 10:18)
Then:
> Data[Data[, .I[Group %in% c("a", "c")]], Var := sort(Var, decreasing = TRUE), by = Group]
> Data
Group Var
1: a 12
2: a 11
3: a 10
4: b 13
5: b 14
6: b 15
7: c 18
8: c 17
9: c 16
For completeness the basic idea is contained in:
Data[Group %in% c("a", "c"), Var:= sort(Var, decreasing = TRUE), by = Group]

Related

Can someone explain how mult works in data.table when it performs update in joins (using .EACHI and mult)

I am struggling again to understand how the mult argument is working when performing an update-on-join.
What I am trying to do is to implement a left-join as defined in lj.
For performance reasons I'd like to update the left table
The "un-trivial" part is that when the left table and the right table have a column in common, (not considering the join columns), I'd like to use the first value in the right table to override the value of the left table.
I thought mult would help me dealing with this multiple match issue but I cannot get it right
library(data.table)
X <- data.table(x = c("a", "a", "b", "c", "d"), y = c(0, 1, 1, 2, 2), t = 0:4)
X
# x y t
# <char> <num> <int>
#1: a 0 0
#2: a 1 1
#3: b 1 2
#4: c 2 3
#5: d 2 4
Y <- data.table(xx = c("f", "b", "c", "c", "e", "a"), y = c(2, NA, 3, 4, 5, 6), u = 2:7)
Y
# xx y u
# <char> <num> <int>
#1: f 2 2
#2: b NA 3
#3: c 3 4
#4: c 4 5
#5: e 5 6
#6: a 6 7
# Expected result
# x y t
# <char> <num> <int>
#1: a 6 0 <= single match on xx == "a" so Y[xx == "a", y] is used
#2: a 6 1 <= single match on xx == "a" so Y[xx == "a", y] is used
#3: b NA 2 <= single match on xx == "b" so Y[xx == "b", y] is used
#4: c 3 3 <= mult match on xx == "c" so Y[xx == "c", y[1L]] is used
#5: d 2 4 <= no xx == "d" in Y so nothing changes
copy(X)[Y, y := i.y, by = .EACHI, on = c(x = "xx"), mult = "first"][]
# x y t
# <char> <num> <int>
#1: a 6 0
#2: a 1 1 <= a should always have the same value ie 6
#3: b NA 2
#4: c 4 3 <= y == 4 is not the first value of y in the Y table
#5: d 2 4
# Using mult = "all" is the closest I get from the right result
copy(X)[Y, y := i.y, by = .EACHI, on = c(x = "xx"), mult = "all"][]
# x y t
# <char> <num> <int>
#1: a 6 0
#2: a 6 1
#3: b NA 2
#4: c 4 3 <= y == 4 is not the first value of y in the Y table
#5: d 2 4
Can someone explain to me what's wrong in the above ?
I guess I could use Y[X, ...] to get to what I want, the issue is that X is very large and the performance I get is much worse using Y[X, ...]

I'd like to use the first value in the right table to override the value of the left table
Select the first values and update with them alone:
X[unique(Y, by="xx", fromLast=FALSE), on=.(x=xx), y := i.y]
x y t
1: a 6 0
2: a 6 1
3: b NA 2
4: c 3 3
5: d 2 4
fromLast= can select the first or last row when dropping dupes.
How multiple matches are handled:
In x[i, mult=], if a row of i has multiple matches, mult determines which matching row(s) of x are selected. This explains the results shown in the OP.
In x[i, v := i.v], if multiple rows of i match to the same row in x, all of the relevant i-rows write to the x-row sequentially, so the last i-row gets the final write. Turn on verbose output to see how many edits are made in an update -- it will exceed the number of x rows in this case (because the rows are edited repeatedly):
options(datatable.verbose=TRUE)
data.table(a=1,b=2)[.(a=1, b=3:4), on=.(a), b := i.b][]
# Assigning to 2 row subset of 1 rows
a b
1: 1 4

mult is always equal to "last" in case of update on join with :=
I recall it was described somewhere in documentation.

R - subset rows by rows in another data frame

Let's say I have a data frame df containing only factors/categorical variables. I have another data frame conditions where each row contains a different combination of the different factor levels of some subset of variables in df (made using expand.grid and levels etc.). I'm trying to figure out a way of subsetting df based on each row of conditions. So for example, if the column names of conditions are c("A", "B", "C") and the first row is c('a1', 'b1', 'c1'), then I want df[df$A == 'a1' & df$B == 'b1' & df$C == 'c1',], and so on.

I'd think this is a great time to use merge (or dplyr::*_join or ...):
df1 <- expand.grid(A = letters[1:4], B = LETTERS[1:4], stringsAsFactors = FALSE)
df1$rn <- seq_len(nrow(df1))
# 'df2' contains the conditions we want to filter (retain)
df2 <- data.frame(
a1 = c('a', 'a', 'c'),
b1 = c('B', 'C', 'C'),
stringsAsFactors = FALSE
)
df1
# A B rn
# 1 a A 1
# 2 b A 2
# 3 c A 3
# 4 d A 4
# 5 a B 5
# 6 b B 6
# 7 c B 7
# 8 d B 8
# 9 a C 9
# 10 b C 10
# 11 c C 11
# 12 d C 12
# 13 a D 13
# 14 b D 14
# 15 c D 15
# 16 d D 16
df2
# a1 b1
# 1 a B
# 2 a C
# 3 c C
Using df2 to define which combinations we need to keep,
merge(df1, df2, by.x=c('A','B'), by.y=c('a1','b1'))
# A B rn
# 1 a B 5
# 2 a C 9
# 3 c C 11
# or
dplyr::inner_join(df1, df2, by=c(A='a1', B='b1'))
(I defined df2 with different column names just to show how it works, but in reality since its purpose is "solely" to be declarative on which combinations to filter, it would make sense to me to have the same column names, in which case the by= argument just gets simpler.)

One option is to create the condition with Reduce
df[Reduce(`&`, Map(`==`, df[c("A", "B", "C")], df[1, c("A", "B", "C")])),]
Or another option is rowSums
df[rowSums(df[c("A", "B", "C")] ==
df[1, c("A", "B", "C")][col(df[c("A", "B", "C")])]) == 3,]

Data table - how to generate a new variable indexed by a column that contains the column names of interest [duplicate]

This question already has answers here:
Select values from different columns based on a variable containing column names [duplicate]
(2 answers)
Closed 6 years ago.
My question is as follows:
I would like to generate column d based on the information from column c. Column c provides the names of the columns from which to fetch data from, for that given row.
a b c d
1 5 3 a 5
2 8 6 b 6
3 12 8 a 12
My current method is very inefficient:
DT[, d:=mget(c)]
for(i in 1:nrow(DT)) { e[i] <- DT[,d][[i]][i]}
DT[,e:=e]
Appreciate it greatly if there is any one-liner solution.

You can group by the values in column c, and use get() to get the values.
dt[, d := get(c), by = c]
which gives
dt
# a b c d
# 1: 5 3 a 5
# 2: 8 6 b 6
# 3: 12 8 a 12
Data:
dt <- data.table(a = c(5, 8, 12), b = c(3, 6, 8), c = c("a", "b", "a"))

You actually don't even need data.table if you don't want:
DT$d <- sapply(1:nrow(DT),function(i){DT[i,get(as.character(DT[i,c]))]})
> DT
a b c d
1: 5 3 a 5
2: 8 6 b 6
3: 12 8 a 12
This solution is also more flexible in that it allows c to refer to any column in the data.
data
DT<-structure(list(a = c(5L, 8L, 12L), b = c(3L, 6L, 8L), c = structure(c(1L,
2L, 1L), .Label = c("a", "b"), class = "factor")), .Names = c("a",
"b", "c"), class = c("data.table", "data.frame"), row.names = c(NA,
-3L), .internal.selfref = <pointer: 0x00000000001f0788>)

Your data:
a <- c(5,8,12)
b <- c(3,6,8)
c <- c("a", "b", "a")
df <- as.data.frame(cbind(a,b,c))
This is how you could do it.
d <- NULL
for (i in 1:NROW(df)){d <- c(d, as.character(df[i,as.character(c[i])]))}
df$d <- d
# a b c d
#1 5 3 a 5
#2 8 6 b 6
#3 12 8 a 12
This allows you to do the same thing as above in the for loop using just 1 line of code (similar to MikeyMike's answer).
df$d <- sapply(1:NROW(df), function(i){as.character(df[i,as.character(c[i])])})

You can use an ifelse statement:
dt[, d := ifelse(c == "a", a, b)]
dt
# a b c d
# 1: 5 3 a 5
# 2: 8 6 b 6
# 3: 12 8 a 12
Another option is to consider to reshape your data which can deal with multiple columns problem:
dt[, id := seq_len(nrow(dt))] # create an id column for reshape purpose
[melt(dt, id.vars = c("id", "c"))[c == variable], d:=value , on = "id"]
# reshape data, select values that match the column names and then join back with the original data.
[, id := NULL] # drop the id column
dt
# a b c d
# 1: 5 3 a 5
# 2: 8 6 b 6
# 3: 12 8 a 12

Number of non-NA records by column, grouped

I have a data.table that looks something like this:
> dt <- data.table(
group1 = c("a", "a", "a", "b", "b", "b", "b"),
group2 = c("x", "x", "y", "y", "z", "z", "z"),
data1 = c(NA, rep(T, 3), rep(F, 2), "sometimes"),
data2 = c("sometimes", rep(F,3), rep(T,2), NA))
> dt
group1 group2 data1 data2
1: a x NA sometimes
2: a x TRUE FALSE
3: a y TRUE FALSE
4: b y TRUE FALSE
5: b z FALSE TRUE
6: b z FALSE TRUE
7: b z sometimes NA
My goal is to find the number of non-NA records in each data column, grouped by group1 and group2.
group1 group2 data1 data2
1: a x 1 2
3: a y 1 1
4: b y 1 1
5: b z 3 2
I have this code left over from dealing with another part of the dataset, which had no NAs and was logical:
dt[
,
lapply(.SD, sum),
by = list(group1, group2),
.SDcols = c("data3", "data4")
]
But it won't work with NA values, or non-logical values.

dt[, lapply(.SD, function(x) sum(!is.na(x))), by = .(group1, group2)]
# group1 group2 data1 data2
#1: a x 1 2
#2: a y 1 1
#3: b y 1 1
#4: b z 3 2

Another alternative is to melt/dcast in order to avoid by column operation. This will remove the NAs and use the length function by default
dcast(melt(dt, id = c("group1", "group2"), na.rm = TRUE), group1 + group2 ~ variable)
# Aggregate function missing, defaulting to 'length'
# group1 group2 data1 data2
# 1: a x 1 2
# 2: a y 1 1
# 3: b y 1 1
# 4: b z 3 2

Using dplyr (with some help from David Arenburg & eddi):
library(dplyr)
dt %>% group_by(group1, group2) %>% summarise_each(funs(sum(!is.na(.))))
Source: local data table [4 x 4]
Groups: group1
group1 group2 data1 data2
1 a x 1 2
2 a y 1 1
3 b y 1 1
4 b z 3 2

Finding "complete" groups in R

I have a large dataset of groups and subgroups. I want to filter the data according to the "completeness" of the groups, i.e. within each groups all levels of the sub-groups (a and b) should occur
A small example
group <- rep(c("A", "B", "C"), each=5)
a <- c(1,1,2,2,3,1,1,1,3,3,1,2,2,3,3)
b <- c("a", "a", "a", "b", "c", "a", "a", "a", "b", "c", "a", "b", "b", "b", "b")
df <- data.frame(group, a, b)
group a b
1 A 1 a
2 A 1 a
3 A 2 a
4 A 2 b
5 A 3 c
6 B 1 a
7 B 1 a
8 B 1 a
9 B 3 b
10 B 3 c
11 C 1 a
12 C 2 b
13 C 2 b
14 C 3 b
15 C 3 b
So here only A would be considered complete because all levels of a and b occur. Is there an efficient (and flexible) way to filter with those conditions?

I would do something like this:
sapply(split(df, df$group), function(x) all(a %in% x$a) & all(b %in% x$b))
## A B C
## TRUE FALSE FALSE

Here is a dplyr solution:
library(dplyr)
df %>%
group_by(group) %>%
mutate(
a_complete = all(unique(df$a) %in% a),
b_complete = all(unique(df$b) %in% b)
) %>%
filter(a_complete, b_complete) %>%
select(- ends_with("complete"))

I would try data.table, something like
library(data.table)
setDT(df)[, indx := length(unique(a)) + length(unique(b))]
df[, indx2 := length(unique(a)) + length(unique(b)), by = group]
df[indx == indx2]
# group a b indx indx2
# 1: A 1 a 6 6
# 2: A 1 a 6 6
# 3: A 2 a 6 6
# 4: A 2 b 6 6
# 5: A 3 c 6 6
Or for a more general solution, you can specify the column names and then use .SDcols, something like
cols <- c("a", "b")
setDT(df)[, indx := Reduce(sum, lapply(.SD, function(x) length(unique(x)))), .SDcols = cols]
df[, indx2 := Reduce(sum, lapply(.SD, function(x) length(unique(x)))), .SDcols = cols, by = group]
df[indx == indx2]

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sort data.table by grouping variable with condition - r

Related

Can someone explain how mult works in data.table when it performs update in joins (using .EACHI and mult)

R - subset rows by rows in another data frame

Data table - how to generate a new variable indexed by a column that contains the column names of interest [duplicate]

Number of non-NA records by column, grouped

Finding "complete" groups in R

Categories

Resources