Grouping factor levels in a data.table - r

I'm trying to combine factor levels in a data.table & wondering if there's a data.table-y way to do so.
Example:
DT = data.table(id = 1:20, ind = as.factor(sample(8, 20, replace = TRUE)))
I want to say types 1,3,8 are in group A; 2 and 4 are in group B; and 5,6,7 are in group C.
Here's what I've been doing, which has been quite slow in the full version of the problem:
DT[ind %in% c(1, 3, 8), grp := as.factor("A")]
DT[ind %in% c(2, 4), grp := as.factor("B")]
DT[ind %in% c(5, 6, 7), grp := as.factor("C")]
Another approach, suggested by this related question, would I guess translate like so:
DT[ , grp := ind]
levels(DT$grp) = c("A", "B", "A", "B", "C", "C", "C", "A")
Or perhaps (given I've got 65 underlying groups and 18 aggregated groups, this feels a little neater)
DT[ , grp := ind]
lev <- letters(1:8)
lev[c(1, 3, 8)] <- "A"
lev[c(2, 4)] <- "B"
lev[5:7] <- "C"
levels(DT$grp) <- lev
Both of these seem unwieldy; does this seem like the appropriate way to do this in data.table?
For reference, I timed a beefed up version of this with 10,000,000 observations and some more subgroup/supergroup levels. My original approach is slowest (having to run all those logic checks is costly), the second the fastest, and the third a close second. But I like the readability of that approach better.
(Keying DT before searching speeds things up, but it only halves the gap vis-a-vis the latter two methods)

Update:
I recently learned of a much simpler way to re-associate factor levels from this question and a closer reading of ?levels. No merges, correspondence table, etc. necessary, just pass a named list to levels:
levels(DT$ind) = list(A = c(1, 3, 8), B = c(2, 4), C = 5:7)
Original Answer:
As suggested by #Arun we have the option of creating the correspondence as a separate data.table, then joining it to the original:
match_dt = data.table(ind = as.factor(1:12),
grp = as.factor(c("A", "B", "A", "B", "C", "C",
"C", "A", "D", "E", "F", "D")))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]
We can also do this in (what I consider to be) the more readable fashion like so (with marginal speed costs):
levels <- letters[1:12]
levels[c(1, 3, 8)] <- "A"
levels[c(2, 4)] <- "B"
levels[5:7] <- "C"
levels[c(9, 12)] <- "D"
levels[10] <- "E"
levels[11] <- "F"
match_dt <- data.table(ind = as.factor(1:12),
grp = as.factor(levels))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]

Related

How to combine multiple vectors such that elements of each vector are distributed as equally as possible?

Let's say I have two or more vectors with to or more elements (single factor) each, e.g.
v1 = c("a", "a", "a")
v2 = c("b", "b")
What I want to do is to merge all vectors and distribute the elements for each group as equally as possible.
For the simple example above there would be a single solution:
c("a", "b", "a", "b", "a")
If v1 = c("a", "a", "a", "a") any of these
c("a", "b", "a", "b", "a", "a")
c("a", "b", "a", "a", "b", "a")
c("a", "a", "b", "a", "b", "a")
would be the best solution. Is there a built-in function that can do this? Any ideas how to implement it?
This would work for two vectors.
v1 = c("a", "a", "a")
v2 = c("b", "b")
distribute_equally <- function(v1, v2) {
v3 <- c(v1, v2)
tab <- sort(table(v3))
c(rep(names(tab), min(tab)), rep(names(tab)[2], diff(range(tab))))
}
distribute_equally(v1, v2)
#[1] "b" "a" "b" "a" "a"
distribute_equally(c('a', 'a'), c('b', 'b'))
#[1] "a" "b" "a" "b"
Thinking of the problem in terms of experimental design optimization, we can get a general solution using the MaxProQQ function in the MaxPro package.
Each position in the merged vector can be thought of as coming from a discrete quantitative factor, and the factors from your v1, v2, etc. can be thought of as qualitative factors. Here's some example code (MaxProQQ takes integer factors instead of characters, but you can convert it afterward):
library(MaxPro)
set.seed(1)
v1 <- rep(1, sample.int(10, 1))
v2 <- rep(2, sample.int(10, 1))
v3 <- rep(3, sample.int(10, 1))
v4 <- rep(4, sample.int(10, 1))
vComb <- c(v1, v2, v3, v4)
vMerge1234 <- MaxProQQ(cbind(1:length(vComb), sample(vComb, length(vComb))), p_nom = 1)$Design
vMerge1234 <- vMerge1234[order(vMerge1234[,1]),][,2]
> vMerge1234
[1] 4 3 4 2 4 3 4 1 2 4 3 4 2 4 3 1 4 3 2 4 1 3 4
Generate 100 samples, say, without replacement from c(v1, v2) giving m which is 5x100 with one column per sample. Then find the column for which the sum of the variances of the frequencies over each group is minimized. If there are more than two vectors just concatenate them in the line marked ## and the rest of the code stays the same.
set.seed(123)
v1 = c("a", "a", "a")
v2 = c("b", "b")
v <- c(v1, v2) ##
m <- replicate(100, sample(v))
varsum <- apply(m, 2, function(x) {
f <- factor(x, levels = unique(v))
sum(tapply(f, v, function(x) var(table(x))))
})
m[, which.min(varsum)]
## [1] "a" "a" "b" "b" "a"

R add all combinations of three values of a vector to a three-dimensional array

I have a data frame with two columns. The first one "V1" indicates the objects on which the different items of the second column "V2" are found, e.g.:
V1 <- c("A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C")
V2 <- c("a","b","c","d","a","c","d","a","b","d","e")
df <- data.frame(V1, V2)
"A" for example contains "a", "b", "c", and "d". What I am looking for is a three dimensional array with dimensions of length(unique(V2)) (and the names "a" to "e" as dimnames).
For each unique value of V1 I want all possible combinations of three V2 items (e.g. for "A" it would be c("a", "b", "c"), c("a", "b", "d", and c("b", "c", "d").
Each of these "three-item-co-occurrences" should be regarded as a coordinate in the three-dimensional array and therefore be added to the frequency count which the values in the array should display. The outcome should be the following array
ar <- array(data = c(0,0,0,0,0,0,0,1,2,1,0,1,0,2,0,0,2,2,0,1,0,1,0,1,0,
0,0,1,2,1,0,0,0,0,0,1,0,0,1,0,2,0,1,0,1,1,0,0,1,0,
0,1,0,2,0,1,0,0,1,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,
0,2,2,0,1,2,0,1,0,1,2,1,0,0,0,0,0,0,0,0,1,1,0,0,0,
0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0),
dim = c(5, 5, 5),
dimnames = list(c("a", "b", "c", "d", "e"),
c("a", "b", "c", "d", "e"),
c("a", "b", "c", "d", "e")))
I was wondering about the 3D symmetry of your result. It took me a while to understand that you want to have all permutations of all combinations.
library(gtools) #for the permutations
foo <- function(x) {
#all combinations:
combs <- combn(x, 3, simplify = FALSE)
#all permutations for each of the combinations:
combs <- do.call(rbind, lapply(combs, permutations, n = 3, r = 3))
#tabulate:
do.call(table, lapply(asplit(combs, 2), factor, levels = letters[1:5]))
}
#apply grouped by V1, then sum the results
res <- Reduce("+", tapply(df$V2, df$V1, foo))
#check
all((res - ar)^2 == 0)
#[1] TRUE
I used to use the crossjoin CJ() to retain the pairwise count of all combinations of two different V2 items
res <- setDT(df)[,CJ(unique(V2), unique(V2)), V1][V1!=V2,
.N, .(V1,V2)][order(V1,V2)]
This code creates a data frame res with three columns. V1 and V2 contain the respective items of V2 from the original data frame df and N contains the count (how many times V1 and V2 appear with the same value of V1 (from the original data frame df).
Now, I found that I could perform this crossjoin with three 'dimensions' as well by just adding another unique(V2) and adapting the rest of the code accordingly.
The result is a data frame with four columns. V1, V2, and V3 indicate the original V2 items and N again shows the number of mutual appearances with the same original V1 objects.
res <- setDT(df)[,CJ(unique(V2), unique(V2), unique(V2)), V1][V1!=V2 & V1 != V3 & V2 != V3,
.N, .(V1,V2,V3)][order(V1,V2,V3)]
The advantage of this code is that all empty combinations (those which do not appear at all) are not considered. It worked with 1,000,000 unique values in V1 and over 600 unique items in V2, which would have otherwise caused an extremely large array of 600 x 600 x 600

Check which rows in a data.table are identical

I need a solution that shows me which rows are identical but I can't find a clever solution (a solution without a bunch of complex loops). I would prefer a data.table solution.
What I want to have is a list with line numbers that have the identical entries.
An example:
library(data.table)
Data <- data.table(A = c("a", "a", "c"),
B = c("A", "A", "B"))
The first and the second line are identical.
My desired output:
[[1]]
[1] 1 2
[[2]]
[1] 3
Here is something quick and dirty:
Data[, .(.I, .GRP), by = .(A, B)][, list(split(I, GRP))]$V1
Could be simplified to:
Data[, .(list(.I)), by = .(A, B)]$V1
That was my solution until sindri_baldur came up with a better solution:
Data.unique <- unique(Data)
Data.unique[, G := .I]
Data[, I := .I]
Data.full <-
merge(Data,
Data.unique,
by = c("A", "B"))
Data.full %>%
split(by = "G") %>%
map(~ .x[, I])

coercing data frame rows to matrix in R

I'm unsure of better terminology for my question, so forgive me for the long winded approach.
I'm trying to use two identifying variables, id and duration to fill up the rows of a matrix where the columns denote half hour periods (so there should be 6 for a 3 hour period) and the rows are a given person's activities in those time periods. If the activities do not fill up the matrix, a dummy variable should be used instead. I've written an example below which should help clarify.
Example:
data has 3 columns, id, activity, and duration. id and duration should serve as identifying variables and activity should serve as the variable in the matrix.
data <- data.frame(id = c(1, 1, 1, 2, 2, 3, 3, 3),
activity = c("a", "b", "c", "d", "e", "b", "b", "a"),
duration = c(60, 30, 90, 45, 30, 15, 60, 100))
For the example, I used a 3-hour duration hence the 6 columns in the matrix. The matrix below is the wanted output. There are DUMMY instances where the total duration of a person's activities does not sum to the duration of the matrix. In this example, the total duration is 180 (3 hours * 60), so person 2 who's activity duration sums to 75 (45 + 30) will get the DUMMY variable after the activities for the first 75 minutes are done.
mat <- t(matrix(c("a", "a", "b", "c", "c", "c",
"d", "d", "e", "DUMMY", "DUMMY", "DUMMY",
"b", "b", "b", "a", "a", "a"),
nrow = 6, ncol = 3))
colnames(mat) <- c("0", "30", "60", "90", "120", "150")
I'm unsure how to fill the matrix mat above with the data above. Any help would be appreciated. Please let me know if the question needs to be made clearer.
EDIT: edited output
EDIT2: Added matrix column names
EDIT3: Added info on dummy variable
EDIT4: Would it be easier if I added start and end time instead of duration?
An approach would be to locate the activities for every 30-min interval by "id":
ints = seq(0, by = 30, length.out = 6)
data2 = do.call(rbind,
lapply(split(data, data$id),
function(d) {
dur = d$duration
i = findInterval(ints, c(cumsum(c(0, dur[-length(dur)])), sum(dur)))
data.frame(id = d$id[1], ints = ints, activity = d$activity[i])
}))
And on the new "data.frame":
tapply(as.character(data2$activity), data2[c("id", "ints")], identity)
# ints
#id 0 30 60 90 120 150
# 1 "a" "a" "b" "c" "c" "c"
# 2 "d" "d" "e" NA NA NA
# 3 "b" "b" "b" "a" "a" "a"

Preserving zero length groups with aggregate

I just noticed that aggregate disappears empty groups from the result, how can I solve this? e.g.
`xx <- c("a", "b", "d", "a", "d", "a")
xx <- factor(xx, levels = c("a", "b", "c", "d"))
y <- rnorm(60, 5, 1)
z <- matrix(y, 6, 10)
aggregate(z, by = list(groups = xx), sum)`
xx is a factor variable with 4 levels, but the result gives just 3 rows, and would like a row for the "c" level with zeros. I would like the same behavior of table(xx) tha gives frecuencies even for levels with no observations.
We can create another data.frame with just the levels of 'xx' and then merge with the aggregate. The output will have all the 'groups' while the row corresponding to the missing level for the other columns will be NA.
merge(data.frame(groups=levels(xx)),
aggregate(z, by = list(groups = xx), sum), all.x=TRUE)
Another option might be to convert to 'long' format with melt and then use dcast with fun.aggregate as 'sum' and drop=FALSE
library(data.table)
dcast(melt(data.table(groups=xx, z), id.var='groups'),
groups~variable, value.var='value', sum, drop=FALSE)

Resources