Why do I get duplicated data.table rows after aggregation? - r

I aggregated a data.table by a column, and set that as a key, then was surprised to find that the table still contained duplicated rows.
What is the reason for this?
My table was special in that I had two columns with exactly the same values (but had to keep both for a practical reason), and I aggregated the table by one of those.
A simple example:
> library(data.table)
> dat = data.table(
+ class1 = c('a', 'a', 'b'),
+ class2 = c('a', 'a', 'b'),
+ value = 1:3)
> aggr = dat[, list(class2, sum(value)), keyby = class1]
> stopifnot(!any(duplicated(aggr)))
Error: !any(duplicated(aggr)) is not TRUE

If you use an aggregation function for all columns, then you get the expected result, without duplicated rows:
> library(data.table)
> dat = data.table(
+ class1 = c('a', 'a', 'b'),
+ class2 = c('a', 'a', 'b'),
+ value = 1:3)
> aggr = dat[, list(class2[[1]], sum(value)), keyby = class1]
> stopifnot(!any(duplicated(aggr)))
Note that the difference is that I take the first element of the class2 column. Note that any other function that outputs one value works as well.

Related

How to retain class of variable in `tapply`?

Suppose my data frame is set up like so:
X <- data.frame(
id = c('A', 'A', 'B', 'B'),
dt = as.Date(c('2020-01-01', '2020-01-02', '2021-01-01', '2021-01-02'))
)
and I want to populate a variable of the id-specific minimum value of date dt
Doing: X$dtmin <- with(X, tapply(dt, id, min)[id]) gives a numeric because the simplify=T in tapply has cast the value to numeric. Why has it done this? Setting simplify=F returns a list which each element in the list has the desired data structure, but populating the variable in my dataframe X casts these back to numeric. Yet calling as.Date(<output>, origin='1970-01-01') seems needlessly verbose. How can I retain the data structure of dt?
We may use
X$dtmin <- with(X, do.call("c", tapply(dt, id, min, simplify = FALSE)[id]))
Or use dplyr
library(dplyr)
X %>%
mutate(dtmin = min(dt), .by = "id")

assertr: Automatically verify assumptions on columns

I want to automatically verify assumptions on columns of my tibble using assertr. The problem is that I have hundreds of columns so I can't apply them column by column. Data looks like this:
df <- tibble(
x = c(1,0,1,1,2),
y = c('A', 'B', 'C', 'D', 'A'),
z = c(1/3, 4, 5/7, 100, 3))
Then I have another tibble which describes types of columns:
df_map <- tibble(
col = c('x','y','z'),
col_type = c('POSSIBLE VALUES 0 AND 1', 'LESS THAN 1 MISSING VALUE', 'POSITIVE VALUE')
)
This can be written like:
df %>%
assert(in_set(0,1), x) %>%
assert_rows(num_row_NAs, within_bounds(0,1), y) %>%
verify(z > 0)
My question is how to apply these verifications (or any other) with this mapping so I don't need to write them for every single column.

Why isn't a data.table sorted properly after calling setDT?

When a data table is turned into a data frame and then back into a data table it may keep the sorted attribute even though it is not sorted (see the example below). This leads to incorrect results when merging data.tables, and possible undetected bugs.
Is this the expected behavior? What is the best way to turn a data.frame into a sorted data.table and verify that it is indeed sorted?
library(data.table)
library(dplyr)
a <- data.table(id = c('a', 'B', 'c'), value = c(1,2,3))
b <- data.table(id = c('a', 'B', 'c'))
setkey(a,id)
a_sum <- a %>%
group_by(id) %>%
summarize_at(vars(value), sum)
setDT(a_sum, key = "id")
a_sum_nokey = setkey(copy(a_sum), NULL)
merged_key_fails = merge(a_sum, b, by="id")
merged_no_key_works = merge(a_sum_nokey, b, by="id")

Dynamically passing a variable to the `i` expression in data.table

SO #24833247 covers nearly all the use cases for passing column names dynamically to a data.table within a function. However it misses one I'm currently trying to address: passing variables to the i expression.
I'm trying to refactor some data cleansing code to a function that converts certain values to NA after I've pulled the data into a data.table
For example, given the following:
dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
dt[colA %in% c('~', ''), colA := NA]
dt[colB %in% c('~', ''), colB := NA]
I want a generic function that replaces the '~', '?' and '' values with NA, instead of having to explicitly code each transformation.
dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
clearCol(dt, colA)
clearCol(dt, colB)
The j expression is straight-forward
clearCol <- function(dt, f) {
f = substitute(f)
dt[,(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '',)), colA)[]
x
1: NA
2: NA
3: NA
4: NA
5: NA
However, extending it to add the variable to the i expression fails:
clearCol <- function(dt, f) {
f = substitute(f)
dt[(f) %in% c('~', ''),(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '')), colA)[]
Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments
Swapping to this seems to work, but the lack of output with verbose = TRUE (compared to the hard-coded method at the top) leaves me concerned that it will not scale well when given the large data sets I'm working with
clearCol <- function(dt, f) {
f = deparse(substitute(f))
dt[get(f) %in% c('~', ''),(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '')), colA)[]
colA
1: A
2: b
3: NA
4: d
5: NA
Is there another way of doing what I want?
You can follow FAQ 1.6 to get the verbose output:
cc = function(d, col, vs = c("~", ""), verb = FALSE){
col = substitute(col)
ix = substitute(col %in% vs)
d[eval(ix), as.character(col) := NA, verbose = verb ][]
}
dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
cc(dt, colA, verb = TRUE)
which gives
Creating new index 'colA'
Starting bmerge ...done in 0 secs
Detected that j uses these columns: <none>
Assigning to 2 row subset of 5 rows
Dropping index 'colA' due to update on 'colA' (column 1)
colA colB
1: A
2: b ?
3: NA a1
4: d a2
5: NA z4
However, notice what the verbose output is saying here. It's creating an index (assuming you didn't do something to create it already, which seems likely since the data was only just read in)... and then it's removing that index (since it is invalidated by the edit to the column). That hardly sounds like something that would do much to contribute to efficiency.
If you really want to do this efficiently, there are a couple options:
Use na.strings when reading the data in
Use set if you have a ton of columns and somehow can't do #1

grouping and summing up dummy vars from caret R

I have data like this
dataset = data.frame(id = c(1,2,1,4,5,6), class = c('a', 'a', 'b', 'a', 'b', 'b') )
I want to convert it into dummy vars but caret's dummy vars doesn't collapse id up it returns the same number of rows as the input. How do I group it so that id 1 has both a and b variables as 1?
dummies <- caret::dummyvars(id ~ . , data=dataset)
predict(dummies, newdata = dataset)
In this case use dcast function for data.table:
library(data.table)
setDT(dataset)
dataset[,dummy:=1]
d2 = dcast(dataset,id~class,value.var = 'dummy',fun.aggregate = length)
d2[is.na(d2)] = 0
Note that this solution will return the number of a's and b's found for each id. If you need only 1 or 0 change for example the fun.aggregate to be
fun.aggregate = function(x) as.integer(length(x) >0)
dummyVars works row wise and for that it doesn't matter what is the value in id
Aggregate your predicted variable. So if you store the outcome of predict in the variable named dummies2:
aggregate(. ~ id, data=dummies2, FUN=sum)

Resources