Why isn't a data.table sorted properly after calling setDT? - r

When a data table is turned into a data frame and then back into a data table it may keep the sorted attribute even though it is not sorted (see the example below). This leads to incorrect results when merging data.tables, and possible undetected bugs.
Is this the expected behavior? What is the best way to turn a data.frame into a sorted data.table and verify that it is indeed sorted?
library(data.table)
library(dplyr)
a <- data.table(id = c('a', 'B', 'c'), value = c(1,2,3))
b <- data.table(id = c('a', 'B', 'c'))
setkey(a,id)
a_sum <- a %>%
group_by(id) %>%
summarize_at(vars(value), sum)
setDT(a_sum, key = "id")
a_sum_nokey = setkey(copy(a_sum), NULL)
merged_key_fails = merge(a_sum, b, by="id")
merged_no_key_works = merge(a_sum_nokey, b, by="id")

Related

How to retain class of variable in `tapply`?

Suppose my data frame is set up like so:
X <- data.frame(
id = c('A', 'A', 'B', 'B'),
dt = as.Date(c('2020-01-01', '2020-01-02', '2021-01-01', '2021-01-02'))
)
and I want to populate a variable of the id-specific minimum value of date dt
Doing: X$dtmin <- with(X, tapply(dt, id, min)[id]) gives a numeric because the simplify=T in tapply has cast the value to numeric. Why has it done this? Setting simplify=F returns a list which each element in the list has the desired data structure, but populating the variable in my dataframe X casts these back to numeric. Yet calling as.Date(<output>, origin='1970-01-01') seems needlessly verbose. How can I retain the data structure of dt?
We may use
X$dtmin <- with(X, do.call("c", tapply(dt, id, min, simplify = FALSE)[id]))
Or use dplyr
library(dplyr)
X %>%
mutate(dtmin = min(dt), .by = "id")

How to use ifelse to replace the values in a column with values in another column in a dataframe in R?

For example, I merged two dataframes using full_join() in dplyr as following:
df_1 <- data.frame(id = c(1,2,3,4,5), x = c('a', 'b', 'c', 'd', 'e'))
df_2 <- data.frame(id = c(2,4,5,6,7,8), y = c('f', 'g', 'h', 'i', 'j', 'k'))
df <- full_join(df_2, df_1, by = 'id')
I want to use ifelse() to do the following:
For each row, check whether there is missing value in x column
If yes, input "NO" into the y column
If no, input the value of x into the y column
I tried this code:
df$y <- ifelse(is.null(x), "NO", x)
But the result was not what I wanted:
What did I do wrong? Could you provide some suggestions on fixing the code?
Thank you a lot.
The following will do what you want:
df$y <- ifelse(is.na(df$x), "NO", df$x)
The problem appears to be is.null() where is.na() should be used.

dplyr join two tables within a function where one variable name is an argument to the function

I am trying to join two tables using dplyr within a function, where one of the variable names is defined by an argument to the function. In other dplyr functions, there is usually a version available for non-standard evaluation, e.g. select & select_, rename and rename_, etc, but not for the _join family. I found this answer, but I cannot get it to work in my code below:
df1 <- data.frame(gender = rep(c('M', 'F'), 5), var1 = letters[1:10])
new_join <- function(df, sexvar){
df2 <- data.frame(sex = rep(c('M', 'F'), 10), var2 = letters[20:1])
# initial attempt using usual dplyr behaviour:
# left_join(df, df2, by = c(sexvar = 'sex'))
# attempt using NSE:
# left_join(df, df2,
# by = c(eval(substitute(var), list(var = as.name(sexvar)))) = 'sex'))
# attempt using setNames:
# left_join(df, df2, by = setNames(sexvar, 'sex'))
}
new_join(df1, 'gender')
The first and second attempt give the error
Error: 'sexvar' column not found in rhs, cannot join
while the last attempt gives the error
Error: 'gender' column not found in lhs, cannot join,
which at least shows it knows I want the column gender, but somehow doesn't see it as a column heading.
Can anyone point out where I am going wrong?
Try:
df1 <- data.frame(gender = rep(c('M', 'F'), 5), var1 = letters[1:10])
new_join <- function(df, sexvar){
df2 <- data.frame(sex = rep(c('M', 'F'), 10), var2 = letters[20:1])
join_vars <- c('sex')
names(join_vars) <- sexvar
left_join(df, df2, by = join_vars)
}
new_join(df1, 'gender')
I'm sure there's a more elegant way of getting this to work using lazy evaluation, etc., but this should get you up-and-running in the meantime.
A oneliner in your block can look like this (which is similar to your last attempt)
left_join(df, df2, by = structure("sex", names = sexvar))
It is also possible to extend this to two varialbes
left_join(df, df2, by = structure(sexvarDF1, names = sexvarDF2))

R: Match a dataframe with 3 others, and create a column

I have a big data frame (df) with a variable id, and 3 other data (df1, df2, df3) frames that have some values of this id. So like the big dataframe has id 1:100, df1 might have 1,2,4,11 etc.
What i need to do is add a column to the big dataframe so that it says from which of the smaller dataframes the data came from.
df$new[df$id %in% df1$id] <- 1
df$new[df$id %in% df2$id] <- 2
df$new[df$id %in% df3$id] <- 3
df$new<- factor(df$new, labels = c('a', 'b', 'c'))
This is my solution but i don't really like it. Any other ideas?
We can use a nested ifelse
with(df, ifelse(id %in% df1$id, 'a',
ifelse(id %in% df2$id, 'b',
ifelse(id %in% df3$id, 'c', id)))

Why do I get duplicated data.table rows after aggregation?

I aggregated a data.table by a column, and set that as a key, then was surprised to find that the table still contained duplicated rows.
What is the reason for this?
My table was special in that I had two columns with exactly the same values (but had to keep both for a practical reason), and I aggregated the table by one of those.
A simple example:
> library(data.table)
> dat = data.table(
+ class1 = c('a', 'a', 'b'),
+ class2 = c('a', 'a', 'b'),
+ value = 1:3)
> aggr = dat[, list(class2, sum(value)), keyby = class1]
> stopifnot(!any(duplicated(aggr)))
Error: !any(duplicated(aggr)) is not TRUE
If you use an aggregation function for all columns, then you get the expected result, without duplicated rows:
> library(data.table)
> dat = data.table(
+ class1 = c('a', 'a', 'b'),
+ class2 = c('a', 'a', 'b'),
+ value = 1:3)
> aggr = dat[, list(class2[[1]], sum(value)), keyby = class1]
> stopifnot(!any(duplicated(aggr)))
Note that the difference is that I take the first element of the class2 column. Note that any other function that outputs one value works as well.

Resources