Take the following data table:
# IMPUTING VALUES
library(data.table)
set.seed(1337)
mydt <- data.table(Year = rep(2000:2005, each = 10),
Type = c("A","B"),
Value = 30 + rnorm(60)
)
naRows <- sample(nrow(mydt),15)
mydt[ naRows, Value := NA]
setkey(mydt,Year,Type)
How would I go about imputing the NAs with the median by Year and Type? I have tried the following
# computed medians
computedMedians <- mydt[, .(Median = median(Value, na.rm = TRUE)), keyby = .(Year,Type)]
# dataset of just NA rows
dtNAs <- mydt[ is.na(Value), .SD, by = .(Year,Type)]
mydt[ is.na(Value),
Imputations := dtNAs[computedMedians, nomatch = 0][, Median],
by = .(Year,Type)]
mydt
but when you run the code, you'll see that it works unless a group is missing data completely, and the computed medians get recycled. Is there a simpler way? or how would you go about getting just the last error fixed?
If you prefer updating the rows without copying the entire column, then:
require(data.table) # v1.9.6+
cols = c("Year", "Type")
dt[is.na(Value), Value := dt[.BY, median(Value, na.rm=TRUE), on=cols], by=c(cols)]
.BY is a special symbol which is a named list containing the groups. Although this requires a join with the entire data.table every time, it should be quite fast, as it's searching for only one group.
There's no need to make a secondary table; it can be done inside a single by-group call:
mydt[,
Value := replace(Value, is.na(Value), median(Value, na.rm=TRUE))
, by=.(Year,Type)]
This imputation doesn't guarantee that all missing values are filled (e.g., 2005-B is still NA).
Related
As a natural consequence of the data.table subsetting logic in i, I often end up in situations where I have part of a variable defined for an id (like "total economic crises before 2007 per country" being counted for data < 2007 hence NA for anything later). Here is a slightly more general example:
library("data.table")
Data <- data.table(id = rep(c(1,2,3), each = 4),
variable =c(3,3,NA,NA,NA,NA,4,NA,NA,NA,NA,NA))
When I subsequently need this variable defined over the entire dataset, I want to fill up the NA's by group. I usually do this using max by group:
Data[, variable_full := max(variable, na.rm = T), by = id]
Data[variable_full == -Inf, variable_full := NA] # this just overwrites the result of the warning
But, for whatever reason, this takes very long in large datasets. Is there a more efficient, more data.table like way of doing this?
edit: "large datasets" is currently 8 million observations and it stops my workflow because it takes several minutes. Other data.table operations take split seconds because data.table is amazing.
perhaps a join?
Data[, variable_full := variable]
Data[is.na(variable), variable_full := Data[!is.na(variable),
max(variable),
by = .(id)][Data[is.na(variable), ], V1, on = .(id)]][]
a (slightly) shorter version of the line with the join is
Data[is.na(variable), variable_full := Data[!is.na(variable), max(variable), by = .(id)][.SD, V1, on = .(id)]]
here, the [Data[is.na(variable), ], part has bene replaced with [.SD, , because is is alreadaey derived from i (at the beginning of the line)...
the
If you want to install the collapse package, I think this could be much faster:
library(collapse)
Data = Data |> gby(id) |> fmutate(variable_full=fmax(variable)) |> setDT()
gby is 'group by' and fmutate is 'fast mutate'. The default output is a grouped data frame so it needs the 'setDT' at the end
This may have been asked before and I have looked through Reference semantics but I can't seem to find the answer. SO also suggested revising my title, so I will be fine if someone posts a link to the answer!
I have a MWE below. I am trying to group by column val by the day of each month. From my understanding, in SCENARIO 1 below in the code, since I am not assigning the values of lapply to any new column through :=, the data.table is printed.
However, in SCENARIO 2, when I assign new column variables by reference using := the new columns are created (with the correct values) but the value is repeated for every hour of the day, when I want just the daily values.
SCENARIO 3 also gives the desired result, but requires the creation of a new data.table.
I also wouldn't think of set because value iterates by row, and I need to group certain columns.
Thanks for any help,
library(data.table)
library(magrittr)
set.seed(123)
# create data.table to group by
dt <- data.table(year = rep(2018, times = 24 * 31),
month = rep(1, times = 24 * 31),
day = rep(1:31, each = 24),
hour = rep(0:23, times = 31)) %>%
.[, val := sample(100, size = nrow(dt), replace = TRUE)]
# SCENARIO 1
# creates desired dataframe but only prints it, doesn't modify dt by reference (because it is missing `:=`)
dt[, lapply(.SD,
sum),
.SDcols = "val",
by = .(year,
month,
day)]
# Scenario 2
# creates desired val column, but creates duplicate val values for all rows of original grouping by data.table
dt[, val := lapply(.SD,
sum),
.SDcols = "val",
by = .(year,
month,
day)]
# SCENARIO 3
# this also works, but requires creating a new data.table
new_dt <- dt[, lapply(.SD,
sum),
.SDcols = "val",
by = .(year,
month,
day)]
I don't see any problem in the creation of the new data.table object, you can do it with the same name to rewrite.
dt <- dt[, lapply(.SD,
sum),
.SDcols = "val",
by = .(year,
month,
day)]
Now you cannot change the number of rows in the data.table without rewriting like dt<-unique(dt) according to discussion in this feature request: https://github.com/Rdatatable/data.table/issues/635.
Trying to calculate mean over 4 last values in DT. Thought the following would work:
library(data.table)
dt <- data.table(a = 1:10)
dt[, means := rowMeans(shift(a, 0:3), na.rm = TRUE)]
but it returns 'x' must be an array of at least two dimensions
so I tested with
lags <- paste0("a.lag", c(1,2,3))
dt[, (lags) := shift(a, 1:3)]
dt[, means := rowMeans(c("a", lags), na.rm = TRUE)]
same error. Surprisingly, the following works:
dt[, means := rowMeans(.SD, na.rm = TRUE), .SDcols = c("a", lags)]
Why is .SD returning a 2-dimensional array here but not the other? Is it a bug or am I missing something? Using DT 1.11.9.
Preface:
I have a column in a data.table of difftime values with units set to days. I am trying to create another data.table summarizing the values with
dt2 <- dt[, .(AvgTime = mean(DiffTime)), by = Group]
When printing the new data.table, I see values such as
1.925988e+00 days
1.143287e+00 days
1.453975e+01 days
I would like to limit the decimal place values for this column only (i.e. not setting options() unless I can do this specifically for difftime values this way). When I try to do this using the method above, modified, e.g.
dt2 <- dt[, .(AvgTime = round(mean(DiffTime)), 2), by = Group]
I am left with NA values, with both the base round() and format() functions returning the warning:
In mean(DiffTime) : argument is not numeric or logical.
Oddly enough, if I perform the same operation on a numeric field, this runs with no problems. Also, if I run the two separate lines of code, I can accomplish what I am looking to do:
dt2 <- dt[, .(AvgTime = mean(DiffTime)), by = Group]
dt2[, AvgTime := round(AvgTime, 2)]
Reproducible Example:
library(data.table)
set.seed(1)
dt <- data.table(
Date1 =
sample(seq(as.Date('2017/10/01'),
as.Date('2017/10/31'),
by="days"), 24, replace = FALSE) +
abs(rnorm(24)) / 10,
Date2 =
sample(seq(as.Date('2017/10/01'),
as.Date('2017/10/31'),
by="days"), 24, replace = FALSE) +
abs(rnorm(24)) / 10,
Num1 =
abs(rnorm(24)) * 10,
Group =
rep(LETTERS[1:4], each=6)
)
dt[, DiffTime := abs(difftime(Date1, Date2, units = 'days'))]
# Warnings/NA:
class(dt$DiffTime) # "difftime"
dt2 <- dt[, .(AvgTime = round(mean(DiffTime), 2)), by = .(Group)]
# Works when numeric/not difftime:
class(dt$Num1) # "numeric"
dt2 <- dt[, .(AvgNum = round(mean(Num1), 2)), by = .(Group)]
# Works, but takes an additional step:
dt2<-dt[,.(AvgTime = mean(DiffTime)), by = .(Group)]
dt2[,AvgTime := round(AvgTime,2)]
# Works with base::mean:
class(dt$DiffTime) # "difftime"
dt2 <- dt[, .(AvgTime = round(base::mean(DiffTime), 2)), by = .(Group)]
Question:
Why am I not able to complete this conversion (rounding of the mean) in one step when the class is difftime? Am I missing something in my execution? Is this some sort of bug in data.table where it can't properly handle the difftime?
Issue added on github.
Update: Issue appears to be cleared after updating from data.table version 1.10.4 to 1.12.8.
This was fixed by update #3567 on 2019/05/15, data.table version 1.12.4 released 2019/10/03
This might be a little late but if you really want it to work you can do:
as.numeric(round(as.difftime(difftime(DATE1, DATE2)), 0))
I recently ran into the same problem using data.table_1.11.8. One quick work around is to use base::mean instead of mean.
I have a data.table with quite a few columns. I need to loop through them and create new columns using some condition. Currently I am writing separate line of condition for each column. Let me explain with an example. Let us consider a sample data as -
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=10),
tc = rep(c('C','D'), 10),
one = rnorm(20,1,1),
two = rnorm(20,2,1),
three = rnorm(20,3,1),
four = rnorm(20,4,1),
five = rnorm(20,5,2),
six = rnorm(20,6,2),
seven = rnorm(20,7,2),
total = rnorm(20,28,3))
For each of the columns from one to total, I need to create 4 new columns, i.e. mean, sd, uplimit, lowlimit for 2 sigma outlier calculation. I am doing this by -
DTnew <- DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x), uplimit = mean(x)+1.96*sd(x), lowlimit = mean(x)-1.96*sd(x))))), by = .(town,tc)]
This DTnew data.table I am then merging with my DT
DTmerge <- merge(DT, DTnew, by= c('town','tc'))
Now to come up with the outliers, I am writing separate set of codes for each variable -
DTAoutlier <- DTmerge[ ,one.Aoutlier := ifelse (one >= one.lowlimit & one <= one.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,two.Aoutlier := ifelse (two >= two.lowlimit & two <= two.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,three.Aoutlier := ifelse (three >= three.lowlimit & three <= three.uplimit,0,1)]
can some one help to simplify this code so that
I don't have to write separate lines of code for outlier. In this example we have only 8 variables but what if we had 100 variables, would we end up writing 100 lines of code? Can this be done using a for loop? How?
In general for data.table how can we add new columns retaining the original columns. So for example below I am taking log of columns 3 to 10. If I don't create a new DTlog it overwrites the original columns in DT. How can I retain the original columns in DT and have the new columns as well in DT.
DTlog <- DT[,(lapply(.SD,log)),by = .(town,tc),.SDcols=3:10]
Look forward to some expert suggestions.
We can do this using :=. We subset the column names that are not the grouping variables ('nm'). Create a vector of names to assign for the new columns using outer ('nm1'). Then, we use the OP's code, unlist the output and assign (:=) it to 'nm1' to create the new columns.
nm <- names(DT)[-(1:2)]
nm1 <- c(t(outer(c("Mean", "SD", "uplimit", "lowlimit"), nm, paste, sep="_")))
DT[, (nm1):= unlist(lapply(.SD, function(x) { Mean = mean(x)
SD = sd(x)
uplimit = Mean + 1.96*SD
lowlimit = Mean - 1.96*SD
list(Mean, SD, uplimit, lowlimit) }), recursive=FALSE) ,
.(town, tc)]
The second part of the question involves doing a logical comparison between columns. One option would be to subset the initial columns, the 'lowlimit' and 'uplimit' columns separately and do the comparison (as these have the same dimensions) to get a logical output which can be coerced to binary with +. Then assign it to the original dataset to create the outlier columns.
m1 <- +(DT[, nm, with = FALSE] >= DT[, paste("lowlimit", nm, sep="_"),
with = FALSE] & DT[, nm, with = FALSE] <= DT[,
paste("uplimit", nm, sep="_"), with = FALSE])
DT[,paste(nm, "Aoutlier", sep=".") := as.data.frame(m1)]
Or instead of comparing data.tables, we can also use a for loop with set (which would be more efficient)
nm2 <- paste(nm, "Aoutlier", sep=".")
DT[, (nm2) := NA_integer_]
for(j in nm){
set(DT, i = NULL, j = paste(j, "Aoutlier", sep="."),
value = as.integer(DT[[j]] >= DT[[paste("lowlimit", j, sep="_")]] &
DT[[j]] <= DT[[paste("uplimit", j, sep="_")]]))
}
The 'log' columns can also be created with :=
DT[,paste(nm, "log", sep=".") := lapply(.SD,log),by = .(town,tc),.SDcols=nm]
Your data should probably be in long format:
m = melt(DT, id=c("town","tc"))
Then just write your test once
m[,
is_outlier := +(abs(value-mean(value)) > 1.96*sd(value))
, by=.(town, tc, variable)]
I see no outliers in this data (according to the given definition of outlier):
m[, .N, by=is_outlier] # this is a handy alternative to table()
# is_outlier N
# 1: 0 160
How it works
melt keeps the id columns and stacks all the rest into
variable (column names)
value (column contents)
+x does the same thing as as.integer(x), coercing TRUE/FALSE to 1/0
If you really like your data in wide format, though:
vjs = setdiff(names(DT), c("town","tc"))
DT[,
paste0(vjs,".out") := lapply(.SD, function(x) +(abs(x-mean(x)) > 1.96*sd(x)))
, by=.(town, tc), .SDcols=vjs]
For completeness, it should be noted that dplyr's mutate_each provides a handy way of tackling such problems:
library(dplyr)
result <- DT %>%
group_by(town,tc) %>%
mutate_each(funs(mean,sd,
uplimit = (mean(.) + 1.96*sd(.)),
lowlimit = (mean(.) - 1.96*sd(.)),
Aoutlier = as.integer(. >= mean(.) - 1.96*sd(.) &
. <= mean(.) - 1.96*sd(.))),
-town,-tc)