Loop through data.table and create new columns basis some condition - r

I have a data.table with quite a few columns. I need to loop through them and create new columns using some condition. Currently I am writing separate line of condition for each column. Let me explain with an example. Let us consider a sample data as -
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=10),
tc = rep(c('C','D'), 10),
one = rnorm(20,1,1),
two = rnorm(20,2,1),
three = rnorm(20,3,1),
four = rnorm(20,4,1),
five = rnorm(20,5,2),
six = rnorm(20,6,2),
seven = rnorm(20,7,2),
total = rnorm(20,28,3))
For each of the columns from one to total, I need to create 4 new columns, i.e. mean, sd, uplimit, lowlimit for 2 sigma outlier calculation. I am doing this by -
DTnew <- DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x), uplimit = mean(x)+1.96*sd(x), lowlimit = mean(x)-1.96*sd(x))))), by = .(town,tc)]
This DTnew data.table I am then merging with my DT
DTmerge <- merge(DT, DTnew, by= c('town','tc'))
Now to come up with the outliers, I am writing separate set of codes for each variable -
DTAoutlier <- DTmerge[ ,one.Aoutlier := ifelse (one >= one.lowlimit & one <= one.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,two.Aoutlier := ifelse (two >= two.lowlimit & two <= two.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,three.Aoutlier := ifelse (three >= three.lowlimit & three <= three.uplimit,0,1)]
can some one help to simplify this code so that
I don't have to write separate lines of code for outlier. In this example we have only 8 variables but what if we had 100 variables, would we end up writing 100 lines of code? Can this be done using a for loop? How?
In general for data.table how can we add new columns retaining the original columns. So for example below I am taking log of columns 3 to 10. If I don't create a new DTlog it overwrites the original columns in DT. How can I retain the original columns in DT and have the new columns as well in DT.
DTlog <- DT[,(lapply(.SD,log)),by = .(town,tc),.SDcols=3:10]
Look forward to some expert suggestions.

We can do this using :=. We subset the column names that are not the grouping variables ('nm'). Create a vector of names to assign for the new columns using outer ('nm1'). Then, we use the OP's code, unlist the output and assign (:=) it to 'nm1' to create the new columns.
nm <- names(DT)[-(1:2)]
nm1 <- c(t(outer(c("Mean", "SD", "uplimit", "lowlimit"), nm, paste, sep="_")))
DT[, (nm1):= unlist(lapply(.SD, function(x) { Mean = mean(x)
SD = sd(x)
uplimit = Mean + 1.96*SD
lowlimit = Mean - 1.96*SD
list(Mean, SD, uplimit, lowlimit) }), recursive=FALSE) ,
.(town, tc)]
The second part of the question involves doing a logical comparison between columns. One option would be to subset the initial columns, the 'lowlimit' and 'uplimit' columns separately and do the comparison (as these have the same dimensions) to get a logical output which can be coerced to binary with +. Then assign it to the original dataset to create the outlier columns.
m1 <- +(DT[, nm, with = FALSE] >= DT[, paste("lowlimit", nm, sep="_"),
with = FALSE] & DT[, nm, with = FALSE] <= DT[,
paste("uplimit", nm, sep="_"), with = FALSE])
DT[,paste(nm, "Aoutlier", sep=".") := as.data.frame(m1)]
Or instead of comparing data.tables, we can also use a for loop with set (which would be more efficient)
nm2 <- paste(nm, "Aoutlier", sep=".")
DT[, (nm2) := NA_integer_]
for(j in nm){
set(DT, i = NULL, j = paste(j, "Aoutlier", sep="."),
value = as.integer(DT[[j]] >= DT[[paste("lowlimit", j, sep="_")]] &
DT[[j]] <= DT[[paste("uplimit", j, sep="_")]]))
}
The 'log' columns can also be created with :=
DT[,paste(nm, "log", sep=".") := lapply(.SD,log),by = .(town,tc),.SDcols=nm]

Your data should probably be in long format:
m = melt(DT, id=c("town","tc"))
Then just write your test once
m[,
is_outlier := +(abs(value-mean(value)) > 1.96*sd(value))
, by=.(town, tc, variable)]
I see no outliers in this data (according to the given definition of outlier):
m[, .N, by=is_outlier] # this is a handy alternative to table()
# is_outlier N
# 1: 0 160
How it works
melt keeps the id columns and stacks all the rest into
variable (column names)
value (column contents)
+x does the same thing as as.integer(x), coercing TRUE/FALSE to 1/0
If you really like your data in wide format, though:
vjs = setdiff(names(DT), c("town","tc"))
DT[,
paste0(vjs,".out") := lapply(.SD, function(x) +(abs(x-mean(x)) > 1.96*sd(x)))
, by=.(town, tc), .SDcols=vjs]

For completeness, it should be noted that dplyr's mutate_each provides a handy way of tackling such problems:
library(dplyr)
result <- DT %>%
group_by(town,tc) %>%
mutate_each(funs(mean,sd,
uplimit = (mean(.) + 1.96*sd(.)),
lowlimit = (mean(.) - 1.96*sd(.)),
Aoutlier = as.integer(. >= mean(.) - 1.96*sd(.) &
. <= mean(.) - 1.96*sd(.))),
-town,-tc)

Related

Data table in r, assign value to a specific column not using it's name

I am writing a function which will impute zero if a value in a column is NA.
The tables I will need to impute will be in a format of:
tab = data.table(V1 = 1, var = NA, perc = NA)
Tables will have different column names but the one to impute will always be the second one.
To simplify, the function could be:
impute = function(DT, variable) {
DT[is.na(get(variable)), variable := 0]
}
That second 'variable' needs to be wrapped in something to work I assume. I would like to point it to the
variable = colnames(tab)[2]
Can anyone help please
You can wrap variable in () :
library(data.table)
impute = function(DT, variable) {
DT[is.na(get(variable)), (variable) := 0]
}
variable = colnames(tab)[2]
impute(tab, variable)
tab
# V1 var perc
#1: 1 0 NA
I don't think you need a function for it, you can do
cols <- colnames(DT)[2]
DT[, (cols) := lapply(.SD, function(z) replace(z, is.na(z), 0)), .SDcols = cols]
though if you want one, you could do
na0 <- function(x, default = 0) replace(x, is.na(x), default)
DT[, (cols) := lapply(.SD, na0), .SDcols = cols]

R data.table: efficiently access and update a variable column name in j expression with grouping [duplicate]

This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 2 years ago.
I want to apply a transformation (whose type, loosely speaking, is "vector" -> "vector") to a list of columns in a data table, and this transformation will involve a grouping operation.
Here is the setup and what I would like to achieve:
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
DT[, A.prime := (A - mean(A))/sd(A), by=year(date)]
DT[, B.prime := (B - mean(B))/sd(B), by=year(date)]
DT[, C.prime := (C - mean(C))/sd(C), by=year(date)]
The goal is to avoid typing out the column names. In my actual application, I have a list of columns I would like to apply this transformation to.
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
columns <- c("A", "B", "C")
for (x in columns) {
# This doesn't work.
# target <- DT[, (x - mean(x, na.rm=TRUE))/sd(x, na.rm = TRUE), by=year(date)]
# This doesn't work.
#target <- DT[, (..x - mean(..x, na.rm=TRUE))/sd(..x, na.rm = TRUE), by=year(date)]
# THIS WORKS! But it is tedious writing "get(x)" every time.
target <- DT[, (get(x) - mean(get(x), na.rm=TRUE))/sd(get(x), na.rm = TRUE), by=year(date)][, V1]
set(DT, j = paste0(x, ".prime"), value = target)
}
Question: What is the idiomatic way to achieve the above result? There are two things which may be possibly be improved:
How to avoid typing out get(x) every time I use x to access a column?
Is accessing [, V1] the most efficient way of doing this? Is it possible to update DT directly by reference, without creating an intermediate data.table?
You can use .SDcols to specify the columns that you want to operate on :
library(data.table)
columns <- c("A", "B", "C")
newcolumns <- paste0(columns, ".prime")
DT[, (newcolumns) := lapply(.SD, function(x) (x- mean(x))/sd(x)),
year(date), .SDcols = columns]
This avoids using get(x) everytime and updates data.table by reference.
I think Ronak's answer is superior & preferable, just writing this to demonstrate a common syntax for more complicated j queries is to use a full {} expression:
target <- DT[ , by = year(date), {
xval = eval(as.name(x))
(xval - mean(xval, na.rm=TRUE))/sd(xval, na.rm = TRUE)
}]$V1
Two other small differences:
I used eval(as.name(.)) instead of get; the former is more trustworthy & IME faster
I replaced [ , V1] with $V1 -- the former requires the overhead of [.data.table.
You might also like to know that the base function scale will do the center & normalize steps more concisely (if slightly inefficient for being a bit to general).

A particular syntactic construct in R

In
Orders1=Orders[Datecreated<floor_date(send_Date,unit='week',week_start = 7)-weeks(PrevWeek),
.(Previous_Sales=sum(Sales)),
by=.(Category,send_Date=floor_date(send_Date,unit='week',week_start = 7))]
What does the . in .(Previous_Sales=sum(Sales)) mean? This is some syntactic nuance with which I am not familiar.
Also, what does by=.(Category,s....
Can someone help?
Here the . is similar to calling a list in data.table. It is creating a summarised output column
.(Previous_Sales=sum(Sales))
Or with list
list(Previous_Sales=sum(Sales))
In dplyr, similar syntax would be
summarise(Previous_Sales = sum(Sales))
and for creating a column/modifying an existing column usee
mutate(Previous_Sales = sum(Sales))
With data.table, updating/creating a column is done with :=
Previous_Sales := sum(Sales)
Similarly, the by also would be a list of column names
by = list(Category, send_Date=floor_date(send_Date,unit='week',week_start = 7)
which we can also use
by = .(Category, send_Date=floor_date(send_Date,unit='week',week_start = 7)
In the context of data.table, the syntax is consistent in the order
dt[i, j, by]
where i, is the place where we specify the row condition for subsetting the rows, j, we apply functions on column/columns and by the grouping columns. Using a simple example with iris
as.data.table(iris)[Sepal.Length < 5, .(Sum = sum(Sepal.Width)), by = Species]
the i is Sepal.Length < 5 it selects only those rows meeting that condition to sum the 'Sepal.Width' (in that rows), and as the by option is provided, it will do the sum of 'Sepal.Width' for each 'Species' resulting in a 3 row (here there are 3 unique 'Species'). We can also do this without the i option by doing the subsetting in j itself
as.data.table(iris)[, .(Sum = sum(Sepal.Width[Sepal.Length < 5])), by = Species]
With summariseation, both of these are okay, but if we do an assignment (:=), it would different
as.data.table(iris)[Sepal.Length < 5, Sum := sum(Sepal.Width), by = Species]
This would create a column 'Sum' and fills the sum values only where the 'Sepal.Length < 5and all other row elements will beNA`. If we do the second option
as.data.table(iris)[, Sum := sum(Sepal.Width[Sepal.Length < 5]), by = Species]
there won't be any NA element because it is subsetting within the j to create a single sum value for each 'Species'

Standardize by group using data.table

Is it possible to use data.table to standardize a number of variables by a number of group variables?
DT <- data.table(V1=1:20, V2=40:21, gr=c(rep(c('a'),10), rep(c('b'),10)),
grr=rep(c(rep(c('a'),5), rep(c('b'),5)),2))
gr and grr are the group variables. I want to add to that data.table V1.z and V2.z that are the standardized score within each gr-by-grr group.
Here is an extremely stupid code for that, to explain what I want:
DTaa <- DT[gr=='a' & grr=='a',]
DTab <- DT[gr=='a' & grr=='b',]
DTba <- DT[gr=='b' & grr=='a',]
DTbb <- DT[gr=='b' & grr=='b',]
DTaa <- DTaa[,V1.z := scale(V1)]
DTaa <- DTaa[,V2.z := scale(V2)]
DTab <- DTab[,V1.z := scale(V1)]
DTab <- DTab[,V2.z := scale(V2)]
DTba <- DTba[,V1.z := scale(V1)]
DTba <- DTba[,V2.z := scale(V2)]
DTbb <- DTbb[,V1.z := scale(V1)]
DTbb <- DTbb[,V2.z := scale(V2)]
DTn <- rbind(DTaa, DTab, DTba, DTbb)
Probably, there is a way to do it using by in one or two lines.
I'm hoping to then use it in a function that accepts data, the target variables (in the example, V1 and V2), and group variables (in the example, gr and grr) as arguments.
If you have a solution that does not use data.table, it's also good (I tried using mutate_at from dplyr but couldn't find much documentation about that function).
After grouping by 'gr' and 'grr', loop over the Subset of Data.table (.SD), scale it (the output of scale is a matrix, so we convert it to vector with as.vector) and assign (:=) the output to the new columns.
DT[, paste0(names(DT)[1:2], ".z") := lapply(.SD,
function(x) as.vector(scale(x))), .(gr, grr)]

Changing multiple Columns in data.table r

I am looking for a way to manipulate multiple columns in a data.table in R. As I have to address the columns dynamically as well as a second input, I wasn't able to find an answer.
The idea is to index two or more series on a certain date by dividing all values by the value of the date eg:
set.seed(132)
# simulate some data
dt <- data.table(date = seq(from = as.Date("2000-01-01"), by = "days", length.out = 10),
X1 = cumsum(rnorm(10)),
X2 = cumsum(rnorm(10)))
# set a date for the index
indexDate <- as.Date("2000-01-05")
# get the column names to be able to select the columns dynamically
cols <- colnames(dt)
cols <- cols[substr(cols, 1, 1) == "X"]
Part 1: The Easy data.frame/apply approach
df <- as.data.frame(dt)
# get the right rownumber for the indexDate
rownum <- max((1:nrow(df))*(df$date==indexDate))
# use apply to iterate over all columns
df[, cols] <- apply(df[, cols],
2,
function(x, i){x / x[i]}, i = rownum)
Part 2: The (fast) data.table approach
So far my data.table approach looks like this:
for(nam in cols) {
div <- as.numeric(dt[rownum, nam, with = FALSE])
dt[ ,
nam := dt[,nam, with = FALSE] / div,
with=FALSE]
}
especially all the with = FALSE look not very data.table-like.
Do you know any faster/more elegant way to perform this operation?
Any idea is greatly appreciated!
One option would be to use set as this involves multiple columns. The advantage of using set is that it will avoid the overhead of [.data.table and makes it faster.
library(data.table)
for(j in cols){
set(dt, i=NULL, j=j, value= dt[[j]]/dt[[j]][rownum])
}
Or a slightly slower option would be
dt[, (cols) :=lapply(.SD, function(x) x/x[rownum]), .SDcols=cols]
Following up on your code and the answer given by akrun, I would recommend you to use .SDcols to extract the numeric columns and lapply to loop through them. Here's how I would do it:
index <-as.Date("2000-01-05")
rownum<-max((dt$date==index)*(1:nrow(dt)))
dt[, lapply(.SD, function (i) i/i[rownum]), .SDcols = is.numeric]
Using .SDcols could be specially useful if you have a large number of numeric columns and you'd like to apply this division on all of them.

Resources