format data.table column values (row wise) if numeric - r

I am creating a summary data.table to be inserted in a knitr report using xtable. I would like to check each row value in each column if is.numeric() == TRUE and if it is, format the number, then revert it back to a character. If is.numeric() == FALSE then return the value. The actual data.table may have many columns.
Here's what I have below, with the desired output at the bottom:
library(data.table)
library(magrittr)
dt <- data.table(A = c("apples",
"bananas",
1000000.999),
B = c("red",
5000000.999,
0.99))
dt
a <- dt[, lapply(.SD,
function(x) {
if (is.na(is.numeric(x))) {
prettyNum(as.numeric(x), digits = 0, big.mark = ",")
} else {
x
}
})]
a
b <- dt[, A := ifelse(is.na(is.numeric(A)),
format(as.numeric(A), digits = 0, big.mark = ","),
A)] %>%
.[, B := ifelse(is.na(is.numeric(B)),
format(as.numeric(B), digits = 0, big.mark = ","),
B)]
b
b
desired <- data.table(A = c("apples",
"bananas",
"1,000,000"),
B = c("red",
"5,000,000",
"1"))
desired
From my understanding lapply in the j argument of data.table syntax operates on the vector, so it can be used for functions like mean(), sum(), na.approx(), etc. and wouldn't necessarily work here. But I would like to loop over each column in the data.table without specifying each column name since there could be many columns and naming them would be cumbersome. It's kind of like I know the circle doesn't go in the square but I really want it to!
I tried the := ifelse() approach which I thought should work, but it seems to be returning the first element. On a different data.table where the column is entirely numeric, employing the same approach yields all NA.
Thanks for any help!

We can use set with number. Loop through the sequence of columns with a for loop, identify the index of elements that are all digits or . ('i1'), use that as the i in set, convert those elements to numeric, apply the number to set the format for that element
library(scales)
library(data.table)
for(j in seq_along(dt)) {
i1 <- grep("^[0-9.]+$", dt[[j]])
set(dt, i = i1, j = j, value = number(as.numeric(dt[[j]][i1]), big.mark = ","))
}
dt
# A B
#1: apples red
#2: bananas 5,000,001
#3: 1,000,001 1

Related

R data.table: efficiently access and update a variable column name in j expression with grouping [duplicate]

This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 2 years ago.
I want to apply a transformation (whose type, loosely speaking, is "vector" -> "vector") to a list of columns in a data table, and this transformation will involve a grouping operation.
Here is the setup and what I would like to achieve:
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
DT[, A.prime := (A - mean(A))/sd(A), by=year(date)]
DT[, B.prime := (B - mean(B))/sd(B), by=year(date)]
DT[, C.prime := (C - mean(C))/sd(C), by=year(date)]
The goal is to avoid typing out the column names. In my actual application, I have a list of columns I would like to apply this transformation to.
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
columns <- c("A", "B", "C")
for (x in columns) {
# This doesn't work.
# target <- DT[, (x - mean(x, na.rm=TRUE))/sd(x, na.rm = TRUE), by=year(date)]
# This doesn't work.
#target <- DT[, (..x - mean(..x, na.rm=TRUE))/sd(..x, na.rm = TRUE), by=year(date)]
# THIS WORKS! But it is tedious writing "get(x)" every time.
target <- DT[, (get(x) - mean(get(x), na.rm=TRUE))/sd(get(x), na.rm = TRUE), by=year(date)][, V1]
set(DT, j = paste0(x, ".prime"), value = target)
}
Question: What is the idiomatic way to achieve the above result? There are two things which may be possibly be improved:
How to avoid typing out get(x) every time I use x to access a column?
Is accessing [, V1] the most efficient way of doing this? Is it possible to update DT directly by reference, without creating an intermediate data.table?
You can use .SDcols to specify the columns that you want to operate on :
library(data.table)
columns <- c("A", "B", "C")
newcolumns <- paste0(columns, ".prime")
DT[, (newcolumns) := lapply(.SD, function(x) (x- mean(x))/sd(x)),
year(date), .SDcols = columns]
This avoids using get(x) everytime and updates data.table by reference.
I think Ronak's answer is superior & preferable, just writing this to demonstrate a common syntax for more complicated j queries is to use a full {} expression:
target <- DT[ , by = year(date), {
xval = eval(as.name(x))
(xval - mean(xval, na.rm=TRUE))/sd(xval, na.rm = TRUE)
}]$V1
Two other small differences:
I used eval(as.name(.)) instead of get; the former is more trustworthy & IME faster
I replaced [ , V1] with $V1 -- the former requires the overhead of [.data.table.
You might also like to know that the base function scale will do the center & normalize steps more concisely (if slightly inefficient for being a bit to general).

dynamically subseting data.table in R

my data.table contain K columns called claims, among other 30 columns. I want to subset the data.table, such that only rows remain which do not have 0 claims.
So, firstly i get all the column names i need for filtering. For the purpose of this example, i have chosen K = 2
> claimsCols = c("claimsnext", paste0("claims" , 1:K))
> claimsCols
[1] "claimsnext" "claims1" "claims2"
i have tried subsetting like:
for(i in claimsCols){
BTplan <- BTplan[ claimsCols[i] == 0, ]
i+1
}
this doent work:
Error in i + 1 : non-numeric argument to binary operator
I am sure there is a better way to do this?
I would basically do what akrun does
idx = BTplan[ , Reduce(`&`, .SD), .SDcols = patterns('claims')]
BTplan = BTplan[idx]
The innovations are:
Use patterns in .SDcols to specify the columns to include by pattern
& automatically converts numeric to logical, i.e. 1.1 & 2.2 is TRUE, and becomes FALSE as soon as there's a 0 anywhere (hence filtering the corresponding row)
In a future version of data.table this will be slightly more efficient and concise (and hopefully more readable):
idx = BTplan[ , pall(.SD), .SDcols = patterns('claims')]
BTplan = BTplan[idx]
Keep an eye on this pull request:
https://github.com/Rdatatable/data.table/pull/4448
In the OP's code, the i is each of the elements of 'claimsCols' which is character, so i +1 won't work and in fact, it is not needed
for(colnm in claimsCols) {
BTplan <- BTplan[BTplan[[colnm]] != 0,]
}
Or using data.table syntax
library(data.table)
setDT(BTplan)
BTplan[BTplan[, Reduce(`&`, lapply(.SD, `!=`, 0)),.SDcols = claimsCols]]

How can I program a loop in R?

How can I program a loop so that all eight tables are calculated one after the other?
The code:
dt_M1_I <- M1_I
dt_M1_I <- data.table(dt_M1_I)
dt_M1_I[,I:=as.numeric(gsub(",",".",I))]
dt_M1_I[,day:=substr(t,1,10)]
dt_M1_I[,hour:=substr(t,12,16)]
dt_M1_I_median <- dt_M1_I[,list(median_I=median(I,na.rm = TRUE)),by=.(day,hour)]
This should be calculated for:
M1_I
M2_I
M3_I
M4_I
M1_U
M2_U
M3_U
M4_U
Thank you very much for your help!
Whenever you have several variables of the same kind, especially when you find yourself numbering them, as you did, step back and replace them with a single list variable. I do not recommend doing what the other answer suggested.
That is, instead of M1_I…M4_I and M1_U…M4_U, have two variables m_i and m_u (using lower case in variable names is conventional), which are each lists of four data.tables.
Alternatively, you might want to use a single variable, m, which contains nested lists of data.tables (m = list(list(i = …, u = …), …)).
Assuming the first, you can then iterate over them as follows:
give_this_a_meaningful_name = function (df) {
dt <- data.table(df)
dt[, I := as.numeric(gsub(",", ".", I))]
dt[, day := substr(t, 1, 10)]
dt[, hour := substr(t, 12, 16)]
dt[, list(median_I = median(I, na.rm = TRUE)), by = .(day, hour)]
}
m_i_median = lapply(m_i, give_this_a_meaningful_name)
(Note also the introduction of consistent spacing around operators; good readability is paramount for writing bug-free code.)
You can use a combination of a for loop and the get/assign functions like this:
# create a vector of the data.frame names
dts <- c('M1_I', 'M2_I', 'M3_I', 'M4_I', 'M1_U', 'M2_U', 'M3_U', 'M4_U')
# iterate over each dataframe
for (dt in dts){
# get the actual dataframe (not the string name of it)
tmp <- get(dt)
tmp <- data.table(tmp)
tmp[, I:=as.numeric(gsub(",",".",I))]
tmp[, day:=substr(t,1,10)]
tmp[, hour:=substr(t,12,16)]
tmp <- tmp[,list(median_I=median(I,na.rm = TRUE)),by=.(day,hour)]
# assign the modified dataframe to the name you want (the paste adds the 'dt_' to the front)
assign(paste0('dt_', dt), tmp)
}

Loop through data.table and create new columns basis some condition

I have a data.table with quite a few columns. I need to loop through them and create new columns using some condition. Currently I am writing separate line of condition for each column. Let me explain with an example. Let us consider a sample data as -
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=10),
tc = rep(c('C','D'), 10),
one = rnorm(20,1,1),
two = rnorm(20,2,1),
three = rnorm(20,3,1),
four = rnorm(20,4,1),
five = rnorm(20,5,2),
six = rnorm(20,6,2),
seven = rnorm(20,7,2),
total = rnorm(20,28,3))
For each of the columns from one to total, I need to create 4 new columns, i.e. mean, sd, uplimit, lowlimit for 2 sigma outlier calculation. I am doing this by -
DTnew <- DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x), uplimit = mean(x)+1.96*sd(x), lowlimit = mean(x)-1.96*sd(x))))), by = .(town,tc)]
This DTnew data.table I am then merging with my DT
DTmerge <- merge(DT, DTnew, by= c('town','tc'))
Now to come up with the outliers, I am writing separate set of codes for each variable -
DTAoutlier <- DTmerge[ ,one.Aoutlier := ifelse (one >= one.lowlimit & one <= one.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,two.Aoutlier := ifelse (two >= two.lowlimit & two <= two.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,three.Aoutlier := ifelse (three >= three.lowlimit & three <= three.uplimit,0,1)]
can some one help to simplify this code so that
I don't have to write separate lines of code for outlier. In this example we have only 8 variables but what if we had 100 variables, would we end up writing 100 lines of code? Can this be done using a for loop? How?
In general for data.table how can we add new columns retaining the original columns. So for example below I am taking log of columns 3 to 10. If I don't create a new DTlog it overwrites the original columns in DT. How can I retain the original columns in DT and have the new columns as well in DT.
DTlog <- DT[,(lapply(.SD,log)),by = .(town,tc),.SDcols=3:10]
Look forward to some expert suggestions.
We can do this using :=. We subset the column names that are not the grouping variables ('nm'). Create a vector of names to assign for the new columns using outer ('nm1'). Then, we use the OP's code, unlist the output and assign (:=) it to 'nm1' to create the new columns.
nm <- names(DT)[-(1:2)]
nm1 <- c(t(outer(c("Mean", "SD", "uplimit", "lowlimit"), nm, paste, sep="_")))
DT[, (nm1):= unlist(lapply(.SD, function(x) { Mean = mean(x)
SD = sd(x)
uplimit = Mean + 1.96*SD
lowlimit = Mean - 1.96*SD
list(Mean, SD, uplimit, lowlimit) }), recursive=FALSE) ,
.(town, tc)]
The second part of the question involves doing a logical comparison between columns. One option would be to subset the initial columns, the 'lowlimit' and 'uplimit' columns separately and do the comparison (as these have the same dimensions) to get a logical output which can be coerced to binary with +. Then assign it to the original dataset to create the outlier columns.
m1 <- +(DT[, nm, with = FALSE] >= DT[, paste("lowlimit", nm, sep="_"),
with = FALSE] & DT[, nm, with = FALSE] <= DT[,
paste("uplimit", nm, sep="_"), with = FALSE])
DT[,paste(nm, "Aoutlier", sep=".") := as.data.frame(m1)]
Or instead of comparing data.tables, we can also use a for loop with set (which would be more efficient)
nm2 <- paste(nm, "Aoutlier", sep=".")
DT[, (nm2) := NA_integer_]
for(j in nm){
set(DT, i = NULL, j = paste(j, "Aoutlier", sep="."),
value = as.integer(DT[[j]] >= DT[[paste("lowlimit", j, sep="_")]] &
DT[[j]] <= DT[[paste("uplimit", j, sep="_")]]))
}
The 'log' columns can also be created with :=
DT[,paste(nm, "log", sep=".") := lapply(.SD,log),by = .(town,tc),.SDcols=nm]
Your data should probably be in long format:
m = melt(DT, id=c("town","tc"))
Then just write your test once
m[,
is_outlier := +(abs(value-mean(value)) > 1.96*sd(value))
, by=.(town, tc, variable)]
I see no outliers in this data (according to the given definition of outlier):
m[, .N, by=is_outlier] # this is a handy alternative to table()
# is_outlier N
# 1: 0 160
How it works
melt keeps the id columns and stacks all the rest into
variable (column names)
value (column contents)
+x does the same thing as as.integer(x), coercing TRUE/FALSE to 1/0
If you really like your data in wide format, though:
vjs = setdiff(names(DT), c("town","tc"))
DT[,
paste0(vjs,".out") := lapply(.SD, function(x) +(abs(x-mean(x)) > 1.96*sd(x)))
, by=.(town, tc), .SDcols=vjs]
For completeness, it should be noted that dplyr's mutate_each provides a handy way of tackling such problems:
library(dplyr)
result <- DT %>%
group_by(town,tc) %>%
mutate_each(funs(mean,sd,
uplimit = (mean(.) + 1.96*sd(.)),
lowlimit = (mean(.) - 1.96*sd(.)),
Aoutlier = as.integer(. >= mean(.) - 1.96*sd(.) &
. <= mean(.) - 1.96*sd(.))),
-town,-tc)

Changing multiple Columns in data.table r

I am looking for a way to manipulate multiple columns in a data.table in R. As I have to address the columns dynamically as well as a second input, I wasn't able to find an answer.
The idea is to index two or more series on a certain date by dividing all values by the value of the date eg:
set.seed(132)
# simulate some data
dt <- data.table(date = seq(from = as.Date("2000-01-01"), by = "days", length.out = 10),
X1 = cumsum(rnorm(10)),
X2 = cumsum(rnorm(10)))
# set a date for the index
indexDate <- as.Date("2000-01-05")
# get the column names to be able to select the columns dynamically
cols <- colnames(dt)
cols <- cols[substr(cols, 1, 1) == "X"]
Part 1: The Easy data.frame/apply approach
df <- as.data.frame(dt)
# get the right rownumber for the indexDate
rownum <- max((1:nrow(df))*(df$date==indexDate))
# use apply to iterate over all columns
df[, cols] <- apply(df[, cols],
2,
function(x, i){x / x[i]}, i = rownum)
Part 2: The (fast) data.table approach
So far my data.table approach looks like this:
for(nam in cols) {
div <- as.numeric(dt[rownum, nam, with = FALSE])
dt[ ,
nam := dt[,nam, with = FALSE] / div,
with=FALSE]
}
especially all the with = FALSE look not very data.table-like.
Do you know any faster/more elegant way to perform this operation?
Any idea is greatly appreciated!
One option would be to use set as this involves multiple columns. The advantage of using set is that it will avoid the overhead of [.data.table and makes it faster.
library(data.table)
for(j in cols){
set(dt, i=NULL, j=j, value= dt[[j]]/dt[[j]][rownum])
}
Or a slightly slower option would be
dt[, (cols) :=lapply(.SD, function(x) x/x[rownum]), .SDcols=cols]
Following up on your code and the answer given by akrun, I would recommend you to use .SDcols to extract the numeric columns and lapply to loop through them. Here's how I would do it:
index <-as.Date("2000-01-05")
rownum<-max((dt$date==index)*(1:nrow(dt)))
dt[, lapply(.SD, function (i) i/i[rownum]), .SDcols = is.numeric]
Using .SDcols could be specially useful if you have a large number of numeric columns and you'd like to apply this division on all of them.

Resources