Add multiple new columns to a data table

Add multiple new columns to a data table - r

I have a set of data and I need to add multiple extra columns to rank the existing data. I am doing this by adding one extra column at a time but I was hoping for a more efficient way by passing in the columns as a character vector? Here is a simple example:
require(data.table)
dt <- data.table(x = rnorm(10),
y = rnorm(10))
dt[, ":=" (rank_x = rank(x, ties.method = "min"),
rank_y = rank(y, ties.method = "min"))]
The ranking method is the same in all cases so I was hoping to use something like
cols <- c("x", "y")
dt[, cols := lapply(.SD, function(x) rank(x, ties.method = "min")), .SDcols = cols]

We can do this with paste to create new variables
dt[, paste0("rank_", cols) := lapply(.SD, rank, ties.method = "min"), .SDcols = cols]

Related

Converting columns to numeric if possible does not work within a function

I have example data as follows:
library(data.table)
set.seed(1)
DT <- data.table(panelID = sample(50,50), # Creates a panel ID
Country = c(rep("Albania",30),rep("Belarus",50), rep("Chilipepper",20)),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
Group = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
wt = 15*round(runif(100)/10,2),
Income = round(rnorm(10,-5,5),2),
Happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = sample(100,100),
Educ = round(rnorm(10,0.75,0.3),2))
DT [, uniqueID := .I] # Creates a unique ID # https://stackoverflow.com/questions/11036989/replace-all-0-values-to-na
DT$some_NA_factor <- factor(DT$some_NA_factor)
DT$Group <- as.character(DT$Group)
DT2 <- copy(DT)
This is what I want to do, to convert a column (in this case colum 5 Group) to numeric if that is possible.
dfs <- c("DT", "DT2")
conv_to_num_check <- function(z) is.character(z) && (mean(grepl("^ *-?[\\d.]+(?:e-?\\d+)?$", z, perl = TRUE), na.rm=TRUE)>0.9)
for (i in length(dfs)) {
cols <- which(sapply(get(dfs[i]), conv_to_num_check))
setDT(get(dfs[i]))[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
}
But when I check the class:
class(DT$Group) # Is character
When I do:
cols <- which(sapply(DT, conv_to_num_check))
setDT(DT)[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
class(DT$Group) # Is numeric
It works.. What am I doing wrong?

Just a tiny error in the line for (i in length(dfs)), as length(dfs) just returns 2:
for (i in length(dfs)) {
print(i)
}
# [1] 2
It will work if you change it to:
for (i in seq_along(dfs)) {
cols <- which(sapply(get(dfs[i]), conv_to_num_check))
setDT(get(dfs[i]))[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
}

R fast cosine distance between consecutive rows of a data.table

How can I efficiently calculate distances between (almost) consecutive rows of a large-ish (~4m rows) of a data.table? I've outlined my current approach, but it is very slow. My actual data has up to a few hundred columns. I need to calculate lags and leads for future use, so I create these and use them to calculate distances.
library(data.table)
library(proxy)
set_shift_col <- function(df, shift_dir, shift_num, data_cols, byvars = NULL){
df[, (paste0(data_cols, "_", shift_dir, shift_num)) := shift(.SD, shift_num, fill = NA, type = shift_dir), byvars, .SDcols = data_cols]
}
set_shift_dist <- function(dt, shift_dir, shift_num, data_cols){
stopifnot(shift_dir %in% c("lag", "lead"))
shift_str <- paste0(shift_dir, shift_num)
dt[, (paste0("dist", "_", shift_str)) := as.numeric(
proxy::dist(
rbindlist(list(
.SD[,data_cols, with=FALSE],
.SD[, paste0(data_cols, "_" , shift_str), with=FALSE]
), use.names = FALSE),
method = "cosine")
), 1:nrow(dt)]
}
n <- 10000
test_data <- data.table(a = rnorm(n), b = rnorm(n), c = rnorm(n), d = rnorm(n))
cols <- c("a", "b", "c", "d")
set_shift_col(test_data, "lag", 1, cols)
set_shift_col(test_data, "lag", 2, cols)
set_shift_col(test_data, "lead", 1, cols)
set_shift_col(test_data, "lead", 2, cols)
set_shift_dist(test_data, "lag", 1, cols)
I'm sure this is a very inefficient approach, any suggestions would be appreciated!

You aren't using the vectorisation efficiencies in the proxy::dist function - rather than call it once for each row you can get all the distances you need from a single call.
Try this replacement function and compare the speed:
set_shift_dist2 <- function(dt, shift_dir, shift_num, data_cols){
stopifnot(shift_dir %in% c("lag", "lead"))
shift_str <- paste0(shift_dir, shift_num)
dt[, (paste0("dist2", "_", shift_str)) := proxy::dist(
.SD[,data_cols, with=FALSE],
.SD[, paste0(data_cols, "_" , shift_str), with=FALSE],
method = "cosine",
pairwise = TRUE
)]
}
You could also do it in one go without storing copies of the data in the table
test_data[, dist_lag1 := proxy::dist(
.SD,
as.data.table(shift(.SD, 1)),
pairwise = TRUE,
method = 'cosine'
), .SDcols = c('a', 'b', 'c', 'd')]

Bin data within a group using breaks from another DF

How to avoid using the for loop in the following code to speed up the computation (the real data is about 1e6 times larger)
id = rep(1:5, 20)
v = 1:100
df = data.frame(groupid = id, value = v)
df = dplyr::arrange(df, groupid)
bkt = rep(seq(0, 100, length.out = 4), 5)
id = rep(1:5, each = 4)
bktpts = data.frame(groupid = id, value = bkt)
for (i in 1:5) {
df[df$groupid == i, "bin"] = cut(df[df$groupid == i, "value"],
bktpts[bktpts$groupid == i, "value"],
include.lowest = TRUE, labels = F)
}

I'm not sure why yout bktpts is formatted like it is?
But here is a data.table slution that should be (at least a bit) faster than your for-loop.
library( data.table )
setDT(df)[ setDT(bktpts)[, `:=`( id = seq_len(.N),
value_next = shift( value, type = "lead", fill = 99999999 ) ),
by = .(groupid) ],
bin := i.id,
on = .( groupid, value >= value, value < value_next ) ][]

Another way:
library(data.table)
setDT(df); setDT(bktpts)
bktpts[, b := rowid(groupid) - 1L]
df[, b := bktpts[copy(.SD), on=.(groupid, value), roll = -Inf, x.b]]
# check result
df[, any(b != bin)]
# [1] FALSE
See ?data.table for how rolling joins work.

I came out with another data.table answer:
library(data.table) # load package
# set to data.table
setDT(df)
setDT(bktpts)
# Make a join
df[bktpts[, list(.(value)), by = groupid], bks := V1, on = "groupid"]
# define the bins:
df[, bin := cut(value, bks[[1]], include.lowest = TRUE, labels = FALSE), by = groupid]
# remove the unneeded bks column
df[, bks := NULL]
Explaining the code:
bktpts[, list(.(value)), by = groupid] is a new table that has in a list al the values of value for each groupid. If you run it alone, you'll understand where we're going.
bks := V1 assigns to variable bks in df whatever exists in V1, which is the name of the list column in the previous table. Of course on = "groupid" is the variable on which we make the join.
The code defining the bins needs little explanation, except by the bks[[1]] bit. It needs to be [[ in order to access the list values and provide a vector, as required by the cut function.
EDIT TO ADD:
All data.table commands can be chained in a -rather unintelligible- single call:
df[bktpts[, list(.(value)), by = groupid],
bks := V1,
on = "groupid"][,
bin := cut(value,
bks[[1]],
include.lowest = TRUE,
labels = FALSE),
by = groupid][,
bks := NULL]

Utilizing roll functions with data.table

I'm having problems specifically applying functions from the roll package using data.table. I'm attempting to calculate rolling metrics on column DT$obs for each group DT$group. I'm able to calculate rolling metrics using the zoo package, but I'd like to use some of the additional arguments in roll package functions.
Demo of the error is below.
require(data.table)
require(zoo)
require(roll)
# Fabricated Data:
DT <- data.table(group = rep(c("A", "B"), each = 20), obs = runif(40, min = 0, max = 100))
# Calculate a rolling sum (this is working properly)
DT[, RollingSum := lapply(.SD, function(x) zoo::rollsumr(x, k = 5, fill = NA)), by = "group", .SDcols = "obs"]
# Attempt to calculate a rolling z-score (this throws me an error)
DT[, RollingZScore := lapply(.SD, function(x) roll::roll_scale(as.matrix(x), width = 10, min_obs = 5)), by = "group", .SDcols = "obs"]
I can't figure out what's different about the zoo function and the roll function. They each return numeric vectors. Any guidance appreciated.

As #Frank describes, the problem is that the result of roll_scale (and thus each element of lapply output) is a matrix. You can either use sapply instead of lapply, or put as.vector in your function definition.
DT[, RollingZScore := sapply(.SD,
function(x) roll::roll_scale(as.matrix(x), width = 10, min_obs = 5)),
by = "group", .SDcols = "obs"]
or
DT[, RollingZScore := lapply(.SD,
function(x) as.vector(roll::roll_scale(as.matrix(x), width = 10, min_obs = 5))),
by = "group", .SDcols = "obs"]

This can be done with rollapplyr by simply defining a function that returns NA if the input has fewer than 5 elements:
Scale <- function(x) if (length(x) < 5) NA else tail(scale(x), 1)
DT[, rollingScore := rollapplyr(obs, 10, Scale, partial = TRUE), by = "group"]

rolling average to multiple variables in R using data.table package

I would like to get rolling average for each of the numeric variables that I have. Using data.table package, I know how to compute for a single variable. But how should I revise the code so it can process multiple variables at a time rather than revising the variable name and repeat this procedure for several times? Thanks.
Suppose I have other numeric variables named as "V2", "V3", and "V4".
require(data.table)
setDT(data)
setkey(data,Receptor,date)
data[ , `:=` ('RollConc' = rollmean(AvgConc, 48, align="left", na.pad=TRUE)) , by=Receptor]
A copy of my sample data can be found at:
https://drive.google.com/file/d/0B86_a8ltyoL3OE9KTUstYmRRbFk/view?usp=sharing
I would like to get 5-hour rolling means for "AvgConc","TotDep","DryDep", and "WetDep" by each receptor.

From your description you want something like this, which is similar to one example that can be found in one of the data.table vignettes:
library(data.table)
set.seed(42)
DT <- data.table(x = rnorm(10), y = rlnorm(10), z = runif(10), g = c("a", "b"), key = "g")
library(zoo)
DT[, paste0("ravg_", c("x", "y")) := lapply(.SD, rollmean, k = 3, na.pad = TRUE),
by = g, .SDcols = c("x", "y")]

Now, one can use the frollmean function in the data.table package for this.
library(data.table)
xy <- c("x", "y")
DT[, (xy):= lapply(.SD, frollmean, n = 3, fill = NA, align="center"),
by = g, .SDcols = xy]
Here, I am replacing the x and y columns by the rolling average.
# Data
set.seed(42)
DT <- data.table(x = rnorm(10), y = rlnorm(10), z = runif(10),
g = c("a", "b"), key = "g")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Add multiple new columns to a data table - r

We can do this with paste to create new variables dt[, paste0("rank_", cols) := lapply(.SD, rank, ties.method = "min"), .SDcols = cols]

Related

Converting columns to numeric if possible does not work within a function

R fast cosine distance between consecutive rows of a data.table

Bin data within a group using breaks from another DF

Utilizing roll functions with data.table

rolling average to multiple variables in R using data.table package

Categories

Resources