outcome variable as argument in regression function - r

I have a datasetup function which currently has 2 arguments: testData and ID1. I want to include outcome variable as an argument.
Suppose outcomevar=c(y1,y2,y3) then the function should create the lagged and differenced variable of my outcome variable.
preparedata<-function(testData,ID1,outcomevar){
#Order temp data by firm and date
testData <- testData[order(testData$firm,testData$date),]
#Create lagged outcomevar for each firm
testData <- ddply(testData, .(firm), transform,
ly1 = c( NA, y1[-length(y1)] ) )
#Create differenced variable
testData$dy1<-(testData$y1-testData$ly1)
}
where the "l" and "d" in front of y1 stand for lagged and differenced.
Depending How can I include the outcome variable?
Thanks
T

Here's a solution using data tables:
# create sample dataset
set.seed(1)
df <- data.frame(firm=rep(LETTERS[1:5],each=10),
date=as.Date("2014-01-01")+1:10,
y1=sample(1:100,50),y2=sample(1:100,50),y3=sample(1:100,50))
preparedata<-function(testData,ID1,outcomevar){
require(data.table)
DT <- as.data.table(testData)
setkey(DT,firm,date)
DT[,lag := c(NA,unlist(.SD)[-.N]), by=firm, .SDcols=outcomevar]
DT[,diff := c(NA,diff(unlist(.SD))), by=firm, .SDcols=outcomevar]
setnames(DT,c("lag","diff"),paste0(c("l","d"),outcomevar))
return(DT)
}
result <- preparedata(df,1,outcomevar="y1")
head(result)
# firm date y1 y2 y3 ly1 dy1
# 1: A 2014-01-02 27 48 66 NA NA
# 2: A 2014-01-03 37 86 35 27 10
# 3: A 2014-01-04 57 43 27 37 20
# 4: A 2014-01-05 89 24 97 57 32
# 5: A 2014-01-06 20 7 61 89 -69
# 6: A 2014-01-07 86 10 21 20 66
This assumes you pass the name of the column containing the "outcomevar", not the column itself.
You should read the documentation on data tables (?data.table), but in brief this code converts the input data frame to a data table, orders the data table (using setkey(...)), and adds two new columns by reference: lag and diff. .SD is a special variable in the data table framework which is an alias for "the subset of the original DT containing the rows specified in by=...". You can specify which columns to include using .SDcols=.... The diff(...) function calculates lagged differences, which is the same thing you were doing. Finally, we rename the columns lag and diff to, e.g. ly1 and dy1.

Here is an outline of a function that relies more heavily on your example:
preparedata<-function(testData,outcomevar){
require(plyr)
testData <- testData[order(testData$firm,testData$date),]
testData$tmp.var <- with(testData, eval(parse(text=outcomevar)))
testData <- ddply(testData, .(firm), transform,
lvar = c( NA, tmp.var[-length(tmp.var)]))
testData$tmp.var <- NULL
testData <- within(testData, assign(paste("d", outcomevar, sep=""),
testData[,outcomevar]-testData$lvar))
colnames(testData)[grep("lvar", colnames(testData))] <- paste("l", outcomevar, sep="")
return(testData)
}
Using the df defined in jihoward's answer, we get
> head(preparedata(df,"y1"))
firm date y1 y2 y3 lvar dy1
1 A 2014-01-02 27 48 66 NA NA
2 A 2014-01-03 37 86 35 27 10
3 A 2014-01-04 57 43 27 37 20
4 A 2014-01-05 89 24 97 57 32
5 A 2014-01-06 20 7 61 89 -69
6 A 2014-01-07 86 10 21 20 66
This function returns a dataframe where ly1 is the lagged variable, and dy1 is the differenced variable that was specified with the second argument outcomevar. Note that in this function, you pass the name (i.e. a character) to the function. That is, do not write y1, but "y1" when you call the function.

You could process all outcome variables simultaneously by first gathering them into a key-value column pair:
set.seed(1)
df <- data.frame(
firm = rep(LETTERS[1:5], each = 10),
date = as.Date("2014-01-01") + 1:10,
y1 = sample(100, 50),
y2 = sample(100, 50),
y3 = sample(100, 50)
)
library(dplyr)
library(tidyr)
df %>%
gather(key, value, y1:y3) %>%
group_by(firm, key) %>%
mutate(lag = lag(value), diff = lag - value)
#> Source: local data frame [150 x 6]
#> Groups: firm, key
#>
#> firm date key value lag diff
#> 1 A 2014-01-02 y1 27 NA NA
#> 2 A 2014-01-03 y1 37 27 -10
#> 3 A 2014-01-04 y1 57 37 -20
#> 4 A 2014-01-05 y1 89 57 -32
#> 5 A 2014-01-06 y1 20 89 69
#> 6 A 2014-01-07 y1 86 20 -66
#> 7 A 2014-01-08 y1 97 86 -11
#> 8 A 2014-01-09 y1 62 97 35
#> 9 A 2014-01-10 y1 58 62 4
#> 10 A 2014-01-11 y1 6 58 52
#> .. ... ... ... ... ... ...

Related

How to randomly select row from a dataframe for which the row skewness is larger that a given value in R

I am trying to select random rows from a data frame with 1000 lines (and six columns) where the skewness of the line is larger than a given value (say Sk > 0.3).
I've generated the following data frame
df=data.frame(replicate(6,sample(10:100,1000,rep=TRUE)))
I can get row skewness from the fbasics package:
rowSkewness(df) gives:
[8] -0.2243295435 0.5306809351 0.0707122386 0.0341447417 0.3339384838 -0.3910593364 -0.6443905090
[15] 0.5603809206 0.4406091534 -0.3736108832 0.0397860038 0.9970040772 -0.7702547535 0.2065830354
But now, I need to select say 10 rows of the df which have rowskewness greater than say 0.1... May with
for (a in 1:10) {
sample.data[a,] = sample(x=df[which(rowSkewness(df[sample(1:nrow(df),1)>0.1),], size = 1, replace = TRUE)
}
or something like this?
Any thoughts on this will be appreciated.
thanks in advance.
you can use the sample_n() function or sample_frac() - makes your version a little shorter:
library(tidyr)
library(fBasics)
df=data.frame(replicate(6,sample(10:100,1000,rep=TRUE)))
x=df %>% dplyr::filter(rowSkewness(df)>0.1) %>% dplyr::sample_n(10)
Got it:
x=df %>% filter(rowSkewness(df)>0.1)
for (a in 1:samplesize) {
sample.data[a,] = sample(x=x, size = 1, replace = TRUE)
}
Just do a subset:
res1 <- DF[fBasics::rowSkewness(DF) > .1, ]
head(res1)
# X1 X2 X3 X4 X5 X6
# 7 56 28 21 93 74 24
# 8 33 56 23 44 10 12
# 12 29 19 29 38 94 95
# 13 35 51 54 98 66 10
# 14 12 51 24 23 36 68
# 15 50 37 81 22 55 97
Or with e1071::skewness:
res2 <- DF[apply(as.matrix(DF), 1, e1071::skewness) > .1, ]
stopifnot(all.equal(res1, res2))
Data
set.seed(42); DF <- data.frame(replicate(6, sample(10:100, 1000, rep=TRUE)))

Rank subset into quantiles using Ntile

I have a dataset containing 42840 observations with a total of 119 unique months (Dataset$date). The idea is that i want to assign a quantile to every dataset$Value within each month, and 'rank' them from 1(lowest value) to 5(highest value).
Date Name(ID) Value Quantile (I want to add this column where i assign the values a quantile from 1 to 5)
2009-03 1 35 (1-5)
2009-04 1 20 ...
2009-05 1 65 ...
2009-03 2 24 ...
2009-04 2 77 ...
2009-03 3 110 ...
.
.
.
2018-12 3 125 ...
2009-03 56 24 ...
2009-04 56 65 ...
2009-03 57 26 ...
2009-04 57 67 ...
2009-03 58 99 ...
I've tried to use the Ntile function, which works great for the whole dataset but there doesn't seem to be a function where I can specify for a subset of date.
Any suggestions?
You could use the base rank function with dplyr's group_by:
library(dplyr)
# Create some data
N <- 3
dat <- tibble(
date = rep(1:12,N),
value = runif(12*N, 0, 100)
)
# The rescale function we will use later to fit on your 1-5 scale
## Adapted From https://stackoverflow.com/questions/25962508/rescaling-a-variable-in-r
RESCALE <- function (x, nx1, nx2, minx, maxx) {
nx = nx1 + (nx2 - nx1) * (x - minx)/(maxx - minx)
return(ceiling(nx))
}
# What you want
dat %>%
group_by(date) %>% # Group the data by Date so that mutate fill compute the rank's for each Month
mutate(rank_detail = rank(value), # ranks the values within each group
rank_group = RESCALE(rank_detail, 1, 5, min(rank_detail), max(rank_detail)) ) %>% # rescales the ranking to be on you 1 to 5 scale
arrange(date)
# A tibble: 36 x 4
# # Groups: date [12]
# date value rank_detail rank_group
# <int> <dbl> <dbl> <dbl>
# 1 1 92.7 3 5
# 2 1 53.6 2 3
# 3 1 47.8 1 1
# 4 2 24.6 2 3
# 5 2 72.2 3 5
# 6 2 11.5 1 1

return a vector in a column in data.table

I have a data.table in R, and I'm looking to create a vector based on .SDcols row by row.
library("data.table")
dt = data.table(
id=1:6,
A1=sample(100,6),
A2=sample(100,6),
A3=sample(100,6),
B1=sample(100,6),
B2=sample(100,6),
B3=sample(100,6)
)
dt[,x1:=paste(.SD,collapse = ","),.SDcols=A1:B3,by=id]
dt[,x2:=strsplit(x1,",")] # x2 vector of characters
now, I got x2 with a vector of characters.
however, I expected x2 with a vector of integers.
R > dt
id A1 A2 A3 B1 B2 B3 x2
1: 1 72 23 76 10 35 14 c(72,23,76,10,35,14)
2: 2 44 28 77 29 20 63 c(44,28,77,29,20,63)
3: 3 18 34 43 77 76 100 c(18,34,43,77,76,100)
4: 4 15 33 50 87 86 86 c(15,33,50,87,86,86)
5: 5 71 71 41 75 8 3 c(71,71,41,75,8,3)
6: 6 11 89 98 42 72 27 c(11,89,98,42,72,27)
I tried with several solutions, all failed.
dt[,x2:=.(list(.SD)),.SDcols=A1:B3,by=id] #x2 is <data.table>
dt[,x2:=.(lapply(.SD,c)),.SDcols=A1:B3,by=id]
dt[,x2:=.(c(.SD)), .SDcols=A1:B3,by=id] #RHS 1 is length 6 (greater than the size (1) of group 1). The last 5 element(s) will be discarded.
dt[,x2:=c(.SD),.SDcols=A1:B3,by=id] # x2 equals A1
dt[,x2:=lapply(.SD,c),.SDcols=A1:B3,by=id] # x2 equals A1
dt[,x2:=sapply(.SD,c),.SDcols=A1:B3,by=id] # x2 equals A1
Any suggestion?
Thanks in advance
=====================================================================
edit: thanks Jaap,
dt[, x2 := lapply(strsplit(x1, ","), as.integer)] # it works
Still, I wonder any beautiful solution?
=====================================================================
edit2:
new solutions, base function is much more useful than I thought.
dt[,ABC0:=apply(rbind(.SD), 1, list),.SDcols=A1:B3,by=id]
dt[,ABC1:=apply(cbind(.SD), 1, list),.SDcols=A1:B3,by=id]
or more simple
dt[,ABC2:=lapply(.SD,rbind),.SDcols=A1:B3]

Automate regression by rows

I have a data.frame
set.seed(100)
exp <- data.frame(exp = c(rep(LETTERS[1:2], each = 10)), re = c(rep(seq(1, 10, 1), 2)), age1 = seq(10, 29, 1), age2 = seq(30, 49, 1),
h = c(runif(20, 10, 40)), h2 = c(40 + runif(20, 4, 9)))
I'd like to make a lm for each row in a data set (h and h2 ~ age1 and age2)
I do it by loop
exp$modelh <- 0
for (i in 1:length(exp$exp)){
age = c(exp$age1[i], exp$age2[i])
h = c(exp$h[i], exp$h2[i])
model = lm(age ~ h)
exp$modelh[i] = coef(model)[1] + 100 * coef(model)[2]
}
and it works well but takes some time with very large files. Will be grateful for the faster solution f.ex. dplyr
Using dplyr, we can try with rowwise() and do. Inside the do, we concatenate (c) the 'age1', 'age2' to create 'age', likewise, we can create 'h', apply lm, extract the coef to create the column 'modelh'.
library(dplyr)
exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )
gives the output
# exp re age1 age2 h h2 modelh
#1 A 1 10 30 19.23298 46.67906 68.85506
#2 A 2 11 31 17.73018 47.55402 66.17050
#3 A 3 12 32 26.56967 46.69174 84.98486
#4 A 4 13 33 11.69149 47.74486 61.98766
#5 A 5 14 34 24.05648 46.10051 82.90167
#6 A 6 15 35 24.51312 44.85710 89.21053
#7 A 7 16 36 34.37208 47.85151 113.37492
#8 A 8 17 37 21.10962 48.40977 74.79483
#9 A 9 18 38 26.39676 46.74548 90.34187
#10 A 10 19 39 15.10786 45.38862 75.07002
#11 B 1 20 40 28.74989 46.44153 100.54666
#12 B 2 21 41 36.46497 48.64253 125.34773
#13 B 3 22 42 18.41062 45.74346 81.70062
#14 B 4 23 43 21.95464 48.77079 81.20773
#15 B 5 24 44 32.87653 47.47637 115.95097
#16 B 6 25 45 30.07065 48.44727 101.10688
#17 B 7 26 46 16.13836 44.90204 84.31080
#18 B 8 27 47 20.72575 47.14695 87.00805
#19 B 9 28 48 20.78425 48.94782 84.25406
#20 B 10 29 49 30.70872 44.65144 128.39415
We could do this with the devel version of data.table i.e. v1.9.5. Instructions to install the devel version are here.
We convert the 'data.frame' to 'data.table' (setDT), create a column 'rn' with the option keep.rownames=TRUE. We melt the dataset by specifying the patterns in the measure to convert from 'wide' to 'long' format. Grouped by 'rn', we do the lm and get the coef. This can be assigned as a new column in the original dataset ('exp') while removing the unwanted 'rn' column by assigning (:=) it to NULL.
library(data.table)#v1.9.5+
modelh <- melt(setDT(exp, keep.rownames=TRUE), measure=patterns('^age', '^h'),
value.name=c('age', 'h'))[, {model <- lm(age ~h)
coef(model)[1] + 100 * coef(model)[2]},rn]$V1
exp[, modelh:= modelh][, rn := NULL]
exp
# exp re age1 age2 h h2 modelh
# 1: A 1 10 30 19.23298 46.67906 68.85506
# 2: A 2 11 31 17.73018 47.55402 66.17050
# 3: A 3 12 32 26.56967 46.69174 84.98486
# 4: A 4 13 33 11.69149 47.74486 61.98766
# 5: A 5 14 34 24.05648 46.10051 82.90167
# 6: A 6 15 35 24.51312 44.85710 89.21053
# 7: A 7 16 36 34.37208 47.85151 113.37492
# 8: A 8 17 37 21.10962 48.40977 74.79483
# 9: A 9 18 38 26.39676 46.74548 90.34187
#10: A 10 19 39 15.10786 45.38862 75.07002
#11: B 1 20 40 28.74989 46.44153 100.54666
#12: B 2 21 41 36.46497 48.64253 125.34773
#13: B 3 22 42 18.41062 45.74346 81.70062
#14: B 4 23 43 21.95464 48.77079 81.20773
#15: B 5 24 44 32.87653 47.47637 115.95097
#16: B 6 25 45 30.07065 48.44727 101.10688
#17: B 7 26 46 16.13836 44.90204 84.31080
#18: B 8 27 47 20.72575 47.14695 87.00805
#19: B 9 28 48 20.78425 48.94782 84.25406
#20: B 10 29 49 30.70872 44.65144 128.39415
Great (double) answer from #akrun.
Just a suggestion for your future analysis as you mentioned "it's an example of a bigger problem". Obviously, if you are really interested in building models rowwise then you'll create more and more columns as your age and h observations increase. If you get N observations you'll have to use 2xN columns for those 2 variables only.
I'd suggest to use a long data format in order to increase your rows instead of your columns.
Something like:
exp[1,] # how your first row (model building info) looks like
# exp re age1 age2 h h2
# 1 A 1 10 30 19.23298 46.67906
reshape(exp[1,], # how your model building info is transformed
varying = list(c("age1","age2"),
c("h","h2")),
v.names = c("age_value","h_value"),
direction = "long")
# exp re time age_value h_value id
# 1.1 A 1 1 10 19.23298 1
# 1.2 A 1 2 30 46.67906 1
Apologies if the "bigger problem" refers to something else and this answer is irrelevant.
With base R, the function sprintf can help us create formulas. And lapply carries out the calculation.
strings <- sprintf("c(%f,%f) ~ c(%f,%f)", exp$age1, exp$age2, exp$h, exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
exp$modelh <- unlist(lst)
exp
# exp re age1 age2 h h2 modelh
# 1 A 1 10 30 19.23298 46.67906 68.85506
# 2 A 2 11 31 17.73018 47.55402 66.17050
# 3 A 3 12 32 26.56967 46.69174 84.98486
# 4 A 4 13 33 11.69149 47.74486 61.98766
# 5 A 5 14 34 24.05648 46.10051 82.90167
# 6 A 6 15 35 24.51312 44.85710 89.21053
# 7 A 7 16 36 34.37208 47.85151 113.37493
# 8 A 8 17 37 21.10962 48.40977 74.79483
# 9 A 9 18 38 26.39676 46.74548 90.34187
# 10 A 10 19 39 15.10786 45.38862 75.07002
# 11 B 1 20 40 28.74989 46.44153 100.54666
# 12 B 2 21 41 36.46497 48.64253 125.34773
# 13 B 3 22 42 18.41062 45.74346 81.70062
# 14 B 4 23 43 21.95464 48.77079 81.20773
# 15 B 5 24 44 32.87653 47.47637 115.95097
# 16 B 6 25 45 30.07065 48.44727 101.10688
# 17 B 7 26 46 16.13836 44.90204 84.31080
# 18 B 8 27 47 20.72575 47.14695 87.00805
# 19 B 9 28 48 20.78425 48.94782 84.25406
# 20 B 10 29 49 30.70872 44.65144 128.39416
In the lapply function the expression as.formula(x) is what converts the formulas created in the first line into a format usable by the lm function.
Benchmark
library(dplyr)
library(microbenchmark)
set.seed(100)
big.exp <- data.frame(age1=sample(30, 1e4, T),
age2=sample(30:50, 1e4, T),
h=runif(1e4, 10, 40),
h2= 40 + runif(1e4,4,9))
microbenchmark(
plafort = {strings <- sprintf("c(%f,%f) ~ c(%f,%f)", big.exp$age1, big.exp$age2, big.exp$h, big.exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
big.exp$modelh <- unlist(lst)},
akdplyr = {big.exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )}
,times=5)
t: seconds
expr min lq mean median uq max neval cld
plafort 13.00605 13.41113 13.92165 13.56927 14.53814 15.08366 5 a
akdplyr 26.95064 27.64240 29.40892 27.86258 31.02955 33.55940 5 b
(Note: I downloaded the newest 1.9.5 devel version of data.table today, but continued to receive errors when trying to test it.
The results also differ fractionally (1.93 x 10^-8). Rounding likely accounts for the difference.)
all.equal(pl, ak)
[1] "Attributes: < Component “class”: Lengths (1, 3) differ (string compare on first 1) >"
[2] "Attributes: < Component “class”: 1 string mismatch >"
[3] "Component “modelh”: Mean relative difference: 1.933893e-08"
Conclusion
The lapply approach seems to perform well compared to dplyr with respect to speed, but it's 5 digit rounding may be an issue. Improvements may be possible. Perhaps using apply after converting to matrix to increase speed and efficiency.

Add new columns to a data.table containing many variables

I want to add many new columns simultaneously to a data.table based on by-group computations. A working example of my data would look something like this:
Time Stock x1 x2 x3
1: 2014-08-22 A 15 27 34
2: 2014-08-23 A 39 44 29
3: 2014-08-24 A 20 50 5
4: 2014-08-22 B 42 22 43
5: 2014-08-23 B 44 45 12
6: 2014-08-24 B 3 21 2
Now I want to scale and sum many of the variables to get an output like:
Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
1: 2014-08-22 A 15 27 34 -1.1175975 0.7310560 121 68
2: 2014-08-23 A 39 44 29 0.3073393 0.4085313 121 68
3: 2014-08-24 A 20 50 5 0.8102582 -1.1395873 121 68
4: 2014-08-22 B 42 22 43 -0.5401315 1.1226726 88 57
5: 2014-08-23 B 44 45 12 1.1539172 -0.3274462 88 57
6: 2014-08-24 B 3 21 2 -0.6137858 -0.7952265 88 57
A brute force implementation of my problem would be:
library(data.table)
set.seed(123)
d <- data.table(Time = rep(seq.Date( Sys.Date(), length=3, by="day" )),
Stock = rep(LETTERS[1:2], each=3 ),
x1 = sample(1:50, 6),
x2 = sample(1:50, 6),
x3 = sample(1:50, 6))
d[,x2_scale:=scale(x2),by=Stock]
d[,x3_scale:=scale(x3),by=Stock]
d[,x2_sum:=sum(x2),by=Stock]
d[,x3_sum:=sum(x3),by=Stock]
Other posts describing a similar issue (Add multiple columns to R data.table in one function call? and Assign multiple columns using := in data.table, by group) suggest the following solution:
d[, c("x2_scale","x3_scale"):=list(scale(x2),scale(x3)), by=Stock]
d[, c("x2_sum","x3_sum"):=list(sum(x2),sum(x3)), by=Stock]
But again, this would get very messy with a lot of variables and also this brings up an error message with scale (but not with sum since this isn't returning a vector).
Is there a more efficient way to achieve the required result (keeping in mind that my actual data set is quite large)?
I think with a small modification to your last code you can easily do both for as many variables you want
vars <- c("x2", "x3") # <- Choose the variable you want to operate on
d[, paste0(vars, "_", "scale") := lapply(.SD, function(x) scale(x)[, 1]), .SDcols = vars, by = Stock]
d[, paste0(vars, "_", "sum") := lapply(.SD, sum), .SDcols = vars, by = Stock]
## Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
## 1: 2014-08-22 A 13 14 32 -1.1338934 1.1323092 87 44
## 2: 2014-08-23 A 25 39 9 0.7559289 -0.3701780 87 44
## 3: 2014-08-24 A 18 34 3 0.3779645 -0.7621312 87 44
## 4: 2014-08-22 B 44 8 6 -0.4730162 -0.7258662 59 32
## 5: 2014-08-23 B 49 3 18 -0.6757374 1.1406469 59 32
## 6: 2014-08-24 B 15 48 8 1.1487535 -0.4147807 59 32
For simple functions (that don't need special treatment like scale) you could easily do something like
vars <- c("x2", "x3") # <- Define the variable you want to operate on
funs <- c("min", "max", "mean", "sum") # <- define your function
for(i in funs){
d[, paste0(vars, "_", i) := lapply(.SD, eval(i)), .SDcols = vars, by = Stock]
}
Another variation using data.table
vars <- c("x2", "x3")
d[, paste0(rep(vars, each=2), "_", c("scale", "sum")) := do.call(`cbind`,
lapply(.SD, function(x) list(scale(x)[,1], sum(x)))), .SDcols=vars, by=Stock]
d
# Time Stock x1 x2 x3 x2_scale x2_sum x3_scale x3_sum
#1: 2014-08-22 A 15 27 34 -1.1175975 121 0.7310560 68
#2: 2014-08-23 A 39 44 29 0.3073393 121 0.4085313 68
#3: 2014-08-24 A 20 50 5 0.8102582 121 -1.1395873 68
#4: 2014-08-22 B 42 22 43 -0.5401315 88 1.1226726 57
#5: 2014-08-23 B 44 45 12 1.1539172 88 -0.3274462 57
#6: 2014-08-24 B 3 21 2 -0.6137858 88 -0.7952265 57
Based on comments from #Arun, you could also do:
cols <- paste0(rep(vars, each=2), "_", c("scale", "sum"))
d[,(cols):= unlist(lapply(.SD, function(x) list(scale(x)[,1L], sum(x))),
rec=F), by=Stock, .SDcols=vars]
You're probably looking for a pure data.table solution, but you could also consider using dplyr here since it works with data.tables as well (no need for conversion). Then, from dplyr you could use the function mutate_all as I do in this example here (with the first data set you showed in your question):
library(dplyr)
dt %>%
group_by(Stock) %>%
mutate_all(funs(sum, scale), x2, x3)
#Source: local data table [6 x 9]
#Groups: Stock
#
# Time Stock x1 x2 x3 x2_sum x3_sum x2_scale x3_scale
#1 2014-08-22 A 15 27 34 121 68 -1.1175975 0.7310560
#2 2014-08-23 A 39 44 29 121 68 0.3073393 0.4085313
#3 2014-08-24 A 20 50 5 121 68 0.8102582 -1.1395873
#4 2014-08-22 B 42 22 43 88 57 -0.5401315 1.1226726
#5 2014-08-23 B 44 45 12 88 57 1.1539172 -0.3274462
#6 2014-08-24 B 3 21 2 88 57 -0.6137858 -0.7952265
You can easily add more functions to be calculated which will create more columns for you. Note that mutate_all applies the function to each column except the grouping variable (Stock) by default. But you can either specify the columns you only want to apply the functions to (which I did in this example) or you can specify which columns you don't want to apply the functions to (that would be, e.g. -c(x2,x3) instead of where I wrote x2, x3).
EDIT: replaced mutate_each above with mutate_all as mutate_each will be deprecated in the near future.
EDIT: cleaner version using functional. I think this is the closest to the dplyr answer.
library(functional)
funs <- list(scale=Compose(scale, c), sum=sum) # See data.table issue #783 on github for the need for this
cols <- paste0("x", 2:3)
cols.all <- outer(cols, names(funs), paste, sep="_")
d[,
c(cols.all) := unlist(lapply(funs, Curry(lapply, X=.SD)), rec=F),
.SDcols=cols,
by=Stock
]
Produces:
Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
1: 2014-08-22 A 15 27 34 -1.1175975 0.7310560 121 68
2: 2014-08-23 A 39 44 29 0.3073393 0.4085313 121 68
3: 2014-08-24 A 20 50 5 0.8102582 -1.1395873 121 68
4: 2014-08-22 B 42 22 43 -0.5401315 1.1226726 88 57
5: 2014-08-23 B 44 45 12 1.1539172 -0.3274462 88 57
6: 2014-08-24 B 3 21 2 -0.6137858 -0.7952265 88 57

Resources