Calculating a weighted mean in data.table in R varying weights - r

My question relates to this previously asked question:
Calculating a weighted mean using data.table in R with weights in one of the table columns
In my case, I have different weights-columns across the columns I want to aggregate. Let's say I have four columns col_a, col_b, col_c and col_d and let's assume I want to aggregate col_a and col_b with weiths w_1 and col_c, col_d with w_2. Example:
require(data.table)
id <- c(1,1,1,2,2,2)
col_a <- c(123,56,87,987,1003,10)
col_b <- c(17,234,20,88,765,69)
col_c <- c(45,90,543,30,1,543)
col_d <- c(60,43,700,3,88,46)
w_1 <- c(1,1,1,1,1,1)
w_2 <- c(1.5,1,1.2,0.8,1,1)
dt <- data.table(id, col_a, col_b, col_c, col_d, w_1, w_2);dt
Now the desired result would look like this:
data.table(id=c(1,2),col_a=c(weighted.mean(col_a[1:3],w_1[1:3]),weighted.mean(col_a[4:6],w_1[4:6])),col_b=c(weighted.mean(col_b[1:3],w_1[1:3]),weighted.mean(col_b[4:6],w_1[4:6])),
col_c=c(weighted.mean(col_c[1:3],w_2[1:3]),weighted.mean(col_c[4:6],w_1[4:6])),col_d=c(weighted.mean(col_d[1:3],w_2[1:3]),weighted.mean(col_d[4:6],w_2[4:6])))
This, I thought could be accomplished similar to #akrun answer to this post:
R collapse multiple rows into 1 row using specific function to each column
where I would have the two functions weighted.mean(x, w_1) and weighted.mean(x, w_2) instead of min or median.
Here is how far I got:
colsToKeep <- c("col_a","col_b","col_c","col_d")
dt[, Map(function(x,y) get(x)(y, na.rm = TRUE),
setNames(rep(c('weighted.mean', 'weighted.mean'),2),names(.SD)), .SD),.SDcols=colsToKeep, by = id]
My question: how can get the arguments w=w_1 and w=w_2 into the setNames-function? Is that even possible?

Could be something like this too:
colsToKeep <- c("col_a", "col_b", "col_c", "col_d")
colsToW <- c("w_1", "w_1", "w_2", "w_2")
eval(parse(text = paste0("dt[, .(", paste0("w_", colsToKeep, " = weighted.mean(", colsToKeep, ",", colsToW, ")", collapse = ", "), "), by = id]")))

or this one:
dt[, Map(function(x,y,w) get(x)(y, w, na.rm = TRUE),
setNames(rep('weighted.mean',length(colsToKeep)), paste0("W_", colsToKeep)),
.SD[, ..colsToKeep], .SD[, ..colsToW]),
by = id]

As mentioned by Roland, you can cast into a long format. The benefit is that in the long run, you do not have to change the codes every time when there is a new column. Explanation in line. You can print mdt to take a look.
#cast into a long format with col values in 1 column and rows in another columns
mdt <- melt(dt, id.var=c("id",grep("^w", names(dt), value=TRUE)),
variable.name="col", value.name="colVal")
mdt <- melt(mdt, id.var=c("id","col","colVal"),
variable.name="w", value.name="wVal")
#prob need to think of a programmatic way rather than typing columns
myPairs <- data.table(rbind(
c(col="col_a", w="w_1"),
c(col="col_b", w="w_1"),
c(col="col_c", w="w_2"),
c(col="col_d", w="w_2")))
#calculate weighted mean according to myPairs and then pivot the table
dcast(mdt[myPairs, on=.(col, w),
weighted.mean(colVal, wVal),
by=.(id, col)],
id ~ col,
value.var="V1")

Related

R data.table: efficiently access and update a variable column name in j expression with grouping [duplicate]

This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 2 years ago.
I want to apply a transformation (whose type, loosely speaking, is "vector" -> "vector") to a list of columns in a data table, and this transformation will involve a grouping operation.
Here is the setup and what I would like to achieve:
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
DT[, A.prime := (A - mean(A))/sd(A), by=year(date)]
DT[, B.prime := (B - mean(B))/sd(B), by=year(date)]
DT[, C.prime := (C - mean(C))/sd(C), by=year(date)]
The goal is to avoid typing out the column names. In my actual application, I have a list of columns I would like to apply this transformation to.
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
columns <- c("A", "B", "C")
for (x in columns) {
# This doesn't work.
# target <- DT[, (x - mean(x, na.rm=TRUE))/sd(x, na.rm = TRUE), by=year(date)]
# This doesn't work.
#target <- DT[, (..x - mean(..x, na.rm=TRUE))/sd(..x, na.rm = TRUE), by=year(date)]
# THIS WORKS! But it is tedious writing "get(x)" every time.
target <- DT[, (get(x) - mean(get(x), na.rm=TRUE))/sd(get(x), na.rm = TRUE), by=year(date)][, V1]
set(DT, j = paste0(x, ".prime"), value = target)
}
Question: What is the idiomatic way to achieve the above result? There are two things which may be possibly be improved:
How to avoid typing out get(x) every time I use x to access a column?
Is accessing [, V1] the most efficient way of doing this? Is it possible to update DT directly by reference, without creating an intermediate data.table?
You can use .SDcols to specify the columns that you want to operate on :
library(data.table)
columns <- c("A", "B", "C")
newcolumns <- paste0(columns, ".prime")
DT[, (newcolumns) := lapply(.SD, function(x) (x- mean(x))/sd(x)),
year(date), .SDcols = columns]
This avoids using get(x) everytime and updates data.table by reference.
I think Ronak's answer is superior & preferable, just writing this to demonstrate a common syntax for more complicated j queries is to use a full {} expression:
target <- DT[ , by = year(date), {
xval = eval(as.name(x))
(xval - mean(xval, na.rm=TRUE))/sd(xval, na.rm = TRUE)
}]$V1
Two other small differences:
I used eval(as.name(.)) instead of get; the former is more trustworthy & IME faster
I replaced [ , V1] with $V1 -- the former requires the overhead of [.data.table.
You might also like to know that the base function scale will do the center & normalize steps more concisely (if slightly inefficient for being a bit to general).

Sample from specific rows in a dataframe column [duplicate]

I'm looking for an efficient way to select rows from a data table such that I have one representative row for each unique value in a particular column.
Let me propose a simple example:
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
dt = as.data.table( z )
my objective is to subset data table dt by sampling one row for each letter a-h in column z.
OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.
dt[dt[ , .I[sample(.N,1)] , by = z]$V1]
You can use dplyr
library(dplyr)
dt %>%
group_by(z) %%
sample_n(1)
I think that shuffling the data.table row-wise and then applying unique(...,by) could also work. Groups are formed with by and the previous shuffling trickles down inside each group:
# shuffle the data.table row-wise
dt <- dt[sample(dim(dt)[1])]
# uniqueness by given column(s)
unique(dt, by = "z")
Below is an example on a bigger data.table with grouping by 3 columns. Comparing with #akrun ' solution seems to give the same grouping:
set.seed(2017)
dt <- data.table(c1 = sample(52*10^6),
c2 = sample(LETTERS, replace = TRUE),
c3 = sample(10^5, replace = TRUE),
c4 = sample(10^3, replace = TRUE))
# the shuffling & uniqueness
system.time( test1 <- unique(dt[sample(dim(dt)[1])], by = c("c2","c3","c4")) )
# user system elapsed
# 13.87 0.49 14.33
# #akrun' solution
system.time( test2 <- dt[dt[ , .I[sample(.N,1)] , by = c("c2","c3","c4")]$V1] )
# user system elapsed
# 11.89 0.10 12.01
# Grouping is identical (so, all groups are being sampled in both cases)
identical(x=test1[,.(c2,c3)][order(c2,c3)],
y=test2[,.(c2,c3)][order(c2,c3)])
# [1] TRUE
For sampling more than one row per group check here
Updated workflow for dplyr. I added a second column v that can be grouped by z.
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
v <- 1:length(z)
dt = data.table(z,v)
library(dplyr)
dt %>%
group_by(z) %>%
slice_sample(n = 1)

Assign by reference a list of results to a number of columns of a data.table

Imagine you have 2 distributions resulting from two simulations stored in a data.frame:
sim1 = 1:10
sim2 = 91:100
sim = data.frame(sim1, sim2)
Now, we want to find the 10% and 90% percentiles of each distribution. This can be done by:
diffSim = ncol(sim)
confidenceInterval = c(0.1, 0.9)
results = lapply(1:diffSim, function(j) {quantile(sim[, j], confidenceInterval,
names = FALSE, type = 3)})
I would like to store these results in a data.table by assigning by reference (:=). However, I first need to getresults in the appropriate shape (i.e. a data.table of 1 row and 4 columns). To do so, I subsequently apply unlist, matrix and as.data.table to results:
DT = data.table(Col1 = "Result")
DT[, c("col2", "col3", "col4", "col5") := as.data.table(matrix(unlist(results), nrow = 1))]
I don't like this at all. Is there a shorter way of doing this?
Not necessarily shorter, but everything in data.table:
library(data.table)
setDT(sim)[, .(col1 = 'Result',
cols = paste0('col',2:5),
vals = unlist(lapply(.SD, quantile, probs = confidenceInterval, type = 3)))
][, dcast(.SD, col1 ~ cols, value.var = 'vals')]
which gives:
col1 col2 col3 col4 col5
1: Result 1 9 91 99

Loop through data.table and create new columns basis some condition

I have a data.table with quite a few columns. I need to loop through them and create new columns using some condition. Currently I am writing separate line of condition for each column. Let me explain with an example. Let us consider a sample data as -
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=10),
tc = rep(c('C','D'), 10),
one = rnorm(20,1,1),
two = rnorm(20,2,1),
three = rnorm(20,3,1),
four = rnorm(20,4,1),
five = rnorm(20,5,2),
six = rnorm(20,6,2),
seven = rnorm(20,7,2),
total = rnorm(20,28,3))
For each of the columns from one to total, I need to create 4 new columns, i.e. mean, sd, uplimit, lowlimit for 2 sigma outlier calculation. I am doing this by -
DTnew <- DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x), uplimit = mean(x)+1.96*sd(x), lowlimit = mean(x)-1.96*sd(x))))), by = .(town,tc)]
This DTnew data.table I am then merging with my DT
DTmerge <- merge(DT, DTnew, by= c('town','tc'))
Now to come up with the outliers, I am writing separate set of codes for each variable -
DTAoutlier <- DTmerge[ ,one.Aoutlier := ifelse (one >= one.lowlimit & one <= one.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,two.Aoutlier := ifelse (two >= two.lowlimit & two <= two.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,three.Aoutlier := ifelse (three >= three.lowlimit & three <= three.uplimit,0,1)]
can some one help to simplify this code so that
I don't have to write separate lines of code for outlier. In this example we have only 8 variables but what if we had 100 variables, would we end up writing 100 lines of code? Can this be done using a for loop? How?
In general for data.table how can we add new columns retaining the original columns. So for example below I am taking log of columns 3 to 10. If I don't create a new DTlog it overwrites the original columns in DT. How can I retain the original columns in DT and have the new columns as well in DT.
DTlog <- DT[,(lapply(.SD,log)),by = .(town,tc),.SDcols=3:10]
Look forward to some expert suggestions.
We can do this using :=. We subset the column names that are not the grouping variables ('nm'). Create a vector of names to assign for the new columns using outer ('nm1'). Then, we use the OP's code, unlist the output and assign (:=) it to 'nm1' to create the new columns.
nm <- names(DT)[-(1:2)]
nm1 <- c(t(outer(c("Mean", "SD", "uplimit", "lowlimit"), nm, paste, sep="_")))
DT[, (nm1):= unlist(lapply(.SD, function(x) { Mean = mean(x)
SD = sd(x)
uplimit = Mean + 1.96*SD
lowlimit = Mean - 1.96*SD
list(Mean, SD, uplimit, lowlimit) }), recursive=FALSE) ,
.(town, tc)]
The second part of the question involves doing a logical comparison between columns. One option would be to subset the initial columns, the 'lowlimit' and 'uplimit' columns separately and do the comparison (as these have the same dimensions) to get a logical output which can be coerced to binary with +. Then assign it to the original dataset to create the outlier columns.
m1 <- +(DT[, nm, with = FALSE] >= DT[, paste("lowlimit", nm, sep="_"),
with = FALSE] & DT[, nm, with = FALSE] <= DT[,
paste("uplimit", nm, sep="_"), with = FALSE])
DT[,paste(nm, "Aoutlier", sep=".") := as.data.frame(m1)]
Or instead of comparing data.tables, we can also use a for loop with set (which would be more efficient)
nm2 <- paste(nm, "Aoutlier", sep=".")
DT[, (nm2) := NA_integer_]
for(j in nm){
set(DT, i = NULL, j = paste(j, "Aoutlier", sep="."),
value = as.integer(DT[[j]] >= DT[[paste("lowlimit", j, sep="_")]] &
DT[[j]] <= DT[[paste("uplimit", j, sep="_")]]))
}
The 'log' columns can also be created with :=
DT[,paste(nm, "log", sep=".") := lapply(.SD,log),by = .(town,tc),.SDcols=nm]
Your data should probably be in long format:
m = melt(DT, id=c("town","tc"))
Then just write your test once
m[,
is_outlier := +(abs(value-mean(value)) > 1.96*sd(value))
, by=.(town, tc, variable)]
I see no outliers in this data (according to the given definition of outlier):
m[, .N, by=is_outlier] # this is a handy alternative to table()
# is_outlier N
# 1: 0 160
How it works
melt keeps the id columns and stacks all the rest into
variable (column names)
value (column contents)
+x does the same thing as as.integer(x), coercing TRUE/FALSE to 1/0
If you really like your data in wide format, though:
vjs = setdiff(names(DT), c("town","tc"))
DT[,
paste0(vjs,".out") := lapply(.SD, function(x) +(abs(x-mean(x)) > 1.96*sd(x)))
, by=.(town, tc), .SDcols=vjs]
For completeness, it should be noted that dplyr's mutate_each provides a handy way of tackling such problems:
library(dplyr)
result <- DT %>%
group_by(town,tc) %>%
mutate_each(funs(mean,sd,
uplimit = (mean(.) + 1.96*sd(.)),
lowlimit = (mean(.) - 1.96*sd(.)),
Aoutlier = as.integer(. >= mean(.) - 1.96*sd(.) &
. <= mean(.) - 1.96*sd(.))),
-town,-tc)

Give factors numerical value [R]

I want to predict a numerical variable. I have a couple of factors. For all that factors I have a numerical equivalent. Now it would be perfect to assign that numerical equivalent to the factor and use it in the prediction. Is this possible?
If this is not possible I guess I will need to replace the factors with their numerical equivalent. What is the best way to do so?
An Example:
df = data.frame(f=c("a","b","a","c"),v=c(2,4,2,6))
lookup = data.frame(name=c("a","b","c"),v=c(1,2,3))
What I would like to get
df2 = data.frame(f=c(1,2,1,3),v=c(2,4,2,6))
cor(df2$f,df2$v) # will be 1
Or
df2 <- merge(df, lookup, by.x = "f", by.y = "name")
cor(df2[, 2], df2[, 3])
Or if your data sets are big
library(data.table)
setkey(setDT(df), f)
setkey(setDT(lookup), name)
df2 <- df[lookup]
cor(df2[, 2, with = F], df2[, 3, with = F])
Does this help?
cor(lookup$v[match(df$f,lookup$name)],df$v)

Resources