Suppose I have the following data.table:
player_id prestige_score_0 prestige_score_1 prestige_score_2 prestige_score_3 prestige_score_4
1: 100284 0.0001774623 2.519792e-03 5.870781e-03 7.430179e-03 7.937716e-03
2: 103819 0.0001774623 1.426482e-03 3.904329e-03 5.526974e-03 6.373850e-03
3: 100656 0.0001774623 2.142518e-03 4.221423e-03 5.822705e-03 6.533448e-03
4: 104745 0.0001774623 1.084913e-03 3.061197e-03 4.383649e-03 5.091851e-03
5: 104925 0.0001774623 1.488457e-03 2.926728e-03 4.360301e-03 5.068171e-03
And I want to find the difference between values in each column starting from column prestige_score_0
In one step it should look like this: df[,prestige_score_0] - df[,prestige_score_1]
How can I do it in data.table(and save this differences as data.table and keep player_id as well)?
This is how you can do this in a tidy way:
# make it tidy
df2 <- melt(df,
id = "player_id",
variable.name = "column_name",
value.name = "prestige_score")
# extract numbers from column names
df2[, score_number := as.numeric(gsub("prestige_score_", "", column_name))]
# compute differences by player
df2[, diff := prestige_score - shift(prestige_score, n = 1L, type = "lead"),
by = player_id]
# if necessary, reshape back to original format
dcast(df2, player_id ~ score_number, value.var = c("prestige_score", "diff"))
you can subtract a whole dt with a shifted version of itself
dt = data.table(id=c("A","B"),matrix(rexp(10, rate=.1), ncol=5))
dt_shift = data.table(id=dt[,id], dt[, 2:(ncol(dt)-1)] - dt[,3:ncol(dt)])
You could use a for loop -
for(i in c(1:(ncol(df)-1)){
df[, paste0("diff_", i-1, "_", i)] = df[, paste0("prestige_score_", i-1)] -
df[, paste0("prestige_score_", i)]
}
This might not be the most efficient if you have a lot of columns though.
Related
Currently solve this with a workaround, but I would like to know if there is a more efficient way.
See below for exemplary data:
library(data.table)
library(anytime)
library(tidyverse)
library(dplyr)
library(batchtools)
# Lookup table
Date <- c("1990-03-31", "1990-06-30", "1990-09-30", "1990-12-31",
"1991-03-31", "1991-06-30", "1991-09-30", "1991-12-31")
period <- c(1:8)
metric_1 <- rep(c(2000, 3500, 4000, 100000), 2)
metric_2 <- rep(c(200, 350, 400, 10000), 2)
id <- 22
dt <- setDT(data.frame(Date, period, id, metric_1, metric_2))
# Fill and match table 2
Date_2 <- c("1990-08-30", "1990-02-28", "1991-07-31", "1991-09-30", "1991-10-31")
random <- c(10:14)
id_2 <- c(22,33,57,73,999)
dt_fill <- setDT(data.frame(EXCL_DATE, random, id_2))
# Convert date columns to type date
dt[ , Date := anydate(Date)]
dt_fill[ , Date_2 := anydate(Date_2)]
Now for the data wrangling. I want to get the most recent preceding data from dt (aka lookup table) into dt_fill. I do this with an easy 1-line rolling join like this.
# Rolling join
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# if not all id_2 present in id column in table 1, we get rows with NA
# I want to only retain the rows with id's that were originally in the lookup table
Then I end with a bunch of rows filled with NAs for the newly added columns that I would like to get rid of. I do this with a semi-join. I found outdated solutions to be quite hard to understand and settled for batchtools::sjoin() function which is essentially also a one liner.
dt_final <- sjoin(dt_res, dt, by = "id")
Is there a more efficient way of accomplishing a clean output result from a rolling join than by doing the rolling join first and then a semi-join with the original dataset. It is also not very fast for very long data sets. Thanks!
Essentially, there are two approaches I find that are both viable solutions.
Solution 1
First, proposed by lil_barnacle is an elegant one-liner that reads like following:
# Rolling join with nomtach-argument set to 0
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE, nomatch=0]
Original approach
Adding the nomatch argument and setting it to 0 like this nomatch = 0, is equivalent to doing the rolling join first and doing the semi-join thereafter.
# Rolling join without specified nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# Semi-join required
dt_final <- sjoin(dt_res, dt, by = "id")
Solution 2
Second, the solution that I came up with was to 'align' both data sets before the rolling join by means of filtering by the 'joined variable' like so:
# Aligning data sets by filtering accd. to joined 'variable'
dt_fill <- dt_fill[id_2 %in% dt[ , unique(id)]]
# Rolling join without need to specify nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
I usually work with dplyr but face a rather large data set and my approach is very slow. I basically need to filter a df group it by dates and count the occurrence within
sample data (turned already everything into data.table)
library(data.table)
library(dplyr)
set.seed(123)
df <- data.table(startmonth = seq(as.Date("2014-07-01"),as.Date("2014-11-01"),by="months"),
endmonth = seq(as.Date("2014-08-01"),as.Date("2014-12-01"),by="months")-1)
df2 <- data.table(id = sample(1:10, 5, replace = T),
start = sample(seq(as.Date("2014-07-01"),as.Date("2014-10-01"),by="days"),5),
end = df$startmonth + sample(10:90,5, replace = T)
)
#cross joining
res <- setkey(df2[,c(k=1,.SD)],k)[df[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
My dplyr approach works but is slow
res %>% filter(start <=endmonth & end>= startmonth) %>%
group_by(startmonth,endmonth) %>%
summarise(countmonth=n())
My data.table knowledge is limited but I guess we would setkeys() on the date columns and something like res[ , :=( COUNT = .N , IDX = 1:.N ) , by = startmonth, endmonth] to get the counts by group but I'm not sure how the filter goes in there.
Appreciate your help!
You could do the counting inside the join:
df2[df, on=.(start <= endmonth, end >= startmonth), allow.cartesian=TRUE, .N, by=.EACHI]
start end N
1: 2014-07-31 2014-07-01 1
2: 2014-08-31 2014-08-01 4
3: 2014-09-30 2014-09-01 5
4: 2014-10-31 2014-10-01 3
5: 2014-11-30 2014-11-01 3
or add it as a new column in df:
df[, n :=
df2[.SD, on=.(start <= endmonth, end >= startmonth), allow.cartesian=TRUE, .N, by=.EACHI]$N
]
How it works. The syntax is x[i, on=, allow.cartesian=, j, by=.EACHI]. Each row if i is used to look up values in x. The symbol .EACHI indicates that aggregation (j=.N) will be done for each row of i.
My question relates to this previously asked question:
Calculating a weighted mean using data.table in R with weights in one of the table columns
In my case, I have different weights-columns across the columns I want to aggregate. Let's say I have four columns col_a, col_b, col_c and col_d and let's assume I want to aggregate col_a and col_b with weiths w_1 and col_c, col_d with w_2. Example:
require(data.table)
id <- c(1,1,1,2,2,2)
col_a <- c(123,56,87,987,1003,10)
col_b <- c(17,234,20,88,765,69)
col_c <- c(45,90,543,30,1,543)
col_d <- c(60,43,700,3,88,46)
w_1 <- c(1,1,1,1,1,1)
w_2 <- c(1.5,1,1.2,0.8,1,1)
dt <- data.table(id, col_a, col_b, col_c, col_d, w_1, w_2);dt
Now the desired result would look like this:
data.table(id=c(1,2),col_a=c(weighted.mean(col_a[1:3],w_1[1:3]),weighted.mean(col_a[4:6],w_1[4:6])),col_b=c(weighted.mean(col_b[1:3],w_1[1:3]),weighted.mean(col_b[4:6],w_1[4:6])),
col_c=c(weighted.mean(col_c[1:3],w_2[1:3]),weighted.mean(col_c[4:6],w_1[4:6])),col_d=c(weighted.mean(col_d[1:3],w_2[1:3]),weighted.mean(col_d[4:6],w_2[4:6])))
This, I thought could be accomplished similar to #akrun answer to this post:
R collapse multiple rows into 1 row using specific function to each column
where I would have the two functions weighted.mean(x, w_1) and weighted.mean(x, w_2) instead of min or median.
Here is how far I got:
colsToKeep <- c("col_a","col_b","col_c","col_d")
dt[, Map(function(x,y) get(x)(y, na.rm = TRUE),
setNames(rep(c('weighted.mean', 'weighted.mean'),2),names(.SD)), .SD),.SDcols=colsToKeep, by = id]
My question: how can get the arguments w=w_1 and w=w_2 into the setNames-function? Is that even possible?
Could be something like this too:
colsToKeep <- c("col_a", "col_b", "col_c", "col_d")
colsToW <- c("w_1", "w_1", "w_2", "w_2")
eval(parse(text = paste0("dt[, .(", paste0("w_", colsToKeep, " = weighted.mean(", colsToKeep, ",", colsToW, ")", collapse = ", "), "), by = id]")))
or this one:
dt[, Map(function(x,y,w) get(x)(y, w, na.rm = TRUE),
setNames(rep('weighted.mean',length(colsToKeep)), paste0("W_", colsToKeep)),
.SD[, ..colsToKeep], .SD[, ..colsToW]),
by = id]
As mentioned by Roland, you can cast into a long format. The benefit is that in the long run, you do not have to change the codes every time when there is a new column. Explanation in line. You can print mdt to take a look.
#cast into a long format with col values in 1 column and rows in another columns
mdt <- melt(dt, id.var=c("id",grep("^w", names(dt), value=TRUE)),
variable.name="col", value.name="colVal")
mdt <- melt(mdt, id.var=c("id","col","colVal"),
variable.name="w", value.name="wVal")
#prob need to think of a programmatic way rather than typing columns
myPairs <- data.table(rbind(
c(col="col_a", w="w_1"),
c(col="col_b", w="w_1"),
c(col="col_c", w="w_2"),
c(col="col_d", w="w_2")))
#calculate weighted mean according to myPairs and then pivot the table
dcast(mdt[myPairs, on=.(col, w),
weighted.mean(colVal, wVal),
by=.(id, col)],
id ~ col,
value.var="V1")
I am trying to understand why can't I order by a new variable that I create in the same line.
Currently I need to write two lines, one for creating the new variable and then for ordering it.
Can this be done in the same line in data.table:
DF <- data.table(ID = c(1,2,1,2,1,1,1,1,2), Value = c(1,1,1,1,1,1,1,1,1))
newDF <- DF[order(-Count), .(Count = .N), by = ID]
# Gives error: Error in eval(v, x, parent.frame()) : object 'Count' not found
# Works Correctly
newDF <- DF[, .(Count = .N), by = ID]
newDF <- newDF[order(-Count)]
> newDF
ID Count
1: 1 6
2: 2 3
You can simply chain both of the operations in a single line,
DF[, .(Count = .N), by = ID][order(-Count)]
I have a data.table with quite a few columns. I need to loop through them and create new columns using some condition. Currently I am writing separate line of condition for each column. Let me explain with an example. Let us consider a sample data as -
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=10),
tc = rep(c('C','D'), 10),
one = rnorm(20,1,1),
two = rnorm(20,2,1),
three = rnorm(20,3,1),
four = rnorm(20,4,1),
five = rnorm(20,5,2),
six = rnorm(20,6,2),
seven = rnorm(20,7,2),
total = rnorm(20,28,3))
For each of the columns from one to total, I need to create 4 new columns, i.e. mean, sd, uplimit, lowlimit for 2 sigma outlier calculation. I am doing this by -
DTnew <- DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x), uplimit = mean(x)+1.96*sd(x), lowlimit = mean(x)-1.96*sd(x))))), by = .(town,tc)]
This DTnew data.table I am then merging with my DT
DTmerge <- merge(DT, DTnew, by= c('town','tc'))
Now to come up with the outliers, I am writing separate set of codes for each variable -
DTAoutlier <- DTmerge[ ,one.Aoutlier := ifelse (one >= one.lowlimit & one <= one.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,two.Aoutlier := ifelse (two >= two.lowlimit & two <= two.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,three.Aoutlier := ifelse (three >= three.lowlimit & three <= three.uplimit,0,1)]
can some one help to simplify this code so that
I don't have to write separate lines of code for outlier. In this example we have only 8 variables but what if we had 100 variables, would we end up writing 100 lines of code? Can this be done using a for loop? How?
In general for data.table how can we add new columns retaining the original columns. So for example below I am taking log of columns 3 to 10. If I don't create a new DTlog it overwrites the original columns in DT. How can I retain the original columns in DT and have the new columns as well in DT.
DTlog <- DT[,(lapply(.SD,log)),by = .(town,tc),.SDcols=3:10]
Look forward to some expert suggestions.
We can do this using :=. We subset the column names that are not the grouping variables ('nm'). Create a vector of names to assign for the new columns using outer ('nm1'). Then, we use the OP's code, unlist the output and assign (:=) it to 'nm1' to create the new columns.
nm <- names(DT)[-(1:2)]
nm1 <- c(t(outer(c("Mean", "SD", "uplimit", "lowlimit"), nm, paste, sep="_")))
DT[, (nm1):= unlist(lapply(.SD, function(x) { Mean = mean(x)
SD = sd(x)
uplimit = Mean + 1.96*SD
lowlimit = Mean - 1.96*SD
list(Mean, SD, uplimit, lowlimit) }), recursive=FALSE) ,
.(town, tc)]
The second part of the question involves doing a logical comparison between columns. One option would be to subset the initial columns, the 'lowlimit' and 'uplimit' columns separately and do the comparison (as these have the same dimensions) to get a logical output which can be coerced to binary with +. Then assign it to the original dataset to create the outlier columns.
m1 <- +(DT[, nm, with = FALSE] >= DT[, paste("lowlimit", nm, sep="_"),
with = FALSE] & DT[, nm, with = FALSE] <= DT[,
paste("uplimit", nm, sep="_"), with = FALSE])
DT[,paste(nm, "Aoutlier", sep=".") := as.data.frame(m1)]
Or instead of comparing data.tables, we can also use a for loop with set (which would be more efficient)
nm2 <- paste(nm, "Aoutlier", sep=".")
DT[, (nm2) := NA_integer_]
for(j in nm){
set(DT, i = NULL, j = paste(j, "Aoutlier", sep="."),
value = as.integer(DT[[j]] >= DT[[paste("lowlimit", j, sep="_")]] &
DT[[j]] <= DT[[paste("uplimit", j, sep="_")]]))
}
The 'log' columns can also be created with :=
DT[,paste(nm, "log", sep=".") := lapply(.SD,log),by = .(town,tc),.SDcols=nm]
Your data should probably be in long format:
m = melt(DT, id=c("town","tc"))
Then just write your test once
m[,
is_outlier := +(abs(value-mean(value)) > 1.96*sd(value))
, by=.(town, tc, variable)]
I see no outliers in this data (according to the given definition of outlier):
m[, .N, by=is_outlier] # this is a handy alternative to table()
# is_outlier N
# 1: 0 160
How it works
melt keeps the id columns and stacks all the rest into
variable (column names)
value (column contents)
+x does the same thing as as.integer(x), coercing TRUE/FALSE to 1/0
If you really like your data in wide format, though:
vjs = setdiff(names(DT), c("town","tc"))
DT[,
paste0(vjs,".out") := lapply(.SD, function(x) +(abs(x-mean(x)) > 1.96*sd(x)))
, by=.(town, tc), .SDcols=vjs]
For completeness, it should be noted that dplyr's mutate_each provides a handy way of tackling such problems:
library(dplyr)
result <- DT %>%
group_by(town,tc) %>%
mutate_each(funs(mean,sd,
uplimit = (mean(.) + 1.96*sd(.)),
lowlimit = (mean(.) - 1.96*sd(.)),
Aoutlier = as.integer(. >= mean(.) - 1.96*sd(.) &
. <= mean(.) - 1.96*sd(.))),
-town,-tc)