Subset by group with data.table compared to aggregate a data.table - r

This is a follow up question to Subset by group with data.table using the same data.table:
library(data.table)
bdt <- as.data.table(baseball)
# Aggregating and loosing information on other columns
dt1 <- bdt[ , .(max_g = max(g)), by = id]
# Aggregating and keeping information on other columns
dt2 <- bdt[bdt[, .I[g == max(g)], by = id]$V1]
Why do dt1 and dt2 differ in number of rows?
Isn't dt2 supposed to have the same result just without loosing the respective information in the other columns?

As #Frank pointed out:
bdt[ , .(max_g = max(g)), by = id] provides you with the maximum value, while
bdt[bdt[ , .I[g == max(g)], by = id]$V1] identifies all rows that have this maximum.
See What is the difference between arg max and max? for a mathematical explanation and try this slim version in R:
library(data.table)
bdt <- as.data.table(baseball)
dt <- bdt[id == "woodge01"][order(-g)]
dt[ , .(max = max(g)), by = id]
dt[ dt[ , .I[g == max(g)], by = id]$V1 ]

Related

R data.table merge duplicate rows and concatenate unique values

I am trying to merge duplicate rows using data.table aggregate, but I need to figure out how to concatenate the non-duplicated columns as strings in the output:
dt = data.table(
ensembl_id=c("ENSRNOG00000055068", "ENSRNOG00000055068", "ENSRNOG00000055068"),
hsapiens_ensembl_id=c("ENSG00000196262", "ENSG00000236334", "ENSG00000263353"),
chr=c(14, 14, 14),
start=c(22706901, 22706901, 22706901),
hsapiens_symbol=c("PPIA", "PPIAL4G", "PPIAL4A"),
hsapiens_chr=c(7, 1, 1)
)
dt[, lapply(.SD, paste(...,sep=",")), by=ensembl_id] # <- need magic join/paste function
desired output:
ensembl_id hsapiens_ensembl_id chr start hsapiens_symbol hsapiens_chr
1: ENSRNOG00000055068 ENSG00000196262,ENSG00000236334,ENSG00000263353 14 22706901 PPIA,PPIAL4G,PPIAL4A 7,1,1
We can use collapse with paste instead of sep and include 'chr', 'start' also as grouping variables
library(data.table)
dt[, lapply(.SD, paste, collapse=","), by = .(chr, start, ensembl_id)]
Or more compactly, with toString
dt[, lapply(.SD, toString), by = .(chr, start, ensembl_id)]
If we there are duplicates, get the unique values and paste
dt[, lapply(.SD, function(x) toString(unique(x))), by = .(chr, start, ensembl_id)]

consecutively subtracting columns in data.table

Suppose I have the following data.table:
player_id prestige_score_0 prestige_score_1 prestige_score_2 prestige_score_3 prestige_score_4
1: 100284 0.0001774623 2.519792e-03 5.870781e-03 7.430179e-03 7.937716e-03
2: 103819 0.0001774623 1.426482e-03 3.904329e-03 5.526974e-03 6.373850e-03
3: 100656 0.0001774623 2.142518e-03 4.221423e-03 5.822705e-03 6.533448e-03
4: 104745 0.0001774623 1.084913e-03 3.061197e-03 4.383649e-03 5.091851e-03
5: 104925 0.0001774623 1.488457e-03 2.926728e-03 4.360301e-03 5.068171e-03
And I want to find the difference between values in each column starting from column prestige_score_0
In one step it should look like this: df[,prestige_score_0] - df[,prestige_score_1]
How can I do it in data.table(and save this differences as data.table and keep player_id as well)?
This is how you can do this in a tidy way:
# make it tidy
df2 <- melt(df,
id = "player_id",
variable.name = "column_name",
value.name = "prestige_score")
# extract numbers from column names
df2[, score_number := as.numeric(gsub("prestige_score_", "", column_name))]
# compute differences by player
df2[, diff := prestige_score - shift(prestige_score, n = 1L, type = "lead"),
by = player_id]
# if necessary, reshape back to original format
dcast(df2, player_id ~ score_number, value.var = c("prestige_score", "diff"))
you can subtract a whole dt with a shifted version of itself
dt = data.table(id=c("A","B"),matrix(rexp(10, rate=.1), ncol=5))
dt_shift = data.table(id=dt[,id], dt[, 2:(ncol(dt)-1)] - dt[,3:ncol(dt)])
You could use a for loop -
for(i in c(1:(ncol(df)-1)){
df[, paste0("diff_", i-1, "_", i)] = df[, paste0("prestige_score_", i-1)] -
df[, paste0("prestige_score_", i)]
}
This might not be the most efficient if you have a lot of columns though.

Removing infrequent rows in a data frame

Let's say I have a following very simple data frame:
a <- rep(5,30)
b <- rep(4,80)
d <- rep(7,55)
df <- data.frame(Column = c(a,b,d))
What would be the most generic way for removing all rows with the value that appear less then 60 times?
I know you could say "in this case it's just a", but in my real data there are many more frequencies, so I wouldn't want to specify them one by one.
I was thinking of writing a loop such that if length() of an 'i' is smaller than 60, these rows will be deleted, but perhaps you have other ideas. Thanks in advance.
A solution using dplyr.
library(dplyr)
df2 <- df %>%
group_by(Column) %>%
filter(n() >= 60)
Or a solution from base R
uniqueID <- unique(df$Column)
targetID <- sapply(split(df, df$Column), function(x) nrow(x) >= 60)
df2 <- df[df$Column %in% uniqueID[targetID], , drop = FALSE]
We create a frequency table and then subset the rows based on the 'count' of values in 'Column'
tbl <- table(df$Column) >=60
subset(df, Column %in% names(tbl)[tbl])
Or with ave from base R
df[with(df, ave(Column, Column, FUN = length)>=60),]
Or we use data.table
library(data.table)
setDT(df)[, .SD[.N >= 60], Column]
Or another option with data.table is .I
setDT(df)[df[, .I[.N >=60], Column]$V1]
If there are more than one column to group, place it in a list (or compactly .()
setDT(df)[df[, .I[.N >=60], by = .(Column1, Column2)]$V1]
If there are many columns, we can also pass as a character string or object
colnms <- paste0("Column", 1:5)
setDT(df)[df[, .I[.N >=60], by = c(colnms)]$V1]
Using data.table
library(data.table)
setDT(df)
df[Column %in% df[, .N, by = Column][N >= 60, Column]]
There is also a variant to Eric Watt's answer which uses a join instead of %in%:
library(data.table)
setDT(df)
df[df[, .N, by = Column][N >= 60, .(Column)], on = "Column"]

r - computing statistics for variables passed on to setDT

Is there a way to pass on a variables, for which a statistic needs to be computed, to setDT?
The example below illustrates my issue. Only A yields the desired result. As I would like to change var into a vector and pass its elements to setDT via a loop, A is not an option.
I also prefer not using sqldf.
col1 <- c('Group 1','Group 1','Group 2','Group 2')
col2 <- c(0.2,0.3,0.5,0.6)
col3 <- c(0.1,0.2,0.3,0.4)
X <- data.frame(col1,col2,col3)
var <- "col2"
A <- setDT(X)[, list(nbrObs = .N, average = mean(col2)), by = .(col1)]
B <- setDT(X)[, list(nbrObs = .N, average = mean(X[[var]])), by = .(col1)]
C <- setDT(X)[, list(nbrObs = .N, average = mean(var)), by = .(col1)]
We can either pass on the variables by specifying it in .SDcols and then apply the function on the Subset of Data.table (.SD). If there are multiple variables, make sure to loop through the .SD i.e. lapply(.SD, mean).
setDT(X)[, list(nbrObs = .N, average = mean(.SD[[1L]])), by = .(col1), .SDcols= var]
Or another option would be convert to symbol with as.name or as.symbol and evaluate it (eval).
setDT(X)[, list(nbrObs = .N, average = mean(eval(as.name(var)))), by = .(col1)]
Or yet another option is using get to return the value.
setDT(X)[, list(nbrObs = .N, average = mean(get(var))), by = .(col1)]

Changing multiple Columns in data.table r

I am looking for a way to manipulate multiple columns in a data.table in R. As I have to address the columns dynamically as well as a second input, I wasn't able to find an answer.
The idea is to index two or more series on a certain date by dividing all values by the value of the date eg:
set.seed(132)
# simulate some data
dt <- data.table(date = seq(from = as.Date("2000-01-01"), by = "days", length.out = 10),
X1 = cumsum(rnorm(10)),
X2 = cumsum(rnorm(10)))
# set a date for the index
indexDate <- as.Date("2000-01-05")
# get the column names to be able to select the columns dynamically
cols <- colnames(dt)
cols <- cols[substr(cols, 1, 1) == "X"]
Part 1: The Easy data.frame/apply approach
df <- as.data.frame(dt)
# get the right rownumber for the indexDate
rownum <- max((1:nrow(df))*(df$date==indexDate))
# use apply to iterate over all columns
df[, cols] <- apply(df[, cols],
2,
function(x, i){x / x[i]}, i = rownum)
Part 2: The (fast) data.table approach
So far my data.table approach looks like this:
for(nam in cols) {
div <- as.numeric(dt[rownum, nam, with = FALSE])
dt[ ,
nam := dt[,nam, with = FALSE] / div,
with=FALSE]
}
especially all the with = FALSE look not very data.table-like.
Do you know any faster/more elegant way to perform this operation?
Any idea is greatly appreciated!
One option would be to use set as this involves multiple columns. The advantage of using set is that it will avoid the overhead of [.data.table and makes it faster.
library(data.table)
for(j in cols){
set(dt, i=NULL, j=j, value= dt[[j]]/dt[[j]][rownum])
}
Or a slightly slower option would be
dt[, (cols) :=lapply(.SD, function(x) x/x[rownum]), .SDcols=cols]
Following up on your code and the answer given by akrun, I would recommend you to use .SDcols to extract the numeric columns and lapply to loop through them. Here's how I would do it:
index <-as.Date("2000-01-05")
rownum<-max((dt$date==index)*(1:nrow(dt)))
dt[, lapply(.SD, function (i) i/i[rownum]), .SDcols = is.numeric]
Using .SDcols could be specially useful if you have a large number of numeric columns and you'd like to apply this division on all of them.

Resources