Transposition with aggregation of data.table [duplicate] - r

This question already has answers here:
Proper/fastest way to reshape a data.table
(4 answers)
Closed 9 years ago.
Suppose that we have data.table like that:
TYPE KEY VALUE
1: 1 A 10
2: 1 B 10
3: 1 A 40
4: 2 B 20
5: 2 B 40
I need to generate the following aggregated data.table (numbers are sums of values for given TYPE and KEY):
TYPE A B
1: 1 50 10
2: 2 0 60
In a real life problem there are a lot of different values for KEY so it's impossible to hardcode them.
How can I achieve that?

One way I could think of is:
# to ensure all levels are present when using `tapply`
DT[, KEY := factor(KEY, levels=unique(KEY))]
DT[, as.list(tapply(VALUE, KEY, sum)), by = TYPE]
# TYPE A B
# 1: 1 50 10
# 2: 2 NA 60

Related

Extract the n highest value by group with data.table in R [duplicate]

This question already has answers here:
How to order data within subgroups in data.table R
(1 answer)
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 3 years ago.
Consider the following:
DT <- data.table(a = sample(1:2), b = sample(1:100, 10), d = rnorm(10))
How to display the whole DT containing only the 3 highest values b for each level of a?
I found an approximate solution here:
> DT[order(-b), head(b, 3), a]
a V1
1: 2 99
2: 2 83
3: 2 75
4: 1 96
5: 1 71
6: 1 67
However, it will only display the columns a and b (named as V1). I would like to obtain the same DT but with all the columns and the original columns names. Please consider that in practice my DT has many columns. I'm aiming for an elegant solution that doesn't requires to list all the columns names.
We can separate the calls and filter top 3 rows by group.
library(data.table)
DT[order(-b),head(.SD, 3),a]
# a b d
#1: 1 100 1.4647474
#2: 1 61 -1.1250266
#3: 1 51 0.9435628
#4: 2 82 0.3302404
#5: 2 72 -0.0219803
#6: 2 55 1.6865777

In data.table in R, how can we create an sequenced indicator variable by the values of two columns? [duplicate]

This question already has answers here:
data.table "key indices" or "group counter"
(2 answers)
Create a new data frame column based on the values of two other columns
(2 answers)
Closed 4 years ago.
In the data.table package in R, for a given data table, I am wondering how an indicator index can be created for the values that are the same in two columns. For example, for the following data table,
> M <- data.table(matrix(c(2,2,2,2,2,2,2,5,2,5,3,3,3,6), ncol = 2, byrow = T))
> M
V1 V2
1: 2 2
2: 2 2
3: 2 2
4: 2 5
5: 2 5
6: 3 3
7: 3 6
I would like to create a new column that essentially orders the values that are the same for each row of the two columns, so that I can get something like:
> M
V1 V2 Index
1: 2 2 1
2: 2 2 1
3: 2 2 1
4: 2 5 2
5: 2 5 2
6: 3 3 3
7: 3 6 4
I essentially would like to repeat values of .N above, is there a nice way to do it?
We can use .GRP after grouping by 'V1' and 'V2'
M[, Index := .GRP, .(V1, V2)]

Return subset of rows based on aggregate of keyed rows [duplicate]

This question already has an answer here:
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 6 years ago.
I would like to subset a data table in R within each subset, based on an aggregate function over the subset of rows. For example, for each key, return all values greater than the mean of a field calculated only for rows in subset. Example:
library(data.table)
t=data.table(Group=rep(c(1:5),each=5),Detail=c(1:25))
setkey(t,'Group')
library(foreach)
library(dplyr)
ret=foreach(grp=t[,unique(Group)],.combine=bind_rows,.multicombine=T) %do%
t[Group==grp&Detail>t[Group==grp,mean(Detail)],]
# Group Detail
# 1: 1 4
# 2: 1 5
# 3: 2 9
# 4: 2 10
# 5: 3 14
# 6: 3 15
# 7: 4 19
# 8: 4 20
# 9: 5 24
#10: 5 25
The question is, is it possible to succinctly code the last two lines using data.table features? Sorry if this is a repeat, I am also struggling explaining the exact goal to have google/stackoverflow find it.
Using the .SD function works. Was not aware of it, thanks:
dt[, .SD[Detail > mean(Detail)], by = Group]
Also works, with some performance gains:
indx <- dt[, .I[Detail > mean(Detail)], by = Group]$V1 ; dt[indx]

R converting from short form to long form with counts in the short form [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Reshaping data.frame from wide to long format
(8 answers)
Closed 4 years ago.
I have a large table (~100M row and 28 columns) in the below format:
ID A B C
1 2 0 1
2 0 1 0
3 0 1 2
4 1 0 0
Columns besides ID (which is unique) gives the counts for each type (i.e. A,B,C). I would like to convert this to the below long form.
ID Type
1 A
1 A
1 C
2 B
3 B
3 C
3 C
4 A
I also would like to use data table (rather than data frame) given the size of my data set. I checked reshape2 package in R regarding converting between long and short form however I am not clear if melt function would allow me to have counts in the short form as above.
Any suggestions on how I can convert this in a fast and efficient way in R using reshape2 and/or data.table?
Update
You can try the following:
DT[, rep(names(.SD), .SD), by = ID]
# ID V1
# 1: 1 A
# 2: 1 A
# 3: 1 C
# 4: 2 B
# 5: 3 B
# 6: 3 C
# 7: 3 C
# 8: 4 A
Keeps the order you want too...
You can try the following. I've never used expandRows on what would become ~ 300 million rows, but it's basically rep, so it shouldn't be slow.
This uses melt + expandRows from my "splitstackshape" package. It works with data.frames or data.tables, so you might as well use data.table for the faster melting....
library(reshape2)
library(splitstackshape)
expandRows(melt(mydf, id.vars = "ID"), "value")
# The following rows have been dropped from the input:
#
# 2, 3, 5, 8, 10, 12
#
# ID variable
# 1 1 A
# 1.1 1 A
# 4 4 A
# 6 2 B
# 7 3 B
# 9 1 C
# 11 3 C
# 11.1 3 C

Joining tables with identical (non-keyed) column names in R data.table

How do you deal with identically named, non-key columns when joining data.tables? I am looking for a solution to table.field notation in SQL.
For instance, lets' say I have a table DT that is repopulated with new data for column v every time period. I also have a table DT_HIST that stores entries from previous time periods (t). I want to find the difference between the current and previous time period for each x
In this case: DT is time period 3, and DT_HIST has time periods 1 and 2:
DT <- data.table(x=c(1,2,3,4),v=c(20,20,35,30))
setkey(DT,x)
DT_HIST <- data.table(x=rep(seq(1,4,1),2),v=c(40,40,40,40,30,25,45,40),t=c(rep(1,4),rep(2,4)))
setkey(DT_HIST,x)
> DT
x v
1: 1 20
2: 2 20
3: 3 35
4: 4 30
> DT_HIST
x v t
1: 1 40 1
2: 1 30 2
3: 2 40 1
4: 2 25 2
5: 3 40 1
6: 3 45 2
7: 4 40 1
8: 4 40 2
I would like to join DT with DT_HIST[t==1,] on x and calculate the difference in v.
Just joining the tables results in columns v and v.1.
> DT[DT_HIST[t==2],]
x v v.1 t
1: 1 20 30 2
2: 2 20 25 2
3: 3 35 45 2
4: 4 30 40 2
However, I can't find a way to refer to the different v columns when doing the join.
> DT[DT_HIST[t==2],list(delta=v-v.1)]
Error in `[.data.table`(DT, DT_HIST[t == 2], list(delta = v - v.1)) :
object 'v.1' not found
> DT[DT_HIST[t==2],list(delta=v-v)]
x delta
1: 1 0
2: 2 0
3: 3 0
4: 4 0
If this is a duplicate, I apologize. I searched and couldn't find a similar question.
Also, I realize that I can simply rename the columns after joining and then run my desired expression, but I want to know if I'm doing this in the completely wrong way.
You can use i.colname to access the column in the i-expression data.table. I see you're using an old data.table version. There have been a few changes since then: the duplicated joined column names have a prefix i. instead of a number postfix (making it more consistent with the i. access of joined column names, and there is no by-without-by anymore by default.
In the latest version (1.9.3), this is what you get:
DT[DT_HIST[t==2],list(delta = v - i.v)]
# delta
#1: -10
#2: -5
#3: -10
#4: -10

Resources