Extract the n highest value by group with data.table in R [duplicate] - r

This question already has answers here:
How to order data within subgroups in data.table R
(1 answer)
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 3 years ago.
Consider the following:
DT <- data.table(a = sample(1:2), b = sample(1:100, 10), d = rnorm(10))
How to display the whole DT containing only the 3 highest values b for each level of a?
I found an approximate solution here:
> DT[order(-b), head(b, 3), a]
a V1
1: 2 99
2: 2 83
3: 2 75
4: 1 96
5: 1 71
6: 1 67
However, it will only display the columns a and b (named as V1). I would like to obtain the same DT but with all the columns and the original columns names. Please consider that in practice my DT has many columns. I'm aiming for an elegant solution that doesn't requires to list all the columns names.

We can separate the calls and filter top 3 rows by group.
library(data.table)
DT[order(-b),head(.SD, 3),a]
# a b d
#1: 1 100 1.4647474
#2: 1 61 -1.1250266
#3: 1 51 0.9435628
#4: 2 82 0.3302404
#5: 2 72 -0.0219803
#6: 2 55 1.6865777

Related

Repeat a vector on a dataframe

I want to add a column to my datatable based on a vector. However my datatable is having 20 rows and my vector is having 7 values. I want the datatable to be repeated 7 times such that each 20 rows has one value from the vector. This might be simple but I am not able to get how to do this.
Some sample data -
library(data.table)
set.seed(9901)
#### create the sample variables for creating the data
group <- c(1:7)
brn <- sample(1:10,20,replace = T)
period <- c(101:120)
df1 <- data.table(cbind(brn,period))
So in this case I want to add a column group. datatable would now have 140 rows. 20 rows for group 1 then 20 rows for group 2 and so on.....
Apparently you want this:
df1[CJ(group, period), on = .(period)]
# brn period group
# 1: 3 101 1
# 2: 9 102 1
# 3: 9 103 1
# 4: 5 104 1
# 5: 5 105 1
# ---
#136: 9 116 7
#137: 7 117 7
#138: 10 118 7
#139: 2 119 7
#140: 7 120 7
CJ creates a data.table resulting from the cartesian join of the vectors passed to it. This data.table is then joined with df1 based on the column specified by on.
I would (1) repeat each number in group 20 times to create the datasets in a list and
(2) join them:
AllLists<-apply(as.data.frame(group),1,function(x) cbind(x,df1))
do.call( "rbind",AllLists)
A solution with data.table. Is that what you are looking for?
library(data.table)
df2 <- df1[rep(1:nrow(df1), times = 7),
][,group := rep(group, each = 20)]
But Rolands solution in the comments is definitly more elegant.

In data.table in R, how can we create an sequenced indicator variable by the values of two columns? [duplicate]

This question already has answers here:
data.table "key indices" or "group counter"
(2 answers)
Create a new data frame column based on the values of two other columns
(2 answers)
Closed 4 years ago.
In the data.table package in R, for a given data table, I am wondering how an indicator index can be created for the values that are the same in two columns. For example, for the following data table,
> M <- data.table(matrix(c(2,2,2,2,2,2,2,5,2,5,3,3,3,6), ncol = 2, byrow = T))
> M
V1 V2
1: 2 2
2: 2 2
3: 2 2
4: 2 5
5: 2 5
6: 3 3
7: 3 6
I would like to create a new column that essentially orders the values that are the same for each row of the two columns, so that I can get something like:
> M
V1 V2 Index
1: 2 2 1
2: 2 2 1
3: 2 2 1
4: 2 5 2
5: 2 5 2
6: 3 3 3
7: 3 6 4
I essentially would like to repeat values of .N above, is there a nice way to do it?
We can use .GRP after grouping by 'V1' and 'V2'
M[, Index := .GRP, .(V1, V2)]

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

R converting from short form to long form with counts in the short form [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Reshaping data.frame from wide to long format
(8 answers)
Closed 4 years ago.
I have a large table (~100M row and 28 columns) in the below format:
ID A B C
1 2 0 1
2 0 1 0
3 0 1 2
4 1 0 0
Columns besides ID (which is unique) gives the counts for each type (i.e. A,B,C). I would like to convert this to the below long form.
ID Type
1 A
1 A
1 C
2 B
3 B
3 C
3 C
4 A
I also would like to use data table (rather than data frame) given the size of my data set. I checked reshape2 package in R regarding converting between long and short form however I am not clear if melt function would allow me to have counts in the short form as above.
Any suggestions on how I can convert this in a fast and efficient way in R using reshape2 and/or data.table?
Update
You can try the following:
DT[, rep(names(.SD), .SD), by = ID]
# ID V1
# 1: 1 A
# 2: 1 A
# 3: 1 C
# 4: 2 B
# 5: 3 B
# 6: 3 C
# 7: 3 C
# 8: 4 A
Keeps the order you want too...
You can try the following. I've never used expandRows on what would become ~ 300 million rows, but it's basically rep, so it shouldn't be slow.
This uses melt + expandRows from my "splitstackshape" package. It works with data.frames or data.tables, so you might as well use data.table for the faster melting....
library(reshape2)
library(splitstackshape)
expandRows(melt(mydf, id.vars = "ID"), "value")
# The following rows have been dropped from the input:
#
# 2, 3, 5, 8, 10, 12
#
# ID variable
# 1 1 A
# 1.1 1 A
# 4 4 A
# 6 2 B
# 7 3 B
# 9 1 C
# 11 3 C
# 11.1 3 C

Transposition with aggregation of data.table [duplicate]

This question already has answers here:
Proper/fastest way to reshape a data.table
(4 answers)
Closed 9 years ago.
Suppose that we have data.table like that:
TYPE KEY VALUE
1: 1 A 10
2: 1 B 10
3: 1 A 40
4: 2 B 20
5: 2 B 40
I need to generate the following aggregated data.table (numbers are sums of values for given TYPE and KEY):
TYPE A B
1: 1 50 10
2: 2 0 60
In a real life problem there are a lot of different values for KEY so it's impossible to hardcode them.
How can I achieve that?
One way I could think of is:
# to ensure all levels are present when using `tapply`
DT[, KEY := factor(KEY, levels=unique(KEY))]
DT[, as.list(tapply(VALUE, KEY, sum)), by = TYPE]
# TYPE A B
# 1: 1 50 10
# 2: 2 NA 60

Resources