R data.table group first,last and middle rows using rleid - r

I need to group a data table using rleid. There should be three groups. One group for first row one for last row and one for all other rows in between first and last row.
I know how to group if I have a condition. Like
dt[,group := rleid(condition)]

You can have a constant vector of size nrow(dt) - 2 to get a condition to apply rleid() on.
dt[, group := rleid(c(1, rep(2, nrow(dt) - 2), 3))]

You can make a vector of all the same values then replace individual elements (e.g. first and last elements) with something else. The code below creates a column which is 1L for the first row, 3L for the last row, and 2L otherwise.
df[, group := replace(rep(2L, .N), c(1L, .N), c(1L, 3L))]
Another way using rleid is
df[, group:= rleid(.I %in% c(1L, .N))]
You can also do grouping operations on variables you create, not already in the data table.
df <- data.table(x = runif(100))
df[, .(sumx = sum(x)),
.(group = replace(rep(2L, nrow(df)), c(1L, nrow(df)), c(1L, 3L)))]
# group sumx
# 1: 1 0.1546382
# 2: 2 48.1939765
# 3: 3 0.4710213

Related

How do I multiply grouped values inside a column of data.table and return the data.table only with the result rows

How do I multiply the values inside a column by grouping from another column.
Let's say I have :
dt = data.table(group = c(1,1,2,2), value = c(2,3,4,5))
I want to multiply the elements of the value with each other but only the ones that belong to the same group , hence that would return.
dt=data.table(group=c(1,2), value=c(6,20))
I tried it with cumprod
dt[, new_value := cumprod(value), by = group]
but then that returns
dt=data.table(group=c(1,1,2,2), value=c(2,6,4,20)) and I don't know how to remove the rows that i dont neeed: those with value(2,4)
...
Taking the maximum is not a solution because the values could also be negative.
Updating for visibility using #chinsoon12 solution in the comments.
dt[, .(new_value = prod(value)), by = group]
Here's one option where you first perform the calculation and then take the last row by group.
dt[, .(new_value = cumprod(value)), by = group][,.SD[.N], by = group]
group new_value
1: 1 6
2: 2 20

In R: How to subset a large dataframe by top 5 longest runs of frequent values in 1 column?

I have a dataframe with 1 column. The values in this column can ONLY be "good" or "bad". I would like to find the top 5 largest runs of "bad".
I am able to use the rle(df) function to get the running length of all the "good" and "bad".
How do i find the 5 largest runs that attribute to ONLY "bad"?
How do i get the starting and ending indices of the top 5 largest runs for ONLY "bad"?
Your assistance is much appreciated!
One option would be rleid. Convert the 'data.frame' to 'data.table' (setDT(df1)), creating grouping column with rleid (generates a unique id based on adjacent non-matching elements, create the number of elements per group (n) as a column, and row number also as another column ('rn'), subset the rows where 'goodbad' is "bad", order 'n' in decreasing order, grouped by 'grp', summarise the 'first' and 'last' row numbe, as well as the entry for goodbad
library(data.table)
setDT(df1)[, grp := rleid(goodbad)][, n := .N, grp][ ,
rn := .I][goodbad == 'bad'][order(-n), .(goodbad = first(goodbad),
n = n, start = rn[1], last = rn[.N]), .(grp)
][n %in% head(unique(n), 5)][, grp := NULL][]
Or we can use rle and other base R methods
rl <- rle(df1$goodbad)
grp <- with(rl, rep(seq_along(values), lengths))
df2 <- transform(df1, grp = grp, n = rep(rl$lengths, rl$lengths),
rn = seq_len(nrow(df1)))
df3 <- subset(df2, goodbad == 'bad')
do.call(data.frame, aggregate(rn ~ grp, subset(df3[order(-df3$n),],
n %in% head(unique(n), 5)), range))
data
set.seed(24)
df1 <- data.frame(goodbad = sample(c("good", "bad"), 100,
replace = TRUE), stringsAsFactors = FALSE)
The sort(...) function arranges things by increasing or decreasing order. The default is increasing, but you can set "decreasing = TRUE". Use ?sort for more info.
The which(...) function returns the INDEX of values that meet a logical criteria. The code below sorts the times columns of rows where the goodbad value == GOOD.
sort(your.df$times[which(your.df$goodbad == GOOD)])
If you wanted to get the top 5 you could do this:
top5_good <- sort(your.df$times[which(your.df$goodbad == GOOD)])[1:5]
top5_bad <- sort(your.df$times[which(your.df$goodbad == BAD)])[1:5]

How to explicitly name the count column generated by the .N function?

I want to group-by a data table by an id column and then count how many times each id occurs. This can be done as follows:
dt <- data.table(id = c(1, 1, 2))
dt_by_id <- dt[, .N, by = id]
dt_by_id
id N
1: 1 2
2: 2 1
That's pretty fine, but I want the N-column to have a different name (e. g. count). In the help it says:
.N is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in
advance and for convenience generally. When grouping by i, .N is the
number of rows in x matched to, for each row of i, regardless of
whether nomatch is NA or 0. It is renamed to N (no dot) in the result
(otherwise a column called ".N" could conflict with the .N variable,
see FAQ 4.6 for more details and example), unless it is explicitly
named; ... .
How to "explicitly name" the N-column when creating the dt_by_id data table? (I know how to rename it afterwards.) I tried
dt_by_id <- dt[, count = .N, by = id]
but this led to
Error in `[.data.table`(dt, , count = .N, by = id) :
unused argument (count = .N)
You have to list the output of your calculation if you want to give your own name:
dt[, .(count=.N), by = id]
This is identical to dt[, list(count=.N), by = id], if you prefer; . is an alias for list here.
If we have already named it, then use setnames
setnames(dt_by_id, "N", 'count')
or using rename
library(dplyr)
dt_by_id %>%
rename(count = N)
# id count
#1: 1 2
#2: 2 1
Using dplyr::count (x, name= "new column" ) will replace the default column name n with a new name.
dt <- data.frame(id = c(1, 1, 2))
dt %>%
dplyr:: count(id, name = 'ID')

Update join with multiple rows

Question
When doing an update-join, where the i table has multiple rows per key, how can you control which row is returned?
Example
In this example, the update-join returns the last row from dt2
library(data.table)
dt1 <- data.table(id = 1)
dt2 <- data.table(id = 1, letter = letters)
dt1[
dt2
, on = "id"
, letter := i.letter
]
dt1
# id letter
# 1: 1 z
How can I control it to return the 1st, 2nd, nth row, rather than defaulting to the last?
References
A couple of references similar to this by user #Frank
data.table tutorial - in particular the 'warning' on update-joins
Issue on github
The most flexible idea I can think of is to only join the part of dt2 which contains the rows you want. So, for the second row:
dt1[
dt2[, .SD[2], by=id]
, on = "id"
, letter := i.letter
]
dt1
# id letter
#1: 1 b
With a hat-tip to #Frank for simplifying the sub-select of dt2.
How can I control it to return the 1st, 2nd, nth row, rather than defaulting to the last?
Not elegant, but sort-of works:
n = 3L
dt1[, v := dt2[.SD, on=.(id), x.letter[n], by=.EACHI]$V1]
A couple problems:
It doesn't select using GForce, eg as seen here:
> dt2[, letter[3], by=id, verbose=TRUE]
Detected that j uses these columns: letter
Finding groups using forderv ... 0.020sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
lapply optimization is on, j unchanged as 'letter[3]'
GForce optimized j to '`g[`(letter, 3)'
Making each group and running j (GForce TRUE) ... 0.000sec
id V1
1: 1 c
If n is outside of 1:.N for some joined groups, no warning will be given:
n = 40L
dt1[, v := dt2[.SD, on=.(id), x.letter[n], by=.EACHI]$V1]
Alternately, make a habit of checking that i in an update join x[i] is "keyed" by the join columns:
cols = "id"
stopifnot(nrow(dt2) == uniqueN(dt2, by=cols))
And then make a different i table to join on if appropriate
mDT = dt2[, .(letter = letter[3L]), by=id]
dt1[mDT, on=cols, v := i.letter]

consecutively subtracting columns in data.table

Suppose I have the following data.table:
player_id prestige_score_0 prestige_score_1 prestige_score_2 prestige_score_3 prestige_score_4
1: 100284 0.0001774623 2.519792e-03 5.870781e-03 7.430179e-03 7.937716e-03
2: 103819 0.0001774623 1.426482e-03 3.904329e-03 5.526974e-03 6.373850e-03
3: 100656 0.0001774623 2.142518e-03 4.221423e-03 5.822705e-03 6.533448e-03
4: 104745 0.0001774623 1.084913e-03 3.061197e-03 4.383649e-03 5.091851e-03
5: 104925 0.0001774623 1.488457e-03 2.926728e-03 4.360301e-03 5.068171e-03
And I want to find the difference between values in each column starting from column prestige_score_0
In one step it should look like this: df[,prestige_score_0] - df[,prestige_score_1]
How can I do it in data.table(and save this differences as data.table and keep player_id as well)?
This is how you can do this in a tidy way:
# make it tidy
df2 <- melt(df,
id = "player_id",
variable.name = "column_name",
value.name = "prestige_score")
# extract numbers from column names
df2[, score_number := as.numeric(gsub("prestige_score_", "", column_name))]
# compute differences by player
df2[, diff := prestige_score - shift(prestige_score, n = 1L, type = "lead"),
by = player_id]
# if necessary, reshape back to original format
dcast(df2, player_id ~ score_number, value.var = c("prestige_score", "diff"))
you can subtract a whole dt with a shifted version of itself
dt = data.table(id=c("A","B"),matrix(rexp(10, rate=.1), ncol=5))
dt_shift = data.table(id=dt[,id], dt[, 2:(ncol(dt)-1)] - dt[,3:ncol(dt)])
You could use a for loop -
for(i in c(1:(ncol(df)-1)){
df[, paste0("diff_", i-1, "_", i)] = df[, paste0("prestige_score_", i-1)] -
df[, paste0("prestige_score_", i)]
}
This might not be the most efficient if you have a lot of columns though.

Resources