data table subset last row by group (retain order) - r

Have:
> aDT <- data.table(ID = c(3,3,2,2,2,3), colA = c(5,5,4,4,4,5), colC = c(1:6))
> aDT
ID colA colC
1: 3 5 1
2: 3 5 2
3: 2 4 3
4: 2 4 4
5: 2 4 5
6: 3 5 6
Need:
> aDT <- data.table(ID = c(3,2,3), colA = c(5,4,5), colC = c(2,5,6))
> aDT
ID colA colC
1: 3 5 2
2: 2 4 5
3: 3 5 6
Tried:
> aDT[, .SD[.N], by = list(ID,colA)]
ID colA colC
1: 3 5 6
2: 2 4 5
As you can see, the result's not really what I need. How to fix it?
(btw, I would like to retain the same order)

You are not really grouping by ID and colA but by the consecutive chunks, for which you can use rleid for this purpose:
aDT[aDT[, .I[.N], rleid(ID, colA)]$V1]
# ID colA colC
#1: 3 5 2
#2: 2 4 5
#3: 3 5 6
.I[.N] extracts the global row number of the last row for each group:
aDT[, .I[.N], rleid(ID, colA)]
# rleid V1
#1: 1 2
#2: 2 5
#3: 3 6 there are three groups in total, the row numbers of last rows are 2,5,6
then use the row numbers to subset the original data table.

Related

Sum all possible combinations of columns in R

I have a data table (100 rows x 25 cols) that is structured like this:
ColA ColB ColC ColD
1: 1 3 1 2
2: 2 2 1 2
3: 3 1 1 2
I want to add column values together, in every possible combination.
The output would include, for example:
ColA+B ColA+C ColA+D ColB+C ColB+D etc.
BUT! I don't just want pairs. I am trying to get every combination. I also want to see, for example:
ColA+B+C ColA+B+D ColA+C+D ColB+C+D
And:
ColA+B+C+D
Ideally I could simply add all these permutations to the right of the base dataset (I am looking to do a correlation matrix on all these permutations.) I am far from an R expert. I see there are packages like combinat - but they don't seem to get at what I'm after. I would be very grateful indeed for any suggestions.
Thank you.
I'm hesitant to present this as a suggestion: it works with four columns, but as #DanAdams commented, this explodes with 25 columns:
choose(25,2) # 25 columns, 2 each
# [1] 300
choose(25,3) # 25 columns, 3 each
# [1] 2300
### 25 columns, in sets of 2 through 25 at a time
sum(sapply(2:25, choose, n=25))
# [1] 33554406
But, let's assume that you can control the number of combinations you need. Change 2:4 to be the number of combinations you need.
combs <- do.call(c, lapply(2:4, function(z) asplit(combn(names(dat), z), 2)))
names(combs) <- sapply(combs, paste, collapse = "_")
length(combs)
# [1] 11
combs[c(1,2,10,11)]
# $ColA_ColB
# [1] "ColA" "ColB"
# $ColA_ColC
# [1] "ColA" "ColC"
# $ColB_ColC_ColD
# [1] "ColB" "ColC" "ColD"
# $ColA_ColB_ColC_ColD
# [1] "ColA" "ColB" "ColC" "ColD"
ign <- Map(function(cols, nm) dat[, (nm) := rowSums(.SD), .SDcols = cols], combs, names(combs))
dat[]
# ColA ColB ColC ColD ColA_ColB ColA_ColC ColA_ColD ColB_ColC ColB_ColD ColC_ColD ColA_ColB_ColC ColA_ColB_ColD ColA_ColC_ColD ColB_ColC_ColD ColA_ColB_ColC_ColD
# <int> <int> <int> <int> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 1 3 1 2 4 2 3 4 5 3 5 6 4 6 7
# 2: 2 2 1 2 4 3 4 3 4 3 5 6 5 5 7
# 3: 3 1 1 2 4 4 5 2 3 3 5 6 6 4 7
BTW: I'm inferring that your data is of class data.table, ergo the side-effect I'm using here. If that's not the case, then this is base R:
dat <- cbind(dat, data.frame(lapply(combs, function(cols) rowSums(subset(dat, select = cols)))))
dat
# ColA ColB ColC ColD ColA_ColB ColA_ColC ColA_ColD ColB_ColC ColB_ColD ColC_ColD ColA_ColB_ColC ColA_ColB_ColD ColA_ColC_ColD ColB_ColC_ColD ColA_ColB_ColC_ColD
# 1 1 3 1 2 4 2 3 4 5 3 5 6 4 6 7
# 2 2 2 1 2 4 3 4 3 4 3 5 6 5 5 7
# 3 3 1 1 2 4 4 5 2 3 3 5 6 6 4 7
(Please don't blame me if your R crashes due to memory exhaustion. Save your work often.)
Data
dat <- setDT(structure(list(ColA = 1:3, ColB = 3:1, ColC = c(1L, 1L, 1L), ColD = c(2L, 2L, 2L)), class = c("data.table", "data.frame"), row.names = c(NA, -3L)))

conditional data.table match for subset of data.table

This post is related to the previous post here: match rows of two data.tables to fill subset of a data.table
Not sure how I can integrate them together.
I have a situation where other than the NA for one column of DT1, a couple of more conditions should apply for merging, but that doesn't work.
> DT1 <- data.table(colA = c(1,1, 2,2,2,3,3), colB = c('A', NA, 'AA', 'B', NA, 'A', 'C'), timeA = c(2,4,3,4,6,1,4))
> DT1
colA colB timeA
1: 1 A 2
2: 1 <NA> 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
> DT2 <- data.table(colC = c(1,1,1,2,2,3), timeB1 = c(1,3,6, 2,4, 1), timeB2 = c(2,5,7,3,5,4), colD = c('Z', 'YY', 'AB', 'JJ', 'F', 'RR'))
> DT2
colC timeB1 timeB2 colD
1: 1 1 2 Z
2: 1 3 5 YY
3: 1 6 7 AB
4: 2 2 3 JJ
5: 2 4 5 F
6: 3 1 4 RR
Using the same guideline as mentioned above, I'd like to merge ColD of DT2 to colB of DT1 only for NA values of colB in DT1 AND use the values of colD for which timeA in DT1 is between timeB1 and timeB2 in DT2. I tried the following but merge doesn't happen:
> output <- DT1[DT2, on = .(colA = colC), colB := ifelse(is.na(x.colB) & i.timeB1 <= x.timeA & x.timeA <= i.timeB2, i.colD, x.colB)]
> output
> output
colA colB timeA
1: 1 A 2
2: 1 <NA> 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
Nothing changes in output.
these is my desired output:
> desired_output
colA colB timeA
1: 1 A 2
2: 1 YY 4 --> should find a match
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6 --> shouldn't find a match
6: 3 A 1
7: 3 C 4
why doesn't this work?
I'd like to use data.table operations only without using additional packages.
An in place update of the colB in DT1 would work as follows:
DT1[is.na(colB), colB := DT2[DT1[is.na(colB)],
on = .(colC = colA, timeB1 <= timeA, timeB2 >= timeA), colD]]
print(DT1)
colA colB timeA
1: 1 A 2
2: 1 YY 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
This indexes the values where colB is NA and after a join on the condition, as defined in on= ..., replaces the missing values by the matching values found in colD.
Possibly not the sortest answer, but it gets the job done.. I'm no data.table-expert, so I welcome improvements/suggestions.
DT1[ is.na(colB), colB := DT1[ is.na(colB), ][ DT2, colB := i.colD, on = c( "colA == colC", "timeA >= timeB1", "timeA <= timeB2")]$colB]
what is does:
first, subset DT1 for all rows where is.na(colB) = TRUE
then, update the value of colB in these rows with the colB-vector from the result of a non-equi join of the same subset of rows on DT2
Bonus is that DT1 is chaged by reference, so it's pretty fast and memory efficient on large data (I think).
colA colB timeA
1: 1 A 2
2: 1 YY 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4

delete a row where the order is wrong within a group

I have a data with about 1000 groups each group is ordered from 1-100(can be any number within 100).
As I was looking through the data. I found that some groups had bad orders, i.e., it would order to 100 then suddenly a 24 would show up.
How can I delete all of these error data
As you can see from the picture above(before -> after), I would like to find all rows that don't follow the order within the group and just delete it.
Any help would be great!
lag will compute the difference between the current value and the previous value, diff will be used to select only positive difference i.e. the current value is greater than the previous value. min is used as lag give the first value NA. I keep the helper column diff to check, but you can deselect using %>% select(-diff)
library(dplyr)
df1 %>% group_by(gruop) %>% mutate(diff = order-lag(order)) %>%
filter(diff >= 0 | order==min(order))
# A tibble: 8 x 3
# Groups: gruop [2]
gruop order diff
<int> <int> <int>
1 1 1 NA
2 1 3 2
3 1 5 2
4 1 10 5
5 2 1 NA
6 2 4 3
7 2 4 0
8 2 8 4
Data
df1 <- read.table(text="
gruop order
1 1
1 3
1 5
1 10
1 2
2 1
2 4
2 4
2 8
2 3
",header=T, stringsAsFactors = F)
Assuming the order column increments by 1 every time we can use ave where we remove those rows which do not have difference of 1 with the previous row by group.
df[!ave(df$order, df$group, FUN = function(x) c(1, diff(x))) != 1, ]
# group order
#1 1 1
#2 1 2
#3 1 3
#4 1 4
#6 2 1
#7 2 2
#8 2 3
#9 2 4
EDIT
For the updated example, we can just change the comparison
df[ave(df$order, df$group, FUN = function(x) c(1, diff(x))) >= 0, ]
Playing with data.table:
library(data.table)
setDT(df1)[, diffo := c(1, diff(order)), group][diffo == 1, .(group, order)]
group order
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 2 1
6: 2 2
7: 2 3
8: 2 4
Where df1 is:
df1 <- data.frame(
group = rep(1:2, each = 5),
order = c(1:4, 2, 1:4, 3)
)
EDIT
If you only need increasing order, and not steps of one then you can do:
df3 <- transform(df1, order = c(1,3,5,10,2,1,4,7,9,3))
setDT(df3)[, diffo := c(1, diff(order)), group][diffo >= 1, .(group, order)]
group order
1: 1 1
2: 1 3
3: 1 5
4: 1 10
5: 2 1
6: 2 4
7: 2 7
8: 2 9

How to store the row number by group [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
I have data, stored as a data.table dt or a data.frame df, with multiple observations per id-month combination. I want to store the row number in a variable, let's call it row.
I know how to do this in dplyr but want to learn how to do it in (pure) data.table. I assume it is a trivially easy operation, but I can't seem to find a solution that works.
Reprex:
library(dplyr)
library(data.table)
df <- data_frame(id = c(1, 1, 1, 2, 2, 2), month = c(1, 1, 2, 1, 1, 2))
dt <- data.table(df)
My dplyr solution gives the expected output:
df %>%
group_by(id, month) %>%
mutate(row = row_number(id))
# A tibble: 6 x 3
# Groups: id, month [4]
id month row
<dbl> <dbl> <int>
1 1 1 1
2 1 1 2
3 1 2 1
4 2 1 1
5 2 1 2
6 2 2 1
Doing similar operations on a data.table yields something different:
dt[, row := row_number(id), by = c("id", "month")]
id month row
1: 1 1 1
2: 1 1 1
3: 1 2 1
4: 2 1 1
5: 2 1 1
6: 2 2 1
Or:
dt[, row := .I, by = c("id", "month")]
id month row
1: 1 1 1
2: 1 1 2
3: 1 2 3
4: 2 1 4
5: 2 1 5
6: 2 2 6
I assume I understand why this happens (row_number(id) simply refers to the first row number of the first row of each group), but do not know how to get the expected result in pure data.table.
dt[, row := row.names(dt), by = c("id", "month")]
dt
id month row
1: 1 1 1
2: 1 1 2
3: 1 2 1
4: 2 1 1
5: 2 1 2
6: 2 2 1

Retain rows up to first occurrence of a value in a column, by group. Groups without value allowed

I have a data frame like this one:
> df
id type
1 1 a
2 1 a
3 1 b
4 1 a
5 1 b
6 2 a
7 2 a
8 2 b
9 3 a
10 3 a
I want to keep all rows for each group (id) up to the first occurrence of value 'b' in the type column. For groups without type 'b', I want to keep all their rows.
The resulting data frame should look like this:
> dfnew
id type
1 1 a
2 1 a
3 1 b
4 2 a
5 2 a
6 2 b
7 3 a
8 3 a
I tried the following code, but it retains additional rows that have the value 'a' beyond the first occurrence of 'b', and only excludes additional occurrences of 'b', which is not what I want. Look at row 4 in the following. I want to rid of it.
> df %>% group_by(id) %>% filter(cumsum(type == 'b') <= 1)
Source: local data frame [7 x 2]
Groups: id
id type
1 1 a
2 1 a
3 1 b
4 1 a
5 2 a
6 2 a
7 2 b
8 3 a
9 3 a
You could combine match or which with slice or (as mentioned by #Richard) which.max
library(dplyr)
df %>%
group_by(id) %>%
slice(if(any(type == "b")) 1:which.max(type == "b") else row_number())
# Source: local data table [8 x 2]
# Groups: id
#
# id type
# 1 1 a
# 2 1 a
# 3 1 b
# 4 2 a
# 5 2 a
# 6 2 b
# 7 3 a
# 8 3 a
Or you could try it with data.table
library(data.table)
setDT(df)[, if(any(type == "b")) .SD[1:which.max(type == "b")] else .SD, by = id]
# id type
# 1: 1 a
# 2: 1 a
# 3: 1 b
# 4: 2 a
# 5: 2 a
# 6: 2 b
# 7: 3 a
# 8: 3 a

Resources