using column numbers for grouping in data table rather than names in R - r

I have code that needs to be flexible, and I cannot hard code in column names when I do grouping. As such, I want to hard code column numbers to do grouping, since these are easy to specify over range changes. (Column 1 through X or so, rather than using the names of cols 1,2,..X)
Example data set:
set.seed(007)
DF <- data.frame(X=1:20, Y=sample(c(0,1), 20, TRUE), Z=sample(0:5, 20, TRUE), Q =sample(0:5, 20, TRUE))
DF
X Y Z Q
1 1 1 3 4
2 2 0 1 2
3 3 0 5 4
4 4 0 5 2
5 5 0 5 5
6 6 1 0 1
7 7 0 3 0
8 8 1 2 4
9 9 0 5 5
10 10 0 2 5
11 11 0 4 3
12 12 0 1 4
13 13 1 1 4
14 14 0 1 3
15 15 0 2 4
16 16 0 5 2
17 17 1 2 0
18 18 0 4 1
19 19 1 5 2
20 20 0 2 1
A grouping (by Z and Q) that finds the X that maximizes Y, and returns both:
DF =data.table(DF)
DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]
Result:
DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]
Z Q Y X
1: 3 4 1 1
2: 1 2 0 2
3: 5 4 0 3
4: 5 2 1 19
5: 5 5 0 5
6: 0 1 1 6
7: 3 0 0 7
8: 2 4 1 8
9: 2 5 0 10
10: 4 3 0 11
11: 1 4 1 13
12: 1 3 0 14
13: 2 0 1 17
14: 4 1 0 18
15: 2 1 0 20
I want to do this purely using column numbers, because of the nature of my code. Additionally, If there were another column, I would potentially want to group by that extra column. And I would also want to potentially return another argmax in the first part.

Maybe just pick off names(DF) with column numbers, combined with eval(parse(...))?
useColNums <- function(data, a, b) {
n <- names(data)
y <- n[a[1]]
x <- n[a[2]]
groupby <- sprintf("list(%s)", paste(n[b], collapse=","))
argmax <- sprintf("list(%1$s=max(%1$s),%2$s=%2$s[which.max(%1$s)])", y, x)
data[, eval(parse(text=argmax)), by=eval(parse(text=groupby))]
}
x <- useColNums(DF, 2:1, 3:4)
y <- DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]
identical(x, y)
# [1] TRUE

Did you find an answer that works for you? Something like this is possible, but it is not pretty, which may mean it is hard to maintain:
DF[, list(Y=max(eval(as.symbol(colnames(DF)[2]))),
X=eval(as.symbol(colnames(DF)[1]))[which.max(eval(as.symbol(colnames(DF)[2])))]),
by=list(Z=eval(as.symbol(colnames(DF)[3])),
Q=eval(as.symbol(colnames(DF)[4])))]
Now you could put those as.symbol(colnames()) into a function and make this easier to read:
cn <- function( dt, col ) { as.symbol(colnames(dt)[col]) }
DF[, list(Y=max(eval(cn(DF,2))),
X=eval(cn(DF,1))[which.max(eval(cn(DF,2)))]),
by=list(Z=eval(cn(DF,3)), Q=eval(cn(DF,4)))]
Does this solve that problem of grouping by column numbers for you?

You could use a combination of grep with your code:
> set.seed(007)
> DF <- data.frame(X=1:20, Y=sample(c(0,1), 20, TRUE), Z=sample(0:5, 20, TRUE), Q =sample(0:5, 20, TRUE))
> DF = data.table(DF)
> coly <- na
> DF[, list(Y=max(Y),X=X[which.max(Y)]), by=c(col1 <- names(DF)[grep("Q", colnames(DF))], names(DF)[grep("Z", colnames(DF))])]
Q Z Y X
1: 4 3 1 1
2: 2 1 0 2
3: 4 5 0 3
4: 2 5 1 19
5: 5 5 0 5
6: 1 0 1 6
7: 0 3 0 7
8: 4 2 1 8
9: 5 2 0 10
10: 3 4 0 11
11: 4 1 1 13
12: 3 1 0 14
13: 0 2 1 17
14: 1 4 0 18
15: 1 2 0 20

Related

Ranking duplicated rows in R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 1 year ago.
I am trying to create an additional variable (new variable-> flag) that will number the repetition of observation in my variable starting from 0.
dataset <- data.frame(id = c(1,1,1,2,2,4,6,6,6,7,7,7,7,8))
intended results will look like:
id flag
1 0
1 1
1 2
2 0
2 1
4 0
6 0
6 1
6 2
7 0
7 1
7 2
7 3
8 0
Thank You!
You may try
dataset$flag <- unlist(sapply(rle(dataset$id)$length, function(x) seq(1,x)-1))
id flag
1 1 0
2 1 1
3 1 2
4 2 0
5 2 1
6 4 0
7 6 0
8 6 1
9 6 2
10 7 0
11 7 1
12 7 2
13 7 3
14 8 0
data.table:
library(data.table)
setDT(dataset)[, flag := rowid(id) - 1]
dataset
id flag
1: 1 0
2: 1 1
3: 1 2
4: 2 0
5: 2 1
6: 4 0
7: 6 0
8: 6 1
9: 6 2
10: 7 0
11: 7 1
12: 7 2
13: 7 3
14: 8 0
Base R:
dataset$flag = sequence(rle(dataset$id)$lengths) - 1
dataset
id flag
1 1 0
2 1 1
3 1 2
4 2 0
5 2 1
6 4 0
7 6 0
8 6 1
9 6 2
10 7 0
11 7 1
12 7 2
13 7 3
14 8 0
Another base option:
transform(dataset,
flag = Reduce(function(x, y) y * x + y, duplicated(id), accumulate = TRUE))
id flag
1 1 0
2 1 1
3 1 2
4 2 0
5 2 1
6 4 0
7 6 0
8 6 1
9 6 2
10 7 0
11 7 1
12 7 2
13 7 3
14 8 0
dplyr -
library(dplyr)
dataset %>% group_by(id) %>% mutate(flag = row_number() - 1)
# id flag
# <dbl> <dbl>
# 1 1 0
# 2 1 1
# 3 1 2
# 4 2 0
# 5 2 1
# 6 4 0
# 7 6 0
# 8 6 1
# 9 6 2
#10 7 0
#11 7 1
#12 7 2
#13 7 3
#14 8 0
Base R with similar logic
transform(dataset, flag = ave(id, id, FUN = seq_along) - 1)
another way to reach what you expect but writing a little more
x <- dataset %>%
group_by(id) %>%
summarise(nreg=n())
df <- data.frame()
for(i in 1:nrow(x)){
flag <- data.frame(id = rep( x$id[i], x$nreg[i] ),
flag = seq(0, x$nreg [i] -1 )
)
df <- rbind(df, flag)
}

R Configure Data With Data.Table

data=data.frame("Student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"Grade"=c(5,6,7,3,4,5,4,5,6,8,9,10,2,3,4),
"Pass"=c(NA,0,1,0,1,1,0,1,0,0,NA,NA,0,0,0),
"NEWPass"=c(0,0,1,0,1,1,0,1,1,0,0,0,0,0,0),
"GradeNEWPass"=c(7,7,7,4,4,4,5,5,5,10,10,10,4,4,4),
"GradeBeforeNEWPass"=c(6,6,6,3,3,3,4,4,4,10,10,10,4,4,4))
I have a data.frame called data. It has column names Student, Grade and Pass. I wish to do this:
NEWPass: Take Pass and for every Student fill in NA values with the previous value. If the first value is 'NA' than put a zero. Then this should be a running maximum.
GradeNEWPass: Take the lowest value of Grade that a Student got a one in NEWPass. If a Student did not get a one in NEWPass, this equals to the maximum grade.
GradeBeforeNEWPass: Take the value of Grade BEFORE a Student got a one in NEWPass. If a Student did not get a one in NEWPass, this equals to the maximum grade.
__
Attempts:
setDT(data)[, NEWPassTry := cummax(Pass), by = Student]
data$GradeNEWPass = data$NEWPassTry * data$Grade
data[, GradeNEWPass := min(GradeNEWPass), by = Student]
Not pretty, admittedly, but your logic includes words like "if any ... for a student", so it's a group-wise conditional, not a row-wise conditional.
library(magrittr) # just for %>% for breakout, not required
mydata %>%
.[, NEWPass2 :=
cummax(fifelse(seq_len(.N) == 1 & is.na(Pass), 0,
zoo::na.locf(Pass, na.rm = FALSE))), by = .(Student) ] %>%
.[, GradeNEWPass2 :=
if (any(NEWPass2 > 0)) min(Grade[ NEWPass2 > 0 ]) else max(Grade),
by = .(Student) ] %>%
.[, GradeBeforeNEWPass2 :=
if (NEWPass2[1] == 0 && any(NEWPass2 > 0)) Grade[ which(NEWPass2 > 0)[1] - 1 ] else max(Grade),
by = .(Student) ]
# Student Grade Pass NEWPass GradeNEWPass GradeBeforeNEWPass NEWPass2 GradeNEWPass2 GradeBeforeNEWPass2
# 1: 1 5 NA 0 7 6 0 7 6
# 2: 1 6 0 0 7 6 0 7 6
# 3: 1 7 1 1 7 6 1 7 6
# 4: 2 3 0 0 4 3 0 4 3
# 5: 2 4 1 1 4 3 1 4 3
# 6: 2 5 1 1 4 3 1 4 3
# 7: 3 4 0 0 5 4 0 5 4
# 8: 3 5 1 1 5 4 1 5 4
# 9: 3 6 0 1 5 4 1 5 4
# 10: 4 8 0 0 10 10 0 10 10
# 11: 4 9 NA 0 10 10 0 10 10
# 12: 4 10 NA 0 10 10 0 10 10
# 13: 5 2 0 0 4 4 0 4 4
# 14: 5 3 0 0 4 4 0 4 4
# 15: 5 4 0 0 4 4 0 4 4
I'm using magrittr::%>% solely to break it out into stages of computation, it is not required.
We can use data.table methods. Grouped by 'Student', create an index ('i1') where the 'Pass' is 1 and not an NA, then get the first position of 1 with which and head ('i2'), while calculating the max of 'Grade' ('mx'), then create the three columns based on the indexes ('v1' - get the cumulative maximum of the binary, 'v2' - if there are any 1s, then subset the 'Grade' with the index 'i2' or else return 'mx', similarly 'v3'- the index is subtracted 1 to get the 'Grade' value
library(data.table)
setDT(data)[, c('NEWPass1', 'GradeNEWPass1', 'GradeBeforeNEWPass1') :={
i1 <- Pass == 1 & !is.na(Pass)
i2 <- head(which(i1), 1)
mx <- max(Grade, na.rm = TRUE)
v1 <- cummax(+(i1))
v2 <- if(any(i1)) Grade[i2] else mx
v3 <- if(any(i1)) Grade[max(1, i2-1)] else mx
.(v1, v2, v3)}, Student]
data
# Student Grade Pass NEWPass GradeNEWPass GradeBeforeNEWPass NEWPass1 GradeNEWPass1 GradeBeforeNEWPass1
# 1: 1 5 NA 0 7 6 0 7 6
# 2: 1 6 0 0 7 6 0 7 6
# 3: 1 7 1 1 7 6 1 7 6
# 4: 2 3 0 0 4 3 0 4 3
# 5: 2 4 1 1 4 3 1 4 3
# 6: 2 5 1 1 4 3 1 4 3
# 7: 3 4 0 0 5 4 0 5 4
# 8: 3 5 1 1 5 4 1 5 4
# 9: 3 6 0 1 5 4 1 5 4
#10: 4 8 0 0 10 10 0 10 10
#11: 4 9 NA 0 10 10 0 10 10
#12: 4 10 NA 0 10 10 0 10 10
#13: 5 2 0 0 4 4 0 4 4
#14: 5 3 0 0 4 4 0 4 4
#15: 5 4 0 0 4 4 0 4 4

Count number of values which are less than current value

I'd like to count the rows in the column input if the values are smaller than the current row (Please see the results wanted below). The issue to me is that the condition is based on current row value, so it is very different from general case where the condition is a fixed number.
data <- data.frame(input = c(1,1,1,1,2,2,3,5,5,5,5,6))
input
1 1
2 1
3 1
4 1
5 2
6 2
7 3
8 5
9 5
10 5
11 5
12 6
The results I expect to get are like this. For example, for observations 5 and 6 (with value 2), there are 4 observations with value 1 less than their value 2. Hence count is given value 4.
input count
1 1 0
2 1 0
3 1 0
4 1 0
5 2 4
6 2 4
7 3 6
8 5 7
9 5 7
10 5 7
11 5 7
12 6 11
Edit: as I am dealing with grouped data with dplyr, the ultimate results I wish to get is like below, that is, I am wishing the conditions could be dynamic within each group.
data <- data.frame(id = c(1,1,2,2,2,3,3,4,4,4,4,4),
input = c(1,1,1,1,2,2,3,5,5,5,5,6),
count=c(0,0,0,0,2,0,1,0,0,0,0,4))
id input count
1 1 1 0
2 1 1 0
3 2 1 0
4 2 1 0
5 2 2 2
6 3 2 0
7 3 3 1
8 4 5 0
9 4 5 0
10 4 5 0
11 4 5 0
12 4 6 4
Here is an option with tidyverse
library(tidyverse)
data %>%
mutate(count = map_int(input, ~ sum(.x > input)))
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
Update
With the updated data, add the group by 'id' in the above code
data %>%
group_by(id) %>%
mutate(count1 = map_int(input, ~ sum(.x > input)))
# A tibble: 12 x 4
# Groups: id [4]
# id input count count1
# <dbl> <dbl> <dbl> <int>
# 1 1 1 0 0
# 2 1 1 0 0
# 3 2 1 0 0
# 4 2 1 0 0
# 5 2 2 2 2
# 6 3 2 0 0
# 7 3 3 1 1
# 8 4 5 0 0
# 9 4 5 0 0
#10 4 5 0 0
#11 4 5 0 0
#12 4 6 4 4
In base R, we can use sapply and for each input count how many values are greater than itself.
data$count <- sapply(data$input, function(x) sum(x > data$input))
data
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
With dplyr one way would be using rowwise function and following the same logic.
library(dplyr)
data %>%
rowwise() %>%
mutate(count = sum(input > data$input))
1. outer and rowSums
data$count <- with(data, rowSums(outer(input, input, `>`)))
2. table and cumsum
tt <- cumsum(table(data$input))
v <- setNames(c(0, head(tt, -1)), c(head(names(tt), -1), tail(names(tt), 1)))
data$count <- v[match(data$input, names(v))]
3. data.table non-equi join
Perhaps more efficient with a non-equi join in data.table. Count number of rows (.N) for each match (by = .EACHI).
library(data.table)
setDT(data)
data[data, on = .(input < input), .N, by = .EACHI]
If your data is grouped by 'id', as in your update, join on that variable as well:
data[data, on = .(id, input < input), .N, by = .EACHI]
# id input N
# 1: 1 1 0
# 2: 1 1 0
# 3: 2 1 0
# 4: 2 1 0
# 5: 2 2 2
# 6: 3 2 0
# 7: 3 3 1
# 8: 4 5 0
# 9: 4 5 0
# 10: 4 5 0
# 11: 4 5 0
# 12: 4 6 4

data.table: Select n specific rows before & after other rows meeting a condition

Given the following example data table:
library(data.table)
DT <- fread("grp y exclude
a 1 0
a 2 0
a 3 0
a 4 1
a 5 0
a 7 1
a 8 0
a 9 0
a 10 0
b 1 0
b 2 0
b 3 0
b 4 1
b 5 0
b 6 1
b 7 1
b 8 0
b 9 0
b 10 0
c 5 1
d 1 0")
I want to select
by group grp
all rows that have y==5
and up to two rows before and after each row from 2 within the grouping.
but 3. only those rows that have exclude==0.
Assuming each group has max one row with y==5, this would yield the desired result for 1.-3.:
idx <- -2:2 # 2 rows before match, the matching row itself, and two rows after match
(row_numbers <- DT[,.I[{
x <- rep(which(y==5),each=length(idx))+idx
x[x>0 & x<=.N]
}], by=grp]$V1)
# [1] 3 4 5 6 7 12 13 14 15 16 20
DT[row_numbers]
# grp y exclude
# 1: a 3 0
# 2: a 4 1
# 3: a 5 0 # y==5 + two rows before and two rows after
# 4: a 7 1
# 5: a 8 0
# 6: b 3 0
# 7: b 4 1
# 8: b 5 0 # y==5 + two rows before and two rows after
# 9: b 6 1
# 10: b 7 1
# 11: c 5 1 # y==5 + nothing, because the group has only 1 element
However, how do I incorporate 4. so that I get
# grp y exclude
# 1: a 2 0
# 2: a 3 0
# 3: a 5 0
# 4: a 8 0
# 5: a 9 0
# 6: b 2 0
# 7: b 3 0
# 8: b 5 0
# 9: b 8 0
# 10: b 9 0
# 11: c 5 1
? Feels like I'm close, but I guess I looked too long at heads and whiches, now, so I'd be thankful for some fresh ideas.
A bit more simplified:
DT[DT[, rn := .I][exclude==0 | y==5][, rn[abs(.I - .I[y==5]) <= 2], by=grp]$V1]
# grp y exclude rn
#1: a 2 0 2
#2: a 3 0 3
#3: a 5 0 5
#4: a 8 0 7
#5: a 9 0 8
#6: b 2 0 11
#7: b 3 0 12
#8: b 5 0 14
#9: b 8 0 17
#10: b 9 0 18
#11: c 5 1 20
You are very close. This should do it:
row_numbers <- DT[exclude==0 | y==5, .I[{
x <- rep(which(y==5), each=length(idx)) + idx
x[x>0 & x<=.N]
}], by=grp]$V1
DT[row_numbers]

How to refer to multiple previous rows in R data.table

I have a question regarding data.table in R
i have a dataset like this
data <- data.table(a=c(1:7,12,32,13),b=c(1,5,6,7,8,3,2,5,1,4))
a b
1: 1 1
2: 2 5
3: 3 6
4: 4 7
5: 5 8
6: 6 3
7: 7 2
8: 12 5
9: 32 1
10: 13 4
Now i want to generate a third column c, which gonna compare the value of each row of a, to all previous values of b and check if there is any value of b is bigger than a. For e.g, at row 5, a=5, and previous value of b is 1,5,6,7. so 6 and 7 is bigger than 5, therefore value of c should be 1, otherwise it would be 0.
The result should be like this
a b c
1: 1 1 NA
2: 2 5 0
3: 3 6 1
4: 4 7 1
5: 5 8 1
6: 6 3 1
7: 7 2 1
8: 12 5 0
9: 32 1 0
10: 13 4 0
I tried with a for loop but it takes a very long time. I also tried shift but i can not refer to multiple previous rows with shift. Anyone has any recommendation?
library(data.table)
data <- data.table(a=c(1:7,12,32,13),b=c(1,5,6,7,8,3,2,5,1,4))
data[,c:= a <= shift(cummax(b))]
This is a base R solution (see the dplyr solution below):
data$c = NA
data$c[2:nrow(data)] <- sapply(2:nrow(data), function(x) { data$c[x] <- any(data$a[x] < data$b[1:(x-1)]) } )
## a b c
## 1: 1 1 NA
## 2: 2 5 0
## 3: 3 6 1
## 4: 4 7 1
## 5: 5 8 1
## 6: 6 3 1
## 7: 7 2 1
## 8: 12 5 0
## 9: 32 1 0
## 10: 13 4 0
EDIT
Here is a simpler solution using dplyr
library(dplyr)
### Given the cumulative max and comparing to 'a', set see to 1/0.
data %>% mutate(c = ifelse(a < lag(cummax(b)), 1, 0))
## a b c
## 1 1 1 NA
## 2 2 5 0
## 3 3 6 1
## 4 4 7 1
## 5 5 8 1
## 6 6 3 1
## 7 7 2 1
## 8 12 5 0
## 9 32 1 0
## 10 13 4 0
### Using 'shift' with dplyr
data %>% mutate(c = ifelse(a <= shift(cummax(b)), 1, 0))

Resources