Count number of values which are less than current value - r

I'd like to count the rows in the column input if the values are smaller than the current row (Please see the results wanted below). The issue to me is that the condition is based on current row value, so it is very different from general case where the condition is a fixed number.
data <- data.frame(input = c(1,1,1,1,2,2,3,5,5,5,5,6))
input
1 1
2 1
3 1
4 1
5 2
6 2
7 3
8 5
9 5
10 5
11 5
12 6
The results I expect to get are like this. For example, for observations 5 and 6 (with value 2), there are 4 observations with value 1 less than their value 2. Hence count is given value 4.
input count
1 1 0
2 1 0
3 1 0
4 1 0
5 2 4
6 2 4
7 3 6
8 5 7
9 5 7
10 5 7
11 5 7
12 6 11
Edit: as I am dealing with grouped data with dplyr, the ultimate results I wish to get is like below, that is, I am wishing the conditions could be dynamic within each group.
data <- data.frame(id = c(1,1,2,2,2,3,3,4,4,4,4,4),
input = c(1,1,1,1,2,2,3,5,5,5,5,6),
count=c(0,0,0,0,2,0,1,0,0,0,0,4))
id input count
1 1 1 0
2 1 1 0
3 2 1 0
4 2 1 0
5 2 2 2
6 3 2 0
7 3 3 1
8 4 5 0
9 4 5 0
10 4 5 0
11 4 5 0
12 4 6 4

Here is an option with tidyverse
library(tidyverse)
data %>%
mutate(count = map_int(input, ~ sum(.x > input)))
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
Update
With the updated data, add the group by 'id' in the above code
data %>%
group_by(id) %>%
mutate(count1 = map_int(input, ~ sum(.x > input)))
# A tibble: 12 x 4
# Groups: id [4]
# id input count count1
# <dbl> <dbl> <dbl> <int>
# 1 1 1 0 0
# 2 1 1 0 0
# 3 2 1 0 0
# 4 2 1 0 0
# 5 2 2 2 2
# 6 3 2 0 0
# 7 3 3 1 1
# 8 4 5 0 0
# 9 4 5 0 0
#10 4 5 0 0
#11 4 5 0 0
#12 4 6 4 4

In base R, we can use sapply and for each input count how many values are greater than itself.
data$count <- sapply(data$input, function(x) sum(x > data$input))
data
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
With dplyr one way would be using rowwise function and following the same logic.
library(dplyr)
data %>%
rowwise() %>%
mutate(count = sum(input > data$input))

1. outer and rowSums
data$count <- with(data, rowSums(outer(input, input, `>`)))
2. table and cumsum
tt <- cumsum(table(data$input))
v <- setNames(c(0, head(tt, -1)), c(head(names(tt), -1), tail(names(tt), 1)))
data$count <- v[match(data$input, names(v))]
3. data.table non-equi join
Perhaps more efficient with a non-equi join in data.table. Count number of rows (.N) for each match (by = .EACHI).
library(data.table)
setDT(data)
data[data, on = .(input < input), .N, by = .EACHI]
If your data is grouped by 'id', as in your update, join on that variable as well:
data[data, on = .(id, input < input), .N, by = .EACHI]
# id input N
# 1: 1 1 0
# 2: 1 1 0
# 3: 2 1 0
# 4: 2 1 0
# 5: 2 2 2
# 6: 3 2 0
# 7: 3 3 1
# 8: 4 5 0
# 9: 4 5 0
# 10: 4 5 0
# 11: 4 5 0
# 12: 4 6 4

Related

Ranking duplicated rows in R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 1 year ago.
I am trying to create an additional variable (new variable-> flag) that will number the repetition of observation in my variable starting from 0.
dataset <- data.frame(id = c(1,1,1,2,2,4,6,6,6,7,7,7,7,8))
intended results will look like:
id flag
1 0
1 1
1 2
2 0
2 1
4 0
6 0
6 1
6 2
7 0
7 1
7 2
7 3
8 0
Thank You!
You may try
dataset$flag <- unlist(sapply(rle(dataset$id)$length, function(x) seq(1,x)-1))
id flag
1 1 0
2 1 1
3 1 2
4 2 0
5 2 1
6 4 0
7 6 0
8 6 1
9 6 2
10 7 0
11 7 1
12 7 2
13 7 3
14 8 0
data.table:
library(data.table)
setDT(dataset)[, flag := rowid(id) - 1]
dataset
id flag
1: 1 0
2: 1 1
3: 1 2
4: 2 0
5: 2 1
6: 4 0
7: 6 0
8: 6 1
9: 6 2
10: 7 0
11: 7 1
12: 7 2
13: 7 3
14: 8 0
Base R:
dataset$flag = sequence(rle(dataset$id)$lengths) - 1
dataset
id flag
1 1 0
2 1 1
3 1 2
4 2 0
5 2 1
6 4 0
7 6 0
8 6 1
9 6 2
10 7 0
11 7 1
12 7 2
13 7 3
14 8 0
Another base option:
transform(dataset,
flag = Reduce(function(x, y) y * x + y, duplicated(id), accumulate = TRUE))
id flag
1 1 0
2 1 1
3 1 2
4 2 0
5 2 1
6 4 0
7 6 0
8 6 1
9 6 2
10 7 0
11 7 1
12 7 2
13 7 3
14 8 0
dplyr -
library(dplyr)
dataset %>% group_by(id) %>% mutate(flag = row_number() - 1)
# id flag
# <dbl> <dbl>
# 1 1 0
# 2 1 1
# 3 1 2
# 4 2 0
# 5 2 1
# 6 4 0
# 7 6 0
# 8 6 1
# 9 6 2
#10 7 0
#11 7 1
#12 7 2
#13 7 3
#14 8 0
Base R with similar logic
transform(dataset, flag = ave(id, id, FUN = seq_along) - 1)
another way to reach what you expect but writing a little more
x <- dataset %>%
group_by(id) %>%
summarise(nreg=n())
df <- data.frame()
for(i in 1:nrow(x)){
flag <- data.frame(id = rep( x$id[i], x$nreg[i] ),
flag = seq(0, x$nreg [i] -1 )
)
df <- rbind(df, flag)
}

Use R to find values for which a condition is first met

Consider the following sample dataset. Id is an individual identifier.
rm(list=ls()); set.seed(1)
n<-100
X<-rbinom(n, 1, 0.5) #binary covariate
j<-rep (1:n)
dat<-data.frame(id=1:n, X)
ntp<- rep(4, n)
mat<-matrix(ncol=3,nrow=1)
m=0; w <- mat
for(l in ntp)
{
m=m+1
ft<- seq(from = 2, to = 8, length.out = l)
# ft<- seq(from = 1, to = 9, length.out = l)
ft<-sort(ft)
seq<-rep(ft,each=2)
seq<-c(0,seq,10)
matid<-cbind( matrix(seq,ncol=2,nrow=l+1,byrow=T ) ,m)
w<-rbind(w,matid)
}
d<-data.frame(w[-1,])
colnames(d)<-c("time1","time2","id")
D <- round( merge(d,dat,by="id") ,2) #merging dataset
nr<-nrow(D)
D$Survival_time<-round(rexp(nr, 0.1)+1,3)
head(D,15)
id time1 time2 X Survival_time
1 1 0 2 0 21.341
2 1 2 4 0 18.987
3 1 4 6 0 4.740
4 1 6 8 0 13.296
5 1 8 10 0 6.397
6 2 0 2 0 10.566
7 2 2 4 0 2.470
8 2 4 6 0 14.907
9 2 6 8 0 8.620
10 2 8 10 0 13.376
11 3 0 2 1 45.239
12 3 2 4 1 11.545
13 3 4 6 1 11.352
14 3 6 8 1 19.760
15 3 8 10 1 7.547
How can I obtain the value at which Survival_time is less that time2 for the very first time per individual. I should end up with the following values
id Survival_time
1 4.740
2 2.470
3 7.547
Also, how can I subset the data to stop individualwise when this condition occurs. i.e obtain
id time1 time2 X Survival_time
1 1 0 2 0 21.341
2 1 2 4 0 18.987
3 1 4 6 0 4.740
6 2 0 2 0 10.566
7 2 2 4 0 2.470
11 3 0 2 1 45.239
12 3 2 4 1 11.545
13 3 4 6 1 11.352
14 3 6 8 1 19.760
15 3 8 10 1 7.547
Using data.table
library(data.table)
setDT(D)[, .SD[seq_len(.N) <= which(Survival_time < time2)[1]], id]
-output
id time1 time2 X Survival_time
1: 1 0 2 0 21.341
2: 1 2 4 0 18.987
3: 1 4 6 0 4.740
4: 2 0 2 0 10.566
5: 2 2 4 0 2.470
6: 3 0 2 1 45.239
7: 3 2 4 1 11.545
8: 3 4 6 1 11.352
9: 3 6 8 1 19.760
10: 3 8 10 1 7.547
Slight variation:
library(dplyr)
D %>% # Take D, and then
group_by(id) %>% # group by id, and then
filter(Survival_time < time2) %>% # keep Survival times < time2, and then
slice(1) %>% # keep the first row per id, and then
ungroup() # ungroup
You can use -
library(dplyr)
D %>%
group_by(id) %>%
summarise(Survival_time = Survival_time[match(TRUE, Survival_time < time2)])
#Also using which.max
#summarise(Survival_time = Survival_time[which.max(Survival_time < time2)])
# id Survival_time
# <int> <dbl>
#1 1 4.74
#2 2 2.47
#3 3 7.55
To select the rows you may till that point you may use -
D %>%
group_by(id) %>%
filter(row_number() <= match(TRUE, Survival_time < time2)) %>%
ungroup
# id time1 time2 X Survival_time
# <int> <int> <int> <int> <dbl>
# 1 1 0 2 0 21.3
# 2 1 2 4 0 19.0
# 3 1 4 6 0 4.74
# 4 2 0 2 0 10.6
# 5 2 2 4 0 2.47
# 6 3 0 2 1 45.2
# 7 3 2 4 1 11.5
# 8 3 4 6 1 11.4
# 9 3 6 8 1 19.8
#10 3 8 10 1 7.55

R Configure Data With Data.Table

data=data.frame("Student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"Grade"=c(5,6,7,3,4,5,4,5,6,8,9,10,2,3,4),
"Pass"=c(NA,0,1,0,1,1,0,1,0,0,NA,NA,0,0,0),
"NEWPass"=c(0,0,1,0,1,1,0,1,1,0,0,0,0,0,0),
"GradeNEWPass"=c(7,7,7,4,4,4,5,5,5,10,10,10,4,4,4),
"GradeBeforeNEWPass"=c(6,6,6,3,3,3,4,4,4,10,10,10,4,4,4))
I have a data.frame called data. It has column names Student, Grade and Pass. I wish to do this:
NEWPass: Take Pass and for every Student fill in NA values with the previous value. If the first value is 'NA' than put a zero. Then this should be a running maximum.
GradeNEWPass: Take the lowest value of Grade that a Student got a one in NEWPass. If a Student did not get a one in NEWPass, this equals to the maximum grade.
GradeBeforeNEWPass: Take the value of Grade BEFORE a Student got a one in NEWPass. If a Student did not get a one in NEWPass, this equals to the maximum grade.
__
Attempts:
setDT(data)[, NEWPassTry := cummax(Pass), by = Student]
data$GradeNEWPass = data$NEWPassTry * data$Grade
data[, GradeNEWPass := min(GradeNEWPass), by = Student]
Not pretty, admittedly, but your logic includes words like "if any ... for a student", so it's a group-wise conditional, not a row-wise conditional.
library(magrittr) # just for %>% for breakout, not required
mydata %>%
.[, NEWPass2 :=
cummax(fifelse(seq_len(.N) == 1 & is.na(Pass), 0,
zoo::na.locf(Pass, na.rm = FALSE))), by = .(Student) ] %>%
.[, GradeNEWPass2 :=
if (any(NEWPass2 > 0)) min(Grade[ NEWPass2 > 0 ]) else max(Grade),
by = .(Student) ] %>%
.[, GradeBeforeNEWPass2 :=
if (NEWPass2[1] == 0 && any(NEWPass2 > 0)) Grade[ which(NEWPass2 > 0)[1] - 1 ] else max(Grade),
by = .(Student) ]
# Student Grade Pass NEWPass GradeNEWPass GradeBeforeNEWPass NEWPass2 GradeNEWPass2 GradeBeforeNEWPass2
# 1: 1 5 NA 0 7 6 0 7 6
# 2: 1 6 0 0 7 6 0 7 6
# 3: 1 7 1 1 7 6 1 7 6
# 4: 2 3 0 0 4 3 0 4 3
# 5: 2 4 1 1 4 3 1 4 3
# 6: 2 5 1 1 4 3 1 4 3
# 7: 3 4 0 0 5 4 0 5 4
# 8: 3 5 1 1 5 4 1 5 4
# 9: 3 6 0 1 5 4 1 5 4
# 10: 4 8 0 0 10 10 0 10 10
# 11: 4 9 NA 0 10 10 0 10 10
# 12: 4 10 NA 0 10 10 0 10 10
# 13: 5 2 0 0 4 4 0 4 4
# 14: 5 3 0 0 4 4 0 4 4
# 15: 5 4 0 0 4 4 0 4 4
I'm using magrittr::%>% solely to break it out into stages of computation, it is not required.
We can use data.table methods. Grouped by 'Student', create an index ('i1') where the 'Pass' is 1 and not an NA, then get the first position of 1 with which and head ('i2'), while calculating the max of 'Grade' ('mx'), then create the three columns based on the indexes ('v1' - get the cumulative maximum of the binary, 'v2' - if there are any 1s, then subset the 'Grade' with the index 'i2' or else return 'mx', similarly 'v3'- the index is subtracted 1 to get the 'Grade' value
library(data.table)
setDT(data)[, c('NEWPass1', 'GradeNEWPass1', 'GradeBeforeNEWPass1') :={
i1 <- Pass == 1 & !is.na(Pass)
i2 <- head(which(i1), 1)
mx <- max(Grade, na.rm = TRUE)
v1 <- cummax(+(i1))
v2 <- if(any(i1)) Grade[i2] else mx
v3 <- if(any(i1)) Grade[max(1, i2-1)] else mx
.(v1, v2, v3)}, Student]
data
# Student Grade Pass NEWPass GradeNEWPass GradeBeforeNEWPass NEWPass1 GradeNEWPass1 GradeBeforeNEWPass1
# 1: 1 5 NA 0 7 6 0 7 6
# 2: 1 6 0 0 7 6 0 7 6
# 3: 1 7 1 1 7 6 1 7 6
# 4: 2 3 0 0 4 3 0 4 3
# 5: 2 4 1 1 4 3 1 4 3
# 6: 2 5 1 1 4 3 1 4 3
# 7: 3 4 0 0 5 4 0 5 4
# 8: 3 5 1 1 5 4 1 5 4
# 9: 3 6 0 1 5 4 1 5 4
#10: 4 8 0 0 10 10 0 10 10
#11: 4 9 NA 0 10 10 0 10 10
#12: 4 10 NA 0 10 10 0 10 10
#13: 5 2 0 0 4 4 0 4 4
#14: 5 3 0 0 4 4 0 4 4
#15: 5 4 0 0 4 4 0 4 4

If a value appears in the row, all subsequent rows should take this value (with dplyr)

I'm just starting to learn R and I'm already facing the first bigger problem.
Let's take the following panel dataset as an example:
N=5
T=3
time<-rep(1:T, times=N)
id<- rep(1:N,each=T)
dummy<- c(0,0,1,1,0,0,0,1,0,0,0,1,0,1,0)
df<-as.data.frame(cbind(id, time,dummy))
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 0
I now want the dummy variable for all rows of a cross section to take the value 1 after the 1 for this cross section appears for the first time. So, what I want is:
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 0
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 1
So I guess I need something like:
df_new<-df %>%
group_by(id) %>%
???
I already tried to set all zeros to NA and use the na.locf function, but it didn't really work.
Anybody got an idea?
Thanks!
Use cummax
df %>%
group_by(id) %>%
mutate(dummy = cummax(dummy))
# A tibble: 15 x 3
# Groups: id [5]
# id time dummy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 0
# 8 3 2 1
# 9 3 3 1
#10 4 1 0
#11 4 2 0
#12 4 3 1
#13 5 1 0
#14 5 2 1
#15 5 3 1
Without additional packages you could do
transform(df, dummy = ave(dummy, id, FUN = cummax))

Extracting event rows from a data frame

I have this data frame:
df <-
ID var TIME value method
1 3 0 2 1
1 3 2 2 1
1 3 3 0 1
1 4 0 10 1
1 4 2 10 1
1 4 4 5 1
1 4 6 5 1
2 3 0 2 1
2 3 2 2 1
2 3 3 0 1
2 4 0 10 1
2 4 2 10 1
2 4 4 5 1
2 4 6 5 1
I want to extract rows that has a new eventin value column. For example, for ID=1, var=3 has a value of 2 at TIME=0. This value stays the same at TIME=1, so I would take the first row at TIME=0 only and discard the second row. However, the third row, the value for var=3 has changed into zero, so I have also to extract this row. And so on for the rest of the variables. This has to be applied for every subject ID. For the above df, the result should be as follows:
dfevent <-
ID var TIME value method
1 3 0 2 1
1 3 3 0 1
1 4 0 10 1
1 4 4 5 1
2 3 0 2 1
2 3 3 0 1
2 4 0 10 1
2 4 4 5 1
Could any one help me doing this in R? I have a huge data set and I want to extract the information at which a new event has occurred for the value of every var. I have 4 variables in the data frame numbered (3, 4,5,6, and 7). The above is an example for 2 variables (variable number: 3 and 4).
This does it using dplyr
library(dplyr)
df %>%
group_by(ID, var) %>%
mutate(tf = ifelse(value==lag(value), 1, 0)) %>%
filter(is.na(tf) | tf==0) %>%
select(-tf)
# ID var TIME value method
#1 1 3 0 2 1
#2 1 3 3 0 1
#3 1 4 0 10 1
#4 1 4 4 5 1
#5 2 3 0 2 1
#6 2 3 3 0 1
#7 2 4 0 10 1
#8 2 4 4 5 1
basically, I created an extra variable that returns a '1' when the value is the same as the preceding row within groups of unique ID/var combinations. We then get rid of this variable before returning the output.
Base solution:
df[with(df, abs(ave(value,ID,FUN=function(x) c(1,diff(x)) ))) > 0,]
# ID var TIME value method
#1 1 3 0 2 1
#3 1 3 3 0 1
#4 1 4 0 10 1
#6 1 4 4 5 1
#8 2 3 0 2 1
#10 2 3 3 0 1
#11 2 4 0 10 1
#13 2 4 4 5 1
From the expected results, you may also try rleid from data.table
library(data.table)#data.table_1.9.5
setDT(df)[df[, .I[1L] , list(ID, var, rleid(value))]$V1]
# ID var TIME value method
#1: 1 3 0 2 1
#2: 1 3 3 0 1
#3: 1 4 0 10 1
#4: 1 4 4 5 1
#5: 2 3 0 2 1
#6: 2 3 3 0 1
#7: 2 4 0 10 1
#8: 2 4 4 5 1
Or a similar approach as #thelatemail
setDT(df)[df[, .I[abs(c(1,diff(value)))>0] , ID]$V1]
Or
unique(setDT(df)[, id:=rleid(value)], by=c('ID', 'var', 'id'))

Resources