R: subset dataframe for all rows after a condition is met

R: subset dataframe for all rows after a condition is met - r

So I'm having a dataset of the following form:
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
I would like to subset the dataframe and create a new dataframe, containing only the rows after Var1 first reached its group-maximum (including the row this happens) up to the row where Var2 becomes 1 for the first time (also including this row). So what I'd like to have should look like this:
ID Var1 Var2
1 12 0
1 11 1
2 8 0
2 7 0
2 6 1
The original dataset contains a number of NAs and the function should simply ignore those. Also if Var2 never reaches "1" for a group is should just add all rows to the new dataframe (of course only the ones after Var1 reaches its group maximum).
However I cannot wrap my hand around the programming. Does anyone know help?

A dplyr solution with cumsum based filter will do what the question asks for.
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(cumsum(Var1 == max(Var1)) == 1, cumsum(Var2) <= 1)
## A tibble: 5 x 3
## Groups: ID [2]
# ID Var1 Var2
# <int> <int> <int>
#1 1 12 0
#2 1 11 1
#3 2 8 0
#4 2 7 0
#5 2 6 1
Edit
Here is a solution that tries to answer to the OP's comment and question edit.
df1 %>%
group_by(ID) %>%
mutate_at(vars(starts_with('Var')), ~replace_na(., 0L)) %>%
filter(cumsum(Var1 == max(Var1)) == 1, cumsum(Var2) <= 1)
Data
df1 <- read.table(text = "
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
", header = TRUE)

Using data.table with .I
library(data.table)
setDT(df1)[df1[, .I[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]$V1]
# ID Var1 Var2
#1: 1 12 0
#2: 1 11 1
#3: 2 8 0
#4: 2 7 0
#5: 2 6 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
Var1 = c(2L, 8L, 12L, 11L, 10L, 5L, 8L, 7L, 6L, 5L), Var2 = c(0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L)), class = "data.frame",
row.names = c(NA,
-10L))

Here is data.table translation of Rui Barradas' working solution:
library(data.table)
dat <- fread(text = "
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
", header = TRUE)
dat[, .SD[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]

Related

how to determine duplicate rows where not all are the same in a column?

suppose I want to find duplicate rows for columns:
cols<-c("col1", "col2")
I know for data f4 duplicate rows are:
Jo<-df4[duplicated(df4[cols]) | duplicated(df4[cols], fromLast = TRUE), ]
and removing these duplicate rows from data set is given:
No<-df4[!(duplicated(df4[cols]) | duplicated(df4[cols], fromLast = TRUE)), ]
I want to modify the above codes. Suppose there is a column called mode. It takes integers between 1 to 4. I don't want all of duplicate rows have the same mode==2.
example
col1 col2 mode
1 3 5
5 3 9
1 2 1
1 2 1
3 2 2
3 2 2
4 1 3
4 1 2
4 1 2
output
Jo:
col1 col2 mode
1 2 1
1 2 1
4 1 3
4 1 2
4 1 2
No:
col1 col2 mode
1 3 5
5 3 9
3 2 2
3 2 2
in the above example in 3 and 4-th rows since mode==2 for both it is not duplicate but for three last row since one of them is not 2 , the are duplicate

Based on the updated dataset,
library(dplyr)
out1 <- df2 %>%
group_by_at(vars(cols)) %>%
filter(n() > 1, !all(mode ==2))
out2 <- anti_join(df2, out1)
out1
# A tibble: 5 x 3
# Groups: col1, col2 [2]
# col1 col2 mode
# <int> <int> <int>
#1 1 2 1
#2 1 2 1
#3 4 1 3
#4 4 1 2
#5 4 1 2
out2
# col1 col2 mode
#1 1 3 5
#2 5 3 9
#3 3 2 2
#4 3 2 2
Or with data.table
library(data.table)
i1 <- setDT(df2)[ , .I[.N > 1 & !all(mode == 2)], by = cols]$V1
df2[i1]
# col1 col2 mode
#1: 1 2 1
#2: 1 2 1
#3: 4 1 3
#4: 4 1 2
#5: 4 1 2
df2[!i1]
# col1 col2 mode
#1: 1 3 5
#2: 5 3 9
#3: 3 2 2
#4: 3 2 2
Or using base R
i1 <- duplicated(df2[1:2])|duplicated(df2[1:2], fromLast = TRUE)
out11 <- df2[i1 & with(df2, !ave(mode==2, col1, col2, FUN = all)),]
out22 <- df2[setdiff(row.names(df2), row.names(out11)),]
data
df2 <- structure(list(col1 = c(1L, 5L, 1L, 1L, 3L, 3L, 4L, 4L, 4L),
col2 = c(3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), mode = c(5L,
9L, 1L, 1L, 2L, 2L, 3L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-9L))

Conditionally remove rows from a database using R

ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6
Each person has several records.
There is only one record of a person whose Number is 1, the rest is 2.
The variable Var has different values for the same person.
When the Number equals to 1, the corresponding Var (we call it P) is different for different persons.
Now, I want to delete the rows whose Var > P for every person.
At the end, I want this
ID Number Var
1 2 6
1 2 7
1 1 8
2 2 3
2 2 4
2 1 5

You can use dplyr::first where Num==1 to get the first Var value
library(dplyr)
df %>% group_by(ID) %>% mutate(Flag=first(Var[Number==1])) %>%
filter(Var <= Flag) %>% select(-Flag)
#short version and you sure there is a one Num==1
df %>% group_by(ID) %>% filter(Var <= Var[Number==1])

Here is a solution with data.table:
library(data.table)
dt <- fread(
"ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6")
dt[, .SD[Var <= Var[Number==1]], ID]
# ID Number Var
# 1: 1 2 6
# 2: 1 2 7
# 3: 1 1 8
# 4: 2 2 3
# 5: 2 2 4
# 6: 2 1 5

A base R option would be
df1[with(df1, Var <= ave(Var * (Number == 1), ID, FUN = function(x) x[x!=0])),]
# ID Number Var
#1 1 2 6
#2 1 2 7
#3 1 1 8
#6 2 2 3
#7 2 2 4
#8 2 1 5
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Number = c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L), Var = c(6L, 7L, 8L, 9L, 10L,
3L, 4L, 5L, 6L)), row.names = c(NA, -9L), class = "data.frame")

Subsetting and repetition of rows in a dataframe using R

Suppose we have the following data with column names "id", "time" and "x":
df<-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(20L, 6L, 7L, 11L, 13L, 2L, 6L),
x = c(1L, 1L, 0L, 1L, 1L, 1L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id has multiple observations for time and x. I want to extract the last observation for each id and form a new dataframe which repeats these observations according to the number of observations per each id in the original data. I am able to extract the last observations for each id using the following codes
library(dplyr)
df<-df%>%
group_by(id) %>%
filter( ((x)==0 & row_number()==n())| ((x)==1 & row_number()==n()))
What is left unresolved is the repetition aspect. The expected output would look like
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(7L, 7L, 7L, 13L, 13L, 6L, 6L),
x = c(0L, 0L, 0L, 1L, 1L, 0L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Thanks for your help in advance.

We can use ave to find the max row number for each ID and subset it from the data frame.
df[ave(1:nrow(df), df$id, FUN = max), ]
# id time x
#3 1 7 0
#3.1 1 7 0
#3.2 1 7 0
#5 2 13 1
#5.1 2 13 1
#7 3 6 0
#7.1 3 6 0

You can do this by using last() to grab the last row within each id.
df %>%
group_by(id) %>%
mutate(time = last(time),
x = last(x))
Because last(x) returns a single value, it gets expanded out to fill all the rows in the mutate() call.
This can also be applied to an arbitrary number of variables using mutate_at:
df %>%
group_by(id) %>%
mutate_at(vars(-id), ~ last(.))

slice will be your friend in the tidyverse I reckon:
df %>%
group_by(id) %>%
slice(rep(n(),n()))
## A tibble: 7 x 3
## Groups: id [3]
# id time x
# <int> <int> <int>
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
In data.table, you could also use the mult= argument of a join:
library(data.table)
setDT(df)
df[df[,.(id)], on="id", mult="last"]
# id time x
#1: 1 7 0
#2: 1 7 0
#3: 1 7 0
#4: 2 13 1
#5: 2 13 1
#6: 3 6 0
#7: 3 6 0
And in base R, a merge will get you there too:
merge(df["id"], df[!duplicated(df$id, fromLast=TRUE),])
# id time x
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0

Using data.table you can try
library(data.table)
setDT(df)[,.(time=rep(time[.N],.N), x=rep(x[.N],.N)), by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0
Following #thelatemai, to avoid name the columns you can also try
df[, .SD[rep(.N,.N)], by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0

Create counter with multiple variables that restart within each subgroup

I have a dataframe with two columns (ident and value). I would like to create a counter that restart every time ident value change and also when value within each ident change. Here is an example to make it clear.
# ident value counter
#--------------------
# 1 0 1
# 1 0 2
# 1 1 1
# 1 1 2
# 1 1 3
# 1 0 1
# 1 1 1
# 1 1 2
# 2 1 1
# 2 0 1
# 2 0 2
# 2 0 3
I've tried the plyr package
ddply(mydf, .(ident, value), transform, .id = seq_along(ident))
Same result with the data.frame package.

A data.table alternative with the use of the rleid/rowid functions. With rleid you create a run length id for consecutive values, which can be used as a group. 1:.N or rowid can be used to create the counter. The code:
library(data.table)
# option 1:
setDT(d)[, counter := 1:.N, by = .(ident,rleid(value))]
# option 2:
setDT(d)[, counter := rowid(ident, rleid(value))]
which both give:
> d
ident value counter
1: 1 0 1
2: 1 0 2
3: 1 1 1
4: 1 1 2
5: 1 1 3
6: 1 0 1
7: 1 1 1
8: 1 1 2
9: 2 1 1
10: 2 0 1
11: 2 0 2
12: 2 0 3
With dplyr it is a bit less straightforward:
library(dplyr)
d %>%
group_by(ident, val.gr = cumsum(value != lag(value, default = first(value)))) %>%
mutate(counter = row_number()) %>%
ungroup() %>%
select(-val.gr)
As an alternative to the cumsum-function you could also use rleid from data.table.
Used data:
d <- structure(list(ident = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
value = c(0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L)),
.Names = c("ident", "value"), class = "data.frame", row.names = c(NA, -12L))

We can paste the two values together and use length attribute of rle to get the length of consecutive numbers. We then use sequence to generate the counter.
df$counter <- sequence(rle(paste0(df$dent, df$value))$lengths)
df
# dent value counter
#1 1 0 1
#2 1 0 2
#3 1 1 1
#4 1 1 2
#5 1 1 3
#6 1 0 1
#7 1 1 1
#8 1 1 2
#9 2 1 1
#10 2 0 1
#11 2 0 2
#12 2 0 3

In R, how can I elegantly compute the medians for multiple columns, and then count the number of cells in each row that exceed the median?

Suppose I have the following data frame:
Base Coupled Derived Decl
1 0 0 1
1 7 0 1
1 1 0 1
2 3 12 1
1 0 4 1
Here is the dput output:
temp <- structure(list(Base = c(1L, 1L, 1L, 2L, 1L), Coupled = c(0L,7L, 1L, 3L, 0L), Derived = c(0L, 0L, 0L, 12L, 4L), Decl = c(1L, 1L, 1L, 1L, 1L)), .Names = c("Base", "Coupled", "Derived", "Decl"), row.names = c(NA, 5L), class = "data.frame")
I want to compute the median for each column. Then, for each row, I want to count the number of cell values greater than the median for their respective columns and append this as a column called AboveMedians.
In the example, the medians would be c(1,1,0,1). The resulting table I want would be
Base Coupled Derived Decl AboveMedians
1 0 0 1 0
1 7 0 1 1
1 1 0 1 0
2 3 12 1 3
1 0 4 1 1
What is the elegant R way to do this? I have something involving a for-loop and sapply, but this doesn't seem optimal.
Thanks.

We can use rowMedians from matrixStats after converting the data.frame to matrix.
library(matrixStats)
Medians <- colMedians(as.matrix(temp))
Medians
#[1] 1 1 0 1
Then, replicate the 'Medians' to make the dimensions equal to that of 'temp', do the comparison and get the rowSums on the logical matrix.
temp$AboveMedians <- rowSums(temp >Medians[col(temp)])
temp$AboveMedians
#[1] 0 1 0 3 1
Or a base R only option is
apply(temp, 2, median)
# Base Coupled Derived Decl
# 1 1 0 1
rowSums(sweep(temp, 2, apply(temp, 2, median), FUN = ">"))

Another alternative:
library(dplyr)
library(purrr)
temp %>%
by_row(function(x) {
sum(x > summarise_each(., funs(median))) },
.to = "AboveMedian",
.collate = "cols"
)
Which gives:
#Source: local data frame [5 x 5]
#
# Base Coupled Derived Decl AboveMedian
# <int> <int> <int> <int> <int>
#1 1 0 0 1 0
#2 1 7 0 1 1
#3 1 1 0 1 0
#4 2 3 12 1 3
#5 1 0 4 1 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: subset dataframe for all rows after a condition is met - r

Here is data.table translation of Rui Barradas' working solution: library(data.table) dat <- fread(text = " ID Var1 Var2 1 2 0 1 8 0 1 12 0 1 11 1 1 10 1 2 5 0 2 8 0 2 7 0 2 6 1 2 5 1 ", header = TRUE) dat[, .SD[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]

Related

how to determine duplicate rows where not all are the same in a column?

Conditionally remove rows from a database using R

Subsetting and repetition of rows in a dataframe using R

Create counter with multiple variables that restart within each subgroup

In R, how can I elegantly compute the medians for multiple columns, and then count the number of cells in each row that exceed the median?

Categories

Resources