R aggregate column until one condition is met - r

so I´m having a dataframe of this form:
ID Var1 Var2
1 1 1
1 2 2
1 3 3
1 4 2
1 5 2
2 1 4
2 2 8
2 3 10
2 4 10
2 5 7
and I would like to filter the Var1 values by group for their maximum, on the condition, that the maximum value of Var2 is not met. This will be part of a new dataframe only containing one row per ID, so the outcome should be something like this:
ID Var1
1 2
2 2
so the function should filter the dataframe for the maximum, but only consider the values in the rows before Var2 reaches it´s maximum. The rows containing the maximum itself should not be included and so shouldn´t the rows after the maximum.
I tried building something with the while loop, but it didn´t work out. Also I´d be thankful if the solution doesn´t employ data.table
Thanks in advance

Maybe you could do something like this:
DF <- structure(list(
ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
Var1 = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L),
Var2 = c(1L, 2L, 3L, 2L, 2L, 4L, 8L, 10L, 10L, 7L)),
class = "data.frame", row.names = c(NA, -10L))
library(dplyr)
DF %>% group_by(ID) %>%
slice(1:(which.max(Var2)-1)) %>%
slice_max(Var1) %>%
select(ID, Var1)
#> # A tibble: 2 x 2
#> # Groups: ID [2]
#> ID Var1
#> <int> <int>
#> 1 1 2
#> 2 2 2
Created on 2020-08-04 by the reprex package (v0.3.0)

Related

R code to assign a sequence based off of multiple variables [duplicate]

This question already has answers here:
Recode dates to study day within subject
(2 answers)
Closed 3 years ago.
I have data structured as below:
ID Day Desired Output
1 1 1
1 1 1
1 1 1
1 2 2
1 2 2
1 3 3
2 4 1
2 4 1
2 5 2
3 6 1
3 6 1
Is it possible to create a sequence for the desired output without using a loop? The dataset is quite large so a loop won't work, is it possible to do this with the dplyr package or maybe a combination of cumsum/diff?
An option is to group by 'ID', and then do a match on the 'Day' with the unique values of 'Day' column
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(desired = match(Day, unique(Day)))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L), Day = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 5L, 6L, 6L)), row.names = c(NA,
-11L), class = "data.frame")

Need help replacing values with NA when another condition is met in R (i.e. when another variable is a specific value)

I'm trying to delete some repeating information in my data set and replace it with NA. Here's an example of the data:
DataTable1
ID Day x y
1 1 1 3
1 2 1 3
2 1 2 5
2 2 2 5
3 1 3 4
3 2 3 4
4 1 4 6
4 2 4 6
I'm trying to replace "x" and "y" values with "NA" when Day=1. This is what I want:
ID Day x y
1 1 NA NA
1 2 1 3
2 1 NA NA
2 2 2 5
3 1 NA NA
3 2 3 4
4 1 NA NA
4 2 4 6
I'm not really sure where to start or how to go about this. I tried using the replace_with_na_if function from the naniar library. Otherwise, I am unsure what to try.
replace_with_na_if(data.frame=DataTable1$x,
condition=DataTable1$Day== 2)
I received an error message that reads:
Error in replace_with_na_if(data.frame = DataTable1$x, condition = DataTable1$Day == :
unused argument (data.frame = DataTable1$x)
An option in base R would be to create a logical vector based on the elements of 'Day'. Use that index to subset the 'x', 'y' columns and assign them to NA
i1 <- df1$Day == 1
df1[i1, c('x', 'y')] <- NA
Here's a data.table solution. Since you may be new to R, you need to install the data.table package first. If you have a large data set, data.table may work faster than using data frame. Also, I find the syntax to be easy to read and understand.
#Create the data frame:
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Day = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), y = c(3L, 3L, 5L, 5L,
4L, 4L, 6L, 6L)), class = "data.frame", row.names = c(NA, -8L))
library(data.table)
dt <- setDT(df) # convert the data frame to a data.table
dt[Day == 1, c("x","y") := NA] # where Day equals 1, make the columns x and y equal NA
Good luck and welcome to stackoverflow!
Using dplyr, we can use mutate_at and replace like
library(dplyr)
df %>% mutate_at(vars(x, y), ~replace(., Day == 1, NA))
# ID Day x y
#1 1 1 NA NA
#2 1 2 1 3
#3 2 1 NA NA
#4 2 2 2 5
#5 3 1 NA NA
#6 3 2 3 4
#7 4 1 NA NA
#8 4 2 4 6
data
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Day = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), y = c(3L, 3L, 5L, 5L,
4L, 4L, 6L, 6L)), class = "data.frame", row.names = c(NA, -8L))

get max value of x in relation of two variables in R [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 4 years ago.
in my data
data=structure(list(v1 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
v2 = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), x = c(10L,
1L, 2L, 3L, 4L, 3L, 2L, 30L, 3L, 5L)), .Names = c("v1", "v2",
"x"), class = "data.frame", row.names = c(NA, -10L))
There are 3 variables.
I need to get only those lines in relation to which X, has the max value.
For example. Take First category of v1 and look in relation to which category v2 x has max value
It is
v1=1 and v2=1 x=10
Take second category of v1 and look in relation to which category v2 x has max value
It is v1=2 ,v2=3 x=30
so desired output
v1 v2 x
1 1 10
2 3 30
How to do it?
Here is a solution using data.table:
library(data.table)
setDT(data)
data[, .SD[which.max(x)], keyby = v1]
v1 v2 x
1: 1 1 10
2: 2 3 30
And for completeness an ugly base-R solution:
t(sapply(split(data, data[["v1"]]), function(s) s[which.max(s[["x"]]),]))
v1 v2 x
1 1 1 10
2 2 3 30
Using dplyr:
data %>%
group_by(v1) %>%
filter(x == max(x))
# A tibble: 2 x 3
# Groups: v1 [2]
v1 v2 x
<int> <int> <int>
1 1 1 10
2 2 3 30

Subsetting a dataframe based on summation of rows of a given column

I am dealing with data with three variables (i.e. id, time, gender). It looks like
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
time = c(21L, 3L, 4L, 9L, 5L, 9L, 10L, 6L, 27L, 3L, 4L, 10L),
gender = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-12L)
)
That is, each id has four observations for time and gender. I want to subset this data in R based on the sums of the rows of variable time which first gives a value which is greater than or equal to 25 for each id. Notice that for id 2 all observations will be included and for id 3 only the first observation is involved. The expected results would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L ),
time = c(21L, 3L, 4L, 5L, 9L, 10L, 6L, 27L ),
gender = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-8L)
)
Any help on this is highly appreciated.
One option is using lag of cumsum as:
library(dplyr)
df %>% group_by(id,gender) %>%
filter(lag(cumsum(time), default = 0) < 25 )
# # A tibble: 8 x 3
# # Groups: id, gender [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Using data.table: (Updated based on feedback from #Renu)
library(data.table)
setDT(df)
df[,.SD[shift(cumsum(time), fill = 0) < 25], by=.(id,gender)]
Another option would be to create a logical vector for each 'id', cumsum(time) >= 25, that is TRUE when the cumsum of 'time' is equal to or greater than 25.
Then you can filter for rows where the cumsum of this vector is less or equal then 1, i.e. filter for entries until the first TRUE for each 'id'.
df %>%
group_by(id) %>%
filter(cumsum( cumsum(time) >= 25 ) <= 1)
# A tibble: 8 x 3
# Groups: id [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Can try dplyr construction:
dt <- groupby(df, id) %>%
#sum time within groups
mutate(sum_time = cumsum(time))%>%
#'select' rows, which fulfill the condition
filter(sum_time < 25) %>%
#exclude sum_time column from the result
select (-sum_time)

R: How to delete rows of a data frame based on the values of a given column

I have 100 simulated data sets, for example a single set is shown below
pid time status
1 2 1
1 6 0
1 4 1
2 3 0
2 1 1
2 7 1
3 8 1
3 11 1
3 2 0
pid denotes patient id. This indicates that each patient has three records on the time and status column.
I want to write R code to delete any row with 0 status if that row is not a record for the first observation of a given patient and keep rows with 0 status if it denotes the first observation while the remaining rows with status 1 following the this 0 are deleted for that patient. The output should look like
pid time status
1 2 1
1 4 1
2 3 0
3 8 1
3 11 1
As there are 100 simulated data sets the positions of 0's and 1's in the status column are not the same for all the data. Could anyone be of help to provide R code that can perform this task?
Thank you in advance.
dplyr package can help. I added a record to your data example to include multiple 0 values for a pid.
Group by pid and with the function first you can hold the first value of status. Due to the group by this will be held for all the records per pid. Then just filter if the first record is 0 and row_number() = 1 just in case there are more records with 0 (see pid 4) or if the first record has status = 1 and keep all the records with status 1.
df %>%
group_by(pid) %>%
filter((first(status) == 0 & row_number() == 1) | (first(status) == 1 & status == 1))
# A tibble: 6 x 3
# Groups: pid [4]
pid time status
<int> <int> <int>
1 1 2 1
2 1 4 1
3 2 3 0
4 3 8 1
5 3 11 1
6 4 3 0
data:
df <-
structure(
list(
pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L),
time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L, 3L, 6L, 8L),
status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L)
),
.Names = c("pid", "time", "status"),
class = "data.frame",
row.names = c(NA,-12L)
)
This question is more appropriate on https://stackoverflow.com.
Here is an attempt using tapply() (it's a little verbose):
dat <- structure(list(pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L),
status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L)),
.Names = c("pid", "time", "status"), class = "data.frame",
row.names = c(NA, -9L))
ind <- unlist(tapply(dat$status, dat$pid, function(x) {
# browser()
y <- (rep(FALSE, length(x)))
if (x[1] == 1) {
y[x != 0] <- TRUE
} else {
y[1] <- TRUE
}
y
}))
dat[ind, ]
#> pid time status
#> 1 1 2 1
#> 3 1 4 1
#> 4 2 3 0
#> 7 3 8 1
#> 8 3 11 1
ind is a vector of TRUEs and FALSEs, which will indicate whether a row of dat should be kept or not according to your rules.
I use tapply(X, INDEX, FUN) to apply a function to subsets of a vector (here X = dat$status), which are defined by a grouping factor (here INDEX = dat$pid).
Here, I used an anonymous function (i.e., FUN = function(x){}) to do something with each subset of X.
In particular, I first define y, which I will return later, to be a vector of FALSEs.
If the first status is 1 for a subgroup, I turn all elements that are non-zero (i.e., y[x != 0]) into TRUE.
Otherwise, I turn only the first element (i.e., y[1]) into TRUE.
You may uncomment the browser() statement and see at the console what the function does by typing n (for next) or x or y (to see what they are).

Resources