This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 21 days ago.
I have a coding problem regarding subsetting my dataset. I would like to subset my data with the following conditions (1) one observation per ID and (2) retaining a row for "event" = 1 occurring at any time, while still not losing any observations.
An example dataset looks like this:
ID event
A 1
A 1
A 0
A 1
B 0
B 0
B 0
C 0
C 1
Desired output
A 1
B 0
C 1
I imagine this would be done using dplyr df >%> group_by(ID), but I'm unsure how to prioritize selecting for any row that contains event = 1 without losing when event = 0. I do not want to lose any of the IDs.
Any help would be appreciated - thank you very much.
We may use
aggregate(event ~ ID, df1, max)
ID event
1 A 1
2 B 0
3 C 1
Or with dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
slice_max(n = 1, event, with_ties = FALSE) %>%
ungroup
# A tibble: 3 × 2
ID event
<chr> <int>
1 A 1
2 B 0
3 C 1
data
df1 <- structure(list(ID = c("A", "A", "A", "A", "B", "B", "B", "C",
"C"), event = c(1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L)),
class = "data.frame", row.names = c(NA,
-9L))
Related
I have this dataset containing multiple columns. I want to use cumsum() on a column conditioning the sum on another column. That is when X happens I want the sum to restart from zero but, I want to sum also the number of the "x" event row. I'll be more precise here in an example.
inv ass port G cumsum(G)
i x 2 1 1
i x 2 0 1
i x 0 1 2
i x 3 0 0
i x 3 1 1
So in the 3rd row the condition port == 0 happens. I want to cumsum(G), but on the 3rd row i want to still sum the value of G and to restart the count from the following row.
I'm using dplyr to group_by(investor, asset) but I'm stuck here since I'm doing:
res1 <- res %>%
group_by(investor, asset) %>%
mutate(posdays = ifelse(operation < 0 & portfolio == 0, 0, cumsum(G))) %>%
ungroup()
Since this restart the cumsum() but excludes the sum of the 3rd row.
I think something saying "cumsum(G) but when condition "x" in the previous row, restart the sum in the following row".
Can you help me?
You may use cumsum to create groups as well.
library(dplyr)
df <- df %>%
group_by(group = cumsum(dplyr::lag(port == 0, default = 0))) %>%
mutate(cumsum_G = cumsum(G)) %>%
ungroup
df
# inv ass port G group cumsum_G
# <chr> <chr> <int> <int> <dbl> <int>
#1 i x 2 1 0 1
#2 i x 2 0 0 1
#3 i x 0 1 0 2
#4 i x 3 0 1 0
#5 i x 3 1 1 1
You may remove the group column from output using %>% select(-group).
data
df <- structure(list(inv = c("i", "i", "i", "i", "i"), ass = c("x",
"x", "x", "x", "x"), port = c(2L, 2L, 0L, 3L, 3L), G = c(1L,
0L, 1L, 0L, 1L)), class = "data.frame", row.names = c(NA, -5L))
This question already has an answer here:
Create a group index for values connected directly and indirectly
(1 answer)
Closed 2 years ago.
I have a dataset with two variables. As a simple example:
df <- rbind(c("A",1),c("B",2),c("C",2),c("C",3),c("D",4),c("D",5),c("E",1))
I would like to group them by the first component or the second, the desired output would be a third column with the following values:
c(1,2,2,2,3,3,1)
If I use dplyr and group_by and cur_group_id(), I would get groups by the first and second component, obtaining therefore
c(1,2,3,4,5,6,7)
Can anyone help me in an easy way, it could be either base R, dplyr or data.table, to obtain the desired group?
Thank you
Perhaps igraph could be a helpful tool for you
library(igraph)
df$grp <- membership(components(graph_from_data_frame(df, directed = FALSE)))[df$X1]
which gives
> df
X1 X2 grp
1 A 1 1
2 B 2 2
3 C 2 2
4 C 3 2
5 D 4 3
6 D 5 3
7 E 1 1
Data
> dput(df)
structure(list(X1 = c("A", "B", "C", "C", "D", "D", "E"), X2 = c(1L,
2L, 2L, 3L, 4L, 5L, 1L)), row.names = c(NA, -7L), class = "data.frame")
I have a column of numbers (index) in a dataframe like the below. I am attempting to check if these numbers are in ascending order by the value of 1. For example, group B and C do not ascend by 1. While I can check by sight, my dataframe is thousands of rows long, so I'd prefer to automate this. Does anyone have advice? Thank you!
group index
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
B 2
C 0
C 3
C 1
C 2
...
I think this works. diff calculates the difference between the two subsequent numbers, and then we can use all to see if all the differences are 1. dat2 is the final output.
library(dplyr)
dat2 <- dat %>%
group_by(group) %>%
summarize(Result = all(diff(index) == 1)) %>%
ungroup()
dat2
# # A tibble: 3 x 2
# group Result
# <chr> <lgl>
# 1 A TRUE
# 2 B FALSE
# 3 C FALSE
DATA
dat <- read.table(text = "group index
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
B 2
C 0
C 3
C 1
C 2",
header = TRUE, stringsAsFactors = FALSE)
Maybe aggregate could help
> aggregate(.~group,df1,function(v) all(diff(v)==1))
group index
1 A TRUE
2 B FALSE
3 C FALSE
We can do a group by group, get the difference between the current and previous value (shift) and check if all the differences are equal to 1.
library(data.table)
setDT(df1)[, .(Result = all((index - shift(index))[-1] == 1)), group]
# group Result
#1: A TRUE
#2: B FALSE
#3: C FALSE
data
df1 <- structure(list(group = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "C", "C", "C", "C"), index = c(0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 2L, 0L, 3L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-13L))
This question already has answers here:
Remove groups that contain certain strings
(4 answers)
Closed 3 years ago.
3 doctors diagnose a patient
question 1 : how to filter the patient which all 3 doctors diagnose with disease B (no matter B.1, B.2 or B.3)
question 2: how to filter the patient which any of 3 doctors diagnose with disease A.
set.seed(20200107)
df <- data.frame(id = rep(1:5,each =3),
disease = sample(c('A','B'), 15, replace = T))
df$disease <- as.character(df$disease)
df[1,2] <- 'A'
df[4,2] <- 'B.1'
df[5,2] <- 'B.2'
df[6,2] <- 'B.3'·
df
I got a method but I don't know how to write the code. I think in the code any() or all() function shoule be used.
First, I want to group patients by id.
Second, check if all the disease is A or B in each group.
The code like this
df %>% group_by(id) %>% filter_all(all_vars(disease == B))
You can use all assuming every patient is checked by 3 doctors only.
library(dplyr)
df %>% group_by(id) %>% summarise(disease_B = all(grepl('B', disease)))
# id disease_B
# <int> <lgl>
#1 1 FALSE
#2 2 TRUE
#3 3 FALSE
#4 4 FALSE
#5 5 FALSE
If you want to subset the rows of the patient, we can use filter
df %>% group_by(id) %>% filter(all(grepl('B', disease)))
For question 2: similarly, we can use any
df %>% group_by(id) %>% summarise(disease_B = any(grepl('A', disease)))
data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L), disease = c("A", "A", "A", "B.1", "B.2",
"B.3", "B", "A", "A", "B", "A", "A", "B", "A", "B")), row.names = c(NA,
-15L), class = "data.frame")
For the question 1, you can replace B.1 B.2 ... by B, then count the number of different "Disease" per patients and filter to keep only those equal to 3 and B:
library(tidyverse)
df %>% group_by(id) %>%
mutate(Disease = gsub(".[0-9]+","",disease)) %>%
count(Disease) %>%
filter(n == 3 & Disease == "B")
# A tibble: 2 x 3
# Groups: id [2]
id Disease n
<int> <chr> <int>
1 2 B 3
2 4 B 3
For the question 2, similarly, you can replace B.1 ... by B, then filter all rows with Disease is A, then count the number of rows per patients and your output is the patient id and the number of doctors that diagnose the disease A:
df %>% group_by(id) %>%
mutate(Disease = gsub(".[0-9]+","",disease))%>%
filter(Disease == "A") %>%
count(id)
# A tibble: 3 x 2
# Groups: id [3]
id n
<int> <int>
1 1 1
2 3 3
3 5 2
This question already has answers here:
Remove group from data.frame if at least one group member meets condition
(4 answers)
Closed 1 year ago.
Problem:
I want to remove all the rows of a specific category if one of the rows has a certain value in another column (similar to problems in the links below). However, the main difference is I would like it to only work if it matches a criteria in another column.
Making a practice df
prac_df <- data_frame(
subj = rep(1:4, each = 4),
trial = rep(rep(1:4, each = 2), times = 2),
ias = rep(c('A', 'B'), times = 8),
fixations = c(17, 14, 0, 0, 15, 0, 8, 6, 3, 2, 3,3, 23, 2, 3,3)
)
So my data frame looks like this.
subj ias fixations
1 1 A 17
2 1 B 14
3 2 A 0
4 2 B 0
5 3 A 15
6 3 B 0
7 4 A 8
8 4 B 6
And I want to remove all of subject 2 because it has a value of 0 for fixations column in a row that ias has a value of A. However I want to do this without removing subject 3, because even though there is a 0 it is in a row where the ias column has a value of B.
My attempt so far.
new.df <- prac_df[with(prac_df, ave(prac_df$fixations != 0, subj, FUN = all)),]
However this is missing the part that will only get rid of it if it has the value A in the ias column. I've attempted various uses of & or if but I feel like there's likely a clever and clean way I just don't know of.
My goal is to make a df like this.
subj ias fixations
1 1 A 17
2 1 B 14
3 3 A 15
4 3 B 0
5 4 A 8
6 4 B 6
Thank you very much!
Related questions:
R: Remove rows from data frame based on values in several columns
How to remove all rows belonging to a particular group when only one row fulfills the condition in R?
We group by 'subj' and then filter based on the logical condition created with any and !
library(dplyr)
df1 %>%
group_by(subj) %>%
filter(!any(fixations==0 & ias == "A"))
# subj ias fixations
# <int> <chr> <int>
#1 1 A 17
#2 1 B 14
#3 3 A 15
#4 3 B 0
#5 4 A 8
#6 4 B 6
Or use all with |
df1 %>%
group_by(subj) %>%
filter(all(fixations!=0 | ias !="A"))
The same approach can be used with ave from base R
df1[with(df1, !ave(fixations==0 & ias =="A", subj, FUN = any)),]
data
df1 <- structure(list(subj = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), ias = c("A",
"B", "A", "B", "A", "B", "A", "B"), fixations = c(17L, 14L, 0L,
0L, 15L, 0L, 8L, 6L)), .Names = c("subj", "ias", "fixations"),
class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))