i have a df which a part of is similar to the following
| Number|Category| A1|A2|B1|B2|C1|C2|A |B |C |
| ------| -------|---|--|--|--|--|--|--|--|--|
| 1 | 1 | 10|30|5 |15|NA|NA|5 |10|NA|
| 2 | 2 | 10|30|5 |15|25|35|40|20|45|
The conditions are
A1 & A2, B1 & B2, C1 & C2 are the lower and upper limits respectively, of the factors A, B, C
and columns A, B, C represent the measurements.
If the measurement is under the lower limit the factor is "passed",
if it is in between the two limits, then the factor is in "danger",
if the measurement is higher than the higher limit then it is "failed".
For the category=1 we are permitted to have only 1 failure in one of the factors and in that case we classify the asset as "risk",
but if we have 2 failures then the asset in the row 1 "fail".
For Category=2 permitted 2 failures. If one factor fails is at "at risk", if we have 2 failures is "risk" and we have 3 failures then its "fail".
So I would like to calculate for every row(asset) the status of every factor and then the status of the asset. I am trying to that with a for loop and an if-else statement that iterates through all these columns of every row but seems difficult as I am a beginner. The final result is to attach the following columns to the dataset. Thank you in advance
|Number|Aa |Bb |Cc |Result |
|------|------|------|------|-------|
|1 |passed|danger|NA | risk |
|2 |failed|failed|failed| failed|
This can be done in dplyr only without even reshaping the data or using any loop (for/while). Using across, cur_data() and cur_column() which are certainly powerful functions from dplyr.
library(dplyr, warn.conflicts = F)
df
#> Number Category A1 A2 B1 B2 C1 C2 A B C
#> 1 1 1 10 30 5 15 NA NA 5 10 NA
#> 2 2 2 10 30 5 15 25 35 40 20 45
df %>% group_by(Number, Category) %>%
transmute(across(c('A', 'B', 'C'), ~ case_when(is.na(.) | is.na(get(paste0(cur_column(), 1))) |
is.na(get(paste0(cur_column(), 2))) ~ NA_character_,
. < get(paste0(cur_column(), 1)) ~ 'passed',
. <= get(paste0(cur_column(), 2)) ~ 'danger',
TRUE ~ 'failed'),
.names = '{.col}{tolower(.col)}')) %>%
mutate(Result = ifelse(rowSums(cur_data() == 'failed', na.rm = T) <= Category, 'risk', 'failed'))
#> # A tibble: 2 x 6
#> # Groups: Number, Category [2]
#> Number Category Aa Bb Cc Result
#> <int> <int> <chr> <chr> <chr> <chr>
#> 1 1 1 passed danger <NA> risk
#> 2 2 2 failed failed failed failed
Created on 2021-07-06 by the reprex package (v2.0.0)
You can also use the following solution which is a combination of base R and tidyverse:
library(dplyr)
library(purrr)
colnames <- c(1, 2)
tmp <- df[-colnames]
lapply(split.default(tmp, gsub("(\\w)\\d+?", "\\1", names(tmp))),
function(x) cbind(df[colnames], x)) %>%
imap(~ .x %>%
mutate(!!{.y} := pmap_chr(., ~
ifelse(any(is.na(..3), is.na(..4), is.na(..5)), "NA",
ifelse(..5 > ..3 & ..5 < ..4, "danger", ifelse(..5 < ..3, "passed", "failed"))))) %>%
select(-c(3, 4))) %>%
reduce(~ full_join(..1, ..2, id = c("Number", "Category"))) %>%
rowwise() %>%
mutate(Result = case_when(
Category == 1 & sum(c_across(A:C) == "failed") <= 1 ~ "Risk",
Category == 1 & sum(c_across(A:C) == "failed") > 1 ~ "Fail",
Category == 2 & sum(c_across(A:C) == "failed") == 1 ~ "At_Risk",
Category == 2 & sum(c_across(A:C) == "failed") == 2 ~ "Risk",
Category == 2 & sum(c_across(A:C) == "failed") == 3 ~ "Fail"
))
# A tibble: 2 x 6
# Rowwise:
Number Category A B C Result
<dbl> <dbl> <chr> <chr> <chr> <chr>
1 1 1 passed danger NA Risk
2 2 2 failed failed failed Fail
Much of your problem is caused by the untidy nature of your data frame. I started to provide solutions based on both your untidy data and a tidy equivalent, but the untidy solution, whilst possible, became just too painful.
So, here's a solution based on a tidy equivalent of your data frame.
First, make it tidy. The reason your data frame is untidy is that your column names contain information, namely that A1 and A2 contain the acceptance limits for values in A, and so on. We can correct this by making the data frame longer.
The process is a little long because of the extent of the untidyness of the original. It might be possible to create a more compact version of the transformation using, say, names_pattern and other advanced arguments to pivot_longer(), but the long version at least has the benefit of clarity.
longDF <- df %>%
select(Number, Category, A, B, C) %>%
pivot_longer(
c(-Category, -Number),
names_to="Variable",
values_to="Value"
) %>%
left_join(
df %>%
select(Number, Category, A1, B1, C1) %>%
pivot_longer(
c(-Category, -Number),
names_to="Variable",
values_to="Lower"
) %>%
mutate(Variable=str_sub(Variable, 1, 1)),
by=c("Number", "Category", "Variable")
) %>%
left_join(
df %>%
select(Number, Category, A2, B2, C2) %>%
pivot_longer(
c(-Category, -Number),
names_to="Variable",
values_to="Upper"
) %>%
mutate(Variable=str_sub(Variable, 1, 1)),
by=c("Number", "Category", "Variable")
)
longDF
# A tibble: 6 x 6
Number Category Variable Value Lower Upper
<dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 1 A 5 10 30
2 1 1 B 10 5 15
3 1 1 C NA NA NA
4 2 2 A 40 10 30
5 2 2 B 20 5 15
6 2 2 C 45 25 35
So at this point, we have columns that define the Category of the test, the Variable being measured, its Value and the two acceptance limits (Lower and Upper).
Now, determining the acceptability of each Value is straightforward.
longDF <- longDF %>%
mutate(
Result=ifelse(
Value < Lower,
"Pass",
ifelse(Value < Upper, "Danger", "Fail")
)
)
longDF
# A tibble: 6 x 7
Number Category Variable Value Lower Upper Result
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 1 1 A 5 10 30 Pass
2 1 1 B 10 5 15 Danger
3 1 1 C NA NA NA NA
4 2 2 A 40 10 30 Fail
5 2 2 B 20 5 15 Fail
6 2 2 C 45 25 35 Fail
Also, note that the categorisation of each value is independent of both the Variable and the number of possible variables. So the code is robust in these respects.
Now we can categorise the results by Number and Category.
longDF %>%
group_by(Number, Category, Result) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(
names_from=Result,
values_from=N,
values_fill=0
)
# A tibble: 2 x 7
Number Category Danger Pass `NA` Fail
<dbl> <dbl> <int> <int> <int> <int>
1 1 1 1 1 1 0
2 2 2 0 0 0 3
Again, we are robust with respect to both the number of Categorys and Numbers, and their labels.
Evaluating the overall results is also straightforward, but slightly long winded because of the various options. Note that your text is inconsistent with the desired output because you haven't explained how an overall result of "warn" for Category = 1 is obtained. I've gone with the text. if you want to match the sample output, the changes to the code should be simple once the criteria are defined.
longDF %>%
group_by(Number, Category, Result) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(
names_from=Result,
values_from=N,
values_fill=0
) %>%
mutate(
Result=ifelse(
Category == 1,
ifelse(Fail == 0, "Pass", ifelse(Fail == 1, "Risk", "Fail")),
ifelse(Fail < 2, "Pass", ifelse(Fail == 2, "Risk", "Fail"))
)
)
# A tibble: 2 x 7
Number Category Danger Pass `NA` Fail Result
<dbl> <dbl> <int> <int> <int> <int> <chr>
1 1 1 1 1 1 0 Pass
2 2 2 0 0 0 3 Fail
If you need to know which Variable caused potential failures, that can also be obtained from longDF with a small change to the grouping.
longDF %>%
group_by(Category, Variable, Result) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(
names_from=Variable,
values_from=Result
)
# A tibble: 2 x 5
Category N A B C
<dbl> <int> <chr> <chr> <chr>
1 1 1 Pass Danger NA
2 2 1 Fail Fail Fail
And, of course, you could join these two data frames together to get a comprehensive description of both the overall results and the component variable assessments.
Related
Is there a function that counts the number of observations within unique groups and not the number of distinct groups as n_distinct() does?
I'm summarising data with dplyr and group_by(), and I'm trying to calculate means of numbers of observations per a different grouping variable.
df<-data.frame(id=c('A', 'A', 'A', 'B', 'B', 'C','C','C'),
id.2=c('1', '2', '2', '1','1','1','2','2'),
v=c(sample(1:10, 8)))
df%>%
group_by(id.2)%>%
summarise(n.mean=mean(n_distinct(id)),
v.mean=mean(v))
# A tibble: 2 × 3
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 3 5
2 2 2 4.5
What I instead need:
id.2 n.mean v.mean
1 1 5
2 2 4.5
because for
id.2==1 n.mean is the mean of 1 observation for A, 2 for B, 1 observation for C,
> mean(1,2,1)
[1] 1
id.2==2 n.mean is the mean of 2 observations for A, 0 for B, 2 for C,
mean(2,0,2)
[1] 2
I tried grouping by group_by(id, id.2) first to count the observations and then pass those counts on when grouping by only id.2 in a subsequent step, but that didn't work (though I probably just don't know how to implement this with dplyr as I'm not very experienced with tidyverse solutions)
You are not using mean correctly. mean(1, 2, 1) ignores all but the first argument and therefore will return 1 no matter what other numbers are in the second and third positions. For id.2 == 1, you'd want mean(c(1, 2, 1)), which returns 1.333.
We can use table to quickly calculate the frequencies of id within each grouping of id.2, and then take the mean of those. We can compute v.mean in the same step.
library(tidyverse)
df %>%
group_by(id.2) %>%
summarize(
n.mean = mean(table(id)),
v.mean = mean(v)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 2 6
Your example notes that id.2 == 2 does not have any values for id == B. It is not clear whether your desired solution counts this as a zero-length category, or simply ignores it. The solution above ignores it. The following includes it as a zero-length category by first complete-ing the input data (note new row #7, which has NA data):
df_complete <- complete(df, id.2, id)
id.2 id v
<chr> <chr> <int>
1 1 A 9
2 1 B 1
3 1 B 2
4 1 C 5
5 2 A 4
6 2 A 7
7 2 B NA
8 2 C 3
9 2 C 10
We can convert id to factor data, which will force table to preserve its unique levels even in groupings of zero length:
df_complete %>%
group_by(id.2) %>%
mutate(id = factor(id)) %>%
filter(!is.na(v)) %>%
summarize(
n.mean = mean(table(id)),
v.mean = mean(v, na.rm = T)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 1.33 6
Or an alternate recipe that does not rely on table:
df_complete %>%
group_by(id.2, id) %>%
summarize(
n_rows = sum(!is.na(v)),
id_mean = mean(v)
) %>%
group_by(id.2) %>%
summarize(
n.mean = mean(n_rows),
v.mean = weighted.mean(id_mean, n_rows, na.rm = T)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 1.33 6
Note that when providing randomized example data, you should use set.seed to control the randomization and ensure reproducibility. Here is what I used:
set.seed(0)
df<-data.frame(id=c('A', 'A', 'A', 'B', 'B', 'C','C','C'),
id.2=c('1', '2', '2', '1','1','1','2','2'),
v=c(sample(1:10, 8)))
I have 1 dataframe of data and multiple "reference" dataframes. I'm trying to automate checking if values of the dataframe match the values of the reference dataframes. Importantly, the values must also be in the same order as the values in the reference dataframes. These columns are of the columns of importance, but my real dataset contains many more columns.
Below is a toy dataset.
Dataframe
group type value
1 A Teddy
1 A William
1 A Lars
2 B Dolores
2 B Elsie
2 C Maeve
2 C Charlotte
2 C Bernard
Reference_A
type value
A Teddy
A William
A Lars
Reference_B
type value
B Elsie
B Dolores
Reference_C
type value
C Maeve
C Hale
C Bernard
For example, in the toy dataset, group1 would score 1.0 (100% correct) because all its values in A match the values and order of values of An in reference_A. However, group2 would score 0.0 because the values in B are out of order compared to reference_B and 0.66 because 2/3 values in C match the values and order of values in reference_C.
Desired output
group type score
1 A 1.0
2 B 0.0
2 C 0.66
This was helpful, but does not take order into account:
Check whether values in one data frame column exist in a second data frame
Update: Thank you to everyone that has provided solutions! These solutions are great for the toy dataset, but have not yet been adaptable to datasets with more columns. Again, like I wrote in my post, the columns that I've listed above are of importance — I'd prefer to not drop the unneeded columns if necessary.
We may also do this with mget to return a list of data.frames, bind them together, and do a group by mean of logical vector
library(dplyr)
mget(ls(pattern = '^Reference_[A-Z]$')) %>%
bind_rows() %>%
bind_cols(df1) %>%
group_by(group, type = type...1) %>%
summarise(score = mean(value...2 == value...5))
# Groups: group [2]
# group type score
# <int> <chr> <dbl>
#1 1 A 1
#2 2 B 0
#3 2 C 0.667
This is another tidyverse solution. Here, I am adding a counter (i.e. rowname) to both reference and data. Then I join them together on type and rowname. At the end, I summarize them on type to get the desired output.
library(dplyr)
library(purrr)
library(tibble)
list(`Reference A`, `Reference B`, `Reference C`) %>%
map(., rownames_to_column) %>%
bind_rows %>%
left_join({Dataframe %>%
group_split(type) %>%
map(., rownames_to_column) %>%
bind_rows},
. , by=c("type", "rowname")) %>%
group_by(type) %>%
dplyr::summarise(group = head(group,1),
score = sum(value.x == value.y)/n())
#> # A tibble: 3 x 3
#> type group score
#> <chr> <int> <dbl>
#> 1 A 1 1
#> 2 B 2 0
#> 3 C 2 0.667
Here's a "tidy" method:
library(dplyr)
# library(purrr) # map2_dbl
Reference <- bind_rows(Reference_A, Reference_B, Reference_C) %>%
nest_by(type, .key = "ref") %>%
ungroup()
Reference
# # A tibble: 3 x 2
# type ref
# <chr> <list<tbl_df[,1]>>
# 1 A [3 x 1]
# 2 B [2 x 1]
# 3 C [3 x 1]
Dataframe %>%
nest_by(group, type, .key = "data") %>%
left_join(Reference, by = "type") %>%
mutate(
score = purrr::map2_dbl(data, ref, ~ {
if (length(.x) == 0 || length(.y) == 0) return(numeric(0))
if (length(.x) != length(.y)) return(0)
sum((is.na(.x) & is.na(.y)) | .x == .y) / length(.x)
})
) %>%
select(-data, -ref) %>%
ungroup()
# # A tibble: 3 x 3
# group type score
# <int> <chr> <dbl>
# 1 1 A 1
# 2 2 B 0
# 3 2 C 0.667
https://www.kaggle.com/nowke9/ipldata ----- Contains the IPL Data.
This is exploratory study performed for the IPL data set. (link for the data attached above) After merging both the files with "id" and "match_id", I have created four more variables namely total_extras, total_runs_scored, total_fours_hit and total_sixes_hit. Now I wish to combine these newly created variables into one single data frame. When I assign these variables into one single variable namely batsman_aggregate and selecting only the required columns, I am getting an error message.
library(tidyverse)
deliveries_tbl <- read.csv("deliveries_edit.csv")
matches_tbl <- read.csv("matches.csv")
combined_matches_deliveries_tbl <- deliveries_tbl %>%
left_join(matches_tbl, by = c("match_id" = "id"))
# Add team score and team extra columns for each match, each inning.
total_score_extras_combined <- combined_matches_deliveries_tbl%>%
group_by(id, inning, date, batting_team, bowling_team, winner)%>%
mutate(total_score = sum(total_runs, na.rm = TRUE))%>%
mutate(total_extras = sum(extra_runs, na.rm = TRUE))%>%
group_by(total_score, total_extras, id, inning, date, batting_team, bowling_team, winner)%>%
select(id, inning, total_score, total_extras, date, batting_team, bowling_team, winner)%>%
distinct(total_score, total_extras)%>%
glimpse()%>%
ungroup()
# Batsman Aggregate (Runs Balls, fours, six , Sr)
# Batsman score in each match
batsman_score_in_a_match <- combined_matches_deliveries_tbl %>%
group_by(id, inning, batting_team, batsman)%>%
mutate(total_batsman_runs = sum(batsman_runs, na.rm = TRUE))%>%
distinct(total_batsman_runs)%>%
glimpse()%>%
ungroup()
# Number of deliveries played .
balls_faced <- combined_matches_deliveries_tbl %>%
filter(wide_runs == 0)%>%
group_by(id, inning, batsman)%>%
summarise(deliveries_played = n())%>%
ungroup()
# Number of 4 and 6s by a batsman in each match.
fours_hit <- combined_matches_deliveries_tbl %>%
filter(batsman_runs == 4)%>%
group_by(id, inning, batsman)%>%
summarise(fours_hit = n())%>%
glimpse()%>%
ungroup()
sixes_hit <- combined_matches_deliveries_tbl %>%
filter(batsman_runs == 6)%>%
group_by(id, inning, batsman)%>%
summarise(sixes_hit = n())%>%
glimpse()%>%
ungroup()
batsman_aggregate <- c(batsman_score_in_a_match, balls_faced, fours_hit, sixes_hit)%>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit)
The error message is displayed as:-
Error: `select()` doesn't handle lists.
The required output is the data set created newly constructed variables.
You'll have to join those four tables, not combine using c.
And the join type is left_join so that all batsman are included in the output. Those who didn't face any balls or hit any boundaries will have NA, but you can easily replace these with 0.
I've ignored the by since dplyr will assume you want c("id", "inning", "batsman"), the only 3 common columns in all four data sets.
batsman_aggregate <- left_join(batsman_score_in_a_match, balls_faced) %>%
left_join(fours_hit) %>%
left_join(sixes_hit) %>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit) %>%
replace(is.na(.), 0)
# A tibble: 11,335 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 1 1 DA Warner 14 8 2 1
2 1 1 S Dhawan 40 31 5 0
3 1 1 MC Henriques 52 37 3 2
4 1 1 Yuvraj Singh 62 27 7 3
5 1 1 DJ Hooda 16 12 0 1
6 1 1 BCJ Cutting 16 6 0 2
7 1 2 CH Gayle 32 21 2 3
8 1 2 Mandeep Singh 24 16 5 0
9 1 2 TM Head 30 22 3 0
10 1 2 KM Jadhav 31 16 4 1
# ... with 11,325 more rows
There are also 2 batsmen who didn't face any delivery:
batsman_aggregate %>% filter(deliveries_played==0)
# A tibble: 2 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 482 2 MK Pandey 0 0 0 0
2 7907 1 MJ McClenaghan 2 0 0 0
One of which apparently scored 2 runs! So I think the batsman_runs column has some errors. The game is here and clearly says that on the second last delivery of the first innings, 2 wides were scored, not runs to the batsman.
I am trying to compare a new algorithm result versus an old one. I need to know approximately how many days of a difference the new algorithm has in predicting a "D" versus the old one.
I can't seem to figure out how to point to the first row (day) that contains a 'D' (min(day) and new == 'D') without filtering (I was able to grab the row using a double filter due to the grouping, but not use it). I want to use it in summarise using dplyr which is why I have included pseudo code similar to where i am currently at in my own dataset.
In my data there are groups of varying length (number of days) for each ID, which is why I made groups of different lengths in the example.
library(dplyr)
id = c(123,123,123,123,123,456,456,456,456)
old = c('S','S','S','S','D','S','S','D','D')
new = c('S','S','D','D','D','S','D','D','D')
day = c(1,2,3,4,5,1,2,3,4)
data = data.frame(id,old,new,day)
data
#> id old new day
#> 1 123 S S 1
#> 2 123 S S 2
#> 3 123 S D 3
#> 4 123 S D 4
#> 5 123 D D 5
#> 6 456 S S 1
#> 7 456 S D 2
#> 8 456 D D 3
#> 9 456 D D 4
d = data %>%
group_by(id)%>%
arrange(day,.by_group=T)%>%
add_tally(new=='S',name='S')%>%
add_tally(new=='D',name='D')%>%
group_by(id,S,D)
# summarise(diff = (day of 1st old D) - (day of 1st new D) )
#Expected Outcome
ido = c(123,456)
S = c(2,1)
D = c(3,3)
diff = c(2,1)
outcome = data.frame(ido,S,D,diff)
outcome
#> ido S D diff
#> 1 123 2 3 2
#> 2 456 1 3 1
Created on 2019-12-26 by the reprex package (v0.3.0)
We can group_by id and count the occurrence of 'S' and 'D' and the difference between first occurrence of old and new 'D'.
library(dplyr)
data %>%
group_by(id) %>%
summarise(S = sum(new == 'S'),
D = sum(new == 'D'),
diff = which.max(old == 'D') - which.max(new == 'D'))
#OR if there could be id without D use
#diff = which(old == 'D')[1] - which(new == 'D')[1])
# A tibble: 2 x 4
# id S D diff
# <dbl> <int> <int> <int>
#1 123 2 3 2
#2 456 1 3 1
We can use pivot_wider after summariseing to get the frequency count after creating a column to take the difference between the 'day' based on the first occurence of 'D' in both 'old' and 'new' columnss
library(dplyr)
library(tidyr)
data %>%
group_by(id) %>%
group_by(diff = day[match("D", old)] - day[match("D", new)],
new, add = TRUE) %>%
summarise(n = n()) %>%
ungroup %>%
pivot_wider(names_from = new, values_from = n)
# A tibble: 2 x 4
# id diff D S
# <dbl> <dbl> <int> <int>
#1 123 2 3 2
#2 456 1 3 1
I have a dataframe :
df <- data.frame(
Group=c('A','A','A','A','B','B','B','B'),
Activity = c('EOSP','NOR','EOSP','COSP','NOR','EOSP','WL','NOR'),
TimeLine=c(1,2,3,4,1,2,3,4)
)
I want to filter for only two activities for each group and in the order in which I am filtering. For example, I am only looking for the activities EOSP and NOR but in the order too. This code:
df %>% group_by(Group) %>%
filter(all(c('EOSP','NOR') %in% Activity) & Activity %in% c('EOSP','NOR'))
results in:
# A tibble: 6 x 3
# Groups: Group [2]
Group Activity TimeLine
<fct> <fct> <dbl>
1 A EOSP 1
2 A NOR 2
3 A EOSP 3
4 B NOR 1
5 B EOSP 2
6 B NOR 4
I don't want row 3 as EOSP occurs after NOR. Similarly for group B, I don't want row 4, as NOR is occurring before EOSP. How do I achieve this?
You can use match to get the first instance of Activity == EOSP and use slice to remove everything before that. Once you do that, then you can remove duplicates and filter on EOSP and NOR, i.e.
library(tidyverse)
df %>%
group_by(Group) %>%
mutate(new = match('EOSP', Activity)) %>%
slice(new:n()) %>%
distinct(Activity, .keep_all = TRUE) %>%
filter(Activity %in% c('EOSP', 'NOR'))
which gives,
# A tibble: 4 x 4
# Groups: Group [2]
Group Activity TimeLine new
<fct> <fct> <dbl> <int>
1 A EOSP 1 1
2 A NOR 2 1
3 B EOSP 2 2
4 B NOR 4 2
NOTE 1: You can ungroup() and select(-new)
NOTE 2: The warning messages being issued here
(Warning messages:
1: In new:4L : numerical expression has 4 elements: only the first used
2: In new:4L : numerical expression has 4 elements: only the first used
)
do not affect us since we only need it to use the first element since all are the same anyway
here is an option with data.table package: you join df with itself, subsetted it to keep only EOSP Activity and computing the min of TimeLine by group, then you can keep only the rows with TimeLine greater or equal to this TimeLine, in order to be sure you keep NOR only if there is EOSP before. Then you drop duplicated Group and Activity if you want to only keep 2 activities per group:
df[df[Activity=="EOSP", min(TimeLine), by=Group], on="Group"][Activity %in% c("NOR", "EOSP") & TimeLine >= V1][!duplicated(paste(Group, Activity))]
# Group Activity TimeLine V1
#1: A EOSP 1 1
#2: A NOR 2 1
#3: B EOSP 2 2
#4: B NOR 4 2
Here is a dplyr idea:
df %>%
filter(Activity %in% c('EOSP','NOR')) %>%
group_by(Group) %>%
mutate(tmp = which(Activity == 'EOSP' & !duplicated(Activity))) %>%
filter(row_number() %in% c(tmp, tmp+1))
# A tibble: 4 x 4
# Groups: Group [2]
Group Activity TimeLine tmp
<fct> <fct> <dbl> <int>
1 A EOSP 1 1
2 A NOR 2 1
3 B EOSP 2 2
4 B NOR 4 2