R: Keep row from duplicates (several columns) based on condition - r

Basically I want to:
If rows are duplicated on the combination of some specific columns, then keep only the row that has the lowest value on another column.
Example data (there's a lot more variance in my real data):
ID BilagNr Henstand Aftale Belob RP Pos Dps Udlign rykkedage
1 111 01-01-2017 1111 100 YA 1 1 10
1 122 02-01-2017 1222 100 YA 1 1 40
1 111 01-07-2017 1111 100 YA 1 1 100
2 222 01-01-2017 2121 299 YA 1 4 5
2 222 01-01-2017 2121 299 YA 1 4 98
2 212 01-05-2017 7654 299 BS 1 3
3 333 01-08-2017 7654 345 BS 2 45
4 444 01-01-2017 7654 345 BS 3 1 4 68
4 411 09-01-2017 7654 345 BS 1 4 43
5 555 01-01-2017 5555 700 BS 1 13
5 555 01-01-2017 5555 700 BS 1 67
6 666 01-01-2017 4720 100 BS 1 23
6 666 03-01-2017 1234 100 BS 2 1 23
6 666 07-08-2017 1234 120 BS 3 1 1 23
7 777 01-01-2017 1234 90 BS 1 1 23
7 777 01-01-2017 1234 90 BS 1 1 199
So I want to only keep these:
ID BilagNr Henstand Aftale Belob RP Pos Dps Udlign rykkedage
1 111 01-01-2017 1111 100 YA 1 1 10
1 122 02-01-2017 1222 100 YA 1 1 40
2 222 01-01-2017 2121 299 YA 1 4 5
2 212 01-05-2017 7654 299 BS 1 3
3 333 01-08-2017 7654 345 BS 2 45
4 444 01-01-2017 7654 345 BS 3 1 4 68
4 411 09-01-2017 7654 345 BS 1 4 43
5 555 01-01-2017 5555 700 BS 1 13
6 666 01-01-2017 4720 100 BS 1 23
6 666 03-01-2017 1234 100 BS 2 1 23
6 666 07-08-2017 1234 120 BS 3 1 1 23
7 777 01-01-2017 1234 90 BS 1 1 23
In other words:
If the rows are duplicated in a combination of the columns ID, BilagNr, Henstand, Aftale, Belob, RP, Pos, Dps, Udlign then keep only one of the duplicated rows and choose this from the condition that rykkedage has to be the smallest of the duplicated rows.
I hope it makes sense.
Furthermore, is it possible to add a code that keeps those duplicated rows that has the same value in rykkedage? I have a large dataset, and I'm not sure if this is even a problem.
Thank you!

We can group by 'ID', 'BilagNr', ..., 'Udlign', and then slice the rows with the index of the minimum value in 'rykkedage'
library(dplyr)
df1 %>%
group_by(ID, BilagNr, Henstand, Aftale, Belob, RP, Pos, Dps, Udlign) %>%
slice(which.min(rykkedage))
# A tibble: 13 x 10
# Groups: ID, BilagNr, Henstand, Aftale, Belob, RP, Pos, Dps, Udlign [13]
# ID BilagNr Henstand Aftale Belob RP Pos Dps Udlign rykkedage
# <int> <int> <chr> <int> <int> <chr> <int> <int> <int> <int>
# 1 1 111 01-01-2017 1111 100 YA 1 1 NA 10
# 2 1 111 01-07-2017 1111 100 YA 1 1 NA 100
# 3 1 122 02-01-2017 1222 100 YA 1 NA 1 40
# 4 2 212 01-05-2017 7654 299 BS 1 NA NA 3
# 5 2 222 01-01-2017 2121 299 YA 1 NA 4 5
# 6 3 333 01-08-2017 7654 345 BS 2 NA NA 45
# 7 4 411 09-01-2017 7654 345 BS 1 NA 4 43
# 8 4 444 01-01-2017 7654 345 BS 3 1 4 68
# 9 5 555 01-01-2017 5555 700 BS 1 NA NA 13
#10 6 666 01-01-2017 4720 100 BS 1 NA NA 23
#11 6 666 03-01-2017 1234 100 BS 2 NA 1 23
#12 6 666 07-08-2017 1234 120 BS 3 1 1 23
#13 7 777 01-01-2017 1234 90 BS 1 NA 1 23

Related

Updating categories with too few observations

Please note that this question has been edited after r2evans' answer.
Example data
I have example data as follows:
library(data.table)
vars_of_interest <- c("A", "B", "C")
vars_of_interest_obs_tot <- c("A_tot", "B_tot", "C_tot")
adapted_BMstratum <- c("A_adapted_BMstratum", "B_adapted_BMstratum", "C_adapted_BMstratum")
full_df_bm <- fread("A B C BMstratum
1 NA NA 1110
23 1 2 1120
1 NA 1 1130
6 NA NA 1140
NA 1 1 1100
2 2 4 1110
NA 1 2 1120
NA 21 11 1130")
# Counting the current observations
setDT(full_df_bm)[, (vars_of_interest_obs_tot) := lapply(vars_of_interest, function(x) sum(!is.na(get(x)))),by = c("BMstratum")]
print(full_df_bm)
# A B C BMstratum A_tot B_tot C_tot
# 1: 1 NA NA 1110 2 1 1
# 2: 23 1 2 1120 1 2 2
# 3: 1 NA 1 1130 1 1 2
# 4: 6 NA NA 1140 1 0 0
# 5: NA 1 1 1100 0 1 1
# 6: 2 2 4 1110 2 1 1
# 7: NA 1 2 1120 1 2 2
# 8: NA 21 11 1130 1 1 2
# The adapted strata start the same as the original
setDT(full_df_bm)[, (adapted_BMstratum):=BMstratum]
print(full_df_bm)
# A B C BMstratum A_tot B_tot C_tot A_adapted_BMstratum B_adapted_BMstratum C_adapted_BMstratum
# 1: 1 NA NA 1110 2 1 1 1110 1110 1110
# 2: 23 1 2 1120 1 2 2 1120 1120 1120
# 3: 1 NA 1 1130 1 1 2 1130 1130 1130
# 4: 6 NA NA 1140 1 0 0 1140 1140 1140
# 5: NA 1 1 1100 0 1 1 1100 1100 1100
# 6: 2 2 4 1110 2 1 1 1110 1110 1110
# 7: NA 1 2 1120 1 2 2 1120 1120 1120
# 8: NA 21 11 1130 1 1 2 1130 1130 1130
Updating the strata
For every variable in adapted_BMstratum, I would like to manually decide what to do when there are less than 2 observations for each of the variables A, B, or C.
for (i in seq_along(adapted_BMstratum)) {
# If stratum 1110 has less than two observations change to 1120
setDT(full_df_bm)[get(vars_of_interest_obs_tot[i])<2 & get(adapted_BMstratum[i])==1110, (adapted_BMstratum[i]):=1120 ,]
# Update the observations
bygroup <- adapted_BMstratum[i]
setDT(full_df_bm)[, (vars_of_interest_obs_tot) := lapply(vars_of_interest, function(x) sum(!is.na(get(x)))),by = bygroup]
# If stratum 1120 has less than two observations change to 1110
setDT(full_df_bm)[get(vars_of_interest_obs_tot[i])<2 & get(adapted_BMstratum[i])==1120, (adapted_BMstratum[i]):=1110,]
# Update the observations
bygroup <- adapted_BMstratum[i]
setDT(full_df_bm)[, (vars_of_interest_obs_tot) := lapply(vars_of_interest, function(x) sum(!is.na(get(x)))),by = bygroup]
# If stratum 1130 has less than two observations change to 1110
setDT(full_df_bm)[get(vars_of_interest_obs_tot[i])<2 & get(adapted_BMstratum[i])==1120, (adapted_BMstratum[i]):=1110,]
# Update the observations
bygroup <- adapted_BMstratum[i]
setDT(full_df_bm)[, (vars_of_interest_obs_tot) := lapply(vars_of_interest, function(x) sum(!is.na(get(x)))),by = bygroup]
# If any strata after has less than 2 observations, change them all to 1110
setDT(full_df_bm)[get(vars_of_interest_obs_tot[i])<2 & (get(adapted_BMstratum[i])==1110 || get(adapted_BMstratum[i])==1120 || get(adapted_BMstratum[i])==1130), (adapted_BMstratum[i]):=1110,]
# Update the observations a last time
bygroup <- adapted_BMstratum[i]
setDT(full_df_bm)[, (vars_of_interest_obs_tot) := lapply(vars_of_interest, function(x) sum(!is.na(get(x)))),by = bygroup]
}
This does however not give the desired outcome:
A B C BMstratum A_tot B_tot C_tot A_adapted_BMstratum B_adapted_BMstratum C_adapted_BMstratum
1: 1 NA NA 1110 2 1 1 1110 1110 1110
2: 23 1 2 1120 1 2 2 1110 1120 1120
3: 1 NA 1 1130 1 1 2 1110 1110 1130
4: 6 NA NA 1140 1 0 0 1110 1110 1110
5: NA 1 1 1100 0 1 1 1110 1110 1110
6: 2 2 4 1110 2 1 1 1110 1110 1110
7: NA 1 2 1120 1 2 2 1110 1120 1120
8: NA 21 11 1130 1 1 2 1110 1110 1130
In addition it gives the following warnings:
Warning messages:
1: In get(adapted_BMstratum[i]) == 1110 || get(adapted_BMstratum[i]) == :
'length(x) = 8 > 1' in coercion to 'logical(1)'
2: In get(adapted_BMstratum[i]) == 1110 || get(adapted_BMstratum[i]) == :
'length(x) = 8 > 1' in coercion to 'logical(1)'
3: In get(adapted_BMstratum[i]) == 1110 || get(adapted_BMstratum[i]) == :
'length(x) = 8 > 1' in coercion to 'logical(1)'
Desired outcome
NOTE: For B_adapted_stratum all have been changed to 1110 because 1110,1120 and 1130, (if they exist) did not all have at least 2 observations.
# A B C BMstratum A_tot B_tot C_tot A_adapted_BMstratum B_adapted_BMstratum C_adapted_BMstratum
# 1: 1 NA NA 1110 4 6 4 1110 1120 1120
# 2: 23 1 2 1120 4 6 4 1110 1120 1120
# 3: 1 NA 1 1130 4 6 2 1110 1120 1130
# 4: 6 NA NA 1140 1 0 0 1140 1140 1140
# 5: NA 1 1 1100 0 1 1 1100 1100 1100
# 6: 2 2 4 1110 4 6 4 1110 1120 1120
# 7: NA 1 2 1120 4 6 4 1110 1120 1120
# 8: NA 21 11 1130 4 6 2 1110 1120 1130
Note: The strata 1100 and 1140 should not be touched, but should not be removed either. This has to do with the fact that I need to add manual rules for these numbers separately. In the real data, there are way more numbers and rules, and I think it would become to messy to write everything out.
Here's a start, though I don't know how to assign 1120 to A_adapted_BMstratum since the two categories are identical:
full_df_bm[, c(adapted_BMstratum) := lapply(.SD, function(z) fifelse(z < 2, BMstratum[which.min(z)] , BMstratum)),
.SDcols = vars_of_interest_obs_tot]
# A_tot B_tot C_tot BMstratum A_adapted_BMstratum B_adapted_BMstratum C_adapted_BMstratum
# <int> <int> <int> <int> <int> <int> <int>
# 1: 1 2 1 1110 1110 1110 1110
# 2: 1 1 2 1120 1110 1120 1120

Estimate number of people alive for every single year between 1850 and 1950 in R

I'm still having trouble regarding my workflow. I need to estimate the number of people alive by gender in every single year between 1850 and 1950. I have the following information:
id, birth_year, death_year and gender
id <- c(1:6)
birth_year <- c(1850:1855)
death_year <- c(1890:1895)
gender <- c("female", "male", "female", "male", "male", "male")
df <- data.frame(id, birth_year, death_year, gender)
Think about the steps to achieve my goal, I realize that a should add columns in my df for each year. In each column, I would estimate the age of a person iat the year x, then, the year of a person i + 1 at the year x + 1. Being i = 1 and x = 1850.
df$age1850 <- 1850 - df$birth_year
df$age1851 <- 1851 - df$birth_year
df$age1852 <- 1852 - df$birth_year
df$age1853 <- 1853 - df$birth_year
df$age1854 <- 1854 - df$birth_year
df$age1855 <- 1855 - df$birth_year
# The expected result would be:
id birth_year death_year gender age1850 age1851 age1852 age1853 age1854 age1855
1 1 1850 1890 female 0 1 2 3 4 5
2 2 1851 1891 male -1 0 1 2 3 4
3 3 1852 1892 female -2 -1 0 1 2 3
4 4 1853 1893 male -3 -2 -1 0 1 2
5 5 1854 1894 male -4 -3 -2 -1 0 1
6 6 1855 1895 male -5 -4 -3 -2 -1 0
Thanks in advance!
To estimate the number of people alive by gender in every single year between 1850 and 1950 you can use table and subset you df with the year.
df$gender <- as.factor(df$gender)
years <- 1850:1950
sapply(setNames(years, years), function(i) {table(df$gender[df$birth_year <= i &
df$death_year >= i])})
# 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863
#female 1 1 2 2 2 2 2 2 2 2 2 2 2 2
#male 0 1 1 2 3 4 4 4 4 4 4 4 4 4
# 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877
#female 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#male 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891
#female 2 2 2 2 2 2 2 2 2 2 2 2 2 1
#male 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905
#female 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#male 3 3 2 1 0 0 0 0 0 0 0 0 0 0
#...

R fill new column based on interval from another dataset (lookup)

Lets say I have this dataset:
df1 = data.frame(groupID = c(rep("a", 6), rep("b", 6), rep("c", 6)),
testid = c(111, 222, 333, 444, 555, 666, 777, 888, 999, 1010, 1111, 1212, 1313, 1414, 1515, 1616, 1717, 1818))
df1
groupID testid
1 a 111
2 a 222
3 a 333
4 a 444
5 a 555
6 a 666
7 b 777
8 b 888
9 b 999
10 b 1010
11 b 1111
12 b 1212
13 c 1313
14 c 1414
15 c 1515
16 c 1616
17 c 1717
18 c 1818
And I have this 2nd dataset:
df2 = data.frame(groupID = c("a", "a", "a", "a", "b", "b", "b", "c", "c", "c"),
testid = c(222, 333, 555, 666, 777, 999, 1010, 1313, 1616, 1818),
bd = c(1, 1, 2, 2, 0, 1, 1, 1, 1, 2))
df2
groupID testid bd
1 a 222 1
2 a 333 1
3 a 555 2
4 a 666 2
5 b 777 0
6 b 999 1
7 b 1010 1
8 c 1313 1
9 c 1616 1
10 c 1818 2
I want to use the intervals in the 2nd dataset to fill in a new variable in the 1st dataset and autofill in values that have two occurances of a bd and NAs everywhere else by group.
Desired output:
groupID testid new_bd
1 a 111 NA
2 a 222 1
3 a 333 1
4 a 444 NA
5 a 555 2
6 a 666 2
7 b 777 0
8 b 888 NA
9 b 999 1
10 b 1010 1
11 b 1111 NA
12 b 1212 NA
13 c 1313 1
14 c 1414 1
15 c 1515 1
16 c 1616 1
17 c 1717 NA
18 c 1818 2
Ideally would like dplyr/tidyr solution but open to any approaches.
similar but these fill all values:
R: Filling timeseries values but only within last 12 months
R autofill blanks in variable until next value
I would start by modifying df2 to start and end of range. And you can loop or do anything else after.
grps <- df2 %>% group_by(groupID, bd) %>% summarize(start = min(testid), end = max(testid))
grps
groupID bd start end
<fct> <dbl> <dbl> <dbl>
1 a 1 222 333
2 a 2 555 666
3 b 0 777 777
4 b 1 999 1010
5 c 1 1313 1616
6 c 2 1818 1818
df1$bd <- NA
for(i in 1:nrow(grps)){
df1$bd[which(df1$test >= grps$start[i] & df1$test <= grps$end[i])] = grps$bd[i]
}
df1
groupID testid bd
1 a 111 NA
2 a 222 1
3 a 333 1
4 a 444 NA
5 a 555 2
6 a 666 2
7 b 777 0
8 b 888 NA
9 b 999 1
10 b 1010 1
11 b 1111 NA
12 b 1212 NA
13 c 1313 1
14 c 1414 1
15 c 1515 1
16 c 1616 1
17 c 1717 NA
18 c 1818 2
Maybe I have overlooked a simpler method but here is what I came up with using dplyr, we first create a left_join between df1 and df2 and fill bd column. We then group_by group_ID and bd and get first and last index of non-NA value in each group and replace values to NA which are less than minimum index and greater than maximum index.
library(dplyr)
left_join(df1, df2, by = c("groupID", "testid")) %>%
mutate(bd1 = bd) %>%
tidyr::fill(bd) %>%
group_by(groupID, bd) %>%
mutate(minRow = if (all(is.na(bd))) 1 else first(which(!is.na(bd1))),
maxRow = if (all(is.na(bd))) n() else last(which(!is.na(bd1))),
new_bd = replace(bd, is.na(bd1) & (row_number() < minRow |
row_number() > maxRow), NA)) %>%
ungroup() %>%
select(names(df1), new_bd)
# groupID testid new_bd
# <fct> <dbl> <dbl>
# 1 a 111 NA
# 2 a 222 1
# 3 a 333 1
# 4 a 444 NA
# 5 a 555 2
# 6 a 666 2
# 7 b 777 0
# 8 b 888 NA
# 9 b 999 1
#10 b 1010 1
#11 b 1111 NA
#12 b 1212 NA
#13 c 1313 1
#14 c 1414 1
#15 c 1515 1
#16 c 1616 1
#17 c 1717 NA
#18 c 1818 2
Here is a solution that works on my test data example above but wont run on my large dataset where I run into the problem of Error: cannot allocate vector of size 45.5 Gb. I believe it is related to the problem outlined here:"The same size explosion can happen if you have lots of the same level in both with otherwise different rows". In my actual dataset I'm looking at date variables, I didn't think this would effect the problem but maybe it does. I'm not sure if there is a work using fuzzyjoin as it works on a subset of the data.
library(tidyverse)
library(fuzzyjoin)
library(tidylog)
grps <- df2 %>% group_by(groupID, bd) %>% summarize(start = min(testid), end = max(testid))
grps
df1 %>%
fuzzy_left_join(grps,
by = c("groupID" = "groupID",
"testid" = "start",
"testid" = "end"),
match_fun = list(`==`, `>=`, `<=`)) %>%
select(groupID = groupID.x, testid, bd, start, end)
select: dropped 2 variables (groupID.x, groupID.y)
groupID testid bd start end
1 a 111 NA NA NA
2 a 222 1 222 333
3 a 333 1 222 333
4 a 444 NA NA NA
5 a 555 2 555 666
6 a 666 2 555 666
7 b 777 0 777 777
8 b 888 NA NA NA
9 b 999 1 999 1010
10 b 1010 1 999 1010
11 b 1111 NA NA NA
12 b 1212 NA NA NA
13 c 1313 1 1313 1616
14 c 1414 1 1313 1616
15 c 1515 1 1313 1616
16 c 1616 1 1313 1616
17 c 1717 NA NA NA
18 c 1818 2 1818 1818
data.table solution:
library(data.table)
> new <- setDT(grps)[setDT(df1),
+ .(groupID, testid, x.start, x.end, x.bd),
+ on = .(groupID, start <= testid, end >= testid)]
> new
groupID testid x.start x.end x.bd
1: a 111 NA NA NA
2: a 222 222 333 1
3: a 333 222 333 1
4: a 444 NA NA NA
5: a 555 555 666 2
6: a 666 555 666 2
7: b 777 777 777 0
8: b 888 NA NA NA
9: b 999 999 1010 1
10: b 1010 999 1010 1
11: b 1111 NA NA NA
12: b 1212 NA NA NA
13: c 1313 1313 1616 1
14: c 1414 1313 1616 1
15: c 1515 1313 1616 1
16: c 1616 1313 1616 1
17: c 1717 NA NA NA
18: c 1818 1818 1818 2
I think it may be done in fuzzyjoin using internal_join but I'm not sure?: https://github.com/dgrtwo/fuzzyjoin/issues/50

R: Calculating New Variable R Code

I have
id_1 id_2 name count total
1 001 111 a 15
2 001 111 b 3
3 001 111 sum 28 28
4 002 111 a 7
5 002 111 b 33
6 002 111 sum 48 48
I want the rows that share the same id_1 and id_2 to share the total, like
id_1 id_2 name count total
1 001 111 a 15 28
2 001 111 b 3 28
3 001 111 sum 28 28
4 002 111 a 7 48
5 002 111 b 33 48
6 002 111 sum 48 48
We can use fill from tidyr.
library(tidyr)
dat2 <- dat %>% fill(total, .direction = "up")
dat2
# id_1 id_2 name count total
# 1 1 111 a 15 28
# 2 1 111 b 3 28
# 3 1 111 sum 28 28
# 4 2 111 a 7 48
# 5 2 111 b 33 48
# 6 2 111 sum 48 48
DATA
dat <- read.table(text = " id_1 id_2 name count total
1 001 111 a 15 NA
2 001 111 b 3 NA
3 001 111 sum 28 28
4 002 111 a 7 NA
5 002 111 b 33 NA
6 002 111 sum 48 48",
header = TRUE, stringsAsFactors = FALSE)
Consider base R's ave calculating group max (na.rm to handle NA):
df$total <- ave(df$total, df$id_1, df$_id_2, FUN=function(i) max(i, na.rm=na.omit))
df
# id_1 id_2 name count total
# 1 1 111 a 15 28
# 2 1 111 b 3 28
# 3 1 111 sum 28 28
# 4 2 111 a 7 48
# 5 2 111 b 33 48
# 6 2 111 sum 48 48
Using zoo and data.table:
df <- read.table(text = "id_1 id_2 name count total
001 111 a 15 NA
001 111 b 3 NA
001 111 sum 28 28
002 111 a 7 NA
002 111 b 33 NA
002 111 sum 48 48",
header = TRUE, stringsAsFactors = FALSE)# create data
library(zoo)# load packages
library(data.table)
setDT(df)[, total := na.locf(na.locf(total, na.rm=FALSE), na.rm=FALSE, fromLast=TRUE), by = c("id_1", "id_2")]# convert df to data.table and carry forward and backward total by ids
Output:
id_1 id_2 name count total
1: 1 111 a 15 28
2: 1 111 b 3 28
3: 1 111 sum 28 28
4: 2 111 a 7 48
5: 2 111 b 33 48
6: 2 111 sum 48 48
Simple approach using the normal dplyr way:
dat %>% group_by(id_1, id_2) %>% mutate(total=count[name == "sum"])
Alternatively:
dat %>% group_by(id_1, id_2) %>% mutate(total=na.omit(total)[1])
id_1 id_2 name count total
<int> <int> <chr> <int> <int>
1 1 111 a 15 28
2 1 111 b 3 28
3 1 111 sum 28 28
4 2 111 a 7 48
5 2 111 b 33 48
6 2 111 sum 48 48

Creating a rank column based on two other (linked) columns in R

I have the following dataframe (example data) which has the dates of different DVD recordings for different pairs of birds for numerous broods:
PairID BroodRef DVDdate
1 512 2004-05-22
1 512 2004-05-30
1 512 2004-05-26
1 588 2004-06-30
1 588 2004-07-04
1 588 2004-07-09
2 673 2004-07-19
3 543 2004-06-03
3 543 2004-06-07
3 543 2004-06-11
3 620 2004-07-19
3 39 2005-05-19
3 39 2005-05-23
What I'd like is a brood number for each pair, such as:
PairID BroodRef DVDdate BroodNumber
1 512 2004-05-22 1
1 512 2004-05-30 1
1 512 2004-05-26 1
1 588 2004-06-30 2
1 588 2004-07-04 2
1 588 2004-07-09 2
2 673 2004-07-19 1
3 543 2004-06-03 1
3 543 2004-06-07 1
3 543 2004-06-11 1
3 620 2004-07-19 2
3 39 2005-05-19 3
3 39 2005-05-23 3
I have tried
ddply(df,.(PairID),transform,BroodNumber = dense_rank(BroodRef))
which I saw on another question, but this results in Pair 3, BroodRef 39 being BroodNumber 1 rather than the 3 it should be.
Appreciate any help!
We could use rleid() from data.table to create a sequence based on BroodRef, grouped by PairID.
library(data.table)
setDT(df)[,BroodNumber := rleid(BroodRef), by = PairID]
# PairID BroodRef DVDdate BroodNumber
# 1: 1 512 2004-05-22 1
# 2: 1 512 2004-05-30 1
# 3: 1 512 2004-05-26 1
# 4: 1 588 2004-06-30 2
# 5: 1 588 2004-07-04 2
# 6: 1 588 2004-07-09 2
# 7: 2 673 2004-07-19 1
# 8: 3 543 2004-06-03 1
# 9: 3 543 2004-06-07 1
#10: 3 543 2004-06-11 1
#11: 3 620 2004-07-19 2
#12: 3 39 2005-05-19 3
#13: 3 39 2005-05-23 3
We can use dplyr
library(dplyr)
df1 %>%
group_by(PairID) %>%
mutate(BroodNumber = match(BroodRef, unique(BroodRef)))
# PairID BroodRef DVDdate BroodNumber
# (int) (int) (chr) (int)
#1 1 512 2004-05-22 1
#2 1 512 2004-05-30 1
#3 1 512 2004-05-26 1
#4 1 588 2004-06-30 2
#5 1 588 2004-07-04 2
#6 1 588 2004-07-09 2
#7 2 673 2004-07-19 1
#8 3 543 2004-06-03 1
#9 3 543 2004-06-07 1
#10 3 543 2004-06-11 1
#11 3 620 2004-07-19 2
#12 3 39 2005-05-19 3
#13 3 39 2005-05-23 3

Resources