I have the following dataframe (example data) which has the dates of different DVD recordings for different pairs of birds for numerous broods:
PairID BroodRef DVDdate
1 512 2004-05-22
1 512 2004-05-30
1 512 2004-05-26
1 588 2004-06-30
1 588 2004-07-04
1 588 2004-07-09
2 673 2004-07-19
3 543 2004-06-03
3 543 2004-06-07
3 543 2004-06-11
3 620 2004-07-19
3 39 2005-05-19
3 39 2005-05-23
What I'd like is a brood number for each pair, such as:
PairID BroodRef DVDdate BroodNumber
1 512 2004-05-22 1
1 512 2004-05-30 1
1 512 2004-05-26 1
1 588 2004-06-30 2
1 588 2004-07-04 2
1 588 2004-07-09 2
2 673 2004-07-19 1
3 543 2004-06-03 1
3 543 2004-06-07 1
3 543 2004-06-11 1
3 620 2004-07-19 2
3 39 2005-05-19 3
3 39 2005-05-23 3
I have tried
ddply(df,.(PairID),transform,BroodNumber = dense_rank(BroodRef))
which I saw on another question, but this results in Pair 3, BroodRef 39 being BroodNumber 1 rather than the 3 it should be.
Appreciate any help!
We could use rleid() from data.table to create a sequence based on BroodRef, grouped by PairID.
library(data.table)
setDT(df)[,BroodNumber := rleid(BroodRef), by = PairID]
# PairID BroodRef DVDdate BroodNumber
# 1: 1 512 2004-05-22 1
# 2: 1 512 2004-05-30 1
# 3: 1 512 2004-05-26 1
# 4: 1 588 2004-06-30 2
# 5: 1 588 2004-07-04 2
# 6: 1 588 2004-07-09 2
# 7: 2 673 2004-07-19 1
# 8: 3 543 2004-06-03 1
# 9: 3 543 2004-06-07 1
#10: 3 543 2004-06-11 1
#11: 3 620 2004-07-19 2
#12: 3 39 2005-05-19 3
#13: 3 39 2005-05-23 3
We can use dplyr
library(dplyr)
df1 %>%
group_by(PairID) %>%
mutate(BroodNumber = match(BroodRef, unique(BroodRef)))
# PairID BroodRef DVDdate BroodNumber
# (int) (int) (chr) (int)
#1 1 512 2004-05-22 1
#2 1 512 2004-05-30 1
#3 1 512 2004-05-26 1
#4 1 588 2004-06-30 2
#5 1 588 2004-07-04 2
#6 1 588 2004-07-09 2
#7 2 673 2004-07-19 1
#8 3 543 2004-06-03 1
#9 3 543 2004-06-07 1
#10 3 543 2004-06-11 1
#11 3 620 2004-07-19 2
#12 3 39 2005-05-19 3
#13 3 39 2005-05-23 3
Related
Please note that this question has been edited after r2evans' answer.
Example data
I have example data as follows:
library(data.table)
vars_of_interest <- c("A", "B", "C")
vars_of_interest_obs_tot <- c("A_tot", "B_tot", "C_tot")
adapted_BMstratum <- c("A_adapted_BMstratum", "B_adapted_BMstratum", "C_adapted_BMstratum")
full_df_bm <- fread("A B C BMstratum
1 NA NA 1110
23 1 2 1120
1 NA 1 1130
6 NA NA 1140
NA 1 1 1100
2 2 4 1110
NA 1 2 1120
NA 21 11 1130")
# Counting the current observations
setDT(full_df_bm)[, (vars_of_interest_obs_tot) := lapply(vars_of_interest, function(x) sum(!is.na(get(x)))),by = c("BMstratum")]
print(full_df_bm)
# A B C BMstratum A_tot B_tot C_tot
# 1: 1 NA NA 1110 2 1 1
# 2: 23 1 2 1120 1 2 2
# 3: 1 NA 1 1130 1 1 2
# 4: 6 NA NA 1140 1 0 0
# 5: NA 1 1 1100 0 1 1
# 6: 2 2 4 1110 2 1 1
# 7: NA 1 2 1120 1 2 2
# 8: NA 21 11 1130 1 1 2
# The adapted strata start the same as the original
setDT(full_df_bm)[, (adapted_BMstratum):=BMstratum]
print(full_df_bm)
# A B C BMstratum A_tot B_tot C_tot A_adapted_BMstratum B_adapted_BMstratum C_adapted_BMstratum
# 1: 1 NA NA 1110 2 1 1 1110 1110 1110
# 2: 23 1 2 1120 1 2 2 1120 1120 1120
# 3: 1 NA 1 1130 1 1 2 1130 1130 1130
# 4: 6 NA NA 1140 1 0 0 1140 1140 1140
# 5: NA 1 1 1100 0 1 1 1100 1100 1100
# 6: 2 2 4 1110 2 1 1 1110 1110 1110
# 7: NA 1 2 1120 1 2 2 1120 1120 1120
# 8: NA 21 11 1130 1 1 2 1130 1130 1130
Updating the strata
For every variable in adapted_BMstratum, I would like to manually decide what to do when there are less than 2 observations for each of the variables A, B, or C.
for (i in seq_along(adapted_BMstratum)) {
# If stratum 1110 has less than two observations change to 1120
setDT(full_df_bm)[get(vars_of_interest_obs_tot[i])<2 & get(adapted_BMstratum[i])==1110, (adapted_BMstratum[i]):=1120 ,]
# Update the observations
bygroup <- adapted_BMstratum[i]
setDT(full_df_bm)[, (vars_of_interest_obs_tot) := lapply(vars_of_interest, function(x) sum(!is.na(get(x)))),by = bygroup]
# If stratum 1120 has less than two observations change to 1110
setDT(full_df_bm)[get(vars_of_interest_obs_tot[i])<2 & get(adapted_BMstratum[i])==1120, (adapted_BMstratum[i]):=1110,]
# Update the observations
bygroup <- adapted_BMstratum[i]
setDT(full_df_bm)[, (vars_of_interest_obs_tot) := lapply(vars_of_interest, function(x) sum(!is.na(get(x)))),by = bygroup]
# If stratum 1130 has less than two observations change to 1110
setDT(full_df_bm)[get(vars_of_interest_obs_tot[i])<2 & get(adapted_BMstratum[i])==1120, (adapted_BMstratum[i]):=1110,]
# Update the observations
bygroup <- adapted_BMstratum[i]
setDT(full_df_bm)[, (vars_of_interest_obs_tot) := lapply(vars_of_interest, function(x) sum(!is.na(get(x)))),by = bygroup]
# If any strata after has less than 2 observations, change them all to 1110
setDT(full_df_bm)[get(vars_of_interest_obs_tot[i])<2 & (get(adapted_BMstratum[i])==1110 || get(adapted_BMstratum[i])==1120 || get(adapted_BMstratum[i])==1130), (adapted_BMstratum[i]):=1110,]
# Update the observations a last time
bygroup <- adapted_BMstratum[i]
setDT(full_df_bm)[, (vars_of_interest_obs_tot) := lapply(vars_of_interest, function(x) sum(!is.na(get(x)))),by = bygroup]
}
This does however not give the desired outcome:
A B C BMstratum A_tot B_tot C_tot A_adapted_BMstratum B_adapted_BMstratum C_adapted_BMstratum
1: 1 NA NA 1110 2 1 1 1110 1110 1110
2: 23 1 2 1120 1 2 2 1110 1120 1120
3: 1 NA 1 1130 1 1 2 1110 1110 1130
4: 6 NA NA 1140 1 0 0 1110 1110 1110
5: NA 1 1 1100 0 1 1 1110 1110 1110
6: 2 2 4 1110 2 1 1 1110 1110 1110
7: NA 1 2 1120 1 2 2 1110 1120 1120
8: NA 21 11 1130 1 1 2 1110 1110 1130
In addition it gives the following warnings:
Warning messages:
1: In get(adapted_BMstratum[i]) == 1110 || get(adapted_BMstratum[i]) == :
'length(x) = 8 > 1' in coercion to 'logical(1)'
2: In get(adapted_BMstratum[i]) == 1110 || get(adapted_BMstratum[i]) == :
'length(x) = 8 > 1' in coercion to 'logical(1)'
3: In get(adapted_BMstratum[i]) == 1110 || get(adapted_BMstratum[i]) == :
'length(x) = 8 > 1' in coercion to 'logical(1)'
Desired outcome
NOTE: For B_adapted_stratum all have been changed to 1110 because 1110,1120 and 1130, (if they exist) did not all have at least 2 observations.
# A B C BMstratum A_tot B_tot C_tot A_adapted_BMstratum B_adapted_BMstratum C_adapted_BMstratum
# 1: 1 NA NA 1110 4 6 4 1110 1120 1120
# 2: 23 1 2 1120 4 6 4 1110 1120 1120
# 3: 1 NA 1 1130 4 6 2 1110 1120 1130
# 4: 6 NA NA 1140 1 0 0 1140 1140 1140
# 5: NA 1 1 1100 0 1 1 1100 1100 1100
# 6: 2 2 4 1110 4 6 4 1110 1120 1120
# 7: NA 1 2 1120 4 6 4 1110 1120 1120
# 8: NA 21 11 1130 4 6 2 1110 1120 1130
Note: The strata 1100 and 1140 should not be touched, but should not be removed either. This has to do with the fact that I need to add manual rules for these numbers separately. In the real data, there are way more numbers and rules, and I think it would become to messy to write everything out.
Here's a start, though I don't know how to assign 1120 to A_adapted_BMstratum since the two categories are identical:
full_df_bm[, c(adapted_BMstratum) := lapply(.SD, function(z) fifelse(z < 2, BMstratum[which.min(z)] , BMstratum)),
.SDcols = vars_of_interest_obs_tot]
# A_tot B_tot C_tot BMstratum A_adapted_BMstratum B_adapted_BMstratum C_adapted_BMstratum
# <int> <int> <int> <int> <int> <int> <int>
# 1: 1 2 1 1110 1110 1110 1110
# 2: 1 1 2 1120 1110 1120 1120
I'm still having trouble regarding my workflow. I need to estimate the number of people alive by gender in every single year between 1850 and 1950. I have the following information:
id, birth_year, death_year and gender
id <- c(1:6)
birth_year <- c(1850:1855)
death_year <- c(1890:1895)
gender <- c("female", "male", "female", "male", "male", "male")
df <- data.frame(id, birth_year, death_year, gender)
Think about the steps to achieve my goal, I realize that a should add columns in my df for each year. In each column, I would estimate the age of a person iat the year x, then, the year of a person i + 1 at the year x + 1. Being i = 1 and x = 1850.
df$age1850 <- 1850 - df$birth_year
df$age1851 <- 1851 - df$birth_year
df$age1852 <- 1852 - df$birth_year
df$age1853 <- 1853 - df$birth_year
df$age1854 <- 1854 - df$birth_year
df$age1855 <- 1855 - df$birth_year
# The expected result would be:
id birth_year death_year gender age1850 age1851 age1852 age1853 age1854 age1855
1 1 1850 1890 female 0 1 2 3 4 5
2 2 1851 1891 male -1 0 1 2 3 4
3 3 1852 1892 female -2 -1 0 1 2 3
4 4 1853 1893 male -3 -2 -1 0 1 2
5 5 1854 1894 male -4 -3 -2 -1 0 1
6 6 1855 1895 male -5 -4 -3 -2 -1 0
Thanks in advance!
To estimate the number of people alive by gender in every single year between 1850 and 1950 you can use table and subset you df with the year.
df$gender <- as.factor(df$gender)
years <- 1850:1950
sapply(setNames(years, years), function(i) {table(df$gender[df$birth_year <= i &
df$death_year >= i])})
# 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863
#female 1 1 2 2 2 2 2 2 2 2 2 2 2 2
#male 0 1 1 2 3 4 4 4 4 4 4 4 4 4
# 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877
#female 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#male 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891
#female 2 2 2 2 2 2 2 2 2 2 2 2 2 1
#male 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905
#female 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#male 3 3 2 1 0 0 0 0 0 0 0 0 0 0
#...
I'd like to get a summary of time series data where group is "Flare" and the max value of the FlareLength is the data of interest for that group.
If I have a dataframe, like this:
Date Flare FlareLength
1 2015-12-01 0 1
2 2015-12-02 0 2
3 2015-12-03 0 3
4 2015-12-04 0 4
5 2015-12-05 0 5
6 2015-12-06 0 6
7 2015-12-07 1 1
8 2015-12-08 1 2
9 2015-12-09 1 3
10 2015-12-10 1 4
11 2015-12-11 0 1
12 2015-12-12 0 2
13 2015-12-13 0 3
14 2015-12-14 0 4
15 2015-12-15 0 5
16 2015-12-16 0 6
17 2015-12-17 0 7
18 2015-12-18 0 8
19 2015-12-19 0 9
20 2015-12-20 0 10
21 2015-12-21 0 11
22 2016-01-11 1 1
23 2016-01-12 1 2
24 2016-01-13 1 3
25 2016-01-14 1 4
26 2016-01-15 1 5
27 2016-01-16 1 6
28 2016-01-17 1 7
29 2016-01-18 1 8
I'd like output like:
Date Flare FlareLength
1 2015-12-06 0 6
2 2015-12-10 1 4
3 2015-12-21 0 11
4 2016-01-18 1 8
I have tried various aggregate forms but I'm not very familiar with the time series wrinkle.
Using dplyr, we can create a grouping variable by comparing the FlareLength with the previous FlareLength value and select the row with maximum FlareLength in the group.
library(dplyr)
df %>%
group_by(gr = cumsum(FlareLength < lag(FlareLength,
default = first(FlareLength)))) %>%
slice(which.max(FlareLength)) %>%
ungroup() %>%
select(-gr)
# A tibble: 4 x 3
# Date Flare FlareLength
# <fct> <int> <int>
#1 2015-12-06 0 6
#2 2015-12-10 1 4
#3 2015-12-21 0 11
#4 2016-01-18 1 8
In base R with ave we can do the same as
subset(df, FlareLength == ave(FlareLength, cumsum(c(TRUE, diff(FlareLength) < 0)),
FUN = max))
Basically I want to:
If rows are duplicated on the combination of some specific columns, then keep only the row that has the lowest value on another column.
Example data (there's a lot more variance in my real data):
ID BilagNr Henstand Aftale Belob RP Pos Dps Udlign rykkedage
1 111 01-01-2017 1111 100 YA 1 1 10
1 122 02-01-2017 1222 100 YA 1 1 40
1 111 01-07-2017 1111 100 YA 1 1 100
2 222 01-01-2017 2121 299 YA 1 4 5
2 222 01-01-2017 2121 299 YA 1 4 98
2 212 01-05-2017 7654 299 BS 1 3
3 333 01-08-2017 7654 345 BS 2 45
4 444 01-01-2017 7654 345 BS 3 1 4 68
4 411 09-01-2017 7654 345 BS 1 4 43
5 555 01-01-2017 5555 700 BS 1 13
5 555 01-01-2017 5555 700 BS 1 67
6 666 01-01-2017 4720 100 BS 1 23
6 666 03-01-2017 1234 100 BS 2 1 23
6 666 07-08-2017 1234 120 BS 3 1 1 23
7 777 01-01-2017 1234 90 BS 1 1 23
7 777 01-01-2017 1234 90 BS 1 1 199
So I want to only keep these:
ID BilagNr Henstand Aftale Belob RP Pos Dps Udlign rykkedage
1 111 01-01-2017 1111 100 YA 1 1 10
1 122 02-01-2017 1222 100 YA 1 1 40
2 222 01-01-2017 2121 299 YA 1 4 5
2 212 01-05-2017 7654 299 BS 1 3
3 333 01-08-2017 7654 345 BS 2 45
4 444 01-01-2017 7654 345 BS 3 1 4 68
4 411 09-01-2017 7654 345 BS 1 4 43
5 555 01-01-2017 5555 700 BS 1 13
6 666 01-01-2017 4720 100 BS 1 23
6 666 03-01-2017 1234 100 BS 2 1 23
6 666 07-08-2017 1234 120 BS 3 1 1 23
7 777 01-01-2017 1234 90 BS 1 1 23
In other words:
If the rows are duplicated in a combination of the columns ID, BilagNr, Henstand, Aftale, Belob, RP, Pos, Dps, Udlign then keep only one of the duplicated rows and choose this from the condition that rykkedage has to be the smallest of the duplicated rows.
I hope it makes sense.
Furthermore, is it possible to add a code that keeps those duplicated rows that has the same value in rykkedage? I have a large dataset, and I'm not sure if this is even a problem.
Thank you!
We can group by 'ID', 'BilagNr', ..., 'Udlign', and then slice the rows with the index of the minimum value in 'rykkedage'
library(dplyr)
df1 %>%
group_by(ID, BilagNr, Henstand, Aftale, Belob, RP, Pos, Dps, Udlign) %>%
slice(which.min(rykkedage))
# A tibble: 13 x 10
# Groups: ID, BilagNr, Henstand, Aftale, Belob, RP, Pos, Dps, Udlign [13]
# ID BilagNr Henstand Aftale Belob RP Pos Dps Udlign rykkedage
# <int> <int> <chr> <int> <int> <chr> <int> <int> <int> <int>
# 1 1 111 01-01-2017 1111 100 YA 1 1 NA 10
# 2 1 111 01-07-2017 1111 100 YA 1 1 NA 100
# 3 1 122 02-01-2017 1222 100 YA 1 NA 1 40
# 4 2 212 01-05-2017 7654 299 BS 1 NA NA 3
# 5 2 222 01-01-2017 2121 299 YA 1 NA 4 5
# 6 3 333 01-08-2017 7654 345 BS 2 NA NA 45
# 7 4 411 09-01-2017 7654 345 BS 1 NA 4 43
# 8 4 444 01-01-2017 7654 345 BS 3 1 4 68
# 9 5 555 01-01-2017 5555 700 BS 1 NA NA 13
#10 6 666 01-01-2017 4720 100 BS 1 NA NA 23
#11 6 666 03-01-2017 1234 100 BS 2 NA 1 23
#12 6 666 07-08-2017 1234 120 BS 3 1 1 23
#13 7 777 01-01-2017 1234 90 BS 1 NA 1 23
This is Fips data set
State Fips State.Abbreviation ANSI.Code GU.Name
1 1 67 AL 2403054 Abbeville
2 1 73 AL 2403063 Adamsville
3 1 117 AL 2403069 Alabaster
4 1 95 AL 2403074 Albertville
5 1 123 AL 2403077 Alexander City
6 1 107 AL 2403080 Aliceville
7 1 39 AL 2403097 Andalusia
8 1 15 AL 2403101 Anniston
:
:
:
41774 51 720 VA 1498434 Norton
41775 51 730 VA 1498435 Petersburg
41776 51 735 VA 1498436 Poquoson
41777 51 740 VA 1498556 Portsmouth
41778 51 750 VA 1498438 Radford
41779 51 760 VA 1789073 Richmond
41780 51 770 VA 1498439 Roanoke
41781 51 775 VA 1789074 Salem
41782 51 790 VA 1789075 Staunton
41783 51 800 VA 1498560 Suffolk
41784 51 810 VA 1498559 Virginia Beach
41785 51 820 VA 1498443 Waynesboro
41786 51 830 VA 1789076 Williamsburg
41787 51 840 VA 1789077 Winchester
dim(fips)
[1] 2937 5
This is data head cancer
PUBCSNUM REG MAR_STAT RACE1V NHIADE SEX FIPS Fips State State.Abbreviation
1 93261752 1544 2 15 0 1 3 3 34 NY
2 93264865 1544 2 1 0 1 15 15 34 NY
3 93268186 1544 2 1 0 1 5 5 34 NY
4 93272027 1544 2 1 0 2 17 17 34 NY
5 93274555 1544 1 1 0 1 13 13 34 NY
6 93275343 1544 5 1 0 2 25 25 34 NY
7 93279759 1544 5 1 0 2 9 9 34 NY
8 93280754 1544 2 1 0 2 35 35 34 NY
9 93281166 1544 2 1 0 2 31 31 34 NY
10 93282602 1544 5 1 0 1 33 33 34 NY
11 93287646 1544 1 1 0 1 11 11 34 NY
12 93288255 1544 4 1 4 1 39 39 34 NY
13 93290660 1544 9 1 0 2 25 25 34 NY
14 93291461 1544 1 1 6 1 39 39 34 NY
15 93291778 1544 2 1 0 1 3 3 34 NY
dim(headcancer)
[1] 75313 10
when I merged together I expect to get the same row with head.cancer 75313 rows, but I got 951423 rows.
Here is my code and output
n = merge(head.cancer,fips, by=c('State','Fips','State.Abbreviation'), all.x= TRUE)
State Fips State.Abbreviation PUBCSNUM REG MAR_STAT RACE1V NHIADE SEX FIPS ANSI.Code GU.Name
1 6 5 CA 70128269 1541 4 1 0 2 5 2409693 Amador City
2 6 5 CA 70128269 1541 4 1 0 2 5 2411446 Plymouth
3 6 5 CA 70128269 1541 4 1 0 2 5 226085 Jackson
4 6 5 CA 70128269 1541 4 1 0 2 5 1675841 Amador
5 6 5 CA 70128269 1541 4 1 0 2 5 2418631 Ione Band of Miwok
6 6 5 CA 70128269 1541 4 1 0 2 5 2412019 Sutter Creek
7 6 5 CA 70128269 1541 4 1 0 2 5 2410110 Ione
8 6 5 CA 70128269 1541 4 1 0 2 5 2410128 Jackson
9 6 5 CA 67476209 1541 2 1 1 2 5 2409693 Amador City
10 6 5 CA 67476209 1541 2 1 1 2 5 2411446 Plymouth
11 6 5 CA 67476209 1541 2 1 1 2 5 226085 Jackson
12 6 5 CA 67476209 1541 2 1 1 2 5 1675841 Amador
13 6 5 CA 67476209 1541 2 1 1 2 5 2418631 Ione Band of Miwok
14 6 5 CA 67476209 1541 2 1 1 2 5 2412019 Sutter Creek
15 6 5 CA 67476209 1541 2 1 1 2 5 2410110 Ione
16 6 5 CA 67476209 1541 2 1 1 2 5 2410128 Jackson
17 6 5 CA 56544761 1541 4 1 0 2 5 2409693 Amador City
18 6 5 CA 56544761 1541 4 1 0 2 5 2411446 Plymouth
19 6 5 CA 56544761 1541 4 1 0 2 5 226085 Jackson
20 6 5 CA 56544761 1541 4 1 0 2 5 1675841 Amador
dim(n)
[1] 951423 12
The first row to 8th "PUBCSNUM "duplicate 8 times, "PUBCSNUM" is ID, so it's unique, "ANSI.Code" is supposed only 1 value, now they are so many value.I don't know why it's duplicate like that
Please help me, I stuck for couples hours but I couldn't figure out. Thanks