Full Join in dplyr - r

I have a dataframe looking like:
library(tidyverse)
df <- tibble::tribble(
~sub_date, ~period,
"2019-01", 1,
"2019-01", 2,
"2019-01", 3,
"2019-02", 1,
"2019-02", 2,
"2019-03", 1,
"2019-03", 2,
"2019-03", 3,
"2019-03", 4
)
sub_date period
<chr> <dbl>
1 2019-01 1
2 2019-01 2
3 2019-01 3
4 2019-02 1
5 2019-02 2
6 2019-03 1
7 2019-03 2
8 2019-03 3
9 2019-03 4
and another:
period <- tibble::tribble(
~period, ~forecast,
1, 10,
2, 20,
3, 30,
4, 40,
5, 50,
6, 60,
7, 70
)
period forecast
<dbl> <dbl>
1 1 10
2 2 20
3 3 30
4 4 40
5 5 50
6 6 60
7 7 70
I am struggling to join them in a way that in df I can fill the missing periods in the table period, aka the number of rows in period X the different sub_date in df.
as follows:
df_output <- tibble::tribble(
~sub_date, ~period, ~forecast,
"2019-01", 1, 10,
"2019-01", 2, 20,
"2019-01", 3, 30,
"2019-01", 4, 40,
"2019-01", 5, 50,
"2019-01", 6, 60,
"2019-01", 7, 70,
"2019-02", 1, 10,
"2019-02", 2, 20,
"2019-02", 3, 30,
"2019-02", 4, 40,
"2019-02", 5, 50,
"2019-02", 6, 60,
"2019-02", 7, 70,
"2019-03", 1, 10,
"2019-03", 2, 20,
"2019-03", 3, 30,
"2019-03", 4, 40,
"2019-03", 5, 50,
"2019-03", 6, 60,
"2019-03", 7, 70
)
# A tibble: 21 x 3
sub_date period forecast
<chr> <dbl> <dbl>
1 2019-01 1 10
2 2019-01 2 20
3 2019-01 3 30
4 2019-01 4 40
5 2019-01 5 50
6 2019-01 6 60
7 2019-01 7 70
8 2019-02 1 10
9 2019-02 2 20
10 2019-02 3 30
# … with 11 more rows
I assumed it was a full join but I don't get the desired result.
Any help?

you can use tidyr::crossing to obtained your desired result:
crossing(select(df, sub_date), period)
Note that you are not looking for a join since you want every combination of sub_date combinded (or crossed) with every combination of period and forecast.

You can try to merge the tables? Try this to see if it gives you what you need?
df <- df %>% distinct(sub_date)
answer <- merge(periods, df, all = TRUE)

Related

Why can't I plot my model selection plot correctly?

I am working with multiple regression models. After running the dredge function, I got approximately 54 000 different combinations. I selected the first 300 models and ran this code:
par(mar=c(1,4,10,3))
> plot(fitt, labels = c("Intercept",
+ "YOFE",
+ "'RW Closeness'",
+ "'LW Closeness'",
+ "Age",
+ "SES",
+ "'GAD-7 Score'",
+ "Fantasy",
+ "'Personal Distress'",
+ "'Empathic Concern'",
+ "'Perspective Taking'",
+ "'PHQ-9 Score'",
+ "'Religioius Affinity'",
+ "'Agreement with IH'",
+ "'Moral Judgement of IH'",
+ "'Harm Assessment of IH'",
+ "'Agreement with IB'",
+ "'Moral Judgement of IB'",
+ "RMET",
+ "Sex"),ylab = expression("Cumulative" ~italic(w[i]*(AICc))),col = c(colfunc(1)), border = "gray30",labAsExpr = TRUE)
10 minutes later, I got this error:
Error in (function (text, side = 3, line = 0, outer = FALSE, at = NA, :
zero-length 'text' specified
In addition: Warning message:
In max(strwidth(arg[["text"]], cex = arg$cex, units = "in")) :
no non-missing arguments to max; returning -Inf
And this is the output plot:
I've tried plotting only the first model and the same error appears:
This also happens when using the whole model selection table (54 000 combinations).
What is a solution to this?
I'm running the latest version of R and RStudio on my 2016 12 inch Macbook.
Note: I've tried increasing the plot-window size manually by dragging the edges without any improvement.
This is what I'd like my plot to look like:
EDIT: Here is the data file data and the code.
modeloglobal<-lm(PROMEDIO_CREENCIA_NFALSA_CORONAVIRUS~Edad+Sex+
AnEdu+
Estrato_1+
GAD_TOTAL+
PHQ_TOTAL+
PracticRel_2+
CercanPolDer_1+
CercanPolIz_1+
RMET_TOTAL+
IRI_PREOCUPACIÓN_EMPATICA+
IRI_FANTASÍA+
IRI_MALESTAR_PERSONAL+
IRI_TOMA_DE_PERSPECTIVA+
PROMEDIO_DILEMAS_BI_ACTUARIGUAL_CORONAVIRUS+
PROMEDIO_DILEMAS_BI_BIENOMAL_CORONAVIRUS+
PROMEDIO_DI_SINPOL_ACTUARIGUAL+
PROMEDIO_DI_SINPOL_BIENOMAL+
PROMEDIO_DI_SINPOL_DANO, data=fake_news,na.action="na.fail")
library(MuMIn)
fitt<-dredge(modeloglobal,trace=2)
m.sel <- model.sel(fitt)
m.sel2 <- m.sel[1:300,]
library(binovisualfields)
And the code that runs the error (using a subset of the first 300 rows):
par(mar=c(1,4,10,3))
> plot(m.sel2, labels = c("Intercept",
+ "YOFE",
+ "'RW Closeness'",
+ "'LW Closeness'",
+ "Age",
+ "SES",
+ "'GAD-7 Score'",
+ "Fantasy",
+ "'Personal Distress'",
+ "'Empathic Concern'",
+ "'Perspective Taking'",
+ "'PHQ-9 Score'",
+ "'Religioius Affinity'",
+ "'Agreement with IH'",
+ "'Moral Judgement of IH'",
+ "'Harm Assessment of IH'",
+ "'Agreement with IB'",
+ "'Moral Judgement of IB'",
+ "RMET",
+ "Sex"),ylab = expression("Cumulative" ~italic(w[i]*(AICc))),col = c(colfunc(1)), border = "gray30",labAsExpr = TRUE)
EDIT 2: Here's the data frame I got from dput().
ResponseId Edad Sex Genero Nacion Resid Estrato_1 Gastos salud
1 R_25GEak825Ohmb9G 18 Female Femenino Colombia Colombia 7 Seguro privado
2 R_1kT7u0PALDHV8H6 20 Female Femenino Colombia Colombia 5 Seguro privado
3 R_2cpBb5Ifzj7lVGs 21 Female Femenino Colombia Colombia 6 Seguro privado
4 R_sGqNUMTXTJzwC09 20 Male Masculino Colombia Colombia 5 Seguro del Estado
5 R_2Cpixt9Z5FJkhg1 36 Male Masculino Colombia Colombia 6 Otro (especifique)
6 R_3QFq50SZNs6CePA 18 Female Femenino Colombia Colombia 7 Seguro privado
Relig PracticRel_2 AnEdu Q161 Ecron Epsiq Q183 Eneu Q184
1 Ninguna 0 15 Estudiante 1 0 <NA> 0 <NA>
2 Cristianismo (Catolicismo) 2 15 Estudiante 0 0 <NA> 0 <NA>
3 Cristianismo (Catolicismo) 2 19 Estudiante 0 0 <NA> 0 <NA>
4 Cristianismo (Catolicismo) 2 15 Estudiante 0 0 <NA> 0 <NA>
5 Cristianismo (Catolicismo) 1 17 Empleado de tiempo completo 0 0 <NA> 0 <NA>
6 Cristianismo (Catolicismo) 4 15 Estudiante 0 0 <NA> 0 <NA>
NPviven Sustancias Pviven AdhAS LevantarAS_1 CumplimAS_1 HorasFuera
1 1 1 Padres 1 5 6 Menos de una hora
2 3 0 Padres,Hermanos 1 1 6 Menos de una hora
3 4 0 Padres,Hermanos 1 2 6 Menos de una hora
4 4 0 Padres,Hermanos 1 2 6 Menos de una hora
5 3 0 Pareja,Hijos 1 2 3 Entre cuatro y seis horas
6 3 0 Padres,Hermanos 1 2 6 Entre una y tres horas
Apoyo CV19_1 ContagUd ContagEC Prob_1_contagio Prob_2_familiar_contagio
1 1 No 0 81 100
2 4 No 0 81 35
3 6 No 0 60 80
4 4 No 0 4 15
5 5 No 0 40 40
6 6 No 0 79 86
Prob_3_contagio_poblaciongeneral Caract_1 Caract_2 Inv_3 Caract_3 Caract_4 Caract_5 Caract_6 Caract_8
1 87 4 2 1 6 4 5 4 5
2 81 5 4 3 4 4 5 2 3
3 80 4 4 1 6 6 6 1 2
4 20 6 5 5 2 1 5 1 5
5 60 2 1 2 5 4 3 2 3
6 70 5 4 2 5 6 2 5 6
Caract_9 Caract_11 Caract_14 INV_15 Caract_15 Caract_16 Caract_17 CompPan_1 CompPan_2 CompPan_3
1 5 3 2 4 3 5 5 1 6 1
2 4 5 4 5 2 3 3 4 5 8
3 6 1 6 6 1 6 6 1 1 1
4 5 5 2 6 1 3 1 1 3 2
5 4 1 1 5 2 2 2 2 2 2
6 6 2 3 5 2 6 5 2 7 3
CompPan_4 CompPan_5 CompPan_6 CercanPolDer_1 CercanPolIz_1 IDpol_1 PHQ_TOTAL GAD_TOTAL
1 5 5 7 8 2 5 8 6
2 8 8 8 7 3 5 4 3
3 3 2 4 6 3 4 2 3
4 4 3 3 5 5 4 3 3
5 3 3 2 5 5 4 2 2
6 6 2 7 3 8 3 7 7
INTEROCEPCION_TOTAL BIS BAS_FUN_SEEKING BAS_REWARD_RESPONSIVENESS BAS_DRIVE BAS_TOTAL
1 45 19 14 19 11 44
2 44 20 10 17 14 41
3 24 17 10 19 13 42
4 17 17 9 14 8 31
5 36 21 10 17 11 38
6 41 25 6 17 13 36
IRI_TOMA_DE_PERSPECTIVA IRI_MALESTAR_PERSONAL IRI_FANTASÍA IRI_PREOCUPACIÓN_EMPATICA RMET_TOTAL
1 14 13 14 19 7
2 18 11 14 20 4
3 17 4 10 20 10
4 16 9 11 12 7
5 10 11 7 10 10
6 16 11 16 18 8
PROMEDIO_TIEMPO_REACCION_RMET PROMEDIO_CREENCIA_NFALSA_TODAS PROMEDIO_CREENCIA_NFALSA_CORONAVIRUS
1 2.411750 2.8 2.666667
2 3.348500 2.8 2.333333
3 3.261083 2.4 2.000000
4 6.390500 2.2 1.666667
5 13.212667 1.8 1.333333
6 4.218583 3.6 2.666667
PROMEDIO_CREENCIA_NFALSA_OTRO PROMEDIO_TIEMPOREACCION_NFALSA PROMEDIO_CREENCIA_NVERDADERA_TODAS
1 3.0 4.3438 3.333333
2 3.5 9.4222 3.000000
3 3.0 5.9734 3.666667
4 3.0 10.1448 2.666667
5 2.5 16.3196 1.333333
6 5.0 7.1954 3.333333
PROMEDIO_CREENCIA_NVERDADERA_CORONAVIRUS PROMEDIO_CREENCIA_NVERDADERA_OTRO
1 5 5
2 4 5
3 6 5
4 5 3
5 1 3
6 6 4
PROMEDIO_TIEMPOREACCION_NVERDADERA PROMEDIO_CREENCIA_NMISLEADING_TODAS
1 5.6440 2.666667
2 7.0430 2.666667
3 8.0265 3.666667
4 4.0495 3.000000
5 32.2400 1.666667
6 9.5830 4.333333
PROMEDIO_TIEMPOREACCION_NMISLEADING PROMEDIO_DILEMAS_BI_BIENOMAL_CORONAVIRUS
1 5.726667 1.000000
2 12.012333 4.000000
3 5.753000 4.333333
4 4.969667 1.333333
5 15.233000 0.000000
6 30.045667 3.666667
PROMEDIO_DILEMAS_BI_ACTUARIGUAL_CORONAVIRUS DILEMA_BI_CONTROL_BIENOMAL DILEMA_BI_CONTROL_ACTUARIGUAL
1 5.666667 4 7
2 7.666667 5 4
3 9.666667 2 6
4 4.333333 0 2
5 3.666667 -3 2
6 9.333333 4 10
PROMEDIO_DILEMAS_BI_BIENOMAL_JUNTOS PROMEDIO_DILEMAS_BI_ACTUARIGUAL_JUNTOS
1 1.75 6.00
2 4.25 6.75
3 3.75 8.75
4 1.00 3.75
5 -0.75 3.25
6 3.75 9.50
PROMEDIO_DILEMAS_DI_BIENOMAL PROMEDIO_DILEMAS_DI_ACTUARIGUAL PROMEDIO_DILEMAS_DI_DANO
1 0.5000000 6.666667 5.666667
2 1.8333333 7.666667 6.166667
3 0.5000000 5.666667 5.333333
4 1.6666667 5.000000 5.500000
5 0.8333333 4.833333 5.666667
6 0.1666667 5.166667 7.000000
TIEMPOREACCION_DILEMAS_DI TIEMPOREACCION_DILEMAS_BI PROMEDIO_DI_SINPOL_BIENOMAL
1 12.140500 7.89900 0.2
2 9.130667 9.99550 1.2
3 6.998333 9.25175 -1.0
4 1.857833 2.84125 0.4
5 19.014333 32.82850 0.8
6 11.633667 16.92000 0.2
PROMEDIO_DI_SINPOL_ACTUARIGUAL PROMEDIO_DI_SINPOL_DANO COMPRAS_COVID19 PERCEPCION_RIESGO_TOTAL
1 7.00 7.25 4.166667 39
2 8.00 6.75 6.833333 37
3 4.25 7.25 2.000000 42
4 4.50 7.00 2.666667 38
5 5.00 7.75 2.333333 26
6 5.50 7.75 4.500000 46
PERCEPCION_RIESGO_INDICE PROB_CONTAGIO_TOTAL PROMEDIO_DILEMASPOLITICOS_BIENOMAL
1 3.9 89.33333 1.0
2 3.7 65.66667 2.5
3 4.2 73.33333 4.0
4 3.8 13.00000 4.0
5 2.6 46.66667 0.5
6 4.6 78.33333 0.0
PROMEDIO_DILEMASPOLITICOS_ACTUARIGUAL PROMEDIO_DILEMASPOLITICOS_DANO D31_1_DI D32_2_DI D33_3_DI
1 6.0 2.5 -2 4 9
2 7.0 5.0 3 9 7
3 8.5 1.5 -3 3 8
4 6.0 2.5 0 3 8
5 4.5 1.5 -2 4 8
6 4.5 5.5 4 9 7
D41_1_DI D42_2_DI D43_3_DI D51_1_DI D52_2_DI D53_3_DI D61_1_DI D62_2_DI D63_3_DI D71_1_DIP D72_2_DIP
1 -1 7 7 5 10 4 -1 7 9 0 4
2 1 8 9 0 7 4 2 8 7 3 7
3 0 6 7 1 5 6 -3 3 8 3 7
4 0 5 8 4 7 3 -2 3 9 4 3
5 3 7 9 1 3 7 2 6 7 -2 2
6 1 8 6 0 4 9 -4 1 9 -4 1
D73_3_DIP D81_1_DIP D82_2_DIP D83_3_DIP D91_1_BI D92_2_BI D101_1_BI D102_2_BI D111_1_BI D112_2_BI
1 3 2 8 2 -3 4 3 9 3 4
2 6 2 7 4 3 8 5 8 4 7
3 2 5 10 1 5 10 5 10 3 9
4 2 4 9 3 4 9 0 2 0 2
5 2 3 7 1 -1 3 3 6 -2 2
6 8 4 8 3 4 9 5 10 2 9
D121_1_BI D122_2_BI total_iri promedio_falsaymisleading prediccioncompraspercprob
1 4 7 60 2.750 4.249759
2 5 4 63 2.750 4.404450
3 2 6 51 2.875 4.431635
4 0 2 48 2.500 5.143974
5 -3 2 38 1.750 3.765907
6 4 10 61 3.875 4.893797
prediccioncomprasperc
1 4.474456
2 4.439994
3 4.521980
4 4.689385
5 3.762449
6 4.967286
Here is the raw dput() output:
structure(list(ResponseId = c("R_25GEak825Ohmb9G", "R_1kT7u0PALDHV8H6",
"R_2cpBb5Ifzj7lVGs", "R_sGqNUMTXTJzwC09", "R_2Cpixt9Z5FJkhg1",
"R_3QFq50SZNs6CePA"), Edad = c(18, 20, 21, 20, 36, 18), Sex = structure(c(2L,
2L, 2L, 1L, 1L, 2L), .Label = c("Male", "Female"), class = "factor"),
Genero = c("Femenino", "Femenino", "Femenino", "Masculino",
"Masculino", "Femenino"), Nacion = c("Colombia", "Colombia",
"Colombia", "Colombia", "Colombia", "Colombia"), Resid = c("Colombia",
"Colombia", "Colombia", "Colombia", "Colombia", "Colombia"
), Estrato_1 = c(7, 5, 6, 5, 6, 7), `Gastos salud` = c("Seguro privado",
"Seguro privado", "Seguro privado", "Seguro del Estado",
"Otro (especifique)", "Seguro privado"), Relig = c("Ninguna",
"Cristianismo (Catolicismo)", "Cristianismo (Catolicismo)",
"Cristianismo (Catolicismo)", "Cristianismo (Catolicismo)",
"Cristianismo (Catolicismo)"), PracticRel_2 = c(0, 2, 2,
2, 1, 4), AnEdu = c(15, 15, 19, 15, 17, 15), Q161 = c("Estudiante",
"Estudiante", "Estudiante", "Estudiante", "Empleado de tiempo completo",
"Estudiante"), Ecron = c(1, 0, 0, 0, 0, 0), Epsiq = c(0,
0, 0, 0, 0, 0), Q183 = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), Eneu = c(0,
0, 0, 0, 0, 0), Q184 = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), NPviven = c("1",
"3", "4", "4", "3", "3"), Sustancias = c(1, 0, 0, 0, 0, 0
), Pviven = c("Padres", "Padres,Hermanos", "Padres,Hermanos",
"Padres,Hermanos", "Pareja,Hijos", "Padres,Hermanos"), AdhAS = c(1,
1, 1, 1, 1, 1), LevantarAS_1 = c(5, 1, 2, 2, 2, 2), CumplimAS_1 = c(6,
6, 6, 6, 3, 6), HorasFuera = c("Menos de una hora", "Menos de una hora",
"Menos de una hora", "Menos de una hora", "Entre cuatro y seis horas",
"Entre una y tres horas"), `Apoyo CV19_1` = c(1, 4, 6, 4,
5, 6), ContagUd = c("No", "No", "No", "No", "No", "No"),
ContagEC = c(0, 0, 0, 0, 0, 0), Prob_1_contagio = c(81, 81,
60, 4, 40, 79), Prob_2_familiar_contagio = c(100, 35, 80,
15, 40, 86), Prob_3_contagio_poblaciongeneral = c(87, 81,
80, 20, 60, 70), Caract_1 = c(4, 5, 4, 6, 2, 5), Caract_2 = c(2,
4, 4, 5, 1, 4), Inv_3 = c(1, 3, 1, 5, 2, 2), Caract_3 = c(6,
4, 6, 2, 5, 5), Caract_4 = c(4, 4, 6, 1, 4, 6), Caract_5 = c(5,
5, 6, 5, 3, 2), Caract_6 = c(4, 2, 1, 1, 2, 5), Caract_8 = c(5,
3, 2, 5, 3, 6), Caract_9 = c(5, 4, 6, 5, 4, 6), Caract_11 = c(3,
5, 1, 5, 1, 2), Caract_14 = c(2, 4, 6, 2, 1, 3), INV_15 = c(4,
5, 6, 6, 5, 5), Caract_15 = c(3, 2, 1, 1, 2, 2), Caract_16 = c(5,
3, 6, 3, 2, 6), Caract_17 = c(5, 3, 6, 1, 2, 5), CompPan_1 = c(1,
4, 1, 1, 2, 2), CompPan_2 = c(6, 5, 1, 3, 2, 7), CompPan_3 = c(1,
8, 1, 2, 2, 3), CompPan_4 = c(5, 8, 3, 4, 3, 6), CompPan_5 = c(5,
8, 2, 3, 3, 2), CompPan_6 = c(7, 8, 4, 3, 2, 7), CercanPolDer_1 = c(8,
7, 6, 5, 5, 3), CercanPolIz_1 = c(2, 3, 3, 5, 5, 8), IDpol_1 = c(5,
5, 4, 4, 4, 3), PHQ_TOTAL = c(8, 4, 2, 3, 2, 7), GAD_TOTAL = c(6,
3, 3, 3, 2, 7), INTEROCEPCION_TOTAL = c(45, 44, 24, 17, 36,
41), BIS = c(19, 20, 17, 17, 21, 25), BAS_FUN_SEEKING = c(14,
10, 10, 9, 10, 6), BAS_REWARD_RESPONSIVENESS = c(19, 17,
19, 14, 17, 17), BAS_DRIVE = c(11, 14, 13, 8, 11, 13), BAS_TOTAL = c(44,
41, 42, 31, 38, 36), IRI_TOMA_DE_PERSPECTIVA = c(14, 18,
17, 16, 10, 16), IRI_MALESTAR_PERSONAL = c(13, 11, 4, 9,
11, 11), IRI_FANTASÍA = c(14, 14, 10, 11, 7, 16), IRI_PREOCUPACIÓN_EMPATICA = c(19,
20, 20, 12, 10, 18), RMET_TOTAL = c(7, 4, 10, 7, 10, 8),
PROMEDIO_TIEMPO_REACCION_RMET = c(2.41175, 3.3485, 3.26108333333333,
6.3905, 13.2126666666667, 4.21858333333333), PROMEDIO_CREENCIA_NFALSA_TODAS = c(2.8,
2.8, 2.4, 2.2, 1.8, 3.6), PROMEDIO_CREENCIA_NFALSA_CORONAVIRUS = c(2.66666666666667,
2.33333333333333, 2, 1.66666666666667, 1.33333333333333,
2.66666666666667), PROMEDIO_CREENCIA_NFALSA_OTRO = c(3, 3.5,
3, 3, 2.5, 5), PROMEDIO_TIEMPOREACCION_NFALSA = c(4.3438,
9.4222, 5.9734, 10.1448, 16.3196, 7.1954), PROMEDIO_CREENCIA_NVERDADERA_TODAS = c(3.33333333333333,
3, 3.66666666666667, 2.66666666666667, 1.33333333333333,
3.33333333333333), PROMEDIO_CREENCIA_NVERDADERA_CORONAVIRUS = c(5,
4, 6, 5, 1, 6), PROMEDIO_CREENCIA_NVERDADERA_OTRO = c(5,
5, 5, 3, 3, 4), PROMEDIO_TIEMPOREACCION_NVERDADERA = c(5.644,
7.043, 8.0265, 4.0495, 32.24, 9.583), PROMEDIO_CREENCIA_NMISLEADING_TODAS = c(2.66666666666667,
2.66666666666667, 3.66666666666667, 3, 1.66666666666667,
4.33333333333333), PROMEDIO_TIEMPOREACCION_NMISLEADING = c(5.72666666666667,
12.0123333333333, 5.753, 4.96966666666667, 15.233, 30.0456666666667
), PROMEDIO_DILEMAS_BI_BIENOMAL_CORONAVIRUS = c(1, 4, 4.33333333333333,
1.33333333333333, 0, 3.66666666666667), PROMEDIO_DILEMAS_BI_ACTUARIGUAL_CORONAVIRUS = c(5.66666666666667,
7.66666666666667, 9.66666666666667, 4.33333333333333, 3.66666666666667,
9.33333333333333), DILEMA_BI_CONTROL_BIENOMAL = c(4, 5, 2,
0, -3, 4), DILEMA_BI_CONTROL_ACTUARIGUAL = c(7, 4, 6, 2,
2, 10), PROMEDIO_DILEMAS_BI_BIENOMAL_JUNTOS = c(1.75, 4.25,
3.75, 1, -0.75, 3.75), PROMEDIO_DILEMAS_BI_ACTUARIGUAL_JUNTOS = c(6,
6.75, 8.75, 3.75, 3.25, 9.5), PROMEDIO_DILEMAS_DI_BIENOMAL = c(0.5,
1.83333333333333, 0.5, 1.66666666666667, 0.833333333333333,
0.166666666666667), PROMEDIO_DILEMAS_DI_ACTUARIGUAL = c(6.66666666666667,
7.66666666666667, 5.66666666666667, 5, 4.83333333333333,
5.16666666666667), PROMEDIO_DILEMAS_DI_DANO = c(5.66666666666667,
6.16666666666667, 5.33333333333333, 5.5, 5.66666666666667,
7), TIEMPOREACCION_DILEMAS_DI = c(12.1405, 9.13066666666666,
6.99833333333333, 1.85783333333333, 19.0143333333333, 11.6336666666667
), TIEMPOREACCION_DILEMAS_BI = c(7.899, 9.9955, 9.25175,
2.84125, 32.8285, 16.92), PROMEDIO_DI_SINPOL_BIENOMAL = c(0.2,
1.2, -1, 0.4, 0.8, 0.2), PROMEDIO_DI_SINPOL_ACTUARIGUAL = c(7,
8, 4.25, 4.5, 5, 5.5), PROMEDIO_DI_SINPOL_DANO = c(7.25,
6.75, 7.25, 7, 7.75, 7.75), COMPRAS_COVID19 = c(4.16666666666667,
6.83333333333333, 2, 2.66666666666667, 2.33333333333333,
4.5), PERCEPCION_RIESGO_TOTAL = c(39, 37, 42, 38, 26, 46),
PERCEPCION_RIESGO_INDICE = c(3.9, 3.7, 4.2, 3.8, 2.6, 4.6
), PROB_CONTAGIO_TOTAL = c(89.3333333333333, 65.6666666666667,
73.3333333333333, 13, 46.6666666666667, 78.3333333333333),
PROMEDIO_DILEMASPOLITICOS_BIENOMAL = c(1, 2.5, 4, 4, 0.5,
0), PROMEDIO_DILEMASPOLITICOS_ACTUARIGUAL = c(6, 7, 8.5,
6, 4.5, 4.5), PROMEDIO_DILEMASPOLITICOS_DANO = c(2.5, 5,
1.5, 2.5, 1.5, 5.5), D31_1_DI = c(-2, 3, -3, 0, -2, 4), D32_2_DI = c(4,
9, 3, 3, 4, 9), D33_3_DI = c(9, 7, 8, 8, 8, 7), D41_1_DI = c(-1,
1, 0, 0, 3, 1), D42_2_DI = c(7, 8, 6, 5, 7, 8), D43_3_DI = c(7,
9, 7, 8, 9, 6), D51_1_DI = c(5, 0, 1, 4, 1, 0), D52_2_DI = c(10,
7, 5, 7, 3, 4), D53_3_DI = c(4, 4, 6, 3, 7, 9), D61_1_DI = c(-1,
2, -3, -2, 2, -4), D62_2_DI = c(7, 8, 3, 3, 6, 1), D63_3_DI = c(9,
7, 8, 9, 7, 9), D71_1_DIP = c(0, 3, 3, 4, -2, -4), D72_2_DIP = c(4,
7, 7, 3, 2, 1), D73_3_DIP = c(3, 6, 2, 2, 2, 8), D81_1_DIP = c(2,
2, 5, 4, 3, 4), D82_2_DIP = c(8, 7, 10, 9, 7, 8), D83_3_DIP = c(2,
4, 1, 3, 1, 3), D91_1_BI = c(-3, 3, 5, 4, -1, 4), D92_2_BI = c(4,
8, 10, 9, 3, 9), D101_1_BI = c(3, 5, 5, 0, 3, 5), D102_2_BI = c(9,
8, 10, 2, 6, 10), D111_1_BI = c(3, 4, 3, 0, -2, 2), D112_2_BI = c(4,
7, 9, 2, 2, 9), D121_1_BI = c(4, 5, 2, 0, -3, 4), D122_2_BI = c(7,
4, 6, 2, 2, 10), total_iri = c(60, 63, 51, 48, 38, 61), promedio_falsaymisleading = c(2.75,
2.75, 2.875, 2.5, 1.75, 3.875), prediccioncompraspercprob = c(`1` = 4.24975892576113,
`2` = 4.40445037029013, `3` = 4.43163539588384, `4` = 5.14397435590305,
`5` = 3.76590707825915, `6` = 4.8937968160894), prediccioncomprasperc = c(`1` = 4.47445595202732,
`2` = 4.4399943212902, `3` = 4.52198006754018, `4` = 4.68938453833302,
`5` = 3.7624488758014, `6` = 4.96728571465517)), row.names = c(NA,
6L), class = c("tbl_df", "tbl", "data.frame"))

Can I create many categories of one variable based in two other conditions in r? [duplicate]

This question already has answers here:
How collect additional row data on binned data in R
(1 answer)
Group value in range r
(3 answers)
Closed 3 years ago.
I am doing a statistic analysis in a big data frame (more than 48.000.000 rows) in r. Here is an exemple of the data:
structure(list(herd = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), cows = c(1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12, 13, 14, 15, 16), `date` = c("11/03/2013",
"12/03/2013", "13/03/2013", "14/03/2013", "15/03/2013", "16/03/2013",
"13/05/2012", "14/05/2012", "15/05/2012", "16/05/2012", "17/05/2012",
"18/05/2012", "10/07/2016", "11/07/2016", "12/07/2016", "13/07/2016",
"11/03/2013", "12/03/2013", "13/03/2013", "14/03/2013", "15/03/2013",
"16/03/2013", "13/05/2012", "14/05/2012", "15/05/2012", "16/05/2012",
"17/05/2012", "18/05/2012", "10/07/2016", "11/07/2016", "12/07/2016",
"13/07/2016", "11/03/2013", "12/03/2013", "13/03/2013", "14/03/2013",
"15/03/2013", "16/03/2013", "13/05/2012", "14/05/2012", "15/05/2012",
"16/05/2012", "17/05/2012", "18/05/2012", "10/07/2016", "11/07/2016",
"12/07/2016", "13/07/2016"), glicose = c(240666, 23457789, 45688688,
679, 76564, 6574553, 78654, 546432, 76455643, 6876, 7645432,
876875, 98654, 453437, 98676, 9887554, 76543, 9775643, 986545,
240666, 23457789, 45688688, 679, 76564, 6574553, 78654, 546432,
76455643, 6876, 7645432, 876875, 98654, 453437, 98676, 9887554,
76543, 9775643, 986545, 240666, 23457789, 45688688, 679, 76564,
6574553, 78654, 546432, 76455643, 6876)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -48L))
I need to identify how many cows are in the following category of glicose by herd and by date:
<=100000
100000 and <=150000
150000 and <=200000
200000 and <=250000
250000 and <=400000
>400000
I tried to use the functions filter() and select() but could not categorize the variable like that.
I tried either to make a vector for each category but it did not work:
ht <- df %>% group_by(herd, date) %>%
filter(glicose < 100000)
Actually I do not have a clue of how I could do this. Please help!
I expect to get the number of cows in each category of each herd based on each date in a table like this:
Calling your data df,
df %>%
mutate(glicose_group = cut(glicose, breaks = c(0, seq(1e5, 2.5e5, by = 0.5e5), 4e5, Inf)),
date = as.Date(date, format = "%d/%m/%Y")) %>%
group_by(herd, date, glicose_group) %>%
count
# # A tibble: 48 x 4
# # Groups: herd, date, glicose_group [48]
# herd date glicose_group n
# <dbl> <date> <fct> <int>
# 1 1 2012-05-13 (0,1e+05] 1
# 2 1 2012-05-14 (4e+05,Inf] 1
# 3 1 2012-05-15 (4e+05,Inf] 1
# 4 1 2012-05-16 (0,1e+05] 1
# 5 1 2012-05-17 (4e+05,Inf] 1
# 6 1 2012-05-18 (4e+05,Inf] 1
# 7 1 2013-03-11 (2e+05,2.5e+05] 1
# 8 1 2013-03-12 (4e+05,Inf] 1
# 9 1 2013-03-13 (4e+05,Inf] 1
# 10 1 2013-03-14 (0,1e+05] 1
# # ... with 38 more rows
I also threw in a conversion to Date class, which is probably a good idea.

dplyr differences between pairs in nested groups

I'd like to use dplyr to calculate differences in value between people nested in pair by session.
dat <- data.frame(person=c(rep(1, 10),
rep(2, 10),
rep(3, 10),
rep(4, 10),
rep(5, 10),
rep(6, 10),
rep(7, 10),
rep(8, 10)),
pair=c(rep(1, 20),
rep(2, 20),
rep(3, 20),
rep(4, 20)),
condition=c(rep("NEW", 10),
rep("OLD", 10),
rep("NEW", 10),
rep("OLD", 10),
rep("NEW", 10),
rep("OLD", 10),
rep("NEW", 10),
rep("OLD", 10)),
session=rep(seq(from=1, to=10, by=1), 8),
value=c(0, 2, 4, 8, 16, 16, 18, 20, 20, 20,
0, 1, 1, 2, 4, 5, 8, 12, 15, 15,
0, 2, 8, 10, 15, 16, 18, 20, 20, 20,
0, 4, 4, 6, 6, 8, 10, 12, 12, 18,
0, 6, 8, 10, 16, 16, 18, 20, 20, 20,
0, 2, 2, 3, 4, 8, 8, 8, 10, 12,
0, 10, 12, 16, 18, 18, 18, 20, 20, 20,
0, 2, 2, 8, 10, 10, 11, 12, 15, 20)
)
For instance, person 1 and 2 make a pair (pair==1):
person==1 & session==2: 2
person==2 & session==2: 1
Difference (NEW-OLD) is 2-1=1.
Here's what I have tried so far. I think I need to group_by() first and then summarise(), but I have not cracked this nut.
dat %>%
mutate(session = factor(session)) %>%
group_by(condition, pair, session) %>%
summarise(pairDiff = value-first(value))
Desired output:
Your output can be obtained by:
dat %>% group_by(pair,session) %>% arrange(condition) %>% summarise(diff = -diff(value))
Source: local data frame [40 x 3]
Groups: pair [?]
# A tibble: 40 x 3
pair session diff
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 1
3 1 3 3
4 1 4 6
5 1 5 12
6 1 6 11
7 1 7 10
8 1 8 8
9 1 9 5
10 1 10 5
# ... with 30 more rows
The arrange ensures that NEW and OLD are in the correct positions, but the solution does depend on there being exactly 2 values for each combination of pair and session.
You can spread condition to headers and then do the subtraction NEW - OLD:
library(dplyr); library(tidyr)
dat %>%
select(-person) %>%
spread(condition, value) %>%
mutate(diff = NEW - OLD) %>%
select(session, pair, diff)
# A tibble: 40 x 3
# session pair diff
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 2 1 1
# 3 3 1 3
# 4 4 1 6
# 5 5 1 12
# 6 6 1 11
# 7 7 1 10
# 8 8 1 8
# 9 9 1 5
#10 10 1 5
# ... with 30 more rows

Averaging dataframe based on current row-value and preceeding rows

I have a simple data set with the following form
df<- data.frame(c(10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20),
c(80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90, 80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90),
c(1, 1, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 4, 4),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60, 25, 75, 20, 40, 5, 5, 2, 4, 6, 5, 2, 1, 2, 3, 4, 6, 2, 7, 2, 4))
colnames(df)<-c("car_number", "year", "marker", "val")
What I am trying to do is quite simple, actually: Per car_number, I want to find the average of the values associated with a marker -value and the preceeding 3 values. So for the example data above the output I want is
car=10, year=80 1: 50
car=10, year=80 2: 40
car=10, year=80 3: 45
car=10, year=80 4: 37.5
car=10, year=90 1: 31.25
car=10, year=90 2: 36.25
car=10, year=90 3: 35
car=10, year=90 4: 38.75
car=20, year=80 1: 5
car=20, year=80 2: 4
car=20, year=80 3: 4.5
car=20, year=80 4: 3.75
car=20, year=90 1: 3.125
car=20, year=90 2: 3.625
car=20, year=90 3: 3.375
car=20, year=90 4: 3.750
Note that for simplicity of the example the markers above come in pairs of two. That is not the case with the real data, so I am thinking a general solution will contain some sort of group_by (?)
Any efficient solution is welcome!
Here is a second example data set and output:
df<- data.frame(c(10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20),
c(80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90, 80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90),
c(1, 2, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 3, 4, 1, 1, 1, 2, 3, 3, 4, 4, 4, 1, 2, 2, 3, 3, 3, 4),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60, 25, 75, 20, 40, 5, 5, 2, 4, 6, 5, 2, 1, 2, 3, 4, 6, 2, 7, 2, 4))
colnames(df)<-c("car_number", "year", "marker", "val")
And the output is (based on the rules above)
car=10, year=80 1: Mean{{25}] = 25
car=10, year=80 2: Mean[{40, 20, 75, 25}] = 40
car=10, year=80 3: Mean[{50, 60, 40, 20, 75, 25}] = 45
car=10, year=80 4: Mean[{10, 20, 50, 60, 40, 20, 75, 25}] = 37.5
car=10, year=90 1: Mean[{30, 20, 10, 20, 50, 60, 40, 20, 75}] = 36.11
car=10, year=90 2: Mean[{60, 40, 30, 20, 10, 20, 50, 60}] = 36.25
car=10, year=90 3: Mean[{20, 75, 25, 60, 40, 30, 20, 10, 20}] = 33.33
car=10, year=90 4: Mean[{40, 20, 75, 25, 60, 40, 30, 20}] = 38.75
car=20, year=80 1: Mean[{2, 5, 5}] = 4
car=20, year=80 2: Mean[{4, 2, 5, 5}] = 4
car=20, year=80 3: Mean[{5, 6, 4, 2, 5, 5}] = 4.5
car=20, year=80 4: Mean[{2, 1, 2, 5, 6, 4, 2, 5, 5}] = 3.55
car=20, year=90 1: Mean[{3, 2, 1, 2, 5, 6, 4}] = 3.29
car=20, year=90 2: Mean[{6, 4, 3, 2, 1, 2, 5, 6}] = 3.625
car=20, year=90 3: Mean[{2, 7, 2, 6, 4, 3, 2, 1, 2}] = 3.22
car=20, year=90 4: Mean[{4, 2, 7, 2, 6, 4, 3}] = 4
A first group_by computes the mean by car_number, year, marker, and retains its weight (number of rows).
A second group_by by car_number allows us to retrieve lagging means and weights to compute the desired mean:
library(purrr)
library(dplyr)
df %>%
arrange(car_number, year, marker) %>%
group_by(car_number, year, marker) %>%
summarise(mean_1 = mean(val, na.rm = TRUE), weight = n()) %>%
group_by(car_number) %>%
mutate(mean_2 = pmap_dbl(
list(mean_1, lag(mean_1), lag(mean_1, 2), lag(mean_1, 3),
weight, lag(weight), lag(weight, 2), lag(weight, 3)),
~ weighted.mean(c(..1, ..2, ..3, ..4),
c(..5, ..6, ..7, ..8),
na.rm = TRUE)
)) %>%
ungroup()
Result:
# # A tibble: 16 × 6
# car_number year marker mean_1 weight mean_2
# <dbl> <dbl> <dbl> <dbl> <int> <dbl>
# 1 10 80 1 50.0 2 50.000
# 2 10 80 2 30.0 2 40.000
# 3 10 80 3 55.0 2 45.000
# 4 10 80 4 15.0 2 37.500
# 5 10 90 1 25.0 2 31.250
# 6 10 90 2 50.0 2 36.250
# 7 10 90 3 50.0 2 35.000
# 8 10 90 4 30.0 2 38.750
# 9 20 80 1 5.0 2 5.000
# 10 20 80 2 3.0 2 4.000
# 11 20 80 3 5.5 2 4.500
# 12 20 80 4 1.5 2 3.750
# 13 20 90 1 2.5 2 3.125
# 14 20 90 2 5.0 2 3.625
# 15 20 90 3 4.5 2 3.375
# 16 20 90 4 3.0 2 3.750
Edit: Alternative syntax for purrr versions prior to 0.2.2.9000:
df %>%
arrange(car_number, year, marker) %>%
group_by(car_number, year, marker) %>%
summarise(mean_1 = mean(val, na.rm = TRUE), weight = n()) %>%
group_by(car_number) %>%
mutate(mean_2 = pmap_dbl(
list(mean_1, lag(mean_1), lag(mean_1, 2), lag(mean_1, 3),
weight, lag(weight), lag(weight, 2), lag(weight, 3)),
function(a, b, c, d, e, f, g, h)
weighted.mean(c(a, b, c, d),
c(e, f, g, h),
na.rm = TRUE)
)) %>%
ungroup()
Just throwing a base R solution in the mix. We can make a custom function using Reduce with accumulate = TRUE and tail(x, 4) to ensure that only last 3 observations will be included. All these after we average the data set by car_type, year, marker, i.e.
f1 <- function(x){
sapply(Reduce(c, x, accumulate = TRUE), function(i) mean(tail(i,4)))
}
dd <- aggregate(val ~ car_number+year+marker, df, mean)
dd <- dd[order(dd$car_number, dd$year, dd$marker),]
dd$new_avg <- with(dd, ave(val, car_number, FUN = f1))
dd
# car_number year marker val new_avg
#1 10 80 1 50.0 50.000
#5 10 80 2 30.0 40.000
#9 10 80 3 55.0 45.000
#13 10 80 4 15.0 37.500
#3 10 90 1 25.0 31.250
#7 10 90 2 50.0 36.250
#11 10 90 3 50.0 35.000
#15 10 90 4 30.0 38.750
#2 20 80 1 5.0 5.000
#6 20 80 2 3.0 4.000
#10 20 80 3 5.5 4.500
#14 20 80 4 1.5 3.750
#4 20 90 1 2.5 3.125
#8 20 90 2 5.0 3.625
#12 20 90 3 4.5 3.375
#16 20 90 4 3.0 3.750
Here is a method with data.table that modifies Frank's suggestion in David Arenburg's answer here.
# aggregate data by car_number, year, and marker
dfNew <- setDT(df)[, .(val=mean(val)), by=.(car_number, year, marker)]
# calculate average of current a previous three values
dfNew[, val := rowMeans(dfNew[,shift(val, 0:3), by=car_number][, -1], na.rm=TRUE)]
The first line is a standard aggregation call. The second line makes some changes to the rowMeans method in the linked answer. rowMeans is fed a data.table of the shifted values, where the shift occurs by car_number (thanks to sotos for catching this), which is chained to a statement that drops the first column (using -1), which is the car_number column returned in the first part of the chain.
this returns
car_number year marker val
1: 10 80 1 50.000
2: 10 80 2 40.000
3: 10 80 3 45.000
4: 10 80 4 37.500
5: 10 90 1 31.250
6: 10 90 2 36.250
7: 10 90 3 35.000
8: 10 90 4 38.750
9: 20 80 1 5.000
10: 20 80 2 4.000
11: 20 80 3 4.500
12: 20 80 4 3.750
13: 20 90 1 3.125
14: 20 90 2 3.625
15: 20 90 3 3.375
16: 20 90 4 3.750
You can do it this way:
df %>%
group_by(car_number, year, marker) %>%
summarise(s = sum(val), w = n()) %>% # sum and number of values
group_by(car_number) %>%
mutate(S = cumsum(s) - cumsum(lag(s, 4, default=0))) %>% # sum of last four s
mutate(W = cumsum(w) - cumsum(lag(w, 4, default=0))) %>% # same for the weights
mutate(result = S/W)
Output of your second example:
# Source: local data frame [16 x 8]
# Groups: car_number [2]
#
# car_number year marker s w S W result
# <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl>
# 1 10 80 1 25 1 25 1 25.000000
# 2 10 80 2 135 3 160 4 40.000000
# 3 10 80 3 110 2 270 6 45.000000
# 4 10 80 4 30 2 300 8 37.500000
# 5 10 90 1 50 2 325 9 36.111111
# 6 10 90 2 100 2 290 8 36.250000
# 7 10 90 3 120 3 300 9 33.333333
# 8 10 90 4 40 1 310 8 38.750000
# 9 20 80 1 12 3 12 3 4.000000
# 10 20 80 2 4 1 16 4 4.000000
# 11 20 80 3 11 2 27 6 4.500000
# 12 20 80 4 5 3 32 9 3.555556
# 13 20 90 1 3 1 23 7 3.285714
# 14 20 90 2 10 2 29 8 3.625000
# 15 20 90 3 11 3 29 9 3.222222
# 16 20 90 4 4 1 28 7 4.000000
Edit:
It's probably more efficient with package RcppRoll, you can try that: S = roll_sum(c(0, 0, 0, s), 4) (and same for W).
considering df as your input, you can use dplyr and zoo and try:
grouping only over car_number, you can try:
df %>%
group_by(car_number, year, marker) %>%
summarise(mm = mean(val)) %>%
group_by(car_number) %>%
mutate(rM=rollapply(mm, if_else(row_number() < 4, marker, 4), FUN=mean, align="right"))%>%
select(year, rM)
which gives:
Source: local data frame [16 x 3]
Groups: car_number [2]
car_number year rM
<dbl> <dbl> <dbl>
1 10 80 50.000
2 10 80 40.000
3 10 80 45.000
4 10 80 37.500
5 10 90 31.250
6 10 90 36.250
7 10 90 35.000
8 10 90 38.750
9 20 80 5.000
10 20 80 4.000
11 20 80 4.500
12 20 80 3.750
13 20 90 3.125
14 20 90 3.625
15 20 90 3.375
16 20 90 3.750

Identify objects with repeated measures, and with the same ID between years

I would like to reclassify the names of some individuals in a dataframe with consequtive letters, and the reclassification criterion has to change each X intervals since the first occurrence of an individual. I explain it better with an example.
ID <- c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 1, 2, 6, 8, 12, 7, 15, 16, 17, 18, 19, 20, 1, 21, 22, 19 )
Year <- c (1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6)
df <- data.frame (ID, Year)
df
I have a dataset with repeated measures of some individuals along 6 years. As you can see some IDs like the "1" or "8" are repeated in Year == 1,2,3,4,5 for the ID == 1 and Year == 2,4 for the ID == 8. However different individuals may have the same ID if some time has happened since the first occurrence of an individual. It is because we consider that the individual dies each 2 years, and the ID may be reused.
In this hypothetical case, we assume that the life of an individual is 2 years, and that we can recognise during the sampling different individuals perfectly. The ID == 1 in the Year == 1 and Year == 2 represent the same individual, however the ID == 1 in the Year == 1,2, Year == 3,4 and Year == 5 represent different individuals. It is because the individual with ID == 1 from the Year == 1 couldn't live that long. The problem is that the first occurrence of the individuals may happen in different years and repeatedly as in this case. So the code has to forget an ID each 2 years since its first occurrence, and classify a new occurrence as a new individual.
I would like to name each individual with an unique ID. The new name does not have to be arranged chronologically as you can see with the ID == 1 in the Year == 5. I only want that they will be named with an unique name.
Below I have put the expected result.
ID <- c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 1, 2, 6, 8, 12, 7, 15, 16, 17, 18, 19, 20, 1, 21, 22, 19 )
Year <- c (1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 1, 6, 6, 6)
new_ID <- c("A", "B", "C", "D", "E", "F", "G", "A", "B", "C", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "M", "N", "Q", "S", "L", "T", "U", "V", "W", "X", "Y", "Z", "CC", "AA", "BB", "Y")
new_df <- data.frame (ID, Year, new_ID)
new_df
As you can see the ID == 1 have different new_ID in the Year == 1 Year == 4 and Year == 5, because we assume that if one individual occurs for the first time in the Year == 1, an individual with the same ID in the Year == 3 is different, and the same with the individual that occurs in the Year == 5.
Thanks in advance.
You can use dplyr and cut:
library(dplyr)
df %>% group_by(ID) %>%
mutate(x = as.numeric(cut(Year, seq(min(Year)-1, max(Year)+1, 2))),
idout = paste0(ID, ".", x))
ID Year x idout
1 1 1 1 1.1
2 2 1 1 2.1
3 3 1 1 3.1
4 4 1 1 4.1
5 5 1 1 5.1
6 6 1 1 6.1
7 7 1 1 7.1
8 1 2 1 1.1
9 2 2 1 2.1
10 3 2 1 3.1
11 8 2 1 8.1
12 9 2 1 9.1
13 10 2 1 10.1
14 11 2 1 11.1
15 12 2 1 12.1
16 1 3 2 1.2
17 2 3 2 2.2
18 3 3 2 3.2
19 4 3 2 4.2
20 5 3 2 5.2
21 6 3 2 6.2
22 1 4 2 1.2
23 2 4 2 2.2
24 6 4 2 6.2
25 8 4 2 8.2
26 12 4 2 12.2
27 7 5 3 7.3
28 15 5 1 15.1
29 16 5 1 16.1
30 17 5 1 17.1
31 18 5 1 18.1
32 19 5 1 19.1
33 20 5 1 20.1
34 1 5 3 1.3
35 21 6 1 21.1
36 22 6 1 22.1
37 19 6 1 19.1
NB there are two mismatches with your desired output: row 34, and 15,26 where you have an L at years 2 and 4 with the same ID. I think these are mistakes?
ID <- c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 1, 2, 6, 8, 12, 7, 15, 16, 17, 18, 19, 20, 1, 21, 22, 19 )
Year <- c (1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6)
new_ID <- c("A", "B", "C", "D", "E", "F", "G", "A", "B", "C", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "M", "N", "Q", "S", "L", "T", "U", "V", "W", "X", "Y", "Z", "CC", "AA", "BB", "Y")
new_df <- data.frame (ID, Year, new_ID)
new_df
# if all ID renews same use:
newID<-sapply(unique(ID), function(x) c(0,cumsum(diff(Year[ID==x]))%%2))
# if some ID renews different year use:
newID<-sapply(unique(ID), function(x) {
mod<-2
if(x==1) mod <- 3
c(0,cumsum(diff(Year[ID==x]))%%mod)
})
names(newID)<-(unique(ID))
new_df<-data.frame(ID,Year,IDcond=NA,new_ID=NA)
for(i in unique(ID)){
new_df[new_df[,1]==i,3]<-newID[[which(unique(ID)==i)]]
}
ltrs<-c(LETTERS,apply(combn(LETTERS,2,simplify = T),2,function(x) paste(x,sep = "",collapse = "")))
ltrn<-0
for(i in 1:nrow(new_df)){
if(new_df[i,3]==0) {ltrn<-ltrn+1;new_df[i,4]<-ltrs[ltrn]}
else {ind<-which(new_df[,1]==new_df[i,1])
ind<-ind[ind<i]
new_df[i,4]<-tail(new_df[ind,4],1)}
}
new_df
> new_df
ID Year IDcond new_ID
1 1 1 0 A
2 2 1 0 B
3 3 1 0 C
4 4 1 0 D
5 5 1 0 E
6 6 1 0 F
7 7 1 0 G
8 1 2 1 A
9 2 2 1 B
10 3 2 1 C
11 8 2 0 H
12 9 2 0 I
13 10 2 0 J
14 11 2 0 K
15 12 2 0 L
16 1 3 0 M
17 2 3 0 N
18 3 3 0 O
19 4 3 0 P
20 5 3 0 Q
21 6 3 0 R
22 1 4 1 M
23 2 4 1 N
24 6 4 1 R
25 8 4 0 S
26 12 4 0 T
27 7 5 0 U
28 15 5 0 V
29 16 5 0 W
30 17 5 0 X
31 18 5 0 Y
32 19 5 0 Z
33 20 5 0 AB
34 1 5 0 AC
35 21 6 0 AD
36 22 6 0 AE
37 19 6 1 Z

Resources