I am calculating the correlarion between each variable with the target feature, in a dataframe. It works great aside from one variable, Age, which is not producing the correlation, instead I get an NA. I removed all NA values before even starting the analysis. So the data is clean.
This is the code: (PD is the target variable and I want to compare it with all other variables. PD is binary)
pearsons = c()
for (i in 1:length(colnames(Train_set))){
pearsons[i] = cor(Train_set[,i], Train_set$PD, method = 'pearson')
}
This is the data structre: (only some of it)
> glimpse(Train_set)
Rows: 1,219
Columns: 56
$ PD <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,…
$ gender <int> 2, 2, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2,…
$ cancer_type <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ Treatment <int> 5, 6, 6, 6, 5, 6, 5, 6, 5, 5, 5, 5, 6, 6, 6, 6, 6, 5, 5, 6, 6, 6, 6, 5, 6, 5, 5, 6, 5,…
$ totaldata_new.Age <int> 50, 66, 51, 60, 31, 70, 51, 56, 65, 62, 55, 69, 32, 82, 60, 49, 56, 59, 50, 51, 70, 74…
$ Adipocytes <dbl> 0.000000000, 0.000000000, 0.005592077, 0.005844092, 0.038175712, 0.000000000, 0.005063…
$ B.cells <dbl> 0.045214394, 1.300478781, 0.184967801, 0.032890485, 0.041641426, 0.006477740, 0.653999…
$ Basophils <dbl> 0.120695085, 0.065615816, 0.362173522, 0.039214941, 0.225555640, 0.056926623, 0.019076…
totaldata_new.Age is the Age variable. I tried setting it as.numeric() and as.integer() but both didn't work.
This is the training set,
structure(list(PD = c(0, 0, 1, 1, 1, 1, 0, 0, 1, 1), gender = c(2L,
2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L), cancer_type = c(3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), Treatment = c(5L, 6L, 6L, 6L,
5L, 6L, 5L, 6L, 5L, 5L), totaldata_new.Age = c(50L, 66L, 51L,
60L, 31L, 70L, 51L, 56L, 65L, 62L), Adipocytes = c(0, 0, 0.00559207695850587,
0.00584409167696122, 0.0381757121622292, 0, 0.00506330308366599,
0, 0.0156430635414994, 0), B.cells = c(0.0452143935493372, 1.30047878079526,
0.184967800962064, 0.0328904854435036, 0.0416414264467815, 0.00647774047514386,
0.653999365837062, 0.0331653878504112, 0.0286461940371656, 0.0888471904628742
), Basophils = c(0.120695085116671, 0.0656158162440011, 0.362173521572841,
0.0392149412975555, 0.225555640419744, 0.0569266227666268, 0.0190762558461507,
0.0733199539844435, 0.20291673586147, 0.0757313145147394), CD4..memory.T.cells = c(0,
0.24081994997988, 0, 0.0084070550945875, 0, 0, 0.0704387567897827,
0, 0.0177784010286187, 0.00653794301542519), CD4..naive.T.cells = c(0,
0.222121262122827, 0, 0, 0, 0, 0.0337776019379054, 0, 0, 0)), row.names = c("Pt10",
"Pt101", "Pt103", "Pt106", "Pt11", "Pt17", "Pt18", "Pt26", "Pt27",
"Pt28"), class = "data.frame")
Why is this variable producing NA, while other variables give good results of the correlation?
Looks like there are NA values in the columns of interest.
To avoid this problem, there is the parameter 'use' in the 'cor'-function, which the help explains as:
"giving a method for computing covariances in the presence of missing
values."
I'd recommend changing your code to:
pearsons[i] = cor(Train_set[,i], Train_set$PD, method = 'pearson',
use = "complete.obs")
Hope that helps!
Samuel
I am working on calculating growth for a coral demography dataset and need to make a comparison of Max Diameter (cm) to determine at what TimeStep corals shrank. I attempted to use lag but for some reason, my new column is all NA instead of only the rows where it changes to a new coral ID. Does anyone have a sense of what I need to do to make it so my Diff column only has NAs where a transition to a new colony occurs?
Dataframe
A tibble: 20 x 22
`Taxonomic Code` ID Date Year Site_long Shelter `Module #` Side Location Settlement_Area TimeStep size_class `Cover Code` `Max Diameter (… `Max Orthogonal…
<chr> <fct> <date> <chr> <fct> <fct> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 PR H30 2018-11-27 18 Hanauma … Low 216 S D3 0.759 7 3 2 22 17
2 PR H30 2019-02-26 19 Hanauma … Low 216 S D3 0.751 8 3 1 24 19
3 PR H30 2019-05-28 19 Hanauma … Low 216 S D3 0.607 9 3 1 30 20
4 PR H30 2019-08-27 19 Hanauma … Low 216 S D3 0.615 10 1 1 8 8
5 PR H30 2019-11-26 19 Hanauma … Low 216 S D3 0.622 11 5 1 46 30
6 PR H37 2018-09-09 18 Hanauma … High 215 S C1 0.759 6 2 1 14 12
7 PR H37 2018-11-27 18 Hanauma … High 215 S C1 0.751 7 3 1 22 19
8 PR H37 2019-03-12 19 Hanauma … High 215 S C1 0.759 8 3 1 26 20
9 PR H37 2019-05-21 19 Hanauma … High 215 S C1 0.759 9 3 3 29 21
10 PR H37 2019-09-03 19 Hanauma … High 215 S C1 0.683 10 3 1 30 26
11 PR H66 2018-06-05 18 Hanauma … High 213 N A1 0.759 5 2 1 20 19
12 PR H66 2018-09-09 18 Hanauma … High 213 N A1 0.759 6 2 1 20 19
13 PR H66 2018-12-04 18 Hanauma … High 213 N A1 0.653 7 3 1 24 22
14 PR H66 2019-03-05 19 Hanauma … High 213 N A1 0.759 8 3 1 25 24
15 PR H66 2019-05-28 19 Hanauma … High 213 N A1 0.615 9 3 1 28 24
16 PR H66 2019-09-03 19 Hanauma … High 213 N A1 0.531 10 3 1 23 20
17 PR H66 2019-12-03 19 Hanauma … High 213 N A1 0.600 11 3 1 23 16
18 PR H76 2018-09-09 18 Hanauma … High 213 N A4 0.759 6 3 1 21 18
19 PR H76 2018-12-04 18 Hanauma … High 213 N A4 0.653 7 3 1 24 12
20 PR H76 2019-03-05 19 Hanauma … High 213 N A4 0.759 8 3 1 22 19
# … with 7 more variables: `Height (cm)` <dbl>, `Status Code` <chr>, area_mm_squared <dbl>, area_cm_squared <dbl>, Volume_mm_cubed <dbl>, Volume_cm_cubed <dbl>, MD <dbl>
Dataframe Code
data <- structure(list(`Taxonomic Code` = c("PR", "PR", "PR", "PR", "PR",
"PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR",
"PR", "PR", "PR", "PR"), ID = structure(c(35L, 35L, 35L, 35L,
35L, 38L, 38L, 38L, 38L, 38L, 55L, 55L, 55L, 55L, 55L, 55L, 55L,
61L, 61L, 61L), .Label = c("H1051", "H108", "H110", "H1101",
"H112", "H113", "H116", "H118", "H1188", "H1211", "H122", "H125",
"H1253", "H1289", "H171", "H172", "H174", "H186", "H187", "H188",
"H189", "H191", "H192", "H236", "H237", "H244", "H252", "H254",
"H258", "H274", "H277", "H288", "H292", "H293", "H30", "H332",
"H366", "H37", "H374", "H396", "H466", "H479", "H484", "H499",
"H531", "H560", "H580", "H593", "H597", "H625", "H644", "H647",
"H649", "H653", "H66", "H693", "H695", "H712", "H728", "H737",
"H76", "H760", "H774", "H854", "H926", "H96", "H963", "H98",
"H985", "H991", "H996", "W1038", "W1101", "W1152", "W1154", "W1192",
"W1208", "W1209", "W1214", "W1227", "W1243", "W1245", "W1315",
"W1345", "W1361", "W1377", "W1399", "W1438", "W1494", "W1495",
"W1537", "W1557", "W1614", "W1636", "W1655", "W1669", "W1690",
"W1697", "W1729", "W1741", "W1758", "W1782", "W1785", "W1847",
"W1919", "W2000", "W2004", "W2011", "W2036", "W2044", "W2046",
"W2131", "W2133", "W234", "W249", "W251", "W254", "W307", "W355",
"W359", "W369", "W433", "W450", "W461", "W470", "W480", "W538",
"W542", "W544", "W584", "W601", "W606", "W781", "W79", "W807",
"W872", "W874", "W887", "W890", "W891", "W923", "W952"), class = "factor"),
Date = structure(c(17862, 17953, 18044, 18135, 18226, 17783,
17862, 17967, 18037, 18142, 17687, 17783, 17869, 17960, 18044,
18142, 18233, 17783, 17869, 17960), class = "Date"), Year = c("18",
"19", "19", "19", "19", "18", "18", "19", "19", "19", "18",
"18", "18", "19", "19", "19", "19", "18", "18", "19"), Site_long = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = c("Hanauma Bay", "Waikiki"), class = "factor"),
Shelter = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("High",
"Low"), class = "factor"), `Module #` = c(216, 216, 216,
216, 216, 215, 215, 215, 215, 215, 213, 213, 213, 213, 213,
213, 213, 213, 213, 213), Side = c("S", "S", "S", "S", "S",
"S", "S", "S", "S", "S", "N", "N", "N", "N", "N", "N", "N",
"N", "N", "N"), Location = c("D3", "D3", "D3", "D3", "D3",
"C1", "C1", "C1", "C1", "C1", "A1", "A1", "A1", "A1", "A1",
"A1", "A1", "A4", "A4", "A4"), Settlement_Area = c(0.75902336,
0.751433126, 0.607218688, 0.614808922, 0.622399155, 0.75902336,
0.751433126, 0.75902336, 0.75902336, 0.683121024, 0.75902336,
0.75902336, 0.65276009, 0.75902336, 0.614808922, 0.531316352,
0.599628454, 0.75902336, 0.65276009, 0.75902336), TimeStep = c(7,
8, 9, 10, 11, 6, 7, 8, 9, 10, 5, 6, 7, 8, 9, 10, 11, 6, 7,
8), size_class = c(3, 3, 3, 1, 5, 2, 3, 3, 3, 3, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3), `Cover Code` = c(2, 1, 1, 1, 1, 1,
1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), `Max Diameter (cm)` = c(22,
24, 30, 8, 46, 14, 22, 26, 29, 30, 20, 20, 24, 25, 28, 23,
23, 21, 24, 22), `Max Orthogonal (cm)` = c(17, 19, 20, 8,
30, 12, 19, 20, 21, 26, 19, 19, 22, 24, 24, 20, 16, 18, 12,
19), `Height (cm)` = c(2, 2, 3, 1, 3, 1, 2, 1, 1, 3, 1, 1,
1, 2, 2, 2, 2, 1, 1, 1), `Status Code` = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, "B", NA, NA, "PB", NA, NA,
NA, NA), area_mm_squared = c(374, 456, 600, 64, 1380, 168,
418, 520, 609, 780, 380, 380, 528, 600, 672, 460, 368, 378,
288, 418), area_cm_squared = c(3.74, 4.56, 6, 0.64, 13.8,
1.68, 4.18, 5.2, 6.09, 7.8, 3.8, 3.8, 5.28, 6, 6.72, 4.6,
3.68, 3.78, 2.88, 4.18), Volume_mm_cubed = c(391.651884147528,
477.522083345649, 942.477796076938, 33.5103216382911, 2167.69893097696,
87.9645943005142, 437.728576400178, 272.271363311115, 318.871654339364,
1225.22113490002, 198.967534727354, 198.967534727354, 276.460153515902,
628.318530717959, 703.716754404114, 481.710873550435, 385.368698840348,
197.920337176157, 150.79644737231, 218.864288200089), Volume_cm_cubed = c(0.391651884147528,
0.477522083345649, 0.942477796076938, 0.0335103216382911,
2.16769893097696, 0.0879645943005142, 0.437728576400178,
0.272271363311115, 0.318871654339364, 1.22522113490002, 0.198967534727354,
0.198967534727354, 0.276460153515902, 0.628318530717959,
0.703716754404114, 0.481710873550435, 0.385368698840348,
0.197920337176157, 0.15079644737231, 0.218864288200089),
MD = c(22, 24, 30, 8, 46, 14, 22, 26, 29, 30, 20, 20, 24,
25, 28, 23, 23, 21, 24, 22)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Code
data_new <- data %>% group_by(ID, TimeStep) %>%
mutate(Diff = `Max Diameter (cm)` - dplyr::lag(`Max Diameter (cm)`))
Output
data_output <- structure(list(`Taxonomic Code` = c("PR", "PR", "PR", "PR", "PR",
"PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR",
"PR", "PR", "PR", "PR"), ID = structure(c(35L, 35L, 35L, 35L,
35L, 38L, 38L, 38L, 38L, 38L, 55L, 55L, 55L, 55L, 55L, 55L, 55L,
61L, 61L, 61L), .Label = c("H1051", "H108", "H110", "H1101",
"H112", "H113", "H116", "H118", "H1188", "H1211", "H122", "H125",
"H1253", "H1289", "H171", "H172", "H174", "H186", "H187", "H188",
"H189", "H191", "H192", "H236", "H237", "H244", "H252", "H254",
"H258", "H274", "H277", "H288", "H292", "H293", "H30", "H332",
"H366", "H37", "H374", "H396", "H466", "H479", "H484", "H499",
"H531", "H560", "H580", "H593", "H597", "H625", "H644", "H647",
"H649", "H653", "H66", "H693", "H695", "H712", "H728", "H737",
"H76", "H760", "H774", "H854", "H926", "H96", "H963", "H98",
"H985", "H991", "H996", "W1038", "W1101", "W1152", "W1154", "W1192",
"W1208", "W1209", "W1214", "W1227", "W1243", "W1245", "W1315",
"W1345", "W1361", "W1377", "W1399", "W1438", "W1494", "W1495",
"W1537", "W1557", "W1614", "W1636", "W1655", "W1669", "W1690",
"W1697", "W1729", "W1741", "W1758", "W1782", "W1785", "W1847",
"W1919", "W2000", "W2004", "W2011", "W2036", "W2044", "W2046",
"W2131", "W2133", "W234", "W249", "W251", "W254", "W307", "W355",
"W359", "W369", "W433", "W450", "W461", "W470", "W480", "W538",
"W542", "W544", "W584", "W601", "W606", "W781", "W79", "W807",
"W872", "W874", "W887", "W890", "W891", "W923", "W952"), class = "factor"),
Date = structure(c(17862, 17953, 18044, 18135, 18226, 17783,
17862, 17967, 18037, 18142, 17687, 17783, 17869, 17960, 18044,
18142, 18233, 17783, 17869, 17960), class = "Date"), Year = c("18",
"19", "19", "19", "19", "18", "18", "19", "19", "19", "18",
"18", "18", "19", "19", "19", "19", "18", "18", "19"), Site_long = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = c("Hanauma Bay", "Waikiki"), class = "factor"),
Shelter = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("High",
"Low"), class = "factor"), `Module #` = c(216, 216, 216,
216, 216, 215, 215, 215, 215, 215, 213, 213, 213, 213, 213,
213, 213, 213, 213, 213), Side = c("S", "S", "S", "S", "S",
"S", "S", "S", "S", "S", "N", "N", "N", "N", "N", "N", "N",
"N", "N", "N"), Location = c("D3", "D3", "D3", "D3", "D3",
"C1", "C1", "C1", "C1", "C1", "A1", "A1", "A1", "A1", "A1",
"A1", "A1", "A4", "A4", "A4"), Settlement_Area = c(0.75902336,
0.751433126, 0.607218688, 0.614808922, 0.622399155, 0.75902336,
0.751433126, 0.75902336, 0.75902336, 0.683121024, 0.75902336,
0.75902336, 0.65276009, 0.75902336, 0.614808922, 0.531316352,
0.599628454, 0.75902336, 0.65276009, 0.75902336), TimeStep = c(7,
8, 9, 10, 11, 6, 7, 8, 9, 10, 5, 6, 7, 8, 9, 10, 11, 6, 7,
8), size_class = c(3, 3, 3, 1, 5, 2, 3, 3, 3, 3, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3), `Cover Code` = c(2, 1, 1, 1, 1, 1,
1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), `Max Diameter (cm)` = c(22,
24, 30, 8, 46, 14, 22, 26, 29, 30, 20, 20, 24, 25, 28, 23,
23, 21, 24, 22), `Max Orthogonal (cm)` = c(17, 19, 20, 8,
30, 12, 19, 20, 21, 26, 19, 19, 22, 24, 24, 20, 16, 18, 12,
19), `Height (cm)` = c(2, 2, 3, 1, 3, 1, 2, 1, 1, 3, 1, 1,
1, 2, 2, 2, 2, 1, 1, 1), `Status Code` = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, "B", NA, NA, "PB", NA, NA,
NA, NA), area_mm_squared = c(374, 456, 600, 64, 1380, 168,
418, 520, 609, 780, 380, 380, 528, 600, 672, 460, 368, 378,
288, 418), area_cm_squared = c(3.74, 4.56, 6, 0.64, 13.8,
1.68, 4.18, 5.2, 6.09, 7.8, 3.8, 3.8, 5.28, 6, 6.72, 4.6,
3.68, 3.78, 2.88, 4.18), Volume_mm_cubed = c(391.651884147528,
477.522083345649, 942.477796076938, 33.5103216382911, 2167.69893097696,
87.9645943005142, 437.728576400178, 272.271363311115, 318.871654339364,
1225.22113490002, 198.967534727354, 198.967534727354, 276.460153515902,
628.318530717959, 703.716754404114, 481.710873550435, 385.368698840348,
197.920337176157, 150.79644737231, 218.864288200089), Volume_cm_cubed = c(0.391651884147528,
0.477522083345649, 0.942477796076938, 0.0335103216382911,
2.16769893097696, 0.0879645943005142, 0.437728576400178,
0.272271363311115, 0.318871654339364, 1.22522113490002, 0.198967534727354,
0.198967534727354, 0.276460153515902, 0.628318530717959,
0.703716754404114, 0.481710873550435, 0.385368698840348,
0.197920337176157, 0.15079644737231, 0.218864288200089),
MD = c(22, 24, 30, 8, 46, 14, 22, 26, 29, 30, 20, 20, 24,
25, 28, 23, 23, 21, 24, 22), Diff = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L), groups = structure(list(ID = structure(c(35L,
35L, 35L, 35L, 35L, 38L, 38L, 38L, 38L, 38L, 55L, 55L, 55L, 55L,
55L, 55L, 55L, 61L, 61L, 61L), .Label = c("H1051", "H108", "H110",
"H1101", "H112", "H113", "H116", "H118", "H1188", "H1211", "H122",
"H125", "H1253", "H1289", "H171", "H172", "H174", "H186", "H187",
"H188", "H189", "H191", "H192", "H236", "H237", "H244", "H252",
"H254", "H258", "H274", "H277", "H288", "H292", "H293", "H30",
"H332", "H366", "H37", "H374", "H396", "H466", "H479", "H484",
"H499", "H531", "H560", "H580", "H593", "H597", "H625", "H644",
"H647", "H649", "H653", "H66", "H693", "H695", "H712", "H728",
"H737", "H76", "H760", "H774", "H854", "H926", "H96", "H963",
"H98", "H985", "H991", "H996", "W1038", "W1101", "W1152", "W1154",
"W1192", "W1208", "W1209", "W1214", "W1227", "W1243", "W1245",
"W1315", "W1345", "W1361", "W1377", "W1399", "W1438", "W1494",
"W1495", "W1537", "W1557", "W1614", "W1636", "W1655", "W1669",
"W1690", "W1697", "W1729", "W1741", "W1758", "W1782", "W1785",
"W1847", "W1919", "W2000", "W2004", "W2011", "W2036", "W2044",
"W2046", "W2131", "W2133", "W234", "W249", "W251", "W254", "W307",
"W355", "W359", "W369", "W433", "W450", "W461", "W470", "W480",
"W538", "W542", "W544", "W584", "W601", "W606", "W781", "W79",
"W807", "W872", "W874", "W887", "W890", "W891", "W923", "W952"
), class = "factor"), TimeStep = c(7, 8, 9, 10, 11, 6, 7, 8,
9, 10, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8), .rows = list(1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
The issue is with the grouping. When we include 'TimeStep', there is only a single row per each group and the lag of a single element is NA
library(dplyr)
data %>%
group_by(ID %>%
mutate(Diff = `Max Diameter (cm)` - dplyr::lag(`Max Diameter (cm)`))
I have been struggling with this for a while now and I haven't been able to find a comparable question asked anywhere, hence my first question on here!
I'm fairly new to R so please excuse any obvious errors I have made.
I have a dataset which has a row for each subscription that a user has or has had. Some users have multiple rows, while some others only have one. Only active or previously active subscriptions are present.
I have two variables which state when the subscription has started and when it ended called, Begindate and Enddate respectively. I already have relationlength variables created which state the amount of days between these two variables for each type of subscription. This means that the relationlength variables only give the amount of days for when a subscription was active.
What I would like to do is create empty rows in between the different subscription rows for the time periods in which no subscription was active, starting from the earliest Begindate known for the specific user and ending on a given date where all subscriptions end (20-04-2022).
I have tried to compare the date difference from the first begindate known for a user and the final date and subtracting the relation length known for the other subscription types. However, I could not make this work.
An example of what the df currently looks like:
(rl standing for relationlength)
ID Begindate Enddate Subscrtype active rl_fixed rl_promotional Productgroup
1 2019-08-26 2022-04-20 fixed 1 968 0 1
1 2018-08-24 2019-08-23 fixed 0 364 0 1
1 2015-08-24 2016-08-23 promo 0 0 364 2
2 2019-08-26 2019-09-12 fixed 0 17 0 1
2 2018-08-24 2019-08-23 fixed 0 364 0 1
What I would like it to look like:
ID Begindate Enddate Subscrtype active rl_fixed rl_promo rl_none Productgroup
1 2019-08-26 2022-04-20 fixed 1 968 0 0 1
1 2019-08-24 2019-08-25 none 0 0 0 2 NA
1 2018-08-24 2019-08-23 fixed 0 364 0 0 1
1 2016-08-24 2018-08-23 none 0 0 0 729 NA
1 2015-08-24 2016-08-23 promo 0 0 364 0 2
2 2019-09-13 2022-04-20 none 0 0 0 950 NA
2 2019-08-26 2019-09-12 fixed 0 17 0 0 1
2 2019-08-24 2019-08-25 none 0 0 0 2 NA
2 2018-08-24 2019-08-23 fixed 0 364 0 0 1
The end goal is to aggregate and have a clear overview of the specific relation lengths for the different types of relations possible for a user.
Thank you in advance!
dput for one specific user in the real df:
structure(list(ï..CRM.relatienummer = structure(c(1L, 1L, 1L,
1L, 1L, 1L), .Label = "1", class = "factor"), Begindatum = c("2019-08-26",
"2018-08-24", "2017-08-24", "2016-08-24", "2015-08-20", "2016-06-01"
), Einddatum = c("2022-04-20", "2019-08-23", "2018-08-23", "2017-08-23",
"2016-05-31", "2016-08-19"), Type.abonnement = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "Actie", class = "factor"), Status_dummy = c(1,
0, 0, 0, 0, 0), relationlength_fixed = c(0, 0, 0, 0, 0, 0), relationlength_promo = c(968,
364, 364, 364, 285, 79), relationlength_trial = c(0, 0, 0, 0,
0, 0), fixed_dummy = c(0, 0, 0, 0, 0, 0), trial_dummy = c(0,
0, 0, 0, 0, 0), promotional_dummy = c(1, 1, 1, 1, 1, 1)), row.names = c("1:20610",
"2:38646", "2:39231", "2:39232", "2:39248", "2:39837"), class = "data.frame")
Edit:
I have tried to run this code:
dfs <- split(testdata,testdata$ï..CRM.relatienummer)
r <- lapply(seq(length(dfs)), function(k){
v <- dfs[[k]]
vt <- data.frame(unique(v$ï..CRM.relatienummer),
as.character((as.Date(v$Einddatum)+1)[-1]),
as.character((as.Date(v$Begindatum)-1)[-nrow(v)]),
0,
0,
0,
0,
(as.Date(v$Begindatum)-1)[-nrow(v)] - (as.Date(v$Einddatum)+1)[-1],
NA,
0,
0,
0,
0,
0)
colnames(vt) <- c(colnames(v)[-ncol(v)],"rl_none",colnames(v)[ncol(v)])
(testdata <- rbind(data.frame(v[-ncol(v)],rl_none = 0,v[ncol(v)]),vt))[order(as.Date(testdata$Begindatum),decreasing = T),]
})
res <- data.frame(Reduce(rbind,r),row.names = NULL)
On this dataframe, with no luck unfortunately:
structure(list(ï..CRM.relatienummer = structure(c("d45248b8974dc4f8ff948779e0fd07e20f304e929ada4e14c0420aebed81e9b5",
"2ab04e80b3e64601147df977d6054c04ffa80014b3691b25dd1cc8ef85cea06a",
"2ab04e80b3e64601147df977d6054c04ffa80014b3691b25dd1cc8ef85cea06a",
"bcf2c99e6dc974380f967204b9623dce2c8a3fad694dc0b4430fcbf77f8f39f3",
"bcf2c99e6dc974380f967204b9623dce2c8a3fad694dc0b4430fcbf77f8f39f3",
"f8610cd0237858ac9384d6ba209759ae306860ffabb3f8e6c3d6fc68dbaddc51",
"e5b8b3f46165e48aec8bbe65ed1cb29d18a0492fbcac44803372f672348459db",
"c737815b2365b01a8a85c380364a0f721685a131de98cd7790b4d40bb8c4e05b",
"b9c0272caa8d5d3497d28cce3bda5d3d17c22f18c5f65c5e82c572b410a8ea71",
"b9c0272caa8d5d3497d28cce3bda5d3d17c22f18c5f65c5e82c572b410a8ea71",
"539c6c3e604245008daefbe500ff29357bee91f82a7896126bd0f69848524cb7",
"d361338bed51cb9c8aa73fd8914cbf392f4e05e7b073f637f7b150cf02b89c8c",
"505d3df3f1298e07aa96073490b72acd2391da06ad4cfbd5a9fbde3a3de79684",
"826443481cbb5b4e061040d443a0ce8d94322615d8ffae1e68b2ff7d896afcf7",
"2b59a1ec028c261c0f22cd6a49220dc7cec9a9fb0fabe2296b4ba77a60cfdaae"
), class = c("hash", "sha256")), Begindatum = c("2019-06-14",
"2019-03-01", "2019-09-02", "2019-03-03", "2019-04-01", "2019-09-21",
"2019-02-02", "2019-06-11", "2019-02-05", "2019-02-09", "2019-07-24",
"2019-05-08", "2019-09-27", "2019-08-03", "2019-04-03"), Einddatum = c("2022-04-20",
"2019-09-01", "2022-04-20", "2019-03-31", "2022-04-20", "2022-04-20",
"2019-02-14", "2019-07-08", "2019-02-11", "2020-02-08", "2019-09-03",
"2019-06-18", "2019-11-07", "2019-08-16", "2022-04-20"), Status_dummy = c(1,
0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1), relationlength_fixed = c(0,
184, 961, 28, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0), relationlength_promo = c(1041,
0, 0, 0, 1115, 942, 12, 0, 0, 364, 0, 0, 0, 0, 1113), relationlength_trial = c(0,
0, 0, 0, 0, 0, 0, 27, 0, 0, 41, 41, 41, 13, 0), rl_none = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), fixed_dummy = c(0,
1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), trial_dummy = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0), promotional_dummy = c(1,
0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1), active_subscr_dummy = c(3,
0, 5, 0, 3, 3, 0, 0, 0, 3, 0, 0, 1, 0, 3), hashedEmail = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c("1:1",
"1:2", "1:3", "1:4", "1:5", "1:6", "1:7", "1:8", "1:9", "1:10",
"1:11", "1:12", "1:13", "1:14", "1:15"), class = "data.frame")
Hopefully this is what you are expecting
dfs <- split(df,df$ID)
r <- lapply(seq(length(dfs)), function(k){
v <- dfs[[k]]
vt <- data.frame(unique(v$ID),
as.character((as.Date(v$Enddate)+1)[-1]),
as.character((as.Date(v$Begindate)-1)[-nrow(v)]),
"none",
0,
0,
0,
(as.Date(v$Begindate)-1)[-nrow(v)] - (as.Date(v$Enddate)+1)[-1],
NA)
colnames(vt) <- c(colnames(v)[-ncol(v)],"rl_none",colnames(v)[ncol(v)])
(df <- rbind(data.frame(v[-ncol(v)],rl_none = 0,v[ncol(v)]),vt))[order(as.Date(df$Begindate),decreasing = T),]
})
res <- data.frame(Reduce(rbind,r),row.names = NULL)
which gives
> res
ID Begindate Enddate Subscrtype active rl_fixed rl_promo rl_none Productgroup
1 1 2019-08-26 2022-04-20 fixed 1 968 0 0 1
2 1 2019-08-24 2019-08-25 none 0 0 0 1 NA
3 1 2018-08-24 2019-08-23 fixed 0 364 0 0 1
4 1 2016-08-24 2018-08-23 none 0 0 0 729 NA
5 1 2015-08-24 2016-08-23 promo 0 0 364 0 2
6 2 2019-08-26 2019-09-12 fixed 0 17 0 0 1
7 2 2019-08-24 2019-08-25 none 0 0 0 1 NA
8 2 2018-08-24 2019-08-23 fixed 0 364 0 0 1
DATA
structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Begindate = structure(c(3L,
2L, 1L, 3L, 2L), .Label = c("2015-08-24", "2018-08-24", "2019-08-26"
), class = "factor"), Enddate = structure(c(4L, 2L, 1L, 3L, 2L
), .Label = c("2016-08-23", "2019-08-23", "2019-09-12", "2022-04-20"
), class = "factor"), Subscrtype = structure(c(1L, 1L, 2L, 1L,
1L), .Label = c("fixed", "promo"), class = "factor"), active = c(1L,
0L, 0L, 0L, 0L), rl_fixed = c(968L, 364L, 0L, 17L, 364L), rl_promo = c(0L,
0L, 364L, 0L, 0L), Productgroup = c(1L, 1L, 2L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-5L))
I am attempting to develop a time varying Cox proportional hazards (CPH) model in R and was wondering if anyone has generated any code to help format data for the counting structure that is used in time varying / time dependent CPH models.
To make the problem reproducible and somewhat simpler, I have extracted the first 100 rows of data, which features 4 variables (id, date, y, and x). The id is a unique subject identifier. The date is an integer sequence from 0 to n days of observation for each id. y is the status or outcome of the hazard analysis and x is the time varying covariate. In this example, once y = 1 has occurred the data for each subject will be censored and no additional data should be included in the ideal output dataframe.
The data are structured so that each subject has 1 row that corresponds to each day of observation.
head(test)
id date y x
1 0 0 0
1 1 0 1
1 2 0 1
1 3 0 1
1 4 0 1
1 5 0 0
However, as I understand it, the cph function in R requires that time varying covariates be structured in such a way that the start and end variables need to be recoded into 3 rows with intervals from (0,1] and (1,5] and (5,6] for the data featured in the head(test) code block above.
The first 100 rows of data can be reconstructed using this code:
dput(test)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5,
5, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9,
9, 9, 9), date = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
0, 1, 2, 3, 4, 5, 6, 7, 8), y = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0), x = c(0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L,
1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L)), .Names = c("id",
"date", "y", "x"), row.names = c(NA, -100L), class = "data.frame")
Ideally, I am trying to recode these data so that the output would be:
head(ideal_output)
id start end y x
1 0 1 0 0
1 1 5 0 1
1 5 6 0 0
1 6 7 0 1
1 7 9 0 0
1 9 11 0 1
1 11 20 0 0
2 0 8 0 0
3 0 1 0 0
3 1 3 0 1
3 3 4 0 0
3 4 6 0 1
3 6 7 1 1
4 0 2 0 0
4 2 4 0 1
4 4 7 0 0
5 0 9 0 0
6 0 7 0 0
7 0 1 0 0
7 1 2 0 1
7 2 3 0 0
7 3 4 1 0
8 0 3 0 0
8 3 4 1 1
9 0 2 0 0
9 2 5 0 1
9 5 6 1 1
I have done this manually to create the ideal_output above but it is an error prone process and untenable for the hundreds of id's and several covariates that I need to evaluate. Consequently, any help would be greatly appreciated in developing an automated way to approach this data formatting challenge. Thanks!
I think the Survsplit() function is the answer to your problem.
look at:
http://www.rdocumentation.org/packages/eha/functions/SurvSplit
Alternatively, try to google: Chapter 5 Extended and Stratified Cox - nus.edu.sg
As #Ham suggest you can use tmerge. Here is an example
> #####
> # `dat` is the data.frame you provided
> library(survival)
>
> # make baseline data.frame for tmerge
> baseline <- by(dat, dat$id, function(x){
+ n <- nrow(x)
+ # avoid slow data.frame call
+ structure(list(
+ id = x$id[1], start = x$date[1], x = x$x[1], end = x$date[n],
+ dummy = 0),
+ row.names = 1L, class = "data.frame")
+ })
> baseline <- do.call(rbind, baseline)
> baseline # show baseline data
id start x end dummy
1 1 0 0 19 0
2 2 0 0 7 0
3 3 0 0 12 0
4 4 0 0 6 0
5 5 0 0 8 0
6 6 0 0 6 0
7 7 0 0 11 0
8 8 0 0 14 0
9 9 0 0 8 0
>
> # use tmerge
> final_dat <- tmerge(baseline, baseline, id = id, y = event(end, dummy))
> final_dat <- tmerge(
+ final_dat, dat, id = id, y = cumtdc(date, y), x = tdc(date, x))
> final_dat[final_dat$id == 3, ] # look at one example
id start x end dummy tstart tstop y
27 3 0 0 12 0 0 1 0
28 3 0 1 12 0 1 2 0
29 3 0 1 12 0 2 3 0
30 3 0 0 12 0 3 4 0
31 3 0 1 12 0 4 5 0
32 3 0 1 12 0 5 6 0
33 3 0 1 12 0 6 7 1
34 3 0 1 12 0 7 8 1
35 3 0 1 12 0 8 9 1
36 3 0 1 12 0 9 10 1
37 3 0 1 12 0 10 11 1
38 3 0 0 12 0 11 12 1
>
> # remove values where y is not zero or y is not the first non-zero value
> final_dat <- within(final_dat, ycum <- unlist(tapply(y, id, cumsum)))
> final_dat <- final_dat[final_dat$ycum < 2, ]
> final_dat$ycum <- NULL
> final_dat[final_dat$id == 3, ]
id start x end dummy tstart tstop y
27 3 0 0 12 0 0 1 0
28 3 0 1 12 0 1 2 0
29 3 0 1 12 0 2 3 0
30 3 0 0 12 0 3 4 0
31 3 0 1 12 0 4 5 0
32 3 0 1 12 0 5 6 0
33 3 0 1 12 0 6 7 1
>
> # remove x row where the previous x value do match. But
> # * keep those where y = 1
> # * update tstop for the last row where the last row may be removed
> final_dat <- within(
+ final_dat,
+ max_t <- unlist(tapply(tstop, id, function(z) rep(max(z), length(z)))))
> final_dat <- within(
+ final_dat,
+ keep <- unlist(tapply(x, id, function(z)
+ c(TRUE, z[-1] != z[-length(z)]))))
>
> final_dat <- final_dat[final_dat$keep | final_dat$y, ]
>
> final_dat <- within(
+ final_dat, is_last <- unlist(tapply(id, id, function(z)
+ seq_along(z) == length(z))))
>
> needs_update <- final_dat$is_last & !final_dat$y
> final_dat[needs_update, "tstop"] <-
+ final_dat[needs_update, "max_t"] + 1
>
> # have to update the tstop column
> final_dat <- within(final_dat, tstop <- unlist(by(
+ cbind(tstart, tstop), id, function(z) {
+ n <- nrow(z)
+ c(z$tstart[-1], z$tstop[n])
+ })))
>
> # show final data.frame
> final_dat[, c("id", "tstart", "tstop", "y", "x")]
id tstart tstop y x
1 1 0 1 0 0
2 1 1 5 0 1
6 1 5 6 0 0
7 1 6 7 0 1
8 1 7 9 0 0
10 1 9 11 0 1
12 1 11 20 0 0
20 2 0 8 0 0
27 3 0 1 0 0
28 3 1 3 0 1
30 3 3 4 0 0
31 3 4 6 0 1
33 3 6 7 1 1
39 4 0 2 0 0
41 4 2 4 0 1
43 4 4 7 0 0
45 5 0 9 0 0
53 6 0 7 0 0
59 7 0 1 0 0
60 7 1 2 0 1
61 7 2 3 0 0
62 7 3 4 1 0
70 8 0 3 0 0
73 8 3 4 1 1
84 9 0 2 0 0
86 9 2 5 0 1
89 9 5 6 1 1
The code after tmerge can be done faster with dplyr or data.table. If you have more columns than just one, x, then I suggest that you: 1) store a column index of dat and use that in tmerge in the tdc function instead of x. Then merge the tables afterwards with merge. Further, you need to update the line that makes the keep indicator. Otherwise the code should be identical.
I think the tmerge() function is the answer to your problem.
look at: https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf