R elementwise calculations in dataframe which contains lists - r

I have the following dataframe df:
adj_coords
1 2, 3, 4, 5, 6, 7
2 1, 3, 7, 8, 9, 10
3 1, 2, 4, 10, 11, 12
4 1, 3, 5, 12, 13, 14
5 1, 4, 6, 14, 15, 16
6 1, 5, 7, 16, 17, 18
adj_coords_material_amounts
1 0.0000, 0.0000, 0.0000, 0.0000, 632.6667, 264.3333
2 263.0000, 0.0000, 264.3333, 262.6667, 0.0000, 238.6667
3 263.0000, 0.0000, 0.0000, 238.6667, 0.0000, 298.3333
4 263.0000, 0.0000, 0.0000, 298.3333, 300.6667, 279.3333
5 263.0000, 0.0000, 632.6667, 279.3333, 0.0000, 273.3333
6 263.0000, 0.0000, 264.3333, 273.3333, 0.0000, 0.0000
df<-structure(list(adj_coords = list(2:7, c(1L, 3L, 7L, 8L, 9L, 10L
), c(1L, 2L, 4L, 10L, 11L, 12L), c(1L, 3L, 5L, 12L, 13L, 14L),
c(1L, 4L, 6L, 14L, 15L, 16L), c(1L, 5L, 7L, 16L, 17L, 18L
)), adj_coords_material_amounts = list(c(0, 0, 0, 0, 632.666666666666,
264.333333333334), c(263, 0, 264.333333333334, 262.666666666667,
0, 238.666666666667), c(263, 0, 0, 238.666666666667, 0, 298.333333333333
), c(263, 0, 0, 298.333333333333, 300.666666666667, 279.333333333334
), c(263, 0, 632.666666666666, 279.333333333334, 0, 273.333333333334
), c(263, 0, 264.333333333334, 273.333333333334, 0, 0))), row.names = c(NA,
6L), class = "data.frame")
I would like to sample one element from each row of adj_coords but only where the corresponding element in adj_coords_material_amounts is >0.

Loop over each paired set of adj_coords and adj_coords__material_amounts using mapply and sample one value with the selection > 0.
##set.seed(1)
mapply(
\(co,ma) sample(co[ma > 0], 1),
df[["adj_coords"]], df[["adj_coords_material_amounts"]]
)
#[1] 6 10 12 1 6 1

I am not that familar with dplyr, but below is one of my attempt
df %>%
mutate(id = 1:n()) %>%
unnest(c(adj_coords, adj_coords_material_amounts)) %>%
filter(adj_coords_material_amounts > 0) %>%
group_by(id) %>%
slice_sample(n = 1) %>%
ungroup() %>%
select(!id)
and you will see
# A tibble: 6 × 2
adj_coords adj_coords_material_amounts
<int> <dbl>
1 7 264.
2 8 263.
3 1 263
4 14 279.
5 16 273.
6 1 263

Related

Add empty rows for gaps between subscriptions

I have been struggling with this for a while now and I haven't been able to find a comparable question asked anywhere, hence my first question on here!
I'm fairly new to R so please excuse any obvious errors I have made.
I have a dataset which has a row for each subscription that a user has or has had. Some users have multiple rows, while some others only have one. Only active or previously active subscriptions are present.
I have two variables which state when the subscription has started and when it ended called, Begindate and Enddate respectively. I already have relationlength variables created which state the amount of days between these two variables for each type of subscription. This means that the relationlength variables only give the amount of days for when a subscription was active.
What I would like to do is create empty rows in between the different subscription rows for the time periods in which no subscription was active, starting from the earliest Begindate known for the specific user and ending on a given date where all subscriptions end (20-04-2022).
I have tried to compare the date difference from the first begindate known for a user and the final date and subtracting the relation length known for the other subscription types. However, I could not make this work.
An example of what the df currently looks like:
(rl standing for relationlength)
ID Begindate Enddate Subscrtype active rl_fixed rl_promotional Productgroup
1 2019-08-26 2022-04-20 fixed 1 968 0 1
1 2018-08-24 2019-08-23 fixed 0 364 0 1
1 2015-08-24 2016-08-23 promo 0 0 364 2
2 2019-08-26 2019-09-12 fixed 0 17 0 1
2 2018-08-24 2019-08-23 fixed 0 364 0 1
What I would like it to look like:
ID Begindate Enddate Subscrtype active rl_fixed rl_promo rl_none Productgroup
1 2019-08-26 2022-04-20 fixed 1 968 0 0 1
1 2019-08-24 2019-08-25 none 0 0 0 2 NA
1 2018-08-24 2019-08-23 fixed 0 364 0 0 1
1 2016-08-24 2018-08-23 none 0 0 0 729 NA
1 2015-08-24 2016-08-23 promo 0 0 364 0 2
2 2019-09-13 2022-04-20 none 0 0 0 950 NA
2 2019-08-26 2019-09-12 fixed 0 17 0 0 1
2 2019-08-24 2019-08-25 none 0 0 0 2 NA
2 2018-08-24 2019-08-23 fixed 0 364 0 0 1
The end goal is to aggregate and have a clear overview of the specific relation lengths for the different types of relations possible for a user.
Thank you in advance!
dput for one specific user in the real df:
structure(list(ï..CRM.relatienummer = structure(c(1L, 1L, 1L,
1L, 1L, 1L), .Label = "1", class = "factor"), Begindatum = c("2019-08-26",
"2018-08-24", "2017-08-24", "2016-08-24", "2015-08-20", "2016-06-01"
), Einddatum = c("2022-04-20", "2019-08-23", "2018-08-23", "2017-08-23",
"2016-05-31", "2016-08-19"), Type.abonnement = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "Actie", class = "factor"), Status_dummy = c(1,
0, 0, 0, 0, 0), relationlength_fixed = c(0, 0, 0, 0, 0, 0), relationlength_promo = c(968,
364, 364, 364, 285, 79), relationlength_trial = c(0, 0, 0, 0,
0, 0), fixed_dummy = c(0, 0, 0, 0, 0, 0), trial_dummy = c(0,
0, 0, 0, 0, 0), promotional_dummy = c(1, 1, 1, 1, 1, 1)), row.names = c("1:20610",
"2:38646", "2:39231", "2:39232", "2:39248", "2:39837"), class = "data.frame")
Edit:
I have tried to run this code:
dfs <- split(testdata,testdata$ï..CRM.relatienummer)
r <- lapply(seq(length(dfs)), function(k){
v <- dfs[[k]]
vt <- data.frame(unique(v$ï..CRM.relatienummer),
as.character((as.Date(v$Einddatum)+1)[-1]),
as.character((as.Date(v$Begindatum)-1)[-nrow(v)]),
0,
0,
0,
0,
(as.Date(v$Begindatum)-1)[-nrow(v)] - (as.Date(v$Einddatum)+1)[-1],
NA,
0,
0,
0,
0,
0)
colnames(vt) <- c(colnames(v)[-ncol(v)],"rl_none",colnames(v)[ncol(v)])
(testdata <- rbind(data.frame(v[-ncol(v)],rl_none = 0,v[ncol(v)]),vt))[order(as.Date(testdata$Begindatum),decreasing = T),]
})
res <- data.frame(Reduce(rbind,r),row.names = NULL)
On this dataframe, with no luck unfortunately:
structure(list(ï..CRM.relatienummer = structure(c("d45248b8974dc4f8ff948779e0fd07e20f304e929ada4e14c0420aebed81e9b5",
"2ab04e80b3e64601147df977d6054c04ffa80014b3691b25dd1cc8ef85cea06a",
"2ab04e80b3e64601147df977d6054c04ffa80014b3691b25dd1cc8ef85cea06a",
"bcf2c99e6dc974380f967204b9623dce2c8a3fad694dc0b4430fcbf77f8f39f3",
"bcf2c99e6dc974380f967204b9623dce2c8a3fad694dc0b4430fcbf77f8f39f3",
"f8610cd0237858ac9384d6ba209759ae306860ffabb3f8e6c3d6fc68dbaddc51",
"e5b8b3f46165e48aec8bbe65ed1cb29d18a0492fbcac44803372f672348459db",
"c737815b2365b01a8a85c380364a0f721685a131de98cd7790b4d40bb8c4e05b",
"b9c0272caa8d5d3497d28cce3bda5d3d17c22f18c5f65c5e82c572b410a8ea71",
"b9c0272caa8d5d3497d28cce3bda5d3d17c22f18c5f65c5e82c572b410a8ea71",
"539c6c3e604245008daefbe500ff29357bee91f82a7896126bd0f69848524cb7",
"d361338bed51cb9c8aa73fd8914cbf392f4e05e7b073f637f7b150cf02b89c8c",
"505d3df3f1298e07aa96073490b72acd2391da06ad4cfbd5a9fbde3a3de79684",
"826443481cbb5b4e061040d443a0ce8d94322615d8ffae1e68b2ff7d896afcf7",
"2b59a1ec028c261c0f22cd6a49220dc7cec9a9fb0fabe2296b4ba77a60cfdaae"
), class = c("hash", "sha256")), Begindatum = c("2019-06-14",
"2019-03-01", "2019-09-02", "2019-03-03", "2019-04-01", "2019-09-21",
"2019-02-02", "2019-06-11", "2019-02-05", "2019-02-09", "2019-07-24",
"2019-05-08", "2019-09-27", "2019-08-03", "2019-04-03"), Einddatum = c("2022-04-20",
"2019-09-01", "2022-04-20", "2019-03-31", "2022-04-20", "2022-04-20",
"2019-02-14", "2019-07-08", "2019-02-11", "2020-02-08", "2019-09-03",
"2019-06-18", "2019-11-07", "2019-08-16", "2022-04-20"), Status_dummy = c(1,
0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1), relationlength_fixed = c(0,
184, 961, 28, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0), relationlength_promo = c(1041,
0, 0, 0, 1115, 942, 12, 0, 0, 364, 0, 0, 0, 0, 1113), relationlength_trial = c(0,
0, 0, 0, 0, 0, 0, 27, 0, 0, 41, 41, 41, 13, 0), rl_none = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), fixed_dummy = c(0,
1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), trial_dummy = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0), promotional_dummy = c(1,
0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1), active_subscr_dummy = c(3,
0, 5, 0, 3, 3, 0, 0, 0, 3, 0, 0, 1, 0, 3), hashedEmail = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c("1:1",
"1:2", "1:3", "1:4", "1:5", "1:6", "1:7", "1:8", "1:9", "1:10",
"1:11", "1:12", "1:13", "1:14", "1:15"), class = "data.frame")
Hopefully this is what you are expecting
dfs <- split(df,df$ID)
r <- lapply(seq(length(dfs)), function(k){
v <- dfs[[k]]
vt <- data.frame(unique(v$ID),
as.character((as.Date(v$Enddate)+1)[-1]),
as.character((as.Date(v$Begindate)-1)[-nrow(v)]),
"none",
0,
0,
0,
(as.Date(v$Begindate)-1)[-nrow(v)] - (as.Date(v$Enddate)+1)[-1],
NA)
colnames(vt) <- c(colnames(v)[-ncol(v)],"rl_none",colnames(v)[ncol(v)])
(df <- rbind(data.frame(v[-ncol(v)],rl_none = 0,v[ncol(v)]),vt))[order(as.Date(df$Begindate),decreasing = T),]
})
res <- data.frame(Reduce(rbind,r),row.names = NULL)
which gives
> res
ID Begindate Enddate Subscrtype active rl_fixed rl_promo rl_none Productgroup
1 1 2019-08-26 2022-04-20 fixed 1 968 0 0 1
2 1 2019-08-24 2019-08-25 none 0 0 0 1 NA
3 1 2018-08-24 2019-08-23 fixed 0 364 0 0 1
4 1 2016-08-24 2018-08-23 none 0 0 0 729 NA
5 1 2015-08-24 2016-08-23 promo 0 0 364 0 2
6 2 2019-08-26 2019-09-12 fixed 0 17 0 0 1
7 2 2019-08-24 2019-08-25 none 0 0 0 1 NA
8 2 2018-08-24 2019-08-23 fixed 0 364 0 0 1
DATA
structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Begindate = structure(c(3L,
2L, 1L, 3L, 2L), .Label = c("2015-08-24", "2018-08-24", "2019-08-26"
), class = "factor"), Enddate = structure(c(4L, 2L, 1L, 3L, 2L
), .Label = c("2016-08-23", "2019-08-23", "2019-09-12", "2022-04-20"
), class = "factor"), Subscrtype = structure(c(1L, 1L, 2L, 1L,
1L), .Label = c("fixed", "promo"), class = "factor"), active = c(1L,
0L, 0L, 0L, 0L), rl_fixed = c(968L, 364L, 0L, 17L, 364L), rl_promo = c(0L,
0L, 364L, 0L, 0L), Productgroup = c(1L, 1L, 2L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-5L))

dplyr: How does bind_rows() change the original dataframe

hth1 is a data frame that I already have.
> hth1
Source: local data frame [13 x 14]
Groups: team [13]
team CSK DC DD GL KKR KTK KXIP MI PW RCB RPSG
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CSK 0 8 11 0 11 2 9 10 4 10 0
2 DC 2 0 8 0 2 1 7 5 3 8 0
3 DD 5 3 0 0 7 2 8 5 2 10 2
4 GL 0 0 2 0 0 0 0 0 0 1 0
5 KKR 5 7 10 2 0 0 5 10 3 15 0
6 KTK 0 0 0 0 2 0 1 0 1 2 0
7 KXIP 8 3 10 2 14 0 0 11 2 6 1
8 MI 12 5 13 2 8 1 7 0 3 11 1
9 PW 2 1 4 0 2 0 4 3 0 1 0
10 RCB 9 3 7 2 3 0 12 8 4 0 1
11 RPSG 0 0 0 2 2 0 1 1 0 1 0
12 RR 8 2 7 0 14 1 7 6 2 7 0
13 SH 3 0 4 0 5 0 4 5 2 5 2
# ... with 2 more variables: RR <dbl>, SH <dbl>
Why do the data frame returned by bind_rows() and the original data frame differ?
> h <- list(hth1)
> hth_b1 <- bind_rows(h)
> identical(hth1, hth_b1)
[1] FALSE
> class(hth_b1)
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
> class(hth1)
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
> setequal(hth1, hth_b1)
TRUE
> anti_join(hth1, hth_b1)
Joining, by = c("team", "CSK", "DC", "DD", "GL", "KKR", "KTK", "KXIP", "MI", "PW", "RCB", "RPSG", "RR", "SH")
Source: local data frame [0 x 14]
Groups: team [13]
# ... with 14 variables: team <chr>, CSK <dbl>, DC <dbl>, DD <dbl>, GL <dbl>,
# KKR <dbl>, KTK <dbl>, KXIP <dbl>, MI <dbl>, PW <dbl>, RCB <dbl>,
# RPSG <dbl>, RR <dbl>, SH <dbl>
What am I missing? I have been stuck here for a long time.
Update 1:
As requested by Benjamin, I dput() function on both dataframes. Here is the output.
> dput(hth_b1)
structure(list(team = c("CSK", "DC", "DD", "GL", "KKR", "KTK",
"KXIP", "MI", "PW", "RCB", "RPSG", "RR", "SH"), CSK = c(0, 2,
5, 0, 5, 0, 8, 12, 2, 9, 0, 8, 3), DC = c(8, 0, 3, 0, 7, 0, 3,
5, 1, 3, 0, 2, 0), DD = c(11, 8, 0, 2, 10, 0, 10, 13, 4, 7, 0,
7, 4), GL = c(0, 0, 0, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0), KKR = c(11,
2, 7, 0, 0, 2, 14, 8, 2, 3, 2, 14, 5), KTK = c(2, 1, 2, 0, 0,
0, 0, 1, 0, 0, 0, 1, 0), KXIP = c(9, 7, 8, 0, 5, 1, 0, 7, 4,
12, 1, 7, 4), MI = c(10, 5, 5, 0, 10, 0, 11, 0, 3, 8, 1, 6, 5
), PW = c(4, 3, 2, 0, 3, 1, 2, 3, 0, 4, 0, 2, 2), RCB = c(10,
8, 10, 1, 15, 2, 6, 11, 1, 0, 1, 7, 5), RPSG = c(0, 0, 2, 0,
0, 0, 1, 1, 0, 1, 0, 0, 2), RR = c(9, 7, 9, 0, 1, 1, 8, 10, 3,
9, 0, 0, 7), SH = c(3, 0, 4, 3, 4, 0, 4, 3, 0, 4, 0, 0, 0)), .Names = c("team",
"CSK", "DC", "DD", "GL", "KKR", "KTK", "KXIP", "MI", "PW", "RCB",
"RPSG", "RR", "SH"), row.names = c(NA, -13L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), vars = list(team), indices = list(
0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), group_sizes = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), biggest_group_size = 1L, labels = structure(list(
team = c("CSK", "DC", "DD", "GL", "KKR", "KTK", "KXIP", "MI",
"PW", "RCB", "RPSG", "RR", "SH")), row.names = c(NA, -13L
), class = "data.frame", vars = list(team), .Names = "team"))
>
> dput(hth1)
structure(list(team = c("CSK", "DC", "DD", "GL", "KKR", "KTK",
"KXIP", "MI", "PW", "RCB", "RPSG", "RR", "SH"), CSK = c(0, 2,
5, 0, 5, 0, 8, 12, 2, 9, 0, 8, 3), DC = c(8, 0, 3, 0, 7, 0, 3,
5, 1, 3, 0, 2, 0), DD = c(11, 8, 0, 2, 10, 0, 10, 13, 4, 7, 0,
7, 4), GL = c(0, 0, 0, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0), KKR = c(11,
2, 7, 0, 0, 2, 14, 8, 2, 3, 2, 14, 5), KTK = c(2, 1, 2, 0, 0,
0, 0, 1, 0, 0, 0, 1, 0), KXIP = c(9, 7, 8, 0, 5, 1, 0, 7, 4,
12, 1, 7, 4), MI = c(10, 5, 5, 0, 10, 0, 11, 0, 3, 8, 1, 6, 5
), PW = c(4, 3, 2, 0, 3, 1, 2, 3, 0, 4, 0, 2, 2), RCB = c(10,
8, 10, 1, 15, 2, 6, 11, 1, 0, 1, 7, 5), RPSG = c(0, 0, 2, 0,
0, 0, 1, 1, 0, 1, 0, 0, 2), RR = c(9, 7, 9, 0, 1, 1, 8, 10, 3,
9, 0, 0, 7), SH = c(3, 0, 4, 3, 4, 0, 4, 3, 0, 4, 0, 0, 0)), .Names = c("team",
"CSK", "DC", "DD", "GL", "KKR", "KTK", "KXIP", "MI", "PW", "RCB",
"RPSG", "RR", "SH"), class = c("grouped_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -13L), vars = list(team), labels = structure(list(
team = c("CSK", "DC", "DD", "GL", "KKR", "KTK", "KXIP", "MI",
"PW", "RCB", "RPSG", "RR", "SH")), class = "data.frame", row.names = c(NA,
-13L), vars = list(team), drop = TRUE, .Names = "team"), indices = list(
0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), drop = TRUE, group_sizes = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), biggest_group_size = 1L)
There is a difference in the output for both of them, there is an extra drop = TRUE for hth1.
I don't understand why it is not there in the other one.
A reproducible example:
library(tidyverse)
test1 <- mtcars %>% group_by(cyl)
test2 <- bind_rows(list(test1))
identical(test1, test2) #FALSE
all_equal(test1, test2) #TRUE
You can check both their attributes and you can see the rownames differ:
rownames(test1)
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
[7] "Duster 360" "Merc 240D" "Merc 230"
[10] "Merc 280" "Merc 280C" "Merc 450SE"
[13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
[16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
[19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
[22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
[25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
[28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
[31] "Maserati Bora" "Volvo 142E"
rownames(test2)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13"
[14] "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26"
[27] "27" "28" "29" "30" "31" "32"
Never expect tibbles to treat your rownames with respect, they may be silently dropped at any time.
Forgive the formatting on this answer, but it would appear that you have labels attached to one object, and not in the other. Where the labels got attached or removed isn't something I can know without looking at code that generates the objects. I've bolded the difference in your objects below.
Note: not formatting this as code is a deliberate choice. Formatting as code prevents me from marking the difference in the structure in bold text
dput(hth_b1)
structure(list(team = c("CSK", "DC", "DD", "GL", "KKR", "KTK",
"KXIP", "MI", "PW", "RCB", "RPSG", "RR", "SH"), CSK = c(0, 2,
5, 0, 5, 0, 8, 12, 2, 9, 0, 8, 3), DC = c(8, 0, 3, 0, 7, 0, 3,
5, 1, 3, 0, 2, 0), DD = c(11, 8, 0, 2, 10, 0, 10, 13, 4, 7, 0,
7, 4), GL = c(0, 0, 0, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0), KKR = c(11,
2, 7, 0, 0, 2, 14, 8, 2, 3, 2, 14, 5), KTK = c(2, 1, 2, 0, 0,
0, 0, 1, 0, 0, 0, 1, 0), KXIP = c(9, 7, 8, 0, 5, 1, 0, 7, 4,
12, 1, 7, 4), MI = c(10, 5, 5, 0, 10, 0, 11, 0, 3, 8, 1, 6, 5
), PW = c(4, 3, 2, 0, 3, 1, 2, 3, 0, 4, 0, 2, 2), RCB = c(10,
8, 10, 1, 15, 2, 6, 11, 1, 0, 1, 7, 5), RPSG = c(0, 0, 2, 0,
0, 0, 1, 1, 0, 1, 0, 0, 2), RR = c(9, 7, 9, 0, 1, 1, 8, 10, 3,
9, 0, 0, 7), SH = c(3, 0, 4, 3, 4, 0, 4, 3, 0, 4, 0, 0, 0)), .Names = c("team",
"CSK", "DC", "DD", "GL", "KKR", "KTK", "KXIP", "MI", "PW", "RCB",
"RPSG", "RR", "SH"), row.names = c(NA, -13L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), vars = list(team), indices = list(
0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), group_sizes = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), biggest_group_size = 1L , labels = structure(list(
team = c("CSK", "DC", "DD", "GL", "KKR", "KTK", "KXIP", "MI",
"PW", "RCB", "RPSG", "RR", "SH")), row.names = c(NA, -13L
), class = "data.frame", vars = list(team), .Names = "team"))
dput(hth1)
structure(list(team = c("CSK", "DC", "DD", "GL", "KKR", "KTK",
"KXIP", "MI", "PW", "RCB", "RPSG", "RR", "SH"), CSK = c(0, 2,
5, 0, 5, 0, 8, 12, 2, 9, 0, 8, 3), DC = c(8, 0, 3, 0, 7, 0, 3,
5, 1, 3, 0, 2, 0), DD = c(11, 8, 0, 2, 10, 0, 10, 13, 4, 7, 0,
7, 4), GL = c(0, 0, 0, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0), KKR = c(11,
2, 7, 0, 0, 2, 14, 8, 2, 3, 2, 14, 5), KTK = c(2, 1, 2, 0, 0,
0, 0, 1, 0, 0, 0, 1, 0), KXIP = c(9, 7, 8, 0, 5, 1, 0, 7, 4,
12, 1, 7, 4), MI = c(10, 5, 5, 0, 10, 0, 11, 0, 3, 8, 1, 6, 5
), PW = c(4, 3, 2, 0, 3, 1, 2, 3, 0, 4, 0, 2, 2), RCB = c(10,
8, 10, 1, 15, 2, 6, 11, 1, 0, 1, 7, 5), RPSG = c(0, 0, 2, 0,
0, 0, 1, 1, 0, 1, 0, 0, 2), RR = c(9, 7, 9, 0, 1, 1, 8, 10, 3,
9, 0, 0, 7), SH = c(3, 0, 4, 3, 4, 0, 4, 3, 0, 4, 0, 0, 0)), .Names = c("team",
"CSK", "DC", "DD", "GL", "KKR", "KTK", "KXIP", "MI", "PW", "RCB",
"RPSG", "RR", "SH"), class = c("grouped_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -13L), vars = list(team), labels = structure(list(
team = c("CSK", "DC", "DD", "GL", "KKR", "KTK", "KXIP", "MI",
"PW", "RCB", "RPSG", "RR", "SH")), class = "data.frame", row.names = c(NA,
-13L), vars = list(team), drop = TRUE, .Names = "team"), indices = list(
0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), drop = TRUE, group_sizes = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), biggest_group_size = 1L)
In the example below, I will add labels to the mtcars data frame, then run it through bind_rows, and you'll see that the labels are no longer present. This is what I believe is happening to your data.
library(Hmisc)
mtcars2 <- mtcars
label(mtcars2, self = FALSE) <- toupper(names(mtcars))
library(dplyr)
mtcars3 <- bind_rows(mtcars2)
identical(mtcars2, mtcars3)
label(mtcars3)

Dplyr: group_by and convert multiple columns to a vector

I have a question about how to convert multiple columns to a vector. I have the following dataset that I would like to group them by their condition and take all the position count into one vector. I know I can use as.vector() to convert them individually but I wonder if there is a dplyr way. Thank you!
test -> structure(list(gene_id = c("gene0", "gene0", "gene0", "gene0",
"gene0", "gene0", "gene0", "gene0", "gene0", "gene0", "gene0",
"gene0", "gene0", "gene0", "gene0", "gene0", "gene0", "gene0",
"gene0", "gene0", "gene0", "gene0", "gene0", "gene0"), codon_index = c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L), position_1_count = c(2L, 7L, 8L,
0L, 2L, 22L, 19L, 15L, 134L, 1L, 127L, 30L, 0L, 0L, 1L, 4L, 65L,
234L, 1L, 3L, 57L, 0L, 4L, 16L), position_2_count = c(0L, 5L,
5L, 0L, 3L, 2L, 3L, 13L, 134L, 0L, 36L, 5L, 0L, 0L, 0L, 1L, 150L,
7L, 0L, 7L, 7L, 0L, 6L, 1L), position_3_count = c(0L, 2L, 1L,
0L, 4L, 0L, 3L, 32L, 43L, 3L, 9L, 1L, 0L, 0L, 0L, 4L, 105L, 1L,
0L, 14L, 5L, 0L, 6L, 1L), condition = structure(c(1L, 1L, 1L,
7L, 7L, 7L, 3L, 3L, 3L, 5L, 5L, 5L, 8L, 8L, 8L, 2L, 2L, 2L, 4L,
4L, 4L, 6L, 6L, 6L), .Label = c("c", "cup", "n", "nup", "p",
"pup", "min", "rich"), class = "factor")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -24L), .Names = c("gene_id",
"codon_index", "position_1_count", "position_2_count", "position_3_count",
"condition"))
> head(a)
# A tibble: 6 × 6
gene_id codon_index position_1_count position_2_count position_3_count condition
<chr> <int> <int> <int> <int> <fctr>
1 gene0 1 2 0 0 c
2 gene0 2 7 5 2 c
3 gene0 3 8 5 1 c
4 gene0 1 0 0 0 min
5 gene0 2 2 3 4 min
6 gene0 3 22 2 0 min
How can we convert this dataset to (I didn't add the column names here)
2 0 0 7 5 2 8 5 1 c
0 0 0 2 3 4 22 2 0 min
Another alternative:
library(purrr)
test %>%
slice_rows("condition") %>%
by_slice(function(x) unlist(x[-(1:2)]), .to = "vec")
Which gives:
# condition vec
#1 c 2, 7, 8, 0, 5, 5, 0, 2, 1
#2 cup 4, 65, 234, 1, 150, 7, 4, 105, 1
#3 n 19, 15, 134, 3, 13, 134, 3, 32, 43
#4 nup 1, 3, 57, 0, 7, 7, 0, 14, 5
#5 p 1, 127, 30, 0, 36, 5, 3, 9, 1
#6 pup 0, 4, 16, 0, 6, 1, 0, 6, 1
#7 min 0, 2, 22, 0, 3, 2, 0, 4, 0
#8 rich 0, 0, 1, 0, 0, 0, 0, 0, 0
As per mentioned in the comments by #advance, if you want the result rowwise:
test %>%
slice_rows("condition") %>%
by_slice(function(x) as.vector(t(x[-(1:2)])), .to = "vec")
# condition vec
#1 c 2, 0, 0, 7, 5, 2, 8, 5, 1
#2 cup 4, 1, 4, 65, 150, 105, 234, 7, 1
#3 n 19, 3, 3, 15, 13, 32, 134, 134, 43
#4 nup 1, 0, 0, 3, 7, 14, 57, 7, 5
#5 p 1, 0, 3, 127, 36, 9, 30, 5, 1
#6 pup 0, 0, 0, 4, 6, 6, 16, 1, 1
#7 min 0, 0, 0, 2, 3, 4, 22, 2, 0
#8 rich 0, 0, 0, 0, 0, 0, 1, 0, 0
Or adapting #DavidArenburg's comment using do() instead of summarise():
test %>%
group_by(condition) %>%
select(position_1_count:condition) %>%
do(res = c(t(.[,-4])))
Which gives:
# condition res
#1 c 2, 0, 0, 7, 5, 2, 8, 5, 1
#2 cup 4, 1, 4, 65, 150, 105, 234, 7, 1
#3 n 19, 3, 3, 15, 13, 32, 134, 134, 43
#4 nup 1, 0, 0, 3, 7, 14, 57, 7, 5
#5 p 1, 0, 3, 127, 36, 9, 30, 5, 1
#6 pup 0, 0, 0, 4, 6, 6, 16, 1, 1
#7 min 0, 0, 0, 2, 3, 4, 22, 2, 0
#8 rich 0, 0, 0, 0, 0, 0, 1, 0, 0
Am I correct that what you want is a separate vector for all of the counts for each condition? If so, a mix of dplyr and tidyr should do it. First, I gather to put all of the counts in a single column. Then, split to separate by the condition, then use lapply to generate a list, containing a separate vector for each condition:
a %>%
gather(Location, Count, starts_with("position")) %>%
split(.$condition) %>%
lapply(function(x){x$Count})
gives:
$c
[1] 2 7 8 0 5 5 0 2 1
$cup
[1] 4 65 234 1 150 7 4 105 1
$n
[1] 19 15 134 3 13 134 3 32 43
$nup
[1] 1 3 57 0 7 7 0 14 5
$p
[1] 1 127 30 0 36 5 3 9 1
$pup
[1] 0 4 16 0 6 1 0 6 1
$min
[1] 0 2 22 0 3 2 0 4 0
$rich
[1] 0 0 1 0 0 0 0 0 0
If the order matters (and is wrong above) you should be able to sort before splitting, e.g. by adding arrange(codon_index) after gather
After taking Peterson's idea, I think this code works the best:
test %>% gather(Location, Count, starts_with("position")) %>% arrange(codon_index) %>% group_by(condition) %>% do(count = as.vector(t(.$Count)))
The result will look like this
> ans = test %>% gather(Location, Count, starts_with("position")) %>% arrange(codon_index) %>% group_by(condition) %>% do(count = as.vector(t(.$Count)))
# A tibble: 8 × 2
condition count
* <fctr> <list>
1 c <int [9]>
2 cup <int [9]>
3 n <int [9]>
4 nup <int [9]>
5 p <int [9]>
6 pup <int [9]>
7 min <int [9]>
8 rich <int [9]>
> ans$count[[1]]
[1] 2 0 0 7 5 2 8 5 1
Thanks a lot for everyone's help!

Sort character in vector of string in R

I have data like,
df <- structure(list(Sex = structure(c(1L, 1L, 2L, 1L, 2L, 2L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("F", "M"), class = "factor"),
Age = c(19L, 16L, 16L, 13L, 16L, 30L, 16L, 30L, 16L, 30L,
30L, 16L, 19L, 1L, 30L), I = c(1, 1, 0, 0, 1, 0, 1, 0, 1,
0, 0, 0, 1, 0, 1), E = c(0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1,
1, 0, 1, 0), S = c(1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
0, 1), N = c(0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0),
F = c(1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1), T = c(0,
1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0), C = c(1, 1, 1,
0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1), D = c(0, 0, 0, 1, 0,
1, 0, 1, 0, 1, 1, 1, 1, 0, 0), type = c("CIFS", "CITN", "CESF",
"DEFS", "CIFN", "DETS", "CITS", "DEFS", "CIFN", "DEFN", "DETS",
"DETS", "DINF", "CENT", "CIFS"), PO = runif(15, -3, 3), AO = runif(15, -3, 3)), .Names = c("Sex",
"Age", "I", "E", "S", "N", "F", "T", "C", "D", "type", "PO",
"AO"), class = c("tbl_dt", "tbl", "data.table", "data.frame"), row.names = c(NA,
-15L))
I want to sort the column type. Not the column but the characters in it. And get the same structure afterwards. For example, CIFS should then be CFIS. I tried to do it as,
df <- within(df, {
type <- apply(sapply(strsplit(df[, type], split=''), sort), 2,
function(x) paste0(x, collapse = ''))
})
Is there any simpler solution, that I have missed to find.
Since you are using data.table, I would suggest
df[, type := paste(sort(unlist(strsplit(type, ""))), collapse = ""), by = type]
like described in How to sort letters in a string?
This should work for both data.frame and data.table (base R only):
df$type <- vapply(strsplit(df$type, split=''),FUN=function(x)paste(sort(x),collapse=''),"")
Result:
> df
Sex Age I E S N F T C D type PO AO
1 F 19 1 0 1 0 1 0 1 0 CFIS 2.9750666 2.0308410
2 F 16 1 0 0 1 0 1 1 0 CINT 0.7902187 2.0891158
3 M 16 0 1 1 0 1 0 1 0 CEFS -1.7173785 2.4774140
4 F 13 0 1 1 0 1 0 0 1 DEFS 1.5352127 -1.9272470
5 M 16 1 0 0 1 1 0 1 0 CFIN -0.2160741 1.7359897
6 M 30 0 1 1 0 0 1 0 1 DEST 2.6314981 -0.6252466
7 F 16 1 0 1 0 0 1 1 0 CIST -1.6032894 -1.9938226
8 M 30 0 1 1 0 1 0 0 1 DEFS 0.7748583 -2.0935737
9 F 16 1 0 0 1 1 0 1 0 CFIN -2.9368356 0.3363364
10 F 30 0 1 0 1 1 0 0 1 DEFN -0.6506217 2.6681535
11 F 30 0 1 1 0 0 1 0 1 DEST -0.4432578 0.4627441
12 F 16 0 1 1 0 0 1 0 1 DEST 2.0236760 2.7684298
13 F 19 1 0 0 1 1 0 0 1 DFIN -1.1774931 2.6546726
14 F 1 0 1 0 1 0 1 1 0 CENT -2.2365388 2.7902646
15 F 30 1 0 1 0 1 0 1 0 CFIS -1.6139238 -2.4982620

Data Formatting for Time Varying Covariate Cox Proportional Hazards Modeling in R

I am attempting to develop a time varying Cox proportional hazards (CPH) model in R and was wondering if anyone has generated any code to help format data for the counting structure that is used in time varying / time dependent CPH models.
To make the problem reproducible and somewhat simpler, I have extracted the first 100 rows of data, which features 4 variables (id, date, y, and x). The id is a unique subject identifier. The date is an integer sequence from 0 to n days of observation for each id. y is the status or outcome of the hazard analysis and x is the time varying covariate. In this example, once y = 1 has occurred the data for each subject will be censored and no additional data should be included in the ideal output dataframe.
The data are structured so that each subject has 1 row that corresponds to each day of observation.
head(test)
id date y x
1 0 0 0
1 1 0 1
1 2 0 1
1 3 0 1
1 4 0 1
1 5 0 0
However, as I understand it, the cph function in R requires that time varying covariates be structured in such a way that the start and end variables need to be recoded into 3 rows with intervals from (0,1] and (1,5] and (5,6] for the data featured in the head(test) code block above.
The first 100 rows of data can be reconstructed using this code:
dput(test)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5,
5, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9,
9, 9, 9), date = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
0, 1, 2, 3, 4, 5, 6, 7, 8), y = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0), x = c(0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L,
1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L)), .Names = c("id",
"date", "y", "x"), row.names = c(NA, -100L), class = "data.frame")
Ideally, I am trying to recode these data so that the output would be:
head(ideal_output)
id start end y x
1 0 1 0 0
1 1 5 0 1
1 5 6 0 0
1 6 7 0 1
1 7 9 0 0
1 9 11 0 1
1 11 20 0 0
2 0 8 0 0
3 0 1 0 0
3 1 3 0 1
3 3 4 0 0
3 4 6 0 1
3 6 7 1 1
4 0 2 0 0
4 2 4 0 1
4 4 7 0 0
5 0 9 0 0
6 0 7 0 0
7 0 1 0 0
7 1 2 0 1
7 2 3 0 0
7 3 4 1 0
8 0 3 0 0
8 3 4 1 1
9 0 2 0 0
9 2 5 0 1
9 5 6 1 1
I have done this manually to create the ideal_output above but it is an error prone process and untenable for the hundreds of id's and several covariates that I need to evaluate. Consequently, any help would be greatly appreciated in developing an automated way to approach this data formatting challenge. Thanks!
I think the Survsplit() function is the answer to your problem.
look at:
http://www.rdocumentation.org/packages/eha/functions/SurvSplit
Alternatively, try to google: Chapter 5 Extended and Stratified Cox - nus.edu.sg
As #Ham suggest you can use tmerge. Here is an example
> #####
> # `dat` is the data.frame you provided
> library(survival)
>
> # make baseline data.frame for tmerge
> baseline <- by(dat, dat$id, function(x){
+ n <- nrow(x)
+ # avoid slow data.frame call
+ structure(list(
+ id = x$id[1], start = x$date[1], x = x$x[1], end = x$date[n],
+ dummy = 0),
+ row.names = 1L, class = "data.frame")
+ })
> baseline <- do.call(rbind, baseline)
> baseline # show baseline data
id start x end dummy
1 1 0 0 19 0
2 2 0 0 7 0
3 3 0 0 12 0
4 4 0 0 6 0
5 5 0 0 8 0
6 6 0 0 6 0
7 7 0 0 11 0
8 8 0 0 14 0
9 9 0 0 8 0
>
> # use tmerge
> final_dat <- tmerge(baseline, baseline, id = id, y = event(end, dummy))
> final_dat <- tmerge(
+ final_dat, dat, id = id, y = cumtdc(date, y), x = tdc(date, x))
> final_dat[final_dat$id == 3, ] # look at one example
id start x end dummy tstart tstop y
27 3 0 0 12 0 0 1 0
28 3 0 1 12 0 1 2 0
29 3 0 1 12 0 2 3 0
30 3 0 0 12 0 3 4 0
31 3 0 1 12 0 4 5 0
32 3 0 1 12 0 5 6 0
33 3 0 1 12 0 6 7 1
34 3 0 1 12 0 7 8 1
35 3 0 1 12 0 8 9 1
36 3 0 1 12 0 9 10 1
37 3 0 1 12 0 10 11 1
38 3 0 0 12 0 11 12 1
>
> # remove values where y is not zero or y is not the first non-zero value
> final_dat <- within(final_dat, ycum <- unlist(tapply(y, id, cumsum)))
> final_dat <- final_dat[final_dat$ycum < 2, ]
> final_dat$ycum <- NULL
> final_dat[final_dat$id == 3, ]
id start x end dummy tstart tstop y
27 3 0 0 12 0 0 1 0
28 3 0 1 12 0 1 2 0
29 3 0 1 12 0 2 3 0
30 3 0 0 12 0 3 4 0
31 3 0 1 12 0 4 5 0
32 3 0 1 12 0 5 6 0
33 3 0 1 12 0 6 7 1
>
> # remove x row where the previous x value do match. But
> # * keep those where y = 1
> # * update tstop for the last row where the last row may be removed
> final_dat <- within(
+ final_dat,
+ max_t <- unlist(tapply(tstop, id, function(z) rep(max(z), length(z)))))
> final_dat <- within(
+ final_dat,
+ keep <- unlist(tapply(x, id, function(z)
+ c(TRUE, z[-1] != z[-length(z)]))))
>
> final_dat <- final_dat[final_dat$keep | final_dat$y, ]
>
> final_dat <- within(
+ final_dat, is_last <- unlist(tapply(id, id, function(z)
+ seq_along(z) == length(z))))
>
> needs_update <- final_dat$is_last & !final_dat$y
> final_dat[needs_update, "tstop"] <-
+ final_dat[needs_update, "max_t"] + 1
>
> # have to update the tstop column
> final_dat <- within(final_dat, tstop <- unlist(by(
+ cbind(tstart, tstop), id, function(z) {
+ n <- nrow(z)
+ c(z$tstart[-1], z$tstop[n])
+ })))
>
> # show final data.frame
> final_dat[, c("id", "tstart", "tstop", "y", "x")]
id tstart tstop y x
1 1 0 1 0 0
2 1 1 5 0 1
6 1 5 6 0 0
7 1 6 7 0 1
8 1 7 9 0 0
10 1 9 11 0 1
12 1 11 20 0 0
20 2 0 8 0 0
27 3 0 1 0 0
28 3 1 3 0 1
30 3 3 4 0 0
31 3 4 6 0 1
33 3 6 7 1 1
39 4 0 2 0 0
41 4 2 4 0 1
43 4 4 7 0 0
45 5 0 9 0 0
53 6 0 7 0 0
59 7 0 1 0 0
60 7 1 2 0 1
61 7 2 3 0 0
62 7 3 4 1 0
70 8 0 3 0 0
73 8 3 4 1 1
84 9 0 2 0 0
86 9 2 5 0 1
89 9 5 6 1 1
The code after tmerge can be done faster with dplyr or data.table. If you have more columns than just one, x, then I suggest that you: 1) store a column index of dat and use that in tmerge in the tdc function instead of x. Then merge the tables afterwards with merge. Further, you need to update the line that makes the keep indicator. Otherwise the code should be identical.
I think the tmerge() function is the answer to your problem.
look at: https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf

Resources