Related
I've converted a data frame into wide format and now want to compute paired t-tests to obtain p-values. I have managed to do this for each pair of columns individually, but it's a lot more code than I feel is necessary. I'm still very new to R, data and coding generally, and couldn't easily see a solution here on Stack Overflow.
My wide data frame is:
> head(df_wide)
# A tibble: 6 x 21
Assessor `Appearance1 `Appearance2 `Aroma_1 `Aroma_2 `Flavour_1 `Flavour_2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10 10 10 10 10 10
2 6 7 7 5 8 4
# ... with 14 more variables
I want to perform a paired T-Test over the attributes, i.e. Appearance1 and Appearance2, Aroma1 and Aroma2, etc. The 14 other variables are all <dbl> and are also attributes to be included as paired columns for the T-Test.
Ideally, the output would be a vector of just the p-values, rather than having all the information. I've managed to do that coding for individual pairs, but I wanted to know if this would be possible to do as part of performing the T-Test over multiple pairs of columns.
Here is the code I have for the first two attributes:
p_values <- c(t.test(df_wide$`Appearance1`, df_wide$`Appearance2`, paired = T)[["p.value"]],
t.test(df_wide$`Aroma1`, df_wide$`Aroma2`, paired = T)[["p.value"]])
This creates the vector I want, but is cumbersome and error-prone. Ideally, I'd be able to perform it over all the pairs at once without needing to use column names.
I do have the original data frame in long format, if it would be easier to do it using that (EDIT: used dput() for first 20 rows instead of head():
> dput(df_test[1:20,])
structure(list(Assessor = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10),
Product = c("MC", "MV", "MC", "MV", "MV", "MC", "MC", "MV", "MV", "MC", "MC", "MV", "MC", "MV", "MC", "MV", "MV", "MC", "MV", "MC"),
Appearance = c(10, 10, 6, 7, 9, 6, 7, 8, 9, 8, 10, 8, 6, 6, 9, 8, 8, 8, 9, 9),
Aroma = c(10, 10, 7, 5, 9, 8, 6, 7, 5, 7, 9, 8, 6, 6, 5, 3, 6, 7, 9, 6),
Flavour = c(10, 10, 8, 4, 10, 7, 7, 6, 8, 8, 9, 10, 8, 8, 6, 8, 7, 9, 9, 8),
Texture = c(10, 10, 8, 8, 9, 6, 7, 8, 8, 8, 9, 10, 8, 8, 9, 8, 8, 9, 9, 8),
`JAR Colour` = c(3, 2, 2, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3),
`JAR Strength Chocolate` = c(2, 2, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 3, 2),
`JAR Strength Vanilla` = c(3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3),
`JAR Sweetness` = c(2, 3, 3, 1, 3, 2, 2, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3),
`JAR Creaminess` = c(3, 3, 3, 3, 3, 1, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3),
`Overall Acceptance` = c(9, 10, 8, 4, 10, 5, 7, 7, 8, 8, 9, 10, 8, 8, 8, 8, 8, 9, 8, 8)),
row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))
The Product variable is the one which was used to make the paired columns in the wide format data frame. Thanks in advance.
if I understand correctly
df <- structure(list(Assessor = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10),
Product = c("MC", "MV", "MC", "MV", "MV", "MC", "MC", "MV", "MV", "MC", "MC", "MV", "MC", "MV", "MC", "MV", "MV", "MC", "MV", "MC"),
Appearance = c(10, 10, 6, 7, 9, 6, 7, 8, 9, 8, 10, 8, 6, 6, 9, 8, 8, 8, 9, 9),
Aroma = c(10, 10, 7, 5, 9, 8, 6, 7, 5, 7, 9, 8, 6, 6, 5, 3, 6, 7, 9, 6),
Flavour = c(10, 10, 8, 4, 10, 7, 7, 6, 8, 8, 9, 10, 8, 8, 6, 8, 7, 9, 9, 8),
Texture = c(10, 10, 8, 8, 9, 6, 7, 8, 8, 8, 9, 10, 8, 8, 9, 8, 8, 9, 9, 8),
`JAR Colour` = c(3, 2, 2, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3),
`JAR Strength Chocolate` = c(2, 2, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 3, 2),
`JAR Strength Vanilla` = c(3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3),
`JAR Sweetness` = c(2, 3, 3, 1, 3, 2, 2, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3),
`JAR Creaminess` = c(3, 3, 3, 3, 3, 1, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3),
`Overall Acceptance` = c(9, 10, 8, 4, 10, 5, 7, 7, 8, 8, 9, 10, 8, 8, 8, 8, 8, 9, 8, 8)),
row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))
head(df)
#> # A tibble: 6 x 12
#> Assessor Product Appearance Aroma Flavour Texture `JAR Colour`
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 MC 10 10 10 10 3
#> 2 1 MV 10 10 10 10 2
#> 3 2 MC 6 7 8 8 2
#> 4 2 MV 7 5 4 8 3
#> 5 3 MV 9 9 10 9 3
#> 6 3 MC 6 8 7 6 3
#> # ... with 5 more variables: JAR Strength Chocolate <dbl>,
#> # JAR Strength Vanilla <dbl>, JAR Sweetness <dbl>, JAR Creaminess <dbl>,
#> # Overall Acceptance <dbl>
library(tidyverse)
map_df(df[-c(1:2)], ~t.test(.x ~ df$Product, paired = TRUE)$p.value)
#> # A tibble: 1 x 10
#> Appearance Aroma Flavour Texture `JAR Colour` `JAR Strength Chocolate`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.496 0.576 1 0.309 0.678 1
#> # ... with 4 more variables: JAR Strength Vanilla <dbl>, JAR Sweetness <dbl>,
#> # JAR Creaminess <dbl>, Overall Acceptance <dbl>
sapply(df[-c(1:2)], function(x) t.test(x ~ df$Product, paired = TRUE)$p.value)
#> Appearance Aroma Flavour
#> 0.4961016 0.5763122 1.0000000
#> Texture JAR Colour JAR Strength Chocolate
#> 0.3092332 0.6783097 1.0000000
#> JAR Strength Vanilla JAR Sweetness JAR Creaminess
#> 0.6783097 1.0000000 0.4433319
#> Overall Acceptance
#> 0.7803523
Created on 2021-06-22 by the reprex package (v2.0.0)
I have data which looks like this:
library(stringr)
library(dplyr)
library(magrittr)
Codes = c(1, 2, 3, 4, 5, 6, 9)
Codes2 = c(Codes, rep(9, 100))
data <- data.frame(
MASTER_HCU_DI = do.call(paste0, Map(stri_rand_strings, n=100, length=c(4, 3),
pattern = c('[A-Z]', '[0-9]'))),
CODE_1 = sample(Codes, 100, replace = T))
data %<>%
mutate(CODE_2 = if_else(CODE_1 == 9, 9, sample(Codes2, 100, replace = T)),
CODE_3 = if_else(CODE_2 == 9, 9, sample(Codes2, 100, replace = T)))
What I want to do is find the total number of people with each of the possible values of CODE_1, CODE_2, and CODE_3; across all three Codes.
Where all of someone's CODE start with a 9, they are counted as missing. Otherwise, I'd like to ignore the CODE values which start with a 9.
This code does what I want, but seems cumbersome:
data %<>%
mutate(Sum_grp1 = if_else(CODE_1 == 1 | CODE_2 == 1 | CODE_3 == 1, 1, 0),
Sum_grp2 = if_else(CODE_1 == 2 | CODE_2 == 2 | CODE_3 == 2, 1, 0),
Sum_grp3 = if_else(CODE_1 == 3 | CODE_2 == 3 | CODE_3 == 3, 1, 0),
Sum_grp4 = if_else(CODE_1 == 4 | CODE_2 == 4 | CODE_3 == 4, 1, 0),
Sum_grp5 = if_else(CODE_1 == 5 | CODE == 5 | CODE_3 == 5, 1, 0),
Sum_grp6 = if_else(CODE_1 == 6 | CODE_2 == 6 | CODE_3 == 6, 1, 0),
Missing = if_else(CODE_1 == 9 & CODE_2 == 9 & CODE_3 == 9, 1, 0))
Group_counts <- data.frame(
Group = c("Group_1", "Group_2", "Group_3", "Group_4", "Group_5", "Group_6", "Missing"),
Sum = c(sum(data$Sum_grp1 == 1),
sum(data$Sum_grp2 == 1),
sum(data$Sum_grp3 == 1),
sum(data$Sum_grp4 == 1),
sum(data$Sum_grp5 == 1),
sum(data$Sum_grp6 == 1),
sum(data$Missing == 1)))
Expected output looks like this:
Is there an easier way to do this?
Thanks.
You can get the data in long format and use count -
library(dplyr)
library(tidyr)
data %>% pivot_longer(cols = -MASTER_HCU_DI) %>% count(name, value)
Is this what you expect?
data %>% pivot_longer(cols = -MASTER_HCU_DI) %>% group_by(name) %>%
summarise(Sum = sum(value), .groups = 'drop')
# A tibble: 3 x 2
name Sum
<chr> <dbl>
1 GROUP_1 409
2 GROUP_2 897
3 GROUP_3 900
As I understand it, the following functionality outlined in the question is not addressed by the existing answers:
Where all of someone's CODE start with a 9, they are counted as missing. Otherwise, I'd like to ignore the CODE values which start with a 9.
Here is my approach to include this functionality:
library(purrr)
library(dplyr)
data %>%
pmap_dfr(~ table(c(...)[-1])) %>%
set_names(~ paste0("Group_", .x)) %>%
mutate(Missing = ifelse(`Group_9` == 3, 1, NA)) %>%
select(-`Group_9`) %>%
colSums(na.rm = T) %>%
tibble::tibble(Group = names(.), Sum = .) %>%
arrange(Group)
Returns:
# A tibble: 7 x 2
Group Sum
<chr> <dbl>
1 Group_1 23
2 Group_2 13
3 Group_3 16
4 Group_4 13
5 Group_5 13
6 Group_6 11
7 Missing 15
Data used:
data <- structure(list(MASTER_HCU_DI = c("VBHT228", "CAAO199", "NDDI124", "AVZV996", "KMOP513", "AALT248", "IGZC617", "ZDHO229", "GXYV745", "PDTW465", "SEPM505", "ZJWQ323", "VRRU692", "NHOY962", "BBFR276", "NVML939", "VHPV534", "YTXG467", "BOCT360", "ONEO498", "CICL849", "SAIK461", "NZGL739", "NIFD497", "XMVE276", "JHZM922", "LCLV707", "BPKN209", "YTZU211", "LUNI891", "CQTC089", "FBDZ269", "VKCI112", "BLJH968", "LLML439", "TDRV973", "RTFR863", "GZAN917", "WSUI006", "JILN883", "CAHM719", "JCMI028", "BGFZ774", "BGVZ374", "WBUJ792", "DLVT690", "AVKE534", "TDPU030", "SKFI697", "UCLY688", "OODZ687", "IIPR924", "TSES431", "CQSN693", "ZQGJ398", "FMGH661", "ZORF207", "MDWD343", "OBDM142", "SATV193", "MUKZ136", "INAE029", "MWDB125", "JUXN395", "LQGW143", "ALKP557", "WQAR962", "UYZI622", "WKYM520", "WUMH621", "GLRV451", "ISHG990", "OCNW161", "WQMS244", "UQEF227", "IAEZ636", "TEZJ280", "GCCJ844", "EVTF869", "JGJH568", "MDPH890", "EHKR422", "NBIM361", "XEWM477", "PBJP921", "FGEG840", "UJOO120", "XZTB081", "GXCQ610", "ANAR117", "TNIP023", "GLFN787", "SYYV532", "GOTY296", "TXME798", "SUZK405", "VWHY631", "HAXW159", "CCJN761", "GGUN719"), GROUP_1 = c(6, 1, 4, 9, 1, 3, 3, 2, 3, 2, 1, 9, 4, 3, 1, 1, 4, 6, 5, 3, 3, 3, 9, 9, 2, 3, 4, 6, 1, 1, 1, 9, 6, 1, 5, 9, 5, 9, 5, 5, 1, 2, 9, 1, 3, 9, 9, 3, 5, 6, 1, 4, 1, 6, 4, 5, 2, 6, 4, 1, 5, 9, 1, 4, 3, 1, 2, 1, 2, 9, 1, 4, 3, 1, 2, 3, 6, 1, 6, 2, 2, 4, 1, 2, 6, 9, 3, 4, 2, 9, 6, 1, 3, 3, 1, 5, 4, 2, 4, 9), GROUP_2 = c(9, 9, 5, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 5, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 5, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9), GROUP_3 = c(9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 5, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9)), class = "data.frame", row.names = c(NA, -100L))
We can use gather
library(dplyr)
library(tidyr)
data %>%
gather('name', 'value', -MASTER_HCU_DI) %>%
count(name, value)
I have a dataframe that looks like this
> head(printing_id_map_unique_frames)
# A tibble: 6 x 5
# Groups: frame_number [6]
X1 X2 X3 row_in_frame frame_number
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 15 1
2 1 2 3 15 2
3 1 2 3 15 3
4 1 2 3 15 4
5 1 2 3 15 5
6 1 2 3 15 6
As you can see, X1,X2,X3, row_in_frame is identical
However, eventually you get to a
X1 X2 X3 row_in_frame frame_number
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 15 32
2 1 2 3 15 33
3 1 2 3 5 34**
4 1 4 5 15 35
5 1 4 5 15 36
What I would like to do is essentially compute a dataframe that looks like:
X1 X2 X3 row_in_frame num_duplicates
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 15 33
2 1 2 3 5 1
...
Essentially, what I want is to "collapse" over identical first 4 columns and count how many rows of that type there are in the "num_duplicates" column.
Is there a nice way to do this in dplyr without a messy for loop that tracks a count and if there is a change.
Below please find a full data structure via dput:
> dput(printing_id_map_unique_frames)
structure(list(X1 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), X2 = c(2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
), X3 = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5), row_in_frame = c(15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 5, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 5
), frame_number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,
62, 63, 64, 65, 66, 67, 68)), row.names = c(NA, -68L), class = c("tbl_df",
"tbl", "data.frame"))
Here is one option with count
library(dplyr) # 1.0.0
df1 %>%
count(!!! rlang::syms(names(.)[1:4]))
Or specify the unquoted column names
df1 %>%
count(X1, X2, X3, row_in_frame)
If we don't want to change the order, an option is to convert the first 4 columns to factor with levels specified as the unique values (which is the same as the order of occurrence of values) and then apply the count
df1 %>%
mutate(across(1:4, ~ factor(.x, levels = unique(.x)))) %>%
count(!!! rlang::syms(names(.)[1:4])) %>%
type.convert(as.is = TRUE)
# A tibble: 4 x 5
# X1 X2 X3 row_in_frame n
# <int> <int> <int> <int> <int>
#1 1 2 3 15 33
#2 1 2 3 5 1
#3 1 4 5 15 33
#4 1 4 5 5 1
This code generates a data frame just so:
library(tidyverse)
A = c(7, 4, 3, 12, 6)
B = c(1, 10, 9, 8, 5)
C = c(5, 3, 1, 7, 6)
df <- data_frame(A, B, C) %>% gather(letter1, rank)
nested <- df %>% group_by(letter1) %>% nest(ranks = c(rank))
nested
A grouped_df: 3 × 2
letter1 ranks
<chr> <list>
A 7, 4, 3, 12, 6
B 1, 10, 9, 8, 5
C 5, 3, 1, 7, 6
This is the desired data frame:
A tibble: 9 × 4
letter1 letter2 data1 data2
<chr> <chr> <list> <list>
A A 7, 4, 3, 12, 6 7, 4, 3, 12, 6
B A 1, 10, 9, 8, 5 7, 4, 3, 12, 6
C A 5, 3, 1, 7, 6 7, 4, 3, 12, 6
A B 7, 4, 3, 12, 6 1, 10, 9, 8, 5
B B 1, 10, 9, 8, 5 1, 10, 9, 8, 5
C B 5, 3, 1, 7, 6 1, 10, 9, 8, 5
A C 7, 4, 3, 12, 6 5, 3, 1, 7, 6
B C 1, 10, 9, 8, 5 5, 3, 1, 7, 6
C C 5, 3, 1, 7, 6 5, 3, 1, 7, 6
Once this step is solved, I'll run a mutate using data1 and data2 to get value, and then selecting letter1, letter2 and value will give an edgelist. I'm working with about 700 letters and the ranks lists will all be the same size and contain about 20 elements.
I'd expected to be able to use expand or expand.grid, but to no avail. Any tidyverse suggestions will be greatly appreciated.
crossing can be used
library(tidyr)
library(purrr)
library(dplyr)
crossing(ind1 = seq_len(nrow(nested)),
ind2 = seq_len(nrow(nested))) %>%
pmap_dfr(~ bind_cols(nested[..1,], nested[..2,]) )
We can use crossing after renaming the second dataframe.
tidyr::crossing(nested, setNames(nested, c('letter2', 'rank2')))
# letter1 ranks letter2 rank2
#1 A 7, 4, 3, 12, 6 A 7, 4, 3, 12, 6
#2 A 7, 4, 3, 12, 6 B 1, 10, 9, 8, 5
#3 A 7, 4, 3, 12, 6 C 5, 3, 1, 7, 6
#4 B 1, 10, 9, 8, 5 A 7, 4, 3, 12, 6
#5 B 1, 10, 9, 8, 5 B 1, 10, 9, 8, 5
#6 B 1, 10, 9, 8, 5 C 5, 3, 1, 7, 6
#7 C 5, 3, 1, 7, 6 A 7, 4, 3, 12, 6
#8 C 5, 3, 1, 7, 6 B 1, 10, 9, 8, 5
#9 C 5, 3, 1, 7, 6 C 5, 3, 1, 7, 6
The same is also valid for expand_grid.
tidyr::expand_grid(nested, setNames(nested, c('letter2', 'rank2')))
I have a dataset of 3 variables: ID, Date and Years_service. Like this:
library(data.table)
data <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), Date = structure(c(1230768000,
1233446400, 1235865600, 1238544000, 1241136000, 1243814400, 1246406400,
1249084800, 1251763200, 1254355200, 1257033600, 1259625600, 1262304000,
1264982400, 1267401600, 1270080000, 1272672000, 1275350400, 1277942400,
1280620800, 1283299200, 1285891200, 1288569600, 1291161600, 1293840000,
1296518400, 1298937600, 1301616000, 1304208000, 1306886400, 1309478400,
1312156800, 1314835200, 1317427200, 1320105600, 1322697600, 1325376000,
1328054400, 1330560000, 1333238400, 1335830400, 1338508800, 1341100800,
1343779200, 1346457600, 1349049600, 1351728000, 1354320000, 1356998400,
1359676800, 1362096000, 1364774400, 1367366400, 1370044800, 1372636800,
1375315200, 1377993600, 1380585600, 1383264000, 1385856000, 1388534400,
1391212800, 1393632000, 1396310400, 1398902400, 1401580800, 1404172800,
1406851200, 1409529600, 1412121600, 1414800000, 1417392000, 1420070400,
1422748800, 1425168000, 1427846400, 1430438400, 1433116800, 1435708800,
1438387200, 1441065600, 1443657600, 1446336000, 1448928000, 1451606400,
1454284800, 1456790400, 1459468800, 1462060800, 1464739200, 1467331200,
1470009600, 1472688000, 1475280000, 1330560000, 1333238400, 1335830400,
1338508800, 1341100800, 1343779200, 1346457600, 1349049600, 1351728000,
1354320000, 1356998400, 1359676800, 1362096000, 1364774400, 1367366400,
1370044800, 1372636800, 1375315200, 1377993600, 1380585600, 1383264000,
1385856000, 1388534400, 1391212800, 1393632000, 1396310400, 1398902400,
1401580800, 1404172800, 1406851200, 1409529600, 1412121600, 1414800000,
1417392000, 1420070400, 1422748800, 1425168000, 1427846400, 1430438400,
1433116800, 1435708800, 1438387200, 1441065600, 1443657600, 1446336000,
1448928000, 1451606400, 1454284800, 1456790400, 1459468800, 1462060800,
1464739200, 1467331200, 1470009600, 1472688000, 1475280000), class =
c("POSIXct",
"POSIXt"), tzone = "UTC"), Years_service = c(19, 19, 19, 19,
19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21,
21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22,
22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23,
23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 25,
25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26,
26, 26, 26, 26, 26, 27, 27, 27, 27, 27, 8, 8, 8, 8, 8, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13),
month_1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), .Names = c("ID", "Date",
"Years_service", "month_1"), row.names = c(NA, -150L), class =
c("data.table",
"data.frame"))
I want a new variable that contains for each ID the date for which years of service is maximum and the month of date is minimum. Something like this:
ID Date Years_service Date_1
1: 1 2009-01-01 19 2016-06-01
2: 1 2009-02-01 19 2016-06-01
3: 1 2009-03-01 19 2016-06-01
4: 1 2009-04-01 19 2016-06-01
5: 1 2009-05-01 19 2016-06-01
---
146: 2 2016-06-01 12 2016-08-01
147: 2 2016-07-01 12 2016-08-01
148: 2 2016-08-01 13 2016-08-01
149: 2 2016-09-01 13 2016-08-01
150: 2 2016-10-01 13 2016-08-01
My desired output is Date_1
I tried this:
data[,Date_1 := Date[which.max(Years_service) & which.min(month_1)], by = ID]
but didn't work.
How can I achieve this?
One option is to get the row index (.I) of the rows where the 'Years_service is max for each 'ID', then using that, get the minimum index of 'month_1' to subset the 'Date' corresponding to that value grouped by 'ID', and join on with the original data on the 'ID' column to create the 'Date_1' column
i1 <- data[, .I[Years_service == max(Years_service)], ID]$V1
data[data[i1, Date[which.min(month_1)], ID], Date_1 :=V1, on = .(ID)]
data
# ID Date Years_service month_1 Date_1
# 1: 1 2009-01-01 19 1 2016-06-01
# 2: 1 2009-02-01 19 2 2016-06-01
# 3: 1 2009-03-01 19 3 2016-06-01
# 4: 1 2009-04-01 19 4 2016-06-01
# 5: 1 2009-05-01 19 5 2016-06-01
# ---
#146: 2 2016-06-01 12 6 2016-08-01
#147: 2 2016-07-01 12 7 2016-08-01
#148: 2 2016-08-01 13 8 2016-08-01
#149: 2 2016-09-01 13 9 2016-08-01
#150: 2 2016-10-01 13 10 2016-08-01
Or extract the 'Date' corresponding to minimum 'month_1' from within the Subset of Data.table
data[, Date_1 := .SD[Years_service == max(Years_service),
Date[which.min(month_1)]], ID]
Or another option is to an order and assign 'Date_1' as the first 'Date' grouped by 'ID'
data[order(-Years_service, month_1), Date_1 := Date[1], ID]
Or using tidyverse
library(tidyverse)
data %>%
group_by(ID) %>%
arrange(desc(Years_service), month_1) %>%
mutate(Date_1 = first(Date))