Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a data frame which is individual-level data of people in a country. In the said data frame I have information on county or municipality of residence, sex, age, race and cancer status. I want to aggregate the data into a new data frame ordered by counties and stratified by age (in categories), sex and race. That is, create subgroups defined by a combination of these multiple variables. The original data has a structure similar to the fictitious data below.
structure(list(Person_ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40), County_ID = c(1,
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4,
4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6), Age = c(39,
21, 65, 87, 19, 16, 48, 52, 31, 19, 24, 44, 38,
39, 40, 27, 69, 71, 52, 53, 80, 23,
21, 29, 38, 34, 39, 73, 54, 50, 52,
43, 55, 57, 37, 24, 44, 37, 38,
40), Sex = c("F", "F", "F", "M", "M", "M", "F",
"M", "M", "F", "F", "F", "M", "M", "F", "F", "M", "M", "M", "M",
"M", "F", "F", "F", "M", "F", "F", "M", "M", "M", "F", "F", "F",
"F", "F", "F", "F", "F", "M", "M"), Race = c(1, 2, 1, 2, 3, 3,
3, 1, 1, 2, 2, 1, 2, 1, 2, 3, 3, 3, 2, 1, 2, 2, 3, 1, 3, 2, 3,
1, 2, 3, 3, 1, 2, 2, 2, 3, 1, 1, 2, 2), `Cancer-status` = c(0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0)), row.names = c(NA,
-40L), class = c("tbl_df", "tbl", "data.frame"))
with a structure like
Person_ID
County_ID
Age
Sex
Race
Cancer_status
1
1
30
M
1
1
2
1
41
M
2
0
3
1
19
F
1
0
4
1
37
F
3
1
5
2
28
F
3
0
6
3
65
M
1
1
where Cancer_status is a dummy or binary variable and Race is a factor variable.
And I want a new data frame in the format below (similar to the data structure of pennLC$data in SpatialEpi package). With the counts of cancer and population ordered by county and sorted by the 3 strata (race, sex and age). The new age variable is a factor or categorical variable.
county
cancer
pop_county
race
Sex
age
1
0
1492
1
F
Under 40
1
0
365
1
F
40-59
1
1
68
1
F
60-69
1
0
73
1
F
70+
1
0
23351
2
F
Under 40
1
5
12136
2
F
40-59
Thank you,
I'm assuming you want dplyr. Given your sample data, try this:
library(dplyr)
DF %>%
mutate(Age = cut(Age, c(0, 40, 60, 70, Inf), right = FALSE)) %>%
group_by(County_ID, Race, Sex, Age) %>%
summarize(cancer = sum(`Cancer-status`), pop_county = n()) %>%
ungroup()
# # A tibble: 37 x 6
# County_ID Race Sex Age cancer pop_county
# <dbl> <dbl> <chr> <fct> <dbl> <int>
# 1 1 1 F [0,40) 0 1
# 2 1 1 F [60,70) 0 1
# 3 1 2 F [0,40) 0 1
# 4 1 2 M [70,Inf) 0 1
# 5 1 3 M [0,40) 0 1
# 6 2 1 M [0,40) 1 1
# 7 2 1 M [40,60) 0 1
# 8 2 2 F [0,40) 1 2
# 9 2 3 F [40,60) 0 1
# 10 2 3 M [0,40) 0 1
# # ... with 27 more rows
You'll need to relabel the Age factor,
Related
I would like to calculate Day.Before_nextCLS with 3 columns below
tibble::tribble(
~Day, ~CLS, ~BAL.D,
0, 0, NA,
3, 0, 15000,
6, 0, 10000,
20, 0, 2000,
25, 0, -4771299,
26, 0, -1615637,
27, 0, -920917,
31, 1, -923089,
32, 1, -81863,
33, 1, 19865,
34, 1, 9865,
37, 1, 609865
)
Desired output is below tribble.
For Day27, Day.Before_nextCLS is 4,
because when CLS is 2, Day is 31, and interval between 27 and 31 is 4.
tibble::tribble(
~Day, ~CLS, ~BAL.D, ~Day.Before_nextCLS
0, 0, NA, 31,
3, 0, 15000, 28,
6, 0, 10000, 25,
20, 0, 2000, 11,
25, 0, -4771299, 6,
26, 0, -1615637, 5,
27, 0, -920917, 4,
31, 1, -923089, NA, (for we don't have date when CLS ==2)
32, 1, -81863, NA,
33, 1, 19865, NA,
34, 1, 9865, NA,
37, 1, 609865, NA,
)
How can I achieve this?
Thank you very much!!
We create a lead column and then do a group by subtract from the last value of lead column with the Day column
library(dplyr)
df1 %>%
mutate(DayLead = lead(Day)) %>%
group_by(CLS) %>%
mutate(Day.Before_nextCLS = last(DayLead) - Day, DayLead = NULL) %>%
ungroup
-output
# A tibble: 12 × 4
Day CLS BAL.D Day.Before_nextCLS
<dbl> <dbl> <dbl> <dbl>
1 0 0 NA 31
2 3 0 15000 28
3 6 0 10000 25
4 20 0 2000 11
5 25 0 -4771299 6
6 26 0 -1615637 5
7 27 0 -920917 4
8 31 1 -923089 NA
9 32 1 -81863 NA
10 33 1 19865 NA
11 34 1 9865 NA
12 37 1 609865 NA
Please see my code below:
# functions to get percentile threshold, and assign new values to outliers
get_low_perc <- function(var_name) {
return(quantile(var_name, c(0.01)))
}
get_hi_perc <- function(var_name) {
return(quantile(var_name, c(0.99)))
}
round_up <- function(target_var, flag_var, floor) {
target_var <- as.numeric(ifelse(flag_var == 1, floor, target_var))
return(as.integer(target_var))
}
round_down <- function(target_var, flag_var, ceiling) {
target_var <- as.numeric(ifelse(flag_var == 1, ceiling, target_var))
return(as.integer(target_var))
}
# try putting it all together
no_way <- function(df, df_col_name, df_col_flagH, df_col_flagL) {
lo_perc <- get_low_perc(df_col_name)
hi_perc <- get_hi_perc(df_col_name)
df$df_col_flagH <- as.factor(ifelse(df_col_name < lo_perc, 1, 0))
df$df_col_flagL <- as.factor(ifelse(df_col_name > hi_perc, 1, 0))
df_col_name <- round_up(df_col_name, df_col_flagL, lo_perc)
df_col_name <- round_down(df_col_name, df_col_flagH, hi_perc)
# names(df)[names(df)=='df_col_flagH'] <-
# boxplot(df_col_name)
return(df)
}
I have created 5 custom functions; the first two respectively get the 1th percentile and the 99th percentile of a given variable. The last two round the values in these variables up or down depending on how far away they are from the 1st percentile and the 99th percentile values. The last function is trying to put all these functions together to essentially output a new dataframe containing the same columns in the original df, the updated column, and two new columns indicating values that were flagged as below the 1st percentile and above the 99th percentile. I have produced a mock dataframe below, since I can't seem to pass some of my data here.
df2 = data.frame(col = c(1, 3, 4, 5, 8, 7, 67, 744, 876, 8, 8, 54, 9),
col1 = c(9, 6, 8, 3, 4, 5, 8, 7, 67, 744, 87, 33, 77),
col2 = c(8, 2, 8, 4, 87, 66, 54, 99, 77, 77, 88, 67, 102))
Ideally, after I call the function using the command "no_way(df2, df2$col1, df2$new_col1, df2$new_col2)", I want an output dataframe looking like:
df2 = data.frame(col = c(1, 3, 4, 5, 8, 7, 67, 744, 876, 8, 8, 54, 9),
col1 = c(9, 6, 8, 3, 4, 5, 8, 7, 67, 744, 87, 33, 77), # updated with appropriate values
col2 = c(8, 2, 8, 4, 87, 66, 54, 99, 77, 77, 88, 67, 102),
new_col1 = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0),
new_col2 = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0))
^ Where new_col1 and new_col2 are column names given by the user when calling the function. I am currently getting the dataframe as expected, but the new columns created have kept the function parameters' names, as in:
df2 = data.frame(col = c(1, 3, 4, 5, 8, 7, 67, 744, 876, 8, 8, 54, 9),
col1 = c(9, 6, 8, 3, 4, 5, 8, 7, 67, 744, 87, 33, 77), # updated with appropriate values
col2 = c(8, 2, 8, 4, 87, 66, 54, 99, 77, 77, 88, 67, 102),
df_col_flagH = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0),
df_col_flagL = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0))
I would not mind changing the name of the columns afterwards, but I will be using this function of 17 columns therefore that wouldn't be optimal. Please help.
You should pass new column names as string.
Also ifelse(condition, 1, 0) can be simplified to as.integer(condition).
no_way <- function(df, df_col_name, df_col_flagH, df_col_flagL) {
lo_perc <- get_low_perc(df[[df_col_name]])
hi_perc <- get_hi_perc(df[[df_col_name]])
df[[df_col_flagH]] <- as.factor(as.integer(df[[df_col_name]] < lo_perc))
df[[df_col_flagL]] <- as.factor(as.integer(df[[df_col_name]] > hi_perc))
df[[df_col_name]] <- round_up(df[[df_col_name]], df_col_flagL, lo_perc)
df[[df_col_name]] <- round_down(df[[df_col_name]], df_col_flagH, hi_perc)
return(df)
}
df2 <- no_way(df2, "col1", "new_col1", "new_col2")
df2
# col col1 col2 new_col1 new_col2
#1 1 9 8 0 0
#2 3 9 2 0 0
#3 4 9 8 0 0
#4 5 9 4 1 0
#5 8 9 87 0 0
#6 7 9 66 0 0
#7 67 9 54 0 0
#8 744 9 99 0 0
#9 876 9 77 0 0
#10 8 9 77 0 1
#11 8 9 88 0 0
#12 54 9 67 0 0
#13 9 9 102 0 0
I have a df like this:
I want to transform the continuous Age variable into a discrete one, that is equal a if the original was between 1 and 2, and b if it was betweem 3 and 4. Thus needing to aggregate the values of Value 1 and Value 2 by summing the entries associated with Age=1 + Age=2 and Age=3 + Age=4. The output would be something like this:
The 146 is the sum of the Value1 entry for Age=1 (75) and Age=2 (71).
I thought on using aggregate:
`df2 = df %>% group_by(Sex, Race) %>%
summarise(across(starts_with("Value"), fun))
Where fun would be some function that checks the Age values and sum accordingly. But i'm not much familiar with these dplyr functions and couldn't get it to work. Thanks for the help!
Data:
df = structure(list(Sex = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 2, 2), Race = c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2,
2, 2), Age = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4
), `Value 1` = c(75, 71, 52, 51, 24, 21, 70, 58, 67, 68, 36,
22, 91, 43, 33, 57), `Value 2` = c(22, 22, 49, 1, 20, 18, 34,
0, 27, 37, 31, 83, 29, 24, 10, 99)), row.names = c(NA, -16L), class = c("tbl_df",
"tbl", "data.frame"))
We can use case_when to do the recoding of 'Age' based on the values
library(dplyr)
df %>%
group_by(Sex, Race, Age = case_when(Age %in% 1:2 ~ 'a',
Age %in% 3:4 ~ 'b')) %>%
summarise(across(everything(), sum, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 8 x 5
# Sex Race Age `Value 1` `Value 2`
#* <dbl> <dbl> <chr> <dbl> <dbl>
#1 1 1 a 146 44
#2 1 1 b 103 50
#3 1 2 a 45 38
#4 1 2 b 128 34
#5 2 1 a 135 64
#6 2 1 b 58 114
#7 2 2 a 134 53
#8 2 2 b 90 109
Based on the OP's comment, if the original data have lots of categories, an easier option is cut or findInterval
df %>%
group_by(Sex, Race, Age = cut(Age, breaks = c(-Inf,
seq(0, 90, by = 5), Inf), labels = letters[1:20])) %>%
summarise(across(everything(), sum, na.rm = TRUE), .groups = 'drop')
I have a dataframe that looks like this
> head(printing_id_map_unique_frames)
# A tibble: 6 x 5
# Groups: frame_number [6]
X1 X2 X3 row_in_frame frame_number
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 15 1
2 1 2 3 15 2
3 1 2 3 15 3
4 1 2 3 15 4
5 1 2 3 15 5
6 1 2 3 15 6
As you can see, X1,X2,X3, row_in_frame is identical
However, eventually you get to a
X1 X2 X3 row_in_frame frame_number
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 15 32
2 1 2 3 15 33
3 1 2 3 5 34**
4 1 4 5 15 35
5 1 4 5 15 36
What I would like to do is essentially compute a dataframe that looks like:
X1 X2 X3 row_in_frame num_duplicates
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 15 33
2 1 2 3 5 1
...
Essentially, what I want is to "collapse" over identical first 4 columns and count how many rows of that type there are in the "num_duplicates" column.
Is there a nice way to do this in dplyr without a messy for loop that tracks a count and if there is a change.
Below please find a full data structure via dput:
> dput(printing_id_map_unique_frames)
structure(list(X1 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), X2 = c(2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
), X3 = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5), row_in_frame = c(15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 5, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 5
), frame_number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,
62, 63, 64, 65, 66, 67, 68)), row.names = c(NA, -68L), class = c("tbl_df",
"tbl", "data.frame"))
Here is one option with count
library(dplyr) # 1.0.0
df1 %>%
count(!!! rlang::syms(names(.)[1:4]))
Or specify the unquoted column names
df1 %>%
count(X1, X2, X3, row_in_frame)
If we don't want to change the order, an option is to convert the first 4 columns to factor with levels specified as the unique values (which is the same as the order of occurrence of values) and then apply the count
df1 %>%
mutate(across(1:4, ~ factor(.x, levels = unique(.x)))) %>%
count(!!! rlang::syms(names(.)[1:4])) %>%
type.convert(as.is = TRUE)
# A tibble: 4 x 5
# X1 X2 X3 row_in_frame n
# <int> <int> <int> <int> <int>
#1 1 2 3 15 33
#2 1 2 3 5 1
#3 1 4 5 15 33
#4 1 4 5 5 1
I have two dataframes:
deploy.info <- data.frame(Echo_ID = c("20180918_7.5Fa_1", "20180918_Sebre_3", "20190808_Bake_2", "20190808_NH_2"),
uppermost_bin = c(2, 7, 8, 12))
spc <- data.frame(species = c("RS", "GS", "YG", "RR", "BR", "GT", "CB"),
percent_dist = c(0, 25, 80, 100, 98, 60, 100),
percent_dist_from_surf = c(0, 25, 80, 100, 98, 60, 100),
'20180918_7.5Fa_1' = c(1, 1, 1, "NA", "NA", 1, "NA"),
'20180918_Sebre_3' = c(1, 2, "NA", "NA", "NA", 4, "NA"),
'20190808_Bake_2' = c(1, 3, 7, "NA", "NA", 6, "NA"),
'20190808_NH_2' = c(1, 2, 8, "NA", "NA", 6, "NA"))
The last four columns in the spc data frame refer to each Echo_ID that I am dealing with in the deploy.info data frame. I want to replace the NAs in the spc data frame with the uppermost_bin values for each of the Echo_IDs. Does anyone know how to go about doing this?
My desired end product would look like:
i.want.this <- data.frame(species = c("RS", "GS", "YG", "RR", "BR", "GT", "CB"),
percent_dist = c(0, 25, 80, 100, 98, 60, 100),
percent_dist_from_surf = c(0, 25, 80, 100, 98, 60, 100),
'20180918_7.5Fa_1' = c(1, 1, 1, 2, 2, 1, 2),
'20180918_Sebre_3' = c(1, 2, 7, 7, 7, 4, 7),
'20190808_Bake_2' = c(1, 3, 7, 8, 8, 6, 8),
'20190808_NH_2' = c(1, 2, 8, 12, 12, 6, 12))
I have over 100 columns like this and would rather not go in and have to do this change by hand. Any ideas are greatly appreciated.
We can use Map to replace the NA elements in the columns of 'Echo_ID' by the corresponding values of 'uppermost_bin'. In the OP's dataset, the columns were factor, so it was converted to the correct type with type.convert
nm1 <- paste0("X", deploy.info$Echo_ID)
spc <- type.convert(spc, as.is = TRUE)
spc[nm1] <- Map(function(x, y) replace(x, is.na(x), y),
spc[nm1], deploy.info$uppermost_bin)
spc
# species percent_dist percent_dist_from_surf X20180918_7.5Fa_1 X20180918_Sebre_3 X20190808_Bake_2 X20190808_NH_2
#1 RS 0 0 1 1 1 1
#2 GS 25 25 1 2 3 2
#3 YG 80 80 1 7 7 8
#4 RR 100 100 2 7 8 12
#5 BR 98 98 2 7 8 12
#6 GT 60 60 1 4 6 6
#7 CB 100 100 2 7 8 12