compare aggregate value across groups - r

With a df below,
need to compute median for variable metric across the teams tm1, tm2 and tm3 on a per locid, day, hour combo basis
then filter only those locid, day, hour observations which have the same metric median across teams tm1, tm2, tm3.
set.seed(100)
df <- data.frame(
locid = sample(c(1111,1122,1133), 20, replace=TRUE),
day = sample(c(1:3), 20, replace=TRUE),
hour = sample(c(1:4), 20, replace=TRUE),
team = sample(c("tm1", "tm2", "tm3"), 20, replace=TRUE),
metric = sample(1:5, 20, replace=TRUE )
)
my attempt
df_medians <- df %>%
group_by(locid + day + hour + team) %>%
summarise(metric_median = median(metric))
this gives the median per team for each locid + day + hour. I need to now find out the locid + day + hour combos that give the same median value across teams tm1, tm2, tm3.
df_medians %>% group_by(locid, day, hour, team) %>% summarise(??what here??)
I was trying with dplyr, but base-r solution is fine.
As a simpler example we can look at the below data- which has measurements from two different locations for two teams.
+-------+------+-------+-------+---------+
| locid | day | hour | team | metric |
+-------+------+-------+-------+---------+
| 1111 | 1 | 1 | tm1 | 3 |
| 1111 | 1 | 1 | tm1 | 2 |
| 1111 | 1 | 1 | tm1 | 1 |
| 1111 | 1 | 1 | tm2 | 1 |
| 1111 | 1 | 1 | tm2 | 2 |
| 1111 | 1 | 1 | tm2 | 3 |
| 1122 | 1 | 1 | tm1 | 3 |
| 1122 | 1 | 1 | tm1 | 2 |
| 1122 | 1 | 1 | tm1 | 1 |
| 1122 | 1 | 1 | tm2 | 1 |
| 1122 | 1 | 1 | tm2 | 2 |
| 1122 | 1 | 1 | tm2 | 1 |
+-------+------+-------+-------+---------+
step 1 - compute median by group
+-------+------+-------+-------+-------------+
| locid | day | hour | team | metric_med |
+-------+------+-------+-------+-------------+
| 1111 | 1 | 1 | tm1 | 2 |
| 1111 | 1 | 1 | tm2 | 2 |
| 1122 | 1 | 1 | tm1 | 2 |
| 1122 | 1 | 1 | tm2 | 1 |
+-------+------+-------+-------+-------------+
Step2 - compare medians across group (locid + day + hour) only (1111, 1, 1) has the metric_med same across the teams gp1 and gp2
+-------+------+-------+-------------+
| locid | day | hour | metric_med |
+-------+------+-------+-------------+
| 1111 | 1 | 1 | 2 |
+-------+------+-------+-------------+

One way to do it is to spread the groups into one row per each locid, day, and hour, and then compare them. This solution scales well for more than two groups and complicated conditions.
library(dplyr)
library(tidyr)
data %>%
group_by(locid, day, hour, team) %>%
summarize(median = median(metric)) %>%
spread(team, median) %>%
filter(tm1 == tm2)
Another possible solution is to arrange the summarized results by locid, day, and hour, and then compare the median in one row to its lag. This solution only works for two groups in team.
data %>%
group_by(locid, day, hour, team) %>%
summarize(median = median(metric)) %>%
arrange(locid, day, hour) %>%
filter(median == lag(median))

Let's re-cast 'all equal' to mean "zero-variance or a single observation". Thus:
df %>%
# per locid, day, hour, team
group_by(locid, day, hour, team) %>%
# compute median
summarize(team_median = median(metric)) %>%
# ungroup before specifying new grouping
ungroup %>%
# for locid, day, hour
group_by(locid, day, hour) %>%
# find the medians that were the same for all teams
# 'the same' here is taken to mean no variance
# or having a single observation
# note that, although logical vector TRUE | NA does yield TRUE
# this is only because it must yield TRUE.
# As another example, FALSE | NA, yields NA.
# As a guard against team_medians that are NA, I add a coalesce wrapper.
# I've decided that missing team_medians represent non-cases, YMMV
summarize(all_equal = coalesce(n() == 1 | var(team_median) == 0), FALSE) %>%
filter(all_equal == TRUE) %>%
select(-all_equal)

Related

Display long table with hiding records

I want to show table which can display n number of top records and n number of bottom records if the table is very long.
df <- nycflights13::flights
funct <- function(data, var){
var_lab(data[[var]])<-"Table 1"
t1<- expss::cro_cpct(data[[var]])
t1
}
funct(data=df,var="distance")
# I tried like below but still doesn't work
t1<- expss::cro_cpct(df[["distance"]]) %>% filter(row_number() <= 10 | row_number() >= (n() - 10)) %>%
add_row(.after = 10)
t2 <- t1 %>% mutate(across(everything(), as.character))
t3 <- t2 %>% mutate(across(everything(), ~replace_na(t2, "...")))
I want to give a parameter like by which it can trim table like below, for example if i give new parameter n = 10 then it should show first 10 records and bottom 10 records and trim the rest of records without changing the original percentage values.
Not very nice, but works for me:
library(expss)
df <- nycflights13::flights
funct <- function(data, var){
var_lab(data[[var]])<-"Table 1"
t1<- expss::cro_cpct(data[[var]])
t1
}
res = funct(data=df,var="distance")
res = add_rows(
head(res, 10),
NA,
tail(res, 10)
)
# All row labels are located in the first column separated with '|'.
# We need to replace the last label with '...'.
# That's why we have this regular expression here.
res$row_labels[11] = gsub("\\|[^|]+$", "|...", res$row_labels[1])
# I don't recommend using the line below because it converts all numerics to characters.
# It can complicate the further processing.
# It's better to leave all columns except row_labels as is, e. g. filled with NA
res[11, -1] = '...'
res
# | | | #Total |
# | ------- | ------------ | -------------------- |
# | Table 1 | 17 | 0.000296933273154857 |
# | | 80 | 0.014549730384588 |
# | | 94 | 0.28980687459914 |
# | | 96 | 0.180238496804998 |
# | | 116 | 0.131541440007601 |
# | | 143 | 0.130353706914982 |
# | | 160 | 0.111646910706226 |
# | | 169 | 0.161828633869397 |
# | | 173 | 0.0656222533672233 |
# | | 184 | 1.63432073544433 |
# | | ... | ... |
# | | 2475 | 3.34406252227 |
# | | 2521 | 0.0843290495759793 |
# | | 2565 | 1.52237689146495 |
# | | 2569 | 0.0976910468679478 |
# | | 2576 | 0.0926431812243153 |
# | | 2586 | 2.43604057296244 |
# | | 3370 | 0.00237546618523885 |
# | | 4963 | 0.108380644701523 |
# | | 4983 | 0.101551179418961 |
# | | #Total cases | 336776 |
Filter and add_row in between the top and bottom rows:
df <- nycflights13::flights
df %>%
select(carrier, distance) %>%
arrange(desc(distance )) %>%
filter(row_number() <= 10 | row_number() >= (n() - 10)) %>%
mutate(across(everything(), as.character)) %>%
add_row(.after = 10, carrier = "...", distance = "...") %>%
writexl::write_xlsx(., "table.xlsx")
If you want an spss format style, you could do it with the janitor package manually, e.g.
df %>%
janitor::tabyl(distance ) %>%
select(-n) %>%
arrange(desc(distance )) %>%
janitor::adorn_totals() %>%
janitor::adorn_pct_formatting() %>%
filter(row_number() <= 10 | row_number() >= (n() - 10)) %>%
add_row(.after = 10) %>%
as_tibble() %>%
mutate(across(everything(), as.character)) %>%
mutate(across(everything(), ~replace_na(.x, "...")))

How to generate Z-test including totals of variables in columns in expss?

Two questions in fact.
How to add totals for variables in columns in expss?
Is it possible to perform Z-test for variables in columns including total as a different category?
Below you can find a piece of code I'd run but it didn't work... I mean I couldn't even add totals on the right/left side of column variable...
test_table = tab_significance_options(data = df, compare_type = "subtable", bonferroni = TRUE, subtable_marks = "both") %>%
tab_cells(VAR1) %>%
tab_total_statistic("w_cpct") %>%
tab_cols(VAR2) %>%
tab_stat_cpct() %>%
tab_cols(total(VAR2)) %>%
tab_last_sig_cpct() %>%
tab_pivot(stat_position = "outside_columns")
I would be grateful for any advice.
To compare with first column you need additionally specify "first_column" in the 'compare_type'. Secondary, for correct result one of total statistic should be cases. Taking into the account all of the above:
library(expss)
data(mtcars)
test_table = mtcars %>%
tab_significance_options(compare_type = c("first_column", "subtable"), bonferroni = TRUE, subtable_marks = "both") %>%
tab_total_statistic(c("u_cases", "w_cpct")) %>%
tab_cells(gear) %>%
tab_cols(total(am), am) %>%
tab_stat_cpct() %>%
tab_last_sig_cpct() %>%
tab_pivot()
test_table
# | | | #Total | am | |
# | | | | 0 | 1 |
# | | | | A | B |
# | ---- | ---------------- | ------ | -------- | -------- |
# | gear | 3 | 46.9 | 78.9 + | |
# | | 4 | 37.5 | 21.1 < B | 61.5 > A |
# | | 5 | 15.6 | | 38.5 |
# | | #Total cases | 32 | 19 | 13 |
# | | #Total wtd. cpct | 100 | 100 | 100 |

Identify subsequent event for repeat IDs based on dates and initial event

I am trying to determine repeat IDs based on date and an initial event. Below is a sample data set
+----+------------+-------------------------+
| ID | Date | Investigation or Intake |
+----+------------+-------------------------+
| 1 | 1/1/2019 | Investigation |
| 2 | 1/2/2019 | Investigation |
| 3 | 1/3/2019 | Investigation |
| 4 | 1/4/2019 | Investigation |
| 1 | 1/2/2019 | Intake |
| 2 | 12/31/2018 | Intake |
| 3 | 1/5/2019 | Intake |
+----+------------+-------------------------+
I want to write R codes to go through IDs from 1 to 4 (IDs that have investigations) and see if they have a subsequent intake (an intake that happens at a later date than the date of investigation). So the expected output looks like this:
+----+------------+-------------------------+------------+
| ID | Date | Investigation or Intake | New Column |
+----+------------+-------------------------+------------+
| 1 | 1/1/2019 | Investigation | Sub Intake |
| 2 | 1/2/2019 | Investigation | None |
| 3 | 1/3/2019 | Investigation | Sub Intake |
| 4 | 1/4/2019 | Investigation | None |
| 1 | 1/2/2019 | Intake | |
| 2 | 12/31/2018 | Intake | |
| 3 | 1/5/2019 | Intake | |
+----+------------+-------------------------+------------+
What will the code look like to solve this? I am guessing it will be some loop function?
Thanks!
you can do this using the dplyr package and using some ifelse statements create a new column as required.
Instead of using looping instead just check the next entry in the group using lead function.
This solution assumes that in each group you will have one "Investigation" and then 0 or more "Intake" entries that are listed afterwards.
library(dplyr)
df <- data.frame(ID = c(1, 2, 3, 4, 1, 2, 3),
Date = as.Date(c("2019-01-01", "2019-01-02", "2019-1-03", "2019-01-04", "2019-01-02", "2018-12-31", "2019-1-5")),
Investigation_or_Intake = c("Investigation", "Investigation", "Investigation", "Investigation", "Intake", "Intake", "Intake"),
stringsAsFactors = FALSE)
df %>%
group_by(ID) %>% # Make groups according to ID column
mutate(newcol = ifelse(lead(Date) > Date, "Sub Intake", "None"), # Check next entry in the group to see if Date is after current
newcol = ifelse(Investigation_or_Intake == "Investigation" & is.na(newcol), "None", newcol)) # Change "Investigation" entries with no Intake to "None"
This gives
ID Date Investigation_or_Intake newcol
<dbl> <date> <chr> <chr>
1 1 2019-01-01 Investigation Sub Intake
2 2 2019-01-02 Investigation None
3 3 2019-01-03 Investigation Sub Intake
4 4 2019-01-04 Investigation None
5 1 2019-01-02 Intake NA
6 2 2018-12-31 Intake NA
7 3 2019-01-05 Intake NA

Random sample by group and filtering on the basis of result

I have a dataframe that is generated by the following code
l_ids = c(1, 1, 1, 2, 2, 2, 2)
l_months = c(5, 5, 5, 88, 88, 88, 88)
l_calWeek = c(201708, 201709, 201710, 201741, 201742, 201743, 201744)
value = c(5, 6, 3, 99, 100, 1001, 1002)
dat <- setNames(data.frame(cbind(l_ids, l_months, l_calWeek, value)),
c("ids", "months", "calWeek", "value"))
and looks like this:
+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
| 1 | 5 | 201708 | 4.5 |
| 1 | 5 | 201709 | 5 |
| 1 | 5 | 201710 | 6 |
| 2 | 88 | 201741 | 75 |
| 2 | 88 | 201742 | 89 |
| 2 | 88 | 201743 | 90 |
| 2 | 88 | 201744 | 51 |
+----+-------+----------+-------+
I would like to randomly sample a calendar week from each id-month group (the months are not calendar months). Then I would like to keep all id-month combination prior to the sample months.
An example output could be: suppose the sampling output returned cal week 201743 for the group id=2 and month=88 and 201709 for the group id=1 and month=5, then the final ouput should be
+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
| 1 | 5 | 201708 | 4.5 |
| 1 | 5 | 201709 | 5 |
| 2 | 88 | 201741 | 75 |
| 2 | 88 | 201742 | 89 |
2 | 88 | 201743 | 90 |
+----+-------+----------+-------+
I tried to work with dplyr's sample_n function (which is going to give me the random calendar week by id-month group, but then I do not know how to get all calendar weeks prior to that date. Can you help me with this. If possible, I would like to work with dplyr.
Please let me know in case you need further information.
Many thanks
require(dplyr)
set.seed(1) # when sampling please set.seed
sampled <- dat %>% group_by(ids) %>% do(., sample_n(.,1))
sampled_day <- sampled$calWeek
dat %>% group_by(ids) %>%
mutate(max_day = which(calWeek %in% sampled_day)) %>%
filter(row_number() <= max_day)
#You can also just filter directly with row_number() <= which(calWeek %in% sampled_day)
# A tibble: 3 x 4
# Groups: ids [2]
ids months calWeek value
<dbl> <dbl> <dbl> <dbl>
1 1.00 5.00 201708 5.00
2 2.00 88.0 201741 99.0
3 2.00 88.0 201742 100
This depends on the row order! So make sure to arrange by day first. You'll need to think about ties, though. I have edited my previous answer and simply filtered with <=
That should do the trick:
sample_and_get_below <- function(df, when, size){
res <- filter(df, calWeek == when) %>%
sample_n(size)
filter(df, calWeek > when) %>%
rbind(res, .)
}
sample_and_get_below(dat, 201741, 1)
ids months calWeek value
1 2 88 201741 99
2 2 88 201742 100
3 2 88 201743 1001
4 2 88 201744 1002

R ddply sum value from next row

I want to sum the column value from a row with the next one.
> df
+----+------+--------+------+
| id | Val | Factor | Col |
+----+------+--------+------+
| 1 | 15 | 1 | 7 |
| 3 | 20 | 1 | 4 |
| 2 | 35 | 2 | 8 |
| 7 | 35 | 1 | 12 |
| 5 | 40 | 1 | 11 |
| 6 | 45 | 2 | 13 |
| 4 | 55 | 1 | 4 |
| 8 | 60 | 1 | 7 |
| 9 | 15 | 2 | 12 |
..........
I would like to have the mean of sum of the Row$Val + nextRow$Val based on their id and Col. I can't assume that the id or Col are consecutive.
I am using ddply to summarize my df. I have tried
> ddply(df, .(Factor), summarize,
max(Val),
sum(Val),
mean(Val + df[df$id == id+1 & df$Col = Col]$Val)
)
> "longer object length is not a multiple of shorter object length"
You can build a vector of values with
sapply(df$id, function(x){mean(c(
subset(df, id == x, select = Val, drop = TRUE),
subset(df, id == x+1, select = Val, drop = TRUE)
))})
You could simplify, but I tried to make it as readable as possible.
You can use rollapply from the zoo package. Since you want mean of only two consecutive rows , you can try
library(zoo)
rollapply(df[order(df$id), 2], 2, function(x) sum(x)/2)
#[1] 17.5 27.5 35.0 37.5 42.5 50.0 57.5 37.5
You can do something like this with dplyr package:
library(dplyr)
df <- arrange(df, id)
mean(df$Val + lead(df$Val), na.rm = TRUE)
[1] 76.25

Resources