Adding a variable name for multiple-answer table expss in R / creating a variable to capture multiple answers - r

I want to add a variable name for the multiple-answer question Q6 which consist with 12 columns (Q6_1 to Q6_12) adding a label as follows do not give me the intended result. it adds a total_row column. I just need a label to indicate this is the table for Q6.
Alternatively if you know a way to create single variable to capture all the multiple answers, Please let me know
banner %>%
tab_cells(mrset(Q6_1 %to% Q6_12, lablel="Q6_test")) %>%
tab_stat_cpct() %>%
tab_pivot() %>%
tab_sort_desc()

You have a typo in the your code. lablel should be label:
library(expss)
mtcars %>%
tab_cells(mrset(am, cyl, label = "am+cyl")) %>%
tab_stat_cpct() %>%
tab_pivot()
# | | | #Total |
# | ------ | ------------ | ------ |
# | am+cyl | 0 | 59.4 |
# | | 1 | 40.6 |
# | | 4 | 34.4 |
# | | 6 | 21.9 |
# | | 8 | 43.8 |
# | | #Total cases | 32.0 |

Related

Display long table with hiding records

I want to show table which can display n number of top records and n number of bottom records if the table is very long.
df <- nycflights13::flights
funct <- function(data, var){
var_lab(data[[var]])<-"Table 1"
t1<- expss::cro_cpct(data[[var]])
t1
}
funct(data=df,var="distance")
# I tried like below but still doesn't work
t1<- expss::cro_cpct(df[["distance"]]) %>% filter(row_number() <= 10 | row_number() >= (n() - 10)) %>%
add_row(.after = 10)
t2 <- t1 %>% mutate(across(everything(), as.character))
t3 <- t2 %>% mutate(across(everything(), ~replace_na(t2, "...")))
I want to give a parameter like by which it can trim table like below, for example if i give new parameter n = 10 then it should show first 10 records and bottom 10 records and trim the rest of records without changing the original percentage values.
Not very nice, but works for me:
library(expss)
df <- nycflights13::flights
funct <- function(data, var){
var_lab(data[[var]])<-"Table 1"
t1<- expss::cro_cpct(data[[var]])
t1
}
res = funct(data=df,var="distance")
res = add_rows(
head(res, 10),
NA,
tail(res, 10)
)
# All row labels are located in the first column separated with '|'.
# We need to replace the last label with '...'.
# That's why we have this regular expression here.
res$row_labels[11] = gsub("\\|[^|]+$", "|...", res$row_labels[1])
# I don't recommend using the line below because it converts all numerics to characters.
# It can complicate the further processing.
# It's better to leave all columns except row_labels as is, e. g. filled with NA
res[11, -1] = '...'
res
# | | | #Total |
# | ------- | ------------ | -------------------- |
# | Table 1 | 17 | 0.000296933273154857 |
# | | 80 | 0.014549730384588 |
# | | 94 | 0.28980687459914 |
# | | 96 | 0.180238496804998 |
# | | 116 | 0.131541440007601 |
# | | 143 | 0.130353706914982 |
# | | 160 | 0.111646910706226 |
# | | 169 | 0.161828633869397 |
# | | 173 | 0.0656222533672233 |
# | | 184 | 1.63432073544433 |
# | | ... | ... |
# | | 2475 | 3.34406252227 |
# | | 2521 | 0.0843290495759793 |
# | | 2565 | 1.52237689146495 |
# | | 2569 | 0.0976910468679478 |
# | | 2576 | 0.0926431812243153 |
# | | 2586 | 2.43604057296244 |
# | | 3370 | 0.00237546618523885 |
# | | 4963 | 0.108380644701523 |
# | | 4983 | 0.101551179418961 |
# | | #Total cases | 336776 |
Filter and add_row in between the top and bottom rows:
df <- nycflights13::flights
df %>%
select(carrier, distance) %>%
arrange(desc(distance )) %>%
filter(row_number() <= 10 | row_number() >= (n() - 10)) %>%
mutate(across(everything(), as.character)) %>%
add_row(.after = 10, carrier = "...", distance = "...") %>%
writexl::write_xlsx(., "table.xlsx")
If you want an spss format style, you could do it with the janitor package manually, e.g.
df %>%
janitor::tabyl(distance ) %>%
select(-n) %>%
arrange(desc(distance )) %>%
janitor::adorn_totals() %>%
janitor::adorn_pct_formatting() %>%
filter(row_number() <= 10 | row_number() >= (n() - 10)) %>%
add_row(.after = 10) %>%
as_tibble() %>%
mutate(across(everything(), as.character)) %>%
mutate(across(everything(), ~replace_na(.x, "...")))

frequency table for banner list

I am trying to create a function to generate frequency table (to show count , valid percentage , percentage) for list of banner.
I want to export tables in xlsx files.
like for variable "gear" , i want to calculate the table for banner below ()
library(expss)
df <- mtcars
df$all<- 1
df$small<-ifelse(df$vs==1,1,NA)
df$large<-ifelse(df$am ==1,1,NA)
val_lab(df$all)<-c("Total"=1)
val_lab(df$small)<-c("Small"=1)
val_lab(df$large)<-c("Large"=1)
banner <- list(dat$all,dat$small,dat$large)
data <- df
var <- "gear"
var1 <- rlang::parse_expr(var)
expss::var_lab(data[[var]])
#tab1 <- expss::fre(data[[var1]])
table1 <- expss::fre(data[[var1]],
stat_lab = getOption("expss.fre_stat_lab", c("Count N", "Valid percent", "Percent",
"Responses, %", "Cumulative responses, %")))
table1
the output table should be like below
You need to make custom function around fre:
library(expss)
df <- mtcars
df$all<- 1
df$small<-ifelse(df$vs==1,1,NA)
df$large<-ifelse(df$am ==1,1,NA)
val_lab(df$all)<-c("Total"=1)
val_lab(df$small)<-c("Small"=1)
val_lab(df$large)<-c("Large"=1)
my_fre <- function(curr_var) setNames(expss::fre(curr_var)[, 1:3],
c("row_labels", "Count N", "Valid percent"))
cross_fun_df(df, gear, list(all, small, large), fun = my_fre)
# | | Total | | Small | | Large | |
# | | Count N | Valid percent | Count N | Valid percent | Count N | Valid percent |
# | ------ | ------- | ------------- | ------- | ------------- | ------- | ------------- |
# | 3 | 15 | 46.88 | 3 | 21.43 | | |
# | 4 | 12 | 37.50 | 10 | 71.43 | 8 | 61.54 |
# | 5 | 5 | 15.62 | 1 | 7.14 | 5 | 38.46 |
# | #Total | 32 | 100.00 | 14 | 100.00 | 13 | 100.00 |
# | <NA> | 0 | | 0 | | 0 | |

R: Subset Large Data Frame with Multiple Conditions

I have a large data frame with 12 million rows and 5 columns. I want to subset the large data frame with multiple conditions. I need to do this multiple times with different criteria, so I created a Look-Up Table and a for loop.
The code below loops through and subsets the large data frame, saving each iteration as an list within a list. After the loop completes, I combined the lists into a data frame.
My current set-up functions, but it is painfully slow (about 15 minutes for 8 loops). Subsetting is actually taking more time than it took to calculate the mean and SD for the 12 million-row table!
Any advice on how to speed this up?
>scaled
| chr | site | Average_CPMn | SD_CPMn |
|------|------|--------------|---------|
| chrI | 1 | 0.071 | 0.070 |
| chrI | 2 | 0.120 | 0.111 |
| chrI | 3 | 0.000 | 0.000 |
| chrI | 4 | 0.000 | 0.000 |
| chrI | 5 | 0.000 | 0.000 |
| chrI | 6 | 0.156 | 0.056 |
...12,000,000 rows
>genes.df
| Gene | Chromosome | Meta_Start | Meta_Stop |
|---------|------------|------------|-----------|
| YGL234W | chrVII | 55982 | 59390 |
| YGR061C | chrVII | 611389 | 616465 |
| YMR120C | chrXIII | 507002 | 509780 |
| YLR359W | chrXII | 843782 | 846230 |
scaled <- read_rds("~/Desktop/scaled.rds")
subset_list = list()
for (i in 1:nrow(genes.df)) {
subset <- scaled %>%
dplyr::filter(chr == genes.df$Chromosome[i] & site >= genes.df$Meta_Start[i] & site <= genes.df$Meta_Stop[i]) %>%
dplyr::mutate(Gene = genes.df$Gene[i])
subset_list[[i]] <- subset
#combine gene-list into single dataframe
counts_subset <- as.data.frame(do.call(rbind, subset_list)) %>%
left_join(genes.df, by = "Gene")
You havne't shared the data/sample so it is difficult to demostrate, however, it is suggested to use semi_join (if you want to subset only) or left_join (if want to mutate instead) somewhat like this in tidyverse
scaled %>% semi_join(genes.df %>% pivot_longer(c(Meta_start, Meta_stop)) %>%
group_by(Gene, Chromosome) %>%
complete(value = seq(min(value), max(value), 1)) %>%
ungroup %>% select(-name), by = c('chr' = 'Chromosome', 'site' = 'value'))

How to summarize data in R (dplyr) and avoid duplicate identifiers? [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 1 year ago.
I'm trying to identify the lowest rate over a range of years for a number of items (ID).
In addition, I would like to know the Year the lowest rate was pulled from.
I'm grouping by ID, but I run into an issue when rates are duplicated across years.
sample data
df <- data.frame(ID = c(1,1,1,2,2,2,3,3,3,4,4,4),
Year = rep(2010:2012,4),
Rate = c(0.3,0.6,0.9,
0.8,0.5,0.2,
0.8,0.4,0.9,
0.7,0.7,0.7))
sample data as table
| ID | Year | Rate |
|:------:|:------:|:------:|
| 1 | 2010 | 0.3 |
| 1 | 2012 | 0.6 |
| 1 | 2010 | 0.9 |
| 2 | 2010 | 0.8 |
| 2 | 2011 | 0.5 |
| 2 | 2012 | 0.2 |
| 3 | 2010 | 0.8 |
| 3 | 2011 | 0.4 |
| 3 | 2012 | 0.9 |
| 4 | 2010 | 0.7 |
| 4 | 2011 | 0.7 |
| 4 | 2012 | 0.7 |
Using dplyr I grouped by ID, then found the lowest rate.
df.Summarise <- df %>%
group_by(ID) %>%
summarise(LowestRate = min(Rate))
This gives me the following
| ID | LowestRate |
| --- | --- |
| 1 | 0.3 |
| 2 | 0.2 |
| 3 | 0.4 |
| 4 | 0.7 |
However, I also need to know the year that data was pulled from.
This is what I would like my final result to look like:
| ID | Year | Rate |
| --- | --- | --- |
| 1 | 0.3 | 2010 |
| 2 | 0.2 | 2012 |
| 3 | 0.4 | 2011 |
| 4 | 0.7 | 2012 |
Here's where I ran into some issues.
Attempt #1: Include "Year" in the original dplyr code
df.Summarise2 <- df %>%
group_by(ID) %>%
summarise(LowestRate = min(Rate),
Year = Year)
Error: Column `Year` must be length 1 (a summary value), not 3
Makes sense. I'm not summarizing "Year" at all. I just want to include that row's value for Year!
Attempt #2: Use mutate instead of summarise
df.Mutate <- df %>%
group_by(ID) %>%
mutate(LowestRate = min(Rate))
So that essentially returns my original dataframe, but with an extra column for LowestRate attached.
How would I go from this to what I want?
I tried to left_join / merge based on ID and Lowest Rate, but there's multiple matches for ID #4. Is there any way to only pick one match (row)?
df.joined <- left_join(df.Summarise,df,by = c("ID","LowestRate" = "Rate"))
df.joined as table
| ID | Year | Rate |
| --- | --- | --- |
| 1 | 0.3 | 2010 |
| 2 | 0.2 | 2012 |
| 3 | 0.4 | 2011 |
| 4 | 0.7 | 2010 |
| 4 | 0.7 | 2011 |
| 4 | 0.7 | 2012 |
I've tried looking online, but I can't really find anything that strikes this exactly.
Using ".drop = FALSE" for group_by() didn't help, as it seems to be intended for empty values?
The dataset I'm working with is large, so I'd really like to find how to make this work and avoid hard-coding anything :)
Thanks for any help!
You can group by ID and then filter without summarizing, and that way you'll preserve all columns but still only keep the min value:
df %>%
group_by(ID) %>%
filter(Rate == min(Rate))

How to generate Z-test including totals of variables in columns in expss?

Two questions in fact.
How to add totals for variables in columns in expss?
Is it possible to perform Z-test for variables in columns including total as a different category?
Below you can find a piece of code I'd run but it didn't work... I mean I couldn't even add totals on the right/left side of column variable...
test_table = tab_significance_options(data = df, compare_type = "subtable", bonferroni = TRUE, subtable_marks = "both") %>%
tab_cells(VAR1) %>%
tab_total_statistic("w_cpct") %>%
tab_cols(VAR2) %>%
tab_stat_cpct() %>%
tab_cols(total(VAR2)) %>%
tab_last_sig_cpct() %>%
tab_pivot(stat_position = "outside_columns")
I would be grateful for any advice.
To compare with first column you need additionally specify "first_column" in the 'compare_type'. Secondary, for correct result one of total statistic should be cases. Taking into the account all of the above:
library(expss)
data(mtcars)
test_table = mtcars %>%
tab_significance_options(compare_type = c("first_column", "subtable"), bonferroni = TRUE, subtable_marks = "both") %>%
tab_total_statistic(c("u_cases", "w_cpct")) %>%
tab_cells(gear) %>%
tab_cols(total(am), am) %>%
tab_stat_cpct() %>%
tab_last_sig_cpct() %>%
tab_pivot()
test_table
# | | | #Total | am | |
# | | | | 0 | 1 |
# | | | | A | B |
# | ---- | ---------------- | ------ | -------- | -------- |
# | gear | 3 | 46.9 | 78.9 + | |
# | | 4 | 37.5 | 21.1 < B | 61.5 > A |
# | | 5 | 15.6 | | 38.5 |
# | | #Total cases | 32 | 19 | 13 |
# | | #Total wtd. cpct | 100 | 100 | 100 |

Resources