Display long table with hiding records - r

I want to show table which can display n number of top records and n number of bottom records if the table is very long.
df <- nycflights13::flights
funct <- function(data, var){
var_lab(data[[var]])<-"Table 1"
t1<- expss::cro_cpct(data[[var]])
t1
}
funct(data=df,var="distance")
# I tried like below but still doesn't work
t1<- expss::cro_cpct(df[["distance"]]) %>% filter(row_number() <= 10 | row_number() >= (n() - 10)) %>%
add_row(.after = 10)
t2 <- t1 %>% mutate(across(everything(), as.character))
t3 <- t2 %>% mutate(across(everything(), ~replace_na(t2, "...")))
I want to give a parameter like by which it can trim table like below, for example if i give new parameter n = 10 then it should show first 10 records and bottom 10 records and trim the rest of records without changing the original percentage values.

Not very nice, but works for me:
library(expss)
df <- nycflights13::flights
funct <- function(data, var){
var_lab(data[[var]])<-"Table 1"
t1<- expss::cro_cpct(data[[var]])
t1
}
res = funct(data=df,var="distance")
res = add_rows(
head(res, 10),
NA,
tail(res, 10)
)
# All row labels are located in the first column separated with '|'.
# We need to replace the last label with '...'.
# That's why we have this regular expression here.
res$row_labels[11] = gsub("\\|[^|]+$", "|...", res$row_labels[1])
# I don't recommend using the line below because it converts all numerics to characters.
# It can complicate the further processing.
# It's better to leave all columns except row_labels as is, e. g. filled with NA
res[11, -1] = '...'
res
# | | | #Total |
# | ------- | ------------ | -------------------- |
# | Table 1 | 17 | 0.000296933273154857 |
# | | 80 | 0.014549730384588 |
# | | 94 | 0.28980687459914 |
# | | 96 | 0.180238496804998 |
# | | 116 | 0.131541440007601 |
# | | 143 | 0.130353706914982 |
# | | 160 | 0.111646910706226 |
# | | 169 | 0.161828633869397 |
# | | 173 | 0.0656222533672233 |
# | | 184 | 1.63432073544433 |
# | | ... | ... |
# | | 2475 | 3.34406252227 |
# | | 2521 | 0.0843290495759793 |
# | | 2565 | 1.52237689146495 |
# | | 2569 | 0.0976910468679478 |
# | | 2576 | 0.0926431812243153 |
# | | 2586 | 2.43604057296244 |
# | | 3370 | 0.00237546618523885 |
# | | 4963 | 0.108380644701523 |
# | | 4983 | 0.101551179418961 |
# | | #Total cases | 336776 |

Filter and add_row in between the top and bottom rows:
df <- nycflights13::flights
df %>%
select(carrier, distance) %>%
arrange(desc(distance )) %>%
filter(row_number() <= 10 | row_number() >= (n() - 10)) %>%
mutate(across(everything(), as.character)) %>%
add_row(.after = 10, carrier = "...", distance = "...") %>%
writexl::write_xlsx(., "table.xlsx")
If you want an spss format style, you could do it with the janitor package manually, e.g.
df %>%
janitor::tabyl(distance ) %>%
select(-n) %>%
arrange(desc(distance )) %>%
janitor::adorn_totals() %>%
janitor::adorn_pct_formatting() %>%
filter(row_number() <= 10 | row_number() >= (n() - 10)) %>%
add_row(.after = 10) %>%
as_tibble() %>%
mutate(across(everything(), as.character)) %>%
mutate(across(everything(), ~replace_na(.x, "...")))

Related

Adding a variable name for multiple-answer table expss in R / creating a variable to capture multiple answers

I want to add a variable name for the multiple-answer question Q6 which consist with 12 columns (Q6_1 to Q6_12) adding a label as follows do not give me the intended result. it adds a total_row column. I just need a label to indicate this is the table for Q6.
Alternatively if you know a way to create single variable to capture all the multiple answers, Please let me know
banner %>%
tab_cells(mrset(Q6_1 %to% Q6_12, lablel="Q6_test")) %>%
tab_stat_cpct() %>%
tab_pivot() %>%
tab_sort_desc()
You have a typo in the your code. lablel should be label:
library(expss)
mtcars %>%
tab_cells(mrset(am, cyl, label = "am+cyl")) %>%
tab_stat_cpct() %>%
tab_pivot()
# | | | #Total |
# | ------ | ------------ | ------ |
# | am+cyl | 0 | 59.4 |
# | | 1 | 40.6 |
# | | 4 | 34.4 |
# | | 6 | 21.9 |
# | | 8 | 43.8 |
# | | #Total cases | 32.0 |

total() in tab_cols only sum up to one, any suggestion?

Suppose I have dataframe 'y'
WR<-c("S",'J',"T")
B<-c("b1","b2","b3")
wgt<-c(0.3,2,3)
y<-data.frame(WR,B,wgt)
I want to make column percentage crosstab with B as row, WR, and total of WR as columns using expss function
library(expss)
y %>% tab_cols(total(),WR) %>% # Columns
tab_stat_valid_n("Base") %>%
tab_weight(wgt) %>%
tab_stat_valid_n("Projection") %>%
tab_cells(mrset(B))%>% # Row
tab_stat_cpct(total_row_position = "none") %>%
tab_pivot()
Result
But the total Base column does not match up
# #Total WR|J WR|S WR|T
# Base 1.000000 1 1.0 1
# Projection 5.300000 2 0.3 3
# b1 5.660377 NA 100.0 NA
# b2 37.735849 100 NA NA
# b3 56.603774 NA NA 100
I think I found the solution
y %>% tab_cols(total(),WR) %>% # Columns
tab_cells(mrset(B))%>% # Row
tab_stat_valid_n("Base") %>%
tab_weight(wgt) %>%
tab_stat_valid_n("Projection") %>%
tab_stat_cpct(total_row_position = "none") %>%
tab_pivot()
| | | #Total | WR | | |
| | | | J | S | T |
| -- | ---------- | ------ | --- | ----- | --- |
| B | Base | 3.0 | 1 | 1.0 | 1 |
| | Projection | 5.3 | 2 | 0.3 | 3 |
| b1 | | 5.7 | | 100.0 | |
| b2 | | 37.7 | 100 | | |
| b3 | | 56.6 | | | 100 |

R: Subset Large Data Frame with Multiple Conditions

I have a large data frame with 12 million rows and 5 columns. I want to subset the large data frame with multiple conditions. I need to do this multiple times with different criteria, so I created a Look-Up Table and a for loop.
The code below loops through and subsets the large data frame, saving each iteration as an list within a list. After the loop completes, I combined the lists into a data frame.
My current set-up functions, but it is painfully slow (about 15 minutes for 8 loops). Subsetting is actually taking more time than it took to calculate the mean and SD for the 12 million-row table!
Any advice on how to speed this up?
>scaled
| chr | site | Average_CPMn | SD_CPMn |
|------|------|--------------|---------|
| chrI | 1 | 0.071 | 0.070 |
| chrI | 2 | 0.120 | 0.111 |
| chrI | 3 | 0.000 | 0.000 |
| chrI | 4 | 0.000 | 0.000 |
| chrI | 5 | 0.000 | 0.000 |
| chrI | 6 | 0.156 | 0.056 |
...12,000,000 rows
>genes.df
| Gene | Chromosome | Meta_Start | Meta_Stop |
|---------|------------|------------|-----------|
| YGL234W | chrVII | 55982 | 59390 |
| YGR061C | chrVII | 611389 | 616465 |
| YMR120C | chrXIII | 507002 | 509780 |
| YLR359W | chrXII | 843782 | 846230 |
scaled <- read_rds("~/Desktop/scaled.rds")
subset_list = list()
for (i in 1:nrow(genes.df)) {
subset <- scaled %>%
dplyr::filter(chr == genes.df$Chromosome[i] & site >= genes.df$Meta_Start[i] & site <= genes.df$Meta_Stop[i]) %>%
dplyr::mutate(Gene = genes.df$Gene[i])
subset_list[[i]] <- subset
#combine gene-list into single dataframe
counts_subset <- as.data.frame(do.call(rbind, subset_list)) %>%
left_join(genes.df, by = "Gene")
You havne't shared the data/sample so it is difficult to demostrate, however, it is suggested to use semi_join (if you want to subset only) or left_join (if want to mutate instead) somewhat like this in tidyverse
scaled %>% semi_join(genes.df %>% pivot_longer(c(Meta_start, Meta_stop)) %>%
group_by(Gene, Chromosome) %>%
complete(value = seq(min(value), max(value), 1)) %>%
ungroup %>% select(-name), by = c('chr' = 'Chromosome', 'site' = 'value'))

How to generate Z-test including totals of variables in columns in expss?

Two questions in fact.
How to add totals for variables in columns in expss?
Is it possible to perform Z-test for variables in columns including total as a different category?
Below you can find a piece of code I'd run but it didn't work... I mean I couldn't even add totals on the right/left side of column variable...
test_table = tab_significance_options(data = df, compare_type = "subtable", bonferroni = TRUE, subtable_marks = "both") %>%
tab_cells(VAR1) %>%
tab_total_statistic("w_cpct") %>%
tab_cols(VAR2) %>%
tab_stat_cpct() %>%
tab_cols(total(VAR2)) %>%
tab_last_sig_cpct() %>%
tab_pivot(stat_position = "outside_columns")
I would be grateful for any advice.
To compare with first column you need additionally specify "first_column" in the 'compare_type'. Secondary, for correct result one of total statistic should be cases. Taking into the account all of the above:
library(expss)
data(mtcars)
test_table = mtcars %>%
tab_significance_options(compare_type = c("first_column", "subtable"), bonferroni = TRUE, subtable_marks = "both") %>%
tab_total_statistic(c("u_cases", "w_cpct")) %>%
tab_cells(gear) %>%
tab_cols(total(am), am) %>%
tab_stat_cpct() %>%
tab_last_sig_cpct() %>%
tab_pivot()
test_table
# | | | #Total | am | |
# | | | | 0 | 1 |
# | | | | A | B |
# | ---- | ---------------- | ------ | -------- | -------- |
# | gear | 3 | 46.9 | 78.9 + | |
# | | 4 | 37.5 | 21.1 < B | 61.5 > A |
# | | 5 | 15.6 | | 38.5 |
# | | #Total cases | 32 | 19 | 13 |
# | | #Total wtd. cpct | 100 | 100 | 100 |

How to merge attributes on a frequency table in R?

Assume that i have two variables. See Dummy data below:
Out of 250 records:
SEX
Male : 100
Female : 150
HAIR
Short : 110
Long : 140
The code i currently use is provided below, For each variable a different table is created:
sexTable <- table(myDataSet$Sex)
hairTable <- table(myDataSet$Hair)
View(sexTable):
|------------------|------------------|
| Level | Frequency |
|------------------|------------------|
| Male | 100 |
| Female | 150 |
|------------------|------------------|
View(hairTable)
|------------------|------------------|
| Level | Frequency |
|------------------|------------------|
| Short | 110 |
| Long | 140 |
|------------------|------------------|
My question is how to merge the two tables in R that will have the following format As well as to calculate the percentage of frequency for each group of levels:
|---------------------|------------------|------------------|
| Variables | Level | Frequency |
|---------------------|------------------|------------------|
| Sex(N=250) | Male | 100 (40%) |
| | Female | 150 (60%) |
| Hair(N=250) | Short | 110 (44%) |
| | Long | 140 (56%) |
|---------------------|------------------|------------------|
We can use bind_rows after converting to data.frame
library(dplyr)
bind_rows(list(sex = as.data.frame(sexTable),
Hair = as.data.frame(hairTable)), .id = 'Variables')
Using a reproducible example
tbl1 <- table(mtcars$cyl)
tbl2 <- table(mtcars$vs)
bind_rows(list(sex = as.data.frame(tbl1),
Hair = as.data.frame(tbl2)), .id = 'Variables')%>%
mutate(Variables = replace(Variables, duplicated(Variables), ""))
If we also need the percentages
dat1 <- transform(as.data.frame(tbl1),
Freq = sprintf('%d (%0.2f%%)', Freq, as.numeric(prop.table(tbl1) * 100)))
dat2 <- transform(as.data.frame(tbl2),
Freq = sprintf('%d (%0.2f%%)', Freq, as.numeric(prop.table(tbl2) * 100)))
bind_rows(list(sex = dat1, Hair = dat2, .id = 'Variables')

Resources