R: Subset Large Data Frame with Multiple Conditions - r

I have a large data frame with 12 million rows and 5 columns. I want to subset the large data frame with multiple conditions. I need to do this multiple times with different criteria, so I created a Look-Up Table and a for loop.
The code below loops through and subsets the large data frame, saving each iteration as an list within a list. After the loop completes, I combined the lists into a data frame.
My current set-up functions, but it is painfully slow (about 15 minutes for 8 loops). Subsetting is actually taking more time than it took to calculate the mean and SD for the 12 million-row table!
Any advice on how to speed this up?
>scaled
| chr | site | Average_CPMn | SD_CPMn |
|------|------|--------------|---------|
| chrI | 1 | 0.071 | 0.070 |
| chrI | 2 | 0.120 | 0.111 |
| chrI | 3 | 0.000 | 0.000 |
| chrI | 4 | 0.000 | 0.000 |
| chrI | 5 | 0.000 | 0.000 |
| chrI | 6 | 0.156 | 0.056 |
...12,000,000 rows
>genes.df
| Gene | Chromosome | Meta_Start | Meta_Stop |
|---------|------------|------------|-----------|
| YGL234W | chrVII | 55982 | 59390 |
| YGR061C | chrVII | 611389 | 616465 |
| YMR120C | chrXIII | 507002 | 509780 |
| YLR359W | chrXII | 843782 | 846230 |
scaled <- read_rds("~/Desktop/scaled.rds")
subset_list = list()
for (i in 1:nrow(genes.df)) {
subset <- scaled %>%
dplyr::filter(chr == genes.df$Chromosome[i] & site >= genes.df$Meta_Start[i] & site <= genes.df$Meta_Stop[i]) %>%
dplyr::mutate(Gene = genes.df$Gene[i])
subset_list[[i]] <- subset
#combine gene-list into single dataframe
counts_subset <- as.data.frame(do.call(rbind, subset_list)) %>%
left_join(genes.df, by = "Gene")

You havne't shared the data/sample so it is difficult to demostrate, however, it is suggested to use semi_join (if you want to subset only) or left_join (if want to mutate instead) somewhat like this in tidyverse
scaled %>% semi_join(genes.df %>% pivot_longer(c(Meta_start, Meta_stop)) %>%
group_by(Gene, Chromosome) %>%
complete(value = seq(min(value), max(value), 1)) %>%
ungroup %>% select(-name), by = c('chr' = 'Chromosome', 'site' = 'value'))

Related

Adding a variable name for multiple-answer table expss in R / creating a variable to capture multiple answers

I want to add a variable name for the multiple-answer question Q6 which consist with 12 columns (Q6_1 to Q6_12) adding a label as follows do not give me the intended result. it adds a total_row column. I just need a label to indicate this is the table for Q6.
Alternatively if you know a way to create single variable to capture all the multiple answers, Please let me know
banner %>%
tab_cells(mrset(Q6_1 %to% Q6_12, lablel="Q6_test")) %>%
tab_stat_cpct() %>%
tab_pivot() %>%
tab_sort_desc()
You have a typo in the your code. lablel should be label:
library(expss)
mtcars %>%
tab_cells(mrset(am, cyl, label = "am+cyl")) %>%
tab_stat_cpct() %>%
tab_pivot()
# | | | #Total |
# | ------ | ------------ | ------ |
# | am+cyl | 0 | 59.4 |
# | | 1 | 40.6 |
# | | 4 | 34.4 |
# | | 6 | 21.9 |
# | | 8 | 43.8 |
# | | #Total cases | 32.0 |

How to generate Z-test including totals of variables in columns in expss?

Two questions in fact.
How to add totals for variables in columns in expss?
Is it possible to perform Z-test for variables in columns including total as a different category?
Below you can find a piece of code I'd run but it didn't work... I mean I couldn't even add totals on the right/left side of column variable...
test_table = tab_significance_options(data = df, compare_type = "subtable", bonferroni = TRUE, subtable_marks = "both") %>%
tab_cells(VAR1) %>%
tab_total_statistic("w_cpct") %>%
tab_cols(VAR2) %>%
tab_stat_cpct() %>%
tab_cols(total(VAR2)) %>%
tab_last_sig_cpct() %>%
tab_pivot(stat_position = "outside_columns")
I would be grateful for any advice.
To compare with first column you need additionally specify "first_column" in the 'compare_type'. Secondary, for correct result one of total statistic should be cases. Taking into the account all of the above:
library(expss)
data(mtcars)
test_table = mtcars %>%
tab_significance_options(compare_type = c("first_column", "subtable"), bonferroni = TRUE, subtable_marks = "both") %>%
tab_total_statistic(c("u_cases", "w_cpct")) %>%
tab_cells(gear) %>%
tab_cols(total(am), am) %>%
tab_stat_cpct() %>%
tab_last_sig_cpct() %>%
tab_pivot()
test_table
# | | | #Total | am | |
# | | | | 0 | 1 |
# | | | | A | B |
# | ---- | ---------------- | ------ | -------- | -------- |
# | gear | 3 | 46.9 | 78.9 + | |
# | | 4 | 37.5 | 21.1 < B | 61.5 > A |
# | | 5 | 15.6 | | 38.5 |
# | | #Total cases | 32 | 19 | 13 |
# | | #Total wtd. cpct | 100 | 100 | 100 |

How to merge attributes on a frequency table in R?

Assume that i have two variables. See Dummy data below:
Out of 250 records:
SEX
Male : 100
Female : 150
HAIR
Short : 110
Long : 140
The code i currently use is provided below, For each variable a different table is created:
sexTable <- table(myDataSet$Sex)
hairTable <- table(myDataSet$Hair)
View(sexTable):
|------------------|------------------|
| Level | Frequency |
|------------------|------------------|
| Male | 100 |
| Female | 150 |
|------------------|------------------|
View(hairTable)
|------------------|------------------|
| Level | Frequency |
|------------------|------------------|
| Short | 110 |
| Long | 140 |
|------------------|------------------|
My question is how to merge the two tables in R that will have the following format As well as to calculate the percentage of frequency for each group of levels:
|---------------------|------------------|------------------|
| Variables | Level | Frequency |
|---------------------|------------------|------------------|
| Sex(N=250) | Male | 100 (40%) |
| | Female | 150 (60%) |
| Hair(N=250) | Short | 110 (44%) |
| | Long | 140 (56%) |
|---------------------|------------------|------------------|
We can use bind_rows after converting to data.frame
library(dplyr)
bind_rows(list(sex = as.data.frame(sexTable),
Hair = as.data.frame(hairTable)), .id = 'Variables')
Using a reproducible example
tbl1 <- table(mtcars$cyl)
tbl2 <- table(mtcars$vs)
bind_rows(list(sex = as.data.frame(tbl1),
Hair = as.data.frame(tbl2)), .id = 'Variables')%>%
mutate(Variables = replace(Variables, duplicated(Variables), ""))
If we also need the percentages
dat1 <- transform(as.data.frame(tbl1),
Freq = sprintf('%d (%0.2f%%)', Freq, as.numeric(prop.table(tbl1) * 100)))
dat2 <- transform(as.data.frame(tbl2),
Freq = sprintf('%d (%0.2f%%)', Freq, as.numeric(prop.table(tbl2) * 100)))
bind_rows(list(sex = dat1, Hair = dat2, .id = 'Variables')

Multiple replacement by matching values within one row

Maybe a little bit silly question, but I can't manage to solve my problem.
I have a table with some codes, where some rows contains few codes separated by space:
| Codes |
|-------------|
| 12.12 |
| 12.12 12.13 |
| 12.11 12.13 |
| 12.10 |
I have to match this code with values from another table
| Code | Value |
|-------|-------|
| 12.10 | AA |
| 12.11 | BB |
| 12.12 | CC |
| 12.13 | DD |
to get the following result (desired separator is comma, but it doesn't really matter):
| Codes |
|-------|
| CC |
| CC,DD |
| BB,DD |
| AA |
I have tried to achieve result like this:
dataframe1$Codes <- dataframe2$values[match(unlist(strsplit(dataframe1 $Codes)) ,dataframe2$Code)]
But I get error: replacement has X rows, data has Y
Your data:
df <- data.frame(Codes=c("12.12","12.12 12.13","12.11 12.13","12.10"),
stringsAsFactors=F)
vals <- data.frame(Code=c("12.10","12.11","12.12","12.13"),
Value=c("AA","BB","CC","DD"),
stringsAsFactors=F)
I use dplyr and iterators:
library(dplyr)
library(iterators)
Make a nested list of Codes in df:
temp <- lapply(iter(df,by="row"),function(x) unlist(strsplit(x," ")))
Match df$Codes to vals$Code, grab paired vals$Value, and paste and convert to data frame:
df1 <- lapply(iter(temp),function(x) paste0(vals$Value[vals$Code %in% x],collapse=",")) %>%
do.call(rbind,.) %>%
as.data.frame() %>%
rename(Codes=V1)
Output
Codes
1 CC
2 CC,DD
3 BB,DD
4 AA

summing columns in a data frame based on the values stored within another data frame in R

I want to summarise a group of columns from df1
df1
| A | B | C | D
| ------ | ------ | ------ | ------
1 | 0.870 | 0.435 | 0.968 | 0.679
2 | 0.456 | 0.259 | 0.906 | 0.467
3 | 0.298 | 0.256 | 0.457 | 0.768
4 | 0.994 | 0.987 | 0.365 | 0.765
if their column names appear as values within a column called TEST within df2
df2
| test |
| ------ |
1 | A |
2 | B |
I've tried to use the followign code but I get the error below that
columns.to.add <- unique(df2$test)
df2$test <- colSums(columns.to.add)
Error in base::colSums(x, na.rm = na.rm, dims = dims, ...) : 'x' must be an array of at least two dimensions
$ will not work in your case. You have to use indexing by column names:
colSums(df1[, df2$test])

Resources