Add row in etable from expss R package - r

I want to add rows at specific place to expss output etable. I did that with some brute force method which always add the row at the start of etable. Any method to add rows at specific place.
library(tidyverse)
library(expss)
test1 <-
mtcars %>%
tab_cells(cyl) %>%
tab_cols(vs) %>%
tab_stat_cpct() %>%
tab_pivot()
test1 %>%
tibble() %>%
tibble::add_row(.data = tibble("", test1[2, -1]/test1[1, -1]*100) %>%
set_names(names(test1))
, .before = 3)

Not sure there's a simple method exported with expss, but we can use expss::add_rows() with a simple custom function to split a table to accomplish this.
insert_row <- function(tbl, where, ...) {
args <- c(...)
tbl1 <- tbl[1:where,]
tbl2 <- tbl[(where+1):nrow(tbl),]
tbl1 %>%
add_rows(args) %>%
add_rows(tbl2)
}
insert_row(test1, 2, c("cyl|4", 300, 40))
| | | vs | |
| | | 0 | 1 |
| --- | ------------ | ---------------- | ---------------- |
| cyl | 4 | 5.55555555555556 | 71.4285714285714 |
| | 6 | 16.6666666666667 | 28.5714285714286 |
| | 4 | 300 | 40 |
| | 8 | 77.7777777777778 | |
| | #Total cases | 18 | 14 |

Solution based on #caldwellst code but with automatic ratio calculation:
insert_ratio <- function(tbl, where) {
if(is.character(where)) {
# if where is character we search it in the rowlabels
where = grep(where, tbl[[1]], fixed = TRUE)[1]
}
isTRUE(where>1) || stop("'where' should be greater than 1 for ratio calculation.")
isTRUE(where<=NROW(tbl)) || stop("'where' should be less or equal than number of rows in the table.")
tbl1 <- tbl[1:where,]
to_insert = c(row_labels = tbl[[1]][where], tbl[where, -1]/tbl[where - 1, -1]*100)
tbl2 <- tbl[(where+1):nrow(tbl),]
tbl1 %>%
add_rows(to_insert) %>%
add_rows(tbl2)
}
insert_ratio(test1, 2)
# | | | vs | |
# | | | 0 | 1 |
# | --- | ------------ | ----- | ---- |
# | cyl | 4 | 5.6 | 71.4 |
# | | 6 | 16.7 | 28.6 |
# | | | 300.0 | 40.0 |
# | | 8 | 77.8 | |
# | | #Total cases | 18.0 | 14.0 |
insert_ratio(test1, "cyl|6")
# the same result
UPDATE
Ratio calculation is moved to separate function:
ratio = function(tbl, where, label = NULL){
if(is.character(where)) {
# if where is character we search it in the rowlabels
where = grep(where, tbl[[1]], fixed = TRUE)[1]
}
isTRUE(where>1) || stop("'where' should be greater than 1 for ratio calculation.")
isTRUE(where<=NROW(tbl)) || stop("'where' should be less or equal than number of rows in the table.")
if(is.null(label)) label = tbl[[1]][where]
c(row_labels = label, tbl[where, -1]/tbl[where - 1, -1]*100)
}
insert_row = function(tbl, where, row) {
if(is.character(where)) {
# if where is character we search it in the rowlabels
where = grep(where, tbl[[1]], fixed = TRUE)[1]
}
isTRUE(where<=NROW(tbl)) || stop("'where' should be less or equal than number of rows in the table.")
first_part = seq_len(where)
tbl1 <- tbl[first_part,]
tbl2 <- tbl[-first_part,]
tbl1 %>%
add_rows(row) %>%
add_rows(tbl2)
}
insert_row(test1, 2, ratio(test1, 2))
insert_row(test1, "cyl|6", ratio(test1, "cyl|6"))

Related

Split a single row into multiple rows keeping the delimiter intact

I am trying to split a single row in my data set into multiple rows by keeping the delimiter intact.
This is a sample of my input data set
|---------------------|----------------------------------------------- |
| Group | Rules |
|---------------------|----------------------------------------------- |
| 1 | 1. Teams must be split into two |
| | 2. Teams must have ten players in each team |
| | 3. Each player must bring their own gear |
|---------------------|----------------------------------------------- |
When I use Strsplit function, I get the following output:
df = data.frame(rules =unlist(strsplit(as.character(df$Rules),"?=[[digits]]", perl = T)))
|---------------------|----------------------------------------------- |
| Group | Rules |
|---------------------|----------------------------------------------- |
| 1 | 1 |
|--------------------------------------------------------------------- |
1 | .Teams must be split into two |
|--------------------------------------------------------------------- |
| 1 | 2 |
|--------------------------------------------------------------------- |
1 | .Teams must have ten players in each team |
|--------------------------------------------------------------------- |
My desired Output
|---------------------|----------------------------------------------- |
| Group | Rules |
|---------------------|----------------------------------------------- |
| 1 | 1.Teams must be split into two |
|--------------------------------------------------------------------- |
| 1 | 2.Teams must have ten players in each team |
|--------------------------------------------------------------------- |
Here is a way to collapse each number with the following character string in column Rules. It throws warnings, not errors.
grp <- cumsum(!is.na(as.numeric(df$Rules)))
res <- lapply(split(df, grp), function(X){
data.frame(Group = X[[1]][1],
Rules = paste(X[[2]], collapse = ""))
})
res <- do.call(rbind, res)
res
# Group Rules
#1 1 1.Teams must be split into two
#2 1 2.Teams must have ten players in each team
Data.
df <- data.frame(Group = rep(1, 4),
Rules = c(1, ".Teams must be split into two",
2, ".Teams must have ten players in each team"),
stringsAsFactors = FALSE)

Generating table from dataframe with proportions of 20 variables, for each row, for each possible combination of said variable in R

I have a dataframe with 1000 rows representing a different species, for each of these rows are 20 columns with different proportions of a single variable (amino acids).
For each row (species), I would like to calculate the proportion of each possible combination of single letter variables (amino acids).
So each species should have 10 million calculated combinations of the amino acids.
My code for generating all possible combinations of amino acids is this:
S <- c('G','A','L','M','F','W','K','Q','E','S','P','V','I','C','Y','H','R','N','D','T')
allCombs <- function(x) c(x, lapply(seq_along(x)[-1L],
function(y) combn(x, y, collapse = "")),
recursive = TRUE)
Scombi <- allCombs(S)
My dataframe looks like this:
+----------------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
| Species | Domain | Actual OGT | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y |
+----------------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
| Acaryochloris_marina | Bacteria | 25 | 0.089806129655016 | 0.011179368033588 | 0.052093758404379 | 0.056116688487831 | 0.033311792369428 | 0.074719969063287 | 0.021456955206517 | 0.062874293719234 | 0.046629846831622 | 0.105160548187069 | 0.023372745414207 | 0.034667218445279 | 0.050847279968411 | 0.052372091362254 | 0.054393907299958 | 0.058415776607691 | 0.059282788930956 | 0.075786041807662 | 0.012266709932789 | 0.025246090272826 |
| Acetobacter_pasteurianus | Bacteria | 26 | 0.113635842586218 | 0.009802006063102 | 0.053600553080754 | 0.058133056353357 | 0.036903783608575 | 0.085210142094237 | 0.021833316616858 | 0.053123968429941 | 0.045353753818743 | 0.096549489115246 | 0.025913145427995 | 0.027225003296464 | 0.052562918173042 | 0.033342785074972 | 0.072705595398914 | 0.049908591821467 | 0.056094207383391 | 0.079084190962059 | 0.010144168305489 | 0.018873482389179 |
| Acetobacterium_woodii | Bacteria | 30 | 0.074955804625209 | 0.011863137047001 | 0.058166310295556 | 0.071786218284636 | 0.03424697521635 | 0.075626240308253 | 0.018397399287915 | 0.087245372635541 | 0.078978610001876 | 0.087790924875632 | 0.03068806687375 | 0.046498124583435 | 0.036120348133785 | 0.031790536900726 | 0.045179171055634 | 0.050727609439901 | 0.055617806111571 | 0.069643619533744 | 0.005984048340735 | 0.028693676448754 |
| Acetohalobium_arabaticum | Bacteria | 37 | 0.07294006171749 | 0.008402092275195 | 0.063388830763099 | 0.094174357919767 | 0.032968396601359 | 0.074335444399095 | 0.014775170057021 | 0.081175614650614 | 0.068173658934912 | 0.096191143631822 | 0.023591084039018 | 0.042176390239929 | 0.036535950562554 | 0.032690297143697 | 0.045929769851454 | 0.05201834344653 | 0.049098780255464 | 0.079225589949997 | 0.004923023531168 | 0.027286000029819 |
| Acholeplasma_laidlawii | Bacteria | 37 | 0.067353087090147 | 0.002160134400001 | 0.056809775441953 | 0.065310218890485 | 0.038735792072418 | 0.069508395797039 | 0.018942086187746 | 0.081435757342441 | 0.084786245636216 | 0.096181862610799 | 0.026545056054257 | 0.045549913713558 | 0.038323250930165 | 0.033008924859672 | 0.047150659509282 | 0.054698408656138 | 0.059971572823796 | 0.072199395290938 | 0.005926270925023 | 0.03540319176793 |
| Achromobacter_xylosoxidans | Bacteria | 30 | 0.120974236639852 | 0.008469732379263 | 0.054028585828065 | 0.055476991380945 | 0.035048667997051 | 0.086814010110846 | 0.02243157894653 | 0.050520668283285 | 0.039296015271673 | 0.099074202941835 | 0.028559018986725 | 0.025845147774914 | 0.049701994138614 | 0.034808403369533 | 0.073998251525545 | 0.050072992977641 | 0.051695040348985 | 0.080314177991249 | 0.011792085285623 | 0.021078197821829 |
+----------------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
So you can see, each row has the proportion of each amino acid (A,G,I etc.) over the entire set of amino acids, (all 20 add up to 1), but I would like to generate each possible combination, over 1. so something that looks like the following:
+----------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+
| Species | Domain | Actual OGT | A | AC | AD | AE |
+----------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+
| Acaryochloris_marina | Bacteria | 25 | 0.089806129655016 | 0.191179368033588 | 0.1782093758404379 | 0.186116688487831 |
+----------------------+----------+------------+-------------------+-------------------+-------------------+-------------------+
So for each species, 10 million columns (each representing one of the possible combinations of amino acids, without repetition, so the largest string is 20 with each one)
Apologies for being unclear, does anyone have any ideas on how to create this data-set? (Or the best way of asking/explaining what I should be looking up?)
Species <- structure(list(Species = c("Acaryochloris_marina",
"Acetobacter_pasteurianus",
"Acetobacterium_woodii", "Acetohalobium_arabaticum", "Acholeplasma_laidlawii",
"Achromobacter_xylosoxidans"), Domain = c("Bacteria", "Bacteria",
"Bacteria", "Bacteria", "Bacteria", "Bacteria"), Actual.OGT = c(25,
26, 30, 37, 37, 30), A = c(0.089806129655016, 0.113635842586218,
0.074955804625209, 0.07294006171749, 0.067353087090147, 0.120974236639852
), C = c(0.011179368033588, 0.009802006063102, 0.011863137047001,
0.008402092275195, 0.002160134400001, 0.008469732379263), D = c(0.052093758404379,
0.053600553080754, 0.058166310295556, 0.063388830763099, 0.056809775441953,
0.054028585828065), E = c(0.056116688487831, 0.058133056353357,
0.071786218284636, 0.094174357919767, 0.065310218890485, 0.055476991380945
), F = c(0.033311792369428, 0.036903783608575, 0.03424697521635,
0.032968396601359, 0.038735792072418, 0.035048667997051), G = c(0.074719969063287,
0.085210142094237, 0.075626240308253, 0.074335444399095, 0.069508395797039,
0.086814010110846), H = c(0.021456955206517, 0.021833316616858,
0.018397399287915, 0.014775170057021, 0.018942086187746, 0.02243157894653
), I = c(0.062874293719234, 0.053123968429941, 0.087245372635541,
0.081175614650614, 0.081435757342441, 0.050520668283285), K = c(0.046629846831622,
0.045353753818743, 0.078978610001876, 0.068173658934912, 0.084786245636216,
0.039296015271673), L = c(0.105160548187069, 0.096549489115246,
0.087790924875632, 0.096191143631822, 0.096181862610799, 0.099074202941835
), M = c(0.023372745414207, 0.025913145427995, 0.03068806687375,
0.023591084039018, 0.026545056054257, 0.028559018986725), N = c(0.034667218445279,
0.027225003296464, 0.046498124583435, 0.042176390239929, 0.045549913713558,
0.025845147774914), P = c(0.050847279968411, 0.052562918173042,
0.036120348133785, 0.036535950562554, 0.038323250930165, 0.049701994138614
), Q = c(0.052372091362254, 0.033342785074972, 0.031790536900726,
0.032690297143697, 0.033008924859672, 0.034808403369533), R = c(0.054393907299958,
0.072705595398914, 0.045179171055634, 0.045929769851454, 0.047150659509282,
0.073998251525545), S = c(0.058415776607691, 0.049908591821467,
0.050727609439901, 0.05201834344653, 0.054698408656138, 0.050072992977641
), T = c(0.059282788930956, 0.056094207383391, 0.055617806111571,
0.049098780255464, 0.059971572823796, 0.051695040348985), V = c(0.075786041807662,
0.079084190962059, 0.069643619533744, 0.079225589949997, 0.072199395290938,
0.080314177991249), W = c(0.012266709932789, 0.010144168305489,
0.005984048340735, 0.004923023531168, 0.005926270925023, 0.011792085285623
), Y = c(0.025246090272826, 0.018873482389179, 0.028693676448754,
0.027286000029819, 0.03540319176793, 0.021078197821829)), .Names = c("Species",
"Domain", "Actual.OGT", "A", "C", "D", "E", "F", "G", "H", "I",
"K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"), row.names = c(NA,
-6L), class = "data.frame")
I'm not entirely sure that R is the right tool for this job. It's going to take a very, very long time. You may be able to reduce that time using the parallel package if you have sufficient cores, however.
I've put together a process that will accomplish what you want. For each species, it takes my computer about eight minutes to generate the "joint proportion." If you run on a single thread, as R will do inherently, you're looking at close to an hour just to accomplish the these six species in your sample data.
I wrote my script to run in parallel, and using seven cores, it took about 11 minutes to complete all six. Extending this over all 1000 species, I wouldn't be surprised if it took as long as two days to do all this (on seven cores). If you have a large cluster, you may be able to cut it down some.
Please note that this will not give you your results as described in your question. I posted a comment that I wasn't sure what formula you were using to get the joint proportions. I am just taking the sum here for ease of demonstration. You will need to adjust your code appropriately.
library(parallel)
library(dplyr)
library(tidyr)
library(magrittr)
# Reshape data. This will make it easier to split and access proportion
# within each species.
SpeciesLong <-
Species %>%
gather(protein, proportion,
A:Y) %>%
arrange(Species)
# Get unique species
S <- unique(SpeciesLong$protein)
# Build the combination list
# Note, this is different than your code, I added FUN = paste0
Scombi <- unlist(lapply(seq_along(S),
function(x) combn(S, x, FUN = paste0, collapse = "")))
# Function to get the joint proportion
# I took the sum, for convenience. You'll need to replace this
# with whatever function you use to get the joint proportion.
# The key part is getting the correct proteins, which happens within
# the `sum` call.
joint_protein <- function(protein_combo, data){
sum(data$proportion[vapply(data$protein,
grepl,
logical(1),
protein_combo)])
}
# make a list data frames, one for each species
SplitSpecies <-
split(SpeciesLong,
SpeciesLong$Species)
# Make a cluster of processors to run on
cl <- makeCluster(detectCores() - 1)
# export Scombi and joint_protein to all processes in the cluster
clusterExport(cl, c("Scombi", "joint_protein"))
# Get the aggregate values for each species in a one-row data frame.
SpeciesAggregate <-
parLapply(cl,
X = SplitSpecies,
fun = function(data){
X <- lapply(Scombi,
joint_protein,
data)
names(X) <- Scombi
as.data.frame(X)
})
# Join the results to the Species data
# You may want to save your data before this step. I'm not entirely
# sure I did this right to match the rows correctly.
Species <- cbind(Species, SpeciesAggregate)

Chi-square p value matrix in r

Is there any way to find the chi-square p-value matrix in 'R' (a matrix with the p-values between the attributes)?
As an example, consider the the iris data set. I am looking for a matrix as follows:
| | Sepal length | Sepal width | Petal length | Petal width | Species |
|----------------|--------------|-------------|--------------|-------------|---------|
| Sepal length | | | | | |
| Sepal width | | | | | |
| Petal length | | | | | |
| Petal width | | | | | |
| Species | | | | | |
The elements of the matrix would be the chi square values for the (i,j) variables of the iris data set.
If that is what you want considering only one of those columns is a categorical variable, Try this:
chisqmatrix <- function(x) {
names = colnames(x); num = length(names)
m = matrix(nrow=num,ncol=num,dimnames=list(names,names))
for (i in 1:(num-1)) {
for (j in (i+1):num) {
m[i,j] = chisq.test(x[,i],x[,j],)$p.value
}
}
return (m)
}
mat = chisqmatrix(iris)

R apply script output in different formats for similar inputs

I'm using a double apply function to get a list of p-values for cor.test between any two columns of two tables.
hel_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
The otud data.frame is 90X11 (90rows,11 colums or to say dim(otud) 90 11) and will be used with different data.frames.
bc and hel - are both 90X2 data.frame-s - so for both I get 2*11=22 p-values out of functions
bc_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
hel_plist<-apply(hel, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
For bc I will have an output with dim=NULL a list of elements of otunames$bcnames$ p-value (a format that I have always got from these scripts and are happy with)
But for hel I will get and output of dim(hel) 11 2 - an 11X2 table with p-values written inside.
Shortened examples of output.
hel_plist
+--------+--------------+--------------+
| | axis1 | axis2 |
+--------+--------------+--------------+
| Otu037 | 1.126362e-18 | 0.01158251 |
| Otu005 | 3.017458e-2 | NULL |
| Otu068 | 0.00476002 | NULL |
| Otu070 | 1.27646e-15 | 5.252419e-07 |
+--------+--------------+--------------+
bc_plist
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07
Why is it like that when the input formats are all the same? (Shortened examples)
bc
+-------+-----------+-----------+
| group | axis1 | axis2 |
+-------+-----------+-----------+
| 1B041 | 0.125219 | 0.246319 |
| 1B060 | -0.022412 | -0.030227 |
| 1B197 | -0.088005 | -0.305351 |
| 1B222 | -0.119624 | -0.144123 |
| 1B227 | -0.148946 | -0.061741 |
+-------+-----------+-----------+
hel
+-------+---------------+---------------+
| group | axis1 | axis2 |
+-------+---------------+---------------+
| 1B041 | -0.0667782322 | -0.1660606406 |
| 1B060 | 0.0214470932 | -0.0611351008 |
| 1B197 | 0.1761876858 | 0.0927570627 |
| 1B222 | 0.0681058251 | 0.0549292399 |
| 1B227 | 0.0516864361 | 0.0774155225 |
| 1B235 | 0.1205676221 | 0.0181712761 |
+-------+---------------+---------------+
How could I force my scripts to always produce "flat" outputs as in the case of bc
OK different output-s are caused because of the NULL results from conditional function in bc_plist case. If I'd to modify code to replace possible NULL-s with NA-s I'd get 2d tables in any case.
So to keep things constant :
bc_nmds_plist<-apply(bc_nmds, 2, function(x) { apply(stoma_otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}else NA}) })
And I get a 2d tabel out for bc_nmds_plist too.
So I guess this thing can be called solved - as I now have a piece of code that produces predictable output on any correct input.
If anyone has any idea how to force the output to conform to previos bc_plist format instead I would still be interested as I do actually prefer that form:
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07

By group: sum of variable values under condition

Sum of var values by group with certain values excluded conditioned on the other variable.
How to do it elegantly without transposing?
So in the table below for each (fTicker, DATE_f), I seek to sum the values of wght with the value of wght conditioned on sTicker excluded from the sum.
In the table below, (excl_val,sTicker=A) |(fTicker=XLK, DATE_f = 6/20/2003) = wght_AAPL_6/20/2003_XLK + wght_AA_6/20/2003_XLK but not the wght for sTicker=A
+---------+---------+-----------+-------------+-------------+
| sTicker | fTicker | DATE_f | wght | excl_val |
+---------+---------+-----------+-------------+-------------+
| A | XLK | 6/20/2003 | 0.087600002 | 1.980834016 |
| A | XLK | 6/23/2003 | 0.08585 | 1.898560068 |
| A | XLK | 6/24/2003 | 0.085500002 | |
| AAPL | XLK | 6/20/2003 | 0.070080002 | |
| AAPL | XLK | 6/23/2003 | 0.06868 | |
| AAPL | XLK | 6/24/2003 | 0.068400002 | |
| AA | XLK | 6/20/2003 | 1.910754014 | |
| AA | XLK | 6/23/2003 | 1.829880067 | |
| AA | XLK | 6/24/2003 | 1.819775 | |
| | | | | |
| | | | | |
+---------+---------+-----------+-------------+-------------+
There are several fTicker groups with many sTicker in them (10 to 70), some sTicker may belong to several fTicker. The end result should be an excl_val for each sTicker on each DATE_f and for each fTicker.
I did it by transposing in SAS with resulting file about 6 gb but the same approach in R, blew memory up to 40 gb and it's basically unworkable.
In R, I got as far as this
weights$excl_val <- with(weights, aggregate(wght, list(fTicker, DATE_f), sum, na.rm=T))
but it's just a simple sum (without excluding the necessary observation) and there is mismatch between rows length. If i could condition the sum to exclude the sTicker obs for wght from the summation, i think it might work.
About the excl_val length: i computed it in excel, for just 2 cells, that's why it's short
Thank you!
Arsenio
When you have data in a data.frame, it is better if the rows are meaningful
(in particular, the columns should have the same length):
in this case, excl_val looks like a separate vector.
After putting the information it contains in the data.frame,
things become easier.
# Sample data
k <- 5
d <- data.frame(
sTicker = rep(LETTERS[1:k], k),
fTicker = rep(LETTERS[1:k], each=k),
DATE_f = sample( seq(Sys.Date(), length=2, by=1), k*k, replace=TRUE ),
wght = runif(k*k)
)
excl_val <- sample(d$wght, k)
# Add a "valid" column to the data.frame
d$valid <- ! d$wght %in% excl_val
# Compute the sum
library(plyr)
ddply(d, c("fTicker","DATE_f"), summarize, sum=sum(wght[valid]))

Resources