R - Setting the class of an object created with by() - r

First a little bit of context:
In my package summarytools, I've defined a print method for objects of classs "summarytools". I have also created a function view() that handles objects created using by() or lapply() in such a way that the output doesn't include the lines stating the group -- or the variable in the case of lapply(); summarytools displays its own headings containing that information, so there is some redundancy when using print. Also, the main headings are not repeated when using view().
Here's an example. Note that in this version (in development), I included a message advising the use of view():
> library(summarytools)
> (tmp <- with(tobacco, by(smoker, gender, freq)))
gender: F
For best results printing list objects with summarytools, use view(x, method = 'pander')
Frequencies
tobacco$smoker
Type: Factor
Group: gender = M
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
Yes 147 30.06 30.06 30.06 30.06
No 342 69.94 100.00 69.94 100.00
<NA> 0 0.00 100.00
Total 489 100.00 100.00 100.00 100.00
------------------------------------------------------------------
gender: M
Frequencies
tobacco$smoker
Type: Factor
Group: gender = F
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
Yes 143 29.24 29.24 29.24 29.24
No 346 70.76 100.00 70.76 100.00
<NA> 0 0.00 100.00
Total 489 100.00 100.00 100.00 100.00
And now using view():
> view(tmp, method = "pander")
Frequencies
tobacco$smoker
Type: Factor
Group: gender = M
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
Yes 147 30.06 30.06 30.06 30.06
No 342 69.94 100.00 69.94 100.00
<NA> 0 0.00 100.00
Total 489 100.00 100.00 100.00 100.00
Group: gender = F
Freq % Valid % Valid Cum. % Total % Total Cum.
----------- ------ --------- -------------- --------- --------------
Yes 143 29.24 29.24 29.24 29.24
No 346 70.76 100.00 70.76 100.00
<NA> 0 0.00 100.00
Total 489 100.00 100.00 100.00 100.00
I've thought about ways through which the objects of class "by" would automatically be dispatched to view() instead of print(). If I add the class "summarytools" to those objects, the print() method could redirect the call to view(), making it simpler for users to get proper, optimal outputs.
The solutions I've thought of, so far, are the following:
Adding a "by" argument to the functions so that I have full control on the created objects' proporties. I'm not fond of this solution, since 1) I try to rely on base R functions that people are familiar with rather than introducing new parameters, and 2) I'd still have a similar issue when objects are created with lapply().
Redefining by() so that when it's called from one of summarytools' functions, it appends the desired class to the created objects. I've avoided this because I'm hesitant to redefine base functions. I'd rather not see messages to the effect that objects have been masked when the package is loaded.
Defining a package-specific by(), such as by_st(); I could use basically the same code as by.default() and by.data.frame(), the only difference being that I'd add the "summarytools" class to the created objects. This is a sort of compromise that I'm considering.
My question is the following: could there be other, maybe better solutions I'm not seeing?

You could use S3 method for print.by to dispatch to your custom function:
old.print.by = print.by # save the original function so we can restore it later
print.by = summarytools::view # redefine print.by to dispatch to custom function
tmp
To restore original function later you can do print.by = old.print.by.
If you only want your new function to operate on lists that contain objects of class "summarytools", you can use
print.by = function(x, method = 'pander', ...) {
if ("summarytools" %in% class(x[[1]])) {
summarytools::view(x, method, ...)
} else {
old.print.by(x, ...)
}
}

Related

How to create correlated tables in a genomic database SQLite3

I'm quite new to sqlite3 but I want to use it for storing some genomic data I have, since manipulating from R takes a lot of time to process. I want to do some basic queries once the database is built, but my problem is, I don't know which tables I should create in order to make the appropriate queries.
This is how my big table looks like:
Chr Start End Ref Alt Callers GATK_Illumina.counts GATK_Illumina.samples GATK_SOLiD.counts GATK_SOLiD.samples LIFE_SOLiD.counts LIFE_SOLiD.samples TVC_Ion.counts TVC_Ion.samples Func.refGene Gene.refGene
chr1 14948 14948 G A GATK_SOLiD 0.38 noSample 1.125 XK713 0.125 noSample 13.43 17E334|17E424|17H593|17J782|17J913|1B566 ncRNA_intronic;downstream WASH7P;DDX11L1
chr1 14948 14948 G A TVC_Ion 0.38 noSample 1.125 XK713 0.125 noSample 13.43 17E334|17E424|17H593|17J782|17J913|1B566 ncRNA_intronic;downstream WASH7P;DDX11L1
chr1 15820 15820 G T GATK_SOLiD 0.38 noSample 1.125 1E695 0.125 noSample 4.43 17E574|17H906|5K083B|6C418 ncRNA_exonic WASH7P
chr1 15820 15820 G T TVC_Ion 0.38 noSample 1.125 1E695 0.125 noSample 4.43 17E574|17H906|5K083B|6C418 ncRNA_exonic WASH7P
chr1 17452 17452 C T GATK_SOLiD 0.38 noSample 1.125 1H823 0.125 noSample 12.43 17G118|17G937|17H906|17J610|17M152|4E832|5C725|5F445|5F685|5H986|5J427 ncRNA_intronic;upstream WASH7P;MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4
chr1 17452 17452 C T TVC_Ion 0.38 noSample 1.125 1H823 0.125 noSample 12.43 17G118|17G937|17H906|17J610|17M152|4E832|5C725|5F445|5F685|5H986|5J427 ncRNA_intronic;upstream WASH7P;MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4
chr1 17538 17538 C A GATK_SOLiD 0.38 noSample 3.125 1E695|1H586|9J385 0.125 noSample 24.43 17C851B|17C918|17D521B|17E424|17F076 ncRNA_intronic;upstream WASH7P;MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4
chr1 17538 17538 C A TVC_Ion 0.38 noSample 3.125 1E695|1H586|9J385 0.125 noSample 24.43 17C851B|17C918|17D521B|17E424|17F076 ncRNA_intronic;upstream WASH7P;MIR6859-1;MIR6859-2;MIR6859-3;MIR6859-4
My queries are going to involve a search by Chr, Start, End to display which Callers have those coordinates, and also a search by gene. But my doubt is how to create the tables. I can create a table with Chr Start End Ref Alt Callers but how to link with another table having samples or genes? A coordinate (Chr Start End) can have multiple samples or callers linked.
Example of queries would be to type coordinates and display all info, and search by gene and show all the coordinates that include that gene
Basically I would like to know how many tables should I create and how to link them. I understand my coordinate table would be the parent table.
Which elements should be keys in the different tables?
I know nothing about genomes, but based on the data and your description, this seems like a few 1:n relationships, and you are looking to create relational entities using primary/foreign keys. I'm not sure how familiar you are with SQL, you can take a look here for more on using it: https://www.w3schools.com/sql/
You may want to create your tables like this, but further if there are additional 1:n or n:n relationships:
genes
gene_id, gene_name, data1, data2
// 1 gene -> many coords
gene_coords
gene_id, chr, start, end, data1, data2
// 1 coord -> many callers (and/or samples)
gene_callers // Are callers & samples 1:1?
gene_id, caller, sample
You can query the callers by using coordinates && chr with something like this:
SELECT gene_callers.callers FROM gene_callers
JOIN gene_coords ON gene_coords.gene_id = gene_callers.gene_id
WHERE gene_coords.start = 14948 AND gene_coords.end = 14948
AND gene_coords.chr = 'chr1';
I'm not sure if you would be querying multiple genes based on a coordinate range, if so then you may not want to be storing the start/end as same value, just store it as a coord and use BETWEEN on that field. This query will also get you gene data for example.
SELECT gene.*, gene_callers.callers FROM gene_callers
JOIN gene ON gene.gene_id = gene_callers.gene_id
JOIN gene_coords ON gene_coords.gene_id = gene_callers.gene_id
WHERE gene_coords.coord BETWEEN 14948 AND 17538;
To get callers by gene name, you can do this:
SELECT gene_callers.callers FROM gene_callers
JOIN gene ON gene.gene_id = gene_callers.gene_id
WHERE gene.gene_name = 'a_gene';
You may need to tweak the join types based on any nulls and the dataset you're looking for. You may need to create another table for samples if they are not 1:1 with callers. Hopefully I interpreted your data correctly and this at least can point you in the right direction.

Syntax error when using count in loop

I am trying to run a loop where I count the total in each file under the variable _merge, and then count certain outcomes of _merge, such as _merge=1 and so on. I then want to calculate percentages by dividing each instance of _merge by the total under _merge.
Below is my code:
/*define local list*/
local ward_names B C D E FN FS GS HE
/*loop for each dbase*/
foreach file of local ward_names {
use "../../../cleaning/sra/output/`file'_ward_CTS_Merged.dta", clear
count if _merge
local ward_count=r(N)
count if _merge==1
local count_master=r(N)
count if _merge==2
local count_using=r(N)
count if _merge==3
local count_match=r(N)
clear
set obs 1
g ward_count='ward_count'
g count_master=`count_master'
g count_using=`count_using'
g count_match=`count_match'
g ward= "`file'"
save "../temp/`file'_collapsed_diagnostics.dta", replace
clear
The code was running fine until I tried to add the total count for each ward file:
g ward_count='ward_count'
'ward_count' invalid name
Is this a syntax error or something more severe?
You need to use ` instead of ' when you refer to a local macro:
generate ward_count = `ward_count'
EDIT:
As per #NickCox's recommendation you can improve your code by using the tabulate command with its matcell() option to get the counts all at once:
tabulate _merge, matcell(A)
_merge | Freq. Percent Cum.
------------------------+-----------------------------------
master only (1) | 1 16.67 16.67
matched (3) | 5 83.33 100.00
------------------------+-----------------------------------
Total | 6 100.00
matrix list A
A[2,1]
c1
r1 1
r2 5
So you could then do the following:
generate count_master = A[1,1]
generate count_match = A[2,1]

Unexpected error using Jump with Julia

I am trying to solve an optimization problem, I am getting error as
"ERROR: Expected m to be a JuMP model, but it has type Int64
in validmodel(::Int64, ::Symbol) at C:\Users\Ting.julia\v0.5\JuMP\src\macros.jl:247
in macro expansion; at C:\Users\Ting.julia\v0.5\JuMP\src\macros.jl:252 [inlined]
in macro expansion; at .\REPL[608]:3 [inlined]
in anonymous at .\:?"
Please see the following code(error in constraint 2). Please don't mind the way I have defined arrays, any help is appreciated. Thank you
using JuMP
using Gurobi
m = Model(solver = GurobiSolver()) #if GurobiSolver is to be used .
## insert all matrixs here
#this is the cost for plant to warehouse
plant=4 #last index for {1,2,3}
product=5 #ast index for {2,3,4}
customer=50
warehouse=4
#variable(m, x[i=1:product ,k=1:plant,l=1:warehouse]>=0) #plant to warehouse
#variable(m, y[i=1:product ,k=1:warehouse,l=1:customer]>=0) #warehouse to customer
#variable(m, z[i=1:product ,k=1:plant,l=1:customer ]>=0) #plant to customer
#variable(m, p[i=1:product ,k=1:plant]>=0) #any product i produced at plant k
#THIS GIVES COST OF PRODUCING AT ANY PRODUCT I AT PLANT K
PC=[500 500 500 500;
400 400 400 400;
300 300 300 300;
200 200 200 200;
100 100 100 100]
#DEMAND OF I AT ANY COSTOMER M, SHOULD BE A MATRIX OF (5*50)
D=[4650.28 10882.70 7920.68 2099.06 4920.32 5077.80 2259.10 9289.30 9782.28 4671.85 6625.68 6956.80 5288.12 4144.78 11121.56 9152.47 10206.88 4601.63 2718.91 1439.39 2984.38 3631.17 3934.48 12314.28 4188.04 8437.43 6302.34 1248.62 6286.56 7333.46 11027.86 6233.33 7240.82 5652.13 10276.03 1197.22 11160.13 4510.31 8850.49 8291.09 1081.47 7652.23 3936.85 2640.47 7726.72 1422.96 1644.78 1060.39 6858.66 6554.45;
528.11 4183.80 352.45 366.34 1961.78 3419.11 337.44 708.15 3556.56 1649.95 583.25 1525.97 1569.92 349.93 1904.59 2221.80 2139.63 1822.87 546.11 784.93 948.33 1424.26 1910.64 2275.11 1527.57 2477.49 1592.14 90.86 2635.48 131.02 2402.35 2669.67 105.34 1350.60 4233.60 411.54 687.88 89.09 213.23 2817.29 8.08 1586.51 577.07 1529.34 2919.06 393.97 85.45 214.93 3193.94 1565.64;
480.26 622.67 131.04 14.45 1299.71 599.27 83.08 197.37 1986.77 409.08 371.12 1249.92 216.21 62.43 34.96 1752.75 227.06 184.26 219.92 577.37 138.71 36.23 1659.02 1323.50 236.64 2557.64 76.74 74.08 363.64 52.96 456.67 1589.86 81.89 617.11 509.86 145.52 14.13 83.22 215.03 2749.34 7.12 490.00 120.42 456.03 430.22 165.02 66.16 150.70 2806.58 1403.70;
307.36 474.39 7.56 11.76 882.03 222.62 27.29 158.13 55.94 332.98 171.36 492.81 44.12 24.08 15.57 739.97 11.09 199.51 136.46 194.40 63.72 2.42 355.99 1005.42 66.33 1647.51 47.22 21.32 218.06 11.54 305.81 387.71 8.50 248.38 9.20 76.05 13.12 39.83 146.52 379.44 2.75 239.53 94.06 136.96 290.16 237.75 9.04 110.64 842.58 395.08;
76.52 280.62 5.06 6.75 281.41 215.58 5.78 54.69 20.79 22.08 78.50 322.13 34.13 6.37 11.66 178.33 3.40 142.11 60.70 46.17 6.96 1.15 227.70 669.39 3.21 526.85 45.91 17.00 131.43 11.19 189.00 43.93 3.36 110.66 1.75 41.34 0 38.63 50.78 241.19 0 176.32 94.25 99.59 153.50 123.02 3.76 122.52 853.48 99.62]
a = Array{Float64}(5,4,4)
a[1,1,1]=a[2,1,1]=a[3,1,1]=a[4,1,1]=a[5,1,1]=0.2*528.42
a[1,2,1]=a[2,2,1]=a[3,2,1]=a[4,2,1]=a[5,2,1]=0.2*1366.16
a[1,3,1]=a[2,3,1]=a[3,3,1]=a[4,3,1]=a[5,3,1]=0.2*1525.41
a[1,4,1]=a[2,4,1]=a[3,4,1]=a[4,4,1]=a[5,4,1]=0.2*878.11
a[1,1,2]=a[2,1,2]=a[3,1,2]=a[4,1,2]=a[5,1,2]=0.2*1692.25
a[1,2,2]=a[2,2,2]=a[3,2,2]=a[4,2,2]=a[5,2,2]=0.2*1553.06
a[1,3,2]=a[2,3,2]=a[3,3,2]=a[4,3,2]=a[5,3,2]=0.2*817.18
a[1,4,2]=a[2,4,2]=a[3,4,2]=a[4,4,2]=a[5,4,2]=0.2*2164.69
a[1,1,3]=a[2,1,3]=a[3,1,3]=a[4,1,3]=a[5,1,3]=0.2*2006.5
a[1,2,3]=a[2,2,3]=a[3,2,3]=a[4,2,3]=a[5,2,3]=0.2*1385.04
a[1,3,3]=a[2,3,3]=a[3,3,3]=a[4,3,3]=a[5,3,3]=0.2*998.58
a[1,4,3]=a[2,4,3]=a[3,4,3]=a[4,4,3]=a[5,4,3]=0.2*2148.45
a[1,1,4]=a[2,1,4]=a[3,1,4]=a[4,1,4]=a[5,1,4]=0.2*1073.07
a[1,2,4]=a[2,2,4]=a[3,2,4]=a[4,2,4]=a[5,2,4]=0.2*368.35
a[1,3,4]=a[2,3,4]=a[3,3,4]=a[4,3,4]=a[5,3,4]=0.2*450.12
a[1,4,4]=a[2,4,4]=a[3,4,4]=a[4,4,4]=a[5,4,4]=0.2*1129.27
#objective(m, Min ,sum(a[i,k,l]* x[i,k,l] for i=1:product for k=1:plant for l=1:warehouse) + sum(c_dash[i,l,m]* y[i,l,m] for i=1:product for l=1:warehouse for m=1:plant) +sum(c_dash_dash[i,k,m]* z[i,k,m] for i=1:product for k=1:plant for m=1:customer)+sum(PC[i,k]* p[i,k] for i=1:product for k=1:plant)) #to be changes
#constraint(m,p[1,2]==0)
#constraint(m,p[1,3]==0)
#constraint(m,p[1,4]==0)
#constraint(m,p[2,1]==0)
#constraint(m,p[2,3]==0)
#constraint(m,p[2,4]==0)
#constraint(m,p[3,1]==0)
#constraint(m,p[3,2]==0)
#constraint(m,p[3,4]==0)
#constraint(m,p[4,1]==0)
#constraint(m,p[4,2]==0)
#constraint(m,p[4,3]==0)
#constraint(m,p[5,1]==0)
#constraint(m,p[5,2]==0)
#constraint(m,p[5,3]==0)
#constraint(m,p[1,1]<=450000)
#constraint(m,p[2,2]<=108000)
#constraint(m,p[3,3]<=45000)
#constraint(m,p[4,4]<=18000)
#constraint(m,p[5,4]<=9000)
#constraint 1
#constraint(m,415728.69-0.8* sum(y[i,l,m] for i=1:product for l=1:warehouse for m=1:customer) <=0)
#constrainst 2
for m=1:customer
for i=1:product
#constraint(m, D[i,m]-sum(z[i,k,m] for k=1:plant)-sum(y[i,l,m] for l=1:warehouse) <=0 ) #cant get
end
end
#constrainst 2
for m=1:customer
for i=1:product
#constraint(m, D[i,m]-sum(z[i,k,m] for k=1:plant)-sum(y[i,l,m] for l=1:warehouse) <=0 ) #cant get
end
end
The error explains the problem very well. Your outer-loop variable here is m, which makes usage of m inside the loop refers to the loop variable and not to your model. m is also used to hold the model in the outer-scope. Change your loop variable or model variable to something else and the problem is fixed.

Insert blank lines in kable

I'm tabling groups of rows in a kable. Each group contains data for one group, with between 3 and 5 rows per group. I want to leave blank lines in the table between groups for readability, but can't get it to work.
I put in a row of all NA, and then set options(knitr.kable.NA=""). This works OK when printed in the console, as here:
|C.01.C.00522 | 3| 1203| 0.043| -0.096| -16.441|
|C.01.C.00522 | 4| 8364| 0.298| 0.159| 31.765|
|C.01.C.00522 | 5| 3494| 0.124| -0.014| -2.588|
| | | | | | |
|C.02.A.00577 | 1| 2496| 0.089| -0.014| -2.410|
|C.02.A.00577 | 2| 1975| 0.070| -0.032| -5.609|
|C.02.A.00577 | 3| 3400| 0.121| 0.018| 3.297|
But in the rendered pdf document there one table for the first group, and then all unformatted lines after that.
C.01.C.00522 3 1203 0.043 -0.096 -16.441 C.01.C.00522 4 8364 0.298 0.159 31.765 C.01.C.00522 5 3494 0.124
-0.014 -2.588
C.02.A.00577 1 2496 0.089 -0.014 -2.410 C.02.A.00577 2 1975 0.070 -0.032 -5.609
I also tried options(knitr.kable.NA='.') and this produces a properly formatted table, but all the dots are a little annoying.
Any ideas?
Thank you Imran for mentioning kableExtra. In kableExtra 0.3 which I released last week, a new function called collapse_rows may do some help in this case.
dt <-data.frame(id =c(rep("C.01.C.00522", 3),rep("C.02.A.00577", 3)),var1 =c(3,4,5,1,2,3), var2 =c(1203, 8364, 3494, 2496, 1975, 3400))
kable(dt, "latex", booktabs = T) %>%
collapse_rows(columns = 1)

Stata counting substring

My table looks like this:
ID AQ_ATC amountATC
. "A05" 1
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 1
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 1
441 "C05AA03" 130
The count for the full 8-character AQ_ATC codes is already correct.
The shorter codes are unique in the table and are substrings of the complete 8-character codes (they represent the first x characters).
What I am looking for is the count of the appearances of the shorter codes throughout the entire table.
For example in this case the resulting table would be
ID AQ_ATC amountATC
. "A05" 2715 <-- 2525 + 190
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 7430 <-- 4330 + 3100
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 130 <-- 130
441 "C05AA03" 130
The partial codes do not overlap, by what I mean that if there is "C05" there wont be another partial code "C05A1".
I created the amountATC column using
bysort ATC: egen amountATC = total(AQ_ATC==AQ_ATC)
I attempted recycling the code that I had received yesterday but failed in doing so.
My attempt looks like this:
levelsof AQ_ATC, local(ATCvals)
quietly foreach y in AQ_ATC {
local i = 0
quietly foreach x of local ATCvals {
if strpos(`y', `"`x'"') == 1{
local i = `i'+1
replace amountATC = `i'
}
}
}
My idea was to use a counter "i" and increase it by 1 everytime the an AQ_ATC starts with another AQ_ATC code. Then I write "i" into amountATC and after I iterated over the entire table for my AQ_ATC, I will have an "i"-value that will be equal to the amount of occurences of the substring. Then I reset "i" to 0 and continue with the next AQ_ATC.
At least thats how I intended for it to work, what it did in the end is set all amountATC-values to 1.
I also attempted looking into different egen-functions such as noccur and moss, but my connection keeps timing out when I attempt to install the packages.
It seems as if you come from another language and you insist in using loops when not strictly necessary. Stata does many things without explicit loops, precisely because commands already apply to all observations.
One way is:
clear
set more off
input ///
ID str15 AQ_ATC amountATC
. "A05" 1
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 1
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 1
441 "C05AA03" 130
end
*----- what you want -----
sort AQ_ATC ID
gen grou = sum(missing(ID))
bysort grou AQ_ATC: gen tosum = amountATC if _n == 1 & !missing(ID)
by grou: egen s = total(tosum)
replace amountATC = s if missing(ID)
list, sepby(grou)
Edit
With your edit the same principles apply. Below code that adjusts to your change and slightly changes the code (one line less):
*----- what you want -----
sort AQ_ATC
gen grou = sum(missing(ID))
bysort grou: gen s = sum(amountATC) if AQ_ATC != AQ_ATC[_n+1] & !missing(ID)
by grou: replace amountATC = s[_N] if missing(ID)
More efficient should be:
<snip>
bysort grou: gen s = sum(amountATC) if AQ_ATC != AQ_ATC[_n+1]
by grou: replace amountATC = s[_N] - 1 if missing(ID)
Some comments:
sort is a very handy command. If you sort the data by AQ_ATC they are arranged in such a way that the short (sub)strings are placed before corresponding long strings.
The by: prefix is fundamental and very helpful, and I noticed you can use it after defining appropriate groups. I created the groups taking advantage of the fact that all short (sub)strings have a missing(ID).
Then (by the groups just defined) you only want to add up one value (observation) per amountATC. That's what the condition if AQ_ATC != AQ_ATC[_n+1] does.
Finally, replace back into your original variable. I would usually generate a copy and work with that, so my original variable doesn't suffer.
An excellent read for the by: prefix is Speaking Stata: How to move step by: step, by Nick Cox.
Edit2
Yet another slightly different way:
*----- what you want -----
sort AQ_ATC
gen grou = sum(missing(ID))
egen t = tag(grou AQ_ATC)
bysort grou: gen s = sum(amountATC * t)
by grou: replace amountATC = s[_N] - 1 if missing(ID)

Resources