R Sweave: digits number in xtable of prop.table - r

I'm making an xtableFtable on R Sweave and can't find a way to suppress the digits with this code. What I am doing false? I've read that it can happen if your values aren't numeric but factor or character, but is prop.table making them non-numeric? I'm lost...
library(xtable)
a <- ftable(prop.table(table(mtcars$mpg, mtcars$hp), margin=2)*100)
b <- xtableFtable(a, method = "compact", digits = 0)
print.xtableFtable(b, rotate.colnames = TRUE)
I've already tried with digits=c(0,0,0,0...) too.

You could use options(digits) to control how many digits will print. Try something like options(digits = 4) as the first line of your code (change 4 to whatever value you want between 1 and 22). See ?options for more information.
Or round the values before printing
a = round(ftable(prop.table(table(mtcars$mpg, mtcars$hp), margin=2)*100), 2)
b = xtableFtable(a, method = "compact")
print.xtableFtable(b, rotate.colnames = TRUE)

The "digits" argument to xtableFtable seems to be unimplemented (as of my version, which is 1.8.3), since after playing around with it for half an hour nothing seems to make any difference.
There's a hint to this effect in the function documentation:
It is not recommended that users change the values of align, digits or align. First of all, alternative values have not been tested. Secondly, it is most likely that to determine appropriate values for these arguments, users will have to investigate the code for xtableFtable and/or print.xtableFtable.
It's probably just carried over from the xtable function (on which xtableFtable is surely based) as a TODO which the maintainer hasn't gotten around to yet.

Related

I want to be able to manipulate objects in class 'phylo' - ie. round/ turn my bootstrap values from decimals (.998) into percentages (99%)

I am using RStudio, programs ape and phytools. I've generated a tree with 500 bootstrap replicates stored in an object of class phylo.
Where cw is the name of my tree, I've tried the following:
round(cw, digits = 2)
and I get the following error message:
Error in round(cw, digits = 2) :
non-numeric argument to mathematical function
I feel like it's probably a very simple manipulation but I'm not sure how to get there.
Hard to tell without a reproducible example but I guess that your bootstrap scores are probably stored in the $node.label subset of your tree.
You can try the following:
## Are the bootstraps in the $node.label object?
if(!is.null(cw$node.label)) {
## Are they as character or numeric?
class(cw$node.label)
}
If they are numeric values:
cw$node.label <- round(cw$node.label, digits = 2)
If they are characters, you can probably coerce them (that can produce some NAs)
cw$node.label <- round(as.numeric(cw$node.label), digits = 2)

VennDiagram with shares/relative numbers

Using R and VennDiagram 1.6.9, I want to draw a triple Venn diagram and display shares rather than absolute values. The internal consistency check however can't deal with rounding errors:
draw.triple.venn(area1=0.89, area2=round(0.481, 2), area3=0.5,
n12=0.46, n23=0.4, n13=0.47)
The error due to rounding is extremely small:
> round(0.48, 2)-0.46-0.4+0.38
[1] -5.551115e-17
Using the complete number, i.e. round(0.48, 3) it all works fine, but I don't want that (my real data has a lot more digits). Is there a way to overrun internal consistency checks? Or is there maybe a better way to display shares?
Firstly, note that the draw.triple.venn function has parameters print.mode and sigdigs, which might be helpful to you. If those are not enough, you may try hacking the output, by simply replacing the values of all labels with improved values to your taste. Here is an example:
grid.newpage()
draw.triple.venn(area1=0.89, area2=0.481, area3=0.5,
n12=0.46, n23=0.4, n13=0.47, n123=0.38)
grobjs = grid.ls() # List of all objects on the diagram
for (o in grobjs$name) {
# Pick out all text labels
if (grepl(".text.", o) == 1) {
# Re-format their value
old_value = as.numeric(grid.get(o)$label)
new_value = sprintf("%0.2f", old_value) #
if (new_value != "NA") {
grid.edit(o, label=new_value, redraw=FALSE)
}
}
}
grid.refresh()

How to display coefficients in scientific notation with stargazer

I want to compare the results of different models (lm, glm, plm, pglm) in a table in R using stargazer or a similar tool.
However I can't find a way to display the coefficients in scientific notation. This is kind of a problem because the intercept is rather large (about a million) while other coefficients are small (about e-7) which results in lots of useless zeros making it harder to read the table.
I found a similar question here: Format model display in texreg or stargazer R as scientific.
But the results there require rescaling the variables and since I use count data I wouldn't want to rescale it.
I am grateful for any suggestions.
Here's a reproducible example:
m1 <- lm(Sepal.Length ~ Petal.Length*Sepal.Width,
transform(iris, Sepal.Length = Sepal.Length+1e6,
Petal.Length=Petal.Length*10, Sepal.Width=Sepal.Width*100))
# Coefficients:
# (Intercept) Petal.Length Sepal.Width Petal.Length:Sepal.Width
# 1.000e+06 7.185e-02 8.500e-03 -7.701e-05
I don't believe stargazer has easy support for this.
You could try other alternatives like xtable or any of the many options here (I have not tried them all)
library(xtable)
xtable(m1, display=rep('g', 5)) # or there's `digits` too; see `?xtable`
Or if you're using knitr or pandoc I quite like pander, which has automagic scientific notation already (note: this is pandoc output which looks like markdown, not tex output, and then you knit or pandoc to latex/pdf):
library(pander)
pander(m1)
It's probably worth making a feature request to the package maintainer to include this option.
In the meantime, you can replace numbers in the output with scientific notation auto-magically. There are a few things to be careful about when replacing numbers. It is important not to reformat numbers that are part of the latex encoding. Also, be careful not to replace characters that are part of variable names. For example the . in Sepal.Width could easily be mistaken for a number by regex. The following code should deal with most common situations. But, if someone, for example, calls their variable X_123456789 it might rename this to X_1.23e+09 depending on the scipen setting. So some caution is needed and a more robust solution probably will need to be implemented within the stargazer package.
here's an example stargazer table to demonstrate on (shamelessly copied from #mathematical.coffee):
library(stargazer)
library(gsubfn)
m1 <- lm(Sepal.Length ~ Petal.Length*Sepal.Width,
transform(iris, Sepal.Length = Sepal.Length+1e6,
Petal.Length=Petal.Length*10, Sepal.Width=Sepal.Width*100))
star = stargazer(m1, header = F, digit.separator = '')
Now a helper function to reformat the numbers. You can play around with the digits and scipen parameters to control the output format. If you want to force scientific format more often use a smaller (more negative) scipen. Otherwise we can have it automatically use scientific format only for very small or large numbers by using a larger scipen. The cutoff parameter is there to prevent reformatting of numbers represented by only a few characters.
replace_numbers = function(x, cutoff=4, digits=3, scipen=-7) {
ifelse(nchar(x) < cutoff, x, prettyNum(as.numeric(x), digits=digits, scientific=scipen))
}
And apply that to the stargazer output using gsubfn::gsubfn
gsubfn("([0-9.]+)", ~replace_numbers(x), star)
Another robust way to get scientific notation using stargazer is to hack the digit.separator parameter. This option allows the user to specify the character that separates decimals (usually a period . in most locales). We can usurp this parameter to insert a uniquely identifiable string into any number that we want to be able to find using regex. The advantage of searching for numbers this way is that we shall only find numbers that correspond to numeric values in the stargazer output. I.e. there is no possibility to also match numbers that are part of variable names (e.g. X_12345) or that are part of the latex formatting code (e.g. \hline \\[-1.8ex]). In the following I use the string ::::, but any unique character string (such as a hash) that we will not find elsewhere in the table will do. It's probably best to avoid having any special regex characters in the identifier mark, as this will complicate things slightly.
Using the example model m1 from this other answer.
mark = '::::'
star = stargazer(m1, header = F, decimal.mark = mark, digit.separator = '')
replace_numbers = function(x, low=0.01, high=1e3, digits = 3, scipen=-7, ...) {
x = gsub(mark,'.',x)
x.num = as.numeric(x)
ifelse(
(x.num >= low) & (x.num < high),
round(x.num, digits = digits),
prettyNum(x.num, digits=digits, scientific = scipen, ...)
)
}
reg = paste0("([0-9.\\-]+", mark, "[0-9.\\-]+)")
cat(gsubfn(reg, ~replace_numbers(x), star), sep='\n')
Update
If you want to ensure that trailing zeros are retained in the scientific notation, then we can use sprintf instead of prettyNum.
Like this
replace_numbers = function(x, low=0.01, high=1e3, digits = 3) {
x = gsub(mark,'.',x)
x.num = as.numeric(x)
form = paste0('%.', digits, 'e')
ifelse(
(abs(x.num) >= low) & (abs(x.num) < high),
round(x.num, digits = digits),
sprintf(form, x.num)
)
}

Specify monospace font in `menu`

Language: R. Question: Can I specify fixed width font for the menu(..,graphics=T) function?
Explanation:
I recently asked this question on how to have a user select a row of a data frame interactively:
df <- data.frame(a=c(9,10),b=c('hello','bananas'))
df.text <- apply( df, 1, paste, collapse=" | " )
menu(df.text,graphics=T)
I'd like the | to line up. They don't at the moment; fair enough, I haven't padded out the columns to the same width. So I use format to get every column to the same width (later I'll write code to automagically determine the width per column, but let's ignore that for now):
df.padded <- apply(df,2,format,width=8)
df.padded.text <- apply( df.padded, 1, paste, collapse=" | ")
menu( df.padded.text,graphics=T )
See how it's still wonky? Yet, if I look at df.padded, I get:
> df.padded
a b
[1,] " 9 " "hello "
[2,] "10 " "bananas "
So each cell is definitely padded out to the same length.
The reason for this is probably because the default font for this (on my system anyway, Linux) is not fixed width.
So my question is:
Can I specify fixed width font for the menu(..,graphics=T) function?
Update
#RichieCotton noticed that if you look at menu with graphics=T it calls select.list, which in turn calls tcltk::tk_select.list.
So it looks like I'll have to modify tcltk options for this. From #jverzani:
library(tcltk)
tcl("option", "add", "*Listbox.font", "courier 10")
menu(df.padded.text,graphics=T)
Given that menu(...,graphics=T) calls tcltk::tk_select.list when graphics is TRUE, my guess is that this is a viable option, as any distro that would be capable of displaying the graphical menu in the first place would also have tcltk on it, since it needs to call tk_select.list.
(As an aside, I can't find anything in the documentation that would give me the hint to try tcl('option','add',...), let alone that the option was called *Listbox.font!)
Another update -- had a closer look at the select.list and menu code, and it turns out on Windows (or if .Platform$GUI=='AQUA' -- is that Mac?), the tcltk::tk_select.list isn't called at all, and it's just some internal code instead. So modifying '*Listbox.font' won't affect this.
I guess I'll just:
if tcltk is there, load it, set the *Listbox.font to courier, and use tcltk::tk_select.list explicitly
if it isn't there, try menu(...,graphics=T) to at least get a graphical interface (which won't be monospace, but is better than nothing)
if that fails too, then just fallback to menu(...,graphics=F), which will definitely work.
Thanks all.
Another approach to padding:
na.pad <- function(x,len){
x[1:len]
}
makePaddedDataFrame <- function(l,...){
maxlen <- max(sapply(l,length))
data.frame(lapply(l,na.pad,len=maxlen),...)
}
x = c(rep("one",2))
y = c(rep("two",10))
z = c(rep("three",5))
makePaddedDataFrame(list(x=x,y=y,z=z))
The na.pad() function exploits the fact that R will automatically pad a vector with NAs if you try to index non-existent elements.
makePaddedDataFrame() just finds the longest one and pads the rest up to a matching length.
I don't understand why you don't want to use View(df) (get the rowid, put the contents into temp. data frame and display it with the View command)
Edit: well, just use sprintf command
Create a function f to extract the strings from the data frame object
f <- function(x,sep1) {
sep1=format(sep1,width=8)
xa<-gsub(" ","",as.character(x[1]))
a1 <- nchar(xa)
xa=format(xa,width=8)
xb=gsub(" ","",as.character(x[2]))
b1 <- nchar(xb)
xb=format(xb,width=8)
format1=paste("%-",10-a1,"s%s%-",20-b1,"s",sep="")
concat=sprintf(format1,xa,sep1,xb)
concat
}
df <- data.frame(a=c(9,10),b=c('hello','bananas'))
df.text <- apply( df, 1, f,sep1="|")
menu(df.text,graphics=T)
Of course the limits used in sprintf 10, 20 are maximum length for the number of characters in the data-frame column (a,b). You can change it to reflect it according to your data.

How to perform basic Multiple Sequence Alignments in R?

(I've tried asking this on BioStars, but for the slight chance that someone from text mining would think there is a better solution, I am also reposting this here)
The task I'm trying to achieve is to align several sequences.
I don't have a basic pattern to match to. All that I know is that the "True" pattern should be of length "30" and that the sequences I have had missing values introduced to them at random points.
Here is an example of such sequences, were on the left we see what is the real location of the missing values, and on the right we see the sequence that we will be able to observe.
My goal is to reconstruct the left column using only the sequences I've got on the right column (based on the fact that many of the letters in each position are the same)
Real_sequence The_sequence_we_see
1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG
3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG
4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG
5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG
7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG
Here is an example code to reproduce the above example:
ATCG <- c("A","T","C","G")
set.seed(40)
original.seq <- sample(ATCG, 30, T)
seqS <- matrix(original.seq,200,30, T)
change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG)
{
number.of.changes <- sample(seq_len(number.of.changes), 1)
new.letters <- sample(letters.to.change.with , number.of.changes, T)
where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F)
x[where.to.change.the.letters] <- new.letters
return(x)
}
change.letters(original.seq)
insert.missing.values <- function(x) change.letters(x, 3, "-")
insert.missing.values(original.seq)
seqS2 <- t(apply(seqS, 1, change.letters))
seqS3 <- t(apply(seqS2, 1, insert.missing.values))
seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
require(stringr)
# library(help=stringr)
all.seqS <- str_replace(seqS4,"-" , "")
# how do we allign this?
data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)
I understand that if all I had was a string and a pattern I would be able to use
library(Biostrings)
pairwiseAlignment(...)
But in the case I present we are dealing with many sequences to align to one another (instead of aligning them to one pattern).
Is there a known method for doing this in R?
Writing an alignment algorithm in R looks like a bad idea to me, but there is an R interface to the MUSCLE algorithm in the bio3d package (function seqaln()). Be aware of the fact that you have to install this algorithm first.
Alternatively, you can use any of the available algorithms (eg ClustalW, MAFFT, T-COFFEE) and import the multiple sequence alignemts in R using bioconductor functionality. See eg here..
Though this is quite an old thread, I do not want to miss the opportunity to mention that, since Bioconductor 3.1, there is a package 'msa' that implements interfaces to three different multiple sequence alignment algorithms: ClustalW, ClustalOmega, and MUSCLE. The package runs on all major platforms (Linux/Unix, Mac OS, and Windows) and is self-contained in the sense that you need not install any external software. More information can be found on http://www.bioinf.jku.at/software/msa/ and http://www.bioconductor.org/packages/release/bioc/html/msa.html.
You can perform multiple alignment in R with the DECIPHER package.
Following your example, it would look something like:
library(DECIPHER)
dna <- DNAStringSet(all.seqS)
aligned_DNA <- AlignSeqs(dna)
It is fast and at least as accurate as the other methods listed here (see the paper). I hope that helps!
You are looking for a global alignment algorithm on multiple sequences.
Did you look at Wikipedia before asking ?
First learn what global alignment is, then look for multiple sequence alignment.
Wikipedia doesn't give a lot of details about algorithms, but this paper is better.

Resources