change data frame so thousands are separated by dots - r

At the moment, I'm working with RMarkdown and Pandoc. My data.frames in R look like this:
3.538e+01 3.542e+01 3.540e+01
9.583e+00 9.406e+00 9.494e+00
2.601e+05 2.712e+05 5.313e+05
After I ran pandoc, the result looks like this:
35.380 35.420 35.400
9.583 9.406 9.494
260116.000 271217.000 531333.000
What it should look like is:
35,380 35,420 35,400
9,583 9,406 9,494
260.116 271.217 531.333
So I want commas instead of dots and I want no comma or dot after 260116 (thousand numbers). The dots to separate the thousand would be nice. Is there a way to directly Change the appearance in R or do I have to set options in knitr/markdown?
Thanks

Here's an example of some of the conversions that can be done with format():
x <- c(35.38, 35.42, 35.4, 9.583, 9.406, 9.494, 260100, 271200, 531300)
format(x, decimal.mark=",", big.mark=".", scientific=FALSE)
# [1] " 35,380" " 35,420" " 35,400" " 9,583" " 9,406"
# [6] " 9,494" "260.100,000" "271.200,000" "531.300,000"
There are several other options, such as trim, justify, and so on that might be of interest in getting your output ready for pandoc.

As this question was really inspiring, I recently introduced that big.mark feature in my pander package, that can return markdown formatted tables from R objects with predefined options -- building on format by the way. Small demo:
Load the package (installed from GH until this features gets to CRAN):
> library(pander)
Create a demo data.frame:
> x <- matrix(c(35.38, 35.42, 35.4, 9.583, 9.406, 9.494, 260100, 271200, 531300), 3, byrow = TRUE)
Set your default options: (values for US context may need to be switched)
> panderOptions('decimal.mark', ',')
> panderOptions('big.mark', '.')
Let pander do the rest:
> pander(x)
------- ------- -------
35,38 35,42 35,4
9,583 9,406 9,494
260.100 271.200 531.300
------- ------- -------
You can find and use even more options there (like the markdown syntax for the table).

Related

rvest object cloned with rlang::duplicate is not properly cloned

rvest doesn't seem to offer any way to extract text from parent object only (ignoring children). One workaround uses xml_remove(), which mutates the original object - all the way up the memory chain given R's default lazy evaluation.
I look to rlang::duplicate(), which is supposed for "modifying the copy leaves the original object intact", but the clone does not appear to be truly independent. For example:
require(rvest)
h = '<ul>
<li id="target">
text to extract
<ul><li>text to ignore</li><li>this too</li></ul>
</li>
</ul>'
doc = xml2::read_html(h)
x = html_node(doc, '#target')
html_text(x)
#> [1] "\ntext to extract\ntext to ignorethis too"
Now clone x, remove its children, and extract the text:
x2 = rlang::duplicate(x, shallow = FALSE)
children = html_children(x2)
xml2::xml_remove(children)
html_text(x2)
#> [1] "\ntext to extract\n"
That works as intended, however x has also been mutated:
html_text(x)
#> [1] "\ntext to extract\n"
Any suggestions why and how to workaround this? I do not want to start re-attaching children..
First of all let me say that I think yoo can solve the issue without copying the data. I'm not an expert in xpath, but I think you can use it to just select only direct text descendents, ignoring text nested in other xml nodes. I.e. the following seems to do the trick without any copy (x defined as in your question):
html_text(html_elements(x, xpath = "text()"))
# [1] "\ntext to extract\n"
That being said, I also have an answer to the question on how to make a deep copy:
The problem is that rlang::duplicate() can only copy R data structures. However, rvest builds on xml2, and xml2 builds on the C library libxml2.
When you create the xml_node object in R, the corresponding data structure is created in libxml2. On the R side, there is basically just a pointer to the libxml2 object. So rlang::duplicate() will only create a copy of that pointer, but not of the underlying data. It cannot do so, because it has no access to it as it is in a different library (that rlang doesn't know of).
The easiest way to create a copy of the underlying data seems to be to serialize and deserialze the xml. I suspect this is not very efficent though.
Example:
Read in the original data:
require(rvest)
h <- '<ul>
<li id="target">
text to extract
<ul><li>text to ignore</li><li>this too</li></ul>
</li>
</ul>'
doc <- xml2::read_html(h)
x <- html_node(doc, '#target')
Create two copies - one with rlang:duplicate() and one with xml2::xml_unserialize():
x1 <- rlang::duplicate(x, shallow = FALSE)
x2 <- xml2::xml_unserialize(xml2::xml_serialize(x, NULL))
Check that x and x1 are in fact identical, while x2 is a true copy (the memory locations you get will be of course be different to the ones shown here):
x$doc
# <pointer: 0x0000023911334ea0>
x1$doc
# <pointer: 0x0000023911334ea0>
# --> same as x
x2$doc
# <pointer: 0x00000239113377d0>
# --> different to x
Test that everything works as intented:
children <- html_children(x2)
xml2::xml_remove(children)
html_text(x2)
# [1] "\n text to extract\n "
html_text(x)
# [1] "\n text to extract\n text to ignorethis too"
Another potential solution (maybe a more general approach) is to use the html_children() function to obtain the text of all the child nodes and then remove that from the full text.
require(rvest)
h = '<ul>
<li id="target">
text to extract
<ul><li>text to ignore</li><li>this too</li></ul>
</li>
</ul>'
doc = xml2::read_html(h)
x = html_node(doc, '#target')
fulltext <- html_text(x)
# [1] "\ntext to extract\ntext to ignorethis too"
#find the text in the children nodes
childtext <- html_children(x) %>% html_text()
# "text to ignorethis too"
#replace the child node text with a numm
gsub(childtext, "", fulltext) %>% trimws()
#"text to extract"
#alternative using the text from the first child node
firstchild <- xml_child(x, search=1) %>% xml_text()
gsub(paste0(firstchild, ".*"), "", fulltext)
Of course, if there are additional newline "\n" or formatting character, the gsub() may break.

How does R Markdown automatically format print effects into dataframes? Or how can I access special print methods?

I'm working with the WRS2 package and there are cases where it'll output its analysis (bwtrim) into a list with a special class of the analysis type class = "bwtrim". I can't as.data.frame() it, but I found that there is a custom print method called print.bwtrim associated with it.
As an example let's say this is the output: bwtrim.out <- bwtrim(...). When I run the analysis output in an Rmarkdown chunk, it seems to "steal" part of the text output and make it into a dataframe.
So here's my question, how can I either access print.bwtrim or how does R markdown automatically format certain outputs into dataframes? Because I'd like to take this outputted dataframe and use it for other purposes.
Update: Here is a minimally working example -- put the following in a chunk in Rmd file."
```{r}
library(WRS2)
df <-
data.frame(
subject = rep(c(1:100), each = 2),
group = rep(c("treatment", "control"), each = 2),
timepoint = rep(c("pre", "post"), times = 2),
dv = rnorm(200, mean = 2)
)
analysis <- WRS2::bwtrim(dv ~ group * timepoint,
id = subject,
data = df,
tr = .2)
analysis
```
With this, a data.frame automatically shows up in the chunk afterwards and it shows all the values very nicely. My main question is how can I get this data.frame for my own uses. Because if you do str(analysis), you see that it's a list. If you do class(analysis) you get "bwtrim". if you do methods(class = "bwtrim"), you get the print method. And methods(print) will have a line that says print.bwtrim*. But I can't seem to figure out how to call print.bwtrim myself.
Regarding what Rmarkdown is doing, compare the following
If you run this in a chunk, it actually steals the data.frame part and puts it into a separate figure.
```{r}
capture.output(analysis)
```
However, if you run the same line in the console, the entire output comes out properly. What's also interesting is that if you try to assign it to another object, the output will be stolen before it can be assigned.
Compare x when you run the following in either a chunk or the console.
```{r}
x<-capture.output(analysis)
```
This is what I get from the chunk approach when I call x
[1] "Call:"
[2] "WRS2::bwtrim(formula = dv ~ group * timepoint, id = subject, "
[3] " data = df, tr = 0.2)"
[4] ""
[5] ""
This is what I get when I do it all in the console
[1] "Call:"
[2] "WRS2::bwtrim(formula = dv ~ group * timepoint, id = subject, "
[3] " data = df, tr = 0.2)"
[4] ""
[5] " value df1 df2 p.value"
[6] "group 1.0397 1 56.2774 0.3123"
[7] "timepoint 0.0001 1 57.8269 0.9904"
[8] "group:timepoint 0.5316 1 57.8269 0.4689"
[9] ""
My question is what can I call whatever Rstudio/Rmarkdown is doing to make data.frames, so that I can have an easy data.frame myself?
Update 2: This is probably not a bug, as discussed here https://github.com/rstudio/rmarkdown/issues/1150.
Update 3: You can access the method by using WRS2:::bwtrim(analysis), though I'm still interested in what Rmarkdown is doing.
Update 4: It might not be the case that Rmarkdown is stealing the output and automatically making dataframes from it, as you can see when you call x after you've already captured the output. Looking at WRS2:::print.bwtrim, it prints a dataframe that it creates, which I'm guessing Rmarkdown recognizes then formats it out.
See below for the print.bwtrim.
function (x, ...)
{
cat("Call:\n")
print(x$call)
cat("\n")
dfx <- data.frame(value = c(x$Qa, x$Qb, x$Qab), df1 = c(x$A.df[1],
x$B.df[1], x$AB.df[1]), df2 = c(x$A.df[2], x$B.df[2],
x$AB.df[2]), p.value = c(x$A.p.value, x$B.p.value, x$AB.p.value))
rownames(dfx) <- c(x$varnames[2], x$varnames[3], paste0(x$varnames[2],
":", x$varnames[3]))
dfx <- round(dfx, 4)
print(dfx)
cat("\n")
}
<bytecode: 0x000001f587dc6078>
<environment: namespace:WRS2>
In R Markdown documents, automatic printing is done by knitr::knit_print rather than print. I don't think there's a knit_print.bwtrim method defined, so it will use the default method, which is defined as
function (x, ..., inline = FALSE)
{
if (inline)
x
else normal_print(x)
}
and normal_print will call print().
You are asking why the output is different. I don't see that when I knit the document to html_document, but I do see it with html_notebook. I don't know the details of what is being done, but if you look at https://rmarkdown.rstudio.com/r_notebook_format.html you can see a discussion of "output source functions", which manipulate chunks to produce different output.
The fancy output you're seeing looks a lot like what knitr::knit_print does for a dataframe, so maybe html_notebook is substituting that in place of print.

How to quietly change the System Locale in R [duplicate]

I'm looking to suppress the output of one command (in this case, the apply function).
Is it possible to do this without using sink()? I've found the described solution below, but would like to do this in one line if possible.
How to suppress output
It isn't clear why you want to do this without sink, but you can wrap any commands in the invisible() function and it will suppress the output. For instance:
1:10 # prints output
invisible(1:10) # hides it
Otherwise, you can always combine things into one line with a semicolon and parentheses:
{ sink("/dev/null"); ....; sink(); }
Use the capture.output() function. It works very much like a one-off sink() and unlike invisible(), it can suppress more than just print messages. Set the file argument to /dev/null on UNIX or NUL on windows. For example, considering Dirk's note:
> invisible(cat("Hi\n"))
Hi
> capture.output( cat("Hi\n"), file='NUL')
>
The following function should do what you want exactly:
hush=function(code){
sink("NUL") # use /dev/null in UNIX
tmp = code
sink()
return(tmp)
}
For example with the function here:
foo=function(){
print("BAR!")
return(42)
}
running
x = hush(foo())
Will assign 42 to x but will not print "BAR!" to STDOUT
Note than in a UNIX OS you will need to replace "NUL" with "/dev/null"
R only automatically prints the output of unassigned expressions, so just assign the result of the apply to a variable, and it won't get printed.
you can use 'capture.output' like below. This allows you to use the data later:
log <- capture.output({
test <- CensReg.SMN(cc=cc,x=x,y=y, nu=NULL, type="Normal")
})
test$betas
In case anyone's arriving here looking for a solution applicable to RMarkdown, this will suppress all output:
```{r error=FALSE, warning=FALSE, message=FALSE}
invisible({capture.output({
# Your code goes here
2 * 2
# etc
# etc
})})
```
The code will run, but the output will not be printed to the HTML document
invisible(cat("Dataset: ", dataset, fill = TRUE))
invisible(cat(" Width: " ,width, fill = TRUE))
invisible(cat(" Bin1: " ,bin1interval, fill = TRUE))
invisible(cat(" Bin2: " ,bin2interval, fill = TRUE))
invisible(cat(" Bin3: " ,bin3interval, fill = TRUE))
produces output without NULL at the end of the line or on the next line
Dataset: 17 19 26 29 31 32 34 45 47 51 52 59 60 62 63
Width: 15.33333
Bin1: 17 32.33333
Bin2: 32.33333 47.66667
Bin3: 47.66667 63
Making Hadley's comment to an answer: Use of apply family without printing is possible with use of the plyr package
x <- 1:2
lapply(x, function(x) x + 1)
#> [[1]]
#> [1] 2
#>
#> [[2]]
#> [1] 3
plyr::l_ply(x, function(x) x + 1)
Here is a version that is robust to errors in the code to be shushed:
quietly <- function(x) {
sink("/dev/null") # on Windows (?) instead use `sink("NUL")`
tryCatch(suppressMessages(x), finally = sink())
}
This is based directly on the accepted answer, for which thanks.
But it avoids leaving output silenced if an error occurs in the quieted code.

How to output results of 'msa' package in R to fasta

I am using the R package msa, a core Bioconductor package, for multiple sequence alignment. Within msa, I am using the MUSCLE alignment algorithm to align protein sequences.
library(msa)
myalign <- msa("test.fa", method=c("Muscle"), type="protein",verbose=FALSE)
The test.fa file is a standard fasta as follows (truncated, for brevity):
>sp|P31749|AKT1_HUMAN_RAC
MSDVAIVKEGWLHKRGEYIKTWRPRYFLL
>sp|P31799|AKT1_HUMAN_RAC
MSVVAIVKEGWLHKRGEYIKTWRFLL
When I run the code on the file, I get:
MUSCLE 3.8.31
Call:
msa("test.fa", method = c("Muscle"), type = "protein", verbose = FALSE)
MsaAAMultipleAlignment with 2 rows and 480 columns
aln
[1] MSDVAIVKEGWLHKRGEYIKTWRPRYFLL
[2] MSVVAIVKEGWLHKRGEYIKTWR---FLL
Con MS?VAIVKEGWLHKRGEYIKTWR???FLL
As you can see, a very reasonable alignment.
I want to write the gapped alignment, preferably without the consensus sequence (e.g., Con row), to a fasta file. So, I want:
>sp|P31749|AKT1_HUMAN_RAC
MSDVAIVKEGWLHKRGEYIKTWRPRYFLL
>sp|P31799|AKT1_HUMAN_RAC
MSVVAIVKEGWLHKRGEYIKTWR---FLL
I checked the msa help, and the package does not seem to have a built in method for writing out to any file type, fasta or otherwise.
The seqinr package looks somewhat promising, because maybe it could read this output as an msf format, albeit a weird one. However, seqinr seems to need a file read in as a starting point. I can't even save this using write(myalign, ...).
I wrote a function:
alignment2Fasta <- function(alignment, filename) {
sink(filename)
n <- length(rownames(alignment))
for(i in seq(1, n)) {
cat(paste0('>', rownames(alignment)[i]))
cat('\n')
the.sequence <- toString(unmasked(alignment)[[i]])
cat(the.sequence)
cat('\n')
}
sink(NULL)
}
Usage:
mySeqs <- readAAStringSet('test.fa')
myAlignment <- msa(mySeqs)
alignment2Fasta(myAlignment, 'out.fasta')
I think you ought to follow the examples in the help pages that show input with a specific read function first, then work with the alignment:
mySeqs <- readAAStringSet("test.fa")
myAlignment <- msa(mySeqs)
Then the rownames function will deliver the sequence names:
rownames(myAlignment)
[1] "sp|P31749|AKT1_HUMAN_RAC" "sp|P31799|AKT1_HUMAN_RAC"
(Not what you asked for but possibly useful in the future.) Then if you execute:
detail(myAlignment) #function actually in Biostrings
.... you get a text file in interactive mode that you can save
2 29
sp|P31749|AKT1_HUMAN_RAC MSDVAIVKEG WLHKRGEYIK TWRPRYFLL
sp|P31799|AKT1_HUMAN_RAC MSVVAIVKEG WLHKRGEYIK TWR---FLL
If you wnat to try hacking a function for which you can get a file written in code, then look at the Biostrings detail function code that is being used
> showMethods( f= 'detail')
Function: detail (package Biostrings)
x="ANY"
x="MsaAAMultipleAlignment"
(inherited from: x="MultipleAlignment")
x="MultipleAlignment"
showMethods( f= 'detail', classes='MultipleAlignment', includeDefs=TRUE)
Function: detail (package Biostrings)
x="MultipleAlignment"
function (x, ...)
{
.local <- function (x, invertColMask = FALSE, hideMaskedCols = TRUE)
{
FH <- tempfile(pattern = "tmpFile", tmpdir = tempdir())
.write.MultAlign(x, FH, invertColMask = invertColMask,
showRowNames = TRUE, hideMaskedCols = hideMaskedCols)
file.show(FH)
}
.local(x, ...)
}
You may use export.fasta function from bio2mds library.
# reading of the multiple sequence alignment of human GPCRS in FASTA format:
aln <- import.fasta(system.file("msa/human_gpcr.fa", package = "bios2mds"))
export.fasta(aln)
You can convert your msa alignment first ("AAStringSet") into an "align" object first, and then export as fasta as follows:
library(msa)
library(bios2mds)
mysequences <-readAAStringSet("test.fa")
alignCW <- msa(mysequences)
#https://rdrr.io/bioc/msa/man/msaConvert.html
alignCW_as_align <- msaConvert(alignCW, "bios2mds::align")
export.fasta(alignCW_as_align, outfile = "test_alignment.fa", ncol = 60, open = "w")

Is there any method of reading non-standard table use organizing labels without loop?

As far as we know, the parsing library like XML and xml2 can read standard table on web page perfectly. But there are some sorts of table which has no grid of table but organizing labels, such as “<span>” and “<div>”.
Now I am coping with a table like this,
The structure of table marks with “<span>”, and every 4 “<span>” Labels organize one record. I’ve used a loop to solve this problem and succeed. But I want to process it without loop. I heard that library purrr may help on this problem, but I don’t know how to use it in this situation.
I do my analysis by both “XML” and “xml2”:
Analysis with “XML” package
pg<-"http://www.irgrid.ac.cn/simple-search?fq=eperson.unique.id%3A311007%5C-000920"
library(XML)
tableNodes <- getNodeSet(htmlParse(pg), "//table[#class='miscTable2']")
itemlines <- xpathApply(tableNodes[[1]], "//tr[#class='itemLine']/td[#width='750']")
ispan <- xmlElementsByTagName(itemlines[[2]], "span")
title <- xmlValue(ispan$span)
isuedate <- xmlValue(ispan$span[1,2])
author <- xmlValue(ispan$span[3])
In this case, “XML” got a list of one span, but this list is very strange but met my expectations:
> attributes(ispan)
$names
[1] "span" "span" "span" "span"
It seems have one row only, but four columns. However, it doesn’t. The 2-4 “span” couldn’t be select by column. The first “span” occupied 2 columns, and other “span” could not get.
> val <- xmlValue(ispan$span[[1]])
> val
[1] "超高周疲劳裂纹萌生与初始扩展的特征尺度"
> isuedate <- xmlValue(ispan$span[[2]])
> isuedate
[1] " \r\n [科普文章]"
> isuedate <- xmlValue(ispan$span[[3]])
> isuedate
[1] NA
> author <- xmlValue(ispan$span[[4]])
> author
[1] NA
None of the selection method used in list works:
> title <- xmlValue(ispan$span[1,1])
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('XMLInternalNodeList', 'XMLNodeList')"
title <- xmlValue(ispan$span[1,])
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('XMLInternalNodeList', 'XMLNodeList')"
author <- xmlValue(ispan[1,3])
Error in ispan[1, 3] : incorrect number of dimensions
Analysis with “xml2”
Use “xml2” the obstacle of “span” makes same problem
pg<-"http://www.irgrid.ac.cn/simple-search?fq=eperson.unique.id%3A311007%5C-000920"
library(xml2)
tableSource <- xml_find_all(read_html(pg, encoding = "UTF-8"), "//table[#class='miscTable2']")
itemspan <- xml_child(itemspantab, "span")
It could not gether any of these “span” labels:
> itemspan
{xml_nodeset (1)}
[1] <NA>
If we make a step further to locate the “span” labels, it only get nothing:
> itemspanl <- xml_find_all(itemspantab, '//tr[#class="itemLine"]/td/span')
> itemspan <- xml_child(itemspanl, "span")
> itemspan
{xml_nodeset (40)}
[1] <NA>
[2] <NA>
[3] <NA>
...
An suggest told me use library(purrr) to do this, but the “purrr” process dataframe only, the “list” prepared by “xml2” could not be analyzed.
I want not to use loop and get the result like below, can we do it? I hope the scholars who have experience on “XML” and “xml2” could give me some advise on how to cope with this non-standard table. Thanks a lot.

Resources