git2r::summary() produces different results when called from console and by RStudio - r

I am trying to write an R package that analyzes Apache Pig GitHub repository with the help of git2r package. I also use testthat package for unit testing.
I have a function, let's call it compute(), which contains code along the lines of:
repo <- repository("C:\normalized\path\to\apache\pig");
allCommits <- commits(repo);
commitSummary <- capture.output(summary(allCommits[[1]]));
print(commitSummary);
The commitSummary is an important part, because I run several regexes on it to recover data like insertions and deletions.
The problem is, when I call compute() from console, it prints out Output 1.
But when compute() is called from my unit test file when I run devtools::test(), it prints out Output 2. (And over the course of me writing this question, after producing Output 2 several times, it produced Output 3.)
When I run the first codeblock in this question from the console, it prints out Output 1, again.
However, when I copy-paste that codeblock into the test file, it prints out Output 3.
I am confused. How is that even possible?
And how can I make sure git2r::summary() uses the format I want?
Output 1
[1] "Commit: d2de56aad939c7c77324066a6f29cc211e29a077"
[2] "Author: Koji Noguchi <knoguchi#apache.org>"
[3] "When: 2016-12-12 23:07:37"
[4] ""
[5] " PIG-5073: Skip e2e Limit_5 test for Tez (knoguchi)"
[6] " "
[7] " "
[8] " git-svn-id: https://svn.apache.org/repos/asf/pig/trunk#1773899 13f79535-47bb-0310-9956-ffa450edef68"
[9] " "
[10] "2 files changed, 18 insertions, 1 deletions"
[11] "CHANGES.txt | -0 + 2 in 1 hunk"
[12] "test/e2e/pig/tests/nightly.conf | -1 +16 in 2 hunks"
[13] ""
Output 2
[d2de56a] 2016-12-12: PIG-5073: Skip e2e Limit_5 test for Tez (knoguchi)
Output 3
[1] " Length Class Mode " " 1 git_commit S4 "
Additional notes that might take away from question's clarity
When I load and call the function calling compute() from the unit test file, it prints out Output 1. Same for calling compute() from working directory set to test folder and exactly same arguments.
To make matters more confusing, up until very recently devtools::test() produced Output 1, then it switched to Output 3 before settling on Output 2.
CRAN's documentation for git2r::summary(object, ...) lists following arguments:
object The commit object.
... Additional arguments affecting the summary produced.
Accepted values of ... are nowhere to be found.

Turns out, there is probably some sort of a namespace-related race condition going on, since there is already a summary() function in base R (and probably some other packages), which explains why there were 3 different outputs.
I just changed every
summary(commitObject)
into
git2r::summary(commitObject)
and everything seems to work again.

Related

rvest - remove tags and its content from HTML string

Suppose I have the below text:
x <- "<p>I would like to run tests for a package with <code>testthat</code> and compute code coverage with <code>covr</code>. Furthermore, the results from <code>testthat</code> should be saved in the JUnit XML format and the results from <code>covr</code> should be saved in the Cobertura format.</p>\n\n<p>The following code does the trick (when <code>getwd()</code> is the root of the package):</p>\n\n<pre><code>options(\"testthat.output_file\" = \"test-results.xml\")\ndevtools::test(reporter = testthat::JunitReporter$new())\n\ncov <- covr::package_coverage()\ncovr::to_cobertura(cov, \"coverage.xml\")\n</code></pre>\n\n<p>However, the tests are executed <em>twice</em>. Once with <code>devtools::test</code> and once with <code>covr::package_coverage</code>. </p>\n\n<p>My understanding is that <code>covr::package_coverage</code> executes the tests, but it does not produce <code>test-results.xml</code>.</p>\n\n<p>As the title suggests, I would like get both <code>test-results.xml</code> and <code>coverage.xml</code> with a single execution of the test suite.</p>\n"
**PROBLEM: **
I need to do remove all <code></code> tags and its content, regardless if they are on its own or inside another tag.
I HAVE TRIED:
I have tried the following, but as you can see, the tags are still there:
content <- xml2::read_html(x) %>%
rvest::html_nodes(css = ":not(code)")
print(content)
But the result I get is the following, and the tags are still there:
{xml_nodeset (8)}
[1] <body>\n<p>I would like to run tests for a package with <code>testthat</code> and compute code coverage with <code>cov ...
[2] <p>I would like to run tests for a package with <code>testthat</code> and compute code coverage with <code>covr</code> ...
[3] <p>The following code does the trick (when <code>getwd()</code> is the root of the package):</p>
[4] <pre><code>options("testthat.output_file" = "test-results.xml")\ndevtools::test(reporter = testthat::JunitReporter$new ...
[5] <p>However, the tests are executed <em>twice</em>. Once with <code>devtools::test</code> and once with <code>covr::pac ...
[6] <em>twice</em>
[7] <p>My understanding is that <code>covr::package_coverage</code> executes the tests, but it does not produce <code>test ...
[8] <p>As the title suggests, I would like get both <code>test-results.xml</code> and <code>coverage.xml</code> with a sin ...
The solution was the following:
content <- xml2::read_html(x)
toRemove <- content %>% rvest::html_nodes(css = "code")
xml_remove(toRemove)
After that, content had no code tags, nor its content, and this wasn't manipulated as string.

Simultaneously save and print R system call output?

Within my R script I am calling a shell script. I would like to both print the output to the console in real time and save the output for debugging. For example:
system("script.sh")
prints to console in real time,
out <- system("script.sh", intern = TRUE)
saves the output to a variable for debugging,
and
(out <- system("script.sh", intern = TRUE))
will only print the contents of out after the script has finished. Is there any way to both print to console in real time and store the output as a variable?
Since R is waiting for this to complete anyway, generally to see the stdout in real time, you need to poll the process for output. (One can/should also poll for stderr, depending.)
Here's a quick example using processx.
First, I'll create a slow-output shell script; replace this with the real reason you're calling system. I've named this myscript.sh.
#!/bin/bash
for i in `seq 1 5` ; do
sleep 3
echo 'hello world: '$i
done
Now let's (1) start a process in the background, then (2) poll its output every second.
proc <- processx::process$new("bash", c("-c", "./myscript.sh"), stdout = "|")
output <- character(0)
while (proc$is_alive()) {
Sys.sleep(1)
now <- Sys.time()
tmstmp <- sprintf("# [%s]", format(now, format = "%T"))
thisout <- proc$read_output_lines()
if (length(thisout)) {
output <- c(output, thisout)
message(tmstmp, " New output!\n", paste("#>", thisout))
} else message(tmstmp)
}
# [13:09:29]
# [13:09:30]
# [13:09:31]
# [13:09:32]New output!
#> hello world: 1
# [13:09:33]
# [13:09:34]
# [13:09:35]New output!
#> hello world: 2
# [13:09:36]
# [13:09:37]
# [13:09:38]New output!
#> hello world: 3
# [13:09:39]
# [13:09:40]
# [13:09:41]New output!
#> hello world: 4
# [13:09:42]
# [13:09:43]
# [13:09:44]New output!
#> hello world: 5
And its output is stored:
output
# [1] "hello world: 1" "hello world: 2" "hello world: 3" "hello world: 4" "hello world: 5"
Ways that this can be extended:
Add/store a timestamp with each message, so you know when it came in. The accuracy and utility of this depends on how frequently you want R to poll the process stdout pipe, and really how much you need this information.
Run the process in the background, and even poll for it in the background cycles. I use the later package and set up a self-recurring function that polls, appends, and re-submits itself into the later process queue. The benefit of this is that you can continue to use R; the drawback is that if you're running long code, then you will not see output until your current code exits and lets R breathe and do something idly. (To understand this bullet, one really must play with the later package, a bit beyond this answer.)
Depending on your intentions, it might be more appropriate for the output to go to a file and "permanently" store it there instead of relying on the R process to keep tabs. There are disadvantages to this, in that now you need to manage polling a file for changes, and R isn't making that easy (it does not have, for instance, direct/easy access to inotify, so now it gets even more complicated).

Reading an attribute value from a NetCDF4 file with nested groups

This should be trivial but I can't for the life of me figure out how to do this: I am trying to read an attribute value from a NetCDF4 file in R. Now, my NetCDF4 file (uploaded here) is fairly complex, i.e. it contains nested groups.
I would like to extract the value of the attribute called gml:posList from the group METADATA/EOP_METADATA/om:featureOfInterest/eop:multiExtentOf/gml:surfaceMembers/gml:exterior using R. I am not sure if it matters in this context, but this group does not contain any variables, only metadata attributes.
I have tried the following
library(ncdf4)
fid = nc_open('S5P_NRTI_L2__NO2____20180728T130136_20180728T130636_04089_01_010100_20180728T140302.nc')
ncatt_get(fid, varid=0, attname='METADATA/EOP_METADATA/om:featureOfInterest/eop:multiExtentOf/gml:surfaceMembers/gml:exterior/gml:posList', verbose=TRUE)
but this returns
[1] "ncatt_get: entering"
[1] "ncatt_get: is a global att"
[1] "ncatt_get: calling ncatt_get_inner for a global att"
[1] "ncatt_get_inner: entering with ncid= 65536 varid= -1 attname= METADATA/EOP_METADATA/om:featureOfInterest/eop:multiExtentOf/gml:surfaceMembers/gml:exterior/gml:posList"
[1] "ncatt_get_inner: about to call R_nc4_inq_att"
[1] "ncatt_get_inner: R_nc4_inq_att returned with error= -43 type= -1"
$`hasatt`
[1] FALSE
$value
[1] 0
presumably indicating that it cannot find the attribute and I assume that I got the path wrong somehow.
So my question is, how do I need to specify the path to an attribute that is a) in a nested group and b) not linked to a specific variable, such that ncatt_get() can find the attribute and return its value?
By the way, just for reference, in Matlab the command
test = ncreadatt(file, 'METADATA/EOP_METADATA/om:featureOfInterest/eop:multiExtentOf/gml:surfaceMembers/gml:exterior', 'gml:posList')
works fine, so I know it's not an issue with the file.
Any hints would be highly appreciated!

R parallel package parSapply(): cannot print message [duplicate]

I found if there are more than one print function during the parallel computation, only the last on will display on the console. So I set outfile option and hope I can get the result of every print. Here is the R code:
cl <- makeCluster(3, type = "SOCK",outfile="log.txt")
abc <<- 123
clusterExport(cl,"abc")
clusterApplyLB(cl, 1:6,
function(y){
print(paste("before:",abc));
abc<<-y;
print(paste("after:",abc));
}
)
stopCluster(cl)
But I just get three records:
starting worker for localhost:11888
Type: EXEC
Type: EXEC
[1] "index: 3"
[1] "before: 123"
[1] "after: 2"
Type: EXEC
[1] "index: 6"
[1] "before: 2"
[1] "after: 6"
Type: DONE
It looks like you're only getting the output from one worker in log.txt. I've often wondered if that could happen, because when you specify outfile="log.txt", each of the workers will open log.txt for appending and then call sink. Here is the code that is executed by the worker processes when outfile is not an empty string:
## all the workers log to the same file.
outcon <- file(outfile, open = "a")
sink(outcon)
sink(outcon, type = "message")
This makes me nervous because I'm not certain what might happen with all of the workers opening the same file for appending at the same time. It may be OS or file system dependent, and it might explain why you're only getting the output from one worker.
For this reason, I tend to use outfile="", in which case this code isn't executed, allowing the output operations to happen normally without redirecting them with the sink function. However, on Windows, you won't see the output if you're using Rgui, so use Rterm instead.
There shouldn't be a problem with multiple print statements in a task, but if you didn't set outfile, you shouldn't see any output since all output is redirected to /dev/null in that case.

is it possible to redirect console output to a variable?

In R, I'm wondering if it's possible to temporarily redirect the output of the console to a variable?
p.s. There are a few examples on the web on how to use sink() to redirect the output into a filename, but none that I could find showing how to redirect into a variable.
p.p.s. The reason this is useful, in practice, is that I need to print out a portion of the default console output from some of the built-in functions in R.
I believe results <- capture.output(...) is what you need (i.e. using the default file=NULL argument). sink(textConnection("results")); ...; sink() should work as well, but as ?capture.output says, capture.output() is:
Related to ‘sink’ in the same way that ‘with’ is related to ‘attach’.
... which suggests that capture.output() will generally be better since it is more contained (i.e. you don't have to remember to terminate the sink()).
If you want to send the output of multiple statements to a variable you can wrap them in curly brackets {}, but if the block is sufficiently complex it might be better to use sink() (or make your code more modular by wrapping it in functions).
For the record, it's indeed possible to store stdout in a variable with the help of a temorary connection without calling capture.output -- e.g. when you want to save both the results and stdout. Example:
Prepare the variable for the diverted R output:
> stdout <- vector('character')
> con <- textConnection('stdout', 'wr', local = TRUE)
Divert the output:
> sink(con)
Do some stuff:
> 1:10
End the diversion:
> sink()
Close the temporary connection:
> close(con)
Check results:
> stdout
[1] " [1] 1 2 3 4 5 6 7 8 9 10"

Resources