KnitR HTML output showing incorrect/strange results. Inline code and modifying options not yielding the correct output - r

I'm creating a report on statistical analysis of several distributions; more specifically random populations and how their samples differ from them with the latter adhering to properties of normal distributions while their larger populations remain skewed in most cases.
Although I'm more than satisfied with the rest of the output, I'm unable to figure out why certain numeric values and their visualisations are differing from the ones done through the command line. Here's some of the reproduced code for the discrepancy(first I generate a 1000 random exponentials):
set.seed(1000)
pop <- rexp(1000, 0.2)
In extracting say, the mean of pop, I get the exact correct result through the console, which is 4.76475. This is the value I should be getting through the markdown output, but instead knitr displays it as 5.015616.
mean(pop)
[1] 4.76475
```{r, echo = T}
mean(pop)
```
[1] 5.015616
Its not just the mean, but in almost all of the rest of the required statistical variables for the population as well as sample. In addition, I also get wrong visualisations in the knitted output:
Original/correct plot
Knitted plot
The plots themselves are being displayed discrepant because of the incorrect results. I thought this is a problem with the digits setting, but digits(options) isn't really solving it, neither is default scipen = 0 setting. I've tried inserting inline code but its still showing me the incorrect values. Referred to knitR's manual if a chunk setting was missing but couldn't really find a fault there. Is there something missing here or a bug related to random distributions?
EDIT: I noticed another peculiar property. I created a new markdown file to see if the results varied according to each new output that I created. Let's name this as test.Rmd but it contains the same commands that I've reproduced here with the same seed. And I'm getting a totally different result now, still different from the original value from the command session.
EDIT: Roman's point seem to be working. Knitted result are coming closer to original values but are still not exactly matching. The seed set to 357 gave me a mean(pop) of 4.881604 which is only a decimal point away from the original value. But why is seed being the game changer here? I thought it has to be 1000.
EDIT: Here's some of the code from the .Rmd file as requested by Phil.
# Load packages
library(ggplot2)
library(knitr)
library(gridExtra)
# Generate random exponentials
set.seed(357)
pop = rexp(1000,0.2) # lambs is 0.2 with n = 1000
pop.table <- as.data.frame(pop)
# Take a sample simulating 1000 averages of 40 exponentials
sample.exp = NULL
for (i in 1:1000){
sample.exp = c(sample, rexp(40, 0.2)} # n = 40 here
sample.df <- as.data.frame(sample.exp)
# Generate means and compare
mean(pop) # 4.881604
mean(sample.exp) # 4.992426
# Generate variances and compare
var(pop) # 26.07005
var(sample.exp) # 0.6562298
# Some plots
plot.means.pop <- ggplot(pop.table, aes(pop.table$pop)) + geom_histogram(binwidth = 0.9, fill = 'white', colour = 'black') + geom_vline(aes(xintercept = mean(pop.table$pop), colour = 'red')) + labs(title = 'Population Mean', x = 'Exponential', y = 'Frequency') + theme(legend.position = 'none') +theme(plot.title = element_text(hjust = 0.5))
plot.means.sample <- ggplot(sample.df, aes(sample.df$sample.exp)) + geom_histogram(binwidth = 0.2, fill = 'white', colour = 'black') + geom_vline(aes(xintercept = mean(sample.df$sample.exp)), colour = 'red', size = 0.8) + labs(title = 'Sample Mean', x = 'Exponential', y = 'Frequency') + guides(fill = F) + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(plot.means.sample, plot.means.pop, ncol = 2, nrow = 1)
So thats pretty much the main portion of the file that is giving me 'close' values if not errors or the exact results from the command line. Note: The values annotated are new values after setting the seed to 357 and I've set the same for the global environment. The values that I'm receiving at console are:
4.76475 for population mean
5.00238 for sample mean
21.80913 for population variance
0.6492991 for sample variance

When asking a question on Stack Overflow it's essential to provide a minimal reproducible example. In particular, have a good read of the first answer and this advice and this will guide you through the process.
I think we've all struggled to help you (and we want to!) because we can't reproduce your issue. Compare the following R and Rmd code when run or knitted, respectively:
# Generate random exponentials
set.seed(1000)
pop = rexp(1000, 0.2) # lambs is 0.2 with n = 1000
mean(pop)
## [1] 5.015616
var(pop)
## [1] 26.07005
and the Rmd:
---
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
message = TRUE,
warning = TRUE
)
```
```{r}
# Generate random exponentials
set.seed(1000)
pop = rexp(1000, 0.2) # lambs is 0.2 with n = 1000
mean(pop)
var(pop)
```
Which produces the following output:
# Generate random exponentials
set.seed(1000)
pop = rexp(1000, 0.2) # lambs is 0.2 with n = 1000
mean(pop)
## [1] 5.015616
var(pop)
## [1] 26.07005
As you can see, the result are identical from a clean R session and a clean knitr session. This is as expected, because the set.seed(), when set the same, should provide the same results every time (see the set.seed man page). When you change the seed to 357, the results vary together:
| mean | var |
console (`R`) | 4.88... | 22.88... |
knitr (`Rmd`) | 4.88... | 22.88... |
In your second code block your knitr chunk result is correct for the 1000 seed, but the console result of 4.76 is incorrect, suggesting to me your console is producing the incorrect output. This could be for one of a few reasons:
You forgot to set the seed in the console before running the rexp() function. If you run this line without setting the seed the result will vary every time. Ensure you run the set.seed(1000) first or use an R script and source this to ensure steps are run through in order.
There's something in your global R environment that is affecting your results. This is less likely because you cleared your R environment, but this is one of the reasons it's important to create a new session from time to time, either by closing and re-opening RStudio or pressing CTRL + Shift + F10
There might be something set in your RProfile.site or .Rprofile that are setting an option on startup that's affecting your results. Have a look at Customizing startup to open and check your startup options, and if necessary correct them.
The output you're seeing isn't because of scipen because there are no numbers in scientific/engineering notation, and it's not digits because the differences you're seeing are more than differences in rounding.
If these suggestions still don't solve your issue, post the minimal reproducible example and try on other computers.

Related

R: automatically assigning all colors

I am working with the R programming language. I have this data:
letters = replicate(52, paste(sample(LETTERS, 10, replace=TRUE), collapse=""))
values = rnorm(52, 100, 100)
my_data = data.frame(letters, values)
I am trying to plot this data:
library(ggplot2)
library(waffle)
waffle(my_data, size = 0.6, rows = 10)
But this gives me the error:
! Insufficient values in manual scale. 51 needed but only 8 provided.
Run `rlang::last_error()` to see where the error occurred.
Normally, I would have manually provided the colors - but 51 colors are a lot to insert manually. Is there some automatic way that can recognize how many colors are required and then fill them all in?
Thanks!
You can use a vector of 53 colors using a palette function such as scales::hue_pal()(53) (note I have had to alter the way the input data is used, since your unmodified example data and code simply returns an error)
waffle(setNames(abs(round(my_data$values/10)),
my_data$letters), size = 0.6, rows = 10,
colors = scales::hue_pal()(53)) +
theme(legend.position = "bottom")
The obvious caveat is that 53 discrete colors is far too many to have in a waffle plot. It is simply unintelligible from a data visualisation point of view. Whatever you are trying to demonstrate, there will certainly be a better way to do it than a waffle chart with 53 discrete colors.

How to change or reset parameters in the plot(ACF)-device within R-studio

I have estimated a two-intercept mixed multilevel-model using the function lme of the r-package nlme.
After that I checked for autocorrelation by visual inspection using the plot(ACF)-function.
Plotting for the first time I specified maxlag=16.
Now I have two problems: First, the maxlag parameter seems to be stuck somehow, i.e. further plots are all plotted with maxlag=16 even when maxlag is set to other values. 2. The plot is cropped at y=0.8 even if the value of lag 0 obviously is 1.
In the following I share the respective replex in hope of getting answers or inputs on how to solve these two issues.
Link to the dataset and if prefered to copy-paste to the following code-script as well:
#read.dataset:
datafclr <-read.csv("datafclr.csv", header = TRUE, sep = ",", dec = ".", fill = TRUE)
#required packages:
library("Matrix")
library("nlme")
#model-estimation:
tim2 <- lme(fixed=EERTmn ~ male + female +
(male:time7c) + (female:time7c) +
(male:IERT_Cp) + (female:IERT_Cp) +
(male:IERT_Cp_Partner) + (female:IERT_Cp_Partner)-1,
control=list(maxIter=100000), data=datafclr,
random=~male + female -1|dyade/female, correlation=corAR1(), na.action=na.omit)
summary(tim2)
#checking for autocorrelation:
plot(ACF(tim2, maxlag = 16), alpha = 0.01)
Results in the following plot:
This results in thin plot
When I change the maxlag:
plot(ACF(tim2, maxlag = 10), alpha = 0.01)
It results in the same plot
Many thanks in advance!
Best,
Patrick
Joes Schwartz helped me solve these issues in the R-Studio community. For the case someone will have the same difficulties I had I'm sharing his answers here:
First issue: maxlag needs to be typed maxLag and the function works fine.
Second issue: detailed help under the following link:
https://community.rstudio.com/t/resetting-plotting-settings-plot-acf-data/19441

How to find byte sizes of R figures on pages?

I would like to monitor the basic quality of the figures produced in R on individual pages such as byte size of each page,...
I can now do only quality assurance of average pages, see the following chapter about it.
I think there must be something builtin for the task than average measures.
Code which produces 4 pages in Rplots.pdf where I would like to know the byte size of each page in an output here; any other statistics of the page outputs is also welcome;
you can get the basic memory monitoring by objects here but I would like it to correspond to the outputs in PDF
# https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/plot.html
require(stats) # for lowess, rpois, rnorm
plot(cars)
lines(lowess(cars))
plot(sin, -pi, 2*pi) # see ?plot.function
## Discrete Distribution Plot:
plot(table(rpois(100, 5)), type = "h", col = "red", lwd = 10,
main = "rpois(100, lambda = 5)")
## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:
plot(x <- sort(rnorm(47)), type = "s", main = "plot(x, type = \"s\")")
points(x, cex = .5, col = "dark red")
## TODO summarise here the byte size of figures in the figures (1-4)
# Output: Rplot.pdf where 4 pages; I want to know the size of each page in bytes
I am currently doing the basic quality assurance in command-line but would like to move some of it to R, to observe bugs faster.
Expected output: byte size, for instance like 4th column of ls -l
To get bytesize of average individual page in an output document
Limitations
Requirement of the homogeneity of the data in pages. This method only works if the pages are all from the same sample.
Otherwise, it is troublesome because it is only average, not describing then the individual phenomenons.
Other possible weaknesses
PDF-elements and meta data. Consider PDF-file as whole, not focusing on the graphic objects itself. So this limits the absolute value use because the filesize contains also headers and other meta data which are not about the graphic objects.
Code
filename <- "main.pdf"
filesize <- file.size(filename)
# http://unix.stackexchange.com/q/331175/16920
pages <- Rpoppler::PDF_info(filename)$Pages
# print page size (= filesize / pages)
pagesize <- filesize / pages
## data of example file
num 7350960
int 62
num 118564
Input: just any 62-pages document
Output: average individual page size (118564)
Testing and's answer
Output but you cannot change the input easily to your wanted PDF-file
files size_bytes
[1,] "./test_page_size_pdf/page01.pdf" "4,123,942"
[2,] "./test_page_size_pdf/page02.pdf" " 4,971"
[3,] "./test_page_size_pdf/page03.pdf" " 4,672"
[4,] "./test_page_size_pdf/page04.pdf" " 5,370"
Input: just any 64-pages document
Expected output: 67 (= 64 + 3) pages, not 4 analysed
R: 3.3.2
OS: Debian 8.5
Download and install the pdftk utility if it is not already on your system and then try one of the following alternatives this from within R.
1) It will return a data frame with the page file sizes in bytes and other information.
myfile <- "Rplots.pdf"
system(paste("pdftk", myfile, "burst"))
file.info(Sys.glob("pg_*.pdf"))
It will also generate a file doc_data.txt with some miscellaneous information that may or may not be of interest.
1a) This alternative will not generate any files. It will simply return the character sizes of the pages as a numeric vector.
myfile <- "Rplots.pdf"
pages <- as.numeric(read.dcf(pipe(paste("pdftk", myfile, "dump_data")))[, "NumberOfPages"])
cmds <- sprintf("pdftk %s cat %d output - | wc -c", myfile, seq_len(pages))
unname(sapply(cmds, function(cmd) scan(pipe(cmd), quiet = TRUE)))
The above should work if pdftk and wc are on your path. Note that on Windows you can find wc in the Rtools distribution and is typically at "C:\\Rtools\\bin\\wc" once Rtools is installed.
2) This alternative is similar to (1) but uses the animation package:
library(animation)
ani.options(pdftk = "/path/to/pdftk")
pdftk("Rplots.pdf", "burst", "pg_%04d.pdf", "")
file.info(Sys.glob("pg_*.pdf"))
To measure the size of each page in a pdf-file I suggest this:
test_size <- TRUE
pdf_name <- "masterpiece"
if(test_size){
dir.create("test_page_size_pdf")
pdf_address <- paste0("./test_page_size_pdf/page%02d.pdf")
} else { pdf_address <- paste0("./", pdf_name, ".pdf")}
pdf(pdf_address, width=10, height=6, onefile=!test_size)
par(mar=c(1,1,1,1), oma=c(1,1,1,1))
plot(rnorm(10^6, 100, 5), type="l")
plot(sin, -pi, 2*pi)
plot(table(rpois(100, 5)), type = "h", col = "red", lwd = 10,
main = "rpois(100, lambda = 5)")
plot(x <- sort(rnorm(47)), type = "s", main = "plot(x, type = \"s\")")
points(x, cex = .5, col = "dark red")
dev.off()
if(test_size){
files <- paste0("./test_page_size_pdf/", list.files("./test_page_size_pdf/"))
size_bytes <- format(file.size(files), big.mark = ",")
file.remove(files)
file.remove("test_page_size_pdf")
cbind(files, size_bytes)
}
The size of a pdf-page in R depends on three things: the content of the plot(), the options used in the pdf() function and the plotting options which are here defined in par().
All this is difficult to estimate. You mention also that you like to have something similar to the shell function ls, which run on files as well. So in this solution I create a temporary folder dir.create() in which we save every page of the pdf separately in a file. We implement this with the option onefile. When the plotting is finish every pdf-page-file as well as the temporary folder will be deleted. And you can see the result in the console.
If you are finish with the testing and want the result in a single file you just have to change in the first line of this script the variable test_size <- FALSE. By the way; I have some doubt that the size of a page is a proxy for the quality of an image. Pdf is a vector format, so the size correspondent with the number of elements: see the size of the first page in my example where I plot 1mio points.

knitr adds an empty figure with ssplot from seqHMM package

I have the following chunk in RStudio:
<<sumfig,dependson='data',fig.cap="Summary of sequences">>=
ssplot(smult)
#
ssplot is a function in seqHMM package which creates a frequency graph and smult is my sequence data.
When I run my code, I get two figures in my pdf: The first one is an empty white figure with label {fig:sumfig1} and the second one is the real figure with label {fig:sumfig1}. I have similar experience with other plots from this package. I also have some other graphs in my file from other packages which work just fine.
Is it something wrong with the package or I am doing something wrong?
The root of this issue seems to be seqHMM:ssplot, not knitr: Even in an interactive sesion, ssplot generates two plots, an empty one and the actual plot.
If there is only one plot generated in the chunk with ssplot, the chunk option fig.keep = "last" can be used to disregard the first plot and show only the second (last) one.
\documentclass{article}
\begin{document}
<<echo = FALSE, message = FALSE, fig.keep = "last">>=
library(seqHMM)
# from ?ssplot
data("biofam3c")
# Creating sequence objects
child_seq <- seqdef(biofam3c$children, start = 15)
marr_seq <- seqdef(biofam3c$married, start = 15)
left_seq <- seqdef(biofam3c$left, start = 15)
## Choosing colors
attr(child_seq, "cpal") <- c("#66C2A5", "#FC8D62")
attr(marr_seq, "cpal") <- c("#AB82FF", "#E6AB02", "#E7298A")
attr(left_seq, "cpal") <- c("#A6CEE3", "#E31A1C")
# Plotting state distribution plots of observations
ssplot(list("Children" = child_seq, "Marriage" = marr_seq,
"Residence" = left_seq))
#
\end{document}
As of knitr 1.14 (the current development version, available on GitHub), you can also use fig.keep to specify which plots exactly you want to keep: fig.keep = c(1,3) will keep the first and the third plot.

Why does my chart print the wrong model for Variable Importance?

This seems like a really obvious question, but I've checked through the code and it seems fine. The challenge is that the code runs fine when run line by line but when using the knitting the R markdown document, it picks up the wrong Random Forest and prints the importance for that.
I've tried reinstalling knitr but that hasn't worked.
The data is based on the train titanic dataset
I have 2 models, one called modRF and another mod2. I want to run the chart on mod 2, but the output is modRF.
You can see this by changing the line
imp<-importance(mod2$finalModel) to be modRF$...
Like I say when I run this code line by line, it works, in Rmarkdown (knitting to HTML) it generates the wrong chart. Can someone elaborate?
PS the random forest models take less than a minute to run each on my machine so running this code shouldn't take you too long.
Thanks in advance for your help,
J
Here's my code to replicate
suppressMessages(library(caret))
suppressMessages(library(randomForest))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
setwd("~/Kaggle/Titanic")
totaltrain<-read.csv("train.csv")
#Adding features for EDA
totaltrain$CabinYes<-as.numeric(!(totaltrain$Cabin)=="")
ageid<-data.frame("minage"=c(0,20,30,40,50,60),
"AgeLabel"=c("Under 20","20-30","30-40","40-50","50-60","60+"))
#vlookup TRUE equivalent
totaltrain$AgeBracket<-ageid[findInterval(totaltrain$Age,ageid$minage),2]
#findInterval creates an index of which of the initial values most closely matches
#the lookup... Then use with the age id index and return the second column
a<-c(1,2,3,5,7,8,12,13,14)
rates<-totaltrain[,a]
rates$AgeBracket<-as.character(rates$AgeBracket)
rates$AgeBracket[is.na(rates$AgeBracket)]<-"Unknown"
rates$AgeBracket<-as.factor(rates$AgeBracket)
rates$Survived<-as.factor(rates$Survived)
rates$Pclass<-as.factor(rates$Pclass)
rates$CabinYes<-as.factor(rates$CabinYes
```{r,cache=TRUE}
set.seed(4321)
inTrain <- createDataPartition(y=rates$Survived,
p=0.75, list=FALSE)
training<-rates[inTrain,]
testing<-rates[-inTrain,]
modRF<-train(Survived~.-PassengerId,data=training,method="rf",trControl=
trainControl(method="cv",number = 3,
allowParallel = T,))
pred<-predict(modRF,newdata=testing)
testing$PredRight<-pred==testing$Survived
sum(testing$PredRight)/length(pred)
```
b<-c(1,2,3,5,6,7,8,12,13)
rates2<-totaltrain[,b]
rates2$Age[is.na(rates2$Age)]<-0
#Model 2
set.seed(2072)
inTrain <- createDataPartition(y=rates$Survived,
p=0.75, list=FALSE)
training<-rates[inTrain,]
testing<-rates[-inTrain,]
mod2<-train(Survived~.-PassengerId,data=training,method="rf",trControl=
trainControl(method="cv",number = 3,
allowParallel = T,))
imp<-importance(mod2$finalModel)
impdf<-data.frame(Variables=row.names(imp),Importance=round(imp[,1],2))
rankimp<-impdf %>% mutate(Rank = paste0('#',dense_rank(-Importance)))
ggplot(rankimp, aes(x = reorder(Variables, Importance),
y = Importance, fill = Importance)) +
geom_bar(stat='identity') +
geom_text(aes(x = Variables, y = 0.5, label = Rank),
hjust=0, vjust=0.55, size = 4, colour = 'red') +
labs(x = 'Variables') +
coord_flip()

Resources