R Markdown struggling with read_xlsx, Warning: Expecting logical - r

When running read_xlsx() in my normal .R script, I'm able to read in the data. But when running the .R script with source() in R Markdown, it suddenly takes a long time (> 20+++ mins I always terminate before the end) and I keep getting these warning messages where it is evaluating every single column and expecting it to be a logical:
Warning: Expecting logical in DE5073 / R5073C109: got 'HOSPITAL/CLINIC'
Warning: Expecting logical in DG5073 / R5073C111: got 'YES'
Warning: Expecting logical in CQ5074 / R5074C95: got '0'
Warning: Expecting logical in CR5074 / R5074C96: got 'MARKET/GROCERY STORE'
Warning: Expecting logical in CT5074 / R5074C98: got 'NO'
Warning: Expecting logical in CU5074 / R5074C99: got 'YES'
Warning: Expecting logical in CV5074 / R5074C100: got 'Less than one week'
Warning: Expecting logical in CW5074 / R5074C101: got 'NEXT'
Warning: Expecting logical in CX5074 / R5074C102: got '0'
.. etc
I can't share the data here, but it is just a normal xlsx file (30k obs, 110 vars). The data has responses in all capitals like YES and NO. The raw data has filters applied, some additional sheets, and some mild formatting in Excel (no borders, white fill) but I don't think these are affecting it.
An example of my workflow setup is like this:
Dataprep.R:
setwd()
pacman::p_load() # all my packages
df <- read_xlsx("./data/Data.xlsx") %>% type_convert()
## blabla more cleaning stuff
Report.Rmd:
setwd()
pacman::p_load() # all my packages again
source("Dataprep.R")
When I run Dataprep.R, everything works in < 1 min. But when I try to source("Dataprep.R") from Report.Rmd, then it starts being slow at read_xlsx() and giving me those warnings.
I've tried also taking df <- read_xlsx() from Dataprep.R and moving it to Report.Rmd, and it is still as slow as running source(). I've also removed type_convert() and tried other things like removing the extra sheets in the Excel. source() was also in the setup chunk in Report.Rmd, but I took it out and still the same thing.
So I think it is something to do with R Markdown and readxl/read_xlsx(). The exact same code and data is evaluating so differently in R vs Rmd and it's very puzzling.
Would appreciate any insight on this. Is there a fix? Or is this something I will just have to live with (i.e. convert to csv)?
> sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.utf8 LC_CTYPE=English_United Kingdom.utf8 LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] digest_0.6.29 R6_2.5.1 lifecycle_1.0.1 pacman_0.5.1 evaluate_0.15 scales_1.2.0 rlang_1.0.2 cli_3.3.0 rstudioapi_0.13
[10] rmarkdown_2.14 tools_4.2.0 munsell_0.5.0 xfun_0.30 yaml_2.3.5 fastmap_1.1.0 compiler_4.2.0 colorspace_2.0-3 htmltools_0.5.2
[19] knitr_1.39
UPDATE:
So in Markdown, I can use the more generic read_excel() and that works in my setup chunk. But I still get the same Warning messages if I try to source() it, even if the R script sourced is also using read_excel() instead of read_xlsx(). Very puzzling all around.

When you run that code on a .R (and probably other kinds of codes that generate warnings), you will get a summary of warnings. Something like "There were 50 or more warnings (use warning() to see the first 50)".
While if you run that same code on a standard Rmarkdown code chunk, you will actually get the whole 50+ warnings. That could mean you are printing thousands, millions, or more warnings.
If your question is WHY does that happen on Rmarkdown and not on R, I'm not sure.
But if your question is how to solve it, it's simple. Just make sure to add the options message=FALSE and warning=FALSE to your code chunk.
It should look something like this:
{r chunk_name, message=FALSE, warning=FALSE}
setwd()
pacman::p_load() # all my packages again
source("Dataprep.R")
Now, about the "setwd()", I would advise against using anything that changes the state of your system (avoid "side effect" functions). They can create problems if you are not very careful. But that is another topic for another day.

Related

Is assignInNamespace() the correct solution for errors rendering distill with data.table?

Using the distill package and Rmarkdown for writing a blog, when rendering the .Rmd file I get errors when code chunks include data.table := and ., but not functions such as data.table(). The errors occur when the YAML header states draft: false but not when draft: true.
The R code chunk in the .Rmd file:
# create a data.table
library(data.table)
DT <- data.table(p = 1:5, q = 6:10)
# operate with ":="
DT[, r := p + q][]
# operate with "."
DT[, .(p)]
With draft: true in the YAML header, the .Rmd file knits with no problem.
With draft: false, I can still run the R code chunks with no error but knitting the .Rmd file produces this error:
Error: Check that is.data.table(DT) == TRUE. Otherwise, :=, `:=`(...)
and let(...) are defined for use in j, once only and in particular
ways. See help(":=").
Execution halted
If I comment out the line with the := operation, knitting the document produces a similar error on the . operation:
Error in .(p) : could not find function "."
Calls: <Anonymous> ... withVisible -> eval -> eval -> [ ->
[.data.table -> [.data.frame
Execution halted
On Stackoverflow I found a similar issue 3 years ago with the rtvs package (RTVS: Unable to Knit Document with data.table) that suggested a workaround using assignInNamespace(). I copied the suggestion to my code chunk and changed “rtvs” to “distill”, as follows:
# workaround
assignInNamespace("cedta.pkgEvalsUserCode",
c(data.table:::cedta.pkgEvalsUserCode, "distill"),
"data.table")
The help page for assignInNamespace() says the function is intended for use within a package, which my blog is not, but it does solve my problem. Adding this workaround to the first code chunk eliminates the errors and the .Rmd file renders correctly.
My questions are:
Does using assignInNamespace() in this way produce any problematic side effects?
Is there another solution or is this possibly a case where data.table or distill might need a patch?
Session information
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] data.table_1.14.3
loaded via a namespace (and not attached):
[1] fansi_1.0.2 digest_0.6.29 R6_2.5.1 jsonlite_1.7.3
[5] magrittr_2.0.2 evaluate_0.14 stringi_1.7.6 rlang_1.0.1
[9] cachem_1.0.6 cli_3.2.0 rstudioapi_0.13 jquerylib_0.1.4
[13] bslib_0.3.1 vctrs_0.3.8 rmarkdown_2.11 distill_1.3
[17] tools_4.1.2 stringr_1.4.0 xfun_0.29 yaml_2.2.2
[21] fastmap_1.1.0 compiler_4.1.2 memoise_2.0.1 htmltools_0.5.2
[25] knitr_1.37 downlit_0.4.0 sass_0.4.0
2022-06-06 Update.
I've switched blog-authoring software from distill to quarto. The problem described above does not occur with quarto and porting older posts to quarto has been fairly straightforward.

Computing n-grams on large corpus using R and Quanteda

I am trying to build n-grams from a large corpus (object size about 1Gb in R) of text using the great Quanteda package.
I don't have a cloud resource available, so I am using my own laptop (Windows and/or Mac, 12Gb RAM) to do the computation.
If I sample down the data into pieces, the code works and I get a (partial) dfm of n-grams of various sizes, but when I try to run the code on whole corpus, unfortunately I hit memory limits with this corpus size, and get the following error (example code for unigrams, single words):
> dfm(corpus, verbose = TRUE, stem = TRUE,
ignoredFeatures = stopwords("english"),
removePunct = TRUE, removeNumbers = TRUE)
Creating a dfm from a corpus ...
... lowercasing
... tokenizing
... indexing documents: 4,269,678 documents
... indexing features:
Error: cannot allocate vector of size 1024.0 Mb
In addition: Warning messages:
1: In unique.default(allFeatures) :
Reached total allocation of 11984Mb: see help(memory.size)
Even worse if I try to build n-grams with n > 1:
> dfm(corpus, ngrams = 2, concatenator=" ", verbose = TRUE,
ignoredFeatures = stopwords("english"),
removePunct = TRUE, removeNumbers = TRUE)
Creating a dfm from a corpus ...
... lowercasing
... tokenizing
Error: C stack usage 19925140 is too close to the limit
I found this related post, but it looks it was an issue with dense matrix coercion, later solved, and it doesn't help in my case.
Are there better ways to handle this with limited amount of memory, without having to break the corpus data into pieces?
[EDIT] As requested, sessionInfo() data:
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6 dplyr_0.4.3 quanteda_0.9.4
loaded via a namespace (and not attached):
[1] magrittr_1.5 R6_2.1.2 assertthat_0.1 Matrix_1.2-3 rsconnect_0.4.2 DBI_0.3.1
[7] parallel_3.2.3 tools_3.2.3 Rcpp_0.12.3 stringi_1.0-1 grid_3.2.3 chron_2.3-47
[13] lattice_0.20-33 ca_0.64
Yes there is, exactly by breaking it into pieces, but hear me out. Instead of importing the whole corpus, import a piece of it (is it in files: then import file by file; is it in one giant txt file - fine, use readLines). Compute your n-grams, store them in another file, read next file/line, store n-grams again. This is more flexible and will not run into RAM issues (it will take quite a bit more space than the original corpus of course, depending on the value of n). Later, you can access the ngrams from the files as per usual.
Update as per comment.
As for loading, sparse matrices/arrays sounds like a good idea, come to think of it, it might be a good idea for storage too (particularly if you happen to be dealing with bigrams only). If your data is that big, you'll probably have to look into indexing anyway (that should help with storage: instead of storing words in bigrams, index all words and store the index tuples). But it also depends what your "full n-gram model" is supposed to be for. If it's to look up the conditional probability of (a relatively small number of) words in a text, then you could just do a search (grep) over the stored ngram files. I'm not sure the indexing overhead would be justified in such a simple task. If you actually need all the 12GB worth of ngrams in a model, and the model has to calculate something that cannot be done piece-by-piece, then you still need a cluster/cloud.
But one more general advice, one that I frequently give to students as well: start small. Instead of 12BG, train and test on small subsets of the data. Saves you a ton of time while you are figuring out the exact implementation and iron out bugs - and particularly if you happen to be unsure about how these things work.
Probably too late now, but I had a very similar problem recently (n-grams, R, Quanteda and large text source). I searched for two days and could not
find a satisfactory solution, posted on this forum and others and didn't get an answer. I knew I had to chunk the data and combine results at the end, but couldn't work out how to do the chunking. In the end I found a somewhat un-elegant solution that worked and answered my own question in the following post here
I sliced up the corpus using the 'tm' package VCorpus then fed the chunks to quanteda using the corpus() function.
I thought I would post it as I provide the code solution. Hopefully, it will prevent others from spending two days searching.

using -knitr- to weave Rnw files in RStudio

This seems to be a recurrent problem for who is willing to write dynamic documents with knitr in RStudio (see also here for instance).
Unfortunately I haven't find a solution on Stack Overflow or by googling more in general.
Here is a toy example I am trying to compile in RStudio. It is the minimal-example-002.Rnw (link):
\documentclass{article}
\usepackage[T1]{fontenc}
\begin{document}
Here is a code chunk.
<<foo, fig.height=4>>=
1+1
letters
chartr('xie', 'XIE', c('xie yihui', 'Yihui Xie'))
par(mar=c(4, 4, .2, .2)); plot(rnorm(100))
#
You can also write inline expressions, e.g. $\pi=\Sexpr{pi}$, and \Sexpr{1.598673e8} is a big number.
\end{document}
My problem is that I am not able to compile the pdf in RStudio by using knitr, while by changing the default weaving option to sweave I get the final pdf.
More specifically, I work in Windows 7, latest RStudio version (0.98.1103), I weave the file using the knitr option and I disabled the "always enable Rnw concordance" box.
Did this happen to you?
Any help would be highly appreciated, thank you very much.
EDIT
Apparently it is not an RStudio problem, as I tried to compile the document from R with:
library('knitr')
knit('minimal_ex.Rnw')
and I get the same error:
processing file: minimal_ex.Rnw
|
| | 0%
|
|...................... | 33%
ordinary text without R code
|
|........................................... | 67%
label: foo (with options)
List of 1
$ fig.height: num 4
Quitting from lines 8-10 (minimal_ex.Rnw)
Errore in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 3, 0
Inoltre: Warning messages:
1: In is.na(res[, 1]) :
is.na() applied to non-(list or vector) of type 'NULL'
2: In is.na(res) : is.na() applied to non-(list or vector) of type 'NULL'
EDIT 2:
This is my session info:
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252 LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C
[5] LC_TIME=Italian_Italy.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.10.5
loaded via a namespace (and not attached):
[1] tools_3.1.1
After spending hours to try to figure out the problem, I updated R (v 3.2.0) and everything works fine now.
It is not clear if the problem was due to some packages conflict, for sure it wasn't an RStudio problem (as I had initially thought).
To add a little to this: It seems to be a bug with the echo parameter which defaults to TRUE. Setting it to false with knitr and pdfLaTeX as a renderer worked for me. In case you're in a situation where you can't update because of dependencies and/or rights issues, this input might be a helpful adhoc fix, since the error message is pretty useless.

R read.csv didn't load all rows of .tsv file

A little mystery. I have a .tsv file that contains 58936 rows. I loaded the file into R using this command:
dat <- read.csv("weekly_devdata.tsv", header=FALSE, stringsAsFactors=TRUE, sep="\t")
but nrow(dat) only shows this:
> nrow(dat)
[1] 28485
So I used the sed -n command to write the rows around where it stopped (before, including and after that row) to a new file and was able to load that file into R so I don't think there was any corruption in the file.
Is it an environment issue?
Here's my sessionInfo()
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] tcltk stats graphics grDevices utils datasets methods base
other attached packages:
[1] sqldf_0.4-10 RSQLite_1.0.0 DBI_0.3.1 gsubfn_0.6-6 proto_0.3-10 scales_0.2.4 plotrix_3.5-11
[8] reshape2_1.4.1 dplyr_0.4.1
loaded via a namespace (and not attached):
[1] assertthat_0.1 chron_2.3-45 colorspace_1.2-4 lazyeval_0.1.10 magrittr_1.5 munsell_0.4.2
[7] parallel_3.1.2 plyr_1.8.1 Rcpp_0.11.4 rpart_4.1-8 stringr_0.6.2 tools_3.1.2
Did I run out of memory? Is that why it didn't finish loading?
I had a similar problem lately, and it turned out I had two different problems.
1 - Not all rows had the right number of tabs. I ended up counting them using awk
2 - At some points in the file I had quotes that were not closed. This was causing it to skip over all the lines until it found a closing quote.
I will dig up the awk code I used to investigate and fix these issues and post it.
Since I am using Windows, I used the awk that came with git bash.
This counted the number of tabs in a line and printed out those lines that did not have the right number.
awk -F "\t" 'NF!=6 { print NF-1 ":" $0 } ' Catalog01.csv
I used something similar to count quotes, and I used tr to fix a lot of it.
Pretty sure this was not a memory issue. If the problem is unmatched quotes then try this:
t <-read.csv("weekly_devdata.tsv", header=FALSE, stringsAsFactors=TRUE,sep="\t",
quote="")
There is also the very useful function count.fields that I use inside table to get a high-level view of the consequences of various parameter settings. Take a look at results of:
table( count.fields( "weekly_devdata.tsv", sep="\t"))
And compare to:
table( count.fields( "weekly_devdata.tsv", sep="\t", quote=""))
It's sometime necessary to read in with readLines, then remove one or more lines assigning the result to clean and then send the cleaned up lines to read.table(text=clean, sep="\t", quote="")
Could well be some illegal characters in some of the entries... Check out how many upload and where the issue is taking place. Delve deeper into that row of the raw data. Webdings font that kind of stuff!

knitr updated from 1.2 to 1.4 error: Quitting from lines

I recently updated knitr to 1.4, and since then my .Rnw files don't compile.
The document is rich (7 chapters, included with child="").
Now, in the recent knitr version I get an error message:
Quitting from lines 131-792 (/DATEN/anna/tex/CoSta/chapter1.Rnw)
Quitting from lines 817-826 (/DATEN/anna/tex/CoSta/chapter1.Rnw)
Fehler in if (eval) { :
Argument kann nicht als logischer Wert interpretiert werden
(the last two lines mean that knitr is looking for a logical and it cannot find it.
At those lines 131 and 817 two figures end. Compiling these sniplets separately will work.
I have no idea how to resolve this problem.
Thank's in advance for any hints that allow to resolve my issue.
Here is the sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] tools stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] knitr_1.4
loaded via a namespace (and not attached):
[1] compiler_2.15.1 digest_0.6.3 evaluate_0.4.7 formatR_0.9
[5] stringr_0.6.2 tcltk_2.15.1
Following the suggestions of Hui, I run each chapter separately with
knit("chapter1.Rnw")
and so on. No error message occurs, and separate tex files are created. To provide more information I display part of the code.
There is a main document in which several options are set
<<options-setting,echo=FALSE>>=
showthis <- FALSE
evalthis <- FALSE
evalchapter <- TRUE
opts_chunk$set(comment=NA, fig.width=6, fig.height=4)
#
The each chapter is used via child chunks, e.g. chapter1 is called from
<<child-chapter1, child='chapter1.Rnw', eval=evalchapter>>=
#
The error message which appears when knitting the main Rnw file was given above.
The related Figure environment is as follows
\begin{figure}[ht]
\centering
<<wuerfel-simulation,echo=showthis,fig.height=5>>=
data.sample6 <- sample(1:6,repl=TRUE,100)
table(data.sample6)
barplot(table(data.sample6)/100,col=5,main="Haeufigkeiten beim Wuerfeln")
#
\caption{Visualisierung beim W"urfeln. 100 Versuche.}
\label{fig:muent-vis}
\end{figure}
This is not very advanced, but the error is still as it was given before.
The quitting from lines concerns a long text, from 131 (end of first chunk) to line 792 (beginning of the followup chunk), which is
<< zeiten, echo=showthis,eval=evalthis>>=
zeiten <- c(17,16,20,24,22,15,21,15,17,22)
max(zeiten)
mean(zeiten)
zeiten[4] <- 18; zeiten
mean(zeiten)
sum(zeiten > 20)
#
Is there a problem with correctly closing a chunk?
I now located the error and I provide a short piece of code with reproducible error message.It concerns conditional evaluation of child processes involving Sexpr:
The main file is the following
\documentclass{article}
\begin{document}
<<options-setting,echo=FALSE>>=
evalchapter <- TRUE
#
<<test,child="test-child.Rnw", eval=evalchapter>>=
#
\end{document}
The related child file 'test-child.Rnw' is
<<no-sexpr>>=
t <- 2:4
#
text \Sexpr{(t <- 2:4)}
knitting this 'as is' gives the error message from above. Removing the Sexpr in the child everything works nicely.
But, everything also works nicely, if I remove the conditioning in the call of the child file, i.e., without 'eval=evalchapter'
Since I use Sexpr quite often I would like to have a solution to this problem. As I mentioned earlier, there were no problems up to knitR Version 1.2.
This is related to a change in knitr 1.3 and mentioned in the NEWS:
added an argument options to knit_child() to set global chunk options for child documents; if a parent chunk calls a child document (via the child option), the chunk options of the parent chunk will be used as global options for the child document, e.g. for <<foo, child='bar.Rnw', fig.path='figure/foo-'>>=, the figure path prefix will be figure/foo- in bar.Rnw; see How to avoid figure filenames in child calls for an application
And this caused a bug for inline R code. In your case, the chunk option eval=evalchapter was not evaluated when it is used for evaluating inline code. I have fixed the bug in the development version v1.4.5 on Github.

Resources