Text mining with tm in R antiword error - r

So I'm rather new to R, and I'm learning how to mine text from this handy website: https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/
I do have my own text set of .doc, .docx, and .xlsx files and I'm trying to mine them. They're located in a folder in my working directory called 'files', but I have already encountered an error after simply writing a few lines of code.
The code I have so far is:
library(tm)
library(readtext)
data = readtext('files')
At this point, after waiting for 25 seconds or so, I get the error:
Error: System call to 'antiword' failed (1): The Big Block Depot is damaged
and the code stops running there.
I have tried searching online for solutions but it seems like a fairly rare error and so I only found 1 possible solution at https://github.com/ropensci/antiword/issues/1 but that did not work for me.
This solution suggested that one of my files were corrupt, and suggested using the code
fixInNamespace(antiword, pos="package:antiword")
to change the error to a warning to not interrupt the reading of the files. I tried that, and at first it raised the error of
Error in as.environment(pos):
no item called "package:antiword" on the search list
After which, I loaded the antiword library with a library(antiword) and changed the stop( to a warning(. However, when I ran the data = readtext('files') line again, it immediately raised the error
Error in is_windows() : could not find function "is_windows"
I'm at a loss here! Any help would be appreciated. Should I be using another package in this case?

I had the same problem with my code, where I tried to get a doc. file in R. I also used the readtext library. What helped me was converting the Word documents I was trying to get into R from doc. to docx. When I ran the same code after it worked.

Related

Rendering a Quarto blog post trips an error when reading in a brms file object

First, I'll apologize for not having a fuller reproducable example, but I'm not entirely sure how to go about that given the various layers to the question/problem.
I'm moving a blog over from Blogdown to a new Quarto-based website and blog. I have three saved brms object files that I'm trying to read into a code chunk in one of the posts. The code chunks work fine when I run them manually, but when I try to render the blog post I get the following error:
Quitting from lines 75-86 (tables-modelsummary-brms.qmd)
Error in stri_replace_all_charclass(str, "[\\u0020\\r\\n\\t]", " ", merge = TRUE) :
invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
Calls: .main ... stri_trim -> stri_trim_both -> stri_replace_all_charclass
Execution halted
I've checked the primary data frame contained in the brms model object and all of the character vectors there are valid UTF-8 vectors. These models objects can be quite large, so it's possible I'm missing something buried deep within the model object, but so far it's nothing apparent.
I tried re-running the models again to ensure that the model objects' files weren't corrupted, and also to make sure that the encoding issue wasn't somehow introduced the last time they were run, which would have been on a Windows machine and a different version of brms.
I've also moved the brms files around to different directories to see if it's a file path issue. The same error comes up regardless of whether the files are in the same folder as the blog post qmd file or in a parent directory file I use for storing site data.
I've also migrated several other posts to the new Quarto site successfully, and some of them also contain R code, but it's all rendering without a problem.
Finally, I don't quite understand how to implement the suggersted alternate function found in the error message either.

Missing command in an R package

So to get to the point: I need to use an R package called machuruku. To get familiar with the package I used the dataset provided in the original paper (https://academic.oup.com/sysbio/article/70/5/1033/6171196). While trying to run the code for the simulation I get an error message saying that the command "machu.simulation" doesn't exist. Any of you have any idea why that's happening? Am I missing a package?
I downloaded the dataset zip file, dove into the second nested zip file Guillory_and_Brown_simulation-validation.zip, then into its file code_simulation-validation.R, and noticed that this source file uses machu.simulation several times before defining the function starting in line 519.
Suggestions:
Grab lines 519 through the end, save into a different file, source that new file, then try to run the code in the beginning of the file again.
Complain (not quietly?) to the authors, the fact that they think this is reproducible means they might have missed something else, too.

BiocParallel error: cannot open the connection, how do I fix it?

I'm trying to use the package bambu to quantify gene counts from bam files. I am using my university's HPC, so I have written an R script and a batch submission file to launch it.
When the script gets to the point of running the bambu function, it gives the following error:
Start generating read class files
| | 0%[W::hts_idx_load2] The index file is older than the data file: ./results/minimap2/KD_R1.sorted.bam.bai
[W::hts_idx_load2] The index file is older than the data file: ./results/minimap2/KD_R3.sorted.bam.bai
[W::hts_idx_load2] The index file is older than the data file: ./results/minimap2/WT_R1.sorted.bam.bai
[W::hts_idx_load2] The index file is older than the data file: ./results/minimap2/WT_R2.sorted.bam.bai
|================== | 25%
Error: BiocParallel errors
element index: 1, 2, 3
first error: cannot open the connection
In addition: Warning message:
stop worker failed:
attempt to select less than one element in OneIndex
Execution halted
So it looks like BiocParallel isn't happy and cannot open a certain connection, but I'm not sure how to fix this?
This is my R script:
#Bambu R script
#load libraries
library(Rsamtools)
library(bambu)
#Creating files
bamFiles<- Rsamtools::BamFileList(c("./results/minimap2/KD_R1.sorted.bam","./results/minimap2/KD_R2.sorted.bam","./results/minimap2/KD_R3.sorted.bam","./results/minimap2/WT_R1.sorted.bam","./results/minimap2/WT_R2.sorted.bam","./results/minimap2/WT_R3.sorted.bam"))
annotation<-prepareAnnotations("./ref_data/Homo_sapiens.GRCh38.104.chr.gtf")
fa.file<-"./ref_data/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
#Running bambu
se<- bambu(reads=bamFiles, annotations=annotation, genome=fa.file,ncore=4)
se
seGene<- transcriptToGeneExpression(se)
#Saving files
save.file<-tempfile(fileext=".gtf")
writeToGTF(rowRanges(se),file=save.file)
save.dir <- tempdir()
writeBambuOutput(se,path=save.fir,prefix="Nanopore_")
writeBambuOutput(seGene,path=save.fir,prefix="Nanopore_")
If you have any ideas on why this happens it would be so helpful! Thank you
I think that #Chris has a good point. Under the hood it seems likely that bambu is running htslib based on those warnings. While they may indeed only be warnings, I would like to know what the results would look like if you ran this interactively.
This question is hard to answer right now as it's missing some information (what do the files look like, a minimal reproducible example, etc.). But in the meantime here are some possibly useful questions for figuring it out:
what does bamFiles look like? Does it have the right number of read records? Do all of those files have nonzero read records? Are any suspiciously small?
What are the timestamps on the bai vs bam files (e.g. ls -lh /results/minimap2/)? Are they about what you'd expect or is it wonky? Are any of them (say, ./results/minimap2/WT_R2.sorted.bam.bai) weirdly small?
What happens when you run it interactively? Where does it fail? You say it's at the bambu() call, but how do you know that?
What happens when you run bambu() with ncores=1?
It seems very likely that this is due to a problem with the files, and it is only at the biocParallel step that the error is bubbling up to the top. Many utilities have an annoying habit of being happy to accept an empty file, only to fail confusingly without informative error messages when asked to do something with the empty file.
You might also consider raising an issue with the developers.
(why the warning is only possibly a problem: The index file sometimes has a timestamp like that for very small alignment files which are generated and indexed programmatically, where the indexing step is near-instantaneous.)

How to solve this error message in rmarkdown?

I am just starting to explore the rmarkdown package. I don't use Rstudio. I use the default R environment. What I did was as follows.
I created a new R document.
Started typing few lines in rmarkdown format.
Saved the file with Rmd extension.
I saved the file in the working directory.
I installed the pandoc using the pkg file.
I installed 'rmarkdown' package. Loaded the package.
Used the following command to render the Rmd file.
rmarkdown::render("Untitled.Rmd")
I get the following error.
Error in tools::file_path_as_absolute(input) : file 'Untitled.Rmd'
does not exist
I tried all the possible ways such as giving the exact path instead of filename etc. But nothing worked out. I googled the error message and found that none had similar error. Can someone help me with this. What I am missing. What the error message mean?
Most of the time the error file not found is either a type error or a real missing file (as in your case, the real one is named in another way).
In order to discard those possibilities:
Copy the fullpath from your filebrowser.
Make sure the file exists, inside R you could type:
file.exists("/fullpath/to/file")
If that return TRUE and the error persists, then you suspect another thing is going on.

Slow or stacking file.choose() in R

If I have more data loaded in R I'm having difficulties with opening and choosing new file via file.choose() and later upload via read.csv(), but I would not get to that point since the file.choose function stacks and the R "crushes" and reports something like "unidentified error occurred and that the R must restart".
I'm using RStudio and running this on Windows 7. The hardware is up to date.
Could someone point me on why this is happing and what would be a remedy against this. Are there other options to select file? I know I can insert the path right into the read.csv command, but the (file is different every time).
EDIT:
The error just happened again. I can not reproduce the error so it happens rather only with high likelihood if the conditions for it are met.
The error reads as: R Session Aborted.R encountered fatal error. The session was terminated. And in window: "Start New Session".
EDIT 2:
I would just rephrase my question. The question is whether there is other option like command or package that deals with choosing a file. [file.choose()]
The error can not be reproduced and hence I can not expect someone gives reasonable comment on this. But if this occurred someone in the past and solved it, I would like to hear about it. Thanks.
EDIT 3: Further to the error. I have spotted just now sentence in red in Console: Error: Unable to provide connection with R

Resources