Use of fread() from data.table causes R session to abort - r

I am working on a project for a MOOC, and was tinkering around with the data.table package in RStudio. Use of the fread() function to import the data files initially worked fine:
fread("UCI HAR Dataset/features.txt")->features
fread("UCI HAR Dataset/test/y_test.txt")->ytest
However, when I tried to run the following line of code, I received a pop-up that said "R Session Aborted: R encountered a fatal error. The session was terminated."
fread("UCI HAR Dataset/test/X_test.txt")->xtest
I don't understand what the problem is. I checked the file names and paths to make sure I had correctly spelled and capitalized everything, and it all checks out. The equivalent code using read.table() works fine and does not cause R to abort. I also tried renaming the file to "x_test.txt", but the same issue occurred.
According to ?fread, only the function will only work with "regular delimited files." As far as I can tell, the file is a "regular delimited file", in that all rows have the same number of columns. There are no cells containing "NA" when I use read.table instead; I checked using anyNA(). Is there a quick way to determine whether a file is a delimited "regularly" or not? Is there something else about the original file that could be causing the problem?
UPDATE
After further research and searching through the reported issues listed on the developer's github, I think that my problem lies in having two white spaces at the beginning of each row, which is discussed here. I am unsure why R aborted instead of giving me a warning. The latest development version of data.table (1.9.5) isn't causing the session to abort under the same conditions, though.

Although I do believe you should have contacted the package maintainer first for any situation where the R session was aborted (and it was not due to your mucking with C-code), I can offer a strategy for your last request which is not really specific to fread but I've found useful with regular-reads(). I'm going to assume that this is a comma separated file but if it;'s whitespace separated you could change the sep="," to sep="".
filcnts <- count.fields("UCI HAR Dataset/test/X_test.txt", sep=",")
table(filcnts)
That should be a single items table. If not, try switching parameters such as quote, sep, blank.lines.skip, or comment.char

Related

BiocParallel error: cannot open the connection, how do I fix it?

I'm trying to use the package bambu to quantify gene counts from bam files. I am using my university's HPC, so I have written an R script and a batch submission file to launch it.
When the script gets to the point of running the bambu function, it gives the following error:
Start generating read class files
| | 0%[W::hts_idx_load2] The index file is older than the data file: ./results/minimap2/KD_R1.sorted.bam.bai
[W::hts_idx_load2] The index file is older than the data file: ./results/minimap2/KD_R3.sorted.bam.bai
[W::hts_idx_load2] The index file is older than the data file: ./results/minimap2/WT_R1.sorted.bam.bai
[W::hts_idx_load2] The index file is older than the data file: ./results/minimap2/WT_R2.sorted.bam.bai
|================== | 25%
Error: BiocParallel errors
element index: 1, 2, 3
first error: cannot open the connection
In addition: Warning message:
stop worker failed:
attempt to select less than one element in OneIndex
Execution halted
So it looks like BiocParallel isn't happy and cannot open a certain connection, but I'm not sure how to fix this?
This is my R script:
#Bambu R script
#load libraries
library(Rsamtools)
library(bambu)
#Creating files
bamFiles<- Rsamtools::BamFileList(c("./results/minimap2/KD_R1.sorted.bam","./results/minimap2/KD_R2.sorted.bam","./results/minimap2/KD_R3.sorted.bam","./results/minimap2/WT_R1.sorted.bam","./results/minimap2/WT_R2.sorted.bam","./results/minimap2/WT_R3.sorted.bam"))
annotation<-prepareAnnotations("./ref_data/Homo_sapiens.GRCh38.104.chr.gtf")
fa.file<-"./ref_data/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
#Running bambu
se<- bambu(reads=bamFiles, annotations=annotation, genome=fa.file,ncore=4)
se
seGene<- transcriptToGeneExpression(se)
#Saving files
save.file<-tempfile(fileext=".gtf")
writeToGTF(rowRanges(se),file=save.file)
save.dir <- tempdir()
writeBambuOutput(se,path=save.fir,prefix="Nanopore_")
writeBambuOutput(seGene,path=save.fir,prefix="Nanopore_")
If you have any ideas on why this happens it would be so helpful! Thank you
I think that #Chris has a good point. Under the hood it seems likely that bambu is running htslib based on those warnings. While they may indeed only be warnings, I would like to know what the results would look like if you ran this interactively.
This question is hard to answer right now as it's missing some information (what do the files look like, a minimal reproducible example, etc.). But in the meantime here are some possibly useful questions for figuring it out:
what does bamFiles look like? Does it have the right number of read records? Do all of those files have nonzero read records? Are any suspiciously small?
What are the timestamps on the bai vs bam files (e.g. ls -lh /results/minimap2/)? Are they about what you'd expect or is it wonky? Are any of them (say, ./results/minimap2/WT_R2.sorted.bam.bai) weirdly small?
What happens when you run it interactively? Where does it fail? You say it's at the bambu() call, but how do you know that?
What happens when you run bambu() with ncores=1?
It seems very likely that this is due to a problem with the files, and it is only at the biocParallel step that the error is bubbling up to the top. Many utilities have an annoying habit of being happy to accept an empty file, only to fail confusingly without informative error messages when asked to do something with the empty file.
You might also consider raising an issue with the developers.
(why the warning is only possibly a problem: The index file sometimes has a timestamp like that for very small alignment files which are generated and indexed programmatically, where the indexing step is near-instantaneous.)

Unable to import previously working SAS-formats files using R-package 'haven'

Around a year ago, I used the 'haven'-package to import two .sas7bdat files along with their respective .sas7bcat formats and it worked wonderfully.
For some reason, however, it does not any longer even though all the SAS-files incl. format files have remained unchanged since then.
When I try running the code now, R gives me the following error:
Error in df_parse_sas_file(spec_data, spec_cat, encoding = encoding,
catalog_encoding = catalog_encoding, : Failed to parse P:/SAS
files/formats.sas7bcat: Invalid file, or file has unsupported features.
R and the 'haven'-package have been reinstalled to their newest versions since the first time when it worked, so I imagine that this might be the reason since all the SAS-files and the code remains unchanged.
For this reason, I tried to reinstall the old version of 'haven' but cannot since this apparently requires a manual installation of 'Rtools' which is not allowed on my computer, so I am a bit stuck here.
Any suggestions will be greatly appreciated, thanks.
A potential workaround is that the package sas7bdat can also read SAS files. I don't know how much extra work this might involve for you though
You can read in a dataset with the code
read.sas7bdat("filename.sas7bdat")

Text mining with tm in R antiword error

So I'm rather new to R, and I'm learning how to mine text from this handy website: https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/
I do have my own text set of .doc, .docx, and .xlsx files and I'm trying to mine them. They're located in a folder in my working directory called 'files', but I have already encountered an error after simply writing a few lines of code.
The code I have so far is:
library(tm)
library(readtext)
data = readtext('files')
At this point, after waiting for 25 seconds or so, I get the error:
Error: System call to 'antiword' failed (1): The Big Block Depot is damaged
and the code stops running there.
I have tried searching online for solutions but it seems like a fairly rare error and so I only found 1 possible solution at https://github.com/ropensci/antiword/issues/1 but that did not work for me.
This solution suggested that one of my files were corrupt, and suggested using the code
fixInNamespace(antiword, pos="package:antiword")
to change the error to a warning to not interrupt the reading of the files. I tried that, and at first it raised the error of
Error in as.environment(pos):
no item called "package:antiword" on the search list
After which, I loaded the antiword library with a library(antiword) and changed the stop( to a warning(. However, when I ran the data = readtext('files') line again, it immediately raised the error
Error in is_windows() : could not find function "is_windows"
I'm at a loss here! Any help would be appreciated. Should I be using another package in this case?
I had the same problem with my code, where I tried to get a doc. file in R. I also used the readtext library. What helped me was converting the Word documents I was trying to get into R from doc. to docx. When I ran the same code after it worked.

gsub error message when addressing column in dataframe in RStudio

Since a couple of days I get the following error message in RStudio from time to time and can't figure out what is causing it.
When I write in the console window to address a data.frame followed by $ to address a specific column in the data.frame (for example df$SomeVariable), the following message is shown in the console window and is printed over an over with every letter I type
Error in gsub(reStrip, "", completions, perl = TRUE) :
input string 38 is invalid UTF-8
The error message doesn't have any real effect. Everything works just fine except the automatic completion of the variable name.
I'm using R version 3.4.4 and RStudio Version 1.0.143 on a Windows computer. In the R script I am currently working on I don't use gsub or any other "string" or regular expression function for that matter. The issue appeared with various data.frames and various types of variables in the data.frames (numeric, integer, date, factor, etc.). It also happens with various packages. Currently, I am using combinations of the packages readr, dplyr, plm, lfe, readstata13, infuser, and RPostgres. The issue disappears for a while after closing RStudio and opening it again but re-appears after working for a while.
Does anyone have an idea what may cause this and how to fix it?
I used to have the same problem a few days ago. I made some research and i found that when you import the dataset, you can change the encoding. Change the encoding to "latin1" and maybe that could fix your problem. Sorry for my poor english, im from Southamerica. Hope it works.

XLConnect 'envir' error

I manage a number of Excel reports, and I use R to do the preprocessing and write the output report. It's great because all I have to do is run the R function and distribute the reports, and the rest of the report writing is inactive time. The reports need to be in Excel format because it is the easiest to disseminate and the audience is large and non-technical. Once the data is pre-processed, I do this very, very simply using XLConnect:
file.copy(from = template,
to = newFileName)
writeWorksheetToFile(file = newFileName,
data = newData,
sheet = "Data",
clearSheets = T)
However, one of my reports began throwing this error when I attempted to write the new data:
Error in ls(envir = envir, all.names = private) :
invalid 'envir' argument
Furthermore, before throwing the error, the function ties up R for 15 minutes. The normal writing time is less than 10 seconds. I must confess, I don't understand what this error even means, and it did not succumb to my usual debugging methods or to any other SO solution.
I've noticed that others have referred to rJava (reinstalling this package didn't work) and to a Java cache of log files (not sure where this would be located on Mac). I'm especially confused as the report ran with no problems just one day earlier using precisely the same process, AND my other reports using the exact same process still work just fine.
I didn't update Java or R or my OS, or debug/rewrite any of the R code. So, starting from the beginning - how can I investigate this 'envir' error? What would you do if you were in my shoes? I've been working on this for a couple days and I'm stumped.
I'm happy to provide extra information if it will provide better context for more discerning programmers than myself :)
Update:
My previous answer (below) did not, in fact, fix this intermittent error (which as the OP points out is extremely difficult to unpick due to the Java dependency). Instead, I followed the advice given here and migrated from the XLConnect package to openxlsx, which sidesteps the problem entirely.
Previous answer:
I've been frustrated by precisely this error for a while, including the apparent intermittency and the tying up of R for several minutes when writing a workbook.
I just realised what the problem was: the length of the name of an Excel worksheet appears to be limited to 31 characters, and my R code was generating worksheet names in excess of this limit.
Just to be clear, I'm referring to the names of the individual tabbed sheets within an Excel workbook, not the filename of the workbook itself.
Trimming each worksheet name to no more than 31 characters fixed this error for me.

Resources