Opening a fas format file on Biostrings package in R - r

I am trying to work on a fas extension file with using 'Biostrings' package. I have attempted a multiple different ways, exhausted google searches without much success on different webs/blogs/video tutorials.
Please consider following file path:
"/mydesktop/DNAfile.fas"
I have successfully installed Biostrings package and XVector.
library(Biostrings)
library(Vector)
Then I wrote below lines:
fas1 <- system.file("extdata", "/mydesktop/DNAfile.fas", package="Biostrings")
dna <- readDNAStringSet(fas1)
However, this created error as this:
Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : cannot open file ''
Can someone guide me through this? Someone advised some other user on some other blog that he should consider changing fas into FASTA, but how would that change the workflow?
Thank you!

I probably found an answer. Not quite sure if I can answer my own question. First, both comments by #Maurits and #Poshi were right. I appreciate them again. It's probably a longer version of what they wisely mentioned above, but my thought process is below. This is for other beginners who might fall into same misunderstanding that I went through... to save their time and energy.
First, the fas (or fasta, fa) files are essentially the same, if the file creator did not intend to fool the user by putting a different extension on purpose. One way to know whether it truly is based on the same format is, just by opening them in any ASCII text editor or equivalent environment and browse the first few lines. However, beware if the file size is huge, it will consume your RAM and make your computer uncomfortable.
Second, I checked whether the file truly contains the definition line as well as the sequence. I know it's a blunt way (and may not be totally accurate), but even before I went to ASCII level, I opened it on R with below:
g <- gzcon(file("/MyDesktop/Ecoli_Genome.fas", "r")) # set to your path
fas <- readLines(g) # read directly from .gz file
close(g) # close the file connection
head(fas) # lines of the FASTQ formatted file
Here you can check all sequences with your own eyes.
But obviously, this does not make you compatible to use the Biostrings on this sequence!
Make sure you correctly installed Biostrings. It cannot be installed via install.packages("Biostrings"). Here is the source.
source("https://bioconductor.org/biocLite.R")
The webpage is:
https://bioconductor.org/packages/release/bioc/html/Biostrings.html
Third, understand the fas (or fasta) can store all sequence in one line. It's basically a complex form of a list, and your sequence can be stored in different ways, but contiguous bases usually need not to be stored fragmented. So my sequence was all in one line!
So the solution to my question was:
library(Biostrings)
dna <- readDNAStringSet("/MyDesktop/Ecoli_Genome.fas")
You can confirm by opening it:
> dna
A DNAStringSet instance of length 1
width seq names
[1] 4641652 AGCTTTTCATTCTGACTGCAACGGGCA...AAAAAACGCCTTAGTAAGTATTTTTC U00096.3
Let me know any further comments to learn. Thank you

Related

(How) can I merge audio files in R using the av package?

I have a bunch of audio files, which I extracted from mp4 video files using the av package. Now I want to merge all the audio files into one long output mp3.
My question: Is there a way to merge audio files in R using the av package?
I.e. when having a vector of file paths/names such as
files <- c("file1.mp3", "file2.mp3", "file3.mp3")
I am looking for a function or concise workaround within R that could handle this, maybe similar to:
av_function_that_should_exist_already(files, output = "big_fat_file.mp3")
Note 1: I do not want to paste an ffmpeg command to the terminal. If I wanted to use the terminal or some script, I could have done that. What I would like to do, is to solve this completely within R, preferably using av. (I want to avoid implementing yet another library, and overthrowing my complete code, making it into a library mixtape, when everything else already works just fine).
Note 2: I have already checked this post: How to concatenate multiple .wav files from a list in R?, I am specifically asking about av in this question, preferably not about other packages.
So, I just want to know if this is possible or not (and if maybe I'm just not seeing it). I haven't found anything in the documentation, which is mostly about converting audio and video files, not about concatenating audio or video files such as mp3 or aac.
I was thinking that this should be possible using something like:
av_audio_convert(files, output = "big_fat_file.mp3")
However, this just leads to "file1.mp3" being written to "big_fat_file.mp3" in this example, so from a vector of file names, only the first element will be processed by av_audio_convert.
Thanks for your help and ideas in advance,
Cat

"filename.rdata" file Exploring and Converting to CSV

I'm no R-programmer (because of the problem I started learning it), I'm using Python, In a forcasting task I got a dataset signalList.rdata of a pheomenen called partial discharge.
I tried some commands to load, open and view, Hardly got a glimps
my_data <- get(load('C:/Users/Zack-PC/Desktop/Study/Data Sets/pdCluster/signalList.Rdata'))
but, since i lack deep knowledge about R, I wanted to convert it into a csv file, or any type that I can deal with in python.
or, explore it and copy-paste manually.
so, i'm asking for any solution whether using R or Python or any tool to get what's in the .rdata file.
Have you managed to load the data successfully into your working environment?
If so, write.csv is the function you are looking for.
If not,
setwd("C:/Users/Zack-PC/Desktop/Study/Data Sets/pdCluster/")
signalList <- load("signalList.Rdata")
write.csv(signalList, "signalList.csv")
should do the trick.
If you would like to remove signalList from your working directory,
rm(signalList)
will accomplish this.
Note: changing your working directory isn't necessary, it just makes it easier to read in a comment I feel. You may also specify another path for saving your csv to within the second argument of write.csv.

PDF File Import R

I have multiple .pdf-files (stored in a local folder), that contain text. I would like to import the .pdf-files (i.e., the texts) in R. I applied the function 'read_dir' (R package: [textreadr][1])
library ("textreadr")
Data <- read_dir("<MY PATH>")
The function works well. BUT. For several files, that include special characters (i.e., letters) in their names (such as 'ć'; e.g., 'filenameć.pdf'), the function did not work (error message: 'The following files failed to read in and were removed:' …).
What can I do?
I tried to rename the files via R (did not work (probably due to the same reasons)). That might be a workaround.
I did not want to rename the files manually :)
Follow-Up (only for experts):
For several files, I got one of the following error messages (and I have no idea why):
PDF error: Mismatch between font type and embedded font file
or
PDF error: Couldn't find trailer dictionary
Any suggestions or hints how to solve this issue?
Likely the issue concerns the encoding of the file names. If you absolutely want to use R to rename the files for you, the function you want to use is iconv, determine the encoding of the file names and then convert them to utf-8.
However, a much better system would imply renaming them using bash from command line. Can you provide a more complete set of examples?

Using R, import data from web

I have just started using R, so this may be a very dumb question. I am trying to import the data using:
emdata=read.csv(file="http://lottery.merseyworld.com/cgi-bin/lottery?days=19&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV",header=TRUE)
My problem is that it reads the csv file into a single column ( by the way, the lottery data is simply because it is publicly available to download - using as an exercise to understand what I can and can't do in R), instead of formatting it into however many columns of data there are. Would someone mind helping out, please, even though this is trivial
Hm, that's kind of obnoxious for a page purporting to be in csv format. You can skip the first 5 lines, which will cause R to read (most of) the rest of the file correctly.
emdata=read.csv(file=...., header=TRUE, skip=5)
I got the number of lines to skip by looking at the source. You'll still have to remove the cruft in the middle and end, and then clean up the columns (they'll all be factors because of the embedded text).
It would be much easier to save the page to your hard disk, edit it to remove all the useless bits, then import it.
... to answer your REAL question, yes, you can import data directly from the web. In general, wherever you would read a file, you can substitute a fully qualified URL -- R is smart enough to do the Right Thing[tm]. This specific URL just happens to be particularly messy.
You could read text from the given url, filter out the obnoxious lines and then read the result as CSV like so:
lines <- readLines(url("http://lottery.merseyworld.com/cgi-bin/lottery?days=19&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV"))
read.csv(text=lines[grep("([^,]*,){5,}", lines)])
The above regular expression matches any lines containing at least five commas.

Converting .pdf files to excel (.xls)

A friend of mine doing an internship asked me 2 hours ago if I could help him avoid to do manually 462 pdf file to .xls using free online soft.
I thought of a shell script using unoconv, but I didn't find out how to use it properly, and I am not sure if unoconv can solve this problem since it mainly converts file to pdf, not the reverse thing.
Conversion from PDF to any other structured format is not always possible and not generally recommended.
Having said that, this does look like a one-off job and there's a fair few of them (462).
It's worth pursuing, if you can reliably extract text from most of them and it's reasonably structured. It's a matter of trying to get regular text output across a sample of the PDF's that you can reliably parse into a table structure.
There's plenty of tools around that target either direct or OCR based text extraction, just google around.
One I like is pstotext from the ghostscript suite; the -bboxes option lets me get the coordinates of each word and leaves it up to me to re-assemble the structure. Despite its name it does work on input PDFs. Downside is that it can be a bit flakey and works on some PDF's but not others.
If you get this far, you'd then most likely then need to write a shell-script or program to convert that to a CSV. You can either open this directly via a spread-sheet or look for tools to convert this into XLS.
PS If he hasn't already, get the intern to ask if there's any possible way of getting at the original data that was used to created the PDFs It will save a lot of time and effort and lead to a way more accurate result.
Update An alternative to pstotext is renderpdf.pl command which is included in the Perl CAM::PDF module. More robust, but just reports text (x,y) position, not bounding boxes.
Other responses on a linked question suggest Tabula, too.
https://github.com/tabulapdf/tabula
I tried and it works very well.

Resources