I have a bunch of audio files, which I extracted from mp4 video files using the av package. Now I want to merge all the audio files into one long output mp3.
My question: Is there a way to merge audio files in R using the av package?
I.e. when having a vector of file paths/names such as
files <- c("file1.mp3", "file2.mp3", "file3.mp3")
I am looking for a function or concise workaround within R that could handle this, maybe similar to:
av_function_that_should_exist_already(files, output = "big_fat_file.mp3")
Note 1: I do not want to paste an ffmpeg command to the terminal. If I wanted to use the terminal or some script, I could have done that. What I would like to do, is to solve this completely within R, preferably using av. (I want to avoid implementing yet another library, and overthrowing my complete code, making it into a library mixtape, when everything else already works just fine).
Note 2: I have already checked this post: How to concatenate multiple .wav files from a list in R?, I am specifically asking about av in this question, preferably not about other packages.
So, I just want to know if this is possible or not (and if maybe I'm just not seeing it). I haven't found anything in the documentation, which is mostly about converting audio and video files, not about concatenating audio or video files such as mp3 or aac.
I was thinking that this should be possible using something like:
av_audio_convert(files, output = "big_fat_file.mp3")
However, this just leads to "file1.mp3" being written to "big_fat_file.mp3" in this example, so from a vector of file names, only the first element will be processed by av_audio_convert.
Thanks for your help and ideas in advance,
Cat
Related
Background
I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.
The Problem
Here's the code I'm using to output a .csv file to a folder on my hard drive:
spark_write_csv(d1, "C:/d1.csv")
When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:
The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".
What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?
Edit
A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.
I had the exact same issue.
In simple terms, the partitions are done for computational efficiency. If you have partitions, multiple workers/executors can write the table on each partition. In contrast, if you only have one partition, the csv file can only be written by a single worker/executor, making the task much slower. The same principle applies not only for writing tables but also for parallel computations.
For more details on partitioning, you can check this link.
Suppose I want to save table as a single file with the path path/to/table.csv. I would do this as follows
table %>% sdf_repartition(partitions=1)
spark_write_csv(table, path/to/table.csv,...)
You can check full details of sdf_repartition in the official documentation.
Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.
You can use a method called as coalese to achieve this.
coalesce(df, 1)
I'm no R-programmer (because of the problem I started learning it), I'm using Python, In a forcasting task I got a dataset signalList.rdata of a pheomenen called partial discharge.
I tried some commands to load, open and view, Hardly got a glimps
my_data <- get(load('C:/Users/Zack-PC/Desktop/Study/Data Sets/pdCluster/signalList.Rdata'))
but, since i lack deep knowledge about R, I wanted to convert it into a csv file, or any type that I can deal with in python.
or, explore it and copy-paste manually.
so, i'm asking for any solution whether using R or Python or any tool to get what's in the .rdata file.
Have you managed to load the data successfully into your working environment?
If so, write.csv is the function you are looking for.
If not,
setwd("C:/Users/Zack-PC/Desktop/Study/Data Sets/pdCluster/")
signalList <- load("signalList.Rdata")
write.csv(signalList, "signalList.csv")
should do the trick.
If you would like to remove signalList from your working directory,
rm(signalList)
will accomplish this.
Note: changing your working directory isn't necessary, it just makes it easier to read in a comment I feel. You may also specify another path for saving your csv to within the second argument of write.csv.
I am trying to work on a fas extension file with using 'Biostrings' package. I have attempted a multiple different ways, exhausted google searches without much success on different webs/blogs/video tutorials.
Please consider following file path:
"/mydesktop/DNAfile.fas"
I have successfully installed Biostrings package and XVector.
library(Biostrings)
library(Vector)
Then I wrote below lines:
fas1 <- system.file("extdata", "/mydesktop/DNAfile.fas", package="Biostrings")
dna <- readDNAStringSet(fas1)
However, this created error as this:
Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : cannot open file ''
Can someone guide me through this? Someone advised some other user on some other blog that he should consider changing fas into FASTA, but how would that change the workflow?
Thank you!
I probably found an answer. Not quite sure if I can answer my own question. First, both comments by #Maurits and #Poshi were right. I appreciate them again. It's probably a longer version of what they wisely mentioned above, but my thought process is below. This is for other beginners who might fall into same misunderstanding that I went through... to save their time and energy.
First, the fas (or fasta, fa) files are essentially the same, if the file creator did not intend to fool the user by putting a different extension on purpose. One way to know whether it truly is based on the same format is, just by opening them in any ASCII text editor or equivalent environment and browse the first few lines. However, beware if the file size is huge, it will consume your RAM and make your computer uncomfortable.
Second, I checked whether the file truly contains the definition line as well as the sequence. I know it's a blunt way (and may not be totally accurate), but even before I went to ASCII level, I opened it on R with below:
g <- gzcon(file("/MyDesktop/Ecoli_Genome.fas", "r")) # set to your path
fas <- readLines(g) # read directly from .gz file
close(g) # close the file connection
head(fas) # lines of the FASTQ formatted file
Here you can check all sequences with your own eyes.
But obviously, this does not make you compatible to use the Biostrings on this sequence!
Make sure you correctly installed Biostrings. It cannot be installed via install.packages("Biostrings"). Here is the source.
source("https://bioconductor.org/biocLite.R")
The webpage is:
https://bioconductor.org/packages/release/bioc/html/Biostrings.html
Third, understand the fas (or fasta) can store all sequence in one line. It's basically a complex form of a list, and your sequence can be stored in different ways, but contiguous bases usually need not to be stored fragmented. So my sequence was all in one line!
So the solution to my question was:
library(Biostrings)
dna <- readDNAStringSet("/MyDesktop/Ecoli_Genome.fas")
You can confirm by opening it:
> dna
A DNAStringSet instance of length 1
width seq names
[1] 4641652 AGCTTTTCATTCTGACTGCAACGGGCA...AAAAAACGCCTTAGTAAGTATTTTTC U00096.3
Let me know any further comments to learn. Thank you
Situation
I wrote an R program which I split up into multiple R-files for the sake of keeping a good code structure.
There is a Main.R file which references all the other R-files with the 'source()' command, like this:
source(paste(getwd(), dirname1, 'otherfile1.R', sep="/"))
source(paste(getwd(), dirname3, 'otherfile2.R', sep="/"))
...
As you can see, the working directory needs to be set correctly in advance, otherwise, this could go wrong.
Now, if I want to share this R program with someone else, I have to pass all the R files and folders in relative order of each other for things to work. Hence my next question.
Question
Is there a way to replace all the 'source' commands with the actual R script code which it refers to? That way, I have a SINGLE R script file, which I can simply pass along without having to worry about setting the working directory.
I'm not looking for a solution which is an 'R package' (which by the way is one single directory, so I would lose my own directory structure). I simply wondering if there is an easy way to combine these self-referencing R files into one single file.
Thanks,
Ok I think you could use something like scaning all the files and then writting them again in the same new one. This can be done using readLines and sink:
sink("mynewRfile.R")
for(i in Nfiles){
current_file = readLines(filedir[i])
cat("\n\n#### Current file:",filedir[i],"\n\n")
cat(current_file, sep ="\n")
}
sink()
Here I have supposed all your file directories are in a vector filedir with length Nfiles, I guess you can adapt that
A friend of mine doing an internship asked me 2 hours ago if I could help him avoid to do manually 462 pdf file to .xls using free online soft.
I thought of a shell script using unoconv, but I didn't find out how to use it properly, and I am not sure if unoconv can solve this problem since it mainly converts file to pdf, not the reverse thing.
Conversion from PDF to any other structured format is not always possible and not generally recommended.
Having said that, this does look like a one-off job and there's a fair few of them (462).
It's worth pursuing, if you can reliably extract text from most of them and it's reasonably structured. It's a matter of trying to get regular text output across a sample of the PDF's that you can reliably parse into a table structure.
There's plenty of tools around that target either direct or OCR based text extraction, just google around.
One I like is pstotext from the ghostscript suite; the -bboxes option lets me get the coordinates of each word and leaves it up to me to re-assemble the structure. Despite its name it does work on input PDFs. Downside is that it can be a bit flakey and works on some PDF's but not others.
If you get this far, you'd then most likely then need to write a shell-script or program to convert that to a CSV. You can either open this directly via a spread-sheet or look for tools to convert this into XLS.
PS If he hasn't already, get the intern to ask if there's any possible way of getting at the original data that was used to created the PDFs It will save a lot of time and effort and lead to a way more accurate result.
Update An alternative to pstotext is renderpdf.pl command which is included in the Perl CAM::PDF module. More robust, but just reports text (x,y) position, not bounding boxes.
Other responses on a linked question suggest Tabula, too.
https://github.com/tabulapdf/tabula
I tried and it works very well.