How do I download and extract a list of papers in LaTeX format from arXiv? - python-requests

I have a list of papers that I'd like to extract from arXiv (I have the arxiv links / name of the arxvi file), but in the LaTeX format. How can I do this in Python?
If we go to this page: https://arxiv.org/format/2010.11645
We can read the following text:
Source:
Delivered as a gzipped tar (.tar.gz) file if there are multiple files, otherwise as a PDF file, or a gzipped TeX, DVI, PostScript or HTML (.gz, .dvi.gz, .ps.gz or .html.gz) file depending on submission format. [ Download source ]
We can download the file by clicking on [ Download source ], but I have no idea what type of file I'm getting back. The filename is simple 2010.11645.
I'd like to download the file in LaTeX format (which I believe it .tex) and then convert it into .txt using pandoc. I believe I'd need to download the files via requests somehow?
How can I do this? Thanks!

Related

How can these I convert these characters to a CSV?

I tried to download a file from LSData, but it brings me to a page full of weird characters. The first few are:
7z¼¯'�DÙ™µUa�����b�������’³_èÚ†à]�&Jgl›Ü)ÉZKŒP7þò|¤ˆëÁëxŠ§u6²ã]’“Àé3lGê7ñ"!èÞ’ïjP³
l½Öv<¹-žøZ¹Æ âäùëOKä#;cÞ Žmï•&?^¢Ø"Á.=ù‚u|õ9žG<އ趽ÈËŒøÂtŠÍÝê/ÂG×à×–R§Ýj×zÛ¥™éwG—ï‘ývíõåò ÂÑ\‡W�ܱò§úßxlø¾Ö¾EºáPnÚR"økv§}6“SLÒ¢ø€m]-Ì«gÐáÅMŠWGU�µOÿDõ™}u¦HŠ_qŠ,/¦lÔ}Áô|,Òäêÿ2l«ª»°úö¡]+€™´í¿¢«|Ãw#êñ:t!
I have no clue what I'm looking at. How can I convert this entire page into a CSV, or in whatever file so I can use it in R?
it is a 7z zipped file, you can download and unzip it to get the CSV file

Importing to R an Excel file saved as web-page

I would like to open an Excel file saved as webpage using R and I keep getting error messages.
The desired steps are:
1) Upload the file into RStudio
2) Change the format into a data frame / tibble
3) Save the file as an xls
The message I get when I open the file in Excel is that the file format (excel webpage format) and extension format (xls) differ. I have tried the steps in this answer, but to no avail. I would be grateful for any help!
I don't expect anybody will be able to give you a definitive answer without a link to the actual file. The complication is that many services will write files as .xls or .xlsx without them being valid Excel format. This is done because Excel is so common and some non-technical people feel more confident working with Excel files than a csv file. Now, the files will have been stored in a format that Excel can deal with (hence your warning message), but R's libraries are more strict and don't see the actual file type they were expecting, so they fail.
That said, the below steps worked for me when I last encountered this problem. A service was outputting .xls files which were actually just HTML tables saved with an .xls file extension.
1) Download the file to work with it locally. You can script this of course, e.g. with download.file(), but this step helps eliminate other errors involved in working directly with a webpage or connection.
2) Load the full file with readHTMLTable() from the XML package
library(XML)
dTemp = readHTMLTable([filename], stringsAsFactors = FALSE)
This will return a list of dataframes. Your result set will quite likely be the second element or later (see ?readHTMLTable for an example with explanation). You will probably need to experiment here and explore the list structure as it may have nested lists.
3) Extract the relevant list element, e.g.
df = dTemp[2]
You also mention writing out the final data frame as an xls file which suggests you want the old-style format. I would suggest the package WriteXLS for this purpose.
I seriously doubt Excel is 'saved as a web page'. I'm pretty sure the file just sits on a server and all you have to do is go fetch it. Some kind of files (In particular Excel and h5) are binary rather than text files. This needs an added setting to warn R that it is a binary file and should be handled appropriately.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download.file(url=myurl, destfile="localcopy.xlsx", mode="wb")
or, for use downloader, and ty something like this.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download(myurl, destfile="localcopy.csv", mode="wb")

How to convert doc to docx file using R code

I tried to read doc file using readdoc() but when doc file consist of tables, it will not able read it properly.
Therefore I want to convert doc file to docx file so that I can extract tables using docxtractr package availabe in R.
I want to convert .doc file to .docx file using R code.

As the input data of Amazon-EMR, is the only legal format plain text?

As quoted in the "Developer Guide" of Amazon EMR, the files in the input directory should be formatted as plain text. Does it mean that i cannot upload some binary files or .png files and parse them by python script?
Likely not. See for example: https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/AUUZ0DKiJGw
what you can do is to have an input data be the file names themselves (either in S3 or HDFS). The Hadoop streaming script will get file names as input that it can open and process as it sees fit.

Merging EBCDIC converted files and pdf files into a single file and pushing to mainframes

I have two pdf files and two text files which are converted into ebcdif format. The two text files acts like cover files for the pdf files containing details like pdf name, number of pages, etc. in a fixed format.
Cover1.det, Firstpdf.pdf, Cover2.det, Secondpdf.pdf
Format of the cover file could be:
Firstpdf.pdf|22|03/31/2012
that is
pdfname|page num|date generated
which is then converted into ebcdic format.
I want to merge all these files in a single file in the order first text file, first pdf file, second text file, second pdf file.
The idea is to then push this single merged file into mainframes using scp.
1) How to merge above mentioned four files into a single file?
2) Do I need to convert pdf files also in ebcdic format ? If yes, how ?
3) As far as I know, mainframe files also need record length details during transit. How to find out record length of the file if at all I succeed in merging them in a single file ?
I remember reading somewhere that it could be done using put and append in ftp. However since I have to use scp, I am not sure how to achieve this merging.
Thanks for reading.
1) Why not use something like pkzip?
2) I don't think converting the pdf files to ebcdic is necessary or even possible. The files need to be transfered in Binary mode
3) Using pkzip and scp you will not need the record length
File merging could easily be achieved by using Cat command in unix with > and >> append operators.
Also, if the next file should start from a new line (as was my case) a blank echo could be inserted between files.

Resources