Extract Hyperlink from a spool pdf file in Python - fastapi

I am getting my form data from frontend and reading it using fast api as shown below:
#app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
print("Content = ",pdf.content_type,pdf.filename,pdf.spool_max_size)
return {"filename": "Succcess"}
Now what I need to do is extract hyperlinks from these spool Files with the help of pypdfextractor as shown below:
import pdfx
from os.path import exists
from config import availableUris
def getHrefsFromPDF(pdfPath:str)->dict:
if not(exists(pdfPath)):
raise FileNotFoundError("PDF File not Found")
pdf = pdfx.PDFx(pdfPath)
return pdf.get_references_as_dict().get('url',[])
But I am not sure how to convert spool file (Received from FAST API) to pdfx readable file format.
Additionally, I also tried to study the bytes that come out of the file. When I try to do this:
data = await pdf.read()
data type shows as : bytes when I try to convert it using str function it gives a unicoded encoded string which is totally a gibberish to me, I also tried to decode using "utf-8" which throws UnicodeDecodeError.

fastapi gives you a SpooledTemporaryFile. You may be able to use that file object directly if there is some api in pdfx which will work on a File() object rather than a str representing a path (!). Otherwise make a new temporary file on disk and work with that:
from tempfile import TemporaryDirectory
from pathlib import Path
import pdfx
#app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
with TemporaryDirectory() as d: #Adding the file into a temporary storage for re-reading purposes
tmpf = Path(d) / "pdf.pdf"
with tmpf.open("wb") as f:
f.write(pdf.read())
p = pdfx.PDFX(str(tmpf))
...
It may be that pdfx.PDFX will take a Path object. I'll update this answer if so. I've kept the read-write loop synchronous for ease, but you can make it asynchronous if there is a reason to do so.
Note that it would be better to find a way of doing this with the SpooledTemporaryFile.
As to your data showing as bytes: well, pdfs are (basically) binary files: what did you expect?

Related

Capture File Stream Data in Argparse

I'd like to capture the help output from argparse in a string:
import argparse
pp = argparse.ArgumentParser(description="foo")
pp.add_argument('bundle_dir', help='Directory containing bundles',
default='default_val')
pp.print_help() # What this to go to a string
For its part argparse lets you pass a file handle to print_help:
def print_help(self, file=None):
if file is None:
file = _sys.stdout
self._print_message(self.format_help(), file)
Is there anyway to create an object that will act like a file but will capture the help data so I can use it as a string?

How to process and save HTTP body as-is in Haskell?

I have tried following code to download HTML but it actually transforms non-ASCII characters into series of decoded characters like < U+009B> and 0033200400\0031\0031.
openURL x = getResponseBody =<< simpleHTTP (getRequest x)
download url path = do src <- openURL url
writeFile path src
How to change the following code to write HTTP response exactly as received? How should one search and manipulate with strings in such content?
The string output like "\1234\5678" is actually only two characters long—the data is preserved, but you need to interpret it correctly. Probably the best way to do that is to use Text which, instead of being a list of Chars, is actually a byte array representing UTF-8 codepoints.
To do this, you need to use a slightly more general interface in HTTP mkRequest :: BufferType ty => RequestMethod -> URI -> Request ty. Text does not directly instantiate BufferType, so we'll go through ByteString, which represents binary chunks of data—it has no particular interpretation of the encoding of that data.
We can then use decodeUtf8 to convert the raw bytes to UTF-8 Text
import Data.Text
import Data.Text.Encoding
import Data.ByteString
\ uri -> do
rawData <- getResponseBody =<< simpleHTTP (mkRequest GET uri) :: IO Text
return (decodeUtf8 rawData)
Note that decodeUtf8 is partial—it may fail in a way that cannot be caught in pure code mandating a restart or handler all the way up in your IO stack. If this is undesirable, if there's a good chance that you're downloading text which isn't valid UTF-8, then you can use decodeUtf8' which returns an Either.

Sequencefiles which map a single key to multiple values

I am trying to do some preprocessing on data that will be fed to LucidWorks Big Data for indexing. LWBD accepts SolrXML in the form of Sequencefile files. I want to create a Pig script which will take all the SolrXML files in a directory and output them in the format
filename_1 => <here goes some XML>
...
filename_N => <here goes some more XML>
Pig's native PigStorage() load function can automatically create a column that includes the name of the file from which the data was extracted, which ideally would look like this:
{"filename_1", "<here goes some XML>"}
...
{"filename_N", "<here goes some more XML>"}
However, PigStorage() also automatically uses '\n' as a line delimiter, so what I actually end up with is a bag that looks like this:
{"filename_1", "<some partial XML from file 1>"}
{"filename_1", "<some more partial XML from file 1>"}
{"filename_1", "<the end of file 1>"}
...
I'm sure you get the picture. My question is, if I were to write this bag to a SequenceFile, how would it be read by other applications? Could it be combined as
"filename_1" => "<some partial XML from file 1>
<some more partial XML from file 1>
<the end of file 1>"
, by the default handling of the application I feed it to? Or is there some post-processing that I can do to get it into this format? Thank you for your help.
Since I can't find anything about a builtin SequenceFile writer, I'm assuming you are using a UDF (and if you aren't, then you need to).
You'll have to group the files (by filename) ahead of time, and then send that to the writer UDF.
DESCRIBE xml ;
-- xml: {filename: chararray, xml_data: chararray}
B = FOREACH (GROUP xml BY filename)
GENERATE group AS filename, xml.xml_data AS all_xml_data ;
Depending on how you have written the SequenceFile writer, it may be easier to convert the all_xml_data bag ahead of time to a chararray using a Python UDF like:
#outputSchema('xml_complete: chararray')
def stringify(bag):
delim = ''
return delim.join(bag)
NOTE: It is important to realize that this way the order of the xml data will become jumbled. If possible based on your data, stringify can maybe be expanded upon the reorgize it.

Haskell Network.HTTP incorrectly downloading image

I'm trying to download images using the Network.HTTP module and having little success.
import Network.HTTP
main = do
jpg <- get "http://www.irregularwebcomic.net/comics/irreg2557.jpg"
writeFile "irreg2557.jpg" jpg where
get url = simpleHTTP (getRequest url) >>= getResponseBody
The output file appears in the current directory, but fails to display under chromium or ristretto. Ristretto reports "Error interpreting JPEG image file (Not a JPEG file: starts with 0c3 0xbf)".
writeFile :: FilePath -> String -> IO ()
String. That's your problem, right there. String is for unicode text. Attempting to store binary data in it will lead to corruption. It's not clear in this case whether the corruption is being done by simpleHTTP or by writeFile, but it's ultimately unimportant. You're using the wrong type, and something is corrupting the data when confronted with bytes that don't make up a valid unicode encoding.
As for fixing this, newer versions of HTTP are polymorphic in their return type, and can handle returning the raw bytes in a ByteString. You just need to change how you're writing the bytes to the file, so that it won't infer that you want a String.
import qualified Data.ByteString as B
import Network.HTTP
import Network.URI (parseURI)
main = do
jpg <- get "http://www.irregularwebcomic.net/comics/irreg2557.jpg"
B.writeFile "irreg2557.jpg" jpg
where
get url = let uri = case parseURI url of
Nothing -> error $ "Invalid URI: " ++ url
Just u -> u in
simpleHTTP (defaultGETRequest_ uri) >>= getResponseBody
The construction to get a polymorphic Request is a bit clumsy. If issue #1 ever gets fixed then using getRequest url will suffice.

how to run and get document stats from boilerpipe article extractor?

There's something I'm not quite understanding about the use of boilerpipe's ArticleExtractor class. Albeit, I am also very new to java, so perhaps my basic knowledge of this enviornemnt is at fault.
anyhow, I'm trying to use boilerpipe to extract the main article from some raw html source I have collected. The html source text is stored in a java.lang.String variable (let's call it htmlstr) variable that has the raw HTML contents of a webpage.
I know how to run boilerpipe to print the extracted text to the output window as follows:
java.lang.String htmlstr = "<!DOCTYPE.... ****html source**** ... </html>";
java.lang.String article = ArticleExtractor.INSTANCE.getText(htmlstr);
System.out.println(article);
However, I'm not sure how to run BP by first instantiating an instance of the ArticleExtractor class, then calling it with the 'TextDocument' input datatype. The TextDocument datatype is itself somehow constructed from BP's 'TextBlock' datatype, and perhaps I am not doing this correctly...
What is the proper way to construct a TextDocument type variable from my htmlstr string variable?
So my problem is then in using the processing method of BP's Article Extractor class aside from calling the ArticleExtractor getText method as per the example above. In other words, I'm not sure how to use the
ArticleExtractor.process(TextDocument doc);
method.
It is my understanding that one is required to run this ArticleExtractor process method to then be able to use the same "TextDocument doc" variable for getting document stats, using BP's
TextDocumentStatistics(TextDocument doc, boolean contentOnly)
method? I would like to use the stats to determine how good the filtering was estimated to be.
Any code examples someone could help me out with?
Code written in Jython (Conversion to java should be easy)
1) How to get TextDocument from a HTML String:
import org.xml.sax.InputSource as InputSource
import de.l3s.boilerpipe.sax.HTMLDocument as HTMLDocument
import de.l3s.boilerpipe.document.TextDocument as TextDocument
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput as BoilerpipeSAXInput
import de.l3s.boilerpipe.extractors.ArticleExtractor as ArticleExtractor
import de.l3s.boilerpipe.estimators.SimpleEstimator as SimpleEstimator
import de.l3s.boilerpipe.document.TextDocumentStatistics as TextDocumentStatistics
import de.l3s.boilerpipe.document.TextBlock as TextBlock
htmlDoc = HTMLDocument(rawHtmlString)
inputSource = htmlDoc.toInputSource()
boilerpipeSaxInput = BoilerpipeSAXInput(inputSource)
textDocument = boilerpipeSaxInput.getTextDocument()
2) How to process TextDocument using Article Extractor (continued from above)
content = ArticleExtractor.INSTANCE.getText(textDocument)
3) How to get TextDocumentStatistics (continued from above)
content_list = [] #replace python 'List' Object with ArrayList in java
content_list.append(TextBlock(content)) #replace with arrayList.add(TextBlock(content))
content_td = TextDocument(content_list)
content_stats = TextDocumentStatistics(content_td, True)#True for article content statistics only
Note: The java docs accompanied with the boilerpipe 1.2.jar library should be somewhat useful for future reference

Resources