how to run and get document stats from boilerpipe article extractor? - web-scraping

There's something I'm not quite understanding about the use of boilerpipe's ArticleExtractor class. Albeit, I am also very new to java, so perhaps my basic knowledge of this enviornemnt is at fault.
anyhow, I'm trying to use boilerpipe to extract the main article from some raw html source I have collected. The html source text is stored in a java.lang.String variable (let's call it htmlstr) variable that has the raw HTML contents of a webpage.
I know how to run boilerpipe to print the extracted text to the output window as follows:
java.lang.String htmlstr = "<!DOCTYPE.... ****html source**** ... </html>";
java.lang.String article = ArticleExtractor.INSTANCE.getText(htmlstr);
System.out.println(article);
However, I'm not sure how to run BP by first instantiating an instance of the ArticleExtractor class, then calling it with the 'TextDocument' input datatype. The TextDocument datatype is itself somehow constructed from BP's 'TextBlock' datatype, and perhaps I am not doing this correctly...
What is the proper way to construct a TextDocument type variable from my htmlstr string variable?
So my problem is then in using the processing method of BP's Article Extractor class aside from calling the ArticleExtractor getText method as per the example above. In other words, I'm not sure how to use the
ArticleExtractor.process(TextDocument doc);
method.
It is my understanding that one is required to run this ArticleExtractor process method to then be able to use the same "TextDocument doc" variable for getting document stats, using BP's
TextDocumentStatistics(TextDocument doc, boolean contentOnly)
method? I would like to use the stats to determine how good the filtering was estimated to be.
Any code examples someone could help me out with?

Code written in Jython (Conversion to java should be easy)
1) How to get TextDocument from a HTML String:
import org.xml.sax.InputSource as InputSource
import de.l3s.boilerpipe.sax.HTMLDocument as HTMLDocument
import de.l3s.boilerpipe.document.TextDocument as TextDocument
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput as BoilerpipeSAXInput
import de.l3s.boilerpipe.extractors.ArticleExtractor as ArticleExtractor
import de.l3s.boilerpipe.estimators.SimpleEstimator as SimpleEstimator
import de.l3s.boilerpipe.document.TextDocumentStatistics as TextDocumentStatistics
import de.l3s.boilerpipe.document.TextBlock as TextBlock
htmlDoc = HTMLDocument(rawHtmlString)
inputSource = htmlDoc.toInputSource()
boilerpipeSaxInput = BoilerpipeSAXInput(inputSource)
textDocument = boilerpipeSaxInput.getTextDocument()
2) How to process TextDocument using Article Extractor (continued from above)
content = ArticleExtractor.INSTANCE.getText(textDocument)
3) How to get TextDocumentStatistics (continued from above)
content_list = [] #replace python 'List' Object with ArrayList in java
content_list.append(TextBlock(content)) #replace with arrayList.add(TextBlock(content))
content_td = TextDocument(content_list)
content_stats = TextDocumentStatistics(content_td, True)#True for article content statistics only
Note: The java docs accompanied with the boilerpipe 1.2.jar library should be somewhat useful for future reference

Related

Not able to access certain JSON properties in Autoloader

I have a JSON file that is loaded by two different Autoloaders.
One uses schema evolution and besides replacing spaces in the json property names, writes the json directly to a delta table, and I can see all the values are there properly.
In the second one I am mapping to a defined schema and only use a subset of properties. So use a lot of withColumn and then a select to narrows to my defined column list.
Autoloader definition:
df = (spark
.readStream
.format('cloudFiles')
.option('cloudFiles.format', 'json')
.option('multiLine', 'true')
.option('cloudFiles.schemaEvolutionMode','rescue')
.option('cloudFiles.includeExistingFiles','true')
.option('cloudFiles.schemaLocation', bronze_schema)
.option('cloudFiles.inferColumnTypes', 'true')
.option('pathGlobFilter','*.json')
.load(upload_path)
.transform(lambda df: remove_spaces_from_columns(df))
.withColumn(...
Writer:
df.writeStream.format('delta') \
.queryName(al_stream_name) \
.outputMode('append') \
.option('checkpointLocation', checkpoint_path) \
.option('mergeSchema', 'true') \
.trigger(once = True) \
.table(bronze_table)
Issue is that some of the source columns are ok load and I get their values, and others are constantly null in the output table.
For example:
.withColumn('vl_rating', col('risk_severity.value')) # works
.withColumn('status', col('status.name')) # always null
...
.select(
'rating',
'status',
...
json is quite simple, these are all string values, they are always populated. The same code works against another simular json file in another autoloader without issue.
I have run out of ideas to fault find on this. My imports are minimal, outside of Autoloader the JSON loads fine.
e.g
%python
import pyspark.sql.functions as psf
jsontest = spark.read.option('inferSchema','true').json('dbfs:....json')
df = jsontest.withColumn('status', psf.col('status.name')).select('status')
display(df)
Results in the values of the status.name property of the json file
Any ideas would be greatly appreciated.
I have found generally what is causing this. Interesting cause!
I am scanning a whole directory of json files, and the schema evolves over time (as expected). But when I clear out the autoloader schema and checkpoint directories and only scan the latest json file it all works correctly.
So what I surmise is that something in schema evolution with the older json files causes Autoloader to get into a state where it will not put certain properties into the stream to the writer.
If anyone has any recommendation on how to implement some data quality analysis in an Autoloader I would be most appreciative if you would share.

Extract Hyperlink from a spool pdf file in Python

I am getting my form data from frontend and reading it using fast api as shown below:
#app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
print("Content = ",pdf.content_type,pdf.filename,pdf.spool_max_size)
return {"filename": "Succcess"}
Now what I need to do is extract hyperlinks from these spool Files with the help of pypdfextractor as shown below:
import pdfx
from os.path import exists
from config import availableUris
def getHrefsFromPDF(pdfPath:str)->dict:
if not(exists(pdfPath)):
raise FileNotFoundError("PDF File not Found")
pdf = pdfx.PDFx(pdfPath)
return pdf.get_references_as_dict().get('url',[])
But I am not sure how to convert spool file (Received from FAST API) to pdfx readable file format.
Additionally, I also tried to study the bytes that come out of the file. When I try to do this:
data = await pdf.read()
data type shows as : bytes when I try to convert it using str function it gives a unicoded encoded string which is totally a gibberish to me, I also tried to decode using "utf-8" which throws UnicodeDecodeError.
fastapi gives you a SpooledTemporaryFile. You may be able to use that file object directly if there is some api in pdfx which will work on a File() object rather than a str representing a path (!). Otherwise make a new temporary file on disk and work with that:
from tempfile import TemporaryDirectory
from pathlib import Path
import pdfx
#app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
with TemporaryDirectory() as d: #Adding the file into a temporary storage for re-reading purposes
tmpf = Path(d) / "pdf.pdf"
with tmpf.open("wb") as f:
f.write(pdf.read())
p = pdfx.PDFX(str(tmpf))
...
It may be that pdfx.PDFX will take a Path object. I'll update this answer if so. I've kept the read-write loop synchronous for ease, but you can make it asynchronous if there is a reason to do so.
Note that it would be better to find a way of doing this with the SpooledTemporaryFile.
As to your data showing as bytes: well, pdfs are (basically) binary files: what did you expect?

Documenter.jl: #ref a specific method of a function

Let's say I have two methods
"""
f(x::Integer)
Integer version of `f`.
"""
f(x::Integer) = println("I'm an integer!")
"""
f(x::Float64)
Float64 version of `f`.
"""
f(x::Float64) = println("I'm floating!")
and produce doc entries for those methods in my documentation using Documenter.jl's #autodocs or #docs.
How can I refer (with #ref) to one of the methods?
I'm searching for something like [integer version](#ref f(::Integer)) (which unfortunately does not work) rather than just [function f](#ref f) or [f](#ref).
Note that for generating the doc entries #docs has a similar feature. From the guide page:
[...] include the type in the signature with
```#docs
length(::T)
```
Thanks in advance.
x-ref: https://discourse.julialang.org/t/documenter-jl-ref-a-specific-method-of-a-function/8792
x-ref: https://github.com/JuliaDocs/Documenter.jl/issues/569#issuecomment-362760811
As pointed out by #mortenpi on Discourse and github:
You would normally refer to a function with [`f`](#ref), with the name
of the function being referred to between backticks in text part of
the link. You can then also refer to specific signatures, e.g. with
[`f(::Integer)`](#ref).
The #ref section in the docs should be updated to mention this
possibility.

Sequencefiles which map a single key to multiple values

I am trying to do some preprocessing on data that will be fed to LucidWorks Big Data for indexing. LWBD accepts SolrXML in the form of Sequencefile files. I want to create a Pig script which will take all the SolrXML files in a directory and output them in the format
filename_1 => <here goes some XML>
...
filename_N => <here goes some more XML>
Pig's native PigStorage() load function can automatically create a column that includes the name of the file from which the data was extracted, which ideally would look like this:
{"filename_1", "<here goes some XML>"}
...
{"filename_N", "<here goes some more XML>"}
However, PigStorage() also automatically uses '\n' as a line delimiter, so what I actually end up with is a bag that looks like this:
{"filename_1", "<some partial XML from file 1>"}
{"filename_1", "<some more partial XML from file 1>"}
{"filename_1", "<the end of file 1>"}
...
I'm sure you get the picture. My question is, if I were to write this bag to a SequenceFile, how would it be read by other applications? Could it be combined as
"filename_1" => "<some partial XML from file 1>
<some more partial XML from file 1>
<the end of file 1>"
, by the default handling of the application I feed it to? Or is there some post-processing that I can do to get it into this format? Thank you for your help.
Since I can't find anything about a builtin SequenceFile writer, I'm assuming you are using a UDF (and if you aren't, then you need to).
You'll have to group the files (by filename) ahead of time, and then send that to the writer UDF.
DESCRIBE xml ;
-- xml: {filename: chararray, xml_data: chararray}
B = FOREACH (GROUP xml BY filename)
GENERATE group AS filename, xml.xml_data AS all_xml_data ;
Depending on how you have written the SequenceFile writer, it may be easier to convert the all_xml_data bag ahead of time to a chararray using a Python UDF like:
#outputSchema('xml_complete: chararray')
def stringify(bag):
delim = ''
return delim.join(bag)
NOTE: It is important to realize that this way the order of the xml data will become jumbled. If possible based on your data, stringify can maybe be expanded upon the reorgize it.

Return a list of internationalized names of languages in Plone 4

By querying the portal_languages tool I can get a list of language names:
>>> from Products.CMFPlone.utils import getToolByName
>>> ltool = getToolByName(context, 'portal_languages')
>>> language_names = [name for code, name in ltool.listAvailableLanguages()]
[u'Abkhazian', u'Afar', u'Afrikaans', u'Albanian', u'Amharic', (...)
But how can I return a list of localized language names?
[EDIT] What I want is the list of language names in the language of the current user, as shown in ##language-controlpanel See: http://i.imgur.com/rGfjG.png
If you want translated language names in many different languages, install Babel (http://pypi.python.org/pypi/Babel). There's good documentation on it, for example http://packages.python.org/Babel/display.html:
>>> from babel import Locale
>>> locale = Locale('de', 'DE').languages['ja']
u'Japanisch'
Plone only includes native and English language names. The zope.i18n package has some of this data, but it's really incomplete and outdated, so Babel is your best bet.
Use the listAvailableLanguageInformation() method instead:
>>> from Products.CMFPlone.utils import getToolByName
>>> ltool = getToolByName(context, 'portal_languages')
>>> native_language_names = [entry[u'native']
... for entry in ltool.listAvailableLanguageInformation()]
[u'Afrikaans', u'Aymara', u'Az\u0259ri T\xfcrk\xe7\u0259si', u'Bahasa Indonesia', ...]
Note that the ##language-controlpanel view uses the zope.i18n.locales module to provide translated languages; but that list is so incomplete that the languages list is not translated for most of UI languages. Apparently italian is one language where this is translated.
You can reach the locales structure via the request, or via the ##plone_state view. The locales.displayNames.languages dictionary maps language code (2 letters) to local language name:
>>> from Products.CMFPlone.utils import getToolByName
>>> ltool = getToolByName(context, 'portal_languages')
>>> languages = request.locales.displayNames.languages
>>> language_names = [languages.get(code, name) for code, name in ltool.listAvailableLanguages()]
[u'abkhazian', u'afar', u'afrikaans', u'albanese', u'amarico', ...]
As you can see, the language names are lowercased, not properly capitalized. Also, the data is expensive to parse (the package contains XML files parsed on first access) so it can take several moments before this data is available to you on first access.
Your best bet would be to use Babel, as Hanno states, as it actually has far more current information available, and not just for a handful of languages.
Thanks to Martijn's help I was able to solve the issue. This is the final working code that will generate the dictionary of translated language names. Very useful if you want to make a localized selection field such as the one found in the language control-panel.
from Products.CMFCore.interfaces import ISiteRoot
from zope.component import getMultiAdapter
from zope.site.hooks import getSite
from zope.globalrequest import getRequest
#grok.provider(IContextSourceBinder)
def languages(context):
"""
Return a vocabulary of language codes and
translated language names.
"""
# z3c.form KSS inline validation hack
if not ISiteRoot.providedBy(context):
for item in getSite().aq_chain:
if ISiteRoot.providedBy(item):
context = item
# retrieve the localized language names.
request = getRequest()
portal_state = getMultiAdapter((context, request), name=u'plone_portal_state')
lang_items = portal_state.locale().displayNames.languages.items()
# build the dictionary
return SimpleVocabulary(
[SimpleTerm(value=lcode, token=lcode, title=lname)\
for lcode, lname in lang_items]
)

Resources