MLCP import Exception - xquery

I have a csv file of size 1GB, I tried to import the csv file as a binary file using Marklogic mlcp-8.0.6,
./mlcp.sh import -mode local -options_file (portname, portno, username, password) -input_file_path "/input_path/file.csv" -document_type binary -output_uri_replace "input_path,'/output_path/'"
but it throwed an exception as mentioned below:
INFO contentpump.LocalJobRunner: Content type: BINARY
INFO contentpump.FileAndDirectorInputFormat: Total input paths to process: 1
INFO contentpump.LocalJobRunner: completed 0%
INFO contentpump.LocalJobRunner: completed 100%
ERROR contentpump.MulithreadedMapper: java.lang.NegativeArraySizeException
Could anyone mention why it happens? Is it due to the size of file. But when I tried to load the same file using xdmp:document-load function, it has been loaded into the document.
xdmp:document-load('input_path/file.csv',
<options xmlns="xdmp:document-load">
<uri>ouput_path</uri>
<format>binary</format>
</options>
)
Thanks.

Related

Extract Hyperlink from a spool pdf file in Python

I am getting my form data from frontend and reading it using fast api as shown below:
#app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
print("Content = ",pdf.content_type,pdf.filename,pdf.spool_max_size)
return {"filename": "Succcess"}
Now what I need to do is extract hyperlinks from these spool Files with the help of pypdfextractor as shown below:
import pdfx
from os.path import exists
from config import availableUris
def getHrefsFromPDF(pdfPath:str)->dict:
if not(exists(pdfPath)):
raise FileNotFoundError("PDF File not Found")
pdf = pdfx.PDFx(pdfPath)
return pdf.get_references_as_dict().get('url',[])
But I am not sure how to convert spool file (Received from FAST API) to pdfx readable file format.
Additionally, I also tried to study the bytes that come out of the file. When I try to do this:
data = await pdf.read()
data type shows as : bytes when I try to convert it using str function it gives a unicoded encoded string which is totally a gibberish to me, I also tried to decode using "utf-8" which throws UnicodeDecodeError.
fastapi gives you a SpooledTemporaryFile. You may be able to use that file object directly if there is some api in pdfx which will work on a File() object rather than a str representing a path (!). Otherwise make a new temporary file on disk and work with that:
from tempfile import TemporaryDirectory
from pathlib import Path
import pdfx
#app.post("/file_upload")
async def upload_file(pdf: UploadFile = File(...)):
with TemporaryDirectory() as d: #Adding the file into a temporary storage for re-reading purposes
tmpf = Path(d) / "pdf.pdf"
with tmpf.open("wb") as f:
f.write(pdf.read())
p = pdfx.PDFX(str(tmpf))
...
It may be that pdfx.PDFX will take a Path object. I'll update this answer if so. I've kept the read-write loop synchronous for ease, but you can make it asynchronous if there is a reason to do so.
Note that it would be better to find a way of doing this with the SpooledTemporaryFile.
As to your data showing as bytes: well, pdfs are (basically) binary files: what did you expect?

PYTHON : Saving multiple images into folder using requests

I need to save all 6 images into a local folder.
The script I found re-writes a single files multiple times and end up producing only 1 image
import requests
img_list = ["https://ae01.alicdn.com/kf/HTB1tT70vhuTBuNkHFNRq6A9qpXa3.jpg", "https://ae01.alicdn.com/kf/HTB12HGkvwKTBuNkSne1q6yJoXXaR.jpg", "https://ae01.alicdn.com/kf/HTB1_yDic56guuRjy0Fmq6y0DXXaY.jpg", "https://ae01.alicdn.com/kf/HTB1RopgXffsK1RjSszgq6yXzpXa5.jpg", "https://ae01.alicdn.com/kf/HTB1R6sJXgHqK1RjSZFkq6x.WFXaF.jpg", "https://ae01.alicdn.com/kf/HTB1_XlhXojrK1RkHFNRq6ySvpXaR.jpg"]
for x in blob:
with open('/Users/reezalaq/PycharmProjects/wholesale/img/pic1.jpg', 'wb') as handle:
response = requests.get(x, stream=True)
if not response.ok:
print(response)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)```
It needs to save all 6 images separately. No error message so far.
The script rewrites the same file each time because you're using the same file name, it never changes.
The problem is here:
with open('/Users/reezalaq/PycharmProjects/wholesale/img/pic1.jpg', 'wb')
The first argument of the open() method is the file path. The second argument is the mode, which you have set to wb, or write/binary. So in you're loop you are rewriting the file contents of pic1.jpg everytime. (See: https://docs.python.org/3.5/library/functions.html#open).
You can pre-define a list of filenames in a list and use those as the filenames or do something more dynamic like :
for img in img_list:
file_name = img.split('/')[-1]
with open(file_name, 'wb') as handle:
....
This would grab the file name of the image from the website you're downloading it from (e.g., 1HTB1tT70vhuTBuNkHFNRq6A9qpXa3.jpg for the first URL) to be used as the file name on your system. (Note: this also assumes names will be unique).
Edit:
You can define your folder path before the for-loop. Then, you can change the open() method to include that path. So:
import os # do this at the top of your file
folder_path = '/Users/reezalaq/PycharmProjects/wholesale/img/'
for img in img_list:
with open(os.path.join(folder_path, file_name), 'wb') as handle:
....

where can I download the ispell *.dict and *.affix files?

I am quite new to postgresql full text search and I am setting up the configuration as where can I download the ispell *.dict and *.affix filefollowing (exactly as in docs):
CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
DictFile = english,
AffFile = english,
StopWords = english
);
So, this I think expects files english.dict and english.affix on for example:
/usr/share/postgresql/9.2/tsearch_data
But these files are not there. I just have ispell_sample.dict and ispell_sample.affix - which when included above work fine - no problem.
So... I followed this post and downloaded the required dictionary from the open office people and renamed the .dic to .dict and .aff to .affix. Then I have checked (using file -bi dict.affix and file -bi english.dict and they are UTF8 encoded).
When I run the above text search dictionary, I get the error:
ERROR: wrong affix file format for flag
CONTEXT: line 2778 of configuration file "/usr/share/postgresql/9.2/tsearch_data/english.affix": "COMPOUNDMIN 1
"
I was wondering if anyone had clues on how to solve this problem or if anyone had encountered this before..
Thanks./.
UPDATE:1: I guess the question can be rephrased as follows:
where can I download the ispell *.dict and *.affix file for postgres
Here's a good reference: https://www.cs.hmc.edu/~geoff/ispell-dictionaries.html This is a good resource for those dictionaries of any language.

Plone: TypeError: Can't pickle objects in acquisition wrappers

I am using / fixing collective.logbook to save errors on the site. Currently logbook fails on my site on some exceptions:
File "/srv/plone/xxx/src/collective.logbook/collective/logbook/events.py", line 101, in hand
transaction.commit()
File "/srv/plone/buildout-cache/eggs/transaction-1.1.1-py2.6.egg/transaction/_manager.py", line 8
return self.get().commit()
File "/srv/plone/buildout-cache/eggs/transaction-1.1.1-py2.6.egg/transaction/_transaction.py", li
self._commitResources()
File "/srv/plone/buildout-cache/eggs/transaction-1.1.1-py2.6.egg/transaction/_transaction.py", li
rm.commit(self)
File "/srv/plone/buildout-cache/eggs/ZODB3-3.10.5-py2.6-linux-x86_64.egg/ZODB/Connection.py", lin
self._commit(transaction)
File "/srv/plone/buildout-cache/eggs/ZODB3-3.10.5-py2.6-linux-x86_64.egg/ZODB/Connection.py", lin
self._store_objects(ObjectWriter(obj), transaction)
File "/srv/plone/buildout-cache/eggs/ZODB3-3.10.5-py2.6-linux-x86_64.egg/ZODB/Connection.py", lin
p = writer.serialize(obj) # This calls __getstate__ of obj
File "/srv/plone/buildout-cache/eggs/ZODB3-3.10.5-py2.6-linux-x86_64.egg/ZODB/serialize.py", line
return self._dump(meta, obj.__getstate__())
File "/srv/plone/buildout-cache/eggs/ZODB3-3.10.5-py2.6-linux-x86_64.egg/ZODB/serialize.py", line
self._p.dump(state)
TypeError: Can't pickle objects in acquisition wrappers.
This is obviously because logbook tries to write a record of the error which refers to an acquired object. I assume that the solution is to clean the error from these kind of objects.
However, how can I figure out what is the bad object, how it ends up to the transaction manager and what are the Python object references causing this issue? Or anything which could help me to debug this issue?
If you can reproduce this reliably, you can put in a print statement or pdb.set_trace() in the ZODB connection _register method (in ZODB/connection.py inside the ZODB egg):
def _register(self, obj=None):
# ... skipped lines ...
if obj is not None:
self._registered_objects.append(obj)
# Insert print statement here.
Now whenever any object has been marked as changed or is added to the connection as a new object, it'll be printed to the console. That should help you with the debugging process. Good luck!

Mxmlc generates different binary on same source

I'm compiling a single .as file into swf using mxmlc.
Whenever I run mxmlc, results are different in size even when the source code is not changed.
For example,
// Test.as
package
{
public class Test
{
}
}
And generates .swf using mxmlc :
mxmlc Test.as
and result size differs from 461 to 465 bytes.
I suppose that it's because of timestamp-like things in compiler, but I could not find how to fix or disable that. Any ideas on generating "same binary from same source" ? Thanks!
Finally, I found that metadata tag (Tag Type = 77) and undocumented 'product info' tag (Tag Type = 41) both contains compliation time.
I succeeded to remove timestamps by following steps :
1. open swf and un-zlib
2. clear timestamps in metadata tag and product info tag
3. re-zlib and make new .SWF
But I'm not happy with that, thus this needs extra work on SWF file. I want to find the easier way. there may be 'bypass product info' option on mxmlc..
You can find more information on SWF File structure and metadata tag on http://www.adobe.com/devnet/swf.html and product info on http://wahlers.com.br/claus/blog/undocumented-swf-tags-written-by-mxmlc/
You need to override the metadata the compiler is writing into the resulting swf file. You can do this with the -raw-metadata compiler aguement.
Usage:
mxmlc -raw-metadata <XML_String> Test.as
Example:
mxmlc -raw-metadata '' Test.as
(Resulting swf is always 190 bytes).
1 : date in metadata:
mxmlc:
<metadata date=" " />
<raw-metadata></raw-metadata>
2 : timestamp in ProductInfo
download sdk source code,and modify the ProductInfo.java,let the timestamp keep same.and then update the ProductInfo.class in your_sdk_dir\lib\swfutils.jar
However,when i have done,Mxmlc also generates different binary on same source.
I think i can't change the compiler link order.

Resources