CopyError - OFS.CopySupport.manage_pasteObjects limited to ~ <160 objects? - plone

I'm using a view to archive old content by moving it into another folder.
(catalog search for enddate more than N months ago, pass id's into the following command:
target.manage_pasteObjects( source.manage_cutObjects(idsToArchive) )
One or two years ago moving about 800 or even more objects was no problem.
Today I need to limit the catalog search to around 80 items, otherwise I get
a
Module OFS.CopySupport, line 193, in manage_pasteObjects
CopyError:
The data in the clipboard could not be read, possibly due to cookie data being truncated by your web browser. Try copying fewer objects.
running plone 4.1.6 / Zope2-2.13.15.
I already tried to deactivate beaker-session-datamanager (still the same problems)

You installed the latest Plone hotfix, 20130618. It includes a DDOS-prevention measure limiting the size of the __cp cookie data to 8kb (decompressed).
Future Zope versions will also include this fix.
To work around this temporarily your only option is to increase the maximum size default. Doing this will allow other threads use larger cookies as well until you restore the default:
from OFS.CopySupport import _cb_decode
_default_maxsize = _cb_decode.func_defaults[0]
def _increase_maxsize(newsize):
# Patch the maxsize default
_cb_decode.func_defaults = (newsize,)
def _restore_maxsize(newsize):
# Patch the maxsize default
_cb_decode.func_defaults = (_default_maxsize,)
The cookie data consists almost entirely of object paths (absolute paths as tuples) as marshall dumps, you'll have to estimate a suitable maximum size from that.

Related

Getting data from a direct mapped cache in prolog

The predicate getDataFromCache(StringAddress,Cache,Data,HopsNum,directMap,BitsNum)
should succeed when the Data is successfully retrieved from the Cache (cache hit)
and the HopsNum represents the number of hops required to access the data from
the cache which can differ according to direct map cache mapping technique such
that:
• StringAddress is a string of the binary number which represents the address
of the data you are required to address and it is six binary bits.
• Cache is the cache using the representation discussed previously .
• Data is the data retrieved from cache when cache hit occurs.
• HopsNum the number of hops required to access the data from the cache.
• BitsNum The BitsNum is the number of bits the index needs.
getDataFromCache is always giving me false although everythings seems working so I want someone to fix it
convertAddress(Binary,N,Tag,Idx,directMap):-
Idx is mod(Binary,10**N),
Tag is Binary // 10**N.
getDataFromCache(SA,[item(tag(T),data(D),V,_)|T],Data,HopsNum,directMap,BitsNum):-
convertAddress(SA,BitsNum,Tag,Idx,directMap),
number_string(Tag,Z),
Z==T,
V==1,
Data is D.
getDataFromCache(SA,[item(tag(T),data(D),V,_)|T],Data,HopsNum,directMap,BitsNum):-
convertAddress(SA,BitsNum,Tag,Idx,directMap),
number_string(Tag,Z),
(Z\=T;V==0),
getDataFromCache(SA,T,Data,HopsNum,directMap,BitsNum).
simply hopsNumber is always zero
and you don't have to traverse since it's direct
you can access it using nth0 perdicate
Also you are using the T variable twice

Understanding elasticsearch circuit_breaking_exception

I am trying to figure out why I am getting this error when indexing a document from a python web app.
The document in this case is a base64 encoded string of a file of size 10877 KB.
I post it to my web app, which then posts it via elasticsearch.py to my elastic instance.
My elastic instance throws an error:
TransportError(429, 'circuit_breaking_exception', '[parent] Data
too large, data for [<http_request>] would be
[1031753160/983.9mb], which is larger than the limit of
[986932838/941.2mb], real usage: [1002052432/955.6mb], new bytes
reserved: [29700728/28.3mb], usages [request=0/0b,
fielddata=0/0b, in_flight_requests=29700728/28.3mb,
accounting=202042/197.3kb]')
I am trying to understand why my 10877 KB file ends up at a size of 983mb as reported by elastic.
I understand that increasing the JVM max heap size may allow me to send bigger files, but I am more wondering why it appears the request size is 10x the size of what I am expecting.
Let us see what we have here, step by step:
[parent] Data too large, data for [<http_request>]
gives the name of the circuit breaker
would be [1031753160/983.9mb],
says, how the heap size will look, when the request would be executed
which is larger than the limit of [986932838/941.2mb],
tells us the current setting of the circuit breaker above
real usage: [1002052432/955.6mb],
this is the real usage of the heap
new bytes reserved: [29700728/28.3mb],
actually an estimatiom, what impact the request will have (the size of the data structures which needs to be created in order to process the request). Your ~10MB file will probably consume 28.3MB.
usages [
request=0/0b,
fielddata=0/0b,
in_flight_requests=29700728/28.3mb,
accounting=202042/197.3kb
]
This last line tells us how the estmation is being calculated.

Blazor preview 9/mono-wasm memory access out of bounds: max string size for DotNet.invokeMethod?

Since dotnet core 3 preview 9, I am facing an issue invoking a dotnet method passing a large string from JavaScript.
Code is worth more than a thousand words, so the snippet below reproduces the issue. It works when length = 1 * mb but fails when length = 2 * mb.
#page "/repro"
<button onclick="const mb = 1024 * 1024; const length = 2 * mb;console.log(`Attempting length ${length}`); DotNet.invokeMethod('#GetType().Assembly.GetName().Name', 'ProcessString', 'a'.repeat(length));">Click Me</button>
#functions {
[JSInvokable] public static void ProcessString(string stringFromJavaScript) { }
}
The error message is:
Uncaught RuntimeError: memory access out of bounds
at wasm-function[2639]:18
at wasm-function[6239]:10
at Module._mono_wasm_string_from_js (http://localhost:52349/_framework/wasm/mono.js:1:202444)
at ccall (http://localhost:52349/_framework/wasm/mono.js:1:7888)
at http://localhost:52349/_framework/wasm/mono.js:1:8238
at Object.toDotNetString (http://localhost:52349/_framework/blazor.webassembly.js:1:39050)
at Object.invokeDotNetFromJS (http://localhost:52349/_framework/blazor.webassembly.js:1:37750)
at u (http://localhost:52349/_framework/blazor.webassembly.js:1:5228)
at Object.e.invokeMethod (http://localhost:52349/_framework/blazor.webassembly.js:1:6578)
at HTMLButtonElement.onclick (<anonymous>:2:98)
I need to process large strings, which represent the content of a file.
Is there a way to increase this limit?
Apart from breaking down the string into multiple segments and performing multiples calls, is there any other way to process a large string?
Is there any other approach for processing large files?
This used to work in preview 8.
Is there a way to increase this limit?
No (unless you modify and recompile blazor and mono/wasm that is).
Apart from breaking down the string into multiple segments and performing multiples calls, is there any other way to process a large string?
Yes, as you are on the client side, you can use the shared memory techniques. You basically map a .net byte[] to an ArrayBuffer. See this (disclaimer: My library) or this library for reference on how to do it. These examples are using the binary content of actual javascript Files but it's applicable to strings as well. There is no reference documentation on these API's yet. Mostly just examples and the blazor source code.
Is there any other approach for processing large files?
See 2)
I recreated your issue in a netcore 3.2 Blazor app (somewhere between 1 and 2 Mb of data kills it just as you described). I updated the application to netcore 5.0 and the problem is fixed (it was still working when I threw 50Mb at it).

Java API to query CommonCrawl to populate Digital Object Identifier (DOI) Database

I am attempting to create a database of Digital Object Identifier (DOI) found on the internet.
By manually searching the CommonCrawl Index Server manually I have obtained some promising results.
However I wish to develop a programmatic solution.
This may result in my process only requiring to read the index files and not the underlying WARC data files.
The manual steps I wish to automate are these:-
1). for each CommonCrawl Currently available index collection(s):
2). I search ... "Search a url in this collection: (Wildcards -- Prefix: http://example.com/* Domain: *.example.com) " e.g. link.springer.com/*
3). this returns almost 6MB of json data that contains approx 22K unique DOIs.
How can I browse all available CommonCrawl indexes instead of searching for specific URLs?
From reading the API documentation for CommonCrawl I cannot see how I can browse all the indexes to extract all DOIs for all domains.
UPDATE
I found this example java code https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/S3ReaderTest.java
that shows how to access a common crawl dataset.
However when I run it I receive this exception
"main" org.jets3t.service.S3ServiceException: Service Error Message. -- ResponseCode: 404, ResponseStatus: Not Found, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>common-crawl/crawl-data/CC-MAIN-2016-26/segments/1466783399106.96/warc/CC-MAIN-20160624154959-00160-ip-10-164-35-72.ec2.internal.warc.gz</Key><RequestId>1FEFC14E80D871DE</RequestId><HostId>yfmhUAwkdNeGpYPWZHakSyb5rdtrlSMjuT5tVW/Pfu440jvufLuuTBPC25vIPDr4Cd5x4ruSCHQ=</HostId></Error>
In fact every file I try to read results in the same error. Why is that?
what is the correct common crawl uri's for their datasets?
The data set location has changed since more than one year, see announcement. However, many examples and libraries still contain the old pointers. You can access the index files for all crawls back to 2013 on s3://commoncrawl/cc-index/collections/CC-MAIN-YYYY-WW/indexes/cdx-00xxx.gz - replace YYYY-WW with year and week of the crawle and expand xxx to 000-299 to get all 300 index parts. New crawl data is announced on the Common Crawl group, or read more about how to access the data.
To get the example code to work replace lines 24 and 25 with:
String fn = "crawl-data/CC-MAIN-2013-48/segments/1386163035819/warc/CC-MAIN-20131204131715-00000-ip-10-33-133-15.ec2.internal.warc.gz";
S3Object f = s3s.getObject("commoncrawl", fn, null, null, null, null, null, null);
Also note that the commoncrawl group have an updated example.

Google Analytics Realtime Sandbox Environment

I am looking for a way to setup a google analytics sandbox environment that will allow me
to test out my custom js code near real time.
My app will be using custom variables for advanced segmentation, and I would like to test out multiple scenarios quickly, as opposed to setting up a dummy GA account and wait for a whole day to confirm the test.
Thanks
Great question.
For GA, server updates occur every four hours, and after every sixth such update, the entire set is recalculated, which means a 24-hour lag from code change to reliable feedback. This delay also applies to most customizations to the GA Browser (e.g., "custom filters").
So if you are going to use GA as your web metrics system, and you expect to actually rely on those data then a test rig is essential.
For me, it's useful to group test systems for client-side analytics using two rubrics: (i) complete, self-contained (closed-loop) systems; or (ii) simpler automated data pulls from the production system (by "production system" here i mean GA's system, not the Site whose pages the GA code is tracking).
For the latter, just add this line to each page of your Site that contains the GA tracking code, just below '__trackPageview()':
pageTracker._setLocalRemoteServerMode();
That line will cause a copy of each transaction line to be logged to your server's activity log--so in essence, you get the data captured by GA in real-time That's all you need to do to capture the data; to parse it, you can use, for instance, any of the excellent open source web log analyzers like AWStats, or roll your own.
This is simple and reliable--but all it can do is tell you (in real-time) "does the analytics code i just implemented on pages served by my production server actually work?"
Usually, that's not good enough--you would rather know if your code will work before it's on your production server. To do that, you need to simulate the production environment and find a way to access in real-time the data GA collects.
This kind of test rig is a little more involved, but still not difficult.
In sum, it requires these steps:
host/serve the ga.js and the
tracking pixel locally;
log the __utm.gif requests (in the
GA data flow, each request
corresponds to one logged
transaction); and
parse the headers into some
convenient human-readable form.
If you want more detail than that (ie, a step-by-step implementation), here it is:
I. Hosting/Serving the GA Script (& automating updates
To do that, you can create a small shell script like this one to wget the latest ga.js version into your local directory (replacing the extant version it finds there).
#!/bin/sh
rm /My_Sites/sitename.com/analytics/ga.js
cd /My_Sites/sitename.com/analytics/
wget http://www.google-analytics.com/ga.js
chmod 644 /My_Sites/sitename.com/analytics/ga.js
cd ${OLDPWD}
exit 0;
(Thanks to AskApache.com, which provided the original motivation and config details to do this in a production context.)
II. Create __utm.gif file
This is just a transparent 1x1 pixel gif image, which you will place in Site directory (doesn't matter where, it just needs to match the location recited in your pages)
III. Log the __utm.gif Requests
For a testing protocol in which you are the source of the client-side activity (e.g., you want to verify the cross-browser fidelity of some event-tracking code you've added to a page on your Site, so you automate 5000 clicks on the button you just wired up,serving the page from your dev server set up for this purpose) it's probably simplest to just log the Request Headers, because it's in those headers that the GA script directs the client to gather various data from the DOM, from the location bar (url), and from prior http headers, and append them to a request for a resource on the GA server (__utm.gif, which is just a 1x1 transparent pixel).
For this type of protocol, i use the Firefox addon, LiveHTTPHeaders. You install it like any other Firefox addon, a few mouse clicks is all. Next, open it, and click the "Generator" tab. From this window, you can see the actual requests in real time. At the bottom of the window is a 'save' button to store the log. I find it easier to configure LiveHTTPHeaders to log only the __utm.gif requests; to do that, just click the 'Edit' tab and create a siimple filter to exclude everything except these particular gif images (using the check boxes on the right, and the large text box to the right).
Other kinds of test protocols require you to work from your Server Activity Logs; in that case just add this line to each page of your Site, just below __trackPageview():
pageTracker._setLocalRemoteServerMode();
IV. Parse those logged requests so you can actually read them
So now your log will contain individual transction lines, each one of which is a string appended to an HTTP Request for the GA tracking pixel. This string is just a concatenation of key-value pairs, each key begins with the letters "utm" (probably for "urchin tracker"). Each of these parameters corresponds to a variable that you see in the GA Dashboard (here's a complete list and description of them). This is all you need to know to build a parser. In more detail:
First, here's a sanitized __utm.gif request (the entries in your LiveHTTPHeaders log):
http://www.google-analytics.com/__utm.gif?utmwv=1&utmn=1669045322&utmcs=UTF-8&utmsr=1280x800&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.0%20r45&utmcn=1&utmdt=Position%20Listings%20%7C%20Linden%20Lab&utmhn=lindenlab.hrmdirect.com&utmr=http://lindenlab.com/employment&utmp=/employment/openings.php?sort=da&&utmac=UA-XXXXXX-X&utmcc=__utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B
This is my parser (in Python):
# regular expression module imported
import re
pattern = r'\&{1,2}'
pat_obj = re.compile(pattern)
# splitting the gif request on the '&' character
# (which GA originally used to concatenate each piece to build the request)
# (here, i've bound the __utm.gif to the variable by 'gfx')
gfx1 = pat_obj.split(gfx)
# create a look-up table to map a descriptive name to each gif request parameter
# (note, this isn't the entire list, which i've linked to above)
keys = "utmje utmsc utmsr utmac utmcc utmcn utmcr utmcs utmdt utme utmfl utmhn utmn utmp utmr utmul utmwv"
values = "java_enabled screen_color_depth screen_resolution account_string cookies campaign_session_new repeat_campaign_visit language_encoding page_title event_tracking_data flash_version host_name GIF_req_unique_id page_request referral_url browser_language gatc_version"
keys = keys.strip().split()
#create the look-up table
GIF_REQUEST_PARAMS = dict(zip(keys, values))
# parse each request parameter and map the parameter name to a descriptive name:
pattern = r'(utm\w{1,2})=(.*?)$'
pat_obj = re.compile(pattern)
for itm in gfx1 :
m = pat_obj.search(itm)
if m :
fmt = '{0:25} {1:10}'
print( fmt.format( GIF_REQUEST_PARAMS[m.group(1)], m.group(2) ) )
The result looks like this:
gatc_version              1         
GIF_req_unique_id         1669045322
language_encoding         UTF-8     
screen_resolution         1280x800  
screen_color_depth        24-bit    
browser_language          en-us     
java_enabled              1         
flash_version             10.0%20r45
campaign_session_new      1         
page_title                Position%20Listings%20%7C%20Linden%20Lab
host_name                 lindenlab.hrmdirect.com
referral_url              http://lindenlab.com/employment
page_request              /employment/openings.php?sort=da
account_string            UA-XXXXXX-X
cookies
To avoid making this longer still, i left out the cookies' value. They obviously require a separate parsing step, though it's virtually identical to the step i just showed. Again, each request represents a single transaction, so you can store them as you need to.

Resources