How to determine file sizes from Plone's ZMI script? - plone

I have a ZMI script that walks the site recursively to a set depth and prints the URL of files I'm interested in (jpg, gif, png, pdf, etc) via obj.absolute_url(). I'd like to figure out the file size of these files but I am having trouble with that, the object I get back doesn't seem to have any file size call.
My end goal is to figure out where and how large are the larger files I have in Plone.

In Plone you could use portal_catalog to search "File" type objects.
Each element in the search is a brain object.
Among standard medatada returned for every brain there is getObjSize.
so a call of this may works for you:
for b in portal_catalog(portal_type="File"):
print "{} - Kb:{}".format(b.r.getPath(), b.getObjSize/1024)
[sic] getPath is a method and getObjSize is actually an attribute
If you have others question please use official community:
http://community.plone.org

Related

How to extract infos in a .db file to create .csv or any viable "bookmark" file?

I am using a quite unknown bookmark manager on Android. I picked this one after trying others because it was possible to import, export, classify by folders, the design was good and it was easy to search in my bookmarks.
After importing all my bookmarks from other browsers and also from files, I started classifying all of them into folders, subfolders, etc..
I spent many days to classify them all as I wanted.
After classifying them, I tried to export them.
The problem is that the only option offered is to export them in a .html file, containing all the bookmarks but without any folder.
The .html file contains all my bookmarks but in complete desorder, and doesnt mention the folders.
In the app there was also a "backup" function, so I tried and it creates a .db file.
I opened this .db file with some SQLiteViewer app and I found written inside, among other things I dont understand, a list of all my bookmarks with a number next to each one of them, and also a list of my folders with next to them the corresponding number.
When I open the .db file, I have a choice between
-SQlite master
-android metadata
-bookmarks
-folders
-sqlite sequence
If I click on "Bookmarks", all my bookmarks are in a kind of spreadsheet with lines and columns. Next to them in another columns, for example for each bookmark related with "Kitchen recipes" it's written the number 1.
And in the "Folders" folder, next to the folder called "Recipes" its also written 1.
So I'm happy because it seems that my classification is stored in this file.
But the fact is I dont know how to extract easily all that data, and create with it a "bookmark" file importable in other bookmark app or browser ( for example .csv or .xbel or .html but with folders)
I guess I need some "script" working like this:
if the first raw in "Folders" got the number 8 next to it
Then take all the bookmarks in the "bookmarks" folder that also got an 8 written next to it, and put it inside this folder.
I'm a complete noob in coding, I dont know what is SQlite, nor anything.
So i know that maybe I am asking for too much informations at the same time.
But if some kind person could put me in the way, by explaining me if
thats possible
what would be the easiest way
if some solution already exist
if someone like me can do it and what do I have to learn if I want some day to be able to do it
Thanks
Here's pictures so you understand easier:
Sqlite
Folders
Bookmarks

Recipes PDF batch extraction

I am now working with 500 pdf recipes files, which I want to display in my website. How can I batch extract them and display information on PDF to my website? PDF has all the information for recipes. For each recipe, I need to display its description, image, ingredients, instructions, nutrition label and so on. Is there any way so that I don't need to work on it manually?
Do these all have the same basic template for how the information is structured? This isn't really specifically a WordPress issue. One thing you can do is use Go to loop through and process all the files. I played with Go and it's incredibly fast to parse large amounts of information. Maybe you can fiddle with it in this library here https://github.com/unidoc/unidoc.
There are a lot of library options to try in PHP also. Here's just one example https://www.pdfparser.org/. There's documentation here and you can install it via composer. https://www.pdfparser.org/documentation
If every recipe follows the same sort of template, and you want to extract specific details in specific sections of the PDF, it should be easy enough. If you don't mind extracting all the text from a PDF and just display that on your website, it should be easy enough using one of the libraries. If you go the Golang route, you could just parse all the text for each PDF, save them to a file, and just upload them using PHP and have the PHP code insert everything into custom post types or something.

Schema file does not exist in XBRL Parse file

I have downloaded a zip file containing around 200,000 html files from Companies House.
Each file is in one of two formats: 1) inline XBRL format (.html file extension) or 2) XBRL format (.xml file extension). Looking at the most recent download available (6 December 2018) all the files seem to be the former format (.html file extensions).
I'm using the XBRL package in R to try and parse these files.
Question 1: is the XBRL package meant to parse inline XBRL format (.html) files, or is it only supposed to work on the XBRL (.xml) formats? If not, can anyone tell me where to look to parse inline XBRL format files? I'm not entirely sure what the difference is between inline and not inline.
Assuming the XBRL package is meant to be able to parse inline XBRL format files, I'm hitting an error telling me that the xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd file does not exist. Here's my code:
install.packages("XBRL")
library(XBRL)
inst <- "./rawdata/Prod224_0060_00000295_20171130.html" # manually unzipped
options(stringsAsFactors = FALSE)
xbrl.vars <- xbrlDoAll(inst, cache.dir = "XBRLcache", prefix.out = NULL, verbose = TRUE)
and the error:
Schema: ./rawdata/https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd
Level: 1 ==> ./rawdata/https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd
Error in XBRL::xbrlParse(file) :
./rawdata/https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd does not exists. Aborting.
Question 2. Can someone explain what this means in basic terms for me? I'm new to XBRL. Do I need to go and find this xsd file and put it somewhere? It seems to be located here, but I have no idea what to do with it or where to put it.
Here's a similar question that doesn't seem fully answered and the links are all in Spanish and I don't know Spanish.
Once i've been able to parse one single html XBRL file, my plan is to figure out how to parse all XBRL files inside multiple zip files from that website.
I had the exactly same problem with the US SEC data.
And I just followed exactly the guidance of pdw and it worked!
FYI, the code I used for
if (substr(file.name, 1, 5) != "http:") {
is
if (!(substr(file.name, 1, 5) %in% c("http:", "https"))) {
And I hacked it using trace('XBRL', edit=TRUE).
I'm not familiar with the XBRL package that you're using, but it seems clear that it's erroneously trying to resolve an absolute URL (https://...) as a local file.
A quick browse of the source code reveals the problem:
XBRL.R line 305:
fixFileName <- function(dname, file.name) {
if (substr(file.name, 1, 5) != "http:") {
[...]
i.e. it decides whether or not a URL is absolute by whether it starts "http:", and you URL starts "https:". It's easy enough to hack in a fix to allow https URLs to pass this test too, and I suspect that that would fix you immediate problem, although it would be far better if this code used a URL library to decide if a URL was absolute or not rather than guessing based on protocol.
I'm not sure what the status is with respect to iXBRL documents. There's a note in the changelog saying "reported to work with inline XBRL documents" which I'm suspicious of. Whilst it might correctly find the taxonomy for an inline document, I can't see how it would correctly extract the facts with significant additional code which I can't see any sign of.
You might want to take a look at the Arelle project as an alternative open source processor that definitely does support Inline XBRL.
As pdw stated, the issue is that the package is hard coded to look for "http:" and erroneously treats "https" paths as local paths. This happens because XBRL files can refer to external files for standard definitions of schemas, etc. In your example, this happens on line 116 of Prod224_0081_00005017_20191231.html
Several people have forked the XBRL package on github and fixed this behavior. You can install one of the versions from https://github.com/cran/XBRL/network/members with devtools::install_git() and that should work out.
For example, using this fork the example Companies House statement is parsed.
# remotes:::install_github("adamp83/XBRL")
library(XBRL)
x <- xbrlDoAll("https://raw.githubusercontent.com/stackoverQs/stackxbrlQ/main/Prod224_0081_00005017_20191231.html",cache.dir = "cache" verbose=TRUE))
Here are a few more general explanations to give some context.
Inline XBRL vs. XBRL
An XBRL file, put simply, is just a flat list of facts.
Inline XBRL is a more modern version of an XBRL instance that, instead of storing these facts as a flat list, stores the facts within a human-readable documents, "stamping" the values. From an abstract XBRL-processing perspective, both an XBRL file and an inline XBRL file are XBRL instances and are simply sets of facts.
DTS
An XBRL instance (either inline or not) is furthermore linked to a few, or a lot of, taxonomy files known to XBRL users as the DTS (Discoverable Taxonomy Set). These files are either XML Schema files (.xsd) containing the report elements (concepts, dimensions, etc) or XML Link files (.xml) containing the linkbases (graphs of reports elements, labels, etc).
The machinery linking an XBRL instance to a DTS is a bit complex and heterogeneous: schema imports, schema includes, simple links pointing to other files, etc. It suffices to understand as a user that the DTS is made of all the files in the transitive closure of the instance via these links. It is the job of an XBRL processor (including the R package) to resolve the entire DTS.
Storage of DTS files
Typically, an XBRL instance points to a file (called entry point) located on the server of the taxonomy provider, and that file may itself point to further files on the same, and other servers.
However, many XBRL processors automatically cache these files locally in order to avoid overloading the servers, as is established practice. Normally, you do not need to do this yourself. It is very cumbersome to resolve the links oneself to download all files manually.
An alternate way is to download the entire DTS (as a zip file following a packaging standard) from the taxonomy provider's servers and use it locally. However, this also requires an XBRL processor to figure out the mapping between remote URLs and local files.

Capture many .png files over HTTP that are named as .htm?

I want to download a large number of .png files that have .htm file extensions. I've tried a some WinPcap-based utilities, but none of them pick up the files I need. The utilities I have tried are called York, EtherWatch and Pikachu2. I've also tried using a Firefox extension called Save Images - which was too buggy to be useful - and I've tried looking in the browser cache. This last approach works, but it has a problem...
...I need at least the last 30 characters of the file names to be maintained so that I know which image is which.
Does anyone know how I can get this done?
you can use downthemall to download all the images, and rename the file extension programmatically

Drupal - Attach files automatically by name to nodes

i need better file attachement function. Best would be that if you upload files to FTP and have a similar name as the name of the node (containing the same word), so they appear under this node (to not have to add each file separately if you need to have more nodes below). Can you think of a solution? Alternatively, some that will not be as difficult as it always manually add it again.
Dan.
This would take a fair bit of coding. Basically you want to implement hook_cron() and run a function that loops through every file in your FTP folder. The function will look for names of files that have not already been added to any node and then decide which node to add them to.
Bear in mind there will be a delay once you upload your files until they are attached to the node until the next cron job runs.
This is not a good solution and if I could give you any advice it would be not to do it - The reason you upload files through the Drupal interface is so that they are tracked in the files table and can be re-used.
Also the way you're proposing leaves massive amounts of ambiguity as to which file will go where. Consider this:
You have two nodes, one about cars and one about motorcycle sidecars. Your code will have to be extremely complex to make the decision of which node to add to if the file you've uploaded is called 'my-favourite-sidecar.jpg'.

Resources