Why can't a conforming PDF writer overwrite a PDF over HTTP? - http

I'm trying to understand the purpose of the generation numbers in the cross reference table of a PDF. As far as I understand if someone makes changes or annotations to a PDF the document is edited to append updated xref tables that include incremented generation numbers so that a PDF reader can reconstruct a document using the most recent versions of the objects. The PDF 1.7 specification says on section 7.5.6 that this incremental feature is useful since
In certain contexts, such as when editing a document across an HTTP connection ... a conforming writer cannot overwrite the contents of the original file.
Incremental updates may be used to save changes to documents in these contexts.
How is it that a conforming PDF writer can append changes to the document but not overwrite the document itself? As far as I understand file write permissions allow you to modify the whole file -- not necessarily just the end.

Related

Schema file does not exist in XBRL Parse file

I have downloaded a zip file containing around 200,000 html files from Companies House.
Each file is in one of two formats: 1) inline XBRL format (.html file extension) or 2) XBRL format (.xml file extension). Looking at the most recent download available (6 December 2018) all the files seem to be the former format (.html file extensions).
I'm using the XBRL package in R to try and parse these files.
Question 1: is the XBRL package meant to parse inline XBRL format (.html) files, or is it only supposed to work on the XBRL (.xml) formats? If not, can anyone tell me where to look to parse inline XBRL format files? I'm not entirely sure what the difference is between inline and not inline.
Assuming the XBRL package is meant to be able to parse inline XBRL format files, I'm hitting an error telling me that the xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd file does not exist. Here's my code:
install.packages("XBRL")
library(XBRL)
inst <- "./rawdata/Prod224_0060_00000295_20171130.html" # manually unzipped
options(stringsAsFactors = FALSE)
xbrl.vars <- xbrlDoAll(inst, cache.dir = "XBRLcache", prefix.out = NULL, verbose = TRUE)
and the error:
Schema: ./rawdata/https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd
Level: 1 ==> ./rawdata/https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd
Error in XBRL::xbrlParse(file) :
./rawdata/https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd does not exists. Aborting.
Question 2. Can someone explain what this means in basic terms for me? I'm new to XBRL. Do I need to go and find this xsd file and put it somewhere? It seems to be located here, but I have no idea what to do with it or where to put it.
Here's a similar question that doesn't seem fully answered and the links are all in Spanish and I don't know Spanish.
Once i've been able to parse one single html XBRL file, my plan is to figure out how to parse all XBRL files inside multiple zip files from that website.
I had the exactly same problem with the US SEC data.
And I just followed exactly the guidance of pdw and it worked!
FYI, the code I used for
if (substr(file.name, 1, 5) != "http:") {
is
if (!(substr(file.name, 1, 5) %in% c("http:", "https"))) {
And I hacked it using trace('XBRL', edit=TRUE).
I'm not familiar with the XBRL package that you're using, but it seems clear that it's erroneously trying to resolve an absolute URL (https://...) as a local file.
A quick browse of the source code reveals the problem:
XBRL.R line 305:
fixFileName <- function(dname, file.name) {
if (substr(file.name, 1, 5) != "http:") {
[...]
i.e. it decides whether or not a URL is absolute by whether it starts "http:", and you URL starts "https:". It's easy enough to hack in a fix to allow https URLs to pass this test too, and I suspect that that would fix you immediate problem, although it would be far better if this code used a URL library to decide if a URL was absolute or not rather than guessing based on protocol.
I'm not sure what the status is with respect to iXBRL documents. There's a note in the changelog saying "reported to work with inline XBRL documents" which I'm suspicious of. Whilst it might correctly find the taxonomy for an inline document, I can't see how it would correctly extract the facts with significant additional code which I can't see any sign of.
You might want to take a look at the Arelle project as an alternative open source processor that definitely does support Inline XBRL.
As pdw stated, the issue is that the package is hard coded to look for "http:" and erroneously treats "https" paths as local paths. This happens because XBRL files can refer to external files for standard definitions of schemas, etc. In your example, this happens on line 116 of Prod224_0081_00005017_20191231.html
Several people have forked the XBRL package on github and fixed this behavior. You can install one of the versions from https://github.com/cran/XBRL/network/members with devtools::install_git() and that should work out.
For example, using this fork the example Companies House statement is parsed.
# remotes:::install_github("adamp83/XBRL")
library(XBRL)
x <- xbrlDoAll("https://raw.githubusercontent.com/stackoverQs/stackxbrlQ/main/Prod224_0081_00005017_20191231.html",cache.dir = "cache" verbose=TRUE))
Here are a few more general explanations to give some context.
Inline XBRL vs. XBRL
An XBRL file, put simply, is just a flat list of facts.
Inline XBRL is a more modern version of an XBRL instance that, instead of storing these facts as a flat list, stores the facts within a human-readable documents, "stamping" the values. From an abstract XBRL-processing perspective, both an XBRL file and an inline XBRL file are XBRL instances and are simply sets of facts.
DTS
An XBRL instance (either inline or not) is furthermore linked to a few, or a lot of, taxonomy files known to XBRL users as the DTS (Discoverable Taxonomy Set). These files are either XML Schema files (.xsd) containing the report elements (concepts, dimensions, etc) or XML Link files (.xml) containing the linkbases (graphs of reports elements, labels, etc).
The machinery linking an XBRL instance to a DTS is a bit complex and heterogeneous: schema imports, schema includes, simple links pointing to other files, etc. It suffices to understand as a user that the DTS is made of all the files in the transitive closure of the instance via these links. It is the job of an XBRL processor (including the R package) to resolve the entire DTS.
Storage of DTS files
Typically, an XBRL instance points to a file (called entry point) located on the server of the taxonomy provider, and that file may itself point to further files on the same, and other servers.
However, many XBRL processors automatically cache these files locally in order to avoid overloading the servers, as is established practice. Normally, you do not need to do this yourself. It is very cumbersome to resolve the links oneself to download all files manually.
An alternate way is to download the entire DTS (as a zip file following a packaging standard) from the taxonomy provider's servers and use it locally. However, this also requires an XBRL processor to figure out the mapping between remote URLs and local files.

MS Word track changes and RMarkDown

I try to write all data analysis reports using R Markdown, because I can have a reproducible document that I can share in several output formats (Pdf, html and MS Word).
However, most of my colleagues use MS Word and they have no idea about R, Markdown, etc.
One advantage of using R Markdown is that I can generate my report in MS Word and directly share it with my colleagues.
The disadvantage is that collaboration becomes cumbersome for me, because I receive feedback on MS Word as well (typically using track changes) and I have to manually introduce those changes back into the .rmd file.
So, my question is: how can I simplify the process (i.e. make it as automatic as possible) of getting the changes in the MS Word document into the .Rmd?
Are there any tools out there that can help me out?
P.s.getting my colleagues to become R-literate is not an option :(
I haven't yet tried what I'm proposing, but here is how I plan to handle this, since I have exactly the same need. First, there are two distinct scenarios:
I am the lead author, or I am responsible for the statistical analysis: I will require all collaborators to learn and use markdown (not R Markdown, just generic markdown) and I'll instruct them not to touch any R code. I believe markdown is easy enough that anyone who is competent enough to collaborate on an article with data analysis is more than competent to learn markdown. For teaching them, the key features for people familiar with working with Microsoft Word and track changes are the following:
Basic markdown references: I would give them the core R Markdown references, which are their Pandoc Markdown documentation and their R Markdown cheat sheet.
Track changes: Collaborators would simply edit the markdown in plain text and submit their edited version. To view and reconcile differences, I would simply use a diff tool; I would find a good online one to teach my collaborators how to diff changes.
Comments between authors: I would select one of the options for markdown comments and teach my collaborators to use that when needed. The modified HTML comment (<!--- Pandoc-enhanced HTML comment -->) is the one I would probably use.
Reference management: I use Zotero, so I would use Better BibTeX for Zotero to handle references. The nice thing about this is that although I would have to handle the references myself, collaborators can directly add references to the Zotero group library. In fact, using citation keys, it should be simple for collaborators to learn how to insert references themselves into the markdown text.
I am NOT the lead author and I am NOT responsible for the statistical analysis: I would use whatever workflow the lead author uses (e.g. if the lead author uses Word with tracked changes, I'll use the same things).
I want to note that it seems that the only part that seems to be not so easy (compared to Microsoft Word normal working features) is replacing track changes with diff. I'm not aware of a tool that makes incorporating diff files as easy as how Word reconciles changes, but if such a tool exists, then the process should be more seamless.
I believe we would need to work on several packages in order to make true collaboration possible between users of Word and RMarkdown. I would be happy to collaborate with anyone interested in making this happen.
Adding a CriticMarkup plugin for RStudio. https://github.com/CriticMarkup/CriticMarkup-toolkit/
Having an R package that can scrape Word documents along with tracked changes. The officer package can already read Word documents, but not the tracked changes. It would also be extremely useful if this package could add simple RMarkdown formatting to the scrapes, e.g. for bold, subscripts and perhaps even tables to facilitate the subsequent matching of Word text to the RMarkdown file.
https://github.com/davidgohel/officer/issues/132
Write a package that can translate the scraped Tracked changes to CriticMarkup into the RMarkdown file.
Generate a key (paragraph)->(lines) that matches paragraphs scraped from Word (without any of the tracked changes) to lines in the RMarkdown. The problem is that we don't know what was generated using code, and what was directly written as Rmd. The first step would be to find lines in the RMarkdown file that should form paragraphs (exclude R chunks, but not inline R). Then, ensuring the order remains the same, compare these lines (remove newlines) to paragraphs scraped from the Word document, using a regexp symbol for "any char, any length" in the place of inline r chunks. Next, split paragraphs with inline chunks as into sub-paragraphs in order to be able to apply tracked changes and comments to either the inline code, before, or after the inline chunk more easily. Finally, the paragraphs that could not be matched were likely generated within code chunks and should be matched to the appropriate code chunks, determined from the order of the paragraphs.
Use the generated key, apply tracked changes (as CritcMarkup) to the RMarkdwown file. Any changes made to code chunks should be reported as a CrticMarkup comment around that code chunk (or group of code chunks if there is no markdown in between chunks).
I suggest you try trackdown https://claudiozandonella.github.io/trackdown/
trackdown offers a simple answer to collaborative writing and editing of R Markdown (or Sweave) documents. Using trackdown, the local .Rmd (or .Rnw) file is uploaded as plain-text in Google Drive where, thanks to the easily readable Markdown (or LaTeX) syntax and the well-known online interface offered by Google Docs, collaborators can easily contribute to the writing and editing of the narrative part of the document. After integrating all authors’ contributions, the final document can be downloaded and rendered locally.
Using Google Docs, anyone can collaborate on the document as no programming experience is required, they only have to focus on the narrative text ignoring code jargon.
Moreover, you can hide code chunks setting hide_code = TRUE (they will be automatically restored when downloaded). This prevents collaborators from inadvertently making changes to the code that might corrupt the file and it allows collaborators to focus only on the narrative text ignoring code jargon.
You can also upload the actual Output (i.e., the resulting complied document) in Google Drive together with the .Rmd (or .Rnw) document. This helps collaborators to evaluate the overall layout, figures and tables and it allows them to use comments on the pdf to propose and discuss suggestions.
I know this is an old post, but for future askers, there is now a package available that can do (mostly) this:
The {redoc} package can output to Word, and by storing the R code internally within the Word document, it can also dedoc() a Word file back into RMarkdown. It uses the Critic Markup syntax discussed in another answer.

QTextDocument serialization

I read old topics about QTextDocument serialization: here and here.
As I understood, once real method for serialization and deserializtion without additional code is saving and reading documents as html files.
But, I think, this method not fast, because html string parsing is a hard and low time operation.
In other case, I can save document in binary format and deserialize them via QTextCursor methods calling in sequence, that faster then html parsing, I think.
Is exist code samples for QTextDocument binary serialization?
There is QTextDocumentWriter, but there is no respective reader.
Check this answer if you need read document.
I didn't found a pure binary seraialization, but I found working sample for reading ODF format files to QTextDocument. This also string (xml) parsing method, but odf format accept avoid storaging several files for document, if him contains images.
Source code can be viewed in okular git repo here

Generating keywords from a pdf automatically

My application allows user to upload pdf files and store them on the webserver for later viewing. I store the name of the file, location, size, upload date, user name etc in an SQL server database.
I'd like to be able to programatically, just after a file is uploaded, generate a list of keywords (maybe everything except common words) and store them in the sql database as well so that subsequent users can do keyword searches...
Suggestions on how to approach this task? Does these type of routine already exist?
EDIT: Just to clarify my requirements, I wouldn't be concerned with doing OCR, I don't know the insides' of PDF's, but I understand that if it was generated by an app, such as Word->PDF Print, the text of the document is searchable...so really my first task, and the intent of my question is, how do I access the text of a PDF file from an asp.net app? OCR on scanned PDF's is probably beyond my requirements at this point.
As a first step you should extract all text from the PDF.
ghostscript and pdftotext can do this, the PDFBox is another option.
There are certainly other tools as well.
Then you can remove all stopwords and duplicates and write it to the database.
I has been mentioned that this does not work for scanned PDF documents but this is only half the truth. On the one hand there are lots of scanned PDFs which have text additionally embeded, because that is what some scanners drivers do (Canon CanoScan drivers performs OCR and generate searchable PDFs). On the other hand documents generated with LaTeX that contain non-ASCCII characters return garbage in my experience (even when I copy and paste in acrobat).
The only problem I foresee of grabbing every non-common word is that you'll dilute your search results and have to query the DB for more pdfs. One website to look at is Scribd which does something similar to what you are talking about doing with users uploading files and people being able to view them online via a flash app.
That is very interesting topic. The question is how many keywords do you need to define one PDF. If you say:
3 to 10 - I would check methods of text categorization such as bayesian classifier or K-NN (that method will group PDF files into clusters which are similar). I know that similar algorithms are used to filter spam. But it is a system that need input for example if you add keywords to 100 PDF this system will learn the schemas. I am not an expert but this is one way to do it.
more than 10 - then I would suggest brute force -> filter common words -> get most frequent words for a specific document.
I would explore first option. You must surely check such methods as "text categorization", "auto tagging", "text mining", "automatic keyword extraction".
Some links :
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Keyword Extraction Using Naive Bayes
If you are planning on indexing PDF documents, you should consider using a dedicated text search engine like Lucene. Lucene provides features that will be difficult to implement using only SQL and a relational database. You will still need to extract the text from the PDF documents, but won't have to worry about filtering out common words. By filtering out common words, you will completely lose the ability to do phrase searches.

How to keep original dates after uploading to Alfresco?

I am uploading several files to Alfresco repsitory via webdav. The batch process works fine, but after the upload, all dates in the repository are changed to current date.
How can I make it keep and show the original file dates (creation and modified) ?
Thanks.
You can leverage metadata extractors. The main purpose is to extract metadata from binary files during upload. There are lots of built-in metadata extractors, just look for implementers of interface org.alfresco.repo.content.metadata.MetadataExtracter. There are different extractors that can extract creation date and set it as cm:created on Alfresco node.
You can enable metadata extraction by applying it as a rule on a space, look for action named Extract Common Metadata in the actions drop-down-box while creating the rule.
I don't believe it's possible without the importing code explicitly turning off the default behaviour of the "cm:auditable" policy, and I suspect the WebDAV driver doesn't do this (since it has no way of knowing whether that's appropriate or not - there are cases where forcing the creation and modification dates to today is the correct thing to do).
This behaviour is discussed in some detail here - it might be worth evaluating whether the bulk filesystem import tool is a more appropriate way to import the content into Alfresco, particularly since it can preserve the creation and modification dates if you tell it to (i.e. by specifying the values of those properties).

Resources