Schema file does not exist in XBRL Parse file - r

I have downloaded a zip file containing around 200,000 html files from Companies House.
Each file is in one of two formats: 1) inline XBRL format (.html file extension) or 2) XBRL format (.xml file extension). Looking at the most recent download available (6 December 2018) all the files seem to be the former format (.html file extensions).
I'm using the XBRL package in R to try and parse these files.
Question 1: is the XBRL package meant to parse inline XBRL format (.html) files, or is it only supposed to work on the XBRL (.xml) formats? If not, can anyone tell me where to look to parse inline XBRL format files? I'm not entirely sure what the difference is between inline and not inline.
Assuming the XBRL package is meant to be able to parse inline XBRL format files, I'm hitting an error telling me that the xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd file does not exist. Here's my code:
install.packages("XBRL")
library(XBRL)
inst <- "./rawdata/Prod224_0060_00000295_20171130.html" # manually unzipped
options(stringsAsFactors = FALSE)
xbrl.vars <- xbrlDoAll(inst, cache.dir = "XBRLcache", prefix.out = NULL, verbose = TRUE)
and the error:
Schema: ./rawdata/https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd
Level: 1 ==> ./rawdata/https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd
Error in XBRL::xbrlParse(file) :
./rawdata/https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd does not exists. Aborting.
Question 2. Can someone explain what this means in basic terms for me? I'm new to XBRL. Do I need to go and find this xsd file and put it somewhere? It seems to be located here, but I have no idea what to do with it or where to put it.
Here's a similar question that doesn't seem fully answered and the links are all in Spanish and I don't know Spanish.
Once i've been able to parse one single html XBRL file, my plan is to figure out how to parse all XBRL files inside multiple zip files from that website.

I had the exactly same problem with the US SEC data.
And I just followed exactly the guidance of pdw and it worked!
FYI, the code I used for
if (substr(file.name, 1, 5) != "http:") {
is
if (!(substr(file.name, 1, 5) %in% c("http:", "https"))) {
And I hacked it using trace('XBRL', edit=TRUE).

I'm not familiar with the XBRL package that you're using, but it seems clear that it's erroneously trying to resolve an absolute URL (https://...) as a local file.
A quick browse of the source code reveals the problem:
XBRL.R line 305:
fixFileName <- function(dname, file.name) {
if (substr(file.name, 1, 5) != "http:") {
[...]
i.e. it decides whether or not a URL is absolute by whether it starts "http:", and you URL starts "https:". It's easy enough to hack in a fix to allow https URLs to pass this test too, and I suspect that that would fix you immediate problem, although it would be far better if this code used a URL library to decide if a URL was absolute or not rather than guessing based on protocol.
I'm not sure what the status is with respect to iXBRL documents. There's a note in the changelog saying "reported to work with inline XBRL documents" which I'm suspicious of. Whilst it might correctly find the taxonomy for an inline document, I can't see how it would correctly extract the facts with significant additional code which I can't see any sign of.
You might want to take a look at the Arelle project as an alternative open source processor that definitely does support Inline XBRL.

As pdw stated, the issue is that the package is hard coded to look for "http:" and erroneously treats "https" paths as local paths. This happens because XBRL files can refer to external files for standard definitions of schemas, etc. In your example, this happens on line 116 of Prod224_0081_00005017_20191231.html
Several people have forked the XBRL package on github and fixed this behavior. You can install one of the versions from https://github.com/cran/XBRL/network/members with devtools::install_git() and that should work out.
For example, using this fork the example Companies House statement is parsed.
# remotes:::install_github("adamp83/XBRL")
library(XBRL)
x <- xbrlDoAll("https://raw.githubusercontent.com/stackoverQs/stackxbrlQ/main/Prod224_0081_00005017_20191231.html",cache.dir = "cache" verbose=TRUE))

Here are a few more general explanations to give some context.
Inline XBRL vs. XBRL
An XBRL file, put simply, is just a flat list of facts.
Inline XBRL is a more modern version of an XBRL instance that, instead of storing these facts as a flat list, stores the facts within a human-readable documents, "stamping" the values. From an abstract XBRL-processing perspective, both an XBRL file and an inline XBRL file are XBRL instances and are simply sets of facts.
DTS
An XBRL instance (either inline or not) is furthermore linked to a few, or a lot of, taxonomy files known to XBRL users as the DTS (Discoverable Taxonomy Set). These files are either XML Schema files (.xsd) containing the report elements (concepts, dimensions, etc) or XML Link files (.xml) containing the linkbases (graphs of reports elements, labels, etc).
The machinery linking an XBRL instance to a DTS is a bit complex and heterogeneous: schema imports, schema includes, simple links pointing to other files, etc. It suffices to understand as a user that the DTS is made of all the files in the transitive closure of the instance via these links. It is the job of an XBRL processor (including the R package) to resolve the entire DTS.
Storage of DTS files
Typically, an XBRL instance points to a file (called entry point) located on the server of the taxonomy provider, and that file may itself point to further files on the same, and other servers.
However, many XBRL processors automatically cache these files locally in order to avoid overloading the servers, as is established practice. Normally, you do not need to do this yourself. It is very cumbersome to resolve the links oneself to download all files manually.
An alternate way is to download the entire DTS (as a zip file following a packaging standard) from the taxonomy provider's servers and use it locally. However, this also requires an XBRL processor to figure out the mapping between remote URLs and local files.

Related

Trying to import data into R in a way that will allow anyone to access it when opening the markdown file/ accessing the html knit

I am currently working on a coding project and I am running into trouble with how i Should import the data set. We are supposed to have it read in a way so that our instructor can access our markdown file and be able to import the data and run the code without changing file paths. I know about using relative file paths to make it accessible to anyone, however I don't know how to get around the /users/owner part of the file path. Any help would be greatly appreciated and if you have any further questions feel free to ask.
I've tried changing the working directory to a certain folder that both I and my instructor have named the same thing, however, like I said above, when I use read.csv to import the data frame I am still forced to use the /users/owner filepath which obviously is specific to my computer.
I can understand your supervisor, I request the same from my students. My recommended solution is to put both data and R script (or the .Rmd file) in the same folder. Then one does not need to add a path in the read.csv (or similar) function.
If you use RStudio, move to the folder in the Files pane and then use the gear icon and select "Set as Working Directory".
Then send both files (.R or .Rmd) and the data to the supervisor, ideally as a zip file. The supervisor can then unpack it to an arbitrary folder and just double click to the .R/.Rmd file. The containing folder will then automatically become the working directory.
Other options are:
to use a subfolder for the data or
to put the data to a publicly readable internet location, e.g.
Github and read it directly from there.
The last option requires of course that the data have a free license.

Ada `Gprbuild` Shorter File Names, Organized into Directories

Over the past few weeks I have been getting into Ada, for various different reasons. But there is no doubt that information regarding my personal reasons as to why I'm using Ada is out of scope for this question.
As of the other day I started using the gprbuild command that comes with the Windows version of GNAT, in order to get the benefits of a system for managing my applications in a project-related manner. That is, being able to define certain attributes on a per-project basis, rather than manually setting up the compile-phase myself.
Currently when naming my files, their names are based off of what seems to be a standard for the grpbuild, although I could very much be wrong. For periods (in the package structure), a - is put in the name of the file, for underscores, an _ is put accordingly. As such, a package by the name App.Test.File_Utils would have a file name of app-test-file_utils: .ads and .adb accordingly.
In the .gpr project file I have specified:
for Source_Dirs use ("app/src/**");
so that I am allowed to use multiple directories for storing my files, rather than needing to have them all in the same directory.
The Problem
The problem that arises, however, is that file names tend to get very long. As I am already putting the files in a directory based on the package name contained by the file, I was wondering if there is a way to somehow make the compiler understand that the package name can be retrieved from the file's directory name.
That is, rather than having to name the App.Test.File_Utils' file name app-test-file_utils, I would like it to reside under the app/test directory by the name file_utils.
Is this doable, or will I be stuck with the horrors of eventually having to name my files along the lines of: app-test-some-then-one-has-more_files-another_package-knew-test-more-important_package.ads? Granted, I have not missed something about how an Ada application should actually be structured.
What I have tried
I tried looking for answers in the package Naming configuration of the gpr files in the documentation, but to no avail. Furthermore I have been browsing the web for information, but decided it might be better to get help through Stackoverflow, so that other people who might struggle with this problem in the future (granted it is a problem in the first place) might also get help.
Any pointers in the right direction would be very helpful!
In the top-secret GNAT documentation there is a description of how to use non-default file names. It's a great deal of effort. You will probably give up, use the default names, and put them all in a single directory.
You can also simplify much of the effort by using GPS and letting it build your project file as you add files to your source directories.

Visualization of xsd Dependencies?

I have a bunch of XSD Files which I did not write myself. The files sometimes import each other:
<xs:import namespace="http://www.mysite.com/xmlns/xXX-YYYY/V" schemaLocation="http://www.mysite.com/xmlns/xXX-YYYY/V/schema_A.xsd"/>
and I would like to get an overview of the dependencies without having to read through all of them.
The URI specified by schemaLocation does not exist, instead a catalog.xml File is used to resolve the schema locations.
http://de.wikipedia.org/wiki/XML_Catalogs
Can anybody recommend a tool that can visualize the dependencies of my schemas by also processing the information given in the catalog.xml file?
Thanks
Mischa
To follow up on my comment...
I am not aware of any tool that takes into account OASIS catalog files. Have a look at this response, see if it supports what you need (and your platform).
Strictly speaking, there are a number of issues with dependencies diagrams, which is why such a question should be qualified with why do you want it.
Some think that it truly shows dependencies between XSD files; it is not true: it may show what the author thinks the dependencies are, but that wouldn't be what the processor actually agrees to. "schemaLocation" is just a hint that processors may or may not use: "may not" use it if they're instructed otherwise (well known XSDs could be cached internally, through catalog entries or any other proprietary "catalogs"), or because the processor may decide there is no need to load an external reference when there's no use for it anyway (it may happen in some corner cases).
A diagram built as described by explicit schema locations is definitely easier to do. It only shows what the author intended; it doesn't mean that it is the "real one" (as in content is pulled indirectly, which makes the whole XSD set valid, while individual XSDs, open independently of the set, would be invalid).
Trying to build a diagram where dangling or non existing schemaLocation are overridden through a catalog, is way harder, due to the multitude of ways to structure the content, and the resolution mechanism. It would have the same shortcoming as the one above (except now the author is the one of the catalog file, rather than who authored the XSDs).
The "true" dependency can be built by traversing a schema set already loaded and compiled. Even then, you would still need to define criteria regarding dependencies due to substitutable components (elements in substitution groups or derived types, through the use of the xsi:type attribute). That is even harder.
Take a look at this tool: DocFlex/XML XSDDoc.
It is an XML schema documentation generator.
It doesn't visualize xsd dependencies, but it does work with XML catalogs.
The overview of each XSD file lists all other XSD files referenced from it
(i.e. imported, included or redefined).
There is also an opposite list of those schemas that reference the given one.
So, you can use it to figure out which XSD files depend on which.
At least, that will be easier than reading raw XSD files.
As an example, here is a documentation generated with that tool:
XML Schemas for DITA 1.1. It has been generated basically by two files:
http://docs.oasis-open.org/dita/v1.1/OS/schema/ditaarch.xsd
http://docs.oasis-open.org/dita/v1.1/OS/schema/catalog.xml
ditaarch.xsd is the schema driver that pulls all other schemas (25 in total); catalog.xml is the XML catalog, via which all file references are resolved.
What is specified in schemaLocation attributes in those schemas themselves are just opaque URIs.

Verifying an uploaded File's content type using ASP.Net

How can I verify a files content type without using the files extension or mime type using ASP.Net.
I don't want to use the mime type because it appears to be determined by the file extension.
You could use the FindMimeFromData() function in UrlMon.dll (using pinvoke).
See this page for an example and this MSDN page for the documentation of the function.
It really depends on file type. For many file types, you can examine the header of the file, which is generally prior to the first 0 char in the file. I used to have some code that examined picture types, so I might be able to find it somewhere.
But, there are file types that will not have this form of header, like XML (yeah, this is a cheap example, but it was easy for me to think of ;->). I believe all graphics types will have the header, as will other binary file types.
As Andrew has mentioned, the header is not 100%. But, it is unlikely it will be a hack attack if the file is "malformed". It is more likely a corrupt upload or upload of a corrupt file.
There is no generic way to verify a file is of the given extension type.
You could create a white-list of formats (png, jpg, zip etc) and examine the file header to determine whether it conforms to the expected format.
Even that is not fool-proof though since the file content itself may be malformed in a way that would only become apparent when an attempt to load it is made.

How to keep original dates after uploading to Alfresco?

I am uploading several files to Alfresco repsitory via webdav. The batch process works fine, but after the upload, all dates in the repository are changed to current date.
How can I make it keep and show the original file dates (creation and modified) ?
Thanks.
You can leverage metadata extractors. The main purpose is to extract metadata from binary files during upload. There are lots of built-in metadata extractors, just look for implementers of interface org.alfresco.repo.content.metadata.MetadataExtracter. There are different extractors that can extract creation date and set it as cm:created on Alfresco node.
You can enable metadata extraction by applying it as a rule on a space, look for action named Extract Common Metadata in the actions drop-down-box while creating the rule.
I don't believe it's possible without the importing code explicitly turning off the default behaviour of the "cm:auditable" policy, and I suspect the WebDAV driver doesn't do this (since it has no way of knowing whether that's appropriate or not - there are cases where forcing the creation and modification dates to today is the correct thing to do).
This behaviour is discussed in some detail here - it might be worth evaluating whether the bulk filesystem import tool is a more appropriate way to import the content into Alfresco, particularly since it can preserve the creation and modification dates if you tell it to (i.e. by specifying the values of those properties).

Resources