Change the last update date of a file in R - r

I've written a small function that download a file from my S3 data repository only if the size of the local version of the file is different, to save bandwhich and time.
I would like to improve it to download if and only if the last update datetime is different. I can make the check using HEAD (from httr package) to get the datetime for the remote file and file.info for the local one.
But (as espected) when I download a fresh copy of the file it's going to have the Sysdate as creation/last update time. I need a way to update the datetime of the fresh new local copy with the one from the server including the potential issue due to different time-zones.
file.info doesn't seem to be able to write file properties.
any idea on how can i do that?

I don't think you can and even if you could, that approach seems a bit unreliable to me (you mentioned time zones for example). Instead, I would suggest you rely on a file's md5sum (a unique representation of its contents) to tell when it has changed:
library(tools)
if (md5sum(remote) != md5sum(local)) file.copy(remote, local)

Related

Schema file does not exist in XBRL Parse file

I have downloaded a zip file containing around 200,000 html files from Companies House.
Each file is in one of two formats: 1) inline XBRL format (.html file extension) or 2) XBRL format (.xml file extension). Looking at the most recent download available (6 December 2018) all the files seem to be the former format (.html file extensions).
I'm using the XBRL package in R to try and parse these files.
Question 1: is the XBRL package meant to parse inline XBRL format (.html) files, or is it only supposed to work on the XBRL (.xml) formats? If not, can anyone tell me where to look to parse inline XBRL format files? I'm not entirely sure what the difference is between inline and not inline.
Assuming the XBRL package is meant to be able to parse inline XBRL format files, I'm hitting an error telling me that the xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd file does not exist. Here's my code:
install.packages("XBRL")
library(XBRL)
inst <- "./rawdata/Prod224_0060_00000295_20171130.html" # manually unzipped
options(stringsAsFactors = FALSE)
xbrl.vars <- xbrlDoAll(inst, cache.dir = "XBRLcache", prefix.out = NULL, verbose = TRUE)
and the error:
Schema: ./rawdata/https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd
Level: 1 ==> ./rawdata/https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd
Error in XBRL::xbrlParse(file) :
./rawdata/https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd does not exists. Aborting.
Question 2. Can someone explain what this means in basic terms for me? I'm new to XBRL. Do I need to go and find this xsd file and put it somewhere? It seems to be located here, but I have no idea what to do with it or where to put it.
Here's a similar question that doesn't seem fully answered and the links are all in Spanish and I don't know Spanish.
Once i've been able to parse one single html XBRL file, my plan is to figure out how to parse all XBRL files inside multiple zip files from that website.
I had the exactly same problem with the US SEC data.
And I just followed exactly the guidance of pdw and it worked!
FYI, the code I used for
if (substr(file.name, 1, 5) != "http:") {
is
if (!(substr(file.name, 1, 5) %in% c("http:", "https"))) {
And I hacked it using trace('XBRL', edit=TRUE).
I'm not familiar with the XBRL package that you're using, but it seems clear that it's erroneously trying to resolve an absolute URL (https://...) as a local file.
A quick browse of the source code reveals the problem:
XBRL.R line 305:
fixFileName <- function(dname, file.name) {
if (substr(file.name, 1, 5) != "http:") {
[...]
i.e. it decides whether or not a URL is absolute by whether it starts "http:", and you URL starts "https:". It's easy enough to hack in a fix to allow https URLs to pass this test too, and I suspect that that would fix you immediate problem, although it would be far better if this code used a URL library to decide if a URL was absolute or not rather than guessing based on protocol.
I'm not sure what the status is with respect to iXBRL documents. There's a note in the changelog saying "reported to work with inline XBRL documents" which I'm suspicious of. Whilst it might correctly find the taxonomy for an inline document, I can't see how it would correctly extract the facts with significant additional code which I can't see any sign of.
You might want to take a look at the Arelle project as an alternative open source processor that definitely does support Inline XBRL.
As pdw stated, the issue is that the package is hard coded to look for "http:" and erroneously treats "https" paths as local paths. This happens because XBRL files can refer to external files for standard definitions of schemas, etc. In your example, this happens on line 116 of Prod224_0081_00005017_20191231.html
Several people have forked the XBRL package on github and fixed this behavior. You can install one of the versions from https://github.com/cran/XBRL/network/members with devtools::install_git() and that should work out.
For example, using this fork the example Companies House statement is parsed.
# remotes:::install_github("adamp83/XBRL")
library(XBRL)
x <- xbrlDoAll("https://raw.githubusercontent.com/stackoverQs/stackxbrlQ/main/Prod224_0081_00005017_20191231.html",cache.dir = "cache" verbose=TRUE))
Here are a few more general explanations to give some context.
Inline XBRL vs. XBRL
An XBRL file, put simply, is just a flat list of facts.
Inline XBRL is a more modern version of an XBRL instance that, instead of storing these facts as a flat list, stores the facts within a human-readable documents, "stamping" the values. From an abstract XBRL-processing perspective, both an XBRL file and an inline XBRL file are XBRL instances and are simply sets of facts.
DTS
An XBRL instance (either inline or not) is furthermore linked to a few, or a lot of, taxonomy files known to XBRL users as the DTS (Discoverable Taxonomy Set). These files are either XML Schema files (.xsd) containing the report elements (concepts, dimensions, etc) or XML Link files (.xml) containing the linkbases (graphs of reports elements, labels, etc).
The machinery linking an XBRL instance to a DTS is a bit complex and heterogeneous: schema imports, schema includes, simple links pointing to other files, etc. It suffices to understand as a user that the DTS is made of all the files in the transitive closure of the instance via these links. It is the job of an XBRL processor (including the R package) to resolve the entire DTS.
Storage of DTS files
Typically, an XBRL instance points to a file (called entry point) located on the server of the taxonomy provider, and that file may itself point to further files on the same, and other servers.
However, many XBRL processors automatically cache these files locally in order to avoid overloading the servers, as is established practice. Normally, you do not need to do this yourself. It is very cumbersome to resolve the links oneself to download all files manually.
An alternate way is to download the entire DTS (as a zip file following a packaging standard) from the taxonomy provider's servers and use it locally. However, this also requires an XBRL processor to figure out the mapping between remote URLs and local files.

How to directly work with a data.frame in physical RData?

Do I have to
1) load a data.frame from the physical RData to the memory,
2) make changes,
3) save it back to the physical RData,
4) remove it from the memory to avoid conflicts?
Is there anyway I can skip the load/save steps and make permanent changes to the physical RData directly? Is there a way to work with data.frame like the way working with a SQLite/MySQL database? Or should I just use SQLite/MySQL (instead of data.frame) as the data storage?
More thoughts: I think the major difference is that to work with SQLite/MySQL you establish a connection to the database, but to work with data.frame from RData you make a copy in the memory. The later approach can create conflicts in complex programs. To avoid potential conflicts you have to save the data.frame and immediately remove it from the memory every time you change it.
Thanks!
Instead of using load you may want to consider using attach. This can attach the saved data object to the search path without loading all the objects in it into the global environment. The data frame would then be available to use.
If you want to change the data frame then you would need to copy it to the global environment (will happen automatically for most editing) and then you would need to save it again (there would not be a simple way to save it into a .Rdata file that contains other objects).
When you are done you can use detach (but if you have made a copy in the global environment then you will still need to delete that copy).
If you don't like typing the load/save commands (or attach/detach) each time then you could write your own function that goes through all the steps for you (and if the copy is only in the environment of the function then you don't need to worry about deleting it).
You may also want to consider different ways of storing your data. The typical .Rdata file works well for an all or nothing approach. The saveRDS and readRDS functions will save and read a single object (and do not force you to use the same name when reading it back in). The interfacing with a database approach is probably the best if you are making frequent changes to tables and want them stored outside of R.

Unix touch command usage

I know you can use touch to create a new empty file.
I just learned that touch can be used to update the access and modification time of a file. I don't quite know in what situations and why do you need to update the access and modification time of a file , i.e. the usefulness of this particular function?
Thanks!
Some utility depends on timestamp of the file.
For example, make uses timestamp to check whether it is required to do something (usually build) based on the timestamp of the source code, and output (executable, object files, ...)
By touching followed by make, the source file, you can force rebuild.
In addition, touch has a -d option that can fake the modification time.
If one "knows what he's doing" she can avoid long build time, due to unnecessary re-compilations.
For example, when adding a declaration to a common header file,
that does not change any old API, one can fake the header real modification time,
and bypass Makefile's dependencies.

Under what conditions on Unix can gtk_file_chooser_get_filename() return NULL signifying a non-local filename?

From the documentation for gtk_file_chooser_get_filename():
The currently selected filename, or NULL if no file is selected, or the selected file can't be represented with a local filename. Free with g_free().
Is there at least one situation where the bolded condition is true on a Unix system (Linux, the various BSDs, etc.)? I tried reading through the source code but got lost/confused. I'd like to know so I can determine if I need to handle it in some special way; I don't need to know every possibility for this.
Thanks.
I haven't yet read through the source either, but I would guess that gtk_file_chooser_get_filename() essentially returns g_file_get_path (gtk_file_chooser_get_file (...)). Probably the only case in which you would need to care about the filename being NULL is if your file chooser is enabled to pick files from a network share, for example. It's probably not something you need to worry about if you set the local-only property on your file chooser.
However, it's probably good practice to use gtk_file_chooser_get_file() anyway, since you will transparently be able to handle non-local files if you have the proper GVFS modules installed.

How to keep original dates after uploading to Alfresco?

I am uploading several files to Alfresco repsitory via webdav. The batch process works fine, but after the upload, all dates in the repository are changed to current date.
How can I make it keep and show the original file dates (creation and modified) ?
Thanks.
You can leverage metadata extractors. The main purpose is to extract metadata from binary files during upload. There are lots of built-in metadata extractors, just look for implementers of interface org.alfresco.repo.content.metadata.MetadataExtracter. There are different extractors that can extract creation date and set it as cm:created on Alfresco node.
You can enable metadata extraction by applying it as a rule on a space, look for action named Extract Common Metadata in the actions drop-down-box while creating the rule.
I don't believe it's possible without the importing code explicitly turning off the default behaviour of the "cm:auditable" policy, and I suspect the WebDAV driver doesn't do this (since it has no way of knowing whether that's appropriate or not - there are cases where forcing the creation and modification dates to today is the correct thing to do).
This behaviour is discussed in some detail here - it might be worth evaluating whether the bulk filesystem import tool is a more appropriate way to import the content into Alfresco, particularly since it can preserve the creation and modification dates if you tell it to (i.e. by specifying the values of those properties).

Resources