How to ensure source code matches an application? - software-design

Preword:
I have come across this issue in a made-up scenario. Imagine company X wishes to work with potentially sensitive data of their customers. How would they prove they only do what they claim to do?
Example: Company X wishes to match users with similar habits, without tampering with the data any further.
My partial solution:
Company X would release the source code of their application, which would confirm that the company only matches the data (and does not search for patterns, create personalised ads, etc.).
The remaining problem:
How does Company X prove that the released source code matches the code they are running? My requirements are that nobody should trust any instance, neither Company X nor any third-party. Simply hiring a third-party to "certify" Company X' practices is not a proof, simply another claim.
Side Note: Does it make a significant difference if the application does not need to be compiled before usage (e.g. PHP)?
Are there any solutions to this? Any "provable" method to ensure a certain source code is being run?

The only way to ensure the binary matches the source is to compile it yourself or at least compare it to a binary you compiled yourself under the exact same circumstances.
But then again you have to do the same to every piece of software involved (the compiler could change the code, libraries could do something bad etc.).
Even if the software is not compiled the interpreter would have to be validated in this way because it interprets and runs the source (i.e. can modify it in any way it wants)
For your scenario the data could also be used and processed outside the specific software so their whole system would have to be audited and build in this way and then locked down. Choose your level of paranoia.
So the answer is not realistically without trusting someone. That's the idea behind signed packages in several Linux systems (including android) were some party like the developer or a repository maintainer signs the binary to verify it's what he compiled (and matches the published source).
Also in the previous step with verifying the source: it's pretty easy to show that a program has a certain functionality but (most often) impossible to show that it doesn't have it.
So basically choose your own level of paranoia but if they are really after you, you are screwed.
Great now I'll go find some tin foil...

Related

Run R script and hide the actual code from user

I have created an R code script that:
Reads some data from a database
Makes some transformations and..
exports into a csv the modified table.
This code needs to run in a client's machine, but we need to "hide" the actual code from the user.
Is there any useful suggestions on how we can achieve that?
Up front
... it will be nearly impossible to deploy an R <something> to another computer in a way that prevents curious users from accessing the source code.
From a mailing list conversation in 2011, in response to "I would not like anyone to be able to read the code.",
R is an open source project, so providing ways for you to do this is not
one of our goals.
Duncan Murdoch https://stat.ethz.ch/pipermail/r-help/2011-July/282755.html
(Prof Murdoch was on the R Core Team and R Foundation for many years.)
Background
Several (many?) programming languages provide the ability to compile a script or program into an executable, the .exe you reference. For example, python has tools like py2exe and PyInstaller. The tools range from merely compactifying the script into a zip-ball, perhaps obfuscating the script; ... to actually creating a exe with the script either tightly embedded or such. (This part could use some more citations/research.)
This is usually good enough for many people, by keeping the honest out. I say it that way because all you need to do is google phrases like decompile py2exe and you'll find tools, howtos, tutorials, etc, whose intent might be honestly trying to help somebody recover lost code. Regardless of the intentions, they will only slow curious users.
Unfortunately, there are no tools that do this easily for R.
There are tools with the intent of making it easy for non-R-users to use R-based tools. For instance, RInno and DesktopDeployR are two tools with the intent of creating Windows (no mac/linux) installers that support R or R/shiny tools. But the intent of tools like this is to facilitate the IT tasks involved with getting a user/client to install and maintain R on their computer, not with protecting the code that it runs.
Constrain R.exe?
There have been questions (elsewhere?) that ask if they can modify the R interpreter itself so that it does not do everything it is intended to do. For instance, one could redefine base::print in such a way that functions' contents cannot be dumped, and debug doesn't show the code it's about to execute, and perhaps several other protective steps.
There are a few problems with this approach:
There is always another way to get at a function's contents. Even if you stop print.default and the debugger from doing this, there are others ways to get to the functions (body(.), for one). How many of these rabbit holes do you feel you will accurately traverse, get them all ... with no adverse effect on normal R code?
Even if you feel you can get to them all, are you encrypting the source .R files that contain your proprietary content? Okay, encrypting is good, except you need to decrypt the contents somehow. Many tools that have encrypted contents do so to thwart reverse-engineering, so they also embed (obfuscatedly, of course) the decryption key in the application itself. Just give it time, somebody will find and extract it.
You might think that you can download the key on start-up (not stored within the app), so that the code is decrypted in real-time. Sorry, network sniffers will get the key. Even if you retrieve it over https://, tools such as https://mitmproxy.org/ will render this step much less effective.
Let's say you have recompiled R to mask print and such, have a way to distribute source code encrypted, and are able to decrypt it in a way that does not easily reveal the key (for full decryption of the source code files). While it takes a dedicated user to wade through everything above to get to the source code, none of the above steps are required: they may legally compel you to release your changes to the R interpreter itself (that you put in place to prevent printing function contents). This doesn't reveal your source code, but it will reveal many of your methods, which might be sufficient. (Or just the risk of legal costs.)
R is GPL, and that means that anything that links to it is also "tainted" with the GPL. This means that anything compiled with Rcpp, for instance, will also be constrained/liberated (your choice) by the GPL. This includes thoughts of using RInside: it is also GPL (>= 2).
To do it without touching the GPL, you'd need to write your interpreter (relatively from scratch, likely) without code from the R project.
Alternatives
Ultimately, if you want to release R-based utilities/apps/functionality to clients, the only sure-fire way to allow them to use your code without seeing it is to ... control the computers on which R will run (and source code will reside). I'll add more links supporting this claim as I find them, but a small start:
https://stat.ethz.ch/pipermail/r-help/2011-July/282717.html
https://www.researchgate.net/post/How_to_make_invisible_the_R_code
Options include anything that keeps the R code and R interpreter completely under your control. Simple examples:
Shiny apps, self-hosted (or on shinyapps.io if you trust their security); servers include Shiny Server (both free and commercial versions), RStudio Connect (commercial only), and ShinyProxy. (The list is not known to be exclusive.)
Rplumber is an API server, not a shiny server. The intent is for single HTTP(s) endpoint calls, possibly authenticated, supporting whatever HTTP supports (post, get, etc). This can be served in various ways, see its hosting page for options.
Rserve. I know less about this, but from what I've experienced with it, I've not had as much luck integrating with enterprise systems (where, e.g., authentication and fine-control over authorization is important). This does allow near-raw access to R, so it might not be what you want (especially when the intent is to give to clients who may not be strong R users themselves).
OpenCPU should be discussed, but not as a viable candidate for "protect your code". It is very similar to rplumber in that it provides HTTP endpoints, but it supports endpoints for every exported function in every package installed in its R library. This includes the base package, so it is not at all difficult to get the source code of any function that you could get on the R console. I believe this is a design feature, even if it is perfectly at odds with your intent to protect your code.
Anything that can call R or Rscript. This might be PHP or mod_python or similar. Any web-page serving language that can exec("/usr/bin/Rscript",...) can take its output and turn it around to the calling agent. (It might also be possible, for example, for a PHP front-end to call an opencpu endpoint that only permits connections from the PHP-serving host.)

Should a commercial Web Application that uses a number of Open Source Licences have an about/notices page? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
Should a web application have a notices/about page that details the components/libraries that we are using and there relevant licences?
Some Licences such as LGPL (Used by SevenZipSharp) have a combined works clause:
lgpl V3
Combined Works. You may convey a Combined Work under terms of your choice that, taken together, effectively do not restrict modification
of the portions of the Library contained in the Combined Work and
reverse engineering for debugging such modifications, if you also do
each of the following:
a) Give prominent notice with each copy of the Combined Work that the
Library is used in it and that the Library and its use are covered by
this License. b) Accompany the Combined Work with a copy of the GNU
GPL and this license document. c) For a Combined Work that displays
copyright notices during execution, include the copyright notice for
the Library among these notices, as well as a reference directing the
user to the copies of the GNU GPL and this license document. d) Do one
of the following: 0) Convey the Minimal Corresponding Source under the
terms of this License, and the Corresponding Application Code in a
form suitable for, and under terms that permit, the user to recombine
or relink the Application with a modified version of the Linked
Version to produce a modified Combined Work, in the manner specified
by section 6 of the GNU GPL for conveying Corresponding Source. 1) Use
a suitable shared library mechanism for linking with the Library. A
suitable mechanism is one that (a) uses at run time a copy of the
Library already present on the user's computer system, and (b) will
operate properly with a modified version of the Library that is
interface-compatible with the Linked Version. e) Provide Installation
Information, but only if you would otherwise be required to provide
such information under section 6 of the GNU GPL, and only to the
extent that such information is necessary to install and execute a
modified version of the Combined Work produced by recombining or
relinking the Application with a modified version of the Linked
Version. (If you use option 4d0, the Installation Information must
accompany the Minimal Corresponding Source and Corresponding
Application Code. If you use option 4d1, you must provide the
Installation Information in the manner specified by section 6 of the
GNU GPL for conveying Corresponding Source.)
lgpl V2.1
You must give prominent notice with each copy of the work that the
Library is used in it and that the Library and its use are covered by
this License. You must supply a copy of this License. If the work
during execution displays copyright notices, you must include the
copyright notice for the Library among them, as well as a reference
directing the user to the copy of this License. Also, you must do one
of these things:
a) Accompany the work with the complete corresponding machine-readable
source code for the Library including whatever changes were used in
the work (which must be distributed under Sections 1 and 2 above);
and, if the work is an executable linked with the Library, with the
complete machine-readable "work that uses the Library", as object code
and/or source code, so that the user can modify the Library and then
relink to produce a modified executable containing the modified
Library. (It is understood that the user who changes the contents of
definitions files in the Library will not necessarily be able to
recompile the application to use the modified definitions.) b) Use a
suitable shared library mechanism for linking with the Library. A
suitable mechanism is one that (1) uses at run time a copy of the
library already present on the user's computer system, rather than
copying library functions into the executable, and (2) will operate
properly with a modified version of the library, if the user installs
one, as long as the modified version is interface-compatible with the
version that the work was made with. c) Accompany the work with a
written offer, valid for at least three years, to give the same user
the materials specified in Subsection 6a, above, for a charge no more
than the cost of performing this distribution. d) If distribution of
the work is made by offering access to copy from a designated place,
offer equivalent access to copy the above specified materials from the
same place. e) Verify that the user has already received a copy of
these materials or that you have already sent this user a copy.
A number of other licences have similar clauses about displaying a notice in your application or combined works.
Does this mean that web applications need an about page detailing components used? I cant find any examples of websites doing this, although lots of sites will often be using OpenSource Libraries?
I guess the get out clause that websites/applications currently use is who would know if an OpenSource component was being used in a combined work/web application on the server side but that isn't an excuse especially if the web app was ever audited.
Just because other does it wrong, doesn't give you the right to do it wrong.
If the license for a library says that your application should include a page with the appropriate credits, then that is what you should do.
Edit:
(This is my interpretation of the license agreement, I may be wrong).
I believe there's a (quite big) difference between a library and a web application. The license agreement posted above seems to be for other libraries, not for end-user webapplications. I believe that if you make your own library, from the SevenZipSharp library, the quote from the license agreement above applies - and forces you to distribute your library under LGPL as well. But if you use the SevenZipSharp library in an application that's meant for the user and not for developers, the above quote from the license does not apply.
Sorry for reopening this rather old question, but apparently there was no accepted conclusion. As a disclaimer, IANAL, so everything that follows is a layman's interpretation of how this licenses might be applied to this topic. I'm really looking for clarifications, confirmations or for someone proving me wrong!
My current interpretation is that a running Web Application does not "distribute" itself to every surfer, it merely generates output which is then rendered in the surfer's browser.
So the Web Application would be required to contain copyright / licensing notices in its own documentation / installation instructions / administrators/operators manual - which is targeted at the Web Application's users, which in fact is the one who installs, configures and operates it.
The licenses usually do not cover OUTPUT generated by an application, which is what the surfers visiting the web pages generated by the application consume. Thus, no copyright notice has to be included in the generated output, ie. the web pages.
In interactive Web Applications, the line between "the application" and "generated output" is not that clear and sharp, however, and might not even exist any more if parts of the application code are contained in the shipped web pages (as Javascript applications). So I'm not sure if my interpretation will hold or if it even is valid at all.

Encrypting R script under MS-Windows

I have a bunch of R scripts which I am running on a Windows machine and want to ensure that the code remains unread by those not intended to see it. On a Linux box, I could wrap the R code in a bash script #! and make an encrypted (and perhaps even a limited-life) executable shell script. What are my options to do something on similar lines under Windows?
My answer is a bit late, but I believe this is a good question. Unfortunately, I don't believe that there is a solution, or at least an easy one, at the present time.
The difficulty is common because, for most interpreted languages, including R, it is often possible to turn on logging and inspection of all commands being run. This can negate many tricks to obfuscate the code.
For those who prefer to think of code being open == good, one should know that a common reason to obfuscate the code is if one is consulting with a client that hires multiple vendors. It is not uncommon for a client to take scripts from vendor A and ask vendor B why it doesn't work with their system. (This may be done by a low-level IT flunkie, rather than someone responsible for the NDA contracts.) If A & B are competitors, A's code has just been handed to B. When scripts == serious programs, then serious code has been given away.
The ways I've seen this addressed are:
Make a call to a compiled language, and use standard protections available there.
Host the executable on a different server, and use calls to the server to execute the calculations. (In R, there are multiple server-side options.)
Use compiled (preprocessed / bytecode) code within the language.
Option 2 is actually easier and better when the code may be widely distributed, not just for IP reasons. A major advantage is that it lets you upgrade the code without having to go through the pain of a site-wide release process. If new libraries are needed, no problem - update the server.
Option 3 is done in Matlab with .p files, and can be done with py2exe for Python on Windows. In R, the new bytecode compilation may be analogous, but I am not familiar enough with it to address any differences between .Rc files in the R context and .p files in the Matlab context. For more info on the compiler, see: http://www.inside-r.org/r-doc/compiler/compile
Hosting computations on the server is great for working with unsophisticated users, because it is easier to iterate quickly in response to bugs or feature requests. The IP protection is simply a benefit.
This is not a specifically R-oriented strategy. (And it's a bit unclear what your constraints or goals really are anyway.) If you want a cross-platform encryption method, you should look into the open-source program TrueCrypt. It supports creating encrypted files that can be mounted as volumes on any machine that supports the volume formatting method. I have tested this across the Mac PC divide , since the Mac can read FAT files, but have no experience with how it might work across the Linux-PC chasm.
(Their TODO list for Windows includes;"Command line options for volume creation (already implemented in Linux and Mac OS X versions)". So I don't see any clear way to use this from within R without you running the program from the OS.)
I don't think this is possible because the R interpreter has to be able to decrypt and read the code in order to execute it which means that whoever is using that interpreter will also be able to decrypt and read the code.
I am by no means an expert, so I reserve the right to be 100% wrong about that statement.
I believe the best solution is to ensure value comes from the expertise and services provided by your company and it's employers---not from keeping secrets.
Failing that, you could try separating the code into a client/server model. That way the client just sends data and receives results---they never have access to the code that runs on the server.
However, the scientist in me just said "that solution sucks and I would never trust results provided under such conditions".

Which Publish method is most efficient at maintaining a large website?

I'm using VS2010 and TFS to build a complex medium sized website.
Which protocol is most efficient and secure? I think I can enable any protocol I need since I own the IIS server and control all aspects of it.
My choices are:
Webdeploy
FTP
FileSystem
FPSE
There is also a hint at something called "one click"... not sure what that is, or if it relates to any of the above.
OK.. I'm sorry, but I'm not sure where to even start, and I'm not sure the question is answerable as-is. I'd probably put this as a note if there weren't a limit on the number of characters.
So much depends on the type of data in this app, your financial resources, etc. This is one of those subjects that seems like a simple question, but the more you learn, the more you realize you don't know. What you're talking about it Release management, which is just one piece of the puzzle in an overall Application Life-cycle Management strategy.
(hint, start at the link posted, and be prepared to spend months learning).
Some of the factors you may need to be aware of are regulatory factors that you many not even have thought of. Certain data is protected, and different standards require you to have formalized risk and release management built into your processes. For example, credits card data, medical records, etc, all have different regulations (some actual laws, some imposed by the Payment Card Industry) that you need to be aware of.
If your site contains ANY sensitive data, you need to first find out whether any of these rules apply to you, and if so, which ones? Do any of them require audit trails for how code goes from development to deployment? (PCI does, for example. That's because we take credit card payments, and in order to do that, you need to be PCI Certified or face heavy fines.)
If your site contains NO sensitive information at all, then your question could be answered as-is, and the question becomes a matter of what you're comfortable with.
If your application DOES contain sensitive info that makes it subject to rules that mandate a documented, secure ALM process, then the question becomes more complex, because doing deployments manually in such a situation is a PAIN IN THE BUTT. It' doesn't take too long before you start looking at tools to help automate some of the processes. (Build servers, tools such as Aldon for deployment, etc. There is a whole host of commercial and open source software to choose from.)
(we're using Atlassian for most of our ALM, but Team Foundation Server is also excellent, and there are a TON of other options.)

Drupal development workflow for teams

In my last Drupal project we were 5 people doing coding and installing new modules, at the same type our client was putting up content. Since we chose to have only one server for simplicity there were times were many people needed to write to the same files like style.css or page.tpl.php or when someones broken code would prevent others from working
Are there any best practises for a team that works with Drupal? How can leverage code repositories or sandboxes?
A single server may appear to give you "simplicity", but what it gives you, as you've experienced, is utter chaos -- and you were lucky if it didn't result in unpleasant and hard-to-reproduce, harder-to-fix crashes. Don't settle for anything less than a "production" server (where your client can be working -- on content only -- if they like minor risks;-) and a "staging" one (where anything from the development team goes to get tested and tried for a while before promotion to development, which is done at a quiet and ideally prearranged time).
Second, use a version control system of some kind. Which one matters less than using one at all: svn is popular and simple, the latest fashion (for excellent reasons) are distributed ones such as hg and git, Microsoft and other have commercial offerings in the field, etc.
The point is, whenever somebody's updating a file, they're doing so on their own client of the VCS. When a coherent set of changes is right, it's pushed to the VCS, and the VCS diagnoses and points out any "conflicts" (places where two developers may have made contradictory changes) so the developer who's currently pushing is responsible for editing the files and fixing the conflicts before their pushes are allowed to go through. Only then are "current versions" allowed to even go on the staging system for more thorough (and ideally automated!-) testing (or, better yet, a "continuous build" system).
Basically, there should be two layers of defense against such conflicts as you observed, and you seem to have deployed neither. They're both essential, though, if forced under duress to pick just one, I guess I'd reluctantly pick the distinction between production and staging servers -- development will still be chaotic (intolerably so compared to the simple solidity of any VCS!) but at least it won't directly hurt the actual serving system;-).
Here's a great writeup about development workflow in Drupal. It sums everything so far responded here and adds "Features", "Strongarm" and a few more tricks to the equation. http://www.lullabot.com/articles/site-development-workflow-keep-it-code

Resources