The Role of Data in openCPU

The Role of Data in openCPU - r

I am well aware of the fact, that this might not be the typical SO question, but since this is the strongest R programming community I know and the author of opencpu explicitly encourages to post here, I'll give it a try:
What role does data play in the opencpu approach? I mean cloud computing is nice, but you need some data to calculate. Uploading some example .csv or .xls table might be straight forward, but what does opencpu have in mind for real world data?
What about several hundred MBs (or even GBs) of data? How would you a) transfer it to your user folder? How would you b) share it among a group of authenticated users and c) hide it from the public?
I read the license part and from what I understand for safety it should be possible to run the calculations behind the scene as long as the source code is publicly available. But still, the little document leaves open questions and lot of guessing.

Thanks for trying OpenCPU. OpenCPU is still an evolving project at this point, so we are open to interesting suggestions or use cases.
About the data... you are asking many things at once. Some thoughts:
At this point, OpenCPU does not solve the 'big data' problem. It does not scale beyond what R itself scales to. It is mostly meant as an infrastructure for small to medium sized data; e.g. a typical research paper, project, etc.
OpenCPU is an API. It is not limited to browser clients. It is designed to be called from other clients as well.
OpenCPU has a store that you use to store R objects on the server. E.g you upload a CSV or whatever once, and then you store the actual dataframe. In any subsequent calls you can then include this object as an argument to function calls.
Another approach would be to combine it with a external database (e.g. mysql) and dynamically pull the data in your R code (e.g. using RMySQL)
Afaik, the legal aspects of open data are not completely clear at this point. I don't think there is consensus on how copyright applies to data, and what a good license would be. However, a key feature in the design of OpenCPU is making sure things are easily reproducible. This can of course only be done when the data is actually public.

Matt,
I'm dealing with a real-life use case that involves transforming and processing data from a 3GB (but growing) dataset. Here is the approach I am using (mostly based on suggestions from Gergely Daróczi):
as long as the source data can fit into the server memory, I'd choose to load the data with my R package and persist that data across user sessions (e.g. preloading data packages with OpenCPU)
if that's not option on your server, an alternative is to copy your data to Ramdisk (Linux tmpfs system) into .rds (or .rda, .rData, etc.) files and set these paths using getOption("path_to_my_persistent_data_files") in your R package, then load/unload these file(s) as needed in your package functions
when your data no longer fits into memory, I'd look into using a MongoDB backend together with R interface rmongodb, as that would likely be faster and easier to maintain than an RDBMS.
Currently OpenCPU does not provide any support for large persistent datasets, it's up to you to find an approach that best suits your needs and resources.

You can install a local instance of opencpu. You don't have to use the existing one on the Internet. Instructions are on the site.

Related

ASP.NET Core - Indexing and searching JSON files

I have close to 10K JSON files (very small). I would like to provide search functionality. Since these JSON files are fixed for specific release, I am thinking to pre-index files and load index during startup of website. I don't want to use external search engine.
I am searching for libraries to support this. lucene.Net is one popular library. I am not sure whether this library supports loading pre-index data.
Index JSON documents and store index results (probably in single file), save to file storage service like S3 - Console app.
Load index file and respond to queries. - ASP.NET core app
I am not sure this is possible or not. What are the possible options available?

Since S3 is not a .NET-specific technology and Lucene.NET is a line-by-line port of Lucene, you can expand your search to include Lucene-related questions. There is an answer here that points to an S3 implementation meant for Lucene that could be ported to .NET. But, by the author's own admission, performance of the implementation is not great.
NOTE: I don't consider this to be a duplicate question due to the fact that the answer most appropriate to you is not the accepted answer, since you explicitly stated you don't want to use an external solution.
There are a couple of implementations for Lucene.NET that use Azure instead of AWS here and here. You may be able to get some ideas that help you to create a more optimal solution for S3, but creating your own Directory implementation is a non-trivial task.
Can IndexReader read index file from in-memory string?
It is possible to use a RAMDirectory, which has a copy constructor that moves the entire index from disk into memory. The copy constructor is only useful if your files are on disk, though. You could potentially read the files from S3 and put them into RAMDirectory. This option is fast for small indexes but will not scale if your index is growing over time. It is also not optimized for high-traffic websites that have multiple concurrent threads performing searches.
From the documentation:
Warning: This class is not intended to work with huge
indexes. Everything beyond several hundred megabytes will waste
resources (GC cycles), because it uses an internal buffer size
of 1024 bytes, producing millions of byte[1024] arrays.
This class is optimized for small memory-resident indexes.
It also has bad concurrency on multithreaded environments.
It is recommended to materialize large indexes on disk and use
MMapDirectory, which is a high-performance directory
implementation working directly on the file system cache of the
operating system, so copying data to heap space is not useful.
When you call the FSDirectory.Open() method, it chooses a directory that is optimized for the current operating system. In most cases it returns MMapDirectory, which is an implementation that uses the System.IO.MemoryMappedFiles.MemoryMappedFile class under the hood with multiple views. This option will scale much better if the size of the index is large or if there are many concurrent users.
To use Lucene.NET's built-in index file optimizations, you must put the index files in a medium that can be read like a normal file system. Rather than trying to roll a Lucene.NET solution that uses S3's APIs, you might want to check into using S3 as a file system instead. Although, I am not sure how that would perform compared to a local file system.

Backing up MariaDB Temporal Database

Generally, I am excited by the Temporal Database feature.
However, mysqldump is not supported for database export and restore.
I can find no resource in the documentation (linked to above) that indicates which methods of backup and restore are safe to use for this type of database. Google searches do not seem to help.
Does anyone have any insights into using these MariaDB temporal databases in production environments? Or more specifically, in using them in development environments, and then transferring the database to a production environment and still keeping the history of the database intact?
I understands this something of a dev-ops question, but it seems pretty central issue to how to work with and around this new feature. Does anyone have an insights in moving these databases around and relying on that process in-production? Just wondering how mature this technology is, given that this issue (which seems pretty central) is not covered in the documentation.

Unfortunately, as the documentation states, while mysqldump will dump these tables, the invisible temporal columns are not included - the tool will only backup the current state of the tables.
Luckily, there are a couple of options here;
You can use mariadb-enterprise-backup or mariabackup which should support the new format of the temportal data and correctly back it up (these tools do binary backups instead of table dumps);
https://mariadb.com/docs/usage/mariadb-enterprise-backup/#mariadb-enterprise-backup
https://mariadb.com/kb/en/library/full-backup-and-restore-with-mariabackup/
Unfortunately, we have found the tool to be somewhat unreliable - especially when using the MyRocks storage engine. However, it is constantly improving.
To get around this, in our production servers we take advantage of the slave replication - which keeps the temporal data (and everything else) intact across all our nodes. We then do secondary backups by taking the slave nodes down and doing a straight copy of the database data files. For more information on how to set up replication, please refer to the documentation;
https://mariadb.com/kb/en/library/setting-up-replication/
So you could potentially set up dev-copy of the database with replication and just copy the data from there. However, in your case, mariabackup might also do the trick.
Regardless of how you do it, be wary of the system clock when setting up replication or when moving these files between systems. You can get some problems when the clock is not in sync (or if the systems are in different time zones). There is some official documentation (and mitigation) on this topic also;
https://mariadb.com/kb/en/library/temporal-data-tables/#use-in-replication-and-binary-logs
Looking at your additional comment - I am not aware of any way to get a complete image of a database as it looked at a given date (with temporal data included), directly from MariaDB itself. I don't think this information is stored in a way that makes this possible. However, there is a workaround even for this. You could potentially use the above method in combination with incremental rdiff backups. Then what you would do to solve it would be to;
Backup the database with any of the above methods.
Use rdiff-backup (https://www.nongnu.org/rdiff-backup/) on those backup files, running it once per day.
This would allow you to fetch an exact copy of how the database looked at any given date of your choice. rdiff-backup also fully supports ssh, allowing you to do things like,
rdiff-backup -r 10D host.net::/var/lib/mariadb /my/tmp/mariadb
This would fetch a copy of those backup files as they looked 10 days ago.

For future planning, according to https://mariadb.com/kb/en/system-versioned-tables/#limitations:
Before MariaDB 10.11, mariadb-dump did not read historical rows from versioned tables, and so historical data would not be backed up. Also, a restore of the timestamps would not be possible as they cannot be defined by an insert/a user. From MariaDB 10.11, use the -H or --dump-history options to include the history.
10.11 is still in development as of writing this answer.

Evaluation of a strategy to sync Core Data with Dropbox

This question is about using Dropbox to sync an sqlite Core Data store between multiple iOS devices. Consider this arrangement:
An app utilizes a Core Data store, call it local.sql, saved in the app's own NSDocumentDirectory
The app uses the Dropbox Sync API to observe a certain file in the user's Dropbox, say, user/myapp/synced.sql
The app observes NSManagedObjectContextDidSaveNotification, and on every save it copies local.sql to user/myapp/synced.sql, thereby replacing the latter.
When the Dropbox API notifies us that synced.sql changed, we do the opposite of part 3 more or less: tear down the Core Data stack, replace local.sql with synced.sql, and recreate the Core Data stack. The user sees "Syncing" or "Loading" in the UI in the meantime.
Questions:
A. Is this arrangement hugely inefficient, to the extent where it should be completely avoided? What if we can guarantee the database is not large in size?
B. Is this arrangement conducive to file corruption? More than syncing via deltas/changelogs? If so, will you please explain in detail why?

A. Is this arrangement hugely inefficient, to the extent where it should be completely avoided? What if we can guarantee the database is not large in size?
Irrelevant, because:
B. Is this arrangement conducive to file corruption? More than syncing via deltas/changelogs? If so, will you please explain in detail why?
Yes, very much so. Virtually guaranteed. I suggest reviewing How to Corrupt An SQLite Database File. Offhand you're likely to commit at least two of the problems described in section 1, including copying the file while a transaction is active and deleting (or failing to copy, or making a useless copy of) the journal file(s). In any serious testing, your scheme is likely to fall apart almost immediately.
If that's not bad enough, consider the scenario where two devices save changes simultaneously. What then? If you're lucky you'll just get one of Dropbox's notorious "conflicted copy" duplicates of the file, which "only" means losing some data. If not, you're into full database corruption again.
And of course, tearing down the Core Data stack to sync is an enormous inconvenience to the user.
If you'd like to consider syncing Core Data via Dropbox I suggest one of the following:
Ensembles, which can sync over Dropbox (while avoiding the problems described above) or iCloud (while avoiding the problems of iOS's built-in Core Data/iCloud sync).
TICoreDataSync, which uses Dropbox file sync but which avoids putting a SQLite file in the file store.
ParcelKit, which uses Dropbox's new data store API. (Note that this is quite new and that the data store API itself is still beta).

Best method of storing semi-static data

I need to access some data on my asp.net website. The data relates to around 50 loan providers.
I could simply build it into the web page at the moment, however I know that I will need to re-use it soon, so its probably better to make it more accessisble.
The data will probably only change once in a while, maybe once a month at most. I was looking at the best method of storing the data - database/xml file, and then how to persist that in my site (cache perhaps).
I have very little experience so would appreciate any advice.

It's hard to beat a database, and by placing it there, you could easily access it from anywhere you wanted to reuse it. Depending on how you get the updates and what DBMS you are using, you could use something like SSIS (for MS SQL Server) to automate updating the data.
ASP.NET also has a robust API for interacting with a database and using it as a datasource for many of it's UI structures.

Relational databases are tools for storing data when access to the data needs to be carefully controlled to ensure that it is atomic, consistent, isolated, and durable. (ACID). To accomplish this, databases include significant additional infrastructure overhead and processing logic. If you don't need this overhead why subject your system to it? There is a broad range of other data storage options at your disposal that might be more appropriate, but should at least be considered options in your decision process.
Using Asp.Net, you have access to several other options, including text files, custom configuration files (stored as Xml), custom Xml, and dotNet classes serialized to binary or Xml files. The fact that your data changes so infrequently may make one of these options more appropriate. Using one of these options also reduces system coupling. Functions dependent on this data are now no longer dependent on the existence of a functioning database.

batch manipulations for online web app

A customer has a web based inventory management system. The system is proprietary and complicated. it has around 100 tables in the DB and complex relationships between them. it has ~1500000 items.
The customer is doing some reorganisations in his processes and now has the need to make massive updates and manipulation to the data (only data changes, no structural changes). The online screens do not permit such work, since they where designed at the begining without this requirement in mind.
The database is MS Sql 2005, and the application is an asp.net running on IIS.
one solution is to build for him new screens where he could visialize the data in grids and do the required job on a large amount of records. This will permit us to use the already existing functions that deal with single items (we just need to implement a loop). At this moment the customer is aware of 2 kinds of such massive manipulations he wants to do, but says there will be others.This will require design, coding, and testing everytime we have a request.
However the customer needs are urgent because of some regulatory requirements, so I am wondering if it will be more efficient to use some kind of mapping between MSSQL and Excel or Access to expose the needed informations. make the changes in Excel or Access then save in the DB. may be using SSIS to do this.
I am not familiar with SSIS or other technologies that do such things, that's why I am not able to judge if the second solution is indeed efficient and better than the first. of course the second solution will require some work and testing, but will it be quicker and less expensive?
the other question is are there any other ways to do this?
any ideas will be greatly appreciated.

Either way, you are going to need testing.
Say you export 40000 products to Excel, he re-organizes them and then you bring them back into a staging table(s) and apply the changes to your SQL table(s). Since Excel is basically a freeform system, what happens if he introduces invalid situations? Your update will need to detect it, fail and rollback or handle it in some specified way.
Anyway, both your suggestions can be made workable.
Personally, for large changes like this, I prefer to have an experienced database developer develop the changes in straight SQL (either hardcoding or table-driven), test it on production data in a test environment (doing a table compare between before and after) and deploy the same script to production. This also allows the use of the existing stored procedures (you are using SPs, to enforce a consistent interface to this complex database, right?) so that basic database logic already in place is simply re-used.

I doubt Excel will be able to deal with 1.5mil elements/rows.
When you say to visualise data in grids - how will your customer make changes? Manually or is there some automation behind it? I would strongly encourage automation (since you know about only 2 types of changes at the moment). Maybe even a simple standalone "converter" application - don't make part of the main program - it will be too tempting for them in the future to manually edit data straight in the DB tables.

Here is a strategy that I think will get you from A to B in the shortest amount of time.
one solution is to build for him new
screens where he could visialize the
data in grids and do the required job
on a large amount of records.
It's rarely a good idea to build an interface into the main system that you will only use once or twice. It takes extra time and you'll probably spend more time maintaining it than using it.
This will permit us to use the already
existing functions that deal with
single items (we just need to
implement a loop)
Hack together your own crappy little interface in a .NET Application, whose sole purpose is to fulfill this one task. Keep it around in your "stuff I might use later" folder.
Since you're dealing with such massive amounts of data, make sure you're not running your app from a remote location.
Obtain a copy of SQL 2005 and install it on a virtualization layer. Copy your production database over to this virtualized SQL server. Take a snap shot of your virtualized copy before you begin testing. Write and test your app against this virtualized copy. Roll back to your original snap shot each time you test. Keep changing your code, testing, and rolling back until your app can flawlessly perform the desired changes.
Then, when the time comes for you to change the production database, you can sit back and relax while your app does all of the changes. Since the process will likely take a while, so add some logging so you can check the status as it runs.
Oh yeah, make sure you have a fresh backup before you run your big update.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex