Requirements for Reading Freebase .json dumps - freebase

I have a machine with Windows 7 and 8gb ram 500gb HD. I see everywhere I either have to "rent" virtual machine using AWS or have a server. My question is can I somehow work with freebase data on my machine locally. Can I some how access some topic without loading it in a database rather by parsing it?

The Freebase data dumps are in RDF format, not JSON. They definitely can be processed incrementally on a laptop using zgrep, cut, etc to filter things to a subset that you're interested in. See some of my other answers to questions asked about the Freebase data dumps for example commands.

Related

SQLite backup using Windows DFS Replication [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
I have an application that use SQLite for storage, and I'm wondering whether it is safe to use Windows DFS Replication to backup the database file to a second server which has a cold standby instance of the application installed.
Potentially relevant details:
Although DFS supports two-way replication, in this case it is only the master DB file that is written to, so the replication is effectively one-way.
The master DB file is located on the same server as the process that is writing to it.
Currently SQLite is configured to use the standard Rollback Journal, but I could switch to Write-Ahead Log if necessary.
If DFS locks the master DB file during replication then I think this approach could work as long as the lock isn't held for too long. However I can't find sufficient information on how DFS is implemented.
UPDATE: I've implemented this in a test environment, and have had it running for several days. During that time I have not encountered any problems, so I am tempted to go with this solution.
Considering that DFS Replication is oriented towards files and folders:
DFS Replication is an efficient, multiple-master replication engine
that you can use to keep folders synchronized between servers across
limited bandwidth network connections.
I would probably try to avoid it if you care about consistency and keeping all your data, as stated by the SQLite backup documentation:
Historically, backups (copies) of SQLite databases have been created
using the following method:
Establish a shared lock on the database file using the SQLite API (i.e. the shell tool).
Copy the database file using an external tool (for example the unix 'cp' utility or the DOS 'copy' command).
Relinquish the shared lock on the database file obtained in step 1.
This procedure works well in many scenarios and is usually very fast.
However, this technique has the following shortcomings:
Any database clients wishing to write to the database file while a backup is being created must wait until the shared lock is relinquished.
It cannot be used to copy data to or from in-memory databases.
If a power failure or operating system failure occurs while copying the database file the backup database may be corrupted
following system recovery.
In the case of DFS, it wouldn't even lock the database prior to copying.
I think your best bet would be to use some kind of hot replication, you might want to use the SQLite Online Backup API, you could check this tutorial on creating a hot backup with the Online Backup API.
Or if you want something simpler, you might try with SymmetricDS, an open source database replication system, compatible with SQLite.
There are other options (like litereplicator.io), but this one went closed source and is limited to old SQLite versions and ~50MB size databases
ps. I would probably move away from SQLite if you really need HA, replication or this kind of features. Depending on your programming language of choice, most probably, you already have the DB layer abstracted and you could use MySQL or PosgreSQL.

Freebase - how to use the freebase-rdf-latest?

I downloaded the freebase-rdf-latest from freebase.com. I uncompressed it and now I have a file of 380.7Gb.
How can I read that data? Which program do you recommend me?
Thanks for your help!
I'll disagree with #Nandana and say that you definitely should not load it into a triple store for most uses. There's a ton of redundancy in it and, even without the redundancy, usually you're only interested in a small portion of it.
Also, for most applications, you probably want to leave the file compressed. You can probably decompress it quicker than you can read the uncompressed version from the file system. If you need to split it for processing in a MapReduce environment, the file is (or at least used to be) a series of concatenated compressed files which can be split apart without having to decompress them.
Nandana has a good suggestion about considering derivative data products. The tradeoff to consider is how often they are updated and how transparent their filtering/extraction pipeline is.
For simple tasks, you can get pretty far with the very latest data using zgrep, cut, and associated Unix command line tools.
You have to load the data to a triple store such as Virtuoso. You can take a look at how load the data in following references.
Virtuoso Freebase Setup
Load Freebase Dump into Virtuoso
Bulk Loading RDF Source Files into one or more Graph IRIs
Loading freebase to Jena Fuseki
However, you might be interested in other projects that provide a cleaned version of freebase pre-loaded into a triple store.
SindiceTech Freebase distribution Freebase data is available for
full download but as today, using it "as a whole" is all but simple.
The SindiceTech Freebase distribution solves that by providing all the
Freebase knowledge preloaded in an RDF specific database (also called
triplestore) and equipped with a set of tools that make it much easier
to compose queries and understand the data as a whole.
:BaseKB :BaseKB is an RDF knowledge base derived from Freebase, a
major source of the Google Knowledge Graph; :BaseKB contains about
half as many facts as the Freebase dump because it removes trivial,
ill-formed and repetitive facts that make processing difficult. The
most recent version of :BaseKB Gold can be downloaded via BitTorrent,
or, if you wish to run SPARQL queries against it, you can run it in
the AWS cloud, pre-loaded into OpenLink Virtuoso 7.

New Azure Server - CSV Reader takes much longer

We have a asp.net website that allows users to import data from a CSV file. Recently we moved to a from a dedicated server to an Azure Virtual Machine and it is taking much longer. The hardware specs of the two systems are similar.
It used to take less than a minute for data to import now it can take 10 - 15 minutes. The original file upload speed is fine it is looping through the data and organizing it in the SQL database that takes the time.
Why is the Azure VM with similar specs taking so much longer and what can I do to fix it?
Our database is using Microsoft SQL Server 2012 installed on the same VM as the website.
Very hard to make a comparison between the two environments. Was the previous environment virtualized? It might do with speed of the hard disks, the placement of the Sql Server files, or some other infrastructural setup (or simply the iron). I would recommend have a look into the performance of the machine under load (resource monitor). This kind of operation is usually both processor and i/o intense. This operation should be done in parallell as well.
Hth
//Peter

How much data do I need to have to make use of Presto?

How much data do I need to have to make use of Presto? The web site states that it can query data sizes from gigabytes to petabytes. I understand how it is used to query very large datasets, but is anyone using it for hundreds of gigabytes?
Currently, Presto is most useful if you already have an existing Hive installation. If you are using Hive, you should definitely try Presto. If all your data fits in a relational database like PostgreSQL or MySQL on a single machine, and you are happy with the performance, then keep using that.
However, Presto should be much faster than either of those databases on a single machine for analytic queries because it executes a query in parallel. Neither of those databases parallelize the execution of individual queries. At the moment, using Presto requires setting up HDFS and Hive (even on a single machine), so getting started will be more work than if you already have an existing Hive installation.
Or, you can take a look at Impala - which has been available as production-ready software for six months. Like Presto, Impala is a distributed SQL query engine for data in HDFS that circumvents MapReduce. Unlike Presto, there is a commercial vendor providing support (Cloudera).
That said, David's comments about data size still apply. Use the right tool for the job.

Berkeley Db platform migration

I have a large (several Gb) berkeley db that I am thinking of migrating from windows (2K) to Linux (either Redhat or Ubuntu). I am not sure how to go about this. Can I merely move the db files accross, or do I need a special conversion utility?
Database and log files are portable across different endian systems. Berkeley DB will recognize the kind of system it is on and swap bytes accordingly for data structures it manages that make up the database itself. Berkeley DB's region files, which are memory mapped, are not portable. That's not such a big deal because they region files hold the cache and locks which, because your application will not be running during the transition, will be re-created on the new system.
But, be careful, Berkeley DB doesn't know anything about the byte-order or types in your data (in your keys and values, stored at "DBTs"). Your application code is responsible for knowing what kind of system it is running on, how it stored the data (big or little endian) and how to transition it (or simple re-order on access). Also, pay close attention to your btree comparison function. That too may be different depending on your system's architecture.
Database and log files are also portable across operating systems with the same caveat as with byte-ordering -- the application's data is the application's responsibility.
You might consider reviewing the following:
Selecting a Byte Order
DB->set_lorder()
Berkeley DB's Getting Started Guide for Transactional Applications
Berkeley DB's Reference Guide
Voice-over presentation about Berkeley DB/DS (Data Store)
Voice-over presentation about Berkeley DB/CDS (Concurrent Data Store)
Berkeley DB's Documentation
Disclosure: I work for Oracle as a product manager for Berkeley DB products. :)
There's a cross-platform file transfer utility described here.
You may also need to be concerned about the byte order on your machine, but that's discussed a little here.
If you're using Java Berkeley though it shouldn't matter?

Resources