I'm trying to find freebase co-types, that is, given a type you find 'compatible' types:
Suppose to start with /people/person, it might be a musician (/music/group_member), but not a music album (/music/album), i don't know if in freebase is there something like owl 'disjointWith' between types, anyway in the MQL cookbook they suggest to use this trick.
The query in the example get all instances of a given type, then get all types of all instances and let them to be unique...this is clever, but the query goes in timeout... is there another way? for me it's ok even a static list/result, i don't need the live query...i think the result would be the same...
EDIT:
Incompatible types type seems to be useful and similar to disjointWith, may also be used with the suggest...
Thanks!
luca
Freebase doesn't have the concept of disjointWith at the graph or schema level. The Incompatible Types base that you found is a manually curated version of that which may be used in a future version of the UI, but isn't today.
If you want to find all co-types, as they exist in the graph today, you can do that using the query you mentioned, but you're probably better off using the data dumps. I'd also consider establishing a frequency threshold to eliminate low frequency co-types so that you filter out mistakes and other noise.
As Tom mentioned, there are some thresholds that you might want to consider to filter out some of the less notable or experimental types that users have created in Freebase.
To give you an idea of what the data looks like I've run the query for all co-types for /people/person.
/common/topic (100.00%)
/book/author (23.88%)
/people/deceased_person (18.68%)
/sports/pro_athlete (12.49%)
/film/actor (9.72%)
/music/artist (7.60%)
/music/group_member (4.98%)
/soccer/football_player (4.53%)
/government/politician (3.92%)
/olympics/olympic_athlete (2.91%)
See the full list here.
You can also experiment with Freebase Co-Types using this app that I built (although it is prone to the same timeouts that you experienced).
Related
There's any way to list the kinds that are not being used in google's datastore by our app engine app without having to look into our code and/or logic? : )
I'm not talking about indexes, which I can list by issuing an
gcloud datastore indexes list
and then compare with the datastore-indexes.xml or index.yaml.
I tried to check datastore kinds statistics and other metadata but I could not find anything useful to help me on this matter.
Should I give up to find ways of datastore providing me useful stats and code something to keep collecting datastore statistics(like data size), during a huge period to have at least a clue of which kinds are not being used and then, only after this research, take a look into our app code to see if the kind Model was removed?
Example:
select bytes from __Stat_Kind__
Store it somewhere and keep updating for a period. If the Kind bytes size does not change than probably the kind is not being used anymore.
The idea is to do some cleaning in datastore.
I would like to find which kinds are not being used anymore, maybe for a long time or were created manually to be used once... You know, like a table in oracle that no one knows what is used for and then if we look into the statistics of that table we would see that this table was only used once 5 years ago. I'm trying to achieve the same in datastore, I want to know which kinds are not being used anymore or were used a while ago, then ask around and backup/delete it if no owner was found.
It's an interesting question.
I think you would be best-placed to audit your code and instill organizational practice that requires this documentation to be performed in future as a business|technical pre-prod requirement.
IIRC, Datastore doesn't automatically timestamp Entities and keys (rightly) aren't incremental. So there appears no intrinsic mechanism to track changes short of taking a snapshot (expensive) and comparing your in-flight and backup copies for changes (also expensive and inconclusive).
One challenge with identifying a Kind that appears to be non-changing is that it could be referenced (rarely) by another Kind and so, while it does not change, it is required.
Auditing your code and documenting it for posterity should not only provide you with a definitive answer (and identify owners) but it pays off a significant technical debt that has been incurred and avoids this and probably future problems (e.g. GDPR-like) requirements that will arise in the future.
Assuming you are referring to records being created/updated, then I can think of the following options
Via the Cloud Console (Datastore > Dashboard) - This lists all your 'Kinds' and the number of records in each Kind. Theoretically, you can take a screen shot and compare the counts so that you know which one has experienced an increase or not.
Use of Created/LastModified Date columns - I usually add these 2 columns to most of my datastore tables. If you have them, then you can have a stored function that queries them. For example, you run a query to sort all of your Kinds in descending order of creation (or last modified date) and you only pull the first record from each one. This tells you the last time a record was created or modified.
I would write a function as part of my App, put it behind a page which requires admin privilege (only app creator can run it) and then just clicking a link on my App would give me the information.
My table is (device, type, value, timestamp), where (device,type,timestamp) makes a unique combination ( a candidate for composite key in non-DynamoDB DBMS).
My queries can range between any of these three attributes, such as
GET (value)s from (device) with (type) having (timestamp) greater than <some-timestamp>
I'm using dynamoosejs/dynamoose. And from most of the searches, I believe I'm supposed to use a combination of the three fields (as a single field ; device-type-timestamp) as id. However the set: function of Schema doesn't let me use the object properties (such as this.device) and due to some reasons, I cannot do it externally.
The closest I got (id:uuidv4:hashKey, device:string:GlobalSecIndex, type:string:LocalSecIndex, timestamp:Date:LocalSecIndex)
and
(id:uuidv4:rangeKey, device:string:hashKey, type:string:LocalSecIndex, timestamp:Date:LocalSecIndex)
and so on..
However, while using a Query, it becomes difficult to fetch results of particular device,type as the id, (hashKey or rangeKey) keeps missing from the scene.
So the question. How would you do it for such kind of table?
And point to be noted, this table is meant to gather content from IoT devices, which is generated every 5 mins by each device on an average.
I'm curious why you are choosing DynamoDB for this task. Advanced queries like this seem to be much better suited for a SQL based database as opposed to a NoSQL database. Due to the advanced nature of SQL queries, this task in my experience is a lot easier in SQL databases. So I would encourage you to think about if DynamoDB is truly the right system for what you are trying to do here.
If you determine it is, you might have to restructure your data a little bit. You could do something like having a property that is device-type and that will be the device and type values combined. Then set that as an index, and query based on that and sort by the timestamp, and filter out the results that are not greater than the value you want.
You are correct that currently, Dynamoose does not pass in the entire object into the set function. This is something that personally I'm open to exploring. I'm a member on the GitHub project, and if you would like to submit a PR adding that feature I would be more than happy to help explore that option with you and get that into the codebase.
The other thing you might want to explore is having a DynamoDB stream, that will set that device-type property whenever it gets added to your DynamoDB table. That would abstract that logic out of DynamoDB and your application. I'm not sure if it's necessary for what you are doing to decouple it to that level, but it might be something you want to explore.
Finally, depending on your setup, you could figure out which item will be more unique, device or type, and setup an index on that property. Then just query based on that, and filter out the results of the other property that you don't want. I'm not sure if that is what you are looking for, it will of course work, but I'm not sure how many items you will have in your table, and there get to be questions about scalability at a certain level. One way to solve some of those scalability questions might be to set the TTL of your items if you know that you the timestamp you are querying for is constant, or predictable ahead of time.
Overall there are a lot of ways to achieve what you are looking to do. Without more detail about how many items, what exactly those properties will be doing, the amount of scalability you require, which of those properties will be most unique, etc. it's hard to give a good solution. I would highly encourage you to think about if NoSQL is truly the best way to go. That query you are looking to do seems a LOT more like a SQL query. Not saying it's impossible in DynamoDB, but it will require some thought about how you want to structure your data model, and such.
Considering opinion of #charlie-fish, I decided to jump into Dynamoose and improvise the code to pass the model to the set function of the attribute. However, I discovered that the model is already being passed to default parameter of the attribute. So I changed my Schema to the following:
id:hashKey;default: function(model){ return model.device + "" + model.type; }
timestamp:rangeKey
For anyone landing here on this answer, please note that the default & set functions can access attribute options & schema instance using this . However both those functions should be regular functions, rather than arrow functions.
Keeping this here as an answer, but I won't accept it as an answer to my question for sometime, as I want to wait for someone else to hit out a better approach.
I also want to make sure that if a value is passed for id field, it shouldn't be set. For this I can use set to ignore the actual incoming value, which I don't know how, as of yet.
I used opentsdb to save my time series data. Of each data point input, I must get 20 value of data points before. But, I have a large numbers of metrics, I can not call query opentsdb api too many times. How can I do to reduce numbers of query from openTSDB?
As far as I know you can't aggregate different metrics into one single result. But I would suggest two solutions:
You can put multiple metrics queries in one call. If you use HTTP
API endpoint you can do something like this:
http://otsdb:4242/api/query?start=15m-ago&m=avg:metric1{tag1=a}&m=avg:metric2{tag2=b}
You get the results for all queries with the same start(end) dates/times. But with multiple metrics don't forget that it will take longer time...
Redefine your time series.I don't know any details about your data, but if you're going to store and use data you should also think about usage - What queries am I going to use? How often? What would be the most common access to the data? And so on...
That's also what's advised from OpenTSDB documentation [1]:
Cardinality also affects query speed a great deal, so consider the queries you will be performing frequently and optimize your naming schema for those.
So, I would suggest to use tags to overcome this issue of multiple metrics. But as I mentioned I don't know your schema, but OpenTSDB is much more powerful with tags - there are many examples and also filtering options as well.
Edit 1:
From OpenTSDB 2.3 version there is also expression api: http://opentsdb.net/docs/build/html/api_http/query/exp.html
You should be able to handle multiple metric queries together (but I've never used that for any query).
On my knowledge base deployed on Virtuoso v7.2.1, I've two RDF graphs, namely http://example.com/A and http://example.com/B, populated by triples.
I'd like to create a virtual graph http://example.com/VG that should represent the union of http://example.com/A and http://example.com/B.
Therefore, any change on either http://example.com/A and http://example.com/B should be propagated to http://example.com/VG.
Digging into the documentation I did not find pointers and the closest thing to do, the so-called RDF View, operates on tables and not RDF graphs.
Has anyone a suggestion on how to implement it? Thanks in advance
Currently, the Virtuoso way to do this is a Graph Group.
You'll need to use DB.DBA.RDF_GRAPH_GROUP_CREATE(), DB.DBA.RDF_GRAPH_GROUP_INS(), and related functions.
The Graph Group is effectively a VIEW of of a UNION of its Member Graphs, so changes to the Member Graphs are immediately reflected through the Graph Group. There is no change propagation, as such.
For future -- Virtuoso-specific questions often get faster response from on the Virtuoso users mailing list, OpenLink support forums, etc.
I am working on building a database of timing and address information of restaurants those are extracted from multiple web sites. As information for same restaurants may be present in multiple web sites. So in the database I will have some nearly duplicate copies.
As the number of restaurants is large say, 100000. Then for each new entry I have to do order of 100000^2 comparison to check if any restaurant information with nearly similar name is already present. So I am asking whether there is any efficient approach better than that is possible. Thank you.
Basically, you're looking for a record linkage tool. These tools can index records, then for each record quickly locate a small set of potential candidates, then do more detailed comparison on those. That avoids the O(n^2) problem. They also have support for cleaning your data before comparison, and more sophisticated comparators like Levenshtein and q-grams.
The record linkage page on Wikipedia used to have a list of tools on it, but it was deleted. It's still there in the version history if you want to go look for it.
I wrote my own tool for this, called Duke, which uses Lucene for the indexing, and has the detailed comparators built in. I've successfully used it to deduplicate 220,000 hotels. I can run that deduplication in a few minutes using four threads on my laptop.
One approach is to structure your similarity function such that you can look up a small set of existing restaurants to compare your new restaurant against. This lookup would use an index in your database and should be quick.
How to define the similarity function is the tricky part :) Usually you can translate each record to a series of tokens, each of which is looked up in the database to find the potentially similar records.
Please see this blog post, which I wrote to describe a system I built to find near duplicates in crawled data. It sounds very similar to what you want to do and since your use case is smaller, I think your implementation should be simpler.