How to query mongodb collection field names from R mongolite package

How to query mongodb collection field names from R mongolite package - r

I would like to return the field names of a given mongodb collection from R mongolite.
Starting from mongolite recent versions (i.e 1.5+), you can run a raw command on the mongodb, I can use the below for instance to return all the collections:
m = mongo(db = 'dbname', url='urlofdb')
m$run('{"listCollections":1}')
This would return a list of collection:
$cursor
$cursor$id
[1] 0
$cursor$ns
[1] "db.$cmd.listCollections"
$cursor$firstBatch
name type readOnly idIndex.v idIndex._id idIndex.name idIndex.ns
1 collection-name collection FALSE 1 1 _id_ db.collection
Can you please advise how I could return the column names of a given collection using the run command?
Thanks!

I don't think you really can do it directly.
If you could, that would largely go against the entire philosophy of a NoSQL-database (which Mongo is). The idea behind a NoSQL-database is that you have a collection of documents, which can all have their own fields.
The analogy to paper documents really does work, and the concept of 'columns' is replaced by 'fields', which don't pertain to the collection as a whole, but to individual documents, and each document can contain anything. And there is no overarching mandatory template into which everything must fit. In practice, a lot of documents will have a similar structure, but this is by no means guaranteed. This means that it's entirely possible that you have 100 million documents with 3 fields called "a", "b" and "c", and that document 100000001 has 4 fields: a, b, c and d.
It could be that the database-engine keeps track of what fields are somewhere in a collection, but I doubt that. And if it doesn't, the only way to get all four names a, b, c and d, is to go through all 100000001 documents (or more), which will take a while. Undoubtedly, some optimisation is implemented, but it will always be a hard question.
If you just want an answer for a small DB, I think simply querying for all documents and taking the column-names of the resulting data.frame is easiest.
But if your database is large, this question is no longer about R or mongolite, and I'm not sufficient enough in working with Mongo to help you further.

Related

How can I limit and sort on document ID in firestore?

I have a collection where the documents are uniquely identified by a date, and I want to get the n most recent documents. My first thought was to use the date as a document ID, and then my query would sort by ID in descending order. Something like .orderBy(FieldPath.documentId, descending: true).limit(n). This does not work, because it requires an index, which can't be created because __name__ only indexes are not supported.
My next attempt was to use .limitToLast(n) with the default sort, which is documented here.
By default, Cloud Firestore retrieves all documents that satisfy the query in ascending order by document ID
According to that snippet from the docs, .limitToLast(n) should work. However, because I didn't specify a sort, it says I can't limit the results. To fix this, I tried .orderBy(FieldPath.documentId).limitToLast(n), which should be equivalent. This, for some reason, gives me an error saying I need an index. I can't create it for the same reason I couldn't create the previous one, but I don't think I should need to because they must already have an index like that in order to implement the default ordering.
Should I just give up and copy the document ID into the document as a field, so I can sort that way? I know it should be easy from an algorithms perspective to do what I'm trying to do, but I haven't been able to figure out how to do it using the API. Am I missing something?
Edit: I didn't realize this was important, but I'm using the flutterfire firestore library.

A few points. It is ALWAYS a good practice to use random, well distributed documentId's in firestore for scale and efficiency. Related to that, there is effectively NO WAY to query by documentId - and in the few circumstances you can use it (especially for a range, which is possible but VERY tricky, as it requires inequalities, and you can only do inequalities on one field). IF there's a reason to search on an ID, yes it is PERFECTLY appropriate to store in the document as well - in fact, my wrapper library always does this.
the correct notation, btw, would be FieldPath.documentId() (method, not constant) - alternatively, __name__ - but I believe this only works in Queries. The reason it requested a new index is without the () it assumed you had a field named FieldPath with a subfield named documentid.
Further: FieldPath.documentId() does NOT generate the documentId at the server - it generates the FULL PATH to the document - see Firestore collection group query on documentId for a more complete explanation.
So net:
=> documentId's should be as random as possible within a collection; it's generally best to let Firestore generate them for you.
=> a valid exception is when you have ONE AND ONLY ONE sub-document under another - for example, every "user" document might have one and only one "forms of Id" document as a subcollection. It is valid to use the SAME ID as the parent document in this exceptional case.
=> anything you want to query should be a FIELD in a document,and generally simple fields.
=> WORD TO THE WISE: Firestore "arrays" are ABSOLUTELY NOT ARRAYS. They are ORDERED LISTS, generally in the order they were added to the array. The SDK presents them to the CLIENT as arrays, but Firestore it self does not STORE them as ACTUAL ARRAYS - THE NUMBER YOU SEE IN THE CONSOLE is the order, not an index. matching elements in an array (arrayContains, e.g.) requires matching the WHOLE element - if you store an ordered list of objects, you CANNOT query the "array" on sub-elements.

From what I've found:
FieldPath.documentId does not match on the documentId, but on the refPath (which it gets automatically if passed a document reference).
As such, since the documents are to be sorted by timestamp, it would be more ideal to create a timestamp fieldvalue for createdAt rather than a human-readable string which is prone to string length sorting over the value of the string.
From there, you can simply sort by date and limit to last. You can keep the document ID's as you intend.

How can I get a document at a specific index after orderBy

I have some code like this:
...
const snapshot = firestore().collection("orders").orderBy("deliveryDate")
...
I want to access only the 100th order in the returned documents. So far, the only way I achieve this is to do firestore().collection("orders").orderBy("deliveryDate").limit(100) and this returns first 100 documents and I can access the last order. But, I end up fetching 99 unwanted documents and this could become quite slower if I want the 200th document or higher.
So, I basically want to know if there's a possible way of getting just the index I want after sorting.
As far as I know, startAt() and startAfter() only accept a doc reference or field values, not an index/offset

Firestore does not offer any way to offset by some numeric amount to web and mobile clients (and doing so would end up having the exact same cost as what you're doing now).
If you need to impose some sort of offset into your collection, you will need to maintain that in the document itself for querying, or use some other type of storage that gives you fast cheap access by index.

How would you explain ordered map in general?

I'm currently trying to learn Symfony and a big part of it is Doctrine. I've been reading the official documentation for Doctrine and in the part about Collections library I stumbled upon this thing called "ordered map". I tried to search it on google, but I wasn't able to find any satisfying answer. There were just answers for specific languages (mostly Java and C++), but I want to understand it in general. How it works and what it is, because in the Doctrine documentation they are comparing it to the ArrayCollection, so I hope if I can understand what it is, it will be easier for me to understand ArrayCollection as well.
I tried to search for things like "what is an ordered map" or "ordered map explained", but as I said earlier, I didn't find what I was looking for.

A map is sometimes called ordered when the entries remain in the same sequence in which they were inserted.
For example, arrays in PHP are ordered (preserve insertion order). So creating/modifying an array like this:
$array = [2 => 'a', 1 => 'b'];
$array[0] = 'c';
will indeed result in the PHP array [2 => 'a', 1 => 'b', 0 => 'c'] - it preserves the insertion order - while in some other languages it will be turned into [0 => 'c', 1 => 'b', 2 => 'a'].
This affects a few operations. Iterating over an array with foreach will return the entries in insertion order. You can do key-wise or value-wise sorting on PHP arrays, the default sorting function sort will drop the original keys and reindex numerically. Serialization and deserialization with numeric keys may have unintended consequences. And some other effects that sometimes are beneficial and sometimes are surprising or annoying (or both). You can read lots of it on PHP's array doc page and the array function pages.
In the context of Doctrine (since it's written in PHP) this means, that a collection where values are the entity objects can be sorted in any manner you want (including id of course), and if you iterate over that collection, you get the entity objects in the order they were added by doctrine (the order of the SQL/DQL query). Doctrine also allows to set the keys to the entities' ids, while still preserving the SQL/DQL query order. This can simplify code since Doctrine's Collection implements PHP's ArrayAccess.
As a counter example, maps can also be unordered or sorted, where the first means when you retrieve the pairs the order can be random (in golang, the starting index used to be random when iterating over maps, don't know if this is still true) or automatically sorted (like SortedMap in Java).

Neo4j Match and Create takes too long in a 10000 node graph

I have a data model like this:
Person node
Email node
OWNS relationship
LISTS relationship
KNOWS relationship
each Person can OWN one Email and LISTS multiple Emails (like a contact list, 200 contacts is assumed per Person).
The query I am trying to perform is finding all the Persons that OWN an Email that a Contact LISTS and create a KNOWS relationship between them.
MATCH (n:Person {uid:'123'}) -[r1:LISTS]-> (m:Email) <-[r2:OWNS]- (l:Person)
CREATE UNIQUE (n)-[:KNOWS]->[l]
The counts of my current database is as follows:
Number of Person nodes: 10948
Number of Email nodes: 1951481
Number of OWNS rels: 21882
Number of LISTS rels: 4376340 (Each Person has 200 unique LISTS rels)
Now my problem is that running the said query on this current database takes something between 4.3 to 4.8 seconds which is unacceptable for my need. I wanted to know if this is normal timing considering my data model or am I doing something wrong with the query (or even model).
Any help would be much appreciated. Also if this is normal for Neo4j please feel free to suggest other graph databases that can handle this kind of model better.
Thank you very much in advance
UPDATE:
My query is: profile match (n: {uid: '4692'}) -[:LISTS]-> (:Email) <-[:OWNS]- (l) create unique (n)-[r:KNOWS]->(l)
The PROFILE command on my query returns this:
Cypher version: CYPHER 2.2, planner: RULE. 3919222 total db hits in 2713 ms.

Yes, 4.5 seconds to match one person from index along with its <=100 listed email addresses and merging a relationship from user to the single owner of each email, is slow.
The first thing is to make sure you have an index for uid property on nodes with :Person label. Check your indices with SCHEMA command and if missing create such an index with CREATE INDEX ON :Person(uid).
Secondly, CREATE UNIQUE may or may not do the work fine, but you will want to use MERGE instead. CREATE UNIQUE is deprecated and though they are sometimes equivalent, the operation you want performed should be expressed with MERGE.
Thirdly, to find out why the query is slow you can profile it:
PROFILE
MATCH (n:Person {uid:'123'})-[:LISTS]->(m:Email)<-[:OWNS]-(l:Person)
MERGE (n)-[:KNOWS]->[l]
See 1, 2 for details. You may also want to profile your query while forcing the use of one or other of the cost and rule based query planners to compare their plans.
CYPHER planner=cost
PROFILE
MATCH (n:Person {uid:'123'})-[:LISTS]->(m:Email)<-[:OWNS]-(l:Person)
MERGE (n)-[:KNOWS]->[l]
With these you can hopefully find and correct the problem, or update your question with the information to help others help you find it.

How to handle duplicates in disconnected object graph?

I'm having a problem updating a disconnected POCO model in an ASP.NET application.
Lets say we have the following model:
Users
Districts
Orders
A user can be responsible for 0 or more districts, an order belongs to a district and a user can be the owner of an order.
When the user logs in the user and the related districts are loaded. Later the user loads an order, and sets himself as the owner of the order. The user(and related districts) and order(and related district) are loaded in two different calls with two different dbcontexts. When I save the order after the user has assigned himself to it. I get an exception that saying that acceptchanges cannot continue because the object's key values conflict with another object.
Which is not strange, since the same district can appear both in the list of districts the user is responsible and on the order.
I've searched high and low for a solution to this problem, but the answers I have found seems to be either:
Don't load the related entities of one of the objects in my case that would be the districts of the user.
Don't assign the user to the order by using the objects, just set the foreign key id on the order object.
Use nHibernate since it apparently handles it.
I tried 1 and that works, but I feel this is wrong because I then either have to load the user without it's districts before relating it to the order, or do a shallow clone. This is fine for this simple case here, but the problem is that in my case district might appear several more times in the graph. Also it seems pointless since I have the objects so why not let me connected them and update the graph. The reason I need the entire graph for the order, is that I need to display all the information to the user. So since I got all the objects why should I need to either reload or shallow clone it to get this to work?
I tried using STE but I ran in to the same problem, since I cannot attach an object to a graph loaded by another context. So I am back at square 1.
I would assume that this is a common problem in anything but tutorial code. Yet, I cannot seem to find any good solution to this. Which makes me think that either I do not under any circumstance understand using POCOs/EF or I suck at using google to find an answer to this problem.
I've bought both of the "Programming Entity Framework" books from O'Reilly by Julia Lerman but cannot seem to find anything to solve my problem in those books either.
Is there anyone out there who can shed some light on how to handle graphs where some objects might be repeated and not necessarily loaded from the same context.

The reason why EF does not allow to have two entities with the same key being attached to a context is that EF cannot know which one is "valid". For example: You could have two District objects in your object graph, both with a key Id = 1, but the two have different Name property values. Which one represents the data that have to be saved to the database?
Now, you could say that it doesn't matter if both objects haven't changed, you just want to attach them to a context in state Unchanged, maybe to establish a relationship to another entity. It is true in this special case that duplicates might not be a problem. But I think, it is simply too complex to deal with all situations and different states the objects could have to decide if duplicate objects are causing ambiguities or not.
Anyway, EF implements a strict identity mapping between object reference identity and key property values and just doesn't allow to have more than one entity with a given key attached to a context.
I don't think there is a general solution for this kind of problem. I can only add a few more ideas in addition to the solutions in your question:
Attach the User to the context you are loading the order in:
context.Users.Attach(user); // attaches user AND user.Districts
var order = context.Orders.Include("Districts")
.Single(o => o.Id == someOrderId);
// because the user's Districts are attached, no District with the same key
// will be loaded again, EF will use the already attached Districts to
// populate the order.Districts collection, thus avoiding duplicate Districts
order.Owner = user;
context.SaveChanges();
// it should work without exception
Attach only the entities to the context you need in order to perform a special update:
using (var context = new MyContext())
{
var order = new Order { Id = order.Id };
context.Orders.Attach(order);
var user = new User { Id = user.Id };
context.Users.Attach(user);
order.Owner = user;
context.SaveChanges();
}
This would be enough to update the Owner relationship. You would not need the whole object graph for this procedure, you only need the correct primary key values of the entities the relationship has to be created for. It doesn't work that easy of course if you have more changes to save or don't know what exactly could have been changed.
Don't attach the object graph to the context at all. Instead load new entities from the database that represent the object graph currently stored in the database. Then update the loaded graph with your detached object graph and save the changes applied to the loaded (=attached) graph. An example of this procedure is shown here. It is safe and a very general pattern (but not generic) but it can be very complex for complex object graphs.
Traverse the object graph and replace the duplicate objects by a unique one, for example just the first one with type and key you have found. You could build a dictionary of unique objects that you lookup to replace the duplicates. An example is here.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex