What is the architecture used to store the index in the DB by Google and how many tables does Google maintain for indexing?
Have a look at this article at High Scalability:
http://highscalability.com/google-architecture
There is also a similar question posted here:
How can Google be so fast?
Related
I am making search engine. But I want to know, How google scrapes all data of stackoverflow.
As my intuition,
Do they save all stackoverflow data in csv file?
and when user types some coding question, use some algorithm and recommend users.
or anything else,
Or Anything else?
Thank you for help.
Storing every data in a csv and running a search will probably cost you hours to retrieve a result.
Google Search Engine works in 3 stages, Crawling, Indexing, and Serving.
These algorithms working in these 3 stages are what makes Google so powerful. As they are fine tuned after years of optimization and learning, they can accurate index each webpage and analyze them without just plainly storing everything, just what you suggested.
Reference: How Search Works
everyone,
I have a completely untypical topic at this point and I hope that I am addressing the right people here.
I'm working on a personal project. I recently became a G Suite customer and would like to map my document and media management via Google Drive. The document management works well so far and with the help of Google Cloud Search I can easily find my documents across platforms.
Since I personally take a lot of pictures, I was wondering if I could use Google products to find a way to classify my pictures automatically. My approach was to use the label detection of the Vision API to store the 5 most likely labels as metadata. By using the metadata, I can then, when I search for example for architecture or animal, find all images that contain one of the following terms in a single search. The concept should of course be extendable to location and text detection.
I have already tried to create an automatism via pages like integromat.com that labels the photos, but unfortunately without success.
Well and now we come to the current situation. Since I realized that an active interaction with the Google Cloud is essential, I am looking for help from an experienced community. I hope that maybe someone here has a good or inspiring idea.
Maybe one more hint before the proposal is made. Google Photos is great and can do something like that, but it doesn't integrate with Google Cloud Search and managing RAW files would be terrible.
You can achieve what you want using the following approach:
Build a web/mobile app to upload photos to Google Drive or Cloud Storage.
Use the Google Vision API to fetch metadata from your image before uploading to Drive/Cloud Storage.
Use Google Cloud Search Rest API to index the extracted metadata along with the image URL to Cloud Search.
Create a custom Search Interface to search and display your indexed images.
Above steps should be able to point you in the right direction in implementing the solution. Let me know if you need further help with it.
I found this question How to store and retrieve different types of Vertices with the Tinkerpop/Blueprints graph API?
But it's not a clear answer for me.
How do I query the 10 most recent articles vs the 10 most recently registered users for an api for instance? Put graphing aside for a moment as this stack must be able to handle both a collection/document paradigm as well as a graphing paradigm (relations between elements of these TYPES of collections). I've read everything including source and am getting closer but not quite there.
Closest document I've read is the multitenant article here: http://architects.dzone.com/articles/multitenant-graph-applications
But it focuses on using gremlin for graph segregation and querying which I'm hoping to avoid until I require analysis on the graphs.
I'm considering using cassandr/hadoop at this point with a reference to the graph id but I can see this biting me down the road.
Indexes in Tinkerpop/Blueprints only support simple lookups based on exact matches of property keys.
If you want to find the 10 most recent articles, or articles with the phrase 'Foo Bar', you may have to maintain an external index using Lucene or Elastic Search. These technologies use inverted indexes which support term/phrase lookups, range queries, and wildcard searches, etc. You can store the vertex.getId() as the field in the indexed document to link back to the vertex in the graph database.
I have implemented something like this for my application which uses a Blueprints database (Bitsy) and a fancy index (Lucene). I have documented the high-level design to (a) keep the fancy index up-to-date using batch updates every few seconds and (b) ensure transactional consistency across the graph database and the fancy index. Hope this helps.
Answering my own question to this - The answer is indices.
https://github.com/tinkerpop/blueprints/wiki/Graph-Indices
I am trying to find a list of what ISBNs are in use. I guess I could scrape a website like Amazon but that would waste a lot of bandwidth. Is there a better (free) way?
Maybe you could use the remote API for isbndb.com.
Trying to keep an enormous ISBN list up-to-date yourself is quite a huge task if you ask me.
Just for the record: note that if you actually want an ISBN for your publication, you need to go to the official agency in your country. In the US this is http://www.isbn.org/ , but it varies by country. In Australia, for example, it is here.
This might help: What is the most complete (free) ISBN API?
As the accepted answer states there is also an API to search Amazon but it's not actually supposed to be used in the way you wish to.
ended up using partial list from http://my.linkbaton.com/isbn/
Yes, try isbndb.com
Is there any programmatic way to read the data stored in a Google Code Project Hosting "Issues" tracker? Ideally it'd be nice to update/add issues too, but first I need some structured form of data from the tracker. I only see the HTML pages, and an Atom feed for issue updates.
Now there is one.
http://googlecode.blogspot.com/2009/10/issue-tracker-data-api-for-project.html
There doesn't seem to be but they have promised it's coming during the Google Wave introduction. The demo is at minute 61 and your question is answered at minute 78.