Count wikipedia results

Count wikipedia results - count

I'd like to get the number of Wikipedia pages matching a condition.
e.g.
"house" --> 1,200 pages
"man" --> 13,000 pages
"university college" --> 360 pages
Among many other ways, I can do this by indexing Wikipedia with Lucene, but that's pretty time consuming.
Is there a way to perform this type of query on the Media Wiki API?
What is the query limit on the Wikipedia API?
Cheers,
Mulone

Try the list=search query. For example:
"house"
"man"
"university college"
(Since you said you're only interested in the number of matching pages, I included srlimit=1and srprop= in the query to minimize the extra information returned. Apparently there's no way to keep the API from returning at least the title of the first match, though; srlimit=0 just gives an error message.)
As for query limits, there are limits on the number of results per query, but I don't think MediaWiki enforces any hard limits on the rate at which you query the API. MediaWiki does limit editing rates, but I don't think any such limits are currently applied for searching.
I believe the recommendation is that you run your queries serially — that is, wait for the previous query to finish before sending the next one. This provides a sort of automatic rate limiting, since if the servers are busy, your queries will take longer to complete. If you want to play nice, you could also include a maxlag parameter in your queries (preferably with exponential backoff if it fails); the maxlag mechanism is really designed more for automatic edits than for searching, but it does at least ensure that your code will not hit Wikimedia's server at times when they're particularly overloaded.
Also, if you want to do a lot of these kinds of queries, you might want to consider downloading a Wikipedia database dump and either indexing it yourself (as you mentioned in your question) or just reading it in a single pass and counting matching pages as you encounter them.

Related

Indexing frequently updated counters (e.g., likes on a post and timestamps) in Firebase

I'm new to firebase and I'm currently trying to understand how to properly index frequently updating counters.
Let's say I have a list of articles on a news website. Every article is stored in my collection 'articles' and the documents inside have a like counter, a date when it was published and an id to represent a certain news category. I would like to be able to retrieve the most liked and latest articles for every category. Therefore I'm thinking about creating two indices, one for category type (in ASC order) and likes (DESC order) and one of the category type and the published date (DESC order).
I tried researching limitations and on the best practices page I found this, regarding creating hotspots with indices:
Creates new documents with a monotonically increasing field, like a timestamp, at a very high rate.
In my example I'm using articles which are not created too frequently. So I'm pretty sure this wouldn't create an issue, correct me if I'm wrong please. But I do still wonder if I could run into limitations or high costs with my approach (especially regarding to likes which can change frequently, while the timestamp is constant).
Is my approach to indexing likes and timestamps by category a sound approach or am I overseeing something?

If you are not adding documents at a high rate, then you will not trigger the limit that you cited in your question.
From the documentation:
Maximum write rate to a collection in which documents contain sequential values in an indexed field: 500 per second
If you are changing a single document frequently, then you will possibly trigger the limitation that a single document can't be updated more than 1 times per second (in a sustained burst of updates only, not a hard limit).
From the documentation on distributed counters:
In Cloud Firestore, you can only update a single document about once per second, which might be too low for some high-traffic applications.
That limit seems to (now) be missing from the formal documentation, not sure why that is. But I'm told that particular rate limit has been dropped. You might want to start a discussion on firebase-talk to get an official answer from Google staff.
Whether or not your approach is "sound" depends entirely on your expected traffic. We can't predict that for you, but you are at least aware of when things will go poorly.

Firebase queries returns too many documents in complex data model

I've been trying to figure out how to best model data for a complex feed in Cloud Firestore without returning unnecessary documents.
Here's the challenge --
Content is created for specific topics, for example: Architecture, Bridges, Dams, Roads, etc. The topic options can expand to included as many as needed at any time. This means it is a growing and evolving list.
When the content is created it is also tagged to specific industries. For example, I may want to create a post in Architecture and I want it to be seen within the Construction, Steel, and Concrete industries.
Here is where the tricky part comes in. If I am a person interested in the Steel and Construction industries, I would like to have a feed that includes posts from both of those industries with the specific topics of Bridges and Dams. Since it's a feed the results will need to be in time order. How would I possibly create this feed?
I've considered these options:
Query for each individual topic selected that includes tags for Steel and Construction, then aggregate and sort the results. The problem I have with this one is that it can return too many posts, which means I'm reading documents unnecessarily. If I select 5 topics between a specific time range, That's 5 queries, which is ok. However, each can have any possible amount of results, which is problematic. I could add a limit but then I run the risk of posts being omitted from topics even though they fall within the time range.
Create a post "index" table in Cloud SQL and perform queries on it to get the post ID's then retrieve the Firestore documents as needed. Then the question is, why not just use Cloud MySql.... Well it's a scaling, cost, and maintenance issue. The whole point of firestore is not having to worry so much about DBAs, load, and scale.
I've not been able to come to any other ideas and hoping someone has dealt with such a challenge and can shed some light on the matter. Perhaps firestore is just completely the wrong solution and I'm trying to fit a square peg into a round hole, but it seems like a workable solution can be found.

The perfect structure is to have separate node for posts then for each post you give it a reference parent category eg Steel and Construction. Have them also arranged with timestamps. If you think that the database will be too massive for firebase's queries. You can connect your firebase database to elasticsearch and do the search from there.
I hope it helps.

R geographic address validation

I am trying to calculate physical distances between geographic locations (addresses) with ggmaps/mapdist function in R. Apart from the uncomfortable fact that Google Maps allows only 2500 queries/session, I have to cope with the misspelled or other way imperfect "addresses". The most typical problem is that the exact address strings themselves are added by several other info (floor, door etc.), but it is very problematic to detect any pattern in these what would allow applying regular expression.
My goal is:
Check if the address string is recognizable to Google Maps;
If not, find a way to truncate to an acceptable form, perhaps by parsing words step by step from the string.
Have anybody coped with this kind of problem?
Thanks.

There are a couple of factors running into each other here. One factor is the misspellings and other complexities related to addresses and the other is pinpointing (geocoding) a given address. Although they are related problems, each must be handled to accomplish your objectives.
There are numerous service providers out there that can do either or both with minimal cost involved. This can be found with a simple Google search. You can then investigate each to see if they match your use case and licensing requirements.
All of that considered, you'll want to get your address list cleaned up on a minimum. Doing that will enable you to utilize any number of geocoding providers.
Depending upon the size of your list, you can get your list cleaned up and geocoded for perhaps $20.
In the interest of full disclosure, I'm the founder of SmartyStreets. We provide a web interface (to help clean up the address list) as well as an API (which can be used on a continual basis to keep addresses clean). We also geocode your list at no extra charge. Further, we don't have any licensing restrictions on the number of lookups that can be performed during a given timeframe. (We have customers that hit us hundreds of millions of times per day.) The entire process of signing up and cleaning up your list takes just a few minutes.

Black box testing a remote DICOM Q/R server

I am wondering if anyone has ever tried to work on the following issue. I need to execute a series of test on a remote DICOM Q/R server. This would allow some easy DICOM Conformance Statement checking.
As an implementation detail of the test suite, I am running the following (DCMTK style command):
$ findscu --study --cancel 1 --key 0020,0010=* --key 8,52=STUDY --aetitle MINE --call THEIR dicom.example.com 11112
The goal here is to find a valid StudyID (later on I'll use that StudyID to execute lower key level C-FIND, and some related C-MOVE queries). Of course it would be much easier if I could upload my own dataset and try to fetch it back, but I cannot do that against a running PACS in a clinical environment. I need to define with a minimal number of queries how to find a valid StudyID.
However I fear that some DICOM implementation may have policies where quering the entire database is forbidden.
So I was wondering if anyone has written a list of those policies, and maybe describe a way to retrieve a valid StudyID from a remote server with a minimal number of C-FIND queries.

I think I may simply go with:
TODAY=`date +"%Y%m%d"`
findscu --study --key 0008,0020="$TODAY-" --key 0020,0010=* --key 8,52=STUDY --aetitle MINE --call THEIR dicom.example.com 11112
If this does not work (return empty), I'll check yesterday results.

Welcome to DICOM-wonderland.
You are right that you should be very, very, very, very careful to run just random queries on a clinical PACS. I've seen commercial PACses send their whole(!) database as a result of a query which it did not understand. Not a pretty sight. This (and privacy) is one of the reasons that in a lot of hospitals around the world PACS admins are very afraid of giving direct access to their PACS via DICOM.
In general I would say that standardization is not going to help you. So you have to find something which works for you, and which will not get bring the PACS down. No guarantees here.
Just a list of observations from querying PACSes in hospitals:
Some are case sensitive in their matching, some are not.
Most support some kind of wild card. This normally is a '*'. but I've also seen '%' (since that is a SQL wildcard, and the query is just passed as a SQL string). This is not well-defined I think.
The list you will get back might be limited to say the first 500 entries. Or 1000. Or random number between 500 and 1000. Or the whole PACS. You just don't know.
DICOM and cancellation do not play well. Cancelling a query is not implemented well. Normally a PACS sees it as a failed transfer, and will retry after some time. And the retry-queue is limited in size, so it might ignore new queries. So always let your STORE-SCP server running to drain this queue.
Sometimes queries take minutes, especially for retrieve. The next time it might have been retrieved (from tape?) and be fast for a while.
A DICOM query may take a lot of resources from the PACS, depending on the PACS. Don't be suprised if a PACS admin shows up if you experiment a little too much.
The queries supported differ very much. Only basic queries are supported by all: list of patients, list of studyID/study instanceuid for patients, list of series per study, retrieve study or series. Unless you get a funky research department which uses Osirix, which does not support patient-level queries but only study-level-queries.
So what I would advise if you want to have something working on any random PACS:
Use empty-return-key instead of '*'. This is the DICOM way to retrieve information.
Do not use '-cancel'. If you really need to cancel, just close the TCP connection (not supported in DCMTK)
Use a query on PatientId, PatientName, Birthdate, StudyDate to get a list of StudyIDs/StudyInstanceUids.
The simplest is just use a fixed StudyID, assuming that it stays in the PACS long enough. If not, think of a limiting query to not overload the PACS (the 'TODAY' suggestion of you fitted that description).
Good luck!

How can I speed up search/browse/filter with 10 M products?

Background:
I'm using SQL Server 2008 and ASP.NET 4 on Windows 2008
I have one table with about 10 million rows of products that I make available online for users to browse -- not search. Each of the 10 million products have extra attributes -- like categories -- that I keep in lookup tables -- there are three or four lookup tables.
Problem
When someone browses and starts using filters (shipping location, price, quality, brand), I need to join the tables, apply all the filters, and return the results. It's very slow and I want to make it faster. Sometimes users will apply a very broad filter, resulting in 800,000 results, and though I only return the first 10 of those for browsing, I still need to run the query for the full 800,000.
What I've Tried Already
I've joined all the information from the various tables into one physical table and then created a covering index for the table.
The queries are much faster, but there is a good bit of maintenance I have to do on the table behind the scenes with jobs to make sure if something goes out of stock I take it out within a reasonable time frame (5 mins or so).
I don't use materialized/indexed views b/c I've got aggregates in the results which SQL Server doesn't seem to like.
Question
How can I speed up browse results beyond the indexing and table optimization that I've already done? I'm not doing any full-text searches -- I'm filtering with exact parameters.
Possible Solutions I've Thought Of
Large caching solution -- AppFabric or MemCached. I'm know next to nothign about these and don't know they are appropriate.
Small caching solution -- Maybe leveraging ASP.NET caching -- but every person is going to apply different filters so I'm not sure how much this will give me.
SSDs -- as a larger-scale solution I've thought about getting SSDs but that will be down the road
CDN -- I don't think a CDN will help b/c the bottleneck here is my database's search capabilities, not the bandwidth/distance to the requester.

I had a similar problem with a complex join query causing horrible response times. I was able to solve it via using Lucene.NET. It's a .NET implementation of the Lucene search index. Basically, you build indexes on data fields (your categories) and then you can search via those categories and return thousands of rows very quickly. Basically, it takes the join operation out of the equation because it already knows, via the indexes, which records fit your criteria.
The following is a very good article on Lucene.NET. I highly recommend it. It took a search result that was taking 20 seconds using standard joins and reduced the response time to less than a second.
http://www.codeproject.com/Articles/29755/Introducing-Lucene-Net
Also, feel free to ping me if you have specific Lucene.NET implmenetation questions. I just got through a lot of research/learning in order to implement it properly on my site, so if you have specific questions on how to make it work I may be able to help with that as well.

"I perform the full query b/c I need to populate the new filters and
the number of results along with the search results. For example, if
someeone filters on category of "Shoes", and location of TX, some of
the other filters are going to be restricted based on the previous
filter."
Try executing two queries: One to count all results and one to select the top N. Maybe your bottleneck is copying 800,000 rows to the client. Doing two queries would fix this at the cost of an additional query. The cost is likely to be less than 2x though due to optimizations for few rows and for count-only queries.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex