Interpreting Search Results - information-retrieval

I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.
What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.
Thanks, Simon

I doubt that there exist a silver-bullet algorithm that without any training will just work on any arbitrary search query output.
However, this task can be solved and is actually solved in many applications, but with different approach. First you have to define general structure of single search result item based on what you actually going to do with it (it could be name, date, link, description snippet, etc.), and then write number of html parsers that will extract necessary necessary fields from search result output of particular web sites.
I know it is not super sexy solution, but it probably the only one that works. And it is not rocket science. Writing parsers is actually extremly simple, you can make dozen per day. If you will look into html source of search result, you will notice that output results are typically very structured and marked with specific div sections or class atributes, so it is very easy to find it in the document. You dont have even use any complicated HTML parsing library for that, something grep-like will be enough.
For example, on this particular page your question starts with <div class="post-text"> and ends with </div>. Everything in between is actually a post text with some HTML formatting that you may want to remove along with extra spaces and "\n". And this <div class="post-text"> appears on the page only once.
Once you go at large scale with your retrieval applicaiton, you will find out that there is not that big variety of different search engines on different sites, and you will be able to re-use already created parsers for sties using similar search engines.
The only thing you have to remember is built-in self-testing. Sites tend to upgrade and change design from time to time. If your application is going to live for some time, you will need to include into your parsers some logic that will check validity of their results and notify you every time search output has changed and is not compatible anymore with your parser. Then you will have to modify particular parser or write new one.
Hope this helps.

Related

Wordpress search results by keyword

Is there a way to prioritise the search results by keywords or tags.
For example, if a user types in Admissions then domain/admissions page comes first and then all the rest. If the word is apply then domain/apply page is first on the search result page and etc
Was trying to find the examples, but no luck. Never used the extended search before, so even don't know where to dig
What you are talking about here is returning search results by "relevance" and there are a few examples out there including plugins such as Relevanssi.
I have no example of how to do this, however, I thought it might be useful to know the term you need to dig deeper.

How to generate PDF via IText Template

we want to realize the following:
Generate PDF with Template,which means set value in AcroFields
Create a big details table (structure of table is also in template). In this progress, the details will occupy more than one page.
If the detail table is on multi pages, the header of the table should be also on top of the new page.
We found some examples on following website:
http://kuujinbo.info/cs/itext_template1.aspx
http://kuujinbo.info/cs/itext_template2.aspx
But the details the founction is omitted there.
Add content; the code for _do_form_fields(), _get_transaction_details(), and _transaction_summary() are omitted, since they only return strings to add to ColumnText. ColumnText is smart; each call to Go() renders as much text that will fit on the current page and returns a status code that tells you: (1) how much text (to write) is remaining, and/or (2) how much space is still available on the page. On each iteration you add text to the current page, call ColumnText.HasMoreText() to inspect the status, and then Document.NewPage() if necessary.
Is there anyone who had same situation before? We are appreciated that you could offer some tips or suggestions.
Thank you.
best regards,
Cheng Gong
You are already making a mistake in the first step of your requirements.
You say "Generate PDF with Template,which means set value in AcroFields."
The first part is OK: you want to generate a PDF with a template. However, this doesn't mean setting values in AcroFields. That's only one option. It's the option you take if you consider PDF being the digital equivalent of paper. The form is static: every coordinate is fixed. You just fill out data at the appropriate places. If the data doesn't fit the designated areas, you're out of luck. I already referred to chapter 6 of my book in a comment. You can also see how AcroForms work in a longer tutorial: https://www.youtube.com/watch?v=6YwDME0Fl1c (This tutorial is almost completely dedicated to creating a report from a data set.)
Another way to create PDF from a template is by using the XML Forms Architecture. In this case (if you have a pure XFA form), your PDF is a container for XML. You can then inject XML data into this form and the form will adapt itself depending on the data. A one-page form can easily grow into a 20-page document when filled out. This is explained in this video: https://www.youtube.com/watch?v=h0wzj84tnmw (Note that the video dates from 2012. The product I present has been finished and the results are much better now.)
Alternatives to this approach could be to create a template in HTML. I often refer to this solution as a poor man's XFA solution. This solution requires XML Worker. You can see an example in this video: https://www.youtube.com/watch?v=clWoDrEEl50
This is a general answer. I couldn't be more specific because your question isn't clear. You first need to make your mind up regarding the approach. Right now, you talk about AcroFields and at the same time about ColumnText. In the long tutorial, this is described as the hard way. See also the corresponding online samples. It is very confusing why you're asking a very difficult question before asking the simple questions. Unless of course, you already have the answer to those simple questions. If so, please share these answers.

Drupal WebForms Data Structure Reference?

I am relatively new to Drupal. We have a site and I've been asked to jump in and make some changes. I'm working on customizing the output of the Webforms module. I'm having trouble doing so because I can't seem to find a reference to the various data structures Webforms uses.
For example, I need to change something in a preprocess hook. Passed into the hook is a structure called $variables. I can see that attributes are being added to the piece I want to change, so I know I'm in the right hook. What I want to do is add something to the text. But I can't figure out where in $variables the text is so I can change it.
I'm sure what I need to change is in there, but I can't seem to get at it. All the documentation I've found on the web is either "paste this code in" or assumes you know the data structures.
So:
1. Is there a reference anywhere to these structures? $variables is one. $submission, $components are others. There are probably more. I know their contents vary widely with the specific webform, but looking for a general reference.
2. How can I see the contents of one of the structures from inside a hook? I've tried a lot of things, but no luck. Would be great to either have it output to the Apache log, or show up on the screen, something...
Any help would be greatly appreciated. It feels like there's real power here, but I can't get at it because I'm missing some basics.
I would say you need to install 2 modules to figure out what is going on...
First Devel, allowing you to use the dmp function. This will output a whole array to the message area.
And then my new favorite module, Search Krumo.
A webform is generated from large array of data and finding the bit that is relevant to you can often be difficult just looking though the dmp output. Search Krumo puts a search box in the message area allowing you to search for any instances of a string in the whole array structure. When you've found the bit that is relevant it also lets you copy the path to that array element so you can easily modify values buried deep in multi-arrays.
EDIT:
If you don't want the output on the screen but would rather log it then use Devel Debug Log. Very useful for debugging ajax requests etc.
If you just need to log simple strings not whole arrays then the dd function is useful combined with: tail -f /tmp/drupal_debug.txt assuming you have SSH access.

Converting words to hashtags when posting to Twitter

I currently use LinqToTwitter to send posts to Twitter. I'd like to convert words in the title of the post to hashtags when it gets fired off as tweet so something like - "Firefox is cool" is the blog post and becomes #Firefox is cool http://myshortu.rl/dhsgeh on Twitter.
So far the way i see it is i need a database table with the words i want to convert to hashtags. I'd have to parse out the title and compare the words to those in the db and add on the pound sign. Is the best way to use a db table? Or can I do it with an in memory collection or keep the words in web.config? Thanks....
The decision on whether to use a database or file (such as web.config) might depend on whether you want to write code that allows you to maintain the list. e.g. Add, Modify, Remove. If so, then a DB sounds like the easiest option. If the list is small and doesn't change, then adding a delimited list to web.config would work fine.
Since you're using ASP.NET you can't hold it in a memory variable, but you can hold the list in Cache. This can make for some very fast lookups, rather than multiple file or DB queries.
Just to put this into perspective though, it's tough to recommend a proper design in a forum because there might be details that aren't known. So, it's best to take my answer as something that helps think about what the tradeoffs are, rather than a definitive recommendation on what you should do.

Aggregating from various sources

It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).
I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:
Some sources don't provide rss feeds. How do I create one?
What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?
Please add any other things (problems, suggestions, whatever) I might not have considered.
Duplication is a nasty issue. What I eventually ended up doing:
1. Strip out all HTML tags except for links (Although I started using regex, I was burned. I eventually moved to custom parsing to remove tags)
2. Strip out all whitespace
3. Case-desensitize
4. Hash all that with MD5.
Here's why you leave the link in:
A comment might be as simple as "Yes, this sucks". "Yes, this sucks" could be a common comment. BUT if the text "this sucks" is linked to different things, then it is not a duplicate comment.
Additionally, you will find that HTML tag escaping is weird with RSS feeds. You would think that a stray < would be double-encoded: (I think)&<;
But it is not. It is encoded <
But so too are HTML tags! :<p>
I eventually copied all the known HTML tags as parsed by Mozilla Firefox and manually recognized those tags.
Creating an RSS feed from HTML is quite nasty and I can only point you to services such as Spinn3r, which are fantastic at de-duplication and content extraction. These services typically use probability-based algorithms that are above me. I know of one provider that got away with regexing pages (They had to know that a certain page was MySpace-based or Blogger-based) but they did not perform admirably.
You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.
About duplicates, take a look at this pipe.
Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, e.g. get it as JSON an manipulate it with Javascript, or process it server-side.

Resources