Finding related keywords - web-scraping

I want to find some data sources from where I can scrape many keywords related to a specified technology.
Suppose, I want to search for java,
I want to get technology keywords related to java. e.g. spring, maven, hibernate etc.
Can anybody help me with that?

Related

Possible to search through all JSPs in Adobe CQ5 repository with CRXDE?

We have a couple of relatively simple websites running on Adobe CQ 5.5 that were developed by a third party. I'm pretty familiar with how CQ works, but I'm working with somebody else's code here and I need to be able to search through all components in the system for a particular string.
The issue is that I can't seem to find a way to search across all of the various .jsp files stored with the various system components. I would have figured that the query tool in CRXDE Lite would have done the trick with something like this:
/jcr:root//*[jcr:contains(., 'Find this exact string in a JSP')] order by #jcr:score
But I've had no luck.
What I am looking for is some sort of global search that includes JSP files. Is that possible? Were I using a regular Java system, any IDE worth the download would be able to do this.
Thanks.
Might not be easiest way, but you can use the VLT tool to checkout the repository into your filesystem. Then you can lookup using whatever tool you prefer. It might even be faster in the long run
I don't have the actual answer but I suppose the JSPs are indexed via a filter that strips out some of their content.
It should be possible to configure the repository to index them as is instead, based on the info at http://wiki.apache.org/jackrabbit/IndexingConfiguration and http://jackrabbit.apache.org/jackrabbit-text-extractors.html
Sorry about the vagueness of this answer - I know the basic principles but to provide the details I would need more time than I can afford now ;-)

Are there any preset I18n word lists / resource files?

I'm creating a web application that uses I18n. As I don't want to translate very common basic strings like "forgot password?" on my own I'm asking you if there are already any resource files or word lists containing these strings. One option is to download an existing framework and extract somehow these strings but this might be a hassle?
Especially I'm looking for translation regarding user authentication and translations from English to Italian, French and German. The file format doesn't matter.
Professional translators use a tool, TMX is the generic term i think, Translation Memory Exchange, that does what you are talking about by building up standard phrase lists in other languages so when they translate they can bring these phrases in to speed up their job and reduce the repetitive tedium. So these lists exist.
There is a free plugin for MS Word that does this and may come with lists (sorry cannot remember the name although Rosetta rings a bell).
There is an FOSS TMX tool called Okapi at Sourceforge. It may come with the dictionaries but if not it is a point where you can investigate.
You could also approach a site called Proz which is a site for translators and might be able to point you in the right direction
Take care over MT like Google API as it can give some weird results but you could use it to build you list and then double check. Remember that when you check a language that you need to do it with a native speaker who can pick up on the nuances and colloquialisms.
You can use google translator api. and your custom resource bundle

I need a language dictionary

I need to add dictionary facilities to an Asp.net MVC app. Does anyone know a library that I could use? Where can I get word definitions from?
Any help is appreciated.
it depends on what you want to do.
if you're trying to add a reference of an actual dictionary of words to use for your application then you'll have to create your own to define the words that you want to use.
if you're trying to find the library that contains the data type dictionary then you should try
System.Collections.Generic, or System.Collections.Specialized
If its still relevant for you, It's not a 'library' but on Dicts.info site you can find raw dictionary files you could use inside your application.
You can also find bilingual dictionaries for various languages or even categorized vocabulary, and even more...
Watch out to respect their respective licences.

How to scrape websites such as Hype Machine?

I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine.
I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we're mainly focused around theory and client side applications.
Any help or directions greatly appreciated.
The first thing to look for is whether the site already offers some sort of structured data, or if you need to parse through the HTML yourself. Looks like there is an RSS feed of latest songs. If that's what you're looking for, it would be good to start there.
You can use a scripting language to download the feed and parse it. I use python, but you could pick a different scripting language if you like. Here's some docs on how you might download a url in python and parse XML in python.
Another thing to be conscious of when you write a program that downloads a site or RSS feed is how often your scraping script runs. If you have it run constantly so that you'll get the new data the second it becomes available, you'll put a lot of load on the site, and there's a good chance they'll block you. Try not to run your script more often than you need to.
You may want to check the following books:
"Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL"
http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593271204
"HTTP Programming Recipes for C# Bots"
http://www.amazon.com/HTTP-Programming-Recipes-C-Bots/dp/0977320677
"HTTP Programming Recipes for Java Bots"
http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669
I believe that the most important thing you must analyze is which kind of information do you want to extract. If you want to extract entire websites like google does probably your best option is to analyze tools like nutch from Apache.org or flaptor solution http://ww.hounder.org If you need to extract particular areas on unstructured data documents - websites, docs, pdf - probably you can extend nutch plugins to fit particular needs. nutch.apache.org
On the other hand if you need to extract particular text or clipping areas of a website where you set rules using DOM of the page probably what you need to check is more related to tools like mozenda.com. with those tools you will be able to set up extraction rules in order to scrap particular information on a website. You must take into consideration that any change on a webpage will give you an error on your robot.
Finally, If you are planning to develop a website using information sources you could purchase information from companies such as spinn3r.com were they sell particular niches of information ready to be consume. You will be able to save lots of money on infrastructure.
hope it helps!.
sebastian.
Python has the feedparser module, located at feedparser.org that actually handles RSS in its various flavours and ATOM in its various flavours. No reason to reinvent the wheel.

Where can I obtain dictionary files for use in checking spelling?

I thought this was asked before, but 15 minutes of searching on Google and the site search didn't turn anything up...so:
Where can I obtain free (as in beer and/or as in speech) dictionary files? I'm mainly interested in English, but if you know of any dictionary files, please point them out.
Note: This question doesn't have a right/wrong answer, so I made it community-wiki. However, I feel that it might be valuable to not only myself, but anyone who wishes to implement or use a spell checker with various dictionary files.
I have found a SourceForge project called Word List, which appears to have a number of dictionaries. I have downloaded a couple and am currently checking them out.
On Linux you can look in places like /usr/share/dict/words
I would presume that OpenOffice contains dictionaries for several languages.
I don't know what your target platform is but here is a solution that is for VB.NET. It uses the Office libraries which Office in itself isn't free but if your users are all internal and have Office then you could leverage these libs. There is a zip file with the example source code you can download as well.
Check spelling and grammar
There is what appears to be a half-decent dictionary available for free here on CodeProject.com (registration required unfortunately).

Resources