Wikipedia article names (no content) [closed] - web-scraping

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am doing a project, for which I need to know all the wikipedia article names (I don't need the content). Is there a place where I can download this data.

Check out this page here on Wikipedia - there is an option to just download an archive with the names of the articles. Here's the actual path to the download page:
All Titles (gzipped) - 32+ Mb at the time of posting.
Edit:
You may notice non-English titles appearing in the list (and some profanity - be advised) contained in enwiki-latest-all-titles-in-ns0.gz. This is because by default most people create content on the main English wiki (language code en). If you were to investigate other language dumps you will observe there are different sets of articles.
Reading on the main download page, there are references to being able to use the Wikipedia API to perform some types of querying on Wikipedia, but I'm not sure this will resolve your problem (taxonomy of the pages doesn't seem to provide a simple way to differentiate "English" content vs "content on English wiki").

I'm not aware of any central list of articles, but if you just need a large number of them rather than a complete list (bearing in mind that any complete list will always be out of date anyway) then you could probably put something together with wget to recursively follow links within wikipedia from the main page and store the URLs you get.

Related

Complete list of all vimrc configuration options? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm not quite sure if stackoverflow is the correct site to post this on, but i dont see any other better fit for it when it comes to the stack exchange sites.
Vim has a lot of documentation, everything from free books to interactive learning, but there seems to be a piece missing, at least from what i can see.
Despite all the documentation im unable to find a complete list of all options that can be specified in a .vimrc file, does anyone know where this is documented ? It is not documented in :help vimrc or any other documentation ive seen, not even the free books ive looked at. The vim tricks wiki gives an intro like so many other pages on the web, but thats about it..no page or documentation seems to list all available options for the vimrc file. The man page doesnt even list a single option, only usage and command line options.
The books and other documentation are good at mentioning how to use vim but not how to configure the configuration file.
So, does someone know where i can find a complete list of all available options in the vimrc file ?
:help 'option' will take you to the documentation of any option. All of those are contained in a single documentation file named option.txt.
Additionally, you can obtain a special report that shows all options, a short help, and the current values via
:options

Dictionary text file [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
This post was edited and submitted for review 6 months ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
I am writing a program that needs a list of English words as a source file for it to work. I realise that these source files are available for students writing games such as Hangman or Crossword solvers but I am having trouble locating such a source file and wonder if anyone knows how I can attain one without slowly scraping websites and building up a dictionary manually.
What about /usr/share/dict/words on any Unix system? How many words are we talking about? Like OED-Unabridged?
For an English dictionary .txt file, you can use Custom Dictionary.
You can also generate a list aspell or wordlist with own settings.
Also you can take a look at http://wordlist.sourceforge.net/
Only english words: http://www.math.sjsu.edu/~foster/dictionary.txt
Also take a look at:
http://wordlist.sourceforge.net/
http://www.math.sjsu.edu/~foster/dictionary.txt
350,000 words
Very late, but might be useful for others.
There's also WordNet. Its data files format are well-documented.
I used it for building an embeddable dictionary library for iOS developers (www.lexicontext.com) and also in one of my apps.
#Future-searchers: you can use aspell to do the dictionary checks, it has bindings in ruby and python. It would make your job much simpler.

How to handle flagged content in a community? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
On a multi-lingual community with almost only user-generated content, is there a commonly used way to treat flagged content (profanity, racism, general illegal stuff etc)?
As there will be a lot non-english content, the only way to handle the flagging itself is crowdsourcing by the community itself and somehow automaticly hide/delete the flagged stuff at a threshold. But what method could be used to stop abuse? e.g. "I don't like him, lets all report this and get it deleted"
FIrst of all, it depends on your content.
But in general, I would start by hide/delete the flagged stuff at a threshold.
When the community grows I would add crowdsourcing and create a balance from both.
I would also do a general scan on all posts to search for keywords which might lead or contain bad content.
Also, you will need to create some tolerance as some posts might contain a reference to illegal stuff but intended for god reasons.
ex: dont take drugs
If the community builds well, I would mostly rely on it.
Another option you might consider is to allow your users to "hide" other users, i.e. not see the content of hidden users.
This allows people to "remove" other users that they don't feel contribute to the community.
You could also allow users to report bad posts, and allow a human to decide whether or not to hide or delete the post. You would have to have community rules for this to be effective.

What is the best Drupal Survey module [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
We're after a replacement for a DotNetNuke installation with a DynamicForms module by DataSprings.
Currently the problems are mainly performance related, but the fact that DynamicForms uses Postbacks on ASP.Net all the time renders it also highly susceptible to slow server response time.
We're after a Drupal module which would allow us to present the CMS user with a control panel where they could:
- create new surveys
- assign a target group for the surveys
- manage the questions:
- checkbox/radiobutton/combobox/open questions
- variations of the above - e.g. a combobox with a text field when "other" was chosen
- the support for data lists, e.g. "what state do you live in" with values stored in the database and managed separately.
- conditional questions (show/hide) further questions when a certain option is chosen
- grouping questions (hiding sets of questions at a time)
- scrapbook function (storing frequent questions and being able to easily copy them into the new poll)
- exporting the poll data along with selected attributes from the user profile
As you can see the requirements are huge, and we're looking for an Open Source alternative to the current solution, which would allow us to extend the module if necessary.
Drupal would be the platform of choice, but we're flexible in that respect.
I'd appreciate your suggestions of alternatives.
There is a similar discussion going on at Drupal.org. IMHO, Drupal just isn't perfect for complex surveys. Limesurvey is much better when it comes to different types of questions, conditional blocks, reusable question types etc. However in Limesurvey 1, the admin interface is awkward and theming/templating system is not great. Limesurvey 2 looks very promising, but it's in beta.
Your best choice should be a new drupal module integrating the LimeSurvey software :
http://drupal.org/project/limesurvey_sync
Have a look at Webform

How to search inside PDF files [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have to search inside PDF files for an upcoming (ASP.NET MVC) project in shared hosting environment. What is the best solution? Any third part product?
Lucene is a popular choice. See Lucene FAQ on searching pdfs.
Lucene is a good choice - for ASP.NET, using Lucene.NET is the best bet. Lucene is an indexing engine only, meaning you'll have to provide it with the text from the PDF. If you have access to the web server, you can install an IFilter for this (I recommend Foxit's PDF filter). Otherwise you'll have to get hold of some code to use on your website to parse and filter the PDF.
Docotic.Pdf library can help with such task.
The library could be used to extract text (with or without formatting). The extracted text can be used to create an index. You can even use String.IndexOf method if you just want to know if a PDF file contains a given text.
The library can also retrieve a collection of words with their bounding rectangles from PDFs. This might be useful if you need to know exact position of a text in a file.
Disclaimer: I work for the vendor of the library.

Resources