question about web scraping(as a beginner) - web-scraping

I have a hobby of reading news. The problem is, there are quite a lot of websites I often go to, and this gives me an idea: building my own database of news. The idea is similar to the newspaper clippings. For example, I read something interesting about Germany economics news, therefore, I can use this software to save all the text and images from the said site(into my computer), and I can add tags such as "Germany", "econ" so I can find it and read it later. I shared this idea with my friend and he said web scraping is not easy because not every site allows you to do that. So my question is, how should I begin this with? I study computer engineering so I got some programming understanding but obviously not enough. Any clues or experience(for the web scraping and tagging) sharing will be helpful, thank you!

Python has a couple of good web-scraping tools that work well. Beautiful Soup 4, Scrapy, Selenium, requests to name a few.
Before web scraping I would recommend to learn the basics of python and how the web works.
Note that most websites disregard it if you scrape them. It is hard for them to track you doing it and if you only download a few specific sites it shouldnt be something they complain about - as it is not much more than pressing CTRL+C and download the whole site as an HTML.
Don't share it and don't spam requests - be a fair player. If you want to be on the safe side check out TOS of the website.

Related

How to collect contact information from websites?

Does anyone know a web crawler tool for collecting contact details from a website? Say I have a www.website/contact.. I want to pull out the address, phone number, etc.. There are 2 tools I've been looking at: cralwer4j opensource jar for java and Scrapy opensource in Python. But I am finding it a bit hard to use for my scenario.
Any suggestions would be great. Thanks
You might google for "simple web crawler" to find a solution that fits you best. In the net there are plenty "pure python" based web crawlers. Based on sceleton code you add db wrap up. I think the most problem would be db setting and saving data in it.
What if there are 1000000s of websites to crawl.. Is there a way to crawl all websites in my are?
No problem for scripting. Just put millions addresses in a file (or files), open it for reading in python or other script. Then get link by link from it and crawl/scrape to your pleasure. Result you might also want to save in file (csv, json).
I'd also recommend you a ready simple python crawler.

Does any open, simply extendible web crawler exists?

I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them:
partly just to read the feeds of several sites
to scrape the content of these sites
if the site has an archive I would like to crawl and index it as well
the crawler should be capable to explore part of the Web for me and it should be able to decide which sites matches the given criteria
should be able to notify me, if things possibly matching my interest were found
the crawler should not kill the servers by attacking it by too many requests, it should be smart doing crawling
the crawler should be robust against freak sites and servers
Those things above can be done one by one without any big effort, but I am interested in any solution which provide a customisable, extendible crawler. I heard of Apache Nutch, but very unsure about the project so far. Do you have experiences with it? Can you recommend alternatives?
I've used Nutch extensively, when I was building the open source project index for my Krugle startup. It's hard to customize, being a fairly monolithic design. There is a plug-in architecture, but the interaction between plug-ins and the system is tricky and fragile.
As a result of that experience, and needing something with more flexibility, I started the Bixo project - a web mining toolkit. http://openbixo.org.
Whether it's right for you depends on the weighting of factors such as:
How much flexibility you need (+)
How mature it should be (-)
Whether you need the ability to scale (+)
If you're comfortable with Java/Hadoop (+)
A quick search on GitHub threw up Anemone, a web spider framework which seems to fit your requirements - particularly extensiblility. Written in Ruby.
Hope it goes well!
I heartily recommend heritrix. It is VERY flexible and I'd argue is the most battle tested freely available open source crawler, as it's the one the Internet Archive uses.
You should be able to find something that fits your needs here.

iweb and mobile me for a group content management system versus open source CMS

I work at a non profit and we are looking for a web solution to do the following:
External facing web site
Internal posting board for news, updates, pictures
Entitlements around user content
One of the folks at the non profit is a mac person and suggests using iweb and mobileme for this functionality. i have no expereince with these tools but it seems like the following are more appropriate:
TikiWiki: http://info.tikiwiki.org/tiki-index.php
Drupal: http://drupal.org/
Joomla: http://www.joomla.org/about-joomla.html
i am a windows dot net guys so i also would prefer some asp.net solution here but i want to avoid getting religious here as any solution that does the job should be fine.
my question is, are there any thing to be concerned about with using the iWeb and mobileme solutions or any brick walls we are going to run into.
Also, are there PC based solution that will allow you to use these tools or does everyone need a mac?
This is only a partial answer to a multi-part question, but:
Drupal and Joomla are platform-independent. The software itself runs on PHP (presumably on a server, rather than a workstation), but you interact with the systems via a web interface. Drupal in particular lets you choose from many different editing options, via it's Wysiwyg module.
Personally, I think Drupal is an outstanding choice for nonprofit org (this being my own background) that have tech-skilled staff, and Joomla is an outstanding fit for nonprofits that don't have much in-house web expertise.
As for iWeb and MobileMe:
Compare them to Adobe Contribute. They're good software for what they do, but building organizational websites is not what they do.
What you've got is basically a souped-up MS Word that writes W3C compliant HTML. Things like members-only content, interactivity, etc are going to be pretty difficult to manage, and you'll be looking for another solution soon anyway if your site gets larger than a few dozen pages.
In short - avoid iWeb and MobileMe for this type of implementation. You may have a "Mac person" in the office (for now), but these products are designed more for individual/home use and not businesses/organizations. Eventually you'll run into any number of "brick walls".
A few other options (amongst many) if you don't have a web-designer on staff and want a hosted solution would be to look at Wordpress or Squarespace.
Thanks

How to utilize my learning power in ASP.NET studies?

I've recently tried to switch to ASP.NET. Did I write switch? I meant to learn it, however I am not really sure how to proceed. I've opened several videos - and really watched them with enthusiasm however they seem to be very general. It's not like there are tons of sources on learning PHP.
Do you know some great learning procedure including the websites and sources to learn from so I can learn it ASAP?
I got one project waiting here -> the website is kinda simple Online flash games. The graphics and HTML's finished but I want to try to do it in ASP.NET with MS SQL. I'm already experienced in C# thus I won't need a lot of insight into that, although I'm absolutely unaware of how to do the website, cute urls, what the basic principles in coding are etc..etc.. :)
Since you have a PHP background, I'd recommend that you try out ASP.NET MVC - if you are familiar with the MVC design pattern, it should be a rather painless 'switch'. The "Learn ASP.NET MVC" section is very nice. There's also an RSS feed (on the site above) that contains many great blog posts regarding the technology; furthermore, there's the NerdDinner sample website with a complete tutorial. If you follow the last one, you should be ready with the site in no time :)
I have found these Microsoft videos to be very useful as study material. Videos

Playing video on a dynamic website

Hi I am currently designing a website for a client - the site will be written in asp.net with a cms built in. My client has come back saying he wants to play mp4s on the site - plus being able to embed some other videos from youtube, vimeo etc.... in his blog - I have managed to convice my client that playing .flv would be better for obvious reasons (which he has agreed is OK). but when I went back to my coder, he said that because of the fact its a dynamic site that it will take 2 days to get this working (in terms of creating the mechanics to allow my client to up load his movies etc.....)
Is this correct - as my client is under the impression that it should be a simple thing to do - while my coder tells me that its not that simple.
I am in the middle of all of this - can you help please!!!!
At the end of the day only the coder you are using knows exactly how much effort is required here. You have to trust them. This almost certainly not trivial. Make sure you and the coder understand exactly what's being asked for here and that neither of you are assuming anything about how the client expects it to work.
Is your client a programmer? Non-programmers should never dictate how long a programming task should take.
If you're cowboy coding without testing "today" would probably suffice, but any sane and professional development shop would never let this happen.
Now let's clarify what your client really told you to do:
Your dev seems to be assuming that he has to support adding/uploading videos from your CMS.
If your dev is going to use a 3rd party API like YouTube, 2 sounds reasonable. If you're going to serve it on your own site, it'd take at least a week's worth of programming to make sure your site can take such a heavy load of streaming data -- it's stupid, not to mention highly irresponsible, to assume it could be worked out in a day.
Now, if you're client is only really talking about embedding videos in blog entries or articles, that's a very trivial task: YouTube, Vimeo and other video sharing sites already supply the HTML embed code that's needed to display a video on a page. In fact that's a zero effort task assuming that your blog entry editor properly parses the embed code, or has an Edit HTML feature.
So, which one is which?
This might be a good occasion to use the <video> tags. It might simplify things at the cost of only supporting users with recent browsers.
Two days is a quite optimistic estimate for all that you've mentioned. Maybe for embedding YouTube videos only, but for upload/storage/streaming of videos on the local server it's a different thing entirely.
But if you don't understand programming yourself, then you have to trust the expert that you've hired to do the job for you, and you have to tell the client that is how long it will take. The fact is that these things aren't trivial to write, there's the front end website management interface that needs creating, and the back end server software that manages what to do with the uploaded file. Never mind integration and making sure it's easy for the client to run a workflow of upload file, incorporate that video inside some content in the CMS, and so on.
I just recently did this, you need to get videoLan http://www.videolan.org/
This streams mostly anything, after you set up a streaming site it's easy!

Resources