Mass Web Scraping Software/AI? - web-scraping

I'm looking to create/write/use software to scrape about 140 different different websites with a grand total of about 3500 different links to these various 140 websites. What methods would you guys recommend for creating a webscraper of such size? What I have no is like this:
website -> website url
I basically just need to go to the specified URL and then grab the price of the item on screen. Would love to hear how the people of stackoverflow would tackle this issue.
I've thought of using beautifulsoup in Python to do this as I am most familiar with this library (and probably a small mix of Selenium for websites that do not auto-load). Other than these options, I'm not sure if there is already better methods of achieving this.

Related

Migrating from Squarespace

We have a Squarespace site (nataal.com) that has been steadily growing over the past 4 years. It now has in the region of 670 pages and is getting quite unwieldy, particularly when trying to scroll through the various page and link menus. Squarespace says 1000 max according to documentation, but < 400 recommended. Creating a user index for the pages is also a problem, ours now runs to about 50 pages in its own right (12 entries per page including thumbnails and captions). That's the way the creative people want it and who am I to argue!
Has anybody had experience, good or bad, of migrating such a site to a better platform? I have some exposure to Drupal and I think it would have worked well had it been used from the start. I have also heard good things about Wagtail, but I've never seen it in operation. Or is there some other platform I should consider?
So, what I'm looking for is a CMS platform that can
Easily handle more pages than Squarespace
Migrate from Squarespace keeping most if not all the structure of each page.
Automate the construction of page indexes.
Fairly east to tweak the layout of any given page to suite the topic.
Both Wagtail CMS and Drupal can support thousands of pages quite easily. In my opinion, Wagtail is MUCH easier to work with than Drupal - and a lot of websites that end up migrating to Wagtail have historically been from WordPress and Drupal (not all, but many!).
I can only give you info on ways to help guide your decision because ultimately the CMS you pick is your decision.
Drupal is a PHP based CMS that typically uses Apache and MySQL. A few pros to using this is the popularity in the tech stack and easy deployment. But the down side is the code gets messy, unruly, and eventually very difficult to maintain due to the structure of PHP as a language (not in all cases, but in most cases this ends up happening).
Wagtail is a Python based CMS that sits on top of a different database called Postgres but it can be swapped out for just about any other database you prefer (Postgres is well known as the "enterprise version" of open source databases). Wagtail also sits on a massively popular framework called Django which has SO MANY great features (too many to list here), but amongst those great features is security. With a Django/Wagtail site you'll have to do more developer work. There isn't really a "plugin" system like in WordPress, but that also means extending the longevity if your codebase and it's easier to maintain your code as it grows (due to the nature of Python, Django and then Wagtail).
I think the biggest downside to migrating such a large site is going to be moving all of your content over. In Wagtail you can structure all your page slugs to be the exact same as your squarespace site which is nice. But there's not an "easy" solution to migrating that much data from Squarespace to another CMS. (But please do make the migration, even if it's painful to do because it'll only get more painful as time goes on and your site gets bigger).
Regardless of which CMS you end up choosing, any dynamic website can create index pages for you very quickly and easily.
With all that said, should you choose to take the Wagtail route, I have a full series that can take you from "zero to hero" on YouTube at wagtail.io/course. We also have a great community where you can get support on the Wagtail Slack as well.
Good luck with the migration!
Wagtail can certainly support those requirements.
If you're in Oxfordshire, UK, you should come and see Torchbox (the creators of Wagtail) to talk about it!
Also you can change your Squarespace account to developer, the downside is that you can't get back to the normal one, but you can change it to developer mode and work it with Angular. I made this www.rudagt.squarespace.com and I will do this, cause the huge amounto of content, but, I have clients so for them SQspace is better interfaz than much others.
Good luck!

Is it easier to scrape the AMP versions of webpages?

I'm working on a web-scraper that aggregates newspaper articles. I know AMP protocol mandates a stripped-down version of Javascript, and I also know that Javascript (in part) enables website administrators to detect/prevent scraping. So logically, I figured it would be easier to scrape AMP websites. However, one the other hand, if this is true, I presume StackOverflow would be on top of it, but I haven't found a single thread reaffirming my inference. Am I correct or am I overlooking something?
I would say that AMP pages are definitely easier to scrape due to the fact that there is virtually no custom JS code. Many sites insert content with JS or AJAX. AMP limits the amount of libraries you can use and thus has less amount of them compared to a regular site.
Furthermore, if you want to scrape content written in JavaScript, you should can Selenium. If not, PHP is the way to go (IMHO) or BeautifulSoup in Python.
Happy scraping!

how to hide myself while web scraping by html-agility-pack

I am trying to scrap content from some webpages of a site. I tried html-agility-pack with c#, which is doing good in scraping html.Here I need to go through some numbers of pages while scraping. Now my question is how can I hide my self as webscraper? As I do not want other side come to know that i am scraping their content.Please Let me know if there is any way that can help me.Looking forward for your responses.
Thanks
Use a tor proxy:
Tor Project
You can reset the proxy after every page or after every site. Keep in mind that some sites look for certain patterns and can tell your scraping them. With html agility pack the web is one big data repository, just make sure your not use someone else's data in a way that would get you in trouble.

Suggestions for deciding on a WCMS for a hockey website?

I need to make a website for my hockey club. My main purpose for this site is allowing people to sign in and post articles and training schedules in their section. Eg Mens, Womens, Juniors and Masters. I want to have some kind of upload manager that will allow them to choose where they post the info too (eg, Mens, Masters and Homepage).
This is the main functionality I'm looking for at the moment.
The clubs previous website used Joombla which I have hated. I found it to be way to restrictive. Its on a old version of it so there are probably many improvements in the new version but from what I've read it seems like it still has a lot of restrictions in how content is managed. I am open to trying it again tho.
I've used Wordpress before and liked it but that was on a small scale projects and I'm not sure it really fits what I'll be trying to do here, since it mostly deals with blog posts and I'll need to have functionality to upload and display files.
I've had a look around at some other ones like Squarespace and Silverstripe. I'm really liking the simplicity of silverstrip(one thing I hate about Joombla is the clutter on the opening page) and am leaning towards it right now if I can find a nice way to have people post news to multiple pages at once.
If anyone has any suggestions they'd be very welcome. I know html, css, javascript and a bit of php. I'm learning Ruby atm so wouldn't be against using it so I could learn more but it might be a bit much for a sports website.
First off, its nice to see someone that likes hockey too :) You can't use Squarespace, you'll need an Apache server for what you want. You will need some way to store information, so you'll need a MySQL database, probably some advanced knowledge of PHP (I'm assuming you don't know how to connect to databases and do some other functions). Wordpress is too limited, so you can't use that. I have never used Silverstripe personally, but it seems like the best of your options here. You'll probably need some more knowledge of PHP before you attempt to make a members system.

Should Wordpress be used to create a real estate listing site?

I have a real estate agent client who wants a website to list the properties he's selling. Although there are great 3rd party web apps out there that do this, he adamantly demands that I recreate a simple and custom website for him.
I can do this quickly with a php framework like Code Igniter that comes with MVC, data access objects and data bind controllers. The database would be straightforward:
t_page: generic content pages
t_property: for each property on the market, has fields like address, price, #of bed rooms etc..
However, the client has heard many great things about Wordpress, and strongly advises that I build his real estate site with it. I've only used Wordpress to create blogs and relatively straightforward websites. SO I dont know how effective it is as a real estate property content management system or how effective it is for users to search for real estate properties based on attributes such as "# of bedrooms, square footage, is basement finished etc..."
So my question is, is it a good idea to build a real estate agent website with Wordpress? Or should I try harder to convince him to build it with web framework like Code Igniter?
Rather than argue with your client about the future platform or CMS or listen to people for/opposed to WP out of principle, sit down with your client and map out exactly what he/she wants to do in terms of the site. How do they want to add material or blog posts? How easy should it be? How do they they want users to be able to search: by price range, location, etc? Get them to show you on other sites how they want things to work.
Then look at the capabilities of various CMS's, frameworks and the like. Investigate search and MLS plugins, property XML feeds, maps. Determine what other real estate sites use (esp. his/her competitors).
Then explain your decision with evidence as to what they want to do compared to what's possible with different systems. They may talk themselves in or out of systems without your help.
It's called working with a client so they get what they want in terms of usability and end-user functions, not imposing what you want on their project. Sure, you know what you are talking about in terms of getting things to work, but they don't care; they want it to work in a certain way: their way.
(And see what's already out there in terms of Real Estate WordPress Plugins and WordPress Real Estate Themes).
I've developed several real estate sites using Joomla and openRealty, and I have tried to create a decent real estate site for my wife using Wordpress due to it's ease of use for end-users, but unfortunately programming a real estate site in Wordpress is tricky. It's a blogging engine and not terribly good at "directory" based information. So I find that the ease of use goes out the window as you try to hack together real estate functionality. Then you are asking your end-users to create custom-fields, etc and it becomes a pain and you end spending too much time managing your end-users.
I love WP. But, a directory style site is not it's highest and best use.
If the client is so adamant that you use WP for his site then let him do it. Then wait till he comes crawling back to you when he can't get it to do what he wants and you can build in properly in CI.
You wouldn't tell a plumber to fix your toilet with a socket set...
Check out ExpressionEngine, it's perfect for this as you can create custom fields (# bedrooms, square footage etc.) and retrieve content by any of these custom fields using the {exp:channel:entries} tag.
So basically you'd create a channel for these listings and then use "custom fields" for the data about each of these listings (specified by the needs of your client).
If you need design for this site "City Guide" from WooThemes will be available for EE as of tonight ;-)
And since you mention CodeIgniter - EE 2.0 is built on CI and if you need some custom functionality it's all CI so that should feel like home.
Wordpress custom post types would work well for this sort of site.. A custom page template and modified WP_Query would provide the basis of the site.
As mentioned by everyone else, WP probably isn't the absolute best tool for the job, but it would not be a bad choice. I've done weirder things with it.
Old question but still relevant. My opinion is that WordPress is not a good option for creating real estate listing sites. The main reason is that it is designed primarily as a blogging engine so it requires a lot of work to set up and is susceptible to getting hacked. More detailed explanation here:
https://smallbusinessforum.co/why-an-alternative-to-wordpress-is-needed-for-real-estate-websites-ff82de096d93#.j2cduk4xs
I think that using Wordpress is a plus, not because it is the best program to use, but if you make the site properly, and he wants to add/change something, you (and many other people out there) can mold it to his needs.
There are a lot of plugins you could extract some php code from and make a good listing. You also have the option of using post_types (which are saved as posts), custom fields (which all the fields are saved in one table but indexed), or creating your own tables (adding tables function or using a plugin like PODS).
I think you will save time on coding if you go with Wordpress, and customization is pretty okay (not anywhere near decent, but I am pretty sure this site will be the next craigslist). Wordpress is the 1995 Toyota Tercel of CMSs: it won't be great, but it gets the job done, and almost everyone has worked on it at some point in their live.
If the money is good, then try to wow him with a CI demo. But with WP, could probably accomplish your task in a few hours. There are ways to set up CI around Wordpress, but that is beyond me.

Resources