I'm creating a web scraper with golang and I just wanted to ask some questions about how most of them work. For example, how does Googlebot not use a lot of bandwidth when scraping because you have to go to each URL to get data and that can be thousands of URLs so not only will that take bandwidth, it will also take a lot of time. I'm building a web scraper and I'm experiencing these issues and I wanted to ask what the best way of fixing these issues is. https://github.com/hackermondev/cosmic
Related
I have a hobby of reading news. The problem is, there are quite a lot of websites I often go to, and this gives me an idea: building my own database of news. The idea is similar to the newspaper clippings. For example, I read something interesting about Germany economics news, therefore, I can use this software to save all the text and images from the said site(into my computer), and I can add tags such as "Germany", "econ" so I can find it and read it later. I shared this idea with my friend and he said web scraping is not easy because not every site allows you to do that. So my question is, how should I begin this with? I study computer engineering so I got some programming understanding but obviously not enough. Any clues or experience(for the web scraping and tagging) sharing will be helpful, thank you!
Python has a couple of good web-scraping tools that work well. Beautiful Soup 4, Scrapy, Selenium, requests to name a few.
Before web scraping I would recommend to learn the basics of python and how the web works.
Note that most websites disregard it if you scrape them. It is hard for them to track you doing it and if you only download a few specific sites it shouldnt be something they complain about - as it is not much more than pressing CTRL+C and download the whole site as an HTML.
Don't share it and don't spam requests - be a fair player. If you want to be on the safe side check out TOS of the website.
I am trying to learn about web scraping tools.
So if anyone help me on getting it started some tutorial links may help.
When should one go for web scraping
What are the benefits over rss feed?
Best tools available in market for web scraping
Thanks!
To make the essentials short: "That depends on what you're trying to achieve."
If you have an RSS-feed available with all the information that you need, you don't need to scrape a web page.
If you're trying to extract data from a website that does not provide an API to access the data directly, you can use scraping to extract the information that you want from the page in a structured way. You can save the data into a database and work from there.
For example: In the early Web 2.0 times, there were sites which scraped all other "flight"-pages to extract the cheapest flight for a given source and destination.
How would I go about gathering data from Instagram for a web scraping project, in particular I am interested in getting the caption as well as the number of likes
check if there is an API that makes thinhs 100x easier
check the terms of use to make sure you are using it legaly
write the web scraper / api scaper
run it
process the data
be disappointed that it was not very interesting data
My last project is a medium size asp.net web forms application. It is built using:
asp.net 3.5
ling to sql dbml --> sql server database (9 tables)
ext.net 1.6 (www.ext.net)
structuremap 2.5.3.0
This time I believed I did my best in terms of architectural design, code and data transfer optimizations. I followed all advice I could to work with the database efficiently through linq to sql and I built layers (model, repository, service, presentation) to separate concerns and lightweight the code in the aspx code behind files.
The problem is: I've installed the application in various web hosting servers with the same pitiful result: the application is struggling to work... pages are loading like in slow motion...
In the past I would say 'OK, I didn't do all I could to speed things up' but in this case I really tried to apply the best practices...
Is there anything else I can do about it? Or is it just asp.net for really small projects only?
thank you.
ASP.NET is fine for building large scale websites. As Brad mentioned, StackExchange sites are built using it, and StackOverflow is a very busy site indeed.
What you need to do first is measure performance; until you do that, you're just guessing at where the problem areas are.
So start with the browser - use a tool such as Firebug, or YSLOW, Google Chrome dev tools, whatever takes your fancy and run your site using the tool enabled. The tools can let you know how long things are taking to process eg requests, how long content is taking to download etc.
YSLOW will also give you some tips on anything it finds as being a bit slow e.g. you're making to many HTTP requests, you should consider minifying your CSS/JS files. You will get a general overview of how the site is performing and where problems could be.
To dig a bit deeper, use a tool like RedGate's ANTS Profiler, use the trial version and measure your website, and server side code, with that tool. There are other tools, though I'm not aware of any free ones.
My first question is that when its slow. Did you try your project in Local area network. Please check first there. If there slow then you need to improve little bit.
This slow performance depends on many things
such as large data load, multiple logic on one page etc.
Please let me know.
Thanks
Basit.
Hi I am currently designing a website for a client - the site will be written in asp.net with a cms built in. My client has come back saying he wants to play mp4s on the site - plus being able to embed some other videos from youtube, vimeo etc.... in his blog - I have managed to convice my client that playing .flv would be better for obvious reasons (which he has agreed is OK). but when I went back to my coder, he said that because of the fact its a dynamic site that it will take 2 days to get this working (in terms of creating the mechanics to allow my client to up load his movies etc.....)
Is this correct - as my client is under the impression that it should be a simple thing to do - while my coder tells me that its not that simple.
I am in the middle of all of this - can you help please!!!!
At the end of the day only the coder you are using knows exactly how much effort is required here. You have to trust them. This almost certainly not trivial. Make sure you and the coder understand exactly what's being asked for here and that neither of you are assuming anything about how the client expects it to work.
Is your client a programmer? Non-programmers should never dictate how long a programming task should take.
If you're cowboy coding without testing "today" would probably suffice, but any sane and professional development shop would never let this happen.
Now let's clarify what your client really told you to do:
Your dev seems to be assuming that he has to support adding/uploading videos from your CMS.
If your dev is going to use a 3rd party API like YouTube, 2 sounds reasonable. If you're going to serve it on your own site, it'd take at least a week's worth of programming to make sure your site can take such a heavy load of streaming data -- it's stupid, not to mention highly irresponsible, to assume it could be worked out in a day.
Now, if you're client is only really talking about embedding videos in blog entries or articles, that's a very trivial task: YouTube, Vimeo and other video sharing sites already supply the HTML embed code that's needed to display a video on a page. In fact that's a zero effort task assuming that your blog entry editor properly parses the embed code, or has an Edit HTML feature.
So, which one is which?
This might be a good occasion to use the <video> tags. It might simplify things at the cost of only supporting users with recent browsers.
Two days is a quite optimistic estimate for all that you've mentioned. Maybe for embedding YouTube videos only, but for upload/storage/streaming of videos on the local server it's a different thing entirely.
But if you don't understand programming yourself, then you have to trust the expert that you've hired to do the job for you, and you have to tell the client that is how long it will take. The fact is that these things aren't trivial to write, there's the front end website management interface that needs creating, and the back end server software that manages what to do with the uploaded file. Never mind integration and making sure it's easy for the client to run a workflow of upload file, incorporate that video inside some content in the CMS, and so on.
I just recently did this, you need to get videoLan http://www.videolan.org/
This streams mostly anything, after you set up a streaming site it's easy!