how to hide myself while web scraping by html-agility-pack - asp.net

I am trying to scrap content from some webpages of a site. I tried html-agility-pack with c#, which is doing good in scraping html.Here I need to go through some numbers of pages while scraping. Now my question is how can I hide my self as webscraper? As I do not want other side come to know that i am scraping their content.Please Let me know if there is any way that can help me.Looking forward for your responses.
Thanks

Use a tor proxy:
Tor Project
You can reset the proxy after every page or after every site. Keep in mind that some sites look for certain patterns and can tell your scraping them. With html agility pack the web is one big data repository, just make sure your not use someone else's data in a way that would get you in trouble.

Related

Can't scrape a website which uses Java Server Faces (JSF)

I am trying to scrape data from a website which uses JSF (JSF is also in the URL like https://xxxx/xxx/x.jsf) for my work.
I have tried a couple of scraping tools like Parsehub & Octoparse but I noticed that they try to reload the page to extract data to a .csv file. After reloading the page, the problem is that all the results are gone and I have to recall (re-filter) the data I need from the website.
Is there a scraping tool that can help me with that? I know that I may get it done using Java or Python, but my programming skills are not enough for such a thing.

How to scrape location from list of websites

I have a list of URLs in a csv file and I would like to scrape locations for each website. I am really new in scraping, so I do not know what tool or language is better. Is there some method to make it? Any help would be appreciated.
Web scraping can be done in several ways. There are many tools online and it also depends on your selection of language that suits you. I worked on Python and can suggest you to try Beautiful Soup, Requests and other API's. You also need to understand DOM structure of the webpage you want to scrape.
You may like to see documentation of Beautiful Soup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Note that in a webpage, you need to understand DOM structure to search its location and extract location data accordingly.

How To Extract Data From a Login Site

I'm trying to figure out how to take live data from one site and have it displayed on my site. I would like to do so, where the data updates as it updates on the original site. My theme is sports gaming and my site is structured like ESPN. I would like to grab all the team standings and players stats.
Sorry if I am unclear.
So basically you want to scrape a website and display it in yours, possibly in a better way.
So I would recommend to use KIMONO, Its an web scraping service, which will provide you with the api to get the data in a proper Model.
Check it out, IT should get your job done.
If not you can create your own scraper in PHP (PHP Simple HTML DOM Parser) or Javascript, there are libraries in Javascript also.
Hope it helps!
Happy Coding !!!

Techniques in making site easily copyable to MS Word

This is kind of an odd question and I didnt know where to post it, but here it is.
I have an ASP .Net website used by internal company employees. The site pages are pretty basic and has various tables, divs, css and some sprinkles of javascript/jQuery.
Some of the site pages are often used for presentations. And sometimes, the users need to copy the content offline.
I got a request that when trying to copy certain pages off IE/Firefox and onto Word/PowerPoint, it does not carry the layout over correctly. Well, I know obviously why this is a problem but the users dont and are asking to make it possible.
I'm assuming that the easiest way to do this is have a "printable" view. But as some of these pages are still being developed, are there some techniques we could follow that would make these pages somewhat copyable to word/ppt?
There are online guides to doing this like this one.

Using the right Web Scraper

I need to make a web scraper that uses an input address from the client, and then retrives data from that address from a specific site. I downloaded Webharvest, is that the right thing to begin with to learn how to write the program to do it?
Also, can someone direct me to a good tutorial to learn how to do it if possible.
Here is a good web-scraper comparison table. It may help you to choose the right scraper.

Resources