How to scrape location from list of websites - web-scraping

I have a list of URLs in a csv file and I would like to scrape locations for each website. I am really new in scraping, so I do not know what tool or language is better. Is there some method to make it? Any help would be appreciated.

Web scraping can be done in several ways. There are many tools online and it also depends on your selection of language that suits you. I worked on Python and can suggest you to try Beautiful Soup, Requests and other API's. You also need to understand DOM structure of the webpage you want to scrape.
You may like to see documentation of Beautiful Soup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Note that in a webpage, you need to understand DOM structure to search its location and extract location data accordingly.

Related

Can't scrape a website which uses Java Server Faces (JSF)

I am trying to scrape data from a website which uses JSF (JSF is also in the URL like https://xxxx/xxx/x.jsf) for my work.
I have tried a couple of scraping tools like Parsehub & Octoparse but I noticed that they try to reload the page to extract data to a .csv file. After reloading the page, the problem is that all the results are gone and I have to recall (re-filter) the data I need from the website.
Is there a scraping tool that can help me with that? I know that I may get it done using Java or Python, but my programming skills are not enough for such a thing.

Crawling a list of URLS for specific links and Javascript

I have little experience with crawling and need help:
I have a list of URLs and I want to find out whether a certain tool is used on the websites.
The tool is working via an iframe which is loaded when a link with a specific URL is clicked.
So I am looking through the websites for this link. The problem I have is that sometimes this link is in an anchor element but sometimes it is in a javascript function (onclick on a button).
So the anchor element I can find (I tried around with different scraping frameworks like scrapy), but how do I find it when the link is in the function?
Is there an easier approach to the problem than looking for the -elements? E.g., downloading all html and all javascript and searching these files for the link? Because other than in classic crawling I do not want to extract structured data but rather know whether there is a specific link somewhere on the pages?
Thanks so much for any help or ideas!
Best
Martin

Is it possible to see all publicly-accessible files on a website?

I would like to query a website that provides files for download to see all the available files for download.
For example: webpage called https://download.website.com/path/to/file has a file of interest to me, but I would also like to see other available files available in the system publicly.
Essentially I would like to be able to view a hierarchy of all of the publicly-facing files given some parent link. So if I know I want all files stored under https://download.website.com/path/, the query would turn up a recursive list of available files from https://download.website.com/path/*.
Is this even possible to do for most websites? Would allowing this behavior be too compromising to web frameworks in general, so it might not exist? Am I XYing out of control?
Any help here greatly appreciated.
This method isn't perfect, but you can try it. Just put this query in Google Search.
You can do a Google search for some publicly available and indexed path.
For example if you want to search all available page on a website/domain:
site:download.website.com
If you want to search all PDF files in this site:
site:download.website.com filetype:pdf
If you want to search all links with path download.website.com/wp-content/:
site:download.website.com inurl:/wp-content/
I hope it will help you a little bit.

How To Extract Data From a Login Site

I'm trying to figure out how to take live data from one site and have it displayed on my site. I would like to do so, where the data updates as it updates on the original site. My theme is sports gaming and my site is structured like ESPN. I would like to grab all the team standings and players stats.
Sorry if I am unclear.
So basically you want to scrape a website and display it in yours, possibly in a better way.
So I would recommend to use KIMONO, Its an web scraping service, which will provide you with the api to get the data in a proper Model.
Check it out, IT should get your job done.
If not you can create your own scraper in PHP (PHP Simple HTML DOM Parser) or Javascript, there are libraries in Javascript also.
Hope it helps!
Happy Coding !!!

how to hide myself while web scraping by html-agility-pack

I am trying to scrap content from some webpages of a site. I tried html-agility-pack with c#, which is doing good in scraping html.Here I need to go through some numbers of pages while scraping. Now my question is how can I hide my self as webscraper? As I do not want other side come to know that i am scraping their content.Please Let me know if there is any way that can help me.Looking forward for your responses.
Thanks
Use a tor proxy:
Tor Project
You can reset the proxy after every page or after every site. Keep in mind that some sites look for certain patterns and can tell your scraping them. With html agility pack the web is one big data repository, just make sure your not use someone else's data in a way that would get you in trouble.

Resources