I've been using the Scraper extension to scrape a website called Flippa and find websites for sale. For example, I'll go to this page with several websites on it and find all the domains for sale:
https://flippa.com/search?sort_alias=most_recent&filter%5Bproperty_type%5D=website,established_website,starter_site&filter%5Bsitetype%5D=content,blog,directory,review,forum-community
I've been using the following XPath code to gather the domains (e.g. blasterpiece.com), but it no longer works:
//div[1]/div[2]/div[1]/a[2]/text()
Any idea what I need to tweak? I'm new to scraping, so I'm pretty stuck.
Thanks!
This XPath should work: //a[contains(#class,"GTM-search-result-card ng-binding")]/text()
Related
I'm trying to write an R script checking prices on a popular swiss website.
Following methodology explained here: https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/ I tried to use rvestfor that:
library(rvest)
url <- "https://www.galaxus.ch/fr/s8/product/quiksilver-everyday-stretch-l-shorts-de-bain-10246344"
webpage <- read_html(url)
Unfortunately, I have limited html/css knowledge and the content of webpage is very obscure to me.
I tried inspecting the page with google chrome and it looks like the price is located in something named priceEnergyWrapper--2ZNIJ but I cannot find any trace of that in webpage. I did not have more luck using SelectorGadget
Can anybody help me get the price out of webpage?
Since it is dynamically generated, you will need RSelenium.
Your code should be something like:
library(RSelenium)
driver <- rsDriver(browser=c("chrome"))
rem_driver <- driver[["client"]]
rem_driver$open()
rem_driver$navigate("https://www.galaxus.ch/fr/s8/product/quiksilver-everyday-stretch-l-shorts-de-bain-10246344")
This will ask Selenium to open this page after loading the entire page, and hence all the HTML that you see by clicking Page Source should be available.
Now do:
rem_driver$findElement(using = 'class', value = 'priceEnergyWrapper--2ZNIJ')
You should now see the necessary HTML to get the price value out of it, which at the time of checking the website is 25 CHF.
PS: I do not scrape websites for others unless I am sure that the owners of the websites do not object to crawlers/scrapers/bots. Hence, my codes are based on the idea of how to go about with Selenium. I have not tested them personally. However, you should more or less get the general idea and the reason behind using a tool like Selenium. You should also find out if you are allowed to legally scrape this website and for others in the near future.
Additional resources to read about RSelenium:
https://ropensci.org/tutorials/rselenium_tutorial/
I am trying to scrape few websites data using selenium. I'm scraping it using the standard way like by css, xpath, id, etc, but for every new website I have to write new selenium script by inspecting every websites element.
Now I got a new requirement from client, he wants a scraper which will scrape data from "any website", no matter what the website html structure is.
And it should get and put the data in columns like, postdate, description, category, location and few other headers like department, region etc
All above headers might not be present for a given website, but scraper should get the available data and put that into the respective column in a DB.
And I need to do it without using the selenium standard way, according to me its like a search engine robot who goes to websites and fetch h1 h2 h3 etc..
I am confused how I can achieve this, I am looking for some hint, idea, suggestions etc, to complete my task. I hope my question is valid for SO.
any Programming language will do.
Thanks in advance
I want help regarding how i can save Facebook data from search results.
I have 1000 query URLs like:
https://www.facebook.com/search/people/?q=name
https://www.facebook.com/search/people/?q=mobile
How can I quickly scrape data from the resulting web pages?
I have tried to scrape with some scraper programs but could not get them to work. Does anyone have a faster way?
Use python requests library. It is pure and fast library. Scraping speed is not only dependent on your code, it also depends on the web site you are scraping.
I am new to web scraping, and I use the following tool and method to scrap:
I use R (with packages Curl, XML, etc) to read the web pages (with a url link), and htmlTreeParse function to parse the html page.
Then in order to know get the data I want, I first use the developer tool i Chrome to insepct the code.
When I know in which node the data are, I use xpathApply to get them.
Usually, it works well. But I had an issue with this site: http://www.sephora.fr/Parfum/Parfum-Femme/C309/2
When you click on the link, you will load the page, and in fact it is the page 1 (of the products).
You have to load the url again (by entering a second time the url), in order to get the page 2.
When I use the usual process to read the data. The htmlTreeParse function always gives me the page1.
I tried to understand more this web site:
It seems that it is built with Oracle commerce (ATG commerce).
The "real" url is hidden, and when you click on the filter (for instance, you select a brand), you will get url with requestid: http://www.sephora.fr/Parfum/Parfum-Femme/C309?_requestid=285099
This doesn't help to know which selection I made.
Could you please help:
How can I access to more products ?
Thank you
I found the solution: selenium ! I think that it is the ultimate tool for web scraping. I posted several questions concerning web scraping, now with rselenium, almost everything is possible.
i am new to nutch and am using nutch 1.9. right now am doing some POC on a sample site(shaadi.com). I have few questions, can somebody help me out on this?
i cant access the urls that requires login authentication(form based), though i setup the configuration in httpclient-auth.xml, nutch-site.xml and all.
i know nutch fetches us only the whole content of the website. but is it possible to get only a piece of information like first name, address etc.. from the website page using nutch? (i think its more like scraping.. this is what pythons scrapy does)
Thanks in advance.
You will need to use plugin to extract specific data & add that data to nutch document while indexing.
This plugin can be used to extract data
www.atlantbh.com/precise-data-extraction-with-apache-nutch/