Scrape a page when URL does not change with page number - R

Scrape a page when URL does not change with page number - R - r

I want to scrape all the URLs from this page:
http://www.domainia.nl/QuarantaineList.aspx
I am able to scrape the first page, however, I can not change the page, because it is not in the URL. So how can I change the page with scraping? I've been looking into RSelenium, but could not get it working.
I'm running the next code to get at least the first page:
#Constructin the to scrape urls
baseURL <- "http://www.domainia.nl/quarantaine/"
date <- gsub("-", "/", Sys.Date())
URL <- paste0(baseURL, date)
#Scraping the page
page <- read_html(URL) %>% html_nodes("td") %>% html_text()
links <- str_subset(page, pattern = "^\r\n.*.nl$")
links <- gsub(pattern = "\r\n", "", links) %>% trimws

I've looked at the site; it's using a Javascript POST to refresh its contents.
Originally a HTTP-POST was meant to send information to a server, for example to send the contents of what somebody entered in a form. As such, it often includes information on the page you are coming from, which means you probably will need more information then just "page n".
If you want to get another page, like your browser would show you, you need to send a similar request. The httr package inlcudes a POST function, I think you should take a look at that.
For knowing what to post, I think it's most useful to capture what your browser does, and copy that. In Chrome, you can use inspect, tab Network to see what is sent and received, I bet other browsers have similar tools.
However, it looks like that website makes its money by showing that information, and if some other source would show the same things, they'd lose money. Therefore I doubt if it's that easy to emulate, I think some part of the request differs every time, yet needs to be exactly right. For example, they could build checks to see if the entire page was rendered, instead of discarded like you do. So I wouldn't be surprised if they intentionally make it very hard to do what you are trying to do.
Which brings me to an entirely different solution: ask them!
When I tried scraping a website with dynamically generated content for the first time, I was struggling as well. Until I explored the website some more, and saw that they had a link where you could download the entire thing, tidied up, in a nice csv-format.
And for a webserver, people trying to scrape their website is often inconvenient, it also demands resources from the server, a lot more than someone downloading a file.
It's quite possible they'll tell you "no", but if they really don't want you to get their data, I bet they've made it difficult to scrape. Maybe you'll just get banned if you make too many requests from the same IP, maybe some other method.
And it's also entirely possible that they don't want their data in the hands of a competitor, but that they'll give it to you if you only use it for a particular purpose.

(too big for a comment and it also has as salient image, but not an answer, per se)
Emil is spot on, except that this is a asp.net/sharepoint-esque site with binary "view states" and other really daft web practices that will make it nigh impossible to scrape with just httr:
When you do use the Network tab (again, as Emil astutely suggests) you can also use curlconverter to automatically build httr VERB functions out of requests "Copied as cURL".
For this site — assuming it's legal to scrape (it has no robots.txt and I am not fluent in Dutch and did not see an obvious "terms and conditions"-like link) — you can use something like splashr or Selenium to navigate, click and scrape since it acts like real browser.

Related

Defensive web scraping techniques for scrapy spider

I have been web scraping for about 3 months now, and I have noticed that many of my spiders need to be constantly babysat, because of websites changing. I use scrapy, python, and crawlera to scrape my sites. For example, 2 weeks ago I created a spider and just had to rebuild it due to the website changing their metatags from singular to plural (so location became locations). Such a small change shouldn't be able to really mess with my spiders, so I would like to take a more defensive approach to my collections moving forward. Does anyone have any advice for web scraping to allow for less babysitting? thank you in advance!

Since you didn't post any code I can only give general advice.
Look if there's a hidden API that retrieves the data you're looking for.
Load the page in Chrome. Inspect with F12 and look under Network tab. Click CTRL + F and you can search for the text you see on screen which you want to collect. If you find any file under the Network tab that contains the data as json, that is more reliable since the backend of a webpage will change less frequent than the frontend.
Be less specific with selectors. Instead of doing body > .content > #datatable > .row::text you can change to #datatable > .row::text. Then your spider will be less likely to break on small changes.
Handle errors with try except so to stop the whole parse function from ending if you're expecting some data might be inconsistent.

R - Web scraping item price

I'm trying to write an R script checking prices on a popular swiss website.
Following methodology explained here: https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/ I tried to use rvestfor that:
library(rvest)
url <- "https://www.galaxus.ch/fr/s8/product/quiksilver-everyday-stretch-l-shorts-de-bain-10246344"
webpage <- read_html(url)
Unfortunately, I have limited html/css knowledge and the content of webpage is very obscure to me.
I tried inspecting the page with google chrome and it looks like the price is located in something named priceEnergyWrapper--2ZNIJ but I cannot find any trace of that in webpage. I did not have more luck using SelectorGadget
Can anybody help me get the price out of webpage?

Since it is dynamically generated, you will need RSelenium.
Your code should be something like:
library(RSelenium)
driver <- rsDriver(browser=c("chrome"))
rem_driver <- driver[["client"]]
rem_driver$open()
rem_driver$navigate("https://www.galaxus.ch/fr/s8/product/quiksilver-everyday-stretch-l-shorts-de-bain-10246344")
This will ask Selenium to open this page after loading the entire page, and hence all the HTML that you see by clicking Page Source should be available.
Now do:
rem_driver$findElement(using = 'class', value = 'priceEnergyWrapper--2ZNIJ')
You should now see the necessary HTML to get the price value out of it, which at the time of checking the website is 25 CHF.
PS: I do not scrape websites for others unless I am sure that the owners of the websites do not object to crawlers/scrapers/bots. Hence, my codes are based on the idea of how to go about with Selenium. I have not tested them personally. However, you should more or less get the general idea and the reason behind using a tool like Selenium. You should also find out if you are allowed to legally scrape this website and for others in the near future.
Additional resources to read about RSelenium:
https://ropensci.org/tutorials/rselenium_tutorial/

Web scraping Oracle (ATG) Commerce

I am new to web scraping, and I use the following tool and method to scrap:
I use R (with packages Curl, XML, etc) to read the web pages (with a url link), and htmlTreeParse function to parse the html page.
Then in order to know get the data I want, I first use the developer tool i Chrome to insepct the code.
When I know in which node the data are, I use xpathApply to get them.
Usually, it works well. But I had an issue with this site: http://www.sephora.fr/Parfum/Parfum-Femme/C309/2
When you click on the link, you will load the page, and in fact it is the page 1 (of the products).
You have to load the url again (by entering a second time the url), in order to get the page 2.
When I use the usual process to read the data. The htmlTreeParse function always gives me the page1.
I tried to understand more this web site:
It seems that it is built with Oracle commerce (ATG commerce).
The "real" url is hidden, and when you click on the filter (for instance, you select a brand), you will get url with requestid: http://www.sephora.fr/Parfum/Parfum-Femme/C309?_requestid=285099
This doesn't help to know which selection I made.
Could you please help:
How can I access to more products ?
Thank you

I found the solution: selenium ! I think that it is the ultimate tool for web scraping. I posted several questions concerning web scraping, now with rselenium, almost everything is possible.

grabbing data from a ASP.NET webForm

I'm fairly new to web development and never before did i do any screen-scraping nor web-crawling, but yesterday a friend of mine asked me if i would be able to grab some data from this website, which is not mine, nor his, but the data is publicly available even for download.
The problem with the data is, it's available only as one file per one date or company, rather than one file for multiple dates or companies, which involves a lot of tedious 'clicking trough' the calendar and so he thought it would be nice if i would be able to create some app that could grab all the data with one click and output it in one single file or something similar..
The website uses aspx webFrom with __doPostBack to retrieve the data for different dates, even the links to download the data in XSL aren't the usual "href=…" links, they are, i assume, references for some asp script…
To be honest the only thing i tried was PHP cURL which didn't work, but since i tried cURL for the first time, i don't even know if it didn't work because it is not possible with cURL, or just because i don't know how to work with it.
I am only somewhat proficient in PHP and JavaScript, but not in ASP, though i would't mind learning something new.
So my question is..
Is it at all possible to grab the data from a website like this? and if it is, would you be so kind as to give me some hints on how to approach this kind of problem?
the website, again, is here http://extranet.net4gas.cz/capacity_ee.aspx
Thanks

C# has a nice WebClient class to do the job:
// Create web client.
WebClient client = new WebClient();
// Download string.
string value = client.DownloadString("http://www.microsoft.com/");
once you have the page html in a string you use regular expressions to scrape the content you are looking for.
here is a very basic regular expression to give a hint:
Regex regex = new Regex(#"\d+");
Match match = regex.Match("hello here 10 values");
if (match.Success)
{
Console.WriteLine(match.Value);
}

Marosko, as you said the data on website is open for public, so for sure you can scrape data out of it. Now, it is to decrease the manual click through dates and scraping data out of it. I personally don't have much idea about how Curl will work but I am sure it will involve a lot of coding. I would rather suggest you to automate the entire process using some automation tool, like a software application. Try Automation Anywhere, I bought it few months back for some data extraction purpose and it worked very well. It is automated and you can check the screen scraping capabilities it shows. Its my favorite :)
Charles

Aggregating from various sources

It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).
I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:
Some sources don't provide rss feeds. How do I create one?
What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?
Please add any other things (problems, suggestions, whatever) I might not have considered.

Duplication is a nasty issue. What I eventually ended up doing:
1. Strip out all HTML tags except for links (Although I started using regex, I was burned. I eventually moved to custom parsing to remove tags)
2. Strip out all whitespace
3. Case-desensitize
4. Hash all that with MD5.
Here's why you leave the link in:
A comment might be as simple as "Yes, this sucks". "Yes, this sucks" could be a common comment. BUT if the text "this sucks" is linked to different things, then it is not a duplicate comment.
Additionally, you will find that HTML tag escaping is weird with RSS feeds. You would think that a stray < would be double-encoded: (I think)&<;
But it is not. It is encoded <
But so too are HTML tags! :<p>
I eventually copied all the known HTML tags as parsed by Mozilla Firefox and manually recognized those tags.
Creating an RSS feed from HTML is quite nasty and I can only point you to services such as Spinn3r, which are fantastic at de-duplication and content extraction. These services typically use probability-based algorithms that are above me. I know of one provider that got away with regexing pages (They had to know that a certain page was MySpace-based or Blogger-based) but they did not perform admirably.

You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.
About duplicates, take a look at this pipe.
Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, e.g. get it as JSON an manipulate it with Javascript, or process it server-side.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex