Web scraping Oracle (ATG) Commerce - web-scraping

I am new to web scraping, and I use the following tool and method to scrap:
I use R (with packages Curl, XML, etc) to read the web pages (with a url link), and htmlTreeParse function to parse the html page.
Then in order to know get the data I want, I first use the developer tool i Chrome to insepct the code.
When I know in which node the data are, I use xpathApply to get them.
Usually, it works well. But I had an issue with this site: http://www.sephora.fr/Parfum/Parfum-Femme/C309/2
When you click on the link, you will load the page, and in fact it is the page 1 (of the products).
You have to load the url again (by entering a second time the url), in order to get the page 2.
When I use the usual process to read the data. The htmlTreeParse function always gives me the page1.
I tried to understand more this web site:
It seems that it is built with Oracle commerce (ATG commerce).
The "real" url is hidden, and when you click on the filter (for instance, you select a brand), you will get url with requestid: http://www.sephora.fr/Parfum/Parfum-Femme/C309?_requestid=285099
This doesn't help to know which selection I made.
Could you please help:
How can I access to more products ?
Thank you

I found the solution: selenium ! I think that it is the ultimate tool for web scraping. I posted several questions concerning web scraping, now with rselenium, almost everything is possible.

Related

What is "dlta" and "ridlist" parameters used in Requests library (Python)

Guys I am working on getting data as tables from QuickBase using Requests library (Python). I found somebody doing it using the URL of the report, but he added two parameters to the URL like that:
&dlta=xs%xx&ridlist=xxxx.
Can anybody please tell me what are those two parameters, I searched for them in the internet but found nothing related to them.
I've been using Quickbase for over ten years and haven't seen documentation for either of these parameters. I have noticed that ridList seems to be used by Quickbase's grid edit view of reports (I suspect it's an ID for a server-side cached list of record IDs to display especially when using the type-ahead search of a report before choosing to grid edit) and dlta is used in the "Download report as CSV" button.
That example you're following may have simply copy and pasted a link generated by Quickbase as a hack to get a CSV instead of XML response. I recommend following the Quickbase HTTP API Reference instead. If you don't want an XML response, Quickbase also has a JSON RESTful API which may be easier to work with.

R - Web scraping item price

I'm trying to write an R script checking prices on a popular swiss website.
Following methodology explained here: https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/ I tried to use rvestfor that:
library(rvest)
url <- "https://www.galaxus.ch/fr/s8/product/quiksilver-everyday-stretch-l-shorts-de-bain-10246344"
webpage <- read_html(url)
Unfortunately, I have limited html/css knowledge and the content of webpage is very obscure to me.
I tried inspecting the page with google chrome and it looks like the price is located in something named priceEnergyWrapper--2ZNIJ but I cannot find any trace of that in webpage. I did not have more luck using SelectorGadget
Can anybody help me get the price out of webpage?
Since it is dynamically generated, you will need RSelenium.
Your code should be something like:
library(RSelenium)
driver <- rsDriver(browser=c("chrome"))
rem_driver <- driver[["client"]]
rem_driver$open()
rem_driver$navigate("https://www.galaxus.ch/fr/s8/product/quiksilver-everyday-stretch-l-shorts-de-bain-10246344")
This will ask Selenium to open this page after loading the entire page, and hence all the HTML that you see by clicking Page Source should be available.
Now do:
rem_driver$findElement(using = 'class', value = 'priceEnergyWrapper--2ZNIJ')
You should now see the necessary HTML to get the price value out of it, which at the time of checking the website is 25 CHF.
PS: I do not scrape websites for others unless I am sure that the owners of the websites do not object to crawlers/scrapers/bots. Hence, my codes are based on the idea of how to go about with Selenium. I have not tested them personally. However, you should more or less get the general idea and the reason behind using a tool like Selenium. You should also find out if you are allowed to legally scrape this website and for others in the near future.
Additional resources to read about RSelenium:
https://ropensci.org/tutorials/rselenium_tutorial/

aikau - implement export search results functionality

I have recently started developing with aikau in alfresco share.
I want to achieve a functionality wherein I can export search results to a CSV file.
For that, I can change the back-end repository web script to return csv data.
Now, At alfresco share end - I was successfully able to show the export link by adding a new widget to FCTSRCH_TOP_MENU_BAR. I used alfresco/renderers/PropertyLink to display this link. Now, the missing part for me is - how can I invoke the search web script passing additional param format=csv and alongwith that pass all the query parameters used to retrieve the results.
I am stuck with that. If I use the publishTopic as ALF_CRUD_GET_ALL and provide the URL there then it invokes the sample web script (I created to return sample csv response) and returns the response. However, the csv doesn't come as downloadable response. I am stuck here in order to how to achieve export csv functionality for search results.
It would be great if any of you can help me here and provide your guidance/suggestions.
This blog post provides an example on how you can custom the search page in Share. Although it specifically addresses changing the search queries the basic extension approach is more or less that same in that you will want to change the data that is used to send an XHR request. I think that the major difference here is that you may need to do more in-depth updates to the service - in particular with regards to the switch statement that is used to build the advanced search query object.
If you have extended or replaced the default search REST API then I would expect that you will need to call the same URL, but if you have provided an entirely new REST API to return the CSV data then you'll also need to change the URL that is used by the service.
In terms of providing a link for downloading the content we have previously implemented something in the DragAndDropModelCreationService (see the generateDownload function) but this only works with Chrome due to security limitations and the generation of files to download.
Your best bet may be temporarily store the CSV content on the repository in a hidden location and then use the standard download links to allow it to be downloaded - this would be more complex but would provide better cross browser support. Something similar is done for the "Download as ZIP" action.
OK, with the extra information provided I would do the following...
The information on the process of adding widgets to the search page are quite well detailed here (although you're not adding a view, you can follow the approach to add a new PropertyLink after the widget with the id "FCTSRCH_RESULTS_COUNT_LABEL").
The approach I would take would be to include an additional custom service on the page that subscribes to the "ALF_RETRIEVE_DOCUMENTS_REQUEST_SUCCESS" topic (which is published on a completed search). It should save the the search response in a variable in preparation for users clicking on the PropertyLink.
This custom service should also subscribe to a topic that is published by the PropertyLink (called say "DOWNLOAD_CSV"). This custom service could then generate a file download using the approach described in my previous answer using the CSV data that will have been provided in the payload. As I said though, this may only work with some browsers due to security reasons.
If your custom search WebScript were able to store the CSV data as a node on the Repository then you could just provide a the NodeRef of the CSV data in the search response and the PropertyLink could just publish the "ALF_DOWNLOAD" topic for the DocumentService to handle the download.
Trying to generate a file to download on the client side is going to be an issue for most browsers I think.

Import.io - Can it replace Kimonolabs

I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?
Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.
Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.
I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.
Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.

Crawling wikipedia

I'm going through crawling wikipedia using website downloader for windows, i was looking through the whole options in this tool to find an option to download wikipedia pages for specific period, for example from 2005 untill now.
Does anyone get any idea about crawling the website in specific period of time ?
Why not download the SQL database containing all of Wikipedia?
You can then query it using SQL.
Give a try to the Wikipedia API and your programming skills.
There should be no need to do web scraping; use the MediaWiki API to directly request the information you want. I'm not sure what you mean by "wikipedia pages for a specific period" - do you mean last edited at a certain time? If so, while skimming, I noticed an API call that lets you get a look at the last n revisions; just ask for the last revision and see what its date is.
It depends if the website in question offers the archive and mostly don't so its not possible in a straightforward way to crawl a sample started from specific date. But you can implement some intelligence in your crawler to read the page created date or something like that.
But you can also look at Wikipedia API at http://en.wikipedia.org/w/api.php

Resources