Unable to retrieve data from website with multiple pages

Unable to retrieve data from website with multiple pages - r

I could really use some help regarding a problem I'm facing. I have a project where I'm supposed to fetch the names and prices of some products. I must retrieve data from the first 5 pages of a given category.I'm trying to implement it using R, the rvest package and the SelectorGadget extension to choose the appropriate css selectors. I've written a function to do that:
readDataProject2<-function(){
url<-readline(prompt="Enter url: ")
nameTags<-readline(prompt="Enter name tags: ")
priceTags<-readline(prompt="Enter price tags: ")
itemNames<-read_html(url)%>%html_nodes(nameTags)%>%html_text()
itemPrices<-read_html(url)%>%html_nodes(priceTags)%>%html_text()
itemPrices<-itemPrices[-c(1,2)]
page<-cbind(itemNames,itemPrices)
}
and here's the page anesishome.gr. From this specific page I can go to the next etc to fetch a total of...240 products. But even when I provide the url for the next page, second page, I keep getting the data of the first page. Needless to say that choosing the option to present 240 in one single page didn't do any good. Can anybody point me to what I'm doing wrong?

Related

Scraping the gender of clothing items

Looking for advice please on methods to scrape the gender of clothing items on a website that doesn't specify the gender on the product page.
The website I'm crawling is www.very.co.uk and an example of a product page would be this - https://www.very.co.uk/berghaus-combust-reflect-long-jacket-red/1600352465.prd
Looking at that page, there looks to be no easy way to create a script that could identify this item as womenswear. Other websites might have breadcrumbs to use, or the gender might be in the title / URL but this has nothing.
As I'm using scrapy, with the crawl template and Rules to build a hierarchy of links to scrape, I was wondering if it's possible to pass a variable in one of the rules or the starting_URL to identify all items scraped following this rule / starting URL would have a variable as womenswear? I can then feed this variable into a method / loader statement to tag the item as womenswear before putting it into a database.
If not, would anyone have any other ideas on how to categorise this item as womenswear. I saw an example where you could use an excel spreadsheet to create the start_urls and in that excel spreadsheet tag each row as womenswear, mens etc. However, I feel this method might cause issues further down the line and would prefer to avoid it if possible. I'll spare the details of why I think this would be problematic unless anyone asks.
Thanks in advance

There does seem to be a breadcrumb in your example, however for an alternative you can usually check the page source by simply searching your term - maybe there's some embedded javascript/json that can be extract?
Here you can see some javascript for subcategory that indicates that it's a "womens_everyday_sports_jacket".
You can parse it quite easily with some regex:
re.findall('subcategory: "(.+?)"', response.body_as_unicode())
# womens_everyday_sports_jacket

Rvest web scrape returns empty character

I am looking to scrape some data from a chemical database using R, mainly name, CAS Number, and molecular weight for now. However, I am having trouble getting rvest to extract the information I'm looking for. This is the code I have so far:
library(rvest)
library(magrittr)
# Read HTML code from website
# I am using this format because I ultimately hope to pull specific items from several different websites
webpage <- read_html(paste0("https://pubchem.ncbi.nlm.nih.gov/compound/", 1))
# Use CSS selectors to scrape the chemical name
chem_name_html <- webpage %>%
html_nodes(".short .breakword") %>%
html_text()
# Convert the data to text
chem_name_data <- html_text(chem_name_html)
However, when I'm trying to create name_html, R only returns character (empty). I am using SelectorGadget to get the HTML node, but I noticed that SelectorGadget gives me a different node than what the Inspector does in Google Chrome. I have tried both ".short .breakword" and ".summary-title short .breakword" in that line of code, but neither gives me what I am looking for.

I have recently run into the same issues using rvest to scrape PubChem. The problem is that the information on the page is rendered using javascript as you are scrolling down the page, so rvest is only getting minimal information from the page.
There are a few workarounds though. The simplest way to get the information that you need into R is using an R package called webchem.
If you are looking up name, CAS number, and molecular weight then you can do something like:
library(webchem)
chem_properties <- pc_prop(1, properties = c('IUPACName', 'MolecularWeight'))
The full list of compound properties that can be extracted using this api can be found here. Unfortunately there isn't a property through this api to get CAS number, but webchem gives us another way to query that using the Chemical Translation Service.
chem_cas <- cts_convert(query = '1', from = 'CID', to = 'CAS')
The second way to get information from the page that is a bit more robust but not quite as easy to work with is by grabbing information from the JSON api.
library(jsonlite)
chem_json <-
read_json(paste0("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/", "1", "/JSON/?response_type=save$response_basename=CID_", "1"))
With that command you'll get a list of lists, which I had to write a convoluted script to parse the information that I needed from the page. If you are familiar with JSON, you can parse far more information from the page, but not quite everything. For example, in sections like Literature, Patents, and Biomolecular Interactions and Pathways, the information in these sections will not fully show up in the JSON information.
The final and most comprehensive way to get all information from the page is to use something like Scrapy or PhantomJS to render the full html output of the PubChem page, then use rvest to scrape it like you originally intended. This is something that I'm still working on as it is my first time using web scrapers as well.
I'm still a beginner in this realm, but hopefully this helps you a bit.

Web Scraping with R using xpathSApply and trying to get only the links with "/overview"

I'm doing a project for college that involves web scraping. I'm trying to get all the links of the players profiles in this website(http://www.atpworldtour.com/en/rankings/singles?rankDate=2015-11-02&rankRange=1-5001). I've tried to grab the links with the following code:
library(XML)
doc_parsed<-htmlTreeParse("ranking.html",useInternal =T)
root<-xmlRoot(doc_parsed)
hrefs1 = xpathSApply(root,fun=xmlGetAttr,"href",path='//a')
"ranking.html" is the saved link. When I run the code, it gives me a list with 6887 instead of the 5000 links of the players profiles.What should I do?

To narrow down to the links you want, you must include in your expression attibutes that are unique to the element you are after. The best and fastest way to go is using ids (which should be unique). Next best is using paths under elements with specific classes. For example:
hrefs1 <- xpathSApply(root,fun=xmlGetAttr, "href", path='//td[#class="player-cell"]/a')
By the way, the page you link to has at the moment exactly 2252 links, not 5000.

Yahoo Pipes - Build an RSS-URL using specific parameters pulled from another RSS feed's content

The main Data Type used by Yahoo Pipes is the [Item], which is RSS feed content. I want to take an RSS's content or sub-element, make it into [Text] (or a number might work), and then use it as an INPUT into a [Module] to build a RSS-URL with specific parameters. I will then use the new RSS-URL to pull more content.
Could possibly use the [URL Builder Module] or some work-around.
The key here is using "dynamic" data from an RSS feed (not user input, or a static data), and getting that data into a Data Type that is compatible (and/or accessible) as an INPUT into a module.
It seems like a vital functionality, but I cannot figure it out. I have tried many, many work-around attempts, with no success.
The Specific API and Methods (if you are interested)
Using the LastFM API.
1st Method: user.getWeeklyChartList. Then pick the "from" (start) and "to" (end) Unix timestamps from 1 year-ago-today.
2nd Method: user.getWeeklyAlbumChart using those specific (and "dynamic") timestamps to pull my top albums for that week.
tl;dr. Build an RSS-URL using specific parameters from another RSS feed's content.

I think I may have figured it out. I doubt it is the best way, but it works. The problem was the module I needed to use didn't have and input node. But the Loop module has an input node, so if I embed the URL builder into the Loop module I can then access sub-element content from the 1st feed to use as parameters to build the URL for the 2nd feed! Then I can just scrap all the extra stuff generated by the Loop, by using Truncate.

Filtering issue with Urchin 6

I have the following filter for one of my profile:
filter type: Include Pattern Only
filter field: user_defined_variable (AUTO)
filter pattern: \[53\]
case sensitive: no
In my content, I have the following javascript:
_userv=0;
urchinTracker();
__utmSetVar("various string in here");
Now, the issue is that in this profiles, there are files that are showing up in the report that shouldn't. For instance, for a specific profile, in the Webmaster View > Content By Title, a page with the following variable (as seen from the source) shows up :
__utmSetVar("[3][345]")
I have no idea why this is happening. The filter pattern doesn't match thus it shouldn't show up.

It turns out that it's supposed to include files that may have a different pattern. The reason is that it will report on all the files that were seen during a single visit, which includes other files with different custom variables.
To see the report on the custom vars:
Marketing Optimization
- Visitor segment performance
-- User defined

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex