Looking for advice please on methods to scrape the gender of clothing items on a website that doesn't specify the gender on the product page.
The website I'm crawling is www.very.co.uk and an example of a product page would be this - https://www.very.co.uk/berghaus-combust-reflect-long-jacket-red/1600352465.prd
Looking at that page, there looks to be no easy way to create a script that could identify this item as womenswear. Other websites might have breadcrumbs to use, or the gender might be in the title / URL but this has nothing.
As I'm using scrapy, with the crawl template and Rules to build a hierarchy of links to scrape, I was wondering if it's possible to pass a variable in one of the rules or the starting_URL to identify all items scraped following this rule / starting URL would have a variable as womenswear? I can then feed this variable into a method / loader statement to tag the item as womenswear before putting it into a database.
If not, would anyone have any other ideas on how to categorise this item as womenswear. I saw an example where you could use an excel spreadsheet to create the start_urls and in that excel spreadsheet tag each row as womenswear, mens etc. However, I feel this method might cause issues further down the line and would prefer to avoid it if possible. I'll spare the details of why I think this would be problematic unless anyone asks.
Thanks in advance
There does seem to be a breadcrumb in your example, however for an alternative you can usually check the page source by simply searching your term - maybe there's some embedded javascript/json that can be extract?
Here you can see some javascript for subcategory that indicates that it's a "womens_everyday_sports_jacket".
You can parse it quite easily with some regex:
re.findall('subcategory: "(.+?)"', response.body_as_unicode())
# womens_everyday_sports_jacket
Related
I'm now in the process of developing my small project and I'm not sure if it's even possible to do this the way I will describe below, so... I have dynamic routes like "/[community]/[townName]". How can I generate static paths where [townName] is constrained to [community]?
In other words - let's say we have townName "abc1". This town is in community "xyz1" so the page /xyz1/abc1 should be accessible and NOT throw 404. But there is also town with the same name "abc1" in "xyz2". So the path /xyz2/abc1/ should also be accesible.
However there is no town with same name in community xyz3 so I do not want to generate page for /xyz3/abc1/ - user should see 404 error.
Of course each town has it's unique ID in database and I could use it to generate pages, but I want my url to be SEO friendly.
All help and tips are appreciated. Thanks!
You should check getStaticPaths and dynamic routes from official Next.js website. It has more than one way to do what you want but there are additional customization options.
I could really use some help regarding a problem I'm facing. I have a project where I'm supposed to fetch the names and prices of some products. I must retrieve data from the first 5 pages of a given category.I'm trying to implement it using R, the rvest package and the SelectorGadget extension to choose the appropriate css selectors. I've written a function to do that:
readDataProject2<-function(){
url<-readline(prompt="Enter url: ")
nameTags<-readline(prompt="Enter name tags: ")
priceTags<-readline(prompt="Enter price tags: ")
itemNames<-read_html(url)%>%html_nodes(nameTags)%>%html_text()
itemPrices<-read_html(url)%>%html_nodes(priceTags)%>%html_text()
itemPrices<-itemPrices[-c(1,2)]
page<-cbind(itemNames,itemPrices)
}
and here's the page anesishome.gr. From this specific page I can go to the next etc to fetch a total of...240 products. But even when I provide the url for the next page, second page, I keep getting the data of the first page. Needless to say that choosing the option to present 240 in one single page didn't do any good. Can anybody point me to what I'm doing wrong?
I'm doing a project for college that involves web scraping. I'm trying to get all the links of the players profiles in this website(http://www.atpworldtour.com/en/rankings/singles?rankDate=2015-11-02&rankRange=1-5001). I've tried to grab the links with the following code:
library(XML)
doc_parsed<-htmlTreeParse("ranking.html",useInternal =T)
root<-xmlRoot(doc_parsed)
hrefs1 = xpathSApply(root,fun=xmlGetAttr,"href",path='//a')
"ranking.html" is the saved link. When I run the code, it gives me a list with 6887 instead of the 5000 links of the players profiles.What should I do?
To narrow down to the links you want, you must include in your expression attibutes that are unique to the element you are after. The best and fastest way to go is using ids (which should be unique). Next best is using paths under elements with specific classes. For example:
hrefs1 <- xpathSApply(root,fun=xmlGetAttr, "href", path='//td[#class="player-cell"]/a')
By the way, the page you link to has at the moment exactly 2252 links, not 5000.
the database that i am scraping is located here: https://www2.cslb.ca.gov/OnlineServices/CheckLicenseII/CheckLicense.aspx
what i would like to do is:
use a wildcard search to find companies with "roof" or "roofing" in the name.
i can perform this search "roof%". however, "%roof" or "* roof%" doesn't work. i am more interested in figuring out how to make the later query work.
example:
xyz roofing co
cal roofers inc
can someone help me with this?
There's no such thing as an aspx database.
There's no way for you to derive a substring search, as you are attempting to do, if the site doesn't support it. The 'search tips' page tells you exactly what is valid:
https://www2.cslb.ca.gov/OnlineServices/CheckLicenseII/CheckLicense.aspx
We have no 'magical' way of making the site do more than it was designed to do.
I have a Drupal 7 website that is running apachesolr search and is using faceting through the facetapi module.
When I use the facets to narrow my searches, everything works perfectly and I can see the filters being added to the search URL, so I can copy them as links (ready-made narrowed searches) elsewhere on the site.
Here is an example of how the apachesolr URL looks after I select several facets/filters:
search_url/search_keyword?f[0]=im_field_tag_term1%3A1&f[1]=im_field_tag_term2%3A100
Where the 'search_keyword' portion is the text I'm searching for and the '%3A' is just the url encoded ':' (colon).
Knowing this format, I can create any number of ready-made searches by creating the correct format for the URL. Perfect!
However, these filters are always ANDed, the same way they are when using the facet interface. Does anyone know if there is a syntax I can use, specifically in the search URL, to OR my filters/facets? Meaning, to make it such that the result is all entries that contains EITHER of the two filters?
Thanks in advance for any help or pointers in the right direction!
New edit:
I do know how to OR terms within the same vocabulary through the URL, I'm just wondering how to do it for terms in different vocabularies. ;-)
You can write a filter query that looks like:
fq=field1:value1 OR field2:value2
Alternatively you can use localparams to specify the query operator:
fq={!q.op=OR}field1:value1 field2:value2
As far as I know, there's not any easier way to do this. There is, in fact, an rather old bug asking for a way to OR the fq parameters...
I finally found a way to do this in Drupal
Enable the fq parameter setting.
Go to admin/config/search/apachesolr/[your_search_page]/core_search/edit or just navigate to the settings of the search page you're trying to modify
Check the 'Allow user input using the URL' setting
URL Syntax
Add the following at the end of the URL: ?fq=tid:(16 OR 38), where 16 and 38 are the term ids