i have a script that scrapes items by class with requests html.
reviewtext = r.html.find(
'strong.reviews__item-title', first=True).text
However, while the class scraped is assorted to multiple elements (reviews) on the page, only one item (the first review) gets scraped.
How do i implement a for loop or something similiar correctly so my program scrapes the first 3 or a certain number of product reviews, not just the first?
An example product url i try to scrape: https://www.coolblue.de/produkt/832192/eufy-by-anker-robovac-35c.html#product-reviews
Try:
reviewtexts = r.html.findAll(strong.reviews__item-title')
for reviewtext in reviewtexts:
print(reviewtext.text)
Related
Table Screenshot I need a single row of Engineer with count 2 in head count
I have attached a screenshot. Category fields are fetched from a subform rows and rows can contain similar categories. I want to show category only once in this table with its count in head count. For example Engineer - 2 and not as it is currently being shown.
Use the distinct option from Zoho Creator data access task to get the unique list of Category. Then use this list to fetch the count of each item.
This also depends on how the subform was included in the main form and whether you are looking to fetch the count per each record of the main form or for all records in the main form.
If the subform was directly created insider the main form, you will have to follow the above method. If the subform was created separately and then embedded into the main form, you can directly apply the Count and Distinct operations instead of iterating through all values.
Store the Category in the list. in loop check, if the category already exists or not then add into list.
I am relatively new to WordPress but have a lot of coding experience so I am hoping to pick it up quite quickly.
I am looking to create a website on WordPress that uses a database.
For example, I have made a table of animals consisting of 100 rows, one for each animal, and this is accessible on phpMyAdmin. I want my website to have a page for each row entry in this database.
Ideally, I would like a page to be shown to the user which contains filters. So the user can select maybe "Small" animals which are "Nocturnal". This will then use the database to find all animals which are "Small" and "Nocturnal", and then the page will display the links to the animals which match these filters. Say only "Hamster" matches these filters in my database, then the link to the Hamster page would display on the website.
Then the user can click the Hamster page to find out more about the animal. So I need a page for each animal, but wondered if I could somehow link this to a database I have to help with the filtering options.
Thanks.
Step 1: Create the animals page and display all of them
Set up a page template to show all your animals first, without filters. That should get you started.
To retrieve them, modify the page template file you created above to fetch all the rows in the animals table, like so:
global $wpdb;
$feeds = $wpdb->get_results("SELECT * FROM animals", ARRAY_A);
Iterate through the rows to print them out into table form.
Step 2: Create the filters
Create the links on the front end using query parameters, e.g. for a link to all nocturnal animals, use something like:
echo 'Nocturnal';
Step 3: Handle a specific query variable
At the top of your animals template page, use the following code to get a query variable:
$animal_type = get_query_var('animal_type');
So if you accessed the URL http://www.example.com/animals/?animal_type=nocturnal then the value of $animal_type would now be nocturnal.
Now create a new function in functions.php to query for animals using query parameters. Something like this:
$args = array('animal_type' => $animal_type, 'animal_size' => $animal_size);
function get_animals($args);
And iterate through them to print them out.
I am trying to scrape a web forum using Scrapy for the href link info and when I do so, I get the href link with many letters and numbers where the question mark should be.
This is a sample of the html document that I am scraping:
I am scraping the html data for the href link using the following code:
response.xpath('.//*[contains(#id, "thread_title")]/#href').extract()
When I run this, I get the following results:
[u'showthread.php?s=f969fe6ed424b22d8fddf605a9effe90&t=2676278']
What should be returned is:
[u'showthread.php?t=2676278']
I have ran other tests scraping for href data with question marks elsewhere in the document and I also get the "s=f969fe6ed424b22d8fddf605a9effe90&" returned.
Why am I getting this data returned with the "s=f969fe6ed424b22d8fddf605a9effe90&" instead of just the question mark?
Thanks!
It seems that the site I am scraping from uses a unique identifier in order to more accurately update the number of views per the thread. I was not able to return scraped data without a unique id, it changed over time, and scraped a different HTML tag for the thread ID and then joined it to the web address (showthread.php?t=) to create the link I was looking for.
There is a way to hide a specific object of my catalog results?
I have a configuration file that I don't want to show.
I'm filtering by id, but it seems so ugly.
from Products.CMFCore.utils import getToolByName
def search(context):
catalog = getToolByName(context, 'portal_catalog')
items = catalog()
for item in items:
if item.id != "config_file":
'do something'
If you are already hiding the object from the navigation tree, you can filter on the same property by testing for exclude_from_nav:
items = catalog()
for item in items:
if item.exclude_from_nav:
continue
# do something with all objects *not* excluded from navigation.
It is harder to filter out things that don't match a criteria. Using a test on the brain object like the above is a perfectly fine way to remove a small subset from your result set.
If you need handle a larger percentage of 'exceptions' you'll need to rethink your architecture, perhaps.
With Products.AdvancedQuery you can create advanced queries and filtering on catalog results. See also this how to.
In the general case, setting a content item's expiration date to some past date hides it from search results (so long as the user does not have the Access inactive portal content permission).
It's an easy way to hide pieces of content that should be visible to all, but that you don't want cluttering up search results e.g. a Document that serves as the homepage.
I always use 1st Jan 2001 as the date so I can easily recognise when I've used this little 'hack'.
I am new to the HtmlAgilityPack and its a bit unclear for me how it exactly works. Lets say when something like this piece of code is written
Dim url1 As String = "http://www.bing.com/search?q=Verizon
Dim hw As New HtmlWeb()
Dim doc As HtmlDocument = hw.Load(url1)
For Each link As HtmlNode In doc.DocumentNode.SelectNodes("//a[#href]")
Dim att As HtmlAttribute = link.Attributes("href")
Response.Write(att.Value)
Next
So when the SelectNodes is //a[#href] does that mean that it will only look at ahref tags?
If so how can I make it consider other tags in within the loop like <li>, <h3>, <div>.
Is it correct like //li[#class='wrap']|//div[#class='last'] ??
How can the data between those tags be fetched and presented.
One other issue is that lets say I need to scrape a telephone number from that url, the number might be unavailable or might not be in any of the tags defined. Is there any reliable method that I can work on in order to obtain a telephone number to a relative search term? Any suggestions or thoughts?
Indeed, the current xpath looks at anchor tags that have a href parameter. I suggest you read up on xpath syntax (for instance at http://www.w3schools.com/xpath/xpath_syntax.asp)
To select other nodes you need to change the xpath to select those tags, for instance:
doc.DocumentNode.SelectNodes("//li")
to get all li nodes etc.
The data in the tags can be reached using the InnerHtml of the selected document nodes (link.InnerHtml in your example)
Automatically scraping telephone numbers is a real pain, every country uses different lengths and there are many different formats to write down a number: +12(0)3456 +123456 00123456 +12(0)34-56 are all the same valid phone number... See Check if there is phone number in string C# for a simple sollution
GL&HF!