Trying to search through text on a website in PlayWright API - css

The text I'm searching for is all contained within a CSS class called "content-center", and within that is a series of CSS classes all with the same name that old similar, but different information. It seems to only be returning [<JSHandle preview=JSHandle#node>] rather than returning the text itself as if saying "yes, this text is on the page X times".
page.wait_for_selector('.content-center')
print(page.query_selector_all(".content-center:has-text('Bob Johnson')"))

page.query_selector_all returns the ElementHandle[] values of the elements which got found. Over these you can loop and call the text_content() method to get the text out of that specific element.
Also in most cases, its enough to use the text-selectors to verify something is on the page or an element has text, see here for reference.

Related

Scrapy response returns empty list

I'm new to Scrapy and I'm trying to extract data from sport bets on sportsbooks.
I am currenty trying to extract data from the upcoming matches in the Premier League: https://sport.mrgreen.com/da-DK/filter/football/england/premier_league
(The site is in Danish)
First I have used the command "fetch" on the website, and I am able to return something back using the "response" command with both CSS and xpath from the body of the HTML code. However, when I want to extract data beyond a certain point in the HTML code ("div data-ui-view"), response just returns an empty list. (See picture)
Example
I have encircled the xpath in red. I return something when I run the following:
response.xpath('/html/body/div[1]/div')
I have tried to use both CSS on the innermost class that I could find on the data I want to extract and the direct xpath as well. Still only an empty list.
response.xpath('/html/body/div[1]/div/div')
(The above code returns "[]")
response.xpath('response.xpath('/html/body/div[1]/div/div/div[2]/div/div/div[1]/div/div[3]/div[2]/div/div/div/div/div/div[4]/div/div[2]/div/div/ul/li[1]/a/div/div[2]/div/div/div/div/button[1]/div/div[1]/div'))
(The above xpath is to a football club name)
Does anybody know what the problem might be? Thanks
You can't do response.xpath(response.xpath()), one response is enough; also, I always use "" instead of '', and avoid using full xpath, that rarely works, instead try with .//div and see what returns, and for better results, use the search options that xpath has, like response.xpath(".//div[contains(text(), 'Chelsea Wolves')]//text(). Make sure your response.url matches with the url you want to scrapy.
Remember, a short and specific xpath is better than a large and ambiguos xpath.

scrapy link extractor by value of html tag

I'm using scrapy to scrape privacy policies by crawling a website from its homepage as such, I want to intelligently crawl specific links within pages containing specific keywords (privacy, data, protection etc...).
I saw that scrapy's CrawlSpider and the LinkExtractor object allow for just that, however I would like the LinkExtractor to not only apply a regex to the discovered links, but also to the text within the <a></a> tags
In order to, for example, better identify cases like these:
Check out our privacy policy
In which, the URL might not be a perfect match, but the text within the HTML tags is more helpful.
I saw that scrapy's LinkExtractor object already has an argument called process_value which can launch an operation on the text within the HTML tag, but I'm unsure how I could "return a Positive link match" (like the regex expression given in the allow parameter would) and thus "add this link to the list of things to parse by the CrawlSpider object"
You’ll be able to do this in Scrapy 1.7.0 or later. See #3635.
The changes add a restrict_text parameter to LinkExtractor. From the master branch of the Scrapy documentation on LinkExtractor:
restrict_text (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the link’s text must match in order to be extracted. If not given (or empty), it will match all links. If a list of regular expressions is given, the link will be extracted if it matches at least one.

How to verify a text is present on a webpage for 'n' times

I wanted to verify a text in a webpage exist for 2 times or ‘n’ times. I have used “Page Should Contain” keyword but it says “Pass” when it identifies single occurrence. I don’t want to verify using locator.
Ex: I want to verify the text "Success" is available in a current webpage for 3 times using robot framework
Any inputs/suggesstions would be helpful.
Too bad you don't want to use a locator, as robotframework has a keyword just for that:
Xpath Should Match X Times //*[contains(., "Success")] 2
The caveat is the locator should not be prepended with xpath= - just straight expression.
The library keyword Page Should Contain does pretty much exactly that, by the way.
And if you want to find how many times the string is present in the page - easy:
${count}= Get Matching Xpath Count //*[contains(., "Success")]
And then do any kind of checks on the result, e.g.
Should Be Equal ${count} 2
I thought the problem of not using locator sounds fun (the rationale behind the requirement still unclear, yet), so another solution - look in the source yourself:
${source}= Page Source # you have the whole html of the page here
${matches}= Get Regexp Matches ${source} >.*\b(Success)\b.*<
${count}= Get Length ${matches}
The first one gets the source, the second gets all non-overlapping (separate) occurrences of the target string, when it is (hopefully) inside a tag. The third line returns the count.
Disclaimer - please don't actually do that, unless you're 100% sure of the source and the structure. Use a locator.

Get Plain Text List of Top '#x' Search Terms

Is there a place I can find the top x number of search terms (preferably from Google) and import them via something such as a HTTP Request/Post?
For that you'll need to use Google's Custom Search API. Specifically, you can use the num parameter to tell it how many results to return. Also, items[].title will give you the title in plain text, items[].snippet will give you the snippet in plain text.

How to extract element id attribute values from HTML

I am trying to work out the overhead of the ASP.NET auto-naming of server controls. I have a page which contains 7,000 lines of HTML rendered from hundreds of nested ASP.NET controls, many of which have id / name attributes that are hundreds of characters in length.
What I would ideally like is something that would extract every HTML attribute value that begins with "ctl00" into a list. The regex Find function in Notepad++ would be perfect, if only I knew what the regex should be?
As an example, if the HTML is:
<input name="ctl00$Header$Search$Keywords" type="text" maxlength="50" class="search" />
I would like the output to be something like:
name="ctl00$Header$Search$Keywords"
A more advanced search might include the element name as well (e.g. control type):
input|name="ctl00$Header$Search$Keywords"
In order to cope with both Id and Name attributes I will simply rerun the search looking for Id instead of Name (i.e. I don't need something that will search for both at the same time).
The final output will be an excel report that lists the number of server controls on the page, and the length of the name of each, possibly sorted by control type.
Quick and dirty:
Search for
\w+\s*=\s*"ctl00[^"]*"
This will match any text that looks like an attribute, e.g. name="ctl00test" or attr = "ctl00longer text". It will not check whether this really occurs within an HTML tag - that's a little more difficult to do and perhaps unnecessary? It will also not check for escaped quotes within the tag's name. As usual with regexes, the complexity required depends on what exactly you want to match and what your input looks like...
"7000"? "Hundreds"? Dear god.
Since you're just looking at source in a text editor, try this... /(id|name)="ct[^"]*"/
Answering my own question, the easiest way to do this is to use BeautifulSoup, the 'dirty HTML' Python parser whose tagline is:
"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."
It works, and it's available from here - http://crummy.com/software/BeautifulSoup
I suggest xpath, as in this question

Resources