Get next page url scrapy

Get next page url scrapy - web-scraping

From a list of urls like this one:
3
Next >
How can I only get the one that has the attribute: title="Next">Next?
Using:
//#href
The value " title="Next">Next" is lose, so it can't be used the filter the urls.

response.xpath("//a[#title='Next']/#href").extract_first()

Related

Get the text associated with a href element in a given page in scrapy

Currently my 'yield' in my scrapy spider looks as follows :
yield {
'hreflink':mylink,
'Parentlink':response.url
}
This returns me a dict
{
'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/"
}
Now, I also want the 'text' that is associated with this particular hreflink, in that particular Parentlink. So my final output should look like
{
'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/",
'Yourtext' : "Download Pricing Info"
}
What would be the simplest way to achieve that. I want to use Xpath expressions to get the "text" in a parentlink where href element = #href .
So far Here is what I tied -
Yourtext = response.xpath('//a[#href='+json.dumps(each)+']//text()').get()
but its not printing anything. I tried printing my response and it returns the right page - 'https://www.southeasthealth.org/financial-information-price-transparency/'

If I understand you correctly you want to get the text belonging to the link Download Pricing Info.
I suggest you try using:
response.xpath("//span[#class='fusion-button-text']//text()").get()

I found the answer to my question.
'//a[#href='+json.dumps(each)+']//text()'
This is the correct expression however the href link 'each' is case sensitive and it needs to match exactly for this Xpath to work.

Robot Framework' Playwright, trying to get list of all the href values on a page, then verify they're valid

Using Robot Framework's Playwright I am trying to get elements from a page, their href value, like href='http://www.mywebpage.com/customer/create/signup.html' and put them into a list.
From there I will ultimately try to compare if each element is similar to the substring of my base URL (which should be contained in each href link value), and then if each link returns a 200 response.
So far my code does get a list of elements(this is good), but each element looks something like this, "element=5cf0c691-41b1-4a3e-9347-d1a2bf0a48b8" instead of a URL ... I will want to query if each list element is similar to a base URL, so I want each element to be the href URL.
So, how can I evaluate what gets put into the list ${element_href} as the equivalent href URL, instead of something like "element=5cf0c691..." which I am currently getting?
Broken links test
${element_list}= browser.get elements xpath=//*[starts-with(#href, 'http://')] #same as above that finds 45 elements.
Log ${element_list}
Create Session testing ${BASE_URL}
FOR ${element_href} IN #{element_list} #get text gets same as without it.
${uri}= Remove String ${element_href} ${BASE_URL}
${contains_base_url}= Evaluate "${BASE_URL}" in "${element_href}"
${response}= Run Keyword If ${contains_base_url} Get Request testing ${uri}
Run Keyword If ${contains_base_url} Should Be Equal As Strings ${response.status_code} 200
END

I found answer to getting page href URL's into a list. Still struggling to compare each list item with the base URL, and then check if they navigate with a 200 response - but this progress effectively answers specifically what I asked above, how to gather elements on the page, namely the href attribute values, and into a list. The 4 commented lines before "END" are what I'm trying to get to work next, so far unsuccessfully; that said, here is the code:
Broken links test
${element_list}= get elements xpath=//a[#href] #ORIGINAL
Create Session testing ${BASE_URL}
#{element_attribute_list}= Create List
FOR ${element_href} IN #{element_list} #get text gets same as without it.
${element_attribute}= get attribute ${element_href} href
Append To List ${element_attribute_list} ${element_attribute}
# ${uri}= Remove String ${element_attribute_list} ${BASE_URL}
# ${contains_base_url}= Evaluate "${BASE_URL}" in "${element_attribute_list}"
# ${response}= Run Keyword If ${contains_base_url} Get Request testing ${uri}
# Run Keyword If ${contains_base_url} Should Be Equal As Strings ${response.status_code} 200
END
RESULTS HERE - they're not perfect, but code is appending to list each href, or so it appears:
KEYWORD Collections . Append To List ${element_attribute_list}, ${element_attribute}
Documentation:
Adds values to the end of list.
Start / End / Elapsed: 20220130 22:41:18.685 / 20220130 22:41:18.685 / 00:00:00.000
22:41:18.685 TRACE Arguments: [ ['http://dev.mysite.net/PAP-4414/',
'#',
'http://dev.mysite.net/PAP-4414/customer/account/create/',
'http://dev.mysite.net/PAP-4414/customer/account/login/',
'http://dev.mysite.net/PAP-4414/checkout/cart/',
'/courses.html',
'#',
'http://dev.mysite.net/PAP-4414/certification-courses/#swm'] | 'http://dev.mysite.net/PAP-4414/certification-courses/#dwc' ]
22:41:18.685 TRACE Return: None
Any additional thoughts are welcome! I appreciate the 10 Views, although no comments so far.

Web scraping with R?

I have a dataframe which indicates, in column, an url.
test = data.frame (id = 1, url = "https://www.georisques.gouv.fr/risques/installations/donnees/details/0030.12015")
Using this, I would like to retrieve an element in the web page. Specifically, I would like to retrieve the value of the activity state.
https://zupimages.net/viewer.php?id=20/51/t1fx.png
Thanks to my research, I was able to find a code which allows to select the element thanks to its "XPath".
library (rvest)
page = read_html ("https://www.georisques.gouv.fr/risques/installations/donnees/details/0030.12015")
page%>% html_nodes (xpath = '// * [# id = "detailAttributFiche"] / div / p')%>% html_text ()%>% as.character ()
character (0)
As you can see, I always have a "character (0)" that appears, as if it couldn't read the whole page. I suspect some JavaScript part is not linking properly ...
How can I do ?
Thank you.

The data is from this link (the etatActiviteInst parameter): https://www.georisques.gouv.fr/webappReport/ws/installations/etablissement/0030-12015

IMacros: Extract text from site

I need to extract to clipboard activation link, link every registration changed.
HTML Code:

Try something like this code:
SEARCH SOURCE=REGEXP:"(http://mctop.me/approve/\w+)" EXTRACT=$1
SET !CLIPBOARD {{!EXTRACT}}

Error -1200: parses "(http://mctop.me/approve/\w+)" - Unrecognized esc-sequence \w.

Send expression to website return dynamic result (picture)

I use http://www.regexper.com to view a picto representation regular expressions a lot. I would like a way to ideally:
send a regular expression to the site
open the site with that expression displayed
For example let's use the regex: "\\s*foo[A-Z]\\d{2,3}". I'd go tot he site and paste \s*foo[A-Z]\d{2,3} (note the removal of the double slashes). And it returns:
I'd like to do this process from within R. Creating a wrapper function like view_regex("\\s*foo[A-Z]\\d{2,3}") and the page (http://www.regexper.com/#%5Cs*foo%5BA-Z%5D%5Cd%7B2%2C3%7D) with the visual diagram would be opened with the default browser.
I think RCurl may be appropriate but this is new territory for me. I also see the double slash as a problem because http://www.regexper.com expects single slashes and R needs double. I can get R to return a single slash to the console using cat as follows, so this may be how to approach.
x <- "\\s*foo[A-Z]\\d{2,3}"
cat(x)
\s*foo[A-Z]\d{2,3}

Try something like this:
Query <- function(searchPattern, browse = TRUE) {
finalURL <- paste0("http://www.regexper.com/#",
URLencode(searchPattern))
if (isTRUE(browse)) browseURL(finalURL)
else finalURL
}
x <- "\\s*foo[A-Z]\\d{2,3}"
Query(x) ## Will open in the browser
Query(x, FALSE) ## Will return the URL expected
# [1] "http://www.regexper.com/#%5cs*foo[A-Z]%5cd%7b2,3%7d"
The above function simply pastes together the web URL prefix ("http://www.regexper.com/#") and the encoded form of the search pattern you want to query.
After that, there are two options:
Open the result in the browser
Just return the full encoded URL

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Get next page url scrapy - web-scraping

From a list of urls like this one: 3 Next > How can I only get the one that has the attribute: title="Next">Next? Using: //#href The value " title="Next">Next" is lose, so it can't be used the filter the urls.

response.xpath("//a[#title='Next']/#href").extract_first()

Related

Get the text associated with a href element in a given page in scrapy

Robot Framework' Playwright, trying to get list of all the href values on a page, then verify they're valid

Web scraping with R?

IMacros: Extract text from site

Send expression to website return dynamic result (picture)

Categories

Resources