With some help, I am able to extract the landing image/main image of a url. However, I would like to be able to extract the subsequent images as well
require(rvest)
url <-"https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-
Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-
spons&keywords=lunch+bag&psc=1"
webpage <- read_html(url)
r <- webpage %>%
html_nodes("#landingImage") %>%
html_attr("data-a-dynamic-image")
imglink <- strsplit(r, '"')[[1]][2]
print(imglink)
This gives the correct output for the main image. However, I would like to extract the links when I roll-over to the other images of the same product. Essentially, I would like the output to have the following links:
1.https://images-na.ssl-images- amazon.com/images/I/81bF%2Ba21WLL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/81HVwttGJAL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/81Z1wxLn-uL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/91iKg%2BKqKML.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/91zhpH7%2B8gL.UY500.jpg
Many thanks
As requested Python script at bottom. In order to make this applicable across languages the answer is in two parts. 1) A high level pseudo code description of steps which can be carried out with R/Python/many other languages 2) A python example.
R script to obtain string shown at end (Steps 1-3 of Process).
1) Process:
Obtain the html via GET request
Regex out a substring from one of the script tags which is in fact what jquery on the page uses to provide the image links from json
The regex pattern is
jQuery\.parseJSON\(\'(.*)\'\);
The explanation is:
Basically, the contained json object is gathered starting at the { before "dataInJson" and ending before the characters '). That extracts this json object as string. The use of 1st Capturing Group (.*) gathers from between the start string and end string (excluding either side).
The first match is the only one wanted, so out of matches returned the first must be extracted. This is then handled with a json parsing library that can take a string and return a json object
That json object is looped accessing by key (in the case of Python as structure is a dictionary - R will be slightly different) colorImages, to generate the colours (of the product), which are in turn used to access the actual urls themselves.
colours:
nested level for images:
2) Those steps shown in Python
import requests #library to handle xhr GET
import re #library to handle regex
import json
headers = {'User-Agent' : 'Mozilla/5.0', 'Referer':'https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc='}
r = requests.get('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', headers = headers)
p1 = re.compile(r'jQuery\.parseJSON\(\'(.*)\'\);')
data = p1.findall(r.text)[0]
json_source = json.loads(data)
for colour in json_source['colorImages']:
for image in json_source['colorImages'][colour]:
print(image['large'])
Output:
All the links for the product in all colours - large image links only (so urls appear slightly different and more numerous but are the same images)
R script to regex out required string and generate JSON:
library(rvest)
library( jsonlite)
library(stringr)
con <- url('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', "rb")
page = read_html(con)
page %>%
html_nodes(xpath=".//script[contains(., 'colorImages')]")%>%
html_text() %>% as.character %>% str_match(.,"jQuery\\.parseJSON\\(\\'(.*)\\'\\);") -> res
json = fromJSON(res[,2][2])
They've updated the page so now just use:
Python:
import requests #library to handle xhr GET
import re #library to handle regex
headers = {'User-Agent' : 'Mozilla/5.0', 'Referer':'https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc='}
r = requests.get('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', headers = headers)
p1 = re.compile(r'"large":"(.*?)"')
links = p1.findall(r.text)
print(links)
R:
library(rvest)
library(stringr)
con <- url('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', "rb")
page = read_html(con)
res <- page %>%
html_nodes(xpath=".//script[contains(., 'var data')]")%>%
html_text() %>% as.character %>%
str_match_all(.,'"large":"(.*?)"')
print(res[[1]][,2])
Related
I am trying to scrape information from multiple collapsible tables from a website called APIS.
An example of what I'm trying to collect is here http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next
Ideally I'd like to be able to have the drop down heading followed by the information underneath, though when using rvest I cant seem to get it to select the correct section from the html.
I'm reasonably new to R, this is what I have from watching some videos about scraping:
link = "http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next"
page = read_html(link)
name = page %>% html_nodes(".tab-tables :nth-child(1)") %>% html_text()
the "name" value displays "Character (empty)"
It may be because I'm new to this and there's a really obvious answer but any help would be appreciated
The data for each tab comes from additional requests you can find in the browser network tab when pressing F5 to refresh the page. For example, the nutrients info comes from:
http://www.apis.ac.uk/sites/default/files/AJAX/srcl_2019/apis_tab_nnut.php?ajax=true&site=1001814&BH=&populateBH=true
Which you can think of more generally as:
scheme='http'
netloc='www.apis.ac.uk'
path='/sites/default/files/AJAX/srcl_2019/apis_tab_nnut.php'
params=''
query='ajax=true&site=1001814&BH=&populateBH=true'
fragment=''
So, you would make your request to those urls you see in the network tab.
If you want to dynamically determine these urls, then make a request, as you did, to the landing page, then regex out from the response text the path (see above) of the urls. This can be done using the following pattern url: "(\\/sites\\/default\\/files\\/.*?)".
You then need to add the protocol + domain (scheme and netloc) to the returned matches based on landing page protocol and domain.
There are some additional query string parameters, which come after the ?, which can also be dynamically retrieved, if reconstructing the urls from the response text. You can see these within the page source:
You probably want to extract each of those data param specs for the Ajax requests e.g. with data:\\s\\((.*?)\\), then have a custom function which turns the matches into the required query string suffix to add to the previously retrieved urls.
Something like the following:
library(rvest)
library(magrittr)
library(stringr)
get_query_string <- function(match, site_code) {
string <- paste0(
"?",
gsub("siteCode", site_code, gsub('["{}]', "", gsub(",\\s+", "&", gsub(":\\s+", "=", match))))
)
return(string)
}
link <- "http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next"
page <- read_html(link) %>% toString()
links <- paste0("http://www.apis.ac.uk", stringr::str_match_all(page, 'url: "(\\/sites\\/default\\/files\\/.*?)"')[[1]][, 2])
params <- stringr::str_match_all(page, "data:\\s\\((.*?)\\),")[[1]][, 2]
site_code <- stringr::str_match_all(page, 'var siteCode = "(.*?)"')[[1]][, 2]
params <- lapply(params, get_query_string, site_code)
urls <- paste0(links, params)
I have a dataframe which indicates, in column, an url.
test = data.frame (id = 1, url = "https://www.georisques.gouv.fr/risques/installations/donnees/details/0030.12015")
Using this, I would like to retrieve an element in the web page. Specifically, I would like to retrieve the value of the activity state.
https://zupimages.net/viewer.php?id=20/51/t1fx.png
Thanks to my research, I was able to find a code which allows to select the element thanks to its "XPath".
library (rvest)
page = read_html ("https://www.georisques.gouv.fr/risques/installations/donnees/details/0030.12015")
page%>% html_nodes (xpath = '// * [# id = "detailAttributFiche"] / div / p')%>% html_text ()%>% as.character ()
character (0)
As you can see, I always have a "character (0)" that appears, as if it couldn't read the whole page. I suspect some JavaScript part is not linking properly ...
How can I do ?
Thank you.
The data is from this link (the etatActiviteInst parameter): https://www.georisques.gouv.fr/webappReport/ws/installations/etablissement/0030-12015
I have a big string and I want to match/extract a pattern with start and end search pattern. How can this be done in R?
An example of the string:
big_string <- "read.csv(\"http://company.com/students.csv\", header = TRUE)","solution":"# Preview students with str()\nstr(students)\n\n# Coerce Grades to character\nstudents$Grades <- read.csv(\"http://company.com/students_grades.csv\", header = TRUE)"
And I want to extract the url components in this instance. Therefore, the pattern starts with http and ends with .csv or any extension (if possible).
http://company.com/students.csv
http://company.com/students_grades.csv
I have no luck with many attempts using gregexpr to extract the pattern. Can someone help with coming out a way to do this in R?
The stringr package works very well for this type of application:
library(stringr)
big_string <- 'read.csv(\"http://company.com/students.csv\", header = TRUE)","solution":"# Preview students with str()\nstr(students)\n\n# Coerce Grades to character\nstudents$Grades <- read.csv(\"http://company.com/students_grades.csv\", header = TRUE)'
results<-unlist(str_extract_all(big_string, "http:.+csv"))
The search pattern is a string starting with "http:" with at least 1 character and ending with "csv"
I am using R to scrape some webpages. One of these pages is a redirect to a new page. When I used readLines with this page like so
test <- readLines('http://zfin.org/cgi-bin/webdriver?MIval=aa-markerselect.apg&marker_type=GENE&query_results=t&input_name=anxa5b&compare=contains&WINSIZE=25')
I get the still redirecting page, instead of the final page http://zfin.org/ZDB-GENE-030131-9076. I want to use this redirection page because in the URL it has input_name=anxa which makes it easy to grab pages for different input names.
How can I get the HTML of the final page?
The redirection page: http://zfin.org/cgi-bin/webdriver?MIval=aa-markerselect.apg&marker_type=GENE&query_results=t&input_name=anxa5b&compare=contains&WINSIZE=25
The final page: http://zfin.org/ZDB-GENE-030131-9076
I don't know how to wait untill the redirection but in the source code of the web page before the redirection, you can see (in a script tag) a javascript function replaceLocation which contain the path where you are redirected : replaceLocation(\"/ZDB-GENE-030131-9076\").
Then I suggest you to parse the code and get this path.
Here is my solution :
library(RCurl)
library(XML)
url <- "http://zfin.org/cgi-bin/webdriver?MIval=aa-markerselect.apg&marker_type=GENE&query_results=t&input_name=anxa5b&compare=contains&WINSIZE=25"
domain <- "http://zfin.org"
doc <- htmlParse(getURL(url, useragent='R'))
scripts <- xpathSApply(doc, "//script", xmlValue)
script <- scripts[which(lapply(lapply(scripts, grep, pattern = "replaceLocation\\([^url]"), length) > 0)]
# > script
# [1] "\n \n\t \n\t replaceLocation(\"/ZDB-GENE-030131-9076\")\n \n \n\t"
new.url <- paste0(domain, gsub('.*\\"(.*)\\".*', '\\1', script))
readLines(new.url)
xpathSApply(doc, "//script", xmlValue) to get all the scripts in the source code.
script <- scripts[which(lapply(lapply(scripts, grep, pattern = "replaceLocation\\([^url]"), length) > 0)] to get the script containing the function with the redirecting path.
("replaceLocation\\([^url]" You need to exclude "url" cause there is two replaceLocationfunctions, one with the object url and the other one with the evaluated object (a string))
And finaly gsub('.*\\"(.*)\\".*', '\\1', script) to get only what you need in the script, the argument of the function, the path.
Hope this help !
I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :
r<-htmlTreeParse(e) ## e is after getURL
g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
l<-g.k[names(g.k)=="text"]
u<-ldply(l,function(x) {
w<-xmlValue(x)
return(w)
})
However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?
I´ve come to
xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk
But this leaves me a lot of cleaning up to do and I assume it can be done better.
Regards,
//M
EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:
getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2
gives me the list of what I want. However still in xml with br tags. I thought running
xpathApply(e2, "//text()", function(k) xmlValue(k))->kk
would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.
Is there a way to do this directly:
xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk
Link to the web page: I´m trying to get the names, and only, the names from the page.
getURL("http://legeforeningen.no/id/1712")
I ended up with
xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)
(no need for RCurl) and then
sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))
(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.
n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})
Unfortunately, this does not pick up names that do not contain a comma.
Use a mixture of xpath and string manipulation.
#Retrieve and parse the page.
library(XML)
library(RCurl)
page <- getURL("http://legeforeningen.no/id/1712")
parsed <- htmlTreeParse(page, useInternalNodes = TRUE)
Inspecting the parsed variable which contains the page's source tells us that instead of sensibly using a list tag (like <ul>), the author just put a paragraph (<p>) of text split with line breaks (<br />). We use xpath to retrieve the <p> elements.
#Inspection tells use we want the fifth paragraph.
name_nodes <- xpathApply(parsed, "//p")[[5]]
Now we convert to character, split on the <br> tags and remove empty lines.
all_names <- as(name_nodes, "character")
all_names <- gsub("</?p>", "", all_names)
all_names <- strsplit(all_names, "<br />")[[1]]
all_names <- all_names[nzchar(all_names)]
all_names
Optionally, separate the names of people and their locations.
strsplit(all_names, ", ")
Or more prettily with stringr.
str_split_fixed(all_names, ", ", 2)