The following are the URLs I wish to extract:
> links
[1] "https://www.makemytrip.com/holidays-india/"
[2] "https://www.makemytrip.com/holidays-india/"
[3] "https://www.yatra.com/india-tour-packages"
[4] "http://www.thomascook.in/tcportal/international-holidays"
[5] "https://www.yatra.com/holidays"
[6] "https://www.travelguru.com/holiday-packages/domestic-packages.shtml"
[7] "https://www.chanbrothers.com/package"
[8] "https://www.tourmyindia.com/packagetours.html"
[9] "http://traveltriangle.com/tour-packages"
[10] "http://www.coxandkings.com/bharatdeko/"
[11] "https://www.sotc.in/india-tour-packages"
I have managed to do it using:
for (i in 1:10){
html <- getURL(links[i], followlocation = TRUE)
parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)}
But the thing is all extracted data are all saved in "plain.text." How do I have "plain.text" for each link?
Thank you.
Related
I have some URL links I would like to apply a while loop over.
###################################
To clarify here a page is given as:
https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate
Where l/3? denotes page 3, l/4? denotes page 4, l/5? denotes page 5 etc.
Each page contains approx 30 unique links. I want to map over these in order to find a "special" unique link, such as /es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list
###################################
I am trying to find the last page a given URL is found. As I write this, the link I would like to stop on is on page 3 ( url2 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate" ). The unique "special" link I want to stop at is given as:
linkToStopAt = "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list"
It is found under the HTML tag:
<a title="Piso en venta en Via Europa - Parc Central" class="re-CardPackMinimal-info-container" href="/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list"><h3 class="re-CardHeader"><span class="re-CardTitle"><strong>Piso</strong> en Via Europa - Parc Central </span><span class="re-CardPriceContainer"><span class="re-CardPriceComposite"><span class="re-CardPrice">289.000 €</span></span></span></h3><ul class="re-CardFeatures-wrapper"><li class="re-CardFeatures-feature">4 habs.</li><li class="re-CardFeatures-feature">2 baños</li><li class="re-CardFeatures-feature">132 m²</li><li class="re-CardFeatures-feature">2ª Planta</li><li class="re-CardFeatures-feature">Ascensor</li></ul><p class="re-CardDescription"><span class="re-CardDescription-text">Muchas veces se habla de que ya no se hacen pisos como los de antes. Pero sólo cuando visitas viviendas como la que hoy ofrecemos se entiende bien esa expresión.
I want to search over the following links:
myLinksToSearchOver = list(
url0 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/1?sortType=publicationDate",
url1 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/2?sortType=publicationDate",
url2 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate", # LINK IS ON THIS PAGE
url3 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/4?sortType=publicationDate"
)
Since the page is dynamic it requires a function to open and scroll the page (takes 30 seconds each page) - this will then activate all of the information in each of the properties (including the link I want to stop at linkToStopAt)
openAndScrollPage <- function(link, sortPage = FALSE){
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(link)
remDr$maxWindowSize()
#accept cookies
remDr$findElement(using = "xpath", '/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
if(sortPage == TRUE){
# order by most recent
remDr$findElement(using = 'xpath', '//*[#id="App"]/div[2]/div[1]/main/div/div[3]/div/div/select/option[3]')$clickElement()
}
#after navigating and accepting cookie, we shall scroll bit by bit
for(i in 1:30){
print(paste("Scrolling for: ", i, " seconds", sep = ""))
remDr$executeScript("window.scrollBy(0,500);")
Sys.sleep(1)
}
#get page sources of all houses
html_full_page = remDr$getPageSource()[[1]] %>%
read_html()
return(html_full_page)
}
I can run the following to open and scroll through the 4 links:
scrappedLinks = map(myLinksToSearchOver, ~ openAndScrollPage(link = .x))
Now, I can use rvest to extract the links (which gives each of the 4 pages and under the 4 pages we have the unique links):
map(scrappedLinks, ~html_elements(.x, ".re-CardPackMinimal-info-container") %>%
html_attr('href')
) %>%
set_names(myLinksToSearchOver)
Which gives me the following output:
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/1?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion/164205518/d?from=list"
[2] "/es/comprar/vivienda/mataro/calefaccion/164203226/d?from=list"
[3] "/es/comprar/vivienda/mataro/cerdanyola-sud/164203093/d?from=list"
[4] "/es/comprar/vivienda/mataro/terraza-zona-comunitaria-internet/164203070/d?from=list"
[5] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-no-amueblado/164202986/d?from=list"
[6] "/es/comprar/vivienda/mataro/calefaccion/164202959/d?from=list"
[7] "/es/comprar/vivienda/mataro/calefaccion-parking-terraza-trastero-ascensor-piscina-internet/164202809/d?from=list"
[8] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor/164202700/d?from=list"
[9] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion/164202653/d?from=list"
[10] "/es/comprar/vivienda/mataro/calefaccion-ascensor/164201939/d?from=list"
[11] "/es/comprar/vivienda/mataro/calefaccion-ascensor/164199828/d?from=list"
[12] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor-parking/164199826/d?from=list"
[13] "/es/comprar/vivienda/mataro/terraza/164199186/d?from=list"
[14] "/es/comprar/vivienda/mataro/calefaccion-ascensor/164197503/d?from=list"
[15] "/es/comprar/vivienda/mataro/cerdanyola-nord/164195419/d?from=list"
[16] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor/164193661/d?from=list"
[17] "/es/comprar/vivienda/mataro/el-palau-escorxador/164190253/d?from=list"
[18] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor/164189543/d?from=list"
[19] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-trastero/164189439/d?from=list"
[20] "/es/comprar/vivienda/mataro/cerdanyola-sud/164189433/d?from=list"
[21] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor/164188699/d?from=list"
[22] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor/164187684/d?from=list"
[23] "/es/comprar/vivienda/mataro/ascensor/164186978/d?from=list"
[24] "/es/comprar/vivienda/mataro/cerdanyola-sud/164185834/d?from=list"
[25] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-trastero-ascensor/164185476/d?from=list"
[26] "/es/comprar/vivienda/mataro/calefaccion/164184593/d?from=list"
[27] "/es/comprar/vivienda/mataro/aire-acondicionado-parking-trastero-no-amueblado/164184260/d?from=list"
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/2?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/ascensor/164182171/d?from=list"
[2] "/es/comprar/vivienda/mataro/parking-ascensor/164182147/d?from=list"
[3] "/es/comprar/vivienda/mataro/ascensor/164182060/d?from=list"
[4] "/es/comprar/vivienda/mataro/terraza/164181970/d?from=list"
[5] "/es/comprar/vivienda/mataro/aire-acondicionado-terraza-no-amueblado/164179457/d?from=list"
[6] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-trastero-ascensor/164178147/d?from=list"
[7] "/es/comprar/vivienda/mataro/rocafonda/164177331/d?from=list"
[8] "/es/comprar/vivienda/mataro/aire-acondicionado-zona-comunitaria-ascensor/164177300/d?from=list"
[9] "/es/comprar/vivienda/mataro/aire-acondicionado-terraza/164177041/d?from=list"
[10] "/es/comprar/vivienda/mataro/aire-acondicionado-ascensor/164174523/d?from=list"
[11] "/es/comprar/vivienda/obra-nueva/mataro/19838850/164174457?from=list"
[12] "/es/comprar/vivienda/obra-nueva/mataro/19838850/164174445?from=list"
[13] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-piscina-parking-no-amueblado/164173609/d?from=list"
[14] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-jardin-piscina-parking-no-amueblado/164173605/d?from=list"
[15] "/es/comprar/vivienda/mataro/terraza/164173351/d?from=list"
[16] "/es/comprar/vivienda/mataro/ascensor-amueblado-parking-no-amueblado/164172246/d?from=list"
[17] "/es/comprar/vivienda/mataro/calefaccion-ascensor-television/164171855/d?from=list"
[18] "/es/comprar/vivienda/mataro/aire-acondicionado/164171580/d?from=list"
[19] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor-television-internet/164171206/d?from=list"
[20] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor-internet/164170890/d?from=list"
[21] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-ascensor-patio-internet/164169905/d?from=list"
[22] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-no-amueblado/164166947/d?from=list"
[23] "/es/comprar/vivienda/mataro/rocafonda/164166363/d?from=list"
[24] "/es/comprar/vivienda/mataro/zona-comunitaria-ascensor-no-amueblado/164164638/d?from=list"
[25] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor/164164595/d?from=list"
[26] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor-patio/164164590/d?from=list"
[27] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-no-amueblado/164161735/d?from=list"
##################################################################################################################################################################################################################################################################################################################
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list" ######################################################################################################################################## ################################## STOP HERE ####################### ########################################################################################################################################
[2] "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160241/d?from=list"
[3] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-zona-comunitaria-ascensor-parking-no-amueblado/164159121/d?from=list"
[4] "/es/comprar/vivienda/mataro/terraza-trastero/164158840/d?from=list"
[5] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-piscina/164158722/d?from=list"
[6] "/es/comprar/vivienda/mataro/trastero-ascensor-patio/164157956/d?from=list"
[7] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-ascensor-piscina-amueblado/164156790/d?from=list"
[8] "/es/comprar/vivienda/mataro/patio/164156495/d?from=list"
[9] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor-se-aceptan-mascotas-no-amueblado/164156307/d?from=list"
[10] "/es/comprar/vivienda/mataro/aire-acondicionado-parking-terraza-trastero/164151811/d?from=list"
[11] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-ascensor/164145742/d?from=list"
[12] "/es/comprar/vivienda/mataro/terraza-no-amueblado/164144159/d?from=list"
[13] "/es/comprar/vivienda/mataro/terraza-no-amueblado/164144156/d?from=list"
[14] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-no-amueblado/164140982/d?from=list"
[15] "/es/comprar/vivienda/mataro/cerdanyola-nord/164137269/d?from=list"
[16] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-ascensor/164136742/d?from=list"
[17] "/es/comprar/vivienda/mataro/ascensor/164134835/d?from=list"
[18] "/es/comprar/vivienda/mataro/calefaccion/164134674/d?from=list"
[19] "/es/comprar/vivienda/mataro/ascensor/164134234/d?from=list"
[20] "/es/comprar/vivienda/mataro/terraza-ascensor/164133959/d?from=list"
[21] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-trastero-zona-comunitaria-ascensor-piscina-parking-no-amueblado/164130914/d?from=list"
[22] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-no-amueblado/164130911/d?from=list"
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/4?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-no-amueblado/164130912/d?from=list"
[2] "/es/comprar/vivienda/mataro/parking-parking-no-amueblado/164130910/d?from=list"
[3] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-trastero-ascensor/164130398/d?from=list"
[4] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor/164130376/d?from=list"
[5] "/es/comprar/vivienda/mataro/peramas/164130277/d?from=list"
[6] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor/164129442/d?from=list"
[7] "/es/comprar/vivienda/mataro/calefaccion-terraza-ascensor/164128789/d?from=list"
[8] "/es/comprar/vivienda/mataro/ascensor/164128152/d?from=list"
[9] "/es/comprar/vivienda/mataro/ascensor/164127856/d?from=list"
[10] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-ascensor/164127269/d?from=list"
[11] "/es/comprar/vivienda/mataro/calefaccion-parking-terraza-trastero/164126765/d?from=list"
[12] "/es/comprar/vivienda/mataro/calefaccion-ascensor/164125679/d?from=list"
[13] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-zona-comunitaria-piscina-parking-piscina-no-amueblado/164124624/d?from=list"
[14] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-parking-terraza-zona-comunitaria-piscina-parking-piscina-no-amueblado/164124623/d?from=list"
[15] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-ascensor/164123678/d?from=list"
[16] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-trastero/164123451/d?from=list"
[17] "/es/comprar/vivienda/mataro/calefaccion-jardin-terraza-trastero-zona-comunitaria-ascensor-parking-piscina/164122646/d?from=list"
[18] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-trastero-ascensor-piscina/164122457/d?from=list"
[19] "/es/comprar/vivienda/mataro/calefaccion-parking-zona-comunitaria-ascensor-piscina/164121936/d?from=list"
[20] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-ascensor/164121742/d?from=list"
[21] "/es/comprar/vivienda/mataro/aire-acondicionado-parking-no-amueblado/164121338/d?from=list"
[22] "/es/comprar/vivienda/mataro/cirera/164120991/d?from=list"
[23] "/es/comprar/vivienda/mataro/calefaccion-parking-jardin-terraza-trastero-ascensor-patio-piscina/164120969/d?from=list"
[24] "/es/comprar/vivienda/mataro/calefaccion-parking-terraza-ascensor-piscina/164120512/d?from=list"
[25] "/es/comprar/vivienda/mataro/aire-acondicionado-calefaccion-terraza-trastero-ascensor-patio-amueblado/164120349/d?from=list"
So, in the above output I want to end at list item 3:
$`https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate`
[1] "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list" ######################################################################################################################################## ################################## STOP HERE ####################### ########################################################################################################################################
The example I have shown doesn't stop automatically when it finds the URL of interest. It continues to look at URL page 4 (I want it to stop at URL page 3 since this is where the linkToStopAt first occurs)
I would like to try an wrap the above into a while loop.
while(TRUE){
map(myLinksToSearchOver, ~ openAndScrollPage(.x))
"return PAGE 3 URL since this is where the 'linkToStopAt' is located - at the time of posting"
print(paste("we stopped at", PAGE_URL))
}
How can I incorporate into the while loop a stopping URL, in order to get the page URL that this URL falls on? I don't want to go through 500 pages when the "unique" URL is on page 3 for example.
All code:
library(rvest)
library(tidyverse)
library(xml2)
library(stringr)
library(rvest)
library(RSelenium)
myLinksToSearchOver = list(
url0 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/1?sortType=publicationDate",
url1 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/2?sortType=publicationDate",
url2 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/3?sortType=publicationDate",
url3 = "https://www.fotocasa.es/es/comprar/viviendas/mataro/todas-las-zonas/l/4?sortType=publicationDate"
)
openAndScrollPage <- function(link, sortPage = FALSE){
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(link)
remDr$maxWindowSize()
#accept cookies
remDr$findElement(using = "xpath", '/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
if(sortPage == TRUE){
# order by most recent
remDr$findElement(using = 'xpath', '//*[#id="App"]/div[2]/div[1]/main/div/div[3]/div/div/select/option[3]')$clickElement()
}
#after navigating and accepting cookie, we shall scroll bit by bit
for(i in 1:30){
print(paste("Scrolling for: ", i, " seconds", sep = ""))
remDr$executeScript("window.scrollBy(0,500);")
Sys.sleep(1)
}
#get page sources of all houses
html_full_page = remDr$getPageSource()[[1]] %>%
read_html()
return(html_full_page)
}
linkToStopAt = "/es/comprar/vivienda/mataro/calefaccion-terraza-trastero-ascensor/164160264/d?from=list"
scrappedLinks = map(myLinksToSearchOver, ~ openAndScrollPage(link = .x))
map(scrappedLinks, ~html_elements(.x, ".re-CardPackMinimal-info-container") %>%
html_attr('href')
) %>%
set_names(myLinksToSearchOver)
while(TRUE){
map(myLinksToSearchOver, ~ openAndScrollPage(.x))
"return PAGE 3 URL"
print(paste("we stopped at", PAGE_URL))
}
I am using R to try and download images from the Reptile-database by filling their form to seek for specific images. For that, I am following previous suggestions to fill a form online from R, such as:
library(httr)
library(tidyverse)
POST(
url = "http://reptile-database.reptarium.cz/advanced_search",
encode = "json",
body = list(
genus = "Chamaeleo",
species = "dilepis"
)) -> res
out <- content(res)[1]
This seems to work smoothly, but my problem now is to identify the link with the correct species name in the resulting out object.
This object should contain the following page:
https://reptile-database.reptarium.cz/species?genus=Chamaeleo&species=dilepis&search_param=%28%28genus%3D%27Chamaeleo%27%29%28species%3D%27dilepis%27%29%29
This contains names with links. Thus, i would like to identify the link that takes me to the page with the correct species's table. however I am unable to find the link nor even the name of the species within the generated out object.
Here I only extract the links to the pictures. Simply map or apply a function to download them with download.file()
library(tidyverse)
library(rvest)
genus <- "Chamaeleo"
species <- "dilepis"
pics <- paste0(
"http://reptile-database.reptarium.cz/species?genus=", genus,
"&species=", species) %>%
read_html() %>%
html_elements("#gallery img") %>%
html_attr("src")
[1] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000034021_01_t.jpg"
[2] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000033342_01_t.jpg"
[3] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029987_01_t.jpg"
[4] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029988_01_t.jpg"
[5] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035130_01_t.jpg"
[6] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035131_01_t.jpg"
[7] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035132_01_t.jpg"
[8] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035133_01_t.jpg"
[9] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036237_01_t.jpg"
[10] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036238_01_t.jpg"
[11] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036239_01_t.jpg"
[12] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041048_01_t.jpg"
[13] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041049_01_t.jpg"
[14] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041050_01_t.jpg"
[15] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041051_01_t.jpg"
[16] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042287_01_t.jpg"
[17] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042288_01_t.jpg"
[18] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0070.jpeg"
[19] "https://calphotos.berkeley.edu/imgs/128x192/1338_3161/0662/0074.jpeg"
[20] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0082.jpeg"
[21] "https://calphotos.berkeley.edu/imgs/128x192/1338_3152/3386/0125.jpeg"
[22] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/1009/0136.jpeg"
[23] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/0210/0057.jpeg"
Each div.grpl-grp clearfix (each club element) on this page Has it's own id:
https://uws-community.symplicity.com/index.php?s=student_group
I am trying to scrape each of these ids, however my current method, as shown below does not work. What am I doing wrong?
url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)
id_nodes <- html_nodes(page, "div.grpl-grp clearfix") %>% html_attrs("id")
Try XPath instead:
library(magrittr)
library(rvest)
doc <- read_html("https://uws-community.symplicity.com/index.php?s=student_group")
html_nodes(doc, xpath=".//div[contains(#class, 'grpl-grp') and contains(#class, 'clearfix')]") %>%
html_attr("id")
## [1] "grpl_5bf9ea61bc46eaeff075cf8043c27c92" "grpl_17e4ea613be85fe019efcf728fb6361d"
## [3] "grpl_d593eb48fe26d58f616515366a1e677b" "grpl_5b445690da34b7cff962ee2bf254db9e"
## [5] "grpl_cd1ebcef22852bdb5301a243803a2909" "grpl_0a7da33f968a919ecfa06486f0787bc7"
## [7] "grpl_a6a6cbf50b45d1ef05f8965c69f462de" "grpl_3fed7efb36173632ae2eef14393f37fc"
## [9] "grpl_f4e1e263109725bd4f99db9f70552b65" "grpl_2be038a5d159bf753fceb26cfdf596c2"
## [11] "grpl_918f9dec53fe5d36c1f98f5136f2ae7d" "grpl_f365b501f1e9833ca0cf8c504e37d11c"
## [13] "grpl_2f302fcce440ec1463beb73c6d7af070" "grpl_26b6771768df4a002e44ad6ec01fa36d"
## [15] "grpl_5e260344fd093628f3326a162996513a" "grpl_3604e5b44c0428dfc982c1bfc852fef2"
## [17] "grpl_9ab9bced3514bd8b2e0e18da8a3c7977" "grpl_6364bed0a4d3f45cd5b1fc929e320cb3"
## [19] "grpl_ba21e3c819afe6a32110585ac379f5d9" "grpl_9964a3732044fceffb4dc9b5645856ba"
I have the following xml page that looks like this which I need to parse using xml2
However, with this code, I cannot get the list under the subcellularLocation xpath :
library(xml2)
xmlfile <- "https://www.uniprot.org/uniprot/P09429.xml"
doc <- xmlfile %>%
xml2::read_xml()
xml_name(doc)
xml_children(doc)
x <- xml_find_all(doc, "//subcellularLocation")
xml_path(x)
# character(0)
What is the right way to do it?
Update
The desired output is a vector:
[1] "Nucleus"
[2] "Chromosome"
[3] "Cytoplasm"
[4] "Secreted"
[5] "Cell membrane"
[6] "Peripheral membrane protein"
[7] "Extracellular side"
[8] "Endosome"
[9] "Endoplasmic reticulum-Golgi intermediate compartment"
Use x <- xml_find_all(doc, "//d1:subcellularLocation")
Whenever you meet a troublesome problem, check the document is the first thing to do, use ?xml_find_all and you will see this (at the end of the page)
# Namespaces ---------------------------------------------------------------
# If the document uses namespaces, you'll need use xml_ns to form
# a unique mapping between full namespace url and a short prefix
x <- read_xml('
<root xmlns:f = "http://foo.com" xmlns:g = "http://bar.com">
<f:doc><g:baz /></f:doc>
<f:doc><g:baz /></f:doc>
</root>
')
xml_find_all(x, ".//f:doc")
xml_find_all(x, ".//f:doc", xml_ns(x))
So you then go to check xml_ns(doc) and find
d1 <-> http://uniprot.org/uniprot
xsi <-> http://www.w3.org/2001/XMLSchema-instance
Update
xml_find_all(doc, "//d1:subcellularLocation")
%>% xml_children()
%>% xml_text()
## [1] "Nucleus"
## [2] "Chromosome"
## [3] "Cytoplasm"
## [4] "Secreted"
## [5] "Cell membrane"
## [6] "Peripheral membrane protein"
## [7] "Extracellular side"
## [8] "Endosome"
## [9] "Endoplasmic reticulum-Golgi intermediate compartment"ent"
If you don't mind, you can use the rvest package:
library(rvest)
a=read_html(xmlfile)%>%
html_nodes("subcellularlocation")
a%>%html_children()%>%html_text()
[1] "Nucleus" "Chromosome"
[3] "Cytoplasm" "Secreted"
[5] "Cell membrane" "Peripheral membrane protein"
[7] "Extracellular side" "Endosome"
[9] "Endoplasmic reticulum-Golgi intermediate compartment"
I've a link where I need to download data which is in ".iqy" file and I need to read that for further cleaning.
I'm able to do it manually by entering the link present(in 3rd line) in the file using
con <- file("ABC1.iqy", "r", blocking = FALSE)
readLines(con=con,n=-1L,ok=TRUE, warn=FALSE,encoding='unknown').
Output:
[1] "WEB"
[2] "1"
[3] "https:abc.../excel/execution/EPnx?view=vrs" [4] ""
[5] ""
[6] "Selection=AllTables"
[7] "Formatting=None"
[8] "PreFormattedTextToColumns=True"
[9] "ConsecutiveDelimitersAsOne=True"
[10] "SingleBlockTextImport=False"
[11] "DisableDateRecognition=False"
[12] "DisableRedirections=False"
[13] ""
I need to automate this instead of doing it manually. Is there any option in r that I can use?
simply use download.file :)
con <- file("ABC1.iqy", "r", blocking = FALSE)
dest_path <- "ABC.file"
download.file(readLines(con=con,n=-1L,ok=TRUE, warn=FALSE,encoding='unknown')[3],destfile= dest_path)
if you can't read the file you get, try :
download.file(readLines(con=con,n=-1L,ok=TRUE, warn=FALSE,encoding='unknown')[3],destfile= dest_path, mode = "wb")