I'm scraping the data from IMDB movie list.
I would like to scrape link for each movie, but not able to correctly identify where it is stored on the page.
This is how the part of the link is stored:
link screenshot
What I have tried:
link<-html_nodes(strona_int, '.lister-item-header+ a href')
link<-html_text(link)
Whole code
install.packages("rvest")
install.packages("RSelenium")
library(rvest)
library(RSelenium)
#open webprowser (in my case Firefox, but can be chrome or internet explorer)
rD <- rsDriver(browser=c("firefox"))
remDr <- rD[["client"]]
#set the start number for page link
ile<-seq(from=1, by=250, length.out = 1)
#empty frame for data
filmy_df=data.frame()
#loop reading the data
for (j in ile){
#set the link for webpage
newURL<-"https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
startNumberURL<-paste0(newURL,j)
#open webpage
remDr$navigate(startNumberURL)
#read html code of the page
strona_int<-read_html(startNumberURL)
#rank section
rank_data<-html_nodes(strona_int,'.text-primary')
#konwersja rankingu na text
rank_data<-html_text(rank_data)
#konwersja na numeric
rank_data<-as.numeric(rank_data)
link<-html_nodes(strona_int, '.lister-item-header+ a href')
link<-html_text(link)
#release date
year<-html_nodes(strona_int,'.lister-item-year')
#konwersja na text
year<-html_text(year)
#usuniecie non numeric
year<-gsub("\\D","",year)
#ustawienie jako factor
year<-as.factor(year)
#title
title_data<-html_nodes(strona_int,'.lister-item-header a')
#konwersja tytulu na text
title_data<-html_text(title_data)
#temporary data frame
filmy_df_temp<-data.frame(Rank=rank_data,Title=title_data,Release.Year=year)
#temp df to target df
filmy_df<-rbind(filmy_df,filmy_df_temp)
}
#close browser
remDr$close()
#stop Selenium
rD[["server"]]$stop()
Expected solution:
Scraped link for the each film which could be used later if required.
Selenium is not required for gathering the links.
The links are a tags housed within a parent with class lister-item-header. You can match on those then extract the href attribute. You need to add the protocol and domain of "https://www.imdb.com"
In the css selector:
.lister-item-header a
The dot is a class selector for the parent class; the space between is a descendant combinator; the final a is a type selector for the child a tags.
library(rvest)
library(magrittr)
url <- "https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
links <- read_html(url) %>% html_nodes(., ".lister-item-header a") %>% html_attr(., "href")
One way of adding protocol and domain:
library(rvest)
library(magrittr)
library(xml2)
base <- 'https://www.imdb.com'
url <- "https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
links <- url_absolute(read_html(url) %>% html_nodes(., ".lister-item-header a") %>% html_attr(., "href"), base)
Reference:
https://www.rdocumentation.org/packages/xml2/versions/1.2.0/topics/url_absolute
https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
Related
I am trying to collect some area names from a website and in order to do so I want to click the drop-down box to expand the downwards pointing arrow.
i.e. on the following page if I click on the "distritos" drop down I can see further drop down-availability
https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/todas-las-zonas/l
For Ciutat Vella I see I have 4 additional items Barri Gòtic, EL Raval, La Barceloneta and Sant Pare, Sta...
I would like to collect these names also. I have the following code to collect the following:
library(RSelenium)
library(rvest)
library(tidyverse)
# 1.a) Open URL, click on provincias
rD <- rsDriver(browser="firefox", port=4536L)
remDr <- rD[["client"]]
url2 = "https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/todas-las-zonas/l"
remDr$navigate(url2)
remDr$maxWindowSize()
# accept cookies
remDr$findElement(using = "xpath",'/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
#click on Distrito
remDr$findElement(using = "xpath", '/html/body/div[1]/div[2]/div[1]/div[3]/div/div[1]/div')$clickElement()
html_distrito_full_page = remDr$getPageSource()[[1]] %>%
read_html()
Distritos_Names = html_distrito_full_page %>%
html_nodes('.re-GeographicSearchNext-checkboxItem') %>%
html_nodes('.re-GeographicSearchNext-checkboxItem-literal') %>%
html_text()
Distritos_Names
Which gives:
[1] "Ciutat Vella" "Eixample" "Gràcia" "Horta - Guinardó" "Les Corts" "Nou Barris" "Sant Andreu" "Sant Martí"
[9] "Sants - Montjuïc" "Sarrià - Sant Gervasi"
However, this is missing the names of the regions in the drop-down boxes.
How can I collect these drop-down links also? i.e. RSelenium to navigate to the page, expand all "downwards facing arrows" then use rvest to scrape the whole page once these downwards facing arrows have been expanded.
You could just use rvest to get the mappings by extracting the JavaScript variable housing the mappings + some other data. Use jsonlite to deserialize the extracted string into a JSON object, then apply a custom function to extract the actual mappings for each dropdown. Wrap that function in a map_dfr() call to get a final combined dataframe of all dropdown mappings.
TODO: Review JSON to see if can remove magic number 4 and dynamically determine the correct item to retrieve from parent list.
library(tidyverse)
library(rvest)
library(jsonlite)
extract_data <- function(x) {
tibble(
location = x$literal,
sub_location = map(x$subLocations, "literal", pluck)
)
}
p <- read_html("https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/todas-las-zonas/l") %>% html_text()
s <- str_match(p, 'window\\.__INITIAL_PROPS__ = JSON\\.parse\\("(.*)"')[, 2]
data <- jsonlite::parse_json(gsub('\\\\\\"', '\\\"', gsub('\\\\"', '"', s)))
location_data <- data$initialSearch$result$geographicSearch[4]
df <- map_dfr(location_data, extract_data)
I'm trying to scrape only one value from this site, but I can not get it
Here is my code
library(RSelenium)
rD <- rsDriver(browser="chrome",port=0999L,verbose = F,chromever = "95.0.4638.54")
remDr <- rD[["client"]]
remDr$navigate("https://www.dailyfx.com/eur-usd")
html <- remDr$getPageSource()[[1]]
library(rvest)
page <- read_html(html)
nodes <- html_nodes(page, css = ".mt-2.text-black")
html_text(nodes)
My result is
html_text(nodes)
[1] "\n\nEUR/USD\nMixed\n\n\n\n\n\n\n\n\n\nNet Long\n\n\n\nNet Short\n\n\n\n\n\nDaily change in\n\n\n\nLongs\n5%\n\n\nShorts\n1%\n\n\nOI\n4%\n\n\n\n\n\nWeekly change in\n\n\n\nLongs\n13%\n\n\nShorts\n23%\n\n\nOI\n17%\n\n\n\n"
What I need to do to get the value of Net Long ?
I would use a more targeted css selector list, to target just the node of interest. Then extract the data-value attribute value from the single matched node to get the percentage:
webElem <- remDr$findElement(using = 'css selector', '.dfx-technicalSentimentCard__netLongContainer [data-type="long-value-info"]')
var <- webElem$getElementAttribute("data-value")[[1]]
Or,
page %>% html_element('.dfx-technicalSentimentCard__netLongContainer [data-type="long-value-info"]') %>% html_attr('data-value')
I'm doing web scraping in R of the reviews of a Google Play app, but I can't get the number of votes. I indicate the code: likes <- html_obj %>% html_nodes(".xjKiLb") %>% html_attr("aria-label") and I get no value. How can it be done?
Get scrape votes
FULL CODE
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
url <- 'https://play.google.com/store/apps/details?id=com.gospace.parenteral&showAllReviews=true'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
likes <- html_obj %>% html_nodes(".xjKiLb") %>% html_attr("aria-label")
What returns me
NA NA NA
What I want to be returned
3 3 2
Maybe you are using the selector gadget to get the css selector. As you, I tried to do that, but the css the selector gadget return is not the correct one.
Inspecting the html code of the page, I realized that the correct element is contain in the tag with class = "jUL89d y92BAb" as you can see in this image.
This way, the code you should use is this one.
html_obj %>% html_nodes('.jUL89d') %>% html_text()
My personal recommendation for you is to always check the source code to confirm the output of the selector gadget.
I want to scrape data from google play store of several app's review in which i want.
name field
How much star they got
review they wrote
This is the snap of the senerio
#Loading the rvest package
library('rvest')
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
#Reading the HTML code from the website
webpage <- read_html(url)
#Using CSS gradient_Selector to scrap the name section
Name_data_html <- html_nodes(webpage,'.kx8XBd .X43Kjb')
#Converting the Name data to text
Name_data <- html_text(Name_data_html)
#Look at the Name
head(Name_data)
but it result to
> head(Name_data)
character(0)
later I try to discover more i found Name_data_html has
> Name_data_html
{xml_nodeset (0)}
I am new to web scraping can any help me out with this!
You should use Xpaths to select the object on the web page :
#Loading the rvest package
library('rvest')
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
#Reading the HTML code from the website
webpage <- read_html(url)
# Using Xpath
Name_data_html <- webpage %>% html_nodes(xpath='/html/body/div[1]/div[4]/c-wiz/div/div[2]/div/div[1]/div/c-wiz[1]/c-wiz[1]/div/div[2]/div/div[1]/c-wiz[1]/h1/span')
#Converting the Name data to text
Name_data <- html_text(Name_data_html)
#Look at the Name
head(Name_data)
See how to get the path in this picture :
After analyzing your code and the source page of the URL you posted, I think that the reason you are unable to scrap anything is because the content is being generated dynamically so rvest cannot get it right.
Here is my solution:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
#Specifying the url for desired website to be scrapped
url <- 'https://play.google.com/store/apps/details?id=com.phonegap.rxpal&hl=en_IN'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) name field (assuming that with 'name' you refer to the name of the reviewer)
names <- html_obj %>% html_nodes(".kx8XBd .X43Kjb") %>% html_text()
# 2) How much star they got
stars <- html_obj %>% html_nodes(".kx8XBd .nt2C1d [role='img']") %>% html_attr("aria-label")
# 3) review they wrote
reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()
# create the df with all the info
review_data <- data.frame(names = names, stars = stars, reviews = reviews, stringsAsFactors = F)
In my solution, I'm using RSelenium, which is able to load the webpage as if you were navigating to it (instead of just downloading it like rvest). This way, all the dynamically-generated content is loaded and when is loaded, you can now retrieve it with rvest and scrap it.
If you have any doubts about my solution, just tell me!
Hope it helped!
I have a list of hospital names for which I need to extract the first google search URL. Here is the code I'm using
library(rvest)
library(urltools)
library(RCurl)
library(httr)
getWebsite <- function(name)
{
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- read_html(url)
results <- page %>%
html_nodes("cite") %>%
html_text()
result <- results[1]
return(as.character(result))}
websites <- data.frame(Website = sapply(c,getWebsite))
View(websites)
For short URLs this code works fine but when the link is long and appears in R with "..." (ex. www.medicine.northwestern.edu/divisions/allergy-immunology/.../fellowship.html) it appears in the dataframe the same way with "...". How can I extract the actual URLs without "..."? Appreciate your help!
This is a working example, tested on my computer:
library("rvest")
# Load the page
main.page <- read_html(x = "https://www.google.com/search?q=software%20programming")
links <- main.page %>%
html_nodes(".r a") %>% # get the a nodes with an r class
html_attr("href") # get the href attributes
#clean the text
links = gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
# as a dataframe
websites <- data.frame(links = links, stringsAsFactors = FALSE)
View(websites)