web scraping directors sections IMDB in r

web scraping directors sections IMDB in r - r

I'm trying to scrape data from IMDB website https://www.imdb.com/list/ls041125816/ , I'm trying to get the directors names with this command : html_nodes("p.text-mutated + a") and also tried html_nodes(".text-mutated + p a") but both are not working
note that this is my first time doing web-scraping
Your help will be much appreciated
Thank you !

Your css selector is not matching anything. This code gets you the directors:
library(rvest)
url <- "https://www.imdb.com/list/ls041125816/"
webpage <- read_html(url)
directors_data_html <- html_nodes(webpage,".text-small:nth-child(6)")
directors_data <- html_text(directors_data_html)
directors <- directors_data %>%
str_split("\\|") %>%
map(., 1) %>%
unlist()
directors %>%
tibble("directors" = .) %>%
filter(str_detect(directors,"Director"))

Related

How to scrape a table created using datawrapper using rvest?

I am trying to scrape Table 1 from the following website using rvest:
https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/
Following is the code i have written:
link <- "https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/"
page <- read_html(link)
page %>% html_nodes("iframe") %>% html_attr("src") %>% .[11] %>% read_html() %>%
html_nodes("table.medium datawrapper-g2oKP-6idse1 svelte-1vspmnh resortable")
But, i get {xml_nodeset (0)} as the result. I am struggling to figure out the correct tag to select in html_nodes() from the datawrapper page to extract Table 1.
I will be really grateful if someone can point out the mistake i am making, or suggest a solution to scrape this table.
Many thanks.

The data is present in the iframe but needs a little manipulation. It is easier, for me at least, to construct the csv download url from the iframe page then request that csv
library(rvest)
library(magrittr)
library(vroom)
library(stringr)
page <- read_html('https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/')
iframe <- page %>% html_element('iframe[title^="Table 1"]') %>% html_attr('src')
id <- read_html(iframe) %>% html_element('meta') %>% html_attr('content') %>% str_match('/(\\d+)/') %>% .[, 2]
csv_url <- paste(iframe,id, 'dataset.csv', sep = '/' )
data <- vroom(csv_url, show_col_types = FALSE)

How do I prevent a for loop is overwriting my results

I would like to scrape all the seasons from 2003-2004 to 2019-2020 of the Dutch football league including the 34 playing rounds (I am using this website https://www.voetbal.com/wedstrijdgegevens/ned-eredivisie-2003-2004-spieltag/). As you can see in my code it's only showing me de results of de last season. I think it's overwriting the other seasons. What am I doing wrong? What do I have to add to my code? Can anybody help me?
Here is the code I use:
library(tidyverse)
library(dplyr)
library(ggplot2)
library(caret)
library(rvest)
library(devtools)
library(httr)
library(tidyr)
library(tibble)
library(xml2)
library(tidyr)
library(stringr)
url <- sprintf("https://www.voetbal.com/wedstrijdgegevens/ned-eredivisie-%d-%d-spieltag/", 2003:2019, 2004:2020)
basis<-function(url){
website <- read_html(url)
Sys.sleep(2)
datum <- website %>%
html_nodes(".data .standard_tabelle td[nowrap]:nth-of-type(1)") %>%
html_text()
tijdstip <- website %>%
html_nodes(".data .standard_tabelle td[nowrap]:nth-of-type(2)") %>%
html_text()
thuisclub <- website %>%
html_nodes(".data .standard_tabelle [align='right'] a") %>%
html_text()
uitclub <- website %>%
html_nodes(".standard_tabelle td:nth-of-type(5) a") %>%
html_text()
uitslag <- website %>%
html_nodes(".data .standard_tabelle td[nowrap]:nth-of-type(6)") %>%
html_text()
return(tibble(datum=datum, tijdstip=tijdstip, thuisclub=thuisclub, uitclub=uitclub, uitslag=uitslag))
}
overige_seizoenen<-function(url){
for (i in 1:17){
list_of_pages<-str_c(url[[i]], 1:34)
table <-list_of_pages%>%
map(basis)%>%
bind_rows()
}
return(table)
}
jochem <- overige_seizoenen(url)
```

I suspect there is an error in the for loop. In R, if you want a loop to iterate from element 1 to 10, you can't just say for (i in 10), you must clarify it as for (i in 1:10). So try this loop now:
for (i in 1:seizoenen){
list_of_pages<-str_c(url[[i]], 1:34)
table <-list_of_pages%>%
map(basis)%>%
bind_rows()
}
return(table)
}

R scraping reviews from multiple pages on TripAdvisor

I'm trying to pull out a few pages of reviews from TripAdvisor for a academic project.
Here's my attempt using R
#Load libraries
library(rvest)
library(RSelenium)
# main url for stadium
urlmainlist=c(
hampdenpark="http://www.tripadvisor.com.ph/Attraction_Review-g186534-d214132-Reviews-Hampden_Park-Glasgow_Scotland.html"
)
# Specify how many search pages and counter
morepglist=list(
hampdenpark=seq(10,360,10)
)
#----------------------------------------------------------------------------------------------------------
# create pickstadium variable
pickstadium="hampdenpark"
# get list of urllinks corresponding to different pages
# url link for first search page
urllinkmain=urlmainlist[pickstadium]
# counter for additional pages
morepg=as.numeric(morepglist[[pickstadium]])
urllinkpre=paste(strsplit(urllinkmain,"Reviews-")[[1]][1],"Reviews",sep="")
urllinkpost=strsplit(urllinkmain,"Reviews-")[[1]][2]
urllink=rep(NA,length(morepg)+1)
urllink[1]=urllinkmain
for(i in 1:length(morepg)){
urllink[i+1]=paste(urllinkpre,"-or",morepg[i],"-",urllinkpost,sep="")
}
head(urllink)
write.csv(urllink,'urllink.csv')
##########
#SCRAPING#
##########
library(rvest)
library(RSelenium)
#install.packages('RSelenium')
testurl <- read.csv("urllink.csv", header=FALSE, quote="'", stringsAsFactors = F)
testurl=testurl[-1,]
testurl=testurl[,-1]
testurl=as.data.frame(testurl)
testurl=gsub('"',"",testurl$testurl)
list<-unlist(testurl)
tripadvisor <- NULL
#Scrape
for(i in 1:length(list)){
reviews <- list[i] %>%
read_html() %>%
html_nodes("#REVIEWS .innerBubble")
id <- reviews %>%
html_node(".quote a") %>%
html_attr("id")
rating <- reviews %>%
html_node(".rating .rating_s_fill") %>%
html_attr("alt") %>%
gsub(" of 5 stars", "", .) %>%
as.integer()
date <- reviews %>%
html_node(".rating .ratingDate") %>%
html_attr("title") %>%
strptime("%b %d, %Y") %>%
as.POSIXct()
review <- reviews %>%
html_node(".entry .partial_entry") %>%
html_text()%>%
as.character()
rowthing <- data.frame(id, review,rating, date, stringsAsFactors = FALSE)
tripadvisor<-rbind(rowthing, tripadvisor)
}
However this results in an empty tripadvisor dataframe. Any help on fixing this would be appreciated.
Additional Question
I'd like to capture the full reviews, as my code currently intends to capture partial entries only. For each review, I'd like to automatically click on the 'More' link and then extract the full review.
Here too, any help would be grately appreciated.

How do I add a loop when using R to scrape data?

I'm trying to create a database of crime data by zip code based on Trulia.com's data. I have the code below but so far it only produces 1 line of data. In the code below, Zipcodes is just a list of US zip codes. Can anyone tell me what I need to add to make this run through my entire list "i" ?
Here is a link to one of the Trulia pages for reference: https://www.trulia.com/real_estate/20004-Washington/crime/
UPDATE:
Here are zip codes for download: https://www.dropbox.com/s/uxukqpu0v88d7tf/Zip%20Code%20Database%20wo%20Boston.xlsx?dl=0
I also changed the code a bit this time after realizing the crime stats appear in different orders depending on the zip code. Is it possible to have the loop produce 4 lines per zipcode? This currently works but only produces the last zip code in the dataset. I can't figure out how to make sure each zip code's data is recorded on separate lines, so it doesn't overwrite and only leave one line of the last zip code.
Please help!!
library(rvest)
data=data.frame(Zipcodes)
for(i in data$Zip.Code)
{
site <- paste("https://www.trulia.com/real_estate/",i,"-Boston/crime/", sep="")
site <- html(site)
crime<- data.frame(zip =i,
type =site %>% html_nodes(".brs") %>% html_text() ,
stringsAsFactors=FALSE)
}
View(crime)
If that code doesn't work, try this:
data=data.frame(Zillow_Data_for_R_Test)
for(i in data$Zip.Code)
site <- paste("https://www.trulia.com/real_estate/",i,"-Boston/crime/", sep="")
site <- read_html(site)
crime<- data.frame(zip =i,
theft =site %>% html_nodes(".crime-text-0") %>% html_text() ,
assault =site %>% html_nodes(".crime-text-1") %>% html_text() ,
arrest =site %>% html_nodes(".crime-text-2") %>% html_text() ,
vandalism =site %>% html_nodes(".crime-text-3") %>% html_text() ,
robbery =site %>% html_nodes(".crime-text-4") %>% html_text() ,
type =site %>% html_nodes(".clearfix") %>% html_text() ,
stringsAsFactors=FALSE)
View(crime)

The comment of #r2evans already provides an answer. Since the #ShanCham asked how to actually implement this I wanted to guide with the following code, which is just more verbose than the comment and could therefore not be posted as additional comment.
library(rvest)
#only two exemplary zipcodes, could be more, of course
zipcodes <- c("02110", "02125")
crime <- lapply(zipcodes, function(z) {
site <- read_html(paste0("https://www.trulia.com/real_estate/",z,"-Boston/crime/"))
#for illustrative purposes:
#introduced as.numeric to numeric columns
#exluded some of your other columns and shortenend the current text in type
data.frame(zip = z,
theft = site %>% html_nodes(".crime-text-0") %>% html_text() %>% as.numeric(),
assault = site %>% html_nodes(".crime-text-1") %>% html_text() %>% as.numeric() ,
type = site %>% html_nodes(".clearfix") %>% html_text() %>% paste(collapse = " ") %>% substr(1, 50) ,
stringsAsFactors=FALSE)
})
class(crime)
#list
#Output are lists that can be bound together to one data.frame
crime <- do.call(rbind, crime)
#crime is a data.frame, hence, classes/types are kept
class(crime$type)
# [1] "character"
class(crime$assault)
# [1] "numeric"

R - Web Scrape of job board

I am trying to get a list of Companies and jobs in a table from indeed.com's job board.
I am using the rvest package using a URL Base of http://www.indeed.com/jobs?q=proprietary+trader&
install.packages("gtools")
install.packages('rvest")
library(rvest)
library(gtools)
mydata = read.csv("setup.csv", header=TRUE)
url_base <- "http://www.indeed.com/jobs?q=proprietary+trader&"
names <- mydata$Page
results<-data.frame()
for (name in names){
url <-paste0(url_base,name)
title.results <- url %>%
html() %>%
html_nodes(".jobtitle") %>%
html_text()
company.results <- url %>%
html() %>%
html_nodes(".company") %>%
html_text()
results <- smartbind(company.results, title.results)
results3<-data.frame(company=company.results, title=title.results)
}
new <- results(Company=company, Title=title)
and then looping a contatenation. For some reason it is not grabbing all of the jobs and mixing the companies and jobs.

It might be because you make two separate requests to the page. You should change the middle part of your code to:
page <- url %>%
html()
title.results <- page %>%
html_nodes(".jobtitle") %>%
html_text()
company.results <- page %>%
html_nodes(".company") %>%
html_text()
When I do that, it seems to give me 10 jobs and companies which match. Can you give an example otherwise of a query URL that doesn't work?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

web scraping directors sections IMDB in r - r

Related

How to scrape a table created using datawrapper using rvest?

How do I prevent a for loop is overwriting my results

R scraping reviews from multiple pages on TripAdvisor

How do I add a loop when using R to scrape data?

R - Web Scrape of job board

Categories

Resources