Web scraping with R and selector gadget

Web scraping with R and selector gadget - r

I am trying to scrape data from a website using R. I am using rvest in an attempt to mimic an example scraping the IMDB page for the Lego Movie. The example advocates use of a tool called Selector Gadget to help easily identify the html_node associated with the data you are seeking to pull.
I am ultimately interested in building a data frame that has the following schema/columns:
rank, blog_name, facebook_fans, twitter_followers, alexa_rank.
My code below. I was able to use Selector Gadget to correctly identity the html tag used in the Lego example. However, following the same process and same code structure as the Lego example, I get NAs (...using firstNAs introduced by coercion[1] NA
). My code is below:
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_node(".stats") %>%
html_text() %>%
as.numeric()
I have also experimented with: html_node("html_node(".stats , .stats span")), which seems to work for the "Facebook fans" column since it reports 714 matches, however only returns 1 number is returned.
714 matches for .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')] | .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')]/descendant-or-self::*/span: using first{xml_node}
<td>
[1] <span>997,669</span>

This may help you:
library(rvest)
d1 <- read_html("http://blog.feedspot.com/video_game_news/")
stats <- d1 %>%
html_nodes(".stats") %>%
html_text()
blogname <- d1%>%
html_nodes(".tlink") %>%
html_text()
Note that it is html_nodes (plural)
Result:
> head(blogname)
[1] "Kotaku - The Gamer's Guide" "IGN | Video Games" "Xbox Wire" "Official PlayStation Blog"
[5] "Nintendo Life " "Game Informer"
> head(stats,12)
[1] "997,669" "1,209,029" "873" "4,070,476" "4,493,805" "399" "23,141,452" "10,210,993" "879"
[10] "38,019,811" "12,059,607" "500"
blogname returns the list of blog names that is easy to manage. On the other hand the stats info comes out mixed. This is due to the way the stats class for Facebook and Twitter fans are indistinguishable from one another. In this case the output array has the information every three numbers, that is stats = c(fb, tw, alx, fb, tw, alx...). You should separate each vector from this one.
FBstats = stats[seq(1,length(stats),3)]
> head(stats[seq(1,length(stats),3)])
[1] "997,669" "4,070,476" "23,141,452" "38,019,811" "35,977" "603,681"

You can use html_table to extract the whole table with minimal work:
library(rvest)
library(tidyverse)
# scrape html
h <- 'http://blog.feedspot.com/video_game_news/' %>% read_html()
game_blogs <- h %>%
html_node('table') %>% # select enclosing table node
html_table() %>% # turn table into data.frame
set_names(make.names) %>% # make names syntactic
mutate(Blog.Name = sub('\\s?\\+.*', '', Blog.Name)) %>% # extract title from name info
mutate_at(3:5, parse_number) %>% # make numbers actually numbers
tbl_df() # for printing
game_blogs
#> # A tibble: 119 x 5
#> Rank Blog.Name Facebook.Fans Twitter.Followers Alexa.Rank
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 Kotaku - The Gamer's Guide 997669 1209029 873
#> 2 2 IGN | Video Games 4070476 4493805 399
#> 3 3 Xbox Wire 23141452 10210993 879
#> 4 4 Official PlayStation Blog 38019811 12059607 500
#> 5 5 Nintendo Life 35977 95044 17727
#> 6 6 Game Informer 603681 1770812 10057
#> 7 7 Reddit | Gamers 1003705 430017 25
#> 8 8 Polygon 623808 485827 1594
#> 9 9 Xbox Live's Major Nelson 65905 993481 23114
#> 10 10 VG247 397798 202084 3960
#> # ... with 109 more rows
It's worth checking that everything is parsed like you want, but it should be usable at this point.

This uses html_nodes (plural) and str_replace to remove commas in numbers. Not sure if these are all the stats you need.
library(rvest)
library(stringr)
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_nodes(".stats") %>%
html_text() %>%
str_replace_all(',', '') %>%
as.numeric()

Related

Scraping movie scripts failing on small subset

I'm working on scraping the lord of the rings movie scripts from this website here. Each script is broken up across multiple pages that look like this
I can get the info I need for a single page with this code:
library(dplyr)
library(rvest)
url_success <- "http://www.ageofthering.com/atthemovies/scripts/fellowshipofthering1to4.php"
success <- read_html(url_success) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(success)
Length Class Mode
[1,] 2 tbl_df list
This works for all Fellowship of the Ring pages, and all Return of the King pages. It also works for Two Towers pages covering scenes 57 to 66. However, any other Two Towers page (scenes 1-56) does not return the same result
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(fail)
Length Class Mode
0 list list
I've inspected the pages in Chrome, and the failing pages appear to have the same structure as the succeeding ones, including the 'AutoNumber1' table. Can anyone help with this?

Works with xpath. Perhaps ill-formed html (page doesn't seem too spec compliant)
library(rvest)
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements( xpath = '//*[#id="AutoNumber1"]') %>%
html_table()
fail
#> [[1]]
#> # A tibble: 139 × 2
#> X1 X2
#> <chr> <chr>
#> 1 "Scene 1 ~ The Foundations of Stone\r\n\r\n\r\nThe movie opens as the … "Sce…
#> 2 "GANDALF VOICE OVER:" "You…
#> 3 "FRODO VOICE OVER:" "Gan…
#> 4 "GANDALF VOICE OVER:" "I a…
#> 5 "The scene changes to \r\n inside Moria.  Gandalf is on the Bridge … "The…
#> 6 "GANDALF:" "You…
#> 7 "Gandalf slams down his staff onto the Bridge, \r\ncausing it to crack… "Gan…
#> 8 "BOROMIR :" "(ho…
#> 9 "FRODO:" "Gan…
#> 10 "GANDALF:" "Fly…
#> # … with 129 more rows

problem with web scrapping from wikipedia

I have been practicing web scrapping from wikipedia with the rvest library, and I would like to solve a problem that I found when using the str_replace_all() function. here is the code:
library(tidyverse)
library(rvest)
pagina <- read_html("https://es.wikipedia.org/wiki/Anexo:Premio_Grammy_al_mejor_%C3%A1lbum_de_rap") %>%
# list all tables on the page
html_nodes(css = "table") %>%
# convert to a table
html_table()
rap <- pagina[[2]]
rap <- rap[, -c(5)]
rap$Artista <- str_replace_all(rap$Artista, '\\[[^\\]]*\\]', '')
rap$Trabajo <- str_replace_all(rap$Trabajo, '\\[[^\\]]*\\]', '')
table(rap$Artista)
The problem is that when I remove the elements between brackets (hyperlinks in wikipedia) from the Artist variable, when doing the tabulation to see the count by artist, Eminem is repeated three times as if it were three different artists, the same happens with Kanye West that is repeated twice. I appreciate any solutions in advance.

There are some hidden bits still attached to the strings and trimws() is not working to remove them. You can use nchar(sort(test)) to see the number of character associated with each entry.
Here is a messy regular expression to extract out the letters, space, comma and - and skip everything else at the end.
rap <- pagina[[2]]
rap <- rap[, -c(5)]
rap$Artista<-gsub("([a-zA-Z -,&]+).*", "\\1", rap$Artista)
rap$Trabajo <- stringr::str_replace_all(rap$Trabajo, '\\[[^\\]]*\\]', '')
table(rap$Artista)
Cardi B Chance the Rapper Drake Eminem Jay Kanye West Kendrick Lamar
1 1 1 6 1 4 2
Lil Wayne Ludacris Macklemore & Ryan Lewis Nas Naughty by Nature Outkast Puff Daddy
1 1 1 1 1 2 1
The Fugees Tyler, the Creator
1 2
Here is another reguarlar expression that seems a bit clearer:
gsub("[^[:alpha:]]*$", "", rap$Artista)
From the end, replace zero or more characters which are not a to z or A to Z.

Text extraction from PDF with search criteria

I need to extract text from a PDF, I have a list of keywords which tell me what text part I need to extract.
PDF looks something like this:
Schema element: Keyword1 This is my keyword
Fontsize: 14 I dont need this
Guide to complete schema element: Text text. This is the text I need and it can between 2 and 3 lines long. And even contain multiple sentences.
Schema element: Keyword2 This is my keyword
Fontsize: 18 I dont need this
Guide to complete schema element: Text text, this is the text I need and it can between 2 and 3 lines long. And even contain multiple sentences. This text is different from the text above.
This is my code so far:
library(pdftools)
library(pdfsearch)
library(tidyverse)
pdf <- pdf_text(dir(pattern = "*.pdf")) %>%
read_lines()
Keyword_list <- c("swDisproportionateCost", `"swDisproportionateCostOtherEULegislation", "swExemptionsTransboundary","swDisproportionateCostAlternativeFinancing","swDisproportionateCostAnalysis","swDisproportionateCostScale")`
Then I tried using keyword_search but it only told me which line the keyword was in.
I would like to extract the text in cursive into a new column in my keyword_list. I think it can be done with regex using the keyword and the text in bold as start and stops.
Here is a link to the pdf.
https://www.dropbox.com/s/kyyzr5wnh8z87if/FINAL%20Draft4_WFD_Reporting_Guidance_2022_resource_page.pdf?dl=0

This is just a rather pedestrian text extraction job. There are many ways to do it, and I'm sure there are more elegant ways to do it than this, but this one does the job:
library(pdftools)
library(dplyr)
keywords <- pdf_text("mypdf.pdf") %>%
strsplit("Schema element:") %>%
lapply(function(x) x[-1]) %>%
lapply(function(x) sapply(strsplit(x, "\r\n"), `[`, 1)) %>%
unlist %>%
trimws()
text <- pdf_text("mypdf.pdf") %>%
strsplit("Guidance on completion of schema element:") %>%
lapply(function(x) x[-1]) %>%
lapply(function(x) sapply(strsplit(x, ":"), `[`, 1)) %>%
lapply(function(x) sapply(strsplit(x, "\r\n"),
function(y) paste(y[-length(y)], collapse = ""))) %>%
unlist() %>%
{gsub(" ", " ", .)} %>%
trimws() %>%
strsplit("Guidance on contents") %>%
sapply(`[`, 1)
df <- tibble(keywords, text)
So the result looks like this:
df
#> # A tibble: 15 x 2
#> keywords text
#> <chr> <chr>
#> 1 swExemption44Driver "Required. Select from the enumeration list the driver~
#> 2 swExemption45Impact "Required. Select from the enumeration list the impact~
#> 3 swExemption45Driver "Required. Select from the enumeration list the driver~
#> 4 swDisproportionateCost "Required. Indicate if disproportionate costs have bee~
#> 5 swDisproportionateCostScale "Conditional. Select from the enumeration list the sc~
#> 6 swDisproportionateCostAnalysis "Conditional. Select from the enumeration list the an~
#> 7 swDisproportionateCostAlterna~ "Conditional. Select from the enumeration list the al~
#> 8 swDisproportionateCostOtherEU~ "Conditional. Indicate whether the costs of basic mea~
#> 9 swTechnicalInfeasibility "Required. Report how ‘technical infeasibility’ has be~
#> 10 swNaturalConditions "Required. Select from the enumeration list the eleme~
#> 11 swExemption46 "Required. Select from the enumeration list the reason~
#> 12 swExemption47 "Required. Select from the enumeration list the modif~
#> 13 swExemptionsTransboundary "Required. Indicate whether the application of exempt~
#> 14 swExemptionsReference "Required. Provide references or hyperlinks to the re~
#> 15 driversSWExemptionsReference "Required. Provide references or hyperlinks to the re~

html_nodes returning two results for a link

I'm trying to use R to fetch all the links to data files on the Eurostat website. While my code currently "works", I seem to get a duplicate result for every link.
Note, the use of download.file is to get around my company's firewall, per this answer
library(dplyr)
library(rvest)
myurl <- "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?dir=data&sort=1&sort=2&start=all"
download.file(myurl, destfile = "eurofull.html")
content <- read_html("eurofull.html")
links <- content %>%
html_nodes("a") %>% #Note that I dont know the significance of "a", this was trial and error
html_attr("href") %>%
data.frame()
# filter to only get the ".tsv.gz" links
files <- filter(links, grepl("tsv.gz", .))
Looking at the top of the dataframe
files$.[1:6]
[1] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali01.tsv.gz
[2] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali01.tsv.gz
[3] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali02.tsv.gz
[4] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali02.tsv.gz
[5] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_eaa01.tsv.gz
[6] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_eaa01.tsv.gz
The only difference between 1 and 2 is that 1 says "...file=data..." while 2 says "...downfile=data...". This pattern continues for all pairs down the dataframe.
If I download 1 and 2 and read the files into R, an identical check confirms they are the same.
Why are two links to the same data being returned? Is there a way (other than filtering for "downfile") to only return one of the links?

As noted, you can just do some better node targeting. This uses XPath vs CSS selectors and picks the links with downfile in the href:
html_nodes(content, xpath = ".//a[contains(#href, 'downfile')]") %>%
html_attr("href") %>%
sprintf("http://ec.europa.eu/%s", .) %>%
head()
## [1] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali01.tsv.gz"
## [2] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali02.tsv.gz"
## [3] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa01.tsv.gz"
## [4] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa02.tsv.gz"
## [5] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa03.tsv.gz"
## [6] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa04.tsv.gz"

Web scrape multiple links with r

I'm trying to scrape some tennis stats with r from multiple links using rvest and selectorgadget. Page I scrape from is http://www.atpworldtour.com/en/scores/archive/stockholm/429/2017/results and there are 29 links that look like this: "http://www.atpworldtour.com/en/scores/2017/429/MS001/match-stats". All the links look the same but change from MS001-MS029. Using the below code I get the desired result with only the first 9 links. I see the problem but don't know how to correct it. First 9 links have double 00 and the rest have single 0. The 10th link should be MS010. Any help with this much appreciated.
library(xml)
library(rvest)
library(stringr)
round <- 1:29
urls <- paste0("http://www.atpworldtour.com/en/scores/2017/429/MS00", round,
"/match-stats")
aces <- function(url) {
url %>%
read_html() %>%
html_nodes(".percent-on:nth-child(3) .match-stats-number-left span") %>%
html_text() %>%
as.numeric()
}
results <- sapply(urls, aces)
results
$`http://www.atpworldtour.com/en/scores/2017/429/MS001/match-stats`
[1] 9
$`http://www.atpworldtour.com/en/scores/2017/429/MS002/match-stats`
[1] 8
$`http://www.atpworldtour.com/en/scores/2017/429/MS003/match-stats`
[1] 5
$`http://www.atpworldtour.com/en/scores/2017/429/MS004/match-stats`
[1] 4
$`http://www.atpworldtour.com/en/scores/2017/429/MS005/match-stats`
[1] 8
$`http://www.atpworldtour.com/en/scores/2017/429/MS006/match-stats`
[1] 9
$`http://www.atpworldtour.com/en/scores/2017/429/MS007/match-stats`
[1] 2
$`http://www.atpworldtour.com/en/scores/2017/429/MS008/match-stats`
[1] 9
$`http://www.atpworldtour.com/en/scores/2017/429/MS009/match-stats`
[1] 5
$`http://www.atpworldtour.com/en/scores/2017/429/MS0010/match-stats`
numeric(0)

One can generate leading zeroes in a formatted string via the sprintf() function.
ids <- 1:29
urlList <- sapply(ids,function(x){
sprintf("%s%03d%s","http://www.atpworldtour.com/en/scores/2017/429/MS",
x,"/match-stats")
})
# print a few items
urlList[c(1,9,10,29)]
...and the output:
> urlList[c(1,9,10,29)]
[1] "http://www.atpworldtour.com/en/scores/2017/429/MS001/match-stats"
[2] "http://www.atpworldtour.com/en/scores/2017/429/MS009/match-stats"
[3] "http://www.atpworldtour.com/en/scores/2017/429/MS010/match-stats"
[4] "http://www.atpworldtour.com/en/scores/2017/429/MS029/match-stats"
>

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web scraping with R and selector gadget - r

Related

Scraping movie scripts failing on small subset

problem with web scrapping from wikipedia

Text extraction from PDF with search criteria

html_nodes returning two results for a link

Web scrape multiple links with r

Categories

Resources