I am trying to get the "Team Offense" table into R. I have tried multiple techniques and I cannot seem to get it to work. It looks like R is only reading the first two tables. The link is below.
https://www.pro-football-reference.com/years/2018/index.htm
This is what I have tried...
library(XML)
TeamData = 'https://www.pro-football-reference.com/years/2018/index.htm'TeamData = 'https://www.pro-football-reference.com/years/2018/index.htm'
URL = TeamData
URLdata = getURL(URL)
table = readHTMLTable(URLdata, stringsAsFactors=F, which = 5)
Scraping Sports Reference sites can be tricky but they are great sources:
library(rvest)
library(httr)
link <- "https://www.pro-football-reference.com/years/2018/index.htm"
doc <- GET(link)
cont <- content(doc, "text") %>%
gsub(pattern = "<!--\n", "", ., fixed = TRUE) %>%
read_html %>%
html_nodes(".table_outer_container table") %>%
html_table()
# Team Offense table is the fifth one
df <- cont[[5]]
Related
I've been web scraping articles in R from the Oxford journals and want to grab the full text of specific articles. All articles have a pdf link to them so I've been trying to pull the pdf link and scrape the entire text onto a csv. The full text should all fit into 1 row however the output in the csv file shows one article of 11 rows. How can I fix this issue?
The code is below:
####install.packages("rvest")
library(rvest)
library(RCurl)
library(XML)
library(stringr)
#for Fulltext to read pdf
####install.packages("pdftools")
library(pdftools)
fullText <- function(parsedDocument){
endLink <- parsedDocument %>%
html_node('.article-pdfLink') %>% html_attr('href')
frontLink <- "https://academic.oup.com"
#link of pdf
pdfLink <- paste(frontLink,endLink,sep = "")
#extract full text from pdfLink
pdfFullText <- pdf_text(pdfLink)
fulltext <- paste(pdfFullText, sep = "\n")
return(fulltext)
}
#############################################
#main function with input as parameter year
testFullText <- function(DOIurl){
parsedDocument <- read_html(DOIurl)
DNAresearch <- data.frame()
allData <- data.frame("Full Text" = fullText(parsedDocument), stringsAsFactors = FALSE)
DNAresearch <- rbind(DNAresearch, allData)
write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
}
testFullText("https://doi.org/10.1093/dnares/dsm026")
Looking at your last function, if I understand correctly, you want to take the url and scrape all the text into the a data frame/tibble and then export it to a csv. Here is how you can do it with just 1 article, and you should be able to loop through some links with a little manipulation (apologies if I am misunderstanding):
library(tidyverse)
library(rvest)
# read in html link
document_link <- read_html("https://doi.org/10.1093/dnares/dsm026")
# get the text, and put it into a tibble with only 1 row
text_tibble <- document_link %>%
html_nodes('.chapter-para') %>%
html_text() %>%
as_tibble() %>%
summarize(full_text = paste(value, collapse = " ")) ## this will collpase to 1 row
# now write to csv
## write_csv(text_tibble, file = "")
I have tried to look at other subjects but it does not looks they are pertinent to my question. I am trying to scrape multiple .png plots with R from the 'Indicator' section of https://tradingeconomics.com/
For any indicator, there are multiple countries data and each country page includes a plot. I would like to find a way to scrape png files for each country through a single routine.
I have tried the first indicator ('growth rate') and yet my code is the following:
library(stringr)
library(dplyr)
library(rvest)
tradeec <- read_html("https://tradingeconomics.com/country-list/gdp-growth-rate")
tradeec_countries <- tradeec %>% html_nodes("td:nth-child(1)") %>%
html_text()
tradeec_countries <- str_replace_all(tradeec_countries, "[\r\n]" , "")
tradeec_countries <- as.data.frame(tradeec_countries)
tradeec_countries <- tradeec_countries[-c(91:95), ]
tradeec_plots <- paste0("https://d3fy651gv2fhd3.cloudfront.net/charts", tradeec_countries, "-gdp-growth.png?s=", i)
Nonetheless I am not reaching my goal.
Any hint?
updated answer
For example, all the figures in world column of link can be obtained using the following code. Other columns, such as Europe, America, Asia, Australia, G20 can also be obtained similarly.
page <- read_html("https://tradingeconomics.com/country-list/gdp-growth-rate")
url_init <- "https://tradingeconomics.com"
country_list <- html_nodes(page,"td a") %>% html_attr("href")
world_list <- paste(url_init,country_list,sep = "")
page_list <- vector(mode = "list")
for(page_index in 1:length(world_list)) {
page_list[[page_index]] = read_html(world_list[page_index])
}
for (i in 1:length(page_list)) {
figure_link <- html_nodes(page_list[[i]],"#ImageChart") %>% html_attr("src")
figure_name <- gsub(".*charts/(.*png).*","\\1",figure_link,perl = TRUE)
figure_name <- paste(i,"_",figure_name)
download.file(figure_link,figure_name)
}
original answer
The following code can get the figure's link and name.
tradeec <- read_html("https://tradingeconomics.com/south-africa/gdp-growth")
figure_link <- html_nodes(tradeec, "#ImageChart") %>% html_attr("src")
figure_name <- gsub(".*charts/(.*png).*", "\\1", figure_link, perl = T)
download.file(figure_link,figure_name)
Then you can replace south-africa in the link to a series of countries you wanted.
I'm trying to scrape zip codes from "https://www.zipcodestogo.com/county-zip-code-list.htm", where states and counties will be provided in a dataset. Take Alabama, Dale as an example (shown below). However, when I use Selector Gadget to extract the table it does not appear, and when I look at the source code I also don't find this table. I'm not sure how to solve this. I'm very new to web-scraping so I apologize in advance if this is a stupid question. Thank you.
zipurl = 'https://www.zipcodestogo.com/county-zip-code-list.htm'
query = list('State:'="Alabama",
'Counties:'="Dale"
)
website = POST(zipurl, body = query,encode = "form")
tables <- html_nodes(content(website), css = 'table')
Same idea but grabbing the table and removing header
library(rvest)
state = "ALABAMA"
county = "DALE"
url = paste0("https://www.zipcodestogo.com/lookups/countyZipCodes.php?state=",state,"&county=",county)
r <- read_html(url) %>%
html_node("table table") %>%
html_table()%>%
slice(-1)
print(r)
Zip codes only column is then:
r$X1
You could also limit to first table column and remove first row:
r <- read_html(url) %>%
html_nodes("table table td:nth-of-type(1)") %>%
html_text() %>%
as.character
print(r[-1])
You can use the links that you can find with your browser in Inspect > tab Network
Here a solution :
state = "ALABAMA"
county = "DALE"
url_scrape = paste0("https://www.zipcodestogo.com/lookups/countyZipCodes.php?state=",state,"&county=",county) # Inspect > Network > XHR links
# function => First letter Capital (needed for regexp)
capwords <- function(s, strict = T) { # You can find this function on the forum
cap <- function(s) paste(toupper(substring(s, 1, 1)),
{s <- substring(s, 2); if(strict) tolower(s) else s},
sep = "", collapse = " " )
sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}
zip_codes = read_html(url_scrape) %>% html_nodes("td") %>% html_text()
zip_codes = zip_codes[-c(1:6)] # Delete header
string_regexp = paste0(capwords(state),"|View") # pattern as var
zip_codes = zip_codes[-grep(pattern = string_regexp,zip_codes)]
df = data.frame(zip = zip_codes[grep("\\d",zip_codes)], label = zip_codes[-grep("\\d",zip_codes)])
I have tried scraping data from a real estate site, and arranging the data in a way that can then easily be filtered and checked using a spreadsheet. I’m actually a little embarrassed that i don’t move of this R code forward.
Now that i have all the links to the posts, i can not now loop through the previously compiled dataframe and get the details from all the URLs.
Could you just please help me with it? Thanks a lot.
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(xml2)
complete <- data.frame()
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
URL.base <- "https://www.sreality.cz/hledani/prodej/byty?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=dnes&strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=tyden&strana="
for (i in 1:10000) {
#Specifying the url for desired website to be scrapped
main_link<- paste0(URL.base, i)
# go to website
remDr$navigate(main_link)
# get page source and save it as an html object with rvest
main_page <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the data
name <- html_nodes(main_page, css=".name.ng-binding") %>% html_text()
locality <- html_nodes(main_page, css=".locality.ng-binding") %>% html_text()
norm_price <- html_nodes(main_page, css=".norm-price.ng-binding") %>% html_text()
sreality_url <- main_page %>% html_nodes(".title") %>% html_attr("href")
sreality_url2 <- sreality_url[c(4:24)]
name2 <- name[c(4:24)]
record <- data.frame(cbind(name2, locality, norm_price, sreality_url2))
complete <- rbind(complete, record)
}
# Write CSV in R
write.csv(complete, file = "MyData.csv")
I would do this differently:
I would create a function, say 'scraper', that groups up together all the scraping functions you have already defined, doing so I'll create a list with the str_c of all the possibile links (say 30), after that a simple lapply function. As it all said, I will not use Rselenium. (libraries: rvest , stringr , tibble, dplyr )
url = 'https://www.sreality.cz/hledani/prodej/byty?strana='
here it is the URL base, starting from here you should be able to replicate the URL strings for all the pages (1 to whichever) you are interested in (and for all the possible url, for praha, olomuc, ostrava etc ).
main_page = read_html('https://www.sreality.cz/hledani/prodej/byty?strana=')
here you create all the linnks according to the number of pages you want:
list.of.pages = str_c(url, 1:30)
then define a single function for all the single data you are interested, in this way you are more precise and your error debug is easier, as well as the data quality. (I assume your CSS selections are right, otherwise you will obtain empty obj)
for names
name = function(url) {
data = html_nodes(url, css=".name.ng-binding") %>%
html_text()
return(data)
}
for locality
locality = function(url) {
data = html_nodes(url, css=".locality.ng-binding") %>%
html_text()
return(data)
}
for normprice
normprice = function(url) {
data = html_nodes(url, css=".norm-price.ng-binding") %>%
html_text()
return(data)
}
for hrefs
sreality_url = function(url) {
data = html_nodes(url, css=".title") %>%
html_attr("href")
return(data)
}
those are the single fuctions (the CSS selection, even if i didnt test them, seem to be not correct to me, but this will give you the right framework to work on). After that combine them into a tibble obj
get.data.table = function(html){
name = name(html)
locality = locality(html)
normprice = normprice(html)
hrefs = sreality_url(html)
combine = tibble(adtext = name,
loc = locality,
price = normprice,
URL = sreality_url)
combine %>%
select(adtext, loc, price, URL) return(combine)
}
then the final scraper:
scrape.all = function(urls){
list.of.pages %>%
lapply(get.data.table) %>%
bind_rows() %>%
write.csv(file = 'MyData.csv')
}
I would like to get the information of href from below.
http://www.mitbbs.com/bbsdoc1/USANews_101_0.html
I prefer to get someting from each topic like this
/USANews/31587637.html
/USANews/31587633.html
/USANews/31587631.html
...
The code is used below, but it doesn't work.
library("XML")
library("httr")
library("stringr")
data <- list()
for( i in 101:201){
url <- paste('bbsdoc1/USANews_', i, '_0.html', sep='')
html <- content(GET("http://www.mitbbs.com/", path = url),as = 'parsed')
url.list <- xpathSApply(html, "//td[#align='left' height=26]/[#class='news1' href]", xmlAttrs)
data <- rbind(data, url.list)
}
Your suggestions are really appreicated!
You should look into the rvest package which simplifies things a lot
library(rvest); library(dplyr)
myList <- read_html("http://www.mitbbs.com/bbsdoc1/USANews_101_0.html") %>%
html_nodes(".news1") %>% xml_attr("href")
mtList
myList %>% gsub("/article_t", "", .)
Retrieve the document
library(XML)
html = htmlParse("http://www.mitbbs.com/bbsdoc1/USANews_101_0.html")
and extract the links and text you're interested in using the appropriate xpath query
href = "//a[./#class='news1']/#href"
text = "//a[./#class='news1']/text()"
df = data.frame(
url=sub("article_t/", "", sapply(html[href], as.character)),
text=trimws(sapply(html[text], xmlValue)))
trimws() is a function in recent versions of R.