I have some code that scrapes data off this link (http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280) and runs some calculations.
What I want to do is cycle through every team and collect and run the manipulations on every team. I have a dataframe with every team link, like the one above.
Psuedo code:
for (link in teamlist)
{scrape, manipulate, put into a table}
However, I can't figure out how to run loop through the links.
I've tried doing URL = teamlist$link[i], but I get an error when using readhtmltable(). I have no trouble manually pasting each team individual URL into the script, just only when trying to pull it from a table.
Current code:
library(XML)
library(gsubfn)
URL= 'http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280'
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
Thanks.
I agree with #ialm that you should check out the rvest package, which makes it very fun and straightforward to loop through links. I will create some example code here using similar subject matter for you to check out.
Here I am generating a list of links that I will iterate through
rm(list=ls())
library(rvest)
mainweb="http://www.basketball-reference.com/"
urls=html("http://www.basketball-reference.com/teams") %>%
html_nodes("#active a") %>%
html_attrs()
Now that the list of links is complete I iterate through each link and pull a table from each
teamdata=c()
j=1
for(i in urls){
bball <- html(paste(mainweb, i, sep=""))
teamdata[j]= bball %>%
html_nodes(paste0("#", gsub("/teams/([A-Z]+)/$","\\1", urls[j], perl=TRUE))) %>%
html_table()
j=j+1
}
Please see the code below, which basically builds off your code and loops through two different team pages as identified by the vector team_codes. The tables are returned in a list where each list element corresponds to a team's table. However, the tables look like they will need more cleaning.
library(XML)
library(gsubfn)
Player_Stats <- list()
j <- 1
team_codes <- c(575, 580)
for(code in team_codes) {
URL <- paste0('http://stats.ncaa.org/team/stats?org_id=', code, '&sport_year_ctl_id=12280')
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats[[j]] = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
j <- j + 1
}
Related
I want to import multiple pdf-files into R but per page there are 4 columns, a header/footer line and a table of contents.
For purpose of text mining I want to remove them from my file or character vector.
Right now I am using two functions to read in the files. The first one is pdf_text because it keeps the pages but can't deal with the 4 columns. The second one is extract_text, this one on its own doesn't keep the pages but can deal with the column structure (and is decently with occuring tables) .
But neither one of them is able to remove the table of contents (as far as I have tried).
My data set is not exactly minimal but otherwise I had some problems with the data structures. Here a working code:
################ relevant code ##############
library(pdftools)
library(tidyverse)
library(tabulizer)
files_name <- "Nachhaltigkeit 2021.pdf"
file_url <- c("https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/sustainability/documents/Allianz_Group_Sustainability_Report_2021-web.pdf", "https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/investor-relations/en/results-reports/annual-report/ar-2021/en-Allianz-Group-Annual-Report-2021.pdf")
reports_list <- lapply(file_url, pdf_text)
createTibble <- function(){
tibble_together <- NULL
#for all files
for(i in 1:length(files_name)){
page_nr <- length(reports_list[[i]])
tib <- tibble(report = rep(files_name[i], page_nr), page = 1:page_nr, text = gsub("\r\n", " ",
extract_text(files_name[[i]], pages = 1:page_nr)))
tibble_together <- rbind(tibble_together, tib)
}
return(tibble_together)
}
reports_df <- createTibble()
############ code for problem visualization ###############
reports_df <- reports_df %>% unnest_tokens(output = word, input = text, token = "words")
#e.g this part contains the table of contents which is not intended
(reports_df %>% filter(page == 34, report == "Nachhaltigkeit 2021.pdf"))$word[832:885]
Thanks for your help in advance
PS: it's my first question so if you need sth. let me know.
And I know that the function createTibble probably isn't optimal. But that's not my primary concern.
For this website: https://www.coinopsy.com/dead-coins/, I'm using R and the rvest package to scrape names, summary, etc., that kind of info, to make my own form. I've done this with other websites and it was really successful, but this one is odd.
I used SelectorGadget, which is useful, in my previous jobs, to figure out the css nodes' names, but html_nodes and html_text return empty character, I don't know if it's because the website is structured under a totally different format!
An example of the css code:
td class="all sorting_1">a class="coin_name" href="007coin">007Coin /a>/td>
a class="coin_name" href="007coin">007Coin /a>
url <- "https://www.coinopsy.com/dead-coins/"
webpage <- read_html(url)
Item_html <- html_nodes(webpage,'.coin_name')
Item <- html_text(Item_html)
> Item
character(0)
Can someone help me out on this issue?
If you disable javascript in the browser you will see that that content is not loaded. If you then inspect the html you will see the data is stored in a script tag; presumably loaded into the table when javascript runs in the browser. Javascript doesn't run with the method you are using. You can extract the javascript array of arrays from the response html. Then parse into a dataframe. I am new to R so looking into how this can be done in this case. I will include a full example with python at the end. I will update if my research yields something. Otherwise, you can regex out contents from returned string in data.
library(rvest)
library(stringr)
library(magrittr)
url = 'https://www.coinopsy.com/dead-coins/'
r <- read_html(url) %>%
html_node('body') %>%
html_text() %>%
toString()
data <- str_match_all(r,'var table_data = (.*?);')
data <- data[[1]][,2] # string representation of list of lists
#step to convert string to object
#step to convert object to dataframe
In python there is the ast library which makes the conversion easy and the result of the below is the table you see on the page.
import requests
import re
import ast
import pandas as pd
r = requests.get('https://www.coinopsy.com/dead-coins/')
p = re.compile(r'var table_data = (.*?);') #p1 = re.compile(r'(\[".*?"\])')
data = p.findall(r.text)[0]
listings = ast.literal_eval(data)
df = pd.DataFrame(listings)
print(df)
Edit:
Currently I can't find a library which does the conversion I mentioned. Below is ugly way of combining and feels inefficient. I would welcome suggestions on improvements (though that may be for code review later). I'm still looking at this so will update.
library(rvest)
library(stringr)
library(magrittr)
url = 'https://www.coinopsy.com/dead-coins/'
headers <- c("Column To Drop","Name","Summary","Project Start Date","Project End Date","Founder","urlId")
# https://www.coinopsy.com/dead-coins/bigone-token/ where bigone-token is urlId
r <- read_html(url) %>%
html_node('body') %>%
html_text() %>%
toString()
data <- str_match_all(r,'var table_data = (.*?);')
data <- data[[1]][,2]
z <- substr(data, start = 2, stop = nchar(data)-1) %>% str_match_all(., "\\[(.*?)\\]")
z <- z[[1]][,2]
for(i in seq(1,length(z))){
if(i==1){
df <- rapply(as.list(strsplit(z[i], ",")[[1]][2:7]), function(x) trimws(sub("'(.*?)'", "\\1", x)))
}else{
df <- rbind(df,rapply(as.list(strsplit(z[i], ",")[[1]][2:7]), function(x) trimws(sub("'(.*?)'", "\\1", x))))
}
}
maybe it will help someone, I had the same problem, the solution was that at the beginning I have to specify the label to which the script is to be directed followed by the ".". In your case you want to address a class named coin_name, when specifying that class in the html_nodes function you don't specify the tag, same as I did. To solve it, I only had to specify the label, which in your case is the "a" label, and it would look like this.
Item_html <- html_nodes(webpage,'a.coin_name')
That way the html_nodes function would not return empty.
I know you already solved it but I hope someone can help you.
I'm unable to scrape the table in the link mentioned below, i've inspected the source code and noted that the table has class name : tablesaw-sortable
I tested the method below on a wikipedia page and it's able to extract the table, any way to read the particular table?
url <- read_html("https://www.wunderground.com/history/airport/KNYC/2015/01/01/DailyHistory.html?HideSpecis=0")
weather_hourly <- url %>%
html_nodes(xpath='//*[#class="tablesaw-sortable"]') %>%
html_table()
Ok, something like this should get you pretty close to where you want to be.
library("httr")
URL <- "https://www.timeanddate.com/weather/usa/new-york/historic?month=8&year=2018"
temp <- tempfile(fileext = ".html")
GET(url = URL, user_agent("Mozilla/5.0"), write_disk(temp))
library("XML")
df <- readHTMLTable(temp)
df <- df[[2]]
df
Create a small loop if you want to iterate through a bunch of URLs and import data from each.
I am working on a web scraping project, which aims to extract Google + reviews from a set of children's hospitals. My methodology is as follows:
1) Define a list of Google + urls to navigate to for review scraping. The urls are in a dataframe along with other variables defining the hospital.
2) Scrape reviews, number of stars, and post time for all reviews related to a given url.
3) Save these elements in a dataframe, and name the dataframe after another variable in the dataframe corresponding to the url.
4) Move on to the next url ... and so on till all urls are scraped.
Currently, the code is able to scrape from a single url. I have tried to create a function using map from the purrr package. However it doesn't seem to be working, I am doing something wrong.
Here is my attempt, with comments on the purpose of each step
#Load the necessary libraries
devtools::install_github("ropensci/RSelenium")
library(purrr)
library(dplyr)
library(stringr)
library(rvest)
library(xml2)
library(RSelenium)
#To avoid any SSL error messages
library(httr)
set_config( config( ssl_verifypeer = 0L ) )
Defining the URL dataframe
#Now to define the dataframe with the urls
urls_df =data.frame(Name=c("CHKD","AIDHC")
,ID=c("AAWZ12","AAWZ13")
,GooglePlus_URL=c("https://www.google.co.uk/search?ei=fJUKW9DcJuqSgAbPsZ3gDQ&q=Childrens+Hospital+of+the+Kings+Daughter+&oq=Childrens+Hospital+of+the+Kings+Daughter+&gs_l=psy-ab.3..0i13k1j0i22i10i30k1j0i22i30k1l7.8445.8445.0.9118.1.1.0.0.0.0.144.144.0j1.1.0....0...1c.1.64.psy-ab..0.1.143....0.qDMr7IDA-uA#lrd=0x89ba9869b87f1a69:0x384861b1e3a4efd3,1,,,",
"https://www.google.co.uk/search?q=Alfred+I+DuPont+Hospital+for+Children&oq=Alfred+I+DuPont+Hospital+for+Children&aqs=chrome..69i57.341j0j8&sourceid=chrome&ie=UTF-8#lrd=0x89c6fce9425c92bd:0x80e502f2175fb19c,1,,,"
))
Creating the function
extract_google_review=function(googleplus_urls) {
#Opens a Chrome session
rmDr=rsDriver(browser = "chrome",check = F)
myclient= rmDr$client
#Creates a sub-dataframe for the filtered hospital, which I will later use to name the dataframe
urls_df_sub=urls_df %>% filter(GooglePlus_URL %in% googleplus_urls)
#Navigate to the url
myclient$navigate(googleplus_urls)
#click on the snippet to switch focus----------
webEle <- myclient$findElement(using = "css",value = ".review-snippet")
webEle$clickElement()
# Save page source
pagesource= myclient$getPageSource()[[1]]
#simulate scroll down for several times-------------
count=read_html(pagesource) %>%
html_nodes(".p13zmc") %>%
html_text()
#Stores the number of reviews for the url, so we know how many times to scroll down
scroll_down_times=count %>%
str_sub(1,nchar(count)-5) %>%
as.numeric()
for(i in 1 :scroll_down_times){
webEle$sendKeysToActiveElement(sendKeys = list(key="page_down"))
#the content needs time to load,wait 1.2 second every 5 scroll downs
if(i%%5==0){
Sys.sleep(1.2)
}
}
#loop and simulate clicking on all "click on more" elements-------------
webEles <- myclient$findElements(using = "css",value = ".review-more-link")
for(webEle in webEles){
tryCatch(webEle$clickElement(),error=function(e){print(e)})
}
pagesource= myclient$getPageSource()[[1]]
#this should get the full review, including translation and original text
reviews=read_html(pagesource) %>%
html_nodes(".review-full-text") %>%
html_text()
#number of stars
stars <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes("g-review-stars > span") %>%
html_attr("aria-label")
#time posted
post_time <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes(".dehysf") %>%
html_text()
#Consolidating everything into a dataframe
reviews=head(reviews,min(length(reviews),length(stars),length(post_time)))
stars=head(stars,min(length(reviews),length(stars),length(post_time)))
post_time=head(post_time,min(length(reviews),length(stars),length(post_time)))
reviews_df=data.frame(review=reviews,rating=stars,time=post_time)
#Assign the dataframe a name based on the value in column 'Name' of the dataframe urls_df, defined above
df_name <- tolower(urls_df_sub$Name)
if(exists(df_name)) {
assign(df_name, unique(rbind(get(df_name), reviews_df)))
} else {
assign(df_name, reviews_df)
}
} #End function
Feeding the urls into the function
#Now that the function is defined, it is time to create a vector of urls and feed this vector into the function
googleplus_urls=urls_df$GooglePlus_URL
googleplus_urls %>% map(extract_google_review)
There seems to be an error in the function ,which is preventing it from scraping and storing the data into separate dataframes like intended.
My Intended Output
2 dataframes, each with 3 columns
Any pointers on how this can be improved will be greatly appreciated.
I am working on a web scraping program to search for data from multiple sheets. The code below is an example of what I am working with. I am able to get only the first sheet on this. It will be of great help if someone can point out where I am going wrong in my syntax.
jump <- seq(1, 10, by = 1)
site <- paste0("https://stackoverflow.com/search?page=",jump,"&tab=Relevance&q=%5bazure%5d%20free%20tier")
dflist <- lapply(site, function(i) {
webpage <- read_html(i)
draft_table <- html_nodes(webpage,'.excerpt')
draft <- html_text(draft_table)
})
finaldf <- do.call(cbind, dflist)
finaldf_10<-data.frame(finaldf)
View(finaldf_10)
Below is the link from where I need to scrape the data which has
127 pages.
[https://stackoverflow.com/search?q=%5Bazure%5D+free+tier][1]
As per the above code I am able to get data only from the first page and not the rest of the pages. There is no syntax error also. Could you please help me in finding out where I am going wrong.
Some websites put a security to prevent bulk scraping. I guess SO is one of them. More on that : https://github.com/JonasCz/How-To-Prevent-Scraping/blob/master/README.md
In fact, if you delay a little your calls, this will work. I've tried w/ 5 seconds Sys.sleep. I guess you can reduce it, but this may not work (I've tried with a 1 second Sys.sleep, that didn't work).
Here is a working code :
library(rvest)
library(purrr)
dflist <- map(.x = 1:10, .f = function(x) {
Sys.sleep(5)
url <- paste0("https://stackoverflow.com/search?page=",x,"&q=%5bazure%5d%20free%20tier")
read_html(url) %>%
html_nodes('.excerpt') %>%
html_text() %>%
as.data.frame()
}) %>% do.call(rbind, .)
Best,
Colin