I'm trying to scrap some data using this code.
require(XML)
tables <- readHTMLTable('http://fantasynba.movistarplus.es/basketball/reports/player_rankings.asp')
str(tables, max.level = 1)
df <- tables$searchResults
It works perfect but the problem is that it only gives me data for the first 188 observations that corresponds to the players whose position is "Base". Whenever I try to get data from "Pivot" or "Alero" players, it gives me the same info. Since the url never changes, I don't know how to get this info.
Related
I am trying to scrape data from the web, specifically from a table that has different filters and pages and I have the following code:
library (rvest)
url.colombia.compra <- "https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to = & date_from = "
tmp <- read_html (url.colombia.compra)
tmp_2 <- html_nodes (tmp, ".active")
the problem is that the code generates a list for me but I need to format it as a table and I have not succeeded, besides that it only shows me data from the first page of the table, how could I complement the code so that it allows me to read the data from all the pages in the table and format it as a table.
This is the table that looks like the table that shows the data
I would split this problem into two parts. Your first is how to programmatically access each of the 11 pages of this online table.
Since this is a simple html table, using the "Next" button (siguiente) will take us to a new page. If we look at the URL on the Next page, we can see the page number in the query parameters.
...tienda-virtual-del-estado-colombiano/ordenes-compra?page=1&number_order=&state...
We know that the pages are numbered starting with 0 (because "next" takes us to page1), and using the navigation bar we can see that there are 11 pages.
We can use the query parameters to construct a series of rvest::read_html() calls corresponding to page number by simply using lapply and paste0 to replace the page=. This will let us access all the pages of the table.
The second part is making use of rvest::html_table which will parse a tibble from the results of read_html
pages <-
lapply(0:11, function(x) {
data.frame(
html_table(
read_html(x = paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
x,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_="))
)
)
})
The result is a list of dataframes which we can combine with do.call.
do.call(rbind, pages)
There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400
The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}
Trying to scrape data into R from table on website but having some trouble in finding the table. The table seems to not have any distinguishable class or id as it is labeled as . I have scraped data from many tables but have never encountered this type of situation. Still a novice and have done some searches but nothing has worked as of yet.
I have tried the following R commands but they only produce a result with "" and no data scraped. I had some results in assigning the article-template class but it is jumbled so I know that the data exists on the html_read. I am just not sure the proper way to program for this type of website.
Here is what I have so far in R:
library(xml2)
library(rvest)
website <- read_html("https://baseballsavant.mlb.com/daily_matchups")
dailymatchup <- website %>%
html_nodes(xpath = '//*[#id="leaderboard"]') %>%
html_text()
Essentially pulling the data from the table should allow a quick and organizable data frame. This is the ultimate target I am seeking.
My goal is to obtain a time series from 1996 week 1 to week 46 of 2016 of legionellosis cases from this website supported by the Center for Disease Control (CDC) of the United States. A coworker attempted to scrape only tables that contain legionellosis cases with the code below:
#install.packages('rvest')
library(rvest)
## Code to get all URLS
getUrls <- function(y1,y2,clist){
root="https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp?mmwr_year="
root1="&mmwr_week="
root2="&mmwr_table=2"
root3="&request=Submit&mmwr_location="
urls <- NULL
for (year in y1:y2){
for (week in 1:53){
for (part in clist) {
urls <- c(urls,(paste(root,year,root1,week,root2,part,root3,sep="")))
}
}
}
return(urls)
}
TabList<-c("A","B") ## can change to get not just 2 parts of the table but as many as needed.
WEB <- as.data.frame(getUrls(1996,2014,TabList)) # Only applies from 1996-2014. After 2014, the root url changes.
head(WEB)
#Example of how to extract data from a single webpage.
url <- 'https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp? mmwr_year=1996&mmwr_week=20&mmwr_table=2A&request=Submit&mmwr_location='
webpage <- read_html(url)
sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table, fill = TRUE)[[2]]
#test if Legionellosis is in the table. Returns a vector showing the columns index if the text is found.
#Can use this command to filter only pages that you need and select only those columns.
test <- grep("Leg", sb)
sb <- sb[,c(1,test)]
### This code only works if you have 3 columns for headings. Need to adapt to be more general for all tables.
#Get Column names
colnames(sb) <- paste(sb[2,], sb[3,], sep="_")
colnames(sb)[1] <- "Area"
sb <- sb[-c(1:3),]
#Remove commas from numbers so that you can then convert columns to numerical values. Only important if numbers above 1000
Dat <- sapply(sb, FUN= function(x)
as.character(gsub(",", "", as.character(x), fixed = TRUE)))
Dat<-as.data.frame(Dat, stringsAsFactors = FALSE)
However, the code is not finished and I thought it may be best to use the API since the structure and layout of the table in the webpages changes. This way we wouldn't have to comb through the tables to figure out when the layout changes and how to adjust the web scraping code accordingly. Thus I attempted to pull the data from the API.
Now, I found two help documents from the CDC that provides the data. One appears to provide data from 2014 onward which can be seen here using RSocrata, while the other instruction appears to be more generalized and uses XML format request over http, which can be seen here.The XML format request over http required a databased ID which I could not find. Then I stumbled onto the RSocrata and decided to try that instead. But the code snippet provided along with the token ID I set up did not work.
install.packages("RSocrata")
library("RSocrata")
df <- read.socrata("https://data.cdc.gov/resource/cmap-p7au?$$app_token=tdWMkm9ddsLc6QKHvBP6aCiOA")
How can I fix this? My end goal is a table of legionellosis cases from 1996 to 2016 on a weekly basis by state.
I'd recommend checking out this issue thread in the RSocrata GitHub repo where they're discussing a similar issue with passing tokens into the RSocrata library.
In the meantime, you can actually leave off the $$app_token parameter, and as long as you're not flooding us with requests, it'll work just fine. There's a throttling limit you can sneak under without using an app token.
For a little project for myself I'm trying to get the results from some races.
I can access the pages with the results and download the data from the table in page. However, there are only 20 results per page, but luckily the web addresses are built logically so I can create them, and in a loop, access these pages and download the data. However, each category has a different number of racers, and thus can have different number of pages. I want to avoid to manually having to check how many racers there are in each category.
My first thought was to just generate a lot of links, making sure there are enough (based on the total amount of racers) to get all the data.
nrs <- rep(seq(1,5,1),2)
sex <- c("M","M","M","M","M","F","F","F","F","F")
links <- NULL
#Loop to create 10 links, 5 for the male age grou 18-24, 5 for women agegroup 18-24. However,
#there are only 3 pages in the male age group with a table.
for (i in 1:length(nrs) ) {
links[i] = paste("http://www.ironman.com/triathlon/events/americas/ironman/texas/results.aspx?p=",nrs[i],"&race=texas&rd=20160514&sex=",sex[i],"&agegroup=18-24&loc=",sep="")
}
resultlist <- list() #create empty list to store results
for (i in 1:length(links)) {
results = readHTMLTable(links[i],
as.data.frame = TRUE,
which=1,
stringsAsFactors = FALSE,
header = TRUE) #get data
resultlist[[i]] <- results #combine results in one big list
}
results = do.call(rbind, resultlist) #combine results into dataframe
As you can see in this code readHTMLTable throws an error message as soon as it encounters a page with no table, and then stops.
I thought of two possible solutions.
1) Somehow check all the links if they exist. I tried with url.exists from the RCurl package. But this doesn't work. It returns TRUE for all pages, as the page exists, it just doesn't have a table in it (so for me it would be a false positive). Somehow I would need some code to check if a table in the page exists, but I don't know how to go about that.
2) Suppress the error message from readHTMLTable so the loop continuous, but I'm not sure if that's possible.
Any suggestions for these two methods, or any other suggestions?
I think that method #2 is easier. I modified your code with tryCatch, one of R's builtin exception handling mechanisms. It works for me.
PS I would recommend using rvest for web scraping like this.