Scrape table in R using rvest - r

I'm unable to scrape the table in the link mentioned below, i've inspected the source code and noted that the table has class name : tablesaw-sortable
I tested the method below on a wikipedia page and it's able to extract the table, any way to read the particular table?
url <- read_html("https://www.wunderground.com/history/airport/KNYC/2015/01/01/DailyHistory.html?HideSpecis=0")
weather_hourly <- url %>%
html_nodes(xpath='//*[#class="tablesaw-sortable"]') %>%
html_table()

Ok, something like this should get you pretty close to where you want to be.
library("httr")
URL <- "https://www.timeanddate.com/weather/usa/new-york/historic?month=8&year=2018"
temp <- tempfile(fileext = ".html")
GET(url = URL, user_agent("Mozilla/5.0"), write_disk(temp))
library("XML")
df <- readHTMLTable(temp)
df <- df[[2]]
df
Create a small loop if you want to iterate through a bunch of URLs and import data from each.

Related

Scraping 13F filings from SEC using R

I'm trying to scrape the data in the SEC FORM 13-F Information Table from the following link:
https://sec.report/Document/0001567619-21-010281/
I tried the below script:
library(timetk)
library(tidyverse)
library(rvest)
url <- "https://sec.report/Document/0001567619-21-010281/"
url <- read_html(url)
raw_data <- url %>%
html_nodes("#table td") %>%
html_text()
However, I'm unable to get the data components and under values, it says that raw_data is empty. Any help would be appreciated.
The data is present in the response. You can use a CSS attribute = value selector to target the nested table. You will need to decide what to decide with the initial three rows which need to be transformed into a single header most likely (or not!)
library(rvest)
library(magrittr)
page <- read_html("https://sec.report/Document/0001567619-21-010281/")
table <- page %>%
html_node('[summary="Form 13F-NT Header Information"]') %>%
html_table(fill = T)
Use 13F from html page it is much easier here is an example
import pandas as pd
import requests
import numpy as np
# Makes a request to the url
url="https://www.sec.gov/Archives/edgar/data/1541617/000154161721000009/xslForm13F_X01/altcap13f3q21infotable.xml"
request = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
# Pass the html response into read_html
tables = pd.read_html(request.text)
df = tables[3]

Web Scraping in R | Unable to extract information under a certain node using rvest

I'm trying to extract a bit of information under the node /html/head/script[16] from a website (here) but am unable to do so.
nykaa <- "https://www.nykaa.com/biotique-bio-kelp-protein-shampoo-for-falling-hair-intensive-hair-growth-treatment-conf/p/357142?categoryId=1292&productId=357142&ptype=product&skuId=39934"
obj <- read_html(nykaa)
extracted_json <- obj %>%
html_nodes(xpath = "/html/head/script[16]") %>%
html_text(trim = TRUE)
Currently, my output for the above code is null. But I would like to extract the data under the above mentioned node in an organized manner.
You can use regex to grab the javascript object inside that script tag and then pass to jsonlite and parse. You need to root around a bit to get what you want from that but it is all there
library(rvest)
library(magrittr)
library(stringr)
library(jsonlite)
p <- read_html('https://www.nykaa.com/biotique-bio-kelp-protein-shampoo-for-falling-hair-intensive-hair-growth-treatment-conf/p/357142?categoryId=1292&productId=357142&ptype=product&skuId=39934') %>% html_text()
all_data <- jsonlite::parse_json(str_match_all(p,'window\\.__PRELOADED_STATE__ = (.*)')[[1]][,2])

Rvest, html_nodes return empty list and string, wield website

For this website: https://www.coinopsy.com/dead-coins/, I'm using R and the rvest package to scrape names, summary, etc., that kind of info, to make my own form. I've done this with other websites and it was really successful, but this one is odd.
I used SelectorGadget, which is useful, in my previous jobs, to figure out the css nodes' names, but html_nodes and html_text return empty character, I don't know if it's because the website is structured under a totally different format!
An example of the css code:
td class="all sorting_1">a class="coin_name" href="007coin">007Coin /a>/td>
a class="coin_name" href="007coin">007Coin /a>
url <- "https://www.coinopsy.com/dead-coins/"
webpage <- read_html(url)
Item_html <- html_nodes(webpage,'.coin_name')
Item <- html_text(Item_html)
> Item
character(0)
Can someone help me out on this issue?
If you disable javascript in the browser you will see that that content is not loaded. If you then inspect the html you will see the data is stored in a script tag; presumably loaded into the table when javascript runs in the browser. Javascript doesn't run with the method you are using. You can extract the javascript array of arrays from the response html. Then parse into a dataframe. I am new to R so looking into how this can be done in this case. I will include a full example with python at the end. I will update if my research yields something. Otherwise, you can regex out contents from returned string in data.
library(rvest)
library(stringr)
library(magrittr)
url = 'https://www.coinopsy.com/dead-coins/'
r <- read_html(url) %>%
html_node('body') %>%
html_text() %>%
toString()
data <- str_match_all(r,'var table_data = (.*?);')
data <- data[[1]][,2] # string representation of list of lists
#step to convert string to object
#step to convert object to dataframe
In python there is the ast library which makes the conversion easy and the result of the below is the table you see on the page.
import requests
import re
import ast
import pandas as pd
r = requests.get('https://www.coinopsy.com/dead-coins/')
p = re.compile(r'var table_data = (.*?);') #p1 = re.compile(r'(\[".*?"\])')
data = p.findall(r.text)[0]
listings = ast.literal_eval(data)
df = pd.DataFrame(listings)
print(df)
Edit:
Currently I can't find a library which does the conversion I mentioned. Below is ugly way of combining and feels inefficient. I would welcome suggestions on improvements (though that may be for code review later). I'm still looking at this so will update.
library(rvest)
library(stringr)
library(magrittr)
url = 'https://www.coinopsy.com/dead-coins/'
headers <- c("Column To Drop","Name","Summary","Project Start Date","Project End Date","Founder","urlId")
# https://www.coinopsy.com/dead-coins/bigone-token/ where bigone-token is urlId
r <- read_html(url) %>%
html_node('body') %>%
html_text() %>%
toString()
data <- str_match_all(r,'var table_data = (.*?);')
data <- data[[1]][,2]
z <- substr(data, start = 2, stop = nchar(data)-1) %>% str_match_all(., "\\[(.*?)\\]")
z <- z[[1]][,2]
for(i in seq(1,length(z))){
if(i==1){
df <- rapply(as.list(strsplit(z[i], ",")[[1]][2:7]), function(x) trimws(sub("'(.*?)'", "\\1", x)))
}else{
df <- rbind(df,rapply(as.list(strsplit(z[i], ",")[[1]][2:7]), function(x) trimws(sub("'(.*?)'", "\\1", x))))
}
}
maybe it will help someone, I had the same problem, the solution was that at the beginning I have to specify the label to which the script is to be directed followed by the ".". In your case you want to address a class named coin_name, when specifying that class in the html_nodes function you don't specify the tag, same as I did. To solve it, I only had to specify the label, which in your case is the "a" label, and it would look like this.
Item_html <- html_nodes(webpage,'a.coin_name')
That way the html_nodes function would not return empty.
I know you already solved it but I hope someone can help you.

How to use read_html reading a character vector of url

I am using rvest package, and belowing are the codes:
library(rvest)
url <- 'https://www.zhihu.com/people/excited-vczh'
webpage <- read_html(url)
profile_data <- html_nodes(webpage, '.Profile-sideColumnItemLink')
profile_data_text <- html_text(profile_data)
The codes read one single url and parse. What if I have a charactor vector which storing multiple urls. How should I put these urls to the above codes?
For instance, urlist is a charactor storing 1000 urls. How can I change my codes to scrapy all specific content in urlist?
You could just use lapply to run through each URL to grab the text you need:
library(rvest)
urlist <- rep('https://www.zhihu.com/people/excited-vczh', 100)
profile_data_list <- lapply(urlist, function(x) {
webpage <- read_html(x)
profile_data <- html_nodes(webpage, '.Profile-sideColumnItemLink')
html_text(profile_data)
})

R: looping through a list of links

I have some code that scrapes data off this link (http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280) and runs some calculations.
What I want to do is cycle through every team and collect and run the manipulations on every team. I have a dataframe with every team link, like the one above.
Psuedo code:
for (link in teamlist)
{scrape, manipulate, put into a table}
However, I can't figure out how to run loop through the links.
I've tried doing URL = teamlist$link[i], but I get an error when using readhtmltable(). I have no trouble manually pasting each team individual URL into the script, just only when trying to pull it from a table.
Current code:
library(XML)
library(gsubfn)
URL= 'http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280'
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
Thanks.
I agree with #ialm that you should check out the rvest package, which makes it very fun and straightforward to loop through links. I will create some example code here using similar subject matter for you to check out.
Here I am generating a list of links that I will iterate through
rm(list=ls())
library(rvest)
mainweb="http://www.basketball-reference.com/"
urls=html("http://www.basketball-reference.com/teams") %>%
html_nodes("#active a") %>%
html_attrs()
Now that the list of links is complete I iterate through each link and pull a table from each
teamdata=c()
j=1
for(i in urls){
bball <- html(paste(mainweb, i, sep=""))
teamdata[j]= bball %>%
html_nodes(paste0("#", gsub("/teams/([A-Z]+)/$","\\1", urls[j], perl=TRUE))) %>%
html_table()
j=j+1
}
Please see the code below, which basically builds off your code and loops through two different team pages as identified by the vector team_codes. The tables are returned in a list where each list element corresponds to a team's table. However, the tables look like they will need more cleaning.
library(XML)
library(gsubfn)
Player_Stats <- list()
j <- 1
team_codes <- c(575, 580)
for(code in team_codes) {
URL <- paste0('http://stats.ncaa.org/team/stats?org_id=', code, '&sport_year_ctl_id=12280')
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats[[j]] = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
j <- j + 1
}

Resources