Data scraping from coinmarketcap - r

Hello i'm trying to scrape the market table at the end of this page "https://coinmarketcap.com/currencies/bitcoin/markets/"
This is what I tried
crpyto_url <- read_html("https://coinmarketcap.com/currencies/bitcoin/markets/")
Exchanges <- crpyto_url %>%
html_node(xpath = '//*[#id="__next"]/div/div[2]/div/div[3]/div[2]/div[2]/div/table') %>%
html_text() %>%
jsonlite::fromJSON()
This is the error
Error in if (is.character(txt) && length(txt) == 1 && nchar(txt, type = "bytes") < : missing value where TRUE/FALSE needed
I don't think that the error is relevant, I think that the real problem is that I don't know how to find the xpath related with the table.
If someone manage to find the xpath, can you please explain what was the process to found it. Or link some resources.
Thanks
I.

This can be done with the coingecko API.
url <- "https://api.coingecko.com/api/v3/coins/bitcoin/tickers"
Exchanges <- GET(url)
araw_data <- fromJSON(content(Exchanges, as = "text",encoding = "UTF-8"))
araw_data$tickers$market %>% select(name) %>% pull

Related

How to skip error Error in open.connection(x, "rb") : HTTP error 504

I am new to the world of r, I have not been able to skip the URLs that according to the website show: “ 504 error That content doesn't seem to exist…”
There exists a list of people on the website that I need to get the table and also information in the nested links for each of those people.
But only the webpage is giving 504 error for 1 person (84th person) so I would like to know how I can skip the page so that in my data frame the webpage for that specific person to be marked as non-existent.
Thanks for your help.
here is my code:
***library(rvest)
library(dplyr)
library(stringr)
library(jsonlite)
library(readr)
url="https://www.barrons.com/advisor/report/top-financial-advisors/100?id=/100/2022&type=ranking_tables"
doc = fromJSON(txt=url)
result = doc$data$data
print(result)
link=str_split_fixed(doc$data$data$Advisor, "\'", n = Inf)
advisor_links= link[,4]
for (i in 1: length(advisor_links)){
name_link=advisor_links[i]
advisor_page= read_html(name_link)
position= advisor_page%>% html_nodes(".BarronsTheme--lg--18rTokdG p:nth-child(1)")%>% html_text()%>% paste(collapse = ",")
print(position)
}***
If you know the index of the person you want to remove, you can simply omit it in the advisor_links before you call your for loop function.
advisor_links <- advisor_links[-84]
If there are multiple websites that errors out
I would suggest using tryCatch function (How to write trycatch in R) and put it inside your for loop function like so:
for (i in 1: length(advisor_links)){
name_link = advisor_links[i]
tryCatch({
name_link=advisor_links[i]
advisor_page = read_html(name_link)
position = advisor_page %>%
html_nodes(".BarronsTheme--lg--18rTokdG p:nth-child(1)")%>%
html_text()%>%
paste(collapse = ",")
if(position == "") print("Non-existent")
else print(position)}, error = function(e) NULL)
}

R: Errors when webscraping across mulitple tables with same URL

I'm fairly new to webscraping and having issues troubleshooting my code. At the moment I'm having different errors every time and don't really know where to continue. Currently looking into utilizing RSelenium but would greatly appreciate some advise and feedback on the code below.
Based my initial code on the following: R: How to web scrape a table across multiple pages with the same URL
library(xml2)
library(RCurl)
library(dplyr)
library(rvest)
i=1
table = list()
for (i in 1:15) {
data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","?page=",i))
page <- read_html(data)
table1 <- page %>%
html_nodes(xpath = "(//table)[2]") %>%
html_table(header=T)
i=i+1
table1[[1]][[7]]=as.integer(gsub(",", "",table1[[1]][[7]]))
table=bind_rows(table, table1)
print(i)}
table$`ÅR`=as.Date(table$`ÅR`,format ="%Y")
Bellow are the errors i am recieving at the moment. I know its a lot, but i assume some of them are a result of previous errors. Any help would be greatly appreciated!
i=1
table = list()
for (i in 1:15) {
data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","?page=",i))
Error: unexpected ',' in:
"for (i in 1:15) {
data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","
page <- read_html(data)
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "function"
table1 <- page %>%
html_nodes(xpath = "(//table)[2]") %>%
html_table(header=T)
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "function"
i=i+1
table1[[1]][[7]]=as.integer(gsub(",", "",table1[[1]][[7]]))
Error in is.factor(x) : object 'table1' not found
table=bind_rows(table, table1)
Error in list2(...) : object 'table1' not found
print(i)}
Error: unexpected '}' in " print(i)}"
table$ÃR=as.Date(table$ÃR,format ="%Y")
The following code produces a dataframe containing all the data you are seeking. Rather than using RSelenium, the below code fetches the data directly from the same API from which the site populates the table and so you do not need to combine multiple pages:
library(tidyverse)
library(rvest)
library(jsonlite)
####GET NUMBER OF ITEMS#####
url <- "https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/10/"
data <- jsonlite::fromJSON(url, flatten = TRUE)
totalItems <- data$TotalNumberOfItems
####GET ALL OF THE ITEMS#####
allData <- paste0('https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/', totalItems,'/') %>%
jsonlite::fromJSON(., flatten = TRUE) %>%
.[1] %>%
as.data.frame() %>%
rename_with(~str_replace(., "ListItems.", ""), everything())

Scraping from website using getURL() returns string of urls, not website content. How do I get the contents of the site? (R studio, windows 10)

I am completely new to scraping, using Windows 10 PC. I am trying to run this code from class to scrape the content of the party platforms form the URLs below:
years=c(1968, 1972, 1976)
urlsR=paste("https://maineanencyclopedia.com/republican-party-platform-",
years,"/",sep='')
urlsD=paste("https://maineanencyclopedia.com/democratic-party-platform-",
years,"/",sep='')
urls=c(urlsR,urlsD)
scraped_platforms <- getURL(urls)
When I run "scraped_platforms" the result is what is shown below rather than the content of the party platforms from the website.
https://maineanencyclopedia.com/republican-party-platform-1968/
""
https://maineanencyclopedia.com/republican-party-platform-1972/
""
https://maineanencyclopedia.com/republican-party-platform-1976/
""
https://maineanencyclopedia.com/democratic-party-platform-1968/
""
https://maineanencyclopedia.com/democratic-party-platform-1972/
""
https://maineanencyclopedia.com/democratic-party-platform-1976/
""
I've seen Windows 10 might be incompatible with getURL (re: How to get getURL to work on R on Windows 10? [tlsv1 alert protocol version]). Even after looking online though, I'm still unclear on how to fix my specific code?
List of links used here:
https://maineanencyclopedia.com/republican-party-platform-1968/
https://maineanencyclopedia.com/republican-party-platform-1972/
https://maineanencyclopedia.com/republican-party-platform-1976/
https://maineanencyclopedia.com/democratic-party-platform-1968/
https://maineanencyclopedia.com/democratic-party-platform-1972/
https://maineanencyclopedia.com/democratic-party-platform-1976/
I don't know getURL() function, but in R, there is one very handy package for scraping: rvest
You can just use your urls object which has all URLs and loop over:
library(rvest)
library(dplyr)
df <- tibble(Title= NULL,
Text= NULL)
for (url in urls){
t <- read_html(url) %>% html_nodes(".entry-title") %>% html_text2()
p <- read_html(url) %>% html_nodes("p") %>% html_text2()
tp <- tibble(Title = t,
Text = p)
df <- rbind(df, tp)
}
df
This is a bit unorganized output, but you can adjust for loop so you can get it a bit nicer.
Here is also a bit nicer presentation of data:
df2 <- df %>% group_by(Title) %>%
slice(-1) %>%
mutate(Text_all = paste0(Text, collapse = "\n")) %>%
dplyr::select(-Text) %>%
distinct()
df2

extracting table with htmltab R

I'm attempting to scrape the second table from
https://fbref.com/en/comps/9/passing/Premier-League-Stats
I have used
URLPL <- "https://fbref.com/en/comps/9/passing/Premier-League-Stats"
Tab <- htmltab(doc = URLPL, which = 2)
which returns
"Error: Couldn't find the table. Try passing (a different) information
to the which argument"
and also
URLPL <- "https://fbref.com/en/comps/9/passing/Premier-League-Stats"
Tab <- htmltab(doc = URLPL, which = "//table[2]")
which returns
"Error in Node[1] : subscript out of bounds"
There is 2 tables on the webpage. If anyone can point me on the right path here.
Thanks.
Edit: I've now realised that there's only 1 table on the webpage and what I thought was a table, is not. Now I'm even more confused as where to go with this.
Answering my own question here. For anyone who may have the same problem.
Anything other than the top table on any of the sports-references websites. (Hockey/Basketball/Baseball) are counted as comments.
PremLeague = "https://fbref.com/en/comps/12/stats/La-Liga-Stats"
Prem = PremLeague %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse='') %>%
read_html() %>%
html_node("#stats_standard") %>%
html_table()
This worked for me.

Error in web scraping in R from wikipedia

Im having trouble web scraping information from wikipedia and get the following error message:
Error in if (length(p) > 1 & maxp * n != sum(unlist(nrows)) & maxp * n != :
missing value where TRUE/FALSE needed
Not sure how to fix this problem, please help me out
url <- 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
wiki <- read_html(url) %>% html_nodes('table') %>% html_table(fill = TRUE)
names(wiki[[1]])
Output error:
Error in if (length(p) > 1 & maxp * n != sum(unlist(nrows)) & maxp * n != :
missing value where TRUE/FALSE needed
Assuming you want the big table you can use its id. Id should be the fastest selector method for an element
require(rvest)
r <- read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies") %>%
html_nodes("#constituents") %>%
html_table()
print(r)
The problem is that there are two tables on this webpage and you shoudl specify which one you want to scrape. Let's assume you want the first one you could do something like:
read_html(url) %>%
html_nodes('table') %>%
`[[`(1) %>% ## extract first table
html_table(fill = TRUE)

Resources