Error in web scraping in R from wikipedia - r

Im having trouble web scraping information from wikipedia and get the following error message:
Error in if (length(p) > 1 & maxp * n != sum(unlist(nrows)) & maxp * n != :
missing value where TRUE/FALSE needed
Not sure how to fix this problem, please help me out
url <- 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
wiki <- read_html(url) %>% html_nodes('table') %>% html_table(fill = TRUE)
names(wiki[[1]])
Output error:
Error in if (length(p) > 1 & maxp * n != sum(unlist(nrows)) & maxp * n != :
missing value where TRUE/FALSE needed

Assuming you want the big table you can use its id. Id should be the fastest selector method for an element
require(rvest)
r <- read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies") %>%
html_nodes("#constituents") %>%
html_table()
print(r)

The problem is that there are two tables on this webpage and you shoudl specify which one you want to scrape. Let's assume you want the first one you could do something like:
read_html(url) %>%
html_nodes('table') %>%
`[[`(1) %>% ## extract first table
html_table(fill = TRUE)

Related

Error when using a function: Error in df1[[1]] : subscript out of bounds

I'm trying to scrape the gun laws from https://www.statefirearmlaws.org/. However, I keep getting the following error:
Error in df1[[1]] : subscript out of bounds
I used selector gadget to copy the nodes for the table.
What can I do to fix it?
library(rvest)
library(tidyverse)
years <- lapply(c(2006:2018), function(x) {
link <- paste0('https://www.statefirearmlaws.org/national-data/', x)
df1 <- link %>% read_html() %>%
html_nodes('.js-view-dom-id-cc833ef0290cd127457401b760770f1411daa41fc70df5f12d07744fab0a173c > div > div') %>%
html_text(trim = TRUE)
df <- df1[[1]]
return(df)
}
)
df1 <- link %>% read_html() %>%
html_nodes('.js-view-dom-id-cc833ef0290cd127457401b760770f1411daa41fc70df5f12d07744fab0a173c > div > div')
this part results in {xml_nodeset (0)} which later produces empty list().
Are you selecting the correct thing you want to scrape in html_nodes? Maybe SelectorGadget can be helpful to choose what you need
So html_text expects a node as it's input and html_table outputs a list of tibbles so html_text fails to parse this.

Data scraping from coinmarketcap

Hello i'm trying to scrape the market table at the end of this page "https://coinmarketcap.com/currencies/bitcoin/markets/"
This is what I tried
crpyto_url <- read_html("https://coinmarketcap.com/currencies/bitcoin/markets/")
Exchanges <- crpyto_url %>%
html_node(xpath = '//*[#id="__next"]/div/div[2]/div/div[3]/div[2]/div[2]/div/table') %>%
html_text() %>%
jsonlite::fromJSON()
This is the error
Error in if (is.character(txt) && length(txt) == 1 && nchar(txt, type = "bytes") < : missing value where TRUE/FALSE needed
I don't think that the error is relevant, I think that the real problem is that I don't know how to find the xpath related with the table.
If someone manage to find the xpath, can you please explain what was the process to found it. Or link some resources.
Thanks
I.
This can be done with the coingecko API.
url <- "https://api.coingecko.com/api/v3/coins/bitcoin/tickers"
Exchanges <- GET(url)
araw_data <- fromJSON(content(Exchanges, as = "text",encoding = "UTF-8"))
araw_data$tickers$market %>% select(name) %>% pull

extracting table with htmltab R

I'm attempting to scrape the second table from
https://fbref.com/en/comps/9/passing/Premier-League-Stats
I have used
URLPL <- "https://fbref.com/en/comps/9/passing/Premier-League-Stats"
Tab <- htmltab(doc = URLPL, which = 2)
which returns
"Error: Couldn't find the table. Try passing (a different) information
to the which argument"
and also
URLPL <- "https://fbref.com/en/comps/9/passing/Premier-League-Stats"
Tab <- htmltab(doc = URLPL, which = "//table[2]")
which returns
"Error in Node[1] : subscript out of bounds"
There is 2 tables on the webpage. If anyone can point me on the right path here.
Thanks.
Edit: I've now realised that there's only 1 table on the webpage and what I thought was a table, is not. Now I'm even more confused as where to go with this.
Answering my own question here. For anyone who may have the same problem.
Anything other than the top table on any of the sports-references websites. (Hockey/Basketball/Baseball) are counted as comments.
PremLeague = "https://fbref.com/en/comps/12/stats/La-Liga-Stats"
Prem = PremLeague %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse='') %>%
read_html() %>%
html_node("#stats_standard") %>%
html_table()
This worked for me.

rvest, following a link present on each node to get more data?

So I'm trying to scrape data from a site that contains club data from clubs at my school. I've got a good script going that scrapes the surface level data from the site, however I can get more data by clicking the "more information" link at each club which leads to the club's profile page. I would like to scrape the data from that page (specifically the facebook link).
Below you'll see my current attempt at this.
url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)
get_more_info <- function(position) {
page <- follow_link(page, css = ".grpl-moreinfo > a:nth-child(" + position + ")")
html_node(sub_page, xpath = '//*[#id="dnf_class_values_student_group__facebook__widget"]') %>% html_text()
page <- page %>% back()
}
get_table <- function(page, count) {
#find group names
name_text <- html_nodes(page,".grpl-name a") %>% html_text()
df <- data.frame(name_text, stringsAsFactors = FALSE)
#find text description
desc_text <- html_nodes(page, ".grpl-purpose") %>% html_text()
df$desc_text <- trimws(desc_text)
#find emails
# find the parent nodes with html_nodes
# then find the contact information from each parent using html_node
email_nodes<-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-contact a") %>% html_text()
df$emails<-email_nodes
category_nodes <- html_nodes(page, "div.grpl-grp") %>% html_node(".grpl-type") %>% html_text()
df$category<-category_nodes
pic_nodes <-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-logo img") %>% html_attr("src")
df$logo <- paste0("https://uws-community.symplicity.com/", pic_nodes)
more_info_nodes <- html_nodes(page, ".grpl-moreinfo a") %>% html_attr("href")
df$more_info <- paste0("https://uws-community.symplicity.com/", more_info_nodes)
df$fb <- lapply(1:nrow(df), get_more_info)
if(count != 44) {
return (rbind(df, get_table(page %>% follow_link(css = ".paging_nav a:last-child"), count + 1)))
} else{
return (df)
}
}
RSO_data <- get_table(page, 0)
as of now I'm getting an error:
Error in ".grpl-moreinfo > a:nth-child(" + position :
non-numeric argument to binary operator
As you can see I'm attempting to follow the link at each element by using the "get_more__data" function and applying it to the amount of elements on a page using lapply
Is there better way to do this? What am I doing wrong?
I think your solution is way easier than you thought it is.
In line 4 you used
page <- follow_link(page, css = ".grpl-moreinfo > a:nth-child(" + position + ")")
where
css = ".grpl-moreinfo > a:nth-child(" + position + ")"
in R you do not concatenate character strings with "+", i.e. it does not work to use
"He" + "llo"
Try it again using: paste('He', 'llo', sep = '') or paste0('He', 'llo')
Please try the next time to look at the error massage itself. It tells you very often exactly where the error is coming from.
edit:
If you want to use it like in Python you can write your own function like that:
`+` <- function(x, y){
return(paste0(x, y))
}
I wouldn't recommend it, but it's possible.

Rvest, looping through elements on a page in order to follow a link at each element?

So I'm trying to scrape data from a site that contains club data from clubs at my school. I've got a good script going that scrapes the surface level data from the site, however I can get more data by clicking the "more information" link at each club which leads to the club's profile page. I would like to scrape the data from that page (specifically the facebook link).
Below you'll see my current attempt at this.
url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)
get_table <- function(page, count) {
#find group names
name_text <- html_nodes(page,".grpl-name a") %>% html_text()
df <- data.frame(name_text, stringsAsFactors = FALSE)
#find text description
desc_text <- html_nodes(page, ".grpl-purpose") %>% html_text()
df$desc_text <- trimws(desc_text)
#find emails
# find the parent nodes with html_nodes
# then find the contact information from each parent using html_node
email_nodes<-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-contact a") %>% html_text()
df$emails<-email_nodes
category_nodes <- html_nodes(page, "div.grpl-grp") %>% html_node(".grpl-type") %>% html_text()
df$category<-category_nodes
pic_nodes <-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-logo img") %>% html_attr("src")
df$logo <- paste0("https://uws-community.symplicity.com/", pic_nodes)
more_info_nodes <- html_nodes(page, ".grpl-moreinfo a") %>% html_attr("href")
df$more_info <- paste0("https://uws-community.symplicity.com/", more_info_nodes)
sub_page <- page %>% follow_link(css = ".grpl-moreinfo a")
df$fb <- html_node(sub_page, xpath = '//*[#id="dnf_class_values_student_group__facebook__widget"]') %>% html_text()
if(count != 44) {
return (rbind(df, get_table(page %>% follow_link(css = ".paging_nav a:last-child"), count + 1)))
} else{
return (df)
}
}
RSO_data <- get_table(page, 0)
The current error I'm getting is:
Error in `$<-.data.frame`(`*tmp*`, "logo", value = "https://uws-community.symplicity.com/") :
replacement has 1 row, data has 0
I know I need to make a function that will go through each element and follow the link, then mapply that function to the dataframe df. However I don't know how I'd go about making that function so that it would work correctly.
your error says that you are trying to combine two different dimensions... your page variable already has one dimension and second is 0. page <- html_session(url) add this inside you function.
This is a reproducable example of your error message.
x = data.frame()
x[1] <- c(1)
I haven't checked your code, but the error is in there, you have to go step by step through your code. You will find the error, where you've created an empty data.frame and then tried to assign a value to it.
good luck

Resources