I can copy every topic in this page, but when I tried to do that with a "hide content" on an expand-button this doesn't work..
Because I need to put a click button function but I don't know how to do that.
And the other question is: if I could copy that after my data.frame will become an error, because this line will be one more information..
library(rvest)
library(dplyr)
concat <- data.frame()
n_paginas <- 2
for(i in 1:n_paginas) {
url_number <- 2 - i
url1 <- paste0('https://www.qconcursos.com/questoes-de-concursos/questoes?')
p1 <- read_html(url1)
an1 = p1 %>% html_nodes(".q-question-info") %>% html_text()
di1 = p1 %>% html_nodes(".q-question-enunciation") %>% html_text()
concat <- rbind(concat, data.frame(an1,di1))
print(paste("Página:",i))
Sys.sleep(3)
}
To my knowledge, R doesn't have any way to "click" a button on a webpage before scraping. There are more sophisticated web scraping strategies that might be able to help. See this related post: R - How to make a click on webpage using rvest or rcurl
Related
Im trying to get the complete data set for bitcoin historical data from yahoo finance via web scraping, this is my first option code chunk:
library(rvest)
library(tidyverse)
crypto_url <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <- html_nodes(crypto_url,css = "table")
cryp_table <- html_table(cryp_table,fill = T) %>%
as.data.frame()
I the link that i provide to read_html() a long period of time is already selected, however it just get the first 101 rows and the last row is the loading message that you get when you keep scrolling, this is my second shot but i get the same:
col_page <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <-
col_page %>%
html_nodes(xpath = '//*[#id="Col1-1-HistoricalDataTable-Proxy"]/section/div[2]/table') %>%
html_table(fill = T)
cryp_final <- cryp_table[[1]]
How can i get the whole dataset?
I think you can get the link of download, if you view the Network, you see the link of download, in this case:
"https://query1.finance.yahoo.com/v7/finance/download/BTC-USD?period1=1480464000&period2=1638230400&interval=1d&events=history&includeAdjustedClose=true"
Well, this link looks like the url of the site, i.e., we can modify the url link to get the download link and read the csv. See the code:
library(stringr)
library(magrittr)
site <- "https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true"
base_download <- "https://query1.finance.yahoo.com/v7/finance/download/"
download_link <- site %>%
stringr::str_remove_all(".+(?<=quote/)|/history?|&frequency=1d") %>%
stringr::str_replace("filter", "events") %>%
stringr::str_c(base_download, .)
readr::read_csv(download_link)
I'm using rvest to scrape the .txt files of a blog page, and I have a script that triggers every day, and scrapes the newest post. The base of that script is an lapply function that simply scrapes all of the posts, and I later sort out duplicates using Apache NiFi.
That's not an efficient way to sort duplicates, so I was wondering if there's a way to use the same script, and only scrape the newest posts?
The posts are labelled with numbers that count up, such as BLOG001, BLOG002, etc. I want to put a line of code that makes sure to scrape the newest posts (they may post several in any given day). How do I make sure that I only get BlOG002, and the next run only get BLOG003, and so on?
library(tidyverse)
library(rvest)
# URL set up
url <- "https://www.example-blog/posts.aspx"
page <- html_session(url, config(ssl_verifypeer = FALSE))
# Picking elements
links <- page %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href")
# Function
out <- Map(function(ln) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
writeBin(fun1$response$content)
return(fun1$response$content)
}, links)
Assuming that all of the links you want start with 'BLOG' as in your post, and you only want to download the one with the maximum number each time the code is run. You could try something like this to achieve that.
library(tidyverse)
library(rvest)
# URL set up
url <- "https://www.example-blog/posts.aspx"
page <- html_session(url, config(ssl_verifypeer = FALSE))
# Picking elements
links <- page %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href")
# Make sure only 'BLOG' links are checked
links <- links[substr(links, 1, 4) == 'BLOG']
# Get numeric value from link
blog.nums <- as.numeric(substr(links, 5, nchar(links)))
# Get the maximum link value
max.link <- links[which(blog.nums == max(blog.nums))]
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", max.link)),
config(ssl_verifypeer = FALSE))
writeBin(fun1$response$content)
I am web scraping a page at
http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
From this url, I have built up a dataframe through the following code:
dflist <- map(.x = 1:417, .f = function(x) {
Sys.sleep(5)
url <- ("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=")
read_html(url) %>%
html_nodes(".title a") %>%
html_text() %>%
as.data.frame()
}) %>% do.call(rbind, .)
I have repeated the same code in order to get all the data I was interested in and it seems to work perfectly, although is of course a little slow due to the Sys.sleep() thing.
My issue has raised once I have tried to scrape the single projects descriptions that should be included in the dataframe.
For instance, the first project description is at
http://catalog.ihsn.org/index.php/catalog/7118/study-description
the second project description is at
http://catalog.ihsn.org/index.php/catalog/6606/study-description
and so forth.
My problem is that I can't find a dynamic way to scrape all the projects' pages and insert them in the data frame, being the number in the URLs not progressive nor at the end of the link.
To make things clearer, this is the structure of the website I am scraping:
1.http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
1.1. http://catalog.ihsn.org/index.php/catalog/7118
1.1.a http://catalog.ihsn.org/index.php/catalog/7118/related_materials
1.1.b http://catalog.ihsn.org/index.php/catalog/7118/study-description
1.1.c. http://catalog.ihsn.org/index.php/catalog/7118/data_dictionary
I have scraped successfully level 1. but cannot level 1.1.b. (study-description) , the one I am interested in, since the dynamic element of the URL (in this case: 7118) is not consistent in the website's above 6000 pages of that level.
You have to extract the deeper urls from the .title a and then scrape those as well. Here's a small example on how to do that using rvest and the tidyverse
library(tidyverse)
library(rvest)
scraper <- function(x) {
Sys.sleep(5)
url <- sprintf("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=%s&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=", x)
html <- read_html(url)
tibble(title = html_nodes(html, ".title a") %>% html_text(trim = TRUE),
project_url = html_nodes(html, ".title a") %>% html_attr("href"))
}
result <- map_df(1:2, scraper) %>%
mutate(study_description = map(project_url, ~read_html(sprintf("%s/study-description", .x)) %>% html_node(".xsl-block") %>% html_text()))
This isn't complete as to all the things you want to do, but should show you an approach.
I am working on a web scraping program to search for data from multiple sheets. The code below is an example of what I am working with. I am able to get only the first sheet on this. It will be of great help if someone can point out where I am going wrong in my syntax.
jump <- seq(1, 10, by = 1)
site <- paste0("https://stackoverflow.com/search?page=",jump,"&tab=Relevance&q=%5bazure%5d%20free%20tier")
dflist <- lapply(site, function(i) {
webpage <- read_html(i)
draft_table <- html_nodes(webpage,'.excerpt')
draft <- html_text(draft_table)
})
finaldf <- do.call(cbind, dflist)
finaldf_10<-data.frame(finaldf)
View(finaldf_10)
Below is the link from where I need to scrape the data which has
127 pages.
[https://stackoverflow.com/search?q=%5Bazure%5D+free+tier][1]
As per the above code I am able to get data only from the first page and not the rest of the pages. There is no syntax error also. Could you please help me in finding out where I am going wrong.
Some websites put a security to prevent bulk scraping. I guess SO is one of them. More on that : https://github.com/JonasCz/How-To-Prevent-Scraping/blob/master/README.md
In fact, if you delay a little your calls, this will work. I've tried w/ 5 seconds Sys.sleep. I guess you can reduce it, but this may not work (I've tried with a 1 second Sys.sleep, that didn't work).
Here is a working code :
library(rvest)
library(purrr)
dflist <- map(.x = 1:10, .f = function(x) {
Sys.sleep(5)
url <- paste0("https://stackoverflow.com/search?page=",x,"&q=%5bazure%5d%20free%20tier")
read_html(url) %>%
html_nodes('.excerpt') %>%
html_text() %>%
as.data.frame()
}) %>% do.call(rbind, .)
Best,
Colin
I have some code that scrapes data off this link (http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280) and runs some calculations.
What I want to do is cycle through every team and collect and run the manipulations on every team. I have a dataframe with every team link, like the one above.
Psuedo code:
for (link in teamlist)
{scrape, manipulate, put into a table}
However, I can't figure out how to run loop through the links.
I've tried doing URL = teamlist$link[i], but I get an error when using readhtmltable(). I have no trouble manually pasting each team individual URL into the script, just only when trying to pull it from a table.
Current code:
library(XML)
library(gsubfn)
URL= 'http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280'
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
Thanks.
I agree with #ialm that you should check out the rvest package, which makes it very fun and straightforward to loop through links. I will create some example code here using similar subject matter for you to check out.
Here I am generating a list of links that I will iterate through
rm(list=ls())
library(rvest)
mainweb="http://www.basketball-reference.com/"
urls=html("http://www.basketball-reference.com/teams") %>%
html_nodes("#active a") %>%
html_attrs()
Now that the list of links is complete I iterate through each link and pull a table from each
teamdata=c()
j=1
for(i in urls){
bball <- html(paste(mainweb, i, sep=""))
teamdata[j]= bball %>%
html_nodes(paste0("#", gsub("/teams/([A-Z]+)/$","\\1", urls[j], perl=TRUE))) %>%
html_table()
j=j+1
}
Please see the code below, which basically builds off your code and loops through two different team pages as identified by the vector team_codes. The tables are returned in a list where each list element corresponds to a team's table. However, the tables look like they will need more cleaning.
library(XML)
library(gsubfn)
Player_Stats <- list()
j <- 1
team_codes <- c(575, 580)
for(code in team_codes) {
URL <- paste0('http://stats.ncaa.org/team/stats?org_id=', code, '&sport_year_ctl_id=12280')
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats[[j]] = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
j <- j + 1
}