I've got a simple Map function that scrapes text files from a blog site. It's pretty easy to get a scraper that gets all of the text files and downloads them to my working directory. My goal: use an ifelse() or a plain if statement to only scrape a file based on a certain date.
Eg, if four files were posted on 1/31/19, and I pointed my ifelse at that date, the function would return those four files. Code:
library(tidyverse)
library(rvest)
# URL set up
url <- "https://www.example-blog/posts.aspx"
page <- html_session(url, config(ssl_verifypeer = FALSE))
# Picking elements
links <- page %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href")
# Getting date elements
dates <- page %>%
html_nodes("node.dates") %>%
html_text()
dates <- parse_date_time(dates, "%m/%d/%Y", tz = "EST",
locale = Sys.getlocale("LC_TIME"))
# Function
out <- Map(function(ln) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
write <- writeBin(fun1$response$content)
ifelse(dates == '2019-01-31', write, "He's dead, Jim")
}, links)
I've tried various ways to get that if statement in there, and also moving the writeBin around. (Usually the writeBin would not be vectorized - I did it for easy viewing in my ifelse). Error:
Error in ans[test & ok] <- rep(yes, length.out = length(ans))[test & ok] :
replacement has length zero
If I leave out the if code, everything works great, it just returns many text files, when I only want the ones from the specified date.
Based on the description, it seems like check the corresponding 'dates' for each 'links' and then apply the if/else. If that is the case, then we can have two arguments in Map
Map(function(ln, y) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
write <- writeBin(fun1$response$content)
if(y == '2019-01-31') {
write
} else "He's dead, Jim"
},
links, dates)
Related
I'm using rvest to scrape the title, date and nested link for Danish parliamentary committee agendas. In general it works fine and I get the data I want, but I have two issues that I hope you can help with. As an example I'm scraping this committee website for the information in the table and the nested links. https://www.ft.dk/da/udvalg/udvalgene/liu/dokumenter/udvalgsdagsordner?committeeAbbreviation=LIU&session=20211
First problem - Missing newest data:
The scraper does not get the newest data although it is available on the website. For example on the particular page in the link there are two entries from June that is not "detected". This problem is consistent with the other committee pages, where it also does not pick up the newest data entries.
Q: Does anybody know why the data is not showing up in R even though it is present on the website and have a solution for getting the data?
Second problem - Missing links:
For the particular committee (LIU) linked to above, I'm not able to get the full nested links to the agendas, even though it works for all the other committees. Instead it just returns www.ft.dk as the nested link. Up until now I have solved it by manually adding every nested link to the dataset, but it is rather time consuming. Does anybody know why this is not working and can help solve it?
Q: How do I get the nested link for the individual committee agenda?
I'm using loops to go through all the different committee pages, but here's the basic code:
library(tidyverse)
library(rvest)
library(httr)
library(dplyr)
library(purrr)
library(stringr)
# base url of Folketinget for committee agendas
base.url <- "https://www.ft.dk/da/udvalg/udvalgene/"
#List of all committees
committee <- c("§71","BEU", "BUU", "UPN", "EPI", "ERU", "EUU", "FIU", "FOU", "FÆU", "GRA", "GRU", "BOU", "IFU", "KIU", "KEF", "KUU", "LIU", "MOF", "REU", "SAU", "SOU", "SUU", "TRU", "UFU", "URU", "UUI", "UFO", "ULØ", "UFS", "UPV", "UER", "UET", "UUF")
## Set up search archives
if (!dir.exists("./DO2011-2022/")) {
dir.create("./DO2011-2022/")
}
search.archive <- "./DO2011-2022/dagsorden_search/"
if (!dir.exists(search.archive)) {
dir.create(search.archive)
}
# empty data set
cols <- c("date", "title", "cmte", "link")
df <- cols %>% t %>% as_tibble(.name_repair = "unique") %>% `[`(0, ) %>% rename_all(~cols)
## Set up main date parameters
first.yr <- 2011
last.yr <- 2022
session <- 1:2
# main loop over committees
for (i in committee) {
for(current.yr in first.yr:last.yr) {
for(j in session) {
print(paste("Working on committee:", i, "Year", current.yr, "session", j))
result.page <- 1
## INTERIOR LOOP OVER SEARCH PAGES
repeat {
# build archive file name
file.name <- paste0(search.archive, i,
current.yr, "session", j,
"-page-",
result.page,
".html")
# construct url to pull
final.url <- paste0(base.url,i, "/dokumenter/udvalgsdagsordner?committeeAbbreviation=", i,
"&session=", current.yr, j, "&pageSize=200&pageNumber=", result.page)
# check archive / pull in page
#Fix problem with missing data from 2021 page - its because newly downloaded data is not on previous downloaded pages.
if(!current.yr == 2021){
if (file.exists(file.name)) {
page <- read_html(x = file.name)
} else {
page <- read_html(final.url)
tmp <- page %>% as.character
#Sys.sleep(3 + rpois(lambda = 2, n = 1))
write(x = tmp, file = file.name)
}
}
else{
page <- read_html(final.url)
tmp <- page %>% as.character
Sys.sleep(5)
write(x = tmp, file = file.name)
}
# only grab length of results once
if (result.page == 1) {
# get total # search results
total.results <- page %>%
html_nodes('.pagination-text-container-top .results') %>%
html_text(trim = T) %>%
str_extract("[[:digit:]]*") %>%
as.numeric
# break out of loop if no results on page (typical for session=2)
if (length(total.results) == 0) break
# count search pages to visit (NB: 200 = number of results per page)
count.pages <- ceiling(total.results / 200)
# print total results to console
print(paste("Total of", total.results, "for committee", i))
}
if(i == "FOU"|i == "GRU"){
titles <- page %>% html_nodes('.column-documents:nth-child(1) .column-documents__icon-text') %>% html_text(trim = T)
}
else{
titles <- page %>% html_nodes('.highlighted+ .column-documents .column-documents__icon-text') %>% html_text(trim = T) }
dates <- page %>% html_nodes('.highlighted .column-documents__icon-text') %>% html_text(trim = T)
# Solution to problem with links for LIU
if(i == "LIU"){
links <- page %>% html_nodes(".column-documents__link") %>% html_attr('href') %>% unique()
}
else{
links <- page %>% html_nodes(xpath = "//td[#data-title = 'Titel']/a[#class = 'column-documents__link']") %>% html_attr('href')
}
links <- paste0("https://www.ft.dk", links)
# build data frame from data
df <- df %>% add_row(
date = dates,
title = titles,
cmte = i,
link = links)
## BREAK LOOP when result.page == length of search result pages by year
if (result.page == count.pages) break
## iterate search page by ONE
result.page <- result.page + 1
} #END PAGE LOOP
} #END SESSION LOOP
} #END YEAR LOOP
} #END COMMITTEE LOOP
end <- Sys.time()
#Scraping time
end - start
If I alternatively use selectorgadget instead of xpath to get the links, I get the following error:
Error in tokenize(css) : Unclosed string at 42
links <- page %>% html_nodes(".highlighted .column-documents__icon-text']") %>% html_attr('href')
Thanks in advance.
Related to the question asked here: R - Using SelectorGadget to grab a dataset
library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
library(dplyr)
get_state_index <- function(states, state) {
return(match(T, map(states, ~ {
.x$name == state
})))
}
s <- read_html("https://www.opentable.com/state-of-industry") %>% html_text()
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
fullbook <- all_data$covidDataCenter$fullbook
hawaii_dataset <- tibble(
date = fullbook$headers %>% unlist() %>% as.Date(),
yoy = fullbook$states[get_state_index(fullbook$states, "Hawaii")][[1]]$yoy %>% unlist()
)
I am trying to grab the Hawaii dataset from the State tab. The code was working before but now it is throwing an error with this part of the code:
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
I am getting the error:
Error: lexical error: invalid char in json text. NA (right here) ------^
Any proposed solutions? It seems that the website has remained the same for the year but what type of change is causing the code to break?
EDIT: The solution proposed by #QHarr:
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = ([\\s\\S]+\\});")[, 2])
This was working for a while but then it seems that their website again changed the underlying HTML codes.
Change the regex pattern as shown below to ensure it correctly captures the desired string within the response text i.e. the JavaScript object to use for all_data
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = ([\\s\\S]+\\});")[, 2])
Note: in R the single escape is doubled e.g. \\s rather than shown \s above.
I have an excel file that contains certain keywords that need to be searched in google through R.
The output to be created is a data frame which contains the following variables:
Keyword;Position(position of the url in the search results);Title(title of the ith search result);Text(text in that search result);URL;Domain
The keywords and some example of the output are given in the link below:
https://drive.google.com/file/d/1AM3d5Hbf5nBpbRG1ydnZM7ZG2AdUyy-6/view?usp=sharing
(Sheet 1 has the keywords and sheet 2 has the sample output)
I tried to create a similar output but there seems to be an error.
Code:
# Web Scraping in R
library(XML)
library(RCurl)
library(dplyr)
library(rvest)
library(urltools)
library(htm2txt)
library(readxl)
data <- read_excel(file.choose()) # Importing the data
output <- data.frame(matrix(ncol=6,nrow=0))
colnames(output) <- c("Name","Position","Title","Text","URL","Domain")
for (i in 1:nrow(data)) {
search.term <- data[i,1]
getGoogleURL <- function(search.term, domain = '.com', quotes=TRUE)
{
search.term <- gsub(' ', '%20', search.term) # Cleaning the Search Term
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
getGoogleURL <- paste('http://www.google', domain, '/search?q=',
search.term, sep='')
}
quotes <- "False"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
page <- read_html(search.url)
links <- page %>% html_nodes("a") %>% html_attr("href")
link <- links[startsWith(links, "/url?q=")]
link <- sub("^/url\\?q\\=(.*?)\\&sa.*$","\\1", link)
for (j in 1:length(link)) {
page1 <- read_html(link[j])
name <- data[i,1]
position <- j
title <- page1 %>% html_node("title") %>% html_text()
text <- gettxt(link[j])
url <- link[j]
domain <- suffix_extract(domain(link[j]))$host
vect <- c(name,position,title,text,url,domain)
output <- rbind(output,vect)
}
}
The error being shown is:
Error in match.names(clabs, nmi) : names do not match previous names
Please help, I'm new to R.
That error comes from rbind when the columns don't line up perfectly. For instance, if there is a missing or extra column. In this case, it might be because one of your vect variables is empty/NULL or length over 1.
rbind(data.frame(a=1,b=2), data.frame(b=3))
# Error in rbind(deparse.level, ...) :
# numbers of columns of arguments do not match
Since iteratively adding rows to a frame gets expensive (it makes a complete copy of the frame every time even one row is added, this is grossly inefficient), it's generally better to append to a list and convert into a frame in one call.
out <- list()
for (i in seq_len(nrow(data))) {
# ...
for (j in seq_along(link)) {
# ...
vect <- c(name, position, title, text, url, domain)
stopifnot(length(vect) == 6L)
out <- c(out, list(vect))
}
}
outout <- do.call(rbind.data.frame, out)
colnames(output) <- c("Name", "Position", "Title", "Text", "URL", "Domain")
(In reality, instead of stopifnot, one might record the url and data retrieved into a different list for forensic purposes. Or find the missing element and NA it before adding to the list. Either way, stopifnot is intended here as a placeholder for something more contextually relevant to you and your process.)
I have seen other posts which show to extract data from multiple webpages
But the problem is that for my website when I scroll the website to see the number of webpages to check in how many pages the data is divided into, the page automatically refresh next data, making unable to identify the number of webpages.I don't have that good knowledge of html and javascript so that I can easily identify the attribute on which the method is been getting called. so I have identified a way by which we can get the number of pages.
The website when loaded in browser gives number of records present, accessing that number and divide it by 30(number of data present per page) for e.g if number of records present is 90, then do 90/30 = 3 number of pages
here is the code to get the number of records found on that page
active_name_data1 <- html_nodes(webpage,'.active')
active1 <- html_text(active_name_data1)
as.numeric(gsub("[^\\d]+", "", word(active1[1],start = 1,end =1), perl=TRUE))
AND another approach is that get the attribute for number of pages i.e
url='http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'
webpage <- read_html(url)
active_data_html <- html_nodes(webpage,'a.act')
active <- html_text(active_data_html)
here active gives me number of pages i.e "1" " 2" " 3" " 4"
SO here I'm unable to identify how do I get the active page data and iterate the other number of webpage so as to get the entire data.
here is what I have tried (uuu_df2 is the dataframe with multiple link for which I want to crawl data)
library(rvest)
uuu_df2 <- data.frame(x = c('http://www.magicbricks.com/property-for-
sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-
Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-
Lacs&BudgetMax=5-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'))
urlList <- llply(uuu_df2[,1], function(url){
this_pg <- read_html(url)
results_count <- this_pg %>%
xml_find_first(".//span[#id='resultCount']") %>%
xml_text() %>%
as.integer()
if(!is.na(results_count) & (results_count > 0)){
cards <- this_pg %>%
xml_find_all('//div[#class="SRCard"]')
df <- ldply(cards, .fun=function(x){
y <- data.frame(wine = x %>% xml_find_first('.//span[#class="agentNameh"]') %>% xml_text(),
excerpt = x %>% xml_find_first('.//div[#class="postedOn"]') %>% xml_text(),
locality = x %>% xml_find_first('.//span[#class="localityFirst"]') %>% xml_text(),
society = x %>% xml_find_first('.//div[#class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
return(y)
})
} else {
df <- NULL
}
return(df)
}, .progress = 'text')
names(urlList) <- uuu_df2[,1]
a=bind_rows(urlList)
But this code just gives me the data from active page and does not iterate through other pages of the given link.
P.S : If the link doesn't has any record the code skips that link and
moves to other link from the list.
Any suggestion on what changes should be made to the code will be helpful. Thanks in advance.
I've been trying to scrape a list of companies off of the site -Company list401.html. I can scrape the single table off of this page with this code:
>fileurl = read_html("http://archive.fortune.com
/magazines/fortune/fortune500_archive/full/2005/1")
> content = fileurl %>%
+ html_nodes(xpath = '//*[#id="MagListDataTable"]/table[2]') %>%
+ html_table()
>contentframe = data.frame(content)
> view(contentframe)
However, I need all of the data that goes back to 1955 from 2005 as well as a list of the companies 1 through 500, whereas this list only shows 100 companies and a single year at a time. I've recognized that the only changes to the url are "...fortune500_archive/full/" YEAR "/" 1, 201,301, or 401 (per range of companies showing).
I also understand that I have to create a loop that will automatically collect this data for me as opposed to me manually replacing the url after saving each table. I've tried a few variations of sapply functions from reading other posts and watching videos, but none will work for me and I'm lost.
A few suggestions to get you started. First, it may be useful to write a function to download and parse each page, e.g.
getData <- function(year, start) {
url <- sprintf("http://archive.fortune.com/magazines/fortune/fortune500_archive/full/%d/%d.html",
year, start)
fileurl <- read_html(url)
content <- fileurl %>%
html_nodes(xpath = '//*[#id="MagListDataTable"]/table[2]') %>%
html_table()
contentframe <- data.frame(content)
}
We can then loop through the years and pages using lapply (as well as do.call(rbind, ...) to rbind all 5 dataframes from each year together). E.g.:
D <- lapply(2000:2005, function(year) {
do.call(rbind, lapply(seq(1, 500, 100), function(start) {
cat(paste("Retrieving", year, ":", start, "\n"))
getData(year, start)
}))
})