url with # - how to import data from html table in R - r

url1 <- "http://www.nationmaster.com/country-info/stats/Economy/Inequality/GINI-index#1994"
url2 <- "http://www.nationmaster.com/country-info/stats/Economy/Inequality/GINI-index#1986"
tables1 <- readHTMLTable(url1)
tables2 <- readHTMLTable(url1)
View(tables1[1])
View(tables2[1])
The results are as for url without #1986 or #1994.
In other words: I would like to read all data from column HISTORY

Related

scrape multiple urls from a csv file with R

I have a CSV file that contains information about a set of articles and the 9th volume refers to the URLs. I have successfully scraped the title and abstract by a single URL with the following code:
library('rvest')
url <- 'https://link.springer.com/article/10.1007/s10734-019-00404-5'
webpage <- read_html(url)
title_data_html <- html_nodes(webpage,'.u-h1')
title_data <- html_text(title_data_html)
head(title_data)
abstract_data_html <- html_nodes(webpage,'#Abs1-content p')
abstract_data <- html_text(abstract_data_html)
head(abstract_data)
myTable = data.frame(Title = title_data, Abstract = abstract_data)
View(myTable)
Now I want to use R to scrape the title and abstract of each article. My question is how to import the URLs contained in the CVS file and how to write a for loop to scrape the data I need. I'm quite new to r so thanks in advance for your help.
Try This:
library(rvest)
URLs <- read.csv("urls.csv")
n <-nrow(URLs)
URLs2 <-character()
for (i in 1:n) {
URLs2[i]<-as.character(URLs[i,1])
}
df <- data.frame(Row = as.integer(), Title=as.character(), Abstract=as.character(), stringsAsFactors = FALSE)
for (i in 1:n) {
webpage <- tryCatch(read_html(URLs2[i]), error = function(e){'empty page'})
if (!"empty page" %in% webpage) {
title_data_html <- html_nodes(webpage,'.u-h1')
title_data <- html_text(title_data_html)
abstract_data_html <- html_nodes(webpage,'#Abs1-content p')
abstract_data <- html_text(abstract_data_html)
temp <- as.data.frame(cbind(Row = match(URLs2[i], URLs2), Title = title_data, Abstract = abstract_data))
if(ncol(temp)==3) {
df <- rbind(df,temp)
}
}
}
View(df)
Edit: The code has been edited in such a way that it will work even if the urls are broken (skipping them). The output rows will be numbered with the entry's corresponding row number in the csv.

R rvest retrieve empty table

I'm trying two strategies to get data from a web table:
library(tidyverse)
library(rvest)
webpage <- read_html('https://markets.cboe.com/us/equities/market_statistics/book/')
data <- html_table(webpage, fill=TRUE)
data[[2]]
''
library("httr")
library("XML")
URL <- 'https://markets.cboe.com/us/equities/market_statistics/book/'
temp <- tempfile(fileext = ".html")
GET(url = URL, user_agent("Mozilla/5.0"), write_disk(temp))
df <- readHTMLTable(temp)
df <- df[[2]]
Both of them are returning an empty table.
Values are retrieved dynamically from another endpoint you can find in the network tab when refreshing your url. You need to add a referer header for the server to return json containing the table data.
library(httr)
headers = c('Referer'='https://markets.cboe.com/us/equities/market_statistics/book/')
d <- content(httr::GET('https://markets.cboe.com/json/bzx/book/FIT', httr::add_headers(.headers=headers)))
print(d$data)

How to scrape tables from raster files (like .jpeg, .jpg, .png, .gif) and save in excel format?

I have been trying to extract a table from .jpg format to excel format. I'm aware how to do it if it's a .pdf or html file. Please find the script below. I would be grateful if someone could help me figure this out.
Thanks,
library(httr)
library(magick)
library(tidyverse)
url_template <- "https://www.environment.co.za/wp-content/uploads/2016/05/worst-air-pollution-in-south-africa-table-graph-statistics-1024x864.jpg"
pb <- progress_estimated(n=length(url_template))
sprintf(url_template) %>%
map(~{
pb$tick()$print()
GET(url = .x,
add_headers(
accept = "image/webp,image/apng,image/*,*/*;q=0.8",
referer = "https://www.environment.co.za/pollution/worst-air-pollution-south-africa.html/attachment/worst-air-pollution-in-south-africa-table-graph-statistics",
authority = "environment.co.za"))
}) -> store_list_pages
map(store_list_pages, content) %>%
map(image_read) %>%
reduce(image_join) %>%
image_write("SApollution.pdf", format = "pdf")
library(tabulizer)
library(tabulizerjars)
library(XML)
wbk<-loadWorkbook("~/crap_exercise/img2pdf/randomdata.xlsx", create=TRUE)
# Extract the table from the document
out <- extract_tables("SApollution.pdf") #check if which="the table number" is there
#Combine these into a single data matrix containing all of the data
final <- do.call(rbind, out[-length(out)])
# table headers get extracted as rows with bad formatting. Dump them.
final <- as.data.frame(final[1:nrow(final), ])
# Column names
headers <- c('#', 'Uraban area', 'Province', 'PM2.5 (mg/m3)')
# Apply custom column names
names(final) <- headers
createSheet(wbk, "pollution")
writeWorksheet(wbk,poptable,sheet='pollution', header=T)
saveWorkbook(wbk)

Looping through names in a csv file

I am trying to loop through all names in a csv file for the following loop to retrieve twitter data:
require(twitteR)
require(data.table)
consumer_key <- 'KEY'
consumer_secret <- 'CON_SECRET'
access_token <- 'TOKEN'
access_secret <- 'ACC_SECRET'
setup_twitter_oauth(consumer_key,consumer_secret,access_token,access_secret)
options(httr_oauth_cache=T)
accounts <- read.csv(file="FILE.CSV", header=FALSE, sep="")
Sample data in CSV file (each name in one only one row, first column):
timberghmans
alyssabereznak
JoshuaLenon
names <- lookupUsers(c(accounts))
for(name in names){
a <- getUser(name)
print(a)
b <- a$getFollowers()
print(b)
b_df <- rbindlist(lapply(b, as.data.frame))
print(b_df)
c <- subset(b_df, location!="")
d <- c$location
print(d)
}
However, it does not work. Every new row contains a twitter screenname.When I type it in like this:
names <- lookupUsers(c("USER1","USER2","USER3"))
it works perfectly. I also tried to loop through the accounts, but to no avail. Does someone maybe have an general example, or could anyone give a hint please?

Feeding R a list of webpages through a CSV

I have created a function which scrapes information and adds it to a data.frame. I want to feed this function a list of urls from a .csv but it does not seem to work when I make it as function.
IMDB <- function(wp){
for (i in wp){
raw_data <- getURL(i)
data <- fromJSON(raw_data)
data <- as.list(data)
length(data)
final_data <- do.call(rbind, data)
Title <- final_data[c("Title"),]
ScreenWriter <- final_data[c("Writer"),]
Fdata <- cbind(Title,ScreenWriter)
Authors <- rbind(Authors, Fdata)
}
}

Resources