url1 <- "http://www.nationmaster.com/country-info/stats/Economy/Inequality/GINI-index#1994"
url2 <- "http://www.nationmaster.com/country-info/stats/Economy/Inequality/GINI-index#1986"
tables1 <- readHTMLTable(url1)
tables2 <- readHTMLTable(url1)
View(tables1[1])
View(tables2[1])
The results are as for url without #1986 or #1994.
In other words: I would like to read all data from column HISTORY
Related
I have a CSV file that contains information about a set of articles and the 9th volume refers to the URLs. I have successfully scraped the title and abstract by a single URL with the following code:
library('rvest')
url <- 'https://link.springer.com/article/10.1007/s10734-019-00404-5'
webpage <- read_html(url)
title_data_html <- html_nodes(webpage,'.u-h1')
title_data <- html_text(title_data_html)
head(title_data)
abstract_data_html <- html_nodes(webpage,'#Abs1-content p')
abstract_data <- html_text(abstract_data_html)
head(abstract_data)
myTable = data.frame(Title = title_data, Abstract = abstract_data)
View(myTable)
Now I want to use R to scrape the title and abstract of each article. My question is how to import the URLs contained in the CVS file and how to write a for loop to scrape the data I need. I'm quite new to r so thanks in advance for your help.
Try This:
library(rvest)
URLs <- read.csv("urls.csv")
n <-nrow(URLs)
URLs2 <-character()
for (i in 1:n) {
URLs2[i]<-as.character(URLs[i,1])
}
df <- data.frame(Row = as.integer(), Title=as.character(), Abstract=as.character(), stringsAsFactors = FALSE)
for (i in 1:n) {
webpage <- tryCatch(read_html(URLs2[i]), error = function(e){'empty page'})
if (!"empty page" %in% webpage) {
title_data_html <- html_nodes(webpage,'.u-h1')
title_data <- html_text(title_data_html)
abstract_data_html <- html_nodes(webpage,'#Abs1-content p')
abstract_data <- html_text(abstract_data_html)
temp <- as.data.frame(cbind(Row = match(URLs2[i], URLs2), Title = title_data, Abstract = abstract_data))
if(ncol(temp)==3) {
df <- rbind(df,temp)
}
}
}
View(df)
Edit: The code has been edited in such a way that it will work even if the urls are broken (skipping them). The output rows will be numbered with the entry's corresponding row number in the csv.
I'm trying two strategies to get data from a web table:
library(tidyverse)
library(rvest)
webpage <- read_html('https://markets.cboe.com/us/equities/market_statistics/book/')
data <- html_table(webpage, fill=TRUE)
data[[2]]
''
library("httr")
library("XML")
URL <- 'https://markets.cboe.com/us/equities/market_statistics/book/'
temp <- tempfile(fileext = ".html")
GET(url = URL, user_agent("Mozilla/5.0"), write_disk(temp))
df <- readHTMLTable(temp)
df <- df[[2]]
Both of them are returning an empty table.
Values are retrieved dynamically from another endpoint you can find in the network tab when refreshing your url. You need to add a referer header for the server to return json containing the table data.
library(httr)
headers = c('Referer'='https://markets.cboe.com/us/equities/market_statistics/book/')
d <- content(httr::GET('https://markets.cboe.com/json/bzx/book/FIT', httr::add_headers(.headers=headers)))
print(d$data)
I have been trying to extract a table from .jpg format to excel format. I'm aware how to do it if it's a .pdf or html file. Please find the script below. I would be grateful if someone could help me figure this out.
Thanks,
library(httr)
library(magick)
library(tidyverse)
url_template <- "https://www.environment.co.za/wp-content/uploads/2016/05/worst-air-pollution-in-south-africa-table-graph-statistics-1024x864.jpg"
pb <- progress_estimated(n=length(url_template))
sprintf(url_template) %>%
map(~{
pb$tick()$print()
GET(url = .x,
add_headers(
accept = "image/webp,image/apng,image/*,*/*;q=0.8",
referer = "https://www.environment.co.za/pollution/worst-air-pollution-south-africa.html/attachment/worst-air-pollution-in-south-africa-table-graph-statistics",
authority = "environment.co.za"))
}) -> store_list_pages
map(store_list_pages, content) %>%
map(image_read) %>%
reduce(image_join) %>%
image_write("SApollution.pdf", format = "pdf")
library(tabulizer)
library(tabulizerjars)
library(XML)
wbk<-loadWorkbook("~/crap_exercise/img2pdf/randomdata.xlsx", create=TRUE)
# Extract the table from the document
out <- extract_tables("SApollution.pdf") #check if which="the table number" is there
#Combine these into a single data matrix containing all of the data
final <- do.call(rbind, out[-length(out)])
# table headers get extracted as rows with bad formatting. Dump them.
final <- as.data.frame(final[1:nrow(final), ])
# Column names
headers <- c('#', 'Uraban area', 'Province', 'PM2.5 (mg/m3)')
# Apply custom column names
names(final) <- headers
createSheet(wbk, "pollution")
writeWorksheet(wbk,poptable,sheet='pollution', header=T)
saveWorkbook(wbk)
I am trying to loop through all names in a csv file for the following loop to retrieve twitter data:
require(twitteR)
require(data.table)
consumer_key <- 'KEY'
consumer_secret <- 'CON_SECRET'
access_token <- 'TOKEN'
access_secret <- 'ACC_SECRET'
setup_twitter_oauth(consumer_key,consumer_secret,access_token,access_secret)
options(httr_oauth_cache=T)
accounts <- read.csv(file="FILE.CSV", header=FALSE, sep="")
Sample data in CSV file (each name in one only one row, first column):
timberghmans
alyssabereznak
JoshuaLenon
names <- lookupUsers(c(accounts))
for(name in names){
a <- getUser(name)
print(a)
b <- a$getFollowers()
print(b)
b_df <- rbindlist(lapply(b, as.data.frame))
print(b_df)
c <- subset(b_df, location!="")
d <- c$location
print(d)
}
However, it does not work. Every new row contains a twitter screenname.When I type it in like this:
names <- lookupUsers(c("USER1","USER2","USER3"))
it works perfectly. I also tried to loop through the accounts, but to no avail. Does someone maybe have an general example, or could anyone give a hint please?
I have created a function which scrapes information and adds it to a data.frame. I want to feed this function a list of urls from a .csv but it does not seem to work when I make it as function.
IMDB <- function(wp){
for (i in wp){
raw_data <- getURL(i)
data <- fromJSON(raw_data)
data <- as.list(data)
length(data)
final_data <- do.call(rbind, data)
Title <- final_data[c("Title"),]
ScreenWriter <- final_data[c("Writer"),]
Fdata <- cbind(Title,ScreenWriter)
Authors <- rbind(Authors, Fdata)
}
}