I scraped a wikipedia table using r
library(rvest)
url <- "https://en.wikipedia.org/wiki/New_York_City"
nyc <- url %>%
read_html() %>%
html_node(xpath = '//*[#id="mw-content-text"]/div/table[1]') %>%
html_table(fill = TRUE)
And want to save the values into a new dataframe.
Output
Area population
468.484 sq mi 8,336,817
What is the best way to do this?
You need to choose which table. From the table select needed columns and rows. Assign column names using setNames and reset rownames by setting them to NULL. I'm sure you want population column as.integer, just use gsub before to clean out the non-digits.
I'm not sure about the html_node line and left it out.
library(rvest)
url <- "https://en.wikipedia.org/wiki/New_York_City"
nyc <- read_html(url)
# nyc <- html_node(nyc, xpath = '//*[#id="mw-content-text"]/div/table[1]')
nyc <- html_table(nyc, header=TRUE, fill = TRUE)
nyc <- `rownames<-`(
setNames(nyc[[3]][-c(1:2, 10), 2:3], c("area", "population")),
NULL)
nyc <- transform(nyc, population=as.integer(gsub("\\D", "", population)))
nyc
# area population
# 1 Bronx 1418207
# 2 Kings 2559903
# 3 New York 1628706
# 4 Queens 2253858
# 5 Richmond 476143
# 6 City of New York 8336817
# 7 State of New York 19453561
Judging from OP's example output they want the table given at a different xpath to that which they provided in the question. Please see the following workflow, note: names have been set manually to save the hassle of formatting the strings from rows:
# Initialise package in session: rvest => .GlobalEnv()
library(rvest)
# Store the url scalar: url => character vector
url <- "https://en.wikipedia.org/wiki/New_York_City"
# Scrape the table and store it memory: nyc => data.frame
nyc <-
url %>%
read_html() %>%
html_node(xpath = '/html/body/div[3]/div[3]/div[4]/div/table[3]') %>%
html_table(fill = TRUE) %>%
data.frame()
# Set the names appropriately: names(nyc) character vector
names(nyc) <- c("borough", "county", "pop_est_2019",
"gdp_bill_usd", "gdp_per_cap",
"land_area_sq_mi", "land_area_sq_km",
"density_pop_sq_mi", "density_pop_sq_km")
# Coerce the vectors to the appropriate type: cleaned => data.frame
cleaned <- data.frame(lapply(nyc[4:nrow(nyc)-1,], function(x){
if(length(grep("\\d+\\,\\d+$|^\\d+\\.\\d+$", x)) > 0){
as.numeric(trimws(gsub("\\,", "", as.character(x)), "both"))
}else{
as.factor(x)
}
}
)
)
Related
I have a script that works fine for one file. It takes the information from a json file, extracts a list and a sublist of it (A), and then another list B with the third element of list A. It creates a data frame with list B and compares it with a master file. Finally, it provides two numbers: the number of elements in the list B and the number of matching elements of that list when comparing with the master file.
However, I have 180 different json files in a folder and I need to run the script for all of them, and build a data frame with the results for each file. So the final result should be something like this (note that the last line's figures are correct, the first two are fictitious):
The code I have so far is the following:
library(rjson)
library(dplyr)
library(tidyverse)
#load data from file
file <- "./raw_data/whf.json"
json_data <- fromJSON(file = file)
org_name <- json_data$id
# extract lists and the sublist
usernames <- json_data$twitter
following <- usernames$following
# create empty vector to populate
longitud = length(following)
names <- vector(length = longitud)
# loop to populate the empty vector with third element of the sub-list
for(i in 1:longitud){
names[i] <- following[[i]][3]
}
# create a data frame and change column name
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# create a data frame with the correct formatting ready to comparison
org_handles <- data.frame(paste("#", names_list$usernames, sep=""))
colnames(org_handles) <- "Twitter"
# load master file and select the needed columns
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv") %>%
select(Name, AKA, Twitter)
# merge data frames and present the results
org_list <- inner_join(psa_handles, org_handles)
length(org_list$Twitter)
length(usernames$following)
My first attempt is to include this code at the beginning:
files <- list.files()
for(f in files){
json_data <- fromJSON(file = f)
# the rest of the script for one file here
}
but I do not know how to write the code for the data frame or even how to integrate both ideas -the working script and the loop for the file names. I took the idea from here.
The new code after Alvaro Morales' answer is the following
library(rjson)
library(dplyr)
library(tidyverse)
archivos <- list.files("./raw_data/")
calculate_accounts <- function(archivos){
#load data from file
path <- paste("./raw_data/", archivos, sep = "")
json_data <- fromJSON(file = path)
org_name <- json_data$id
# extract lists and the sublist
usernames <- json_data$twitter
following <- usernames$following
# create empty vector to populate
longitud = length(following)
names <- vector(length = longitud)
# loop to populate the empty vector with third element of the sub-list
for(i in 1:longitud){
names[i] <- following[[i]][3]
}
# create a data frame and change column name
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# create a data frame with the correct formatting ready to comparison
org_handles <- data.frame(paste("#", names_list$usernames, sep=""))
colnames(org_handles) <- "Twitter"
# load master file and select the needed columns
psa_handles <- read_csv(file = "./psa_handles.csv") %>%
select(Name, AKA, Twitter)
# merge data frames and present the results
org_list <- inner_join(psa_handles, org_handles)
accounts_db_org <- length(org_list$Twitter)
accounts_total_org <- length(usernames$following)
}
table_psa <- map_dfr(archivos, calculate_accounts)
However, now there is an error when Joining, by = "Twitter", it says subindex out of limits.
Links to 3 test files to put together in raw_data folder:
https://drive.google.com/file/d/1ilUHwLjgtZCzh0LneIJEhTryrGumDF1V/view?usp=sharing
https://drive.google.com/file/d/1KM3hRZ8DzgPMEsMFmwBdmMNHrPCttuaB/view?usp=sharing
https://drive.google.com/file/d/17cWXJ9ltGXZ6izkgJv0uyNwStrE95_OA/view?usp=sharing
Link to the master file to compare:
https://drive.google.com/file/d/11fOpYFFfHijhZl_CuWHKvkrI7edkpUNQ/view?usp=sharing
<<<<< UPDATE >>>>>>
I am trying to find the solution and I did the code work and provide a valide output (a 180x3 data frame), but the columns that should be filled with the values of the objects accounts_db_org and accounts_total_org are showing NA. When checking the value stored in those objects, the values are correct (for the last iteration). So the output now is in its right format, but with NA instead of numbers.
I am really close, but I am not being able to make the code to show the right numbers. My last attempt is:
library(rjson)
library(dplyr)
library(tidyverse)
archivos <- list.files("./raw_data", pattern = "json", full.names = TRUE)
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv", show_col_types = FALSE) %>%
select(Name, AKA, Twitter)
nr_archivos <- length(archivos)
psa_result <- matrix(nrow = nr_archivos, ncol = 3)
# loop for working with all files, one by one
for(f in 1:nr_archivos){
# load file
json_data <- fromJSON(file = archivos[f])
org_name <- json_data$id
# extract lists and the sublist
usernames <- json_data$twitter
following <- usernames$following
# empty vector
longitud = length(following)
names <- vector(length = longitud)
# loop to populate with the third element of each i item of the sublist
for(i in 1:longitud){
names[i] <- following[[i]][3]
}
# convert the list into a data frame
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# applying some format prior to comparison
org_handles <- data.frame(paste("#", names_list$usernames, sep=""))
colnames(org_handles) <- "Twitter"
# merge tables and calculate the results for each iteration
org_list <- inner_join(psa_handles, org_handles)
accounts_db_org <- length(org_list$Twitter)
accounts_total_org <- length(usernames$following)
# populate the matrix row by row
psa_result[f] <- c(org_name, accounts_db_org, accounts_total_org)
}
# create a data frame from the matrix and save the result
psa_result <- data.frame(psa_result)
write_csv(psa_result, file = "./outputs/cuentas_seguidas_en_psa.csv")
The subscript out of bounds error was caused by a json file with 0 records. That was fixed deleting the file.
You can do it with purrr::map or purrr::map_dfr.
Is this what you looking for?
archivos <- list.files("./raw_data", pattern = "json", full.names = TRUE)
# load master file and select the needed columns. This needs to be out of "calculate_accounts" because you only read it once.
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv") %>%
select(Name, AKA, Twitter)
# calculate accounts
calculate_accounts <- function(archivo){
json_data <- rjson::fromJSON(file = archivo)
org_handles <- json_data %>%
pluck("twitter", "following") %>%
map_chr("username") %>%
as_tibble() %>%
rename(usernames = value) %>%
mutate(Twitter = str_c("#", usernames)) %>%
select(Twitter)
org_list <- inner_join(psa_handles, org_handles)
org_list %>%
mutate(accounts_db_org = length(Twitter),
accounts_total_org = nrow(org_handles)) %>%
select(-Twitter)
}
table_psa <- map_dfr(archivos, calculate_accounts)
#output:
# A tibble: 53 x 4
Name AKA accounts_db_org accounts_total_org
<chr> <chr> <int> <int>
1 Association of American Medical Colleges AAMC 20 2924
2 American College of Cardiology ACC 20 2924
3 American Heart Association AHA 20 2924
4 British Association of Dermatologists BAD 20 2924
5 Canadian Psoriasis Network CPN 20 2924
6 Canadian Skin Patient Alliance CSPA 20 2924
7 European Academy of Dermatology and Venereology EADV 20 2924
8 European Society for Dermatological Research ESDR 20 2924
9 US Department of Health and Human Service HHS 20 2924
10 International Alliance of Dermatology Patients Organisations (Global Skin) IADPO 20 2924
# ... with 43 more rows
Unfortunately, the answer provided by Álvaro does not work as expected, since the output repeats the same number with different organisation names, making it really difficult to read. Actually, the number 20 is repeated 20 times, the number 11, 11 times, and so on. The information is there, but it is not accessible without further data treatment.
I was doing my own research in the meantime and I got to the following code. Finally I made it to work, but the data format was "matrix" "array", really confusing. Fortunately, I wrote the last lines to transpose the data, unlist the array and convert in a matrix, which is able to be converted in a data frame and manipulated as usual.
Maybe my explanation is not very useful, and since I am a newbie, I am sure the code is far from being elegant and optimised. Anyway, please review the code below:
library(purrr)
library(rjson)
library(dplyr)
library(tidyverse)
setwd("~/documentos/varios/proyectos/programacion/R/psa_twitter")
# Load data from files.
archivos <- list.files("./raw_data/json_files",
pattern = ".json",
full.names = TRUE)
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv") %>%
select(Name, AKA, Twitter)
nr_archivos <- length(archivos)
calcula_cuentas <- function(a){
# Extract lists
json_data <- fromJSON(file = a)
org_aka <- json_data$id
org_meta <- json_data$metadata
org_name <- org_meta$company
twitter <- json_data$twitter
following <- twitter$following
# create an empty vector to populate
longitud = length(following)
names <- vector(length = longitud)
# loop to populate the empty vector with third element of the sub-list
for(i in 1:longitud){
names[i] <- following[[i]][3]
}
# create a data frame and change column name
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# Create a data frame with the correct formatting ready to comparison
org_handles <- data.frame(paste("#",
names_list$usernames,
sep="")
)
colnames(org_handles) <- "Twitter"
# merge tables
org_list <- inner_join(psa_handles, org_handles)
cuentas_db_org <- length(org_list$Twitter)
cuentas_total_org <- length(twitter$following)
results <- data.frame(Name = org_name,
AKA = org_aka,
Cuentas_db = cuentas_db_org,
Total = cuentas_total_org)
results
}
# apply function to list of files and unlist the result
psa <- sapply(archivos, calcula_cuentas)
psa1 <- t(as.data.frame(psa))
psa2 <- matrix(unlist(psa1), ncol = 4) %>%
as.data.frame()
colnames(psa2) <- c("Name", "AKA", "tw_int_outbound", "tw_ext_outbound")
# Save the results.
saveRDS(psa2, file = "rda/psa.RDS")
Hi i'm new to R and try to fetch the tickers/symbols of Yahoo Finance from a text file which contains company names like Adidas, BMW etc. in order to run an event study later. This file contains about 800 names. Some of them can be found in yahoo and some not. (Thats ok)
My loop work so far but missing results won't be displayed. Further it only creates a table with numbers and results which could be found.But i would like to create a list which displayed the variable i ("firmen") and the results that's has been found or an NA in case there was no result.
Hope you guys can help me. Thank you !!!
my code:
library(rvest)
# company_names
firmen <- c(read.table("Mappe1.txt"))
# init
df <- NULL
# loop for search names in Yahoo Ticker Lookup
for(i in firmen){
# find url
url <- paste0("https://finance.yahoo.com/lookup/all?s=", i, "/")
page <- read_html(url,as="text")
# grab table
table <- page %>%
html_nodes(xpath = "//*[#id='lookup-page']/section/div/div/div/div[1]/table/tbody/tr[1]/td[1]") %>%
html_text() %>%
as.data.frame()
# bind to dataframe
df <- rbind(df, table)
}
I solved the first problem and now empty nodes (if "i" has not been found on the yahoo page) will be displayed as "NA"
here is the code:
library(rvest)
# teams
firmen <- c(read.table("Mappe1.txt"))
# init
df <- NULL
table <- NULL
# loop
for(i in firmen){
# find url
url <- paste0("https://finance.yahoo.com/lookup/all?s=", i, "/")
page <- read_html(url,as="text")
# grab ticker from yahoo finance
table <- page %>%
html_nodes(xpath = "//*[#id='lookup-page']/section/div/div/div/div[1]/table/tbody/tr[1]/td[1]") %>%
html_text(trim=TRUE) %>% replace(!nzchar(table), NA) %>%
as.data.frame()
# bind to dataframe
df <- rbind(df,table)
}
Now there is just one question left
How can i merge "df" and "firmen" into one table which has the columns:
"tickers" = df and "firmen" = firmen
because df has just one column named "." with the results and the list firmen contains a number of companies placed in many colums but with just one row.
basically i need to transform the list "firmen" but i don't know how
Thank you for the help
I am using rvest to (try to) scrape all the author affiliation data from a database of academic publications called RePEc. I have the authors' short IDs (author_reg), which I'm using to scrape affiliation data. However, I have several columns indicating multiple authors (each of which I need the affiliation data for). When there aren't multiple authors, the cell has an NA value. Some of the columns are mostly NA values so how do I alter my code so it skips the NA values but doesn't delete them?
Here is the code I'm using:
library(rvest)
library(purrr)
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500", "NA", "NA")
http1 <- "https://ideas.repec.org/e/"
http2 <- "https://ideas.repec.org/f/"
df$affiliation_author_1 <- sapply(df$author_reg_1, function(x) {
links = c(paste0(http1, x, ".html"),paste0(http2, x, ".html"))
# here we try both links and store under attempts
attempts = links %>% map(function(i){
try(read_html(i) %>% html_nodes("#affiliation h3") %>% html_text())
})
# the good ones will have "character" class, the failed ones, try-error
gdlink = which(sapply(attempts,class) != "try-error")
if(length(gdlink)>0){
return(attempts[[gdlink[1]]])
}
else{
return("True 404 error")
}
})
Thanks in advance for your help!
As far as I see the target links, you can try the following way. First, you want to scrape all links from https://ideas.repec.org/e/ and create all links. Then, check if each link exists or not. (There are about 26000 links with this URL, and I do not have time to check all. So I just used 100 URLs in the following demonstration.) Extract all existing links.
library(rvest)
library(httr)
library(tidyverse)
# Get all possible links from this webpage. There are 26665 links.
read_html("https://ideas.repec.org/e/") %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href") %>%
.[grepl(x = ., pattern = "html")] -> x
# Create complete URLs.
mylinks1 <- paste("https://ideas.repec.org/e/", x, sep = "")
# For this demonstration I created a subset.
mylinks_samples <- mylinks1[1:100]
# Check if each URL exists or not. If FALSE, a link exists.
foo <- sapply(mylinks_sample, http_error)
# Using the logical vector, foo, extract existing links.
urls <- mylinks_samples[!foo]
Then, for each link, I tried to extract affiliation information. There are several spots with h3. So I tried to specifically target h3 that stays in xpath containing id = "affiliation". If there is no affiliation information, R returns character(0). When enframe() is applied, these elements are removed. For instance, pab127 does not have any affiliation information, so there is no entry for this link.
lapply(urls, function(x){
read_html(x, encoding = "UTF-8") %>%
html_nodes(xpath = '//*[#id="affiliation"]') %>%
html_nodes("h3") %>%
html_text() %>%
trimws() -> foo
return(foo)}) -> mylist
Then, I assigned names to mylist with the links and created a data frame.
names(mylist) <- sub(x = basename(urls), pattern = ".html", replacement = "")
enframe(mylist) %>%
unnest(value)
name value
<chr> <chr>
1 paa1 "(80%) Institutt for ØkonomiUniversitetet i Bergen"
2 paa1 "(20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen"
3 paa2 "Department of EconomicsCollege of BusinessUniversity of Wyoming"
4 paa6 "Statistisk SentralbyråGovernment of Norway"
5 paa8 "Centraal Planbureau (CPB)Government of the Netherlands"
6 paa9 "(79%) Economic StudiesBrookings Institution"
7 paa9 "(21%) Brookings Institution"
8 paa10 "Helseøkonomisk Forskningsprogram (HERO) (Health Economics Research Programme)\nUniversitetet i Oslo (Unive~
9 paa10 "Institutt for Helseledelse og Helseökonomi (Institute of Health Management and Health Economics)\nUniversi~
10 paa11 "\"Carlo F. Dondena\" Centre for Research on Social Dynamics (DONDENA)\nUniversità Commerciale Luigi Boccon~
I am trying to scrape data from a database that doesn't allow downloading directly. I have been able to scrape data from a single species but I am trying to do it for 159 species. This is why I wanted to create a loop that could be helpful
test <- data.frame(site = c("url=1",
"url=2"),
html.node = "td.DataText", stringsAsFactors = F)
library(rvest)
# an empty list, to fill with the scraped data
empty_list <- list()
for (i in 1:nrow(test)){
datatext <- pubs[i, 1]
datatext2 <- pubs[i, 2]
# scrape it!
empty_list[[i]] <- read_html(datatext) %>% html_nodes(datatext2) %>% html_text()
}
names(empty_list) <- test$site
empty <- as.data.frame(empty_list)
This is what I've tried so far. This is only for 2 species as indicated by FID=1 and FID=2 in the URL. There are 159 species. This is why I wanted a for loop that goes from 1:159 and populates the dataframe as it does with this current code.
I was able to figure it out!
url="url=1"
webpage <- read_html(url)
Data.Label <- webpage %>%
html_nodes("td.DataLabel") %>%
html_text()
Label <- as.data.frame(t(Data.Label))
#Obtains the data labels in a dataframe that is tranposed.
Data.Text <- lapply(paste0('url=', 1:159),
function(url){
url %>% read_html() %>%
html_nodes("td.DataText") %>%
html_text()
})
#Creates a list of all the data text needed to populate the table
Eco.Table <- as.data.frame(Data.Text)
#Convert list into dataframe.
Eco.Table <- Eco.Table[-c(39:42), ]
#Remove irrelevant rows
Eco.Table <- as.data.frame(t(Eco.Table))
#Transpose the dataframe into rows
rownames(Eco.Table) <- NULL
colnames(Eco.Table) <- as.character(unlist(Label))
#Reset row names and add column labels
Hi I am trying scrape the data from ebay in R, I used the code mentioned below but I encountered with a problem wherein there were missing values for a particular selector elements, to get round it I used a for loop as shown(inspecting each listing and giving the number for which there was data missing) since the data scraped was less it was possible to inspect but how to do it when there's large amounts of data to be scraped.
Thanks in advance
library(rvest)
url<-"https://www.ebay.in/sch/i.html_from=R40&_sacat=0&LH_ItemCondition=4&_ipg=100&_nkw=samsung+j7"
web<- read_html(url)
subdescp<- html_nodes(web, ".lvsubtitle+ .lvsubtitle")
subdescp1<-html_text(subdescp)
head(subdescp1)
library(stringr)
subdescp1<- str_replace_all(subdescp1, "[\t\n\r]" , "")
head(subdescp1)
for (i in c(5,6,10,19,33,34,35)){
a<-subdescp1[1:(i-1)]
b<-subdescp1[i:length(subdescp1)]
subdescp1<-append(a,list("NA"))
subdescp1<-append(subdescp1,b)
}
Z<-as.character(subdescp1)
Z
webpage <- read_html(url)
Descp_data_html <- html_nodes(webpage,'.vip')
Descp_data <- html_text(Descp_data_html)
head(Descp_data)
price_data_html <- html_nodes(web,'.prc .bold')
price_data <- html_text(price_data_html)
head(price_data)
library(stringr)
price_data<-str_replace_all(price_data, "[\t\n]" , "")
price_data<-gsub("Rs. ","",price_data)
price_data<-gsub(",","",price_data)
price_data<- as.numeric(price_data)
price_data
Desc_data_html <- html_nodes(webpage,'.lvtitle+ .lvsubtitle')
Desc_data <- html_text(Desc_data_html, trim = TRUE)
head(Desc_data)
j7_f2<-data.frame(Title = Descp_data, Description= Desc_data, Sub_Description= Z, Pirce = price_data)
For instance you can use something like this.
data <- read_html("url.xml")
var <- data %>% html_nodes("//node") %>% xml_text()
# observations that don´t have certain nodes - fill them with NA
var_pair <- data %>% html_nodes("node_var_pair")
var_missing_clean = sapply(var_pair, function(x) {
tryCatch(xml_text(html_nodes(x, "./var_missing")),
error=function(err) NA)
})
df = data.frame(var, var_pair, var_missing)
Here there are three types of nodes that you may consider. var gathers the nodes that do not have missing data. var_pair includes the nodes that you want to pair with the nodes that contain missing observation and var_missing refers to the nodes with missing information. You can create variables and aggregate them in a data data frame (df)
The process here is simple and in two steps -- First extract all nodes at the block level (not each element and don't convert to text). This is a list of length equal to the number of blocks. Second from this extracted list extract each element as text and clean it. Since this is being done from a list, NA's where applicable are automatically coerced in the right places. See an example from the same ebay India site:
library(rvest)
library(stringr)
# specify the url
url <-"https://www.ebay.in/sch/Mobile-Phones"
# read the page
web <- read_html(url)
# define the supernode that has the entire block of information
super_node <- '.li'
# read as vector of all blocks of supernode (imp: use html_nodes function)
super_node_read <- html_nodes(web, super_node)
# define each node element that you want
node_model_details <- '.lvtitle'
node_description_1 <- '.lvtitle+ .lvsubtitle'
node_description_2 <- '.lvsubtitle+ .lvsubtitle'
node_model_price <- '.prc .bold'
node_shipping_info <- '.bfsp'
# extract the output for each as cleaned text (imp: use html_node function)
model_details <- html_node(super_node_read, node_model_details) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
description_1 <- html_node(super_node_read, node_description_1) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
description_2 <- html_node(super_node_read, node_description_2) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
model_price <- html_node(super_node_read, node_model_price) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
shipping_info <- html_node(super_node_read, node_shipping_info) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
# create the data.frame
mobile_phone_data <- data.frame(
model_details,
description_1,
description_2,
model_price,
shipping_info
)