JSON applied over a dataframe in R - r

I used the below on one website and it returned a perfect result:
looking for key word: Emaar pasted at the end of the query:
library(httr)
library(jsonlite)
query<-"https://www.googleapis.com/customsearch/v1?key=AIzaSyA0KdZHRkAjmoxKL14eEXp2vnI4Yg_po38&cx=006431301429107149113:as7yqcm2qc8&q=Emaar"
result11 <- content(GET(query))
print(result11)
result11_JSON <- toJSON(result11)
result11_JSON <- fromJSON(result11_JSON)
result11_df <- as.data.frame(result11_JSON)
now I want to apply the same function over a data.frame containing key words:
so i did the below testing .csv file:
Company Name
[1] ADES International Holding Ltd
[2] Emirates REIT (CEIC) Limited
[3] POLARCUS LIMITED
called it Testing Website Extraction.csv
code used:
test_companies <- read.csv("... \\Testing Website Extraction.csv")
#removing space and adding "+" sign then pasting query before it (query already has my unique google key and search engine ID
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
Result here shows a list of length 3 (the 3 search terms) and sublist within each term containing: url (list[2]), queries (list[2]), ... items (list[10]) and these are the same for each search term (same length separately), my issue here is applying the remainder of the code
#when i run:
result_JSON <- toJSON(result)
result_JSON <- as.list(fromJSON(result_JSON))
I get a list of 6 list that has sublists
and putting it into a tidy dataframe where the results are listed under each other (not separately) is proving to be difficult
also note that I tried taking from the "result" list that has 3 separate lists in it each one by itself but its a lot of manual labor if I have a longer list of keywords
The expected end result should include 30 observations of 37 variables (for each search term 10 observations of 37 variables and all are underneath each other.
Things I have tried unsuccessfully:
These work to flatten the list:
#do.call(c , result)
#all.equal(listofvectors, res, check.attributes = FALSE)
#unlist(result, recursive = FALSE)
# for (i in 1:length(result)) {listofvectors <- c(listofvectors, result[[i]])}
#rbind()
#rbind.fill()
even after flattening I dont know how to organize them into a tidy final output for a non-R user to interact with.
Any help here would be greatly appreciated,
I am here in case anything is not clear about my question,
Always happy to learn more about R so please bear with me as I am just starting to catch up.
All the best and thanks in advance!

Basically what I did is extract only the columns I need from the dataframe list, below is the final code:
library(httr)
library(jsonlite)
library(tidyr)
library(stringr)
library(purrr)
library(plyr)
test_companies <- read.csv("c:\\users\\... Companies Without Websites List.csv")
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
function_toJSONall <- function(all) {toJSON(all)}
a <- lapply(result, function_toJSONall)
function_fromJSONall <- function(all) {fromJSON(all)}
b <- lapply(a, function_fromJSONall)
function_dataframe <- function(all) {as.data.frame(all)}
c <- lapply(b, function_dataframe)
function_column <- function(all) {all[ ,15:30]}
result_final <- lapply(c, function_column)
results_df <- rbind.fill(c[])

Related

Fetching data from OECD into R via SDMX(XML)

I want to extract data from the OECD website particularily the dataset "REGION_ECONOM" with the dimensions "GDP" (GDP of the respective regions) and "POP_AVG" (the average population of the respective region).
This is the first time I am doing this:
I picked all the required dimensions on the OECD website and copied the SDMX (XML) link.
I tried to load them into R and convert them to a data frame with the following code:
(in the link I replaced the list of all regions with "ALL" as otherwise the link would have been six pages long)
if (!require(rsdmx)) install.packages('rsdmx') + library(rsdmx)
url2 <- "https://stats.oecd.org/restsdmx/sdmx.ashx/GetData/REGION_ECONOM/1+2.ALL.SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
sdmx2 <- readSDMX(url2)
stats2 <- as.data.frame(sdmx2)
head(stats2)
Unfortunately, this returns a "400 Bad request" error.
When just selecting a couple of regions the error does not appear:
if (!require(rsdmx)) install.packages('rsdmx') + library(rsdmx)
url1 <- "https://stats.oecd.org/restsdmx/sdmx.ashx/GetData/REGION_ECONOM/1+2.AUS+AU1+AU101+AU103+AU104+AU105.SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
sdmx1 <- readSDMX(url1)
stats1 <- as.data.frame(sdmx1)
head(stats1)
I also tried to use the "OECD" package to get the data. There I had the same problem. ("400 Bad Request")
if (!require(OECD)) install.packages('OECD') + library(OECD)
df1<-get_dataset("REGION_ECONOM", filter = "GDP+POP_AVG",
start_time = 2008, end_time = 2009, pre_formatted = TRUE)
However, when I use the package for other data sets it does work:
df <- get_dataset("FTPTC_D", filter = "FRA+USA", pre_formatted = TRUE)
Does anyone know where my mistake could lie?
the sdmx-ml api does not seem to work as explained (using the all parameter), whereas the json API works just fine. The following query returns the values for all countries and returns them as json - I simply replaced All by an empty field.
query <- https://stats.oecd.org/sdmx-json/data/REGION_ECONOM/1+2..SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?
Transforming it to a readable format is not so trivial. I played around a bit to find the following work-around:
# send a GET request using httr
library(httr)
query <- "https://stats.oecd.org/sdmx-json/data/REGION_ECONOM/1+2..SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
dat_raw <- GET(query)
dat_parsed <- parse_json(content(dat_raw, "text")) # parse the content
Next, access the observations from the nested list and transform them to a matrix. Also extract the features from the keys:
dat_obs <- dat_parsed[["dataSets"]][[1]][["observations"]]
dat0 <- do.call(rbind, dat_obs) # get a matrix
new_features <- matrix(as.numeric(do.call(rbind, strsplit(rownames(dat0), ":"))), nrow = nrow(dat0))
dat1 <- cbind(new_features, dat0) # add feature columns
dat1_df <- as.data.frame(dat1) # optionally transform to data frame
Finally you want to find out about the keys. Those are hidden in the "structure". This one you also need to parse correctly, so I wrote a function for you to easier extract the values and ids:
## Get keys of features
keys <- dat_parsed[["structure"]][["dimensions"]][["observation"]]
for (i in 1:length(keys)) print(paste("id position:", i, "is feature", keys[[i]]$id))
# apply keys
get_features <- function(data_input, keys_input, feature_index, value = FALSE) {
keys_temp <- keys_input[[feature_index]]$values
keys_temp_matrix <- do.call(rbind, keys_temp)
keys_temp_out <- keys_temp_matrix[, value + 1][unlist(data_input[, feature_index])+1] # column 1 is id, 2 is value
return(unlist(keys_temp_out))
}
head(get_features(dat1_df, keys, 7))
head(get_features(dat1_df, keys, 2, value = FALSE))
head(get_features(dat1_df, keys, 2, value = TRUE))
I hope that helps you in your project.
Best, Tobias

Create data frame from lists that are uneven

My code below is me scraping data from IMDB from multiple pages, however, when I try to combine the data into one data frame it is giving me an error telling me the differing rows for gross and meta. I was wondering how would I go about inserting NA values to those empty places so the strings are equal in length? (Note, I have to remove some links because I need certain rep to post more links)
urls <- c("https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=51&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=101&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=151&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=201&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=251&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=301&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=351&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=401&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=451&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=501&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=551&ref_=adv_nxt",
"https://www.imdb.com/search/title?
)
results_list <- list()
for(.page in seq_along(urls)){
webpage <- read_html(urls[[.page]])
titlehtml <- html_nodes(webpage,'.lister-item-header a')
title <- html_text(titlehtml)
runtimehtml <- html_nodes(webpage,'.text-muted .runtime')
runtime <- html_text(runtimehtml)
runtime <- gsub(" min","",runtime)
ratinghtml <- html_nodes(webpage,'.ratings-imdb-rating strong')
rating<- html_text(ratinghtml)
voteshtml <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')
votes <- html_text(voteshtml)
votes<-gsub(",","",votes)#removing commas
metascorehtml <- html_nodes(webpage,'.metascore')
metascore <- html_text(metascorehtml)
metascore<-gsub(" ","",metascore)#removing extra space in metascore
grosshtml <- html_nodes(webpage,'.ghost~ .text-muted+ span')
gross <- html_text(grosshtml)
gross<-gsub("M","",gross)#removing '$' and 'M' signs
gross<-substring(gross,2,6)
results_list[[.page]] <- data.frame(Title = title,
Runtime = as.numeric(runtime),
Rating = as.numeric(rating),
Metascore = as.numeric(metascore),
Votes = as.numeric(votes),
Gross_Earning_in_Mil = as.numeric(unlist(gross))
)
}
final_results <- plyr::ldply(results_list)
Error in data.frame(Title = title, Runtime = as.numeric(runtime), Rating = as.numeric(rating), :
arguments imply differing number of rows: 50, 49, 48
You need to know where your data is missing, so you need to know which items belong together. Right now you just have seperate vectors of values, so you don't know which belong together.
Looking at the page, it looks they are neatly organized into "lister-item-content"-nodes, so the clean thing to do is first extract those nodes, and only then pull out more info from each unit seperately. Something like this works for me:
items <- html_nodes(webpage,'.lister-item-content')
gross <- sapply(items, function(i) {html_text(html_node(i, '.ghost~ .text-muted+ span'))})
It inserts NA at every place where 'items' does not contain the header you're looking for.

change a for loop to a function to scrape a website

I am trying to scrape a website using the following:
industryurl <- "https://finance.yahoo.com/industries"
library(rvest)
read <- read_html(industryurl) %>%
html_table()
library(plyr)
industries <- ldply(read, data.frame)
industries = industries[-1,]
read <- read_html(industryurl)
industryurls <- html_attr(html_nodes(read, "a"), "href")
links <- industryurls[grep("/industry/", industryurls)]
industryurl <- "https://finance.yahoo.com"
links <- paste0(industryurl, links)
links
##############################################################################################
store <- NULL
tbl <- NULL
for(i in links){
store[[i]] = read_html(i)
tbl[[i]] = html_table(store[[i]])
}
#################################################################################################
I am mostly interested in the code between ########## and I want to apply a function instead of a for loop since I am running into time out issues with yahoo and I want to make it more human like to extract this data (it is not too much).
My question is, how can I take links apply a function and set a sort of delay timer to read in the contents of the for loop?
I can paste my own version of the for loop which does not work.
This is the function I came up with
##First argument is the link you need
##The second argument is the total time for Sys.sleep
extract_function <- function(define_link, define_time){
print(paste0("The system will stop for: ", define_time, " seconds"))
Sys.sleep(define_time)
first <- read_html(define_link)
print(paste0("It will now return the table for link", define_link))
return(html_table(first))
}
##I added the following tryCatch function
link_try_catch <- function(define_link, define_time){
out <- tryCatch(extract_function(define_link,define_time), error =
function(e) NA)
return(out)
}
##You can now retrieve the data using the links vector in two ways
##Picking the first ten, so it should not crash on link 5
p <- lapply(1:10, function(i)link_try_catch(links[i],1))
##OR (I subset the vector just for demo purposes
p2 <- lapply(links[1:10], function(i)extract_function(i,1))
Hope it helps

Efficient way to split strings in separate rows (creating an edgelist)

I currently have the following problem. I work with Web-of-Science scientific publication and citation data, which has the following structure: A variable "SR" is a string with the name of a publication, "CR" a variable with a string containing all cited references in the article, separated by a ";".
My task now is to create an edgelist between all publications with the corresponding citations, where every publication and citation combination is in a single row. I do it currently with the following code:
# Some minimal data for example
pub <- c("pub1", "pub2", "pub3")
cit <- c("cit1;cit2;cit3;cit4","cit1;cit4;cit5","cit5;cit1")
M <- cbind(pub,cit)
colnames(M) <- c("SR","CR")
# Create an edgelist
cit_el <- data.frame() #
for (i in seq(1, nrow(M), 1)) { # i=3
cit <- data.frame(strsplit(as.character(M[i,"CR"]), ";", fixed=T), stringsAsFactors=F)
colnames(cit)[1] <- c("SR")
cit$SR_source <- M[i,"SR"]
cit <- unique(cit)
cit_el <- rbind(cit_el, cit)
}
However, for large datasets of some 10k+ of publications (which tend to have 50+ citations), the script runs 15min+. I know that loops are usually an inefficient way of coding in R, yet didn't find an alternative that produces what I want.
Anyone knows some trick to make this faster?
This is my attempt. I haven't compared the speeds of different approaches yet.
First is the artificial data with 10k pubs, 100k possible citations, max is 80 citations per pub.
library(data.table)
library(stringr)
pubCount = 10000
citCount = 100000
maxCitPerPub = 80
pubList <- paste0("pub", seq(pubCount))
citList <- paste0("cit", seq(citCount))
cit <- sapply(sample(seq(maxCitPerPub), pubCount, replace = TRUE),
function(x) str_c(sample(citList, x), collapse = ";"))
data <- data.table(pub = pubList,
cit = cit)
For processing, I use stringr::str_split_fixed to split the citations into columns and use data.table::melt to collapse the columns.
temp <- data.table(pub = pubList, str_split_fixed(data$cit, ";", maxCitPerPub))
result <- melt(temp, id.vars = "pub")[, variable:= NULL][value!='']
Not sure if this is any quicker but if I'm understanding correctly this should give the desired result
rbindlist(lapply(1:nrow(M), function(i){
data.frame(SR_source = M[i, 'SR'], SR = strsplit(M[i, 'CR'], ';'))
}))

R html scrape with redirect links, word searches, and counts

I am trying to streamline a tedious process of online data collection with R scraping code. The website I am currently interested in is here : Wisconsin Bills- Author index.
The website features a redirect link to each legislator, and then under each legislator there is a list of bills introduced, and a link to the major action summaries for each bill. My end goal is to create a data frame that includes a column for legislator name, number of assembly bills (only links that that include "AB") introduced, number of bills passed the assembly, and number of bills signed into law.
Scraping the website, I have successfully created a data frame with each legislator's first name, last name, district, state (always WI) and year (always 1999, t-1 is when the session ended). Below is my code:
#specify the URL
url <- "https://docs.legis.wisconsin.gov/1997/related/author_index/assembly"
#download the HTML code
html <- getURL(url, ssl.verifypeer = FALSE, followlocation = TRUE)
#parse the HTML code
html.parsed <- htmlTreeParse(html, useInternalNodes = T)
# Get list of legislator names:
names <- xpathSApply(html.parsed, path="//a[contains(#href, 'authorindex')]", xmlValue)
# get all links into a list:
links <- xpathSApply(html.parsed2, "//a/#href")
# see what I have:
head(links) # still have hrefs in there
links <- as.vector(links)
head(links) # good, hrefs are dropped.
# I only need the links that begin with /document/authorindex/1997.
typeof(links) # confirming its character
links # looking to see which ones to keep (only ones with "authorindex" and "A__", where the number that follows A is the district)
links <- links[14:114] # now the links only have the legislator redirects!!!
# Lets begin to build the final data frame needed:
# first, take a look at names- there are 104, but there are only 100 legislators...
names # elements 3-103 are leg names
names <- names[3:103]
# split up by first name, last name, etc.
names <- as.vector(names)
names1 <- strsplit(names, ",")
last.names <- sapply(names1, "[[", 1) # good- create a data frame
id = c(1:101)
df <- data.frame(ID= id)
df$last.name = last.names # now have an ID and their last name.
# now need district, party, and first names.
first_names <- strsplit(names, "p.")
first_names # now republicans have 3 elements, dems have 2, first word of 2nd element is first name
# do another strsplit
first_names <- as.character(first_names)
first_names <- strsplit(first_names, " ()")
first_names # 4th element is almost always their name! do it that way, correct those that messed up by hand
first_names <- sapply(first_names, "[[", 4)
first_names # 10 (Timothy), 90 (William) 80 (Joan H) 80 (Tom) 47 (John)
# 25 (Jose) 17 (Stephen) 5 (Spencer)
first_names[5] <- "Spencer"
first_names[10] <- "Timothy"
first_names[90] <- "William"
first_names[80] <- "Joan H."
first_names[81] <- "Tom"
first_names[47] <- "John"
first_names[25] <- "Jose"
first_names[17] <- "Stephen"
df$first.name <- first_names # first names- done.
# district:
district <- regmatches(names, gregexpr("[[:digit:]]+", names))
df$district <- district
df$state <- "WI"
df$year <- 1999
Now, I'm stumped. I need to follow each redirect link, and count the number of AB links under that legislator's name ONLY, follow the AB links, and count the # of AB sites for each legislator that have the word "passed" in them and the # of AB sites that have the word "Sen." in them. I would thus like to add to the existing df the following columns:
Bills Introduced Bills Passed Assembly Bills Signed into Law
4 3 2
39 18 14
Etc. I get the sense I need to use loops, but I don't know how to approach it.
Any help would be incredibly appreciated.
Thank you!

Resources