How do I finish this R function for movie scraping? - r

I'm writing an R function. I would like it to be able to take a list of movies, download info about them, and then throw it into a data frame.
So far,
rottenrate <- function(movie){
link <- paste("http://www.omdbapi.com/?t=", movie, "&y=&plot=short&r=json&tomatoes=true", sep = "")
jsonData <- fromJSON(link)
return(jsonData)
}
This will return info for one movie and won't convert to a data.frame.
Thanks for any help.

You could do it like this:
# First, vectorize function
rottenrate <- function(movie){
require(RJSONIO)
link <- paste("http://www.omdbapi.com/?t=", movie, "&y=&plot=short&r=json&tomatoes=true", sep = "")
jsonData <- fromJSON(link)
return(jsonData)
}
vrottenrate <- Vectorize(rottenrate, "movie", SIMPLIFY = FALSE)
# Now, query and combine
movies <- c("inception", "toy story")
df <- do.call(rbind, lapply(vrottenrate(movies), function(x) as.data.frame(t(x), stringsAsFactors = FALSE)))
dplyr::glimpse(df)
# Observations: 2
# Variables:
# $ Title (chr) "Inception", "Toy Story"
# $ Year (chr) "2010", "1995"
# $ Rated (chr) "PG-13", "G"
# $ Released (chr) "16 Jul 2010", "22 Nov 1995"
# $ Runtime (chr) "148 min", "81 min"
# $ Genre (chr) "Action, Mystery, Sci-Fi", "Animation,
# ...
Interesting database btw... :-)

Related

Importing multiple invoices (.PDF) in R. Turning them from strings to a tibble

So I'm doing a project where I need to load a numerous amount of .pdfs into R. This part is somewhat covered. The problem is when importing the pdfs into R, every line is a string. Not all the information in de the string is relevant. And in some of the cases information is missing. So I want to select the info I need and place them into a tibble for further analysis.
Importing the pdf's are done by pdftools. It's working, hints or tips are welcome though
invoice_pdfs = list.files(pattern="*.pdf") # gather all the .pdf in current wd.
invoice_list <- map(invoice_pdfs, .f = function(invoices){ # Using the purrr::map function .
pdf_text(invoices) %>% # extracting text from listed pdf file(s)
readr::read_lines() %>% # read all text from pdf
str_squish() %>% # clear all white space in text.
str_to_lower # convert string to lower case
})
reproducible example:
invoice_example <- c("invoice",
"to: rade ris",
"cane nompany",
"kakber street 23d",
"nork wey",
"+223 (0)56 015 6542",
"invoice id: 85600023",
"date reference product product reference weigth amount",
"01-02-2016 840000023 product a 24.45.6 de6583621 14.900 kg a 50 per tonne 745,00",
"07-02-2016 840000048 product b 24.45.7 qf8463641 19.000 kg a 50 per tonne 950,00",
"03-02-2016 840000032 product b 24.34.2 qf8463641 4.000 kg per tonne 250,00",
"02-02-2016 840000027 ke7801465 1.780 kg per tonne 89,00",
"subtotal 2.034,00",
"sales tax 183,06",
"total 2.217,06")
So here is where the problem starts.
What I've tried is using stringr and rebus to select specific parts of the text. I've made the following function to search the document for specific string, it returns the rownumber:
word_finder <- function(x, findWord){
word_hit <- x %>% # temp for storing TRUE or FALSE
str_detect(pattern = fixed(findWord))
which(word_hit == TRUE) # give rownumber if TRUE
}
And the following searchpatterns:
detect_date <- dgt(2) %R% "-" %R% dgt(2) %R% "-" %R% dgt(2)
detect_money <- optional(DIGIT) %R% optional(".") %R% one_or_more(DIGIT) %R% "," %R% dgt(2)
detect_invoice_num <- str_trim(SPC %R% dgt(8) %R% optional(SPC))
The next step should be to make a tibble (or data frame) with the column names c("date", "reference", "product", "product reference", "weight", "amount") I've also tried making a tibble of the whole invoice_example problem is the missing info in some fields and the column names don’t match the corresponding value's.
So I would like to make some function that uses the search pattern and places that specific value to a predestined column. I've got no clue how to get this done. Or maybe I should handle this completely different?
final result should be something like this.
reproducible example:
invoice_nr <- c("85600023", "85600023", "85600023", "85600023" )
date <- c( "01-02-2016", "07-02-2016", "03-02-2016", "02-02-2016")
reference <- c( "840000023", "840000048", "840000032", "840000027")
product_id <- c( "de6583621", "qf8463641", "qf8463641", "ke7801465")
weight <- c("14.900", "19.000", "4.000", "1.780")
amount <- c("745.00", "950.00", "250.00", "89.00")
example_tibble <- tibble(invoice_nr, date, reference, product_id, weight, amount)
Result:
# A tibble: 4 x 6
invoice_nr date reference product_id weight amount
<chr> <chr> <chr> <chr> <chr> <chr>
1 85600023 01-02-2016 840000023 de6583621 14.900 745.00
2 85600023 07-02-2016 840000048 qf8463641 19.000 950.00
3 85600023 03-02-2016 840000032 qf8463641 4.000 250.00
4 85600023 02-02-2016 840000027 ke7801465 1.780 89.00
Any suggested ways of dealing with this will be appreciated!
Actually you can use the functions of library(stringr) to achieve your goal (I skipped the rebus part as this seems to eb anyways 'just' a helper for creatign teh regex, which I did by hand):
library(tidyverse)
parse_invoice <- function(in_text) {
## define regex, some assumptions:
## product id is 2 lower characters followed by 7 digits
## weight is some digits with a dot followed by kg
## amount is some digits at the end with a comma
all_regex <- list(date = "\\d{2}-\\d{2}-\\d{4}",
reference = "\\d{9}",
product_id = "[a-z]{2}\\d{7}",
weight = "\\d+\\.\\d+ kg",
amount = "\\d+,\\d+$")
## look only at lines where there is invoice data
rel_lines <- str_subset(in_text, all_regex$date)
## extract the pieces from the regex
ret <- as_tibble(map(all_regex, str_extract, string = rel_lines))
## clean up the data
ret %>%
mutate(invoice_nr = str_extract(str_subset(in_text, "invoice id:"), "\\d{8}"),
date = as.Date(date, "%d-%m-%Y"),
weight = as.numeric(str_replace(weight, "(\\d+.\\d+) kg", "\\1")),
amount = as.numeric(str_replace(amount, ",", "."))
) %>%
select(invoice_nr,
date,
reference,
product_id,
weight,
amount)
}
str(parse_invoice(invoice_example))
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 6 variables:
# $ invoice_nr: chr "85600023" "85600023" "85600023" "85600023"
# $ date : Date, format: "2016-02-01" "2016-02-07" ...
# $ reference : chr "840000023" "840000048" "840000032" "840000027"
# $ product_id: chr "de6583621" "qf8463641" "qf8463641" "ke7801465"
# $ weight : num 14.9 19 4 1.78
# $ amount : num 745 950 250 89
Since I'm not familiar with rebus I've rewritten your code. Assuming the invoices are at least somewhat structured the same I could generate a tibble from your example. You would just have to apply this to your whole list and then purrr::reduce it to a big tibble:
df <- tibble(date=na.omit(str_extract(invoice_example,"\\d{2}-\\d{2}-\\d{4}")))
df %>% mutate(invoice_nr=na.omit(sub("invoice id: ","",str_extract(invoice_example,"invoice id: [0-9]+"))),
reference=na.omit(sub("\\d{2}-\\d{2}-\\d{4} ","",str_extract(invoice_example,"\\d{2}-\\d{2}-\\d{4} \\d{9}"))),
product_id=na.omit(str_extract(invoice_example,"[:lower:]{2}\\d{7}")),
weight=na.omit(sub(" kg","",str_extract(invoice_example,"[0-9\\.]+ kg"))),
amount=na.omit(sub("tonne ","",str_extract(invoice_example,"tonne [0-9,]+"))))

Values are not getting entered in dataframe from web scraping

My main aim is to extract the content from the website. I want to save it locally. After the content should get updated in website it should reflect the local data also.
I am able to read the data from the webpage used in the code,now I want to save the result into data frame so that I can export the result. I want the values of x6 should enter into the data frame df ,so that I can export the data frame result into text file or excel file or you can suggest any other way to extract the data from the webpage used in the code (web scraping).In this I want my for loop is not working ,so please anyone help me out.
library(rvest)
library(dplyr)
library(qdapRegex) # install.packages("qdapRegex")
google <- read_html("https://bidplus.gem.gov.in/bidresultlists")
(x <- google %>%
html_nodes(".block") %>%
html_text())
class(x)
(x1 <- gsub(" ", "", x))
(x2 <- gsub(" ", "", x1))
(x3 <- gsub(" ", "", x2))
(x4 <- gsub(" ", "", x3))
(x5 <- gsub(" ", "", x4))
(x6 <- gsub("\n", "", x5))
class(x6)
length(x6[i])
typeof(x6)
for (i in x6) {
BIDNO <- rm_between(x6[i], "BID NO:", "Status", extract = TRUE)
Status <- rm_between(x6[i], "Status:", "Quantity Required", extract = TRUE)
Quantity_Required <- rm_between(x6[i], "Quantity Required:", "Department Name And Address", extract = TRUE)
Department_Name_And_Address <- rm_between(x6[i], "Department Name And Address:", "Start Date", extract = TRUE)
Start_Date <- rm_between(x6[i], "Start Date:", "End Date", extract = TRUE)
# End_Date <- rm_between(x6[i], "End Date: ", "Technical Evaluation", extract=TRUE)
df <- data.frame("BID_NO", "Status", "Quantity_Required", "Department_Name_Address", "Start_Date")
}
df
View(df)
Targeting the desired elements with XPath is likely a path with less frustration & error:
library(rvest)
library(dplyr)
pg <- read_html("https://bidplus.gem.gov.in/bidresultlists")
Get all the bid blocks:
blocks <- html_nodes(pg, ".block")
Target items & quantity div:
items_and_quantity <- html_nodes(blocks, xpath=".//div[#class='col-block' and contains(., 'Item(s)')]")
Pull out items and quantities:
items <- html_nodes(items_and_quantity, xpath=".//strong[contains(., 'Item(s)')]/following-sibling::span") %>% html_text(trim=TRUE)
quantity <- html_nodes(items_and_quantity, xpath=".//strong[contains(., 'Quantity')]/following-sibling::span") %>% html_text(trim=TRUE) %>% as.numeric()
Get department name and address. Modify it so the three lines are separated with pipes (|). This will enable separation at a later time. Pipe symbol is a pain for regex since it has to be escaped but it is highly unlikely to appear in the text and tabs can often cause confusion at a later time.
department_name_and_address <- html_nodes(blocks, xpath=".//div[#class='col-block' and contains(., 'Department Name And Address')]") %>%
html_text(trim=TRUE) %>%
gsub("\n", "|", .) %>%
gsub("[[:space:]]*\\||\\|[[:space:]]*", "|", .)
Target the block header which has bid # and status:
block_header <- html_nodes(blocks, "div.block_header")
Pull out bid # (see note at the end of the answer):
html_nodes(block_header, xpath=".//p[contains(#class, 'bid_no')]") %>%
html_text(trim=TRUE) %>%
gsub("^.*: ", "", .) -> bid_no
Pull out status:
html_nodes(block_header, xpath=".//p/b[contains(., 'Status')]/following-sibling::span") %>%
html_text(trim=TRUE) -> status
Target & pull out start & end dates:
html_nodes(blocks, xpath=".//strong[contains(., 'Start Date')]/following-sibling::span") %>%
html_text(trim=TRUE) -> start_date
html_nodes(blocks, xpath=".//strong[contains(., 'End Date')]/following-sibling::span") %>%
html_text(trim=TRUE) -> end_date
Make a data frame:
data.frame(
bid_no,
status,
start_date,
end_date,
items,
quantity,
department_name_and_address,
stringsAsFactors=FALSE
) -> xdf
Some of the bids are "RA"s so we can also create a column letting us know which ones are which:
xdf$is_ra <- grepl("/RA/", bid_no)
The resultant data frame:
str(xdf)
## 'data.frame': 10 obs. of 8 variables:
## $ bid_no : chr "GEM/2018/B/93066" "GEM/2018/B/93082" "GEM/2018/B/93105" "GEM/2018/B/93999" ...
## $ status : chr "Not Evaluated" "Not Evaluated" "Not Evaluated" "Not Evaluated" ...
## $ start_date : chr "25-09-2018 03:53:pm" "27-09-2018 09:16:am" "25-09-2018 05:08:pm" "26-09-2018 05:21:pm" ...
## $ end_date : chr "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" ...
## $ items : chr "automotive chassis fitted with engine" "automotive chassis fitted with engine" "automotive chassis fitted with engine" "Storage System" ...
## $ quantity : num 1 1 1 2 90 1 981 6 4 376
## $ department_name_and_address: chr "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Maharashtra Energy Department Maharashtra Bhusawal Tps N/a" ...
## $ is_ra : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
I'll let you turn dates into POSIXct elements.
The contiguous code w/o explanation is here.
Also, this isn't Java. for loops are rarely the solution to a problem in R. And, you should read up on regexes since counting spaces for substitution is also a path fraught with peril and frustration.
The problem appears to be that you created is a bunch of strings with 'BID_NO'etc in quotes. If you are trying to save values into a dataframe you need to save the variable names into which you saved the values into a dataframe instead.
df<-data.frame(BID_NO,Status,Quantity_Required,Department_Name_Address,Start_Date)
Provided all the code which creates each field is correct above and values are saved into those variables, you will get a ONE ROW dataframe because it is created in a for loop so each time it iterates you will write over the last version.
If you hope to save multiple rows, create final_df prior to the loop. Then
data.frame(rbind(final_df, df)) will bind the row of data to the empty frame on the first pass then add a new row each time through.
But any dataframe created in a loop will be created anew each pass and written over...and save values from variable withouth ' ' around them...

R Webscraping from WFP website

I am using WFP country website (http://www1.wfp.org/countries) aiming web scraping it in order to build up a dataset containing the news issued periodically on that without clicking each time page after page.
Furthermore, I would add some columns including keyword count.
Leaving aside the part of the script containing the Countries and the urls I would focus on the scraping itself, indeed.
Yet, I am using a bunch of packages.
library(rvest)
library(stringr)
library(tidyr)
library(data.table)
library(plyr)
library(xml2)
library(selectr)
library(tibble)
library(purrr)
library(datapasta)
library(jsonlite)
library(countrycode)
library(httr)
library(stringi)
library(tidyverse)
library(dplyr)
library(XML)
I have prepared the dataset for another website and it seems to work well.
A helper here suggested a quite elegant solution for the thing and I have integrated it with my previous work on the country part and everything works well in that. Nevertheless, the solution does not seem to comply with my present need.
Yet, I have this:
## 11. Creating a function in order to scrape data from a website (in this case, WFP's)
wfp_get_news <- function(iso3) { GET(
url = "http://www1.wfp.org/countries/common/allnews/en/",
query = list(iso3=iso3)
) -> res
warn_for_status(res)
if (status_code(res) > 399) return(NULL)
out <- content(res, as="text", encoding="UTF-8")
out <- jsonlite::fromJSON(out)
out$iso3 <- iso3
tbl_df(out)
}
## 12. Setting all the Country urls in order for them to be automatically scraped
pb <- progress_estimated(length(countrycode_data$iso3c[])) # THIS TAKES LONG TO BE PROCESSED
map_df(countrycode_data$iso3c[], ~{
pb$tick()$print()
Sys.sleep(5)
wfp_get_news(.x)
}) -> xdf
## 13. Setting keywords (of course, this process is arbitrary: one can decide any keywor s/he prefers)
keywords <- c("drought", "food security")
keyword_regex <- sprintf("(%s)", paste0(keywords, collapse="|"))
## 14. Setting the keywords search
bind_cols(
xdf,
stri_match_all_regex(tolower(xdf$bodytext), keyword_regex) %>%
map(~.x[,2]) %>%
map_df(~{
res <- table(.x, useNA="always")
nm <- names(res)
nm <- ifelse(is.na(nm), "NONE", stri_replace_all_regex(nm, "[ -]", "_"))
as.list(set_names(as.numeric(res), nm))
})
) %>%
select(-NONE) -> xdf_with_keyword_counts
In particular, when I run point 14. if the script, I attain the following error message:
Error in overscope_eval_next(overscope, expr) :
object "NONE" not found
Furthermore: Warning message:
Unknown or uninitialised column: 'bodytext'.
The expected result should be, more or less, instead:
> glimpse(xdf_with_keyword_counts)
Observations: 12,375
Variables: 12
$ uid <chr> "1071595", "1069933", "1069560", "1045264", "1044139", "1038339", "405003", "1052711", NA, "1062329", "1045248", "...
$ table <chr> "news", "news", "news", "news", "news", "news", "news", "news", NA, "news", "news", "news", "news", "news", NA, "n...
$ title <chr> "Conflicts and drought spur hunger despite strong global food supply", "FAO Calls for Stronger Collaboration on Tr...
$ date <chr> "1512640800", "1511823600", "1511737200", "1508191200", "1508104800", "1505980800", "1459461600", "1293836400", NA...
$ bodytext <chr> " 7 December 2017, Rome- Strong cereal harvests are keeping global food supplies buoyant, but localised drought, f...
$ date_format <chr> "07/12/2017", "28/11/2017", "27/11/2017", "17/10/2017", "16/10/2017", "21/09/2017", "01/04/2016", "01/01/2011", NA...
$ image <chr> "http://www.wfp.org...", "http://www.wfp.org...
$ pid <chr> "2330", "50840", "16275", "70992", "16275", "2330", "40990", "40990", NA, "53724", "53724", "2330", "53724", "5084...
$ detail_pid <chr> "/news/story/en/item/1071595/icode/", "/neareast/news/view/en/c/1069933/", "/asiapacific/news/detail-events/en/c/1...
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "ALA", "ALB", "ALB", "ALB", "ALB", "DZA", "ASM", "AND", "A...
$ drought <dbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ food_security <dbl> NA, NA, NA, 2, 1, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
I hope I made myself quite clear.
Any clue?
I think you hit one of the "gotchas" in web scraping: they removed this functionality/paths on the web site.
Try going to http://www1.wfp.org/countries/common/allnews/en/iso=SLV (El Salvador's news page from the URL scheme you were using a cpl days ago). It doesn't exist.
But, if you go to http://www1.wfp.org/countries/el-salvador there's a link for http://www.wfp.org/news/el-salvador-177 on that page which is the El Salvador news items.
I think it's the same content, just presented differently, so it's just attacking it differently:
library(rvest)
library(httr)
library(stringi)
library(tidyverse)
This is a helper so we can get their country id's and name mappings:
get_countries <- function() {
pg <- read_html("http://www.wfp.org/news/news-releases?tid=All&tid_2=All")
# find the country popup
country_sel <- html_nodes(pg, "select[name='tid'] option")
# extract ids and name for each country, ignoring "All"
data_frame(
cid = html_attr(country_sel, "value"),
cname = html_text(country_sel)
) %>%
filter(stri_detect_regex(cid, "[[:digit:]]"))
}
This is a helper to get the news content on a page
get_news <- function(cid, tid) {
GET("http://www.wfp.org/news/news-releases",
query=list(tid=cid, tid_2=tid)) -> res
warn_for_status(res)
if (status_code(res) > 200) return(NULL)
res <- content(res, as="parsed")
# check for no stories by testing for the presence of the
# div that has the "no stories are found" text
if (length(html_node(res, "div.view-empty")) != 0) return(NULL)
# find the news item boxes on this page
items <- html_nodes(res, "div.list-page-item")
# extract the contents
data_frame(
cid = cid,
tid = tid,
# significant inconsistency in how they assign CSS classes to date boxes
date = html_text(html_nodes(items, xpath=".//div[contains(#class, 'box-date')]"), trim=TRUE),
title = html_text(html_nodes(items, "h3"), trim=TRUE),
# how & where they put summary text in the div is also inconsistent so we
# need to (unfortunately) include the date and title to ensure we capture it
# we cld get just the text, but it's more complex code.
summary = html_text(items, trim=TRUE),
link = html_attr(html_nodes(items, "h3 a"), "href")
)
}
Now, we iterate over the countries and get all the stories:
country_df <- get_countries()
pb <- progress_estimated(length(country_df$cid))
map_df(country_df$cid, ~{
pb$tick()$print()
get_news(.x, "All")
}) -> news_df
# add in country names
mutate(news_df, cid = as.character(cid)) %>%
left_join(country_df) -> news_df
glimpse(news_df)
## Observations: 857
## Variables: 7
## $ cid <chr> "120", "120", "120", "120", "120", "120", "120", "120", "120", "120"...
## $ tid <chr> "All", "All", "All", "All", "All", "All", "All", "All", "All", "All"...
## $ date <chr> "26 October 2017", "16 October 2017", "2 October 2017", "10 July 201...
## $ title <chr> "US Contribution To Boost WFP Food Assistance And Local Economy In A...
## $ summary <chr> "26 October 2017\t\t\r\n\t\t\r\n\tUS Contribution To Boost WFP Food ...
## $ link <chr> "/news/news-release/us-contribution-boost-wfp-food-assistance-and-lo...
## $ cname <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghani...
You still need to try to classify this by adapting the other code you have, and you can use the link in the data frame to mine more text for said classification.
NOTE: this only gets the most recent news page for each country but that's pretty much what you want to do anyway (check for net-new & classify them).
Now, we can try to auto-classify stories by looping through country & pop-up topics list since those topics seem to be what you care about (some of them). You'll need to trust that they tagged things well.
NOTE: This is going to take a long time especially with the "being kind" delay hence why I only scaffold-ed it and didn't run it apart from a light test to ensure it worked:
# get topic ids
get_topics <- function() {
pg <- read_html("http://www.wfp.org/news/news-releases?tid=All&tid_2=All")
# find the topic popup
country_sel <- html_nodes(pg, "select[name='tid_2'] option")
# extract ids and name for each topic, ignoring "All" and sub-topics
# i.e. ignore ones that begin with "-"
data_frame(
tid = html_attr(country_sel, "value"),
tname = html_text(country_sel)
) %>%
filter(stri_detect_regex(tid, "[[:digit:]]")) %>%
filter(tid != "All") # exclude "All" since we're trying to auto-tag
}
topics_df <- get_topics()
pb <- progress_estimated(length(country_df$cid))
map_df(country_df$cid, ~{
pb$tick()$print()
cid <- .x
Sys.sleep(5) ## NOTE THIS SHOULD REALLY GO IN get_news() but I didn't want to mess with that function for this extra part of the example
map_df(topics_df$tid, ~get_news(cid, .x))
}) -> news_with_tagged_topics_df
mutate(news_with_tagged_topics_df, tid = as.character(tid), cid = as.character(cid)) %>%
left_join(topics_df) %>%
left_join(country_df) %>%
glimpse()
I ran it for a random sample of 3 countries:
## Observations: 11
## Variables: 8
## $ cid <chr> "4790", "4790", "4790", "4790", "4790", "4790", "4790", "152", "152"...
## $ tid <chr> "4488", "3929", "3929", "995", "999", "1005", "1005", "997", "995", ...
## $ date <chr> "16 December 2014", "2 September 2016", "1 October 2014", "1 October...
## $ title <chr> "Russia & WFP Seal Partnership To End Hunger; Kamaz Trucks Rolled Ou...
## $ summary <chr> "16 December 2014\t\t\r\n\t\t\r\n\tRussia & WFP Seal Partnership To ...
## $ link <chr> "/news/news-release/russia-wfp-seal-partnership-end-hunger-kamaz-tru...
## $ tname <chr> "Executive Director", "Centre of Excellence against Hunger", "Centre...
## $ cname <chr> "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil"...
and it did pick up a diversity of tags:
unique(news_with_tagged_topics_df$tname)
## [1] "Executive Director" "Centre of Excellence against Hunger"
## [3] "Nutrition" "Procurement"
## [5] "School Meals" "Logistics"

Rselenium scraping loop and list

I'm trying to use this code:
require(RSelenium)
checkForServer()
startServer()
remDr<-remoteDriver()
remDr$open()
appURL <- 'http://www.mtmis.excise-punjab.gov.pk'
remDr$navigate(appURL)
remDr$findElement("name", "vhlno")$sendKeysToElement(list("ria-07-777"))
Can't figure out css selector
remDr$findElements("class", "ent-button-div")[[1]]$clickElement()
after searching query
elem <- remDr$findElement(using="class", value="result-div")
elemtxt <- elem$getElementAttribute("outerHTML")[[1]]
elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T)
final <- readHTMLTable(elemxml)
remDr$close()
rD[["server"]]$stop()
What I want is to create an automated "for loop" with different vehicles from list and merge all into one final table with unique identifier, e.g., "ria-07-777".
list <- c("ria-07-776", "ria-07-777", "ria-07-778")
Why do you need Selenium?
library(httr)
library(rvest)
clean_cols <- function(x) {
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
make.unique(x, sep = "_")
}
get_vehicle_info <- function(vhlno) {
POST(
url = 'http://www.mtmis.excise-punjab.gov.pk/',
set_cookies(has_js=1),
body = list(vhlno=vhlno)
) -> res
stop_for_status(res)
pg <- content(res)
rows <- html_nodes(pg, xpath=".//div[contains(#class, 'result-div')]/table/tr[td[not(#colspan)]]")
cbind.data.frame(
as.list(
setNames(
html_text(html_nodes(rows, xpath=".//td[2]")),
clean_cols(html_text(html_nodes(rows, xpath=".//td[1]")))
)
),
stringsAsFactors=FALSE
)
}
Now use ^^:
vehicles <- c("ria-07-776", "ria-07-777", "ria-07-778")
Reduce(
rbind.data.frame,
lapply(vehicles, function(v) {
Sys.sleep(5) # your desire to steal a bunch of vehicle info to make a sketch database does not give you the right to hammer the server, and you'll very likely remove this line anyway, but I had to try
get_vehicle_info(v)
})
) -> vehicle_df
str(vehicle_df)
## 'data.frame': 3 obs. of 12 variables:
## $ registration_number: chr "ria-07-776" "ria-07-777" "ria-07-778"
## $ chassis_number : chr "KZJ95-0019869" "NFBFD15746R101101" "NZE1206066278"
## $ engine_number : chr "1KZ-0375851" "R18A11981105" "X583994"
## $ make_name : chr "LAND - CRUISER" "HONDA - CIVIC" "TOYOTA - COROLLA"
## $ registration_date : chr "17-Dec-2007 12:00 AM" "01-Aug-2007 12:00 AM" "01-Jan-1970 12:00 AM"
## $ model : chr "1997" "2006" "2007"
## $ vehicle_price : chr "1,396,400" "1,465,500" "0"
## $ color : chr "MULTI" "GRENDA B.P" "SILVER"
## $ token_tax_paid_upto: chr "June 2015" "June 2011" "June 2016"
## $ owner_name : chr "FATEH DIN AWAN" "M BILAL YASIN" "MUHAMMAD ALTAF"
## $ father_name : chr "HAFIZ ABDUL HAKEEM AWAN" "CH M. YASIN" "NAZAR MUHAMMAD"
## $ owner_city : chr "RAWALPINDI" "ISLAMABAD" "SARGODHA"
You'll need to handle network and scraping errors on your own. I can't justify any more time for this likely unethical endeavour (the answer was more to help others with similar q's).

Geocode IP addresses in R

I have made this short code to automate geocoding of IP addresses by using the freegeoip.net (15,000 queries per hour by default; excellent service!):
> library(RCurl)
Loading required package: bitops
> ip.lst =
c("193.198.38.10","91.93.52.105","134.76.194.180","46.183.103.8")
> q = do.call(rbind, lapply(ip.lst, function(x){
try( data.frame(t(strsplit(getURI(paste0("freegeoip.net/csv/", x)), ",")[[1]]), stringsAsFactors = FALSE) )
}))
> names(q) = c("ip","country_code","country_name","region_code","region_name","city","zip_code","time_zone","latitude","longitude","metro_code")
> str(q)
'data.frame': 4 obs. of 11 variables:
$ ip : chr "193.198.38.10" "91.93.52.105" "134.76.194.180" "46.183.103.8"
$ country_code: chr "HR" "TR" "DE" "DE"
$ country_name: chr "Croatia" "Turkey" "Germany" "Germany"
$ region_code : chr "" "06" "NI" ""
$ region_name : chr "" "Ankara" "Lower Saxony" ""
$ city : chr "" "Ankara" "Gottingen" ""
$ zip_code : chr "" "06450" "37079" ""
$ time_zone : chr "Europe/Zagreb" "Europe/Istanbul" "Europe/Berlin" ""
$ latitude : chr "45.1667" "39.9230" "51.5333" "51.2993"
$ longitude : chr "15.5000" "32.8378" "9.9333" "9.4910"
$ metro_code : chr "0\r\n" "0\r\n" "0\r\n" "0\r\n"
In three lines of code you get coordinates for all IPs including city/country codes. I wonder if this could be parallelized so it runs even faster? To geocode >10,000 IPs can take hours otherwise.
library(rgeolocate)
ip_lst = c("193.198.38.10", "91.93.52.105", "134.76.194.180", "46.183.103.8")
maxmind(ip_lst, "~/Data/GeoLite2-City.mmdb",
fields=c("country_code", "country_name", "region_name", "city_name",
"timezone", "latitude", "longitude"))
## country_code country_name region_name city_name timezone latitude longitude
## 1 HR Croatia <NA> <NA> Europe/Zagreb 45.1667 15.5000
## 2 TR Turkey Istanbul Istanbul Europe/Istanbul 41.0186 28.9647
## 3 DE Germany Lower Saxony Bilshausen Europe/Berlin 51.6167 10.1667
## 4 DE Germany North Rhine-Westphalia Aachen Europe/Berlin 50.7787 6.1085
There are instructions in the package for obtaining the necessary data files. Some of the fields you're pulling are woefully inaccurate (more so than any geoip vendor would like to admit). If you do need ones that aren't available, file an issue and we'll add them.
I've found multidplyr is a great package for making parallel server calls. This is the best guide I've found, and I highly recommend reading the whole thing to better understand how the package works: http://www.business-science.io/code-tools/2016/12/18/multidplyr.html
library("devtools")
devtools::install_github("hadley/multidplyr")
library(parallel)
library(multidplyr)
library(RCurl)
library(tidyverse)
# Convert your example into a function
get_ip <- function(ip) {
do.call(rbind, lapply(ip, function(x) {
try(data.frame(t(strsplit(getURI(
paste0("freegeoip.net/csv/", x)
), ",")[[1]]), stringsAsFactors = FALSE))
})) %>% nest(X1:X11)
}
# Made ip.lst into a Tibble to make it work better with dplyr
ip.lst =
tibble(
ip = c(
"193.198.38.10",
"91.93.52.105",
"134.76.194.180",
"46.183.103.8",
"193.198.38.10",
"91.93.52.105",
"134.76.194.180",
"46.183.103.8"
)
)
# Create a cluster based on how many cores your machine has
cl <- detectCores()
cluster <- create_cluster(cores = cl)
# Create a partitioned tibble
by_group <- partition(ip.lst, cluster = cluster)
# Send libraries and the function get_ip() to each cluster
by_group %>%
cluster_library("tidyverse") %>%
cluster_library("RCurl") %>%
cluster_assign_value("get_ip", get_ip)
# Send parallel requests to the website and parse the results
q <- by_group %>%
do(get_ip(.$ip)) %>%
collect() %>%
unnest() %>%
tbl_df() %>%
select(-PARTITION_ID)
# Set names of the results
names(q) = c(
"ip",
"country_code",
"country_name",
"region_code",
"region_name",
"city",
"zip_code",
"time_zone",
"latitude",
"longitude",
"metro_code"
)

Resources