Scraping text from html table in r - r

i'm pretty new to R and i'm trying to solve some real world challenges while taking datacamp.com R course. The thing is: i'm trying to scrape address, name, phone, email and site from a webpage. The information is on a table. I have tried this code:
library(rvest)
# Store web url
apel_url <- read_html("http://www.apel.pt/pageview.aspx?pageid=944&langid=1")
txt <- html_text(apel_url)
txt
associados <- apel_url %>%
html_nodes(css = "p.MsoNormal") %>%
html_text()
print(associados)
As result i have a chr [1:1481] string but some of the lines were scraped joined with each other, although in the site they are separate lines. For instance:
associados[969]
results in:
[1] "PENUMBRA EDITORA, LDA.Rua da Marinha, 50 - Madalena4405-761 VILA NOVA DE GAIA Tel.: 22 375 04 52"
I wonder what i'm missing and would like to know which is the best way to transform this string in a dataframe separating each field in a column (phone, address, email, URL, etc). Some of the entrances have 1 or more phone numbers, other don't have URL, etc, so it has to be blank when there is no information.
Thanks for helping.

Related

How can I find the page number from a .pdf by a text?

I have a .pdf with 120 certificates, each page is a certificate and the only difference is the name of the participant.
I also have a .csv with the name and e-mail (I will also try to send by e-mail with R later).
How can I split each certificate (page) and save in a new .pdf with the participant name?
I saw functions like pdf_subset from library(pdftools), but how can I identify the page number by some text?
# extract some pages
pdf_subset('https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf',
pages = 1:3, output = "subset.pdf")
Example of .pdf:
https://drive.google.com/file/d/1iwgW6kMT7C9Xee5SM65vz-D8B26bpavz/view?usp=sharing
in the .csv I have the column name
name,
Prof. Dr. Thiado Souza,
Prof. Dr. Marcelo Jose ́,
Ricado Augusto,
Carlos Jose ́,
pdf_text returns a character vector where each element represents individual page.
library(pdftools)
data <- pdf_text('https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf')
data[1] is the 1st page of the pdf, data[2] is the second one and so on. So you can subset one page at a time or multiple pages like data[1:10] for first 10 pages.

xpath returning empty text when web-scraping in r

I'm trying to scrape information from https://www.kff.org/interactive/subsidy-calculator. For instance, put state=California, zip=90001, income=20000, no coverage, 1 people, 1 adult, no children, age=21, no tobacco.
We get the following:
https://www.kff.org/interactive/subsidy-calculator/#state=ca&zip=94704&income-type=dollars&income=20000&employer-coverage=0&people=1&alternate-plan-family=individual&adult-count=1&adults%5B0%5D%5Bage%5D=21&adults%5B0%5D%5Btobacco%5D=0&child-count=0
I would like to get the numbers for "estimated financial help" and "your cost for a silver plan" (they are bolded-blue in the "Results" grey box, for some reason I can't upload the screenshot). When I use the xpath for the numbers, I get back empty string. This is not the case if I were to retrieve some other text (not in the grey box). I wonder what could be wrong with this. I have attached code below. Please forgive me if this is a stupid question since I'm very new to web-scraping. Thank you!
state = tolower('CA')
zip = 94704
income = 20000
people = 1
adult = 1
children = 0
url = paste0("https://www.kff.org/interactive/subsidy-calculator/#state=", state, "&zip=", zip, "&income-type=dollars&income=", income, "&employer-coverage=0&people=", people, "&alternate-plan-family=individual&adult-count=", adult, "&adults%5B0%5D%5Bage%5D=21&adults%5B0%5D%5Btobacco%5D=0&child-count=", children)
# This returns empty string
r = read_html(url) %>%
html_nodes(xpath ='//*[#id="subsidy-calculator-new"]/div[5]/div/div/dl/dd[1]/span') %>% html_text()
# This returns "Number of children (20 and younger) enrolling in Marketplace coverage", a line that's not in the grey box.
r = read_html(url) %>%
html_nodes(xpath = '//*[#id="subsidy-form"]/div[2]/div[3]/div[3]/p') %>%
html_text()
The values are generated through scripts that run on the page. Your current method won't allow for this hence your result. You are likely better off using a method which allows scripts to run such as RSelenium.
The form you complete #subsidy-form feeds values into a template in a script tag #results-template. The associated calculations are covered in this script https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/calculator.js?ver=1.7.7 where you will find the logic and the pre-set values such as poverty lines per year.
The simplest quick view is probably to inspect the javascript variables when the new SubsidyCalculator object is created to process the form i.e. js starting with var sc = new SubsidyCalculator. You could 'reverse engineer' those variables with your values plus the values returned from the json below which I think, but haven't confirmed, feed the 6 variables that begin with kff_sc, according to zipcode, into the calculator e.g. silver: kff_sc.silver . You get an idea of the ballpark figures given there are default values given at top of script.
Figures in relation to zipcode are retrieved from this: https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/json/zips/94.json where the last two numbers before .json are the first two numbers of zipcode. You can determine this from the input validation script: https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/shared.js?ver=1.7.7
var bucket = $( this ).val().substring( 0, 2 );
if ( kff_sc.buckets[bucket] ) return;
$.ajax( '/wp-content/themes/vip/kaiser-foundation-2016/interactives/subsidy-calculator/2019/json/zips/' + bucket + '.json',
The first two digits determine the bucket.
All in all you could likely implement your own calculator but you would be re-inventing the wheel. Seems easier to just automate the browser and then extract the resultant values.

utf8 encoding for emoji in R showing wrong result

I've a set of twitter emojis:
description r.encoding unicode width
shootingstar <f0><9f><8c><a0> U+1F320 16
wrappedgift <f0><9f><8e><81> U+1F381 16
yellowheart <f0><9f><92><9b> U+1F49B 16
femalesign <e2><99><80> U+2640 12
frowningface <e2><98><b9> U+2639 12
And a set of tweets:
[1] "Ring<f0><9f><9a><b4><e2><80><8d><e2><99><80><ef><b8><8f> Order today and have it within 3 days<e2><9d><a3><ef><b8><8f>\n"
[2] "Really I have been thinking <f0><9f><a4><94> about surfing <f0><9f><8f><84><e2><80><8d><e2><99><80><ef><b8><8f>"
When I'm trying to get name of emojis in these texts. using:
vec <- str_count(string, matchto)# string is the text, matchto is r.encodig
matches <- which(vec != 0)
in some cases it shows wrong result.
specifically for the emojis which are not in my emoji set.
for example
"femalesign" emoji is:<e2><99><80>
in both tweets, my codes shows "female sign" emoji, however when I checked the tweet, the user is actually using "woman biking" and "woman surfing" which are not in my emoji dataset
woman biking: <f0><9f><9a><b4><e2><80><8d><e2><99><80><ef><b8><8f>
woman surfing: <f0><9f><8f><84><e2><80><8d><e2><99><80><ef><b8><8f>
So the output that I was expecting was :
NA
NA
May I know if there's a solution? is there a specific pattern which can help?
I was wondering if there's a pattern /regex that can help in recognising
whether this kind of sequence for example "<f0><9f><9a><b4><e2><80><8d>
<e2><99><80> <ef><b8><8f>", belongs to an emoji , regardless which emoji is it
Because we have more than 2,000 emojis, and it'd be very time consuming to gather info of all emojis since I couldn't find a comprehensive file that includes emoji name and utf-8. Plus emojis frequently get updated.

Web Crawler using R

I want to build a webcrawler using R program for website "https://www.latlong.net/convert-address-to-lat-long.html", which can visit the website with the parameter for address and then fetch the generated latitude and longitude from the site. And this would repeat for the length of the dataset which I have.
Since I am new to web crawling domain, I would seek guidance.
Thanks in advance.
In the past I have used an API called IP stack (ipstack.com).
Example: a data frame 'd' that contains a column of IP addresses called 'ipAddress'
for(i in 1:nrow(d)){
#get data from API and save the text to variable 'str'
lookupPath <- paste("http://api.ipstack.com/", d$ipAddress[i], "?access_key=INSERT YOUR API KEY HERE&format=1", sep = "")
str <- readLines(lookupPath)
#save all the data to a file
f <- file(paste(i, ".txt", sep = ""))
writeLines(str,f)
close(f)
#save data to main data frame 'd' as well:
d$ipCountry[i]<-str[7]
print(paste("Successfully saved ip #:", i))
}
In this example, I was specifically after the Country location of each IP, which appears on line 7 of the data returned by the API (hence the str[7])
This API lets you lookup 10,000 addresses per month for free, which was enough for my purposes.

how to use rvest to scrape same kind of datapoint but labelled with different id

if I want to use rvest to scrape a particular datapoint (name, address, phone etc) repeated in different section of this page, all start with similar span id, but not exactly the same, such as:
docs-internal-guid-049ac94a-f34e-5729-b053-30567fdf050a
docs-internal-guid-765e48e9-f34b-7c88-5d95-042a93fcfda3
what's the best approach? to find and copy each id is not viable. Thanks
Edit:
You can use the following script to retrieve all star restaurants:
library("rvest")
url_base <- "http://www.straitstimes.com/lifestyle/food/full-list-of-michelin-starred-restaurants-for-2017"
data <- read_html(url_base) %>%
html_nodes("h3") %>%
html_text()
This also gives you the headers ("ONE MICHELIN STAR", "TWO MICHELIN STARS", "THREE MICHELIN STARS"), bu this might even be helpful.
Background to the script:
Fortunately, all and only the relevant information is within the h3 selector. The script gives you a char vector as output. Of course, you can further elaborate on this with e.g. %>% as.data.frame() or however you want to store / process the data.
------------------- old answer -------------------
Could you maybe provide the url of that particular page? For me it sounds like you have to find the right css-selector (nth-child(x)) you can use in a loop.

Resources