Web scrape password protected website but there are errors - r

I am trying to scrape data from the member directory of a website ("members.dublinchamber.ie"). I have tried using the 'rvest' but I got the data from the login page even after entering the login details. The code is as follows:
library(rvest)
url <- "members.dublinchamber.ie/login.aspx"
pgsession <- html_session(url)
pgform <- html_form(pgsession)[[2]]
filled_form <- set_values(pgform,
"Username" = "username",
"Password" = "password")
submit_form(pgsession, filled_form)
memberlist <- jump_to(pgsession,'members.dublinchamber.ie/directory/profile.aspx?compid=50333')
page <- read_html(memberlist)
usernames <- html_nodes(x = page, css = 'css of required data')
data_usernames <- data.frame(html_text(usernames, trim = TRUE),stringsAsFactors = FALSE)
I also used RCurl and again I'm getting data from the login page. The RCurl code is as follows:
library(RCurl)
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
html <- getURL('http://members.dublinchamber.ie/login.aspx', curl = curl)
viewstate <- as.character(sub('.*id="__VIEWSTATE" value=['142555296'].*', '\\1', html))
params <- list(
'ctl00$ContentPlaceHolder1$ExistingMembersLogin1$username'= 'username',
'ctl00$ContentPlaceHolder1$ExistingMembersLogin1$password'= 'pass',
'ctl00$ContentPlaceHolder1$ExistingMembersLogin1$btnSubmit'= 'login',
'__VIEWSTATE' = viewstate
)
html = postForm('http://members.dublinchamber.ie/login.aspx', .params = params, curl = curl)
grep('Logout', html)
There are 3 URL's actually:
1) members.dublinchamber.ie/directory/default.aspx(has the names of all industry and it is required to click on any industry)
2) members.dublinchamber.ie/directory/default.aspx?industryVal=AdvMarPubrel (the advmarpubrel is just a small string which is generated as i clicked that industry)
3) members.dublinchamber.ie/directory/profile.aspx?compid=19399 (this has the profile information of a specific company which i clicked in the previous page)
i want to scrape data which should give me industry name, list of companies in each industry and their details which are present as a table in the 3rd URL above.
I am new here and also to R, webscrape. Please don't mind if the question was lengthy or not that clear.

Related

Scraping Tweets in R httr, jsonlite, dplyr

This is my code:
library(httr)
library(jsonlite)
library(dplyr)
bearer_token <- Sys.getenv("BEARER_TOKEN")
headers <- c('Authorization' = sprintf('Bearer %s', bearer_token))
params <- list('expansions' = 'attachments.media_keys')
handle <- readline('BenDuBose')
url_handle <-
sprintf('https://api.twitter.com/2/users/by?username=%s', handle)
response <-
httr::GET(url = url_handle,
httr::add_headers(.headers = headers),
query = params)
obj <- httr::content(response, as = "text")
print(obj)
This is my error message:
[1] "{"errors":[{"parameters":{"ids":[""]},"message":"The number of values in the ids query parameter list [0] is not between 1 and 100"}],"title":"Invalid Request","detail":"One or more parameters to your request was invalid.","type":"https://api.twitter.com/2/problems/invalid-request"}"
My end goal is to scrape an image from a specific tweet ID/user. I already have a list of users and tweet IDs, along with attachments.media_keys. But, I don't know how to use HTTR and I am trying to copy the Twitter Developer example verbatim to learn, but it isn't working.

How to set up POST request to add song to spotify playlist with R

First lets say that i'm really new with R. Yesterday I listened to my favorite radio station. because there was so much advertising in between, I decided to scrape the music they play every day from their webpage. So i can listen to it without any ads.
I wrote a script in R that takes the title and artist of every song the radio played from their website:
### Radio2 playlist scraper ###
#Loading packages#
install.packages("rvest")
library(rvest)
install.packages("dplyr")
library("dplyr")
install.packages("remotes")
remotes::install_github("charlie86/spotifyr")
library(spotifyr)
install.packages('knitr', dependencies = TRUE)
library(knitr)
#Get playlist url #
url <- "https://www.nporadio2.nl/playlist"
#Read HTML code from pagen#
webpage <- read_html(url)
#Get Artist and Title#
artist <- html_nodes(webpage, '.fn-artist')
title <- html_nodes(webpage, '.fn-song')
#Artist and Title to text#
artist_text <- html_text(artist)
title_text <- html_text(title)
#Artist and Title to dataframe#
artiest <- as.data.frame(artist_text)
titel_text <- as.data.frame(title_text)
#Make one dataframe#
radioplaylist <- cbind(artiest$artist_text, titel_text$title_text)
radioplaylist <- as.data.frame(radioplaylist)
radioplaylist
#Rename columns#
colnames(radioplaylist)[1] <- "Artiest"
colnames(radioplaylist)[2] <- "Titel"
radioplaylist
#Remove duplicate songs#
radioplaylistuniek <- radioplaylist %>% distinct(Artiest, Titel, .keep_all = TRUE)
#Write to csv#
date <- Sys.Date()
date
write.csv(radioplaylistuniek, paste0("C://Users//Kantoor//Radio2playlists//playlist - ", date, ".csv"))
#Set spotify API#
Sys.setenv(SPOTIFY_CLIENT_ID = 'caxxxxxxxxxxxxxxxxxx')
Sys.setenv(SPOTIFY_CLIENT_SECRET = '7exxxxxxxxxxxxx')
access_token <- get_spotify_access_token()
clientID <- "xxxxxxxxxxxxxxx"
secret <- "xxxxxxxxxxxxxx"
library(httr)
library(magrittr)
library(rvest)
library(ggplot2)
response = POST(
'https://accounts.spotify.com/api/token',
accept_json(),
authenticate(clientID, secret),
body = list(grant_type = 'client_credentials'),
encode = 'form',
verbose()
)
token = content(response)$access_token
authorization.header = paste0("Bearer ", token)
#Get track info#
call1 <- GET(url = paste("https://api.spotify.com/v1/search?q=track:Ready%20To%20Go%20artist:Republica&type=track&limit=1"), config = add_headers(authorization = authorization.header))
call1
# JSON to TXT#
jsonResponseParsed <- content(call1, as="parsed") #JSON response structured into parsed data
jsonResponseParsed
# Extract track uri#
uri <- jsonResponseParsed$tracks$items[[1]]$uri
uri
# Add track to playlist #
POST(url= "https://api.spotify.com/v1/playlists/29fotSbWUGP1NmWbtGRaG6/tracks?uris=spotify%3Atrack%3A5Qt8U8Suu7MFH1VcJr17Td", config = add_headers(c('Accept="application/json"', 'Content-type= "application/JSON"', 'Authorization="Bearer BQDX9jbz99bCt6TXd7OSaaj12CgCh3s5F6KBwb-ATnv7AFkSnjuEASS9FOW0zx-xxxxxxxxxxxxxx"')))
What do i want?
I want to automatically add every song I picked up to my spotify playlist
What have i got so far?
I created an app via developer.spotify.com. For each song I can get a unique uri which is needed to add the song to my playlist.
Where do i get stuck?
I am unable to add the song to my playlist with a POST REQUEST. I get the message "No token provided".
I have created a sample POST REQUEST via https://developer.spotify.com/console/post-playlist-tracks/?playlist_id=&position=&uris= which adds the song neatly to my playlist. The code is:
POST https://api.spotify.com/v1/playlists/{playlist_id}/tracks
curl -X "POST" "https://api.spotify.com/v1/playlists/29fotSbWUGP1NmWbtGRaG6/tracks?uris=spotify%3Atrack%3A5Qt8U8Suu7MFH1VcJr17Td" -H "Accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer BQDX9jbz99bCt6TXd7OSaaj12CgCh3s5F6KBwb-ATxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
How to setup the correct POST request?
Can someone help me with the last part to setup the correct POST request?
#webb thank you. It is working now with the following last code:
# GET user authorization code#
code <- get_spotify_authorization_code(client_id = Sys.getenv("SPOTIFY_CLIENT_ID"),
client_secret = Sys.getenv("SPOTIFY_CLIENT_SECRET"),
scope = "playlist-modify-public")
#Save code#
code2 = code[["credentials"]][["access_token"]]
usercode <- paste0("Bearer ", code2)
#Add track to playlist#
POST("https://api.spotify.com/v1/playlists/29fotSbWUGP1NmWbtGRaG6/tracks?uris=spotify%3Atrack%3A5Qt8U8Suu7MFH1VcJr17Td",
encode="json",
add_headers(Authorization = usercode),
body = "{\"texts\":[\"A simple string\"]}")

fill form in R without Rselenium

I need to fill the fields month and year of the page:
Http://www.svs.cl/institucional/mercados/entidad.php?mercado=S&rut=99588060&grupo=&tipoentidad=CSVID&row=AABaHEAAaAAAB7uAAT&vig=VI&control=svs&pestania=3
By this, I have programmed the following in Rselenium and it works
#library
library(RSelenium)
#browser parameters
mybrowser<-remoteDriver(browserName = "chrome")
mybrowser$open(silent = TRUE)
mybrowser$setTimeout(type = "page load", milliseconds =1000000)
mybrowser$setImplicitWaitTimeout(milliseconds = 1000000)
url<-paste("http://www.svs.cl/institucional/mercados/entidad.php?mercado=S&rut=99588060&grupo=&tipoentidad=CSVID&row=AABaHEAAaAAAB7uAAT&vig=VI&control=svs&pestania=3",sep="")
#start navigation
mybrowser$navigate(url)
webElem$clickElement()
wxbox<-mybrowser$findElement(using="class","bordeInput2")
wxbox$sendKeysToElement(list("09"))
wxbox<-mybrowser$findElement(using="id","aa")
wxbox$sendKeysToElement(list("2016"))
wxbutton<-mybrowser$findElement('xpath',"//*[#id='fm']/div[2]/input")
wxbutton$clickElement()
However, I'd like to see a solution using rvest or rcurl, I've tried and it does not work for me. If anyone can help me with that, I would appreciate it.
An attempt I made was
library(RCurl)
library(XML)
form <- postForm("Http://www.svs.cl/institucional/mercados/entidad.php?mercado=S&rut=99588060&grupo=&tipoentidad=CSVID&row=AABaHEAAaAAAB7uAAT&vig=VI&control=svs&pestania=3", Year = 2010, Month = 2)
doc <- htmlParse(form) pkids <- xpathSApply(doc, xmlAttrs)
pkids
data <- lapply(pkids)
tab <- readHTMLTable(data[[1]], which = 1)
first of all, Thanks
You can simply POST to the URL as follows:
require(rvest)
require(httr)
a <- POST("http://www.svs.cl/institucional/mercados/entidad.php",
# Body = what you fill in the form
body = list(mm = 09, aa = 2016),
# query = the long URL broken into parameter
query = list(mercado="S",
rut="99588060",
grupo="",
tipoentidad="CSVID",
row="AABaHEAAaAAAB7uAAT",
vig="VI",
control="svs",
pestania="3"))
read_html(a) %>% html_nodes("dd") %>% html_text %>%
setNames(c("Business name", "RUT"))
Which gives you:
Business name RUT
"ACE SEGUROS DE VIDA S.A." "99588060-1"

Error in trying to pull information off instragram

I've been working on a project with the hopes of pulling off instagram post and comment information from instagram posts over the past year.
I am starting right now with a simple code just to pull out information from a single user.
Here is the code:
require(httr)
full_url <- oauth_callback()
full_url <- gsub("(.*localhost:[0-9]{1,5}/).*", x=full_url, replacement="\1")
print(full_url)
app_name <- "Cognitive Model of the Customer"
client_id <- "b03d4a910f0442b9bd1cd79fc06a086f"
client_secret <- "c35f785784fa45cd9eaf786742ae9b3f"
scope = "basic"
instagram <- oauth_endpoint(
authorize = "https://api.instagram.com/oauth/authorize",
access = "https://api.instagram.com/oauth/access_token")
myapp <- oauth_app(app_name, client_id, client_secret)
ig_oauth <- oauth2.0_token(instagram, myapp,scope="basic", type = "application/x-www-form-urlencoded",cache=FALSE)
tmp <- strsplit(toString(names(ig_oauth$credentials)), '"')
token <- tmp[[1]][4]
library(jsonlite)
library(RCurl)
user_info <- fromJSON(getURL(paste('https://api.instagram.com/v1/users/search? q=',"newbalance",'&access_token=',token,sep="")),unexpected.escape = "keep")
The error I am receiving is
Error in simplify(obj, simplifyVector = simplifyVector, simplifyDataFrame = simplifyDataFrame, :
unused argument (unexpected.escape = "keep")
I'm not sure I understand where this error comes from though.
Before running your code you should load essential packages.
Please load this package and then run your code:
library(rjson)

Characters different in RcURL/getURL than in browser

I am looking to extract foreign-language text from a website. The following code (hopefully self-contained) will demonstrate the problem:
require(RCurl)
require(XML)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Chrome 39.0.2171.71 (64-bit)"
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt' ,useragent = agent,followlocation = TRUE , autoreferer = TRUE , curl = curl)
html <-getURL('http://164.100.47.132/LssNew/psearch/result12.aspx?dbsl=1008', maxredirs = as.integer(20), followlocation = TRUE, curl = curl)
work <- htmlTreeParse(html, useInternal = TRUE)
table <- xpathApply(work, "//table[#id = 'ctl00_ContPlaceHolderMain_DataList1' ]//font|//table[#id = 'ctl00_ContPlaceHolderMain_DataList1' ]//p", xmlValue) #this one captured some mess in 13
table[[2]]
Where the first bunch of characters in the console printout appear for me as ¸Ã\u0089Ã\u0092 iÃ\u0089{Ã\u0089xÃ\u0089 Ã\u008aºÃ\u0089Eònù®ú.
Note that if I go to the actual page (http://bit.ly/1AcE9Gs), and view the page source and find the second opening <font tag (corresponding to the second list item in my table, or inspect the element near the first Hindi characters) what renders in the page source looks something like this: ¸ÉÒ iÉ{ÉxÉ ÊºÉEònù®ú (nù¨Énù¨É): which is what I want.
Anyone know why this might occur, and/or how to fix? Something to do with encodings in R, or RcURL? I can see all the way up to the initial getURL call that the characters are different like this, so it doesn't have to do with passing from the html text to xpathApply.
I am using MAC OSX 10.9.3, Chrome browser (for viewing the actual page), R 3.1.1.
If interested, see a related question on xpathApply here: R and xpathApply -- removing duplicates from nested html tags
Thanks!
Add encoding options to htmlParse and getURL:
require(RCurl)
require(XML)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Chrome 39.0.2171.71 (64-bit)"
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt' ,useragent = agent,followlocation = TRUE , autoreferer = TRUE , curl = curl)
html <-getURL('http://164.100.47.132/LssNew/psearch/result12.aspx?dbsl=1008'
, maxredirs = as.integer(20), followlocation = TRUE, curl = curl
, .encoding = 'UTF-8')
work <- htmlParse(html, encoding = 'UTF-8')
table <- xpathApply(work, "//table[#id = 'ctl00_ContPlaceHolderMain_DataList1' ]//font|//table[#id = 'ctl00_ContPlaceHolderMain_DataList1' ]//p", xmlValue) #this one captured some mess in 13
> table[[2]]
[1] "¸ÉÒ iÉ{ÉxÉ ÊºÉEònù®ú (nù¨Énù¨É):\r\nºÉ¦ÉÉ{ÉÊiÉ ¨É½þÉänùªÉ, {ɽþ±Éä ÊnùxÉ ¨ÉèÆ ¤ÉÉä±É\r\n®ú½þÉ lÉÉ iÉÉä ¨ÉèÆxÉä =iiÉ®ú {ÉÚ´ÉÒÇ ¦ÉÉ®úiÉ Eòä\r\n+ÉiÉÆEò´ÉÉnù {É®ú =ºÉ ÊnùxÉ nùÉä {ɽþ±ÉÖ+ÉäÆ EòÉ =±±ÉäJÉ\r\nÊEòªÉÉ lÉÉ* +ÉVÉ ¦ÉÒ ¨ÉèÆ, ÊVÉºÉ EòÉ®úhÉ ºÉä +ÉiÉÆEò´ÉÉnù\r\n{ÉènùÉ ½þÖ+É, =ºÉEòä Ê´É¹ÉªÉ ¨ÉäÆ lÉÉäc÷É ºÉÉ =±±ÉäJÉ\r\nEò°üÆMÉÉ*"
Here's an alternative implementation using rvest. Not only is the code simpler, but you don't have to do anything with the encoding, rvest figures it out for you.
library("rvest")
url <- "http://164.100.47.132/LssNew/psearch/result12.aspx?dbsl=1008"
search <- html(url)
search %>%
html_node("#ctl00_ContPlaceHolderMain_DataList1") %>%
html_nodes("font, p") %>%
html_text() %>%
.[[2]]
#> [1] "¸ÉÒ iÉ{ÉxÉ ÊºÉEònù®ú (nù¨Énù¨É):\r\nºÉ¦ÉÉ{ÉÊiÉ ¨É½þÉänùªÉ, ...

Resources