Error in open.connection(x, "rb") : http error 403 - r

My code ruturn an error message "error in open.connection(x, "rb") : http error 403." when scraping indeed.com
for(i in 0:(count-1)){
#progress$inc((i)/count, detail = paste0("https://www.indeed.com/jobs?q=", URLencode(job), "&start=", i*15))
print(paste0("https://www.indeed.com/jobs?q=", job, "&start=", i*15))
page = read_html(paste0("https://www.indeed.com/jobs?q=", URLencode(job), "&start=", i*15))
jobcards <- html_node(page, "#mosaic-provider-jobcards")
job_links <- html_nodes(jobcards, 'a[id^="job"]')
This code works well half years ago. Is it due to the anti-web-crawler system? Is there anything I can do to fix it?
My program is trying to scrape data from indeed.com based on given job titles and technologies list to analyze the word frequency of each technology.

Related

Downloading Financial Statements in R with finstr

I'm trying to download financial statements in R using a package at:
Financial statements in R
I'm trying to modify the example in their read me for other companies. I have tried to download the last two Tesla Q's.
The code I modified so far is:
xbrl_url2017Q3 <- "https://www.sec.gov/Archives/edgar/data/1318605/000156459018026353/tsla-20180930.xml"
xbrl_url2017Q2 <- "https://www.sec.gov/Archives/edgar/data/1318605/000156459018019254/tsla-20180630.xml"
old_o <- options(stringsAsFactors = FALSE)
xbrl_data_tsla2017Q3 <- xbrlDoAll(xbrl_url2017Q3)
Error from the line above is:
Error in fileFromCache(file) :
Error in download.file(file, cached.file, quiet = !verbose) :
cannot open URL 'https://www.sec.gov/Archives/edgar/data/1318605/000156459018026353/https://xbrl.sec.gov/dei/2018/dei-2018-01-31.xsd'
In addition: Warning message:
In download.file(file, cached.file, quiet = !verbose) :
cannot open URL 'https://www.sec.gov/Archives/edgar/data/1318605/000156459018026353/https://xbrl.sec.gov/dei/2018/dei-2018-01-31.xsd': HTTP status was '403 Forbidden'
xbrl_data_tsla2017Q2 <- xbrlDoAll(xbrl_url2017Q2)
options(old_o)
tsla2017Q3 <- xbrl_get_statements(xbrl_data_tsla2017Q3)
tsla2017Q2 <- xbrl_get_statements(xbrl_data_tsla2017Q2 )
tsla2017Q2
balance_sheet2017Q2 <- tsla2017Q2$StatementOfFinancialPositionClassified
balance_sheet2017Q3<- tsla2017Q3$StatementOfFinancialPositionClassified
income2017Q2 <- tsla2017Q2$StatementOfIncome
income2017Q3 <- tsla2017Q3$StatementOfIncome
balance_sheet2017Q3
Returns "NULL"
See the 10-Q at tesla's SEC fillings.
The last 10-Q.
Any recommendations on how I can go about this?
I'm looking to download the financial data to play around it with and would like it in tidy formate.
This is a common problem with the XBRL package where not all XML schemas are downloaded in the cache for some SEC filings. Download the missing schema in your cache folder and retry the xbrlDoAll call - it should work this time.

HTTP 403 error when using lookupUsers for a list of twitter handles

I have a list of twitter handles in a csv and I am trying to extract data for all these handles.My csv contains around 200 twitter handles
users <- read.csv("Twitter.csv")
users1 <- lookupUsers(users[1:nrow(users),1])
however, I am getting the following error:
Error in twInterfaceObj$doAPICall(paste("users", "lookup", sep = "/"), :
Forbidden (HTTP 403).
Anybody knows why am I getting this error and how can I fix it?

quantmod - getQuote() - '403 Forbidden'

I found an answer to my own question (see below). Still need help.
In the same package, quantmod, there is an option called getSymbol.google.
Nevertheless,
If I use it to get Microsoft value, for example, it works all right
getSymbols.google('MSFT', environment() , src="google", from = (Sys.Date() - 1))
[1] "MSFT"
But, I can´t make it work on a currency pair;
getSymbols.google("GBPUSD", environment() , src="google", from = (Sys.Date() - 1))
Error in download.file(paste(google.URL, "q=", Symbols.name, "&startdate=", :
cannot open URL 'http://finance.google.com/finance/historical?q=GBPUSD&startdate=Nov+02,+2017&enddate=Nov+03,+2017&output=csv'
In addition: Warning message:
In download.file(paste(google.URL, "q=", Symbols.name, "&startdate=", :
cannot open URL 'http://finance.google.com/finance/historical?q=GBPUSD&startdate=Nov+02,+2017&enddate=Nov+03,+2017&output=csv': HTTP status was '400 Bad Request'
Any ideas?
Good morning,
Since the 1ts of November i´m having trouble with the function getQuote from Yahoo. Is a function inside the package "quantmod", which uses yahoo API to request the information.
The description of the function is as follows; Fetch current stock quote(s) from specified source. At present this only handles sourcing quotes from Yahoo Finance, but it will be extended to additional sources over time.
In r, i´m getting the following error; "HTTP status was '403 Forbidden'"
I´ve look on my browser and the error comes from the following error in Yahoo web page "Fetch current stock quote(s) from specified source. At present this only handles sourcing quotes from Yahoo Finance, but it will be extended to additional sources over time."
Does anybody know how to solve ir, or, any alternatives to the function getQuote()
Here is an example from RStudio
getQuote("AAPL")
Error in download.file(paste("https://finance.yahoo.com/d/quotes.csv?s=", :
cannot open URL 'https://finance.yahoo.com/d/quotes.csv?s=AAPL&f=d1t1l1c1p2ohgv'
In addition: Warning message:
In download.file(paste("https://finance.yahoo.com/d/quotes.csv?s=", :
cannot open URL 'https://finance.yahoo.com/d/quotes.csv?s=AAPL&f=d1t1l1c1p2ohgv': HTTP status was '403 Forbidden'
Thanks
seems that yahoo has discontinued this service. Anyone aware of a alternative for yahoo (I'd rather not have to webscrape yahoo for this)
rob
I ran into the same problem... it's kludgey but as a workaround to get the end-of-day value, I have found this to work for now:
Instead of getQuote() to get the Last price (which doesn't seem to work from Yahoo anymore):
underlying<-"AAPL"
quote.last <-getQuote(underlying)$Last
I use "getSymbols" which still works-- throws it into a new data frame, and I pull out the value I want from that:
Hx<-getSymbols(underlying,from=Sys.Date()-1) # allows me to not have to retain the ticker name if I do this across many tickers
quote.last<-as.double(tail(Cl(get(Hx)),1)) # Closing price value from last row of data
rm(list=Hx) # throw away the temporary data frame with quote history
I'm sure the's a more elegant way to do it, but this is what fell out of my brain as a quick workaround that got it done... sadly that doesn't get things like the Bid and Ask that getQuote does.

url_absolute error "not compatible with STRSXP" when using submit_form

I am trying to scrape the http://www.emedexpert.com/lists/brand-generic.shtml web page for brand and generic drug names
library(httr)
library(rvest)
session <- read_html("http://www.emedexpert.com/lists/brand-generic.shtml")
form1 <- html_form(session)[[2]]
form2 <- set_values(form1, brand = "tylenol")
submit_form(session, form2)
however this results in the error message:
Error in xml2::url_absolute(form$url, session$url) :
not compatible with STRSXP
Therefore, based on this answer to the same error message ("Error: not compatible with STRSXP" on submit_form with rvest) I added a session$url as follows:
session$url <- "http://www.emedexpert.com/lists/brand-generic.shtml" # added from S.Ov
but I still get the same error message. So I tried also adding various permutations of also adding form2$url such as these
form2$url <- "http://www.emedexpert.com/lists/brand-generic.shtml"
form2$url <- ""
form2$url <- "/"
submit_form(session, form2)
At this point, the error message goes away and I obtain a web page which contain most of the desired web page. However it seems to completely lack the table of brand and generic names.
Any suggestions?
Yes #hackR, RSelenium is not always the answer.
library(rvest)
url<-"http://www.emedexpert.com/lists/bg.php?myc"
page<-html_session(url)
table<-html_table(read_html(page))[[1]]
This could help you I hope.

Error in open.connection(x, "rb") : HTTP error 405

While trying to extract data from Glassdoor, I got the following error.
Error in open.connection(x, "rb") : HTTP error 405.
Here is the code:
rm(list=ls())
library("rvest")
htmlpage <- read_html("https://www.glassdoor.co.uk/Reviews/Google-Reviews-E9079.htm")
forecasthtml <- html_nodes(htmlpage, ".summary")
SelectorGadget was used to select just the Headlines associated in each review and is given by .summary in the above code.
Is it because extracting data is not allowed or is there any fundamental mistake behind coding?

Resources