How to scrape from investing.com using 'rusquant' package in R - r

I posted a similar question before [the question is closed now, I deleted it]. From that I came to know about 'rusquant' package. I thank the person who introduced me to the package 'rusquant' here. I tried the following codes in several unsuccessful attempts to scrape stock data from investing.com
library(rusquant)
all_stocks <- getSymbolList(src = "Investing", country = "Bangladesh")
head(all_stocks, 4)
from_date <- date("2021-01-01")
grameenphone <- getSymbols('GRAE', src = 'Investing', from = from_date, auto.assign = F)
grameenphone <- getSymbols.Investing('GRAE', from = from_date, auto.assign = F)
Now, the getSymbolList function works. But when I try to scrape for a particular stock, and I followed the method from https://github.com/arbuzovv/rusquant, I get an error. as follows:
grameenphone <- getSymbols('GRAE', src = 'Investing', from = from_date, auto.assign = F)
‘getSymbols’ currently uses auto.assign=TRUE by default, but will
use auto.assign=FALSE in 0.5-0. You will still be able to use
‘loadSymbols’ to automatically load data. getOption("getSymbols.env")
and getOption("getSymbols.auto.assign") will still be checked for
alternate defaults.
This message is shown once per session and may be disabled by setting
options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
Error in curl::curl_fetch_memory(url, handle = handle) :
Unrecognized content encoding type. libcurl understands deflate, gzip content encodings.
Then I tried getSymbols.Investing function. But I get following error:
grameenphone <- getSymbols.Investing('GRAE', from = from_date, auto.assign = F)
Error in missing(verbose) : 'missing' can only be used for arguments
Please help me out here. I'm new in coding. I apologize if anything silly happened here. Thanks in advance.

Related

Handle "500" server error type during API request using rentrez

I am trying to recover some IDs linked to names using the rentrez package that is a R wrapper over the entrez API using this code (short list of query as an example):
vect_names <- c("Theileria sergenti","Dipodascus ambrosiae","Dipodascus armillariae","Dipodascus macrosporus")
idseq <- lapply(vect_names, function(x){
query <- entrez_search(db = "taxonomy", term = x)
return(query$ids)
})
Now, this code works for me as long as I get no server errors (type : 500) which stops my requests. For small amounts of queries it is not a problem but I have around 40k queries to send so it will encounter the error for sure.
This is the error :
Erreur : HTTP failure: 500
{"error":"error forwarding request","api-key":"xxx.xx.xx.xxx","type":"ip",
"status":"ok"
I did some research and I think I need to wrap this code into a try/except function. However, the documentation is pretty scary to me and I don't see how I can replicate the server error I have so I could build a reproducible example with the error. Also because my full request will last several hours, testing multiple versions of a try/except until I am sure my code handles the error could take a long time.
So what I am looking for here is a version of this first piece of code that will continue to request the same vector element until it gets the result for it (until the HTTP failure is solved, which should take a matter of seconds).
Thanks!
The sample list you provided doesn't give an error for me, but you can use tryCatch like this
idseq <- lapply(vect_names, function(x) {
query <- tryCatch(entrez_search(db = "taxonomy", term = x), error= function(e) NA)
return(ifelse(is.na(query), NA, query$ids))
}
if there's an error query will be NA, so I changed the returned value to check for that.
After some research I needed to use the tryCatch function coupled with Sys.sleep :
idseq <- lapply(vect_names, function(x){
tryCatch(
{
query <- entrez_search(db = "taxonomy", term = x)
return(ifelse(is.na(query), NA, query$ids))
},
error = function(e)
{
Sys.sleep(60) # If error (most probably type 500 serveur), sleep 60scd then redo
query <- entrez_search(db = "taxonomy", term = x)
return(ifelse(is.na(query), NA, query$ids))
}
)
})

How correct use try function to ignore any errors after loop executing in R [duplicate]

I have a process in R in which I am importing a number of files in R.
Occasionally there are issues with some of the files, for example there isn't an EOF character present in the file that I'm reading in, so the read.table statement errors.
As there are a lot of files to process this is difficult to manage manually, so I would like to use some error trapping to alter the user of the issue and carry on with the other files.
I have tried using try and referenced the SOF post What is the R equivalent for Excel IFERROR?
Below I would like to test the import then depending on the result either give some message to the user or actually import the file.
mtry <- try(read.table("~/file_location/test_file.csv",
fill = TRUE,
stringsAsFactors = FALSE))
if (!inherits(mtry, "try-error")) {
read.table("~/file_location/test_file..csv",
fill = TRUE,
stringsAsFactors = FALSE)
} else {
message("File doesn't exist, please check")
}
The issue is that the try() statement is producing an error in the log which is what I'm trying to avoid.
Thanks
Try suppressing the report of error messages by specifying try(..., silent = TRUE) (see also ?try). I tested the code below with a non-existing dummy file, used if (class(mtry) != "try-error") and it works fine.
some_dummy_file <- "data/dummy.csv"
mtry <- try(read.table(some_dummy_file, sep = ",", header = TRUE),
silent = TRUE)
if (class(mtry) != "try-error") {
read.table(some_dummy_file, sep = ",", header = TRUE)
} else {
message("File doesn't exist, please check")
}
And here is the console output.
> File doesn't exist, please check

R - form web scraping with rvest

First I'd like to take a moment and thank the SO community,
You helped me many times in the past without me needing to even create an account.
My current problem involves web scraping with R. Not my strong point.
I would like to scrap http://www.cbs.dtu.dk/services/SignalP/
what I have tried:
library(rvest)
url <- "http://www.cbs.dtu.dk/services/SignalP/"
seq <- "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM"
session <- rvest::html_session(url)
form <- rvest::html_form(session)[[2]]
form <- rvest::set_values(form, `SEQPASTE` = seq)
form_res_cbs <- rvest::submit_form(session, form)
#rvest prints out:
Submitting with 'trunc'
rvest::html_text(rvest::html_nodes(form_res_cbs, "head"))
#ouput:
"Configuration error"
rvest::html_text(rvest::html_nodes(form_res_cbs, "body"))
#ouput:
"Exception:WebfaceConfigErrorPackage:Webface::service : 358Message:Unhandled #parameter 'NULL' in form "
I am unsure what is the unhandled parameter.
Is the problem in the submit button? I can not seem to force:
form_res_cbs <- rvest::submit_form(session, form, submit = "submit")
#rvest prints out
Error: Unknown submission name 'submit'.
Possible values: trunc
is the problem the submit$name is NULL?
form[["fields"]][[23]]
I tried defining the fake submit button as suggested here:
Submit form with no submit button in rvest
with no luck.
I am open to solutions using rvest or RCurl/httr, I would like to avoid using RSelenium
EDIT: thanks to hrbrmstr awesome answer I was able to build a function for this task. It is available in the package ragp: https://github.com/missuse/ragp
Well, this is doable. But it's going to require elbow grease.
This part:
library(rvest)
library(httr)
library(tidyverse)
POST(
url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
encode = "form",
body=list(
`configfile` = "/usr/opt/www/pub/CBS/services/SignalP-4.1/SignalP.cf",
`SEQPASTE` = "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM",
`orgtype` = "euk",
`Dcut-type` = "default",
`Dcut-noTM` = "0.45",
`Dcut-TM` = "0.50",
`graphmode` = "png",
`format` = "summary",
`minlen` = "",
`method` = "best",
`trunc` = ""
),
verbose()
) -> res
Makes the request you made. I left verbose() in so you can watch what happens. It's missing the "filename" field, but you specified the string, so it's a good mimic of what you did.
Now, the tricky part is that it uses an intermediary redirect page that gives you a chance to enter an e-mail address for notification when the query is done. It does do a regular (every ~10s or so) check to see if the query is finished and will redirect quickly if so.
That page has the query id which can be extracted via:
content(res, as="parsed") %>%
html_nodes("input[name='jobid']") %>%
html_attr("value") -> jobid
Now, we can mimic the final request, but I'd add in a Sys.sleep(20) before doing so to ensure the report is done.
GET(
url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
query = list(
jobid = jobid,
wait = "20"
),
verbose()
) -> res2
That grabs the final results page:
html_print(HTML(content(res2, as="text")))
You can see images are missing because GET only retrieves the HTML content. You can use functions from rvest/xml2 to parse through the page and scrape out the tables and the URLs that you can then use to get new content.
To do all this, I used burpsuite to intercept a browser session and then my burrp R package to inspect the results. You can also visually inspect in burpsuite and build things more manually.

R tm package readPDF error in strptime(d, fmt) : input string too long

I would like to do text mining of the files on this website using the tm package. I am using the following code to download one of the files (i.e., abell.pdf) to my working directory and attempt to store the contents:
library("tm")
url <- "https://baltimore2006to2010acsprofiles.files.wordpress.com/2014/07/abell.pdf"
filename <- "abell.pdf"
download.file(url = url, destfile = filename, method = "curl")
doc <- readPDF(control = list(text = "-layout"))(elem = list(uri = filename),
language = "en", id = "id1")
But I receive the following error and warnings:
Error in strptime(d, fmt) : input string is too long
In addition: Warning messages:
1: In grepl(re, lines) : input string 1 is invalid in this locale
2: In grepl(re, lines) : input string 2 is invalid in this locale
The pdfs aren't particularly long (5 pages, 978 KB), and I have been able to successfully use the readPDF function to read in other pdf files on my Mac OSX. The information I want most (the total population for the 2010 census) is on the first page of each pdf, so I've tried shortening the pdf to just the first page, but I get the same message.
I am new to the tm package, so I apologize if I am missing something obvious. Any help is greatly appreciated!
Based on what I've read this error has something to do with the way that the "readPDF" function tries to make metadata for the file you're importing. Anyway, you can change the metadata info by using the "info" option. For example, I usually circumvent this error by modifying the command in the following way (using your code):
doc <- readPDF(control = list(info="-f",text = "-layout"))(elem = list(uri = filename),language = "en", id = "id1")
Where the addition of "info="-f"" is the only change. This doesn't really "fix" the problem, but it bypasses the error. Cheers :)

Adding rows to a Google Sheet using the R Package googlesheets

I'm using the googlesheets package (CRAN version, but available here: https://github.com/jennybc/googlesheets) to read data from a Google Sheet in R, but would now like to add rows. Unfortunately, every time use gs_add_row for an existing sheet I get the following error:
Error in gsheets_POST(lf_post_link, XML::toString.XMLNode(new_row)) :
client error: (405) Method Not Allowed
I followed the tutorial on Github to create a sheet and add rows as follows:
library(googlesheets)
library(dplyr)
df.colnames <- c("Project Short Name","Project Start Date","Proj Stuff")
my.df <- data.frame(a = "cannot be empty", b = "cannot be empty", c = "cannot be empty")
colnames(my.df) <- df.colnames
## Create a new workbook populated by this data.frame:
mynewSheet <- gs_new("mynewsheet", input = my.df, trim = TRUE)
## Append Element
mynewSheet <- mynewSheet %>% gs_add_row(input = c("a","b","c"))
mynewKey <- mynewSheet$sheet_key
Rows are added successfully, I even get the cheery message Row successfully appended.
I now provide mynewKey to gs_key, as I would if this were a new sheet I were working with and attempt to add a new row using gs_add_row (Note: before evaluating these lines, I navigate to the Google Sheet and make it public to the web):
myExistingWorkbook <- gs_key(mynewKey, visibility = "public")
## Attempt to gs_add_row
myExistingWorkbook <- myExistingWorkbook %>% gs_add_row(input = c("a","b","c"), ws="Sheet1", verbose = TRUE)
Error in gsheets_POST(lf_post_link, XML::toString.XMLNode(new_row)) :
client error: (405) Method Not Allowed
Things that I have tried:
1) Published the Google Sheet to the web (as per https://github.com/jennybc/googlesheets/issues/126#issuecomment-118751652)
2) Enabled the sheet as editable to the public
Notes
In my actual example, I have an existing Google Sheet with many worksheets within it that I would like to add rows to. I have tried to use a minimal example here to understand my error, I can also provide a link to the specific worksheet that I would like to update as well.
I have raised an issue on the package's github page here, https://github.com/jennybc/googlesheets/issues/168
googlesheets::gs_add_row() and googlesheets::gs_edit_cells() make POST requests to the Sheets API. This requires that the visibility be set to "private".
Above, when you register the Sheet by key, please do so like this:
gs_key(mynewKey, visibility = "private")
If you want this to work even for Sheets you've never visited in the browser, then add lookup = FALSE as well:
gs_key(mynewKey, lookup = FALSE, visibility = "private")

Resources