Getting ftp data with RCurl - r

I can access ftp site with Chrome but not with Internet Explorer cause of company restriction I think. For that reason maybe, I can not download ftp data with RCurl in R. Do you have any solution to download ftp data via Chrome setup in R?
Thanks
url<-c("myUrl")
x<-getURL(url,userpwd="user:password", connecttimeout=60)
writeLines(x, "Append.txt")

The package RCurl does not use a web browser to access ftp sites. It uses libcurl, as it says in the documentation. The problem you encounter should be solved within the constraints of libcurl.
Also, if one web browser on your computer can access a website, and another can't, it need not be a problem with the web browser per se. The most common problem is the way files or paths are referenced, such as whether or not one includes a trailing / with a pathname (never with a filename, of course). Perhaps this is the case for you?
Otherwise there may be a problem with your ftp settings: libcurl is pretty smart about guessing things right, but it is possible to twiddle with all sorts of settings, in case the defaults do not work, for example (from the manual):
# Deal with newlines as \n or \r\n. (BDR)
# Or alternatively, instruct libcurl to change \n's to \r\n's for us with crlf = TRUE
# filenames = getURL(url, ftp.use.epsv = FALSE, ftplistonly = TRUE, crlf = TRUE)
filenames = paste(url, strsplit(filenames, "\r*\n")[[1]], sep = "")
con = getCurlHandle( ftp.use.epsv = FALSE)
If this doesn't help, it might help us help you if you give us more complete information. What is this myUrl in url<-c("myUrl"), for example? Is it a filename? A pathname?

Related

How to avoid "too many redirects" error when using readLines(url) in R?

I am trying to mine news articles from various sources by doing
site = readLines(link)
link being the url of the site I am trying to download. Most of the time this works but with some specific sources I get the error:
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") : too many redirects, aborting ...
Which I'd like to avoid but so far I had no success in doing so.
Replicating this is quite easy as virtually none of the New York Times links work
e.g. http://www.nytimes.com/2014/08/01/us/politics/african-leaders-coming-to-talk-business-may-also-be-pressed-on-rights.html
It seems like the NYT site forces redirects for cookie and tracking purposes. Looks like the built-in URL reader isn't able to deal with them correctly (not sure if it supports cookies which is probably the problem).
Anyway, you might consider using the RCurl package to access the file instead. Try
library(RCurl)
link = "http://www.nytimes.com/2014/08/01/us/politics/african-leaders-coming-to-talk-business-may-also-be-pressed-on-rights.html?_r=0"
site <- getURL(link, .opts = curlOptions(
cookiejar="", useragent = "Mozilla/5.0", followlocation = TRUE
))

OAuth2 GitHub API token

I have a Client ID and Client Secret after having set up an application in github, I'm not sure what the URL or the callback URL is meant to be for that...which i think is causing me problems
I also have a private repo that I would like the application to access...
The way I would like to access the private repo would be via R, so I have found some packages that might help including ROAuth and oauth, but I'm not too sure how to go about using these to get the tokens, as they tend to require a bunch of URLs to make the requests from, and I am unsure as to what the URLs are to get these requests for tokens.
Looking at http://developer.github.com/v3/oauth/ doesn't seem to be amazingly helpful in terms of my inputs for oauth or Oauth2Authorize functions for each of the respective packages.
The end goal is to source files from the private repo, since source_url('private.repo.file.url') doesn't work
I tried the basic authentication using curl through bash, but wasn't able to find a token.
Any walkthrough examples of how to go about doing this would be greatly appreciated.
P.S. this is a follow up question from r sourcing private repos from github
You just need to create an oAuth token at https://github.com/settings/tokens
and get required file via GitHub API using code like below
library(RCurl)
library(devtools)
jsonRawFile <- fromJSON(getURL("https://api.github.com/repos/USERNAME/REPONAME/contents/filename.R",
httpheader = c(Authorization = "token 38ebb0393fe1757ffde9c45d81adzzzzzzzzz",
"User-Agent" = "RCurl"),
.opts = list(ssl.verifypeer = FALSE)))
source_url(jsonRawFile$download_url)
The format of Authorization header should be strictly "token " + your_token_from_account.

Reading information from a password protected site

I have been using readLines() to scrape information from a website in an R tutorial. I now wish to extract data from my own website (specifically the awstats data) however the domain is password protected.
Is there a way that I can pass the url for the specific awstats data I require with a username and password.
the format of the url is:
http://domain.name:port/awstats.pl?month=02&year=2011&config=domain.name&lang=en&framename=mainright&output=alldomains
Thanks.
If it is indeed a http basic access authentication, the documentation on connections provides some help:
URLs
Note that https:// connections are
only supported if --internet2 or
setInternet2(TRUE) was used (to make
use of Internet Explorer internals),
and then only if the certificate is
considered to be valid. With that
option only, the http://user:pass#site
notation for sites requiring
authentication is also accepted.
So your URL string should look like this:
http://username:password#domain.name:port/awstats.pl?month=02&year=2011&config=domain.name&lang=en&framename=mainright&output=alldomains
This might be Windows-only though.
Hope this helps!
You can embed the username and password in the url like :
http://userid:passw#domain.name:port/...
This you can try to use with readLines(). If that doesn't work, you can always try a workaround using url() to open the connection :
zz <- url("http://userid:passw#domain.name:port/...")
readLines(zz)
close(zz)
You can also download the file and save it somewhere using download.file()
download.file("theurl","/path/to/file/filename",method="wget")
This saves the file on the local path that is specified.
EDIT :
as csgillespie said, you shouldn't include your username and password in the script. If you run scripts with source() or interactively, you could add eg :
user <- readline("Give the username : ")
passw <- readline("Give the password : ")
Url <- paste("http://",user,":",passw,"#domain.name...")
readLines(Url,...)
When running from the commandline, you could pass the arguments after --args and access them using commandArgs (see ?commandArgs)
If you have access to the box, you could always just read the awstats log files. If you can ssh into the box, then you could easily sync the latest file using rsync.
The slight snag with using
http://username:password#domain...
is that you are putting your password in an R script - best to avoid this. Of course you can secure it the script, but it only takes one slip. For example,
Someone asks you a similar question and you publish your script
The url http://username:password#domain... will(?) now show up on your server logs
...
Formatting the url as http://username:password#domain... for use with download.file didn't work for me, but R.utils provides the function downloadFile that works perfectly:
require(R.utils)
downloadFile(myurl, myfile, username = "myusername", password ="mypassword")
See #joris-meys answer for a way to avoid including your username and password in plain text in your script.
EDIT Except it looks like downloadFile just reformats the URL to http://username:password#domain...? Hmm...

401 Unauthorised errors when attempting to download ASP page to file

Issue
Msxml2.ServerXMLHTTP keeps returning 401 - Unauthorised errors each time we attempt to read the contents of a file (ASP) from a web server.
Source server is running IIS6, using NTLM integrated login.
This process has been used successfully before, but only in as far as extracting XML files from external websites, not internal ones.
The proxy settings in the registry of the server on which the script is run has also been updated to bypass the website in question, but to no avail.
All paths identified in the VBScript have been checked and tested, and are correct.
User running the script has correct read/write permissions for all locations referenced in the script.
Solution needed
To identify the cause of the HTTP 401 Unauthorised messages, so that the script will work as intended.
Description
Our organisation operates an intranet, where the content is replicated to servers at each of our remote sites. This ensures these sites have continued fast access to important information, documentation and data, even in the event of losing connectivity.
We are in the middle of improving the listing and management of Forms (those pesky pieces of paper that have to be filled in for specific tasks). This involves establising a database of all our forms.
However, as the organisation hasn't been smart enough to invest in MSSQL Server instances at each site, replication of the database and accessing it from the local SQL server isn't an option.
To work around this, I have constructed a series of views (ASP pages) which display the required data. I then intend to use Msxml2.ServerXMLHTTP by VBScript, so I can read the resulting pages and save the output to a static file back on the server.
From there, the existing replication process can stream these files out to the site - with users having no idea that they're looking at a static page that just happened to be generated from database output.
Code
' Forms - Static Page Generator
' Implimented 2011-02-15 by Michael Harris
' Purpose: To download the contents of a page, and save that page to a static file.
' Target category: 1 (Contracts)
' Target Page:
' http://sharename.fpc.wa.gov.au/corporate/forms/generator/index.asp
' Target path: \\servername\sharename\corporate\forms\index.asp
' Resulting URL: http://sharename.fpc.wa.gov.au/corporate/forms/index.asp
' Remove read only
' Remove read only flag on file if present to allow editing
' If file has been set to read only by automated process, turn off read only
Const READ_ONLY = 1
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.GetFile("\\server\sharename\corporate\forms\index.asp")
If objFile.Attributes AND READ_ONLY Then
objFile.Attributes = objFile.Attributes XOR READ_ONLY
End If
Dim webObj, strURL
Set webObj = CreateObject("Msxml2.ServerXMLHTTP")
strURL = "http://sharename.fpc.wa.gov.au/corporate/forms/generator/index.asp"
webObj.Open "GET", strURL
webObj.send
If webObj.Status=200 Then
Set objFso = CreateObject("Scripting.FileSystemObject")
Set txtFile = objFso.OpenTextFile("file:\\servername.fpc.wa.gov.au\sharename\corporate\forms\index.asp", 2, True)
txtFile.WriteLine webObj.responseText
txtFile.close
ElseIf webObj.Status >= 400 And webObj.Status <= 599 Then
MsgBox "Error Occurred : " & webObj.Status & " - " & webObj.statusText
Else
MsgBox webObj.ResponseText
End If
Replace your line:
webObj.Open "GET", strURL
With:
webObj.Open "GET", strURL, False, "username", "password"
In most cases 401 Unauthorized means you haven't supplied credentials. Also you should specifiy False to indicate you don't want async mode.
It sounds like the O.P. got this working with the correct proxy settings in the registry (http://support.microsoft.com/kb/291008 explains why proxy configuration will fix this). Newer versions of ServerXMLHTTP have a setProxy method that can be used to set the necessary proxy configuration in your code instead.
In the O.P. code above, after webObj is created, the following line of code would set up the proxy correctly:
webObj.setProxy 2, "0.0.0.0:80", "*.fpc.wa.gov.au"
ServerXMLHTTP will pass on the credentials of the user running the code if it is configured with a proxy, and if the target URL bypasses that proxy. Since you are bypassing the proxy anyway, you can make it a dummy value "0.0.0.0:80", and make sure your target url is covered by what you specify in the bypass list "*.fpc.wa.gov.au"
I would first test if you can reach your url through a normal browser on the same server X you run your code on (A). I would try then reach the url from another PC. One never used to reach that url but in the same network as server X (B).
If B works but A doesn't I would suspect that for some reason your source server (i.e. that one that serves the url) blocks server X for some reason. Check the security settings of II6 and of NTLM.
If both A and B don't work, there is something wrong more in general with your source server (i.e. it blocks everything or NTML doesn't allow you in).
If A works (B doesn't matter then), the problem has to be somewhere in your code. In that case, I would recommend fiddler. This tool can give you the HTTP requests of both your browser and your code in realtime. You can then compare both. That should give you at least a very strong hint about (if not immediately give you) the solution.

Reading a remote URL in Domino LotusScript

I have a remote RSS feed which has to be transformed into Notes documents using LotusScript.
I've looked through the documentation, but I can't find how to open a remote URL in order to retrieve its contents. In other words, some sort of wget- or curl-like functionality. Can anyone shed some light on how to do this? Using Java is not an option.
Thanks.
Check out the NotesDOMParser class - available in LotusScript - which lets you (indirectly) pull XML from a remote URL and process in a an XML DOM object.
You can pull the XML into a string using the MSXMLHTTP COM object, then use NotesStream to send the XML to the NotesDOMParser.
I have not tested, but the code would look something like this:
...
Set objXML = CreateObject("Microsoft.XMLHTTP")
objXML.open "GET", sURL, False, "", ""
objXML.send("")
sXMLAsText = Trim$(objXML.responseText)
Set inputStream = session.CreateStream
inputStream.Open (sXMLAsText)
Set domParser=session.CreateDOMParser(inputStream, outputStream)
domParser.Process
...
Documentation: http://publib.boulder.ibm.com/infocenter/domhelp/v8r0/index.jsp?topic=/com.ibm.designer.domino.main.doc/H_NOTESDOMPARSER_CLASS.html
You can't open a remote URL (whether it's HTTP or some other protocol) using native Lotusscript: the object library simply doesn't support it. If you're running on a Windows server, you should be able to use the MS XMLHttp DLLs to get a handle on your remote file via a URL, as specified by the previous answer. (Alternatively, this link specifies how to parse and open a UNC path with Lotusscript—again, Windows only).
All that said, if I understand you correctly, you're not using HTTP to access the remote file at all. If the RSS file is just on a simple path, why can't you open the file for parsing in the normal way with Lotusscript?

Resources