How can I check if a https url is valid?
Using:
RCurl::url.exists("https://github.com/")
gives [1] FALSE.
I prefer base R for my needs but am not married to it. Plus additional answers make this question more generalizable.
I would use httr instead. I'm not sure which one is preferable between url_ok and url_success but they both work at this level.
library(httr)
url_ok("http://github.com/")
#[1] TRUE
url_ok("https://github.com/")
#[1] TRUE
url_ok("https://github.com/nonworking")
#[1] FALSE
url_success("http://github.com/")
#[1] TRUE
url_success("https://github.com/")
#[1] TRUE
url_success("https://github.com/nonworking")
#[1] FALSE
For some reason, RCurl doesn't like github even in http mode. I suspect it's because of a redirect.
library(RCurl)
url.exists("http://github.com/")
#[1] FALSE
url.exists("https://github.com/")
#[1] FALSE
Edit: Some commenters have mentionned they get TRUE as an answer, but I also get FALSE using RCurl. I'm on Windows.
Related
I have read several questions listed below:
Set path to miktex for pdflatex in R
How can I set the latex path for sweave in R?
https://tex.stackexchange.com/questions/267299/how-to-fix-the-sorry-but-c-miktex-pdftex-exe-did-not-succeed-error
https://tex.stackexchange.com/questions/429706/rstudio-not-detecting-miktex
https://tex.stackexchange.com/questions/231595/rstudio-cant-find-pdflatex-on-windows-7
The above list does not exhaust everything I have tried which also includes reinstalling RStudio, R and MikTex.
I then thought that I could edit the path to delete MikTeX 1.9 that R keeps calling but don't know how to do that.
I found this function which shows that I have infact set the correct path to MikTex but R keeps calling MikTeX 1.9:
Sys.which2 <- function(cmd) {
stopifnot(length(cmd) == 1)
if (.Platform$OS.type == "windows") {
suppressWarnings({
pathname <- shell(sprintf("where %s 2> NUL", cmd), intern=TRUE)[1]
})
if (!is.na(pathname)) return(setNames(pathname, cmd))
}
Sys.which(cmd)
}
Different output between Sys.which and Sys.which2:
Sys.which2("pdflatex")
pdflatex
"C:\\Program Files\\MiKTeX 2.9\\miktex\\bin\\x64\\pdflatex.exe"
Sys.which("pdflatex")
pdflatex
"C:\\PROGRA~1\\MIKTEX~1.9\\miktex\\bin\\x64\\pdflatex.exe"
How can I best solve this issue?
My idea was to somehow locate where R is finding this MikTeX 1.9 and replace it but I can't find it on my system and don't quite know what Sys.which is doing behind the scenes.
EDIT
An attempt at locating where 1.9 is:
stringr::str_detect(unlist(strsplit(Sys.getenv("PATH"),";")),"latex")
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Output of sys.getenv("PATH":
"C:/Program Files/MiKTeX 2.9/miktex/bin/x64:C:\Program Files\R\R-3.6.2\bin\x64;C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\ProgramData\Oracle\Java\javapath;C:\Program Files\copasi.org\COPASI 4.22.170\bin;C:\Program Files (x86)\Intel\TXE Components\iCLS\;C:\Program Files\Intel\TXE Components\iCLS\;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Users\Administrator\AppData\Local\Microsoft\WindowsApps;C:\Recovery\OEM\Backup\;C:\Program Files\Intel\TXE Components\DAL\;C:\Program Files (x86)\Intel\TXE Components\DAL\;C:\Program Files\Intel\TXE Components\IPT\;C:\Program Files (x86)\Intel\TXE Components\IPT\;C:\Program Files\Intel\WiFi\bin\;C:\Program Files\Common Files\Intel\WirelessCommon\;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;E:\MATLAB\runtime\win64;E:\MATLAB\bin;C:\Program Files\Git\cmd;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\130\Tools\Binn\;C:\Program Files (x86)\Microsoft SQL Server\140\Tools\Binn\;C:\Program Files\Microsoft SQL Server\140\Tools\Binn\;C:\Program Files\Microsoft SQL Server\140\DTS\Binn\;C:\ProgramData\chocolatey\bin;C:\Program Files\MiKTeX 2.9\miktex\bin\x64\;C:\Users\my name\AppData\Local\Programs\Python\Python38\Scripts\;C:\Users\my name\AppData\Local\Programs\Python\Python38\;C:\Users\my name\AppData\Local\Programs\Python\Python36\Scripts\;C:\Users\my name\AppData\Local\Programs\Python\Python36\;C:\Users\my name\Desktop\wget-1.20.3-win64;C:\Users\my name\AppData\Local\Programs\Python\Python37\Scripts\;C:\Users\my name\AppData\Local\Programs\Python\Python37\;C:\Users\my name\AppData\Local\Microsoft\WindowsApps;C:\Users\my name\AppData\Local\Programs\Python\Python37-32;E:\jdk-12_windows-x64_bin;C:\Users\my name\AppData\Local\Microsoft\WindowsApps;C:\Users\my name\Desktop\adb+-+platform+tools+v28.0.1"
C:\\PROGRA~1\\MIKTEX~1.9 doesn't mean literally MiKTeX v1.9. It is an 8.3 filename. Because the string MiKTeX 2 contains a "special character" (i.e. a space), it is converted to MIKTEX~1 (the .9 part still remains as the "extension", so MiKTeX 2.9 became MIKTEX~1.9, which is indeed confusing in this case).
I feel the problem you are actually trying to solve might be a different one. If that's the case, you may ask the actual question. There isn't anything wrong with your environment variables, as far as I can see.
If you really need the long name, you can call normalizePath() to convert the short 8.3 name to a long name.
I am trying to figure out a way to find all the keywords that come from the same root word (in some sense the opposite action of stemming). Currently, I am using R for coding, but I am open to switching to a different language if it helps.
For instance, I have the root word "rent" and I would like to be able to find "renting", "renter", "rental", "rents" and so on.
Try this code in python:
from pattern.en import lexeme
print(lexeme("rent")
the output generated is:
Installation:
pip install pattern
pip install nltk
Now, open a terminal, type python and run the below code.
import nltk
nltk.download(["wordnet","wordnet_ic","sentiwordnet"])
After the installation is done, run the pattern code again.
You want to find the opposite of Stemming, but stemming can be your way in.
Look at this example in Python:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
words = ["renting", "renter", "rental", "rents", "apple"]
all_rents = {}
for word in words:
stem = stemmer.stem(word)
if stem not in all_rents:
all_rents[stem] = []
all_rents[stem].append(word)
else:
all_rents[stem].append(word)
print(all_rents)
Result:
{'rent': ['renting', 'rents'], 'renter': ['renter'], 'rental': ['rental'], 'appl': ['apple']}
There are several other algorithm to use. However, keep in mind that stemmers are rule-based and are not "smart" to the point where they will select all related words (as seen above). You can even implement your own rules (extend the Stem API from NLTK).
Read more about all available stemmers in NLTK (the module that was used in the above example) here: https://www.nltk.org/api/nltk.stem.html
You can implement your own algorithm as well. For example, you can implement Levenshtein Distance (as proposed in #noski comment) to compute the smaller common prefix. However, you have to do your own research on this one, since it is a complex process.
For an R answer, you can try these functions as a starting point. d.b gives grepl as an example, here are a few more:
words = c("renting", "renter", "rental", "rents", "apple", "brent")
grepl("rent", words) # TRUE TRUE TRUE TRUE FALSE TRUE
startsWith(words, "rent") # TRUE TRUE TRUE TRUE FALSE FALSE
endsWith(words, "rent") # FALSE FALSE FALSE FALSE FALSE TRUE
I'm running Kibana 6.7.1 in Elastic Cloud. I'm open to upgrading it if will help.
I would like to hide all of the links (plugins? applications?) in Kibana's left navigation bar, except for "Discover", "Visualize", "Dashboard", and maybe "Canvas". Ideally configured by space or by role.
I've read in a few places that Timelion can be disabled by setting Timelion.enabled: false in the Kibana.yaml. However, that setting is not documented for 6.7. And there are ten other links to hide.
Is this what Application Roles are for? I did not get anywhere with trying to set them and I don't think they do this as I assume the documentation would list the roles for each default application if that was the case.
I've tried Dashboard Only mode, but it's more restrictive than I would prefer.
Are there settings in Kibana to disable these links, or do I have to add CSS or use a proxy which edits the HTML to remove them?
I am using the latest ELK stack [ELK - 7.5.2] to develop a custom plugin and have found following settings to be helpful in removing the clutter from Kibana.
You can add these to your kibana.yml file and set their values according to your needs:
xpack.canvas.enabled: false
xpack.reporting.enabled: false
xpack.actions.enabled: false
xpack.alerting.enabled: false
xpack.maps.enabled: false
xpack.security.enabled: false
xpack.uptime.enabled: false
xpack.watcher.enabled: false
xpack.spaces.enabled: false
xpack.license_management.enabled: false
xpack.upgrade_assistant.enabled: false
xpack.index_management.enabled: false
xpack.apm.enabled: false
xpack.beats.enabled: false
xpack.ccr.enabled: false
xpack.cloud.enabled: false
xpack.code.enabled: false
xpack.graph.enabled: false
xpack.grokdebugger.enabled: false
xpack.ilm.enabled: false
xpack.infra.enabled: false
xpack.logstash.enabled: false
xpack.ml.enabled: false
xpack.monitoring.enabled: false
xpack.remote_clusters.enabled: false
xpack.rollup.enabled: false
xpack.searchprofiler.enabled: false
xpack.siem.enabled: false
xpack.snapshot_restore.enabled: false
xpack.tilemap.enabled: false
xpack.transform.enabled: false
Hope it helps.
For example:
if(url.exists("http://www.google.com")) {
# Two ways to submit a query to google. Searching for RCurl
getURL("http://www.google.com/search?hl=en&lr=&ie=ISO-8859-1&q=RCurl&btnG=Search")
# Here we let getForm do the hard work of combining the names and values.
getForm("http://www.google.com/search", hl="en", lr="",ie="ISO-8859-1", q="RCurl", btnG="Search")
# And here if we already have the parameters as a list/vector.
getForm("http://www.google.com/search", .params = c(hl="en", lr="", ie="ISO-8859-1", q="RCurl", btnG="Search"))
}
This is an example from RCurl package manual. However, it does not work:
> url.exists("http://www.google.com")
[1] FALSE
I found there is an answer to this here Rcurl: url.exists returns false when url does exists. It said this is because of the default user agent is not useful. But I do not understand what user agent is and how to use it.
Also, this error happened when I worked in my company. I tried the same code at home, and it worked find. So I am guessing this is because of proxy. Or there is some other reasons that I did not realize.
I need to use RCurl to search my queries from Google, and then extract the information such as title and descriptions from the website. In this case, how to use user agent? Or, does the package httr can do this?
guys. Thanks a lot for help. I think I just figured out how to do it. The important thing is proxy. If I use:
> opts <- list(
proxy = "http://*******",
proxyusername = "*****",
proxypassword = "*****",
proxyport = 8080
)
> url.exists("http://www.google.com",.opts = opts)
[1] TRUE
Then all done! You can find your proxy under System-->proxy if you use win 10. At the same time:
> site <- getForm("http://www.google.com.au", hl="en",
lr="", q="r-project", btnG="Search",.opts = opts)
> htmlTreeParse(site)
$file
[1] "<buffer>"
.........
In getForm, opts needs to be put in as well. There are two posters here (RCurl default proxy settings and Proxy setting for R) answering the same question. I have not tried how to extract information from here.
I'm trying to use url_ok with pbsapply to test a large number of URLs:
pbsapply(foo$URL, function(x) try(url_ok(x)))
but the program keeps getting stuck on certain bad URLs, like url_ok("www.isdnet.net"). This URL will return 403 Forbidden in the browser, but makes R stuck. There are other bad URL situations, and I don't know how many bad URLs are in the big data set.
I tried to create a time out, making it stop if can't return anything after a few seconds, give it a FALSE and move on to the next URL.
I tried this but didn't work, still got stuck:
evalWithTimeout(url_ok("www.isdnet.net"), timeout=1.08, onTimeout="warning");
Unfortunately HEAD requests (which is what url_ok uses) do not work on all web servers and can send you into a timeout death-spiral or give you inaccurate results (amongst other issues). The only way to avoid this is to use a GET request which will result in downloading more payload. But, then you have to probably worry about malformed URLs (I've rarely come across a large, clean URL dataset) which will throw actual R errors. You're best bet is to write a more robust "URL OK" routine (UPDATED to add timeout):
library(httr)
library(pbapply)
# this is a dangerous setting for normal operations
set_config(config(ssl_verifypeer=0L, ssl_verifyhost=0L), override=TRUE)
url_ok_via_get <- function(...) {
ret <- FALSE
tryCatch( {
x <- GET(..., timeout(5), user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.61 Safari/537.36"))
ret <- identical(status_code(x), 200L)
}, error=function(e) {
ret <- FALSE
}, finally=return(ret))
}
url_ok_via_get("aladslsdf")
## [1] FALSE
url_ok_via_get("http://www.isdnet.net")
## [1] FALSE
url_ok_via_get("http://dds.ec/guaranteed_to_return_404")
## [1] FALSE
url_ok_via_get("http://rud.is/")
## [1] TRUE
url_ok_via_get("http://www.gn.dk")
## [1] FALSE
pbsapply(c("aladslsdf", "http://www.isdnet.net",
"http://dds.ec/guaranteed_to_return_404", "http://rud.is/",
"http://www.gn.dk"), url_ok_via_get)
## aladslsdf
## FALSE
## http://www.isdnet.net
## FALSE
## http://dds.ec/guaranteed_to_return_404
## FALSE
## http://rud.is/
## TRUE
## http://www.gn.dk
## FALSE