I am struggling with writing a script that would somehow scrape the https://www.rstudio.com/products/rstudio/download/ for the number of the latest RStudio version, download it and install it.
Since I am an R programmer, I started to write an R script using rvest package. I managed to scrape the download link for the RStudio server, but I still cannot get the RStudio itself.
Here is the R code for getting a download link for the 64bit RStudio-server for Ubuntu.
if(!require('stringr')) install.packages('stringr', Ncpus=8, repos='http://cran.us.r-project.org')
if(!require('rvest')) install.packages('rvest', Ncpus=8, repos='http://cran.us.r-project.org')
xpath<-'//code[(((count(preceding-sibling::*) + 1) = 3) and parent::*)]'
url<-'https://www.rstudio.com/products/rstudio/download-server/'
thepage<-xml2::read_html(url)
the_links_html <- rvest::html_nodes(thepage,xpath=xpath)
the_links <- rvest::html_text(the_links_html)
the_link <- the_links[stringr::str_detect(the_links, '-amd64\\\\.deb')]
the_r_uri<-stringr::str_match(the_link, 'https://.*$')
cat(the_r_uri)
Unfortunately, the RStudio desktop download page has completely different layout, and I the same approach doesn't work here.
Can someone help me with this? I can't believe, that all the data scientist in the world manually upgrade their RStudio!
There is an even simpler version of the script, that reads the version of the RStudio-server. Bash version:
RSTUDIO_LATEST=$(wget --no-check-certificate -qO- https://s3.amazonaws.com/rstudio-server/current.ver)
or R version:
scan('https://s3.amazonaws.com/rstudio-server/current.ver', what = character(0))
But the version of the RStudio-desktop still eludes me.
It seems that you can get the latest stable version number from the url http://download1.rstudio.org/current.ver and it is more up to date (for some unknown reason), at least at the time of writing this answer.
$ curl -s http://download1.rstudio.org/current.ver
1.1.447
$ curl -s https://www.rstudio.org/links/check_for_update?version=1.0.0 | grep -oEi 'update-version=([0-9]+\.[0-9]+\.[0-9]+)' | awk -F= '{print $2}'
1.1.423
Found that here: https://github.com/yutannihilation/ansible-playbook-r/blob/master/tasks/install-rstudio-server.yml
If you query RStudio's check_for_update with a version string you'll get back the update version and the URL of where to get it from:
https://www.rstudio.org/links/check_for_update?version=1.0.0
update-version=1.0.153&update-url=https%3A%2F%2Fwww.rstudio.com%2Fproducts%2Frstudio%2Fdownload%2F&update-message=RStudio%201.0.153%20is%20now%20available%20%28you%27re%20using%201.0.0%29&update-urgent=0
See here:
https://github.com/rstudio/rstudio/blob/54cd3abcfc58837b433464c793fe9b03a87f0bb4/src/cpp/session/modules/SessionUpdates.R
If you really want to scrape it from the download page then I'd get the href of the <a> in the first <td> of the first <table> of class "downloads", and then parse out the three dot-separated numbers between "RStudio-" and ".exe". RStudio release versions over all platforms so getting it from the Windows download should be sufficient.
> url = "https://www.rstudio.com/products/rstudio/download/"
> thepage<-xml2::read_html(url)
> html_node(thepage, ".downloads td a") %>% html_attr("href")
[1] "https://download1.rstudio.org/RStudio-1.0.153.exe"
There's a nearly-solution here:
https://hub.docker.com/r/rocker/rstudio-daily/~/dockerfile/
In this script, which scrapes for the latest builds:
https://raw.githubusercontent.com/rocker-org/rstudio-daily/master/latest.R
You'll want to modify that script to be more strict about what it accepts, i.e. I would want this one rstudio-server-1.1.355-amd64.deb and not the stretch variant.
(But you can modify it to target the kind of build you want anyway, this is the daily builds, RStudio Server for Ubuntu.)
If anyone is interested, here is my ultimate RServer-desktop-on-Ubuntu update script. It installs RStudio-desktop 64bit and then, if Fira Console font is available, applies a patch from https://github.com/tonsky/FiraCode/wiki/RStudio-instructions for the RStudio, so the ligatures start working.
#!/bin/bash
if dpkg -s rstudio >/dev/null 2>/dev/null; then
ver=$(apt show rstudio | grep Version)
pattern='^Version: ([0-9.]+)\s*$'
if [[ $ver =~ $pattern ]]; then
ourversion=${BASH_REMATCH[1]}
netversion=$(Rscript -e 'cat(stringr::str_match(scan("https://www.rstudio.org/links/check_for_update?version=1.0.0", what = character(0), quiet=TRUE), "^[^=]+=([^\\&]+)\\&.*")[[2]])')
if [[ $ourversion != $netversion ]]; then
RSTUDIO_URI=$(Rscript /tmp/get_rstudio_uri.R)
fi
tee /tmp/get_rstudio_uri.R <<EOF
if(!require('rvest')) install.packages('rvest', repos='http://cran.us.r-project.org')
xpath='.downloads:nth-child(2) tr:nth-child(5) a'
url = "https://www.rstudio.com/products/rstudio/download/"
thepage<-xml2::read_html(url)
cat(html_node(thepage, xpath) %>% html_attr("href"))
EOF
RSTUDIO_URI=$(Rscript /tmp/get_rstudio_uri.R)
wget -c --output-document /tmp/rstudio.deb $RSTUDIO_URI
sudo dpkg -i /tmp/rstudio.deb
rm /tmp/rstudio.deb
rm /tmp/get_rstudio_uri.R
if fc-list |grep -q FiraCode; then
if !grep -q "text-rendering:" /usr/lib/rstudio/www/index.htm; then
sudo sed -i '/<head>/a<style>*{text-rendering: optimizeLegibility;}<\/style>' /usr/lib/rstudio/www/index.htm
fi
fi
fi
fi
Related
I have created below .sh file to run R code saved in separate .R file.
cat EE.sh
#!/bin/bash
VARIABLES=( 20190719 20190718 )
for i in ${VARIABLES[#]}; do
VARIABLENAME=$i
/usr/lib/R/bin/Rscript -e 'source("/home/EER.R")'
Basically what it is expected to do is, take the dates from VARIABLE and pass to the /home/EER.R file, and R will do execution based on passed date (after correct formatting)
Then I ran below code
sudo chmod a+rx EE.sh
and
sudo bash EE.sh
But I then get below error message.
sudo bash EE.sh
EE.sh: line 2: $'\r': command not found
EE.sh: line 3: $'\r': command not found
EE.sh: line 4: $'\r': command not found
Can anyone help me to resolve this issue.
I am using Ubuntu 18 with R version 3.4.4 (2018-03-15)
This problem looks to be related to carriage returns related(which come when we copy text from windows machine to unix machine), so to identify them use:
cat -v Input_file
If you see carriage returns in your file then try:
tr -d '\r' < Input_file > temp && mv temp Input_file
Once they are removed then try to run your program.
I'm using wget to download all jpegs from a website.
I searched a lot and this should be the way:
wget -r -nd -A jpg "http://www.hotelninfea.com"
This should recursively -r download files jpegs -A jpg and store all files in a single directory, without recreating website directory tree -nd
Running this command downloads only the jpegs from the homepage of the website, not the whole jpegs of all the website.
I know that a jpeg file could have different extensions (jpg, jpeg) and so on, but this is not the case, also there aren't any robots.txt restrictions acting.
If I remove the filter from the previous command, it works as expected
wget -r -nd "http://www.hotelninfea.com"
This is happening on Lubuntu 16.04 64bit, wget 1.17.1
Is this a bug or I am misunderstanding something?
I suspect that this is happening because the main page you mention contains links to the other pages in the form http://.../something.php, i.e., there is an explicit extension. Then the option -A jpeg has the "side-effect" of removing those pages from the traversal process.
Perhaps a bit dirty workaround in this particular case would be something like this:
wget -r -nd -A jpg,jpeg,php "http://www.hotelninfea.com" && rm -f *.php
i.e., to download only the necessary extra pages and then delete them if wget successfully terminates.
ewcz anwer pointed me to the right way, the --accept acclist parameter has a dual role, it define the rules of file saving and the rules of following links.
Reading deeply the manual I found this
If ‘--adjust-extension’ was specified, the local filename might have ‘.html’ appended to it. If Wget is invoked with ‘-E -A.php’, a filename such as ‘index.php’ will match be accepted, but upon download will be named ‘index.php.html’, which no longer matches, and so the file will be deleted.
So you can do this
wget -r -nd -E -A jpg,php,asp "http://www.hotelninfea.com"
But of course a webmaster could have been using custom extensions
So I think that the most robust solution would be a bash script, something
like
WEBSITE="http://www.hotelninfea.com"
DEST_DIR="."
image_urls=`wget -nd --spider -r "$WEBSITE" 2>&1 | grep '^--' | awk '{ print $3 }' | grep -i '\.\(jpeg\|jpg\)'`
for image_url in $image_urls; do
DESTFILE="$DEST_DIR/$RANDOM.jpg"
wget "$image_url" -O "$DESTFILE"
done
--spider wget will not download the pages, just check that they are there
$RANDOM asks a random number to the operating system
The following did not work.
wget -r -A .pdf home_page_url
It stop with the following message:
....
Removing site.com/index.html.tmp since it should be rejected.
FINISHED
I don't know why it only stops in the starting url, do not go into the links in it to search for the given file type.
Any other way to recursively download all pdf files in an website. ?
It may be based on a robots.txt. Try adding -e robots=off.
Other possible problems are cookie based authentication or agent rejection for wget.
See these examples.
EDIT: The dot in ".pdf" is wrong according to sunsite.univie.ac.at
the following cmd works for me, it will download pictures of a site
wget -A pdf,jpg,png -m -p -E -k -K -np http://site/path/
This is certainly because of the links in the HTML don't end up with /.
Wget will not follow this has it think it's a file (but doesn't match your filter):
page
But will follow this:
page
You can use the --debug option to see if it's the actual problem.
I don't know any good solution for this. In my opinion this is a bug.
In my version of wget (GNU Wget 1.21.3), the -A/--accept and -r/--recursive flags don't play nicely with each other.
Here's my script for scraping a domain for PDFs (or any other filetype):
wget --no-verbose --mirror --spider https://example.com -o - | while read line
do
[[ $line == *'200 OK' ]] || continue
[[ $line == *'.pdf'* ]] || continue
echo $line | cut -c25- | rev | cut -c7- | rev | xargs wget --no-verbose -P scraped-files
done
Explanation: Recursively crawl https://example.com and pipe log output (containing all scraped URLs) to a while read block. When a line from the log output contains a PDF URL, strip the leading timestamp (25 characters) and tailing request info (7 characters) and use wget to download the PDF.
I have a URL in my custom module which runs a long script. If i call url via wget it downloads the page content. It doesn't run the script. How to do it?
I would have thought that even though it downloaded the page it would still run the script.
To run without downloading the file use:
wget -O - -q -t 1 http://example.com/path/to/file.php
From memory:
-O and the hyphen are redirecting the output so it's not saved to a file.
-q is for quiet
-t is the number of attempts.
You can use man wget to look any more options up.
I created a batch file for windows that executes some xmlstarlet commands. I want to write it as .sh file so that i can run it on mac. The problem is.. Some commands are working fine in windows but not in mac. It didn't show any error too. Eg.
**xml ed -L -d //intent-filter//category[#android:name='android.intent.category.LAUNCHER'] my_folder\AndroidManifest.xml**
In windows, above command deletes the mentioned xml tag. BUt it does nothing in mac.
But the command
**xml sel -t -m //manifest -v //manifest/#package mim_apk_proj\AndroidManifest.xml**
is working fine in both mac and windows.
I have installed xml tool. Checked /usr/local/bin. It has libxslt.dylib and libxml2.dylib. I dont know where the problem lies?
Can someone help?
The quoting rules for bash (that's the shell on your mac, right?) are different from cmd.exe (the Windows shell), in particular, cmd.exe treats ' as a normal character while to bash it is a quoting character so it isn't passed to the program. In bash you therefore need to quote the 's as well:
xml ed -L -d //intent-filter//category[#android:name='android.intent.category.LAUNCHER'] my_folder\AndroidManifest.xml
# becomes
xml ed -L -d "//intent-filter//category[#android:name='android.intent.category.LAUNCHER'] my_folder\AndroidManifest.xml"
# or, since XPath treats both kinds of quotes identically you can also use
xml ed -L -d '//intent-filter//category[#android:name="android.intent.category.LAUNCHER"] my_folder\AndroidManifest.xml'
The second fix is safer because it also prevents bash from doing any variable expansion if you use $, but the first fix has the advantage of working in Windows as well.