How to install libs for R arrow package on ubuntu without internet? - r

I am working on Azure databricks and it's compute server is Ubuntu 18.04. I want to install arrow R package but without internet access because of security reasons. I downloaded arrow tar file on my MacBook that has internet access and made it available in ubuntu for manual installation. I performed following steps:
Re-installed build-essential by downloading it from this link and uploaded to ubuntu and executed following bash command to make it available: sudo dpkg -i /dbfs/FileStore/tables/build_essential_12_4ubuntu1_amd64.deb
Installed cpp11 as it is dependency mentioned on cran: R CMD INSTALL /dbfs/FileStore/tables/arrow_dir/cpp11_0_3_1.tar.gz
Downloaded arrow_4.0.1.tar.gz from here and made it available on ubuntu.
Here I see required C++ dependencies to be available on ubuntu before installing the arrow package. How can I install these dependencies without access to internet?
Thanks for reading my question.
Note: A solution is suggested below and after execution of ./thirdparty/download_dependencies.sh $HOME/arrow-thirdparty I get:
# Environment variables for offline Arrow build
export ARROW_ABSL_URL=/root/arrow-thirdparty/absl-0f3bb466b868b523cf1dc9b2aaaed65c77b28862.tar.gz
export ARROW_AWSSDK_URL=/root/arrow-thirdparty/aws-sdk-cpp-1.8.133.tar.gz
export ARROW_AWS_CHECKSUMS_URL=/root/arrow-thirdparty/aws-checksums-v0.1.10
export ARROW_AWS_C_COMMON_URL=/root/arrow-thirdparty/aws-c-common-v0.5.10.tar.gz
export ARROW_AWS_C_EVENT_STREAM_URL=/root/arrow-thirdparty/aws-c-event-stream-v0.1.5
export ARROW_BOOST_URL=/root/arrow-thirdparty/boost-1.75.0.tar.gz
export ARROW_BROTLI_URL=/root/arrow-thirdparty/brotli-v1.0.9.tar.gz
export ARROW_BZIP2_URL=/root/arrow-thirdparty/bzip2-1.0.8.tar.gz
export ARROW_CARES_URL=/root/arrow-thirdparty/cares-1.17.1.tar.gz
export ARROW_GBENCHMARK_URL=/root/arrow-thirdparty/gbenchmark-v1.5.2.tar.gz
export ARROW_GFLAGS_URL=/root/arrow-thirdparty/gflags-v2.2.2.tar.gz
export ARROW_GLOG_URL=/root/arrow-thirdparty/glog-v0.4.0.tar.gz
export ARROW_GRPC_URL=/root/arrow-thirdparty/grpc-v1.35.0.tar.gz
export ARROW_GTEST_URL=/root/arrow-thirdparty/gtest-1.10.0.tar.gz
export ARROW_JEMALLOC_URL=/root/arrow-thirdparty/jemalloc-5.2.1.tar.bz2
export ARROW_LZ4_URL=/root/arrow-thirdparty/lz4-v1.9.3.tar.gz
export ARROW_MIMALLOC_URL=/root/arrow-thirdparty/mimalloc-v1.7.2.tar.gz
export ARROW_ORC_URL=/root/arrow-thirdparty/orc-1.6.6.tar.gz
Failed downloading https://github.com/google/protobuf/releases/download/v3.14.0/protobuf-all-3.14.0.tar.gz

Would it help to use the script mentioned in the link below to download the dependencies and put them somewhere you can then install them from?
There's some instructions here: https://arrow.apache.org/docs/developers/cpp/building.html#offline-builds
I've pasted them below in case the link expires, but you may want to check it for the most up to date version of these instructions.
To enable offline builds, you can download the source artifacts yourself and use environment variables of the form ARROW_$LIBRARY_URL to direct the build system to read from a local file rather than accessing the internet.
To make this easier for you, we have prepared a script thirdparty/download_dependencies.sh which will download the correct version of each dependency to a directory of your choosing. It will print a list of bash-style environment variable statements at the end to use for your build script.
# Download tarballs into $HOME/arrow-thirdparty
$ ./thirdparty/download_dependencies.sh $HOME/arrow-thirdparty
You can then invoke CMake to create the build directory and it will use the declared environment variable pointing to downloaded archives instead of downloading them (one for each build dir!).

Starting in arrow 6.0.0, the package should successfully install from source when offline. It will have only basic features: you'll be able to work with Arrow data and feather files, but features like Parquet reading, S3, and compression libraries won't be available. There is also a new utility function, create_package_with_all_dependencies(), that you can run on a machine connected to the internet in order to produce a "fat" source package containing all third-party C++ dependencies. You can then copy this to your airgapped server. See https://arrow.apache.org/docs/r/reference/create_package_with_all_dependencies.html for details.

Related

How to create local cache for choco package manager

I am building a windows docker image and use a package manager called choco for automating package installation. I would like to avoid downloading packages from the internet every time I modify and rebuild the dockerfile. Locally we run an Artifactory server that purpose But the package I am installing has hardcoded url value which is pointing to microsoft.com
In the package installation script here https://community.chocolatey.org/packages/visualcpp-build-tools#files
$packageName = 'visualcpp-build-tools'
$installerType = 'EXE'
$url = 'https://download.microsoft.com/download/5/A/8/5A8B8314-CA70-4225-9AF0-9E957C9771F7/vs_BuildTools.exe'
$checksum = 'E77D433C44F3D0CBF7A3EFA497101DE93918C492C2EBCAEC79A1FAF593D419BC'
How can I overcome this hardcoded Url value and force the install script to download the package from my local artifactory server?
Before discovering the hardcoded Url value inside the package installation script, I tried using the --source option with choco install as described in the guide here https://jfrog.com/blog/artifactory-as-a-caching-mechanism-for-package-managers/ but I noticed that the package is still downloaded from the internet.
Packages on the Chocolatey Community Repository often can't include the original binaries due to licenses and distribution rights - leading to each user downloading the file from the source URL.
However, you can manually re-create the package with the downloaded file inside - to do so, you should:
Extract the relevant content of the nupkg file (e.g. the tools directory and nuspec file)
Download the file specified in URL (which is the file being downloaded repeatedly during the package installation) to the tools directory (or somewhere else within the package)
Replace the URL in URL with the relative path to the downloaded file
Run choco pack $PathToNuspecFile (in the root of the extracted files)
Upload the resultant nupkg file to your Artifactory instance
Alternatively, you could host the binary file on your Artifactory server and update URL to point to that. This would have the potential benefit of not downloading the full size if you had logic in the package that would short-circuit the installation before the actual install step, e.g. if you had gigantic MSU file you only wanted to download if you needed to apply it.
Finally, one of the parts of a licensed copy of Chocolatey for Business is the Package Internalizer.
This automates the process described above, allowing you to simply call choco download visualcpp-build-tools --internalize.

Install a R package permanently in Google Colab

I am using the -idefix- R package and I do not want to install it everytime I log into Google Colab. Is there any way of installing it permanently? Will it also be installed for other people if I share the notebook.
Thank you :)
Like what you could do in a local computer, copy the source local R library to the target location. See some instruction in this blog ( atusy.net )
Here are two CoLab notebooks to reproduce the import and export R library.
CoLab Notebook export local library
CoLab Notebook import local library
Here are some minimal snippets in this I/O process.
Open a CoLab notebook in Python,
# activate R magic
%load_ext rpy2.ipython
Make the notebook available for R.
%%R
install.packages('tidymodels')
tar("library.tar.gz", "/usr/local/lib/R/site-library")
Install the package tidymodels, and zip your library with installed packages.
from google.colab import drive
drive.mount('/content/drive')
Connect your google drive and make a copy for use in the future.
%cp library.tar.gz drive/MyDrive/src/
drive/MyDrive/src/ is the path I choose, you can use another.
Next, you use this library in another or new notebook.
from google.colab import drive
drive.mount('/content/drive')
Connect your Google Drive.
%cp drive/MyDrive/src/library.tar.gz .
Copy it in your working directory.
!tar xf library.tar.gz
Extract the installed packages from the zipped file.
.libPaths('usr/local/lib/R/site-library/')
update the Library path and put it at the top ranking.
library(tidymodels)
Check, this package is of reuse
As far as I understand it, each virtual machine is recycled after you close the browser window or the session is longer than 12 hours. There is no possibility to install packages in a way that you can access them without installing them again (to the best of my knowledge).

jpeg R package instalation does not find jpeglib.h in non standard location

I'm trying to install jpeg package in R in a Linux server (in which I don't have sudo access) and jpeg installation does not find jpeglib.h I installed locally. How do I tell R where to look for it when configure.args='--with-libjpeg-include=/path failed?
Sever OS version is CentOS Linux 7 (Core)
In R I ran:
>install.packages('jpeg', lib="/shared/mybossusr/R3.5.0/lib", repos="https://mirrors.nic.cz/R/", destdir="/shared/mybossusr/usr/tmp")
And I got this error:
rjcommon.h:11:21: fatal error: jpeglib.h: No such file or directory
#include
So I installed jpeg-turbo
wget https://downloads.sourceforge.net/libjpeg-turbo/libjpeg-turbo-2.0.2.tar.gz
mkdir libjpeg-turbo-2
cd libjpeg-turbo-2
cmake -G"Unix Makefiles" -DCMAKE_INSTALL_PREFIX:PATH=/shared/mybossusr/bin/libjpeg-turbo-2 /shared/mybossusr/download/libjpeg-turbo-2.0.2
make
make install
I checked and jpeglib.h is at /shared/mybossusr/bin/libjpeg-turbo-2/include
I added this at the end of my ~/.bashrc :
export CFLAGS="-I/usr/include -I=/shared/mybossusr/bin/libjpeg-turbo-2"
I logged out and in, and I got the same error when trying to install jpeg in R.
I also added the location of the library to my path at ~/.barsh:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/shared/mybossusr/bin/libjpeg-turbo-2/include
export PATH=$PATH:/shared/mybossusr/bin/libjpeg-turbo-2/include
just in case, because I don't fully understand when a software looks where. Did source ~/.bashrc, logged out and in, but nothing changed.
So, I tried afterwards in R some arguments I came up with:
install.packages('jpeg', lib="/shared/mybossusr/R3.5.0/lib", repos="https://mirrors.nic.cz/R/", destdir="/shared/mybossusr/R3.5.0/tmp", configure.args='--with-libjpeg-include=/shared/mybossuser/bin/jpeg/include')
and:
install.packages('jpeg', lib="/shared/mybossusr/R3.5.0/lib", repos="https://mirrors.nic.cz/R/", destdir="/shared/mybossusr/R3.5.0/tmp", configure.args='--with-libjpeg=/shared/mybossuser/bin/jpeg')
or:
install.packages('jpeg', lib="/shared/mybossusr/R3.5.0/lib", repos="https://mirrors.nic.cz/R/", destdir="/shared/mybossusr/R3.5.0/tmp", configure.args='--with-libjpeg-lib=/shared/mybossuser/bin/jpeg/include')
to try to tell R where libjpeg was installed, but nothing worked.
Is there any configure.args that will do the trick? So far with other packages it was quite straight forward to use a --with-package_name-lib, but I'm clueless with this one...
Thanks in advance!
Try installing the libjpeg-turbo-devel package. That's what did it for me on RHEL 7. According to this page, on CentOS 7 the package name is the same.
For me this is what did the trick:
Install jpeg-turbo in a non-standard location, say $HOME/local, from:
https://github.com/libjpeg-turbo/libjpeg-turbo/releases
cmake -G"Unix Makefiles" -DCMAKE_INSTALL_PREFIX=$HOME/local
make
make install
Then point these globals to the install location in your .bashrc:
export LIBRARY_PATH=$HOME/local/lib64:$HOME/local/lib:$LIBRARY_PATH
export LD_LIBRARY_PATH=$HOME/local/lib64:$HOME/local/lib:$LD_LIBRARY_PATH
export C_INCLUDE_PATH=$HOME/local/include:$C_INCLUDE_PATH
Then try the R package again

Correct workflow? - Distributable environment including jupyter notebook

I am developing applications that use Jupyter notebook and ipywidgets for a GUI frontend to a backend codebase. I have run into issues distributing/installing packages in the normal way, such as:
unexpected differences between required library versions (e.g. pandas)
requirements.txt forcing an update to more recent version of a library when the user maintains and uses their own codebase on an older version of that library.
I think pipenv might be able to solve this problem, but I want to check I have a correct usage before going too far down this path.
Requirements:
the user needs to be able to restart Jupyter Notebook in the same env multiple times, running the program from scratch, until a new version is available.
Users are all on Mac.
Any installation should not alter site-packages etc, have no effect on the python setup any user currently has.
Workflow concept
Development:
Develop within a pipenv environment (I use Pycharm so this is relatively straightforward).
Include jupyter in Pipfile [requires], even though jupyter is not imported anywhere in my source.
Use pipenv install new_package as and when new packages are required by my codebase, and maintain the Pipfile (respecting --dev for testing packages etc).
User installation
Produce a zip file containing source code, setup.py etc plus Pipfile and Pipfile.lock.
User extracts the zip file to a known location on their machine.
In terminal, navigate to the unzipped folder location, and run pipenv install.
Use:
In terminal, navigate to the folder location, and run pipenv shell
Run pipenv run jupyter notebook to reload the env and notebook.
When finished, close out of notebook and run exit to close the env.
Uninstall env and upgrade to newer version
In terminal, navigate to the folder location, and run pipenv --rm.
Download new source zip and follow steps above.
If I've understood, this should ensure anyone can use the distribution in a tightly controlled environment, without making any alterations to their existing python install? Have I overcomplicated things?

How to install stringi from local file (ABSOLUTELY no Internet Access)

I am working on a remote server using RStudio. This server has no access to the Internet. I would like to install the package "stringi." I have looked at this stackoverflow article, but whenever I use the command
install.packages("stringi_0.5-5.tar.gz",
configure.vars="ICUDT_DIR=/my/directory/for/icudt.zip")
It simply tries to access the Internet, which it cannot do. Up until now I have been using Tools -> Install Packages -> Install from Packaged Archive File. However, due to this error, I can no longer use this method.
How can I install this package?
If you have no internet access on local machines, you can build a distributable source package that includes all the required
ICU data files (for off-line use) by omitting some relevant lines in
the .Rbuildignore file. The following command sequence should do the trick:
wget https://github.com/gagolews/stringi/archive/master.zip -O stringi.zip
unzip stringi.zip
sed -i '/\/icu..\/data/d' stringi-master/.Rbuildignore
R CMD build stringi-master
Assuming the most recent development version is 1.3.1,
a file named stringi_1.3.1.tar.gz is created in the current working directory.
The package can now be installed (the source bundle may be propagated via
scp etc.) by executing:
R CMD INSTALL stringi_1.3.1.tar.gz
or by calling install.packages("stringi_1.3.1.tar.gz", repos=NULL),
from within an R session.
For a Linux machine the easiest way is from my point of view:
Download the release you need from Rexamine in tar.gz format to your local pc. In opposition to the version on CRAN it already contains the icu55\data\ folder.
Move the archive to your target linux machine without internet access
run R CMD INSTALL stringi-1.0-1.tar.gz (in case of release 1.0-1)
You provided the wrong value of configure.vars.
It indicates that you have to give the directory's name, not a final file name.
Correct your code to the following:
install.packages("stringi_0.5-5.tar.gz",
configure.vars="ICUDT_DIR=/my/directory/for/")
Regards,
Sean
Follow the steps below
Download icudt55l.zip seperately from server where you have internet access with
wget http://www.mini.pw.edu.pl/~gagolews/stringi/icudt55l.zip
Copy the downloaded packages to the server where you want to install stringi
Execute the following command
R CMD INSTALL --configure-vars='ICUDT_DIR=/tmp/ALL' stringi_1.1.6.tar.gz
icudt55l.zip is copied to /tmp/ALL
The suggestion from #gagolews almost worked for me. Here's what actually did the trick with RStudio.
Download the master.zip file that will save as stringi-master.zip.
Unzip the file onto your desktop. The unzipped folder should be stringi-master.
Edit the .Rbuildignore file by removing ^src/icu55/data and ^src/icu61/data or similar lines.
Move the folder from your desktop to the home directory of your server.
Create a New Project in RStudio with ~/stringi-master as the Existing Directory
From RStudio's menu, select Build and Build Source Package. (You may need to first select Configure Build Tools. For Project build tools choose Package then select OK.)
It should create a tar.gz file, in the following format: stringi_x.x.(x+1).tar.gz. For example, if the current version of stringi is 1.5.3, it will create version 1.5.4. (I received a few warnings that didn't seem to affect the outcome.)
Move the newly created package to your local repository. Update the repository index. And install the package.

Resources