Google Dataproc with Jupyter - Downloading files generated by notebook - jupyter-notebook

We're using Google Cloud Dataproc for quick data analysis, and we use Jupyter notebooks a lot. A common case for us is to generate a report which we then want to download as a csv.
In a local Jupyter env this is possible using FileLink for example:
from IPython.display import FileLinks
df.to_csv(path)
FileLinks(path)
This doesn't work with Dataproc because the notebooks are kept on a Google Storage bucket and the links generated are relative to that prefix, for example http://my-cluster-m:8123/notebooks/my-notebooks-bucket/notebooks/my_csv.csv
Does anyone know how to overcome this? Of course we can scp the file from the machine but we're looking for something more convenient.

To share report you can save it to Google Cloud Storage (GCS) instead of local file.
To do so, you need to convert your Pandas DataFrame to Spark DataFrame and write it to GCS:
sparkDf = SQLContext(SparkContext.getOrCreate()).createDataFrame(df)
sparkDf.write.csv("gs://<BUCKET>/<path>")

Related

google colab does not import modules

I was trying to find a way to install modules permanently. I came to this post which teaches how to install packages on google drive, then mounting the drive and then using "sys.path.append" to tell the python where to look for the new package.
this method works as expected when a module directly is imported when you code in the notebook itself.
However, when I tried to run a project that I already had and wanted to run the .py file (by using"!python myCode.py"), the "sys" module can't append the path of the modules that have been installed in google drive.
in short, when you use the approach in the link above, you can only import packages when you directly code in the notebook itself. the approach did not work for me when I tried to use it on my .py files. i.e., when I used "!python myCode.py"
any suggestion on how to solve this problem? do you have the same problem as well?
thanks,

Saving R output in Google Colab

How do I save R output into a file in google colab? It could be saved on google drive or my local drive, either would work.
For example, if I wanted to save a list of R objects in a RDS file, normally I would have used something like this on RStudio.
saveRDS(list(a, b, c, d), file = "C:\\sim1.rds")
I am looking to do something similar on Google colab.
Recently I found the answer so I wanted to write it here in case it is useful for others.
To save an output on my google drive we need to mount it using the following.
from google.colab import drive
drive.mount('/content/drive')
Then we can navigate to MyDrive using the following.
cd /content/drive/MyDrive
Now that we are in MyDrive, we can run the code and save outputs on MyDrive. Then we can download it to our laptop.

Problem with loading files in local run jupter notebook vs google colab as the order of data files seems to be different

For loading the data I am using the following code for the data that is located in google drive.
import glob
dataAD = glob.glob('ADNI_komplett/AD/*.nii.gz')
dataLMCI = glob.glob('ADNI_komplett/LMCI/*.nii.gz')
dataCN = glob.glob('ADNI_komplett/CN/*.nii.gz')
dataFiles = dataAD + dataLMCI + dataCN
I need to access the same data in jupyter notebook in my local machine for which I am downloading the data from google drive to my machine and trying to load the files using the same code as above.
But I noticed that the order in which the files getting loaded is different in colab vs jupyter.
Adding screen shots to show difference
Check screenshots.
On the left side of the screenshot, is the code run in my machine in jupyter and on the right side, is the code run in colab.
As you can notice from the highlighted region; in jupyter the 1st file name loaded is AD\mwp1ADNI_002_S_0729_MR_MT1 whereas in colab the 1st file loaded is AD/mwp1AD_4001_037_MR_MT1 and also from the second screenshot it can be seen that the number ordering is also different.
I need to maintain the ordering in both colab and jupyter.
Any suggestion for this problem is appreciated.
glob returns files in the order they appear within the filesystem (see How is Pythons glob.glob ordered?). Colaboratory is running on a different filesystem architecture than your local Jupyter runtime, and so it's not surprising that the orders are different.
If you require files to be listed in the same order cross-platform, I'd suggest sorting the outputs in Python; i.e.
dataAD = sorted(glob.glob('ADNI_komplett/AD/*.nii.gz'))

how to load csv files in google colab for R?

How to load CSV files in google colab for R?
For python, there are many answers but can someone guide how file can be imported in R for google colab.
Assuming you mean "get a CSV file from my local system into the Colaboratory Environment" and not just importing it from inside the Colab file paths as per Korakot's suggestion, since your question wasn't very clear, I think you have two main options:
1. Upload a file directly through the shortcut in the side menu thingy.
Just click the icon there and upload your file to drive. Then, you can run normal r import functions by following the internal path like korakot put in this answer.
2. Connect your google drive
Assuming you're using a notebook like the one created by Thong Nguyen, you can use a python call to mount your own google drive, like this one:
cat(system('python3 -c "from google.colab import drive\ndrive.mount()"', intern=TRUE), sep='\n', wait=TRUE)
... which will initiate the login process to Google Drive and will allow you to access your files from google drive as if they were folders in colab. There's more info about this process here.
In case you use the Colab with R as runtime type (and Python code would not work therefore), you could also simply upload the file as MAIAkoVSky suggested in step 1 and then import it with
data <- read.csv('/content/your-file-name-here.csv')
The filepath can also be accessed by right clicking on the file in the interface.
Please be aware that the files will disappear once you disconnected from Colab. You would need to upload them again for the next session.
You can call the read.csv function like
data = read.csv('sample_data/mnist_test.csv')

Saving Google Colab Notebook on Github

Checkpoints in Google Colab
In one of the answers to the above question it is mentioned that to save checkpoints in Google Colab we should push the notebbok to GitHub. I was having a doubt that whether pushing to Github will save all the files that are created in the VM environment of Google Colab Notebook. If no, please suggest an alternate solution. Thanks in advance
Files in the VM environment will not be saved to Github. In order to save specific files, you'll need to write a script to save them either to your local machine via:
https://gist.github.com/korakot/e7f04fa7bd3a8a67b729da279ab1713a
Or you can save the files using the Colab Drive integrations:
https://datascience.stackexchange.com/questions/27964/how-to-download-dynamic-files-created-during-work-on-google-colab

Resources