How do I move a dataset from Huggingface to Google Cloud? - firebase

I am trying to use huggingface multi_nli to train a text multi-classification ai in google cloud. I want to call the ai from a firebase web app eventually. But when I try this code in colab:
!pip install datasets
from datasets import load_dataset
# Load only train set
dataset = load_dataset(path="multi_nli", split="train")
It says its saved in /root/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72e... but I can't find the file, only a variable version, so I can't move it to google cloud. What is missing for the download to work? Is there some other workaround get it to go to google cloud?

It is easy to do with the method Dataset.save_to_disk and the help of the package gcsfs. You will need first to install gcsfs:
pip install gcsfs
And then you can use the methods Dataset.save_to_disk and Dataset.load_from_disk to save and load the dataset from Google Cloud Storage bucket. To save it:
from datasets import load_dataset
from gcsfs import GCSFileSystem
fs = GCSFileSystem()
dataset = load_dataset(path="multi_nli", split="train")
dataset.save_to_disk("gs://YOUR_BUCKET_NAME_HERE/multi_nli/train", fs=fs)
This will create a directory in the Google Cloud Storage bucket BUCKET_NAME_HERE with the content of the dataset. Then to load it back you only need to execute the following:
from datasets import Dataset
from gcsfs import GCSFileSystem
fs = GCSFileSystem()
dataset = Dataset.load_from_disk("gs://YOUR_BUCKET_NAME_HERE/multi_nli/train", fs=fs)
For more information, please refer to:
Datasets - Cloud Storage
gcsfs

Related

Import GTFS realtime file into R

I'm trying to import a GTFS realtime data file into R using the ProtoBuf package, but can't get it to work. This is what I've tried, but I think it's way off track.
library(RProtoBuf)
setwd("c:\\temp\\")
proto <- readProtoFiles("seq")
The gtfsway package reads the data from the site, like this question, but the authors say the package is outdated.

Is it possbile to work locally by training a tensroflow model using data from google cloud storage without uploading it?

I am trying to use locally in R a tensorflow model using tfdatasets and cloudML using training data available in Google cloud storage without uploading it. AS far as I know the package "tfdtasets" should use gs:// URLs directly with gs_data_dir().
If I specify in TSScript.R:
data_dir <- gs_data_dir(gs://my-gcp-project/data/)
When I run cloud_train(TSScript.R) I get the error:
Error: 'gs://my-gpc-project/data/train.data.csv' does not exist in current working directory ('/root/.local/lib/python2.7/site-packages/cloudml-model')
Here my questions:
Is it somehow possible to do it but I am doing some mistakes in my script?
If not, do I need to install R in the cloud and working from there directly?
would it possible maybe training data from bigTable without uploading it locally?
Thanks
For 1) I think you might be looking for tf.gfile()
https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile
Example of use: https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/census/keras/trainer/model.py#L154
Hope this helps!
For 2) If you want to do this, you should look at Custom Containers. https://cloud.google.com/ml-engine/docs/custom-containers-training
For 3) I'm not familiar with BigTable, but my guess is you would have to query that data you need and manually pull it locally. I don't think TF.gfile supports BigTable only GCS.

Google Dataproc with Jupyter - Downloading files generated by notebook

We're using Google Cloud Dataproc for quick data analysis, and we use Jupyter notebooks a lot. A common case for us is to generate a report which we then want to download as a csv.
In a local Jupyter env this is possible using FileLink for example:
from IPython.display import FileLinks
df.to_csv(path)
FileLinks(path)
This doesn't work with Dataproc because the notebooks are kept on a Google Storage bucket and the links generated are relative to that prefix, for example http://my-cluster-m:8123/notebooks/my-notebooks-bucket/notebooks/my_csv.csv
Does anyone know how to overcome this? Of course we can scp the file from the machine but we're looking for something more convenient.
To share report you can save it to Google Cloud Storage (GCS) instead of local file.
To do so, you need to convert your Pandas DataFrame to Spark DataFrame and write it to GCS:
sparkDf = SQLContext(SparkContext.getOrCreate()).createDataFrame(df)
sparkDf.write.csv("gs://<BUCKET>/<path>")

Retrieve data from Google Drive file which is not shared by link in R

I would like to import a file to R from Google Drive. I am trying this with gsheet::gsheet2text, but unable to retrieve information because it is not shared by link. How can I log in to get the data through R?

How to download google docs spreadsheet through R

I would like to download a google docs spreadsheet using R and then import the spreadsheet as a csv file in a small shiny server app.
How could I do that?
Try the new googlesheets package, which is an R API for Google Sheets:
https://github.com/jennybc/googlesheets
This snippet will install the package, copy a Sheet to your Google Drive, register it for access, and import data from one tab or worksheet into a local data.frame:
devtools::install_github("jennybc/googlesheets")
gap_key <- "1HT5B8SgkKqHdqHJmn5xiuaC04Ngb7dG9Tv94004vezA"
copy_ss(key = gap_key, to = "Gapminder")
gap <- register_ss("Gapminder")
oceania_csv <- get_via_csv(gap, ws = "Oceania")
As for integration with Shiny, see the shinyga package which recently incorporated support for googlesheets:
https://github.com/MarkEdmondson1234/shinyga
You can use RGoogle Docs package to access the Google Docs content. Other option worth consider is the RGoogle Data package that provides access to Google services.

Resources