google-cloud-vision how to read pdf file - google-cloud-vision

I am using Google OCR API and I am reading both images and PDF files, I am able to read and process images file, however, for PDF files, as per Google OCR API documentation, they have mentioned that we need to store our document into Google Cloud service.
Having said that, due to data confidentiality, I can't store my data into Google Cloud and want to upload my PDF from my local system in order to read text from PDF file. Is it possible to upload PDF from local disk and then process it instead of uploading file into Google Cloud?

As you said, it's not possible to do that locally. I filed a Feature Request [1] on your behalf for you to follow updates there.
Anyway, I have a possible workaround that might satisfy your data confidentiality awareness. It consist in using the Cloud Storage Client libraries [2] to both upload and delete those files:
You have the PDF file locally and no buckets containing it.
Upload it to a bucket [3]
Use that bucket+file URI to read it through Cloud Vision API and store the result in a bucket
Download the result file into your local machine [4]
Delete both the PDF file and the result file from the bucket(s) [5]
This should work as long as you don't mind having those files in buckets for a brief period of time.

The code for a locally stored file is not under the document specific section, but here: https://cloud.google.com/vision/docs/file-small-batch
I summarized the code for both GCP and local options below.
# imports
from google.cloud import vision
from google.cloud.vision_v1 import enums
import io
# Set up Vision API
from google.cloud import vision
client = vision.ImageAnnotatorClient()
features = [{"type": enums.Feature.Type.DOCUMENT_TEXT_DETECTION}]
mime_type = 'application/pdf'
# from GCP
gcs_source_uri = "gs://bk-bucketname/example.pdf"
gcs_source = vision.types.GcsSource(uri=gcs_source_uri)
input_gcp = vision.types.InputConfig(gcs_source=gcs_source, mime_type=mime_type)
# from local
file_path = "./example.pdf"
with io.open(file_path, "rb") as f:
content = f.read()
input_local = {"mime_type": mime_type, "content": content}
# send the api request
pages = [1] # list of page#s, 5max for online / 2000max for offline/async
requests = [{"input_config": input_local, "features": features, "pages": pages}]
response = client.batch_annotate_files(requests)

You may split the PDF onto pages, send them individually to the online OCR API, and merge the results in order. Alternatively, you can rely on an OCR service that can do it for you such as https://base64.ai/demo/document-processing/ocr

Related

Preserving or parsing file structure from GCP Cloud Storage in a R Server Posit Workbench Standard for GCP Instance

I am working on a project that involves a large .las file (3d point cloud). I process this file in a variety of ways using the lidr package in R. Because of the size of the file and my hardware limitations, I decided to try to give cloud computing a try. So this is my first foray into this world. Apologies for lack of experience. I have successful set up an RStudio Server instance on GCP. I have added the file for processing in a google cloud storage bucket.
With the code below I'm able to call the file I need and load it into memory. I am stuck here as the file doesn't have its originally structure anymore. I think I understand that google stores the file in some sort of "blob" structure and that this new structure is some stream of bytes that can be parsed. The R environment lists it as "Large raw (52136558042 elements, 52.1 GB)
I have been searching online how to parse this haven't yet found an answer online. Any help that could point me in the right direction would be much appreciated.
Setup
library(googleCloudStorageR)
options(googleAuthR.scopes.selected = "https://www.googleapis.com/auth/devstorage.full_control")
Authenticate ClientID
googleAuthR::gar_set_client("auth/client_ID.json")
Authenticate service account
Sys.setenv("GCS_AUTH_FILE" = "auth/service_key.json")
googleCloudStorageR::gcs_auth()
Get bucket info
bucket <- gcs_list_buckets(projectId = "my_project")
bucket<-gcs_global_bucket(bucket$name)
objects <- gcs_list_objects("my_lidar_project123")
wholeasslas<- gcs_get_object(objects$name[[1]],
parseObject = TRUE)
httr::content(wholeasslas, type = application/vnd.las)

Struggling to use my own API key with googlesheets4 in shinyapps.io

I've got googlesheets4 working in a shinyapps.io with the following code:
gs4_auth(
email = "me#email.com",
path = NULL,
scopes = "https://www.googleapis.com/auth/drive",
cache = "path_to_cache",
use_oob = FALSE,
token = NULL)
I run this locally, which requires initial browser authentication and downloads a file of some sort.
As long as I upload that file with my app to shinyapps.io, then it works (i.e. refreshes the token whenever it needs).
However, as I understand it, this is using googlesheets4 own Google API settings, which were set up to make it easy for everyone to use.
The disadvantage is that, since a lot of people are sharing this API, they sometimes (myself included) hit the data limits and get a 429 RESOURCE EXHAUSTED error. This is discussed here.
OK, so I've followed the instructions here and here and added the following code BEFORE the auth chunk already provided:
if (interactive()){
# Desktop Client ID
google_app <- httr::oauth_app(
"my-awesome-google-api-wrapping-package",
key = "mykey_for_desktop_app",
secret = "mysecret"
)
}else{
# Web Client ID
google_app <- httr::oauth_app(
"my-awesome-google-api-wrapping-package",
key = "mykey_for_web_app",
secret = "mysecret"
)
}
# API key
google_key <- "My-API-KEY"
gs4_auth_configure(app = google_app, api_key = google_key)
# Also configure google drive to use my API
drive_auth_configure(app = google_app, api_key = google_key)
So this seems to work locally (e.g. in RStudio) and I can see activity on my Google Cloud API dashboard.
However, whilst this works for a short period of time (e.g. 10 mins), even when uploaded to shinyapps.io, the auto-refresh seems to fail because I soon get the dreaded:
"Can't get Google credentials. Are you running googlesheets4 in a non-interactive session?"
Is anyone able to point me towards what I'm doing wrong?
Again - it works fine as long as I'm not trying to use my own API settings (the second code chunk).
OK, pretty sure I've got this working...
It was the YouTube video here that really helped, and made this more clear.
All I need is a Service Account, which seems to generate a json file that I can upload with my app.
i.e. at around 1:03 in the video shows the creation of this service account, then adding that e-mail address (of the Service Account) to the Google Sheet(s) I want to access, this means I can download (using GoogleDrive) and write (using GoogleSheets).
The crazy part is that all I need to put in my code is the following:
drive_auth(path = ".secrets/client_secret.json")
gs4_auth(path = ".secrets/client_secret.json")
i.e. those two lines (plus the downloaded json file for the Service Account) replace ALL the code I posted in my OP!
If anyone is reading this, I was struggling with the last steps of Jimbo's (excellent) answer, i.e. how to upload the local json file to shinyapps.io.
My working solution : I created a subfolder inside the shiny app folder, next to the app.r file, called "secrets". I placed the json file there. I made sure to set my working directory to the shiny app when testing everything locally. (Note : don't include the setwd() code in your shiny app code). I'm not sure if this exposes the json file somehow, but it'll have to do.
When publishing to shinyapps, I checked all boxes suggested by Rstudio to upload the whole contents of the folder (app.r file, subfolder + json file in subfolder). I used the following path in the app.r file:
drive_auth(path = "secret/clientsecret.json")
gs4_auth(path = "secret/clientsecret.json")

Download image from url without saving it locally to then upload to firebase storage [duplicate]

I'm a Ruby dev trying my hand at Google Cloud Functions written in Python and have hit a wall with transferring a remote file from a given URL to Google Cloud Storage (GCS).
In an equivalent RoR app I download to the app's ephemeral storage and then upload to GSC.
I am hoping there's a way to simply 'download' the remote file to my GCS bucket via the Cloud Function.
Here's a simplified example of what I am doing with some comments, the real code fetches the URLs from a private API, but that works fine and isn't where the issue is.
from google.cloud import storage
project_id = 'my-project'
bucket_name = 'my-bucket'
destination_blob_name = 'upload.test'
storage_client = storage.Client.from_service_account_json('my_creds.json')
# This works fine
#source_file_name = 'localfile.txt'
# When using a remote URL I get 'IOError: [Errno 2] No such file or directory'
source_file_name = 'http://www.hospiceofmontezuma.org/wp-content/uploads/2017/10/confused-man.jpg'
def upload_blob(bucket_name, source_file_name, destination_blob_name):
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
upload_blob(bucket_name, source_file_name, destination_blob_name)
Thanks in advance.
It is not possible to upload a file to Google Cloud Storage directly from an URL. Since you are running the script from a local environment, the file contents that you want to upload, need to be in that same environment. This means that the contents of the url need to either be stored in the memory, or in a file.
An example showing how to do it, based in your code:
Option 1: You can use the wget module, that will fetch the url and download it's contents into a local file (similar to the wget CLI command). Note that this means that the file will be stored locally, and then uploaded from the file. I added the os.remove line to remove the file once the upload is done.
from google.cloud import storage
import wget
import io, os
project_id = 'my-project'
bucket_name = 'my-bucket'
destination_blob_name = 'upload.test'
storage_client = storage.Client.from_service_account_json('my_creds.json')
source_file_name = 'http://www.hospiceofmontezuma.org/wp-content/uploads/2017/10/confused-man.jpg'
def upload_blob(bucket_name, source_file_name, destination_blob_name):
filename = wget.download(source_file_name)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(filename, content_type='image/jpg')
os.remove(filename)
upload_blob(bucket_name, source_file_name, destination_blob_name)
Option 2: using the urllib module, works similar to the wget module, but instead of writing into a file it writes to a variable. Note that I did this example im Python3, there are some differences if you plan to run your script in Python 2.X.
from google.cloud import storage
import urllib.request
project_id = 'my-project'
bucket_name = 'my-bucket'
destination_blob_name = 'upload.test'
storage_client = storage.Client.from_service_account_json('my_creds.json')
source_file_name = 'http://www.hospiceofmontezuma.org/wp-content/uploads/2017/10/confused-man.jpg'
def upload_blob(bucket_name, source_file_name, destination_blob_name):
file = urllib.request.urlopen(source_file_name)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_string(link.read(), content_type='image/jpg')
upload_blob(bucket_name, source_file_name, destination_blob_name)
Directly transferring URLs into GCS is possible through the Cloud Transfer service, but setting up a cloud transfer job for a single URL is a lot of overhead. That sort of solution is targeted towards a situation with millions of URLs that need to become GCS objects.
Instead, I recommend writing a job that pumps an incoming stream from reading a URL into a write stream to GCS and running that somewhere in the Google Cloud close to the bucket.

How do I delete an image with cloud functions with download url?

I want to delete an image. All I have is the download url.
In flutter i am able to get file path from download url and use that path to delete the file in cloud storage.
Is that possible to get the file path from download url and use that path to delete the image from cloud functions?
or is there any better/ faster way / more efficient way to delete an image from cloud storage only with the download url
Google Cloud Storage object URL has the following parts:
https://storage.cloud.google.com/[bucket_name]/[path/and/the/object/name*]?[autentication_if_needed]
*Path in Cloud Storage is "virtual", in fact it is an integral part of the object name/identification. Cloud Console and gsutil simulates folders for the user interface output.
There are several methods to delete object:
From the Cloud Console
Using Cloud SDK command: gsutil rm gs://[BUCKET_NAME]/[OBJECT_NAME]
Using Client Libraries, for example with python:
def delete_blob(bucket_name, blob_name):
"""Deletes a blob from the bucket."""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(blob_name)
blob.delete()
print('Blob {} deleted.'.format(blob_name))`
Please keep in mind, that you need proper permissions to delete the object for the user/service account used to perform the operation.

Size of json to upload (import) on Firebase database

I can upload, e.g. json file in 10kb to Firebase database.
But when I want to upload more, e.g. json file in 30kb or 70kb it shows an error "There was a problem contacting the server. Try uploading your file again":
Please first refer to the status dashboard:
https://status.firebase.google.com/
At time of question - the console was experiencing a service disruption - which is why you can read / write to the DB via your application but can not perform admin tasks via the console.
Note the location is substantially above the RTDB

Resources