Iterating through files in google cloud storage - google-cloud-vision

I am currently trying to read pdf files stored in my google cloud storage. So far I have figured out how to read one file at a time from my google cloud storage, but I want to be able to loop through multiple files in my google cloud storage without manually reading them one by one. How can I do this? I have attached my code below.

To iterate all files in your bucket move your code for downloading and parsing in the for loop. Also I changed the for loop to for blob in blob_list[1:]: since GCS always prints the top folder in the first element and you do not want to parse that since it will result to and error. My folder structure used for testing is "gs://my-bucket/output/file.json....file_n.json".
Output when looping through the folder (for blob in blob_list:):
Output files:
output/
output/copy_1_output-1-to-1.json
output/copy_2_output-1-to-1.json
output/output-1-to-1.json
Output when skipping the first element (for blob in blob_list[1:]:):
Output files:
output/copy_1_output-1-to-1.json
output/copy_2_output-1-to-1.json
output/output-1-to-1.json
Loop through files. Skip the first element:
blob_list = list(bucket.list_blobs(prefix=prefix))
print('Output files:')
for blob in blob_list[1:]:
json_string = blob.download_as_string()
response = json.loads(json_string)
first_page_response = response['responses'][0]
annotation = first_page_response['fullTextAnnotation']
print('Full text:\n')
print(annotation['text'])
print('END OF FILE')
print('##########################')
NOTE: If you have a different folder structure versus the test made, just adjust the index in the for loop.

Related

Can I use PowerBI to access SharePoint files, and R to write those files to a local directory (without opening them)?

I have a couple of large .xlsb files in 2FA-protected SharePoint. They refresh periodically, and I'd like to automate the process of pulling them across to a local directory. I can do this in PowerBI already by polling the folder list, filtering to the folder/files that I want, importing them and using an R script to write that to an .rds (it doesn't need to be .rds - any compressed format would do). Here's the code:
let
#"~ Query ~"="",
//Address for the SP folder
SPAddress="https://....sharepoint.com/sites/...",
//Poll the content
Source15 = SharePoint.Files(SPAddress, [ApiVersion=15]),
//... some code to filter the content list down to the 2 .xlsb files I'm interested in - they're listed as nested 'binary' items under column 'Content' within table 'xlsbList'
//R export within an arbitrary 'add column' instruction
ExportRDS = Table.AddColumn(xlsbList, "Export", each R.Execute(
"saveRDS(dataset, file = ""C:/Users/current.user/Desktop/XLSBs/" & [Label] & ".rds"")",[dataset=Excel.Workbook([Content])[Data]{0}]))
However, the files are so large that my login times out before the refresh can complete. I've tried using R's file.copy command instead of saveRDS, to pick up the files as binaries (so PowerBI never has to import them):
R.Execute("file.copy(dataset, ""C:/Users/current.user/Desktop/XLSBs/""),[dataset=[Content]])
with dataset=[Content] instead of dataset=Excel.Workbook([Content])[Data]{0} (which gives me a different error, but in any event would result in the same runtime issues as before) but it tells me The Parameter 'dataset' isn't a Table. Is there a way to reference what PowerBI sees as binary objects, from within nested R (or Python) code so that I can copy them to a local directory without PowerBI importing them as data?
Unfortunately I don't have permissions to set the SharePoint site up for direct access from R/Python, or I'd leave PowerBI out entirely.
Thanks in advance for your help

How to use googledrive::drive_upload() without changing the Google ID of file?

When I upload a pptx file to Drive (or any file) I'd like to maintain the Google ID for the file, but every time I execute this function, a new Google ID is created even when overwrite=TRUE. This breaks the hyperlink that stakeholders were using to find the file in Drive. Is there a way to maintain the Google ID when overwriting during upload?
googldrive::drive_upload(
my_pres,
name = "My Presentation",
type = 'presentation', # converts pptx to Googleslides
overwrite = TRUE
)
According to the documentation googledrive::drive_upload() wraps the Files.create method of the Drive API. That is the wrong function to use for updating a file. The overwrite argument set as TRUE stands for:
"[...] Check for a pre-existing file at the filepath. If there is zero or one, move a pre-existing file to the trash[...]"
You should use googledrive::drive_update() which wraps the Files.update method of the Drive API. From the R docs, is described as:
"[...] Update an existing Drive file id with new content ("media" in Drive API-speak), new metadata, or both. To create a new file or update existing, depending on whether the Drive file already exists, see drive_put(). [...]"

Kusto ingestion error "BadRequest_EmptyArchive: Empty zip archive"

I have a bunch of .csv files in Azure blob storage, and an ingestion rule to pull them into Kusto (Azure Data Explorer). This used to work, but I've been getting a lot of ingestion failures lately. ".show ingestion failures" gives me:
Details FailureKind OperationKind ErrorCode ShouldRetry IngestionProperties IngestionSourcePath
BadRequest_EmptyArchive: Empty zip archive Permanent DataIngestPull BadRequest_EmptyArchive 0 "[Format=Csv/mandatory, IngestionMapping=[{""column"":""CrashSource"",""datatype"":""string"",""Ordinal"":""0""},{""column"":""CrashType"",""datatype"":""string"",""Ordinal"":""1""},{""column"":""ReportId"",""datatype"":""string"",""Ordinal"":""2""},{""column"":""DeviceId"",""datatype"":""string"",""Ordinal"":""3""},{""column"":""DeviceSerialNumber"",""datatype"":""string"",""Ordinal"":""4""},{""column"":""DumpFilePath"",""datatype"":""string"",""Ordinal"":""5""},{""column"":""FailureXmlPath"",""datatype"":""string"",""Ordinal"":""6""},{""column"":""PROCESS_NAME"",""datatype"":""string"",""Ordinal"":""7""},{""column"":""BUILD_VERSION_STRING"",""datatype"":""string"",""Ordinal"":""8""},{""column"":""DUMP_TYPE"",""datatype"":""string"",""Ordinal"":""9""},{""column"":""PRIMARY_PROBLEM_CLASS"",""datatype"":""string"",""Ordinal"":""10""},{""column"":""IMAGE_NAME"",""datatype"":""string"",""Ordinal"":""11""},{""column"":""FAILURE_BUCKET_ID"",""datatype"":""string"",""Ordinal"":""12""},{""column"":""OS_VERSION"",""datatype"":""string"",""Ordinal"":""13""},{""column"":""TARGET_TIME"",""datatype"":""string"",""Ordinal"":""14""},{""column"":""FAILURE_ID_HASH_STRING"",""datatype"":""string"",""Ordinal"":""15""},{""column"":""FAILURE_ID_HASH"",""datatype"":""string"",""Ordinal"":""16""},{""column"":""FAILURE_ID_REPORT_LINK"",""datatype"":""string"",""Ordinal"":""17""}], ValidationPolicy=[Options=ValidateCsvInputConstantColumns, Implications=BestEffort], Tags=[ToStringEmpty], IngestIfNotExists=[ToStringEmpty], ZipPattern=[null]]" https://crashanalysisresults.blob.core.usgovcloudapi.net/datacontainer/Telemetry.37c92f1a-a951-4047-b839-e685bd11758f.zip.crashanalysis.csv
My CSV files are not zipped in blob storage. Do I need to do something with ZipPattern to say so?
Here's what this CSV contains (many strings simplified):
CrashSource,CrashType,ReportId,DeviceId,DeviceSerialNumber,DumpFilePath,FailureXmlPath,PROCESS_NAME,BUILD_VERSION_STRING,DUMP_TYPE,PRIMARY_PROBLEM_CLASS,IMAGE_NAME,FAILURE_BUCKET_ID,OS_VERSION,TARGET_TIME,FAILURE_ID_HASH_STRING,FAILURE_ID_HASH,FAILURE_ID_REPORT_LINK
"source","type","reportid","deviceid","","dumpfilepath","failurexmlpath","process","version","1","problem class","image","bucket","version","2020-07-27T22:36:44.000Z","hash string","{b27ad816-2cb5-c004-d164-516c7a32dcef}","link"
As often happens. I seem to have found my own answer just by asking. While reviewing my question, I realized the string ".zip" is in the middle of my CSV file name. (Telemetry.37c92f1a-a951-4047-b839-e685bd11758f.zip.crashanalysis.csv) That made me wonder if Kusto is treating it differently based on that. I tested by taking the exact same file, renaming it "Telemetry.37c92f1a-a951-4047-b839-e685bd11758f.crashanalysis.csv", and uploading it. This was successfully ingested.
<forehead smack> I guess I will rename my files to get past this, but this sounds like a Kusto ingestion bug to me.
Thanks for reporting this, Sue.
Yes, Kusto automatically attempts to check files with ".zip" in them. We'll check why this is "tripping" if the string is in the middle of the file name instead of just its end.

How to save images stored in assets to Firebase?

I am trying to figure out the best way to save an image (png) that is stored in my assets folder to my firebase database so I can retrieve that image/path for later use. Essentially this is what I want:
// savedPath was stored and then retrieved from database
let savedPath = '../assets/images/example.png'
return (<Image source={require(savedPath)} />)
However, if I just save the path as a string, I can not insert it into the image source like this and it throws an error. I am wondering what my alternative option is to achieve that same thing as the example above?
You can use firebase storage bucket https://firebase.google.com/docs/storage/ to store all your assets
Then you can store the file name/path in your database and access it from there.

Converting RTF to PDF from System

I've createad a rule to transform any file to PDF and copy this one to another folder.
So i can add a file named: "test.rtf" and then the rule create a test.pdf into folder "PDF"..
Till here its ok. If i add a file through alfresco (add content button) it works perfectly...
By the way, on the system that i've developed when i try to add a file then i get my file .rtf correctly in the folder, but the pdf file converted and copied goes without any content...
If i send a rtf file with a table with 10 rows and into the rows i right "testing" then the pdf created goes with the table, and with 10 empty rows...
Someone knows the reason for that?
Im not sure, but maybe when i send the file by the system alfresco starts to convert and copy before completing to create the rtf... someone already got some problem like this one?
The problem you're getting in is that Alfresco first creates an empty file with all the meta-data and then updates the file with the associated content.
So you can do 2 things:
1: create a rule which is triggered on update, instead on create/inbound
2: create a rule which triggers a Javascript file, which will do the transformation and will check on the content size.
Hence it's better to create a rule which checks the content on size.
Create a JavaScript file in Data Dictionary/Scripts.
Check the JavaScript API
Something like this:
if (document != null && document.size > 0){
document.transformDocument("application/rtf");
}

Resources