Kusto ingestion error "BadRequest_EmptyArchive: Empty zip archive" - azure-data-explorer

I have a bunch of .csv files in Azure blob storage, and an ingestion rule to pull them into Kusto (Azure Data Explorer). This used to work, but I've been getting a lot of ingestion failures lately. ".show ingestion failures" gives me:
Details FailureKind OperationKind ErrorCode ShouldRetry IngestionProperties IngestionSourcePath
BadRequest_EmptyArchive: Empty zip archive Permanent DataIngestPull BadRequest_EmptyArchive 0 "[Format=Csv/mandatory, IngestionMapping=[{""column"":""CrashSource"",""datatype"":""string"",""Ordinal"":""0""},{""column"":""CrashType"",""datatype"":""string"",""Ordinal"":""1""},{""column"":""ReportId"",""datatype"":""string"",""Ordinal"":""2""},{""column"":""DeviceId"",""datatype"":""string"",""Ordinal"":""3""},{""column"":""DeviceSerialNumber"",""datatype"":""string"",""Ordinal"":""4""},{""column"":""DumpFilePath"",""datatype"":""string"",""Ordinal"":""5""},{""column"":""FailureXmlPath"",""datatype"":""string"",""Ordinal"":""6""},{""column"":""PROCESS_NAME"",""datatype"":""string"",""Ordinal"":""7""},{""column"":""BUILD_VERSION_STRING"",""datatype"":""string"",""Ordinal"":""8""},{""column"":""DUMP_TYPE"",""datatype"":""string"",""Ordinal"":""9""},{""column"":""PRIMARY_PROBLEM_CLASS"",""datatype"":""string"",""Ordinal"":""10""},{""column"":""IMAGE_NAME"",""datatype"":""string"",""Ordinal"":""11""},{""column"":""FAILURE_BUCKET_ID"",""datatype"":""string"",""Ordinal"":""12""},{""column"":""OS_VERSION"",""datatype"":""string"",""Ordinal"":""13""},{""column"":""TARGET_TIME"",""datatype"":""string"",""Ordinal"":""14""},{""column"":""FAILURE_ID_HASH_STRING"",""datatype"":""string"",""Ordinal"":""15""},{""column"":""FAILURE_ID_HASH"",""datatype"":""string"",""Ordinal"":""16""},{""column"":""FAILURE_ID_REPORT_LINK"",""datatype"":""string"",""Ordinal"":""17""}], ValidationPolicy=[Options=ValidateCsvInputConstantColumns, Implications=BestEffort], Tags=[ToStringEmpty], IngestIfNotExists=[ToStringEmpty], ZipPattern=[null]]" https://crashanalysisresults.blob.core.usgovcloudapi.net/datacontainer/Telemetry.37c92f1a-a951-4047-b839-e685bd11758f.zip.crashanalysis.csv
My CSV files are not zipped in blob storage. Do I need to do something with ZipPattern to say so?
Here's what this CSV contains (many strings simplified):
CrashSource,CrashType,ReportId,DeviceId,DeviceSerialNumber,DumpFilePath,FailureXmlPath,PROCESS_NAME,BUILD_VERSION_STRING,DUMP_TYPE,PRIMARY_PROBLEM_CLASS,IMAGE_NAME,FAILURE_BUCKET_ID,OS_VERSION,TARGET_TIME,FAILURE_ID_HASH_STRING,FAILURE_ID_HASH,FAILURE_ID_REPORT_LINK
"source","type","reportid","deviceid","","dumpfilepath","failurexmlpath","process","version","1","problem class","image","bucket","version","2020-07-27T22:36:44.000Z","hash string","{b27ad816-2cb5-c004-d164-516c7a32dcef}","link"

As often happens. I seem to have found my own answer just by asking. While reviewing my question, I realized the string ".zip" is in the middle of my CSV file name. (Telemetry.37c92f1a-a951-4047-b839-e685bd11758f.zip.crashanalysis.csv) That made me wonder if Kusto is treating it differently based on that. I tested by taking the exact same file, renaming it "Telemetry.37c92f1a-a951-4047-b839-e685bd11758f.crashanalysis.csv", and uploading it. This was successfully ingested.
<forehead smack> I guess I will rename my files to get past this, but this sounds like a Kusto ingestion bug to me.

Thanks for reporting this, Sue.
Yes, Kusto automatically attempts to check files with ".zip" in them. We'll check why this is "tripping" if the string is in the middle of the file name instead of just its end.

Related

Can I use PowerBI to access SharePoint files, and R to write those files to a local directory (without opening them)?

I have a couple of large .xlsb files in 2FA-protected SharePoint. They refresh periodically, and I'd like to automate the process of pulling them across to a local directory. I can do this in PowerBI already by polling the folder list, filtering to the folder/files that I want, importing them and using an R script to write that to an .rds (it doesn't need to be .rds - any compressed format would do). Here's the code:
let
#"~ Query ~"="",
//Address for the SP folder
SPAddress="https://....sharepoint.com/sites/...",
//Poll the content
Source15 = SharePoint.Files(SPAddress, [ApiVersion=15]),
//... some code to filter the content list down to the 2 .xlsb files I'm interested in - they're listed as nested 'binary' items under column 'Content' within table 'xlsbList'
//R export within an arbitrary 'add column' instruction
ExportRDS = Table.AddColumn(xlsbList, "Export", each R.Execute(
"saveRDS(dataset, file = ""C:/Users/current.user/Desktop/XLSBs/" & [Label] & ".rds"")",[dataset=Excel.Workbook([Content])[Data]{0}]))
However, the files are so large that my login times out before the refresh can complete. I've tried using R's file.copy command instead of saveRDS, to pick up the files as binaries (so PowerBI never has to import them):
R.Execute("file.copy(dataset, ""C:/Users/current.user/Desktop/XLSBs/""),[dataset=[Content]])
with dataset=[Content] instead of dataset=Excel.Workbook([Content])[Data]{0} (which gives me a different error, but in any event would result in the same runtime issues as before) but it tells me The Parameter 'dataset' isn't a Table. Is there a way to reference what PowerBI sees as binary objects, from within nested R (or Python) code so that I can copy them to a local directory without PowerBI importing them as data?
Unfortunately I don't have permissions to set the SharePoint site up for direct access from R/Python, or I'd leave PowerBI out entirely.
Thanks in advance for your help

Iterating through files in google cloud storage

I am currently trying to read pdf files stored in my google cloud storage. So far I have figured out how to read one file at a time from my google cloud storage, but I want to be able to loop through multiple files in my google cloud storage without manually reading them one by one. How can I do this? I have attached my code below.
To iterate all files in your bucket move your code for downloading and parsing in the for loop. Also I changed the for loop to for blob in blob_list[1:]: since GCS always prints the top folder in the first element and you do not want to parse that since it will result to and error. My folder structure used for testing is "gs://my-bucket/output/file.json....file_n.json".
Output when looping through the folder (for blob in blob_list:):
Output files:
output/
output/copy_1_output-1-to-1.json
output/copy_2_output-1-to-1.json
output/output-1-to-1.json
Output when skipping the first element (for blob in blob_list[1:]:):
Output files:
output/copy_1_output-1-to-1.json
output/copy_2_output-1-to-1.json
output/output-1-to-1.json
Loop through files. Skip the first element:
blob_list = list(bucket.list_blobs(prefix=prefix))
print('Output files:')
for blob in blob_list[1:]:
json_string = blob.download_as_string()
response = json.loads(json_string)
first_page_response = response['responses'][0]
annotation = first_page_response['fullTextAnnotation']
print('Full text:\n')
print(annotation['text'])
print('END OF FILE')
print('##########################')
NOTE: If you have a different folder structure versus the test made, just adjust the index in the for loop.

how to fix my sqlite? database disk image is malformed

When I did 'vacuum ;', just found message in below.
" database disk image is malformed."
So, I did like those.
C>sqlite [malformed.db]
sqlite3>.mode insert
sqlite3>.output a.sql
sqlite3>.dump
and continues...
C>sqlite3 new.db
sqlite3>.read a.sql
finally I found the new.db which file size is 0 byte.
any idea?
For some reason, this very process that you described, when the database is malformed adds one last line to sql file, which says "rollback" (-- due to errors, it elaborates).
Reminds some anecodes with genies, doesn't it?
While it is intented to prevent disasters when importing data on existing bases, it is a disaster if all you are trying to do is reconstruct your database from scratch.
So, remove this line, replace it with "end transaction;", re-run and you are hopefully ok.

tFlowToIterate with BigDataBatch job

I have currently a simple standard job on Talend doing this :
It simply reads a file of several lines (tHDFSInput), and for each line of this file (tFlowToIterate), I create a INSERT query "INSERT ... SELECT ... FROM" based on what I read in my file (tHiveRow). And it works well, it's just a bit slow.
I now need to modify my "Standard" job to make a "Big Data Batch" job in order to make it faster, and also because we asked me to only make Big Data Batch from now on.
The thing is that there is no tFlowToIterate and no tHiveRow component with Big Data Batch...
How can I do this ?
Thank's a lot.
Though I haven't tried this solution, I think this can help you.
Create hive table upfront.
Place tHDFSConfiguration component in the job and provide cluster details.
Use tFileInputDelimited component. On providing storage configuration as tHDFSConfiguration(defined in step1), this will read from HDFS.
Use tHiveOutput component. Connect tFileInputDelimited to tHiveOutput. In tHiveOutput, you can provide the table, format and save mode.
In order to load HDFS into Hive without modification on the data maybe you can use only one component : tHiveLoad
Insert your HDFS path inside the component.
tHiveLoad documentation : https://help.talend.com/reader/hCrOzogIwKfuR3mPf~LydA/ILvaWaTQF60ovIN6jpZpzg

Dynamic output issue when rowset is empty

I'm running a u-sql script similar to this:
#output =
SELECT Tag,
Json
FROM #table;
OUTPUT #output
TO #"/Dir/{Tag}/filename.txt"
USING Outputters.Text(quoting : false);
The problem is that #output is empty and the execution crashes. I already checked that if I don't use {tag} in the output path the script works well (it writes an empty file but that's the expectable).
Is there a way to avoid the crash and simply don't output anything?
Thank you
The form you are using is not yet publicly supported. Output to Files (U-SQL) documents the only supported version right now.
That said, depending on the runtime that you are using, and the flags that you have set in the script, you might be running the private preview feature of outputting to a set of files. In that case, I would expect that it would work properly.
Are you able to share a job link?

Resources