databricks autoLoader - why new data is not write to table when original csv file is deleted and new csv file is uploaded - databricks-autoloader

I have a question about autoload writestream.
I have below user case:
Days before I uploaded 2 csv files into databricks file system, then read and write it to table by autoloader.
Today, I found that the files uploaded days before has wrong data that faked. so I deleted these old csv file, and then uploaded 2 new correct csv file.
Then I read and write the new files by autoloader streaming.
I found that the streaming can read the data from new files successfully, but failed to write to table by writestream.
Then I tried to delete the checkpoint folder and all sub folders or files and re-create the checkpoint folder, and read and write by stream again, found that the data is write to table successfully.
Questions:
Since the autoloader has detect the new files, why it can't write to table succesfully until I delete the checkpoint folder and create a new one.

AutoLoader works best when new files are ingested into a directory. Overwriting files might give unexpected results. I haven't worked with the option cloudFiles.allowOverwrites set to True yet, but this might help you (see documentation below).
On the question about readStream detecting the overwritten files, but writeStream not: This is because of the checkpoint. The checkpoint is always linked to the writeStream operation. If you do
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.load("filepath"))
display(df)
then you will always view the data of all the files in the directory. If you use writeStream, you need to add .option("checkpointLocation", "<path_to_checkpoint>"). This checkpoint will remember that the files (that were overwritten) already have been processed. This is why the overwritten files will only be processed again after you deleted the checkpoint.
Here is some more documentation about the topic:
https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/faq#does-auto-loader-process-the-file-again-when-the-file-gets-appended-or-overwritten
cloudFiles.allowOverwrites https://docs.databricks.com/ingestion/auto-loader/options.html#common-auto-loader-options

Related

How to upload CSV files to GitHub repo and use them as data for my R scripts

I'm currently doing a project that uses R to process some large csv files that are saved in my local directory linked to my repo.
So far, I managed to create the R project and commit and push R scripts into the repo with no problem.
However, the scripts read in the data from the csv files saved in my local directory, so the code goes in a form
df <- read.csv("mylocaldirectorylink")
However, this is not helpful if my partner and I working on the same project have to change that url to our own local directory every time we pull it off the repo. So I was thinking that maybe we can upload the csv files onto GitHub Repo and let the R script refer directly to the csv files online.
So my questions are:
Why can't I upload csv files onto GitHub? They keep saying that my file is too large.
If I can upload the csv files, how to I read the data from these csv files?
Firstly, it's generally a bad idea to store data on Github, especially if it's large. If you want to save it somewhere on the Internet, you can use, say, Dataverse, and then can access your data with URL (through the API), or Google Drive, as Jake Kaupp suggested.
Now back to your question. If your data doesn't change, I would just use not the absolute paths to CSV but relative ones. In other words, instead of
df<-read.csv("C:/folder/subfolder/data.csv")
I would use
df <- read.csv("../data.csv")
If you are working with R project, then the initial working directory is inside the folder of the project. You can check it with getwd(). This working directory changes as you move the R project. Just agree with your colleague that your data file should be in the same folder where the folder with R project is situated.
This is for a Python script.
You can track csv files by editing your .gitignore file.
**OR**
You can add csv files in your github repo, which can be used by others.
I did so by following steps:
Checkout the branch on github.com
Go to the folder where you want to keep csv files.
Here, you will see an option "Add file" in top right area as shown below:
Here you can upload csv files and commit the changes in same branch or by creating a new branch.

Mapping Data Type for CSV files

Is it possible to save a mapping file which SSIS can use to decide the data type based on the column names rather than going and updating it in the Advanced section of the 'Flat File Connection Manager Editor'. Thank you
This is a common problem faced by every SSIS developer. Whenever you make changes in a flat file connection, you lose all data type mappings and you have to manually edit this by using the advanced editor.
But you can save your life using the following:
Practice #1
When you work with an existing connection, make sure you have the flat file at the reference location of the flat file connection with the same name. If you forget to save it or don't find it, try the second practice.
Practice #2
Follow the steps below before using the SSIS package:
Open package in XML file format.
Find the flat file connection.
Read the file name and path of flat file connection.
Get the output copy of the final output file (usually you can find where SSIS is exporting final output file).
Copy the final output file and rename with the file connection's file name and paste it to the flat file connection location.
Remove all the data from file except the column list (make sure you keep the file format as it is, e.g., CSV or Excel).
Close the XML of the SSIS package & close the package.
Reopen the SSIS package and you saved your life.
This trick works for me in all my cases.

purpose of .RDataTmp temporary file? [R]

what is the purpose of the R temporary file that is created in every directory where a workspace is saved? What data does it contain and is it safe to delete?
That file is a holding file for save.image() while R waits for its file argument to succeed. From help(save.image) and its safe argument -
safe - logical. If TRUE, a temporary file is used for creating the saved workspace. The temporary file is renamed to file if the save succeeds. This preserves an existing workspace file if the save fails, but at the cost of using extra disk space during the save.
So the file contains the entire workspace image and it is probably best to just leave it there in case R fails to save the workspace normally.
I'm also guessing that if you see this file, R has already failed to rename the file so you may want to search for file and check its contents before deleting the temporary file.

OutOfMemory issue while creating XSSFWorkbook instance to read XSLX file

As per business functionality we need to read multiple excel files(both .xls and .xlsx format) at different locations in a multi thread environment. Each thread is responsible for reading a file. In order to test the performance, we have created 2 file sets in both .xls and .xlsx formats. One file set has just 20 row data while other file set contains 300,000 row data. We are able successfully read both files in .xls formats and load data into the table. Even for 20 row data .xlsx file, our source code is working fine.
But when the execution flow starts reading .xlsx file, application server is terminated abruptly. When I started tracing down the issue, I have been
facing a strange issue while creating XSSFWorkbook instance.Refer the code snippet below:
OPCPackage opcPackage = OPCPackage.open(FILE);
System.out.println("Created OPCPackage instance.");
XSSFWorkbook workbook = new XSSFWorkbook(opcPackage);
System.out.println("Created XSSFWorkbook instance.");
SXSSFWorkbook sxssfWorkbook = new SXSSFWorkbook(workbook, 1000);
System.out.println("Created SXSSFWorkbook instance.");[/code]
Output
Process XLSX file EXCEL_300K.xlsx start.
Process XLSX file EXCEL.xlsx start.
Created OPCPackage instance.
Created OPCPackage instance.
Created XSSFWorkbook instance.
Created SXSSFWorkbook instance.
Process XLSX file EXCEL.xlsx end.
For larger file set the execution hangs at
XSSFWorkbook workbook = new XSSFWorkbook(opcPackage);
causing heap space issue. Please do help me to fix this issue.
Thanks in advance.
Thanks,
Sankar.
After trying lot of solutions I found out that processing XLSX files require huge memory. But using POI 3.12 library has multiple advantages.
Processes excel files faster.
Has more API's to handle excel files like closing a working book, opening a excel file using File instance etc.

Acting on changes to a log file in R as they happen

Suppose a log file is being written to disk with one extra line appended to it every so often (by a process I have no control over).
I would like to know a clean way to have an R program "watch" the log file, and process a new line when it is written to the log file.
Any advice would by much appreciated.
You can use file.info to get the modification date of a file, just check every so often and take action is the modification date changes. Keeping track of how many lines have already been read will enable you to use scan or read.table to read only the new lines.
You could also delete or move the log file after it is read by your program. The external program will then create a new log file, I assume. Using file.exists you can check if the file has been recreated, and read it when needed. You then add the new data to the already existing data.
I would move the log file to an archive subfolder and read the logfiles as they are created.

Resources