purpose of .RDataTmp temporary file? [R] - r

what is the purpose of the R temporary file that is created in every directory where a workspace is saved? What data does it contain and is it safe to delete?

That file is a holding file for save.image() while R waits for its file argument to succeed. From help(save.image) and its safe argument -
safe - logical. If TRUE, a temporary file is used for creating the saved workspace. The temporary file is renamed to file if the save succeeds. This preserves an existing workspace file if the save fails, but at the cost of using extra disk space during the save.
So the file contains the entire workspace image and it is probably best to just leave it there in case R fails to save the workspace normally.
I'm also guessing that if you see this file, R has already failed to rename the file so you may want to search for file and check its contents before deleting the temporary file.

Related

databricks autoLoader - why new data is not write to table when original csv file is deleted and new csv file is uploaded

I have a question about autoload writestream.
I have below user case:
Days before I uploaded 2 csv files into databricks file system, then read and write it to table by autoloader.
Today, I found that the files uploaded days before has wrong data that faked. so I deleted these old csv file, and then uploaded 2 new correct csv file.
Then I read and write the new files by autoloader streaming.
I found that the streaming can read the data from new files successfully, but failed to write to table by writestream.
Then I tried to delete the checkpoint folder and all sub folders or files and re-create the checkpoint folder, and read and write by stream again, found that the data is write to table successfully.
Questions:
Since the autoloader has detect the new files, why it can't write to table succesfully until I delete the checkpoint folder and create a new one.
AutoLoader works best when new files are ingested into a directory. Overwriting files might give unexpected results. I haven't worked with the option cloudFiles.allowOverwrites set to True yet, but this might help you (see documentation below).
On the question about readStream detecting the overwritten files, but writeStream not: This is because of the checkpoint. The checkpoint is always linked to the writeStream operation. If you do
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.load("filepath"))
display(df)
then you will always view the data of all the files in the directory. If you use writeStream, you need to add .option("checkpointLocation", "<path_to_checkpoint>"). This checkpoint will remember that the files (that were overwritten) already have been processed. This is why the overwritten files will only be processed again after you deleted the checkpoint.
Here is some more documentation about the topic:
https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/faq#does-auto-loader-process-the-file-again-when-the-file-gets-appended-or-overwritten
cloudFiles.allowOverwrites https://docs.databricks.com/ingestion/auto-loader/options.html#common-auto-loader-options

How to load the actual .RData file, that is just called .RData (the compressed file that gets saved from a session)

Similar questions, but not the question I have, were around loading a file that someone saved as somefilename.RData. I am trying to do something different.
What I am trying to do is load the actual .RData file that gets saved from an R session. The context is that I am using 2 different computers and am trying to download the .RData file from one computer and then load this same .RData file on a different computer in RStudio.
When I download the .RData file it shows up without the “.” (e.g., it shows up as RData). When I try to rename it “.RData”, Windows will not allow me to do so.
Is there a way to do what I am trying to do?
Thanks!
After playing around with this, I was able to load the file (even though it was called “RData“ and not called “.RData”, by using RStudio by going to Session > Load Workspace... and then navigating to that file. I had used File > Open File... which did not work

Mapping Data Type for CSV files

Is it possible to save a mapping file which SSIS can use to decide the data type based on the column names rather than going and updating it in the Advanced section of the 'Flat File Connection Manager Editor'. Thank you
This is a common problem faced by every SSIS developer. Whenever you make changes in a flat file connection, you lose all data type mappings and you have to manually edit this by using the advanced editor.
But you can save your life using the following:
Practice #1
When you work with an existing connection, make sure you have the flat file at the reference location of the flat file connection with the same name. If you forget to save it or don't find it, try the second practice.
Practice #2
Follow the steps below before using the SSIS package:
Open package in XML file format.
Find the flat file connection.
Read the file name and path of flat file connection.
Get the output copy of the final output file (usually you can find where SSIS is exporting final output file).
Copy the final output file and rename with the file connection's file name and paste it to the flat file connection location.
Remove all the data from file except the column list (make sure you keep the file format as it is, e.g., CSV or Excel).
Close the XML of the SSIS package & close the package.
Reopen the SSIS package and you saved your life.
This trick works for me in all my cases.

What is the philosophy behind the workspaces in R?

When I start R session from some directory, R automatically loads the corresponding workspace (if it exists). After I finish to work in this workspace I can decide if I want to modify (save) the current workspace. This logic is simple and clear.
What I do not understand, is what happens if I start R from some directory and then change the working directory by setwd(). As far as I understood the workspace corresponding to the new working directory will not be "loaded". I still see the variables and history from the previous working directory. Why?
Second, when I quit() R, I replace the work-space image corresponding to the "new" working directory by the workspace corresponding to the "old" directory. Do I interpret the behavior correctly? What is the logic behind this behavior? Can I switch to another work-space from R session?
Workspaces are stored in .RData files and are automatically loaded from current working directory when you start R. But working directory itself (and setwd() function that sets it) has nothing to do with workspace. You can load any workspace by explicitly specifying any .RData file:
load("c:/project/myfile.RData")
or
setwd("c:/project/")
load()

Acting on changes to a log file in R as they happen

Suppose a log file is being written to disk with one extra line appended to it every so often (by a process I have no control over).
I would like to know a clean way to have an R program "watch" the log file, and process a new line when it is written to the log file.
Any advice would by much appreciated.
You can use file.info to get the modification date of a file, just check every so often and take action is the modification date changes. Keeping track of how many lines have already been read will enable you to use scan or read.table to read only the new lines.
You could also delete or move the log file after it is read by your program. The external program will then create a new log file, I assume. Using file.exists you can check if the file has been recreated, and read it when needed. You then add the new data to the already existing data.
I would move the log file to an archive subfolder and read the logfiles as they are created.

Resources