I have 48 scripts used to clean data corresponding to 48 different tests. The cleaning protocols for each test used to be unique and test-specific, but after some time the final project guideline allows that all tests may use the same cleaning protocol granted they save all output files to the appropriate directory (each test's own folder of results). I'm trying to combine these tests into one master cleaning script that can be used by any team member to clean data as more is collected, or make small changes, given they have the raw data files and a folder for each test (that I would give to them).
Currently I have tried two approaches:
The first is to include all necessary libraries in the body of a master cleaning script, then source() each individual cleaning script. Inside each script, the libraries are the require()ed, the appropriate files are read in, and code for the files are saved to their correct destination. This method seems to work best, but if the whole script is run, some subtests are successfully cleaned and saved to their correct locations, and the rest need to be saved individually--I'm not sure why.
library(readr)
library(dplyr)
library(data.table)
library(lubridate)
source("~/SF_Cleaning_Protocol.R")
etc
.
.
The second is the save the body of the general cleaning script as a function, and then call that function in a series of if statements based on the test one wants to clean.
For example:
if (testname == "SF"){
setwd("~/SF")
#read in the csv file
subtest<- read_csv()
path_map<- read_csv()
SpecIDs<- read_csv()
CleaningProtocol(subtest,path_map,SpecIDs)
write.csv("output1.csv")
write.csv("output2.csv")
write.csv("output3.csv")
write.csv("output4.csv")
} else if (testname == "EV"){
etc
}
The code reads in and prints out files fine if selected individually, but when testname is specified and the script is run as a whole, it ignores the if statements, runs all test, but fails to print results for any.
Is there a better option I haven't tried, or can anyone help me diagnose my issues?
Many thanks.
Related
Background
I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.
The Problem
Here's the code I'm using to output a .csv file to a folder on my hard drive:
spark_write_csv(d1, "C:/d1.csv")
When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:
The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".
What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?
Edit
A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.
I had the exact same issue.
In simple terms, the partitions are done for computational efficiency. If you have partitions, multiple workers/executors can write the table on each partition. In contrast, if you only have one partition, the csv file can only be written by a single worker/executor, making the task much slower. The same principle applies not only for writing tables but also for parallel computations.
For more details on partitioning, you can check this link.
Suppose I want to save table as a single file with the path path/to/table.csv. I would do this as follows
table %>% sdf_repartition(partitions=1)
spark_write_csv(table, path/to/table.csv,...)
You can check full details of sdf_repartition in the official documentation.
Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.
You can use a method called as coalese to achieve this.
coalesce(df, 1)
My problem seems to be two-fold. I am using code that has worked before. I re-ran my scripts and got similar outputs, but saved to a new location. I have changed all of my setwd lines accordingly. But, there may be an error with either setwd or the do.call function.
In R, I want to merge 25 csv's that are located in a folder- only certain columns
My path is
/Documents/CODE/merge_file/2sp
So, I do:
setwd("/Documents/CODE")
but then I get an error saying cannot change working directory (usually works fine). So then I manually set working directory in the Session in RStudio.
The next script seems to run fine:
myMergedData2 <-
do.call(rbind,
lapply(list.files(path = "/Documents/CODE/merge_file/2sp"),
read.csv))
myMergedData2 ends up in the global environment, but it says it is NULL (empty), though the console makes it look like everything is ok.
I would then like to save just these columns of information but I can't even get to this point.
myMergedData2<-myMergedData2[c(2:5),c(10:12)]
And then add this
myMergedData2<-myMergedData2 %>% mutate(richness = 2)%>% select(richness,
everything())
And then I would like to save
setwd("/Documents/CODE/merge_file/allsp")
write.csv(myMergedData2, "/Documents/CODE/merge_file/allsp/2sp.csv")
I am trying to merge these data so I can use ggplot 2 and show how my response variables (columns 2-5) according to my independent variables (columns 10-12). I have 25 different parameter sets with 50 observations in each csv.
Ok, so the issue was that my dropbox didn't have enough space and I weirdly don't have permissions to do what I was trying on my university's H drive. Bizarre, but easy fix with the increase in space on Dropbox to allow for complete syncing of csv's.
Sometimes the issue is minor!
This is an environment design question. I have a number of analysis/forecasting scripts I run each week, and each one relies on a number of files, with most files used by more than one script. I just had to change the name of one of the files, which was a real pain because I had to search through all my scripts and change the path declared in each one.
I would like to use a single .csv master file with file names and their paths, and create a centralized function that takes a list of file names, looks up their file paths, and then imports them all into the global environment. I could use this function in every script I run. Something like:
files_needed <- c("File_1", "File_2", "File_4", "File_6")
import_files(files_needed)
But then the function would require indirect variable assignment and declaring global variables, which know are frowned upon and I don't even know how to do both at once. I know I can write logic for importing the file path names manually in every script, but there must be a better option, where I can just write the import logic once.
Currently I have a master file that I source at the beginning of every script which loads my most commonly used packages and declares some helper functions I use frequently. I'd love to add this importing functionality in some capacity, but I'm open to solutions that look completely different to what I described. How do people generally solve this problem?
As a final note, many files have another twist, where they incorporate e.g. a date into the file name, so I need to be able to pass additional parameters in order to get the one I need.
Without a worked example this is untested code, but why not just make a list of imported files using those names?
files_needed <- c("File_1", "File_2", "File_4", "File_6")
my_imported_files <-
setNames( lapply(files_needed, read.csv), paste0(files_needed, "_df") )
I have two R scripts. The first reads csv files, cleans the data, checks for mathematical errors and corrects them ("errorcheck.R"). The second script gets the clean data from the first, combines column names, expressions and values and creates csv files ("createTables.R").
Originally, the first script was created for importing 5 csv files. But for some projects I might have only 4 or 3 csv files to import, which is fine for the final output. But that throws me an error and when I try to source the first script from the second script, I don't get the clean csv files. How can I source the clean datasets from the first script, even with errors? The errors come only from calling csv files that don't exist.
I'm not sure if this is the same question as:
Is there a way to `source()` and continue after an error?
Can I have some ideas on this please?
Thanks in advance
I am not sure if this serves your answer or not:
Situation:
1. According to your description, your first scripts is made for static input of length of 5. (i.e. 5 .csv file input)
Solution:
I don't know how you take the input of .csv files in first script. I suggest to create a vector of string and pass that to first script and calculate the length of vector to decide how many times your operation should run. Now, the input can be of any length.
So, You can effectively handle any range of .csv files rather than only for 5. Try avoiding hard-coding.
Please let me know if this answer your question. If you face any diffculty just let me know.
Is there a way in R to pass the values of some variables, say strings, defined in a script to another script that is being sourced so that the latter can use them without having to declare them? Eg:
some R code
...
...
var1 <- "some string"
var2 <- "some param"
source("header.r")
Within header.r a list() has the slots with the names of the strings in var1 and var2:
tabl <- alldata.list[["some string"]][["some param"]]
Such that when I run the original script and call the header, tabl will be addressed properly?
Additionally, is there a restriction on the number and type of elements that can be passed?
When you use source to load a .R file, this sequentially runs the lines in that script, merging everything in that script into your running R session. All variables and functions are available from that moment onwards.
To make your code more readable/maintainable/debuggable though I would recommend not using variables to communicate between source files. In stead, I would use functions. In practice for me this means that I have one or multiple files which contain helper functions (sort of a package-light). These helper functions abstract away some of the functionality you need in the main script, making it shorter and more to-the-point. The goal is to create a main script that roughly fills a screen. In this way you can easily grasp the main idea of the script, any details can be found in the helper functions.
Using functions makes the main script self contained and not dependent on what happens in executable code in other source files. This requires less reasoning by yourself and others to determine what the script is exactly doing as you basically just have to read 40-50 lines of code.