Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 days ago.
This post was edited and submitted for review 3 days ago.
Improve this question
Let's say you've just completed writing a series of custom functions in a RMarkdown book to analyze your dataset from reading, tidying, analysis, visualization and export. You now want to deploy these functions on
a folder full of csv datasets sequentially. The functions can't be used as standalones as it requires variables/objects that are outputs from the first function. Essentially, they need to be run in a linear order.
What is the most efficient method of the two below for combining these functions together?
I imagine there's two approaches:
Should you create individual R script files for each function and source these into another R script file to run each function as standalone lines of code one after another e.g.,
x<- read_csv(data_sets)
clean_output <- func1(x)
results_output <- func2(clean_output)
table_plots_output <- func3(results_output)
export_csv <- func4(table_plots_output)
OR
Should you write a sort of master function that contains all the functions you've created previously to run all your processes/functions (cleaning, analysis, visualization and export of results) in a single line of code?
x<- read_csv(data_sets)
export_csv <- master_funct(x) {
clean_output <- func1(x)
results_output <- func2(clean_output)
table_plots_output <- func3(results_output)
func4(table_plots_output)
}
I try to follow Tidyverse approaches, so if there is a Tidyverse approach to this task that would be great.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
As I'm dealing with a huge dataset I had to split my data into different buckets. Thus, I want to save some interim results in a csv to recall it later. However, my datafile contains some columns with lists, which according to R can not be exported (see snapshot). Do you guys know a simple way for a R newbie to make this work?
Thank you so much!
I guess the best way to solve your problem is switching to a more apropriate file format. I recomend using write_rds() from the readr package, which creates .rds files. The files you create with readr::write_rds('your_file_path') can be read in with readr::read_rds('your_file_path').
The base R functions are saveRDS() and readRDS() and the functions mentioned earlier form the readr are just wrappers with some convience features.
Just right click, then choose new csv to the folder where you want to save your work. Then set the separator of the csv to a comma.
Input all data in column form. You can later make it a matrix in your R program.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have more than 400 image files in my loacl directory.I want to read these images in r for passing it through XG boost algorithm..My two tries(codes) are is given below
library("EBImage")
img <- readImage("/home/vishnu/Documents/XG_boost_R/Data_folder/*.jpg")
and
library(jpeg)
library(biOps)
myjpg <- readJpeg("/home/vishnu/Documents/XG_boost_R/Data_folder/*.jpg")
It is a bit hard to guess what you want to do exactly, but one way to accomplish loading a lot of files and processing them is via a for-loop like this:
files <- list.files() #create a vector with file names
for(i in 1:length(files)){#loop over file names
load(files[i]) #load .rda-file
#do some processing and save results
}
This structure is generalizable to other cases. Depending on what kind of files you want to load, you will have to replace load(files[i]) with the appropriate command, for instance load.image() from the imager package.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have several data frames in my environment, beginning with SPECIALITY:
I would like to be able to only call the data frames once in my self-defined functions (possibly with an apply function), instead of having to run a line of code for each data frame like so:
I was thinking of combining the data frames into a list, but I'm not sure how I would go about doing this, or whether that would be the most efficient method.
Storing them into a list is an excellent idea, you can do it this way:
new_list <- mget(ls(pattern="^SPECIALTY"))
And then use lapply on it with the function of your choice.
If you want to clean up your workspace after you've put them in a list, run :
rm(list = ls(pattern="^SPECIALTY")))
To go further you might want to challenge why you got them in separated tables to start with, maybe it's because you've done something like:
SPECIALTY2014_Q1 <- read.csv("SPECIALTY2014_Q1.csv")
SPECIALTY2014_Q2 <- read.csv("SPECIALTY2014_Q2.csv")
...
In this case you could have done the following to store everything in a list from the start:
lapply(paste0("SPECIALTY", c("2014_Q1", "2014_Q2"),".csv"), read.csv)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am having a larger data set with more than 1 million entries. If I am running scripts it sometimes takes up a while till I get an output. Sometimes it seems that there is no output what so ever, even if I let it run for hours. Is there a way to track the progress of the computation (or maybe just see if it is not stuck)?
1. Start small
Write your analysis script and then test it using trivially small amounts of data. Gradually scale up and see how the runtime increases. The microbenchmark package is great at this. In the example below, I compare the amount of time it takes to run the same function with three different sized chunks of data.
library(microbenchmark)
long_running_function <- function(x) {
for(i in 1:nrow(x)) {
Sys.sleep(0.01)
}
}
microbenchmark(long_running_function(mtcars[1:5,]),
long_running_function(mtcars[1:10,]),
long_running_function(mtcars[1:15,]))
2. Look for functions that provide progress bars
I'm not sure what kind of analysis you're performing, but some packages already have this functionality. For example, ranger gives you more updates than the equivalent RandomForest functions.
3. Write your own progress updates
I regularly add print() or cat() statements to large code blocks to tell me when R has finished running a particular part of my analysis. Functions like txtProgressBar() let you add your own progress bars to functions as well.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I want to pass a data frame to a function as an argument. And then inside the function, I want to work on different combinations of columns for graphical presentation. Basically, I want to do graphical presentation on different data files. I want that, I pass the data file as an argument and then get the graphs. How can I do this in R.
You are not giving us much info but here is a very basic starting point:
library(ggplot2) # if you don't have this library run install.packages('ggplot2')
myAmazingFunction <- function(myDF) {
ggplot(myDF,aes(X,Y))+geom_line()
}
df <-data.frame(X=1:30, Y=runif(30), Z=1.3*runif(30))
myAmazingFunction(df)