Save data.frame objects into .Rds files within a loop - r

I have data.frame objects with normalized names into my global env and I want to save them into .Rda files.
My first question is, should I save them into one big .Rda file or should I create one file for each data frame ? (df have 14 col and ~260 000 row).
Assuming that I'll save them into differents files, I was thinking about a function like this : (All my data.frame names begin by "errDatas")
sapply(ls(pattern = "errDatas"), function(x) save(as.name(x), file = paste0(x, ".Rda")))
But I have this error :
Error in save(as.name(x), file = paste0(x, ".Rda")) :
objet ‘as.name(x)’ introuvable
Seems like save can't parse as.name(x) and evaluate it as is. I tried also with eval(parse(text = x)) but it's the same thing.
Do you have an idea about how I can manage to save my data frames within a loop ? Thanks.
And I have a bonus question to know if what I'm trying to do is useful and legit :
These data frames come from csv files (one data frame by csv file which I import with read.csv). Each day I have one new csv file and I want to do some analysis on all the csv files. I realized that reading from csv is much slower than saving and loading a Rda file. So instead of reading all the csv each time I run my program, I actualy want to read each csv file only once, saving it into a Rda file and then loading it. Is this a good idea ? Is there best-practices for that with R ?

Use the list= parameter of the save function. This allows you to specify the name of the object as a character vector rather than passing the object itself. For example
lapply(ls(pattern = "errDatas"), function(x) {
save(list=x, file = paste0(x, ".Rda"))
})

Related

R - query all large text files in a folder without reading each file into Rstudio using vroom

I have a list of 15 txt files in a folder that are each 1.5 - 2 GB. Is there a way to query each file for specific rows based on a condition and then load the results into a list of data frames in R? Currently, I am loading a few files in at once using
temp = list.files(pattern="*.txt");
data_list = lapply(temp, read.delim); names(data_list) = temp
and then applying a custom filter function to each of the data frames in the list using lapply.
Due to RAM limitations, I cannot load entire files into my R environment and then query. I'm looking for some code to perhaps automatically read in one file, perform the query, add the data frame result to a list, free up the memory, and then repeat. Thank you!
Edit:
It seems like I should use vroom instead of read.delim:
temp = list.files(pattern="*.txt");
data_list = lapply(temp, vroom); names(data_list) = temp
I get a few warning messages and when I run problems(),
I get:
Error in vroom_materialize(x, replace = FALSE) : argument "x" is missing, with no default
Is this an issue?
Each of the files have a different number of columns, so do I need to use map as described here
Lastly, on each of the data frames in the list, I would like to run
filtered_list = lapply(data_list, filter, COL1 == "ABC")
Does doing so essentially read in each of the files, negating the benefits of vroom? When I run this lapply, R takes a very long time.

Can convert a string to an object but can't save() it -- why? [duplicate]

I am repeatedly applying a function to read and process a bunch of csv files. Each time it runs, the function creates a data frame (this.csv.data) and uses save() to write it to a .RData file with a unique name. Problem is, later when I read these .RData files using load(), the loaded variable names are not unique, because each one loads with the name this.csv.data....
I'd like to save them with unique tags so that they come out properly named when I load() them. I've created the following code to illustrate .
this.csv.data = list(data=c(1:9), unique_tag = "some_unique_tag")
assign(this.csv.data$unique_tag,this.csv.data$data)
# I want to save the data,
# with variable name of <unique_tag>,
# at a file named <unique_tag>.dat
saved_file_name <- paste(this.csv.data$unique_tag,"RData",sep=".")
save(get(this.csv.data$unique_tag), saved_file_name)
but the last line returns:
"Error in save(get(this_unique_tag), file = data_tag) :
object ‘get(this_unique_tag)’ not found"
even though the following returns the data just fine:
get(this.csv.data$unique_tag)
Just name the arguments you use. With your code the following works fine:
save(list = this.csv.data$unique_tag, file=saved_file_name)
My preference is to avoid the name in the RData file on load:
obj = local(get(load('myfile.RData')))
This way you can load various RData files and name the objects whatever you want, or store them in a list etc.
You really should use saveRDS/readRDS to serialize your objects.
save and load are for saving whole environments.
saveRDS(this.csv.data, saved_file_name)
# later
mydata <- readRDS(saved_file_name)
you can use
save.image("myfile.RData")
This worked for me:
env <- new.env()
env[[varname]] <- object_to_save
save(list=c(varname), envir=env, file='out.Rda')
You could probably do it without a new env (but I didn't try this):
.GlobalEnv[[varname]] <- object_to_save
save(list=c(varname), envir=.GlobalEnv, file='out.Rda')
You might even be able to remove the envir variable.

reading and saving large files rds files in a single rds file

I have a list that contain many large files. All the files have the same column names. I want to combine them into an rds file and save.
list.nam<- list.files(pattern="*.I S")
list.fil <- lapply (list.nam, readRDs)
Error in match.fun(FUN) : object 'readRDs' not found
You have entered a incorrect function name, replace readRDs with readRDS it works
list.fil <- lapply (list.nam, readRDS)

Read, process and export analysis results from multiple .csv files in R

I have a bunch of CSV files and I would like to perform the same analysis (in R) on the data within each file. Firstly, I assume each file must be read into R (as opposed to running a function on the CSV and providing output, like a sed script).
What is the best way to input numerous CSV files to R, in order to perform the analysis and then output separate results for each input?
Thanks (btw I'm a complete R newbie)
You could go for Sean's option, but it's going to lead to several problems:
You'll end up with a lot of unrelated objects in the environment, with the same name as the file they belong to. This is a problem because...
For loops can be pretty slow, and because you've got this big pile of unrelated objects, you're going to have to rely on for loops over the filenames for each subsequent piece of analysis - otherwise, how the heck are you going to remember what the objects are named so that you can call them?
Calling objects by pasting their names in as strings - which you'll have to do, because, again, your only record of what the object is called is in this list of strings - is a real pain. Have you ever tried to call an object when you can't write its name in the code? I have, and it's horrifying.
A better way of doing it might be with lapply().
# List files
filelist <- list.files(pattern = "*.csv")
# Now we use lapply to perform a set of operations
# on each entry in the list of filenames.
to_dispose_of <- lapply(filelist, function(x) {
# Read in the file specified by 'x' - an entry in filelist
data.df <- read.csv(x, skip = 1, header = TRUE)
# Store the filename, minus .csv. This will be important later.
filename <- substr(x = x, start = 1, stop = (nchar(x)-4))
# Your analysis work goes here. You only have to write it out once
# to perform it on each individual file.
...
# Eventually you'll end up with a data frame or a vector of analysis
# to write out. Great! Since you've kept the value of x around,
# you can do that trivially
write.table(x = data_to_output,
file = paste0(filename, "_analysis.csv"),
sep = ",")
})
And done.
You can try the following codes by putting all csv files in the same directory.
names = list.files(pattern="*.csv") %csv file names
for(i in 1:length(names)){ assign(names[i],read.csv(names[i],skip=1, header=TRUE))}
Hope this helps !

Which is the best method to apply a script repetitively to n .csv files in R?

My situation:
I have a number of csv files all with the same suffix pre .csv, but the first two characters of the file name are different (ie AA01.csv, AB01.csv, AC01.csv etc)
I have an R script which I would like to run on each file. This file essentially extracts the data from the .csv and assigns them to vectors / converts them into timeseries objects. (For example, AA01 xts timeseries object, AB01 xts object)
What I would like to achieve:
Embed the script within a larger loop (or as appropriate) to sequentially run over each file and apply the script
Remove the intermediate objects created (see code snippet below)
Leave me with the final xts objects created from each raw data file (ie AA01 to AC01 etc as Values / Vectors etc)
What would be the right way to embed this script in R? Sorry, but I am a programming noob!
My script code below...heading of each column in each CSV is DATE, TIME, VALUE
# Pull in Data from the FileSystem and attach it
AA01raw<-read.csv("AA01.csv")
attach(AA01raw)
#format the data for timeseries work
cdt<-as.character(Date)
ctm<-as.character(Time)
tfrm<-timeDate(paste(cdt,ctm),format ="%Y/%m/%d %H:%M:%S")
val<-as.matrix(Value)
aa01tsobj<-timeSeries(val,tfrm)
#convert the timeSeries object to an xts Object
aa01xtsobj<-as.xts(tsobj)
#remove all the intermediate objects to leave the final xts object
rm(cdt)
rm(ctm)
rm(aa01tsobj)
rm(tfrm)
gc()
and then repeat on each .csv file til all xts objects are extracted.
ie, what we would end up within R, ready for further applications are:
aa01xtsobj, ab01xtsobj, ac01xtsobj....etc
any help on how to do this would be very much appreciated.
Be sure to use Rs dir command to produce the list of filenames instead of manually entering them in.
filenames = dir(pattern="*01.csv")
for( i in 1:length(filenames) )
{
...
I find a for loop and lists is well enough for stuff like this. Once you have a working set of code it's easy enough to move from a loop into a function which can be sapplyied or similar, but that kind of vectorization is idiosyncratic anyway and probably not useful outside of private one-liners.
You probably want to avoid assigning to multiple objects with different names in the workspace (this a FAQ which usually comes up as "how do I assign() . . .").
Please beware my untested code.
A vector of file names, and a list with a named element for each file.
files <- c("AA01.csv", "AA02.csv")
lst <- vector("list", length(files))
names(lst) <- files
Loop over each file.
library(timeSeries)
for (i in 1:length(files)) {
## read strings as character
tmp <- read.csv(files[i], stringsAsFactors = FALSE)
## convert to 'timeDate'
tmp$tfrm <- timeDate(paste(tmp$cdt, tmp$ctm),format ="%Y/%m/%d %H:%M:%S"))
## create timeSeries object
obj <- timeSeries(as.matrix(tmp$Value), tmp$tfrm)
## store object in the list, by name
lst[[files[i]]] <- as.xts(obj)
}
## clean up
rm(tmp, files, obj)
Now all the read objects are in lst, but you'll want to test that the file is available, that it was read correctly, and you may want to modify the names to be more sensible than just the file name.
Print out the first object by name index from the list:
lst[[files[1]]]

Resources