This question already has answers here:
What are the main differences between R data files?
(2 answers)
Closed 3 years ago.
I have made a dataframe based on a set of twitters in the following form:
rdmTweets <- userTimeline("rdatamining", n=200)
df <- do.call("rbind", lapply(rdmTweets, as.data.frame))
Now I am saving the data frame with save in this way:
save(df, file="data")
How I can load that saved data frame for future use? When I use:
df2 <- load("data")
and I apply dim(df2) it should return the quantity of tweets that data frame has, but it only shows 1.
As #mrdwab points out, save saves the names as well as the data/structure (and in fact can save a number of different R objects in a single file). There is another pair of storage functions that behave more as you expect. Try this:
saveRDS(df, file="mytweets.rds")
df2 <- readRDS("mytweets.rds")
These functions can only handle a single object at a time.
Another option is to save your data frame as a csv file. The benefit of this option is that it provides long term storage, i.e. you will (likely) be able to open your csv file on any platform in ten years time. With an RData file, you can only open it with R and I wouldn't like to bet money on opening it between versions.
To save the file as a csv, just use: read.csv and write.csv, so:
write.csv(df, file="out.csv", row.name=FALSE)
df = read.csv("out.csv", header=TRUE)
Gavin's comment below raised a couple of points:
The CSV route only works for tabular-style data.
Completely correct. But if you are saving a data frame (as the OP is), then your data is in tabular form.
With R you'll always have the
ability to fire up an old version to read the data and export if for
some reason they change save format and don't allow the old format to
be loaded by another function.
To play devil's adovacate, you could use this argument with Excel and save your data as an xls. However, saving your data in a csv format means we never need to worry about this.
R's file format is documented so one could reasonably easily
read the binary data in another system using that open info.
I completely agree - although "easily" is a bit strong. This is why saving as an RData file isn't such a big deal. But if you are saving tabular data, why not use a csv file?
For the record, there are some reasons for saving tabular data as an RData file. For example, the speed in reading/writing the file or file size.
save saves the name of the dataset as well as the data. Thus, you should not not assign a name to load("data") and you should be fine. In other words, simply use:
load("data")
and it will load an object named df (or whatever is contained in the file "data") into your current workspace.
I would suggest a more original name for your file though, and consider adding an extension to help you remember what your script files are, your data files are, and so on.
Work your way through this simple example:
rm(list = ls()) # Remove everything from your current workspace
ls() # Anything there? Nope.
# character(0)
a <- 1:10 # Create an object "a"
save(a, file="myData.Rdata") # Save object "a"
ls() # Anything there? Yep.
# [1] "a"
rm(a) # Remove "a" from your workspace
ls() # Anything there? Nope.
# character(0)
load("myData.Rdata") # Load your "myData.Rdata" file
ls() # Anything there? Yep. Object "a".
# [1] "a"
str(a) # Is "a" what we expect it to be? Yep.
# int [1:10] 1 2 3 4 5 6 7 8 9 10
a2 <- load("myData.Rdata") # What about your approach?
ls() # Now we have 2 objects
# [1] "a" "a2"
str(a2) # "a2" stores the object names from your data file.
# chr "a"
As you can see, save allows you to save and load multiple objects at once, which can be convenient when working on projects with multiple sets of data that you want to keep together.
On the other hand, saveRDS (from the accepted answer) only lets you save single objects. In some ways, this is more "transparent" since load() doesn't let you preview the contents of the file without first loading it.
Related
I'm trying to read multiple .csv files from an URL starting with http. All files can be found on the same website. Generally, the structure of the file's name is: yyyy_mm_dd_location_XX.csv
Now, there are three different locations (lets say locA, locB, locC) for which there is a file for every day of the month each. So, the file name would be e.g. "2009_10_01_locA_XX.csv", "2009_10_02_locA_XX.csv" and so forth.
The structure, meaning the number of columns of all csv files is the same, however the length is not.
I'd like to combine all these files into one csv file but have problems reading them from the website due to the changing names.
Thanks a lot for any ideas!
Here is a way to programmatically generate the names of the files, and then run download.file() to download them. Since no reproducible example was given with the question, one needs to change the code to the correct HTTP location to access the files.
startDate <- as.Date("2019-10-01","%Y-%m-%d")
dateVec <- date + 0:4 # create additional dates by adding integers
library(lubridate)
downloadFileNames <- unlist(lapply(dateVec,function(x) {
locs <- c("locA","locB","locC")
paste(year(x),month(x),day(x),locs,"XX",sep="_")
}))
head(downloadFileNames)
We print the head() of the vector to show the correct naming pattern.
> head(downloadFileNames)
[1] "2019_10_1_locA_XX" "2019_10_1_locB_XX" "2019_10_1_locC_XX"
[4] "2019_10_2_locA_XX" "2019_10_2_locB_XX" "2019_10_2_locC_XX"
>
Next, we'll create a directory to store the files, and download them.
# create a subdirectory to store the files
if(!dir.exists("./data")) dir.create("./data")
# download files, as https://www.example.com/2019_10_01_locA_XX.csv
# to ./data/2019_10_01_locA_XX.csv, etc.
result <- lapply(downloadFileNames,function(x){
download.file(paste0("https://www.example.com/",x,".csv"),
paste0("./data/",x,".csv"))
})
Once the files are downloaded, we can use list.files() to retrieve the path names, read the data with read.csv(), and combine them into a single data frame with do.call().
theFiles <- list.files("./data",pattern = ".csv",full.names = TRUE)
dataList <- lapply(theFiles,read.csv)
data <- do.call(rbind,dataList)
I want to save within a function, using the input object's name as the file name
saveNew <- function(dat){
# Collect the original name
originalName <- deparse(substitute(dat))
#Do lots of Fun and Interesting Things!
#Now lets save it, First i have to get it
newToSave <- get(originalName, envir = .GlobalEnv)
save(newToSave, file = paste0(originalName, '.Rdata') )
}
But the problem is when i go to save it, it saves the newly created data as newToSave. This is apparent when loading this newly created object with
load('funData.Rdata') the object is no longer funData but now newToSave
How can i get this function to save it as, in the example below, funData, as well as load it as fundata, not newToSave.
Example:
funData <- sample(seq(1,1000,.01))
saveNew(funData)
load("funData.Rdata")
You can use assign to assign dat to originalName
saveNew <- function(dat){
# Collect the original name
originalName <- deparse(substitute(dat))
#Do lots of Fun and Interesting Things!
assign(originalName, dat)
save(list = originalName, file = paste0(originalName, '.Rdata') )
}
# Sample data
funData <- 1:10
# Save
saveNew(funData)
# Remove funData from the current environment
remove(funData)
# Load the RData object
load("funData.RData")
# Confirm that funData is in our current environment
funData
# [1] 1 2 3 4 5 6 7 8 9 10
Note that we need to use save with the list argument to enforce that save writes the value that has been assigned to originalName.
Disclaimer: This isn't really an answer, but as the OP wanted more clarification on the pros and cons of saveRDS, I thought I could put those under an answer. If you consider it should be deleted, please state so in a comment (before downvoting) and I'll be happy to withdraw it.
From ?saveRDS:
Details:
These functions provide the means to save a single R object to a connection (typically a file) and to restore the object, quite possibly under a different name. This differs from ‘save’ and ‘load’, which save and restore one or more named objects into an environment. They are widely used by R itself, for example to store metadata for a package and to store the ‘help.search’ databases: the ‘".rds"’ file extension is most often used.
saveRDS is specifically aimed at saving one object, while save can save one or more, but for me the main difference is that save and load bring back the object to life with the same name it had when saved, so one of its potential drawbacks is that it could rewrite an object already in the environment, whilst saveRDS and its companion readRDS can save and load objects to different objects.
From ?load:
Warning:
...
‘load()’ replaces all existing objects with the same names in the current environment (typically your workspace, ‘.GlobalEnv’) and hence potentially overwrites important data. It is considerably safer to use ‘envir = ’ to load into a different environment, or to ‘attach(file)’ which ‘load()’s into a new entry in the ‘search’ path.
Consider this:
save(iris, "save_file.rdat")
iris[1, 2] <- 20000 # implement a change to iris
load("save_file.rdat") # overwrites iris
saveRDS(iris, "my_file.RDS")
iris[1, 2] <- 20000 # introduce a change to iris
new_iris <- readRDS("my_file.RDS") # modified-iris is kept. New object is created
I realize this is a pretty basic question, but I want to make sure I do it right, so I wanted to ask just to confirm. I have a vector in one project that I want to be able to use in another project, and I was wondering if there was a simple way to export the vector in a form that I can easily import it to another project.
The way that I figured out how to do it so far is to convert it to a df, then export the df as a csv, then import and unpack it to vector form, but that seems needlessly complicated. It's just a simple numeric vector.
There are a number of ways to read and write data/files in R. For reading, you may want to look at: read.table, read.csv, readLines, source, dget, load, unserialize, and readRDS. For writing, you will want to look write.table, writeLines, dump, dput, save, serialize, and saveRDS.
x <- 1:3
# [1] 1 2 3
save(x, file = "myvector.rda")
# Change x to prove a point.
x <- 4:6
x
# [1] 4 5 6
# Better yet, we could remove it entirely
rm(x)
x
# Error: object 'x' not found
# Now load what we saved to get us back to where we started.
load("myvector.rda")
x
# [1] 1 2 3
Alternatively, you can use saveRDS and readRDS -- best practice/convention is to use the .rds extension; note, however, that loading the object is slightly different as saveRDS does not save the object name:
saveRDS(x, file = "myvector_serialized.rds")
x <- readRDS("myvector_serialized.rds")
Finally, saveRDS is a lower-level function and therefore can only save one object a time. The traditional save approach allows you to save multiple objects at the same time, but can become a nightmare if you re-use the same names in different projects/files/scripts...
I agree that saveRDS is a good way to go, but I also recommend the save and save.image functions, which I will demonstrate below.
# save.image
x <- c(5,6,8)
y <- c(8,9,11)
save.image(file="~/vectors.Rdata") # saves all workspace objects
Or alternatively choose which objects you want to save
x <- c(5,6,8)
y <- c(8,9,11)
save(x, y, file="~/vectors.Rdata") # saves only the selected objects
One advantage to using .Rdata over .Rda (a minor one) is that you can click on the object in the file explorer (i.e. in windows) and it will be loaded into the R environment. This doesn't work with .Rda objects in say Rstudio on windows
I've got a function that has a list output. Every time I run it, I want to export the results with save. After a couple of runs I want to read the files in and compare the results. I do this, because I don't know how many tasks there will be, and maybe I'll use different computers to calculate each task. So how should I name the archived objects, so later I can read them all in?
My best guess would be to dynamically name the variables before saving, and keep track of the object names, but I've read everywhere that this is a big no-no.
So how should I approach this problem?
You might want to use the saveRDS and readRDS functions instead of save and load. The RDS version functions will save and read single objects without the attached name. You would create your object and save it to a file (using paste0 or sprintf to create unique names), then when processing the results you can read in one object at a time, or read several into a list to work with them.
You can use scope to hide the retrieved name inside a function, so first you might save a list to a file:
mybiglist <- list(fred=1, john='dum di dum', mary=3)
save(mybiglist, file='mybiglist1.RData')
Then you can load it back in through a function and give it whatever name you like be it inside another list or just a plain object:
# Use the fact that load returns the name of the object loaded
# and that scope will hide this object
myspecialload <- function(RD.fnam) {
return(eval(parse(text=load(RD.fnam))))
}
# now lets reload that file but put it in another object
mynewbiglist <- myspecialload('mybiglist1.RData')
mynewbiglist
$fred
[1] 1
$john
[1] "dum di dum"
$mary
[1] 3
Note that this is not really a generic 'use it anywhere' type function, as for an RData file with multiple objects it appears to return the last object saved... so best stick with one list object per file for now!
One time I was given several RData files, and they all had only one variable called x. In order to read all of them in my workspace, I loaded sequentially each the variable to its environment, and I used get() to read its value.
tenv <- new.env()
load("file_1.RData", envir = tenv)
ls(tenv) # x
myvar1 <- get(ls(tenv), tenv)
rm(tenv)
....
This code can be repeated for each file.
My situation:
I have a number of csv files all with the same suffix pre .csv, but the first two characters of the file name are different (ie AA01.csv, AB01.csv, AC01.csv etc)
I have an R script which I would like to run on each file. This file essentially extracts the data from the .csv and assigns them to vectors / converts them into timeseries objects. (For example, AA01 xts timeseries object, AB01 xts object)
What I would like to achieve:
Embed the script within a larger loop (or as appropriate) to sequentially run over each file and apply the script
Remove the intermediate objects created (see code snippet below)
Leave me with the final xts objects created from each raw data file (ie AA01 to AC01 etc as Values / Vectors etc)
What would be the right way to embed this script in R? Sorry, but I am a programming noob!
My script code below...heading of each column in each CSV is DATE, TIME, VALUE
# Pull in Data from the FileSystem and attach it
AA01raw<-read.csv("AA01.csv")
attach(AA01raw)
#format the data for timeseries work
cdt<-as.character(Date)
ctm<-as.character(Time)
tfrm<-timeDate(paste(cdt,ctm),format ="%Y/%m/%d %H:%M:%S")
val<-as.matrix(Value)
aa01tsobj<-timeSeries(val,tfrm)
#convert the timeSeries object to an xts Object
aa01xtsobj<-as.xts(tsobj)
#remove all the intermediate objects to leave the final xts object
rm(cdt)
rm(ctm)
rm(aa01tsobj)
rm(tfrm)
gc()
and then repeat on each .csv file til all xts objects are extracted.
ie, what we would end up within R, ready for further applications are:
aa01xtsobj, ab01xtsobj, ac01xtsobj....etc
any help on how to do this would be very much appreciated.
Be sure to use Rs dir command to produce the list of filenames instead of manually entering them in.
filenames = dir(pattern="*01.csv")
for( i in 1:length(filenames) )
{
...
I find a for loop and lists is well enough for stuff like this. Once you have a working set of code it's easy enough to move from a loop into a function which can be sapplyied or similar, but that kind of vectorization is idiosyncratic anyway and probably not useful outside of private one-liners.
You probably want to avoid assigning to multiple objects with different names in the workspace (this a FAQ which usually comes up as "how do I assign() . . .").
Please beware my untested code.
A vector of file names, and a list with a named element for each file.
files <- c("AA01.csv", "AA02.csv")
lst <- vector("list", length(files))
names(lst) <- files
Loop over each file.
library(timeSeries)
for (i in 1:length(files)) {
## read strings as character
tmp <- read.csv(files[i], stringsAsFactors = FALSE)
## convert to 'timeDate'
tmp$tfrm <- timeDate(paste(tmp$cdt, tmp$ctm),format ="%Y/%m/%d %H:%M:%S"))
## create timeSeries object
obj <- timeSeries(as.matrix(tmp$Value), tmp$tfrm)
## store object in the list, by name
lst[[files[i]]] <- as.xts(obj)
}
## clean up
rm(tmp, files, obj)
Now all the read objects are in lst, but you'll want to test that the file is available, that it was read correctly, and you may want to modify the names to be more sensible than just the file name.
Print out the first object by name index from the list:
lst[[files[1]]]