I have to load in many files and tansform their data. Each file contains only one data.table, however the tables have various names.
I would like to run a single script over all of the files -- to do so, i must assign the unknown data.table to a common name ... say blob.
What is the R way of doing this? At present, my best guess (which seems like a hack, but works) is to load the data.table into a new environment, and then: assign('blob', get(objects(envir=newEnv)[1], env=newEnv).
In a reproducible context this is:
newEnv <- new.env()
assign('a', 1:10, envir = newEnv)
assign('blob', get(objects(envir=newEnv)[1], env=newEnv))
Is there a better way?
The R way is to create a single object, i.e. a single list of data tables.
Here is some pseudocode that contains three steps:
Use list.files() to create a list of all files in a folder.
Use lapply() and read.csv() to read your files and create a list of data frames. Replace read.csv() with read.table() or whatever is appropriate for your data.
Use lapply() again, this time with as.data.table() to convert the data frames to data tables.
The pseudocode:
filenames <- list.files("path/to/files")
dat <- lapply(files, read.csv)
dat <- lapply(dat, as.data.table)
Your result should be a single list, called dat, containing a data table for each of your original files.
I assume that you saved the data.tables using save() somewhat like this:
d1 <- data.table(value=1:10)
save(d1, file="data1.rdata")
and your problem is that when you load the file you don't know the name (here: d1) that you used when saving the file. Correct?
I suggest you use instead saveRDS() and readRDS() for saving/loading single objects:
d1 <- data.table(value=1:10)
saveRDS(d1, file="data1.rds")
blob <- readRDS("data1.rds")
Related
I am writing a generic procedure and I don't understand how to handle names of objects that are unknown. In this case I am loading all *.Rda files in a directory and doing rbind to make a data frame. The names and number of Rda files can vary. My question is how best to handle this situation?
library(data.table)
# Load all data frames in wd
my_files <- list.files(pattern='*.Rda',full.names = TRUE)
# Names of files without .Rda suffix
my_files_names <- gsub(".Rda$","",list.files(pattern='*.Rda'))
# load each data frame, creates objects with names in my_files_names
for(i in 1:length(my_files)){
load(my_files[i])
}
# make large data frame from all loaded data frames
combined_df <- rbindlist(my_files_names)
I am getting the error
Input is character but should be a plain list of items to be stacked
combined_df <- rbindlist(as.list(my_files_names)) doesn't work.
The example works using rbind with each object as an argument, but for some reason a character vector can't be used to refer to objects with names not known at run-time. What am I missing?
The solution was a two-liner:
library(dplyr)
my_files <- list.files(pattern='*.rds',full.names = TRUE)
combined_df <- bind_rows(lapply(my_files, readRDS))
First, the names of the objects were not important so I could this different approach. Second, the use of .Rda files was causing problems. This file type can contain more than one object. Although my files only had a single data frame per file, the code about would not run with load as an argument in lapply. I converted my files to .rds files, which only allow one data frame per file and the code ran fine.
I am using a for loop to read in multiple csv files and naming the datasets import1, import2, etc. For example:
assign(paste("import",i,sep=""), read.csv(files[i], header=FALSE))
However, I now want to rename the variables in each dataset. I have tried the following:
names(as.name(paste("import",i,sep=""))) <- c("xxxx", "yyyy")
But get the error "target of assignment expands to non-language object". (I need to change the name of variables in each dataset within the loop as the variable names need to be different in each dataset).
Any suggestions on how to do this would be much appreciated.
Thanks.
While I do agree it would be much better to keep your data.frames in a list rather than creating a bunch of variables in your global environment, you can also set names when you read the files in
assign(paste("import",i,sep=""),
read.csv(files[i], header=FALSE, col.names=c("xxxx", "yyyy")))
Using assign() isn't very "R-like".
A better approach would be to read the files into a list of data.frames, instead of one data.frame object per file. Assuming files is the vector of file names (as you imply above):
import <- lapply(files, read.csv, header=FALSE)
Then if you want to operate on each data.frame in the list using a loop, you easily can:
for (i in seq_along(import)) names(import[[i]]) <- c('xxx', 'yyy')
I have a folder with multiple files to load:
Every file is a list. And I want to combine all the lists loaded in a single list. I am using the following code (the variable loaded every time from a file is called TotalData) :
Filenames <- paste0('DATA3_',as.character(1:18))
Data <- list()
for (ii in Filenames){
load(ii)
Data <- append(Data,TotalData)
}
Is there a more elegant way to write it? For example using apply functions?
You can use lapply. I assume that your files have been stored using save, because you use load to get them. I create two files to use in my example as follows:
TotalData<-list(1:10)
save(TotalData,file="DATA3_1")
TotalData<-list(11:20)
save(TotalData,file="DATA3_2")
And then I read them in by
Filenames <- paste0('DATA3_',as.character(1:2))
Data <- lapply(Filenames,function(fn) {
load(fn)
return (TotalData)
})
After this, Data will be a list that contains the lists from the files as its elements. Since you are using append in your example, I assume this is not what you want. I remove one level of nesting with
Data <- unlist(Data,recursive=FALSE)
For my two example files, this gave the same result as your code. Whether it is more elegant can be debated, but I would claim that it is more R-ish than the for-loop.
I would like to add two blank rows to the top of an xlsx file that I create.
so far I have tried:
library("xlsx")
fn1 <- 'test1.xlsx'
fn2 <- 'test2.xlsx'
write.xlsx(matrix(rnorm(25),5),fn1)
wb <- loadWorkbook(fn1)
rows <- getRows(getSheets(wb)[[1]])
for(i in 1:length(rows))
rows[[i]]$setRowNum(as.integer(i+1))
saveWorkbook(wb,fn2)
But test2.xlsx is empty!
I'm a bit confused by what you're trying to do with the for loop, but:
You could create a dummy object with the same number of columns as wb, then use rbind() to join the dummy and wb to create fn2.
fn1 <- 'test1.xlsx'
wb <- loadWorkbook(fn1)
dummy <- wb[c(1,2),]
# set all values of dummy to whatever you want, e.g. "NA" or 0
fn2 <- rbind(dummy, wb)
saveWorkbook( fn2)
Hope that helps
So the xlsx package actually interfaces with a java library in order to pull in and modify the workbooks. The read.xlsx and write.xlsx functions are convenience wrappers to read and write data frames within R without having to manually write code to parse the individual cells and rows yourself using the java objects. The loadWorkbook and getRows functions give you access to the actual java objects, which you can then use to modify things such as cell styles.
However, if all you want to do is add a blank row to your spreadsheet before you output it, the easiest way is to add a blank row to your data frame before you export it (as Chris mentioned). You can accomplish this with the following variant on your code:
library("xlsx")
fn1 <- 'test1.xlsx'
fn2 <- 'test2.xlsx'
# I have added row.names=FALSE because the read.xlsx function
# does not have a parameter to include row names
write.xlsx(matrix(rnorm(25),5),fn1,row.names=FALSE)
# If you read your data back in using the read.xlsx function
# you will have a data frame rather than a series of java object references
wb <- read.xlsx(fn1,1)
# Now that we have a data frame we can add a blank row at the top
wb<-rbind.data.frame(rep("",5),wb)
# Then we write to the file using the same function as before
write.xlsx(wb,fn2,row.names=FALSE)
If you desire to use the advanced functionality in the java libraries, some of it is not currently implemented in the R package, and thus you will have to call it directly using .jcall. If you decide to pursue this line of action, I definitely recommend using .jmethods on the objects produced (i.e. .jmethods(rows[[1]]) which will list the available functions which you can use on the object (at the cellular level these are quite extensive).
My situation:
I have a number of csv files all with the same suffix pre .csv, but the first two characters of the file name are different (ie AA01.csv, AB01.csv, AC01.csv etc)
I have an R script which I would like to run on each file. This file essentially extracts the data from the .csv and assigns them to vectors / converts them into timeseries objects. (For example, AA01 xts timeseries object, AB01 xts object)
What I would like to achieve:
Embed the script within a larger loop (or as appropriate) to sequentially run over each file and apply the script
Remove the intermediate objects created (see code snippet below)
Leave me with the final xts objects created from each raw data file (ie AA01 to AC01 etc as Values / Vectors etc)
What would be the right way to embed this script in R? Sorry, but I am a programming noob!
My script code below...heading of each column in each CSV is DATE, TIME, VALUE
# Pull in Data from the FileSystem and attach it
AA01raw<-read.csv("AA01.csv")
attach(AA01raw)
#format the data for timeseries work
cdt<-as.character(Date)
ctm<-as.character(Time)
tfrm<-timeDate(paste(cdt,ctm),format ="%Y/%m/%d %H:%M:%S")
val<-as.matrix(Value)
aa01tsobj<-timeSeries(val,tfrm)
#convert the timeSeries object to an xts Object
aa01xtsobj<-as.xts(tsobj)
#remove all the intermediate objects to leave the final xts object
rm(cdt)
rm(ctm)
rm(aa01tsobj)
rm(tfrm)
gc()
and then repeat on each .csv file til all xts objects are extracted.
ie, what we would end up within R, ready for further applications are:
aa01xtsobj, ab01xtsobj, ac01xtsobj....etc
any help on how to do this would be very much appreciated.
Be sure to use Rs dir command to produce the list of filenames instead of manually entering them in.
filenames = dir(pattern="*01.csv")
for( i in 1:length(filenames) )
{
...
I find a for loop and lists is well enough for stuff like this. Once you have a working set of code it's easy enough to move from a loop into a function which can be sapplyied or similar, but that kind of vectorization is idiosyncratic anyway and probably not useful outside of private one-liners.
You probably want to avoid assigning to multiple objects with different names in the workspace (this a FAQ which usually comes up as "how do I assign() . . .").
Please beware my untested code.
A vector of file names, and a list with a named element for each file.
files <- c("AA01.csv", "AA02.csv")
lst <- vector("list", length(files))
names(lst) <- files
Loop over each file.
library(timeSeries)
for (i in 1:length(files)) {
## read strings as character
tmp <- read.csv(files[i], stringsAsFactors = FALSE)
## convert to 'timeDate'
tmp$tfrm <- timeDate(paste(tmp$cdt, tmp$ctm),format ="%Y/%m/%d %H:%M:%S"))
## create timeSeries object
obj <- timeSeries(as.matrix(tmp$Value), tmp$tfrm)
## store object in the list, by name
lst[[files[i]]] <- as.xts(obj)
}
## clean up
rm(tmp, files, obj)
Now all the read objects are in lst, but you'll want to test that the file is available, that it was read correctly, and you may want to modify the names to be more sensible than just the file name.
Print out the first object by name index from the list:
lst[[files[1]]]