I am writing a generic procedure and I don't understand how to handle names of objects that are unknown. In this case I am loading all *.Rda files in a directory and doing rbind to make a data frame. The names and number of Rda files can vary. My question is how best to handle this situation?
library(data.table)
# Load all data frames in wd
my_files <- list.files(pattern='*.Rda',full.names = TRUE)
# Names of files without .Rda suffix
my_files_names <- gsub(".Rda$","",list.files(pattern='*.Rda'))
# load each data frame, creates objects with names in my_files_names
for(i in 1:length(my_files)){
load(my_files[i])
}
# make large data frame from all loaded data frames
combined_df <- rbindlist(my_files_names)
I am getting the error
Input is character but should be a plain list of items to be stacked
combined_df <- rbindlist(as.list(my_files_names)) doesn't work.
The example works using rbind with each object as an argument, but for some reason a character vector can't be used to refer to objects with names not known at run-time. What am I missing?
The solution was a two-liner:
library(dplyr)
my_files <- list.files(pattern='*.rds',full.names = TRUE)
combined_df <- bind_rows(lapply(my_files, readRDS))
First, the names of the objects were not important so I could this different approach. Second, the use of .Rda files was causing problems. This file type can contain more than one object. Although my files only had a single data frame per file, the code about would not run with load as an argument in lapply. I converted my files to .rds files, which only allow one data frame per file and the code ran fine.
Related
I am importing multiple excel workbooks, processing them, and appending them subsequently. I want to create a temporary dataframe (tempfile?) that holds nothing in the beginning, and after each successive workbook processing, append it. How do I create such temporary dataframe in the beginning?
I am coming from Stata and I use tempfile a lot. Is there a counterpart to tempfile from Stata to R?
As #James said you do not need an empty data frame or tempfile, simply add newly processed data frames to the first data frame. Here is an example (based on csv but the logic is the same):
list_of_files <- c('1.csv','2.csv',...)
pre_processor <- function(dataframe){
# do stuff
}
library(dplyr)
dataframe <- pre_processor(read.csv('1.csv')) %>%
rbind(pre_processor(read.csv('2.csv'))) %>%>
...
Now if you have a lot of files or a very complicated pre_processsing then you might have other questions (e.g. how to loop over the list of files or to write the right pre_processing function) but these should be separate and we really need more specifics (example data, code so far, etc.).
I have written a program in R that takes all of the .csv files in a folder and imports them as data frames with the naming convention "main1," "main2," "main3" and so on for each data frame. The number of files in the folder may vary, so I was hoping the convention would make it easier to join the files later by being able to paste together the number of records. I successfully coded a way to find the folder and identify all of the files, as well as the total number of files.
agencyloc <- dirname(file.choose())
setwd(agencyloc)
listagencyfiles <- list.files(pattern = "*.csv")
numagencies <- 1:length(listagencyfiles)
I then created the individual dataframes without issue. I am not including this because it is long and does not relate to my problem. The problem is when I try to rbind these dataframes into one large dataframe, it says "Input to rbindlist must be a list of data.tables." Since there will be varying numbers of files, I can't just hard code this in, it has to be something similar to this. I tried the following, but it creates a list of strings and not a list of objects:
allfiles <- paste0("main", 1:length(numagencies))
However, this outputs a list of strings that can't be used to bind the fiels. Is there a way to change the data type from character strings to objects so that this will work when executed:
finaltable <- rbindlist(allfiles)
What I am looking for would almost be the opposite of as.character(objectname) if that makes any sense. I need to go from character to object instead of object to character.
I am using a for loop to read in multiple csv files and naming the datasets import1, import2, etc. For example:
assign(paste("import",i,sep=""), read.csv(files[i], header=FALSE))
However, I now want to rename the variables in each dataset. I have tried the following:
names(as.name(paste("import",i,sep=""))) <- c("xxxx", "yyyy")
But get the error "target of assignment expands to non-language object". (I need to change the name of variables in each dataset within the loop as the variable names need to be different in each dataset).
Any suggestions on how to do this would be much appreciated.
Thanks.
While I do agree it would be much better to keep your data.frames in a list rather than creating a bunch of variables in your global environment, you can also set names when you read the files in
assign(paste("import",i,sep=""),
read.csv(files[i], header=FALSE, col.names=c("xxxx", "yyyy")))
Using assign() isn't very "R-like".
A better approach would be to read the files into a list of data.frames, instead of one data.frame object per file. Assuming files is the vector of file names (as you imply above):
import <- lapply(files, read.csv, header=FALSE)
Then if you want to operate on each data.frame in the list using a loop, you easily can:
for (i in seq_along(import)) names(import[[i]]) <- c('xxx', 'yyy')
I have to load in many files and tansform their data. Each file contains only one data.table, however the tables have various names.
I would like to run a single script over all of the files -- to do so, i must assign the unknown data.table to a common name ... say blob.
What is the R way of doing this? At present, my best guess (which seems like a hack, but works) is to load the data.table into a new environment, and then: assign('blob', get(objects(envir=newEnv)[1], env=newEnv).
In a reproducible context this is:
newEnv <- new.env()
assign('a', 1:10, envir = newEnv)
assign('blob', get(objects(envir=newEnv)[1], env=newEnv))
Is there a better way?
The R way is to create a single object, i.e. a single list of data tables.
Here is some pseudocode that contains three steps:
Use list.files() to create a list of all files in a folder.
Use lapply() and read.csv() to read your files and create a list of data frames. Replace read.csv() with read.table() or whatever is appropriate for your data.
Use lapply() again, this time with as.data.table() to convert the data frames to data tables.
The pseudocode:
filenames <- list.files("path/to/files")
dat <- lapply(files, read.csv)
dat <- lapply(dat, as.data.table)
Your result should be a single list, called dat, containing a data table for each of your original files.
I assume that you saved the data.tables using save() somewhat like this:
d1 <- data.table(value=1:10)
save(d1, file="data1.rdata")
and your problem is that when you load the file you don't know the name (here: d1) that you used when saving the file. Correct?
I suggest you use instead saveRDS() and readRDS() for saving/loading single objects:
d1 <- data.table(value=1:10)
saveRDS(d1, file="data1.rds")
blob <- readRDS("data1.rds")
My situation:
I have a number of csv files all with the same suffix pre .csv, but the first two characters of the file name are different (ie AA01.csv, AB01.csv, AC01.csv etc)
I have an R script which I would like to run on each file. This file essentially extracts the data from the .csv and assigns them to vectors / converts them into timeseries objects. (For example, AA01 xts timeseries object, AB01 xts object)
What I would like to achieve:
Embed the script within a larger loop (or as appropriate) to sequentially run over each file and apply the script
Remove the intermediate objects created (see code snippet below)
Leave me with the final xts objects created from each raw data file (ie AA01 to AC01 etc as Values / Vectors etc)
What would be the right way to embed this script in R? Sorry, but I am a programming noob!
My script code below...heading of each column in each CSV is DATE, TIME, VALUE
# Pull in Data from the FileSystem and attach it
AA01raw<-read.csv("AA01.csv")
attach(AA01raw)
#format the data for timeseries work
cdt<-as.character(Date)
ctm<-as.character(Time)
tfrm<-timeDate(paste(cdt,ctm),format ="%Y/%m/%d %H:%M:%S")
val<-as.matrix(Value)
aa01tsobj<-timeSeries(val,tfrm)
#convert the timeSeries object to an xts Object
aa01xtsobj<-as.xts(tsobj)
#remove all the intermediate objects to leave the final xts object
rm(cdt)
rm(ctm)
rm(aa01tsobj)
rm(tfrm)
gc()
and then repeat on each .csv file til all xts objects are extracted.
ie, what we would end up within R, ready for further applications are:
aa01xtsobj, ab01xtsobj, ac01xtsobj....etc
any help on how to do this would be very much appreciated.
Be sure to use Rs dir command to produce the list of filenames instead of manually entering them in.
filenames = dir(pattern="*01.csv")
for( i in 1:length(filenames) )
{
...
I find a for loop and lists is well enough for stuff like this. Once you have a working set of code it's easy enough to move from a loop into a function which can be sapplyied or similar, but that kind of vectorization is idiosyncratic anyway and probably not useful outside of private one-liners.
You probably want to avoid assigning to multiple objects with different names in the workspace (this a FAQ which usually comes up as "how do I assign() . . .").
Please beware my untested code.
A vector of file names, and a list with a named element for each file.
files <- c("AA01.csv", "AA02.csv")
lst <- vector("list", length(files))
names(lst) <- files
Loop over each file.
library(timeSeries)
for (i in 1:length(files)) {
## read strings as character
tmp <- read.csv(files[i], stringsAsFactors = FALSE)
## convert to 'timeDate'
tmp$tfrm <- timeDate(paste(tmp$cdt, tmp$ctm),format ="%Y/%m/%d %H:%M:%S"))
## create timeSeries object
obj <- timeSeries(as.matrix(tmp$Value), tmp$tfrm)
## store object in the list, by name
lst[[files[i]]] <- as.xts(obj)
}
## clean up
rm(tmp, files, obj)
Now all the read objects are in lst, but you'll want to test that the file is available, that it was read correctly, and you may want to modify the names to be more sensible than just the file name.
Print out the first object by name index from the list:
lst[[files[1]]]