Is there a way to run commands over multiple R datasets? - r

I am very new to R and would like to use some code to run various batch code on all of the data that I have available. It should be clear what I'm trying to do:
# library(PerformanceAnalytics)
# mydata <- mtcars[, c('mpg', 'cyl', 'disp', 'hp', 'carb')];
# chart.Correlation(mydata, histogram=TRUE, pch=19)
library(MASS)
M_names = data(package = "MASS")$result[, "Item"]
for (i in 1:length(M_names)) {
eval(paste("MASS::", M_names[i], sep=""));
}
The commented part is some code I found that I haven't been able to integrate yet. The Correlation is a very cool correlation matrix, which I'm attempting to funnel every single dataset I have access to into so I can quickly review them instead of doing it all manually. I guess I will need to save them all to PNGs to have practical workflow around that, as it's clear there's no way to coax the X windows to appear or stay put when running R code as a script.
The behavior I observe as I execute this on my Mac is:
> library(MASS)
> M_names = data(package = "MASS")$result[, "Item"]
> for (i in 1:length(M_names)) {
+ eval(paste("MASS::", M_names[i], sep=""));
+ }
>
>
I don't know for sure what the silent + indicator means, but I'm pretty sure it just means that code line is inside the for loop scope. But the eval is swallowing the command I assembled. I'm just trying to get it to print out the content of the data at each iteration of the loop for now.
I also noticed this:
> eval("MASS::ships")
[1] "MASS::ships"
It just prints it when I try to eval it.
I also hope there is a way to programmatically print individual datasets. I'm already hacking really hard at this, and there is no way that what I am doing here is a good idea.

If you have the package dataset names in a vector the key to accessing them
by their character names is the get function:
library(MASS)
M_names = data(package = "MASS")$result[, "Item"]
head(get(M_names[1]), 1)
# state sex diag death status T.categ age
# 1 NSW M 10905 11081 D hs 35
You can then loop through the vector of names
for (DATA in M_names) print(summary(get(DATA)))
Another options is to use the envir argument of the data function to load the datasets into a specific environment. It may be worth adding the data to a new environment instead of polluting your workspace.You can do that with
data(list=M_names, package="MASS", envir = list_of_datafames<- new.env())
You can then look through the list_of_datafames as you would with an other list object:
lapply(list_of_datafames, summary)

Related

R code automation

I am doing PCA. Here is the code for the same-
### Read .csv file #####
data<-read.csv(file.choose(),header=T,sep=",")
names(data)
data$qcountry
#### for the country-ARGENTINA#######
ar_data<-data[which(data$qcountry=="ar"),]
ar_data$qcountry<-NULL
names(ar_data)
names(ar_data)<-c("01_insufficient_efficacy","02_safety_issues","03_inconvenient_dosage_regimen","04_price_issues"
,"05_not_reimbursed","06_not_inculed_govt","07_insuficient_clinicaldata","08_previously_used","09_prescription_opted_for_some_patients","10_scientific_info_NA","12_involved_in_diff_clinical_trial"
,"13_patient_inappropriate_for_TT","14_patient_inappropriate_Erb","16_patient_over_65","17_Erbitux_alternative","95_Others")
names(ar_data)
ar_data_wdt_zero_columns<-ar_data[, colSums(ar_data != 0) > 0]
####Testing multicollinearity####
vif(ar_data_wdt_zero_columns)
#### Testing appropriatness of PCA ####
KMO(ar_data_wdt_zero_columns)
cortest.bartlett(ar_data_wdt_zero_columns)
#### Run PCA ####
pca<-prcomp(ar_data_wdt_zero_columns,center=F,scale=F)
summary(pca)
#### Compute the loadings for deciding the top4 most correlated variables###
load<-pca$rotation
write.csv(load,"loadings_argentina_2015_Q4.csv")
I have shown here for the one country, I have done this for 9countries. For each country I have to run this code. I am sure there must be easier way to automate this code. Please suggest !!
Thanks!!
Yes, this is doable for every country. You can make your custom function which takes appropriate parameters, e.g. country name and data. You do the magic inside and return an appropriate object (or not). Pass this magic to a processed data which you import and make pretty only once. The below code is not tested but should get you started.
A few comments.
Don't use file.choose() as it breaks your code 3 days down the line. How do you know what file to choose? Why click every time you run the script when you can make the script work for you? Be lazy in that sense.
You have a lot of clutter in your script. Adhere to some style and don't leave in random lines you try out for "shits and giggles". Use spaces in your code, at least.
Be more imaginative in choose object names. Try the name out first if perhaps the object already exists in a form of a base function, e.g. load.
myPCA <- function(my.country, my.data) {
ar_data <- data[data$qcountry %in% "ar", ]
ar_data$qcountry <- NULL
ar_data_wdt_zero_columns <- ar_data[, colSums(ar_data != 0) > 0]
#### Run PCA ####
pca <- prcomp(ar_data_wdt_zero_columns, center = FALSE, scale = FALSE)
#### Compute the loadings for deciding the top4 most correlated variables###
write.csv(pca$rotation, paste("loadings_", my.country, ".csv", sep = "")) # may need tweaking
return(list(pca = pca, vif = vif(ar_data_wdt_zero_columns),
kmo = KMO(ar_data_wdt_zero_columns), correlation = cortest.bartlett(ar_data_wdt_zero_columns))
}
data <- read.csv("relative_link_to_file", header = TRUE, sep = ",")
names(data) <- c("01_insufficient_efficacy","02_safety_issues","03_inconvenient_dosage_regimen","04_price_issues"
,"05_not_reimbursed","06_not_inculed_govt","07_insuficient_clinicaldata","08_previously_used","09_prescription_opted_for_some_patients","10_scientific_info_NA","12_involved_in_diff_clinical_trial"
,"13_patient_inappropriate_for_TT","14_patient_inappropriate_Erb","16_patient_over_65","17_Erbitux_alternative","95_Others")
sapply(data$qcountry, FUN = myPCA)

read, manipulate and export multiple .dta Files using a for Loop in R

I have multiple time series (each in a seperate file), which I need to adjust seasonally using the season package in R and store the adjusted series each in a seperate file again in a different directory.
The Code works for a single county.
So I tried to use a for Loop but R is unable to use the read.dta with a wildcard.
I'm new to R and using usually Stata so the question is maybe quite stupid and my code quite messy.
Sorry and Thanks in advance
Nathan
for(i in 1:402)
{
alo[i] <- read.dta("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/SINGLE_SERIES/County[i]")
alo_ts[i] <-ts(alo[i], freq = 12, start = 2007)
m[i] <- seas(alo_ts[i])
original[i]<-as.data.frame(original(m[i]))
adjusted[i]<-as.data.frame(final(m[i]))
trend[i]<-as.data.frame(trend(m[i]))
irregular[i]<-as.data.frame(irregular(m[i]))
County[i] <- data.frame(cbind(adjusted[i],original[i],trend[i],irregular[i], deparse.level =1))
write.dta(County[i], "/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/ADJUSTED_SERIES/County[i].dta")
}
This is a good place to use a function and the *apply family. As noted in a comment, your main problem is likely to be that you're using Stata-like character string construction that will not work in R. You need to use paste (or paste0, as here) rather than just passing the indexing variable directly in the string like in Stata. Here's some code:
f <- function(i) {
d <- read.dta(paste0("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/SINGLE_SERIES/County",i,".dta"))
alo_ts <- ts(d, freq = 12, start = 2007)
m <- seas(alo_ts)
original <- as.data.frame(original(m))
adjusted <- as.data.frame(final(m))
trend <- as.data.frame(trend(m))
irregular <- as.data.frame(irregular(m))
County <- cbind(adjusted,original,trend,irregular, deparse.level = 1)
write.dta(County, paste0("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/ADJUSTED_SERIES/County",i,".dta"))
invisible(County)
}
# return a list of all of the resulting datasets
lapply(1:402, f)
It would probably also be a good idea to take advantage of relative directories by first setting your working directory:
setwd("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/")
Then you can simply the above paths to:
d <- read.dta(paste0("./SINGLE_SERIES/County",i,".dta"))
and
write.dta(County, paste0("./ADJUSTED_SERIES/County",i,".dta"))
which will make your code more readable and reproducible should, for example, someone ever run it on another computer.

Run external R script n times, with different input paramters, and save outputs in a data frame

my question ties to the following problem:
Run external R script n times and save outputs in a data frame
The difference is, that I dont generate different results by randomization functions but would like to use every time a different set of input variables (e.g. run the chunk of code for a range of latitudes lat=c(50,60,70,80))
Has anyone a hint for me?
Thanks a lot!
Wrap the script into a function by putting:
my_function <- function(latitude) {
at the top and
}
at the bottom.
That way, you could source it once then then use ldply from the plyr package:
results <- ldply(10 * 5:8, myFunction)
If you wanted a column to identify which latitude was used, you could either add that to your function's data.frame or use:
results <- ldply(10 * 5:8, function(lat) data.frame(latitude = lat, myFunction())
If for some reason you didn't want to modify your script, you could create a wrapper function:
my_wrapper <- function(a) {
latitude <- a
source("script.R", local = TRUE)$value
}
or even use eval and parse:
my_function <- eval(parse(text = c("function(latitude) {",
readLines("script.R"), "}")))

Including R scripts in R packages

I want to share some software as a package but some of my scripts do not seem to go very naturally as functions. For example consider the following chunk of code where 'raw.df' is a data frame containing variables of both discrete and continuous kinds. The functions 'count.unique' and 'squash' will be defined in the package. The script splits the data frame into two frames, 'cat.df' to be treated as categorical data and 'cts.df' to be treated as continuous data.
My idea of how this would be used is that the user would read in the data frame 'raw.df', source the script, then interactively edit 'cat.df' and 'cts.df', perhaps combining some categories and transforming some variables.
dcutoff <- 9
tail(raw.df)
(nvals <- apply(raw.df, 2, count.unique))
p <- dim(raw.df)[2]
(catvar <- (1:p)[nvals <= dcutoff])
p.cat <- length(catvar)
(ctsvar <- (1:p)[nvals > dcutoff])
p.cts <- length(ctsvar)
cat.df <- raw.df[ ,catvar]
for (i in 1:p.cat) cat.df[ ,i] <- squash(cat.df[ ,i])
head(cat.df)
for(i in 1:p.cat) {
cat(as.vector(table(cat.df[ ,i])), "\n")
}
cts.df <- raw.df[ ,ctsvar]
for(i in 1:p.cts) {
cat( quantile(cts.df[ ,i], probs = seq(0, 1, 0.1)), "\n")
}
Now this could, of course, be made into a function returning a list containing nvals, p, p.cat, cat.df, etc; however this seems rather ugly to me. However the only provision for including scripts in a package seems to be the 'demo' folder which does not seem to be the right way to go. Advice on how to proceed would be gratefully received.
(But the gratitude would not be formally expressed as it seems that using a comment to express thanks is deprecated.)
It is better to encapsulate your code in a function. It is not ugly to return a list, S3 objects for example are just a list with an attribute class.
object <- list(attribute.name = something, ..)
class(object) <- "cname"
return (object)
You can also use inst folder (as mentioned in Dirk comment) since the contents of the inst subdirectory will be copied recursively to the installation directory.
you create an inst folder:
inst
----scripts
some_scripts.R
You can call it from a function in your package and use system.file mechanism to load it.
load_myscript <- function(){
source(system.file(package='your_pkg_name','scripts/some_scripts.R'))
}
You call it as any other function in your package:
load_myscript()

What ways are there for cleaning an R environment from objects?

I know I can use ls() and rm() to see and remove objects that exist in my environment.
However, when dealing with "old" .RData file, one needs to sometimes pick an environment a part to find what to keep and what to leave out.
What I would like to do, is to have a GUI like interface to allow me to see the objects, sort them (for example, by there size), and remove the ones I don't need (for example, by a check-box interface). Since I imagine such a system is not currently implemented in R, what ways do exist? What do you use for cleaning old .RData files?
Thanks,
Tal
I never create .RData files. If you are practicing reproducible research (and you should be!) you should be able to source in R files to go from input data files to all outputs.
When you have operations that take a long time it makes sense to cache them. If often use a construct like:
if (file.exists("cache.rdata")) {
load("cache.rdata")
} else {
# do stuff ...
save(..., file = "cache.rdata")
}
This allows you to work quickly from cached files, and when you need to recalculate from scratch you can just delete all the rdata files in your working directory.
Basic solution is to load your data, remove what you don't want and save as new, clean data.
Another way to handle this situation is to control loaded RData by loading it to own environment
sandbox <- new.env()
load("some_old.RData", sandbox)
Now you can see what is inside
ls(sandbox)
sapply(ls(sandbox), function(x) object.size(get(x,sandbox)))
Then you have several posibilities:
write what you want to new RData: save(A, B, file="clean.RData", envir=sandbox)
remove what you don't want from environment rm(x, z, u, envir=sandbox)
make copy of variables you want in global workspace and remove sandbox
I usually do something similar to third option. Load my data, do some checks, transformation, copy final data to global workspace and remove environments.
You could always implement what you want. So
Load the data
vars <- load("some_old.RData")
Get sizes
vars_size <- sapply(vars, function(x) object.size(get(x)))
Order them
vars <- vars[order(vars_size, decreasing=TRUE)]
vars_size <- vars_size [order(vars_size, decreasing=TRUE)]
Make dialog box (depends on OS, here is Windows)
vars_with_size <- paste(vars,vars_size)
vars_to_save <- select.list(vars_with_size, multiple=TRUE)
Remove what you don't want
rm(vars[!vars_with_size%in%vars_to_save])
To nice form of object size I use solution based on getAnywhere(print.object_size)
pretty_size <- function(x) {
ifelse(x >= 1024^3, paste(round(x/1024^3, 1L), "Gb"),
ifelse(x >= 1024^2, paste(round(x/1024^2, 1L), "Mb"),
ifelse(x >= 1024 , paste(round(x/1024, 1L), "Kb"),
paste(x, "bytes")
)))
}
Then in 4. one can use paste(vars, pretty_size(vars_size))
You may want to check out the RGtk2 package.
You can very easily create an interface with Glade Interface Designer and then attach whatever R commands you want to it.
If you want a good starting point where to "steal" ideas on how to use RGtk2, install the rattle package and run rattle();. Then look at the source code and start making your own interface :)
I may have a go at it and see if I can come out with something simple.
EDIT: this is a quick and dirty piece of code that you can play with. The big problem with it is that for whatever reason the rm instruction does not get executed, but I'm not sure why... I know that it is the central instruction, but at least the interface works! :D
TODO:
Make rm work
I put all the variables in the remObjEnv environment. It should not be listed in the current variable and it should be removed when the window is closed
The list will only show objects in the global environment, anything inside other environment won't be shown, but that's easy enough to implement
probably there's some other bug I haven't thought of :D
Enjoy
# Our environment
remObjEnv <<- new.env()
# Various required libraries
require("RGtk2")
remObjEnv$createModel <- function()
{
# create the array of data and fill it in
remObjEnv$objList <- NULL
objs <- objects(globalenv())
for (o in objs)
remObjEnv$objList[[length(remObjEnv$objList)+1]] <- list(object = o,
type = typeof(get(o)),
size = object.size(get(o)))
# create list store
model <- gtkListStoreNew("gchararray", "gchararray", "gint")
# add items
for (i in 1:length(remObjEnv$objList))
{
iter <- model$append()$iter
model$set(iter,
0, remObjEnv$objList[[i]]$object,
1, remObjEnv$objList[[i]]$type,
2, remObjEnv$objList[[i]]$size)
}
return(model)
}
remObjEnv$addColumns <- function(treeview)
{
colNames <- c("Name", "Type", "Size (bytes)")
model <- treeview$getModel()
for (n in 1:length(colNames))
{
renderer <- gtkCellRendererTextNew()
renderer$setData("column", n-1)
treeview$insertColumnWithAttributes(-1, colNames[n], renderer, text=n-1)
}
}
# Builds the list.
# I seem to have some problems in correctly build treeviews from glade files
# so we'll just do it by hand :)
remObjEnv$buildTreeView <- function()
{
# create model
model <- remObjEnv$createModel()
# create tree view
remObjEnv$treeview <- gtkTreeViewNewWithModel(model)
remObjEnv$treeview$setRulesHint(TRUE)
remObjEnv$treeview$getSelection()$setMode("single")
remObjEnv$addColumns(remObjEnv$treeview)
remObjEnv$vbox$packStart(remObjEnv$treeview, TRUE, TRUE, 0)
}
remObjEnv$delObj <- function(widget, treeview)
{
model <- treeview$getModel()
selection <- treeview$getSelection()
selected <- selection$getSelected()
if (selected[[1]])
{
iter <- selected$iter
path <- model$getPath(iter)
i <- path$getIndices()[[1]]
model$remove(iter)
}
obj <- as.character(remObjEnv$objList[[i+1]]$object)
rm(obj)
}
# The list of the current objects
remObjEnv$objList <- NULL
# Create the GUI.
remObjEnv$window <- gtkWindowNew("toplevel", show = FALSE)
gtkWindowSetTitle(remObjEnv$window, "R Object Remover")
gtkWindowSetDefaultSize(remObjEnv$window, 500, 300)
remObjEnv$vbox <- gtkVBoxNew(FALSE, 5)
remObjEnv$window$add(remObjEnv$vbox)
# Build the treeview
remObjEnv$buildTreeView()
remObjEnv$button <- gtkButtonNewWithLabel("Delete selected object")
gSignalConnect(remObjEnv$button, "clicked", remObjEnv$delObj, remObjEnv$treeview)
remObjEnv$vbox$packStart(remObjEnv$button, TRUE, TRUE, 0)
remObjEnv$window$showAll()
Once you've figured out what you want to keep, you can use the function -keep- from package gdata does what its name suggests.
a <- 1
b <- 2
library(gdata)
keep(a, all = TRUE, sure = TRUE)
See help(keep) for details on the -all- and -sure- options.
all: whether hidden objects (beginning with a .) should be removed, unless explicitly kept.
sure: whether to perform the removal, otherwise return names of objects that would have been removed.
This function is so useful that I'm surprised it isn't part of R itself.
The OS X gui does have such a thing, it's called the Workspace Browser. Quite handy.
I've also wished for an interface that shows the session dependency between objects, i.e. if I start from a plot() and work backwards to find all the objects that were used to create it. This would require parsing the history.
It doesn't have checkboxes to delete with, rather you select the file(s) then click delete. However, the solution below is pretty easy to implement:
library(gWidgets)
options(guiToolkit="RGtk2")
## make data frame with files
out <- lapply((x <- list.files()), file.info)
out <- do.call("rbind", out)
out <- data.frame(name=x, size=as.integer(out$size), ## more attributes?
stringsAsFactors=FALSE)
## set up GUI
w <- gwindow("Browse directory")
g <- ggroup(cont=w, horizontal=FALSE)
tbl <- gtable(out, cont=g, multiple=TRUE)
size(tbl) <- c(400,400)
deleteThem <- gbutton("delete", cont=g)
enabled(deleteThem) <- FALSE
## add handlers
addHandlerClicked(tbl, handler=function(h,...) {
enabled(deleteThem) <- (length(svalue(h$obj, index=TRUE)) > 0)
})
addHandlerClicked(deleteThem, handler=function(h,...) {
inds <- svalue(tbl, index=TRUE)
files <- tbl[inds,1]
print(files) # replace with rm?
})
The poor guy answer could be :
ls()
# spot the rank of the variables you want to remove, for example 10 to 25
rm(list= ls()[[10:25]])
# repeat until satisfied
To clean the complete environment you can try:
rm(list(ls())

Resources