Retaining sample information after processing with a function in R - r

I won't pretend that this code is even remotely optimal, but here is the problem I have. I have a list of files with multiple columns read in with sapply(), such that if I call file.list[[1]] I get a summary of that data.frame, and summary(file.list) is a list of files.
I am fitting curves to the data using the mgcv package as follows:
gam_data <- function(curves)
{
out <- gam(curves[, 15] ~ s(curves[, 23]))
pd <- plot(out)
return(pd)
}
out <- lapply(file.list, gam_data)
split_curves <- function(splitting)
{
pd_2 <- c(splitting[[1]]$fit)
pd_3 <- c(splitting[[1]]$x)
pd_4 <- c(splitting[[1]]$se)
curveg <- cbind(pd_2, pd_3, pd_4)
colnames(curveg) <- c("fitted", "sphro", "se")
return(curveg)
}
out2 <- lapply(out, split_curves)
Where the first block is performing gam and the second is extracting the fit of the curve. However, after all of that the original information in file.list such as replicate, genotype, etc. is lost, and the data.frames are not the same length anymore. This is probably a trivial question, but how does one retain that information through processing? I'm applying this to hundreds of data frames so I cannot just manually recreate the columns.

Related

Saving output of for-loop for every iteration

I am currently working on an imputation project where I need to evaluate my methods of imputation. I have my incomplete dataframe with NAs from which I calculate the missing rate for every column/variable. My second data frame contains the complete cases which I extracted from the first data frame. I now want to simulate the missingness structure of the real data in the frame containing the complete cases. the data frame with the generated NAs get stored in the object "result" as you can see in the code. If I now want to replicate this code and thus generate 100 different data frames like "result", how do I replicate and save them separately?
I'm a beginner and would be really thankful for your answers!
I tried to put my loop which generates the NAs in another loop which contains the replicate() command and counts from 1:100 and saves these 100 replicated data frames but it didn't work at all.
result = data.frame(res0=rep(NA, dim(comp_cas)[1]))
for (i in 1:length(Z32_miss_item$miss_per_item)) {
dat = comp_cas[,i]
missRate = Z32_miss_item$miss_per_item[i]
cat (i, " ", paste0(dat, collapse=",") ," ", missRate, "!\n")
df <- data.frame("res"= GenMiss(x=dat, missrate = missRate), stringsAsFactors = FALSE)
colnames(df) = gsub("res", paste0("Var", i), colnames(df))
result = cbind(result, df)
}
result = result[,-1]
I expect that every data frame of the 100 runs get saved in a separate .rda file in my project folder.
also, is imputation and the evaluation of fitness of the latter beginner stuff in r or at what level of proficiency am I if you take a look at the code that I posted?
It is difficult to guess what exactly you are doing without some dummy data. But it is fine to have loops within loops and to save data.frames. Firstly, I would avoid the replicate function here as it has a strange syntax and just stick with plain loops. Secondly, you must make sure that the loops have different indexes (i.e. for(i ... should be surrounded by, say, for(j ... since functions can loop outside their scope in R. Finally, use saveRDS rather than save, as you can then have each object (data.frame) saved in separate .rds files. The save function is designed for saving your whole workspace so that you can pick up where you left off.
fun <- function(i){
df <- data.frame(x=rnorm(5))
names(df) <- paste0("x",i)
df
}
for(j in 1:100){
res <- data.frame(id=1:5)
for(i in 1:10){
res <- cbind(res, fun(i))
}
saveRDS(res, sprintf("replication_%s.rds",j))
}

Creating a pipeline in R that serially processes multiple csv files

My pipeline reads in a csv to a dataframe, assigns rownames, removes a column, performs a pca, plots the pca and extracxts the meaningful variables from the pca which are also plotted.
Here is my current code, which only goes as far as the first plot:
library(ggplot2)
library(ggrepel)
tsv = read.csv('matrix.tsv', sep='\t')
bell= read.csv('bell.tsv', sep='\t')
tail= read.csv('tail.tsv', sep='\t')
dfList = list(tail, tsv, bell)
#process csv's
dfList = lapply(dfList, function(dum){
rownames(dum) = dum[,1]
dum[,1] = NULL
dum$X = NULL
dum = dum[, -grep('un', colnames(dum))]
})
#create pca's of dataframes
pcaList = lapply(dfList, function(pca){
prin_comp = prcomp(pca, scale. = T)
})
#plot top 2 principle components in the pca
plotList = lapply(pcaList, function(prin_comp){
t = qplot(x=prin_comp$rotation[,1], y=prin_comp$rotation[,2]) + geom_text_repel(aes(label=row.names(prin_comp$rotation)))
})
#this plots the 3 plots, one for each pca, but they are un-named
plotList
The problem is that the plots don't have meaningful names/titles. I don't know how to keep that information present, passed from function to function.
I know there must be a more elegant way of doing this. And I have spent a day reading similar and not so similar questions regarding processing multiple csv files. But either they weren't applicable or didn't work for my case.
And as the title of this question implies, I would prefer to do this on one csv at a time, not all 3 at a time, as the csv's in question are very large, over 5GB each, so keeping each dataframe and pca in memory at the same time is impossible.
You just need to keep a string you want to use as the title somewhere and add ggtitle(YOUR_TITLE) to your plot, but this is not so easy with your current code. Instead of performing each step of the analysis for each CSV before going to the next step, why don't you just perform all steps for one CSV at a time?
Your code could look like:
library(ggplot2)
library(ggrepel)
csvs <- c("matrix.tsv","bell.tsv","tail.tsv")
for (i in csvs) {
# read file
df <- read.csv(i, sep='\t')
# process file
rownames(df) <- df[,1]
df[,1] <- NULL
df$X = NULL
df = df[, -grep('un', colnames(df))]
# create pca
pca <- prcomp(df, scale = T)
# plot pca
pcaPlot <- qplot(x=pca$rotation[,1], y=pca$rotation[,2]) +
geom_text_repel(aes(label=row.names(pca$rotation))) +
ggtitle(i)
print(pcaPlot)
# extract and plot meaningful variables
# ...
}
Basically I just put everything you do in a lapply call inside of a for loop, this approach also does the processing for one CSV at the time.

Adding a line of data of different types to a row of a data frame

I am experimenting with different regression models. My end goal is to have a nice easy to read dataframe with 3 columns:
model_results <- data.frame(name = character(),
rmse = numeric(),
r2 = numeric())
Then after running each model, add the corrosponding output to the dataframe and then, at the end, review and make some decisions on which model to use.
I tried this:
mod.spend_transactions.results <- list("mod.spend_transactions",
rsme(residuals(mod.spend_transactions)),
summary(mod.spend_transactions)$r.squared)
I tried using a list because I know vectors can only store one datatype (right?).
Output:
rbind(model_results, mod.spend_transactions.results)
X.mod.spend_transactions. X12.6029444519635 X0.912505643567096
1 mod.spend_transactions 12.60294 0.9125056
Close but not what I wnated since the df names have been changed and I did not expect that.
So I tried vectors, which works but seems "clunky" in that I'm sure I could do this with writing less code:
vect_modname <- vector()
vect_rsme <- vector()
vect_r2 <- vector()
Then after running a model
vect_modname <- c(vect_modname, "mod.spend_transactions")
vect_rsme <- c(vect_rsme, rsme(residuals(mod.spend_transactions)))
vect_r2 <- c(vect_r2, summary(mod.spend_transactions)$r.squared)
Then at the end of running all the models I'm testing out
data.frame(vect_modname, vect_rsme, vect_r2)
Again, the vector method does work. But is there a "better", more elegant way of doing this?

Export accuracy of multiple timeseries forecasts in r into csv-document

I am using the fpp package to forecast multiple time series of different customers at the same time. I am already able to extract the point forecasts of different easy forecast methods (snaive, meanf, etc.) into a csv document. However, I am still trying to figure out how to extract the measures of the accuracy() command of every time series into a csv file at the same time.
I constructed an example:
# loading of the "fpp"-package into R
install.packages("fpp")
require("fpp")
# Example customers
customer1 <- c(0,3,1,3,0,5,1,4,8,9,1,0,1,2,6,0)
customer2 <- c(1,3,0,1,7,8,2,0,1,3,6,8,2,5,0,0)
customer3 <- c(1,6,9,9,3,1,5,0,5,2,0,3,2,6,4,2)
customer4 <- c(1,4,8,0,3,5,2,3,0,0,0,0,3,2,4,5)
customer5 <- c(0,0,0,0,4,9,0,1,3,0,0,2,0,0,1,3)
#constructing the timeseries
all <- ts(data.frame(customer1,customer2,customer3,customer4,customer5),
f=12, start=2015)
train <- window(all, start=2015, end=2016-0.01)
test <- window(all, start=2016)
CustomerQuantity <- ncol(train)
# Example of extracting easy forecast method into csv-document
horizon <- 4
fc_snaive <- matrix(NA, nrow=horizon, ncol=CustomerQuantity)
for(i in 1:CustomerQuantity){
fc_snaive [,i] <- snaive (train[,i], h=horizon)$mean
}
write.csv2(fc_snaive, file ="fc_snaive.csv")
The following part is exactly the part, where I would needed some help - I would like to extract the accuracy-measures into a csv file all at the same time. In my real dataset, I have 4000 customers, and not only 5! I tried to use loops and lapply(), but unfortunately my code didn't work.
accuracy(fc_snaive[,1], test[,1])
accuracy(fc_snaive[,2], test[,2])
accuracy(fc_snaive[,3], test[,3])
accuracy(fc_snaive[,4], test[,4])
accuracy(fc_snaive[,5], test[,5])
The following uses lapply to run accuracy for each element from 1 to the number of columns in fc_snaive with the corresponding element in test.
Then, with do.call, we bind the results by row (rbind), so we end up with a matrix that we can, in turn, export using write.csv.
new_matrix <- do.call(what = rbind,
args = lapply(1:ncol(fc_snaive), function(x){
accuracy(fc_snaive[, x], test[, x])
}))
write.csv(x = new_matrix,
file = "a_filename.csv")

Looping Over a Set of Files

I cooked up some code that is supposed to find all my .txt files (they're outputs of ODE simulations), open them all up as data frames with "read.table" and then perform some calculations on them.
files <- list.files(path="/Users/redheadmammoth/Desktop/Ultimate_Aging_F2016",
pattern=".txt",full.names=TRUE)
ldf <- lapply(files, read.table)
tuse <- seq(from=0,to=100,by=0.1)
for(files in ldf)
findR <- function(r){
with(files,(sum(exp(-r*age)*fecund*surv*0.1)-1)^2)
}
{
R0 <- with(files,(sum(fecund*surv*age)))
GenTime <- with(files,(sum(tuse*fecund*surv*0.1))/R0)
r <- optimize(f=findR,seq(-5,5,.0001),tol=0.00000001)$minimum
RV <- with(files,(exp(r*tuse)/surv)*(exp(-r*tuse)*(fecund*surv)))
plot(log(surv) ~ age,files,type="l")
tmp.lm <- lm(log(surv) ~ age + I(age^2),files) #Fit log surv to a quadratic
lines(files$age,predict(tmp.lm),col="red")
}
However, the problem is that it seems to only be performing the calculations contained in my "for" loop on one file, rather than all of them. I'd like it to perform the calculations on all of my files, then save all the files together as one big data frame so I can access the results of any particular set of my simulations. I suspect the error is that I'm not indexing the files correctly in order to loop over all of them.
How about using plyr::ldply() for this. It takes a list (in your case your list of files) and performs the same function on them and then returns a data frame.
The main thing to remember to do is create a column for the ID of each file you read in so you know which data comes from which file. The simplest way to do this is just to call it the file name and then you can edit it from there.
If you have additional arguments in your function they go after the function you want to use in ldply.
# create file list
files <- list.files(path="/Users/redheadmammoth/Desktop/Ultimate_Aging_F2016",
pattern=".txt",full.names=TRUE)
tuse <- seq(from=0,to=100,by=0.1)
load_and_edit <- function(file, tuse){
temp <- read.table(file)
# here put all your calculations you want to do on each file
temp$R0 <- sum(temp$fecund*temp$surv*temp*age)
# make a column for each file name so you know which data comes from which file
temp$id <- file
return(temp)
}
new_data <- plyr::ldply(list.files, load_and_edit, tuse)
This is the easiest way I have found to read in and wrangle multiple files in batch.
You can then plot each one really easily.

Resources