Extract r^2 from multiple models in plyr - r

I am hoping to efficiently combine my regressions using plyr functions. I have data frames with monthly data for multiple years in format yDDDD (so y2014, y2013, etc.)
Right now, I have the below code for one of those dfs, y2014. I am running the regressions by month, as desired within each year.
modelsm2= by(y2014,y2014$Date,function(x) lm(y~,data=x))
summarym2=lapply(modelsm2,summary)
coefficientsm2=lapply(modelsm2,coef)
coefsm2v2=ldply(modelsm2,coef) #to get the coefficients into an exportable df
I have several things I'd like to do and I would really appreciate your help!
A. Extract the r^2 for each model. I know that for one model, you can do summary(model)$r.squared to get it, but I have not had luck with my construct.
B. Apply the same methodology in a loop-type structure to get the models to run for all of my data frames (y2013 and backwards)
C. Get the summary into an easily exportable (to Excel) format --> the ldply function does not work for the summaries.
Thanks again.

A. You need to subset out the r.squared values from your summaries:
lapply(summarym2,"[[","r.squared")
B. Put all your data into a list, and put another lapply around it, eg:
lapply(list(y2014,y2013,y2012), function(dat)
by(dat,dat$Date, function(x) lm(y~.,data=x))
)
You will then have a list of lists so for instance to extract the summaries, you would use:
lapply(lmlist,lapply,summary)
C. summary returns a fairly complex data structure that cannot be coerced into a data.frame. The result you see is a consequence of the print method for it. You can use capture.output to get a charactor vector of each line of the output that you may use to write to a file.

Related

Separating data frame based on column values

I am having a bit of trouble with trying to script a code in R so that it separates a data frame based on the character in a data frame column without manually specifying a subset command. Below is the script for reproduction in R:
a=c("Model_A","R1",358723.0,171704.0,1.0,36.818500,4.0222700,1.38895000)
b=c("Model_A","R2",358723.0,171704.0,2.6,36.447300,4.0116100,1.37479000)
c=c("Model_A","R3",358723.0,171704.0,5.0,35.615400,3.8092600,1.34301000)
d=c("Model_B","R1",358723.0,171704.0,1.0,39.818300,2.4475600,1.50384000)
e=c("Model_B","R2",358723.0,171704.0,2.6,39.391600,2.4209900,1.48754000)
f=c("Model_B","R3",358723.0,171704.0,5.0,38.442700,2.3618400,1.45126000)
g=c("Model_C","R1",358723.0,171704.0,1.0,31.246400,2.2388000,1.30652000)
h=c("Model_C","R2",358723.0,171704.0,2.6,30.911600,2.2144800,1.29234000)
i=c("Model_C","R3",358723.0,171704.0,5.0,30.166700,2.1603000,1.26077000)
df=data.frame(a,b,c,d,e,f,g,h,i)
df=t(df)
df=data.frame(df)
col_list=list("Model","Receptor.name","X(m.)","Y(m.)","Z(m.)",
"nox","PM10","PM2.5")
colnames(df)=col_list
Essentially what I am trying is to separate the data frame (df) by the Model names ("Model_A", "Model_B", and "Model_C") and store them in new and different data frames. I have been trying to use the following command
df_test=split(df,with(df,interaction(Model,Model)), drop = TRUE)
This command separates the data frame but stores them in lists, and I don't know how to extract the lists individually and store them as data frames. Is there a simpler solution (avoiding the subset command if possible as I need the script to be dynamic and relative) or does anyone know how to use the last command shown above to separate the lists into individual data frames? Also if possible, is it possible to name the data frame after the model?
I apologize if these are a lot of questions but any help would be hugely appreciated! Thank you!
list2env(split(df, df$Model), envir = .GlobalEnv) will give you three dataframes in your global environment, named after the models, containing the relevant rows.
> Model_A
Model Receptor.name X(m.) Y(m.) Z(m.) nox PM10 PM2.5
a Model_A R1 358723 171704 1 36.8185 4.02227 1.38895
b Model_A R2 358723 171704 2.6 36.4473 4.01161 1.37479
c Model_A R3 358723 171704 5 35.6154 3.80926 1.34301
Although I would just keep the list of three dataframes by only using dflist <- split(df, df$Model).
Why a list? Lists allow you the use of lapply - a looping function that applies an operation over every list element. A quick example: Let's say you'd want to get a frequency table for both PM variables in your data for all three datasets.
For single elements in your global environment this would be
table(Model_A$PM10)
table(Model_A$PM2.5)
...
table(Model_C$PM2.5)
With a list, it would be
lapply(dflist, function(x) table(x["PM10"]))
lapply(dflist, function(x) table(x["PM2.5"]))
Right now, it seems to only save some lines of code, but better yet, the output of lapply is again a list, which you can store in an object and further use for different operations. Due to this, you can have a global environment with only a few objects in it, each being lists which contain certain similar objects, like dataframes, tables, summaries or even plots.

Error ("variables with different types from the fit") when using the predict() function in R

I have fitted a multivariate polynomial using the lm() and step() functions in R. My data has dependent variable Y and some independent variables X1 till Xn. I formatted the formula to fit as follows: Y ~ I(X1^1)+I(X1^2)+I(X2^1)+... etc. When I use the predict() function on the original data everything works, even on the validation points which weren't used for the fit. But, I have to use the predict() function on some simulated data I produced. I made sure the simulated data is in a data.frame and all the elements are of type double like the original data. I copied the column names from the original data (X1, ... ,Xn) to the simulated data. Now when I use the predict() function I get the following error:
Error: variables ‘I(X1^1)’, ‘I(X1^2)’, ‘I(X2^1)’ were specified with different types from the fit
I really don't get it. The column names are the same, the types are the same and both original and simulated data are in a data.frame. What is happening here?
Thanks in advance!!
Sorry for not providing a reproducible example. But I've found a solution. It's not very elegant but here it is. When I coerce the data.frame with the original data to a matrix and then straight back to a data frame again some attributes and other stuff are cut off the original data. If I now use this data.frame for the fitting process the predict() function works also on the simulated data. The simulated data was in matrix format first and was converted to a data.frame. It's still not clear to me if there isn't a more elegant way to get rid off the attr, dimensions and other stuff in the data.frame of the original data. I've tried unname() but that didn't do the job.

converting mixed dataframe with list to pure dataframe

what is the easiest way to extract information from a list embedded within a dataframe?
a<-data.frame(cyl=c(4,6,8),k=c("A","B","C"))
j<-by(data=mtcars,INDICES=mtcars$cyl,function(x) lm(mpg~disp,data=x))
a$l<-j
t(sapply(a$l,coef))->a$t
But this results in a matrix embedded within the dataframe and it needs some massaging in order to have it as two columns in a with their associated column names.
What I'd like is an easier method to extract this information and have it stored in dataframe a with the associated column names.
EDIT_ This is what I had in mind, but I just found the procedure somewhat cumbersome.
t(sapply(a$l,coef))->a$t
as.data.frame(a$t)->g
g$cyl<-as.numeric(rownames(g))
merge(x = a,y = g)->a2
a2[,-c(3,4)]->a3
Any simpler ways of doing this?
Now, to complicate matters- What If I´d like to get the residuals from a$l by cylinder.
sapply(a$l,function(x) x[['residuals']])->a$t
How can I generate a new dataframe in a long format with two columns: cyl and residual that later can be merged with the original dataframe a?
Well--see my previous edit for the first answer. This is for my second problem:
It does solve my problem, but I´m sure there must be a quicker and more intuitive way of solving this.
flat.list.df<-function(list,sublist){
nm<-names(list)
i<-do.call(rbind,lapply(nm,function(x){
u<-list[[x]][[sublist]]
g<-length(u)
j<-rep(x,g)
m<-data.frame(var=j,val=u)
m
})
)
return(i)
}
flat.list.df(a$l,"residuals")->w
w
merge(w,a,by.x="var",by.y="cyl")

Applying a function to a dataframe to trim empty columns within a list environment R

I am a naive user of R and am attempting to come to terms with the 'apply' series of functions which I now need to use due to the complexity of the data sets.
I have large, ragged, data frame that I wish to reshape before conducting a sequence of regression analyses. It is further complicated by having interlaced rows of descriptive data(characters).
My approach to date has been to use a factor to split the data frame into sets with equal row lengths (i.e. a list), then attempt to remove the trailing empty columns, make two new, matching lists, one of data and one of chars and then use reshape to produce a common column number, then recombine the sets in each list. e.g. a simplified example:
myDF <- as.data.frame(rbind(c("v1",as.character(1:10)),
c("v1",letters[1:10]),
c("v2",c(as.character(1:6),rep("",4))),
c("v2",c(letters[1:6], rep("",4)))))
myDF[,1] <- as.factor(myDF[,1])
myList <- split(myDF, myDF[,1])
myList[[1]]
I can remove the empty columns for an individual set and can split the data frame into two sets from the interlacing rows but have been stumped with the syntax in writing a function to apply the following function to the list - though 'lapply' with 'seq_along' should do it?
Thus for the individual set:
DF <- myList[[2]]
DF <- DF[,!sapply(DF, function(x) all(x==""))]
DF
(from an earlier answer to a similar, but simpler example on this site). I have a large data set and would like an elegant solution (I could use a loop but that would not use the capabilities of R effectively). Once I have done that I ought to be able to use the same rationale to reshape the frames and then recombine them.
regards
jac
Try
lapply(split(myDF, myDF$V1), function(x) x[!colSums(x=='')])

how to make groups of variables from a data frame in R?

Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})

Resources