converting mixed dataframe with list to pure dataframe - r

what is the easiest way to extract information from a list embedded within a dataframe?
a<-data.frame(cyl=c(4,6,8),k=c("A","B","C"))
j<-by(data=mtcars,INDICES=mtcars$cyl,function(x) lm(mpg~disp,data=x))
a$l<-j
t(sapply(a$l,coef))->a$t
But this results in a matrix embedded within the dataframe and it needs some massaging in order to have it as two columns in a with their associated column names.
What I'd like is an easier method to extract this information and have it stored in dataframe a with the associated column names.
EDIT_ This is what I had in mind, but I just found the procedure somewhat cumbersome.
t(sapply(a$l,coef))->a$t
as.data.frame(a$t)->g
g$cyl<-as.numeric(rownames(g))
merge(x = a,y = g)->a2
a2[,-c(3,4)]->a3
Any simpler ways of doing this?
Now, to complicate matters- What If I´d like to get the residuals from a$l by cylinder.
sapply(a$l,function(x) x[['residuals']])->a$t
How can I generate a new dataframe in a long format with two columns: cyl and residual that later can be merged with the original dataframe a?

Well--see my previous edit for the first answer. This is for my second problem:
It does solve my problem, but I´m sure there must be a quicker and more intuitive way of solving this.
flat.list.df<-function(list,sublist){
nm<-names(list)
i<-do.call(rbind,lapply(nm,function(x){
u<-list[[x]][[sublist]]
g<-length(u)
j<-rep(x,g)
m<-data.frame(var=j,val=u)
m
})
)
return(i)
}
flat.list.df(a$l,"residuals")->w
w
merge(w,a,by.x="var",by.y="cyl")

Related

Making list of duples, triples, etc. from a series of vectors (or a data.frame) in R

I hope I phrased the question right, I'm not even sure how to word my question, which is probably part of why I'm having trouble finding the answer.
Consider a data.frame that has multiple string vectors. I would like to construct another variable that pair-wise combines the two vectors together, agnostic of their order.
For example, consider the following data.frame
df <- data.frame(var1 = c('string1', 'string2', 'string3'),
var2 = c('string3', 'string4', 'string1')
)
I'd like to have a variable that is identical for the first and 3rd element, like:
c('string1, string3', 'string2, string 4', 'string1, string3')
I'm imagining that it might be best to make a variable/vector that's a list of the two component variables, but I'm obviously open to any solution. I tried to make a list variable that does what I want based on this question but with no luck:
Create a data.frame where a column is a list
If possible, I'd like to do this in a way that could extend to more than 2 columns and could efficiently run over millions of rows, especially if there is a data.table method.
Thanks for your help!
Edit: A crappy example of how I could do it with a forloop that doesn't quite work but you get the idea:
for (i in 1:nrow(df)) {
df$var.new[i] <- paste(sort( c(df$var1[i], df$var2[i])))
}

R friendly way to convert lots of R data frame columns to lots of vectors

I looked at this solution: R-friendly way to convert R data.frame column to a vector?
but each solution seems to involve manually declaring the name of the vector being created.
I have a large dataframe with about 224 column names. I would like to break up the data frame and turn it into 224 different vectors which preserve their label without typing them all manually. Is there a way to step through the columns in the data frame and produce a vector which has the same name as the column or am I dreaming?
I think it's a bad idea but this would work (using mtcars data set):
list2env(mtcars, .GlobalEnv)
attach is another dangerous command that people use to be able to access the columns of a data frame directly with their names. If you don't know why it's dangerous, though, don't do it.
Here's another bad idea:
for(i in names(mtcars)) assign(i, mtcars[,i])
Just for Richard:
for (x in names(mtcars))
eval(parse(text=paste(x, '<- c(', paste(mtcars[[x]], collapse=',') ,')')))

how to make groups of variables from a data frame in R?

Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})

Specifying names of columns to be used in a loop R

I have a df with over 30 columns and over 200 rows, but for simplicity will use an example with 8 columns.
X1<-c(sample(100,25))
B<-c(sample(4,25,replace=TRUE))
C<-c(sample(2,25,replace =TRUE))
Y1<-c(sample(100,25))
Y2<-c(sample(100,25))
Y3<-c(sample(100,25))
Y4<-c(sample(100,25))
Y5<-c(sample(100,25))
df<-cbind(X1,B,C,Y1,Y2,Y3,Y4,Y5)
df<-as.data.frame(df)
I wrote a function that melts the data generates a plot with X1 giving the x-axis values and faceted using the values in B and C.
plotdata<-function(l){
melt<-melt(df,id.vars=c("X1","B","C"),measure.vars=l)
plot<-ggplot(melt,aes(x=X1,y=value))+geom_point()
plot2<-plot+facet_grid(B ~ C)
ggsave(filename=paste("X_vs_",l,"_faceted.jpeg",sep=""),plot=plot2)
}
I can then manually input the required Y variable
plotdata("Y1")
I don't want to generate plots for all columns. I could just type the column of interest into plotdata and then get the result, but this seems quite inelegant (and time consuming). I would prefer to be able to manually specify the columns of interest e.g. "Y1","Y3","Y4" and then write a loop function to do all those specified.
However I am new to writing for loops and can't find a way to loop in the specific column names that are required for my function to work. A standard for(i in 1:length(df)) wouldn't be appropriate because I only want to loop the user specified columns
Apologies if there is an answer to this is already in stackoverflow. I couldn't find it if there was.
Thanks to Roland for providing the following answer:
Try
for (x in c("Y1","Y3","Y4")) {plotdata(x)}
The index variable doesn't have to be numeric

How do you select multiple variables from a matrix using a randomly selected vector of column indices?

Hopefully this has an easy answer I just haven't been able to find:
I am trying to write a simulation that will compare a number of statistical procedures on different subsets of rows (subjects) and columns (variables) of a large matrix.
Subsets of rows was fairly easy using a sample() of the subject ID numbers, but I am running into a little more trouble with columns.
Essentially, what I'd like to be able to do is create a random sample of column index numbers which will then be used to create a new matrix. What's got me the closest so far is:
testmat <- matrix(rnorm(10000),nrow=1000,ncol=100)
column.ind <- sample(3:100,20)
teststr <- paste("testmat[,",column.ind,"]",sep="",collapse=",")
which gives me a string that has a testmat[,column.ind] for every sampled index number. Is there any way to easily plug that into a cbind() function to make a new matrix? Is there any other obvious way I'm missing?
I've been able to do it using a loop (i.e. cbind(matrix,newcolumn) over and over), but that's fairly slow as the matrix I'm using is quite large and I will be doing this many times. I'm hoping there's a couple-line solution that's more elegant and quicker.
Have you tried testmat[, column.ind]?
Rows and columns can be indexed in the same way with logical vectors, a set of names, or numbers for indexes.
See here for an example: http://ideone.com/EtuUN.

Resources