Function over vectors collected in a list in R - r

I have looked long and hard for a solution to the folliwing problem, but I couldn't find it. I apologize in advance if this is a duplicate, and I will delete this question if you direct me to an answer.
I have a list (Mylist) where each element holds many different fields. I'm interested in the numeric vector called ´coefficients´. I can thus select coefficients related to the i'thinstance of the list as
Mylist[[i]]$coefficients
but how do I get the average of coefficients over all i? The average is just meant as an example. What I'm generally interested in is how to compute a function over a list where each field of the list holds more than one data.frame/matrix/string etc.
UPDATE: As kindly supplied by Thomas below, here are some fake data for the problem:
Mylist <- replicate(10,data.frame(coefficients=rnorm(20),
something=rnorm(20)), simplify=FALSE)
I have tried looking at lapply, but since ´Mylist´ have other fields than coefficients I don't see how to do it.
Thanks!

You might need to provide more details on the exact structure of your data, but here's a simple example:
# some fake data:
mylist <- replicate(10,data.frame(coefficients=rnorm(20),
something=rnorm(20)), simplify=FALSE)
# take the grand mean:
mean(sapply(mylist,function(x) x$coefficients))
But perhaps you want the mean for each set of corresponding coefficients across all the list entries, which you could get with something like either of the following (which are identical):
colMeans(do.call(rbind,lapply(mylist,function(x) x$coefficients)))
rowMeans(do.call(cbind,lapply(mylist,function(x) x$coefficients)))
Which #SimonO101 rightly points out simplifies to:
rowMeans(sapply(mylist, function(x) x$coefficients))
because sapply is just a wrapper for lapply that does the simplification for you.

If you want the mean for all coefficients across all lists try...
mean( unlist( sapply( Mylists , function(x) `[`(x , 'coefficients') ) ) )
However, you should clarify what you want because it is unclear if you want...
# A mean for each set of coefficients
sapply( Mylists , function(x) mean( x$coefficients ) )
# The mean for each coefficient across all lists
rowMeans( sapply( Mylists , function(x) x$coefficients ) )

Related

Conditional frequencies calculation for more than one variable

I want to calculate conditional probabilities in my data. Therefore I coded the following:
creditrisks <- read.table("kredit.asc", header=TRUE)
glimpse(creditrisks)
creditrisks$moral1 <- as.integer(moral>1)
creditrisks$konto1 <- as.integer(laufkont==1)
creditrisks$konto2 <- as.integer(laufkont==4)
creditrisks$zweck <- as.integer(0<verw & verw<9)
attach(creditrisks)
prop.table(table(kredit,konto1),2)
prop.table(table(kredit,konto2),2)
prop.table(table(kredit,moral1),2)
prop.table(table(kredit,zweck),2)
The results look like this:
This works well for me, the only thing I want to change is that I can calculate all conditional frequencies at once, so the table should look like this:
With cbind I loose all the variable names, so I'm searching for a more elegant way.
The dataset can be found here: dataset
Thanks for your help!
Try this.
lapply(creditrisks[, c("konto1", "konto2", "moral1", "zweck")],
function(x) prop.table(table(creditrisks$kredit, x), 2)
)
You can also cbind them together by
do.call(cbind,
lapply(
creditrisks[, c("konto1", "konto2", "moral1", "zweck")],
function(x) prop.table(table(creditrisks$kredit, x), 2)
)
)

Understanding Vectorized Code In R

I'm trying to understand the answer to this question using R and I'm struggling a lot.
The dataset for the R code can be found with this code
library(devtools)
install_github("genomicsclass/GSE5859Subset")
library(GSE5859Subset)
data(GSE5859Subset) ##this loads the three tables you need
Here is the question
Write a function that takes a vector of values e and a binary vector group coding two groups, and returns the p-value from a t-test: t.test( e[group==1], e[group==0])$p.value.
Now define g to code cases (1) and controls (0) like this g <- factor(sampleInfo$group)
Next use the function apply to run a t-test for each row of geneExpression and obtain the p-value. What is smallest p-value among all these t-tests?
The answer provided is
myttest <- function(e,group){
x <- e[group==1]
y <- e[group==0]
return( t.test(x,y)$p.value )
}
g <- factor(sampleInfo$group)
pvals <- apply(geneExpression,1,myttest, group=g)
min( pvals )
Which gives you the answer of 1.406803e-21.
What exactly is the input of the "e" argument of the myttest function when you run this? Is it possible to write this function as a formula like
t.test(DV ~ sampleInfo$group)
The t test is comparing the gene expression values of the 24 people (the values of which I believe are in the "geneExpression" matrix) by what group they were
in which you can find in sampleInfo's "group" column. I've run t tests so many times in R, but for some reason I can't wrap my mind around what's going on in this code.
You question seems to be about understanding the function apply().
For the technical description, see ?apply.
My quick explanation: the apply() line of code in your question applies the following function to each of the rows of geneExpression
myttest(e=x, group=g)
where x is a placeholder for each row.
To help make sense of it, a for loop version of that apply() line would look something like:
N <- nrows(geneExpression) #so we don't have to type this twice
pvals <- numeric(N) #empty vector to store results
# what 'apply' does (but it does it very quickly and with less typing from us)
for(i in 1:N) {
pvals[i] <- myttest(geneExpression[i,], group=g[i])
}

apply fisher test in a large dataset that join all contingency tables

I have a dataset like this:
contingency_table<-tibble::tibble(
x1_not_happy = c(1,4),
x1_happy = c(19,31),
x2_not_happy = c(1,4),
x2_happy= c(19,28),
x3_not_happy=c(14,21),
X3_happy=c(0,9),
x4_not_happy=c(3,13),
X4_happy=c(17,22)
)
in fact, there are many other variables that come from a poll aplied in two different years.
Then, I apply a Fisher test in each 2X2 contingency matrix, using this code:
matrix1_prueba <- contingency_table[1:2,1:2]
matrix2_prueba<- contingency_table[1:2,3:4]
fisher1<-fisher.test(matrix1_prueba,alternative="two.sided",conf.level=0.9)
fisher2<-fisher.test(matrix2_prueba,alternative="two.sided",conf.level=0.9)
I would like to run this task using a short code by mean of a function or a loop. The output must be a vector with the p_values of each questions.
Thanks,
Frederick
So this was a bit of fun to do. The main thing that you need to recognize is that you want combinations of your data. There are a number of functions in R that can do that for you. The main workhorse is combn() Link
So in the language of the problem, we want all combinations of your tibble taken 2 at a time link2
From there, you just need to do some looping structure to get your tests to work, and extract the p-values from the object.
list_tables <- lapply(combn(contingency_table,2,simplify=F), fisher.test)
unlist(lapply(list_tables, `[`, 'p.value'))
This should produce your answer.
EDIT
Given the updated requirements for just adjacement data.frame columns, the following modifications should work.
full_list <- combn(contingency_table,2,simplify=F)
full_list <- full_list[sapply(
full_list, function(x) all(startsWith(names(x), substr(names(x)[1], 1,2))))]
full_list <- lapply(full_list, fisher.test)
unlist(lapply(full_list, `[`, 'p.value'))
This is approximately the same code as before, but now we have to find the subsets of the data that have the same question prefix name. This only works if the prefixes are exactly the same (X3 != x3). I think this is a better solution than trying to work with column indexes, and without the guarantee of always being next to one another. The sapply code does just that. The final output should be what you need for the problem.

How to repeat a simple command on multiple objects?

I have 20 unique linear models created from 1 dataset. Each one was created by:
mymodel1 <- lm(y ~ x1 + etc, data=mydata)
Now all I want to do is create a list of the output of a command on all 20 models, e.g. something like:
summary(mymodel[i])$adj
for i=1,2,...,20
It's probably obvious, but I'm not finding anything on this.
Is this the best way to act on 20 variable names that change by a positive integer?
for (i in 1:20) print(somefunction(eval(parse(text=paste0("model", i))))$adj)
This should return a vector of items in your workspace that inherit from class of 'lm":
lm.names <- ls()[ sapply( ls(), function(x) 'lm' %in% class(get(x) ))]
This will return a list of summary items from all of them.
sapply( lm.names, function(x) summary( get(x) )
Notice the use of get (twice). The ls function returns the names of object but neither as the objects themselves nor as true R names, but rather as a character vector. You might want to look carefully at the "Value" section of ?summary.lm, because it's a list and perhaps you only want a few items form that list?

What's the shortest way of creating a load of R objects with consecutive names?

This is what I've got at the moment:
weights0 <- array(dim=c(nrow(ind),nrow(all.msim)))
weights1 <- array(dim=c(nrow(ind),nrow(all.msim)))
weights2 <- array(dim=c(nrow(ind),nrow(all.msim)))
weights3 <- array(dim=c(nrow(ind),nrow(all.msim)))
weights4 <- array(dim=c(nrow(ind),nrow(all.msim)))
weights5 <- array(dim=c(nrow(ind),nrow(all.msim)))
weights0 <- 1 # sets initial weights to 1
Nice and clear, but not nice and short!
Would experienced R programmers write this in a different way?
EDIT:
Also, is there an established way of creating a number of weights that depends on a pre-existing variable to make this generalisable? For example, the parameter num.cons would equal 5: the number of constraints (and hence weights) that we need. Imagine this is a common programming problem, so sure there is a solution.
Option 1
If you want to create the different elements in your environment, you can do it with a for loop and assign. Other options are sapply and the envir argument of assign
for (i in 0:5)
assign(paste0("weights", i), array(dim=c(nrow(ind),nrow(all.msim))))
Option 2
However, as #Axolotl9250 points out, depending on your application, more often than not it makes sense to have these all in a single list
weights <- lapply(rep(NA, 6), array, dim=c(nrow(ind),nrow(all.msim)))
Then to assign to weights0 as you have above, you would use
weights[[1]][ ] <- 1
note the empty [ ] which is important to assign to ALL elements of weights[[1]]
Option 3
As per #flodel's suggestion, if all of your arrays are of the same dim,
you can create one big array with an extra dim of length equal to the number
of objects you have. (ie, 6)
weights <- array(dim=c(nrow(ind),nrow(all.msim), 6))
Note that for any of the options:
If you want to assign to all elements of an array, you have to use empty brackets. For example, in option 3, to assign to the 1st array, you would use:
weights[,,1][] <- 1
I've just tried to have a go at achieving this but with no joy, maybe someone else is better than I (most likely!!). However I can't help but feel maybe it's easier to have all the arrays in a single object, a list; that way a single lapply line would do, and instead of referring to weights1 weights2 weights3 weights4 it would be weights[[1]] weights[[2]] weights[[3]] weights[[4]]. Future operations on those arrays would then also be achieved by the apply family of functions. Sorry I can't get it exactly as you describe.
given what you're duing, just using a for loop is quick and intuitive
# create a character vector containing all the variable names you want..
variable.names <- paste0( 'weights' , 0:5 )
# look at it.
variable.names
# create the value to provide _each_ of those variable names
variable.value <- array( dim=c( nrow(ind) , nrow(all.msim) ) )
# assign them all
for ( i in variable.names ) assign( i , variable.value )
# look at what's now in memory
ls()
# look at any of them
weights4

Resources