I'm trying to understand the answer to this question using R and I'm struggling a lot.
The dataset for the R code can be found with this code
library(devtools)
install_github("genomicsclass/GSE5859Subset")
library(GSE5859Subset)
data(GSE5859Subset) ##this loads the three tables you need
Here is the question
Write a function that takes a vector of values e and a binary vector group coding two groups, and returns the p-value from a t-test: t.test( e[group==1], e[group==0])$p.value.
Now define g to code cases (1) and controls (0) like this g <- factor(sampleInfo$group)
Next use the function apply to run a t-test for each row of geneExpression and obtain the p-value. What is smallest p-value among all these t-tests?
The answer provided is
myttest <- function(e,group){
x <- e[group==1]
y <- e[group==0]
return( t.test(x,y)$p.value )
}
g <- factor(sampleInfo$group)
pvals <- apply(geneExpression,1,myttest, group=g)
min( pvals )
Which gives you the answer of 1.406803e-21.
What exactly is the input of the "e" argument of the myttest function when you run this? Is it possible to write this function as a formula like
t.test(DV ~ sampleInfo$group)
The t test is comparing the gene expression values of the 24 people (the values of which I believe are in the "geneExpression" matrix) by what group they were
in which you can find in sampleInfo's "group" column. I've run t tests so many times in R, but for some reason I can't wrap my mind around what's going on in this code.
You question seems to be about understanding the function apply().
For the technical description, see ?apply.
My quick explanation: the apply() line of code in your question applies the following function to each of the rows of geneExpression
myttest(e=x, group=g)
where x is a placeholder for each row.
To help make sense of it, a for loop version of that apply() line would look something like:
N <- nrows(geneExpression) #so we don't have to type this twice
pvals <- numeric(N) #empty vector to store results
# what 'apply' does (but it does it very quickly and with less typing from us)
for(i in 1:N) {
pvals[i] <- myttest(geneExpression[i,], group=g[i])
}
Related
I have a dataset "res.sav" that I read in via haven. It contains 20 columns, called "Genes1_Acc4", "Genes2_Acc4" etc. I am trying to find a correlation coefficient between those and another column called "Condition". I want to separately list all coefficients.
I created two functions, cor.condition.cols and cor.func to do that. The first iterates through the filenames and works just fine. The second was supposed to give me my correlations which didn't work at all. I also created a new "cor.condition.Genes" which I would like to fill with the correlations, ideally as a matrix or dataframe.
I have tried to iterate through the columns with two functions. However, when I try to pass it, I get the error: "NAs introduced by conversion". This wouldn't be the end of the world (I tried also suppressWarning()). But the bigger problem I have that it seems like my function does not convert said columns into the numeric type I need for my cor() function. I receive the "y must be numeric" error when trying to run the cor() function. I tried to put several arguments within and without '' or "" without success.
When I ran str(cor.condition.cols) I only receive character strings, which makes me think that my function somehow messes up with the as.numeric function. Any suggestions of how else I could iter through these columns and transfer them?
Thanks guys :)
cor.condition.cols <- lapply(1:20, function(x){paste0("res$Genes", x, "_Acc4")})
#save acc_4 columns as numeric columns and calculate correlations
res <- (as.numeric("cor.condition.cols"))
cor.func <- function(x){
cor(res$Condition, x, use="complete.obs", method="pearson")
}
cor.condition.Genes <- cor.func(cor.condition.cols)
You can do:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
res2 <- as.numeric(as.matrix(res[cor.condition.cols]))
cor.condition.Genes <- cor(res2, res$Condition, use="complete.obs", method="pearson")
eventually the short variant:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
cor.condition.Genes <- cor(res[cor.condition.cols], res$Condition, use="complete.obs")
Here is an example with other data:
cor(iris[-(4:5)], iris[[4]])
I have a dataset like this:
contingency_table<-tibble::tibble(
x1_not_happy = c(1,4),
x1_happy = c(19,31),
x2_not_happy = c(1,4),
x2_happy= c(19,28),
x3_not_happy=c(14,21),
X3_happy=c(0,9),
x4_not_happy=c(3,13),
X4_happy=c(17,22)
)
in fact, there are many other variables that come from a poll aplied in two different years.
Then, I apply a Fisher test in each 2X2 contingency matrix, using this code:
matrix1_prueba <- contingency_table[1:2,1:2]
matrix2_prueba<- contingency_table[1:2,3:4]
fisher1<-fisher.test(matrix1_prueba,alternative="two.sided",conf.level=0.9)
fisher2<-fisher.test(matrix2_prueba,alternative="two.sided",conf.level=0.9)
I would like to run this task using a short code by mean of a function or a loop. The output must be a vector with the p_values of each questions.
Thanks,
Frederick
So this was a bit of fun to do. The main thing that you need to recognize is that you want combinations of your data. There are a number of functions in R that can do that for you. The main workhorse is combn() Link
So in the language of the problem, we want all combinations of your tibble taken 2 at a time link2
From there, you just need to do some looping structure to get your tests to work, and extract the p-values from the object.
list_tables <- lapply(combn(contingency_table,2,simplify=F), fisher.test)
unlist(lapply(list_tables, `[`, 'p.value'))
This should produce your answer.
EDIT
Given the updated requirements for just adjacement data.frame columns, the following modifications should work.
full_list <- combn(contingency_table,2,simplify=F)
full_list <- full_list[sapply(
full_list, function(x) all(startsWith(names(x), substr(names(x)[1], 1,2))))]
full_list <- lapply(full_list, fisher.test)
unlist(lapply(full_list, `[`, 'p.value'))
This is approximately the same code as before, but now we have to find the subsets of the data that have the same question prefix name. This only works if the prefixes are exactly the same (X3 != x3). I think this is a better solution than trying to work with column indexes, and without the guarantee of always being next to one another. The sapply code does just that. The final output should be what you need for the problem.
I am normally a maple user currently working with R, and I have a problem with correctly indexing variables.
Say I want to define 2 vectors, v1 and v2, and I want to call the nth element in v1. In maple this is easily done:
v[1]:=some vector,
and the nth element is then called by the command
v[1][n].
How can this be done in R? The actual problem is as follows:
I have a sequence M (say of length 10, indexed by k) of simulated negbin variables. For each of these simulated variables I want to construct a vector X of length M[k] with entries given by some formula. So I should end up with 10 different vectors, each of different length. My incorrect code looks like this
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
for(k in 1:sims){
x[k]<-rep(NA,M[k])
X[k]<-rep(NA,M[k])
for(i in 1:M[k]){x[k][i]<-runif(1,min=0,max=1)
if(x[k][i]>=0 & x[i]<=0.1056379){
X[k][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[k][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
}
The error appears to be that x[k] is not a valid name for a variable. Any way to make this work?
Thanks a lot :)
I've edited your R script slightly to get it working and make it reproducible. To do this I had to assume that eks_2016_kasko was an integer value of 10.
require(MASS)
sims<-10
# Because you R is not zero indexed add one
M<-rnegbin(sims, 10*exp(-2.17173), 840.1746) + 1
# Create a list
x <- list()
X <- list()
for(k in 1:sims){
x[[k]]<-rep(NA,M[k])
X[[k]]<-rep(NA,M[k])
for(i in 1:M[k]){
x[[k]][i]<-runif(1,min=0,max=1)
if(x[[k]][i]>=0 & x[[k]][i]<=0.1056379){
X[[k]][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[[k]][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
This will work and I think is what you were trying to do, BUT is not great R code. I strongly recommend using the lapply family instead of for loops, learning to use data.table and parallelisation if you need to get things to scale. Additionally if you want to read more about indexing in R and subsetting Hadley Wickham has a comprehensive break down here.
Hope this helps!
Let me start with a few remarks and then show you, how your problem can be solved using R.
In R, there is most of the time no need to use a for loop in order to assign several values to a vector. So, for example, to fill a vector of length 100 with uniformly distributed random variables, you do something like:
set.seed(1234)
x1 <- rep(NA, 100)
for (i in 1:100) {
x1[i] <- runif(1, 0, 1)
}
(set.seed() is used to set the random seed, such that you get the same result each time.) It is much simpler (and also much faster) to do this instead:
x2 <- runif(100, 0, 1)
identical(x1, x2)
## [1] TRUE
As you see, results are identical.
The reason that x[k]<-rep(NA,M[k]) does not work is that indeed x[k] is not a valid variable name in R. [ is used for indexing, so x[k] extracts the element k from a vector x. Since you try to assign a vector of length larger than 1 to a single element, you get an error. What you probably want to use is a list, as you will see in the example below.
So here comes the code that I would use instead of what you proposed in your post. Note that I am not sure that I correctly understood what you intend to do, so I will also describe below what the code does. Let me know if this fits your intentions.
# define M
library(MASS)
eks_2016_kasko <- 486689.1
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
# define the function that calculates X for a single value from M
calculate_X <- function(m) {
x <- runif(m, min=0,max=1)
X <- ifelse(x > 0.1056379, rlnorm(m, 6.228244, 0.3565041),
rlnorm(m, 8.910837, 1.1890874))
}
# apply that function to each element of M
X <- lapply(M, calculate_X)
As you can see, there are no loops in that solution. I'll start to explain at the end:
lapply is used to apply a function (calculate_X) to each element of a list or vector (here it is the vector M). It returns a list. So, you can get, e.g. the third of the vectors with X[[3]] (note that [[ is used to extract elements from a list). And the contents of X[[3]] will be the result of calculate_X(M[3]).
The function calculate_X() does the following: It creates a vector of m uniformly distributed random values (remember that m runs over the elements of M) and stores that in x. Then it creates a vector X that contains log normally distributed random variables. The parameters of the distribution depend on the value x.
I'm attempting to create sigma/summation function with the variables in my dataset that looks like this:
paste0("(choose(",zipdistrib$Leads[1],",",zipdistrib$Starts[1],")*beta(a+",zipdistrib$Starts[1],",b+",zipdistrib$Leads[1],"-",zipdistrib$Starts[1],")/beta(a,b))")
When I enter that code, I get
[1] "(choose(9,6)*beta(a+6,b+9-6)/beta(a,b))"
I want to create a sigma/summation function where a and b are unknown free-floating variables and the values of Leads[i] and Starts[i] are determined by the values for Leads and Starts for observation i in my dataset. I have tried using a sum function in conjunction with mapply and sapply to no avail. Currently, I am taking the tack of creating the function as a string using a for loop in conjunction with a paste0 command so that the only things that change are the values of the variables Leads and Starts. Then, I try coercing the result into a function. To my surprise, I can actually enter this code without creating a syntax error, but when I try optimize the function for variables a and b, I'm not having success.
Here's my attempt to create the function out of a string.
betafcn <- function (a,b) {
abfcnstring <-
for (i in 1:length(zipdistrib$Zip5))
toString(
paste0(" (choose(",zipdistrib$Leads[i],",",zipdistrib$Starts[i],")*beta(a+",zipdistrib$Starts[i],",b+",zipdistrib$Leads[i],"-",zipdistrib$Starts[i],")/beta(a,b))+")
)
as.function(
as.list(
substr(abfcnstring, 1, nchar(abfcnstring)-1)
)
)
}
Then when I try to optimize the function for a and b, I get the following:
optim(c(a=.03, b=100), betafcn(a,b))
## Error in as.function.default(x, envir) :
argument must have length at least 1
Is there a better way for me to compile a sigma from i=1 to length of dataset with mapply or lapply or some other *apply function? Or am I stuck using a dreaded for loop? And then once I create the function, how do I make sure that I can optimize for a and b?
Update
This is what my dataset would look like:
leads <-c(7,4,2)
sales <-c(3,1,0)
zipcodes <-factor(c("11111", "22222", "33333"))
zipleads <-data.frame(ZipCode=zipcodes, Leads=leads, Sales=sales)
zipleads
## ZipCode Leads Sales
# 1 11111 7 3
# 2 22222 4 1
# 3 33333 2 0
My goal is to create a function that would look something like this:
betafcn <-function (a,b) {
(choose(7,3)*beta(a+3,b+7-3)/beta(a,b))+
(choose(4,1)*beta(a+4,b+4-1)/beta(a,b))+
(choose(2,0)*beta(a+0,b+2-0)/beta(a,b))
}
The difference is that I would ideally like to replace the dataset values with any other possible vectors for Leads and Sales.
Since R vectorizes most of its operations by default, you can write an expression in terms of single values of a and b (which will automatically be recycled to the length of the data) and vectors of x and y (i.e., Leads and Sales); if you compute on the log scale, then you can use sum() (rather than prod()) to combine the results. Thus I think you're looking for something like:
betafcn <- function(a,b,x,y,log=FALSE) {
r <- lchoose(x,y)+lbeta(a+x,b+x-y)-lbeta(a,b)
if (log) r else exp(r)
}
Note that (1) optim() minimizes by default (2) if you're trying to optimize a likelihood you're better off optimizing the log-likelihood instead ...
Since all of the internal functions (+, lchoose, lbeta) are vectorized, you should be able to apply this across the whole data set via:
zipleads <- data.frame(Leads=c(7,4,2),Sales=c(3,1,0))
objfun <- function(p) { ## negative log-likelihood
-sum(betafcn(p[1],p[2],zipleads$Leads,zipleads$Sales,
log=TRUE))
}
objfun(c(1,1))
optim(fn=objfun,par=c(1,1))
I got crazy answers for this example (extremely large values of both shape parameters), but I think that's because it's awfully hard to fit a two-parameter model to three data points!
Since the shape parameters of the beta-binomial (which is what this appears to be) have to be positive, you might run into trouble with unconstrained optimization. You can use method="L-BFGS-B", lower=c(0,0) or optimize the parameters on the log scale ...
I thought your example was hopelessly complex. If you are going to attemp making a function by pasting character values, you first need to understand how to make a function body with an unevaluated expression, and after that basic task is understood, then you can elaborate ... if in fact it is necessary, noting BenBolker's suggestions.
choosefcn <- function (a,b) {}
txtxpr <- paste0("choose(",9,",",6,")" )
body(choosefcn) <- parse(text= txtxpr)
#----------
> betafcn
function (a, b)
choose(9, 6)
val1 <- "a"
val2 <- "b"
txtxpr <- paste0("choose(", val1, ",", val2, ")" )
body(choosefcn) <- parse(text= txtxpr)
#
choosefcn
#function (a, b)
#choose(a, b)
It also possible to configure the formal arguments separately with the formals<- function. See each of these help pages:
?formals
?body
?'function' # needs to be quoted
I have written a userdefined function:
epf <- function(z,x,noise=std_noise){
z_dims <- length(z)
std_noise <- 0.5*matrix(1,1,z_dims)
std_noise <- as.data.frame(std_noise)
obs_prob <- dnorm(z,x[1:z_dims],noise)
error <- prod(cbind(1,obs_prob))
return(error)
}
This function is called in a for-loop in another function:
w <- matrix(0,N,1)
for (i in 1:N){
w[i] <- epf(z,p[i,],R_noise)
}
where z is a 2-dimensional vector, N=1000, p is a dataframe of 1000 observations and 4 variables and R_noise is a dataframe og 1 observation and 4 variables.
Here I get the error: "Non-numeric argument to mathematical function", for the line obs_prob <- dnorm(z,x[1:z_dims],noise)
Can anyone help me with finding the error?
I have looked through questions similar to mine, but I still can't find the error in my code.
Edit:
Added definition of N
dnorm(as.matrix(z), x[1:a_dims], noise) may work better.
And more broadly speaking, a data frame with one row and two columns may be better expressed as a vector. Data frames look like matrices and as you put it 'two-dimensional vectors', but they are different in important aspects.
The same error may be occurring because you are feeding dnorm a second data frame in its last argument noise by passing R_noise.
Also, consider that p[i, ] has four values. It is being subsetted by obs_prob with x[1:z_dims]. In this case, z_dims will equal 2 since length(z) is 2. So you are evaluating dnorm(data.frame(z), p[1, ][1:2], data.frame(R_noise)).