R: Non-numeric argument to mathematical function - r

I have written a userdefined function:
epf <- function(z,x,noise=std_noise){
z_dims <- length(z)
std_noise <- 0.5*matrix(1,1,z_dims)
std_noise <- as.data.frame(std_noise)
obs_prob <- dnorm(z,x[1:z_dims],noise)
error <- prod(cbind(1,obs_prob))
return(error)
}
This function is called in a for-loop in another function:
w <- matrix(0,N,1)
for (i in 1:N){
w[i] <- epf(z,p[i,],R_noise)
}
where z is a 2-dimensional vector, N=1000, p is a dataframe of 1000 observations and 4 variables and R_noise is a dataframe og 1 observation and 4 variables.
Here I get the error: "Non-numeric argument to mathematical function", for the line obs_prob <- dnorm(z,x[1:z_dims],noise)
Can anyone help me with finding the error?
I have looked through questions similar to mine, but I still can't find the error in my code.
Edit:
Added definition of N

dnorm(as.matrix(z), x[1:a_dims], noise) may work better.
And more broadly speaking, a data frame with one row and two columns may be better expressed as a vector. Data frames look like matrices and as you put it 'two-dimensional vectors', but they are different in important aspects.
The same error may be occurring because you are feeding dnorm a second data frame in its last argument noise by passing R_noise.
Also, consider that p[i, ] has four values. It is being subsetted by obs_prob with x[1:z_dims]. In this case, z_dims will equal 2 since length(z) is 2. So you are evaluating dnorm(data.frame(z), p[1, ][1:2], data.frame(R_noise)).

Related

How can I avoid replacement has length zero error

I am trying to generate the term frequency matrix of a document and subsequently look up the frequency of a certain word in a given query in that matrix. In the end I want to sum the frequencies found of the words in the query.
However, I am coping with the error message: Error in feature[i] <- x : replacement has length zero
I do not have a lot of coding experience in general, and this is my first time working with R, thus I am having difficulties solving this error. I presume it has something to do with a null-value. I already tried to avoid the nested for-loop with an apply function because I thought that might help (not sure though), but I could not quite get the hang of how to convert the for-loop into an apply function.
termfreqname <- function(queries,docs){
n <- length(queries)
feature <- vector(length=n)
for(i in 1:n){
query <- queries[i]
documentcorpus <- c(docs[i])
tdm <- TermDocumentMatrix(tm_corpus) #creates the term frequency matrix per document
m <- sapply(strsplit(query, " "), length) #length of the query in words
totalfreq <- list(0) #initialize list
freq_counter <- rowSums(as.matrix(tdm)) #counts the occurrence of a given word in the tdm matrix
for(j in 1:m){
freq <- freq_counter[word(query,j)] #finds frequency of each word in the given query, in the term frequency matrix
totalfreq[[j]] <- freq #adds this frequency to position j in the list
}
x <- reduce(totalfreq,'+') #sums all the numbers in the list
feature[i] <- x #adds this number to feature list
feature
}
}
It depends on your needs, but bottom line you need to add some if statement. How you use it depends on whether you want the default value of the vector to persist. In your code, while feature starts as a logical vector, it is likely coerced to integer or numeric once you overwrite its first value with a number. In that case, the default value in all positions of the vector will be 0 (or 0L, if integer). That's going to influence your decision on how to use the if statement.
if (length(x)) feature[i] <- x
This will only attempt to overwrite the ith value of feature if the x objects has length (that's equivalent to if (length(x) > 0)). In this case, since the default value in the vector will be zero, this means when you are done that you will not be able to distinguish between an element known to be 0 and an element that failed to find anything.
The alternative (and my preference/recommendation):
feature[i] <- if (length(x)) x else NA
In this case, when you are done, you can clearly distinguish between known-zero (0) and uncertain/unknown values (NA). When doing math operations on that vector, you might want/need na.rm=TRUE ... but it all depends on your use.
BTW, as MartinGal noted, your use of reduce(totalfreq, '+') is a little flawed: 'x' may not be (is not?) recognized as a known function. The first fix to this is to use backticks around the function, so
totalfreq <- 5:7
reduce(totalfreq, '+')
# NULL
reduce(totalfreq, `+`)
# [1] 18
sum(totalfreq)
# [1] 18
There the last is the much-more-preferred method. Why? With a vector of length 4, for instance, it takes the first two and adds them, then takes that result and adds it to the third, then takes that result and adds to the fourth. Three operations. When you have 100 elements, it will make 99 individual additions. sum does it once, and this does have an effect on performance (asymptotically).
However, if totalfreq is instead a list, then this changes slightly:
totalfreq <- as.list(5:7)
reduce(totalfreq, `+`)
# [1] 18
sum(totalfreq)
# Error in sum(totalfreq) : invalid 'type' (list) of argument
# x
sum(unlist(totalfreq))
# [1] 18
The reduce code still works, and the sum by itself fails, but we can unlist the list first, effectively creating a vector, and then call sum on that. Much much faster asymptotically. And perhaps clearer, more declarative.
(I'm assuming purrr::reduce, btw ...)

How to transfer multiple columns into numeric & find correlation coefficients

I have a dataset "res.sav" that I read in via haven. It contains 20 columns, called "Genes1_Acc4", "Genes2_Acc4" etc. I am trying to find a correlation coefficient between those and another column called "Condition". I want to separately list all coefficients.
I created two functions, cor.condition.cols and cor.func to do that. The first iterates through the filenames and works just fine. The second was supposed to give me my correlations which didn't work at all. I also created a new "cor.condition.Genes" which I would like to fill with the correlations, ideally as a matrix or dataframe.
I have tried to iterate through the columns with two functions. However, when I try to pass it, I get the error: "NAs introduced by conversion". This wouldn't be the end of the world (I tried also suppressWarning()). But the bigger problem I have that it seems like my function does not convert said columns into the numeric type I need for my cor() function. I receive the "y must be numeric" error when trying to run the cor() function. I tried to put several arguments within and without '' or "" without success.
When I ran str(cor.condition.cols) I only receive character strings, which makes me think that my function somehow messes up with the as.numeric function. Any suggestions of how else I could iter through these columns and transfer them?
Thanks guys :)
cor.condition.cols <- lapply(1:20, function(x){paste0("res$Genes", x, "_Acc4")})
#save acc_4 columns as numeric columns and calculate correlations
res <- (as.numeric("cor.condition.cols"))
cor.func <- function(x){
cor(res$Condition, x, use="complete.obs", method="pearson")
}
cor.condition.Genes <- cor.func(cor.condition.cols)
You can do:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
res2 <- as.numeric(as.matrix(res[cor.condition.cols]))
cor.condition.Genes <- cor(res2, res$Condition, use="complete.obs", method="pearson")
eventually the short variant:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
cor.condition.Genes <- cor(res[cor.condition.cols], res$Condition, use="complete.obs")
Here is an example with other data:
cor(iris[-(4:5)], iris[[4]])

Understanding Vectorized Code In R

I'm trying to understand the answer to this question using R and I'm struggling a lot.
The dataset for the R code can be found with this code
library(devtools)
install_github("genomicsclass/GSE5859Subset")
library(GSE5859Subset)
data(GSE5859Subset) ##this loads the three tables you need
Here is the question
Write a function that takes a vector of values e and a binary vector group coding two groups, and returns the p-value from a t-test: t.test( e[group==1], e[group==0])$p.value.
Now define g to code cases (1) and controls (0) like this g <- factor(sampleInfo$group)
Next use the function apply to run a t-test for each row of geneExpression and obtain the p-value. What is smallest p-value among all these t-tests?
The answer provided is
myttest <- function(e,group){
x <- e[group==1]
y <- e[group==0]
return( t.test(x,y)$p.value )
}
g <- factor(sampleInfo$group)
pvals <- apply(geneExpression,1,myttest, group=g)
min( pvals )
Which gives you the answer of 1.406803e-21.
What exactly is the input of the "e" argument of the myttest function when you run this? Is it possible to write this function as a formula like
t.test(DV ~ sampleInfo$group)
The t test is comparing the gene expression values of the 24 people (the values of which I believe are in the "geneExpression" matrix) by what group they were
in which you can find in sampleInfo's "group" column. I've run t tests so many times in R, but for some reason I can't wrap my mind around what's going on in this code.
You question seems to be about understanding the function apply().
For the technical description, see ?apply.
My quick explanation: the apply() line of code in your question applies the following function to each of the rows of geneExpression
myttest(e=x, group=g)
where x is a placeholder for each row.
To help make sense of it, a for loop version of that apply() line would look something like:
N <- nrows(geneExpression) #so we don't have to type this twice
pvals <- numeric(N) #empty vector to store results
# what 'apply' does (but it does it very quickly and with less typing from us)
for(i in 1:N) {
pvals[i] <- myttest(geneExpression[i,], group=g[i])
}

Indexing variables in R

I am normally a maple user currently working with R, and I have a problem with correctly indexing variables.
Say I want to define 2 vectors, v1 and v2, and I want to call the nth element in v1. In maple this is easily done:
v[1]:=some vector,
and the nth element is then called by the command
v[1][n].
How can this be done in R? The actual problem is as follows:
I have a sequence M (say of length 10, indexed by k) of simulated negbin variables. For each of these simulated variables I want to construct a vector X of length M[k] with entries given by some formula. So I should end up with 10 different vectors, each of different length. My incorrect code looks like this
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
for(k in 1:sims){
x[k]<-rep(NA,M[k])
X[k]<-rep(NA,M[k])
for(i in 1:M[k]){x[k][i]<-runif(1,min=0,max=1)
if(x[k][i]>=0 & x[i]<=0.1056379){
X[k][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[k][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
}
The error appears to be that x[k] is not a valid name for a variable. Any way to make this work?
Thanks a lot :)
I've edited your R script slightly to get it working and make it reproducible. To do this I had to assume that eks_2016_kasko was an integer value of 10.
require(MASS)
sims<-10
# Because you R is not zero indexed add one
M<-rnegbin(sims, 10*exp(-2.17173), 840.1746) + 1
# Create a list
x <- list()
X <- list()
for(k in 1:sims){
x[[k]]<-rep(NA,M[k])
X[[k]]<-rep(NA,M[k])
for(i in 1:M[k]){
x[[k]][i]<-runif(1,min=0,max=1)
if(x[[k]][i]>=0 & x[[k]][i]<=0.1056379){
X[[k]][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[[k]][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
This will work and I think is what you were trying to do, BUT is not great R code. I strongly recommend using the lapply family instead of for loops, learning to use data.table and parallelisation if you need to get things to scale. Additionally if you want to read more about indexing in R and subsetting Hadley Wickham has a comprehensive break down here.
Hope this helps!
Let me start with a few remarks and then show you, how your problem can be solved using R.
In R, there is most of the time no need to use a for loop in order to assign several values to a vector. So, for example, to fill a vector of length 100 with uniformly distributed random variables, you do something like:
set.seed(1234)
x1 <- rep(NA, 100)
for (i in 1:100) {
x1[i] <- runif(1, 0, 1)
}
(set.seed() is used to set the random seed, such that you get the same result each time.) It is much simpler (and also much faster) to do this instead:
x2 <- runif(100, 0, 1)
identical(x1, x2)
## [1] TRUE
As you see, results are identical.
The reason that x[k]<-rep(NA,M[k]) does not work is that indeed x[k] is not a valid variable name in R. [ is used for indexing, so x[k] extracts the element k from a vector x. Since you try to assign a vector of length larger than 1 to a single element, you get an error. What you probably want to use is a list, as you will see in the example below.
So here comes the code that I would use instead of what you proposed in your post. Note that I am not sure that I correctly understood what you intend to do, so I will also describe below what the code does. Let me know if this fits your intentions.
# define M
library(MASS)
eks_2016_kasko <- 486689.1
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
# define the function that calculates X for a single value from M
calculate_X <- function(m) {
x <- runif(m, min=0,max=1)
X <- ifelse(x > 0.1056379, rlnorm(m, 6.228244, 0.3565041),
rlnorm(m, 8.910837, 1.1890874))
}
# apply that function to each element of M
X <- lapply(M, calculate_X)
As you can see, there are no loops in that solution. I'll start to explain at the end:
lapply is used to apply a function (calculate_X) to each element of a list or vector (here it is the vector M). It returns a list. So, you can get, e.g. the third of the vectors with X[[3]] (note that [[ is used to extract elements from a list). And the contents of X[[3]] will be the result of calculate_X(M[3]).
The function calculate_X() does the following: It creates a vector of m uniformly distributed random values (remember that m runs over the elements of M) and stores that in x. Then it creates a vector X that contains log normally distributed random variables. The parameters of the distribution depend on the value x.

vector-matrix multiplication in r

I want to multiply 1000 random variables to a matrix so as to get 1000 different resultant matrices.
I'm running the following code :
Threshold <- runif(1000,min=0,max=1) #Generating 1000 random variables so that we can see 1000 multiple results of Burstscore
Burstscore <- matrix(data=0,nrow=nrow(Fm2),ncol=ncol(Fpre2))
#Calculating the final burst score
for (i in 1:nrow(Fm2)){
for (j in 1:ncol(Fpre)){ #Dimentions of all the matrices (Fpre,Fm,Growth,TD,Burstscore) are 432,24
{
Burstscore[i,j]= ((as.numeric(Threshold))*(as.numeric(Growth[i,j]))) + ((1-(as.numeric(Threshold)))*(as.numeric(TD[i,j])))
}
}
}
I'm getting the following error -
'Error in Burstscore[i, j] = ((as.numeric(Threshold)) * (as.numeric(Growth[i, :
number of items to replace is not a multiple of replacement length'
You are trying to put in one cell of the Burstscore matrix 1000 values (as you are multiplying each [i,j] one by the entire "Threshold" vector). Apart from this, your code contains unnecesary elements (brackets or as.numeric() statements). And, of course, as said above, it is not fully reproducible, and I had to "invent" several matrices.
I guess that what you want to do is the following:
Threshold <- runif(1000,min=0,max=1)
Growth <- matrix(runif(432*24), ncol=24)
Burstscore <- vector("list", length(Threshold))
for (i in 1:length(Threshold)) {
Burstscore[[i]]= (Threshold[i]*Growth) + ((1-Threshold[i])*TD)
}
In R, it would be even more elegant to use a lapply() function:
Burstscore <- lapply(Threshold, function(x) (x*Growth)+((1-x)*TD))
Finally, I suggest you also put a more meaningful title to your question, so it could potentially be helpful to others also.

Resources