I was trying to write and apply a seemingly easy function that would standardize my continuous regression parameters/ predictors. The reason is that I want to deal with multicollinearity.
So instead of writing x-mean(x,na.rm=T) each time, I'm looking for something more handy which does the job for me - not least because I wanted to exercize writing functions in R. ;)
So here is what I tried:
fun <- function(data.frame, x){
data.frame$x - mean(data.frame$x, na.rm=T)
}
Apparently this is not too wrong. At least it doesn't return an error message.
However, applying fun to, say, the built-in mtcars dataset and, say, the variable disp yields this error message:
#Loading the data:
data("mtcars")
fun(mtcars,x=disp) #I tried several ways, e.g. w and w/o "mtcars" in front
Warning message:
In mean.default(mtcars$x, na.rm = T) :
argument is not numeric or logical: returning NA
My guess is that it is about how I applied the function, because when I do manually what the function is supposed to do, it works perfectly.
Also, I was looking for similar questions on writing and applying such a function (also beyond the Stack Exchange universe), but I didn't find anything helpful.
Hope I didn't make a blunder due to my novice R-skills.
There is already a function in R which does what you want to do: scale().
You can just write scale(mtcars$hp, center = TRUE, scale = FALSE) which then subtracts the mean of the vector from the vector itself.
In combination with apply this is powerful; You can, for example center every column of your dataframe by writing:
apply(dataframe, MARGIN = 2, FUN = scale, center = TRUE, scale = FALSE)
Before you do that you have to make sure that this is a valid function for your column. You cannot scale factors or characters, for example.
In regards to your question: Your function should have to look like this:
fun <- function(data.frame, x){
data.frame[[x]] - mean(data.frame[[x]], na.rm=T)
}
and then when specifying the function you would have to write fun(mtcars, "hp") and specify the variable name in quotation marks. This is because of the special way the $ operator works, you cannot use a character string after it.
Related
I am trying to extract regression coefficients using the function below;
## customized function to return coef as matrix
cust_lm<- function(varname, data){
y<-data[,varname]
coefOLS<- as.matrix(coef(summary(lm(y~x))));
}
I want to run regression using different dependent variables (independent variable remain the same) each time with this function. I am using lapply for the same.
## artificial data
x<-rnorm(100,5,3)
ydata<-data.frame(y1=rnorm(100), y2=rnorm(100))
## running regressions together and storing as list
list<-lapply(names(ydata)[1:2], function(x) cust_lm(x, ydata))
I'm getting the desired result where list[[1]] is nothing but coef(summary(lm(ydata[,1]~x))) and list[[2]] is equal to coef(summary(lm(ydata[,2]~x))).
I have written this with the help of several SO posts sometime back. Now I want to decipher my custom function to know how it works and also I'm not very clear about lapply.
I have already created the custom function with the arguments requiring as, (varname, data) and again I'm giving cust_lm(x, data) as an argument in lapply. Is it right thing to do?
Is it right if I give, list<-lapply(names(ydata)[1:2], function(z) cust_lm(z, data)) instead?. I'm quite confused on this. Any help/resources are appreciated.
You can always try to break it down bit by bit.
The first iteration of lapply would call cust_lm('a', ydata). Let's take a look:
cust_lm('y1', ydata)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.006170844 0.22234415 0.02775357 0.9779151
# x -0.004470560 0.03960525 -0.11287797 0.9103582
In your code, the name data is the variable name inside the function. So when you specify list<-lapply(names(ydata)[1:2], function(z) cust_lm(z, data)), R will be looking for a variable named data when the line is called. So this is wrong. Calling it with list<-lapply(names(ydata)[1:2], function(x) cust_lm(x, ydata)) is the correct answer. You can further simplify it as:
list <- lapply(names(ydata)[1:2], cust_lm, data=ydata)
This breaks down to "call cust_lm with each element of names(ydata)[1:2] in turn as first argument; use ydata for the argument named data".
testing<-function(formula=NULL,data=NULL){
if(with(data,formula)==T){
print('YESSSS')
}
}
A<-matrix(1:16,4,4)
colnames(A)<-c('x','y','z','gg')
A<-as.data.frame(A)
testing(data=A,formula=(2*x+y==Z))
Error in eval(expr, envir, enclos) : object 'x' not found
##or I can put formula=(x=1)
##reason that I use formula is because my dataset had different location and I would want
##to 'subset' my data into different set
This is the main flow of my code. I had done some search and seems to be no one ask this kind of stupid question or it is not possible to pass a formula in a if statement. Thank you in advance
if you just want subset of your data.frame create a character object representing the formula like this:
formula="2*x+y==z"
testing<-function(data,formula){with(data = data,expr = eval(parse(text = formula)))}
subset(A,testing(A,formula=formula))
#x y z gg
#2 2 6 10 14
You can change the formula as per your need.
If we need to evaluate it, one option is eval(parse
testing<-function(formula=NULL,data=NULL){
data <- deparse(substitute(data))
if(any(eval(parse(text=paste("with(", data, ",",
deparse(substitute(formula)), ")")))))
print("YESSS")
}
testing(data=A,formula=(2*x+y==z))
#[1] "YESSS"
When you call a function in R it evaluates its arguments first before executing the function.
For example, prod(2+2, 3) is first turned into prod(4, 3) before the function prod() is even called.
Thus, in your code, R starts by trying to solve (2*x+y==Z). It fails because there is no x object outside of the function code. So, it not even begin running testing().
To use your function correctly you should make it clear to R that it is not supposed to calculate (2*x+y==Z). Instead it should pass this information as is. You could do that using the functions expression() and eval().
testing<-function(formula=NULL,data=NULL){
if(with(data,eval(formula==T)){
print('YESSSS')
}
}
A<-matrix(1:16,4,4)
colnames(A)<-c('x','y','z','gg')
A<-as.data.frame(A)
testing(data=A,formula=expression(2*x+y==Z))
However, you will notice that there other problems with your code.
For Z is different than z. Notice that the in colnames you use z and in the formula Z.
The if() only works for when there is a single value of true or false. In your case, you will have one value for each row in A. When this happens, if() will only check if the first row fits the criteria.
If your purpose is subsetting, it is much more easier to do:
A.subset <- subset(A, 2*A$x+A$y == A$z)
After a discussion with my colleague,
here is a kind of solution
testing<-function(cx,cy,px,py,z,data=NULL){
list<-NULL
for(m in 1:nrow(data)){
if(cx*data$x[m]^px+cy*data$y[m]^py+data$z==0){
print(m)}
}
}
but this can deal with polynomial only and with a lot of arguments in the function. I am think of a way to reduce it as a general equation.or maybe this is the most easiest equation.
So, I have a set of data, and what I'm trying to do is find all the local maxima on the resulting curve. I read in a CSV file, which has x-values in the first column and y-values in the second, first step done, easy.
To find the maxima, I tried to use the findpeaks() function from the pracma database. However, each time I tried to run it, I got the same error:
Error: is.vector(x, mode = "numeric") is not TRUE
So, I first tried just converting this to a vector. Still got the same issue, however is.vector(x, mode = "any") was now returning true. I found some other help threads (which I can no longer find, so I can't share them, sorry!), and decided to try using lapply to coerce each entry in the new vector using as.numeric. Didn't work. Looked into ?as.numeric, and it mentioned that as.double might be better suited. Didn't work. Now I'm at a loss and not sure what to do - current working code is shown below.
plot <- read_csv("AFGP60 UV-05-04-16.csv",
col_names = FALSE, na = "null", skip = 2,n_max = numrow)
diffplot <- c(plot[1:601,2])
diffplot <- lapply(diffplot,as.double)
findpeaks(diffplot)`
Try diffplot <- as.numeric(as.vector(plot[1:600, 2])).
The problem was that the data was read as character or as factor. The above code should change that. However, there are multiple issues with your code. First, plot is a base function used for plotting. Naming a variable with such a name is bad practice.
Second, the diffplot variable is a vector (first 600 rows from the second column), so there is no need to change each element separately with the lapply function.
I am struggling to add an na.rm command to a custom function (just a percentage) seen below on a dataframe where each column is a point in time filled with prices of the securities that are identified in the rows. This df contains quite a bit of NAs. Here is the function:
pctabovepx=function(x) {
count_above_px=x>pxcutoff
100*(sum(count_above_px)/nrow(count_above_px))
}
I then want to run this function withinan sapply on all columns of my df with price data, as specified in the range below. Without adding an na command, it returns nothing ("numeric(0)") but when I add an na.rm command as I would with a function like mean, it returns "Error in FUN(X[[1L]], ...) : unused argument (na.rm = TRUE)".
abovepar=sapply(master[min_range:max_range], pctabovepx)
abovepar=sapply(master[min_range:max_range], pctabovepx, na.rm=TRUE)
I also tried to simplify and just do a count before doing a percentage. The following command did not return an error but just returned all values that were not NA, instead of the subset with prices above the cutoff.
countsabovepx=as.data.frame(sapply(master[min_range:max_range],function(x) sum(!is.na(x>pxcutoff))))
I am wondering how to avoid this issue, both with this function and generally with self-written functions that aren't mean or median.
You need to add it to your function as an argument and pass it to the sum. You also need to take account of the effect on the nrow part too. However, in the context of the rest of the function, I expect count_above_px to be a vector and nrow not to make sense here. I presume you meant to do length, and you are actually computing the mean, which has the na.rm argument anyway. You might also want to look at pxcutoff as it is not defined in the function - should this be passed as an argument too?
pctabovepx=function(x, na.rm=FALSE) {
count_above_px=x>pxcutoff
100*mean(count_above_px, na.rm=na.rm)
}
I have two lists of lists. humanSplit and ratSplit. humanSplit has element of the form::
> humanSplit[1]
$Fetal_Brain_408_AGTCAA_L001_R1_report.txt
humanGene humanReplicate alignment RNAtype
66 DGKI Fetal_Brain_408_AGTCAA_L001_R1_report.txt 6 reg
68 ARFGEF2 Fetal_Brain_408_AGTCAA_L001_R1_report.txt 5 reg
If you type humanSplit[[1]], it gives the data without name $Fetal_Brain_408_AGTCAA_L001_R1_report.txt
RatSplit is also essentially similar to humanSplit with difference in column order. I want to apply fisher's test to every possible pairing of replicates from humanSplit and ratSplit. Now I defined the following empty vector which I will use to store the informations of my fisher's test
humanReplicate <- vector(mode = 'character', length = 0)
ratReplicate <- vector(mode = 'character', length = 0)
pvalue <- vector(mode = 'numeric', length = 0)
For fisher's test between two replicates of humanSplit and ratSplit, I define the following function. In the function I use `geneList' which is a data.frame made by reading a file and has form:
> head(geneList)
human rat
1 5S_rRNA 5S_rRNA
2 5S_rRNA 5S_rRNA
Now here is the main function, where I use a function getGenetype which I already defined in other part of the code. Also x and y are integers :
fishertest <-function(x,y) {
ratReplicateName <- names(ratSplit[x])
humanReplicateName <- names(humanSplit[y])
## merging above two based on the one-to-one gene mapping as in geneList
## defined above.
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[x]], by.x = "rat", by.y = "ratGene")
## [here i do other manipulation with using already defined function
## getGenetype that is defined outside of this function and make things
## necessary to define following contingency table]
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
fisherTest <- fisher.test(contingencyTable)
humanReplicate <- c(humanReplicate,humanReplicateName )
ratReplicate <- c(ratReplicate,ratReplicateName )
pvalue <- c(pvalue , fisherTest$p)
}
After doing all this I do the make matrix eg to use in apply. Here I am basically trying to do something similar to double for loop and then using fisher
eg <- expand.grid(i = 1:length(ratSplit),j = 1:length(humanSplit))
junk = apply(eg, 1, fishertest(eg$i,eg$j))
Now the problem is, when I try to run, it gives the following error when it tries to use function fishertest in apply
Error in humanSplit[[y]] : recursive indexing failed at level 3
Rstudio points out problem in following line:
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
Ultimately, I want to do the following:
result <- data.frame(humanReplicate,ratReplicate, pvalue ,alternative, Conf.int1, Conf.int2, oddratio)
I am struggling with these questions:
In defining fishertest function, how should I pass ratSplit and humanSplit and already defined function getGenetype?
And how I should use apply here?
Any help would be much appreciated.
Up front: read ?apply. Additionally, the first three hits on google when searching for "R apply tutorial" are helpful snippets: one, two, and three.
Errors in fishertest()
The error message itself has nothing to do with apply. The reason it got as far as it did is because the arguments you provided actually resolved. Try to do eg$i by itself, and you'll see that it is returning a vector: the corresponding column in the eg data.frame. You are passing this vector as an index in the i argument. The primary reason your function erred out is because double-bracket indexing ([[) only works with singles, not vectors of length greater than 1. This is a great example of where production/deployed functions would need type-checking to ensure that each argument is a numeric of length 1; often not required for quick code but would have caught this mistake. Had it not been for the [[ limit, your function may have returned incorrect results. (I've been bitten by that many times!)
BTW: your code is also incorrect in its scoped access to pvalue, et al. If you make your function return just the numbers you need and the aggregate it outside of the function, your life will simplify. (pvalue <- c(pvalue, ...) will find pvalue assigned outside the function but will not update it as you want. You are defeating one purpose of writing this into a function. When thinking about writing this function, try to answer only this question: "how do I compare a single rat record with a single human record?" Only after that works correctly and simply without having to overwrite variables in the parent environment should you try to answer the question "how do I apply this function to all pairs and aggregate it?" Try very hard to have your function not change anything outside of its own environment.
Errors in apply()
Had your function worked properly despite these errors, you would have received the following error from apply:
apply(eg, 1, fishertest(eg$i, eg$j))
## Error in match.fun(FUN) :
## 'fishertest(eg$i, eg$j)' is not a function, character or symbol
When you call apply in this sense, it it parsing the third argument and, in this example, evaluates it. Since it is simply a call to fishertest(eg$i, eg$j) which is intended to return a data.frame row (inferred from your previous question), it resolves to such, and apply then sees something akin to:
apply(eg, 1, data.frame(...))
Now that you see that apply is being handed a data.frame and not a function.
The third argument (FUN) needs to be a function itself that takes as its first argument a vector containing the elements of the row (1) or column (2) of the matrix/data.frame. As an example, consider the following contrived example:
eg <- data.frame(aa = 1:5, bb = 11:15)
apply(eg, 1, mean)
## [1] 6 7 8 9 10
# similar to your use, will not work; this error comes from mean not getting
# any arguments, your error above is because
apply(eg, 1, mean())
## Error in mean.default() : argument "x" is missing, with no default
Realize that mean is a function itself, not the return value from a function (there is more to it, but this definition works). Because we're iterating over the rows of eg (because of the 1), the first iteration takes the first row and calls mean(c(1, 11)), which returns 6. The equivalent of your code here is mean()(c(1, 11)) will fail for a couple of reasons: (1) because mean requires an argument and is not getting, and (2) regardless, it does not return a function itself (in a "functional programming" paradigm, easy in R but uncommon for most programmers).
In the example here, mean will accept a single argument which is typically a vector of numerics. In your case, your function fishertest requires two arguments (templated by my previous answer to your question), which does not work. You have two options here:
Change your fishertest function to accept a single vector as an argument and parse the index numbers from it. Bothing of the following options do this:
fishertest <- function(v) {
x <- v[1]
y <- v[2]
ratReplicateName <- names(ratSplit[x])
## ...
}
or
fishertest <- function(x, y) {
if (missing(y)) {
y <- x[2]
x <- x[1]
}
ratReplicateName <- names(ratSplit[x])
## ...
}
The second version allows you to continue using the manual form of fishertest(1, 57) while also allowing you to do apply(eg, 1, fishertest) verbatim. Very readable, IMHO. (Better error checking and reporting can be used here, I'm just providing a MWE.)
Write an anonymous function to take the vector and split it up appropriately. This anonymous function could look something like function(ii) fishertest(ii[1], ii[2]). This is typically how it is done for functions that either do not transform as easily as in #1 above, or for functions you cannot or do not want to modify. You can either assign this intermediary function to a variable (which makes it no longer anonymous, figure that) and pass that intermediary to apply, or just pass it directly to apply, ala:
.func <- function(ii) fishertest(ii[1], ii[2])
apply(eg, 1, .func)
## equivalently
apply(eg, 1, function(ii) fishertest(ii[1], ii[2]))
There are two reasons why many people opt to name the function: (1) if the function is used multiple times, better to define once and reuse; (2) it makes the apply line easier to read than if it contained a complex multi-line function definition.
As a side note, there are some gotchas with using apply and family that, if you don't understand, will be confusing. Not the least of which is that when your function returns vectors, the matrix returned from apply will need to be transposed (with t()), after which you'll still need to rbind or otherwise aggregrate.
This is one area where using ddply may provide a more readable solution. There are several tutorials showing it off. For a quick intro, read this; for a more in depth discussion on the bigger picture in which ddply plays a part, read Hadley's Split, Apply, Combine Strategy for Data Analysis paper from JSS.