correlation matrix problems - r

Good afternoon,
I am a beginner of Rstudio, and I need to make a correlation matrix for this I have used:
cor (final_base, method = "pearson")
but since I get many NA variables in the array, I have tried to put:
cor (final_base, method = "pearson", na.rm = TRUE)
So that I do not count such data in the matrix, but that's when I get an error that says:
unused argument (na.rm = TRUE)
How can I do to make the matrix, but without taking into account the NA of my database?
Thank you

That's because na.rm is not a valid argument for the cor function. Use the argument use instead, which is "an optional character string giving a method for computing covariances in the presence of missing values." In your case, you can set this to "complete.obs", as in:
cor(final_base, method = "pearson", use = "complete.obs")
See the documentation here.
Also, check out this note on the confusing (and therefore discouraged) "pairwise.complete.obs" option and whether to exclude missing values from correlation matrices in the first place. Instead, consider being explicit about these missings:
If you want to run correlations on lots of vectors with missing values, consider simply using the R default of use = "everything" and propagating missing values into the correlation matrix. This makes it clear what you don’t know.

Related

Error when trying to make confusion matrix

cm = table(obs = test[,14], pred)
Error in if (xi > xj) 1L else -1L: missing value where TRUE/FALSE needed
I am trying to output the confusion matrix of my random forest model on the testing data, but I'm getting this error. Any ideas what the issue might be?
Thank you in advance!
The error function tells us that one of the items in test[,14] or pred is missing (NA), and the table() function you are using cannot handle missing values. I expect you can get a confusion matrix by first eliminating elements of both vectors where either vector is NA.
Note that the table() function you are using does not seem to be the base R table() function. I expect it is part of a package you have loaded.

Is there a way in R to ignore a "." in my data when calculating mean/sd/etc

I have a large data set that I need to calculate mean/std dev/min/ and max on for several columns. The data set uses a "." to denote when a value is missing for a subject. When running the mean or sd function this causes R to return NA . Is there a simple way around this?
my code is just this
xCAL<-mean(longdata$CAL)
sdCAL<-sd(longdata$CAL)
minCAL<-min(longdata$CAL)
maxCAL<-max(longdata$CAL)
but R will return NA on all these variables. I get the following Error
Warning message:
In mean.default(longdata$CAL) :
argument is not numeric or logical: returning NA
You need to convert your data to numeric to be able to do any calculations on it. When you run as.numeric, your . will be converted to NA, which is what R uses for missing values. Then, all of the functions you mention take an argument na.rm that can be set to TRUE to remove (rm) missing values (na).
If your data is a factor, you need to convert it to character first to avoid loss of information as explained in this FAQ.
Overall, to be safe, try this:
longdata$CAL <- as.numeric(as.character(longdata$CAL))
xCAL <- mean(longdata$CAL, na.rm = TRUE)
sdCAL <- sd(longdata$CAL, na.rm = TRUE)
# etc
Do note that na.rm is a property of the function - it's not magic that works everywhere. If you look at the help pages for ?mean ?sd, ?min, etc., you'll see the na.rm argument documented. If you want to remove missing values in general, the na.omit() function works well.

Writing/applying "subtract the mean"-function to standardize regression parameters

I was trying to write and apply a seemingly easy function that would standardize my continuous regression parameters/ predictors. The reason is that I want to deal with multicollinearity.
So instead of writing x-mean(x,na.rm=T) each time, I'm looking for something more handy which does the job for me - not least because I wanted to exercize writing functions in R. ;)
So here is what I tried:
fun <- function(data.frame, x){
data.frame$x - mean(data.frame$x, na.rm=T)
}
Apparently this is not too wrong. At least it doesn't return an error message.
However, applying fun to, say, the built-in mtcars dataset and, say, the variable disp yields this error message:
#Loading the data:
data("mtcars")
fun(mtcars,x=disp) #I tried several ways, e.g. w and w/o "mtcars" in front
Warning message:
In mean.default(mtcars$x, na.rm = T) :
argument is not numeric or logical: returning NA
My guess is that it is about how I applied the function, because when I do manually what the function is supposed to do, it works perfectly.
Also, I was looking for similar questions on writing and applying such a function (also beyond the Stack Exchange universe), but I didn't find anything helpful.
Hope I didn't make a blunder due to my novice R-skills.
There is already a function in R which does what you want to do: scale().
You can just write scale(mtcars$hp, center = TRUE, scale = FALSE) which then subtracts the mean of the vector from the vector itself.
In combination with apply this is powerful; You can, for example center every column of your dataframe by writing:
apply(dataframe, MARGIN = 2, FUN = scale, center = TRUE, scale = FALSE)
Before you do that you have to make sure that this is a valid function for your column. You cannot scale factors or characters, for example.
In regards to your question: Your function should have to look like this:
fun <- function(data.frame, x){
data.frame[[x]] - mean(data.frame[[x]], na.rm=T)
}
and then when specifying the function you would have to write fun(mtcars, "hp") and specify the variable name in quotation marks. This is because of the special way the $ operator works, you cannot use a character string after it.

Error in family$linkinv(eta) : Argument eta must be a nonempty numeric vector

The reason the title of the question is the error I am getting is because I simply do not know how to interpret it, no matter how much I research. Whenever I run a logistic regression with bigglm() (from the biglm package, designed to run regressions over large amounts of data), I get:
Error in family$linkinv(eta) : Argument eta must be a nonempty numeric vector
This is how my bigglm() function looks like:
fit <- bigglm(f, data = df, family=binomial(link="logit"), chunksize=100, maxit=10)
Where f is the formula and df is the dataframe (of little over a million rows and about 210 variables).
So far I have tried changing my dependent variable to a numeric class but that didn't work. My dependent variable has no missing values.
Judging from the error message I wonder if this might have to do anything with the family argument in the bigglm() function. I have found numerous other websites with people asking about the same error and most of them are either unanswered, or for a completely different case.
The error Argument eta must be a nonempty numeric vector to me looks like your data has either empty values or NA. So, please check your data. Whatever advice we provide here, cannot be tested until we see your code or the steps involved resulting an error.
try this
is.na(df) # if TRUE, then replace them with 0
df[is.na(df)] <- 0 # Not sure replacing NA with 0 will have effect on your model
or whatever line of the code is resulting in NAs generation pass na.rm=Targument
Again, we can only speculate. Hope it helps.

Adding na.rm to Custom R Function with sapply

I am struggling to add an na.rm command to a custom function (just a percentage) seen below on a dataframe where each column is a point in time filled with prices of the securities that are identified in the rows. This df contains quite a bit of NAs. Here is the function:
pctabovepx=function(x) {
count_above_px=x>pxcutoff
100*(sum(count_above_px)/nrow(count_above_px))
}
I then want to run this function withinan sapply on all columns of my df with price data, as specified in the range below. Without adding an na command, it returns nothing ("numeric(0)") but when I add an na.rm command as I would with a function like mean, it returns "Error in FUN(X[[1L]], ...) : unused argument (na.rm = TRUE)".
abovepar=sapply(master[min_range:max_range], pctabovepx)
abovepar=sapply(master[min_range:max_range], pctabovepx, na.rm=TRUE)
I also tried to simplify and just do a count before doing a percentage. The following command did not return an error but just returned all values that were not NA, instead of the subset with prices above the cutoff.
countsabovepx=as.data.frame(sapply(master[min_range:max_range],function(x) sum(!is.na(x>pxcutoff))))
I am wondering how to avoid this issue, both with this function and generally with self-written functions that aren't mean or median.
You need to add it to your function as an argument and pass it to the sum. You also need to take account of the effect on the nrow part too. However, in the context of the rest of the function, I expect count_above_px to be a vector and nrow not to make sense here. I presume you meant to do length, and you are actually computing the mean, which has the na.rm argument anyway. You might also want to look at pxcutoff as it is not defined in the function - should this be passed as an argument too?
pctabovepx=function(x, na.rm=FALSE) {
count_above_px=x>pxcutoff
100*mean(count_above_px, na.rm=na.rm)
}

Resources