How to perform RMSE with missing values? - r

I have a huge dataset with 679 rows and 16 columns with 30 % of missing values. So I decided to impute this missing values with the function impute.knn from the package impute and I got a dataset with 679 rows and 16 columns but without the missing values.
But now I want to check the accuracy using the RMSE and I tried 2 options:
load the package hydroGOF and apply the rmse function
sqrt(mean (obs-sim)^2), na.rm=TRUE)
In two situations I have the error: errors in sim .obs: non numeric argument to binary operator.
This is happening because the original data set contains an NA value (some values are missing).
How can I calculate the RMSE if I remove the missing values? Then obs and sim will have different sizes.

How about simply...
sqrt( sum( (df$model - df$measure)^2 , na.rm = TRUE ) / nrow(df) )
Obviously assuming your dataframe is called df and you have to decide on your N ( i.e. nrow(df) includes the two rows with missing data; do you want to exclude these from N observations? I'd guess yes, so instead of nrow(df) you probably want to use sum( !is.na(df$measure) ) ) or, following #Joshua just
sqrt( mean( (df$model-df$measure)^2 , na.rm = TRUE ) )

The rmse() function in R package hydroGOF has an NA-remove parameter:
# require(hydroGOF)
rmse(sim, obs, na.rm=TRUE, ...)
which, according to the documentation, does the expected when na.rm is TRUE:
"When an ’NA’ value is found at the i-th position in obs OR sim, the i-th value
of obs AND sim are removed before the computation."
Without a minimal reproducible example, it's hard to say why that didn't work for you.
If you want to eliminate the missing values before you input to the hydroGOF::rmse() function, you could do:
my.rmse <- rmse(df.sim[rownames(df.obs[!is.na(df.obs$col_with_missing_data),]),]
, df.obs[!is.na(df.obs$col_with_missing_data),])
assuming you have the "simulated" (imputed) and "observed" (original) data sets in different data frames named df.sim and df.obs, respectively, that were created from the same original data frame so have the same dimensions and row names.
Here is a canonical way to do the same thing if you have more than one column with missing data:
rows.wout.missing.values <- with(df.obs, rownames(df.obs[!is.na(col_with_missing_data1) & !is.na(col_with_missing_data2) & !is.na(col_with_missing_data3),]))
my.rmse <- rmse(df.sim[rows.wout.missing.values,], df.obs[rows.wout.missing.values,])

Related

Divide specific values in a column by 1000

I need to divide certain values in a column by 1000 but do not know how to go about it
I attempted to use this function initially:
test <- Updins(weight,)
test$weight <- as.numeric(test$weight) / 1000
head(test)
with Updins being the dataframe and weight the column just to see if it would at least divide the entire column by 1000 but no such luck. It did not recognise 'test' as a variable.
Can anyone provide any guidance? I'm very new to R :)
If 'Updins is the dataset object name, we can select the columns with [ and not with ( as ( is used for function invoke
test <- Updins['weight']
test$weight <- as.numeric(test$weight) / 1000
Here is a fake data set to divide all rows by 1000. I also included a for-loop as one potential way to only do this for certain rows. Since you didn't specify how you were doing that, I just did it for any rows that had a value greater than 1,005, and I did a second version for only dividing by 1,000 if the ID was an odd number. If you have NAs this you may need an addition if statement to deal with them. I will provide an example for that in the third/last for-loop example.
ID<-1:10
grams<-1000:1009
df<-data.frame(ID,grams)
df$kg<-as.numeric(df$grams)/1000
df[,"kg"]<-as.numeric(df[,"grams"])/1000 #will do the same thing as the line above
for(i in 1:nrow(df)){
if(df[i,"grams"]>1005){df[i,"kg3"]<-as.numeric(df[i,"grams"])/1000}
}#if the weight is greater than 1,005 grams.
for(i in 1:nrow(df)){
if(df[i,"ID"] %in% seq(1,101, by = 2)){df[i,"kg4"]<-as.numeric(df[i,"grams"])/1000}
}#if the id is an odd number
df[3,"grams"]<-NA#add an NA to the weight data to test the next loop
for(i in 1:nrow(df)){
if(is.na(df[i,"grams"]) & (df[i,"ID"] %in% seq(1,101, by = 2))){df[i,"kg4"]<-NA}
else if(df[i,"ID"] %in% seq(1,101, by = 2)){df[i,"kg4"]<-as.numeric(df[i,"grams"])/1000}
}#Same as above, but works with NAs
Hard without data to work with or expected output, but here's a skeleton that you could probably use:
library(dplyr) #The package you'll need, for the pipes (%>% -- passes objects from one line to the next)
test <- Updins %>% #Using the dataset Updins
mutate(weight = ifelse(as.numeric(weight) > 199, #CHANGING weight variable. #Where weight > 50...
as.character(as.numeric(weight)/1000), #... divide a numeric version of the weight variable by 1000, but keep as a character...
weight) #OTHERWISE, keep the weight variable as is
head(test)
I kept the new value as a character, because it seems that your weight variable is a character variable based on some of the warnings ('NAs introduced by coercion') that you're getting.

'R', 'mice', missing variable imputation - how to only do one column in sparse matrix

I have a matrix that is half-sparse. Half of all cells are blank (na) so when I try to run the 'mice' it tries to work on all of them. I'm only interested in a subset.
Question: In the following code, how do I make "mice" only operate on the first two columns? Is there a clean way to do this using row-lag or row-lead, so that the content of the previous row can help patch holes in the current row?
set.seed(1)
#domain
x <- seq(from=0,to=10,length.out=1000)
#ranges
y <- sin(x) +sin(x/2) + rnorm(n = length(x))
y2 <- sin(x) +sin(x/2) + rnorm(n = length(x))
#kill 50% of cells
idx_na1 <- sample(x=1:length(x),size = length(x)/2)
y[idx_na1] <- NA
#kill more cells
idx_na2 <- sample(x=1:length(x),size = length(x)/2)
y2[idx_na2] <- NA
#assemble base data
my_data <- data.frame(x,y,y2)
#make the rest of the data
for (i in 3:50){
my_data[,i] <- rnorm(n = length(x))
idx_na2 <- sample(x=1:length(x),size = length(x)/2)
my_data[idx_na2,i] <- NA
}
#imputation
est <- mice(my_data)
data2 <- complete(est)
str(data2[,1:3])
Places that I have looked for answers:
help document (link)
google of course...
https://stats.stackexchange.com/questions/99334/fast-missing-data-imputation-in-r-for-big-data-that-is-more-sophisticated-than-s
I think what you are looking for can be done by modifying the parameter "where" of the mice function. The parameter "where" is equal to a matrix (or dataframe) with the same size as the dataset on which you are carrying out the imputation. By default, the "where" parameter is equal to is.na(data): a matrix with cells equal to "TRUE" when the value is missing in your dataset and equal to "FALSE" otherwise. This means that by default, every missing value in your dataset will be imputed. Now if you want to change this and only impute the values in a specific column (in my example column 2) of your dataset you can do this:
# Define arbitrary matrix with TRUE values when data is missing and FALSE otherwise
A <- is.na(data)
# Replace all the other columns which are not the one you want to impute (let say column 2)
A[,-2] <- FALSE
# Run the mice function
imputed_data <- mice(data, where = A)
Instead of the where argument a faster way might be to use the method argument. You can set this argument to "" for the columns/variables you want to skip. Downside is that automatic determination of the method will not work. So:
imp <- mice(data,
method = ifelse(colnames(data) == "your_var", "logreg", ""))
But you can get the default method from the documentation:
defaultMethod
... By default, the method uses pmm, predictive mean matching (numeric data) logreg, logistic regression imputation (binary data, factor with 2 levels) polyreg, polytomous regression imputation for unordered categorical data (factor > 2 levels) polr, proportional odds model for (ordered, > 2 levels).
Your question isn't entirely clear to me. Are you saying you wish to only operate on two columns? In that case mice(my_data[,1:2]) will work. Or you want to use all the data but only fill in missing values for some columns? To do this, I'd just create an indicator matrix along the following lines:
isNA <- data.frame(apply(my_data, 2, is.na))
est <- mice(my_data)
mapply(function(x, isna) {
x[isNA == 1] <- NA
return(x)
}, <each MI mice return object column-wise>, isNA)
For your final question, "can I use mice for rolling data imputation?" I believe the answer is no. But you should double check the documentation.

Removing matrix rows when outliers outside a given limit are found in columns

I'm trying to figure out how to remove a whole row when I find an outlier, outside a given limit, in a column of the same matrix. So I got a data set with labeled columns(B,C,D etc) from where I want to remove outliers that's greater than 3 standard deviations. When an outlier is found the whole row is to be removed. When done with one column the same procedure is to be repeated for the next one.
I found this post: Removing matrix rows if values of a cloumn are outliers but the code there removes all outliers outside 1.5 standard deviations, not outside your own limit, right?
(I'm sorry if this is a basic question, I'm relatively new to R. I've only been coding with MatLab before.)
In this case you have to define your own function to identify the outliers. Try the following:
remove_outliers2 <- function(x, limit = 3) {
mn <- mean(x, na.rm = T)
out <- limit * sd(x, na.rm = T)
x < (mn - out) | x > (mn + out)
}
This function will return a TRUE or FALSE vector that has the same dimensions as x. It will return TRUE when the element is an outlier.
To apply this function to all columns do:
apply(x,2,remove_outliers2,lim = 2)
And then proceed to remove those rows that contain a TRUE.

Covariance matrices by group, lots of NA

This is a follow up question to my earlier post (covariance matrix by group) regarding a large data set. I have 6 variables (HML, RML, FML, TML, HFD, and BIB) and I am trying to create group specific covariance matrices for them (based on variable Group). However, I have a lot of missing data in these 6 variables (not in Group) and I need to be able to use that data in the analysis - removing or omitting by row is not a good option for this research.
I narrowed the data set down into a matrix of the actual variables of interest with:
>MMatrix = MMatrix2[1:2187,4:10]
This worked fine for calculating a overall covariance matrix with:
>cov(MMatrix, use="pairwise.complete.obs",method="pearson")
So to get this to list the covariance matrices by group, I turned the original data matrix into a data frame (so I could use the $ indicator) with:
>CovDataM <- as.data.frame(MMatrix)
I then used the following suggested code to get covariances by group, but it keeps returning NULL:
>cov.list <- lapply(unique(CovDataM$group),function(x)cov(CovDataM[CovDataM$group==x,-1]))
I figured this was because of my NAs, so I tried adding use = "pairwise.complete.obs" as well as use = "na.or.complete" (when desperate) to the end of the code, and it only returned NULLs. I read somewhere that "pairwise.complete.obs" could only be used if method = "pearson" but the addition of that at the end it didn't make a difference either. I need to get covariance matrices of these variables by group, and with all the available data included, if possible, and I am way stuck.
Here is an example that should get you going:
# Create some fake data
m <- matrix(runif(6000), ncol=6,
dimnames=list(NULL, c('HML', 'RML', 'FML', 'TML', 'HFD', 'BIB')))
# Insert random NAs
m[sample(6000, 500)] <- NA
# Create a factor indicating group levels
grp <- gl(4, 250, labels=paste('group', 1:4))
# Covariance matrices by group
covmats <- by(m, grp, cov, use='pairwise')
The resulting object, covmats, is a list with four elements (in this case), which correspond to the covariance matrices for each of the four groups.
Your problem is that lapply is treating your list oddly. If you run this code (which I hope is pretty much analogous to yours):
CovData <- matrix(1:75, 15)
CovData[3,4] <- NA
CovData[1,3] <- NA
CovData[4,2] <- NA
CovDataM <- data.frame(CovData, "group" = c(rep("a",5),rep("b",5),rep("c",5)))
colnames(CovDataM) <- c("a","b","c","d","e", "group")
lapply(unique(as.character(CovDataM$group)), function(x) print(x))
You can see that lapply is evaluating the list in a different manner than you intend. The NAs don't appear to be the problem. When I run:
by(CovDataM[ ,1:5], CovDataM$group, cov, use = "pairwise.complete.obs", method = "pearson")
It seems to work fine. Hopefully that generalizes to your problem.

Using apply function to obtain only those rows that passes threshold in R

I am trying to apply a filter on my data (which is in the form of matrix) with say 10 columns, 200 rows.
I want to retain only those rows that where the coefficient of variance is greater than a threshold. But with the code I have, it seems its printing the coefficient of variance for the rows passing threshold. I want it to just test if it passes threshold, but print the original data point in the matrix.
covar <- function(x) ( sd(x)/mean(x) )
evar <- apply(myMatrix,1,covar)
myMatrix_filt_var <-myMatrix[evar>2,]
Here threshold I set is 2.
What am I doing wrong ? Sorry just learning R.
Thanks!
If m is your matrix, then,
m[apply(m, 1, function(x) sd(x)/mean(x) > 2), ]
should give you the filtered matrix. The idea is to obtain the coefficient of variation for every row and check if it is > 2 inside. This will return a logical vector from which by directly accessing it like m[logical_vector, ], we can get those rows where the condition is TRUE.
You can use na.rm = TRUE if you want to remove NA values while calculating sd and mean.

Resources