How to replace the value with NAs with a condition - r

I am trying to use the Chauvenet criterion to remove outliers in R. Chauvenet criteria helps with detecting outliers with a probability band based on the mean and SD. Some information about Chauvenets and the source of the code that im using:
https://influentialpoints.com/Training/chauvenets_outlier-id-criterion.htm
https://github.com/tillrohrmann/edu-polytechnique-map553/blob/master/map553/Chauvenet.R
I have a large dataset around 100000 data. Im hoping the code will return all the values above >0.5 i.e the data without outliers.
I am new to R. To easen the data cleaning process I want to use this code (but as <0.5) to replace all the values that do not fulfil this criterion with a NA in the same dataframe or a new dataframe so that I can verify the outliers myself before removing them. This is my code `
Chauvenet <- function(datapoints, loop=TRUE){
numdatapoints <- nrow(data)
#calculating normalised distance from the mean
dist <- abs(data - colMeans(data))/sapply(data,sd)
#calculating the probability to see such point assuming the distribution in normal
prob <- apply(dist,c(1,2),function(x) numdatapoints*dnorm(x))
#Selecting the points that have only a probablity >0.5
sel <- (apply(prob,c(1,2),function(x) x<=0.5))
idx <- rowSums(sel) == ncol(data)
datapoints <- data[idx,]
if(loop == TRUE){
mutate(replace(newdata,NA)
numdatapoints <- nrow(data)
dist <- abs(data - colMeans(data))/sapply(data,sd)
prob <- apply(dist,c(1,2),function(x) numdatapoints*dnorm(x))
sel <- apply(prob,c(1,2),function(x) x<=0.5)
idx <- rowSums(sel) == ncol(data)
datapoints <- data[idx,]
}
return(datapoints)
}
`
It would be nice to get some help on how to modify the code better.

Related

Is there a way to get the index of a list in R without match or which

I am trying to detect anomalies in the iris dataset by normalising the data into iris_norm, then splitting that into a training and testing set, then using the knn function to find anomalies. now I can extract those anomalies from the normalised iris_test set but not from the actual iris set, is there a way for me to use the indexes of the values in 'actual' as the indexes in iris? Here is my code
library(gmodels)
library(class)
library(tidyverse)
# STEP 1: Import your dataset, look at a summary
summary(iris)
# STEP 2: Generate a random number to split the dataset.
ran <- sample(1:nrow(iris), 0.9 * nrow(iris))
# The normalization function is created
nor <-function(x) {(x -min(x))/(max(x)-min(x))}
# Run nomalisation on predictor columns
iris_norm <- as.data.frame(lapply(iris[,c(1,2,3,4)], nor))
##extract training set
iris_train <- iris_norm[ran,]
##extract testing set
iris_test <- iris_norm[-ran,]
# Extract 5th column of train dataset because it will be used as
#'cl' argument in knn function.
iris_target_category <- iris[ran,5]
##extract 5th column if test dataset to measure the accuracy
iris_test_category <- iris[-ran,5]
##run knn function
pr <- knn(iris_train,iris_test,cl=iris_target_category,k=15)
##create confusion matrix
tab <- table(pr,iris_test_category)
##this function divides the correct predictions by total number of predictions
#that tell us how accurate teh model is.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)
#create a cross table to see where the wrong predictions are
mytab <- CrossTable(iris_test_category, pr, FALSE)
#anomaly indexes
anomalies_index <- which(iris_test_category != pr)
# get the anomaly values
anomaly_value1 <- iris_test[iris_test_category != pr, "Sepal.Length"]
anomaly_value2 <- iris_test[iris_test_category != pr, "Sepal.Width"]
anomaly_value3 <- iris_test[iris_test_category != pr, "Petal.Length"]
anomaly_value4 <- iris_test[iris_test_category != pr, "Petal.Width"]
anomalies <- data.frame(anomaly_value1, anomaly_value2,
anomaly_value3, anomaly_value4)
actual <- iris_test[anomalies_index,]
print(anomalies)
print(actual)
I found the solution a few minutes later, all I had to do was
actual_index <- as.numeric(rownames(actual))
iris[actual_index,]
and I was able to extract the correct values

Creating an R function using randomization test of differences

Write a function that will allow the user
to input a vector of numerical values,
with no missing values "the data", and
a vector of 1's and 2's, representing
two different groups that you want to compare.
"the treatments". The number of 1's and 2's does not need to be equal. You may assume
for now, that treatment 2 has a higher mean
than treatment 1.
The function will create the randomization
distribution of differences, and plot them
in a histogram. It will use the distribution
to calculate the p-value -- the chance that
the observed difference (or higher) could have
occurred by chance. It will print the observed
difference and the p-value, both
rounded to 4 digits, using text:
"The observed difference is xxxx and the
p-value is xxxx"
Using these two vectors I have determined how to get the differences but do not know how to put it into a function and implement a randomization test.
dat<- c(1,4,2,5,2,4,8,6,9,7)
trt <- c(1,1,1,1,1,2,2,2,2,2)
How to find the observed difference:
obsdiff <- mean(dat[trt == 2]) - mean(dat[trt == 1])
How to 'shuffle the treatments':
trtsh <- sample(trt, size = length(trt))
How to find a difference simulated
under the null hypothesis,
i.e., difference in means for shuffled
treatment 2 minus treatment 1:
simdiff <- mean(dat[trtsh == 2]) - mean(dat[trtsh == 1])
The p-value using these vectors should be .011
dat<- c(1,4,2,5,2,4,8,6,9,7)
trt <- c(1,1,1,1,1,2,2,2,2,2)
In general, it is a good idea to coerce all of your data into a data frame, a la
data.frame(dat, trt) -> mydata
Now you can calculate your obsdiff as
obsdiff <- mean(mydata$dat[mydata$trt == 2]) - mean(mydata$dat[mydata$trt == 1])
Here's one way you can shuffle your treatment values using a for-loop:
simdiff <- vector(length=10000)
for(j in 1:10000){
cat(paste(j, '\n') )
trtsh <- sample(trt)
mydatash <- data.frame(dat, trtsh)
simdiff[j] <- mean(mydatash$dat[mydatash$trt == 2]) - mean(mydatash$dat[mydatash$trt == 1])
}
For help with plotting, see ?hist (e.g. hist(simdiff)).
Now, you just need to wrap the pieces above into a function that calculates the quantile of simdiff where obsdiff >= simdiff and outputs the text.

R: looped variable assignment, augmenting variable calculation each time

I am trying to calculate a regression variable based on a range of variables in my data set. I would like the regression variable (ei: Threshold 1) to be calculated using a different variable set in each iteration of running the regression.
Aim to collected SSR values for each threshold range, and thus identify the ideal threshold based on the data.
Data (df) variables: Yield, Prec, Price, 0C, 1C, 2C, 3C, 4C, 5C, 6C, 7C, 8C, 9C, 10C
Each loop calculates "thresholds" by selecting a different "b" each time.
a <- df$0C
b <- df$1C
Threshold1 <- (a-b)
Threshold2 <- (b)
Where "b" would be changing in each loop, ranging from 1C to 9C.
Each individual threshold set (1 and 2) should be used to run a regression, and save the SSR for comparison with the subsequent regression utilizing thresholds based on a new "b" value (ranging from 1C TO 9C)
Regression:
reg <- lm(log(Yield)~Threshold1+Threshold2+log(Price)+prec+I(prec^2),data=df)
for each loop of the Regression, I vary the components of calculating thresholds in the following manner:
Current approach is centered around the following code:
df <- read.csv("Data.csv",header=TRUE)
names(df)
0C-9Cvarlist <- names(df)[9:19]
ssr.vec <- matrix(,21,1)
for(i in 1:length(varlist)){
a <- df$0C
b <- df$[i]
Threshold1 <- (a-b)
Threshold2 <- (b)
reg <- lm(log(Yield)~Threshold1+Threshold2+log(Price)+prec+I(prec^2),data=df)
r2 <- summary(reg)$r.squared
ssr.vec[i,] <- c(varlist,r2)
}
colnames(ssr.vec) <- c("varlist","r2")
I am failing to achieve the desired result with the above approach.
Thank you.
I can spot quite a few mistakes...
You need to add variables of interest (Threshold1 anf Threshold2) to the data in the regression. Also, I think that you need to select varlist[i] and not varlist to create your ssr.vec. You need 2 columns to your ssr.vec which is a matrix, so you should call it matrix. You also cannot use something like df$[i] to extract a column! Why is the matrix of length 21 ?! Change the column name to C0,..,C9 and not 0C,..,9C.
For future reference, solve the simple errors before asking question... and include error messages in your post!
This should do the job:
df <- read.csv("Data.csv",header=TRUE)
names(df)[8:19] = paste0("C",0:10)
varlist <- names(df)[9:19]
ssr.vec <- matrix(,21,2)
for(i in 1:length(varlist)){
a <- df$C0
b <- df[,i+9]
df$Threshold1 <- (a-b)
df$Threshold2 <- (b)
reg <- lm(log(Yield)~Threshold1+Threshold2+log(Price)+prec+I(prec^2),data=df)
r2 <- summary(reg)$r.squared
ssr.vec[i,] <- c(varlist[i],r2)
}
colnames(ssr.vec) <- c("varlist","r2")

Custom AUC in R with different thresholds and binary predictions

I am looking to plot a FPR vs TPR point on an AUC graph for different thresholds.
For example, if data$C2 is my data frame with the true response column (either 0 or 1), I want to make a vector with predicted values (0 or 1) when data$C1 (a different measurement column) is above or below the specified threshold. Here is the function I've attempted with the ROCR package.
fun <- function (data, col1, col2){
perfc <- NULL #Create null vectors for prediction and performance
perfs <- NULL
temp <- NULL
d <- seq(0.10,0.30,0.01) ##Various thresholds to be tested
for (i in length(d){
temp <- ifelse(data[,col1] > d, 1 , 0) ##Create predicted responses
pred <- prediction(temp, data[,col2]) #Predict responses over true values
perf <- performance(pred, "tpr","fpr") #Store performance information
predc[i] <- pred #Do this i times for every d in the sequence
perfc[i] <- perf
preds <- prediction.class(predc, col2) #Combine to make prediction class
perfs <- performance.class(preds, "tpr","fpr") #Combine to make performance class
}
plot(perfs) #Plot TPR against FPR
}
Is the problem because temp is a list vector and the true labels are from a matrix? Am I applying this for loop incorrectly?
Thanks in advance!
Edit: Here's my attempt to do this manually without the ROC package.
for(t in seq(0.40,0.60,0.01)) #I want to do this for every t in the sequence
{
t <- t
TP <- 0
FP <- 0
p <- sum(data$C2==1, na.rm=TRUE) #Total number of true positives
n <- sum(data$C2==0, na.rm=TRUE) #Total number of true negatives
list <- data$C1 #Column to vector
test <- ifelse(list > t, 1, 0) #Make prediction vector
for(i in 1:nrow(data))
{if(test==1 & data$C2==1)
{TP <- TP + 1} #Count number of correct predictions
if(test==1 & data$C2==0)
{FP <- FP + 1} #Count number of false positives
}
plot(x=(FP/n),y=(TP/p)) #Plot every FP,TP pair
}
I hope I understand your question right, but I think that by AUC graph you mean ROC curve. The ROC curve already takes into account different thresholds to make those classification decisions. See this wikipedia page. I found this picture particularly helpful.
If the above is right, then all you need to do in your code is:
pred <- prediction(data[,col1], data[,col2])
perf <- performance(pred, "tpr","fpr")
plot(perf)
If you would like to 'add' a different curve to that plot, perhaps because you used a different classification technique (e.g. decision tree instead of logistic regression. Then use plot(perf2,add=TRUE). Where perf2 is created in a same way as perf. See the documentation.

How to Generate Normal Random Samples within Mean±3Sigma

I want to draw normal random numbers in an array of order ((100*8)*5000) with a specific Mean (M) and Standard Deviation (S) but I want them to be only within the range M±3S, so that I don't have any outliers in my array exceeding those limits.
Any Suggestion? I want to write a program in R based on this array for some simulation studies. I am using following R Code to generate my Data Set:
for(i in 1:5000){
for(j in 1:8){
Dat[,j,i]=rnorm(100,mean=muu[j],sd=sigma[j])
}
}
Now, We want to get rid of those values which are higher than muu±3sigma in the above data. Definitely, We have to replace discarded values with fresh values so that the dimension of the Dat array keep intact.
First Solution
Here is a start but I bet there is a more elegant solution.
First generate a sample next step is to subset it to your desired values. Of course you have to adjust values to your desire.
set.seed(123)
rs <- rnorm(10000, mean = 10, sd = 3)
rs1 <- rs[ rs >= -19 & rs <= 19 ]
Second (better) solution
I think my first solutions didn't work so well. I have just written some code that might be perfect for your purposes. Here are the steps.
create an array of NAs with the required dimensions
fill it with random numbers
create a logical vector where TRUEs are for the desired conditions
subset the data based on that vector and replace the values where TRUE is TRUE (pardon my words game) with the mean used to generate samples
data <- array(NA, dim = c(100, 8, 5000))
for(i in 1:5000){
data[ , , i] <- rnorm(800, 3, 1)
}
bound <- 3 + c(-1, 1)*3*1
pr <- data <= bound[1] | data >= bound[2]
data[pr] <- 3

Resources