outlier detection for a grouped (clustered) data

outlier detection for a grouped (clustered) data - r

I wanna find outliers and eliminate them in my data(named "df"):
> head(df)
cluster machine.code age Good.Times repair.price
1 1 13010132 23 58.54 198170000
2 1 13010129 23 105.25 390847500
3 1 13010131 23 20.50 20701747
4 1 13010072 18 14.30 22340000
5 1 13010101 18 57.63 13220000
6 1 13010106 27 49.96 254450000
where my data has 65 clusters and I wanna run the outlier detection within each cluster separately,
I had used the code below for outlier detecting before for one cluster and it was fine:
library("ggstatsplot")
df<- read.csv("C:/Users/gadmin/Desktop/dataE.csv",header = TRUE)
ggbetweenstats(df,cluster, repair.price , outlier.tagging = TRUE)
Q <- quantile(df$repair.price, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(df$repair.price)
up <- Q[2]+1.5*iqr # Upper Range
low<- Q[1]-1.5*iqr # Lower Range
eliminated<- subset(df, df$repair.price > (Q[1] - 1.5*iqr) & df$repair.price < (Q[2]+1.5*iqr))
ggbetweenstats(eliminated, cluster, repair.price, outlier.tagging = TRUE)
now I wanna do the same thing for all 65 clusters using "for" something like this:
for(i in 1:length(unique(df$cluster))) {
...
}
but I don't how? (I mean the part that after outlier detecting the first cluster, how should it be replaced(subset) and continue the process to another cluster)

Core question
There are various ways to detect outliers. As for the core of your question, I understand it as "How do I subset the data so I can apply a for-loop to remove the outliers for each cluster?"
# maybe insert a column id that assigns an id (identical to the row number) to identify individual entries
df$id <- seq(1, nrow(df))
# make a list to store the outlier ids for each cluster
outlrs <- list()
# loop through the clusters
for(clust in unique(df$cluster)){
subset <- df[df$cluster == clust,]
outlrs[[clust]] <- [INSERT YOUR OUTLIER DETECTION FUNCTION HERE*]
}
# remove the outliers
outliers <- do.call(rbind, outlrs)
df <- df[-outliers, ]
* the outlier detection function you use should ultimately output the id of the row containing the outlier. This part would have to be adapted to your method of outlier identification.
I didn't test it since I have insufficient data. You could use e.g. dput(df) to output a version of your data you can copy and paste to make it accessible to people who want to test their proposed solutions.
Edit: one (of many) alternative approaches
Alternatively, you could apply the functions you included in your question on a subset of the data within the loop, store the cleaned-up output e.g. as a list and subsequently apply do.call(rbind.data.frame, your_list) to the list.
Note
As Phil pointed out, it is questionable whether outliers should be removed, especially when you're just applying a loop that "takes care of them". While we can provide the means by which "outliers" can be removed programmatically, the question whether you should actually remove those outliers in a given situation is another one (probably more adequate on CrossValidated). It should also be noted that there are many algorithms to determine which values differ "significantly" from the bulk of values and the border between "significant" and not significant is arbitrary.

Related

Build loop to use increasing part of dataframe in R as input to function

I'm using the first principal component from a PCA analysis as an explanatory variable in a forecasting model that forecasts recursively using Kalman filtering. In other words, at each point in time, the model updates and produces a new forecast based on the new observation included into the model. Since PCA uses data from all observations included in the model for its calculations, I need to run also the PCAs recursively, using only the observations prior to the point in time that I am forecasting (otherwise, the PCA-result could reveal information about the future, and help the model produce a more accurate answer than it would have otherwise). I think a loop might be the solution, but I am struggling with how to formulate the code.
As a more specific example, consider if I have the following data.frame
data <- as.data.frame(rbind(c(6,15,23),c(9,11,22), c(7,13,23), c(6,12,25),c(7,13,23)))
names(data) <- c("V1","V2","V3")
> data
V1 V2 V3
1 6 15 23
2 9 11 22
3 7 13 23
4 6 12 25
5 7 13 23
At each observation date, I wish to run a PCA (function prcomp() from the stats-package) for all observations up to, and including, that observation. So I want to first run PCA for the two first observation
pca2 <- prcomp(data[1:2,], scale = TRUE)
next I want to run PCA with the first, second and third observation as input
pca3 <- prcomp(data[1:3,], scale = TRUE)
next I want to run PCA with the first, second, third and fourth observation as input
pca4 <- prcomp(data[1:4,], scale = TRUE)
and so on, until the last run of the PCA, which includes all observations in the dataframe. For each of these "runs" of the PCA, I wish to extract the last value (though for pca2, I use both the first and second value) of the first principal component (PC1), and merge these into a final dataframe, where each monthly observation is the last value of the first principal component of PCA results for each of the runs.
The principal component outputs are:
> my_pca2 <- as.data.frame(pca2$x)
> my_pca2
PC1 PC2
1 -1.224745 -5.551115e-17
2 1.224745 5.551115e-17
> my_pca3 <- as.data.frame(pca3$x)
> my_pca3
PC1 PC2 PC3
1 -1.4172321 -0.2944338 6.106227e-16
2 1.8732448 -0.1215046 3.330669e-16
3 -0.4560127 0.4159384 4.163336e-16
> my_pca4 <- as.data.frame(pca4$x)
> my_pca4
PC1 PC2 PC3
1 -1.03030993 -1.10154914 0.015457199
2 2.00769890 0.07649216 0.011670433
3 0.03301806 -0.24226508 -0.033461874
4 -1.01040702 1.26732205 0.006334242
So I want my final output to be a dataframe to look like
>final.output
PC1
1 -1.224745
2 1.224745
3 -0.4560127
4 -1.01040702
Comment: yes, it looks a bit weird with the two first values, but please don't pay too much attention to that. My point is that I wish to build a dataframe that consists of the last calculated value for the first principal component for each of the PCA runs.
I am thinking that a for.loop might be the best solution here, but I have not been successful in finding any threads that might guide me closer to a coding solution. How can I make the loop use an increasing amount of the dataframe in the calculations? Does anyone have any suggestions/tips/links? Any help on this is much appreciated!

I had a very similar approach.
PCA <- vector("list", length=nrow(data)-1)
for(i in 1:(nrow(data)-1)) {
if(i==1) j <- 1:2 else j<-i+1
PCA[[i]] <- as.data.frame(prcomp(data[1:(1+i),], scale = TRUE)$x)[j, 1]
}
unlist(PCA)

You can use a for loop. It's maybe not the most efficient solution, but it will work.
First, you create an empty list to store your results:
all_results <- list()
Next, you iterate from 2 to the number of rows of data with a loop. For each iteration of the loop, run prcomp on data[1:i,]. You can directly create your pca data frame and extract PC1from it as a vector. Now you store it in the list at index i - 1
for(i in 2:nrow(data))
{
all_results[[i - 1]] <- as.data.frame(prcomp(data[1:i,], scale = TRUE)$x)$PC1
}
Now to extract all the results, you use lapply (list apply) to extract only the last element from each PC1 vector:
PC1 <- lapply(all_results, function(pca) pca[length(pca)] )
Now you convert these from a list of single elements to a vector:
PC1 <- do.call("c", PC1)
Finally, you want to stick the first value of the first analysis back on to the front of this vector:
PC1 <- c(all_results[[1]][1], PC1)

Resampling in R

Consider the following data:
library(Benchmarking)
d <- data.frame(x1=c(200,200,3000), x2=c(200,200,1000), y=c(100,100,3))
So I have 3 observations.
Now I want to select 2 observations randomly out of d three times (without repetition - there is three combinations in total). For each of these three times I want to calculate the following:
e <- dea(d[c('x1', 'x2')], d$y)
weighted.mean(eff(e), d$y)
That is, I will get three numbers, which I want to calculate an average of. Can someone show how to do this with a loop function in R?
Example:
There is three combinations in total, so I can only get the same result in this case. If I do the calculation manually, I will get the three following result:
0.977 0.977 1
(The result could of course be in a another order).
And the mean of these two numbers is:
0.984
This is a simple example. In my case I have a lot of combinations, where I don't select all of the combinations (e.g. there could be say 1,000,000 combinations, where I only select 1,000 of them).

I think it's better if you use sample.int and replicate instead of doing all the combinations, see my example:
nsample <- 2 # Number of selected observations
nboot <- 10 # Number of times you repeat the process
replicate(nboot, with(d[sample.int(nrow(d), nsample), ],
weighted.mean(eff(dea(data.frame(x1, x2), y)), y)))

I have check also the link you bring regarding this issue, so if I got it right, I mean, you want to extract two rows (observations) each time without replacement, you can use sample:
SelObs <- sample(1:nrow(d),2)
# for getting the selected observations just
dSel <- d[SelObs,]
And then do your calculations
If you want those already selected observation to not be selected in a nex random selection, it is similar, but you need an index
Obs <- 1:nrow(d)
SelObs <- sample(Obs, 2)
dSel <- d[SelObs, ]
# and now, for removing those already selected
Obs <- Obs[-SelObs]
# and keep going with next random selections and the above code

Delete missing datapoints (NA's) from multiple vectors

So I am working with biological data at a hospital, (I won't disclose anything here but I won't need to in order to ask this question). We are looking at concentrations of antibodies taken a certain amount of time. There are, for one reason or another, missing data points all over our data set. What I am doing is trying to remove the missing data points along with their corresponding time. Right now the basic goal is just to get some basic graphs and charts up and running but eventually we're going to want to create some logistical models and nonlinear dynamics models which we'll do in another language.
1) First I put my data into a vector along with it's corresponding time:
data <- read.csv("blablabla.csv" header = T)
Biomarker <- data[,2]
time <- data[,1]
2)Then I sort the data:
Biomarker <- Biomarker[order(time)]
time <- sort(time, decreasing = F)
3)Then I put the indexes of the NA values into a vector
NA_Index <- which(is.na(Biomarker))
4)Then I try to remove the data points at that index for both the biomarker and time vector
i <- 1
n <- length(NA_Index)
for(i:n){
Biomarker[[NA_Index[i]]] <- NULL
time[[NA_Index[i]]] <- NULL
}
Also I have tried a few different things than the one above:
1)
Biomarker <- Biomarker[-NA_Index[i]]
2)
Biomarker <- Biomarker[!= "NA"]
My question is: "How do I remove NA values from my vectors and remove the time with the same index?"
So Obviously I am very new to R and might be going about this in a completely wrong. I just ask that you explain all what all the functions do if you post some code. Thanks for the help.

First I'd recommend storing your data in a data.frame instead of two vectors, since the entries in the vecotors correspond to cases this is a more appropriate datastructure.
my_table <- data.frame(time=time, Biomarker=Biomarker)
Then you can simply subset the whole data.frame, the first dimension are rows, the second columns, as usual, leave the second dimension free to keep all columns.
my_table <- my_table[!is.na(my_table$Biomarker), ]

> BioMarker
[1] 1 2 NA 3 NA 5
> is.na(BioMarker)
[1] FALSE FALSE TRUE FALSE TRUE FALSE
> BioMarker[is.na(BioMarker)]
[1] NA NA
> BioMarker[! is.na(BioMarker)]
[1] 1 2 3 5
> BioMarker <- BioMarker[! is.na(BioMarker)]
> BioMarker
[1] 1 2 3 5

How to check if an anova test excludes zero values

I was wondering if there is an simply way check if my zero values in my data are excluded in my anova.
I first changed all my zero values to NA with
BFL$logDecomposers[which(BFL$logDecomposers==0)] = NA
I'm not sure if 'na.action=na.exclude' makes sure my values are being ignored(like I want them to be)??
standard<-lm(logDecomposers~1, data=BFL) #null model
ANOVAlnDeco.lm<-lm(logDecomposers~Species_Number,data=BFL,na.action=na.exclude)
anova(standard,ANOVAlnDeco.lm)
P.S.:I've just been using R for a few weeks, and this website has been of tremendous help to me :)

You haven't given a reproducible example, but I'll make one up.
set.seed(101)
mydata <- data.frame(x=rnorm(100),y=rlnorm(100))
## add some zeros
mydata$y[1:5] <- 0
As pointed out by #Henrik you can use the subset argument to exclude these values:
nullmodel <- lm(y~1,data=mydata,subset=y>0)
fullmodel <- update(nullmodel,.~x)
It's a little confusing, but na.exclude and na.omit (the default) actually lead to the same fitted model -- the difference is in whether NA values are included when you ask for residual or predicted values. You can try it out:
mydata2 <- within(mydata,y[y==0] <- NA)
fullmodel2 <- update(fullmodel,subset=TRUE,data=mydata2)
(subset=TRUE turns off the previous subset argument, by specifying that all the data should be included).
You can compare the fits (coefficients etc.). One shortcut is to use the nobs method, which counts the number of observations used in the model:
nrow(mydata) ## 100
nobs(nullmodel) ## 95
nobs(fullmodel) ## 95
nobs(fullmodel2) ## 95
nobs(update(fullmodel,subset=TRUE)) ## 100

Exclude data based on the number of non NA observations for each value of key

I have a dataset consisting of monthly observations for returns of US companies. I am trying to exclude from my sample all companies which have less than a certain number of non NA observations.
I managed to do what I want using foreach, but my dataset is very large and this takes a long time. Here is a working example which shows how I accomplished what I wanted and hopefully makes my goal clear
#load required packages
library(data.table)
library(foreach)
#example data
myseries <- data.table(
X = sample(letters[1:6],30,replace=TRUE),
Y = sample(c(NA,1,2,3),30,replace=TRUE))
setkey(myseries,"X") #so X is the company identifier
#here I create another data table with each company identifier and its number
#of non NA observations
nobsmyseries <- myseries[,list(NOBSnona = length(Y[complete.cases(Y)])),by=X]
# then I select the companies which have less than 3 non NA observations
comps <- nobsmyseries[NOBSnona <3,]
#finally I exclude all companies which are in the list "comps",
#that is, I exclude companies which have less than 3 non NA observations
#but I do for each of the companies in the list, one by one,
#and this is what makes it slow.
for (i in 1:dim(comps)[1]){
myseries <- myseries[X != comps$X[i],]
}
How can I do this more efficiently? Is there a data.table way of getting the same result?

If you have more than 1 column you wish to consider for NA values then you can use complete.cases(.SD), however as you want to test a single columnI would suggest something like
naCases <- myseries[,list(totalNA = sum(!is.na(Y))),by=X]
you can then join given a threshold total NA values
eg
threshold <- 3
myseries[naCases[totalNA > threshold]]
you could also select using not join to get those cases you have excluded
myseries[!naCases[totalNA > threshold]]
As noted in the comments, something like
myseries[,totalNA := sum(!is.na(Y)),by=X][totalNA > 3]
would work, however, in this case you are performing a vector scan on the entire data.table, whereas the previous solution performed the vector scan on a data.table that is only nrow(unique(myseries[['X']])).
Given that this is a single vector scan, it will be efficient regardless (and perhaps binary join + small vector scan may be slower than larger vector scan), However I doubt there will be much difference either way.

How about aggregating the number of NAs in Y over X, and then subsetting?
# Aggregate number of NAs
num_nas <- as.data.table(aggregate(formula=Y~X, data=myseries, FUN=function(x) sum(!is.na(x))))
# Subset
myseries[!X %in% num_nas$X[Y>=3],]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

outlier detection for a grouped (clustered) data - r

Related

Build loop to use increasing part of dataframe in R as input to function

Resampling in R

Delete missing datapoints (NA's) from multiple vectors

How to check if an anova test excludes zero values

Exclude data based on the number of non NA observations for each value of key

Categories

Resources