Related
I wanna find outliers and eliminate them in my data(named "df"):
> head(df)
cluster machine.code age Good.Times repair.price
1 1 13010132 23 58.54 198170000
2 1 13010129 23 105.25 390847500
3 1 13010131 23 20.50 20701747
4 1 13010072 18 14.30 22340000
5 1 13010101 18 57.63 13220000
6 1 13010106 27 49.96 254450000
where my data has 65 clusters and I wanna run the outlier detection within each cluster separately,
I had used the code below for outlier detecting before for one cluster and it was fine:
library("ggstatsplot")
df<- read.csv("C:/Users/gadmin/Desktop/dataE.csv",header = TRUE)
ggbetweenstats(df,cluster, repair.price , outlier.tagging = TRUE)
Q <- quantile(df$repair.price, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(df$repair.price)
up <- Q[2]+1.5*iqr # Upper Range
low<- Q[1]-1.5*iqr # Lower Range
eliminated<- subset(df, df$repair.price > (Q[1] - 1.5*iqr) & df$repair.price < (Q[2]+1.5*iqr))
ggbetweenstats(eliminated, cluster, repair.price, outlier.tagging = TRUE)
now I wanna do the same thing for all 65 clusters using "for" something like this:
for(i in 1:length(unique(df$cluster))) {
...
}
but I don't how? (I mean the part that after outlier detecting the first cluster, how should it be replaced(subset) and continue the process to another cluster)
Core question
There are various ways to detect outliers. As for the core of your question, I understand it as "How do I subset the data so I can apply a for-loop to remove the outliers for each cluster?"
# maybe insert a column id that assigns an id (identical to the row number) to identify individual entries
df$id <- seq(1, nrow(df))
# make a list to store the outlier ids for each cluster
outlrs <- list()
# loop through the clusters
for(clust in unique(df$cluster)){
subset <- df[df$cluster == clust,]
outlrs[[clust]] <- [INSERT YOUR OUTLIER DETECTION FUNCTION HERE*]
}
# remove the outliers
outliers <- do.call(rbind, outlrs)
df <- df[-outliers, ]
* the outlier detection function you use should ultimately output the id of the row containing the outlier. This part would have to be adapted to your method of outlier identification.
I didn't test it since I have insufficient data. You could use e.g. dput(df) to output a version of your data you can copy and paste to make it accessible to people who want to test their proposed solutions.
Edit: one (of many) alternative approaches
Alternatively, you could apply the functions you included in your question on a subset of the data within the loop, store the cleaned-up output e.g. as a list and subsequently apply do.call(rbind.data.frame, your_list) to the list.
Note
As Phil pointed out, it is questionable whether outliers should be removed, especially when you're just applying a loop that "takes care of them". While we can provide the means by which "outliers" can be removed programmatically, the question whether you should actually remove those outliers in a given situation is another one (probably more adequate on CrossValidated). It should also be noted that there are many algorithms to determine which values differ "significantly" from the bulk of values and the border between "significant" and not significant is arbitrary.
I've been practicing basics in R (3.6.3) and I'm stuck trying to understand this problem for hours already. This was the exercise:
Step 1: Generate sequence of data between 1 and 3 of total length 100; #use the jitter function (with a large factor) to add noise to your data
Step 2: Compute the vector of rolling averages roll.mean with the average of 5 consecutive points. This vector has only 96 averages.
Step 3: add the vector of these averages to your plot
Step 4: generalize step 2 and step 3 by making a function with parameters consec (default=5) and y.
y88 = seq(1,3,0.02)
y = jitter(y88, 120, set.seed(1))
y = y[-99] # removed one guy so y can have 100 elements, as asked
roll.meanT = rep(0,96)
for (i in 1:length(roll.meanT)) # my 'reference i' is roll.mean[i], not y[i]
{
roll.meanT[i] = (y[i+4]+y[i+3]+y[i+2]+y[i+1]+y[i])/5
}
plot(y)
lines(roll.meanT, col=3, lwd=2)
This produced this plot:
Then, I proceed to generalize using a function (it asks me to generalize steps 2 and 3, so the data creation step was ignored) and I consider y to remain constant):
fun50 = function(consec=5,y)
{
roll.mean <- rep(NA,96) # Apparently, we just leave NA's as NA's, since lenght(y) is always greater than lenght(roll.means)
for (i in 1:96)
{
roll.mean[i] <- mean(y[i:i+consec-1]) # Using mean(), I'm able to generalize.
}
plot(y)
lines(roll.mean, col=3, lwd=2)
}
Which gave me a completely different plot:
When I manually try too see if mean(y[1:5]) produces the right mean, it does. I know I could have already used the mean() function in the first part, but I would really like to get the same results using (y[i+4]+y[i+3]+y[i+2]+y[i+1]+y[i])/5 or mean(y[1:5],......).
You have the line
roll.mean[i] <- mean(y[i:i+consec-1]) # Using mean(), I'm able to generalize.
I believe your intention is to grab the values with indices i to (i+consec-1). Unfortunately for you - the : operator takes precedence over arithmetic operations.
> 1:1+5-1 #(this is what your code would do for i=1, consec=5)
[1] 5
> (1:1)+5-1 # this is what it's actually doing for you
> 5
> 2:2+5-1 #(this is what your code would do for i=2, consec=5)
[1] 6
> 3:3+5-1 #(this is what your code would do for i=3, consec=5)
[1] 7
> 3:(3+5-1) #(this is what you want your code to do for i=3, consec=5)
[1] 3 4 5 6 7
so to fix - just add some parenthesis
roll.mean[i] <- mean(y[i:(i+consec-1)]) # Using mean(), I'm able to generalize.
I have a list of data.frames which hold the data for each of the stages of a chemical process. Each of the data.frames has the same number of columns in the same order but the number of rows can vary for each of the data.frames.
See below the example data with the difference that fruits are standing in for chemical substances and reagents.
I've written a function to scale up the raw data and add the data to columns in the original data frames.
I have two problems, when a I apply a scale factor it only applies to the last element of the last data.frame. The new scale factor is then applied to the whole of the last data.frame. I can generate the scale factor for the next but last data frame by taking the weight of the common fruits (chemicals) between the two data frames (always the in the last and first rows) and dividing the wts in a similar manner to how we got the first scale factor ... then multiplying throughout this data.frame and repeating to get to the first data.frame. The other problem is ... when a use lapply to apply the scale_up function over the list, how can I feed it these scale factors so that each one is only applied to its particular data frame.
example.data <- list(
stage1 <- data.frame(code=c("aaa", "ooo", "bbb"),
stuff=c("Apples","Oranges","Bananas"),
Mw=c(1,2,3),
Density=c(3,2,1),
Assay=c(8,9,1),
Wt=c(1,2,3), stringsAsFactors = FALSE),
stage2 <- data.frame(code=c("bbb","mmm","ccc","qqq","ggg"),
stuff=c("Bananas","Mango","Cherry","Quince","Gooseberry"),
Mw=c(8,9,10,1,2),
Density=c(23,32,55,5,4),
Assay=c(0.1,0.3,0.4,0.4,0.9),
Wt=c(45,23,56,99,2), stringsAsFactors = FALSE),
stage3 <- data.frame(code=c("ggg","bbb","ggg","bbb"),
stuff=c("Gooseberry","Bread","Grapes","Butter"),
Mw=c(9,8,9,10),
Density=c(34,45,67,88),
Assay=c(10,10,46,52),
Wt=c(24,56,31,84), stringsAsFactors = FALSE)
)
scale_up <- function(inventory,scale_factor,vessel_volume_L, NoBatches = 1) {
## This function accepts a data.frame with Molecule, Mw, Density,
## Assay and Wt columns
## It takes a scale factor and vessel volume and returns input
## charges and fill volumes
## rownames(inventory) <- inventory$smiles
inventory <- inventory[,-1] ## the rownames are given the smiles designation
## and the smiles column is removed
## volumes and moles are calculated for the given data
inventory$Vol <- round((inventory$Wt / inventory$Density) , 3)
inventory$Moles <- round((inventory$Wt / inventory$Mw) , 3)
inventory$Equivs <- round((inventory$Moles / inventory$Moles[1]) , 3)
inventory[,paste0(scale_factor,"xWt_kg")] <- round((((inventory$Wt * scale_factor) / 1000 ) / NoBatches) , 3)
inventory[,paste(scale_factor,"xVol_L",sep="")] <- round((((inventory$Vol * scale_factor) / 1000 ) / NoBatches) , 3)
inventory$PerCentFill <- round((100 * cumsum(inventory[,paste(scale_factor,"xVol_L",sep="")]) / vessel_volume_L) , 2)
inventory
## at which point everything is in place to scale up
}
new.example.data <- lapply(example.data, scale_up,20e3,454)
> new.example.data[[1]]
stuff Mw Density Assay Wt Vol Moles Equivs 20000xWt_kg 20000xVol_L PerCentFill
1 Apples 1 3 8 1 0.333 1 1 20 6.66 1.47
2 Oranges 2 2 9 2 1.000 1 1 40 20.00 5.87
3 Bananas 3 1 1 3 3.000 1 1 60 60.00 19.09
So, I've scaled my original data (laboratory scale, grams) to see if it will fit in a ten gallon plant vessel (454 L) but the only stage that is scaled properly is the last one ... the other two need those 'fiddle factors' and I need to apply the 'fiddle factors' to each of the stages as I loop (presumably a for loop rather than lapply) through the list.
(Ps ... I tried to ask this earlier but I tried to disguise my example too much and just confused the stack overflowers).
Based on the details mentioned in this post and the other link Chaining dataframes in a list here's the solution that I have come up with:
Extract the weights for the first and last fruit in a matrix like this:
wts<-sapply(example.data,function(t){c(t$Wt[1],t$Wt[nrow(t)])},simplify=T)
Declare a global variable final.wt as you have initially taken:
final.wt<<- 20000
Create a scales function to caclulate the scaling factor for each corresponding stage:
scales<-function(x,final.wt){
n=ncol(x)
nscales<-numeric(n)
for(i in (n:1)){
if(i==n){
.GlobalEnv$final.wt = final.wt/x[2,i]
nscales[i]=.GlobalEnv$final.wt
}else{
.GlobalEnv$final.wt = .GlobalEnv$final.wt * x[1,i+1]/(x[2,i])
nscales[i]=.GlobalEnv$final.wt
}
}
return(nscales)
}
This gives you a vector of the desired scaling factors for each stage:
scale.fact<-scales(wts,final.wt)
Now you can call scale_up using mapply like this:
mapply(scale_up,example.data,scale.fact,454)
The values in scale.fact are:
42858.0 2857.2 238.1
Each value will be passed to scale_factor using mapply corresponding to the stage .
This question has been asked before, but the solutions posed only partially solve my problem, and I've been working on this for days now. I felt it was time to seek help, even if the topic has been addressed previously. I apologize for any inconvenience.
I have a very large data.frame in R with 6288 observations of 11 variables. I want to run a Shapiro test by group on each variable, but grouped by two different factors (Number and Treatment). A much reduced sample data set with one variable is provided for example:
data <- data.frame(Number=c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2),
Treatment=c("High","High","High","High","High","High","Low",
"Low","Low","Low","Low","Low","High","High","High",
"High","High","High","Low","Low","Low","Low","Low",
"Low"),
FW=c(746,500,498,728,626,580,1462,738,1046,568,320,578,654,664,
660,596,1110,834,486,548,688,776,510,788))
I want to run a Shapiro test on FW by Number and by Treatment, so I'd have a test for 1High, 1Low, 2High, 2Low, etc. I'd like to have data for both the W statistic and the P-value. The original dataset contains 16 observations per group (1High,1Low,etc.; total groups=400), and an occasional NA; this sample dataset contains 6 observations per group (1High, 1Low, 2High, 2Low; groups=4).
The following code was previously posted as a solution to this problem of shapiro tests by groups:
res<-aggregate(cbind(P.value=data$FW)~data$Number+data$Treatment,data,FUN=shapiro.test)
I've also experimented with a number of other ways of grouping, but nothing seems to work. The above code comes closest.
The code above using aggregate groups my data appropriately, and gives me the W statistic, but it won't give me the P value (the column header says "P.value", but this is not the P value, it's the W statistic, I've confirmed this several ways). It also gives me the following warning message:
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs
When I did a Google search for this warning, the results suggest it is a bug in the data.frame, but I can't figure out how to solve it. I'm not even sure it really is a bug in this case.
Can anyone help by providing some insight into the warning message, or another way to do the Shapiro test by group?
You're getting that error because shapiro.test returns a list and aggregate expects the result of the aggregation to be a vector or a single number.
aggregate sees the list, takes the first element of the list by default, and tells you why it's unhappy (in admittedly vague terms). But it still gives you the Shapiro-Wilk statistic since that's the first element of the list returned from shapiro.test.
You can make a slight modification to your existing code that will get you what you want without issue:
aggregate(formula = FW ~ Number + Treatment,
data = data,
FUN = function(x) {y <- shapiro.test(x); c(y$statistic, y$p.value)})
# Number Treatment FW.W FW.V2
# 1 1 High 0.88995051 0.31792857
# 2 2 High 0.78604502 0.04385663
# 3 1 Low 0.93305840 0.60391888
# 4 2 Low 0.86456934 0.20540230
Note that the rightmost columns correspond to the statistic and p-value.
This is directly extracting the statistic and p-value from the list, thereby making the result of aggregation a single vector, which makes aggregate happy.
Another option would be to use the data.table package, available from CRAN.
library(data.table)
DT <- data.table(data)
DT[,
.(W = shapiro.test(FW)$statistic, P.value = shapiro.test(FW)$p.value),
by = .(Number, Treatment)]
# Number Treatment W P.value
# 1: 1 High 0.8899505 0.31792857
# 2: 1 Low 0.9330584 0.60391888
# 3: 2 High 0.7860450 0.04385663
# 4: 2 Low 0.8645693 0.20540230
The dplyr package is handy for groupwise operations:
library(dplyr)
data %>%
group_by(Number, Treatment) %>%
summarise(statistic = shapiro.test(FW)$statistic,
p.value = shapiro.test(FW)$p.value)
Number Treatment statistic p.value
1 1 High 0.8899505 0.31792857
2 1 Low 0.9330584 0.60391888
3 2 High 0.7860450 0.04385663
4 2 Low 0.8645693 0.20540230
The simple dplyr answer didn't do it for me as it did not do the shapiro test on each grouped variable, but only did it once, so here's my own solution using nesting :
shapiro <- data %>%
group_by(!!sym(groupvar)) %>%
group_nest() %>%
mutate(shapiro = map(.data$data, ~ shapiro_test(.x, !!sym(quantvar)))) %>%
select(-data) %>%
unnest(cols = shapiro) %>%
print()
Suppose I have a data frame in R where I would like to use 2 columns "factor1" and "factor2" as factors and I need to calculate mean value for all other columns per each pair of the above mentioned factors. After running the code below, the last line gives the following warnings:
Warning messages:
1: In split.default(seq_along(x), f, drop = drop, ...) :
data length is not a multiple of split variable
...
Why is it happening and what should I do to make it right?
Thanks.
Here is my code:
# Create data frame
myDataFrame <- data.frame(factor1=c(1,1,1,2,2,2,3,3,3), factor2=c(3,3,3,4,4,4,5,5,5), val1=c(1,2,3,4,5,6,7,8,9), val2=c(9,8,7,6,5,4,3,2,1))
# Split by 2 columns (factors)
splitDataFrame <- split(myDataFrame, list(myDataFrame$factor1, mydataFrame$factor2))
# Calculate mean value for each column per each pair of factors
splitMeanValues <- lapply(splitDataFrame, function(x) apply(x, 2, mean))
# Combine back to reduced table whereas there is only one value (mean) per each pair of factors
MeanValues <- unsplit(splitMeanValues, list(unique(myDataFrame$factor1), unique(mydataFrame$factor2)))
EDIT1: Added data frame creation (see above)
If you need to calculate the mean for all other columns than the factors, you can use the formula syntax of aggregate()
aggregate(.~factor1+factor2, myDataFrame, FUN=mean)
That returns
factor1 factor2 val1 val2
1 1 3 2 8
2 2 4 5 5
3 3 5 8 2
Your split() method didn't work because when you unsplit you must have the same number of rows as when you split your data. You were reduing the number of rows for all groups to just one row. Plus, unsplit really should be used with the exact same list of factors that was used to do the split otherwise groups may get out of order. You could to a split and then lapply some collapsing function and then rbind the list back into a single data.frame if you really wanted, but for a simple mean, aggregate is probably best.
The same result can be obtained with summaryBy() in the doBy package. Although it's pretty much the same as aggregate() in this case.
> library(doBy)
> summaryBy( . ~ factor1+factor2, data = myDataFrame)
# factor1 factor2 val1.mean val2.mean
# 1 1 3 2 8
# 2 2 4 5 5
# 3 3 5 8 2
Have you tried aggregate?
aggregate(myDataFrame$valueColum, myDataFrame$factor1, FUN=mean)
aggregate(myDataFrame$valueColum, myDataFrame$factor2, FUN=mean)