Boxplot outlier removal causing problems with NbClust - r

I have a dataset from which I had to remove outliers. I used the boxplot method to remove my outliers however, I feel this method has changed the structure of my data from a table like structure to just a list. I am trying to use NbClust to get a prediction on the amount of clusters I should use. I also applied z-score scaling before attempting to use NbClust. I am really new to R and I am not sure how to change it back and/or if this is the reason the error is occurring with NbClust
The data also showed as "846 obs. of 18 variables" before outlier removal
to "List of 18" after outlier removal (Shown in the Global Environment panel)
Error: Error in t(jeu) %*% jeu :
requires numeric/complex matrix/vector arguments
I think the correct thing is to change it into a data frame but I am not too sure how to do correctly do this.
Data before outlier removal using boxplot method:
After outliers removed using boxplot method:
Reproduceable example
library(reshape2)
library(NbClust)
vehData <-
structure(
list(
Samples = 1:6,
Comp = c(95L, 91L, 104L, 93L, 85L,
107L),
Circ = c(48L, 41L, 50L, 41L, 44L, 57L),
D.Circ = c(83L,
84L, 106L, 82L, 70L, 106L),
Rad.Ra = c(178L, 141L, 209L, 159L,
205L, 172L),
Pr.Axis.Ra = c(72L, 57L, 66L, 63L, 103L, 50L),
Max.L.Ra = c(10L,
9L, 10L, 9L, 52L, 6L),
Scat.Ra = c(162L, 149L, 207L, 144L, 149L,
255L),
Elong = c(42L, 45L, 32L, 46L, 45L, 26L),
Pr.Axis.Rect = c(20L,
19L, 23L, 19L, 19L, 28L),
Max.L.Rect = c(159L, 143L, 158L, 143L,
144L, 169L),
Sc.Var.Maxis = c(176L, 170L, 223L, 160L, 241L, 280L),
Sc.Var.maxis = c(379L, 330L, 635L, 309L, 325L, 957L),
Ra.Gyr = c(184L,
158L, 220L, 127L, 188L, 264L),
Skew.Maxis = c(70L, 72L, 73L,
63L, 127L, 85L),
Skew.maxis = c(6L, 9L, 14L, 6L, 9L, 5L),
Kurt.maxis = c(16L,
14L, 9L, 10L, 11L, 9L),
Kurt.Maxis = c(187L, 189L, 188L, 199L,
180L, 181L),
Holl.Ra = c(197L, 199L, 196L, 207L, 183L, 183L),
Class = c("van", "van", "saab", "van", "bus", "bus")
),
row.names = c(NA,
6L), class = "data.frame")
#Remove outliers
removeOutliers <- function(data) {
OutVals <- boxplot(data)$out
remOutliers <- sapply(data, function(x) x[!x %in% OutVals])
return (remOutliers)
}
# Scale data -> same as scale() function
z_score <- function(x){
return ((x - mean(x))/sd(x))
}
vehDataRemove1 <- vehData[, -1]
vehDataRemove2 <- vehDataRemove1[,-19]
vehData <- vehDataRemove2
vehClass <- vehData$Class
#Begin removing outliers
removeOutliers1 <- removeOutliers(vehData)
removeOutliers2 <- removeOutliers(removeOutliers1)
removeOutliers3 <- removeOutliers(removeOutliers2)
removeOutliers4 <- removeOutliers(removeOutliers3)
cleanVehicleData <- removeOutliers4
cl_vehDataScale <- lapply(cleanVehicleData, z_score)
set.seed(26)
clusterNo <- NbClust(cl_vehDataScale, distance="euclidean", min.nc=2, max.nc=10,
method="kmeans", index="all")

Related

Bootstrapping/Monte Carlo Simulation in R

I am trying to follow this test:
Suppose I have the following data:
set.seed(123)
active_MJO <-c(6L, 2L, 11L, 20L, 62L, 15L, 2L, 51L, 58L, 100L, 45L, 44L, 49L,
86L, 28L, 1L, 1L, 40L, 79L, 99L, 86L, 50L, 9L, 78L, 45L, 100L,
77L, 44L, 45L, 93L)
inactive_MJO <-c(83L, 170L, 26L, 66L, 156L, 40L, 29L, 72L, 109L, 169L, 153L,
136L, 169L, 133L, 153L, 13L, 24L, 148L, 121L, 80L, 125L, 21L,
135L, 155L, 161L, 171L, 124L, 177L, 167L, 162L)
I dont know how to implement the above test in R.
I have tried the following but I am not sure if this is correct.
sig.test <- function (x){
a <- sample(active_MJO)
b <- sample(inactive_MJO)
sum(a > b)
}
runs <- 1000
sim <- sum(replicate(runs,sig.test(dat))+1)/(runs+1)
I think the above is not correct. Where can I put the 950/1000 condition?
Apologies, I am new to bootstrapping/Monte Carlo test.
I'll appreciate any help on this.
Sincerely,
Lyndz
First, it's important to note that they are sampling 30 frequency pairs. Since it's bootstrapping, those samples will be with replacement.
Then they compare the average active to average inactive. This is equivalent to:
comparing the sum of the active against the sum of the inactive from the 30 pairs, or
comparing the sum of the differences within each of the 30 pairs to zero.
They repeat the process 1000 times then compare the results of the 1000 comparisons to 950.
The following code performs #2:
set.seed(123)
active_MJO <-c(6L, 2L, 11L, 20L, 62L, 15L, 2L, 51L, 58L, 100L, 45L, 44L, 49L,
86L, 28L, 1L, 1L, 40L, 79L, 99L, 86L, 50L, 9L, 78L, 45L, 100L,
77L, 44L, 45L, 93L)
inactive_MJO <-c(83L, 170L, 26L, 66L, 156L, 40L, 29L, 72L, 109L, 169L, 153L,
136L, 169L, 133L, 153L, 13L, 24L, 148L, 121L, 80L, 125L, 21L,
135L, 155L, 161L, 171L, 124L, 177L, 167L, 162L)
diff_MJO <- active_MJO - inactive_MJO
sim <- sum(replicate(1e3, sum(sample(diff_MJO, 30, replace = TRUE)) > 0))
> sim
[1] 0
In this case, none of the 1000 replications resulted in an average active_MJO that was greater than the average inactive_MJO. This is unsurprising after plotting the histogram of sums of bootstrapped differences:
diff_MJO <- replicate(1e5, sum(sample(diff_MJO, 30, replace = TRUE)))
hist(diff_MJO)

calculate sum of values in dataframe based on values in other columns

I have a dataframe in R in which values correspond to value estimates and their margin of error (MoE).
Column names consist of a pattern, an indicator character (e = estimate, m = margin of error) and an ID that matches estimate and margin of error.
So, the column names look like "XXXe1, XXXm1, XXXe2, XXXm2, ...".
Goal
I am trying to create a function to (for each row)
Calculate the sum of the estimates. (That is pretty straightforward.)
Calculate the aggregated margin of error. This is the square root of the sum of the squares of each MoE.
Condition: the MoE of estimates marked as 0 should only be added once.
Examples:
In row 20, the aggregated MoE should only be sqrt(123^2).
In row 13, B01001e4 and B01001e5 are 0, so their MoE is only counted once.
So far, I have done the following to build a function that does this:
estimate_aggregator <- function(DF_to_write_on, New_column_name, source_df, pattern){
subset_df <- source_df[, grepl(pattern, names(source_df))] # I subset all the columns named with the pattern, regardless of whether they are estimate or margin of error
subset_df_e <- source_df[, grepl(paste0(pattern, "e"), names(source_df))] # I create a table with only the estimated values to perform the sum
DF_to_write_on[paste0(New_column_name, "_e")]<- rowSums(subset_df_e) # I write a new column in the new DF with the rowSums of the estimates values, having calculated the new estimate
return(DF)
}
What I am missing: a way to write in the new dataframe the result of selecting the XXXmYY values of those columns that have no 0 value in their corresponding estimate. If there is one or more 0 in the estimates, then I should include the MoE 123 in the calculation only once.
What would be the cleanest way to achieve this? I see that my struggle is on dealing with several columns at once and the fact that the values on the XXXeYY columns determine the selection of the XXXmYY ones.
Expected output
row1: DF_to_write_on[paste0(New_column_name,"_m") <- sqrt(176^2 + 117^2+22^2 + 123^2)
row2: DF_to_write_on[paste0(New_column_name,"_m") <- sqrt(123^2)
B01001e1 B01001m1 B01001e2 B01001m2 B01001e3 B01001m3 B01001e4 B01001m4 B01001e5 B01001m5
15 566 176 371 117 14 22 0 123 0 123
20 0 123 0 123 0 123 0 123 0 123
Data
structure(list(B01001e1 = c(1691L, 2103L, 975L, 2404L, 866L,
2140L, 965L, 727L, 1602L, 1741L, 948L, 1771L, 1195L, 1072L, 566L,
1521L, 2950L, 770L, 1624L, 0L), B01001m1 = c(337L, 530L, 299L,
333L, 264L, 574L, 227L, 266L, 528L, 498L, 320L, 414L, 350L, 385L,
176L, 418L, 672L, 226L, 319L, 123L), B01001e2 = c(721L, 1191L,
487L, 1015L, 461L, 1059L, 485L, 346L, 777L, 857L, 390L, 809L,
599L, 601L, 371L, 783L, 1215L, 372L, 871L, 0L), B01001m2 = c(173L,
312L, 181L, 167L, 170L, 286L, 127L, 149L, 279L, 281L, 152L, 179L,
193L, 250L, 117L, 234L, 263L, 155L, 211L, 123L), B01001e3 = c(21L,
96L, 70L, 28L, 33L, 90L, 12L, 0L, 168L, 97L, 72L, 10L, 59L, 66L,
14L, 0L, 35L, 47L, 14L, 0L), B01001m3 = c(25L, 71L, 73L, 26L,
33L, 79L, 18L, 123L, 114L, 79L, 59L, 15L, 68L, 99L, 22L, 123L,
31L, 37L, 20L, 123L), B01001e4 = c(30L, 174L, 25L, 91L, 4L, 27L,
30L, 43L, 102L, 66L, 54L, 85L, 0L, 16L, 0L, 26L, 34L, 27L, 18L,
0L), B01001m4 = c(26L, 148L, 30L, 62L, 9L, 27L, 25L, 44L, 82L,
52L, 46L, 48L, 123L, 21L, 123L, 40L, 33L, 32L, 27L, 123L), B01001e5 = c(45L,
44L, 7L, 46L, 72L, 124L, 45L, 34L, 86L, 97L, 0L, 83L, 0L, 30L,
0L, 66L, 0L, 23L, 33L, 0L), B01001m5 = c(38L, 35L, 12L, 37L,
57L, 78L, 36L, 37L, 62L, 97L, 123L, 50L, 123L, 42L, 123L, 59L,
123L, 31L, 49L, 123L)), .Names = c("B01001e1", "B01001m1", "B01001e2",
"B01001m2", "B01001e3", "B01001m3", "B01001e4", "B01001m4", "B01001e5",
"B01001m5"), row.names = c(NA, 20L), class = "data.frame")
From your description it sounds like your desired output should have 2 columns, the row sum of the estimate, and the function of the row margins of errors using the logic you describe. Here is one (somewhat roundabout) solution to that problem.
I saved your data as df.
# Isolate estimate and MoE dataframes
df_e <- df[,grepl('e', names(df))]
df_m <- df[,grepl('m', names(df))]
# Temporary matrix used to isolate 0 values for MoE, count number of zero occurances, and convert those MoE values to NA
mat <- df_e == 0
mat <- t(apply(mat, 1, cumsum))
df_m[mat > 1] = NA
# Combine with estimate row sum
output_df <- data.frame(
e = rowSums(df[,grepl('e', names(df))]),
m = apply(df_m, 1, function(x) sqrt(sum(x^2, na.rm = T)))
)
head(output_df)
e m
1 2508 382.4173
2 3608 637.5061
3 1564 358.5178
4 3584 380.3512
5 1436 320.9595
6 3440 651.4031

Exact number of every value next to the bar [duplicate]

This question already has answers here:
How to put labels over geom_bar in R with ggplot2
(4 answers)
Closed 5 years ago.
Having a dataset like this:
df <- structure(list(word = structure(c(1L, 12L, 23L, 34L, 43L, 44L,
45L, 46L, 47L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 13L,
14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 24L, 25L, 26L, 27L,
28L, 29L, 30L, 31L, 32L, 33L, 35L, 36L, 37L, 38L, 39L, 40L, 41L,
42L), .Label = c("word1", "word10", "word11", "word12", "word13",
"word14", "word15", "word16", "word17", "word18", "word19", "word2",
"word20", "word21", "word22", "word23", "word24", "word25", "word26",
"word27", "word28", "word29", "word3", "word30", "word31", "word32",
"word33", "word34", "word35", "word36", "word37", "word38", "word39",
"word4", "word40", "word41", "word42", "word43", "word44", "word45",
"word46", "word47", "word5", "word6", "word7", "word8", "word9"
), class = "factor"), frq = c(1975L, 1665L, 1655L, 1469L, 1464L,
1451L, 1353L, 1309L, 1590L, 1545L, 1557L, 1556L, 1130L, 1153L,
1151L, 1150L, 1144L, 1141L, 1115L, 194L, 195L, 135L, 135L, 130L,
163L, 167L, 164L, 159L, 153L, 145L, 143L, 133L, 133L, 153L, 153L,
150L, 119L, 115L, 115L, 115L, 114L, 113L, 113L, 113L, 115L, 102L,
101L)), .Names = c("word", "frq"), class = "data.frame", row.names = c(NA,
-47L))
With this command lines I produce a bar plot graph
dat2 = transform(df,word = reorder(word,frq))
df2 <- head(dat2, 10)
p = ggplot(df2, aes(x = word, y = frq)) + geom_bar(stat = "identity", fill = "yellow")
p2 <- p +coord_flip()
How is it possible to have the number of frq in the end of every bar?
I would use annotate..
p2 + annotate(geom = "text",x = df2$word, y= df2$frq, label = df2$frq)

Add column based on header cell; then remove header cell

I have interesting data that is not uniform. A group of items are listed under the category name, but it is all in the same column. I need to add a column with the row corresponding to the item's category that it belongs to (then remove the category heading). The only way to distinguish a new category is determining whether the value under the year is empty.... My dputs should explain my issue more clearly.
Before:
structure(list(X = structure(c(13L, 1L, 19L, 16L, 5L, 17L, 11L,
8L, 2L, 10L, 4L, 6L, 18L, 15L, 21L, 12L, 14L, 9L, 3L, 20L, 7L
), .Label = c("-Burgers", "-Cameras", "-Shirts", "+Laptops",
"+Salads", "+TVs", "Caps", "Cell", "Clothes:", "Desktops", "Electronics",
"Flowers", "Food", "Garden Nomes", "Grills", "Hotdogs", "Nachoes",
"Outdoors:", "Pizza", "Shorts", "Swimming Gear"), class = "factor"),
X2000 = c(NA, 104L, 159L, 184L, 189L, 182L, NA, 49L, 28L,
46L, 34L, 43L, NA, 129L, 190L, 189L, 119L, NA, 45L, 80L,
80L), X2001 = c(NA, 147L, 192L, 164L, 174L, 196L, NA, 40L,
34L, 43L, 35L, 22L, NA, 114L, 130L, 120L, 145L, NA, 56L,
35L, 54L), X2002 = c(NA, 163L, 172L, 138L, 146L, 190L, NA,
38L, 40L, 21L, 22L, 33L, NA, 186L, 172L, 139L, 119L, NA,
88L, 78L, 91L), X2003 = c(NA, 125L, 152L, 182L, 148L, 125L,
NA, 36L, 44L, 34L, 27L, 50L, NA, 119L, 115L, 188L, 166L,
NA, 91L, 77L, 77L), X2004 = c(NA, 116L, 111L, 120L, 153L,
199L, NA, 49L, 48L, 43L, 37L, 32L, NA, 159L, 116L, 143L,
153L, NA, 18L, 53L, 51L)), .Names = c("X", "X2000", "X2001",
"X2002", "X2003", "X2004"), class = "data.frame", row.names = c(NA,
-21L))
After:
structure(list(X = structure(c(1L, 15L, 13L, 5L, 14L, 8L, 2L,
9L, 4L, 6L, 12L, 17L, 10L, 11L, 3L, 16L, 7L), .Label = c("-Burgers",
"-Cameras", "-Shirts", "+Laptops", "+Salads", "+TVs", "Caps",
"Cell", "Desktops", "Flowers", "Garden Nomes", "Grills", "Hotdogs",
"Nachoes", "Pizza", "Shorts", "Swimming Gear"), class = "factor"),
X.1 = structure(c(3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L,
4L, 4L, 4L, 4L, 1L, 1L, 1L), .Label = c("Clothes:", "Electronics",
"Food", "Outdoors:"), class = "factor"), X2000 = c(104L,
159L, 184L, 189L, 182L, 49L, 28L, 46L, 34L, 43L, 129L, 190L,
189L, 119L, 45L, 80L, 80L), X2001 = c(147L, 192L, 164L, 174L,
196L, 40L, 34L, 43L, 35L, 22L, 114L, 130L, 120L, 145L, 56L,
35L, 54L), X2002 = c(163L, 172L, 138L, 146L, 190L, 38L, 40L,
21L, 22L, 33L, 186L, 172L, 139L, 119L, 88L, 78L, 91L), X2003 = c(125L,
152L, 182L, 148L, 125L, 36L, 44L, 34L, 27L, 50L, 119L, 115L,
188L, 166L, 91L, 77L, 77L), X2004 = c(116L, 111L, 120L, 153L,
199L, 49L, 48L, 43L, 37L, 32L, 159L, 116L, 143L, 153L, 18L,
53L, 51L)), .Names = c("X", "X.1", "X2000", "X2001", "X2002",
"X2003", "X2004"), class = "data.frame", row.names = c(NA, -17L
))
The items are arbitrarily have + or - signs...I need that to remain the same. Also, some category headers have : while others do not.
We create an index based on the 'NA' values in columns other than the 1st ('indx'). We split the dataset using the 'indx', remove the first row i.e. NA values from columns 2nd to the last, cbind with the 1st row, 1st column value, rearrange the columns and rbind.
indx <- cumsum(!rowSums(!is.na(df1[-1])))
res <- do.call(rbind,lapply(split(df1, indx), function(x)
cbind(x, X.1= x[1,1])[-1,c(1,7,2:6)]))
row.names(res) <- NULL
all.equal(res, out, check.attributes=FALSE)
#[1] TRUE
where 'out' is the dput output of the expected result
Update
If the columns have '' instead of NA,
indx <- cumsum(!rowSums(df1[-1]!=''))
and do the rest as above. Having said that, when we have '' in a numeric column, the class will be either factor or character based on whether you specify stringsAsFactors=FALSE or =TRUE in the read.table/read.csv. So, keeping the '' as such will get the output also a factor/character class. I would convert the columns to their correct class first which will also coerce the '' to NA, i.e.
df1[-1] <- lapply(df1[-1], function(x) as.numeric(as.character(x)))
The as.character is only needed if the columns are factor class.
Once, we have done the conversion, the first approach should work fine as well.

Add a line from different result to boxplot graph in ggplot2

I have a dataframe (df1) that contains 3 columns (y1, y2, x). I managed to plot a boxplot graph between y1, x and y2, x. I have another dataframe (df2) which contains two columns A, x. I want to plot a line graph (A,x) and add it to the boxplot. Note the variable x in both dataframes is the axis access, however, it has different values. I tried to combine and reshape both dataframes and plot based on the factor(x)... I got 3 boxplots in one graph. I need to plot df2 as line and df1 as boxplot in one graph.
df1 <- structure(list(Y1 = c(905L, 941L, 744L, 590L, 533L, 345L, 202L,
369L, 200L, 80L, 200L, 80L, 50L, 30L, 60L, 20L, 30L, 30L), Y2 = c(774L,
823L, 687L, 545L, 423L, 375L, 249L, 134L, 45L, 58L, 160L, 60L,
20L, 40L, 20L, 26L, 19L, 27L), x = c(10L, 10L, 10L, 20L, 20L,
20L, 40L, 40L, 40L, 50L, 50L, 50L, 70L, 70L, 70L, 90L, 90L, 90L
)), .Names = c("Y1", "Y2", "x"), row.names = c(NA, -18L), class = "data.frame")
df2 <- structure(list(Y3Line = c(384L, 717L, 914L, 359L, 241L, 265L,
240L, 174L, 114L, 165L, 184L, 96L, 59L, 60L, 127L, 54L, 31L,
44L), x = c(36L, 36L, 36L, 56L, 56L, 56L, 65L, 65L, 65L, 75L,
75L, 75L, 85L, 85L, 85L, 99L, 99L, 99L)), .Names = c("A",
"x"), row.names = c(NA, -18L), class = "data.frame")
df_l <- melt(df1, id.vars = "x")
ggplot(df_l, aes(x = factor(x), y =value, fill=variable )) +
geom_boxplot()+
# here I'trying to add the line graph from df2
geom_line(data = df2, aes(x = x, y=A))
Any suggestions?
In the second dataset you have three y values per x value, do you want to draw seperate lines per x value or the mean per x value? Both are shown below. The trick is to first change the x variables in both datasets to factors that contain all the levels of both variables.
df1 <-structure(list(Y1 = c(905L, 941L, 744L, 590L, 533L, 345L, 202L,
369L, 200L, 80L, 200L, 80L, 50L, 30L, 60L, 20L, 30L, 30L), Y2 = c(774L,
823L, 687L, 545L, 423L, 375L, 249L, 134L, 45L, 58L, 160L, 60L,
20L, 40L, 20L, 26L, 19L, 27L), x = c(10L, 10L, 10L, 20L, 20L,
20L, 40L, 40L, 40L, 50L, 50L, 50L, 70L, 70L, 70L, 90L, 90L, 90L
)), .Names = c("Y1", "Y2", "x"), row.names = c(NA, -18L), class = "data.frame")
df2 <- structure(list(Y3Line = c(384L, 717L, 914L, 359L, 241L, 265L,
240L, 174L, 114L, 165L, 184L, 96L, 59L, 60L, 127L, 54L, 31L,
44L), x = c(36L, 36L, 36L, 56L, 56L, 56L, 65L, 65L, 65L, 75L,
75L, 75L, 85L, 85L, 85L, 99L, 99L, 99L)), .Names = c("A",
"x"), row.names = c(NA, -18L), class = "data.frame")
library(ggplot2)
library(reshape2)
df_l <- melt(df1, id.vars = "x")
allLevels <- levels(factor(c(df_l$x,df2$x)))
df_l$x <- factor(df_l$x,levels=(allLevels))
df2$x <- factor(df2$x,levels=(allLevels))
Line per x category:
ggplot(data=df_l,aes(x = x, y =value))+geom_line(data=df2,aes(x = factor(x), y =A)) +
geom_boxplot(aes(fill=variable ))
Connected means of x categories:
ggplot(data=df2,aes(x = factor(x), y =A)) +
stat_summary(fun.y=mean, geom="line", aes(group=1)) +
geom_boxplot(data=df_l,aes(x = x, y =value,fill=variable ))

Resources