Check which are the features that differentiate between clusters, using a boxplot - r

I applied the UMAP dimentionaity reduction over my data, and clustred it. I got three different clusters:
I have the data that specefices to which cluster does eahc sample belong, with the name of the sample and everything. Here is a subsample of it, let's call it df_cluster:
structure(list(X1 = c(17.6942795910888, 16.5328416912875, 15.0031683863395,
16.3550118351627, 17.6931159161312, 16.9869249394253, 16.3790173297882,
15.8964870189374, 17.1055608092973, 16.4568632337052), X2 = c(-1.64953541728691,
0.185674946464158, -1.38521677790428, -0.448487127519734, -1.63670327964466,
-0.456667476792068, -0.091689040488956, -1.77486494294163, -1.86407675524967,
0.14666260432486), cluster = c(1L, 2L, 2L, 1L, 2L, 1L, 3L, 3L,
1L, 3L)), row.names = c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), class = "data.frame")
The samples of df_cluster are the same in the original data, data, which I used for the clustering. Which is basically just the samples you saw as rows, and features as columns, looks something like this:
structure(c(-0.0741098696855045, -0.094401270881699, 0.0410284948786532,
-0.163302950330185, -0.0942478217207681, -0.167314411991775,
-0.118272811489486, -0.0366277340916379, -0.0349008907108641,
-0.167823357941815, -0.178835447722468, -0.253897294559596, -0.0372301980787381,
-0.230579110769457, -0.224125346052727, -0.196933050675633, -0.344608041139497,
-0.0550538743643369, -0.157003425700701, -0.162295446209879,
-0.0384421660291032, -0.0275306107582565, 0.186447606591857,
-0.124972070102036, -0.15348122673842, -0.106812144494277, -0.104757782473888,
0.0686746776877563, -0.0662055287009653, 0.00388752358937872), dim = c(10L,
3L), dimnames = list(c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), c("Feature1", "Feature2",
"Feature3")))
I just want to view each of those features (the columns of data), in each cluster, using a box plot or a violin plot. Kind of a comparison between the clusters.
So in the X-axis I'll have clusters 1, 2, and 3, the Y-axis would be the values. Each feature will get a plot. I've drawn an example by hand to make it more clear:

You could use facets.
But first you need to pivot the dataframe.
df_cluster <- structure(list(X1 = c(17.6942795910888, 16.5328416912875, 15.0031683863395,
16.3550118351627, 17.6931159161312, 16.9869249394253, 16.3790173297882,
15.8964870189374, 17.1055608092973, 16.4568632337052), X2 = c(-1.64953541728691,
0.185674946464158, -1.38521677790428, -0.448487127519734, -1.63670327964466,
-0.456667476792068, -0.091689040488956, -1.77486494294163, -1.86407675524967,
0.14666260432486), cluster = c(1L, 2L, 2L, 1L, 2L, 1L, 3L, 3L,
1L, 3L)), row.names = c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), class = "data.frame")
data <- structure(c(-0.0741098696855045, -0.094401270881699, 0.0410284948786532,
-0.163302950330185, -0.0942478217207681, -0.167314411991775,
-0.118272811489486, -0.0366277340916379, -0.0349008907108641,
-0.167823357941815, -0.178835447722468, -0.253897294559596, -0.0372301980787381,
-0.230579110769457, -0.224125346052727, -0.196933050675633, -0.344608041139497,
-0.0550538743643369, -0.157003425700701, -0.162295446209879,
-0.0384421660291032, -0.0275306107582565, 0.186447606591857,
-0.124972070102036, -0.15348122673842, -0.106812144494277, -0.104757782473888,
0.0686746776877563, -0.0662055287009653, 0.00388752358937872), dim = c(10L,
3L), dimnames = list(c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), c("Feature1", "Feature2",
"Feature3")))
library(tidyverse)
data %>%
as.data.frame() %>%
rownames_to_column("Patient") %>%
left_join(df_cluster %>% rownames_to_column("Patient") %>% select(Patient, cluster)) %>%
pivot_longer(- c(cluster, Patient)) %>% #Pivot the dataframe
ggplot(aes(as.factor(cluster), value)) +
geom_boxplot() +
facet_grid(~ name)

Related

How to Plot line chart using R for time-series analysis

I am trying to plot a line chart using Date-time and no of tweets at that period of date and time in R.
library(ggplot2)
df1 <- structure(list(Date = structure(c(1L, 1L, 2L, 1L, 1L, 1L), .Label = c("2020-03-12",
"2020-03-13"), class = "factor"), Time = structure(c(1L, 1L, 2L,
3L, 4L, 5L), .Label = c("00:00:00Z", "00:00:01Z", "00:10:04Z",
"00:25:12Z", "01:00:02Z"), class = "factor"), Text = structure(c(5L,
3L, 6L, 4L, 2L, 1L), .Label = c("The images of demonstrations and gathering", "Premium policy get activate by company abc",
"Launches of rocket", "Premium policy get activate by company abc",
"Technology makes trend", "The images of demonstrations and gatherings",
"Weather forecasting by xyz"), class = "factor")), class = "data.frame", row.names = c(NA,
-6L))
ggplot(df1, aes(x = Date, y = text(count)) + geom_line(aes(color = variable), size = 1)
I tried the above code to plot desired result but got an error. Dataset given like that in csv format.
Date Time Text
2020-03-12 00:00:00Z The images of demonstrations and gatherings
2020-03-12 00:00:00Z Premium policy get activate by company abc
2020-03-12 00:00:01Z Weather forecasting by xyz
2020-03-12 00:10:04Z Technology makes trend
2020-03-12 00:25:12Z Launches of rocket
2020-03-12 01:00:02Z Government launch new policy to different sector improvement
I have a dataset of nearly 15 days and want to plot the line chart to visualize the number of tweets (given in text column) to see the trend of tweets on different time and date.
df1 <- structure(list(Date = structure(c(1L, 1L, 2L, 1L, 1L, 1L), .Label = c("3/12/2020",
"3/13/2020"), class = "factor"), Time = structure(c(1L, 1L, 2L,
3L, 4L, 5L), .Label = c("00:00:00Z", "00:00:01Z", "00:10:04Z",
"00:25:12Z", "01:00:02Z"), class = "factor"), Text = structure(c(5L,
3L, 6L, 4L, 2L, 1L), .Label = c("Government launch new policy to different sector",
"Launches of rocket", "Premium policy get activate by company abc",
"Technology makes trend", "The images of demonstrations and gatherings",
"Weather forecasting by xyz"), class = "factor"), X = structure(c(1L,
1L, 1L, 1L, 1L, 2L), .Label = c("", "improvement"), class = "factor")), class = "data.frame", row.names = c(NA,
-6L))
Creating the dataset df1 as above then running this gives you required plot for hour
library(tidyverse)
library(lubridate)
df1 %>%
mutate(Time=hms(Time),
Date=mdy(Date),
hour=hour(Time)) %>%
count(hour) %>%
ggplot(aes(hour,n,group=1))+geom_line()+geom_point()
Is this what you are after?
library(dplyr)
library(lubridate)
library(stringr)
library(ggplot2)
Answer with your data
To demonstrate data wrangling.
# your data;
df1 <- structure(list(Date = structure(c(1L, 1L, 2L, 1L, 1L, 1L),
.Label = c("2020-03-12","2020-03-13"),
class = "factor"),
Time = structure(c(1L, 1L, 2L,3L, 4L, 5L),
.Label = c("00:00:00Z", "00:00:01Z", "00:10:04Z","00:25:12Z", "01:00:02Z"),
class = "factor"),
Text = structure(c(5L,3L, 6L, 4L, 2L, 1L),
.Label = c("The images of demonstrations and gathering", "Premium policy get activate by company abc",
"Launches of rocket", "Premium policy get activate by company abc",
"Technology makes trend", "The images of demonstrations and gatherings", "Weather forecasting by xyz"), class = "factor")),
class = "data.frame", row.names = c(NA,-6L))
# data wrangle
df2 <-
df1 %>%
# change all variables from factors to character
mutate_all(as.character) %>%
mutate(Time = str_remove(Time, "Z$"), #remove the trailing 'Z' from Time values
dt = ymd_hms(paste(Date, Time, sep = " ")), # change text into datetime format using lubridtate::ymd_hms
dt = ceiling_date(dt, unit="hour")) %>% # round to the end of the named hour, separated for clarity
group_by(dt) %>%
summarise(nr_tweets = n())
# plot
p1 <- ggplot(df2, aes(dt, nr_tweets))+
geom_line()+
scale_x_datetime(date_breaks = "1 day", date_labels = "%d/%m")+
ggtitle("Data from question `df1`")
Answer with made up large dataset
tib <- tibble(dt = sample(seq(ISOdate(2020,05,01), ISOdate(2020,05,15), by = "sec"), 10000, replace = TRUE),
text = sample(c(letters[1:26], LETTERS[1:26]), 10000, replace = TRUE))
tib1 <-
tib %>%
mutate(dt = round_date(dt, unit="hour"))%>%
group_by(dt) %>%
summarise(nr_tweets = n())
p2 <- ggplot(tib1, aes(dt, nr_tweets))+
geom_line()+
scale_x_datetime(date_breaks = "1 day", date_labels = "%d/%m")+
ggtitle("Result using `tib` data made up to answer the question")
p1/p2
Created on 2020-05-13 by the reprex package (v0.3.0)

Grouped box plot

Here is what it looks like after those edits - lines but no boxes.
Reproducible code:
df <- data.frame(SampleID = structure(c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L),
.Label = c("C004", "C005", "C007", "C009", "C010",
"C011", "C013", "C027", "C028", "C029",
"C030", "C031", "C032", "C033", "C034",
"C035", "C036", "C042", "C043", "C044",
"C045", "C046", "C047", "C048", "C049",
"C058", "C086"), class = "factor"),
Sequencing.Depth = c(1L, 2612L, 5223L, 7834L, 10445L, 13056L, 15667L, 18278L,
20889L, 23500L),
Observed.OTUs = c(1, 213, 289.5, 338, 377.8, 408.9, 434.4, 453.8, 472.1, NA),
Mange = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
.Label = c("N", "Y"), class = "factor"),
SpeciesCode = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
.Label = c("Cla", "Ucin", "Vvu"), class = "factor"))
In your aes, you can use interaction of your x values and your categorical values for plotting boxplot on a continuous x axis and pass position = "identity" in order to place them on the precise x values and not to be dodged.
Here to add the line connecting each boxplot, I calculate mean per Species per x values using dplyr directly inggplot but you can calculate outside and generate a second dataframe.
So, as your x values are pretty spread from 1 to 23500, you will have to modify the width of the geom_boxplot in order to see a box and not a single line:
library(ggplot2)
library(dplyr)
ggplot(df,aes(x = Xvalues, y = Yvalues, color = Species,
group = interaction(Species, Xvalues)))+
geom_boxplot(position = "identity", width = 1000)+
geom_line(data = df %>%
group_by(Xvalues, Species) %>%
summarise(Mean = mean(Yvalues)),
aes(x = Xvalues, y = Mean,
color = Species, group = Species))
So, apply to your dataset (based on informations you provided in your code), you should try something like:
library(ggplot2)
library(dplyr)
ggplot(observedotusrare,
aes(x=Sequencing.Depth, y=Observed.OTUs,
color=SpeciesCode,
group = interaction(Sequencing.Depth, SpeciesCode))) +
geom_boxplot(position = "identity", width = 1000) +
geom_line(data = observedotusrare %>%
group_by(Sequencing.Depth, SpeciesCode) %>%
summarise(Mean = mean(Observed.OTUs, na.rm = TRUE)),
aes(x = Sequencing.Depth, y = Mean,
color = SpeciesCode, group = SpeciesCode))
Does it answer your question ?
Reproducible example
df <- data.frame(Xvalues = rep(c(10,2000,23500), each = 30),
Species = rep(rep(LETTERS[1:3], each = 10),3),
Yvalues = c(rnorm(10,1,1),
rnorm(10,5,1),
rnorm(10,8,1),
rnorm(10,5,1),
rnorm(10,8,1),
rnorm(10,12,1),
rnorm(10,20,1),
rnorm(10,30,1),
rnorm(10,50,1)))

Conditional updating coordinate column in dataframe

I am attempting to populate two newly empty columns in a data frame with data from other columns in the same data frame in different ways depending on if they are populated.
I am trying to populate the values of HIGH_PRCN_LAT and HIGH_PRCN_LON (previously called F_Lat and F_Lon) which represent the final latitudes and londitudes for those rows this will be based off the values of the other columns in the table.
Case 1: Lat/Lon2 are populated (like in IDs 1 & 2), using the great
circle algorithm a midpoint between them should be calculated and
then placed into F_Lat & F_Lon.
Case 2: Lat/Lon2 are empty, then the values of Lat/Lon1 should be put
into F_Lat and F_Lon (like with IDs 3 & 4).
My code is as follows but doesn't work (see previous versions, removed in an edit).
The preperatory code I am using is as follows:
incidents <- structure(list(id = 1:9, StartDate = structure(c(1L, 3L, 2L,
2L, 2L, 3L, 1L, 3L, 1L), .Label = c("02/02/2000 00:34", "02/09/2000 22:13",
"20/01/2000 14:11"), class = "factor"), EndDate = structure(1:9, .Label = c("02/04/2006 20:46",
"02/04/2006 22:38", "02/04/2006 23:21", "02/04/2006 23:59", "03/04/2006 20:12",
"03/04/2006 23:56", "04/04/2006 00:31", "07/04/2006 06:19", "07/04/2006 07:45"
), class = "factor"), Yr.Period = structure(c(1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("2000 / 1", "2000 / 2", "2000 /3"
), class = "factor"), Description = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "ENGLISH TEXT", class = "factor"),
Location = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L
), .Label = c("Location 1", "Location 1 : Location 2"), class = "factor"),
Location.1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "Location 1", class = "factor"), Postcode.1 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Postcode 1", class = "factor"),
Location.2 = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
1L), .Label = c("", "Location 2"), class = "factor"), Postcode.2 = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("", "Postcode 2"
), class = "factor"), Section = structure(c(2L, 2L, 3L, 1L,
4L, 4L, 2L, 1L, 4L), .Label = c("East", "North", "South",
"West"), class = "factor"), Weather.Category = structure(c(1L,
2L, 4L, 2L, 2L, 2L, 4L, 1L, 3L), .Label = c("Animals", "Food",
"Humans", "Weather"), class = "factor"), Minutes = c(13L,
55L, 5L, 5L, 5L, 522L, 1L, 11L, 22L), Cost = c(150L, 150L,
150L, 20L, 23L, 32L, 21L, 11L, 23L), Location.1.Lat = c(53.0506727,
53.8721035, 51.0233529, 53.8721035, 53.6988355, 53.4768766,
52.6874562, 51.6638245, 51.4301359), Location.1.Lon = c(-2.9991256,
-2.4004125, -3.0988341, -2.4004125, -1.3031529, -2.2298073,
-1.8023421, -0.3964916, 0.0213837), Location.2.Lat = c(52.7116187,
53.746791, NA, 53.746791, 53.6787167, 53.4527824, 52.5264907,
NA, NA), Location.2.Lon = c(-2.7493169, -2.4777984, NA, -2.4777984,
-1.489026, -2.1247029, -1.4645023, NA, NA)), class = "data.frame", row.names = c(NA, -9L))
#gpsColumns is used as the following line of code is used for several data frames.
gpsColumns <- c("HIGH_PRCN_LAT", "HIGH_PRCN_LON")
incidents [ , gpsColumns] <- NA
#create separate variable(?) containing a list of which rows are complete
ind <- complete.cases(incidents [,17])
#populate rows with a two Lat/Lons with great circle middle of both values
incidents [ind, c("HIGH_PRCN_LON_2","HIGH_PRCN_LAT_2")] <-
with(incidents [ind,,drop=FALSE],
do.call(rbind, geosphere::midPoint(cbind.data.frame(Location.1.Lon, Location.1.Lat), cbind.data.frame(Location.2.Lon, Location.2.Lat))))
#populate rows with one Lat/Lon with those values
incidents[!ind, c("HIGH_PRCN_LAT","HIGH_PRCN_LON")] <- incidents[!ind, c("Location.1.Lat","Location.1.Lon")]
I will use the geosphere::midPoint function based off a recommendation here: http://r.789695.n4.nabble.com/Midpoint-between-coordinates-td2299999.html.
Unfortunately, it doesn't appear that this way of populating the column will work when there are several cases.
The current error that is thrown is:
Error in `$<-.data.frame`(`*tmp*`, F_Lat, value = integer(0)) :
replacement has 0 rows, data has 178012
Edit: also posted to reddit: https://www.reddit.com/r/Rlanguage/comments/bdvavx/conditional_updating_column_in_dataframe/
Edit: Added clarity on the parts of the code I do not understand.
#replaces the F_Lat2/F_Lon2 columns in rows with a both sets of input coordinates
dataframe[ind, c("F_Lat2","F_Lon2")] <-
#I am unclear on what this means, specifically what the "with" function does and what "drop=FALSE" does and also why they were used in this case.
with(dataframe[ind,,drop=FALSE],
#I am unclear on what do.call and rbind are doing here, but the second half (geosphere onwards) is binding the Lats and Lons to make coordinates as inputs for the gcIntermediate function.
do.call(rbind, geosphere::gcIntermediate(cbind.data.frame(Lat1, Lon1),
cbind.data.frame(Lat2, Lon2), n = 1)))
Though your code doesn't work as-written for me, and I cannot calculate the same precise values your expect, I suspect the error your seeing can be fixed with these steps. (Data is down at the bottom here.)
Pre-populate the empty columns.
Pre-calculate the complete.cases step, it'll save time.
Use cbind.data.frame for inside gcIntermediate.
I'm inferring from
gcIntermediate([dataframe...
^
this is an error in R
that you are binding those columns together, so I'll use cbind.data.frame. (Using cbind itself produced some ignorable warnings from geosphere, so you can use it instead and perhaps suppressWarnings, but that function is a little strong in that it'll mask other warnings as well.)
Also, since it appears you want one intermediate value for each pair of coordinates, I added the gcIntermediate(..., n=1) argument.
The use of do.call(rbind, ...) is because gcIntermediate returns a list, so we need to bring them together.
dataframe$F_Lon2 <- dataframe$F_Lat2 <- NA_real_
ind <- complete.cases(dataframe[,4])
dataframe[ind, c("F_Lat2","F_Lon2")] <-
with(dataframe[ind,,drop=FALSE],
do.call(rbind, geosphere::gcIntermediate(cbind.data.frame(Lat1, Lon1),
cbind.data.frame(Lat2, Lon2), n = 1)))
dataframe[!ind, c("F_Lat2","F_Lon2")] <- dataframe[!ind, c("Lat1","Lon1")]
dataframe
# ID Lat1 Lon1 Lat2 Lon2 F_Lat F_Lon F_Lat2 F_Lon2
# 1 1 19.05067 -3.999126 92.71332 -6.759169 55.88200 -5.379147 55.78466 -6.709509
# 2 2 58.87210 -1.400413 54.74679 -4.479840 56.80945 -2.940126 56.81230 -2.942029
# 3 3 33.02335 -5.098834 NA NA 33.02335 -5.098834 33.02335 -5.098834
# 4 4 54.87210 -4.400412 NA NA 54.87210 -4.400412 54.87210 -4.400412
Update, using your new incidents data and switching to geosphere::midPoint.
Try this:
incidents$F_Lon2 <- incidents$F_Lat2 <- NA_real_
ind <- complete.cases(incidents[,4])
incidents[ind, c("F_Lat2","F_Lon2")] <-
with(incidents[ind,,drop=FALSE],
geosphere::midPoint(cbind.data.frame(Location.1.Lat,Location.1.Lon),
cbind.data.frame(Location.2.Lat,Location.2.Lon)))
incidents[!ind, c("F_Lat2","F_Lon2")] <- dataframe[!ind, c("Lat1","Lon1")]
One (big) difference is that geosphere::gcIntermediate(..., n=1) returns a list of results, whereas geosphere::midPoint(...) (no n=) returns just a matrix, so no rbinding required.
Data:
dataframe <- read.table(header=T, stringsAsFactors=F, text="
ID Lat1 Lon1 Lat2 Lon2 F_Lat F_Lon
1 19.0506727 -3.9991256 92.713318 -6.759169 55.88199535 -5.3791473
2 58.8721035 -1.4004125 54.746791 -4.47984 56.80944725 -2.94012625
3 33.0233529 -5.0988341 NA NA 33.0233529 -5.0988341
4 54.8721035 -4.4004125 NA NA 54.8721035 -4.4004125")

Casting dataframe gives error R

Here is the dataframe df on which I'm trying to do a pivot using cast function
dput(df)
structure(list(Val = c(1L, 2L, 2L, 5L, 2L, 5L), `Perm 1` = structure(c(1L,
2L, 3L, 3L, 3L, 3L), .Label = c("Blue", "green", "yellow"
), class = "factor"), `Perm 2` = structure(c(1L, 2L, 2L, 3L,
3L, 3L), .Label = c("Blue", "green", "yellow"), class = "factor"),
`Perm 3` = structure(c(1L, 2L, 2L, 2L, 3L, 3L), .Label = c("Blue",
"green", "yellow"), class = "factor")), .Names = c("Val",
"Perm 1", "Perm 2", "Perm 3"), row.names = c(NA, 6L), class = "data.frame")
And expecting the data after pivot
Blue 1 1 1
green 2 4 9
yellow 14 12 7
I tried doing
cast(df, df$Val ~ df$`Perm 1`+df$`Perm 2`+df$`Perm 3`, sum, value = 'Val')
But this gives error
Error: Casting formula contains variables not found in molten data: df$Val, df$`Perm1`, df$`Perm2`
How can I be able to do pivot so that I'll be able to get the desired O/P
P.S- The dataframe DF has around 36 column but for simplicity I took only 3 columns.
Any suggestion will be appreciated.
Thank you
Domnick
It appears you want to sum, grouped by each permutation in your dataset. Although hacky, I think this works for your problem. First we create a function to perform that summation using tidyeval syntax. Link for more information: Group by multiple columns in dplyr, using string vector input
sum_f <- function(col, df) {
library(tidyverse)
df <- df %>%
group_by_at(col) %>%
summarise(Val = sum(Val)) %>%
ungroup()
df[,2]
}
We then apply it to your dataset using lapply, and binding together the summations.
bind_cols(lapply(c('Perm1', 'Perm2', 'Perm3'), sum_f, df))
This gets us the above answer.
Caveats: Need to know the name of the columns you have to sum over for this to work. Also, each column needs to have the same levels of your permutations i.e. blue, green, yellow. The code will respect this ordering.

Outlier detection for multi column data frame in R

I have a data frame with 18 columns and about 12000 rows. I want to find the outliers for the first 17 columns and compare the results with the column 18. The column 18 is a factor and contains data which can be used as indicator of outlier.
My data frame is ufo and I remove the column 18 as follow:
ufo2 <- ufo[,1:17]
and then convert 3 non0numeric columns to numeric values:
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
and then use the following command for outlier detection:
outlier.scores <- lofactor(ufo2, k=5)
But all of the elements of the outlier.scores are NA!!!
Do I have any mistake in this code?
Is there another way to find outlier for such a data frame?
All of my code:
setwd(datadirectory)
library(doMC)
registerDoMC(cores=8)
library(DMwR)
# load data
load("data_9802-f2.RData")
ufo2 <- ufo[,2:17]
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
outlier.scores <- lofactor(ufo2, k=5)
The output of the dput(head(ufo2)) is:
structure(list(Origin = c(2L, 2L, 2L, 2L, 2L, 2L), IO = c(2L,
2L, 2L, 2L, 2L, 2L), Lot = c(1003L, 1003L, 1003L, 1012L, 1012L,
1013L), DocNumber = c(10069L, 10069L, 10087L, 10355L, 10355L,
10382L), OperatorID = c(5698L, 5698L, 2015L, 246L, 246L, 4135L
), Month = c(1L, 1L, 1L, 1L, 1L, 1L), LineNo = c(1L, 2L, 1L,
1L, 2L, 1L), Country = c(1L, 1L, 1L, 1L, 11L, 1L), ProduceCode = c(63456227L,
63455714L, 33687427L, 32686627L, 32686627L, 791614L), Weight = c(900,
850, 483, 110000, 5900, 1000), InvoiceValue = c(637, 775, 2896,
48812, 1459, 77), InvoiceValueWeight = c(707L, 912L, 5995L, 444L,
247L, 77L), AvgWeightMonth = c(1194.53, 1175.53, 7607.17, 311.667,
311.667, 363.526), SDWeightMonth = c(864.931, 780.247, 3442.93,
93.5818, 93.5818, 326.238), Score = c(0.56366535234262, 0.33775439984787,
0.46825476121676, 1.414092583904, 0.69101737288291, 0.87827342721894
), TransactionNo = c(47L, 47L, 6L, 3L, 3L, 57L)), .Names = c("Origin",
"IO", "Lot", "DocNumber", "OperatorID", "Month", "LineNo", "Country",
"ProduceCode", "Weight", "InvoiceValue", "InvoiceValueWeight",
"AvgWeightMonth", "SDWeightMonth", "Score", "TransactionNo"), row.names = c(NA,
6L), class = "data.frame")
First of all, you need to spend a lot more time preprocessing your data.
Your axes have completely different meaning and scale. Without care, the outlier detection results will be meaningless, because they are based on a meaningless distance.
For example produceCode. Are you sure, this should be part of your similarity?
Also note that I found the lofactor implementation of the R DMwR package to be really slow. Plus, it seems to be hard-wired to Euclidean distance!
Instead, I recommend using ELKI for outlier detection. First of all, it comes with a much wider choice of algorithms, secondly it is much faster than R, and third, it is very modular and flexible. For your use case, you may need to implement a custom distance function instead of using Euclidean distance.
Here's the link to the ELKI tutorial on implementing a custom distance function.

Resources