regression on subsets for unique factor combinations using lm

regression on subsets for unique factor combinations using lm - r

I would like to automate a simple multiple regression for the subsets defined by the unique combinations of the grouping variables. I have a dataframe with several grouping variables df1[,1:6] and some independent variables df1[,8:10] and a response df1[,7].
This is an excerpt from the data.
structure(list(Surface = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("NiAu", "Sn"), class = "factor"), Supplier = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), ParticleSize = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("3", "5"), class = "factor"), T1 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("130", "144"), class = "factor"), T2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "200", class = "factor"), O2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "1300", class = "factor"), Shear = c(56.83, 67.73, 78.51, 62.61, 66.78, 60.89, 62.94, 76.34, 70.56, 70.4, 54.15), Gap = c(373, 450, 417, 450, 406, 439, 439, 417, 439, 441, 417), Clearance = c(500.13, 509.85, 495.97, 499.55, 502.66, 505.33, 500.32, 503.28, 507.44, 500.5, 498.39), Void = c(316, 343, 89, 247, 271, 326, 304, 282, 437, 243, 116)), .Names = c("Surface", "Supplier", "ParticleSize","T1", "T2", "O2", "Shear", "Gap", "Clearance", "Void"), class = "data.frame", row.names = c(NA, -11L))
Using unique(df1[,1:6]) returns 5 factor combinations of the grouping variables. So there should be 5 subsets where I apply the lm() function to.
My call looks like that
df1.fit.by<-with(df1,by(df1,df1[,1:6], function(x) lm(Shear~Gap+Clearance+Void,data=x)))
sapply(df1.fit.by,coef)
Problem 1: it returns a list with 16 list entries. Apparently, it calculates all possible factor combinations of the first six grouping variables. (V5+V6 only have on level but V1:4 have two levels level in the excerpt. Resulting in 2^4=16) But it should only use the real existing factor combinations in the data. So I suppose by() is not the correct function to achieve that. Any suggestions?
Problem 2: I find it easier to refer to column indices rather than variable names. So I was initially trying to use my lm() function in the way lm(df1[,7]~df1[,8]+df1[,9]). That did not work out. Because I always access the entire df1 dataframe instead of the subsets. So probably I should pass the row indeces for the factor combinations to the lm()function rather than a complete dataframe.
I think the solution to problem 1 and 2 are somehow related and solved using another subset function. It would be nice if someone can try to explain where my mistake is. If its possible I would stick to the standard packages simply because I want to improve my understanding of R. Thanks
EDIT: a minor mistake in the variable assignment

You could use the plyr package:
require(plyr)
list_reg <- dlply(df1, .(Surface, Supplier, ParticleSize, T1, T2), function(df)
{lm(Shear~Gap+Clearance+Void,data=df)})
#We have indeed five different results
length(list_reg)
#That's how you check out one particular regression, in this case the first
summary(list_reg[[1]])
The function dlply takes a data.frame (that's what the d... stands for), in your case df1, and returns a list (that's what the .l... stands for), in your case consisting of five elements, each containing the results of one regression.
Internally, your df1 is split up into five sub-data.frames according to the columns specified by .(Surface, Supplier, ParticleSize, T1, T2) and the function lm(Shear~Gap+Clearance+Void,data=df) is applied to every of these sub-data.frames.
To get a better feeling of what dlply really does, just call
list_sub_df <- dlply(df1, .(Surface, Supplier, ParticleSize, T1, T2))
and you can look at each sub-data.frame on which the lm will be applied to.
And just a general note at the end: The paper by the package author Hadley Wickham is really great: even if you won't end up using his package, it is still really good to get a feeling about the split-apply-combine approach.
EDIT:
I just did a quick search and as expected, this was already explained better before, so also make sure to read this SO post.
EDIT2:
If you want to use the column numbers directly, try this (taken from this SO post):
list_reg <- dlply(df1, names(df1[, 1:5]), function(df)
{lm(Shear~Gap+Clearance+Void,data=df)})

Related

R How to tell a (t-test) function the needed column in an indirect way?

This is the data with the two columns 'weight' and 'group':
genderweight <- structure(list(weight = c(95.0626365041014, 65.9189881179415,
64.1289176345525, 66.1688823533661, 81.6245374434498, 85.1845386418439,
81.0348729928744, 92.161156464954, 86.3842380662202, 64.8582493776221,
62.3256566394621, 85.0980797936812, 80.0399859200671, 83.3698935236987,
62.8710960018134, 77.0097819307823, 62.9067362884316, 62.8505200797307,
62.2199243419118, 86.2430806667288, 83.8522826935738, 59.3086045947413,
82.578094058482, 62.9779809883867), group = structure(c(2L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L,
1L, 2L, 2L, 1L, 2L, 1L), levels = c("F", "M"), class = "factor")), row.names = c(NA,
-24L), class = c("tbl_df", "tbl", "data.frame"))
Package and library needed:
install.packages("rstatix")
library(rstatix)
I would like to use a placeholder in the following function:
t_test(genderweight, weight ~ group, detailed = TRUE)
My placeholder could be named i, for example, and afterwards I would like to run:
i <- "weight"
t_test(genderweight, i ~ group, detailed = TRUE)
Or alternatively, i could be a number, e.g. i = 1 and then I would like to run:
t_test(genderweight,genderweight[,i] ~ group, detailed = TRUE)
For both ways, I get an error message of the following type:
Error in `vec_as_location2_result()`:
! Can't extract columns that don't exist.
✖ Column `genderweight[, 1]` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.
Is there a way to tell the function in an indirect way which column you want for the t-test?

Conditional updating coordinate column in dataframe

I am attempting to populate two newly empty columns in a data frame with data from other columns in the same data frame in different ways depending on if they are populated.
I am trying to populate the values of HIGH_PRCN_LAT and HIGH_PRCN_LON (previously called F_Lat and F_Lon) which represent the final latitudes and londitudes for those rows this will be based off the values of the other columns in the table.
Case 1: Lat/Lon2 are populated (like in IDs 1 & 2), using the great
circle algorithm a midpoint between them should be calculated and
then placed into F_Lat & F_Lon.
Case 2: Lat/Lon2 are empty, then the values of Lat/Lon1 should be put
into F_Lat and F_Lon (like with IDs 3 & 4).
My code is as follows but doesn't work (see previous versions, removed in an edit).
The preperatory code I am using is as follows:
incidents <- structure(list(id = 1:9, StartDate = structure(c(1L, 3L, 2L,
2L, 2L, 3L, 1L, 3L, 1L), .Label = c("02/02/2000 00:34", "02/09/2000 22:13",
"20/01/2000 14:11"), class = "factor"), EndDate = structure(1:9, .Label = c("02/04/2006 20:46",
"02/04/2006 22:38", "02/04/2006 23:21", "02/04/2006 23:59", "03/04/2006 20:12",
"03/04/2006 23:56", "04/04/2006 00:31", "07/04/2006 06:19", "07/04/2006 07:45"
), class = "factor"), Yr.Period = structure(c(1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("2000 / 1", "2000 / 2", "2000 /3"
), class = "factor"), Description = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "ENGLISH TEXT", class = "factor"),
Location = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L
), .Label = c("Location 1", "Location 1 : Location 2"), class = "factor"),
Location.1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "Location 1", class = "factor"), Postcode.1 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Postcode 1", class = "factor"),
Location.2 = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
1L), .Label = c("", "Location 2"), class = "factor"), Postcode.2 = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("", "Postcode 2"
), class = "factor"), Section = structure(c(2L, 2L, 3L, 1L,
4L, 4L, 2L, 1L, 4L), .Label = c("East", "North", "South",
"West"), class = "factor"), Weather.Category = structure(c(1L,
2L, 4L, 2L, 2L, 2L, 4L, 1L, 3L), .Label = c("Animals", "Food",
"Humans", "Weather"), class = "factor"), Minutes = c(13L,
55L, 5L, 5L, 5L, 522L, 1L, 11L, 22L), Cost = c(150L, 150L,
150L, 20L, 23L, 32L, 21L, 11L, 23L), Location.1.Lat = c(53.0506727,
53.8721035, 51.0233529, 53.8721035, 53.6988355, 53.4768766,
52.6874562, 51.6638245, 51.4301359), Location.1.Lon = c(-2.9991256,
-2.4004125, -3.0988341, -2.4004125, -1.3031529, -2.2298073,
-1.8023421, -0.3964916, 0.0213837), Location.2.Lat = c(52.7116187,
53.746791, NA, 53.746791, 53.6787167, 53.4527824, 52.5264907,
NA, NA), Location.2.Lon = c(-2.7493169, -2.4777984, NA, -2.4777984,
-1.489026, -2.1247029, -1.4645023, NA, NA)), class = "data.frame", row.names = c(NA, -9L))
#gpsColumns is used as the following line of code is used for several data frames.
gpsColumns <- c("HIGH_PRCN_LAT", "HIGH_PRCN_LON")
incidents [ , gpsColumns] <- NA
#create separate variable(?) containing a list of which rows are complete
ind <- complete.cases(incidents [,17])
#populate rows with a two Lat/Lons with great circle middle of both values
incidents [ind, c("HIGH_PRCN_LON_2","HIGH_PRCN_LAT_2")] <-
with(incidents [ind,,drop=FALSE],
do.call(rbind, geosphere::midPoint(cbind.data.frame(Location.1.Lon, Location.1.Lat), cbind.data.frame(Location.2.Lon, Location.2.Lat))))
#populate rows with one Lat/Lon with those values
incidents[!ind, c("HIGH_PRCN_LAT","HIGH_PRCN_LON")] <- incidents[!ind, c("Location.1.Lat","Location.1.Lon")]
I will use the geosphere::midPoint function based off a recommendation here: http://r.789695.n4.nabble.com/Midpoint-between-coordinates-td2299999.html.
Unfortunately, it doesn't appear that this way of populating the column will work when there are several cases.
The current error that is thrown is:
Error in `$<-.data.frame`(`*tmp*`, F_Lat, value = integer(0)) :
replacement has 0 rows, data has 178012
Edit: also posted to reddit: https://www.reddit.com/r/Rlanguage/comments/bdvavx/conditional_updating_column_in_dataframe/
Edit: Added clarity on the parts of the code I do not understand.
#replaces the F_Lat2/F_Lon2 columns in rows with a both sets of input coordinates
dataframe[ind, c("F_Lat2","F_Lon2")] <-
#I am unclear on what this means, specifically what the "with" function does and what "drop=FALSE" does and also why they were used in this case.
with(dataframe[ind,,drop=FALSE],
#I am unclear on what do.call and rbind are doing here, but the second half (geosphere onwards) is binding the Lats and Lons to make coordinates as inputs for the gcIntermediate function.
do.call(rbind, geosphere::gcIntermediate(cbind.data.frame(Lat1, Lon1),
cbind.data.frame(Lat2, Lon2), n = 1)))

Though your code doesn't work as-written for me, and I cannot calculate the same precise values your expect, I suspect the error your seeing can be fixed with these steps. (Data is down at the bottom here.)
Pre-populate the empty columns.
Pre-calculate the complete.cases step, it'll save time.
Use cbind.data.frame for inside gcIntermediate.
I'm inferring from
gcIntermediate([dataframe...
^
this is an error in R
that you are binding those columns together, so I'll use cbind.data.frame. (Using cbind itself produced some ignorable warnings from geosphere, so you can use it instead and perhaps suppressWarnings, but that function is a little strong in that it'll mask other warnings as well.)
Also, since it appears you want one intermediate value for each pair of coordinates, I added the gcIntermediate(..., n=1) argument.
The use of do.call(rbind, ...) is because gcIntermediate returns a list, so we need to bring them together.
dataframe$F_Lon2 <- dataframe$F_Lat2 <- NA_real_
ind <- complete.cases(dataframe[,4])
dataframe[ind, c("F_Lat2","F_Lon2")] <-
with(dataframe[ind,,drop=FALSE],
do.call(rbind, geosphere::gcIntermediate(cbind.data.frame(Lat1, Lon1),
cbind.data.frame(Lat2, Lon2), n = 1)))
dataframe[!ind, c("F_Lat2","F_Lon2")] <- dataframe[!ind, c("Lat1","Lon1")]
dataframe
# ID Lat1 Lon1 Lat2 Lon2 F_Lat F_Lon F_Lat2 F_Lon2
# 1 1 19.05067 -3.999126 92.71332 -6.759169 55.88200 -5.379147 55.78466 -6.709509
# 2 2 58.87210 -1.400413 54.74679 -4.479840 56.80945 -2.940126 56.81230 -2.942029
# 3 3 33.02335 -5.098834 NA NA 33.02335 -5.098834 33.02335 -5.098834
# 4 4 54.87210 -4.400412 NA NA 54.87210 -4.400412 54.87210 -4.400412
Update, using your new incidents data and switching to geosphere::midPoint.
Try this:
incidents$F_Lon2 <- incidents$F_Lat2 <- NA_real_
ind <- complete.cases(incidents[,4])
incidents[ind, c("F_Lat2","F_Lon2")] <-
with(incidents[ind,,drop=FALSE],
geosphere::midPoint(cbind.data.frame(Location.1.Lat,Location.1.Lon),
cbind.data.frame(Location.2.Lat,Location.2.Lon)))
incidents[!ind, c("F_Lat2","F_Lon2")] <- dataframe[!ind, c("Lat1","Lon1")]
One (big) difference is that geosphere::gcIntermediate(..., n=1) returns a list of results, whereas geosphere::midPoint(...) (no n=) returns just a matrix, so no rbinding required.
Data:
dataframe <- read.table(header=T, stringsAsFactors=F, text="
ID Lat1 Lon1 Lat2 Lon2 F_Lat F_Lon
1 19.0506727 -3.9991256 92.713318 -6.759169 55.88199535 -5.3791473
2 58.8721035 -1.4004125 54.746791 -4.47984 56.80944725 -2.94012625
3 33.0233529 -5.0988341 NA NA 33.0233529 -5.0988341
4 54.8721035 -4.4004125 NA NA 54.8721035 -4.4004125")

<<- not working - R - call object from outside function [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 5 years ago.
Improve this question
I've looked through the forums here and figure out that <<- assign a variable inside a function to a global variable (to be accessible outside the function).
I've done so below, but to no avail - any thoughts?
> Billeddata_import <- function(burl="C:\\Users\\mcantwell\\Desktop\\Projects\\M & V Analysis\\Final_Bills.csv"){
+ billeddata<-read.csv(burl,header=TRUE, sep=",",stringsAsFactors = FALSE) %>%
+ mutate(Usage=as.numeric(Usage)) %>%
+ #Service.Begin.Date=as.Date(Service.Begin.Date,format='%m/%d/%Y'),
+ #Service.End.Date=as.Date(Service.End.Date,format='%m/%d/%Y')) %>%
+
+ filter(UOM=="Kw",
+ !is.na(Usage),
+ Service.Description %in% c("Demand","Demand On Peak", "Demand Off Peak", "Dmd Partial Pk")) %>%
+ group_by(Location..,Service.Begin.Date,Service.End.Date) %>%
+ summarise(monthly_peak=max(Usage))
+ out<<-billdata
+ }
> out
Error: object 'out' not found
>
The object billdata is a data table that I cleaned up in Billeddata_import(), and I'm hoping to use it in later functions.
Running the function alone yields:
> Billeddata_import()
Error in Billeddata_import() : object 'billdata' not found
without the out<<-billdata line, Billeddata_import() runs fine.

NOTE: Using <<- is bad practice. You can read this thread to know more about this.
You need to run the function. Here, you just define it. Take one step further and run it before looking for out.
Since we do not have your data, look at the example below;
#This is an example:
myfun <- function(xdat=df) {
billeddata <- xdat %>% select(-var3) %>%
filter(var1=="treatment5")
out<<-billeddata
}
myfun(df) #You need to run the function!!!
out
# var1 var2 value
# 1 treatment5 group_2 0.005349631
# 2 treatment5 group_2 0.005349631
# 3 treatment5 group_1 0.005349631
Data:
df <- structure(list(var1 = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), .Label = c("treatment1", "treatment2",
"treatment3", "treatment4", "treatment5"), class = "factor"),
var2 = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L), .Label = c("group_1", "group_2"), class = "factor"),
var3 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 3L, 2L, 2L, 3L,
2L, 3L, 2L, 2L, 3L), .Label = c("C8.0", "C8.1", "C8.2"), class = "factor"),
value = c(0.010056478, 0.009382918, 0.003014983, 0.005349631,
0.005349631, 0.010056478, 0.009382918, 0.003014983, 0.005349631,
0.005349631, 0.010056478, 0.009382918, 0.003014983, 0.005349631,
0.005349631)), .Names = c("var1", "var2", "var3", "value"
), class = "data.frame", row.names = c(NA, -15L))
P.S.
Even if you want to use return(out) you still need to run the function after defining it.
Moreover, using return() will not add a variable to your global. You need to assign it while calling the function, like this:
out <- myfun(df)

You can just use return(out) as the last line in your function, and then call your function every time you need to access your variable.

In R - How do you make transition charts with the Gmisc package?

I've been trying to make a graph that looks like this (but nicer)
based on what I found in this discussion using the transitionPlot() function from the Gmiscpackage.
However, I can't get my transition_matrix right and I also can't seem to plot the different state classes in separate third column.
My data is based on the symptomatic improvement of patients following surgery. The numbers in the boxes are the number of patients in each "state" pre vs. post surgery. Please note the (LVAD) is not a necessity.
The data for this plot is this called df and is as follows
dput(df)
structure(list(StudyID = structure(c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("P1", "P2", "P3",
"P4", "P5", "P6", "P7"), class = "factor"), MeasureTime = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Postoperative",
"Preoperative"), class = "factor"), NYHA = c(3L, 3L, 3L, 3L,
3L, 2L, 3L, 1L, 3L, 1L, 3L, 3L, 1L, 1L)), .Names = c("StudyID",
"MeasureTime", "NYHA"), row.names = c(NA, -14L), class = "data.frame")
I've made a plot in ggplot2 that looked like this
but my supervisor didn't like it, because I had to jitterthe lines so that they didn't overlap and so one could see what was happening with each patient and thus the points/lines aren't exactly lined up with the y-axis.
So I was wondering if anyone had an idea, how I'd be able to do this using the Gmisc package making what seems to me to be a transitionPlot.
Your help and time is much appreciated.
Thanks.

Using your sample df data, here are some pretty low-level plotting function that can re-create your sample image. It should be straigtforward to customize however you like
First, make sure pre comes before post
df$MeasureTime<-factor(df$MeasureTime, levels=c("Preoperative","Postoperative"))
then define some plot helper functions
textrect<-function(x,y,text,width=.2) {
rect(x-width, y-width, x+width, y+width)
text(x,y,text)
}
connect<-function(x1,y1,x2,y2, width=.2) {
segments(x1+width,y1,x2-width,y2)
}
now draw the plot
plot.new()
par(mar=c(0,0,0,0))
plot.window(c(0,4), c(0,4))
with(unique(reshape(df, idvar="StudyID", timevar="MeasureTime", v.names="NYHA", direction="wide")[,-1]),
connect(2,NYHA.Preoperative,3,NYHA.Postoperative)
)
with(as.data.frame(with(df, table(NYHA, MeasureTime))),
textrect(as.numeric(MeasureTime)+1,as.numeric(as.character(NYHA)), Freq)
)
text(1, 1:3, c("I","II","III"))
text(1:3, 3.75, c("NYHA","Pre-Op","Post-Op"))
text(3.75, 2, "(LVAD)")
which results in

Outlier detection for multi column data frame in R

I have a data frame with 18 columns and about 12000 rows. I want to find the outliers for the first 17 columns and compare the results with the column 18. The column 18 is a factor and contains data which can be used as indicator of outlier.
My data frame is ufo and I remove the column 18 as follow:
ufo2 <- ufo[,1:17]
and then convert 3 non0numeric columns to numeric values:
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
and then use the following command for outlier detection:
outlier.scores <- lofactor(ufo2, k=5)
But all of the elements of the outlier.scores are NA!!!
Do I have any mistake in this code?
Is there another way to find outlier for such a data frame?
All of my code:
setwd(datadirectory)
library(doMC)
registerDoMC(cores=8)
library(DMwR)
# load data
load("data_9802-f2.RData")
ufo2 <- ufo[,2:17]
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
outlier.scores <- lofactor(ufo2, k=5)
The output of the dput(head(ufo2)) is:
structure(list(Origin = c(2L, 2L, 2L, 2L, 2L, 2L), IO = c(2L,
2L, 2L, 2L, 2L, 2L), Lot = c(1003L, 1003L, 1003L, 1012L, 1012L,
1013L), DocNumber = c(10069L, 10069L, 10087L, 10355L, 10355L,
10382L), OperatorID = c(5698L, 5698L, 2015L, 246L, 246L, 4135L
), Month = c(1L, 1L, 1L, 1L, 1L, 1L), LineNo = c(1L, 2L, 1L,
1L, 2L, 1L), Country = c(1L, 1L, 1L, 1L, 11L, 1L), ProduceCode = c(63456227L,
63455714L, 33687427L, 32686627L, 32686627L, 791614L), Weight = c(900,
850, 483, 110000, 5900, 1000), InvoiceValue = c(637, 775, 2896,
48812, 1459, 77), InvoiceValueWeight = c(707L, 912L, 5995L, 444L,
247L, 77L), AvgWeightMonth = c(1194.53, 1175.53, 7607.17, 311.667,
311.667, 363.526), SDWeightMonth = c(864.931, 780.247, 3442.93,
93.5818, 93.5818, 326.238), Score = c(0.56366535234262, 0.33775439984787,
0.46825476121676, 1.414092583904, 0.69101737288291, 0.87827342721894
), TransactionNo = c(47L, 47L, 6L, 3L, 3L, 57L)), .Names = c("Origin",
"IO", "Lot", "DocNumber", "OperatorID", "Month", "LineNo", "Country",
"ProduceCode", "Weight", "InvoiceValue", "InvoiceValueWeight",
"AvgWeightMonth", "SDWeightMonth", "Score", "TransactionNo"), row.names = c(NA,
6L), class = "data.frame")

First of all, you need to spend a lot more time preprocessing your data.
Your axes have completely different meaning and scale. Without care, the outlier detection results will be meaningless, because they are based on a meaningless distance.
For example produceCode. Are you sure, this should be part of your similarity?
Also note that I found the lofactor implementation of the R DMwR package to be really slow. Plus, it seems to be hard-wired to Euclidean distance!
Instead, I recommend using ELKI for outlier detection. First of all, it comes with a much wider choice of algorithms, secondly it is much faster than R, and third, it is very modular and flexible. For your use case, you may need to implement a custom distance function instead of using Euclidean distance.
Here's the link to the ELKI tutorial on implementing a custom distance function.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

regression on subsets for unique factor combinations using lm - r

Related

R How to tell a (t-test) function the needed column in an indirect way?

Conditional updating coordinate column in dataframe

<<- not working - R - call object from outside function [closed]

In R - How do you make transition charts with the Gmisc package?

Outlier detection for multi column data frame in R

Categories

Resources