I'm busy with a small project where a large amount of samples have been taken from a manufacturing process (2700 samples of 11 items). A specified Upper and Lower Specification Limit has been set, and items under the LSL are said to cost $3 to fix, while items above the USL are said to cost $5 to fix. The data is spread with a uniform distribution.
How would I go about deciding where to centre the process (given that the distribution would stay the same along the centre line) to minimize total cost? I know how to do it iteratively, but I'd like a more optimal way to solve this problem.
EDIT: Here is an example of the data I'm working with.
One sample would be, for instance
45.62565379
47.06496942
46.39000538
46.44387364
45.81911053
45.25935862
48.75357907
46.50918593
46.87072887
46.60195194
48.09000017
There are 2701 more samples like the one above (albeit with different values) making up my population. The population mean is 47.66 and population standard deviation is 1.425. The UCL is 48.98 and the LCL is 46.34. The USL has been set to 50 and the LSL to 45.
Currently the process is centered around the population mean, but the amount of samples with means above 50 is proportionally larger than that of the amount of samples with means under 45, meaning that the process is more expensive, as it costs $5 to fix a batch above the USL and only $3 to fix it under the LSL. How do I decide where to centre the process if its distribution around the centre line will remain the same to minimize cost?
Related
The following code comes from this book, Statistics and Data Analysis For Financial Engineering, which describes how to generate simulation data of ARCH(1) model.
library(TSA)
library(tseries)
n = 10200
set.seed("7484")
e = rnorm(n)
a = e
y = e
sig2 = e^2
omega = 1
alpha = 0.55
phi = 0.8
mu = 0.1
omega/(1-alpha) ; sqrt(omega/(1-alpha))
for (t in 2:n){
a[t] = sqrt(sig2[t])*e[t]
y[t] = mu + phi*(y[t-1]-mu) + a[t]
sig2[t+1] = omega + alpha * a[t]^2
}
plot(e[10001:n],type="l",xlab="t",ylab=expression(epsilon),main="(a) white noise")
My question is that why we need to discard the first 10000 simulation?
========================================================
Bottom Line Up Front
Truncation is needed to deal with sampling bias introduced by the simulation model's initialization when the simulation output is a time series.
Details
Not all simulations require truncation of initial data. If a simulation produces independent observations, then no truncation is needed. The problem arises when the simulation output is a time series. Time series differ from independent data because their observations are serially correlated (also known as autocorrelated). For positive correlations, the result is similar to having inertia—observations which are near neighbors tend to be similar to each other. This characteristic interacts with the reality that computer simulations are programs, and all state variables need to be initialized to something. The initialization is usually to a convenient state, such as "empty and idle" for a queueing service model where nobody is in line and the server is available to immediately help the first customer. As a result, that first customer experiences zero wait time with probability 1, which is certainly not the case for the wait time of some customer k where k > 1. Here's where serial correlation kicks us in the pants. If the first customer always has a zero wait time, that affects some unknown quantity of subsequent customer's experiences. On average they tend to be below the long term average wait time, but gravitate more towards that long term average as k, the customer number, increases. How long this "initialization bias" lingers depends on both how atypical the initialization is relative to the long term behavior, and the magnitude and duration of the serial correlation structure of the time series.
The average of a set of values yields an unbiased estimate of the population mean only if they belong to the same population, i.e., if E[Xi] = μ, a constant, for all i. In the previous paragraph, we argued that this is not the case for time series with serial correlation that are generated starting from a convenient but atypical state. The solution is to remove some (unknown) quantity of observations from the beginning of the data so that the remaining data all have the same expected value. This issue was first identified by Richard Conway in a RAND Corporation memo in 1961, and published in refereed journals in 1963 - [R.W. Conway, "Some tactical problems on digital simulation", Manag. Sci. 10(1963)47–61]. How to determine an optimal truncation amount has been and remains an active area of research in the field of simulation. My personal preference is for a technique called MSER, developed by Prof. Pres White (University of Virginia). It treats the end of the data set as the most reliable in terms of unbiasedness, and works its way towards the front using a fairly simple measure to detect when adding observations closer to the front produces a significant deviation. You can find more details in this 2011 Winter Simulation Conference paper if you're interested. Note that the 10,000 you used may be overkill, or it may be insufficient, depending on the magnitude and duration of serial correlation effects for your particular model.
It turns out that serial correlation causes other problems in addition to the issue of initialization bias. It also has a significant effect on the standard error of estimates, as pointed out at the bottom of page 489 of the WSC2011 paper, so people who calculate the i.i.d. estimator s2/n can be off by orders of magnitude on the estimated width of confidence intervals for their simulation output.
I'm having trouble interpreting the results I got from relrisk. My data is a multiple point process containing two marks (two rodents species AA and RE), I want to know if they are spatially segregated or not.
> summary(REkm)
Marked planar point pattern: 46 points
Average intensity 0.08101444 points per square unit
*Pattern contains duplicated points*
Coordinates are given to 3 decimal places
i.e. rounded to the nearest multiple of 0.001 units
Multitype:
frequency proportion intensity
AA 15 0.326087 0.02641775
RE 31 0.673913 0.05459669
Window: rectangle = [4, 38] x [0.3, 17] units
x 16.7 units)
Window area = 567.8 square units
relkm <- relrisk(REkm)
plot(relkm, main="Relrisk default")
The bandwidth of this relrisk estimation is automatically selection by default(bw.relrisk), but when I tried other numeric number using sigma= 0.5 or 1, the results are somehow kind of weird.
How did this happened? Was it because the large proportion of blank space of my ppp?
According to chapter.14 of Spatial Point Patterns books and the previous discussion, I assume the default of relrisk will show the ratio of intensities (case divided by control, in my case: RE divided by AA), but if I set casecontrol=FALSE, I can get the spatially-varying probability of each type.
Then why the image of type RE in the Casecontrol=False looks exactly same as the relrisk estimation by default? Or they both estimate p(RE)=λRE/ λRE+λAA for each sites?
Any help will be appreciated! Thanks a lot!
That's two questions.
Why does the image for RE when casecontrol=FALSE look the same as the default output from relrisk?
The definitive source of information about spatstat functions is the online documentation in the help files. The help file for relrisk.ppp gives full details of the behaviour of this function. It says that the calculation of probabilities and risks is controlled by the argument relative. If relative=FALSE (the default), the code calculates the spatially varying probability of each type. If relative=TRUE it calculates the relative risk of each type i, defined as the ratio of the probability of type i to the probability of type c where c is the type designated as the control. If you wanted the relative risk then you should set relative=TRUE.
Very different results obtained when setting sigma=0.5 compared to the automatically selected bandwidth.
Your example output says that the window is 34 by 17 units. A smoothing bandwidth of sigma=0.5 is very small for this region. Imagine each data point being replaced by a blurry circle of radius 0.5; there would be a lot of empty space. The smoothing procedure is encountering numerical problems which are causing the funky artefacts.
You could try a range of different values of sigma, say from 1 to 15, and decide which value produces the most satisfactory result.
The plot of relrisk(REkm, casecontrol=FALSE) suggests that the automatic bandwidth selector bw.relriskppp chose a much larger value of sigma, perhaps about 10. You can investigate this by
b <- bw.relriskppp(REkm)
print(b)
plot(b)
The print command will print the chosen value of sigma that was used in the default calculation. The plot command will show the cross-validation criterion which was maximised to select the bandwidth. This gives you an idea of the range of values of sigma that are acceptable according to the automatic selector.
Read the help file for bw.relriskppp about the different options available for bandwidth selection method. Maybe a different choice of method would give you a more acceptable result from your viewpoint.
I'm fitting a logarithmic curve to 20+ data sets using the equation
y = intercept + coefficient * ln(x)
Generated in R via
output$curvePlot <- renderPlot ({
x=medianX
y=medianY
Estimate = lad(formula = y~log(x),method = "EM")
logEstimate = lad(formula = y~log(x),method = "EM")
plot(x,predict(Estimate),type='l',col='white')
lines(x,predict(logEstimate),col='red')
points(x,y)
cf <- round(coef(logEstimate),1)
eq <- paste0("y = ", cf[1],
ifelse(sign(cf[2])==1, " + ", " - "), abs(cf[2]), " * ln(x) from 0 to ",xmax)
mtext(eq,3,line=-2,col = "red")
output$summary <- renderPrint(summary(logEstimate))
output$calcCurve <-
renderPrint(round(cf[2]*log(input$calcFeet)+cf[1]))
})
The curve consistently "crosses twice" on the data; fitting too low at low/high points on the X axis, fitting too high at the middle of the X axis.
I don't really understand where to go from here. Am I missing a factor or using the wrong curve?
The dataset is about 60,000 rows long, but I condensed it into medians. Medians were selected due to unavoidable outliers in the data, particularly a thick left tail, caused by our instrumentation.
x,y
2,6.42
4,5.57
6,4.46
8,3.55
10,2.72
12,2.24
14,1.84
16,1.56
18,1.33
20,1.11
22,0.92
24,0.79
26,0.65
28,0.58
30,0.34
32,0.43
34,0.48
36,0.38
38,0.37
40,0.35
42,0.32
44,0.21
46,0.25
48,0.24
50,0.25
52,0.23
Full methodology for context:
Samples of dependent variable, velocity (ft/min), were collected at
various distances from fan nozzle with a NIST-calibrated hot wire
anemometer. We controlled for instrumentation accuracy by subjecting
the anemometer to a weekly test against a known environment, a
pressure tube with a known aperture diameter, ensuring that
calibration was maintained within +/- 1%, the anemometer’s published
accuracy rating.
We controlled for fan alignment with the anemometer down the entire
length of the track using a laser from the center of the fan, which
aimed no more than one inch from the center of the anemometer at any
distance.
While we did not explicitly control for environmental factors, such as
outdoor air temperature, barometric pressure, we believe that these
factors will have minimal influence on the test results. To ensure
that data was collected evenly in a number of environmental
conditions, we built a robot that drove the anemometer down the track
to a different distance every five minutes. This meant that data would
be collected at every independent variable position repeatedly, over
the course of hours, rather than at one position over the course of
hours. As a result, a 24 hour test would measure the air velocity at
each distance over 200 times, allowing changes in temperature as the
room warmed or cooled throughout the day to address any confounding
environmental factors by introducing randomization.
The data was collected via Serial port on the hot wire anemometer,
saving a timestamped CSV that included fields: Date, Time, Distance
from Fan, Measured Temperature, and Measured Velocity. Analysis on the
data was performed in R.
Testing: To gather an initial set of hypotheses, we took the median of
air velocity at each distance. The median was selected, rather than
the mean, as outliers are common in data sets measuring physical
quantities. As air moves around the room, it can cause the airflow to
temporarily curve away from the anemometer. This results in outliers
on the low end that do not reflect the actual variable we were trying
to measure. It’s also the case that, sometimes, the air velocity at a
measured distance appears to “puff,” or surge and fall. This is
perceptible by simply standing in front of the fan, and it happens on
all fans at all distances, to some degree. We believe the most likely
cause of this puffing is due to eddy currents and entrainment of the
surrounding air, temporarily increasing airflow. The median result
absolves us from worrying about how strong or weak a “puff” may feel,
and it helps limit the effects on air speed of the air curving away
from the anemometer, which does not affect actual air velocity, but
only measured air velocity. With our initial dataset of medians, we
used logarithmic regression to calculate a curve to match the data and
generated our initial velocity profiles at set distances. To validate
that the initial data was accurate, we ran 10 monte carlo folding
simulations at 25% of the data set and ensured that the generated
medians were within a reasonable value of each other.
Validation: Fans were run every three months and the monte carlo
folding simulations were observed. If the error rate was <5% from our
previous test, we validated the previous test.
There is no problem with the code itself, you found the best possible fit using a logarithmic curve. I double-checked using Mathematica, and I obtain the same results.
The problem seems to reside in your model. From the data you provided and the description of the origin of the data, the logarithmic function might not the best model for your measurements. The description indicates that the velocity must be a finite value at x=0, and slowly tends towards 0 while going to infinity. However, the negative logarithmic function will be infinite at x=0 and negative after a while.
I am not a physicist, but my intuition would tend towards using the inverse-square law or using the exponential function. I tested both, and the exponential function gives way better results:
I have some data sampled at regular intervals that looks sinusoidal and I would like to determine the frequency of the wave, to that end I obtained R and loaded the TSA package that contains a function named 'periodogram'.
In an attempt to understand how it works I created some data as follows:
x<-.0001*1:260
This could be interpreted to be 260 samples with an interval of .0001 seconds
Frequency=80
The frequency could be interpreted to be 80Hz so there should be about 125 points per wave period
y<-sin(2*pi*Frequency*x)
I then do:
foo=TSA::periodogram(y)
In the resulting periodogram I would expect to see a sharp spike at the frequency that corresponds to my data - I do see a sharp spike but the maximum 'spec' value has a frequency of 0.007407407, how does this relate to my frequency of 80Hz?
I note that there is variable foo$bandwidth with a value of 0.001069167 which I also have difficulty interpreting.
If there are better ways of determining the frequency of my data I would be interested - my experience with R is limited to one day.
The periodogram is computed from the time series without knowledge of your actual sampling interval. This result in frequencies which are limited to the normalized [0,0.5] range. To obtain a frequency in Hertz that takes into account the sampling interval, you simply need to multiply by the sampling rate. In your case, the spike you get at a normalized frequency of 0.007407407 and a sampling rate of 10,000Hz, this correspond to a frequency of ~74Hz.
Now, that's not quite 80Hz (the original tone frequency), but you have to keep in mind that a periodogram is a frequency spectrum estimate, and its frequency resolution is limited by the number of input samples. In your case you are using 260 samples, so the frequency resolution is on the order of 10,000Hz/260 or ~38Hz. Since 74Hz is well within 80 +/- 38Hz, it is a reasonable result. To get a better frequency estimate you would have to increase the number of samples.
Note that the periodogram of a sinusoidal tone will typically spike near the tone frequency and decay on either side (a phenomenon caused by the limited number of samples used for the estimation, often called spectral leakage) until the value can be considered comparatively 'negligeable'. The foo$bandwidth variable then indicates that the input signal starts to contain less energy for frequencies above 0.001069167*10000Hz ~ 107Hz, which is consistent with the tone's decay.
I can't find the type of problem I have and I was wondering if someone knew the type of statistics it involves. I'm not sure it's even a type that can be optimized.
I'd like to optimize three variables, or more precisely the combination of 2. The first is a likert scale average the other is the frequency of that item being rated on that likert scale, and the third is the item ID. The likert is [1,2,3,4]
So:
3.25, 200, item1. Would mean that item1 was rated 200 times and got an average of 3.25 in rating.
I have a bunch of items and I'd like to find the high value items. For instance, an item that is 4,1 would suck because while it is rated highest, it is rated only once. And a 1,1000 would also suck for the inverse reason.
Is there a way to optimize with a simple heuristic? Someone told me to look into confidence bands but I am not sure how that would work. Thanks!
Basically you want to ignore scores with fewer than x ratings, where x is a threshold that can be estimated based on the variance in your data.
I would recommend estimating the variance (standard deviation) of your data and putting a threshold on your standard error, then translating that error into the minimum number of samples required to produce that bound with 95% confidence. See: http://en.wikipedia.org/wiki/Standard_error_(statistics)
For example, if your data has standard deviation 0.5 and you want to be 95% sure your score is within 0.1 of the current estimate, then you need (0.5/0.1)^2 = 25 ratings.