Given a vector of specified values, for example:
x = c(4.0, 3.7, 3.3, 3.0, 2.7, 2.3, 2.0, 1.7, 1.3, 1.0)
I would like to create a new vector of any length comprised only of values in x sampled randomly, that will result in the combined mean of 3.15. I have tried using the rnorm() function but however, I can only generate random numbers that equate to the mean of 3.15 and not the specified values I wanted. Could anyone point me in the correct direction?
The problem with your question is that there are an infinite number of ways to sample from
x = c(4.0, 3.7, 3.3, 3.0, 2.7, 2.3, 2.0, 1.7, 1.3, 1.0)
to get a mean of roughly 3.15, you just have a specify a probability for each value.
Doing
n = 20
sample(x, n, replace = TRUE)
assumes each value is equally likely and you would get a mean close to 2.5. But if you re-weight the probabilities, you can get closer to what you want. One way of doing this might be
p = 1/(x - 3.15)^2 # or try p = 1/abs(x - 3.15)
sample(x, n, replace = TRUE, prob = p)
where p weights values closer to 3.15 higher so these are more likely to be accepted. It isn't perfect (meaning the true expected value is something like 3.12 and most values are just 2.7, 3.0 and 3.3), but then again there isn't a single solution.
Here's my brute force method:
samp315<-function(n=20, desmean=3.15, distance=0.001) { # create a function with default n=20 and range 3.149-3.151
x<- c(4.0, 3.7, 3.3, 3.0, 2.7, 2.3, 2.0, 1.7, 1.3, 1.0)
samp<-0 # reset samp to 0
i<-0 # reset my counter to zero
while (!between(mean(samp),desmean-distance,desmean+distance) & i<1000000) { # the following will run continuously until a sample (samp) with a mean that falls within the specified range is found, OR until 1 million attempts have been made
samp<-sample(x,n,replace=TRUE) # try to generate a sample of n times from the list of values (x)
i=i+1 # add to counter towards 1 million
}
ifelse(i<1000000,samp,warning("Couldn't find an appropriate sample, please select a lower n, a desired mean closer to 2.5, or a greater distance")) # if the while loop ended because the counter reached a million, exit with an error, otherwise, return the contents of samp.
}
Now, every time you do samp315():
eg<-samp315()
mean(eg)
[1] 3.15
eg
[1] 3.0 3.7 3.0 3.7 3.3 3.7 3.3 3.3 4.0 1.0 1.7 3.0 2.0 4.0 3.7 3.7 2.3 3.3 4.0 3.3
If you want a sample of different length, just place whatever number you wish inside samp315(). The larger the number, however, the longer it will take to find a sample that will get your desired mean.
You can also change your desired mean by setting desmean, and play around with the range by changing distance to whatever the distance (+/-) from your desired mean can be. The default is n=20, with the range from 3.149 to 3.151.
To avoid an infinite loop for highly unlikely combinations of n and range, I set a maximum of 1m samples, after which the function quits with a warning.
As #mickey pointed out, we can weight the probability of each item according to how far it is from the mean. However, that does not quite work, because there are more elements in x lower than than the desired mean, which skews the sampling towards them. We can account for this by adjusting the probabilites relative to how many elements are higher or lower than the desire mean:
x = c(4.0, 3.7, 3.3, 3.0, 2.7, 2.3, 2.0, 1.7, 1.3, 1.0)
n = 100000
xbar=3.15
xhi = x[which(x>xbar)]
xlo = x[which(x<xbar)]
probhi = 1/(xhi-xbar)
problo = 1/(xbar-xlo)
probhi = probhi * length(problo) / length(probhi)
n=1e5
set.seed(1)
y = sample(x, size = n, replace = TRUE, prob = c(probhi,problo))
mean(y)
# [1] 3.150216
Related
I would like to round the min and max of a vector to the nearest integer, so that the new set defined by the two values is an actual superset of the previous data.
For instance, for a vector between 1.1 and 2.9, round(range()) returns 1 and 3, which is the desirable outcome:
x <- seq(1.1, 2.9, 0.1)
oldrange_x <- range(x)
newrange_x <- round(oldrange_x)
newrange_x
On the other hand, for 1.8 and 2.9 round(range()) returns 2 and 3, which is not a superset of the initial vector:
y <- seq(1.8, 2.9, 0.1)
oldrange_y <- range(y)
newrange_y <- round(oldrange_y)
newrange_y
and leads me to use a combination of floor() and ceiling()
newrange_y2 <- c(floor(min(oldrange_y)), ceiling(max(oldrange_y)))
newrange_y2
Would there be a readymade function doing that -essentially a roundrange() function- in order to avoid the ugly solution and make the code a bit more readable?
Your use of floor and ceiling is perfectly fine, not ugly at all. Here's how you would wrap it in a function.
superset <- function(x){
c(floor(min(x)), ceiling(max(x)))}
superset(seq(1.8, 2.9, 0.1))
> [1] 1 3
I am using the baysout function for outlier detection from the 'dprep' package in R. The returned value is supposed to be a 2 column matrix according to the R documentation. The first column contains the indexes of the top num.out (user defined number of outliers to return) and the second, the outlyingness measure for each index.
The problem is that I want to access the index number separately but I am not able to do this. The function is actually returning an num.out x 1 matrix as opposed to a num.out x 2 matrix. The index value and the outlyingness measure are there but I cannot access them separately. Please see sample code below:
# Install and load the dprep library
install.packages("dprep")
library(dprep)
# Create 5x3 matrix for input to baysout function
A = matrix(c(0.8, 0.4, 1.2, 0.4, 1.2, 1.1, 0.3,
0.1, 1.9, 1.1, 0.9, 1.4, 0.3, 1.5, 0.5), nrow=5, ncol=3)
# Run the baysout function on matrix A and store result in outliers
outliers <- baysout(A, blocks = 3, nclass=0, k = 3, num.out = 3)
# print out result
print(outliers)
# attempt to access the index
print(outliers[1,1])
Output is as follows:
print out result
print(outliers) [,1] 4 3.625798 3 2.901654 2 2.850419
attempt to access the index
print(outliers[1,1]) 4 3.625798
This is not the real data I am using which is much larger and I would like to gain access to the index. In the example above I would like to be able to access the number 4 on its own. It is coupled with the 3.625798 and I am not able to access each figure separately. Would anyone have any advice on how I could do this?
solution by ekstroem
Use:
index <- as.numeric(rownames(outliers))
The documentation may not be entirely correct. In any case the index is stored in the row names.
I am trying to use match() in R to find any matching values within a certain interval. For example:
v <- c(2.2, 2.4, 4.3, 1.3, 4.5, 6.8, 0.9)
match(2.4, v)
gives me all the locations where 2.4 occurs in v, but what if I wanted to give a range for all possible matches? For example 2.4 +/- 0.2?
Any help is greatly appreciated, thanks in advance!
In that case, I would use subsetting:
v[v>2.2 & v<2.6]
or
which(v>2.2 & v<2.6)
depending on if you want the values or the index
This is another option:
which(findInterval(v, c(-.2, .2) + 2.4) == 1)
[1] 1 2
findInterval(v, c(-.2, .2) + 2.4) gives you 1 1 2 0 2 2 0, where 1 means the element is inside the interval, 0 means it's to the left, and 2 means to the right.
I have a list of points I want to check for autocorrelation using Moran's I and by dividing area of interest by 4 x 4 quadrats.
Now every example I found on Google (e. g. http://www.ats.ucla.edu/stat/r/faq/morans_i.htm) uses some kind of measured value as the first input for the Moran's I function, no matter which library is used (I looked into the ape and spdep packages).
However, all I have are the points themselves I want to check the correlation for.
The problem is, as funny (or sad) as this might sound, I've no idea what I'm doing here. I'm not much of a (spatial) statistics guy, all I want to find out is if a collection of points is dispersed, clustered or ramdom using Moran's I.
Is my approach correct? If not where and what I am doing wrong?
Thanks
This is what I have so far:
# download, install and load the spatstat package (http://www.spatstat.org/)
install.packages("spatstat")
library(spatstat)
# Download, install and run the ape package (http://cran.r-project.org/web/packages/ape/)
install.packages("ape")
library(ape)
# Define points
x <- c(3.4, 7.3, 6.3, 7.7, 5.2, 0.3, 6.8, 7.5, 5.4, 6.1, 5.9, 3.1, 5.2, 1.4, 5.6, 0.3)
y <- c(2.2, 0.4, 0.8, 6.6, 5.6, 2.5, 7.6, 0.3, 3.5, 3.1, 6.1, 6.4, 1.5, 3.9, 3.6, 5.2)
# Store the coordinates as a matrix
coords <- as.matrix(cbind(x, y))
# Store the points as two-dimensional point pattern (ppp) object (ranging from 0 to 8 on both axis)
coords.ppp <- as.ppp(coords, c(0, 8, 0, 8))
# Quadrat count
coords.quadrat <- quadratcount(coords.ppp, 4)
# Store the Quadrat counts as vector
coords.quadrat.vector <- as.vector(coords.quadrat)
# Replace any value > 1 with 1
coords.quadrat.binary <- ifelse(coords.quadrat.vector > 1, 1, coords.quadrat.vector)
# Moran's I
# Generate the distance matrix (euclidean distances between points)
coords.dists <- as.matrix(dist(coords))
# Take the inverse of the matrix
coords.dists.inv <- 1/coords.dists
# replace the diagonal entries (Inf) with zeroes
diag(coords.dists.inv) <- 0
writeLines("Moran's I:")
print(Moran.I(coords.quadrat.binary, coords.dists.inv))
writeLines("")
There's a few ways of doing this. I took a great (free) course in analysing spatial data with R by Roger Bivand who is very active on the r-sig-geo mailing list (where you may want to direct this query). You basically want to assess whether or not your point pattern is completely spatially random or not.
You can plot the empirical cumulative distribution of nearest neighbour distances of your observed points, and then compare this to the ecdf of randomly generated sets of completely spatially random point patterns within your observation window:
# The data
coords.ppp <- ppp( x , y , xrange = c(0, 8) , yrange = c(0, 8) )
# Number of points
n <- coords.ppp$n
# We want to generate completely spatially random point patterns to compare against the observed
ex <- expression( runifpoint( n , win = owin(c(0,8),c(0,8))))
# Reproducible simulation
set.seed(1)
# Compute a simulation envelope using Gest, which estimates the nearest neighbour distance distribution function G(r)
res <- envelope( coords.ppp , Gest , nsim = 99, simulate = ex ,verbose = FALSE, savefuns = TRUE )
# Plot
plot(res)
The observed nearest neighbour distribution is completely contained within the grey envelope of the ecdf of randomly generated point patterns. My conclusion would be that you have a completely spatially random point pattern, with the caveat that you don't have many points.
As an aside, where the black observed line falls below the grey envelope we may infer that points are further apart than would be expected by chance and vice versa above the envelope.
I would like to take many values from interpolation at once.
For example, from my data file('int.txt'), I have each "conc1" corresponding to each "depth1" (e.g., 1.1 m, 2.1 m, 3.1 m, 4.1 m, 5.1 m, 6.1 m).
Here, after interpolating my concentration data, I want to take "conc"s at "depth" of 1.2, 2.2, 3.2, 4.2, 5.2 m
Following comments below (I'm editting my question), I made a code like this,
f = approxfun(depth1, conc1, rule=1,method='linear', xout=seq(1.2,5.2,1.0))
i<-approx(depth1, conc1, rule=1,method='linear', xout=seq(1.2,5.2,1.0))
It works well. Here, I have two more questions.
1. Then, how can I make two columns with data from i? Can I add these two columns to my data, 'int'? In this case, I will have no value at the last rows of the new columns.
2. I have one more x, y vector (y= conc2, x=depth2). I have each "conc2" at each "depth2", and "depth2" does not have regular intervals, so which is like 1.3, 2.7, 3.2... Here, after interpolating above, I want to extract all "conc1" values corresponding "depth2".
Please let me know how to do these things. Thank you very much for your help :)
approxfun() generates a function that interpolates between given x and y vectors. You can call that function on a vector to take many approximations at once. There are several customizations you can make, (such as the simple method of interpolation and what to do outside of the data range,) but this should get you started until you specify the need for something more complicated.
?approxfun
f = approxfun(x=c(1.1, 2.1, 3.1, 4.1, 5.1),y=c(1, 3, 5, 2, 4),rule=1,method='constant')
plot(y=f(seq(1.1,5.1,.1)),x=seq(1.1,5.1,.1))
f = approxfun(x=c(1.1, 2.1, 3.1, 4.1, 5.1),y=c(1, 3, 5, 2, 4),rule=1,method='linear')
plot(y=f(seq(1.1,5.1,.1)),x=seq(1.1,5.1,.1))