I am trying to run the following algorithm in R:
generate a uniform random variable, u
find F(x(i-1)) < u <= F(x(i))
return x=x(i)
In my case, I segmented my F(x) so that it is given by:
cdf:
[1] 0.0000000000 0.0001524158 0.0025910684 0.0196616369 0.0879439110
0.2586495961 0.5317786923 0.8049077884 0.9609815577 1.0000000000
So for example F(x(1)) = cdf[2]
Then i generate a vector of random uniforms:
u<-c(runif(10000,0,1))
But I am having trouble assigning each element in that vector to a specific range in the 'cdf'. I've tried a for loop with many if statements, but this is tedious and error prone.
I've also tried the following using a while statement:
x<-u
for(i in (1:length(u))){
for(j in (1:length(cdf)))
while(x[i]<cdf[j]){x[i]==which(cdf[j]>=x[i])}
}
Any suggestions?
I think you want to use cut(), as in:
cutPoints <- c(0.0000000000,# could set to -1. See comment below.
0.0001524158,
0.0025910684,
0.0196616369,
0.0879439110,
0.2586495961,
0.5317786923,
0.8049077884,
0.9609815577,
1.0000000000)
u <- runif(1000)
cut(u,
cutPoints,
labels = seq.int(length(cutPoints)-1))
Notice that the length of the (optional) argument labels is one less than the cut points b/c the labels label the space between the cut points. See ?cut for details.
Related
Consider this two‐dimensional random walk:
where, Zt, Wt, t = 1,2,3, … are independent and identically distributed standard normal
random variables.
I am having problems in finding a way to simulate and plot the sample path of (X,Y) for t = 0,1, … ,100. I was given a sample:
The following code is an example of the way I am used to plot random walks in R:
set.seed(13579)
r<-sample(c(-1,1),size=100,replace=T,prob=c(0.5,0.5))
r<-c(10,r))
(w<-cumsum(r))
w<-as.ts(w)
plot(w,main="random walk")
I am not very sure of how to achieve this.
The problem I am having is that this kind of codes has a more "simple" result, with a line that goes either up or down, -1 or +1:
while the plot I need to create also goes from left to right and viceversa.
Would you help me in correcting the code I know so that it fits my task/suggesting a smarterst way to go about it? It would be greatly appreciated.
Cheers!
Instead of using sample, you need to use rnorm(100) to draw 100 samples from a standard normal distribution. Since the walk starts at [0, 0], we need to append a 0 at the start and do a cumsum on the result, i.e. cumsum(c(0, rnorm(100))).
We want to do this for both the x and y variables, then plot. The whole thing can be done in a single line of code in base R:
plot(x = cumsum(c(0, rnorm(100))), y = cumsum(c(0, rnorm(100))), type = 'l')
I'm still trying to find the best way to classify bivariate point patterns:
Point pattern classification with spatstat: what am I doing wrong?
I now analysed 110 samples of my dataset using #Adrian's suggestion with sigma=bw.diggle (as I wanted an automatic bandwidth selection). f is a "resource selection function" (RSF) which describes the relationship between the intensity of the Cancer point process and the covariate (here kernel density of Immune):
Cancer <- split(cells)[["tumor"]]
Immune <- split(cells)[["bcell"]]
Dimmune <- density(Immune,sigma=bw.diggle)
f <- rhohat(Cancer, Dimmune)
I am in doubt about some results I've got. A dozen of rho-functions looked weird (disrupted, single peak). After changing to default sigma=NULL or sigma=bw.scott (which are smoother) the functions became "better" (see examples below). I also experimented with the following manipulations:
cells # bivariate point pattern with marks "tumor" and "bcell"
o.marks<-cells$marks # original marks
#A) randomly re-assign original marks
a.marks <- sample(cells$marks)
#B) replace marks randomly with a 50/50 proportion
b.marks<-as.factor(sample(c("tumor","bcell"), replace=TRUE, size=length(o.marks)))
#C) random (homogenious?) pattern with the original number of points
randt<-runifpoint(npoints(subset(cells,marks=="tumor")),win=cells$window)
randb<-runifpoint(npoints(subset(cells,marks=="bcell")),win=cells$window)
cells<-superimpose(tumor=randt,bcell=randb)
#D) tumor points are associated with bcell points (is "clustered" a right term?)
Cancer<-rpoint(npoints(subset(cells,marks=="tumor")),Dimmune,win=cells$window)
#E) tumor points are segregated from bcell points
reversedD<-Dimmune
density.scale.v<-sort(unique((as.vector(Dimmune$v)[!is.na(as.vector(Dimmune$v))]))) # density scale
density.scale.v.rev<-rev(density.scale.v)# reversed density scale
new.image.v<-Dimmune$v
# Loop over matrix
for(row in 1:nrow(Dimmune$v)) {
for(col in 1:ncol(Dimmune$v)) {
if (is.na(Dimmune$v[row, col])==TRUE){next}
number<-which(density.scale.v==Dimmune$v[row, col])
new.image.v[row, col]<-density.scale.v.rev[number]}
}
reversedD$v<-new.image.v # reversed density
Cancer<-rpoint(npoints(subset(cells,marks=="tumor")),reversedD,win=cells$window)
A better way to generate inverse density heatmaps is given by #Adrian in his post below.
I could not generate rpoint patterns for the bw.diggle density as it produced negative numbers.Thus I replaced the negatives Dimmune$v[which(Dimmune$v<0)]<-0 and could run rpoint then. As #Adrian explained in the post below, this is normal and can be solved easier by using a density.ppp option positive=TRUE.
I first used bw.diggle, because hopskel.test indicarted "clustering" for all my patterns. Now I'm going to use bw.scott for my analysis but can this decision be somehow justified? Is there a better method besides "RSF-function is looking weird"?
some examples:
sample10:
sample20:
sample110:
That is a lot of questions!
Please try to ask only one question per post.
But here are some answers to your technical questions about spatstat.
Negative values:
The help for density.ppp explains that small negative values can occur because of numerical effects. To force the density values to be non-negative, use the argument positive=TRUE in the call to density.ppp. For example density(Immune, bw.diggle, positive=TRUE).
Reversed image: to reverse the ordering of values in an image Z you can use the following code:
V <- Z
A <- order(Z[])
V[][A] <- Z[][rev(A)]
Then V is the order-reversed image.
Tips for your code:
to generate a random point pattern with the same number of points and in the same window as an existing point pattern X, use Y <- runifpoint(ex=X).
To extract the marks of a point pattern X, use a <- marks(X). To assign new marks to a point pattern X, use marks(X) <- b.
to randomly permute the marks attached to the points in a point pattern X, use Y <- rlabel(X).
to assign new marks to a point pattern X where the new marks are drawn randomly-with-replacement from a given vector of values m, use Y <- rlabel(X, m, permute=FALSE).
Here is a reproducible example :
set.seed(10)
pick <- sample(nrow(iris),nrow(iris)/2)
iris.training <- iris[pick,]
iris.testing <- iris[-pick,]
pca.training <- prcomp(iris.training[-5])
pca.testing <- prcomp(iris.testing[-5])
autoplot(pca.training,loadings.label=T,loadings=T)
autoplot(pca.testing,loadings.label=T,loadings=T)
Which produces the following output :
As one can see, pca on data.training and on data.testing produces very similar biplots but the first principal components has reversed its sign, they are mirrored. Is it possible to force a 180 degree rotation on the two components ?
You are not returning the rotated variables. Changed code is as below. Notice retx=TRUE
set.seed(10)
pick <- sample(nrow(iris),nrow(iris)/2)
iris.training <- iris[pick,]
iris.testing <- iris[-pick,]
pca.training <- prcomp(iris.training[-5], retx=TRUE)
pca.testing <- prcomp(iris.testing[-5], retx=TRUE)
autoplot(pca.training,loadings.label=TRUE,loadings=TRUE)
autoplot(pca.testing,loadings.label=TRUE,loadings=TRUE)
It produced the following outputs for training and testing.
I'm assuming autoplot is the function from the ggfortify package. There are probably two ways to do this. The easiest is to just ask to reverse the x axis, by writing
autoplot(pca.testing,loadings.label=TRUE,loadings=TRUE) + scale_x_reverse()
Notice that this didn't change any values: the X axis now runs from positive to negative instead of the usual direction.
The second is to modify the pca.testing object to swap the signs on the x axis.
This is statistically valid: PCA doesn't determine the signs of any components, but it's a bit tricky, because the signs show up in two places: component x for the data points, and component rotation for the arrows:
pca.testing$x[,1] <- - pca.testing$x[,1]
pca.testing$rotation[,1] <- -pca.testing$rotation[,1]
autoplot(pca.testing,loadings.label=TRUE,loadings=TRUE)
Not related to your question, but some advice: don't use T, use TRUE, otherwise the next time you have temperature data, you may inadvertantly change the value, and cause havoc with your analysis.
I am learning to plot histograms in R, but I have some problem with parameter "breaks" for a single number. In the help, it says:
breaks: a single number giving the number of cells for the histogram
I did the following experiment:
data("women")
hist(women$weight, breaks = 7)
I expect it should give me 7 bins, but the result is not what I expected! It gives me 10 bins.
Do you know, what does breaks = 7 mean? What does it mean in the help "number of cells"?
Reading carefully breaks argument help page to the end, it says:
breaks
one of:
a vector giving the breakpoints between histogram cells,
a function to compute the vector of breakpoints,
a single number giving the number of cells for the histogram,
a character string naming an algorithm to compute the number of cells (see ‘Details’),
a function to compute the number of cells.
In the last three cases the number is a suggestion only; the breakpoints will be set to pretty values. If breaks is a function, the
x vector is supplied to it as the only argument.
So, as you can notice, n is considered only a "suggestion", it probably tries to get near to that value but it depends on the input values and if they can be nicely split into n buckets (it uses function pretty to compute them).
Hence, the only way to force the number of breaks is to provide the vector of interval breakpoints between the cells.
e.g.
data("women")
n <- 7
minv <- min(women$weight)
maxv <- max(women$weight)
breaks <- c(minv, minv + cumsum(rep.int((maxv - minv) / n, n-1)), maxv)
hist(women$weight, breaks = breaks)
I am stuck in simple problem. I have a scatter plot.
I am plotted confidence lines around it using my a custom formula. Now, i just want only the names outside the cutoff lines to be displayed nothing inside. But, I can't figure out how to subset my data on the based of the line co-ordinates.
The line is plotted using the lines function which is a vector of 128 x and y values. Now, how do I subset my data (x,y points) based on these 2 values. I can apply a static limit of a single number of sub-setting data like 1,2 or 3 but how to use a vector to subset data, got me stuck.
For an reproducible example, consider :
df=data.frame(x=seq(2,16,by=2),y=seq(2,16,by=2),lab=paste("label",seq(2,16,by=2),sep=''))
plot(df[,1],df[,2])
# adding lines
lines(seq(1,15),seq(15,1),lwd=1, lty=2)
# adding labels
text(df[,1],df[,2],labels=df[,3],pos=3,col="red",cex=0.75)
Now, I need just the labels, which are outside or intersecting the line.
What I was trying to subset my dataframe with the values used for the lines, but I cant make it right.
Now, static sub-setting can be done for single values like
df[which(df[,1]>8 & df[,2]>8),] but how to do it for whole list.
I also tried sapply, to cycle over all the values of x and y used for lines on the df iteratively, but most values become +ve for a limit but false for other values.
Thanks
I will speak about your initial volcano-type-graph problem and not the made up one because they are totally different.
So I really thought this a lot and I believe I reached a solid conclusion. There are two options:
1. You know the equations of the lines, which would be really easy to work with.
2. You do not know the equation of the lines which means we need to work with an approximation.
Some geometry:
The function shows the equation of a line. For a given pair of coordinates (x, y), if y > the right hand side of the equation when you pass x in, then the point is above the line else below the line. The same concept stands if you have a curve (as in your case).
If you have the equations then it is easy to do the above in my code below and you are set. If not you need to make an approximation to the curve. To do that you will need the following code:
df=data.frame(x=seq(2,16,by=2),y=seq(2,16,by=2),lab=paste("label",seq(2,16,by=2),sep=''))
make_vector <- function(df) {
lab <- vector()
for (i in 1:nrow(df)) {
this_row <- df[i,] #this will contain the three elements per row
if ( (this_row[1] < max(line1x) & this_row[2] > max(line1y) & this_row[2] < a + b*this_row[1])
|
(this_row[1] > min(line2x) & this_row[2] > max(line2y) & this_row[2] > a + b*this_row[1]) ) {
lab[i] <- this_row[3]
} else {
lab[i] <- <NA>
}
}
return(lab)
}
#this_row[1] = your x
#this_row[2] = your y
#this_row[3] = your label
df$labels <- make_vector(df)
plot(df[,1],df[,2])
# adding lines
lines(seq(1,15),seq(15,1),lwd=1, lty=2)
# adding labels
text(df[,1],df[,2],labels=df[,4],pos=3,col="red",cex=0.75)
The important bit is the function. Imagine that you have df as you created it with x,y and labs. You also will have a vector with the x,y coordinates for line1 and x,y coordinates for line2.
Let's see the condition of line1 only (the same exists for line 2 which is implemented on the code above):
this_row[1] < max(line1x) & this_row[2] > max(line1y) & this_row[2] < a + b*this_row[1]
#translates to:
#this_row[1] < max(line1x) = your x needs to be less than the max x (vertical line in graph below
#this_row[2] > max(line1y) = your y needs to be greater than the max y (horizontal line in graph below
#this_row[2] < a + b*this_row[1] = your y needs to be less than the right hand side of the equation (to have a point above i.e. left of the line)
#check below what the line is
This will make something like the below graph (this is a bit horrible and also magnified but it is just a reference. Visualize it approximating your lines):
The above code would pick all the points in the area above the triangle and within the y=1 and x=1 lines.
Finally the equation:
Having 2 points' coordinates you can figure out a line's equation solving a system of two equations and 2 parameters a and b. (y = a +bx by replacing y,x for each point)
The 2 points to pick are the two points closest to the tangent of the first line (line1). Chose those arbitrarily according to your data. The closest to the tangent the better. Just plot the spots and eyeball.
Having done all the above you have your points with your labels (approximately at least).
And that is the only thing you can do!
Long talk but hope it helps.
P.S. I haven't tested the code because I have no data.