R: Draw lines which "fit" the same shape as a reference line - r
I'm currently working on a project were I need to indentify lines which have the same kind of shape, e.g:
yrefer = c(0.2900,0.3189,0.4097,0.3609,0.3762,0.5849,0.7144)
For example take a look at the following plot, I want R to recognize these two red lines for example as a fitting shape, also if there is a little deviation(say 0.05 from the reference line(in blue)).
So I want to write a code which checks based on a list of y coördinates if these y coördinates fit the yrefer line, were a deviation of 0.05 is permitted.
I'm not sure if this is possible in R, but if it is I know there are people here that can help me out.
notice: What I mean with the deviation of 0.05:
let's say we have a line which is:
1.2900 1.3189 1.4097 1.3609 1.3762 1.5849 1.7144
this would be exactly the same line but then 1 higher then the yrefer line, but with the deviation of 0.05 I mean that if some y coördinates differ 0.05 from what you would expect them to be, so in this example I should expect them to be 1 higher for ever yRefer coördinate, but if one of them is 0.98 higher I would still accept this as a "fitting" line, because it's devation is < 0.05.
To clarify I drawed some possiblities(there are a lot more of course!) which should be accepted as correct for the first y-value:
I hope it's clear, if not let me know!
I don't think Johannes' answer generalizes, e.g:
y_ref = c(0, 0, 0)
y_test = c(.03,.03, -.06) #then test_line fails even though, let:
y_test = y_test +.011
abs(y_test - y_ref) #never outside the .05 range
test_line(y_test) #failed
I think you want something like:
n = length(y_test)
d1 = y_test[-1] - y_test[-n]
d2 = y_ref[-1] - y_ref[-n]
max(cumsum(d2 - d1)) - min(cumsum(d2 - d1)) #shouldn't be >= .1
Just account for the different y mean.
newline<- c(1.25, 1.3189, 1.4097, 1.4609, 1.3762, 1.5249, 1.7144)
newline2<-newline+mean(yrefer)-mean(newline)
sd(newline2-yrefer) #Can use var or whatever you want here.
This can all be packed into a function like.
lindev<- function(x){
newline2<-x+mean(yrefer)-mean(x)
return(sd(newline2-yrefer))}
lindev(c(1.25, 1.3189, 1.4097, 1.4609, 1.3762, 1.5249, 1.7144))
Note this will only work if the x coordinates are the same.
y_ref <- c(0.2900,0.3189,0.4097,0.3609,0.3762,0.5849,0.7144)
y_test_1 <- c(1.2900, 1.3187, 1.4097, 1.3609, 1.3762, 1.5849, 1.7144)
y_test_2 <- c(1.2900, 1.2189, 1.4097, 1.3609, 1.3762, 1.5849, 1.7144)
test_line <- function(y_test) {
overall_deviation <- mean(y_test - y_ref)
residuals <- y_test - y_ref - overall_deviation
if (any(abs(residuals) > 0.05)) message("Failed")
else message("Passed")
}
test_line(y_test_1)
test_line(y_test_2)
Related
Data frames using conditional probabilities to extract a certain range of values
I would like some help answering the following question: Dr Barchan makes 600 independent recordings of Eric’s coordinates (X, Y, Z), selects the cases where X ∈ (0.45, 0.55), and draws a histogram of the Y values for these cases. By construction, these values of Y follow the conditional distribution of Y given X ∈ (0.45,0.55). Use your function sample3d to mimic this process and draw the resulting histogram. How many samples of Y are displayed in this histogram? We can argue that the conditional distribution of Y given X ∈ (0.45, 0.55) approximates the conditional distribution of Y given X = 0.5 — and this approximation is improved if we make the interval of X values smaller. Repeat the above simulations selecting cases where X ∈ (0.5 − δ, 0.5 + δ), using a suitably chosen δ and a large enough sample size to give a reliable picture of the conditional distribution of Y given X = 0.5. I know for the first paragraph we want to have the values generated for x,y,z we got in sample3d(600) and then restrict the x's to being in the range 0.45-0.55, is there a way to code (maybe an if function) that would allow me to keep values of x in this range but discard all the x's from the 600 generated not in the range? Also does anyone have any hints for the conditional probability bit in the third paragraph. sample3d = function(n) { df = data.frame() while(n>0) { X = runif(1,-1,1) Y = runif(1,-1,1) Z = runif(1,-1,1) a = X^2 + Y^2 + Z^2 if( a < 1 ) { b = (X^2+Y^2+Z^2)^(0.5) vector = data.frame(X = X/b, Y = Y/b, Z = Z/b) df = rbind(vector,df) n = n- 1 } } df } sample3d(n) Any help would be appreciated, thank you.
Your function produces a data frame. The part of the question that asks you to find those values in a data frame that are in a given range can be solved by filtering the data frame. Notice that you're looking for a closed interval (the values aren't included). df <- sample3d(600) df[df$X > 0.45 & df$X < 0.55,] Pay attention to the comma. You can use a dplyr solution as well, but don't use the helper between(), since it will look at an open interval (you need a closed interval). filter(df, X > 0.45 & X < 0.55) For the remainder of your assignment, see what you can figure out and if you run into a specific problem, stack overflow can help you.
Simulating realistic noise for a calcium baseline
thanks to the truely amazing community my project group is one step closer to mimic realistic calcium baseline noise. I simulated a typical calcium movement in a mathematical model: Thanks to the community I could add random noise to the unrealistic baseline: However, the noise dynamic is actually too fast. Is there a way to slow down the noise and create broader noise peaks instead of these spikes. I add an actual measurement to show you what I mean: If this question is too specific, I apologize and will delete the post. Best wishes and many thanks!
Please make your question and examples reproducible so that others can help. That being said, it looks like the baseline is a just a random normal -- probably created with something like x <- rnorm(500). One way to make this less jumpy is calculate a moving average. You could use a package like TTR or zoo to do this, or you can create your own function. For example: x <- rnorm(500) plot(x, type = "l") ma <- function(x, n = 5){ filter(x, rep(1/n, n), sides = 2) } plot(ma(x), type = "l") plot(ma(x, 10), type = "l")
I see your point now. I have two suggestions for this case, maybe they will be of help : Try to add noise to only a subset of your base line ( following is a 10%) baseline.index = which(App[,2] == min(App[,2])) baseline.index.subset = sample(x = baseline.index, size = 0.1 * length ( baseline.index) , replace = F) noise = rnorm( length (baseline.index.subset)) App[ baseline.index.subset,2] = App[ baseline.index.subset,2] + noise And try to play a bit with the mean and standard deviation of the noise. ie: noise = rnorm( length (baseline.index.subset), mean = 0, sd = 0.1) Let us know if this helps
Local linear regression in R -- locfit() vs locpoly()
I am trying to understand the different behaviors of these two smoothing functions when given apparently equivalent inputs. My understanding was that locpoly just takes a fixed bandwidth argument, while locfit can also include a varying part in its smoothing parameter (a nearest-neighbors fraction, "nn"). I thought setting this varying part to zero in locfit should make the "h" component act like the fixed bandwidth used in locpoly, but this is evidently not the case. A working example: library(KernSmooth) library(locfit) set.seed(314) n <- 100 x <- runif(n, 0, 1) eps <- rnorm(n, 0, 1) y <- sin(2 * pi * x) + eps plot(x, y) lines(locpoly(x, y, bandwidth=0.05, degree=1), col=3) lines(locfit(y ~ lp(x, nn=0, h=0.05, deg=1)), col=4) Produces this plot: locpoly gives the smooth green line, and locfit gives the wiggly blue line. Clearly, locfit has a smaller "effective" bandwidth here, even though the supposed bandwidth parameter has the same value for each. What are these functions doing differently?
The two parameters both represent smoothing, but they do so in two different ways. locpoly's bandwidth parameter is relative to the scale of the x-axis here. For example, if you changed the line x <- runif(n, 0, 1) to x <- runif(n, 0, 10), you will see that the green locpoly line becomes much more squiggly despite the fact that you still have the same number of points (100). locfit's smoothing parameter, h, is independent of the scale, and instead is based on a proportion of the data. The value 0.05 means 5% of the data that is closest to that position is used to fit the curve. So changing the scale would not alter the line. This also explains the observation made in the comment that changing the value of h to 0.1 makes the two look nearly identical. This makes sense, because we can expect that a bandwidth of 0.05 will contain about 10% of the data if we have 100 points distributed uniformly from 0 to 1. My sources include the documentation for the locfit package and the documentation for the locpoly function.
I changed your code a bit so we can see more clearly what the actual window widths are: library(KernSmooth) library(locfit) x <- seq(.1, .9, length.out = 80) y <- rep(0:1, each = 40) plot(x, y) lines(locpoly(x, y, bandwidth=0.1, degree=1), col=3) lines(locfit(y ~ lp(x, nn=0, h=0.1, deg=1)), col=4) The argument h from locfit appears to be a half-window width. locpoly's bandwidth is clearly doing something else. KernSmooth's documentation is very ambiguous, but judging from the source code (here and here), it looks like the bandwidth is the standard deviation of a normal density function. Hopefully this is explained in the Kernel Smoothing book they cite.
How to use the sum function in a for loop in R?
We want to calculate the value of an integral in linear plot. For a better understanding look at the photo. Let's say the overall area is 1. We want to find what the value in a certain part is. For instance we want to know how much % of the overall 100% lay within the 10th and 11th month if everything refers to months and A as maximum stands for 24. We can calculate a integral and then should be able to get the searched area by F(x) - F(x-1) I thoght about the following code: a <- 24 tab <-matrix(0,a,1) tab <-cbind(seq(1,a),tab) tab<-data.frame(tab) #initialization for first point tab[1,2] <- (2*tab[1,1] / a - tab[1,1]^2 / a^2) #for loop for calculation of integral of each point - integral until to the area for(i in 2:nrow(tab)) {tab[i,2] <- (2*tab[i,1] / a - tab[i,1]^2/ a^2) - sum(tab[1,2]:tab[i-1,2])} #plotting plot(tab[,2], type="l") If you see the plot - it's confusing. Any ideas how to handle this correct?
The base R function integrate() can do this for you: f <- function(x, A) 2/A - x / A^2 integrate(function(x)f(x, 24), lower=10, upper=11) 0.06510417 with absolute error < 7.2e-16
Using the formulas directly: a <- 24 # number of divisions x <- c(seq(1,a)) # y <- x*2/a - x^2/a^2 # F(x) z <- (x*2/a - x^2/a^2) - ((x-1)*2/a - (x-1)^2/a^2) # F(x) - F(x-1) Then do the binding afterward. > sum(z) [1] 1
Detecting dips in a 2D plot
I need to automatically detect dips in a 2D plot, like the regions marked with red circles in the figure below. I'm only interested in the "main" dips, meaning the dips have to span a minimum length in the x axis. The number of dips is unknown, i.e., different plots will contain different numbers of dips. Any ideas? Update: As requested, here's the sample data, together with an attempt to smooth it using median filtering, as suggested by vines. Looks like I need now a robust way to approximate the derivative at each point that would ignore the little blips that remain in the data. Is there any standard approach? y <- c(0.9943,0.9917,0.9879,0.9831,0.9553,0.9316,0.9208,0.9119,0.8857,0.7951,0.7605,0.8074,0.7342,0.6374,0.6035,0.5331,0.4781,0.4825,0.4825,0.4879,0.5374,0.4600,0.3668,0.3456,0.4282,0.3578,0.3630,0.3399,0.3578,0.4116,0.3762,0.3668,0.4420,0.4749,0.4556,0.4458,0.5084,0.5043,0.5043,0.5331,0.4781,0.5623,0.6604,0.5900,0.5084,0.5802,0.5802,0.6174,0.6124,0.6374,0.6827,0.6906,0.7034,0.7418,0.7817,0.8311,0.8001,0.7912,0.7912,0.7540,0.7951,0.7817,0.7644,0.7912,0.8311,0.8311,0.7912,0.7688,0.7418,0.7232,0.7147,0.6906,0.6715,0.6681,0.6374,0.6516,0.6650,0.6604,0.6124,0.6334,0.6374,0.5514,0.5514,0.5412,0.5514,0.5374,0.5473,0.4825,0.5084,0.5126,0.5229,0.5126,0.5043,0.4379,0.4781,0.4600,0.4781,0.3806,0.4078,0.3096,0.3263,0.3399,0.3184,0.2820,0.2167,0.2122,0.2080,0.2558,0.2255,0.1921,0.1766,0.1732,0.1205,0.1732,0.0723,0.0701,0.0405,0.0643,0.0771,0.1018,0.0587,0.0884,0.0884,0.1240,0.1088,0.0554,0.0607,0.0441,0.0387,0.0490,0.0478,0.0231,0.0414,0.0297,0.0701,0.0502,0.0567,0.0405,0.0363,0.0464,0.0701,0.0832,0.0991,0.1322,0.1998,0.3146,0.3146,0.3184,0.3578,0.3311,0.3184,0.4203,0.3578,0.3578,0.3578,0.4282,0.5084,0.5802,0.5667,0.5473,0.5514,0.5331,0.4749,0.4037,0.4116,0.4203,0.3184,0.4037,0.4037,0.4282,0.4513,0.4749,0.4116,0.4825,0.4918,0.4879,0.4918,0.4825,0.4245,0.4333,0.4651,0.4879,0.5412,0.5802,0.5126,0.4458,0.5374,0.4600,0.4600,0.4600,0.4600,0.3992,0.4879,0.4282,0.4333,0.3668,0.3005,0.3096,0.3847,0.3939,0.3630,0.3359,0.2292,0.2292,0.2748,0.3399,0.2963,0.2963,0.2385,0.2531,0.1805,0.2531,0.2786,0.3456,0.3399,0.3491,0.4037,0.3885,0.3806,0.2748,0.2700,0.2657,0.2963,0.2865,0.2167,0.2080,0.1844,0.2041,0.1602,0.1416,0.2041,0.1958,0.1018,0.0744,0.0677,0.0909,0.0789,0.0723,0.0660,0.1322,0.1532,0.1060,0.1018,0.1060,0.1150,0.0789,0.1266,0.0965,0.1732,0.1766,0.1766,0.1805,0.2820,0.3096,0.2602,0.2080,0.2333,0.2385,0.2385,0.2432,0.1602,0.2122,0.2385,0.2333,0.2558,0.2432,0.2292,0.2209,0.2483,0.2531,0.2432,0.2432,0.2432,0.2432,0.3053,0.3630,0.3578,0.3630,0.3668,0.3263,0.3992,0.4037,0.4556,0.4703,0.5173,0.6219,0.6412,0.7275,0.6984,0.6756,0.7079,0.7192,0.7342,0.7458,0.7501,0.7540,0.7605,0.7605,0.7342,0.7912,0.7951,0.8036,0.8074,0.8074,0.8118,0.7951,0.8118,0.8242,0.8488,0.8650,0.8488,0.8311,0.8424,0.7912,0.7951,0.8001,0.8001,0.7458,0.7192,0.6984,0.6412,0.6516,0.5900,0.5802,0.5802,0.5762,0.5623,0.5374,0.4556,0.4556,0.4333,0.3762,0.3456,0.4037,0.3311,0.3263,0.3311,0.3717,0.3762,0.3717,0.3668,0.3491,0.4203,0.4037,0.4149,0.4037,0.3992,0.4078,0.4651,0.4967,0.5229,0.5802,0.5802,0.5846,0.6293,0.6412,0.6374,0.6604,0.7317,0.7034,0.7573,0.7573,0.7573,0.7772,0.7605,0.8036,0.7951,0.7817,0.7869,0.7724,0.7869,0.7869,0.7951,0.7644,0.7912,0.7275,0.7342,0.7275,0.6984,0.7342,0.7605,0.7418,0.7418,0.7275,0.7573,0.7724,0.8118,0.8521,0.8823,0.8984,0.9119,0.9316,0.9512) yy <- runmed(y, 41) plot(y, type="l", ylim=c(0,1), ylab="", xlab="", lwd=0.5) points(yy, col="blue", type="l", lwd=2)
EDITED : function strips the regions to contain nothing but the lowest part, if wanted. Actually, Using the mean is easier than using the median. This allows you to find regions where the real values are continuously below the mean. The median is not smooth enough for an easy application. One example function to do this would be : FindLowRegion <- function(x,n=length(x)/4,tol=length(x)/20,p=0.5){ nx <- length(x) n <- 2*(n %/% 2) + 1 # smooth out based on means sx <- rowMeans(embed(c(rep(NA,n/2),x,rep(NA,n/2)),n),na.rm=T) # find which series are far from the mean rlesx <- rle((sx-x)>0) # construct start and end of regions int <- embed(cumsum(c(1,rlesx$lengths)),2) # which regions fulfill requirements id <- rlesx$value & rlesx$length > tol # Cut regions to be in general smaller than median regions <- apply(int[id,],1,function(i){ i <- min(i):max(i) tmp <- x[i] id <- which(tmp < quantile(tmp,p)) id <- min(id):max(id) i[id] }) # return unlist(regions) } where n determines how much values are used to calculate the running mean, tol determines how many consecutive values should be lower than the running mean to talk about a low region, and p determines the cutoff used (as a quantile) for stripping the regions to their lowest part. When p=1, the complete lower region is shown. Function is tweaked to work on data as you presented, but the numbers might need to be adjusted a bit to work with other data. This function returns a set of indices, which allows you to find the low regions. Illustrated with your y vector : Lows <- FindLowRegion(y) newx <- seq_along(y) newy <- ifelse(newx %in% Lows,y,NA) plot(y, col="blue", type="l", lwd=2) lines(newx,newy,col="red",lwd="3") Gives :
You have to smooth the graph in some way. Median filtration is quite useful for that purpose (see http://en.wikipedia.org/wiki/Median_filter). After smoothing, you will simply have to search for the minima, just as usual (i.e. search for the points where the 1st derivative switches from negative to positive).
A simpler answer (which also does not require smoothing) could be provided by adapting the maxdrawdown() function from the tseries. A drawdown is commonly defined as the retreat from the most-recent maximum; here we want the opposite. Such a function could then be used in a sliding window over the data, or over segmented data. maxdrawdown <- function(x) { if(NCOL(x) > 1) stop("x is not a vector or univariate time series") if(any(is.na(x))) stop("NAs in x") cmaxx <- cummax(x)-x mdd <- max(cmaxx) to <- which(mdd == cmaxx) from <- double(NROW(to)) for (i in 1:NROW(to)) from[i] <- max(which(cmaxx[1:to[i]] == 0)) return(list(maxdrawdown = mdd, from = from, to = to)) } So instead of using cummax(), one would have to switch to cummin() etc.
My first thought was something much cruder than filtering. Why not look for the big drops followed by long enough stable periods? span.b <- 20 threshold.b <- 0.2 dy.b <- c(rep(NA, span.b), diff(y, lag = span.b)) span.f <- 10 threshold.f <- 0.05 dy.f <- c(diff(y, lag = span.f), rep(NA, span.f)) down <- which(dy.b < -1 * threshold.b & abs(dy.f) < threshold.f) abline(v = down) The plot shows that it's not perfect, but it doesn't discard the outliers (I guess it depends on your take on the data).