Calculate furthest distance in given time or best time for given distance - r

I have imported data from my GPS tracker and I am trying to figure out how to best calculate furthest distance ran in given time (e.g. 12-minutes) or best time for given distence (e.g. 5 miles). Given the observations are taken in unequal intervals and my speed is also not constant, I will have data like the table below:
x <- read.table(header=T, sep="", stringsAsFactors = FALSE,text="
time dist
4 3
5 4
5 6
3 2
5 5
4 5
4 3
4 2
5 6")
My best attempt so far is to generate new dataset where times go by one time unit. It is then relatively easy to calculate furthest distance in given time. The downside of this is that a) I would need to repeat the same logic for best time (generate data with unit distance), b) it seems to be quite sub-optimal solution for data with thousands data points.
# Generate data frame where each row represents one unit of time
z_adj <- data.frame(
time = unlist(sapply(x$time, function(s) rep(s, each = s))),
dist = unlist(sapply(seq_along(x$dist), function(s) rep(x$dist[s], each = x$time[s])))
)
z_adj$seq_time <- seq_along(z_adj$time)
z_adj$time_dist <- z_adj$dist / z_adj$time
# Furthest distance given time
# Time 10
z_adj$in_t10 <- sapply(z_adj$seq_time, function(s) sum(z_adj$dist[s:(s+9)]))
z_adj$in_t10[which(z_adj$in_t10 == max(z_adj$in_t10, na.rm = T))]
# Fastest time given distance
# ... would need to do the above again with constant distance :/
Is there a more straightforward approach to accomplish this?

You could use something like this:
x <- read.table(header=T, sep="", stringsAsFactors = FALSE,text="
time dist
4 3
5 4
5 6
3 2
5 5
4 5
4 3
4 2
5 6")
# Add starting point and cumulatice time/distance
x <- rbind(c(0,0), x)
x$total_time <- cumsum(x$time)
x$total_dist <- cumsum(x$dist)
# function to interpolate and calculate lagging differences
foo <- function(x, y, n) {
interpolation <- approx(x, y, xout = seq(min(x), max(x)))
diff(interpolation$y, lag = n)
}
# Max distance in ten units of time
max(foo(x$total_time, x$total_dist, 10))
# Min time for ten units of distance
min(foo(x$total_dist, x$total_time, 10))
BTW, in your code you should sum over z_adj$time_dist instead of z_adj$distto get the correct result.

Related

R - polynomial regression issue - model limited to finite number of output values

I'm trying to calculate point slopes from a series of x,y data. Because some of the x data repeats (...8, 12, 12, 16...) there will be a division by zero issue when using slope = (y2-y1/x2-x1).
My solution is to create a polynomial regression equation of the data, then plug a new set of x values (xx) into the equation that monotonically increase between the limits of x. This eliminates the problem of equal x data points. As a result, (x) and (xx) have the same limits, but (xx) is always longer in length.
The problem I am having is that the fitted values for xx are limited to the length of x. When I try to use the polynomial equation with (xx) that is 20 in length, the fitted yy results provide data for the first 10 points then gives NA for the next 10 points. What is wrong here?
x <- c(1,2,2,5,8,12,12,16,17,20)
y <- c(2,4,5,6,8,11,12,15,16,20)
df <- data.frame(x,y)
my_mod <- lm(y ~ poly(x,2,raw=T), data=df) # This creates the polynomial equation
xx <- x[1]:x[length(x)] # Creates montonically increasing x using boundaries of original x
yy <- fitted(my_mod)[order(xx)]
plot(x,y)
lines(xx,yy)
tag-name
If you look at
fitted(my_mod)
It outputs:
# 1 2 3 4 5 6 7 8 9 10
#3.241032 3.846112 3.846112 5.831986 8.073808 11.461047 11.461047 15.303305 16.334967 19.600584
Meaning the name of the output matches the position of x, not the value of x, so fitted(my_mod)[order(xx)] doesn't quite make sense.
You want to use predict here:
yy <- predict(my_mod, newdata = data.frame(x = xx))
plot(xx, yy)
# 1 2 3 4 5 6 7 8 9 10
# 3.241032 3.846112 4.479631 5.141589 5.831986 6.550821 7.298095 8.073808 8.877959 9.710550
# 11 12 13 14 15 16 17 18 19 20
# 10.571579 11.461047 12.378953 13.325299 14.300083 15.303305 16.334967 17.395067 18.483606 19.600584

Run DBSCAN against grouped coordinates

I'm attempting to run DBSCAN against some grouped coordinates in order to get sub-clusters. I've clustered some spacial data and I'd now like to further divide these regions according to the density of points within them. I think DBSCAN is probably the best way to do this.
My issue is that I can't figure out how to run DBSCAN against each cluster seperately and then output the cluster assignment as a new column. Here's some sample data:
library(dplyr)
library(dbscan)
# Create sample data
df <- data.frame(
"ID"=1:200,
"X"=c(1.0083,1.3166,1.3072,1.1311,1.2984,1.2842,1.1856,1.3451,1.1932,1.0926,1.2464,1.3197,1.2331,1.2996,1.3482,
1.1944,1.2800,1.3051,1.4471,0.9068,1.3150,1.1846,1.0232,1.0005,1.0640,1.3177,1.1015,0.9598,1.0354,1.2203,
0.8388,0.8655,1.3387,1.0133,1.0106,1.1753,1.3200,1.0139,1.1511,1.3508,1.2747,1.3681,1.1074,1.2735,1.2245,
0.9695,1.3250,1.0537,1.2020,1.3093,0.9268,1.3244,1.2626,1.3123,1.2819,1.1063,0.8759,1.0063,1.0173,1.0187,
1.2396,1.0241,1.2619,1.2682,1.0008,1.0827,1.3639,1.3099,1.0004,0.8886,1.2359,1.1370,1.2783,1.0803,1.1918,
1.1156,1.3313,1.1205,1.0776,1.3895,1.3559,0.8518,1.1315,1.3521,1.2281,1.2589,0.9974,1.1487,1.4204,0.9998,
1.0154,1.0098,0.8851,1.0252,0.9331,1.2197,1.0084,1.2303,1.2808,1.3125,0.5500,0.6694,0.3301,0.3787,0.6492,
0.6568,0.6773,0.3769,0.6237,0.7265,0.5509,0.3579,0.7201,0.2631,0.3881,0.7596,0.3343,0.7049,0.3430,0.2951,
0.5483,0.7699,0.3806,0.6555,0.2524,0.4030,0.6329,0.5006,0.2701,0.0822,0.5442,0.5233,0.7105,0.5660,0.3962,
0.3187,0.3143,0.5673,0.3731,0.7310,0.6376,0.4864,0.8865,0.3352,0.7540,0.0690,0.7983,0.6990,0.4090,0.5658,
0.5636,0.5420,0.7223,0.6146,0.5648,0.2711,0.3422,0.7214,0.2196,0.2848,0.6496,0.7907,0.7418,0.7825,0.4550,
0.4361,0.7417,0.2661,0.8978,0.7875,0.2343,0.3853,0.6874,0.7761,0.2905,0.6092,0.5329,0.6189,0.0684,0.5726,
0.5740,0.7060,0.4609,0.3568,0.7037,0.2874,0.6200,0.7149,0.5100,0.7059,0.2520,0.3105,0.6870,0.7888,0.3674,
0.6514,0.7271,0.6679,0.3752,0.7067),
"Y"=c(-1.2547,-1.1499,-1.1803,-1.0626,-1.2877,-1.1151,-1.0958,-1.1339,-1.0808,-1.5461,-1.0775,-1.1431,-1.0499,
-1.1521,-1.1675,-1.0963,-1.1407,-1.1916,-1.1229,-1.2297,-1.1308,-1.0341,-1.3071,-1.2370,-1.5043,-1.1154,
-1.5452,-1.0349,-1.5412,-1.0348,-1.3620,-1.3776,-1.1830,-1.2552,-1.2354,-1.0838,-1.1214,-1.2396,-1.4937,
-1.0793,-1.1857,-1.0679,-1.5425,-1.1633,-1.1620,-1.0838,-1.0750,-1.3493,-1.4155,-1.1354,-1.0615,-1.1494,
-1.1620,-1.1582,-1.1800,-1.5230,-1.3019,-1.2484,-1.5490,-1.2435,-1.0487,-1.2330,-1.1234,-1.0924,-1.0702,
-1.0446,-1.1077,-1.1144,-1.2170,-1.2715,-1.1537,-1.5077,-1.1305,-1.3396,-1.2107,-1.5458,-1.1482,-1.1224,
-1.3690,-1.2058,-1.1685,-1.3400,-1.5033,-1.2152,-1.3805,-1.1439,-1.5183,-1.4288,-1.1252,-1.2330,-1.2511,
-1.5429,-1.3333,-1.1851,-1.1367,-1.3952,-1.1240,-1.2113,-1.1632,-1.1965,-0.9917,-0.7416,-0.7729,-1.1279,
-0.9323,-0.9372,-0.7013,-1.1746,-0.9191,-0.9356,-0.7873,-1.1957,-0.9838,-0.5825,-1.0738,-0.9302,-0.7713,
-0.9407,-0.7774,-0.8160,-0.9861,-1.0440,-0.9896,-0.6478,-0.8865,-1.0601,-1.0640,-0.9898,-0.5989,-0.7375,
-0.7689,-0.9799,-0.9147,-1.1048,-0.9735,-0.8591,-0.7913,-1.0085,-0.7231,-0.9688,-0.9272,-0.9395,-0.9494,
-0.7859,-1.0817,-0.7262,-0.9915,-0.9329,-1.0953,-1.0425,-1.0806,-1.0132,-0.8514,-1.0785,-1.1109,-0.8542,
-1.0849,-0.9665,-0.5940,-0.6145,-0.7830,-0.9601,-0.8996,-0.7717,-0.7447,-1.0406,-1.0067,-0.5710,-0.9839,
-1.0594,-0.7069,-1.1202,-0.9705,-1.0100,-0.6377,-1.0632,-0.9450,-0.9163,-0.7865,-1.0090,-1.1005,-1.0049,
-0.8042,-1.0781,-0.6829,-0.5962,-1.0759,-0.7918,-0.9732,-0.7353,-0.5615,-1.2002,-0.9295,-0.9944,-1.1570,
-0.9524,-0.9257,-0.9360,-1.1328,-0.7661),
"cluster"=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2))
# How do you run DBSCAN against the points within each cluster?
I first thought I'd try to use the group_by function in dplyr but DBSCAN requires a data matrix input and group_by doesn't work for matrices.
matrix <- as.matrix(df[, -1])
set.seed(1234)
db = matrix %>%
group_by(cluster) %>%
dbscan(matrix, 0.4, 4)
#Error in UseMethod("group_by_") :
# no applicable method for 'group_by_' applied to an object of class "c('matrix', 'double', 'numeric')"
I've also tried using by() but get duplicate results for each cluster grouping, which isn't right:
by(data = df, INDICES = df$cluster, FUN = function(x) {
out <- dbscan(as.matrix(df[, c(2:3)]),eps=.0215,minPts=4)
})
#df$cluster: 1
#DBSCAN clustering for 200 objects.
#Parameters: eps = 0.0215, minPts = 4
#The clustering contains 10 cluster(s) and 138 noise points.
#
# 0 1 2 3 4 5 6 7 8 9 10
#138 11 12 4 5 8 2 4 8 4 4
#
#Available fields: cluster, eps, minPts
#--------------------------------------------------------------------------
#df$cluster: 2
#DBSCAN clustering for 200 objects.
#Parameters: eps = 0.0215, minPts = 4
#The clustering contains 10 cluster(s) and 138 noise points.
#
# 0 1 2 3 4 5 6 7 8 9 10
#138 11 12 4 5 8 2 4 8 4 4
#
#Available fields: cluster, eps, minPts
Can anyone point me in the right direction?
To be clear, dbscan::dbscan works fine on data.frame objects. You do not need to convert to matrix. It returns an object that includes a vector with the same dimension as the number of records in your input. The issue is that dplyr exposes variables to other functions as individual vectors, rather than as data.frame or matrix objects. You are free to do something like:
df %>%
group_by(cluster) %>%
mutate(
dbscan_cluster = dbscan::dbscan(
data.frame(X, Y),
eps = 0.0215,
minPts = 4
)[["cluster"]]
)
dplyr is not necessary, by also works, you just need to supply a generic function rather than one that directly references the source object directly. Your data must already be ordered by cluster.
df$dbscan_cluster <- unlist(
by(
df,
INDICES = df$cluster,
function(x) dbscan::dbscan(x[,c(2,3)], eps = 0.0215, minPts = 4)[["cluster"]]
)
)
However, you can still get garbage results if you don't have a good way to pick your epsilon. You might consider using dbscan::optics instead.

looping through large rasterbrick in raster R

How can i perform for-loop in a large daily rasterbrick to get annual stacks and calculate the maximum value (annual maxima ) for each year( each stack of 365 files).
Basically, i have same question like this. So taking same question as sample, how i can conduct a for-loop that would calculate maximum value for each 46 stacks ( each stack with 8 layers).
I tried using only stackApply but it gives all black/zero value when i run for whole period, however it gives max values if i run for individual years (tested separately for 10 years, i have more than 100 years data).
library(raster)
# example data
sca <- brick(nrow=108,ncol=132,nl=365)
values(sca) <- runif(ncell(sca)*nlayers(sca))
# indices grouping sets of 8
i <- rep(1:ceiling(365/8), each=8)
# the last period is not a complete set of 8 days
i <- i[1:nlayers(sca)]
# This does not work for me, gives output as zero.
x <- stackApply(sca, i, max)
for (i in 1:nlayers(sca)) {
x <- sca[[i]]
xx<-stackApply(sca, i, max)
plot(xx)
# etc.
}
You could loop like this:
library(raster)
sca <- brick(nrow=108,ncol=132,nl=365)
values(sca) <- runif(ncell(sca)*nlayers(sca))
i <- rep(1:ceiling(365/8), each=8)
i <- i[1:nlayers(sca)]
for (j in unique(i)) {
x <- sca[[which(j==i)]]
xx <- max(x, na.rm=TRUE)
# or
# xx <- calc(x, fun=max, na.rm=TRUE, filename = patste0(i, '.tif'))
}

How to import a distance matrix for clustering in R

I have got a text file containing 200 models all compared to eachother and a molecular distance for each 2 models compared. It looks like this:
1 2 1.2323
1 3 6.4862
1 4 4.4789
1 5 3.6476
.
.
All the way down to 200, where the first number is the first model, the second number is the second model, and the third number the corresponding molecular distance when these two models are compared.
I can think of a way to import this into R and create a nice 200x200 matrix to perform some clustering analyses on. I am still new to Stack and R but thanks in advance!
Since you don't have the distance between model1 and itself, you would need to insert that yourself, using the answer from this question:
(you can ignore the wrong numbering of the models compared to your input data, it doesn't serve a purpose, really)
# Create some dummy data that has the same shape as your data:
df <- expand.grid(model1 = 1:120, model2 = 2:120)
df$distance <- runif(n = 119*120, min = 1, max = 10)
head(df)
# model1 model2 distance
# 1 2 7.958746
# 2 2 1.083700
# 3 2 9.211113
# 4 2 5.544380
# 5 2 5.498215
# 6 2 1.520450
inds <- seq(0, 200*119, by = 200)
val <- c(df$distance, rep(0, length(inds)))
inds <- c(seq_along(df$distance), inds + 0.5)
val <- val[order(inds)]
Once that's in place, you can use matrix() with the ncol and nrow to "reshape" your vector of distance in the appropriate way:
matrix(val, ncol = 200, nrow = 200)
Edit:
When your data only contains the distance for one direction, so only between e.g. model1 - model5 and not model5 - model1 , you will have to fill the values in the upper triangular part of a matrix, like they do here. Forget about the data I generated in the first part of this answer. Also, forget about adding the ones to your distance column.
dist_mat <- diag(200)
dist_mat[upper.tri(dist_mat)] <- your_data$distance
To copy the upper-triangular entries to below the diagonal, use:
dist_mat[lower.tri(dist_mat)] <- t(dist_mat)[lower.tri(dist_mat)]
As I do not know from your question what format is your file in, I will assume the most general file format, i.e., CSV.
Then you should look at the reading files, read.csv, or fread.
Example code:
dt <- read.csv(file, sep = "", header = TRUE)
I suggest using data.table package. Then:
setDT(dt)
dt[, id := paste0(as.character(col1), "-", as.character(col2))]
This creates a new variable out of the first and the second model and serves as a unique id.
What I do is then removing this id and scale the numerical input.
After scaling, run clustering algorithms.
Merge the result with the id to analyse your results.
Is that what you are looking for?

calculate lag from phase arrows with biwavelet in r

I'm trying to understand the cross wavelet function in R, but can't figure out how to convert the phase lag arrows to a time lag with the biwavelet package. For example:
require(gamair)
data(cairo)
data_1 <- within(cairo, Date <- as.Date(paste(year, month, day.of.month, sep = "-")))
data_1 <- data_1[,c('Date','temp')]
data_2 <- data_1
# add a lag
n <- nrow(data_1)
nn <- n - 49
data_1 <- data_1[1:nn,]
data_2 <- data_2[50:nrow(data_2),]
data_2[,1] <- data_1[,1]
require(biwavelet)
d1 <- data_1[,c('Date','temp')]
d2 <- data_2[,c('Date','temp')]
xt1 <- xwt(d1,d2)
plot(xt1, plot.phase = TRUE)
These are my two time series. Both are identical but one is lagging the other. The arrows suggest a phase angle of 45 degrees - apparently pointing down or up means 90 degrees (in or out of phase) so my interpretation is that I'm looking at a lag of 45 degrees.
How would I now convert this to a time lag i.e. how would I calculate the time lag between these signals?
I've read online that this can only be done for a specific wavelength (which I presume means for a certain period?). So, given that we're interested in a period of 365, and the time step between the signals is one day, how would one alculate the time lag?
So I believe you're asking how you can determine what the lag time is given two time series (in this case you artificially added in a lag of 49 days).
I'm not aware of any packages that make this a one-step process, but since we are essentially dealing with sin waves, one option would be to "zero out" the waves and then find the zero crossing points. You could then calculate the average distance between zero crossing points of wave 1 and wave 2. If you know the time step between measurements, you can easy calculate the lag time (in this case the time between measurement steps is one day).
Here is the code I used to accomplish this:
#smooth the data to get rid of the noise that would introduce excess zero crossings)
#subtracted 70 from the temp to introduce a "zero" approximately in the middle of the wave
spline1 <- smooth.spline(data_1$Date, y = (data_1$temp - 70), df = 30)
plot(spline1)
#add the smoothed y back into the original data just in case you need it
data_1$temp_smoothed <- spline1$y
#do the same for wave 2
spline2 <- smooth.spline(data_2$Date, y = (data_2$temp - 70), df = 30)
plot(spline2)
data_2$temp_smoothed <- spline2$y
#function for finding zero crossing points, borrowed from the msProcess package
zeroCross <- function(x, slope="positive")
{
checkVectorType(x,"numeric")
checkScalarType(slope,"character")
slope <- match.arg(slope,c("positive","negative"))
slope <- match.arg(lowerCase(slope), c("positive","negative"))
ipost <- ifelse1(slope == "negative", sort(which(c(x, 0) < 0 & c(0, x) > 0)),
sort(which(c(x, 0) > 0 & c(0, x) < 0)))
offset <- apply(matrix(abs(x[c(ipost-1, ipost)]), nrow=2, byrow=TRUE), MARGIN=2, order)[1,] - 2
ipost + offset
}
#find zero crossing points for the two waves
zcross1 <- zeroCross(data_1$temp_smoothed, slope = 'positive')
length(zcross1)
[1] 10
zcross2 <- zeroCross(data_2$temp_smoothed, slope = 'positive')
length(zcross2)
[1] 11
#join the two vectors as a data.frame (using only the first 10 crossing points for wave2 to avoid any issues of mismatched lengths)
zcrossings <- as.data.frame(cbind(zcross1, zcross2[1:10]))
#calculate the mean of the crossing point differences
mean(zcrossings$zcross1 - zcrossings$V2)
[1] 49
I'm sure there are more eloquent ways of going about this, but it should get you the information that you need.
In my case, for the tidal wave in semidiurnal, 90 degree equal to 3 hours (90*12.5 hours/360 = 3.125 hours). 12.5 hours is the period of semidiurnal. So, for 45 degree equal to -> 45*12.5/360 = 1.56 hours.
Thus in your case:
90 degree -> 90*365/360 = 91.25 hours.
45 degree -> 45*365/360= 45.625 hours.
My understanding is as follows:
For there to be a simple cause-and-effect relationship between the phenomena recorded in the time series, we would expect that the oscillations are phase-locked (Grinsted 2004); so, the period where you find the "in phase" arrow (--->) indicates the lag between the signals.
See the simulated examples with different distances between cause-and-effect phenomena; observe that greater the distance, greater is the period of occurrence of the "in phase arrow" in the Cross wavelet transform.
Nonlinear Processes in Geophysics (2004) 11: 561–566 SRef-ID: 1607-7946/npg/2004-11-561
See the example here

Resources