Determine if a sequence has "gaps" in R - r

I would like to determine if a sequence contains any gaps or irregular steps? Not sure if this is the right way to phrase this and there's a good chance that this is duplicate (but I was unable to find a good question).
The following has_gap function gives me the correct results, but seems a bit clunky? Perhaps there is something built-in that I haven't discovered?
x1 <- c(1:5, 7:10)
x2 <- 1:10
x3 <- seq(1, 10, by = 2)
x4 <- c(seq(1, 6, by = 2), 6, seq(7, 10, by = 2))
has_gap <- function(vec) length(unique(diff(vec))) != 1
vecs <- list(x1, x2, x3, x4)
sapply(vecs, has_gap)
# [1] TRUE FALSE FALSE TRUE

library(zoo)
is.regular(x3, strict=TRUE)
is.regular(x3, strict=FALSE)

As noted by G. Grothendieck in the comments, one approach is:
has_gaps <- \(x)!!diff(range(diff(x)))
Another approach might be:
has_gaps2 <- \(x)var(diff(x))>0
If performance is an issue, rawr suggested:
has_gaps3 <- \(x)!isTRUE(all.equal(cor(x,seq_along(x)),1))

Related

R Loop: Perform a Function for Every 3 Rows

I have 2000 wheat plants, growing over the course of 40 days.
I'd like to perform the coeff function on each plant to find the coefficients of the quadratic equation the 3 time points make. (a, b, and c)
(1) The coef(lm(y~poly(x,2,raw=TRUE)) function works exactly the way I want it to.
(2) However, the way my data is presented, requires me to manually set x and y.
(3) Thus, I melted my data, and ordered it.
(4) I'd like to make a loop that will take the first three in column "Day" and set that as x. Then I'd like it to take the first three in column "Height" and set that as y.
Then I'd like to perform the coeff function.
Last I'd like it to present the coefficient outputs I need, preferably in a new data table.
Then repeat for every three rows, which represent each wheat ID, for all wheat plants.
1) This function works, giving me coefficients: a, b, c
x<-c(1,2,3)
y<-c(1,10,4)
coef(lm(y~poly(x,2,raw=TRUE)))
2) This is what my data originally looked like
A = matrix(c(5, 4, 2, 10, 10, 4, 5, 15, 6),nrow=3, ncol=3)
colnames(A)<-c("10", "25", "40")
rownames(A)<-c("Wheat 1", "Wheat 2", "Wheat 3")
A
3) This is my melted format
A.melted<-as.data.frame(melt(A, id.vars="ID"))
A.melted<-A.melted[with(A.melted,order(Var1)),]
colnames(A.melted) <- c("WheatID", "Day", "Height")
A.melted$Day<-as.numeric(as.character(A.melted$Day))
A.melted
#
4) This is what I am trying to do with my loop....
for every 3 rows,
x<-A.melted[,2]
y<-A.melted[,3]
coef(lm(y~poly(x,2,raw=TRUE)))
something to compile the coefficients: a, b, c
I am just not familiar with the syntax of loops, and I'd love any tips and suggestions. Perusing Google tells me that one should not do loops unless it is absolutely required since I may run into more problems- thus I am open to non loop techniques as well.
If you want to do it in a loop try this. The crucial part is to use seq together with a by = argument to let the index take the steps you need.
library(tibble)
df <- tibble(
WheatID = rep(NA_character_, nrow(A)),
Intercept = rep(NA_real_, nrow(A)),
poly1 = rep(NA_real_, nrow(A)),
poly2 = rep(NA_real_, nrow(A))
)
cnt <- 1
for (i in seq(1, nrow(A.melted), by = 3)) {
x <- A.melted$Day[i + 0:2]
y <- A.melted$Height[i + 0:2]
df$WheatID[cnt] <- as.character(A.melted$WheatID[i])
df[cnt, 2:4] <- coef(lm(y~poly(x,2,raw=TRUE)))
cnt <- cnt + 1
}
df
Note: I am not a data.table guy. Therefore, I present you with a tibble.
We can do this with the help of data.table, see ?data.table:
library(data.table)
A.models = A.melted[, model := list(.(lm(Height ~ poly(Day, 2),
data = list(.(.SD[WheatID == .BY[[1]]]))))),
by = WheatID]
A.models[, coefs := list(.(coefficients(model[[1]]))),
by = WheatID]
You can access each model like this:
A.models[WheatID == "Wheat 1", model[[1]]]
and even
A.models[WheatID == "Wheat 1", summary(model[[1]])]
The magic here happens because data.table takes in J expressions, not only functions.
This is something you can do with data.table package.
data.list <- split(A.melted, f = (1:nrow(A.melted) - 1) %/% 3)
coefs <- lapply(data.list, function(x) {
coefs <- coef(lm(Day ~ poly(Height, raw=TRUE), data = x))
data.table(
intercept = coefs[1],
poly.height = coefs[2]
)
})
coefs <- rbindlist(coefs)
Or you could perform apply() directly on the original matrix:
x <- as.numeric(colnames(A))
apply(A, 1, function(y) coef(lm(y~poly(x,2,raw=TRUE))))
Wheat 1 Wheat 2 Wheat 3
(Intercept) -3.88888889 -0.555555556 6.666667e-01
poly(x, 2, raw = TRUE)1 1.11111111 0.477777778 1.333333e-01
poly(x, 2, raw = TRUE)2 -0.02222222 -0.002222222 -2.417315e-18
Or you could transpose the data and use the coef(...) call directly:
x <- as.numeric(colnames(A))
coef(lm(t(A) ~ poly(x, 2, raw = TRUE)))

unable to loop over data frame row index in R

I have a data frame in R which is of size nx4. I am attempting to loop through it and perform a computation to add to the "distances" vector. x0 is a vector of length 3. I attempt to run the following code
trainData = data.frame(x1,x2,x3,y)
for (j in 1:n) {
distances[j] = sqrt(sum((x0 - trainData[j,1:3])^2))
}
I get the following error:
Error in Ops.data.frame(x0, trainData[j, 1:3]) :
‘-’ only defined for equally-sized data frames
However, the 2 values being subtracted are the same length, and I can run it without looping, ie
sqrt(sum((x0 - trainData[1,1:3])^2))
I'm unable to find the reason for this, any help is appreciated.
You want to use the dist() function to calculate your distance. Also, avoid using loops and look at the apply family of functions.
library(dplyr)
set.seed(1724)
trainData <- data.frame(x1 = runif(4, 1, 10), x2 = runif(4, 1, 10), x3 = runif(4, 1, 10), y = runif(4, 1, 10))
mutate(trainData,
dist = apply(trainData,
1,
function(x, y = runif(3, 1, 10)) {
dist(rbind(x[1:3], y), method = "euclidean")
}))
# x1 x2 x3 y dist
# 1 5.890667 7.156956 6.946917 6.580706 6.188533
# 2 3.060810 1.117295 7.676836 7.965404 5.193822
# 3 8.058110 5.518819 2.687567 3.832825 10.520283
# 4 8.405847 1.326119 3.533277 6.804517 8.390918
I'm not sure what the original issue was, but I've got things working by taking Paul's advice and replacing the loop with:
distances = apply(trainData, 1, function(x) dist(rbind(x0,x)))

Should attach be avoided in this situation?

Although there are some questions about this topic (e.g. this question), none of them answer my particular questions (as far as I could tell anyway).
Suppose I have a function which depends on a lot of parameters. For demonstration purposes I chose 3 parameters:
myfun <- function(x1, x2, x3){
some code containing x1, x2, x3
}
Often the input parameters are already contained in a list:
xlist <- list(x1 = 1, x2= 2, x3 = 3)
I want to run myfun with the inputs contained in xlist like this:
myfun(xlist$x1, xlist$x2, xlist$x3)
However this seems like too big of an effort (because of the high number of parameters).
So I decided to modify myfun: instead of all the input parameters. It now gets the whole list as one single input: at the beginning of the code I use attach in order to use the same code as above.
myfun2 <- function(xlist){
attach(xlist)
same code as in myfun containing x1, x2, x3
detach(xlist)
}
I thought that this would be quite a neat solution, but a lot of users advise to not use attach.
What do you think? Are there any arguments to prefer myfun over myfun2?
Thanks in advance.
I think you'd be better off using do.call. do.call will accept a list and convert them to arguments.
myfun <- function(x1, x2, x3){
x1 + x2 + x3
}
xlist <- list(x1 = 1, x2= 2, x3 = 3)
do.call(myfun, xlist)
This has the benefit of being explicit about what the arguments are, which makes it much easier to reason with the code, maintain it, and debug it.
The place where this gets tricky is if xlist has more values in it than just those required by the function. For example, the following throws an error:
xlist <- list(x1 = 1, x2 = 2, x3 = 3, x4 = 4)
do.call(myfun, xlist)
You can circumvent this by matching arguments with the formals
do.call(myfun, xlist[names(xlist) %in% names(formals(myfun))])
It's still a bit of typing, but if you're talking about 10+ arguments, it's still a lot easier than xlist$x1, xlist$x2, xlist$x3, etc.
LAP gives a useful solution as well, but would be better used to have with outside the call.
with(xlist, myfun(x1, x2, x3))
You could just use with():
xlist <- list(x1 = 1, x2= 2, x3 = 3)
FOO <- function(mylist){
with(mylist,
x1+x2+x3
)
}
> FOO(xlist)
[1] 6
I'm not convinced of this approach, though. The function would depend on the correctly named elements within the list.
My approach would be something like this:
testfun <- function (a_list)
{
args = a_list
print(args$x1)
print(args$x2)
print(args$x3)
}
my_list <- list(x1=2, x2=3, x3=4)
testfun(my_list)
However, you would need to know the names of the parameters within the function.
Perhaps the do.call() function can come into play here.
do.call('fun', list)
You could assign the list to the environment of the function:
myfun <- function(xlist) {
for (i in seq_along(xlist)) {
assign(names(xlist)[i], xlist[[i]], envir = environment())
}
# or if you dislike for-loops
# lapply(seq_along(xlist), function(i) assign(names(xlist)[i], xlist[[i]], envir = parent.env(environment())))
print(paste0(x2, x3)) # do something with x2 and x3
print(x1 * x3) # do something with x1 and x3
}
myfun(list(x1 = 4, x2 = "dc", x3 = c(3,45,21)))

Strange bahaviour with lapply, lazy evaluation?

can someone explain me what is going on here?
I have a list of lists that I need to match with a table, and I am using lapply with fmatch (package fastmatch http://cran.r-project.org/web/packages/fastmatch/index.html) for that (which I think uses matching by hashing the table to be matched with, in contrast to match).
However, this is rather slow if table values have to be evaluated in the function (at least that's what I suspect), but I am not entirely sure.
I have found a workaround which speeds up the computation from 5.5 to 0.01s, but would like a more elegant solution.
Here is a reproducible example:
set.seed(10)
matchFeatures <- replicate(n = 1000, paste0("a", sample(x = 1:10000, size = sample(x = 1:10, size = 1))))
matchTable <- 1:10000
system.time(m1 <- lapply(matchFeatures, function(features) fmatch(features, paste0("a", 1:10000))))
system.time(m2 <- lapply(matchFeatures, function(features) force(fmatch(features, paste0("a", 1:10000)))))
system.time({tempTable <- paste0("a", 1:10000); m3 <- lapply(matchFeatures, function(features) fmatch(features, tempTable))})
identical(m1, m3)
Thanks Justin, just to follow up, I was looking for something like this:
system.time(m4 <- lapply(matchFeatures, fmatch, table = paste0("a", 1:10000)))
In the first two functions, you're running the paste command once for each iteration (i.e. 10000 times). In the third, it only happens once. If you use matchTable <- paste('a', 1:10000) and pass matchTable to all three versions you get a substantial speed up as expected.
matchFeatures <- replicate(n = 1000,
paste0("a",
sample(x = 1:10000,
size = sample(x = 1:10, size = 1))))
matchTable <- paste('a', 1:10000)
system.time(m1 <- lapply(matchFeatures,
function(features) fmatch(features, matchTable)))
system.time(m2 <- lapply(matchFeatures,
function(features) force(fmatch(features, matchTable))))
system.time(m3 <- lapply(matchFeatures,
function(features) fmatch(features, matchTable)))
identical(m1, m3)

Avoiding Loop with R using Apply (?)

I'm trying to run apply a function to each row of a dataset. The function looks up matching rows in a second dataset and computes a similarity score for the product details passed to it.
The function works if I just call it with test numbers but I can't figure out how to run it on all rows of my dataset. I've tried using apply but can't get it working.
I'm going to be iterating different parameter settings to find those that best fit historical data so speed is important... meaning that a loop is out. Any help you can provide would be hugely appreciated.
Thanks! Alan
GetDistanceTest <- function(SnapshotDate, Cand_Type, Cand_Height, Cand_Age) {
HeightParam <- 1/5000
AgeParam <- 1
Stock_SameType <- HistoricalStock[!is.na(HistoricalStock$date) & !is.na(HistoricalStock$Type) & as.character(HistoricalStock$date)==as.character(SnapshotDate) & HistoricalStock$Type==Cand_Type,]
Stock_SameType$ED <- (HeightParam*(Stock_SameType$Height - Cand_Height))^2 + (AgeParam*(Stock_SameType$Age - Cand_Age))^2
return(sqrt(sum(Stock_SameType$ED)))
}
HistoricalStock <- HistoricalAQStock[,c(1, 3, 4, 5)]
colnames(HistoricalStock) <- c("date", "Age", "Height", "Type")
Sales <- AllSales[,c(2,10,11,25)]
colnames(Sales) <- c("date", "Age", "Height", "Type")
GetDistanceTest("2010-04-01", 5261, 12, 7523) #works and returns a single number
res1 <- transform(Sales, ClusterScore=GetDistanceTest(date, Type, Height, Age))
# returns Error in `$<-.data.frame`(`*tmp*`, "ED", value = c(419776714.528591, 22321257.0276852, : replacement has 4060 rows, data has 54
# also 4 warnings, one for each variable. e.g. 1: In as.character(HistoricalStock$date) == as.character(SnapshotDate) : longer object length is not a multiple of shorter object length
res2 <- apply(Sales, 1, GetDistanceTest, Sales$Type, Sales$Height, Sales$Age)
# `$<-.data.frame`(`*tmp*`, "ED", value = c(419648071.041523, 22325941.2704261, : replacement has 4060 rows, data has 13
# also same 4 warnings as res1
I took some liberties with your code b/c I try to vectorize vice use loops whenever I can... With the merge function, you merge the two data frames, and operate on the "columns", which allows you to use the vectorization built into R. I think this will do what you want (in the second line I'm just making sure that A and B don't have the same values for height and age so that your distance isn't always zero):
A <- B <- data.frame(date=Sys.Date()-9:0, stock=letters[1:10], type=1:10, height=1:10, age=1:10)
B$height <- B$age <- 10:1
AB <- merge(x=A, y=B, by=c("date", "type"), suffixes=c(".A", ".B"))
height.param <- 1/5000
age.param <- 1
temp <- sqrt( height.param * (AB$height.A - AB$height.B)^2 + age.param * (AB$age.A - AB$age.B)^2 )
Use mapply, the multivariate form of apply:
res1 <- mapply(GetDistanceTest, Sales$date, Sales$Type, Sales$Height, Sales$Age)
Code as per above comment:
A <- data.frame(date=rep(Sys.Date()-9:0,100), id=letters[1:10], type=floor(runif(1000, 1, 10)), height=runif(1000, 1, 100), age=runif(1000, 1, 100))
B <- data.frame(date=rep(Sys.Date()-9:0,1000), type=floor(runif(10000, 1, 10)), height=runif(10000, 1, 10), age=runif(10000, 1, 10))
AB <- merge(x=A, y=B, by=c("date", "type"), suffixes=c(".A", ".B"))
height.param <- 1
age.param <- 1
AB$ClusterScore <- sqrt( height.param * (AB$height.A - AB$height.B)^2 + age.param * (AB$age.A - AB$age.B)^2 )
Scores <- ddply(AB, c("id"), function(df)sum(df$ClusterScore))

Resources