How can I find the first minimum value in a vector? E.g., in y I'd like to return 3 because it is the first time the values on either side of y[3] are greater than y[3]. I wrote a function to do this but wondering if there is an easier way? I also need to account for the case when the first value is less than the second value. E.g., in z.
y <- c(2448,2442,2438,2440,2438,2444,2431,2433,2434)
plot(y)
getFirstMin <- function(x){
if(x[1] < x[2]) res <- 1
else res <- min(which(diff(x) > 0))
return(res)
}
getFirstMin(y)
z <- c(2408,2442,2438,2440,2438,2444,2431,2433,2434)
plot(z)
getFirstMin(z)
We can use first, which, lag() and lead()
getFirstMin<-function(x) {dplyr::first(which(lag(x, default = Inf) > x & lead(x, default = Inf) > x ) }
Easier than #onyambu, a local minimum simply requires the first order difference is positive at some point, so with a nicely behaved sequence:
y <- c(2448,2442,2438,2440,2438,2444,2431,2433,2434)
which(diff(y)>0)[1]
gives 3. We could of course, build on this to handle some complicated cases, but your particular example demands nothing else.
Related
I'm trying to create a function to solve this puzzle:
An Arithmetic Progression is defined as one in which there is a constant difference between the consecutive terms of a given series of numbers. You are provided with consecutive elements of an Arithmetic Progression. There is however one hitch: exactly one term from the original series is missing from the set of numbers which have been given to you. The rest of the given series is the same as the original AP. Find the missing term.
You have to write the function findMissing(list), list will always be at least 3 numbers. The missing term will never be the first or last one.
The next section of code shows my attempt at this function. The site i'm on runs tests against the function, all of which passed, as in they output the correct missing integer.
The problem i'm facing is it's giving me a timeout error, because it takes to long to run all the tests. There are 102 tests and it's saying it takes over 12 seconds to complete them. Taking more than 12 seconds means the function isn't efficient enough.
After running my own timing tests in RStudio it seems running the function would take considerably less time than 12 seconds to run but regardless i need to make it more efficient to be able to complete the puzzle.
I asked on the site forum and someone said "Sorting is expensive, think of another way of doing it without it." I took this to mean i shouldn't be using the sort() function. Is this what they mean?
I've since found a few different ways of getting my_diff which is calculated using the sort() function. All of these ways are even less efficient than the original way of doing it.
Can anyway give me a more efficient way of doing the sort to find my_diff or maybe make other parts of the code more efficient? It's the sort() part which is apparently the inefficient part of the code though.
find_missing <- function(sequence){
len <- length(sequence)
if(len > 3){
my_diff <- as.integer(names(sort(table(diff(sequence)), decreasing = TRUE))[1])
complete_seq <- seq(sequence[1], sequence[len], my_diff)
}else{
differences <- diff(sequence)
complete_seq_1 <- seq(sequence[1],sequence[len],differences[1])
complete_seq_2 <- seq(sequence[1],sequence[len],differences[2])
if(length(complete_seq_1) == 4){
complete_seq <- complete_seq_1
}else{
complete_seq <- complete_seq_2
}
}
complete_seq[!complete_seq %in% sequence]
}
Here are a couple of sample sequences to check the code works:
find_missing(c(1,3,5,9,11))
find_missing(c(1,5,7))
Here are some of the other things i tried instead of sort:
1:
library(pracma)
Mode(diff(sequence))
2:
library(dplyr)
(data.frame(diff_1 = diff(sequence)) %>%
group_by(diff_1) %>%
summarise(count = n()) %>%
ungroup() %>%
filter(count==max(count)))[1]
3:
MaxTable <- function(sequence, mult = FALSE) {
differences <- diff(sequence)
if (!is.factor(differences)) differences <- factor(differences)
A <- tabulate(differences)
if (isTRUE(mult)) {
as.integer(levels(differences)[A == max(A)])
}
else as.integer(levels(differences)[which.max(A)])
}
Here is one way to do this using seq. We can create a sequence from minimum value in sequence to maximum value in the sequence having length as length(x) + 1 as there is exactly one term missing in the sequence.
find_missing <- function(x) {
setdiff(seq(min(x), max(x), length.out = length(x) + 1), x)
}
find_missing(c(1,3,5,9,11))
#[1] 7
find_missing(c(1,5,7))
#[1] 3
This approach takes the diff() of the vector - there will always be one difference higher than the others.
find_missing <- function(x) {
diffs <- diff(x)
x[which.max(diffs)] + min(diffs)
}
find_missing(c(1,3,5,9,11))
[1] 7
find_missing(c(1,5,7))
[1] 3
There is actually a simple formula for this, which will work even if your vector is not sorted...
find_missing <- function(x) {
(length(x) + 1) * (min(x) + max(x))/2 - sum(x)
}
find_missing(c(1,5,7))
[1] 3
find_missing(c(1,3,5,9,11,13,15))
[1] 7
find_missing(c(2,8,6))
[1] 4
It is based on the fact that the sum of the full series should be the average value times the length.
I want to make a function in terms of x and coef for multiple values of x so that the output is a vector, like I've tried here:
directpoly<-function(x,coef) {
for(n in length(coef)) {
total<-sum(coef*x^(0:(n-1)))
}
total
}
This works when I input one value for x and any vector for the coefficient values, but I want more than that. I want to input a certain amount of values for the coefficients, say c(5,9,-2), and have the function produce three different values, one for each input of x for, say, x<-2:4. So in that case I'd want output 15, 14, 9. Any ideas? I am new so all help is appreciated.
Edit: I took out an "<-" that I accidentally put in there. Sorry if that was any cause for confusion. Also what I want in the end is a function
P(x) = c1 + c2*x + ... + cn*x^n-1
Does this work?
directpoly <- function(x, coef) {
seqcoef <- seq_along(coef) - 1
sapply(x, function(z) sum(coef*z^seqcoef))
}
directpoly(2:4, c(5,9,-2))
# [1] 15 14 9
If so, the trick to solving this is two-steps:
Determine what you want to do with each value of x (no vector). In this case, it's simply from among:
sum(coef*x^(1:length(coef)-1))
sum(coef*x^(0:(length(coef)-1)))
sum(coef*x^(seq_along(coef)-1))
Because I'm eventually putting this into some loop/apply formulation, I don't need to recalculate the sequence each time, so I break it out:
seqcoef <- seq_along(coef) - 1
sum(coef*x^seqcoef)
Now that you know what to do with each x`, now map or apply over it:
sapply(x, function(z) ...)
where ... is what we determined above. For clear coding, many believe the technique of hard-defining this function is good, so something like:
directpoly1 <- function(x, coef, seqcoef = seq_along(coef) - 1) {
sum(coef*x^seqcoef)
}
directpoly <- function(x, coef) {
seqcoef <- seq_along(coef) - 1
sapply(x, directpoly1, coef, seqcoef)
}
(I took a little more liberty with this version to enable running it explicitly with a scalar argument, primarily for unit-testing. It is not strictly necessary, so the function at the top of this answer should suffice.)
So I have a homework problem that I am really struggling to code in R.
This is the problem: Write a function difference() that takes a vector X as a parameter and returns a vector of the
difference between each element and the next element:
X[2]-X[1], X[3]-X[2], X[4]-X[3], etc.
Thus difference(c(5,2,9,4,8)) would return c(-3,7,-5,4)
And so far I have this:
difference<-function(X) {
for (i in X)
X.val<-X[i]-X[i-1]
return(X.val)
}
difference(c(5,2,9,4,8))
I cant seem to get the function to subtract the X[2]-X[1] and it is returning one more number than it should when I run the function. Can anyone help me?
You're having a couple of problems with your code. Since this is homework, I'm not going to provide the correct code, but I'll help highlight where you're going wrong to help you get closer. The only reason I'm not providing the answer is because these are good learning experiences. If you comment with updated attempts, I'll continue to update my answer to guide you.
The issue is that you're using for (i in X), which will actually loop through the values of X and not its index. So, in your example, i will equal 5 and then 2 and then 9 and then 4 and then 8. If we start with i == 5, the code is doing this: X.val <- X[5] - X[5 - 1]. At this point you'd assign X.val to be 4 because X[5] is equal to 8 and X[4] is equal to 4. At the next iteration, i == 2. So this will set X.val to -3 because X[2] is 2 and X[1] is 5.
To fix this issue, you'd want to loop through the index of X instead. You can do this by using for (i in 1:length(X)) where length(X) will give you a number equal to the number of elements in X.
The next issue you've found is that you're getting one extra number. It's important to think about how many numbers you should have in your output and what this means in terms of where i should start. Hint: should you really be starting at 1?
Lastly, you overwrite X.val in each iteration. It surprises me that you were getting an extra number in your results given that you should have only received NA given that the last number is 8 and there are not 8 elements in X. Nevertheless, you'll need to rewrite your code so that you don't overwrite X.val, but instead append to it for each iteration.
I hope that helps.
UPDATE #1
As noted in the comments below, your code now looks like this:
difference <- function(X) {
for (i in 2:length(X)) {
X[i] <- X[i] - X[i-1]
}
return(X)
}
difference(c(5, 2, 9, 4, 8))
We are now very, very close to a final solution. We just need to address a quick problem.
The problem is that we're now overriding our value of X, which is bad. Since our numbers, c(5,2,9,4,8), are passed into the function as the variable X, the line X[i] <- X[i] - X[i-1] will start to override our values. So, stepping through one iteration at a time, we get the following:
Step 1:
i gets set to 2
X[2] is currently equal to 2
We then run the line X[i] <- X[i] - X[i-1], which gets evaluated like this: X[2] <- X[2] - X[1] --> X[2] <- 2 - 5 --> X[2] <- -3
X[2] is now set to -3
Step 2:
i gets set to 3
X[3] is currently equal to 9
We then run the X[i] <- X[i] - X[i-1], which gets evaluated like this: X[3] <- X[3] - X[2] --> X[3] <- 9 - -3 --> X[3] <- 12
X[3] is now set to 12
As you can see from the first two iterations, we're overwriting our X variable, which is directly impacting the differences we get when we run our function.
To solve this, we simply go back to using X.val, like we were before. Since this variable has no values, there's nothing to be overwritten. Our function now looks like this:
difference <- function(X) {
for (i in 2:length(X)) {
X.val[i] <- X[i] - X[i-1]
}
return(X.val)
}
Now, for each iteration, nothing is overwritten and our values of X stay in tact. There are two problems that we're going to have though. If we run this new code, we'll end up with an error telling us that x.diff doesn't exist. Earlier, I told you that you can index a variable that you're making, which is true. We just have to tell R that the variable we're making is a variable first. There are several ways to do this, but the second best way to do it is to create a variable with the same class as our expected output. Since we know we want our output to be a list of numbers, we can just make X.val a numeric vector. Our code now looks like this:
difference <- function(X) {
X.val <- numeric()
for (i in 2:length(X)) {
X.val[i] <- X[i] - X[i-1]
}
return(X.val)
}
Notice that the assignment of X.val happens before we enter the for loop. As an exercise, you should think about why that's the case and then try moving it inside of the for loop and seeing what happens.
So this, solves our first problem. Try running the code and seeing what you get. You'll notice that the first element of the output is NA. Why might this be the case, and how can we fix it? Hint: it has to do with the value of i.
UPDATE #2
So now that we have the correct answer, let's look at a couple tips and tricks that are available thanks to R. R has some inherent features that it can use on vectors. To see this action, run the following example:
a <- 1:10
b <- 11:20
a + b
a - b
a * b
a / b
As you can see, R will automatically perform what is called "element wise" operations for vectors. You'll notice that a - b is pretty similar to what we were trying to do here. The difference is that a and b are two different vectors and we were dealing with one vector at a time. So how do we set up our problem to work like this? Simple: we create two vectors.
x <- c(5, 2, 9, 4, 8)
y <- x[2:length(x)]
z <- x[1:(length(x)-1)]
y - z
You should notice that y - z now gives us the answer that we wanted from our function. We can apply that to our difference function like so:
difference <- function(X) {
y <- X[2:length(X)]
z <- X[1:(length(X)-1)]
return(y-z)
}
Using this trick, we no longer need to use a for loop, which can be incredibly slow in R, and instead use the vectorized operation, which is incredibly fast in R. As was stated in the comments, we can actually skip the step of assignin those values to y and z and can instead just directly return what we want:
difference <- function(X) {
return(X[2:length(X)] - X[1:(length(X)-1)])
}
We've now just successfully created a one-line function that does what we were hoping to do. Let's see if we can make it even cleaner. R comes with two functions that are very handy for looking at data: head() and tail(). head allows you to look at the first n number of elements and tail allows you to look at the last n number of elements. Let's see an example.
a <- 1:50
head(a) # defaults to 6 elements
tail(a) # defaults to 6 elements
head(a, n=20) # we can change how many elements to return
tail(a, n=20)
head(a, n=-1) # returns all but the last element
tail(a, n=-1) # returns all but the first element
Those last two are the most important for what we want to do. In our newest version of difference we were looking at X[2:length(X)], which is another way of saying "all elements in X except the first element". We were also looking at X[1:(length(X)-1)], which is another way of saying "all elements in X except the last element". Let's clean that up:
difference <- function(X) {
return(tail(X, -1) - head(X, -1))
}
As you can see, that's a much cleaner way of defining our function.
So those are the tricks. Let's look at a couple tips. The first is to drop the return from simple functions like this. R will automatically return the last command if a function if it's not an assignment. To see this in action, try running the two different functions:
difference_1 <- function(X) {
x.diff <- tail(X, -1) - head(X, -1)
}
difference_1(1:10)
difference_2 <- function(X) {
tail(X, -1) - head(X, -1)
}
difference_2(1:10)
In difference_1 you'll notice that nothing is returned. This is because the command is an assignment command. You could force it to return a value by using the return command.
The next tip is something you won't need for a while, but it's important. Going back to the current version of difference that we have (the code you're using now, not anything I've mentioned in this update), we assign values to X.val, which causes it to "grow" over time. To see what this means, run the following code:
x.val <- numeric()
length(x)
x.val[1] <- 1
length(x)
x.val[2] <- 2
length(x)
You'll see that the length keeps growing. This is often a point of huge slowdowns in R code. The proper way to do this is to create x.val with a length equal to how big we need it. This is much, much faster and will save you some pains in the future. Here's how it would work:
difference <- function(X) {
x.val <- numeric(length=(length(X) - 1))
for (i in 2:length(X)) {
x.val[i-1] <- X[i] - X[i-1]
}
return(x.val)
}
In our current code, this doesn't make a real difference. But if you're dealing with very large data in the future, this can you hours or even days of computing time.
I hope this all helps you better understand some of the functionality in R. Good luck with everything!
For example, I have a matrix k
> k
d e
a 1 3
b 2 4
I want to apply a function on k
> apply(k,MARGIN=1,function(p) {p+1})
a b
d 2 3
e 4 5
However, I also want to print the rowname of the row being apply so that I can know which row the function is applied on at that time.
It may looks like this:
apply(k,MARGIN=1,function(p) {print(rowname(p)); p+1})
But I really don't do how to do that in R.
Does anyone has any idea?
Here's a neat solution to what I think you're asking. (I've called the input matrix mat rather than k for clarity - in this example, mat has 2 columns and 10 rows, and the rows are named abc1 through to abc10.)
In the code below, the result out1 is the thing you wanted to calculate (the outcome of the apply command). The result out2 comes out identically to out1 except that it prints out the rownames that it is working on (I put in a delay of 0.3 seconds per row so you can see it really does do this - take this out when you want the code to run full speed obviously!)
The trick I came up with was to cbind the row numbers (1 to n) onto the left of mat (to create a matrix with one additional column), and then use this to refer back to the rownames of mat. Note the line x = y[-1] which means that the actual calculation within the function (here, adding 1) ignores the first column of row numbers, which means it's the same as the calculation done for out1. Whatever sort of calculation you want to perform on the rows can be done this way - just pretend that y never existed, and formulate your desired calculation using x. Hope this helps.
set.seed(1234)
mat = as.matrix(data.frame(x = rpois(10,4), y = rpois(10,4)))
rownames(mat) = paste("abc", 1:nrow(mat), sep="")
out1 = apply(mat,1,function(x) {x+1})
out2 = apply(cbind(seq_len(nrow(mat)),mat),1,
function(y) {
x = y[-1]
cat("Doing row:",rownames(mat)[y[1]],"\n")
Sys.sleep(0.3)
x+1
}
)
identical(out1,out2)
You can use a variable outside of the apply call to keep track of the row index and pass the row names as an extra argument to your function:
idx <- 1
apply(k, 1, function(p, rn) {print(rn[idx]); idx <<- idx + 1; p + 1}, rownames(k))
This should work. The cat() function is what you want to use when printing results during evaluation of a function. paste(), conversely, just returns a character vector but doesn't send it to the command window.
The solution below uses a counter created as a closure, allowing it to "remember" how many times the function has been run before. Note the use of the global assign <<-. If you really want to understand what's going on here, I recommend reading through this wiki https://github.com/hadley/devtools/wiki/
Note there may be an easier way to do this; my solution assumes that there is no way to access the rownumber or rowname of a current row using typical means within an apply function. As previously mentioned, this would be no problem in a loop.
k <- matrix(c(1,2,3,4),ncol=2)
rownames(k) <- c("a","b")
colnames(k) <- c("d","e")
make.counter <- function(x){
i <- 0
function(){
i <<- i+1
i
}
}
counter1 <- make.counter()
apply(k,MARGIN=1,function(p){
current.row <- rownames(k)[counter1()]
cat(current.row,"\n")
return(p+1)
})
As far as I know you cannot do that with apply, but you could loop through the rownames of your data frame. Lame example:
lapply(rownames(mtcars), function(x) sprintf('The mpg of %s is %s.', x, mtcars[x, 1]))
I'm trying just to calculate the Hamming distance between two vectors in R. I'm currently attempting to use the "e1071" package, and the hamming.distance function, as follows:
library(e1071)
H <- hamming.distance(X)
Where X is a data.frame with 2 rows and (in my particular data) 667 columns, and every observation is 0 or 1.
Initially I got the error:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
After some research, it appeared that one fix might be increasing the basic option in R. This I did via options(expressions=5000), and then tried varying values in place of the 5000. But this only produced the error:
Error: C stack usage is too close to the limit
I'm not much of a programmer, and the fixes for this most recent error appear to have to do with something inside the package e1071 possibly not being called correctly (or at the right time).
Any ideas on what I'm doing wrong? I eventually want the Hamming distances between a large number of vectors, and this was just a starting point. If this has to do with memory allocation, any suggestions for how to deal with it?
I don't know how hamming.distance works internally, but a simple way to calculate the distance for 2 vectors is just
sum(x1 != x2)
or, in this case,
sum(X[1,] != X[2,])
If the total number of vectors is not too large (up to, say, a few thousand), you could implement this in a nested loop:
n <- nrow(X)
m <- matrix(nrow=n, ncol=n)
for(i in seq_len(n - 1))
for(j in seq(i, n))
m[j, i] <- m[i, j] <- sum(X[i,] != X[j,])
Caveat: untested.
WARNING ABOUT USING HAMMING.DISTANCE FROM PACKAGE e1071!
This package's implementation forces the objects being compared to booleans with as.logical. This means that values of 0 will be FALSE and any non-zero values will be TRUE. This means that for the sequence: 0 1 2 compared to 0 1 1 the hamming distance will be reported as 0 instead of the correct value of 1 -- this package will treat 1 and 2 as equal since as.logical(1) == as.logical(2).
Here is the faulty (in my view) implementation:
> library("e1071", lib.loc="C:/Program Files/R/R-2.15.3/library")
Loading required package: class
> hamming.distance
function (x, y)
{
z <- NULL
if (is.vector(x) && is.vector(y)) {
z <- sum(as.logical(x) != as.logical(y))
}
else {
z <- matrix(0, nrow = nrow(x), ncol = nrow(x))
for (k in 1:(nrow(x) - 1)) {
for (l in (k + 1):nrow(x)) {
z[k, l] <- hamming.distance(x[k, ], x[l, ])
z[l, k] <- z[k, l]
}
}
dimnames(z) <- list(dimnames(x)[[1]], dimnames(x)[[1]])
}
z
}
<environment: namespace:e1071>
My recommendation: DO NOT USE. Hamming distance is trivial to implement as noted several times above.
hamming.distance takes two vectors or a matrix, but not a data frame, so what you want is probably either
m = as.matrix(X)
hamming.distance(m[1,], m[2,])
or
hamming.distance(as.matrix(X))
but as was pointed out this is in your particular case the same as
sum(m[1,] != m[2,])
(In general, avoid data.frames if what you have is not a heterogenous structure since they are much, much slower than matrices)
As an addition to all that was mentioned above: Although the Hamming distance is trivial to implement as an ordinary nested loop, in terms of execution time things can quickly get out of hand for larger matrices. In R, it is far more efficient to instead use matrix multiplication for computing the Hamming distance between all columns of large matrices. This is extremely fast compared to an R-level nested loop. An example implementation can be found here.
sum(xor(x[1,],x[2,]))
I don't know the relative efficiency of 'xor' to '!='
Just adding to #HongOoi I want to point that in R != and == return NA when one of the values is missing, so it could give misleading results
> c(1, NA) == 1:2
[1] TRUE NA
however %in% outputs FALSE for 1 %in% NA comparison. Because of that if when comparing vectors you want to count missing values as "different", then you have to use sum(!((x != y) %in% FALSE)) code:
> x <- c(1, 8, 5, NA, 5)
> y <- 1:5
> sum(!((x != y) %in% FALSE))
[1] 3
Notice also that it could happen that x and y vectors have different length, what would lead to missing values in the shorter vector - you can do two things: truncate the longer vector or claim that values absent in the shorter vector are "different". This could be translated into standalone function with familiar R parameters:
hamming <- function(x, y, na.rm = TRUE) {
size <- 1:max(length(x) & length(y))
x <- x[size]
y <- y[size]
if (na.rm) {
del <- is.na(x) & is.na(y)
x <- x[del]
y <- y[del]
}
sum(!((x != y) %in% FALSE))
}
This function enables you to choose if you want to count missing values as "different" (na.rm = FALSE) or ignore them. With na.rm = TRUE if vectors differ in their length, the longer one gets truncated.