Count "changes in direction" in a vector in R - r

I need to count how many times a variable inverts its growth pattern - from increasing values to decreasing values (as well as from decreasing values to increasing values). In the following example, I should be able to find 4 such inversions. How can I create a new dummy variable that shows such inversions?
x <- c(1:20,19:5,6:15,12:9,10:11)
plot(x)

You're effectively asking "when is the second derivative of x not equal to zero?", so you could just do a double diff:
x <- c(1:20,19:5,6:15,12:9,10:11)
plot(seq_along(x), x)
changes <- c(0, diff(diff(x)), 0) != 0
To show it picks the right points, colour them red.
points(seq_along(x)[changes], x[changes], col = "red")

This function will return the indices at which the direction changed:
get_change_indices <- function(x){
# return 0 if x contains one (or none, if NULL) unique elements
if(length(unique(x)) <= 1) return(NULL)
# make x named, so we can recapture its indices later
x <- setNames(x, paste0("a", seq_along(x)))
# calculate diff between successive elements
diff_x <- diff(x)
# remove points that are equal to zero
diff_x <- diff_x[!diff_x==0]
# identify indices of changepoints
diff_x <- c(diff_x[1], diff_x)
change_ind <- NULL
for(i in 2:length(diff_x)){
if(sign(diff_x[i]) != sign(diff_x[i-1])){
change_ind_curr <- as.numeric(gsub("a", "", names(diff_x[i]))) - 1
change_ind <- c(change_ind, change_ind_curr)
}
}
change_ind
}
The length of its output is the number of changes.
Note that it also works when the change in x is non-linear, e.g. if x <- c(1, 4, 9, 1).

Related

Sampling from a subset of data

I have the following problem.
I have multiple subarrays (say 2) that I have populated with character labels (1, 2, 3, 4, 5). My algorithm selects labels at random based on occurrence probabilities.
How can I get R to instead select labels 1:3 for subarray 1 and 4:5 for subarray 2, say, without using subsetting (i.e., []). That is, I want a random subset of labels to be selected for each subarray, instead of all labels assigned to each subarray manually using [].
I know sample() should help.
Using subsetting (which I don't want) one would do
x <- 1:5
sample(x[1:3], size, prob = probs[1:3])
but this assigns labels 1:3 to ALL subarrays.
Would
sample(sample(x), size, replace = TRUE, prob = probs)
work?
Any ideas? Please let me know if this is unclear.
Here is a small example, which selects labels from 1:5 for each of 10 subarrays.
set.seed(1)
N <- 10
K <- 2
Hstar <- 5
probs <- rep(1/Hstar, Hstar)
perms <- 5
## Set up container(s) to hold the identity of each individual from each permutation ##
num.specs <- ceiling(N / K)
## Create an ID for each haplotype ##
haps <- 1:Hstar
## Assign individuals (N) to each subpopulation (K) ##
specs <- 1:num.specs
## Generate permutations, assume each permutation has N individuals, and sample those individuals' haplotypes from the probabilities ##
gen.perms <- function() {
sample(haps, size = num.specs, replace = TRUE, prob = probs) # I would like each subarray to contain a random subset of 1:5.
}
pop <- array(dim = c(perms, num.specs, K))
for (i in 1:K) {
pop[,, i] <- replicate(perms, gen.perms())
}
pop
Hopefully this helps.
I think what you actually want is something like that
num.specs <- 3
haps[sample(seq(haps),size = num.specs,replace = F)]
[1] 3 5 4
That is a random subset of your vector haps ?
Not quite what you want (returns list of matrices instead of 3D array) but this might help
lapply(split(1:5, cut(1:5, breaks=c(0, 2, 5))), function(i) matrix(sample(i, 25, replace=TRUE), ncol=5))
Use cut and split to partition your vector of character labels before sampling them. Here I split your character labels at the value 2. Also, rather than sampling 5 numbers 5 times, you can sample 25 numbers once, and convert to matrix.

Extend a vector by randomly increasing values in R

In this example I'm trying to generate a random time series for 3 individuals at 4 time points (below x contains the 1st timepoints for each individual). I want the values to be randomly increasing rather than decreasing in time. Below is my current solution.
set.seed(0)
x <- rnorm(3)
x
[1] 1.2629543 -0.3262334 1.3297993
y <- c(x,
x*runif(1,.8,1.2),
x*runif(1,.9,1.3),
x*runif(1,1,1.4))
y
[1] 1.2629543 -0.3262334 1.3297993 1.4642135 -0.3782206 1.5417106 1.6138915 -0.4168839 1.6993107 1.5967772
[11] -0.4124631 1.6812906
This has some problems.
For each individual the same coefficient is used for calculating the values for same timepoint resulting in identical trends. How could I get a random coefficient for each multiplication? I could use lapply but then the vector will be "grouped" by individuals not by timepoints.
I don't wish to write the formulas for last timepoints separately and be so precise. Exact coefficients are not important, I just need the values to have a tendency to slightly increase but occasional decreasing should also be allowed. How could I extend the vector more "effectively"?
How to make negative values to also increase?
I managed to solve this thanks to Federico Manigrasso. The solution is below.
TimeSer <- function(num.id, years, init.val) {
df <- data.frame(id = factor(rep(1:num.id, length(years))),
year = rep(years, each = num.id))
yrs <- length(years) - 1
minim <- seq(-.1, by = -.1, len = yrs)
maxim <- seq(.4, by = .4, len = yrs)
val <- list(init.val)
for (i in 1:yrs) {
val[[i + 1]] <- unlist(lapply(init.val, function (x) {
x + (x * runif(1, minim[i], maxim[i]))
}))
}
df$val <- unlist(val)
df
}
df <- TimeSer(num.id = 3, years = 2006:2016, init.val = rnorm(3,1e5, 1e5))
Visual representation of the results:
num.id <- length(unique(df$id))
par(mfrow=c(1,num.id))
lapply(1:num.id, function(x) {
plot(unique(df$year), df$val[df$id == x], type = 'l', col = x)
})
I suggest to put the output in a list, It a lot less messy and you can transform into a vector later (using unlist).
This is how I would rewrite your code
x<-rnorm(3)
time<-3
output<-list(x) #init output list with initial data
par1<-c(0.8,0.9,1)
par2<-c(1.2,1.3,1.4)
for( i in 1:time){
a<-unlist(lapply(x,function(x){x+runif(1,par1[i],par2[i])}))
output[[i+1]]<-a
x<-a
}
let me know if this solves all your problems..

Use an 'apply' function to perform code with conditional statements in R

I have been working on a project for which I need to find peaks and valleys in a dataset (not just the highest numbers per column, but all of the peaks and valleys).
I did manage to get it to work on 1 column, but I use a for-loop for that and I need to do this for about 50 columns, so I think I should use an 'apply' function. I just don't know how to do so. Can I put 'if' statements and such in an 'apply' function?
Here is what I used for checking one column:
('First' is the name of the dataset and 'Seq1' is the first column)
Lowest = 0
Highest = 0
Summits = vector('numeric')
Valleys = vector('numeric')
for (i in 1:length(First$Seq1))
{
if (!is.na(First$Seq1[i+1]))
{
if (First$Seq1[i] < Lowest) {Lowest = First$Seq1[i]}
if (First$Seq1[i] > Highest) {Highest = First$Seq1[i]}
if (First$Seq1[i] > 0 && First$Seq1[i+1] < 0)
{ Summits <- append(Summits, Highest, after=length(Summits)) }
if (First$Seq1[i] < 0 && First$Seq1[i+1] > 0)
{ Valleys <- append(Valleys, Lowest, after=length(Summits)) }
}
}
Sure you can! I would first define a helper function that defines what is to be done with one specific column and then you call that function within apply:
HelperFun <- function(x) {
# your code from above, replacing 'Seq1' by x
}
apply(First, 2, HelperFun)
An *apply function is not better for this than a for loop, provided you don't grow an object in the for loop. You must never use append in a loop. Pre-allocate your results object and fill it.
This finds all local minima on a grid:
#an example
set.seed(42)
plane <- matrix(rnorm(100, sd = 5), 10)
#plot
library(raster)
plot(raster(plane))
#initialize a logical matrix
res <- matrix(TRUE, ncol = ncol(plane), nrow = nrow(plane))
#check for each subgrid of 2 times 2 cells which of the cells is the minimum
for (i in 1:(nrow(plane) - 1)) {
for (j in 1:(ncol(plane) - 1)) {
inds <- as.matrix(expand.grid(r = i + 0:1, c = j + 0:1))
#cell must be a minimum of all 4 subgrids it is part of
res[inds] <- res[inds] & plane[inds] == min(plane[inds])
}
}
print(res)
plane[res]
#[1] -13.282277 -8.906542 -8.585043 -12.071038 -3.919195 -14.965450 -5.215595 -5.498904 -5.971644 -2.380870 -7.296070
#highlight local minima
plot(rasterToPolygons(raster(res)), border = t(res), add = TRUE)
library(reshape2)
res1 <- melt(res)
res1 <- res1[res1$value,]
text(x = res1$Var2 /10 - 0.05,
y = 1-res1$Var1 /10 + 0.05,
labels = round(plane[res],1))
I've assumed here that diagonal neighbors are counted as neighbors and not only neighbors in the same column or row. But this would be trivial to change.
I know that this is not the solution you want --- you have one-dimensional time series, but here is a (more direct) variation on Roland's solution.
#example data
set.seed(42)
plane <- matrix(rnorm(100, sd = 5), 10)
library(raster)
r <- raster(plane)
f <- focal(r, matrix(1,3,3), min, pad=TRUE, na.rm=TRUE)
x <- r == f
mins <- mask(r, x, maskvalue=FALSE)
pts <- rasterToPoints(mins)
cells <- cellFromXY(x, pts)
r[cells]
plot(r)
text(mins, digits=1)
plot(rasterToPolygons(mins), add=TRUE)

R filter() dealing with NAs

I am trying to implement Chebyshev filter to smooth a time series but, unfortunately, there are NAs in the data series.
For example,
t <- seq(0, 1, len = 100)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)),NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)))
I am using Chebyshev filter: cf1 = cheby1(5, 3, 1/44, type = "low")
I am trying to filter the time series exclude NAs, but not mess up the orders/position. So, I have already tried na.rm=T, but it seems there's no such argument.
Then
z <- filter(cf1, x) # apply filter
Thank you guys.
Try using x <- x[!is.na(x)] to remove the NAs, then run the filter.
You can remove the NAs beforehand using the compelete.cases function. You also might consider imputing the missing data. Check out the mtsdi or Amelia II packages.
EDIT:
Here's a solution with Rcpp. This might be helpful is speed is important:
require(inline)
require(Rcpp)
t <- seq(0, 1, len = 100)
set.seed(7337)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)),NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)))
NAs <- x
x2 <- x[!is.na(x)]
#do something to x2
src <- '
Rcpp::NumericVector vecX(vx);
Rcpp::NumericVector vecNA(vNA);
int j = 0; //counter for vx
for (int i=0;i<vecNA.size();i++) {
if (!(R_IsNA(vecNA[i]))) {
//replace and update j
vecNA[i] = vecX[j];
j++;
}
}
return Rcpp::wrap(vecNA);
'
fun <- cxxfunction(signature(vx="numeric",
vNA="numeric"),
src,plugin="Rcpp")
if (identical(x,fun(x2,NAs)))
print("worked")
# [1] "worked"
I don't know if ts objects can have missing values, but if you just want to re-insert the NA values, you can use ?insert from R.utils. There might be a better way to do this.
install.packages(c('R.utils', 'signal'))
require(R.utils)
require(signal)
t <- seq(0, 1, len = 100)
set.seed(7337)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)), NA, NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)), NA)
cf1 = cheby1(5, 3, 1/44, type = "low")
xex <- na.omit(x)
z <- filter(cf1, xex) # apply
z <- as.numeric(z)
for (m in attributes(xex)$na.action) {
z <- insert(z, ats = m, values = NA)
}
all.equal(is.na(z), is.na(x))
?insert
Here is a function you can use to filter a signal with NAs in it.
The NAs are ignored rather than replaced by zero.
You can then specify a maximum percentage of weight which the NAs may take at any point of the filtered signal. If there are too many NAs (and too few actual data) at a specific point, the filtered signal itself will be set to NA.
# This function applies a filter to a time series with potentially missing data
filter_with_NA <- function(x,
window_length=12, # will be applied centrally
myfilter=rep(1/window_length,window_length), # a boxcar filter by default
max_percentage_NA=25) # which percentage of weight created by NA should not be exceeded
{
# make the signal longer at both sides
signal <- c(rep(NA,window_length),x,rep(NA,window_length))
# see where data are present and not NA
present <- is.finite(signal)
# replace the NA values by zero
signal[!is.finite(signal)] <- 0
# apply the filter
filtered_signal <- as.numeric(filter(signal,myfilter, sides=2))
# find out which percentage of the filtered signal was created by non-NA values
# this is easy because the filter is linear
original_weight <- as.numeric(filter(present,myfilter, sides=2))
# where this is lower than one, the signal is now artificially smaller
# because we added zeros - compensate that
filtered_signal <- filtered_signal / original_weight
# but where there are too few values present, discard the signal
filtered_signal[100*(1-original_weight) > max_percentage_NA] <- NA
# cut away the padding to left and right which we previously inserted
filtered_signal <- filtered_signal[((window_length+1):(window_length+length(x)))]
return(filtered_signal)
}

min() error in function with while loop in R -a debugging challenge for the R enthusiast

I have a function (bobB) that seems to be getting stuck in a while loop. When I hit escape and look at warnings() I get the following error:
Warning message:
In min(xf, yf) : no non-missing arguments to min; returning Inf
Some example code:
#I have the data:
x<-"A03"
y<-"A24"
sitex<-c("Sp1","Sp1","Sp3","Sp3")
sitey<-c("Sp2","Sp4","Sp2","Sp4")
gsim<-c(0.2,0.3,0.4,0.1)
gsim<-data.frame(sitex,sitey,gsim)
site<-c("A03","A03","A03","A03","A24","A24","A24","A24")
species<-c("Sp1","Sp1","Sp3","Sp4","Sp1","Sp1","Sp3","Sp4")
freq<-c(0.2,0.3,0.4,0.1,0.3,0.3,0,0.4)
ssf<-data.frame(site,species,freq,stringsAsFactors=FALSE)
#My function:
bobB <- function (x, y, ssf, gsim) {
#*Step 1.* Create an empty matrix 'specfreq' to fill
#Selects the species frequency data greater than 0 for the two sites being compared
ssfx <- ssf[ssf$site == x & ssf$freq >0,]
ssfy <- ssf[ssf$site == y & ssf$freq >0,]
#pull out the species that are present at site x and/or site y using a merge,
#this is needed to create the initial empty matrix
m<-(merge(ssfx,ssfy,all=TRUE))
species<-unique(m$species)
#Creates an empty matrix of the frequency of each species at site x and y
specfreq <- matrix(0, length(species), 2, dimnames=list(species,c(x,y)))
#*Step 2.* Fill empty matrix 'specfreq' with data from data.frame 'ssf'
for(i in 1:nrow(ssf{specfreq[rownames(specfreq)==ssf[i,"species"],colnames(specfreq)==ssf[i,"site"]] <- ssf[i,"freq"]}
#*Step 3.* For species present at site x and y remove the minimum of the two from both
#find minimum frequency for each species for site x and y
a <- pmin(specfreq[,x], specfreq[,y])
#subtract 'a' from current 'specfreq'
specfreq <- specfreq - a
#*Step 4.* Calulate variable B
#Set answer to 0
answer <- 0
#while 'specfreq' contains data (i.e. is >0) keep doing the following
while(sum(specfreq) > 1e-10) {
#Find species remaining at site x
sx <- species[specfreq[,1] > 0]
#Find species remaining at site y
sy <- species[specfreq[,2] > 0]
#Pull out the gsim value for sx and sy
gsimre <-gsim[gsim$sitex %in% sx & gsim$sitey %in% sy,]
#determines the row containing the maximum value for remaining gsim and assigns it to i
i <- which.max(gsimre$gsim)
#once the max gsim value has been found (i) we go back to the specfreq matrix and filter
#out the frequency of the site x species associated with i
xf <- specfreq[(gsimre$sitex[i]),x]
#and the frequency of the site y species associated with i
yf <- specfreq[(gsimre$sitey[i]),y]
#The frequency of the species at site x associated with the greatest i is multiplied by i
answer <- answer + xf * gsimre$gsim[i]
#Then take the minimum common frequency for the two species associated with i
f <- min(xf,yf)
#Subtract theminimum common frequency from both the site x and site y column
specfreq[gsimre$sitex[i],x] <- specfreq[gsimre$sitex[i],x]-f
specfreq[gsimre$sitey[i],y] <- specfreq[gsimre$sitey[i],y]-f
}
answer
}
bobB(x, y, ssf, gsim)
Ok so here is your problem (I think):
xf <- specfreq[(gsimre$sitex[i]),x]
yf <- specfreq[(gsimre$sitey[i]),y]
(gsimre$sitex[i])and (gsimre$sitey[i])are factors which means that when you use them to index something there are interpreted as numeric and not as characters.
And indeed:
gsimre$sitey
[1] Sp4
Levels: Sp2 Sp4
Sp4, being the second factor, has a numerical value of 2, hence specfreq[(gsimre$sitey[i]),y] is specfreq[2,2] which is 0.
Hence, this should do the trick:
xf <- specfreq[as.character(gsimre$sitex[i]),x]
yf <- specfreq[as.character(gsimre$sitey[i]),y]
Or you declare them as being strings and not factors at the beginning, as you did for ssf but not for gsim:
gsim<-data.frame(sitex,sitey,gsim, stringsAsFactors=FALSE)
Because of this mistake yf was equal to 0 and therefore f=min(xf,yf)was always equal to 0, hence your endless loop.
P. S. : answer should be out of the while loop if you want the function to return it, i. e.
}
answer
}
instead of
answer
}
}

Resources