Merging two vectors at random in R - r

I have two vectors x and y. x is a larger vector compared to y. For example (x is set to all zeros here, but that need not be the case)
x = rep(0,20)
y = c(2,3,-1,-1)
What I want to accomplish is overlay some y's in x but at random. So in the above example, x would look like
0,0,2,3,-1,-1,0,0,0,0,2,3,-1,-1,...
Basically, I'll step through each value in x, pull a random number, and if that random number is less than some threshold, I want to overlay y for the next 4 places in x unless I've reached the end of x. Would any of the apply functions help? Thanks much in advance.

A simple way of doing it would be to choose points at random (the same length as x) from the two vectors combined:
sample(c(x, y), length(x), replace = TRUE)
If you want to introduce some probability into it, you could do something like:
p <- c(rep(2, each = length(x)), rep(1, each = length(y)))
sample(c(x, y), length(x), prob = p, replace = TRUE)
This is saying that an x point is twice as likely to be chosen over a y point (change the 2 and 1 in p accordingly for different probabilities).

Short answer: yes :-) . Write some function like
ranx <- runif(length(x)-length(y)+1)
# some loop or apply func...
if (ranx[j] < threshold) x[j:j+length(y)] <- y
# and make sure to stop the loop at length(y)-length(x)

Something like the following worked for me.
i = 1
while(i <= length(x)){
p.rand = runif(1,0,1)
if(p.rand < prob[i]){
p[i:(i+length(y))] = y
i = i+length(y)
}
i = i + 1
}
where prob[i] is some probability vector.

Related

In R, discretizing floating point coordinates to nearest coordinate

I am trying to bucket coordinates into their nearest coordinate. In a sense, I am doing one iteration of kmeans clustering, with 1222 centroids. Below I have a function that does this, imperfectly, and too slowly as well. I am looking for help on improving this function:
discretizeCourt <- function(x_loc, y_loc) {
# create the dataframe of points that I want to round coordinates to
y <- seq(0, 50, by = 2)
x1 <- seq(1, 93, by = 2)
x2 <- seq(2, 94, by = 2)
x <- c(x1, x2)
coordinates <- data.frame(
x = rep(x, 13),
y = rep(y, each = length(x1)),
count = 0
)
# loop over each point in x_loc and y_loc
# increment the count column whenever a point is 'near' that column
for(i in 1:length(x_loc)) {
this_x = x_loc[i]
this_y = y_loc[i]
coordinates[coordinates$x > this_x-1 &
coordinates$x < this_x+1 &
coordinates$y > this_y-1 &
coordinates$y < this_y+1, ]$count =
coordinates[coordinates$x > this_x-1 &
coordinates$x < this_x+1 &
coordinates$y > this_y-1 &
coordinates$y < this_y+1, ]$count + 1
}
}
Here is some test data that I'm working with:
> dput(head(x_loc, n = 50))
c(13.57165, 13.61702, 13.66478, 13.70833, 13.75272, 13.7946,
13.83851, 13.86792, 13.8973, 13.93906, 13.98099, 14.02396, 14.06338,
14.10872, 14.15412, 14.2015, 14.26116, 14.30871, 14.35056, 14.39536,
14.43964, 14.48442, 14.5324, 14.57675, 14.62267, 14.66972, 14.71443,
14.75383, 14.79012, 14.82455, 14.85587, 14.87557, 14.90737, 14.9446,
14.97763, 15.01079, 15.04086, 15.06752, 15.09516, 15.12394, 15.15191,
15.18061, 15.20413, 15.22896, 15.25411, 15.28108, 15.3077, 15.33578,
15.36507, 15.39272)
> dput(head(y_loc, n = 50))
c(25.18298, 25.17431, 25.17784, 25.18865, 25.20188, 25.22865,
25.26254, 25.22778, 25.20162, 25.25191, 25.3044, 25.35787, 25.40347,
25.46049, 25.5199, 25.57132, 25.6773, 25.69842, 25.73877, 25.78383,
25.82168, 25.86067, 25.89984, 25.93067, 25.96943, 26.01083, 26.05861,
26.11965, 26.18428, 26.25347, 26.3352, 26.35756, 26.4682, 26.55412,
26.63745, 26.72157, 26.80021, 26.8691, 26.93522, 26.98879, 27.03783,
27.07818, 27.03786, 26.9909, 26.93697, 26.87916, 26.81606, 26.74908,
26.67815, 26.60898)
My actual x_loc and y_loc files are ~60000 coordinates, and I have thousands of files each with ~60000 coordinates, so it's a lot of work. I am pretty certain that the reason the function runs slow is the way I am indexing / incrementing.
The counting is imperfect. A technically better approach would be to loop over all 60000 points (above only 50 points for the example), and for each point, calculate the distance between that point and each point in the coordinates dataframe (1222 points). However thats 60000 * 1222 calculations, just for this one set of points, which is too high.
Would greatly appreciate any help on this!
Thanks,
EDIT: I'm working on converting my dataframes / vectors to 2 matrices, and vectorizing the whole approach, will let you know if it works.
If you want to process your matrix faster than your solution, consider using data.table library. Please see the example below:
df <- data.table(x_loc, y_loc) # Your data.frame is turned into a data.table
df$row.idx <- 1:nrow(df) # This column is used as ID for each sample point.
Now, we can find the right coordinate for each point. Later we can calculate how many points belong to a certain coordinate. We keep the coordinates data frame first:
y <- seq(0, 50, by = 2)
x1 <- seq(1, 93, by = 2)
x2 <- seq(2, 94, by = 2)
x <- c(x1, x2)
coordinates <- data.frame(
x = rep(x, 13),
y = rep(y, each = length(x1)),
count = 0
)
coordinates$row <- 1:nrow(coordinates) # Similar to yours. However, this time we are interested in seeing which points belong to this coordinate.
Now, we define a function which checks the coordinates and returns the one within one unit distance of the point in question.
f <- function(this_x, this_y, coordinates) {
res <- coordinates[coordinates$x > this_x-1 &
coordinates$x < this_x+1 &
coordinates$y > this_y-1 &
coordinates$y < this_y+1, ]$row
res
}
For each point, we find its right coordinate:
df[, coordinate.idx := f(x_loc, y_loc), by = row.idx]
df[, row.idx := NULL]
df contains the following variables: (x_loc, y_loc, coordinate.idx). You can populate coordinates$count using this. Even for 60000 points, it should not take more than 1 second.
for(i in 1:nrow(coordinates)) {
coordinates$count = length(which(df$coordinate.idx == i))
}

Simulating random walk on the n-cycle in R

I have been trying to implement a random walk on the n-cycle algorithm in R.
By n-cycle I mean the set of integers Zn, or modulo n. Basically, it’s example 5.3.1 from the book “Markov chains and mixing time”, by Levin, Peres and Wilmer. The intention is as follows: consider two chains modeling the movement of two particles X and Y on Zn with starting points X1 and Y1. By the flip of a fair coin we decide which particle will move (the particles cannot move simultaneously unless they have coupled); the direction is decided by another flip of fair coin.
Once the two particle collide, they move together hereafter. It is part of a study project to implement a CFTP algorithm, so the length of the chains should have a pre-defined value, say T.
The code does not run and an error message appears. The error is “object ‘res’ not found”. However, I had previously defined “res” as a list to store the output of the function. Why does this happen and how could it be fixed?
I have two scripts: in the first one the code is split in smaller helper functions; the second one may be messier, as I tried to put all the helper functions within one single function.
Any help will be much appreciated.
This one is script 2.
# X1 - initial state of chain X
# Y1 - initial state of chain Y
# T - "length" of a chain, number of steps the chains will run for.
# n - length of the n-cycle, i.e., Zn.
Main_Function <- function (X1 = 8, Y1 = 4 , T = 20, n = 6){
X <- rep( X1, T) %% n # X, Y and res will store the results
Y <- rep( Y1, T) %% n
res <- list(X,Y) # Here I defined the object res. Later on R encounters an error "object 'res' not found".
ps <- TakeOneStep() # TakeOneStep is a function defined below
return(ps)
}
TakeOneStep <- function(){
incr_same <- sample(c(-1, 0, 1), size = 1, prob = c(1/4, 1/2, 1/4)) #direction of the particles after they have coupled
incr_dif <- sample(c(-1,1), size = 1, prob = c(1/2, 1/2)) # direction of the particles before coupling occurred.
choice <- runif(T) # determines which chain moves, before coupling occurred.
for(t in 2:T){
if(res[[1]][t-1]%%n == res[[2]][t-1]%%n){
res[[1]][t] <- (res[[1]][t-1] + incr_same) %% n
res[[2]][t] <- (res[[2]][t-1] + incr_same) %% n
}else{ if(choice[t] < 0.5) {
res[[1]][t] <- (res[[1]][t-1] + incr_dif) %% n
}else{res[[2]][t] <- (res[[2]][t-1] + incr_dif)%%n}
}
}
return(res)
}

Summing elements of a vector in R

I'm trying to figure out how to add each respective component in a vector and store that in another vector. This is what I have so far:
# Create a function to roll a die n times.
RollDie = function(n) sample(1:6, n, rep=T)
die1 = RollDie(500)
die2 = RollDie(500)
die3 = RollDie(500)
die4 = RollDie(500)
die5 = RollDie(500)
die6 = RollDie(500)
# Sum the values of the first component of each vector which represent the values
# of the six die rolled.
X = sum(die1[1], die2[1], die3[1], die4[1], die5[1], die6[1])
X
What I'm trying to do is sum the first, second, etc components of die 1 through 6.
So, the first component of X will be
sum(die1[1], die2[1], die3[1], die4[1], die5[1], die6[1])
the second component of X will be
sum(die1[2], die2[2], die3[2], die4[2], die5[2], die6[2])
the third component of X will be
sum(die1[3], die2[3], die3[3], die4[3], die5[3], die6[3])
and so on. X will have a length of 500.
I'm trying to find the appropriate command, but not having any luck. Please help. Thanks!
A possible solution with a vectorized approach:
rowSums(replicate(6, RollDie(500)))

Multiply unique pairs of values in a vector and sum the result

I want to multiply and then sum the unique pairs of a vector, excluding pairs made of the same element, such that for c(1:4):
(1*2) + (1*3) + (1*4) + (2*3) + (2*4) + (3*4) == 35
The following code works for the example above:
x <- c(1:4)
bar <- NULL
for( i in 1:length(x)) { bar <- c( bar, i * c((i+1) : length(x)))}
sum(bar[ 1 : (length(bar) - 2)])
However, my actual data is a vector of rational numbers, not integers, so the (i+1) portion of the loop will not work. Is there a way to look at the next element of the set after i, e.g. j, so that I could write i * c((j : length(x))?
I understand that for loops are usually not the most efficient approach, but I could not think of how to accomplish this via apply etc. Examples of that would be welcome, too. Thanks for your help.
An alternative to a loop would be to use combn and multiply the combinations using the FUN argument. Then sum the result:
sum(combn(x = 1:4, m = 2, FUN = function(x) x[1] * x[2]))
# [1] 35
Even better to use prod in FUN, as suggested by #bgoldst:
sum(combn(x = 1:4, m = 2, FUN = prod))

R filter() dealing with NAs

I am trying to implement Chebyshev filter to smooth a time series but, unfortunately, there are NAs in the data series.
For example,
t <- seq(0, 1, len = 100)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)),NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)))
I am using Chebyshev filter: cf1 = cheby1(5, 3, 1/44, type = "low")
I am trying to filter the time series exclude NAs, but not mess up the orders/position. So, I have already tried na.rm=T, but it seems there's no such argument.
Then
z <- filter(cf1, x) # apply filter
Thank you guys.
Try using x <- x[!is.na(x)] to remove the NAs, then run the filter.
You can remove the NAs beforehand using the compelete.cases function. You also might consider imputing the missing data. Check out the mtsdi or Amelia II packages.
EDIT:
Here's a solution with Rcpp. This might be helpful is speed is important:
require(inline)
require(Rcpp)
t <- seq(0, 1, len = 100)
set.seed(7337)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)),NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)))
NAs <- x
x2 <- x[!is.na(x)]
#do something to x2
src <- '
Rcpp::NumericVector vecX(vx);
Rcpp::NumericVector vecNA(vNA);
int j = 0; //counter for vx
for (int i=0;i<vecNA.size();i++) {
if (!(R_IsNA(vecNA[i]))) {
//replace and update j
vecNA[i] = vecX[j];
j++;
}
}
return Rcpp::wrap(vecNA);
'
fun <- cxxfunction(signature(vx="numeric",
vNA="numeric"),
src,plugin="Rcpp")
if (identical(x,fun(x2,NAs)))
print("worked")
# [1] "worked"
I don't know if ts objects can have missing values, but if you just want to re-insert the NA values, you can use ?insert from R.utils. There might be a better way to do this.
install.packages(c('R.utils', 'signal'))
require(R.utils)
require(signal)
t <- seq(0, 1, len = 100)
set.seed(7337)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)), NA, NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)), NA)
cf1 = cheby1(5, 3, 1/44, type = "low")
xex <- na.omit(x)
z <- filter(cf1, xex) # apply
z <- as.numeric(z)
for (m in attributes(xex)$na.action) {
z <- insert(z, ats = m, values = NA)
}
all.equal(is.na(z), is.na(x))
?insert
Here is a function you can use to filter a signal with NAs in it.
The NAs are ignored rather than replaced by zero.
You can then specify a maximum percentage of weight which the NAs may take at any point of the filtered signal. If there are too many NAs (and too few actual data) at a specific point, the filtered signal itself will be set to NA.
# This function applies a filter to a time series with potentially missing data
filter_with_NA <- function(x,
window_length=12, # will be applied centrally
myfilter=rep(1/window_length,window_length), # a boxcar filter by default
max_percentage_NA=25) # which percentage of weight created by NA should not be exceeded
{
# make the signal longer at both sides
signal <- c(rep(NA,window_length),x,rep(NA,window_length))
# see where data are present and not NA
present <- is.finite(signal)
# replace the NA values by zero
signal[!is.finite(signal)] <- 0
# apply the filter
filtered_signal <- as.numeric(filter(signal,myfilter, sides=2))
# find out which percentage of the filtered signal was created by non-NA values
# this is easy because the filter is linear
original_weight <- as.numeric(filter(present,myfilter, sides=2))
# where this is lower than one, the signal is now artificially smaller
# because we added zeros - compensate that
filtered_signal <- filtered_signal / original_weight
# but where there are too few values present, discard the signal
filtered_signal[100*(1-original_weight) > max_percentage_NA] <- NA
# cut away the padding to left and right which we previously inserted
filtered_signal <- filtered_signal[((window_length+1):(window_length+length(x)))]
return(filtered_signal)
}

Resources