How to select a specific amount of rows before and after predefined values - r

I am trying to select relevant rows from a large time-series data set. The tricky bit is, that the needed rows are before and after certain values in a column.
# example data
x <- rnorm(100)
y <- rep(0,100)
y[c(13,44,80)] <- 1
y[c(20,34,92)] <- 2
df <- data.frame(x,y)
In this case the critical values are 1 and 2 in the df$y column. If, e.g., I want to select 2 rows before and 4 after df$y==1 I can do:
ones<-which(df$y==1)
selection <- NULL
for (i in ones) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection <- 0
df$selection[selection] <- 1
This, arguably, scales poorly for more values. For df$y==2 I would have to repeat with:
twos<-which(df$y==2)
selection <- NULL
for (i in twos) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection[selection] <- 2
Ideal scenario would be a function doing something similar to this imaginary function selector(data=df$y, values=c(1,2), before=2, after=5, afterafter = FALSE, beforebefore=FALSE), where values is fed with the critical values, before with the amount of rows to select before and correspondingly after.
Whereas, afterafter would allow for the possibility to go from certain rows until certain rows after the value, e.g. after=5,afterafter=10 (same but going into the other direction with afterafter).
Any tips and suggestions are very welcome!
Thanks!

This is easy enough with rep and its each argument.
df$y[rep(which(df$y == 2), each=7L) + -2:4] <- 2
Here, rep repeats the row indices that your criterion 7 times each (two before, the value, and four after, the L indicates that the argument should be an integer). Add values -2 through 4 to get these indices. Now, replace.
Note that for some comparisons, == will not be adequate due to numerical precision. See the SO post why are these numbers not equal for a detailed discussion of this topic. In these cases, you could use something like
which(abs(df$y - 2) < 0.001)
or whatever precision measure will work for your problem.

Related

R: How to sample values under conditions from previous samples

I have tried to wrap my head around this for a few hours now and my head just blanked eventually...
My end goal is a data frame with 26 rows and 4 columns, containing the values 1-6, distributed at semi-random, meeting certain conditions.
Conditions are:
Each number can only appear once within a row.
Neighbouring rows of the same column can never contain the same value.
For this, column 1 and 2 must be seen as the same column containing 2 values, of which neither can repeat in a neighbouring row. So if I have "1 & 2" in one row of column 1 & 2, I can only have a combination between "3-6" in the next and previous row.
Values should be about equally distributed within any subsection of the table.
My original though was that I would sample a first row, then create a for loop to add the other rows one by one changing the probability for a number to be drawn based on the previous samples. Before adding this sample row I could checking that the second condition was met and if not resample.
I realised eventually that this is so nested that I can just not wrap my head around it... I assume that I need a while loop, which I have no experience with. As in while the first 2 conditions are not met, resample at a probability based on previous appearances. The problem is however that each column gets its own probability for each value based on the previous appearance, so I cannot just sample a row.
however if I sample each column individually based on the previous appearances, I will likely get the same values in a row...
So, this is the conditions I would want as FALSE(I tried it with a function that would just repeat if the statement is TRUE):
temp[1] == table[i-1,1] || temp[1] == table[i-1,2] || temp[2] == table[i-1,1] || temp[2] == table[i-1,2]
temp[3]== table[i-1,3]
temp[4]== table[i-1,4]
here is how I could calculate the probabilities for sampling (I realise there is a problem if values did not appear yet as this would mean division by 0)
probAB <- rep(1,6) /table(table[,c(1,2)])
probC <- rep(1,6) /table(table[,3])
probD <- rep(1,6) /table(table[,4])
I you want to know what it is supposed to be: It is a non repeating chores rota, between 6 people where 2 people take up 1 chore, while the other 2 chores are done by only one person. I am open for alternative suggestions to achieve this^^
We can use rejection sampling. Just generate a sample for each row and if it meets the conditions accept it and go to the next row; otherwise, repeat.
nr <- 26
nc <- 4
k <- 6
set.seed(123)
is_ok <- function(x, y) all(x != y) && x[1] != y[2] && x[2] != y[1]
tab <- matrix(NA, nr, nc)
tab[1, ] <- sample(k, nc)
for(i in 2:nr) repeat if (is_ok(tab[i, ] <- sample(k, nc), tab[i-1, ])) break

find all unique combinations of n numbers between 1 and k

I want a list of all possible sets of five (or n) numbers between 1 and 63 (or more generalizably 1 and k)
If computing time wasn't an issue, I could do something like
#Get all combenations of numbers between 1 and 63
indexCombinations <- expand.grid(1:63, 1:63, 1:63, 1:63, 1:63)
#Throw out the rows that have more than one of the same number in them
allDifferent <- apply(indexCombinations, 1, function(x){
length(x) == length(unique(x))
} # function
) # apply
indexCombinationsValid <- indexCombinations[allDifferent,]
# And then just take the unique values
indexCombinationsValidUnique <- unique(indexCombinationsValid)
The finding of unique values, I am concerned, is going to be prohibitively slow. Furthermore, I end up having to make a bunch of rows in the first place I never use. I was wondering if anyone has a more elegant and efficient way of getting a data frame or matrix of unique combinations of each of five numbers (or n numbers) between one and some some range of values.
Credit to #SymbolixAU for a very elegant solution, which I re-post here as an answer:
n <- 1:63; x <- combn(n, m = 5)

simulate x percentage of missing and error in data in r

I would like to perform two things to my fairly large data set about 10 K x 50 K . The following is smaller set of 200 x 10000.
First I want to generate 5% missing values, which perhaps simple and can be done with simple trick:
# dummy data
set.seed(123)
# matrix of X variable
xmat <- matrix(sample(0:4, 2000000, replace = TRUE), ncol = 10000)
colnames(xmat) <- paste ("M", 1:10000, sep ="")
rownames(xmat) <- paste("sample", 1:200, sep = "")
Generate missing values at 5% random places in the data.
N <- 2000000*0.05 # 5% random missing values
inds_miss <- round ( runif(N, 1, length(xmat)) )
xmat[inds_miss] <- NA
Now I would like to generate error (means that different value than what I have in above matrix. The above matrix have values of 0 to 4. So what I would like to do:
(1) I would like to replace x value with another value that is not x (for example 0 can be replaced by a random sample of that is not 0 (i.e. 1 or 2 or 3 or 4), similarly 1 can be replaced by that is not 1 (i.e. 0 or 2 or 3 or 4). Indicies where random value can be replaced can be simply done with:
inds_err <- round ( runif(N, 1, length(xmat)) )
If I randomly sample 0:4 values and replace with the indices, this will sometime replace same value with same value ( 0 with 0, 1 with 1 and so on) without creating error.
errorg <- sample(0:4, length(inds_err), replace = TRUE)
xmat[inds_err] <- errorg
(2) So what I would like to do is introduce error in xmat with missing values, However I do not want NA generated in above step be replaced with a value (0 to 4). So ind_err should not be member of vector inds_miss.
So summary rules :
(1) The missing values should not be replaced with error values
(2) The existing value must be replaced with different value (which is definition of error here)- in random sampling this 1/5 probability of doing this.
How can it be done ? I need faster solution that can be used in my large dataset.
You can try this:
inds_err <- setdiff(round ( runif(2*N, 1, length(xmat)) ),inds_miss)[1:N]
xmat[inds_err]<-(xmat[inds_err]+sample(4,N,replace=TRUE))%%5
With the first line you generate 2*N possible error indices, than you subtract the ones belonging to inds_miss and then take the first N. With the second line you add to the values you want to change a random number between 1 and 4 and than take the mod 5. In this way you are sure that the new value will be different from the original and stil in the 0-4 range.
Here's an if/else solution that could work for you. It is a for loop so not sure if that will be okay for you. Possibly vectorize it is some way to make it faster.
# vector of options
vec <- 0:4
# simple logic based solution if just don't want NA changed
for(i in 1:length(inds_err){
if(is.na(xmat[i])){
next
}else{
xmat[i] <- sample(vec[-xmat[i]], 1)
}
}

Aligning sequences with missing values

The language I'm using is R, but you don't necessarily need to know about R to answer the question.
Question:
I have a sequence that can be considered the ground truth, and another sequence that is a shifted version of the first, with some missing values. I'd like to know how to align the two.
setup
I have a sequence ground.truth that is basically a set of times:
ground.truth <- rep( seq(1,by=4,length.out=10), 5 ) +
rep( seq(0,length.out=5,by=4*10+30), each=10 )
Think of ground.truth as times where I'm doing the following:
{take a sample every 4 seconds for 10 times, then wait 30 seconds} x 5
I have a second sequence observations, which is ground.truth shifted with 20% of the values missing:
nSamples <- length(ground.truth)
idx_to_keep <- sort(sample( 1:nSamples, .8*nSamples ))
theLag <- runif(1)*100
observations <- ground.truth[idx_to_keep] + theLag
nObs <- length(observations)
If I plot these vectors this is what it looks like (remember, think of these as times):
What I've tried. I want to:
calculate the shift (theLag in my example above)
calculate a vector idx such that ground.truth[idx] == observations - theLag
First, assume we know theLag. Note that ground.truth[1] is not necessarily observations[1]-theLag. In fact, we have ground.truth[1] == observations[1+lagI]-theLag for some lagI.
To calculate this, I thought I'd use cross-correlation (ccf function).
However, whenever I do this I get a lag with a max. cross-correlation of 0, meaning ground.truth[1] == observations[1] - theLag. But I've tried this in examples where I've explicitly made sure that observations[1] - theLag is not ground.truth[1] (i.e. modify idx_to_keep to make sure it doesn't have 1 in it).
The shift theLag shouldn't affect the cross-correlation (isn't ccf(x,y) == ccf(x,y-constant)?) so I was going to work it out later.
Perhaps I'm misunderstanding though, because observations doesn't have as many values in it as ground.truth? Even in the simpler case where I set theLag==0, the cross correlation function still fails to identify the correct lag, which leads me to believe I'm thinking about this wrong.
Does anyone have a general methodology for me to go about this, or know of some R functions/packages that could help?
Thanks a lot.
For the lag, you can compute all the differences (distances) between your two sets of points:
diffs <- outer(observations, ground.truth, '-')
Your lag should be the value that appears length(observations) times:
which(table(diffs) == length(observations))
# 55.715382960625
# 86
Double check:
theLag
# [1] 55.71538
The second part of your question is easy once you have found theLag:
idx <- which(ground.truth %in% (observations - theLag))
The following should work if your time series are not too long.
You have two vectors of time-stamps,
the second one being a shifted and incomplete copy of the first,
and you want to find by how much it was shifted.
# Sample data
n <- 10
x <- cumsum(rexp(n,.1))
theLag <- rnorm(1)
y <- theLag + x[sort(sample(1:n, floor(.8*n)))]
We can try all possible lags and, for each one,
compute how bad the alignment is,
by matching each observed timestamp with the closest
"truth" timestamp.
# Loss function
library(sqldf)
f <- function(u) {
# Put all the values in a data.frame
d1 <- data.frame(g="truth", value=x)
d2 <- data.frame(g="observed", value=y+u)
d <- rbind(d1,d2)
# For each observed value, find the next truth value
# (we could take the nearest, on either side,
# but it would be more complicated)
d <- sqldf("
SELECT A.g, A.value,
( SELECT MIN(B.value)
FROM d AS B
WHERE B.g='truth'
AND B.value >= A.value
) AS next
FROM d AS A
WHERE A.g = 'observed'
")
# If u is greater than the lag, there are missing values.
# If u is smaller, the differences decrease
# as we approach the lag.
if(any(is.na(d))) {
return(Inf)
} else {
return( sum(d$`next` - d$value, na.rm=TRUE) )
}
}
We can now search for the best lag.
# Look at the loss function
sapply( seq(-2,2,by=.1), f )
# Minimize the loss function.
# Change the interval if it does not converge,
# i.e., if it seems in contradiction with the values above
# or if the minimum is Inf
(r <- optimize(f, c(-3,3)))
-r$minimum
theLag # Same value, most of the time

Matching elements in a list

Just starting to program in R... Got stumped on this one, perhaps because I don't know where to begin.
Define a random variable to be equal to the number of trials before there is a match. So if you have a list of numbers, (4,5,7,11,3,11,12,8,8,1....), the first value of the random variable is 6 because by then there are two 11's.(4,5,7,11,3,11) The second value is 3 because then you have 2 8's..12,8,8.
The code below creates the list of numbers, u, by simulating from a uniform distribution.
Thank-you for any help or pointers. I've included a full description of the problem I am solving below if anyone is interested (trying to learn by coding a statistics text).
set.seed(1); u = matrix(runif(1000), nrow=1000)
u[u > 0 & u <= 1/12] <- 1
u[u > 1/12 & u <= 2/12] <- 2
u[u > 2/12 & u <= 3/12] <- 3
u[u > 3/12 & u <= 4/12] <- 4
u[u > 4/12 & u <= 5/12] <- 5
u[u > 5/12 & u <= 6/12] <- 6
u[u > 6/12 & u <= 7/12] <- 7
u[u > 7/12 & u <= 8/12] <- 8
u[u > 8/12 & u <= 9/12] <- 9
u[u > 9/12 & u <= 10/12] <- 10
u[u > 10/12 & u <= 11/12] <- 11
u[u > 11/12 & u < 12/12] <- 12
table(u); u[1:10,]
Example 2.6-3 Concepts in Probability and Stochastic Modeling, Higgins
Suppose we were to ask people at random in which month they were born. Let the random variable X denote the number of people we would need to ask before we found two people born in the same month. The possible values for X are 2,3,...13. That is, at least two people must be asked in order to have a match and no more than 13 need to be asked. With the simplifying assumption that every month is an equally likely candidate for a response, a computer simulation was used to estimate the probabilitiy mass function of X. The simulation generated birth months until a match was found. Based on 1000 repetitions of this experiment, the following empirical distribution and sample statistics were obtained...
R has a steep initial learning curve. I don't think it's fair to assume this is your homework, and yes, it's possible to find solutions if you know what you're looking for. However, I remember it being difficult at times to research problems online simply because I didn't know what to search for (I wasn't familiar enough with the terminology).
Below is an explanation of one approach to solving the problem in R. Read the commented code and try and figure out exactly what it's doing. Still, I would recommend working through a good beginner resource. From memory, a good one to get up and running is icebreakeR, but there are many out there...
# set the number of simulations
nsim <- 10000
# Create a matrix, with nsim columns, and fill it with something.
# The something with which you'll populate it is a random sample,
# with replacement, of month names (held in a built-in vector called
# 'month.abb'). We're telling the sample function that it should take
# 13*nsim samples, and these will be used to fill the matrix, which
# has nsim columns (and hence 13 rows). We've chosen to take samples
# of length 13, because as your textbook states, 13 is the maximum
# number of month names necessary for a month name to be duplicated.
mat <- matrix(sample(month.abb, 13*nsim, replace=TRUE), ncol=nsim)
# If you like, take a look at the first 10 columns
mat[, 1:10]
# We want to find the position of the first duplicated value for each column.
# Here's one way to do this, but it might be a bit confusing if you're just
# starting out. The 'apply' family of functions is very useful for
# repeatedly applying a function to columns/rows/elements of an object.
# Here, 'apply(mat, 2, foo)' means that for each column (2 represents columns,
# 1 would apply to rows, and 1:2 would apply to every cell), do 'foo' to that
# column. Our function below extends this a little with a custom function. It
# says: for each column of mat in turn, call that column 'x' and perform
# 'match(1, duplicated(x))'. This match function will return the position
# of the first '1' in the vector 'duplicated(x)'. The vector 'duplicated(x)'
# is a logical (boolean) vector that indicates, for each element of x,
# whether that element has already occurred earlier in the vector (i.e. if
# the month name has already occurred earlier in x, the corresponding element
# of duplicated(x) will be TRUE (which equals 1), else it will be false (0).
# So the match function returns the position of the first duplicated month
# name (well, actually the second instance of that month name). e.g. if
# x consists of 'Jan', 'Feb', 'Jan', 'Mar', then duplicated(x) will be
# FALSE, FALSE, TRUE, FALSE, and match(1, duplicated(x)) will return 3.
# Referring back to your textbook problem, this is x, a realisation of the
# random variable X.
# Because we've used the apply function, the object 'res' will end up with
# nsim realisations of X, and these can be plotted as a histogram.
res <- apply(mat, 2, function(x) match(1, duplicated(x)))
hist(res, breaks=seq(0.5, 13.5, 1))

Resources