simulate x percentage of missing and error in data in r - r

I would like to perform two things to my fairly large data set about 10 K x 50 K . The following is smaller set of 200 x 10000.
First I want to generate 5% missing values, which perhaps simple and can be done with simple trick:
# dummy data
set.seed(123)
# matrix of X variable
xmat <- matrix(sample(0:4, 2000000, replace = TRUE), ncol = 10000)
colnames(xmat) <- paste ("M", 1:10000, sep ="")
rownames(xmat) <- paste("sample", 1:200, sep = "")
Generate missing values at 5% random places in the data.
N <- 2000000*0.05 # 5% random missing values
inds_miss <- round ( runif(N, 1, length(xmat)) )
xmat[inds_miss] <- NA
Now I would like to generate error (means that different value than what I have in above matrix. The above matrix have values of 0 to 4. So what I would like to do:
(1) I would like to replace x value with another value that is not x (for example 0 can be replaced by a random sample of that is not 0 (i.e. 1 or 2 or 3 or 4), similarly 1 can be replaced by that is not 1 (i.e. 0 or 2 or 3 or 4). Indicies where random value can be replaced can be simply done with:
inds_err <- round ( runif(N, 1, length(xmat)) )
If I randomly sample 0:4 values and replace with the indices, this will sometime replace same value with same value ( 0 with 0, 1 with 1 and so on) without creating error.
errorg <- sample(0:4, length(inds_err), replace = TRUE)
xmat[inds_err] <- errorg
(2) So what I would like to do is introduce error in xmat with missing values, However I do not want NA generated in above step be replaced with a value (0 to 4). So ind_err should not be member of vector inds_miss.
So summary rules :
(1) The missing values should not be replaced with error values
(2) The existing value must be replaced with different value (which is definition of error here)- in random sampling this 1/5 probability of doing this.
How can it be done ? I need faster solution that can be used in my large dataset.

You can try this:
inds_err <- setdiff(round ( runif(2*N, 1, length(xmat)) ),inds_miss)[1:N]
xmat[inds_err]<-(xmat[inds_err]+sample(4,N,replace=TRUE))%%5
With the first line you generate 2*N possible error indices, than you subtract the ones belonging to inds_miss and then take the first N. With the second line you add to the values you want to change a random number between 1 and 4 and than take the mod 5. In this way you are sure that the new value will be different from the original and stil in the 0-4 range.

Here's an if/else solution that could work for you. It is a for loop so not sure if that will be okay for you. Possibly vectorize it is some way to make it faster.
# vector of options
vec <- 0:4
# simple logic based solution if just don't want NA changed
for(i in 1:length(inds_err){
if(is.na(xmat[i])){
next
}else{
xmat[i] <- sample(vec[-xmat[i]], 1)
}
}

Related

nested loop matrix index in R

I am learning matrix multiplication in R and following is what I want to achieve. I am doing this purely to upscale my skills in R.
Following is the kind of matrix I am working with:
m <- matrix(1, 100, 10)
I have matrix with only element 1 with 100 rows and 10 columns. Now I want to replace for column 1 with 0 from row1 to row10. Then for the second column, I want to replace 1 with zeros from row 11 to row 20. Similarly for for the third column, I want to replace 1 with zeros from row 21 to row 30 and similarly for the rest up too column 10. Following my my example
m <- matrix(1, 100, 10)
for(j in 1:10){
for(i in (j-1)*10+1: j*10){
m[i,j] <-0
}
}
I was quite confident that my logic was correct but every time I run my code, I get following error message Subscripts out of bounds Call. I tried couple days now and I could not resolve this problem. I would highly appreciate for any hints or direct solutions to fix this. Many thanks in advance.
You could use just one variable, which I think would be easier. For each column, j, get the lower and upper range of row indices and assign as 0.
for(j in 1:10){
row_lower <- (j-1)*10+1
row_upper <- j*10
m[row_lower: row_upper, j] <- 0
}
Returns 0's in your specified range.

Replacing NAs with random decimals in a particular column in R

I am trying to replace NAs with random decimals in a particular column in R. However, R generates random decimals with the same trailing fraction and just changes the part before the decimal. The following are the methods I tried:
df_LT$ATC[is.na(df_LT$ATC)] <- sample(seq(10.2354897,23.78954214), size=sum(is.na(df_LT$ATC)), replace=T)
dplyr
df_LT <- df_LT %>%mutate_at(vars(df_LT$ATC), ~replace_na(., sample(10.2354897:23.78954214, size=sum(is.na(ATC)), replace=T)))
Data looks as below
A ATC
1 11.2356879
2 42.58974164
3 NA
4 34.25382343
5 NA
Now, wherever there is a NA in the ATC column I want to add a decimal like the others but in the range 10:23. Hope this explanation will help.
I may be missing something very obvious. Thanks for the help in advance.
You are using seq or the colon operator : to create your samples, which means you are sampling from following sequence:
seq(10.2354897, 23.78954214)
# [1] 10.23549 11.23549 12.23549 13.23549 14.23549 ....
So the starting value is increased by 1 in each step, leaving the numbers after the decimal points fixed.
If you want to sample random number within the range of these two limits you can do:
runif(n = 1, min = 10.2354897, max = 23.78954214)
So for your example this translates to:
df_LT$ATC[is.na(df_LT$ATC)] <-
runif(n = sum(is.na(df_LT$ATC)), 10.2354897, 23.78954214)
If you want to add a condition you can do:
df_LT$ATC <-
ifelse(is.na(df_LT$ATC) & df_LT$A == 3,
runif(n = nrow(df_LT), 10.2354897, 23.78954214),
df_LT$ATC)
This checks whether ATC is missing and also whether A is equal to 3. If this is fulfille the missing value is replaced with a random number, otherwise the original value (missin or not) is returned.

How to select a specific amount of rows before and after predefined values

I am trying to select relevant rows from a large time-series data set. The tricky bit is, that the needed rows are before and after certain values in a column.
# example data
x <- rnorm(100)
y <- rep(0,100)
y[c(13,44,80)] <- 1
y[c(20,34,92)] <- 2
df <- data.frame(x,y)
In this case the critical values are 1 and 2 in the df$y column. If, e.g., I want to select 2 rows before and 4 after df$y==1 I can do:
ones<-which(df$y==1)
selection <- NULL
for (i in ones) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection <- 0
df$selection[selection] <- 1
This, arguably, scales poorly for more values. For df$y==2 I would have to repeat with:
twos<-which(df$y==2)
selection <- NULL
for (i in twos) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection[selection] <- 2
Ideal scenario would be a function doing something similar to this imaginary function selector(data=df$y, values=c(1,2), before=2, after=5, afterafter = FALSE, beforebefore=FALSE), where values is fed with the critical values, before with the amount of rows to select before and correspondingly after.
Whereas, afterafter would allow for the possibility to go from certain rows until certain rows after the value, e.g. after=5,afterafter=10 (same but going into the other direction with afterafter).
Any tips and suggestions are very welcome!
Thanks!
This is easy enough with rep and its each argument.
df$y[rep(which(df$y == 2), each=7L) + -2:4] <- 2
Here, rep repeats the row indices that your criterion 7 times each (two before, the value, and four after, the L indicates that the argument should be an integer). Add values -2 through 4 to get these indices. Now, replace.
Note that for some comparisons, == will not be adequate due to numerical precision. See the SO post why are these numbers not equal for a detailed discussion of this topic. In these cases, you could use something like
which(abs(df$y - 2) < 0.001)
or whatever precision measure will work for your problem.

For loop issue with a negative values in R

I am new to R and I have an R code that uses the for loop to calculate the y = m*i + b. The starting value is a negative and I want to use that in my calculation and store it in the first occurrence of the Trend.Line and so forth.
I am not getting the results that I'm expecting. If my starting is a positive number no matter what number it is, I still want to store it in the first occurrence.
For example, if start= -5, I would like to store this calculated value in Y <= m*i + b in Trend.Line[1], the -4 calculated value to Trend.Line[2]. Now if start = 6, I would like to store calculated value in Trend.Line[1], the 7 calculated value to Trend.Line[2]
Thanks for looking into this.
Here is my code:
Trend.Line <- numeric(0)
start <- -5
end <- 12
m <- 345.72
b <- 54454
for(i in start:end){
y <- m*(i) + b
Trend.Line[i] <- y
}
Trend.Line
How about just doing
Trend.Line <- start:end
m * Trend.Line + b
It returns a numeric vector with everything at the index you want. It also makes use of the vectorization of functions in R. So multiplication and addition work on all elements of the vector Trend.Line.

Aligning sequences with missing values

The language I'm using is R, but you don't necessarily need to know about R to answer the question.
Question:
I have a sequence that can be considered the ground truth, and another sequence that is a shifted version of the first, with some missing values. I'd like to know how to align the two.
setup
I have a sequence ground.truth that is basically a set of times:
ground.truth <- rep( seq(1,by=4,length.out=10), 5 ) +
rep( seq(0,length.out=5,by=4*10+30), each=10 )
Think of ground.truth as times where I'm doing the following:
{take a sample every 4 seconds for 10 times, then wait 30 seconds} x 5
I have a second sequence observations, which is ground.truth shifted with 20% of the values missing:
nSamples <- length(ground.truth)
idx_to_keep <- sort(sample( 1:nSamples, .8*nSamples ))
theLag <- runif(1)*100
observations <- ground.truth[idx_to_keep] + theLag
nObs <- length(observations)
If I plot these vectors this is what it looks like (remember, think of these as times):
What I've tried. I want to:
calculate the shift (theLag in my example above)
calculate a vector idx such that ground.truth[idx] == observations - theLag
First, assume we know theLag. Note that ground.truth[1] is not necessarily observations[1]-theLag. In fact, we have ground.truth[1] == observations[1+lagI]-theLag for some lagI.
To calculate this, I thought I'd use cross-correlation (ccf function).
However, whenever I do this I get a lag with a max. cross-correlation of 0, meaning ground.truth[1] == observations[1] - theLag. But I've tried this in examples where I've explicitly made sure that observations[1] - theLag is not ground.truth[1] (i.e. modify idx_to_keep to make sure it doesn't have 1 in it).
The shift theLag shouldn't affect the cross-correlation (isn't ccf(x,y) == ccf(x,y-constant)?) so I was going to work it out later.
Perhaps I'm misunderstanding though, because observations doesn't have as many values in it as ground.truth? Even in the simpler case where I set theLag==0, the cross correlation function still fails to identify the correct lag, which leads me to believe I'm thinking about this wrong.
Does anyone have a general methodology for me to go about this, or know of some R functions/packages that could help?
Thanks a lot.
For the lag, you can compute all the differences (distances) between your two sets of points:
diffs <- outer(observations, ground.truth, '-')
Your lag should be the value that appears length(observations) times:
which(table(diffs) == length(observations))
# 55.715382960625
# 86
Double check:
theLag
# [1] 55.71538
The second part of your question is easy once you have found theLag:
idx <- which(ground.truth %in% (observations - theLag))
The following should work if your time series are not too long.
You have two vectors of time-stamps,
the second one being a shifted and incomplete copy of the first,
and you want to find by how much it was shifted.
# Sample data
n <- 10
x <- cumsum(rexp(n,.1))
theLag <- rnorm(1)
y <- theLag + x[sort(sample(1:n, floor(.8*n)))]
We can try all possible lags and, for each one,
compute how bad the alignment is,
by matching each observed timestamp with the closest
"truth" timestamp.
# Loss function
library(sqldf)
f <- function(u) {
# Put all the values in a data.frame
d1 <- data.frame(g="truth", value=x)
d2 <- data.frame(g="observed", value=y+u)
d <- rbind(d1,d2)
# For each observed value, find the next truth value
# (we could take the nearest, on either side,
# but it would be more complicated)
d <- sqldf("
SELECT A.g, A.value,
( SELECT MIN(B.value)
FROM d AS B
WHERE B.g='truth'
AND B.value >= A.value
) AS next
FROM d AS A
WHERE A.g = 'observed'
")
# If u is greater than the lag, there are missing values.
# If u is smaller, the differences decrease
# as we approach the lag.
if(any(is.na(d))) {
return(Inf)
} else {
return( sum(d$`next` - d$value, na.rm=TRUE) )
}
}
We can now search for the best lag.
# Look at the loss function
sapply( seq(-2,2,by=.1), f )
# Minimize the loss function.
# Change the interval if it does not converge,
# i.e., if it seems in contradiction with the values above
# or if the minimum is Inf
(r <- optimize(f, c(-3,3)))
-r$minimum
theLag # Same value, most of the time

Resources