add exact proportion of random missing values to data.frame - r

I would like to add random NA to a data.frame in R. So far I've looked into these questions:
R: Randomly insert NAs into dataframe proportionaly
How do I add random NAs into a data frame
add random missing values to a complete data frame (in R)
Many solutions were provided here, but I couldn't find one that comply with these 5 conditions:
Add really random NA, and not the same amount by row or by column
Work with every class of variable that one can encounter in a data.frame (numeric, character, factor, logical, ts..), so the output must have the same format as the input data.frame or matrix.
Guarantee an exact number or proportion [note] of NA in the output (many solutions result in a smaller number of NA since several are generated at the same place)
Is computationnaly efficient for big datasets.
Add the proportion/number of NA independently of already present NA in the input.
Anyone has an idea?
I have already tried to write a function to do this (in an answer of the first link) but it doesn't comply with points N°3&4.
Thanks.
[note] the exact proportion, rounded at +/- 1NA of course.

This is the way that I do it for my paper on library(imputeMulti) which is currently in review at JSS. This inserts NA's into a random percentage of the whole dataset and scales well, It doesn't guarantee an exact number because of the case of n * p * pctNA %% 1 != 0.
createNAs <- function (x, pctNA = 0.1) {
n <- nrow(x)
p <- ncol(x)
NAloc <- rep(FALSE, n * p)
NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
x[matrix(NAloc, nrow = n, ncol = p)] <- NA
return(x)
}
Obviously you should use a random seed for reproducibility, which can be specified before the function call.
This works as a general strategy for creating baseline datasets for comparison across imputation methods. I believe this is what you want, although your question (as noted in the comments) isn't clearly stated.
Edit: I do assume that x is complete. So, I'm not sure how it would handle existing missing data. You could certainly modify the code if you want, though that would probably increase the runtime by at least O(n*p)

Some users reported that Alex's answer did not address condition N°5 of my question. Indeed, when adding random NA on a dataframe that already contains missing values, the new ones will sometimes fall on the initial ones, and the final proportion will be somewhere between initial proportion and desired proportion... So I expand on Alex's function to comply with all 5 conditions:
I modify his createNAs function so that it enables one of 3 options:
option complement: complement with NA up to the desired %
option add : add % of NA in addition to those already present
option none : add a % of NA regardless of those already present
For option 1 and 2, the function will work recursively until reached the desired proportion of NA:
createNAs <- function (x, pctNA = 0.0, option = "add"){
prop.NA = function(x) sum(is.na(x))/prod(dim(x))
initial.pctNA = prop.NA(x)
if ( (option =="complement") & (initial.pctNA > pctNA) ){
message("The data already had more NA than the target percentage. Returning original data")
return(x)
}
if ( (option == "none") || (initial.pctNA == 0) ){
n <- nrow(x)
p <- ncol(x)
NAloc <- rep(FALSE, n * p)
NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
x[matrix(NAloc, nrow = n, ncol = p)] <- NA
return(x)
} else { # if another option than none:
target = ifelse(option=="complement", pctNA, pctNA + initial.pctNA)
while (prop.NA(x) < target) {
prop.remaining.to.add = target - prop.NA(x)
x = createNAs(x, prop.remaining.to.add, option = "none")
}
return(x)
}
}

Related

R - Looping with while always results in missing value where TRUE/FALSE is expected

EDIT: I implemented offered solutions so far, and the code looks way cleaner now. This was the key to finally finding my error. It was a logical condition that I didn't check within the while loop. It could happen that the iterator would exceed the number of elements in the vector and thus pass a "NA" to the while condition! Thx
I also changed the solution to use vector assignments to store the results and then recombine after the for loop, as vector indexing seems to be way faster than data.table indexing and value assignment within the loop.
Pls let me apologize first for any errors and lack of information for troubleshooting my problem as this is my first post so far. I have already read that this can happen accidentally whenever ther is an error in a computation and the value of a condition results in an error, such as
if (TRUE & sqrt(-1))
It's been days and I am still receiving this error. It really gives me a headache, as the inherent logic behind such code is actually pretty straigth forward and I still can't properly formalize it. It goes like following: Compare for each unique bond ID contained in a vector of size N (loop through with i), the static value of its corresponding maturity to 7 periods' end date for distinct set of rules (loop through with k) to determine which periods with unique rules the respective issue falls into, and then determine by looping through all the periods' size thresholds (loop through by l) to find if a particular issue has violted these minimium size requirements. If a violation is determined, I can assign the date of the violation. If (l == k), I can reckon that for all periods that the issue's maturity falls into, have also successfully looped through the corresponding size requirements checks and as such hasn't violated any rules. I then assign the result of the conditional checks as corresponding binary values in a new data.table column as well as the violation date. So far, I really cant determine what is casusing this error.
My data looks like following. I have a pretty large data.table containing bond issue identifiers and various other column variables that describe those issues. It was imported as initially with the read_dta() function and then transformed to a data.table with setDT().
I extract 3 columns out of this data.table, using
issue_IDs.vec <- as.numeric(issues.dt[[2]])
maturity.vec <- as.Date(issues.dt[[8]], "%Y-%m-%d")
offerings_atm.vec <- as.numeric(issues.dt[[33]])
Next, I transform eligibility criteria of an index as following.
# (1) Creating size requirement end periods (valid thru) ----
size_req_per_1 <- as.Date("1992-01-01", "%Y-%m-%d")
size_req_per_2 <- as.Date("1994-01-01", "%Y-%m-%d")
size_req_per_3 <- as.Date("1999-07-01", "%Y-%m-%d")
size_req_per_4 <- as.Date("2003-10-01", "%Y-%m-%d")
size_req_per_5 <- as.Date("2004-07-01", "%Y-%m-%d")
size_req_per_6 <- as.Date("2017-02-01", "%Y-%m-%d")
size_req_per_7 <- as.Date("2021-02-01", "%Y-%m-%d")
size_req_val_per.vec <- c(size_req_per_1, size_req_per_2, size_req_per_3, size_req_per_4,
size_req_per_5, size_req_per_6, size_req_per_7)
# (2) Create a size requirement threshold per rules' validity period ----
size_req_thresh_1 <- 25000
size_req_thresh_2 <- 50000
size_req_thresh_3 <- 100000
size_req_thresh_4 <- 150000
size_req_thresh_5 <- 200000
size_req_thresh_6 <- 250000
size_req_thresh_7 <- 300000
size_req_thresh.vec <- c(size_req_thresh_1, size_req_thresh_2, size_req_thresh_3,
size_req_thresh_4, size_req_thresh_5, size_req_thresh_6,
size_req_thresh_7)
Next, I do write a loop to perform conditional checks to find for each issue ID stored in the issues_ID.vec if they violate the index eligibility criterium of the minimim issance size during their maturity. I do this by passing the value of iterator variable i as a position value to the issues_ID.vec.
# (3) Looping through a set of conditional check to find out if and if so when a particular issue violated the size requirement ---
# Iterator variables ----
# Length of issues.dt
j <- issues.dt[, .N]
# Main iterator looping through all entries of isssues.dt extracted as vector
i <- 1
# Looping through vector elements of issue rules (vec. 1: validity periods)
k <- 1
# Looping through vector elements of issue rules (vec. 2: size thresholds)
l <- 1
# Loop
for (i in 1:j) {
id <- issue_IDs.vec[i]
maturity <- maturity.vec[i]
offering_atm <- issue_IDs.vec[i]
k <- 1
maturity_comp <- size_req_val_per.vec[k]
while (maturity >= maturity_comp) {
if (k < 7) {
k <- k + 1
maturity_comp <- size_req_val_per.vec[k]
} else {
break
}
}
l <- 1
offering_size_comp <- size_req_thresh.vec[l]
for (l in 1:k) {
if (offering_atm >= offering_size_comp) {
offering_size_comp <- size_req_thresh.vec[l]
next
} else {}
}
if (l == k) {
issues.dt[ISSUE_ID == id,
`:=`(SIZE_REQ_VIOLATION = 0,
SIZE_REQ_VIOLATION_DATE = NA)]
} else {
issues.dt[ISSUE_ID == id,
`:=`(SIZE_REQ_VIOLATION = 1,
SIZE_REQ_VIOLATION_DATE = size_req_val_per.vec[l])]
}
i <- i + 1
}
Whenever I try running the code in a simplified version, such as
k <- 1
for (1 in 1:7) {
print(maturity >= maturity_comp)
k <- k + 1
maturity_comp <- format(as.Date(size_req_val_per.vec[k]), "%Y-%m-%d")
}
the code runs smooth and always results in the printed evaluations TRUE or FALSE, depending which ID I initially to create the corresponding static maturity of the particular bond issue. As this stage, I already exhasuted my troubleshooting skills.
I'd appreciate any input from you guys, and if you need any additional information, explanations etc. just let me know.
I think the answer lies in Gregor's comment. The way you are formatting your dates converts them to character variables. Here's a quick example:
Exmpl<-as.Date("08-25-2020", "%m-%d-%Y")
class(Exmpl)
[1] "Date"
##Not your preferred format, but it is a Date variable##
Exmpl
"2020-08-25"
##Formatting changes it to a character
Exmpl2<-format(as.Date(Exmpl), "%m-%d-%Y")
class(Exmpl2)
[1] "character"
When you call them in the while() function, R is trying make a comparison to decided if the condition (i.e., maturity is greater than or equal to maturity comp) is TRUE or FALSE (logical variables). Because you have character variables, R cannot make this comparison.
I think your code will work if you don't format the dates, but simply read them in and leave them in the YYYY-mm-dd format.

How do you create a function that row reduces a matrix in R?

So far I've tried the following code but it didn't work in R-studio; it just hangs there.
Am I doing something wrong? This is my first real R code project so I'd love suggestions!
new.rref <- function(M,fractions=FALSE)
{
#M is a matrix.
#Require numeric matricies.
if ((!is.matrix(M)) || (!is.numeric(M)))
stop("Sorry pal! Data not a numeric matrix.")
#Specify and differentiate between rows and columns.
r=nrow(M)
c=ncol(M)
#Now establish a continuous loop (*needed help on this one)
#According to the help documents I've read, this has to do with a
#computerized version of the Gaussian Reducing Algorithm
#While 1<r and 1<c, must set first column entries in which
#1:r < 1 equal to zero. This while loop is used to loop the
#algorithm until a specific condition is met -- in this case,
#until elements in the first column to which 1:r < 1
#are set to zero.
while((1<=r) & (1<=c))
new <- M[,1]
new[1:r < y.position] <- 0
# Now here's the fun part :)
#We need to find the maximum leading coefficient that lies
#at or below the current row.
new1 <- which.max(abs(new))
#We will assign these values to the vector "LC"
LC <- col[which]
#Now we need to allow for row exchange!
#Basically tells R that M[c(A,B),] = M[c(B,A),].
if (which > 1) { M[c(1,which),]<-A[c(which,1),] }
#Now we have to allow for the pivot, "sweep", and restoration
#of current row. I totally didn't know how to do this so I
#used and changed some code from different documentations.
#PIVOT (friends reference)
M[1,]<-M[1,]/LC
new2 <-M[1,]
#CLEAN
M <- M - outer(M[,x.position],new2)
#RESTORE
A[1,]<-new2
#Last, but certantly not least, we're going to round the matrix
#off to a certain value. I might have did this wrong.
round(M)
return(M)
print(M)
}
Edit: I added the first line, for some reason it got deleted.
Edit 2: Say you have a matrix M=matrix(c(2,3,4,7), nrow=2, ncol=2, byrow=TRUE); new.rref(M) needs to produce the reduced row echelon form of matrix M. I already did the math; new.rref(M) should be equal to matrix(c(1,0,0,1), nrow=2, ncol=2, byrow=T

Getting only NAs in the dataframe?

I am trying to make a dataframe with two columns and 10 rows whereby the first column contains weight (denoted by w in the code) and the second column contains the error rate (denoted by cv.error). However I get a dataframe with only NA in it. I don't know what I am doing wrong. Help would be appreciated.
I want a dataframe in which the first column has "w" and the other has cv.error.
Following is my code
l <- data.frame(matrix(NA, nrow = 10, ncol = 2))
k_fun <- function(combined_distance,n,j)
{
glm_fit <- glm(gcms$train$response ~ combined_distance ,family=binomial, data=gcms$train,control = list(maxit = 50))
cv.error = cv.glm(gcms$train, glm_fit,K=5)$delta[1]
l[j,1] = n
l[j,2] = cv.error
}
w = c(0.1,0.2,0.25,0.3,0.35,0.4,0.45,0.50,0.7,0.9)
for(j in 1:10)
{
combined_distance <- alkoloiddistance + (1 - alkoloiddistance^w[j]) * solventdistance
k_fun(combined_distance,w[j],j)
}
dont know why my answer was deleted. it answered the question, and it explained the reason.
u need l[j,1] <<- n and l[j,2] <<- cv.error. u hope to update l inside the function, but actually it is only its local copy in the function that is updated. so after running your loop, l in your r session is unchanged at all. u set up l as a data frame of NA, thus u still get a data frame with all NA.
The problem is probably related to how you instantiated the data frame. If you run str(l) after running the first line of your code you may see that the data type assigned by R is either logical or factor. If you try to assign numeric values to columns of these types you will get an NA instead. Try running options(stringsAsFactors = F) before the rest of your code. Alternatively, you can assign the class of NA you want in the first line of your code (e.g., NA_integer_).

Finding a value in an interval

Sorry if this is a basic question. Have been trying to figure this out but not being able to.
I have a vector of values called sym.
> head(sym)
[,1]
val 3.652166e-05
val -2.094026e-05
val 4.583950e-05
val 6.570184e-06
val -1.431486e-05
val -5.339604e-06
These I put in intervals by using factor on cut function on sym.
factorx<-factor(cut(sym,breaks=nclass.Sturges(sym)))
[1] (2.82e-05,5.28e-05] (-2.11e-05,3.55e-06] (2.82e-05,5.28e-05] (3.55e-06,2.82e-05] (-2.11e-05,3.55e-06] (-2.11e-05,3.55e-06]
[7] (-2.11e-05,3.55e-06] (2.82e-05,5.28e-05] (3.55e-06,2.82e-05] (7.74e-05,0.000102]
Levels: (-2.11e-05,3.55e-06] (3.55e-06,2.82e-05] (2.82e-05,5.28e-05] (7.74e-05,0.000102]
So clearly, four intervals were created in factorx. Now I have a new value tmp=3.7e-0.6.
My question is how can I find which interval in the above does it belongs to? I tried to use findInterval() but seems it does not work on factors like factorx.
Thanks
If you plan to re-classify new values, it's best to explicitly set the breaks= parameter with a vector rather than a size. Not that had those values been in the set originally, you may have had different breaks, and it is possible that your new values may be outside all the levels of your existing data which can be troublesome.
So first, I will generate some sample data.
set.seed(18)
x <- runif(50)
Now I will show two different way to calculate breaks. Here are b1() and b2()
b1<-function(x, n=nclass.Sturges(x)) {
#like default cut()
nb <- as.integer(n + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0)
dx <- abs(rx[1L])
seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
}
b2<-function(x, n=nclass.Sturges(x)) {
#like default hist()
pretty(range(x), n=n)
}
So each of these functions will give break points similar to either the default behaviors of cut() or hist(). Rather than just a single number of breaks, they each return a vector with all the break points explicitly stated. This allows you to use cut() to create your factor
mybreaks <- b1(x)
factorx <- cut(x,breaks=mybreaks))
(Note that's you don't have to wrap cut() in factor() as cut() already returns a factor. Now, if you get new values, you can classify them using findInterval() and the special breaks vector you've already prepared
nv <- runif(5)
grp <- findInterval(nv,mybreaks)
And we can check the results with
data.frame(grp=levels(factorx)[grp], x=nv)
# grp x
# 1 (0.831,0.969] 0.8769438
# 2 (0.00131,0.14] 0.1188054
# 3 (0.416,0.554] 0.5467373
# 4 (0.14,0.278] 0.2327532
# 5 (0.554,0.693] 0.6022678
and everything looks pretty good. In this case, findInterval() will tell you which level of the previous factor you created that each item belongs to. It will return 0 if the number is smaller than your previous observations, but it will return the largest category for anything greater than the largest level of mybreaks. This behavior is somewhat different that cut() which return NA. The last group in cut() is right-closed where findInterval leaves the right-end open.

Calculating Hamming distance for two vectors in R?

I'm trying just to calculate the Hamming distance between two vectors in R. I'm currently attempting to use the "e1071" package, and the hamming.distance function, as follows:
library(e1071)
H <- hamming.distance(X)
Where X is a data.frame with 2 rows and (in my particular data) 667 columns, and every observation is 0 or 1.
Initially I got the error:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
After some research, it appeared that one fix might be increasing the basic option in R. This I did via options(expressions=5000), and then tried varying values in place of the 5000. But this only produced the error:
Error: C stack usage is too close to the limit
I'm not much of a programmer, and the fixes for this most recent error appear to have to do with something inside the package e1071 possibly not being called correctly (or at the right time).
Any ideas on what I'm doing wrong? I eventually want the Hamming distances between a large number of vectors, and this was just a starting point. If this has to do with memory allocation, any suggestions for how to deal with it?
I don't know how hamming.distance works internally, but a simple way to calculate the distance for 2 vectors is just
sum(x1 != x2)
or, in this case,
sum(X[1,] != X[2,])
If the total number of vectors is not too large (up to, say, a few thousand), you could implement this in a nested loop:
n <- nrow(X)
m <- matrix(nrow=n, ncol=n)
for(i in seq_len(n - 1))
for(j in seq(i, n))
m[j, i] <- m[i, j] <- sum(X[i,] != X[j,])
Caveat: untested.
WARNING ABOUT USING HAMMING.DISTANCE FROM PACKAGE e1071!
This package's implementation forces the objects being compared to booleans with as.logical. This means that values of 0 will be FALSE and any non-zero values will be TRUE. This means that for the sequence: 0 1 2 compared to 0 1 1 the hamming distance will be reported as 0 instead of the correct value of 1 -- this package will treat 1 and 2 as equal since as.logical(1) == as.logical(2).
Here is the faulty (in my view) implementation:
> library("e1071", lib.loc="C:/Program Files/R/R-2.15.3/library")
Loading required package: class
> hamming.distance
function (x, y)
{
z <- NULL
if (is.vector(x) && is.vector(y)) {
z <- sum(as.logical(x) != as.logical(y))
}
else {
z <- matrix(0, nrow = nrow(x), ncol = nrow(x))
for (k in 1:(nrow(x) - 1)) {
for (l in (k + 1):nrow(x)) {
z[k, l] <- hamming.distance(x[k, ], x[l, ])
z[l, k] <- z[k, l]
}
}
dimnames(z) <- list(dimnames(x)[[1]], dimnames(x)[[1]])
}
z
}
<environment: namespace:e1071>
My recommendation: DO NOT USE. Hamming distance is trivial to implement as noted several times above.
hamming.distance takes two vectors or a matrix, but not a data frame, so what you want is probably either
m = as.matrix(X)
hamming.distance(m[1,], m[2,])
or
hamming.distance(as.matrix(X))
but as was pointed out this is in your particular case the same as
sum(m[1,] != m[2,])
(In general, avoid data.frames if what you have is not a heterogenous structure since they are much, much slower than matrices)
As an addition to all that was mentioned above: Although the Hamming distance is trivial to implement as an ordinary nested loop, in terms of execution time things can quickly get out of hand for larger matrices. In R, it is far more efficient to instead use matrix multiplication for computing the Hamming distance between all columns of large matrices. This is extremely fast compared to an R-level nested loop. An example implementation can be found here.
sum(xor(x[1,],x[2,]))
I don't know the relative efficiency of 'xor' to '!='
Just adding to #HongOoi I want to point that in R != and == return NA when one of the values is missing, so it could give misleading results
> c(1, NA) == 1:2
[1] TRUE NA
however %in% outputs FALSE for 1 %in% NA comparison. Because of that if when comparing vectors you want to count missing values as "different", then you have to use sum(!((x != y) %in% FALSE)) code:
> x <- c(1, 8, 5, NA, 5)
> y <- 1:5
> sum(!((x != y) %in% FALSE))
[1] 3
Notice also that it could happen that x and y vectors have different length, what would lead to missing values in the shorter vector - you can do two things: truncate the longer vector or claim that values absent in the shorter vector are "different". This could be translated into standalone function with familiar R parameters:
hamming <- function(x, y, na.rm = TRUE) {
size <- 1:max(length(x) & length(y))
x <- x[size]
y <- y[size]
if (na.rm) {
del <- is.na(x) & is.na(y)
x <- x[del]
y <- y[del]
}
sum(!((x != y) %in% FALSE))
}
This function enables you to choose if you want to count missing values as "different" (na.rm = FALSE) or ignore them. With na.rm = TRUE if vectors differ in their length, the longer one gets truncated.

Resources