Generate random number of missing values in R - r

I have a data frame such like that:
df<-data.frame(time1=rbinom(100,1,0.3),
time2=rbinom(100,1,0.4),
time3=rbinom(100,1,0.5),
time4=rbinom(100,1,0.6))
How could I generate random missing values for each time variable with up to 20% number of missing? Namely, in this case, the total number of missing less than 20 in each column and they are missed in random from subjects (rows).

You could do:
insert_nas <- function(x) {
len <- length(x)
n <- sample(1:floor(0.2*len), 1)
i <- sample(1:len, n)
x[i] <- NA
x
}
df2 <- sapply(df, insert_nas)
df2
This will give you up to maximal 20% missings per column
colSums(is.na(df2)) / nrow(df2)
time1 time2 time3 time4
0.09 0.16 0.19 0.14

Here's one way:
as.data.frame(lapply(df, function(x)
"is.na<-"(x, sample(seq(x), floor(length(x) * runif(1, 0, .2))))))

Something like this, you mean?
nomissing <- sample(1:20,1)
testnos <- rbinom(100 - nomissing,1,0.3)
testnas <- rep(NA,nomissing)
testmix <- sample(x = c(testnos,testnas),100)
Output -
> testmix
[1] 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
[37] 1 0 0 0 1 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 NA 0 1 0 0
[73] 0 0 1 1 0 0 1 0 0 1 1 0 0 NA 1 0 0 0 0 0 1 0 NA NA 1 0 0 0

Related

Drawing conditional combinations of a binary vector one by one

I am trying to write a routine to find combinations conditionally of a binary vector. For example, consider the following vector:
> A <- rep(c(1,0,0),3)
> A
[1] 1 0 0 1 0 0 1 0 0
Note that, length of the vector A is always multiple of 3. So the following condition always holds:
length(A) %% 3 == 0
The main condition is that there must be only a single 1 in each set of 3 vectors consecutively. In this example, for instance, one element of A[1:3] will be 1, one element of A[4:6] will be 1 and one element of A[7:9] will be 1 and the rest are all 0. Therefore, for this example, there will be a total of 27 possible combinations.
Objective is to make a routine to draw/return the next valid combination until all the possible legal combinations are returned.
Note that, I am not looking for a table with all the possible combinations. That Solution is already available in my other query in StackOverflow. However, with that method, I am running into memory problems when going beyond more than a length of 45 elements in A, as it is returning the full matrix which is huge. Therefore instead of storing the full matrix, I want to retrieve one combination at a time, and then decide later if I want to store it or not.
What the OP is after is an iterator. If we were to do this properly, we would write a class in C++ with a get_next method, and expose this to R. As it stands, with base R, since everything is passed by value, we must call a function on our object-to-be-updated and reassign the object-to-be-updated every time.
Here is a very crude implementation:
get_next <- function(comb, v, m) {
s <- seq(1L, length(comb), length(v))
e <- seq(length(v), length(comb), length(v))
last_comb <- rev(v)
can_be_incr <- sapply(seq_len(m), function(x) {
!identical(comb[s[x]:e[x]], last_comb)
})
if (all(!can_be_incr)) {
return(FALSE)
} else {
idx <- which(can_be_incr)[1L]
span <- s[idx]:e[idx]
j <- which(comb[span] == 1L)
comb[span[j]] <- 0L
comb[span[j + 1L]] <- 1L
if (idx > 1L) {
## Reset previous maxed out sections
for (i in 1:(idx - 1L)) {
comb[s[i]:e[i]] <- v
}
}
}
return(comb)
}
And here is a simple usage:
m <- 3L
v <- as.integer(c(1,0,0))
comb <- rep(v, m)
count <- 1L
while (!is.logical(comb)) {
cat(count, ": ", comb, "\n")
comb <- get_next(comb, v, m)
count <- count + 1L
}
1 : 1 0 0 1 0 0 1 0 0
2 : 0 1 0 1 0 0 1 0 0
3 : 0 0 1 1 0 0 1 0 0
4 : 1 0 0 0 1 0 1 0 0
5 : 0 1 0 0 1 0 1 0 0
6 : 0 0 1 0 1 0 1 0 0
7 : 1 0 0 0 0 1 1 0 0
8 : 0 1 0 0 0 1 1 0 0
9 : 0 0 1 0 0 1 1 0 0
10 : 1 0 0 1 0 0 0 1 0
11 : 0 1 0 1 0 0 0 1 0
12 : 0 0 1 1 0 0 0 1 0
13 : 1 0 0 0 1 0 0 1 0
14 : 0 1 0 0 1 0 0 1 0
15 : 0 0 1 0 1 0 0 1 0
16 : 1 0 0 0 0 1 0 1 0
17 : 0 1 0 0 0 1 0 1 0
18 : 0 0 1 0 0 1 0 1 0
19 : 1 0 0 1 0 0 0 0 1
20 : 0 1 0 1 0 0 0 0 1
21 : 0 0 1 1 0 0 0 0 1
22 : 1 0 0 0 1 0 0 0 1
23 : 0 1 0 0 1 0 0 0 1
24 : 0 0 1 0 1 0 0 0 1
25 : 1 0 0 0 0 1 0 0 1
26 : 0 1 0 0 0 1 0 0 1
27 : 0 0 1 0 0 1 0 0 1
Note, this implementation will be memory efficient, however it will be very slow.

How might I generalize this code to fit larger sets of data?

I want to extend the usability of the code i've written. even better, I would like to generalize it for future use.
I am using Rstudio. I have recoded a 100-dimensional vector. Values 1-10 have been converted to identity vectors. For instance, all values of 1 are now vectors that read 1 0 0 0 0 0 0 0 0 0, and all values of 2 now read 0 1 0 0 0 0 0 0 0 0, and so on. Here is the code:
tens <- seq(from=1, to=10, by=1)
y <- sample(tens, size=100, replace=TRUE)
y
num.its <- 100
Y <- rep(0,num.its*10)
dim(Y) <- c(num.its,10)
I <- diag(10)
for(i in 1:100){
if(y[i]==1){
Y[i,] <- I[1,]
} else if (y[i]==2){
Y[i,] <- I[2,]
} else if (y[i]==3){
Y[i,] <- I[3,]
} else if (y[i]==4){
Y[i,] <- I[4,]
} else if (y[i]==5){
Y[i,] <- I[5,]
} else if (y[i]==6){
Y[i,] <- I[6,]
} else if (y[i]==7){
Y[i,] <- I[7,]
} else if (y[i]==8){
Y[i,] <- I[8,]
} else if (y[i]==9){
Y[i,] <- I[9,]
} else {
Y[i,] <- I[10,]
}
}
The code works as planned. however, if I had to recode values of 1-2000, then I would rather not write 2000 else if statements. Any help would be appreciated. thank you!
A pretty decent one-liner is the following:
# sample data
set.seed(1234)
x <- c(1:5, sample(10L, 6))
Our vector is
x
[1] 1 2 3 4 5 10 6 5 4 1 8
Then, convert x to a factor variable, specifying the desired levels, and use model.matrix to get a matrix of your desired vectors.
model.matrix(~ . + 0, data.frame(x=factor(x, levels=1:10)))
This returns
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
1 1 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 1
7 0 0 0 0 0 1 0 0 0 0
8 0 0 0 0 1 0 0 0 0 0
9 0 0 0 1 0 0 0 0 0 0
10 1 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 1 0 0
attr(,"assign")
[1] 1 1 1 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
Here, the rows represent what you want. You can use t to convert this to columns if desired. Note also that even though 7 is missing in x, that column is present in the matrix.
You could use the function dummy from the package dummy
dummy::dummy(data.frame(x=factor(x)))
x_1 x_2 x_3 x_4 x_5 x_6 x_8 x_9
1 1 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0
4 0 0 0 1 0 0 0 0
5 0 0 0 0 1 0 0 0
6 0 1 0 0 0 0 0 0
7 0 0 0 0 0 1 0 0
8 0 0 0 0 1 0 0 0
9 0 0 0 0 0 0 1 0
10 0 0 0 0 0 0 0 1
11 0 0 0 1 0 0 0 0

R - Creating a new column within a data frame when two or more columns are a match in a row

I'm currently stuck on a part of my code that feels intuitive but I can't figure a way to do it. I have a very big data frame (nrows = 34036, ncol = 43) in which I want to create a continuous sequence of the variables where the value of the row is 1 (without having multiple columns with 1). It consists of only zeros and ones similar to the following:
A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1
I was able to remove the zeroes using:
#find the sum of each row
placeholderData <- transform(placeholderData, sum=rowSums(placeholderData))
placeholderData <- placeholderData[!(placeholderData$sum <= 0),]
And the data frame now looks like:
A B C D sum
1 0 0 0 1
0 0 0 1 1
0 0 0 1 1
1 0 1 0 2
1 0 1 0 2
0 1 0 0 1
0 1 0 0 1
1 0 0 1 2
My main problem comes when there are two or more 1's in a row. To try to solve this, I used the following code to identify the columns that have a sum of 2 or more:
placeholderData$Matches <- lapply(apply(placeholderData == 1, 1, which), names)
Which added the following column to the data frame:
A B C D sum Matches
1 0 0 0 1 A
0 0 0 1 1 D
0 0 0 1 1 D
1 0 1 0 2 c("A","C")
1 0 1 0 2 c("A","C")
0 1 0 0 1 B
0 1 0 0 1 B
1 0 0 1 2 c("A", "D")
I added the Matches column as an approach to solve the problem, but I'm not sure how would I do it without using a lot of logical operators (I don't know what columns have matches or not). What I would like to do is to aggregate the rows that have more than (or equal to) two 1's into a new column, to be able to have a data frame like this:
A B C D AC AD sum Matches
1 0 0 0 0 0 1 A
0 0 0 1 0 0 1 D
0 0 0 1 0 0 1 D
0 0 0 0 1 0 1 c("A","C")
0 0 0 0 1 0 1 c("A","C")
0 1 0 0 0 0 1 B
0 1 0 0 0 0 1 B
0 0 0 0 0 1 1 c("A", "D")
Then, I would be able to use my code as normal (It works just fine when there are no repeated values in rows). I tried searching to find similar questions, but I'm not sure if I was even asking the right question. I was wondering if anyone could provide some help or some ideas that I could try.
Thank you very much!
This seems a lot like making dummy variables, so I would use the model.matrix function commonly used for dummy variables (one-hot encoding):
m = read.table(header = T, text = "A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1")
m = m[rowSums(m) > 0, ]
d = factor(sapply(apply(m == 1, 1, which), function(x) paste(names(m)[x], collapse = "")))
result = data.frame(model.matrix(~ d + 0))
names(result) = levels(d)
# A AC AD B D
# 1 1 0 0 0 0
# 2 0 0 0 0 1
# 3 0 0 0 0 1
# 4 0 1 0 0 0
# 5 0 1 0 0 0
# 6 0 0 0 1 0
# 7 0 0 0 1 0
# 8 0 0 1 0 0

Building a symmetric binary matrix

I have a matrix that is for example like this:
rownames V1
a 1
c 3
b 2
d 4
y 2
q 4
i 1
j 1
r 3
I want to make a Symmetric binary matrix that it's dimnames of that is the same as rownames of above matrix. I want to fill these matrix by 1 & 0 in such a way that 1 indicated placing variables that has the same number in front of it and 0 for the opposite situation.This matrix would be like
dimnames
a c b d y q i j r
a 1 0 0 0 0 0 1 1 0
c 0 1 0 0 0 0 0 0 1
b 0 0 1 0 1 0 0 0 0
d 0 0 0 1 0 1 0 0 0
y 0 0 1 0 1 0 0 0 0
q 0 0 0 1 0 1 0 0 0
i 1 0 0 0 0 0 1 1 0
j 1 0 0 0 0 0 1 1 0
r 0 1 0 0 0 0 0 0 1
Anybody know how can I do that?
Use dist:
DF <- read.table(text = "rownames V1
a 1
c 3
b 2
d 4
y 2
q 4
i 1
j 1
r 3", header = TRUE)
res <- as.matrix(dist(DF$V1)) == 0L
#alternatively:
#res <- !as.matrix(dist(DF$V1))
#diag(res) <- 0L #for the first version of the question, i.e. a zero diagonal
res <- +(res) #for the second version, i.e. to coerce to an integer matrix
dimnames(res) <- list(DF$rownames, DF$rownames)
# 1 2 3 4 5 6 7 8 9
#1 1 0 0 0 0 0 1 1 0
#2 0 1 0 0 0 0 0 0 1
#3 0 0 1 0 1 0 0 0 0
#4 0 0 0 1 0 1 0 0 0
#5 0 0 1 0 1 0 0 0 0
#6 0 0 0 1 0 1 0 0 0
#7 1 0 0 0 0 0 1 1 0
#8 1 0 0 0 0 0 1 1 0
#9 0 1 0 0 0 0 0 0 1
You can do this using table and crossprod.
tcrossprod(table(DF))
# rownames
# rownames a b c d i j q r y
# a 1 0 0 0 1 1 0 0 0
# b 0 1 0 0 0 0 0 0 1
# c 0 0 1 0 0 0 0 1 0
# d 0 0 0 1 0 0 1 0 0
# i 1 0 0 0 1 1 0 0 0
# j 1 0 0 0 1 1 0 0 0
# q 0 0 0 1 0 0 1 0 0
# r 0 0 1 0 0 0 0 1 0
# y 0 1 0 0 0 0 0 0 1
If you want the row and column order as they are found in the data, rather than alphanumerically, you can subset
tcrossprod(table(DF))[DF$rownames, DF$rownames]
or use factor
tcrossprod(table(factor(DF$rownames, levels=unique(DF$rownames)), DF$V1))
If your data is large or sparse, you can use the sparse matrix algebra in xtabs, with similar ways to change the order of the resulting table as before.
Matrix::tcrossprod(xtabs(data=DF, ~ rownames + V1, sparse=TRUE))

R: column reference to itself

Please, help!
I have w:
x y
0 0
0 0
0 0
0 1
0 0
0 0
0 -1
0 0
0 0
0 1
0 0
0 -1
0 0
0 0
I would like to get:
x y
0 0
0 0
0 0
1 1
1 0
1 0
0 -1
0 0
0 0
1 1
1 0
0 -1
0 0
0 0
I use R:
for (i in 2:length(w$x)) { w$x[i] = w$x[i-1] + w$y[i]}
Is it possible to do without the use of a loop statement?
Thank you!
This assumes that you want to start with the initial value of 0 in the x column:
transform(w, x = cumsum(y))
## x y
## 1 0 0
## 2 0 0
## 3 0 0
## 4 1 1
## 5 1 0
## 6 1 0
## 7 0 -1
## 8 0 0
## 9 0 0
## 10 1 1
## 11 1 0
## 12 0 -1
## 13 0 0
## 14 0 0
Otherwise you can include the initial value:
transform(w, x = x[1] + cumsum(y))
The result here is the same.
Both of these assume that either y[1] is zero, or that you want to use the actual value if it is nonzero (your code ignores y[1]).

Resources