Sum observations from two columns, looping over many columns in R - r

I have searched high and low, but am stuck on how to approach this. I have two sets of columns that I want to sum, row by row, but which I want to loop over many columns. If I were to do this manually, I would want:
df1[1,1]+df2[1,1]
df1[2,1]+df2[2,1]
etc... I've found many helpful examples on how to do something like:
apply(df[,c("a","d")], 1, sum)
though I want to do this over lots of columns. Also, while it's not entirely relevant, I want to phrase my question as close to my reality as possible, so my example below includes NA's, since my actual data contains many missing values.
# make a data frame, df1, with three columns
a <- sample(1:100, 50, replace = T)
b <- sample(100:300, 50, replace = T)
c <- sample(2:50, 500, replace = T)
df1 <- cbind(a,b,c)
# make another data frame, df2, with three columns
x <- sample(1:100, 50, replace = T)
y <- sample(100:300, 50, replace = T)
z <- sample(2:50, 50, replace = T)
df2 <- cbind(x,y,z)
# make another data frame, df2, with three columns
x <- sample(1:100, 50, replace = T)
y <- sample(100:300, 50, replace = T)
z <- sample(2:50, 50, replace = T)
df2 <- cbind(x,y,z)
Make it possible to randomly throw a few NAs in, function from http://www.r-bloggers.com/function-to-generate-a-random-data-set/
NAins <- NAinsert <- function(df, prop = .1){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
Add the NAs to the frames
NAins(df1, .2)
NAins(df2, .14)
Then, I tried to seq along the columns in each data frame, and used apply setting the index to 1, meaning to sum each row entry. This doesn't work.
for(i in seq_along(df1)){
for(j in seq_along(df2)){
apply(c(df1[,i], col2[j]), 1, function(x) sum(x, na.rm = T))}}
Thanks for any help!

You should be able to just replace NA with 0, and then add with "+":
replace(df1, is.na(df1), 0) + replace(df2, is.na(df2), 0)
# X Y Z
# 1 7 19 6
# 2 11 12 1
# 3 16 14 11
# 4 13 7 13
# 5 10 2 11
Alternatively, if you have more than just two data.frames, you can collect them in a list and use Reduce:
Reduce("+", lapply(mget(c("df1", "df2", "df3")), function(x) replace(x, is.na(x), 0)))
Here's some sample data (and what I think is an easier way to create it):
set.seed(1) ## Set a seed so others can reproduce your sample data
dfmaker <- function() {
setNames(
data.frame(
replicate(3, sample(c(NA, 1:10), 5, TRUE), FALSE)),
c("X", "Y", "Z"))
}
df1 <- dfmaker()
df1
# X Y Z
# 1 2 9 2
# 2 4 10 1
# 3 6 7 7
# 4 9 6 4
# 5 2 NA 8
df2 <- dfmaker()
df2
# X Y Z
# 1 5 10 4
# 2 7 2 NA
# 3 10 7 4
# 4 4 1 9
# 5 8 2 3
df3 <- dfmaker()

You can transform the data.frame to an array and sum them using apply function.
install.package('abind')
library(abind)
df <- abind(list(df1,df2), along = 3)
results <- apply(df, MARGIN = c(1,2), FUN = function(x) sum(x, na.rm = TRUE))
results

Related

Add/match rows with NA to matrix based on missing unique IDs

I am using a panel data set and intent to model this as a dynamic affiliation network using SAOMs. The data is unfortunately very messy and a pain to deal with.
I have managed to create adjacency matrices for each panel wave. However, over time the panel grew in size / people left. I need the number of rows in each matrix to be the same and in the same order according to the unique IDs, which are present when inspecting the objects in R. All "added IDs" should show 10s across the whole row.
Here is a reproducible example that should make the issue clear and also shows what I aim for. I assume this can be solved by smart use of the merge() function, but I could not get it to work:
wave1 <- matrix(c(0,0,1,1,0,1,1,0,1,1), nrow = 5, ncol = 2, dimnames = list(c("1","2","4","5","9"), c("group1","group2")))
wave2 <- matrix(c(0,1,1,0,1,0,1,1), nrow = 4, ncol = 2, dimnames = list(c("1","4","8","9"), c("group1","group2")))
wave1_c <- matrix(c(0,0,1,1,10,0,1,1,0,0,10,1), nrow = 6, ncol = 2, dimnames = list(c("1","2","4","5","8","9"), c("group1","group2")))
wave2_c <- matrix(c(0,10,1,10,1,0,1,10,0,10,1,1), nrow = 6, ncol = 2, dimnames = list(c("1","2","4","5","8","9"), c("group1","group2")))
Thanks in advance. Numbers in the matrices are arbitrary except for the 10s.
Solution in base R using dataframes and merge.
Merge and outer join.
dwave1_c <- merge(wave1, wave2, by = 'row.names', all = TRUE, suffixes="")[2:3]
dwave2_c <- merge(wave2, wave1, by = 'row.names', all = TRUE, suffixes="")[2:3]
dwave1_c[is.na(dwave1_c)] <- 10
dwave2_c[is.na(dwave2_c)] <- 10
as.matrix(dwave1_c)
as.matrix(dwave2_c)
Update.
both <- merge(wave1, wave2, by = 'row.names', all = TRUE)
Output.
Row.names group1.x group2.x group1.y group2.y
1 1 0 1 0 1
2 2 0 1 NA NA
3 4 1 0 1 0
4 5 1 1 NA NA
5 8 NA NA 1 1
6 9 0 1 0 1
dwave1_c <- both[,2:3]; colnames(dwave1_c) <- colnames(wave1)
dwave2_c <- both[,4:5]; colnames(dwave2_c) <- colnames(wave2)
dwave1_c[is.na(dwave1_c)] <- 10
dwave2_c[is.na(dwave2_c)] <- 10
Show result.
as.matrix(dwave1_c)
as.matrix(dwave2_c)
First try.
## Convert matrix to dataframe.
df1 <- as.data.frame(wave1)
df2 <- as.data.frame(wave2)
## Merge df1 and df2 by row name.
m_df1_df2 <- merge(df1, df2, by = 'row.names', all = TRUE)
rownames(m_df1_df2) <- m_df1_df2$Row.names
# Rows not in df1, but in df2,
# rows not in df2, but in df1
not1_2 <- m_df1_df2[is.na(m_df1_df2$group1.x),][c("group1.x", "group2.x")] # not in df1, in df2
not2_1 <- m_df1_df2[is.na(m_df1_df2$group1.y),][c("group1.y", "group2.y")] # not in df2, in df1
## Same column names.
colnames(not1_2) <- colnames(df1)
colnames(not2_1) <- colnames(df2)
## append
df1_c <- rbind(df1, not1_2)
df2_c <- rbind(df2, not2_1)
## order by row name
df1_c <- df1_c[order(row.names(df1_c)), ]
df2_c <- df2_c[order(row.names(df2_c)), ]
## replace NA by 10
df1_c[is.na(df1_c)] <- 10
df2_c[is.na(df2_c)] <- 10
as.matrix(df1_c)
as.matrix(df2_c)
The conversion of wave1,2 to data frames in my first attempt is redundant and can be omitted. However at the expense of implicit coercions.
## merge wave1 and wave2 by row name.
m_df1_df2 <- merge(wave1, wave2, by = 0, all = TRUE)
rownames(m_df1_df2) <- m_df1_df2$Row.names
# rows not in set 1, but in set 2,
# rows not in set 2, but in set 1.
not1_2 <- m_df1_df2[is.na(m_df1_df2$group1.x),][c("group1.x", "group2.x")]
not2_1 <- m_df1_df2[is.na(m_df1_df2$group1.y),][c("group1.y", "group2.y")]
## Same column names.
colnames(not1_2) <- colnames(wave1)
colnames(not2_1) <- colnames(wave2)
## append.
wave1_c <- rbind(wave1, not1_2)
wave2_c <- rbind(wave2, not2_1)
## order by row name.
wave1_c <- wave1_c[order(row.names(wave1_c)), ]
wave2_c <- wave2_c[order(row.names(wave2_c)), ]
## replace NA by 10.
wave1_c[is.na(wave1_c)] <- 10
wave2_c[is.na(wave2_c)] <- 10
## show result.
wave1_c
wave2_c
Solution using setdiff.
## rownames not in set 1, but in set 2,
## rownames not in set 2, but in set 1.
rn_not2_1 <- setdiff(rownames(wave1), rownames(wave2))
rn_not1_2 <- setdiff(rownames(wave2), rownames(wave1))
## missing rows to add.
add_to_1 <- wave2[rn_not1_2,,drop=FALSE]
add_to_2 <- wave1[rn_not2_1,,drop=FALSE]
add_to_1[,] <- 10
add_to_2[,] <- 10
## append.
wave1_c <- rbind(wave1, add_to_1)
wave2_c <- rbind(wave2, add_to_2)
## order by row name.
wave1_c <- wave1_c[order(row.names(wave1_c)), ]
wave2_c <- wave2_c[order(row.names(wave2_c)), ]
## show result.
wave1_c
wave2_c

Assign() to specific indices of vectors, vectors specified by string names

I'm trying to assign values to specific indices of a long list of vectors (in a loop), where each vector is specified by a string name. The naive approach
testVector1 <- c(0, 0, 0)
vectorName <- "testVector1"
indexOfInterest <- 3
assign(x = paste0(vectorName, "[", indexOfInterest, "]"), value = 1)
doesn't work, instead it creates a new vector "testVector1[3]" (the goal was to change the value of testVector1 to c(0, 0, 1)).
I know the problem is solvable by overwriting the whole vector:
temporaryVector <- get(x = vectorName)
temporaryVector[indexOfInterest] <- 1
assign(x = vectorName, value = temporaryVector)
but I was hoping for a more direct approach.
Is there some alternative to assign() that solves this?
Similarly, is there a way to assign values to specific elements of columns in data frames, where both the data frames and columns are specified by string names?
If you must do this you can do it with eval(parse():
valueToAssign <- 1
stringToParse <- paste0(
vectorName, "[", indexOfInterest, "] <- ", valueToAssign
)
eval(parse(text = stringToParse))
testVector1
# [1] 0 0 1
But this is not recommended. Better to put the desired objects in a named list, e.g.:
testVector1 <- c(0, 0, 0)
dat <- data.frame(a = 1:5, b = 2:6)
l <- list(
testVector1 = testVector1,
dat = dat
)
Then you can assign to them by name or index:
vectorName <- "testVector1"
indexOfInterest <- 3
dfName <- "dat"
colName <- "a"
rowNum <- 3
valueToAssign <- 1
l[[vectorName]][indexOfInterest] <- valueToAssign
l[[dfName]][rowNum, colName] <- valueToAssign
l
# $testVector1
# [1] 0 0 1
# $dat
# a b
# 1 1 2
# 2 2 3
# 3 1 4
# 4 4 5
# 5 5 6

R. Create a column that is the min() value of my row [duplicate]

I'm try to calculate minimum across multiple columns (row-wise min) in a data frame, but the min function automatically returns the minimum across the whole of each column rather than for each row separately. I'm sure I'm missing something really simple here? Any ideas much appreciated.
x <- c(1,2,7)
y <- c(1,5,4)
minIwant <- c(1,2,4)
df <- data.frame(x, y, minIwant)
df$minIget <- min(df$x,df$y)
df
x y minIwant minIget
1 1 1 1 1
2 2 5 2 1
3 7 4 4 1
You can use apply to go through each row
apply(df, 1, FUN = min)
Where 1 means to apply FUN to each row of df, 2 would mean to apply FUN to columns.
To remove missing values, use:
apply(df, 1, FUN = min, na.rm = TRUE)
We could use pmin, which finds the parallel minima of sets of values. Since our df is technically a list, we will need to run it via do.call.
df$min <- do.call(pmin, df)
which gives
df
# x y min
# 1 1 1 1
# 2 2 5 2
# 3 7 4 4
Data:
df <- data.frame(x = c(1, 2, 7), y = c(1, 5, 4))
Furthermore, if na.rm = TRUE is needed, you can do
df$min <- do.call(pmin, c(df, na.rm = TRUE))
Just want to add on how you can also do this with dplyr.
library(dplyr)
x<-c(1,2,7)
y<-c(1,5,4)
df <- data.frame(x,y)
df %>% rowwise() %>% mutate(minIget = min(x, y))
# A tibble: 3 x 3
x y minIget
<dbl> <dbl> <dbl>
1 1. 1. 1.
2 2. 5. 2.
3 7. 4. 4.
We could also use rowMins from library(matrixStats)
library(matrixStats)
df$minIwant <- rowMins(as.matrix(df))

How to merge several columns of the same dataframe?

I have one big data frame containing different measurements performed by several probes.
The timing of the measurements are not exactly the same. As I want to compare both measurements at a given time and plot them in an animation, I need my data to be "synchronized".
Here is an example of the dataframe I get (in real life I have way more columns that I read directly from a text file):
time1.in.s <- seq(0.010, 100, length.out = 100)
time2.in.s <- seq(0.022, 100, length.out = 100)
data1 <- seq(-10, 100, length.out = 100)
data2 <- seq(-25, 80, length.out = 100)
my.df <- data.frame(time1.in.s, data1, time2.in.s, data2)
Which gives:
time1.in.s data1 time2.in.s data2
1 0.01 -10.000000 0.022000 -25.0000000
2 1.02 -8.888889 1.031879 -23.9393939
3 2.03 -7.777778 2.041758 -22.8787879
4 3.04 -6.666667 3.051636 -21.8181818
5 4.05 -5.555556 4.061515 -20.7575758
6 5.06 -4.444444 5.071394 -19.6969697
What I want to do is merge the two timeX.in.s columns in a single "time" column. Where data is not available, I would have NAs that I could fill in with something like na.approx(my.df$data1, x = my.df$time).
This code is given so that you can reproduce the problem, but in real life, time1.in.s, time2.in.s, data1 and data2 are not available separately. What I actually do is my.df <- read.table(my.file, header = TRUE) and I get the same result. I thus don't have the possibility to build the separate data frames directly, I need to split the one big data frame in several manually:
df.list <- list()
for (i in seq(1, ncol(my.df), 2)) {
df.list[[ceiling(i/2)]] <- data.frame(time = my.df[, i], data = my.df[, i+1])
}
Then merge the dataframes one by one:
merged.df <- data.frame(time = as.numeric(NA), data = as.numeric(NA))
for (i in 1:length(df.list)) {
merged.df <- merge(merged.df, df.list[[i]], by = "time", all = TRUE)
}
And finally fill in the gaps:
merged.df$data.y <- na.approx(merged.df$data.y, x = merged.df$time, na.rm = FALSE)
That definitely works (except the names of the columns are a big mess). But it is cumbersome and doesn't look very R to me. Is there a simpler way to do this?
Here is the result obtained with the above commands:
> head(merged.df)
time data.x data.y data
1 0.010000 NA -10.000000 NA
2 0.022000 NA -9.986799 -25.00000
3 1.020000 NA -8.888889 NA
4 1.031879 NA -8.875821 -23.93939
5 2.030000 NA -7.777778 NA
6 2.041758 NA -7.764843 -22.87879
Column data.x comes from the initial empty merged.df. It can be dumped.
Column data.y is the my.df$data1 column.
In the above dataframe, I did not use the na.approx command on column data (which corresponds to my.df$data2 column)
Additional note on OmaymaS' proposed solution:
To make this work in the general case (i.e. with any number of columns), what I have done is the following. First, I defined a 6 columns data frame:
time1.in.s <- seq(0.010, 100, length.out = 100)
time2.in.s <- seq(0.022, 100, length.out = 100)
time3.in.s <- seq(0.017, 99.8, length.out = 100)
data1 <- seq(-10, 100, length.out = 100)
data2 <- seq(-25, 80, length.out = 100)
data3 <- seq(-15, 70, length.out = 100)
my.df <- data.frame(time1.in.s, data1, time2.in.s, data2, time3.in.s, data3)
This leads to:
head(my.df)
time1.in.s data1 time2.in.s data2 time3.in.s data3
1 0.01 -10.000000 0.022000 -25.00000 0.017000 -15.00000
2 1.02 -8.888889 1.031879 -23.93939 1.024909 -14.14141
3 2.03 -7.777778 2.041758 -22.87879 2.032818 -13.28283
4 3.04 -6.666667 3.051636 -21.81818 3.040727 -12.42424
5 4.05 -5.555556 4.061515 -20.75758 4.048636 -11.56566
6 5.06 -4.444444 5.071394 -19.69697 5.056545 -10.70707
I changed the name of all columns containing the time to the same name (this way I don't have to tell the merge function which column to merge by):
colnames(my.df)[seq(1, ncol(my.df), 2)] <- "Time"
Then I loop on a slightly modified Reduce function:
df.merged <- my.df[, 1:2]
for (i in seq(3, ncol(my.df), 2)) {
df.merged <- Reduce(function(x,y) merge(x,y,
all = TRUE),
list(df.merged,
my.df[, i:(i+1)])
)
}
This gives:
> head(df.merged)
Time data1 data2 data3
1 0.010000 -10.000000 NA NA
2 0.017000 NA NA -15.00000
3 0.022000 NA -25.00000 NA
4 1.020000 -8.888889 NA NA
5 1.024909 NA NA -14.14141
6 1.031879 NA -23.93939 NA
Finally, I apply the na.approx function:
df.interp <- df.merged
df.interp[, 2:ncol(df.interp)] <- na.approx(df.interp[, 2:ncol(df.interp)],
x = df.interp$Time,
na.rm = FALSE)
Here is the final result:
> head(df.interp)
Time data1 data2 data3
1 0.010000 -10.000000 NA NA
2 0.017000 -9.992299 NA -15.00000
3 0.022000 -9.986799 -25.00000 -14.99574
4 1.020000 -8.888889 -23.95187 -14.14560
5 1.024909 -8.883488 -23.94671 -14.14141
6 1.031879 -8.875821 -23.93939 -14.13548
I still have NAs at the beginning of some data columns, but I can get rid of them with the na.omit function.
Try merge, it should help you accomplish what you need:
First: create two datframes with data and corresponding time:
df1 <- data.frame(time1.in.s, data1)
df2 <- data.frame(time2.in.s, data2)
Second: merge the two dataframes, specifying the column to use using by.x and by.y, and include all values:
df.merged <- merge(df1,df2,
by.x = "time1.in.s",
by.y = "time2.in.s",
all.x = TRUE,
all.y = TRUE)
Note: to clarify as per Sotos recommendation:
all.x = TRUE,
all.y = TRUE
is similar to
all = TRUE
So if you want to exclude values from either dataframes that do not exist in the other, you can set all.x or all.y to FALSE.
Now you will have time in once column, and you can rename the columns as you like.
> head(df.merged)
time1.in.s data1 data2
1 0.010000 -10.000000 NA
2 0.022000 NA -25.00000
3 1.020000 -8.888889 NA
4 1.031879 NA -23.93939
5 2.030000 -7.777778 NA
6 2.041758 NA -22.87879
EDIT: If you want to apply this on multiple columns, where you have multiple timen.in.s- datan, you can try reduce as follows, where you can add multiple selections in the list, and all will be merged according to the time column, assuming that it will be always the first in select.
df.merged <- Reduce(function(x,y) merge(x,y,
by.x = names(x)[1],
by.y = names(y)[1],
all = TRUE),
list(select(my.df,time1.in.s, data1),
select(my.df,time2.in.s, data2))
)
> head(df.merged)
time1.in.s data1 data2
1 0.010000 -10.000000 NA
2 0.022000 NA -25.00000
3 1.020000 -8.888889 NA
4 1.031879 NA -23.93939
5 2.030000 -7.777778 NA
6 2.041758 NA -22.87879
Additional NOTE:
If you want to use columns' indecies, you can use:
df.merged <- Reduce(function(x,y) merge(x,y,
by.x = names(x)[1],
by.y = names(y)[1],
all = TRUE),
list(select(my.df,1,2),
select(my.df,3,4))
)
Also If your columns' names are consistent, and you want to build the list automatically, you can create a function which takes an integer and return the columns' names you want to select:
getDF <- function(x)
{
c1 <- paste0("time",x,".in.s")
c2 <- paste0("data",x)
return(c(c1,c2))
}
For example:
> getDF(1)
[1] "time1.in.s" "data1"
Then you can use this in reduce:
df.merged <- Reduce(function(x,y) merge(x,y,
by.x = names(x)[1],
by.y = names(y)[1],
all = TRUE),
list(my.df[,getDF(1)],
my.df[,getDF(2)])
)
A bit of code.
I am assuming that you would like to split your data.frame every two columns
library(magrittr)
library(dplyr)
...
my.df <- data.frame(time1.in.s, data1, time2.in.s, data2)
my.df %<>% t %>% data.frame %>%
mutate(x=(mod(seq_along(row.names(.)), 2) +
seq_along(row.names(.)))/2) %>% split(., .$x) %>% lapply(t)
for (i in 1:length(my.df)) colnames(my.df[[i]]) <- c("time", paste0("data",i))
my.df %<>% lapply(function(x) x[-dim(x), ])
final = Reduce(function(...) merge(..., all=T), my.df)

How do I add random `NA`s into a data frame

I created a data frame with random values
n <- 50
df <- data.frame(id = seq (1:n),
age = sample(c(20:90), n, rep = TRUE),
sex = sample(c("m", "f"), n, rep = TRUE, prob = c(0.55, 0.45))
)
and would like to introduce a few NA values to simulate real world data. I am trying to use apply but cannot get there. The line
apply(subset(df,select=-id), 2, function(x) {x[sample(c(1:n),floor(n/10))]})
will retrieve random values alright, but
apply(subset(df,select=-id), 2, function(x) {x[sample(c(1:n),floor(n/10))]<-NA})
will not set them to NA. Have tried with and within, too.
Brute force works:
for (i in (1:floor(n/10))) {
df[sample(c(1:n), 1), sample(c(2:ncol(df)), 1)] <- NA
}
But I'd prefer to use the apply family.
Return x within your function:
> df <- apply (df, 2, function(x) {x[sample( c(1:n), floor(n/10))] <- NA; x} )
> tail(df)
id age sex
[45,] "45" "41" NA
[46,] "46" NA "f"
[47,] "47" "38" "f"
[48,] "48" "32" "f"
[49,] "49" "53" NA
[50,] "50" "74" "f"
Apply returns an array, thereby converting all columns to the same type. You could use this instead:
df[,-1] <- do.call(cbind.data.frame,
lapply(df[,-1], function(x) {
x[sample(c(1:n),floor(n/10))]<-NA
x
})
)
Or use a for loop:
for (i in seq_along(df[,-1])+1) {
is.na(df[sample(seq_len(n), floor(n/10)),i]) <- TRUE
}
Using dplyr1 you could arrive at the desired solution using the following, compact, syntax:
set.seed(123)
library("tidyverse")
n <- 50
df <- data.frame(
id = seq (1:n),
age = sample(c(20:90), n, replace = TRUE),
sex = sample(c("m", "f"), n, replace = TRUE, prob = c(0.55, 0.45))
)
mutate(.data = as_tibble(df),
across(
.cols = all_of(c("age", "sex")),
.fns = ~ ifelse(row_number(.x) %in% sample(1:n(), size = (10 * n(
) / 100)), NA, .x)
))
Results
Approximatly 10% of values is replaced with NA per column. This follows from sample(1:n(), size = (10 * n() / 100))
count(.Last.value, sex)
# A tibble: 3 x 2
# sex n
# <chr> <int>
# 1 f 21
# 2 m 24
# 3 NA 5
# A tibble: 50 x 3
# id age sex
# <int> <int> <chr>
# 1 1 50 m
# 2 2 70 m
1 I'm loading tidyverse as replace_na is available via tidyr.
I think you need to return the x value from the function:
apply(subset(df,select=-id), 2, function(x)
{x[sample(c(1:n),floor(n/10))]<-NA; x})
but you also need to assign this back to the relevant subset of the data frame (and subset(...) <- ... doesn't work)
idCol <- names(df)=="id"
df[,!idCol] <- apply(df[,!idCol], 2, function(x)
{x[sample(1:n,floor(n/10))] <- NA; x})
(if you have only a single non-ID column you'll need df[,!idCol,drop=FALSE])
here is another simple way to go at it
your data frame
df<-mtcars
Number of missing required
nbr_missing<-20
sample row and column indices
y<-data.frame(row=sample(nrow(df),size=nbr_missing,replace = T),
col=sample(ncol(df),size = nbr_missing,replace = T))
remove duplication
y<-y[!duplicated(y),]
use matrix indexing
df[as.matrix(y)]<-NA
To introduce certain percentage of NAs in your dataframe you could use this:
while(sum(is.na(df) == TRUE) < (nrow(df) * ncol(df) * percentage/100)){
df[sample(nrow(df),1), sample(ncol(df),1)] <- NA
}
you could also change "(nrow(df) * ncol(df) * percentage/100)" to a fixed number of NAs
You can also use prodNA from the missForest package.
library(missForest)
library(dplyr)
> bind_cols(df[1],missForest::prodNA(df[-1],noNA=0.1))
# A tibble: 50 x 3
id age sex
<int> <int> <fct>
1 1 NA m
2 2 84 NA
3 3 82 f
4 4 42 f
5 5 35 m
6 6 80 m
7 7 90 f
8 8 NA NA
9 9 89 f
10 10 42 m
# … with 40 more rows
Simply pass your dataframe into the following function. The only arguments are the frame you want to add NAs to and the number of features (columns) you want to have with NAs.
add_random_nas_to_frame <- function(frame, num_features) {
col_order <- names(frame)
rand_cols <- sample(ncol(frame), num_features)
left_overs <- which(!names(frame) %in% names(frame[,rand_cols]))
other_frame <- frame[,left_overs]
nas_added <- data.frame(lapply(frame[,rand_cols], function(x) x[sample(c(TRUE, NA), prob = c(sample(100, 1)/100, 0.15), size = length(x), replace = TRUE)]))
final_frame <- cbind(other_frame, nas_added)
final_frame <- final_frame[,col_order]
return(final_frame)
}
For example, using the full dataset from banking dataset from UCI:
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
bank <- read.table(file='path_to_data', sep =";", stringsAsFactors = F, header = T)
And viewing the original missing data:
We can see there is no missing data in the original frame.
Now applying our function:
bank_nas <- add_random_nas_to_frame(bank, 5)

Resources