How do I add random `NA`s into a data frame

How do I add random `NA`s into a data frame - r

I created a data frame with random values
n <- 50
df <- data.frame(id = seq (1:n),
age = sample(c(20:90), n, rep = TRUE),
sex = sample(c("m", "f"), n, rep = TRUE, prob = c(0.55, 0.45))
)
and would like to introduce a few NA values to simulate real world data. I am trying to use apply but cannot get there. The line
apply(subset(df,select=-id), 2, function(x) {x[sample(c(1:n),floor(n/10))]})
will retrieve random values alright, but
apply(subset(df,select=-id), 2, function(x) {x[sample(c(1:n),floor(n/10))]<-NA})
will not set them to NA. Have tried with and within, too.
Brute force works:
for (i in (1:floor(n/10))) {
df[sample(c(1:n), 1), sample(c(2:ncol(df)), 1)] <- NA
}
But I'd prefer to use the apply family.

Return x within your function:
> df <- apply (df, 2, function(x) {x[sample( c(1:n), floor(n/10))] <- NA; x} )
> tail(df)
id age sex
[45,] "45" "41" NA
[46,] "46" NA "f"
[47,] "47" "38" "f"
[48,] "48" "32" "f"
[49,] "49" "53" NA
[50,] "50" "74" "f"

Apply returns an array, thereby converting all columns to the same type. You could use this instead:
df[,-1] <- do.call(cbind.data.frame,
lapply(df[,-1], function(x) {
x[sample(c(1:n),floor(n/10))]<-NA
x
})
)
Or use a for loop:
for (i in seq_along(df[,-1])+1) {
is.na(df[sample(seq_len(n), floor(n/10)),i]) <- TRUE
}

Using dplyr1 you could arrive at the desired solution using the following, compact, syntax:
set.seed(123)
library("tidyverse")
n <- 50
df <- data.frame(
id = seq (1:n),
age = sample(c(20:90), n, replace = TRUE),
sex = sample(c("m", "f"), n, replace = TRUE, prob = c(0.55, 0.45))
)
mutate(.data = as_tibble(df),
across(
.cols = all_of(c("age", "sex")),
.fns = ~ ifelse(row_number(.x) %in% sample(1:n(), size = (10 * n(
) / 100)), NA, .x)
))
Results
Approximatly 10% of values is replaced with NA per column. This follows from sample(1:n(), size = (10 * n() / 100))
count(.Last.value, sex)
# A tibble: 3 x 2
# sex n
# <chr> <int>
# 1 f 21
# 2 m 24
# 3 NA 5
# A tibble: 50 x 3
# id age sex
# <int> <int> <chr>
# 1 1 50 m
# 2 2 70 m
1 I'm loading tidyverse as replace_na is available via tidyr.

I think you need to return the x value from the function:
apply(subset(df,select=-id), 2, function(x)
{x[sample(c(1:n),floor(n/10))]<-NA; x})
but you also need to assign this back to the relevant subset of the data frame (and subset(...) <- ... doesn't work)
idCol <- names(df)=="id"
df[,!idCol] <- apply(df[,!idCol], 2, function(x)
{x[sample(1:n,floor(n/10))] <- NA; x})
(if you have only a single non-ID column you'll need df[,!idCol,drop=FALSE])

here is another simple way to go at it
your data frame
df<-mtcars
Number of missing required
nbr_missing<-20
sample row and column indices
y<-data.frame(row=sample(nrow(df),size=nbr_missing,replace = T),
col=sample(ncol(df),size = nbr_missing,replace = T))
remove duplication
y<-y[!duplicated(y),]
use matrix indexing
df[as.matrix(y)]<-NA

To introduce certain percentage of NAs in your dataframe you could use this:
while(sum(is.na(df) == TRUE) < (nrow(df) * ncol(df) * percentage/100)){
df[sample(nrow(df),1), sample(ncol(df),1)] <- NA
}
you could also change "(nrow(df) * ncol(df) * percentage/100)" to a fixed number of NAs

You can also use prodNA from the missForest package.
library(missForest)
library(dplyr)
> bind_cols(df[1],missForest::prodNA(df[-1],noNA=0.1))
# A tibble: 50 x 3
id age sex
<int> <int> <fct>
1 1 NA m
2 2 84 NA
3 3 82 f
4 4 42 f
5 5 35 m
6 6 80 m
7 7 90 f
8 8 NA NA
9 9 89 f
10 10 42 m
# … with 40 more rows

Simply pass your dataframe into the following function. The only arguments are the frame you want to add NAs to and the number of features (columns) you want to have with NAs.
add_random_nas_to_frame <- function(frame, num_features) {
col_order <- names(frame)
rand_cols <- sample(ncol(frame), num_features)
left_overs <- which(!names(frame) %in% names(frame[,rand_cols]))
other_frame <- frame[,left_overs]
nas_added <- data.frame(lapply(frame[,rand_cols], function(x) x[sample(c(TRUE, NA), prob = c(sample(100, 1)/100, 0.15), size = length(x), replace = TRUE)]))
final_frame <- cbind(other_frame, nas_added)
final_frame <- final_frame[,col_order]
return(final_frame)
}
For example, using the full dataset from banking dataset from UCI:
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
bank <- read.table(file='path_to_data', sep =";", stringsAsFactors = F, header = T)
And viewing the original missing data:
We can see there is no missing data in the original frame.
Now applying our function:
bank_nas <- add_random_nas_to_frame(bank, 5)

Related

mutate(across) to generate multiple new columns in tidyverse

I usually have to perform equivalent calculations on a series of variables/columns that can be identified by their suffix (ranging, let's say from _a to _i) and save the result in new variables/columns. The calculations are equivalent, but vary between the variables used in the calculations. These again can be identified by the same suffix (_a to _i). So what I basically want to achieve is the following:
newvar_a = (oldvar1_a + oldvar2_a) - z
...
newvar_i = (oldvar1_i + oldvar2_i) - z
This is the farest I got:
mutate(across(c(oldvar1_a:oldvar1_i), ~ . - z, .names = "{col}_new"))
Thus, I'm able to "loop" over oldvar1_a to oldvar1_i, substract z from them and save the results in new columns named oldvar1_a_new to oldvar1_i_new. However, I'm not able to include oldvar2_a to oldvar2_i in the calculations as R won't loop over them. (Additionally, I'd still need to rename the new columns).
I found a way to achieve the result using a for-loop. However, this definitely doesn't look like the most efficient and straightforward way to do it:
for (i in letters[1:9]) {
oldvar1_x <- paste0("oldvar1_", i)
oldvar2_x <- paste0("oldvar2_", i)
newvar_x <- paste0("newvar_", i)
df <- df %>%
mutate(!!sym(newvar_x) := (!!sym(oldvar1_x) + !!sym(oldvar2_x)) - z)
}
Thus, I'd like to know whether/how to make mutate(across) loop over multiple columns that can be identified by suffixes (as in the example above)

In this case, you can use cur_data() and cur_column() to take advantage that we are wanting to sum together columns that have the same suffix but just need to swap out the numbers.
library(dplyr)
df <- data.frame(
oldvar1_a = 1:3,
oldvar2_a = 4:6,
oldvar1_i = 7:9,
oldvar2_i = 10:12,
z = c(1,10,20)
)
mutate(
df,
across(
starts_with("oldvar1"),
~ (.x + cur_data()[gsub("1", "2", cur_column())]) - z,
.names = "{col}_new"
)
)
#> oldvar1_a oldvar2_a oldvar1_i oldvar2_i z oldvar2_a oldvar2_i
#> 1 1 4 7 10 1 4 16
#> 2 2 5 8 11 10 -3 9
#> 3 3 6 9 12 20 -11 1
If you want to use with case_when, just make sure to index using [[, you can read more here.
df <- data.frame(
oldvar1_a = 1:3,
oldvar2_a = 4:6,
oldvar1_i = 7:9,
oldvar2_i = 10:12,
z = c(1,2,0)
)
mutate(
df,
across(
starts_with("oldvar1"),
~ case_when(
z == 1 ~ .x,
z == 2 ~ cur_data()[[gsub("1", "2", cur_column())]],
TRUE ~ NA_integer_
),
.names = "{col}_new"
)
)
#> oldvar1_a oldvar2_a oldvar1_i oldvar2_i z oldvar1_a_new oldvar1_i_new
#> 1 1 4 7 10 1 1 7
#> 2 2 5 8 11 2 5 11
#> 3 3 6 9 12 0 NA NA

There is a fairly straightforward way to do what I believe you are attempting to do.
# first lets create data
library(dplyr)
df <- data.frame(var1_a=runif(10, min = 128, max = 131),
var2_a=runif(10, min = 128, max = 131),
var1_b=runif(10, min = 128, max = 131),
var2_b=runif(10, min = 128, max = 131),
var1_c=runif(10, min = 128, max = 131),
var2_c=runif(10, min = 128, max = 131))
# taking a wild guess at what your z is
z <- 4
# initialize a list
fnl <- list()
# iterate over all your combo, put in list
for (i in letters[1:3])
{
dc <- df %>% select(ends_with(i))
i <- dc %>% mutate(a = rowSums(dc[1:ncol(dc)]) - z)
fnl <- append(fnl, i)
}
# convert to a dataframe/tibble
final <- bind_cols(fnl)
I left the column names sloppy assuming you had specific requirements here. You can convert this loop into a function and do the whole thin in a single step using purrr.

Run functions stored in a list based on criteria

I have a list of functions, for example:
myFunctions = list(
calculateMean = function(x) {mean(x)},
calculateMedian = function(x) {median(x)}
)
I need to call stored functions in myFunctions based on some criteria for example, I have a table (myTable) with prices and I need to calculate means and medians (I also need to do more things like standardize names, join a specific value with a table with codes, etc).
If a value in a column in myTable is == "a" I want to use function calculateMean, if == "b" I want to use function calculateMedian, if == "c" use function calculateMean.
What is the best way to do this? I am saving functions as a list as I will have a lot of functions. And how can I call a function in the myFunctions based on a specific criteria?
Thanks!

Maybe the following does what the question asks for.
Depending on ID, function priceStat determines which function from myFunctions to apply to column price.
priceStat <- function(x, funlist) {
type <- unique(as.character(x[["ID"]]))
f <- switch(type,
pear = funlist[[1]],
orange = funlist[[2]])
f(x[["price"]])
}
myFunctions = list(
calculateMean = function(x) {mean(x)},
calculateMedian = function(x) {median(x)}
)
set.seed(1234)
df1 <- data.frame(ID = sample(c("pear", "orange"), 20, TRUE),
price = runif(20),
stringsAsFactors = FALSE)
sapply(split(df1, df1$ID), priceStat, myFunctions)
# orange pear
#0.3036828 0.5427695

Here is something that I think does what you are suggesting.
library(dplyr)
Create some data.
set.seed(1234)
data <- tibble(id = rep(letters[1:2], each = 3), price = rnorm(6, 100, 5))
data
# # A tibble: 6 x 2
# id price
# <chr> <dbl>
# 1 a 94.0
# 2 a 101.
# 3 a 105.
# 4 b 88.3
# 5 b 102.
# 6 b 103.
Create a list of functions. Note we named the list item for the id we want to apply it to.
myFunctions <- list(
a = mean,
b = median
)
Group the data on the id. Then iterate over each list item, calling summarize(). For each list (which is the subset of the data for that given id) call the function from the myFunctions list.
data %>%
group_by(id) %>%
group_modify(~ summarize(.x, calc = myFunctions[[pull(.y[1])]](.x$price)))
# # A tibble: 2 x 2
# id calc
# <chr> <dbl>
# 1 a 100.
# 2 b 102.
Testing it out.
> mean(data$price[data$id == "a"])
[1] 100.258
> median(data$price[data$id == "b"])
[1] 102.1456

Removing all NAs while retaining the most data possible

I have a 37x21 matrix in R which contains many NAs. For my analysis, I need to get rid of all the NAs. I could remove all rows containing an NA, all columns containing an NA, or some combination of the two.
I want to remove specific rows and columns in such a way that I remove all NAs but retain the highest number of data cells possible.
E.g. Removing all ROWS with an NA results in a 10x21 matrix (10*21 = 210 cells of data). Removing all COLUMNS with an NA results in a 37x12 matrix (37x12 = 444 cells of data). But instead of doing either of these extremes, I want to remove the combination of rows and columns that results in the highest number of cells of data being retained. How would I go about this?

Here is one way using the first algorithm that I could think of. The approach is just to remove a row or column in an iteration if it has at least one NA and the fewest non-NA values in the matrix (so you lose the fewest cells when removing the row/column). To do this, I make a dataframe of the rows and columns with their counts of NA and non-NA along with dimension and index. At the moment, if there is a tie it resolves by deleting rows before columns and earlier indexes before later.
I am not sure that this will give the global maximum (e.g. only takes one branch at ties) but it should do better than just deleting rows/columns. In this example we get 210 for deleting rows, 74 for deleting columns but 272 with the new approach. The code could also probably be optimised if you need to use this for much larger matrices or for many more NA.
set.seed(1)
mat <- matrix(sample(x = c(1:10, NA), size = 37 * 21, replace = TRUE), ncol = 21)
# filter rows
prod(dim(mat[apply(mat, 1, function(x) all(!is.na(x))), ]))
#> [1] 210
# filter cols
prod(dim(mat[, apply(mat, 2, function(x) all(!is.na(x)))]))
#> [1] 74
delete_row_col <- function(m) {
to_delete <- rbind(
data.frame(
dim = "row",
index = seq_len(nrow(m)),
nas = rowSums(is.na(m)),
non_nas = rowSums(!is.na(m)),
stringsAsFactors = FALSE
),
data.frame(
dim = "col",
index = seq_len(ncol(m)),
nas = colSums(is.na(m)),
non_nas = colSums(!is.na(m)),
stringsAsFactors = FALSE
)
)
to_delete <- to_delete[to_delete$nas > 0, ]
to_delete <- to_delete[to_delete$non_nas == min(to_delete$non_nas), ]
if (nrow(to_delete) == 0) {
return(m)
}
else if (to_delete$dim[1] == "row") {
m <- m[-to_delete$index[1], ]
} else {
m <- m[, -to_delete$index[1]]
}
return(m)
}
remove_matrix_na <- function(m) {
while (any(is.na(m))) {
m <- delete_row_col(m)
}
return(m)
}
prod(dim(remove_matrix_na(mat)))
#> [1] 272
Created on 2019-07-06 by the reprex package (v0.3.0)

Here's a way using mixed integer programming (MIP). I have used ompr package for mathematical modeling and open source "glpk" solver. I have added model explanation as comments in the code. MIP approaches, when successful, guarantee optimal solution as indicated by solver_status(model) shown in code.
This approach will easily scale up to handle large matrices.
library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)
set.seed(1)
mat <- matrix(sample(x = c(1:10, NA), size = 37 * 21, replace = TRUE), ncol = 21)
# filtering all rows with NA retains 126 cells
prod(dim(mat[apply(mat, 1, function(x) all(!is.na(x))), , drop = F]))
# [1] 126
# filtering all cols with NA retains 37 cells
prod(dim(mat[, apply(mat, 2, function(x) all(!is.na(x))), drop = F]))
# [1] 37
m <- +!is.na(mat) # gets logical matrix; 0 if NA else 1
nr <- nrow(m)
nc <- ncol(m)
model <- MIPModel() %>%
# keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>%
# rm_row[i] is 1 if row i is selected for removal else 0
add_variable(rm_row[i], i = 1:nr, type = "binary") %>%
# rm_col[j] is 1 if column j is selected for removal else 0
add_variable(rm_col[j], j = 1:nc, type = "binary") %>%
# maximize good cells kept
set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>%
# cell can be kept only when row is not selected for removal
add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
# cell can be kept only when column is not selected for removal
add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
# only non-NA values can be kept
add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>%
# solve using free glpk solver
solve_model(with_ROI(solver = "glpk"))
Get solution -
solver_status(model)
# [1] "optimal" <- "optimal" guarnatees optimality
# get rows to remove
rm_rows <- model %>%
get_solution(rm_row[i]) %>%
filter(value > 0) %>%
pull(i)
# [1] 1 3 4 6 7 8 10 14 18 19 20 21 22 23 24 28 30 33 34 35 37
# get columns to remove
rm_cols <- model %>%
get_solution(rm_col[j]) %>%
filter(value > 0) %>%
pull(j)
# [1] 6 14 15 16 17
result <- mat[-rm_rows, -rm_cols]
# result has retained more cells as compared to
# removing just rows (126) or just columns (37)
prod(dim(result))
# [1] 256
This approach should be possible with lpSolve package as well but I think it involves building constraint matrix manually which is very cumbersome.

Joint Occurrence of variables in R

I want to count individual and combine occurrence of variables (1 represents presence and 0 represents absence). This can be obtained by multiple uses of table function (See MWE below). Is it possible to use a more efficient approach to get the required output given below?
set.seed(12345)
A <- rbinom(n = 100, size = 1, prob = 0.5)
B <- rbinom(n = 100, size = 1, prob = 0.6)
C <- rbinom(n = 100, size = 1, prob = 0.7)
df <- data.frame(A, B, C)
table(A)
A
0 1
48 52
table(B)
B
0 1
53 47
table(C)
C
0 1
34 66
table(A, B)
B
A 0 1
0 25 23
1 28 24
table(A, C)
C
A 0 1
0 12 36
1 22 30
table(B, C)
C
B 0 1
0 21 32
1 13 34
table(A, B, C)
, , C = 0
B
A 0 1
0 8 4
1 13 9
, , C = 1
B
A 0 1
0 17 19
1 15 15
Required Output
I am requiring something like the following:
A = 52
B = 45
C = 66
A + B = 24
A + C = 30
B + C = 34
A + B + C = 15

Expanding on Sumedh's answer, you can also do this dynamically without having to specify the filter every time. This will be useful if you have more than only 3 columns to combine.
You can do something like this:
lapply(seq_len(ncol(df)), function(i){
# Generate all the combinations of i element on all columns
tmp_i = utils::combn(names(df), i)
# In the columns of tmp_i we have the elements in the combination
apply(tmp_i, 2, function(x){
dynamic_formula = as.formula(paste("~", paste(x, "== 1", collapse = " & ")))
df %>%
filter_(.dots = dynamic_formula) %>%
summarize(Count = n()) %>%
mutate(type = paste0(sort(x), collapse = ""))
}) %>%
bind_rows()
}) %>%
bind_rows()
This will:
1) generate all the combinations of the columns of df. First the combinations with one element (A, B, C) then the ones with two elements (AB, AC, BC), etc.
This is the external lapply
2) then for every combination will create a dynamic formula. For AB for instance the formula will be A==1 & B==1, exactly as Sumedh suggested. This is the dynamic_formula bit.
3) Will filter the dataframe with the dynamically generated formula and count the number of rows
4) Bind all together (the two bind_rows)
The output will be
Count type
1 52 A
2 47 B
3 66 C
4 24 AB
5 30 AC
6 34 BC
7 15 ABC

EDITED TO ADD: I see now that you don't want to get the exclusive counts (i.e. A and AB should both include all As).
I got more than a little nerd-sniped by this today, particularly as I wanted to solve it using base R with no packages. The below should do that.
There is a very easy (in principle) solution that simply uses xtabs(), which I've illustrated below. However, to generalize it for any potential number of dimensions, and then to apply it to a variety of combinations, actually was harder. I strove to avoid using the dreaded eval(parse()).
set.seed(12345)
A <- rbinom(n = 100, size = 1, prob = 0.5)
B <- rbinom(n = 100, size = 1, prob = 0.6)
C <- rbinom(n = 100, size = 1, prob = 0.7)
df <- data.frame(A, B, C)
# Turn strings off
options(stringsAsFactors = FALSE)
# Obtain the n-way frequency table
# This table can be directly subset using []
# It is a little tricky to pass the arguments
# I'm trying to avoid eval(parse())
# But still give a solution that isn't bound to a specific size
xtab_freq <- xtabs(formula = formula(x = paste("~",paste(names(df),collapse = " + "))),
data = df)
# Demonstrating what I mean
# All A
sum(xtab_freq["1",,])
# [1] 52
# AC
sum(xtab_freq["1",,"1"])
# [1] 30
# Using lapply(), we pass names(df) to combn() with m values of 1, 2, and 3
# The output of combn() goes through list(), then is unlisted with recursive FALSE
# This gives us a list of vectors
# Each one being a combination in which we are interested
lst_combs <- unlist(lapply(X = 1:3,FUN = combn,x = names(df),list),recursive = FALSE)
# For nice output naming, I just paste the values together
names(lst_combs) <- sapply(X = lst_combs,FUN = paste,collapse = "")
# This is a function I put together
# Generalizes process of extracting values from a crosstab
# It does it in this fashion to avoid eval(parse())
uFunc_GetMargins <- function(crosstab,varvector,success) {
# Obtain the dimname-names (the names within each dimension)
# From that, get the regular dimnames
xtab_dnn <- dimnames(crosstab)
xtab_dn <- names(xtab_dnn)
# Use match() to get a numeric vector for the margins
# This can be used in margin.table()
tgt_margins <- match(x = varvector,table = xtab_dn)
# Obtain a margin table
marginal <- margin.table(x = crosstab,margin = tgt_margins)
# To extract the value, figure out which marginal cell contains
# all variables of interest set to success
# sapply() goes over all the elements of the dimname names
# Finds numeric index in that dimension where the name == success
# We subset the resulting vector by tgt_margins
# (to only get the cells in our marginal table)
# Then, use prod() to multiply them together and get the location
tgt_cell <- prod(sapply(X = xtab_dnn,
FUN = match,
x = success)[tgt_margins])
# Return as named list for ease of stacking
return(list(count = marginal[tgt_cell]))
}
# Doing a call of mapply() lets us get the results
do.call(what = rbind.data.frame,
args = mapply(FUN = uFunc_GetMargins,
varvector = lst_combs,
MoreArgs = list(crosstab = xtab_freq,
success = "1"),
SIMPLIFY = FALSE,
USE.NAMES = TRUE))
# count
# A 52
# B 47
# C 66
# AB 24
# AC 30
# BC 34
# ABC 15
I ditched the prior solution that used aggregate.

Using dplyr,
Occurrence of only A:
library(dplyr)
df %>% filter(A == 1) %>% summarise(Total = nrow(.))
Occurrence of A and B:
df %>% filter(A == 1, B == 1) %>% summarise(Total = nrow(.))
Occurence of A, B, and C
df %>% filter(A == 1, B == 1, C == 1) %>% summarise(Total = nrow(.))

Sum observations from two columns, looping over many columns in R

I have searched high and low, but am stuck on how to approach this. I have two sets of columns that I want to sum, row by row, but which I want to loop over many columns. If I were to do this manually, I would want:
df1[1,1]+df2[1,1]
df1[2,1]+df2[2,1]
etc... I've found many helpful examples on how to do something like:
apply(df[,c("a","d")], 1, sum)
though I want to do this over lots of columns. Also, while it's not entirely relevant, I want to phrase my question as close to my reality as possible, so my example below includes NA's, since my actual data contains many missing values.
# make a data frame, df1, with three columns
a <- sample(1:100, 50, replace = T)
b <- sample(100:300, 50, replace = T)
c <- sample(2:50, 500, replace = T)
df1 <- cbind(a,b,c)
# make another data frame, df2, with three columns
x <- sample(1:100, 50, replace = T)
y <- sample(100:300, 50, replace = T)
z <- sample(2:50, 50, replace = T)
df2 <- cbind(x,y,z)
# make another data frame, df2, with three columns
x <- sample(1:100, 50, replace = T)
y <- sample(100:300, 50, replace = T)
z <- sample(2:50, 50, replace = T)
df2 <- cbind(x,y,z)
Make it possible to randomly throw a few NAs in, function from http://www.r-bloggers.com/function-to-generate-a-random-data-set/
NAins <- NAinsert <- function(df, prop = .1){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
Add the NAs to the frames
NAins(df1, .2)
NAins(df2, .14)
Then, I tried to seq along the columns in each data frame, and used apply setting the index to 1, meaning to sum each row entry. This doesn't work.
for(i in seq_along(df1)){
for(j in seq_along(df2)){
apply(c(df1[,i], col2[j]), 1, function(x) sum(x, na.rm = T))}}
Thanks for any help!

You should be able to just replace NA with 0, and then add with "+":
replace(df1, is.na(df1), 0) + replace(df2, is.na(df2), 0)
# X Y Z
# 1 7 19 6
# 2 11 12 1
# 3 16 14 11
# 4 13 7 13
# 5 10 2 11
Alternatively, if you have more than just two data.frames, you can collect them in a list and use Reduce:
Reduce("+", lapply(mget(c("df1", "df2", "df3")), function(x) replace(x, is.na(x), 0)))
Here's some sample data (and what I think is an easier way to create it):
set.seed(1) ## Set a seed so others can reproduce your sample data
dfmaker <- function() {
setNames(
data.frame(
replicate(3, sample(c(NA, 1:10), 5, TRUE), FALSE)),
c("X", "Y", "Z"))
}
df1 <- dfmaker()
df1
# X Y Z
# 1 2 9 2
# 2 4 10 1
# 3 6 7 7
# 4 9 6 4
# 5 2 NA 8
df2 <- dfmaker()
df2
# X Y Z
# 1 5 10 4
# 2 7 2 NA
# 3 10 7 4
# 4 4 1 9
# 5 8 2 3
df3 <- dfmaker()

You can transform the data.frame to an array and sum them using apply function.
install.package('abind')
library(abind)
df <- abind(list(df1,df2), along = 3)
results <- apply(df, MARGIN = c(1,2), FUN = function(x) sum(x, na.rm = TRUE))
results

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How do I add random `NA`s into a data frame - r

Return x within your function: > df <- apply (df, 2, function(x) {x[sample( c(1:n), floor(n/10))] <- NA; x} ) > tail(df) id age sex [45,] "45" "41" NA [46,] "46" NA "f" [47,] "47" "38" "f" [48,] "48" "32" "f" [49,] "49" "53" NA [50,] "50" "74" "f"

To introduce certain percentage of NAs in your dataframe you could use this: while(sum(is.na(df) == TRUE) < (nrow(df) * ncol(df) * percentage/100)){ df[sample(nrow(df),1), sample(ncol(df),1)] <- NA } you could also change "(nrow(df) * ncol(df) * percentage/100)" to a fixed number of NAs

Related

mutate(across) to generate multiple new columns in tidyverse

Run functions stored in a list based on criteria

Removing all NAs while retaining the most data possible

Joint Occurrence of variables in R

Sum observations from two columns, looping over many columns in R

Categories

Resources