aggregate on multiple columns - keeping the original column names and structure - r

please consider the following example which makes use of aggregate twice.
library(dplyr)
set.seed(5)
x <- data.frame(
name = sample(c('NM01', 'NM02', 'NM03', 'NM04', 'NM05'), 400, replace = TRUE),
strand = sample(c('+', '-'), 400, replace = TRUE),
value = sample(6, 400, replace = TRUE)
)
x_agg_hist <- aggregate( x$value,
by = list(strand = x$strand,
transcript = x$name
),
function(v) hist( v,
breaks = seq(0.5, 6.5),
plot= FALSE
)$counts
)
y <- data.frame(
name = c('NM01', 'NM02', 'NM03', 'NM04', 'NM05'),
value = runif(5)
)
x_agg_hist$value <- y$value[match(x_agg_hist$transcript, y$name)]
x_agg_hist$division <- ifelse(x_agg_hist$value > 0.5, 1, 2) %>% as.factor()
x_agg_hist
strand transcript x.1 x.2 x.3 x.4 x.5 x.6 value division
1 - NM01 6 9 8 5 5 8 0.5661267 1
2 + NM01 4 2 8 8 8 6 0.5661267 1
3 - NM02 8 4 6 5 3 11 0.1178577 2
4 + NM02 7 6 9 8 7 7 0.1178577 2
5 - NM03 4 5 10 4 6 3 0.2572855 2
6 + NM03 6 10 5 9 5 9 0.2572855 2
7 - NM04 7 4 5 7 4 9 0.9678125 1
8 + NM04 4 3 4 10 8 9 0.9678125 1
9 - NM05 4 6 10 5 5 5 0.8891210 1
10 + NM05 11 13 5 8 12 8 0.8891210 1
So far, everything is fine. Specifically, I notice that I can select the columns of the histograms created by aggregate "collectively" using
x_agg_hist$x
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 6 9 8 5 5 8
[2,] 4 2 8 8 8 6
[3,] 8 4 6 5 3 11
[4,] 7 6 9 8 7 7
[5,] 4 5 10 4 6 3
[6,] 6 10 5 9 5 9
Next, I would like to sum the histograms by 'division' and 'strand' (and normalise by the number of observations in each group).
x_agg_hist_agg_sum <- aggregate( x_agg_hist$x,
by = list(division = x_agg_hist$division,
strand = x_agg_hist$strand
),
function(v) sum(v)/length(v)
)
Note that using x_agg_hist$x to select all the columns of the histograms seems a lot more convenient than what has been proposed here (Aggregate / summarize multiple variables per group (e.g. sum, mean)).
This still works as expected.
x_agg_hist_agg_sum
division strand V1 V2 V3 V4 V5 V6
1 1 - 5.666667 6.333333 7.666667 5.666667 4.666667 7.333333
2 2 - 6.000000 4.500000 8.000000 4.500000 4.500000 7.000000
3 1 + 6.333333 6.000000 5.666667 8.666667 9.333333 7.666667
4 2 + 6.500000 8.000000 7.000000 8.500000 6.000000 8.000000
However, now aggregate has renamed the columns of the (summed) histograms in a way that does not allow selecting them collectively any more. Therefore, I was wondering if it was possible to tell aggregate to keep the original column names and structure or if there is any other method that can do so. (Of course I know that I can use x_agg_hist_agg_sum[, -c(1, 2)], but with my real data (after a lot of further processing) this would at least be a lot more difficult.)
Cheers,
mce1

I would suggest to use dplyr for such long chained operations. There are lot of benefits with it.
You can do all the transformation/manipulation and reshaping code with it in the single pipe without creating intermediate variables like x_agg_hist and x_agg_hist_agg_sum. So you don't have to remember/manage them.
The first few steps of your code code can be translated as :
library(dplyr)
x %>%
group_by(strand, name) %>%
summarise(res = hist(value, breaks = seq(0.5, 6.5),plot= FALSE)$counts) %>%
left_join(y, by = 'name') %>%
mutate(division = factor(ifelse(value > 0.5, 1, 2))) %>%
ungroup
Use pivot_wider to cast the data into wide format which will maintain the names of the data.

Related

How to recursively compute average over time in R

Consider the follow dataset
period<-c(1,2,3,4,5)
x<-c(3,6,7,4,6)
cumulative_average<-c((3)/1,(3+6)/2,(3+6+7)/3,(3+6+7+4)/4,(3+6+7+4+6)/5)
df_test<-data.frame(value,cum_average)
df_test
period value cum_average
1 3 3
2 6 4.5
3 7 5.3
4 4 5.0
5 6 5.2
Assume that the 5 observations in the 'x' column represents the value assumed by a variable in 'period' from 1 to 5, respectively. How can I produce column 'cum_average'??
I believe that this could be done using zoo::timeAverage but when I try to lunch the package on my relatively old machine I incur in some conflict and cannot use it.
Any help would be much appreciated!
Solution
new_df <- df_test %>% mutate(avgT = cumsum(value)/period)
did the trick.
Thank you so much for your answers!
Maybe you are looking for this. You can first compute the cumulative sum as mentioned by #tmfmnk and then divide by the rownumber which tracks the number of observation, if the mean is required. Here the code using dplyr:
library(dplyr)
#Code
newdf <- df_test %>% mutate(AvgTime=cumsum(x)/row_number())
Output:
period x AvgTime
1 1 3 3.000000
2 2 6 4.500000
3 3 7 5.333333
4 4 4 5.000000
5 5 6 5.200000
If only cumulative sum is needed:
#Code2
newdf <- df_test %>% mutate(CumTime=cumsum(x))
Output:
period x CumTime
1 1 3 3
2 2 6 9
3 3 7 16
4 4 4 20
5 5 6 26
Or only base R:
#Base R
df_test$Cumsum <- cumsum(df_test$x)
Output:
period x Cumsum
1 1 3 3
2 2 6 9
3 3 7 16
4 4 4 20
5 5 6 26
Using standard R:
period<-c(1,2,3,4,5)
value<-c(3,6,7,4,6)
recursive_average<-cumsum(value) / (1:length(value))
df_test<-data.frame(value, recursive_average)
df_test
value recursive_average
1 3 3.000000
2 6 4.500000
3 7 5.333333
4 4 5.000000
5 6 5.200000
If your period vector, is the vector you wish to use to calculate the average, simply replace 1:length(value) with period
We can use cummean
library(dplyr)
df_test %>%
mutate(AvgTime=cummean(value))
-output
# period value AvgTime
#1 1 3 3.000000
#2 2 6 4.500000
#3 3 7 5.333333
#4 4 4 5.000000
#5 5 6 5.200000
data
df_test <- structure(list(period = c(1, 2, 3, 4, 5), value = c(3, 6, 7,
4, 6)), class = "data.frame", row.names = c(NA, -5L))

Iterate through columns to sum the previous 2 numbers of each row

In R, I have a dataframe, with columns 'A', 'B', 'C', 'D'. The columns have 100 rows.
I need to iterate through the columns to perform a calculation for all rows in the dataframe which sums the previous 2 rows of that column, and then set in new columns ('AA', 'AB', etc) what that sum is:
A B C D
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
to
A B C D AA AB AC AD
1 2 3 4 NA NA NA NA
2 3 4 5 3 5 7 9
3 4 5 6 5 7 9 11
4 5 6 7 7 9 11 13
5 6 7 8 9 11 13 15
6 7 8 9 11 13 15 17
Can someone explain how to create a function/loop that allows me to set the columns I want to iterate over (selected columns, not all columns) and the columns I want to set?
A base one-liner:
cbind(df, setNames(df + df[c(NA, 1:(nrow(df)-1)), ], paste0("A", names(df))))
If your data is large, this one might be the fastest because it manipulates the entire data.frame.
A dplyr solution using mutate() with across().
library(dplyr)
df %>%
mutate(across(A:D,
~ .x + lag(.x),
.names = "A{col}"))
# A B C D AA AB AC AD
# 1 1 2 3 4 NA NA NA NA
# 2 2 3 4 5 3 5 7 9
# 3 3 4 5 6 5 7 9 11
# 4 4 5 6 7 7 9 11 13
# 5 5 6 7 8 9 11 13 15
# 6 6 7 8 9 11 13 15 17
If you want to sum the previous 3 rows, the second argument of across(), i.e. .fns, should be
~ .x + lag(.x) + lag(.x, 2)
which is equivalent to the use of rollsum() in zoo:
~ zoo::rollsum(.x, k = 3, fill = NA, align = 'right')
Benchmark
A benchmark test with microbenchmark package on a new data.frame with 10000 rows and 100 columns and evaluate each expression for 10 times.
# Unit: milliseconds
# expr min lq mean median uq max neval
# darren_base 18.58418 20.88498 35.51341 33.64953 39.31909 80.24725 10
# darren_dplyr_lag 39.49278 40.27038 47.26449 42.89170 43.20267 76.72435 10
# arg0naut91_dplyr_rollsum 436.22503 482.03199 524.54800 516.81706 534.94317 677.64242 10
# Grothendieck_rollsumr 3423.92097 3611.01573 3650.16656 3622.50895 3689.26404 4060.98054 10
You can use dplyr's across (and set optional names) with rolling sum (as implemented e.g. in zoo):
library(dplyr)
library(zoo)
df %>%
mutate(
across(
A:D,
~ rollsum(., k = 2, fill = NA, align = 'right'),
.names = 'A{col}'
)
)
Output:
A B C D AA AB AC AD
1 1 2 3 4 NA NA NA NA
2 2 3 4 5 3 5 7 9
3 3 4 5 6 5 7 9 11
4 4 5 6 7 7 9 11 13
5 5 6 7 8 9 11 13 15
6 6 7 8 9 11 13 15 17
With A:D we've specified the range of column names we want to apply the function to. The assumption above in .names argument is that you want to paste together A as prefix and the column name ({col}).
Here's a data.table solution. As you ask for, it allows you to select which columns you want to apply it to rather than just for all columns.
library(data.table)
x <- data.table(A=1:6, B=2:7, C=3:8, D=4:9)
selected_cols <- c('A','B','D')
new_cols <- paste0("A",selected_cols)
x[, (new_cols) := lapply(.SD, function(col) col+shift(col, 1)), .SDcols = selected_cols]
x[]
NB This is 2 or 3 times faster than the fastest other answer.
That is a naive approach with nested for loops. Beware it is damn slow if you gonna iterate over hundreds thousand rows.
i <- 1
n <- 5
df <- data.frame(A=i:(i+n), B=(i+1):(i+n+1), C=(i+2):(i+n+2), D=(i+3):(i+n+3))
for (col in colnames(df)) {
for (ind in 1:nrow(df)) {
if (ind-1==0) {next}
s <- sum(df[c(ind-1, ind), col])
df[ind, paste0('S', col)] <- s
}
}
That is a cumsum method:
na.df <- data.frame(matrix(NA, 2, ncol(df)))
colnames(na.df) <- colnames(df)
cs1 <- cumsum(df)
cs2 <- rbind(cs1[-1:-2,], na.df)
sum.diff <- cs2-cs1
cbind(df, rbind(na.df[1,], cs1[2,], sum.diff[1:(nrow(sum.diff)-2),]))
Benchmark:
# Unit: milliseconds
# expr min lq mean median uq max neval
# darrentsai.rbind 11.5623 12.28025 23.38038 16.78240 20.83420 91.9135 100
# darrentsai.rbind.rev1 8.8267 9.10945 15.63652 9.54215 14.25090 62.6949 100
# pseudopsin.dt 7.2696 7.52080 20.26473 12.61465 17.61465 69.0110 100
# ivan866.cumsum 25.3706 30.98860 43.11623 33.78775 37.36950 91.6032 100
I believe, most of the time the cumsum method wastes on df allocations. If correctly adapted to data.table backend, it could be the fastest.
Specify the columns we want. We show several different ways to do that. Then use rollsumr to get the desired columns, set the column names and cbind DF with it.
library(zoo)
# jx <- names(DF) # if all columns wanted
# jx <- sapply(DF, is.numeric) # if all numeric columns
# jx <- c("A", "B", "C", "D") # specify columns by name
jx <- 1:4 # specify columns by position
r <- rollsumr(DF[jx], 2, fill = NA)
colnames(r) <- paste0("A", colnames(r))
cbind(DF, r)
giving:
A B C D AA AB AC AD
1 1 2 3 4 NA NA NA NA
2 2 3 4 5 3 5 7 9
3 3 4 5 6 5 7 9 11
4 4 5 6 7 7 9 11 13
5 5 6 7 8 9 11 13 15
6 6 7 8 9 11 13 15 17
Note
The input in reproducible form:
DF <- structure(list(A = 1:6, B = 2:7, C = 3:8, D = 4:9),
class = "data.frame", row.names = c(NA, -6L))

Averaging row and column cells from multiple data frames

I have multiple data frames, like:
DG = data.frame(y=c(1,3), v=3:8, x=c(4,6))
DF = data.frame(y=c(1,3), v=3:8, x=c(12,14))
DT = data.frame(y=c(1,3), v=3:8, x=c(4,5))
head(DG)
y v x
1 1 3 4
2 3 4 6
3 1 5 4
4 3 6 6
5 1 7 4
6 3 8 6
head(DT)
y v x
1 1 3 4
2 3 4 5
3 1 5 4
4 3 6 5
5 1 7 4
6 3 8 5
head(DF)
y v x
1 1 3 12
2 3 4 12
3 1 5 12
4 3 6 12
5 1 7 12
6 3 8 12
I want to calculate means of each 'row' but from each column of each data frame, i.e. the resulting data frame I need looks like:
y v x
1 'mean(DG(y1)DT(y1),DF(y1))' 'mean(DG(v1)DT(v1),DF(v1))' 'mean(DG(x1)DT(x1),DF(x1))'
2 'mean(DG(y2)DT(y2),DF(y2))' 'mean(DG(v2)DT(v2),DF(v2))' 'mean(DG(x2)DT(x2),DF(x2))'
3 'mean(DG(y3)DT(y3),DF(y3))' 'mean(DG(v3)DT(v3),DF(v3))' 'mean(DG(x3)DT(x3),DF(x3))'
....
In reality, y, v and x are different locations and 1 - 6 time steps. I want to average my data for each time step and location. Eventually, I need one data set, that looks like one of the example data sets, but with averaged values in each cell.
I have a working example with loops, but for large datasets it is very slow, so I tried various combinations with apply and rowSums, but neither worked out.
If I understand correctly, there are many data frames which all have the same structure (number, name and type of columns) as well as the same number of rows (time steps). Some data points may contain NA.
The code below creates a large data.table from the single data frames and computes the mean values for each time step and location across the different data frames:
library(data.table)
rbindlist(list(DG, DF, DT), idcol = TRUE)[
, lapply(.SD, mean, na.rm = TRUE), by = .(time_step = rowid(.id))]
time_step y v x
1: 1 1 3 6.666667
2: 2 3 4 8.333333
3: 3 1 5 6.666667
4: 4 3 6 8.333333
5: 5 1 7 6.666667
6: 6 3 8 8.333333
This will work also with NAs, e.g.,
DG = data.frame(y=c(1,3), v=3:8, x=c(4,6))
DF = data.frame(y=c(1,3), v=3:8, x=c(12,14))
DT = data.frame(y=c(1,3), v=3:8, x=c(4,5,NA))
Note that column x of DT has been modified
rbindlist(list(DG, DF, DT), idcol = TRUE)[
, lapply(.SD, mean, na.rm = TRUE), by = .(time_step = rowid(.id))]
time_step y v x
1: 1 1 3 6.666667
2: 2 3 4 8.333333
3: 3 1 5 8.000000
4: 4 3 6 8.000000
5: 5 1 7 7.000000
6: 6 3 8 10.000000
Note that x in rows 3 and 6 has changed.
If you only have the three data frames, I would recommend
result = (DG + DT + DF) / 3
result
# y v x
# 1 1 3 6.666667
# 2 3 4 8.333333
# 3 1 5 6.666667
# 4 3 6 8.333333
# 5 1 7 6.666667
# 6 3 8 8.333333
This assumes that your rows and columns are already in the correct order.
If you have more data frames, put them in a list (see here for help with that) and then you can do this:
result = Reduce("+", list_of_data) / length(list_of_data)
If you need advanced features of mean, like ignoring NAs or trimming, this won't work. Instead, I would recommend using converting your data frames to matrices, stacking them into an 3-d array, and applying mean.
library(abind)
stack = abind(DG, DF, DT, along = 3)
# if you have data frames in a list, do this instead:
# stack = do.call(abind, c(list_of_data, along = 3))
apply(stack, MARGIN = 1:2, FUN = mean, na.rm = TRUE)
# y v x
# [1,] 1 3 6.666667
# [2,] 3 4 8.333333
# [3,] 1 5 6.666667
# [4,] 3 6 8.333333
# [5,] 1 7 6.666667
# [6,] 3 8 8.333333
The final method I'll recommend is a "tidy" method - combine your data into one data frame and use grouped operations to produce the result. This can be done easily with data.table or dplyr. See Uwe's answer for a nice data.table implementation.
library(dplyr)
bind_rows(list(DG, DF, DT), .id = ".id") %>%
group_by(.id) %>%
mutate(rn = row_number()) %>%
ungroup() %>%
select(-.id) %>%
group_by(rn) %>%
summarize_all(mean, na.rm = TRUE) %>%
select(-rn)
# # A tibble: 6 x 3
# y v x
# <dbl> <dbl> <dbl>
# 1 1 3 6.67
# 2 3 4 8.33
# 3 1 5 6.67
# 4 3 6 8.33
# 5 1 7 6.67
# 6 3 8 8.33

R data.table with variable number of columns

For each student in a data set, a certain set of scores may have been collected. We want to calculate the mean for each student, but using only the scores in the columns that were germane to that student.
The columns required in a calculation are different for each row. I've figured how to write this in R using the usual tools, but am trying to rewrite with data.table, partly for fun, but also partly in anticipation of success in this small project which might lead to the need to make calculations for lots and lots of rows.
Here is a small working example of "choose a specific column set for each row problem."
set.seed(123234)
## Suppose these are 10 students in various grades
dat <- data.frame(id = 1:10, grade = rep(3:7, by = 2),
A = sample(c(1:5, 9), 10, replace = TRUE),
B = sample(c(1:5, 9), 10, replace = TRUE),
C = sample(c(1:5, 9), 10, replace = TRUE),
D = sample(c(1:5, 9), 10, replace = TRUE))
## 9 is a marker for missing value, there might also be
## NAs in real data, and those are supposed to be regarded
## differently in some exercises
## Students in various grades are administered different
## tests. A data structure gives the grade to test linkage.
## The letters are column names in dat
lookup <- list("3" = c("A", "B"),
"4" = c("A", "C"),
"5" = c("B", "C", "D"),
"6" = c("A", "B", "C", "D"),
"7" = c("C", "D"),
"8" = c("C"))
## wrapper around that lookup because I kept getting confused
getLookup <- function(grade){
lookup[[as.character(grade)]]
}
## Function that receives one row (named vector)
## from data frame and chooses columns and makes calculation
getMean <- function(arow, lookup){
scores <- arow[getLookup(arow["grade"])]
mean(scores[scores != 9], na.rm = TRUE)
}
stuscores <- apply(dat, 1, function(x) getMean(x, lookup))
result <- data.frame(dat, stuscores)
result
## If the data is 1000s of thousands of rows,
## I will wish I could use data.table to do that.
## Client will want students sorted by state, district, classroom,
## etc.
## However, am stumped on how to specify the adjustable
## column-name chooser
library(data.table)
DT <- data.table(dat)
## How to write call to getMean correctly?
## Want to do this for each participant (no grouping)
setkey(DT, id)
The desired output is the student average for the appropriate columns, like so:
> result
id grade A B C D stuscores
1 1 3 9 9 1 4 NaN
2 2 4 5 4 1 5 3.0
3 3 5 1 3 5 9 4.0
4 4 6 5 2 4 5 4.0
5 5 7 9 1 1 3 2.0
6 6 3 3 3 4 3 3.0
7 7 4 9 2 9 2 NaN
8 8 5 3 9 2 9 2.0
9 9 6 2 3 2 5 3.0
10 10 7 3 2 4 1 2.5
Then what? I've written a lot of mistakes so far...
I did not find any examples in the data table examples in which the columns to be used in calculations for each row was itself a variable, I thank you for your advice.
I was not asking anybody to write code for me, I'm asking for advice on how to get started with this problem.
First of all, when creating a reproducible example using functions such as sample (which set a random seed each time you run it), you should use set.seed.
Second of all, instead of looping over each row, you could just loop over the lookup list which will always be smaller than the data (many times significantly smaller) and combine it with rowMeans. You can also do it with base R, but you asked for a data.table solution so here goes (for the purposes of this solution I've converted all 9 to NAs, but you can try to generalize this to your specific case too)
So using set.seed(123), your function gives
apply(dat, 1, function(x) getMean(x, lookup))
# [1] 2.000000 5.000000 4.666667 4.500000 2.500000 1.000000 4.000000 2.333333 2.500000 1.500000
And here's a possible data.table application which runs only over the lookup list (for loops on lists are very efficient in R btw, see here)
## convert all 9 values to NAs
is.na(dat) <- dat == 9L
## convert your original data to `data.table`,
## there is no need in additional copy of the data if the data is huge
setDT(dat)
## loop only over the list
for(i in names(lookup)) {
dat[grade == i, res := rowMeans(as.matrix(.SD[, lookup[[i]], with = FALSE]), na.rm = TRUE)]
}
dat
# id grade A B C D res
# 1: 1 3 2 NA NA NA 2.000000
# 2: 2 4 5 3 5 NA 5.000000
# 3: 3 5 3 5 4 5 4.666667
# 4: 4 6 NA 4 NA 5 4.500000
# 5: 5 7 NA 1 4 1 2.500000
# 6: 6 3 1 NA 5 3 1.000000
# 7: 7 4 4 2 4 5 4.000000
# 8: 8 5 NA 1 4 2 2.333333
# 9: NA 6 4 2 2 2 2.500000
# 10: 10 7 3 NA 1 2 1.500000
Possibly, this could be improved utilizing set, but I can't think of a good way currently.
P.S.
As suggested by #Arun, please take a look at the vignettes he himself wrote here in order to get familiar with the := operator, .SD, with = FALSE, etc.
Here's another data.table approach using melt.data.table (needs data.table 1.9.5+) and then joins between data.tables:
DT_m <- setkey(melt.data.table(DT, c("id", "grade"), value.name = "score"), grade, variable)
lookup_dt <- data.table(grade = rep(as.integer(names(lookup)), lengths(lookup)),
variable = unlist(lookup), key = "grade,variable")
score_summary <- setkey(DT_m[lookup_dt, nomatch = 0L,
.(res = mean(score[score != 9], na.rm = TRUE)), by = id], id)
setkey(DT, id)[score_summary, res := res]
# id grade A B C D mean_score
# 1: 1 3 9 9 1 4 NaN
# 2: 2 4 5 4 1 5 3.0
# 3: 3 5 1 3 5 9 4.0
# 4: 4 6 5 2 4 5 4.0
# 5: 5 7 9 1 1 3 2.0
# 6: 6 3 3 3 4 3 3.0
# 7: 7 4 9 2 9 2 NaN
# 8: 8 5 3 9 2 9 2.0
# 9: 9 6 2 3 2 5 3.0
#10: 10 7 3 2 4 1 2.5
It's more verbose, but just over twice as fast:
microbenchmark(da_method(), nk_method(), times = 1000)
#Unit: milliseconds
# expr min lq mean median uq max neval
# da_method() 17.465893 17.845689 19.249615 18.079206 18.337346 181.76369 1000
# nk_method() 7.047405 7.282276 7.757005 7.489351 7.667614 20.30658 1000

How do I replace NA values with zeros in an R dataframe?

I have a data frame and some columns have NA values.
How do I replace these NA values with zeroes?
See my comment in #gsk3 answer. A simple example:
> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
> d <- as.data.frame(m)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 NA 3 7 6 6 10 6 5
2 9 8 9 5 10 NA 2 1 7 2
3 1 1 6 3 6 NA 1 4 1 6
4 NA 4 NA 7 10 2 NA 4 1 8
5 1 2 4 NA 2 6 2 6 7 4
6 NA 3 NA NA 10 2 1 10 8 4
7 4 4 9 10 9 8 9 4 10 NA
8 5 8 3 2 1 4 5 9 4 7
9 3 9 10 1 9 9 10 5 3 3
10 4 2 2 5 NA 9 7 2 5 5
> d[is.na(d)] <- 0
> d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 0 3 7 6 6 10 6 5
2 9 8 9 5 10 0 2 1 7 2
3 1 1 6 3 6 0 1 4 1 6
4 0 4 0 7 10 2 0 4 1 8
5 1 2 4 0 2 6 2 6 7 4
6 0 3 0 0 10 2 1 10 8 4
7 4 4 9 10 9 8 9 4 10 0
8 5 8 3 2 1 4 5 9 4 7
9 3 9 10 1 9 9 10 5 3 3
10 4 2 2 5 0 9 7 2 5 5
There's no need to apply apply. =)
EDIT
You should also take a look at norm package. It has a lot of nice features for missing data analysis. =)
The dplyr hybridized options are now around 30% faster than the Base R subset reassigns. On a 100M datapoint dataframe mutate_all(~replace(., is.na(.), 0)) runs a half a second faster than the base R d[is.na(d)] <- 0 option. What one wants to avoid specifically is using an ifelse() or an if_else(). (The complete 600 trial analysis ran to over 4.5 hours mostly due to including these approaches.) Please see benchmark analyses below for the complete results.
If you are struggling with massive dataframes, data.table is the fastest option of all: 40% faster than the standard Base R approach. It also modifies the data in place, effectively allowing you to work with nearly twice as much of the data at once.
A clustering of other helpful tidyverse replacement approaches
Locationally:
index mutate_at(c(5:10), ~replace(., is.na(.), 0))
direct reference mutate_at(vars(var5:var10), ~replace(., is.na(.), 0))
fixed match mutate_at(vars(contains("1")), ~replace(., is.na(.), 0))
or in place of contains(), try ends_with(),starts_with()
pattern match mutate_at(vars(matches("\\d{2}")), ~replace(., is.na(.), 0))
Conditionally:
(change just single type and leave other types alone.)
integers mutate_if(is.integer, ~replace(., is.na(.), 0))
numbers mutate_if(is.numeric, ~replace(., is.na(.), 0))
strings mutate_if(is.character, ~replace(., is.na(.), 0))
##The Complete Analysis -
Updated for dplyr 0.8.0: functions use purrr format ~ symbols: replacing deprecated funs() arguments.
###Approaches tested:
# Base R:
baseR.sbst.rssgn <- function(x) { x[is.na(x)] <- 0; x }
baseR.replace <- function(x) { replace(x, is.na(x), 0) }
baseR.for <- function(x) { for(j in 1:ncol(x))
x[[j]][is.na(x[[j]])] = 0 }
# tidyverse
## dplyr
dplyr_if_else <- function(x) { mutate_all(x, ~if_else(is.na(.), 0, .)) }
dplyr_coalesce <- function(x) { mutate_all(x, ~coalesce(., 0)) }
## tidyr
tidyr_replace_na <- function(x) { replace_na(x, as.list(setNames(rep(0, 10), as.list(c(paste0("var", 1:10)))))) }
## hybrid
hybrd.ifelse <- function(x) { mutate_all(x, ~ifelse(is.na(.), 0, .)) }
hybrd.replace_na <- function(x) { mutate_all(x, ~replace_na(., 0)) }
hybrd.replace <- function(x) { mutate_all(x, ~replace(., is.na(.), 0)) }
hybrd.rplc_at.idx<- function(x) { mutate_at(x, c(1:10), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.nse<- function(x) { mutate_at(x, vars(var1:var10), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.stw<- function(x) { mutate_at(x, vars(starts_with("var")), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.ctn<- function(x) { mutate_at(x, vars(contains("var")), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.mtc<- function(x) { mutate_at(x, vars(matches("\\d+")), ~replace(., is.na(.), 0)) }
hybrd.rplc_if <- function(x) { mutate_if(x, is.numeric, ~replace(., is.na(.), 0)) }
# data.table
library(data.table)
DT.for.set.nms <- function(x) { for (j in names(x))
set(x,which(is.na(x[[j]])),j,0) }
DT.for.set.sqln <- function(x) { for (j in seq_len(ncol(x)))
set(x,which(is.na(x[[j]])),j,0) }
DT.nafill <- function(x) { nafill(df, fill=0)}
DT.setnafill <- function(x) { setnafill(df, fill=0)}
###The code for this analysis:
library(microbenchmark)
# 20% NA filled dataframe of 10 Million rows and 10 columns
set.seed(42) # to recreate the exact dataframe
dfN <- as.data.frame(matrix(sample(c(NA, as.numeric(1:4)), 1e7*10, replace = TRUE),
dimnames = list(NULL, paste0("var", 1:10)),
ncol = 10))
# Running 600 trials with each replacement method
# (the functions are excecuted locally - so that the original dataframe remains unmodified in all cases)
perf_results <- microbenchmark(
hybrd.ifelse = hybrd.ifelse(copy(dfN)),
dplyr_if_else = dplyr_if_else(copy(dfN)),
hybrd.replace_na = hybrd.replace_na(copy(dfN)),
baseR.sbst.rssgn = baseR.sbst.rssgn(copy(dfN)),
baseR.replace = baseR.replace(copy(dfN)),
dplyr_coalesce = dplyr_coalesce(copy(dfN)),
tidyr_replace_na = tidyr_replace_na(copy(dfN)),
hybrd.replace = hybrd.replace(copy(dfN)),
hybrd.rplc_at.ctn= hybrd.rplc_at.ctn(copy(dfN)),
hybrd.rplc_at.nse= hybrd.rplc_at.nse(copy(dfN)),
baseR.for = baseR.for(copy(dfN)),
hybrd.rplc_at.idx= hybrd.rplc_at.idx(copy(dfN)),
DT.for.set.nms = DT.for.set.nms(copy(dfN)),
DT.for.set.sqln = DT.for.set.sqln(copy(dfN)),
times = 600L
)
###Summary of Results
> print(perf_results)
Unit: milliseconds
expr min lq mean median uq max neval
hybrd.ifelse 6171.0439 6339.7046 6425.221 6407.397 6496.992 7052.851 600
dplyr_if_else 3737.4954 3877.0983 3953.857 3946.024 4023.301 4539.428 600
hybrd.replace_na 1497.8653 1706.1119 1748.464 1745.282 1789.804 2127.166 600
baseR.sbst.rssgn 1480.5098 1686.1581 1730.006 1728.477 1772.951 2010.215 600
baseR.replace 1457.4016 1681.5583 1725.481 1722.069 1766.916 2089.627 600
dplyr_coalesce 1227.6150 1483.3520 1524.245 1519.454 1561.488 1996.859 600
tidyr_replace_na 1248.3292 1473.1707 1521.889 1520.108 1570.382 1995.768 600
hybrd.replace 913.1865 1197.3133 1233.336 1238.747 1276.141 1438.646 600
hybrd.rplc_at.ctn 916.9339 1192.9885 1224.733 1227.628 1268.644 1466.085 600
hybrd.rplc_at.nse 919.0270 1191.0541 1228.749 1228.635 1275.103 2882.040 600
baseR.for 869.3169 1180.8311 1216.958 1224.407 1264.737 1459.726 600
hybrd.rplc_at.idx 839.8915 1189.7465 1223.326 1228.329 1266.375 1565.794 600
DT.for.set.nms 761.6086 915.8166 1015.457 1001.772 1106.315 1363.044 600
DT.for.set.sqln 787.3535 918.8733 1017.812 1002.042 1122.474 1321.860 600
###Boxplot of Results
ggplot(perf_results, aes(x=expr, y=time/10^9)) +
geom_boxplot() +
xlab('Expression') +
ylab('Elapsed Time (Seconds)') +
scale_y_continuous(breaks = seq(0,7,1)) +
coord_flip()
Color-coded Scatterplot of Trials (with y-axis on a log scale)
qplot(y=time/10^9, data=perf_results, colour=expr) +
labs(y = "log10 Scaled Elapsed Time per Trial (secs)", x = "Trial Number") +
coord_cartesian(ylim = c(0.75, 7.5)) +
scale_y_log10(breaks=c(0.75, 0.875, 1, 1.25, 1.5, 1.75, seq(2, 7.5)))
A note on the other high performers
When the datasets get larger, Tidyr''s replace_na had historically pulled out in front. With the current collection of 100M data points to run through, it performs almost exactly as well as a Base R For Loop. I am curious to see what happens for different sized dataframes.
Additional examples for the mutate and summarize _at and _all function variants can be found here: https://rdrr.io/cran/dplyr/man/summarise_all.html
Additionally, I found helpful demonstrations and collections of examples here: https://blog.exploratory.io/dplyr-0-5-is-awesome-heres-why-be095fd4eb8a
Attributions and Appreciations
With special thanks to:
Tyler Rinker and Akrun for demonstrating microbenchmark.
alexis_laz for working on helping me understand the use of local(), and (with Frank's patient help, too) the role that silent coercion plays in speeding up many of these approaches.
ArthurYip for the poke to add the newer coalesce() function in and update the analysis.
Gregor for the nudge to figure out the data.table functions well enough to finally include them in the lineup.
Base R For loop: alexis_laz
data.table For Loops: Matt_Dowle
Roman for explaining what is.numeric() really tests.
(Of course, please reach over and give them upvotes, too if you find those approaches useful.)
Note on my use of Numerics: If you do have a pure integer dataset, all of your functions will run faster. Please see alexiz_laz's work for more information. IRL, I can't recall encountering a data set containing more than 10-15% integers, so I am running these tests on fully numeric dataframes.
Hardware Used
3.9 GHz CPU with 24 GB RAM
For a single vector:
x <- c(1,2,NA,4,5)
x[is.na(x)] <- 0
For a data.frame, make a function out of the above, then apply it to the columns.
Please provide a reproducible example next time as detailed here:
How to make a great R reproducible example?
dplyr example:
library(dplyr)
df1 <- df1 %>%
mutate(myCol1 = if_else(is.na(myCol1), 0, myCol1))
Note: This works per selected column, if we need to do this for all column, see #reidjax's answer using mutate_each.
If we are trying to replace NAs when exporting, for example when writing to csv, then we can use:
write.csv(data, "data.csv", na = "0")
It is also possible to use tidyr::replace_na.
library(tidyr)
df <- df %>% mutate_all(funs(replace_na(.,0)))
Edit (dplyr > 1.0.0):
df %>% mutate(across(everything(), .fns = ~replace_na(.,0)))
I know the question is already answered, but doing it this way might be more useful to some:
Define this function:
na.zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
Now whenever you need to convert NA's in a vector to zero's you can do:
na.zero(some.vector)
More general approach of using replace() in matrix or vector to replace NA to 0
For example:
> x <- c(1,2,NA,NA,1,1)
> x1 <- replace(x,is.na(x),0)
> x1
[1] 1 2 0 0 1 1
This is also an alternative to using ifelse() in dplyr
df = data.frame(col = c(1,2,NA,NA,1,1))
df <- df %>%
mutate(col = replace(col,is.na(col),0))
With dplyr 0.5.0, you can use coalesce function which can be easily integrated into %>% pipeline by doing coalesce(vec, 0). This replaces all NAs in vec with 0:
Say we have a data frame with NAs:
library(dplyr)
df <- data.frame(v = c(1, 2, 3, NA, 5, 6, 8))
df
# v
# 1 1
# 2 2
# 3 3
# 4 NA
# 5 5
# 6 6
# 7 8
df %>% mutate(v = coalesce(v, 0))
# v
# 1 1
# 2 2
# 3 3
# 4 0
# 5 5
# 6 6
# 7 8
To replace all NAs in a dataframe you can use:
df %>% replace(is.na(.), 0)
Would've commented on #ianmunoz's post but I don't have enough reputation. You can combine dplyr's mutate_each and replace to take care of the NA to 0 replacement. Using the dataframe from #aL3xa's answer...
> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
> d <- as.data.frame(m)
> d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 8 1 9 6 9 NA 8 9 8
2 8 3 6 8 2 1 NA NA 6 3
3 6 6 3 NA 2 NA NA 5 7 7
4 10 6 1 1 7 9 1 10 3 10
5 10 6 7 10 10 3 2 5 4 6
6 2 4 1 5 7 NA NA 8 4 4
7 7 2 3 1 4 10 NA 8 7 7
8 9 5 8 10 5 3 5 8 3 2
9 9 1 8 7 6 5 NA NA 6 7
10 6 10 8 7 1 1 2 2 5 7
> d %>% mutate_each( funs_( interp( ~replace(., is.na(.),0) ) ) )
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 8 1 9 6 9 0 8 9 8
2 8 3 6 8 2 1 0 0 6 3
3 6 6 3 0 2 0 0 5 7 7
4 10 6 1 1 7 9 1 10 3 10
5 10 6 7 10 10 3 2 5 4 6
6 2 4 1 5 7 0 0 8 4 4
7 7 2 3 1 4 10 0 8 7 7
8 9 5 8 10 5 3 5 8 3 2
9 9 1 8 7 6 5 0 0 6 7
10 6 10 8 7 1 1 2 2 5 7
We're using standard evaluation (SE) here which is why we need the underscore on "funs_." We also use lazyeval's interp/~ and the . references "everything we are working with", i.e. the data frame. Now there are zeros!
Another example using imputeTS package:
library(imputeTS)
na.replace(yourDataframe, 0)
Dedicated functions, nafill and setnafill, for that purpose is in data.table.
Whenever available, they distribute columns to be computed on multiple threads.
library(data.table)
ans_df <- nafill(df, fill=0)
# or even faster, in-place
setnafill(df, fill=0)
If you want to replace NAs in factor variables, this might be useful:
n <- length(levels(data.vector))+1
data.vector <- as.numeric(data.vector)
data.vector[is.na(data.vector)] <- n
data.vector <- as.factor(data.vector)
levels(data.vector) <- c("level1","level2",...,"leveln", "NAlevel")
It transforms a factor-vector into a numeric vector and adds another artifical numeric factor level, which is then transformed back to a factor-vector with one extra "NA-level" of your choice.
dplyr >= 1.0.0
In newer versions of dplyr:
across() supersedes the family of "scoped variants" like summarise_at(), summarise_if(), and summarise_all().
df <- data.frame(a = c(LETTERS[1:3], NA), b = c(NA, 1:3))
library(tidyverse)
df %>%
mutate(across(where(anyNA), ~ replace_na(., 0)))
a b
1 A 0
2 B 1
3 C 2
4 0 3
This code will coerce 0 to be character in the first column. To replace NA based on column type you can use a purrr-like formula in where:
df %>%
mutate(across(where(~ anyNA(.) & is.character(.)), ~ replace_na(., "0")))
No need to use any library.
df <- data.frame(a=c(1,3,5,NA))
df$a[is.na(df$a)] <- 0
df
You can use replace()
For example:
> x <- c(-1,0,1,0,NA,0,1,1)
> x1 <- replace(x,5,1)
> x1
[1] -1 0 1 0 1 0 1 1
> x1 <- replace(x,5,mean(x,na.rm=T))
> x1
[1] -1.00 0.00 1.00 0.00 0.29 0.00 1.00 1.00
The cleaner package has an na_replace() generic, that at default replaces numeric values with zeroes, logicals with FALSE, dates with today, etc.:
library(dplyr)
library(cleaner)
starwars %>% na_replace()
na_replace(starwars)
It even supports vectorised replacements:
mtcars[1:6, c("mpg", "hp")] <- NA
na_replace(mtcars, mpg, hp, replacement = c(999, 123))
Documentation: https://msberends.github.io/cleaner/reference/na_replace.html
Another dplyr pipe compatible option with tidyrmethod replace_na that works for several columns:
require(dplyr)
require(tidyr)
m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
d <- as.data.frame(m)
myList <- setNames(lapply(vector("list", ncol(d)), function(x) x <- 0), names(d))
df <- d %>% replace_na(myList)
You can easily restrict to e.g. numeric columns:
d$str <- c("string", NA)
myList <- myList[sapply(d, is.numeric)]
df <- d %>% replace_na(myList)
This simple function extracted from Datacamp could help:
replace_missings <- function(x, replacement) {
is_miss <- is.na(x)
x[is_miss] <- replacement
message(sum(is_miss), " missings replaced by the value ", replacement)
x
}
Then
replace_missings(df, replacement = 0)
An easy way to write it is with if_na from hablar:
library(dplyr)
library(hablar)
df <- tibble(a = c(1, 2, 3, NA, 5, 6, 8))
df %>%
mutate(a = if_na(a, 0))
which returns:
a
<dbl>
1 1
2 2
3 3
4 0
5 5
6 6
7 8
Replace is.na & NULL in data frame.
data frame with colums
A$name[is.na(A$name)]<-0
OR
A$name[is.na(A$name)]<-"NA"
with all data frame
df[is.na(df)]<-0
with replace na with blank in data frame
df[is.na(df)]<-""
replace NULL to NA
df[is.null(df)] <- NA
if you want to assign a new name after changing the NAs in a specific column in this case column V3, use you can do also like this
my.data.frame$the.new.column.name <- ifelse(is.na(my.data.frame$V3),0,1)
I wan to add a next solution which using a popular Hmisc package.
library(Hmisc)
data(airquality)
# imputing with 0 - all columns
# although my favorite one for simple imputations is Hmisc::impute(x, "random")
> dd <- data.frame(Map(function(x) Hmisc::impute(x, 0), airquality))
> str(dd[[1]])
'impute' Named num [1:153] 41 36 12 18 0 28 23 19 8 0 ...
- attr(*, "names")= chr [1:153] "1" "2" "3" "4" ...
- attr(*, "imputed")= int [1:37] 5 10 25 26 27 32 33 34 35 36 ...
> dd[[1]][1:10]
1 2 3 4 5 6 7 8 9 10
41 36 12 18 0* 28 23 19 8 0*
There could be seen that all imputations metadata are allocated as attributes. Thus it could be used later.
This is not exactly a new solution, but I like to write inline lambdas that handle things that I can't quite get packages to do. In this case,
df %>%
(function(x) { x[is.na(x)] <- 0; return(x) })
Because R does not ever "pass by object" like you might see in Python, this solution does not modify the original variable df, and so will do quite the same as most of the other solutions, but with much less need for intricate knowledge of particular packages.
Note the parens around the function definition! Though it seems a bit redundant to me, since the function definition is surrounded in curly braces, it is required that inline functions are defined within parens for magrittr.
This is a more flexible solution. It works no matter how large your data frame is, or zero is indicated by 0 or zero or whatsoever.
library(dplyr) # make sure dplyr ver is >= 1.00
df %>%
mutate(across(everything(), na_if, 0)) # if 0 is indicated by `zero` then replace `0` with `zero`
Another option using sapply to replace all NA with zeros. Here is some reproducible code (data from #aL3xa):
set.seed(7) # for reproducibility
m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
d <- as.data.frame(m)
d
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> 1 9 7 5 5 7 7 4 6 6 7
#> 2 2 5 10 7 8 9 8 8 1 8
#> 3 6 7 4 10 4 9 6 8 NA 10
#> 4 1 10 3 7 5 7 7 7 NA 8
#> 5 9 9 10 NA 7 10 1 5 NA 5
#> 6 5 2 5 10 8 1 1 5 10 3
#> 7 7 3 9 3 1 6 7 3 1 10
#> 8 7 7 6 8 4 4 5 NA 8 7
#> 9 2 1 1 2 7 5 9 10 9 3
#> 10 7 5 3 4 9 2 7 6 NA 5
d[sapply(d, \(x) is.na(x))] <- 0
d
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> 1 9 7 5 5 7 7 4 6 6 7
#> 2 2 5 10 7 8 9 8 8 1 8
#> 3 6 7 4 10 4 9 6 8 0 10
#> 4 1 10 3 7 5 7 7 7 0 8
#> 5 9 9 10 0 7 10 1 5 0 5
#> 6 5 2 5 10 8 1 1 5 10 3
#> 7 7 3 9 3 1 6 7 3 1 10
#> 8 7 7 6 8 4 4 5 0 8 7
#> 9 2 1 1 2 7 5 9 10 9 3
#> 10 7 5 3 4 9 2 7 6 0 5
Created on 2023-01-15 with reprex v2.0.2
Please note: Since R 4.1.0 you can use \(x) instead of function(x).
in data.frame it is not necessary to create a new column by mutate.
library(tidyverse)
k <- c(1,2,80,NA,NA,51)
j <- c(NA,NA,3,31,12,NA)
df <- data.frame(k,j)%>%
replace_na(list(j=0))#convert only column j, for example
result
k j
1 0
2 0
80 3
NA 31
NA 12
51 0
I used this personally and works fine :
players_wd$APPROVED_WD[is.na(players_wd$APPROVED_WD)] <- 0

Resources