Drawing multiple barplots on a graph using data with different size - r

I can plot multiple bar plots on one plot with following code (taken from this question):
mydata <- data.frame(Barplot1=rbinom(5,16,0.6), Barplot2=rbinom(5,16,0.25),
Barplot3=rbinom(5,5,0.25), Barplot4=rbinom(5,16,0.7))
barplot(as.matrix(mydata), main="Interesting", ylab="Total", beside=TRUE,
col=terrain.colors(5))
legend(13, 12, c("Label1","Label2","Label3","Label4","Label5"), cex=0.6,
fill=terrain.colors(5))
But my scenario is a bit different: I have data stored in 3 data.frames (sorted according to V2 column) where V1 column is the Y axis and V2 column is the X axis:
> tail(hist1)
V1 V2
67 2 70
68 2 72
69 1 73
70 2 74
71 1 76
72 1 84
> tail(hist2)
V1 V2
87 1 92
88 3 94
89 1 95
90 2 96
91 1 104
92 1 112
> tail(hist3)
V1 V2
103 3 110
104 1 111
105 2 112
106 2 118
107 2 120
108 1 138
For plotting one single plot it is just simple as:
barplot(hist3$V1, main="plot title", names.arg = hist3$V2)
But I cannot construct the matrix needed for plot because of several problems that I can see right now (maybe there are several others):
My data has different size:
> nrow(hist1)
[1] 72
> nrow(hist2)
[1] 92
> nrow(hist3)
[1] 108
There are X (and therefore Y also) values which are in one list but not in another list e.g.:
> hist3$V2[which(hist3$V2==138)]
[1] 138
> hist1$V2[which(hist1$V2==138)]
integer(0)
What I need (I guess) is something that will create appropriate V2 (x axis) with 0 Y value in appropriate data.frame so they will have same length and I will be able combine them as above example. See following example with only 2 data.frames (v2 and v1 are reversed as in previous example):
> # missing v2 for 3,4,5
> df1
v2 v1
1 1 1
2 2 2
3 6 3
4 7 4
5 8 5
6 9 6
7 10 7
> # missing v2 for 1,2,9,10
> df2
v2 v1
1 3 1
2 4 2
3 5 3
4 6 4
5 7 5
6 8 6
> # some_magic_goes_here ...
> df1
v2 v1
1 1 1
2 2 2
3 3 0 # created
4 4 0 # created
5 5 0 # created
6 6 3
7 7 4
8 8 5
9 9 6
10 10 7
> df2
v2 v1
1 1 0 # created
2 2 0 # created
3 3 1
4 4 2
5 5 3
6 6 4
7 7 5
8 8 6
9 9 0 # created
10 10 0 # created
Thanks

Probably, you can do this by 1) retrieving all possible x-axis values (v2 values) from all data.frames. and 2) using this information to retrieve existing values and/or filling missing ones with zeroes.
set.seed(111)
df1 <- data.frame(v2= sample(1:10, size = 7),
v1 = sample(1:100, size = 1))
df2 <- data.frame(v2= sample(1:10, size = 7),
v1 = sample(1:100, size = 1))
df3 <- data.frame(v2= sample(1:10, size = 7),
v1 = sample(1:100, size = 1))
First, retrieve your categories / x-axis values / v2
Note that if class(df1$v2) == "factor", then you should use levels() instead of unique()
my.x <- unique(c(df1$v2, df2$v2, df3$v2))
Likely, you want it sorted
my.x <- sort(my.x)
Now, use my.x to re-order/fill your data.frames, starting with df1. Specifically, you check each value of my.x: if that value is included in df1$v2, then the corresponding v1 is returned, otherwise 0.
my.df1 <- data.frame(v2 = my.x,
v1 = sapply(my.x, (function(i){
ifelse (i %in% df1$v2, df1$v1[df1$v2 == i], 0)
})))
my.df1
A simple way to apply this operation to all your data.frames is to list them together and then use lapply()
dfs <- list(df1 = df1, df2 = df2, df3 = df3)
dfs <- lapply(dfs, (function(df){
data.frame(v2 = my.x,
v1 = sapply(my.x, (function(i){
ifelse (i %in% df$v2, df$v1[df$v2 == i], 0)
})))
}))
# show all data.frames
dfs
# show df1
dfs$df1

Related

R - how to select elements from sublists of a list by their name

I have a list of lists that looks like this:
list(list("A[1]" = data.frame(W = 1:5),
"A[2]" = data.frame(X = 6:10),
B = data.frame(Y = 11:15),
C = data.frame(Z = 16:20)),
list("A[1]" = data.frame(W = 21:25),
"A[2]" = data.frame(X = 26:30),
B = data.frame(Y = 31:35),
C = data.frame(Z = 36:40)),
list("A[1]" = data.frame(W = 41:45),
"A[2]" = data.frame(X = 46:50),
B = data.frame(Y = 51:55),
C = data.frame(Z = 56:60))) -> dflist
I need my output to also be a list of list with length 3 so that each sublist retains elements whose names start with A[ while dropping other elements.
Based on some previous questions, I am trying to use this:
dflist %>%
map(keep, names(.) %in% "A[")
but that gives the following error:
Error in probe(.x, .p, ...) : length(.p) == length(.x) is not TRUE
Trying to select a single element, for example just A[1] like this:
dflist %>%
map(keep, names(.) %in% "A[1]")
also doesn't work. How can I achieve the desired output?
I think you want:
purrr::map(dflist, ~.[stringr::str_starts(names(.), "A\\[")])
What this does is:
For each sublist (purrr::map)
Select all elements of that sublist (.[], where . is the sublist)
Whose names start with A[ (stringr::str_starts(names(.), "A\\["))
You got the top level map correct, since you want to modify the sublists. However, map(keep, names(.) %in% "A[") has some issues:
names(.) %in% "A[" should be a function or a formula (starting with ~
purrr::keep applies the filtering function to each element of the sublist, namely to the data frames directly. It never "sees" the names of each data frame. Actually I don't think you can use keep for this problem at all
Anyway this produces:
[[1]]
[[1]]$`A[1]`
W
1 1
2 2
3 3
4 4
5 5
[[1]]$`A[2]`
X
1 6
2 7
3 8
4 9
5 10
[[2]]
[[2]]$`A[1]`
W
1 21
2 22
3 23
4 24
5 25
[[2]]$`A[2]`
X
1 26
2 27
3 28
4 29
5 30
[[3]]
[[3]]$`A[1]`
W
1 41
2 42
3 43
4 44
5 45
[[3]]$`A[2]`
X
1 46
2 47
3 48
4 49
5 50
If we want to use keep, use
library(dplyr)
library(purrr)
library(stringr)
map(dflist, ~ keep(.x, str_detect(names(.x), fixed("A["))))
Here a base R solution:
lapply(dflist, function(x) x[grep("A\\[",names(x))] )
[[1]]
[[1]]$`A[1]`
W
1 1
2 2
3 3
4 4
5 5
[[1]]$`A[2]`
X
1 6
2 7
3 8
4 9
5 10
[[2]]
[[2]]$`A[1]`
W
1 21
2 22
3 23
4 24
5 25
[[2]]$`A[2]`
X
1 26
2 27
3 28
4 29
5 30
[[3]]
[[3]]$`A[1]`
W
1 41
2 42
3 43
4 44
5 45
[[3]]$`A[2]`
X
1 46
2 47
3 48
4 49
5 50

randomly select rows based on limited random numbers

Seems simple but I can't figure it out.
I have a bunch of animal location data (217 individuals) as a single dataframe. I'm trying to randomly select X locations per individual for further analysis with the caveat that X is within the range of 6-156.
So I'm trying to set up a loop that first randomly selects a value within the range of 6-156 then use that value (say 56) to randomly extract 56 locations from the first individual animal and so on.
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
sub2<-i([sample(nrow(i),sub),])
}
This approach didn't seem to work so I tried tweaking it...
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
rand<-i[sample(1:nrow(i),sub,replace=FALSE),]
}
This did not work either.. Any suggestions or previous postings would be helpful!
Head of the datafile...ANIMALS is the name of the df, ID indicates unique individuals
> FID X Y MONTH DAY YEAR HOUR MINUTE SECOND ELKYR SOURCE ID animalid
1 0 510313 4813290 9 5 2008 22 30 0 342008 FG 1 1
2 1 510382 4813296 9 6 2008 1 30 0 342008 FG 1 1
3 2 510385 4813311 9 6 2008 2 0 0 342008 FG 1 1
4 3 510385 4813394 9 6 2008 3 30 0 342008 FG 1 1
5 4 510386 4813292 9 6 2008 2 30 0 342008 FG 1 1
6 5 510386 4813431 9 6 2008 4 1 0 342008 FG 1 1
Here's one way using mapply. This function takes two lists (or something that can be coerced into a list) and applies function FUN to corresponding elements.
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
# calculate number of samples for individual animal
num.samples.per.animal <- sample(3:6, length(unique(xy$animal)), replace = TRUE)
num.samples.per.animal
[1] 6 3 4 4 6 3 3 6 3 5
# subset random x number of rows from each animal
result <- do.call("rbind",
mapply(num.samples.per.animal, split(xy, f = xy$animal), FUN = function(x, y) {
y[sample(1:nrow(y), x),]
}, SIMPLIFY = FALSE)
)
result
animal loc
7 1 0.99483999
1 1 0.50951321
10 1 0.36505294
6 1 0.34058842
8 1 0.26489107
9 1 0.47418823
13 2 0.27213396
12 2 0.28087775
15 2 0.22130069
23 3 0.33646632
21 3 0.02395097
28 3 0.53079981
29 3 0.85287600
35 4 0.84534073
33 4 0.87370167
31 4 0.85646813
34 4 0.11642335
46 5 0.59624723
48 5 0.15379729
45 5 0.57046122
42 5 0.88799675
44 5 0.62171858
49 5 0.75014593
60 6 0.86915983
54 6 0.03152932
56 6 0.66128549
64 7 0.85420774
70 7 0.89262455
68 7 0.40829671
78 8 0.19073661
72 8 0.20648832
80 8 0.71778913
73 8 0.77883677
75 8 0.37647108
74 8 0.65339300
82 9 0.39957202
85 9 0.31188471
88 9 0.10900795
100 10 0.55282999
95 10 0.10145296
96 10 0.09713218
93 10 0.64900866
94 10 0.76099256
EDIT
Here is another (more straightforward) approach that also handles cases when number of rows is less than the number of samples that should be allocated.
set.seed(357)
result <- do.call("rbind",
by(xy, INDICES = xy$animal, FUN = function(x) {
avail.obs <- nrow(x)
num.rows <- sample(3:15, 1)
while (num.rows > avail.obs) {
message("Sample to be larger than available data points, repeating sampling.")
num.rows <- sample(3:15, 1)
}
x[sample(1:avail.obs, num.rows), ]
}))
result
I like Stackoverflow because I learn so much. #RomanLustrik provided a simple solution; mine is straight-froward as well:
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
newVec <- NULL #Create a blank dataFrame
for(i in unique(xy$animal)){
#Sample a number between 1 and 10 (or 6 and 156, if you need)
samp <- sample(1:10, 1)
#Determine which rows of dataFrame xy correspond with unique(xy$animal)[i]
rows <- which(xy$animal == unique(xy$animal)[i])
#From xy, sample samp times from the rows associated with unique(xy$animal)[i]
newVec1 <- xy[sample(rows, samp, replace = TRUE), ]
#append everything to the same new dataFrame
newVec <- rbind(newVec, newVec1)
}

remove rows based on substraction results

I have a large data set like this:
df <- data.frame(group = c(rep(1, 6), rep(5, 6)), score = c(30, 10, 22, 44, 6, 5, 20, 35, 2, 60, 14,5))
group score
1 1 30
2 1 10
3 1 22
4 1 44
5 1 6
6 1 5
7 5 20
8 5 35
9 5 2
10 5 60
11 5 14
12 5 5
...
I want to do a subtraction for each neighboring score within each group, if the difference is greater than 30, remove the smaller score. For example, within group 1, 30-10=20<30, 10-22=-12<30, 22-44=-22<30, 44-6=38>30 (remove 6), 44-5=39>30 (remove 5)... The expected output should look like this:
group score
1 1 30
2 1 10
3 1 22
4 1 44
5 5 20
6 5 35
7 5 60
...
Does anyone have idea about realizing this?
Like this?
repeat {
df$diff=unlist(by(df$score,df$group,function(x)c(0,-diff(x))))
if (all(df$diff<30)) break
df <- df[df$diff<30,]
}
df$diff <- NULL
df
# group score
# 1 1 30
# 2 1 10
# 3 1 22
# 4 1 44
# 7 5 20
# 8 5 35
# 10 5 60
This (seems...) to require an iterative approach, because the "neighboring score" changes after removal of a row. So before you remove 6, the difference 44 - 6 > 30, but 6 - 5 < 30. After you remove 6, the difference 44 - 5 > 30.
So this calculates difference between successive rows by group (using by(...) and diff(...)), and removes the appropriate rows, then repeats the process until all differences are < 30.
It's not elegant but it should work:
out = data.frame(group = numeric(), score=numeric())
#cycle through the groups
for(g in levels(as.factor(df$group))){
temp = subset(df, df$group==g)
#now go through the scores
left = temp$score[1]
for(s in seq(2, length(temp$score))){
if(left - temp$score[s] > 30){#Test the condition
temp$score[s] = NA
}else{
left = temp$score[s] #if condition not met then the
}
}
#Add only the rows without NAs to the out
out = rbind(out, temp[which(!is.na(temp$score)),])
}
There should be a way to do this using ave but carrying the last value when removing the next if the diff >30 is tricky! I'd appreciate the more elegant solution if there is one.
You can try
df
## group score
## 1 1 30
## 2 1 10
## 3 1 22
## 4 1 44
## 5 1 6
## 6 1 5
## 7 5 20
## 8 5 35
## 9 5 2
## 10 5 60
## 11 5 14
## 12 5 5
tmp <- df[!unlist(tapply(df$score, df$group, FUN = function(x) c(F, -diff(x) > 30), simplify = T)), ]
while (!identical(df, tmp)) {
df <- tmp
tmp <- df[!unlist(tapply(df$score, df$group, FUN = function(x) c(F, -diff(x) > 30), simplify = T)), ]
}
tmp
## group score
## 1 1 30
## 2 1 10
## 3 1 22
## 4 1 44
## 7 5 20
## 8 5 35
## 10 5 60

automating a normal transformation function in R over multiple columns

I have a data frame m with:
>m
id w y z
1 2 5 8
2 18 5 98
3 1 25 5
4 52 25 8
5 5 5 4
6 3 3 5
Below is a general function for normally transforming a variable that I need to apply to columns w,y,z.
y<-qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x))
For example, if I wanted to run this function on "column w" to get the output column appended to dataframe "m" then:
m$w_n<-qnorm((rank(m$w,na.last="keep")-0.5)/sum(!is.na(m$w))
Can someone help me automate this to run on multiple columns in data frame m?
Ideally, I would want an output data frame with the following columns:
id w y z w_n y_n z_n
Note this is a sample data frame, the one I have is much larger and I have more letter columns to run this function on other than w, y,z.
Thanks!
Probably a way to do it in a single step, but what about:
df <- data.frame(id = 1:6, w = sample(50, 6), z = sample(50, 6) )
df
id w z
1 1 39 40
2 2 20 26
3 3 43 11
4 4 4 37
5 5 36 24
6 6 27 14
transCols <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))
tmpdf <- lapply(df[, -1], transCols)
names(tmpdf) <- paste0(names(tmpdf), "_n")
df_final <- cbind(df, tmpdf)
df_final
df_final
id w z w_n z_n
1 1 39 40 -0.2104284 -1.3829941
2 2 20 26 1.3829941 1.3829941
3 3 43 11 0.2104284 0.6744898
4 4 4 37 -1.3829941 0.2104284
5 5 36 24 0.6744898 -0.6744898
6 6 27 14 -0.6744898 -0.2104284

How do I replace NA values with zeros in an R dataframe?

I have a data frame and some columns have NA values.
How do I replace these NA values with zeroes?
See my comment in #gsk3 answer. A simple example:
> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
> d <- as.data.frame(m)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 NA 3 7 6 6 10 6 5
2 9 8 9 5 10 NA 2 1 7 2
3 1 1 6 3 6 NA 1 4 1 6
4 NA 4 NA 7 10 2 NA 4 1 8
5 1 2 4 NA 2 6 2 6 7 4
6 NA 3 NA NA 10 2 1 10 8 4
7 4 4 9 10 9 8 9 4 10 NA
8 5 8 3 2 1 4 5 9 4 7
9 3 9 10 1 9 9 10 5 3 3
10 4 2 2 5 NA 9 7 2 5 5
> d[is.na(d)] <- 0
> d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 0 3 7 6 6 10 6 5
2 9 8 9 5 10 0 2 1 7 2
3 1 1 6 3 6 0 1 4 1 6
4 0 4 0 7 10 2 0 4 1 8
5 1 2 4 0 2 6 2 6 7 4
6 0 3 0 0 10 2 1 10 8 4
7 4 4 9 10 9 8 9 4 10 0
8 5 8 3 2 1 4 5 9 4 7
9 3 9 10 1 9 9 10 5 3 3
10 4 2 2 5 0 9 7 2 5 5
There's no need to apply apply. =)
EDIT
You should also take a look at norm package. It has a lot of nice features for missing data analysis. =)
The dplyr hybridized options are now around 30% faster than the Base R subset reassigns. On a 100M datapoint dataframe mutate_all(~replace(., is.na(.), 0)) runs a half a second faster than the base R d[is.na(d)] <- 0 option. What one wants to avoid specifically is using an ifelse() or an if_else(). (The complete 600 trial analysis ran to over 4.5 hours mostly due to including these approaches.) Please see benchmark analyses below for the complete results.
If you are struggling with massive dataframes, data.table is the fastest option of all: 40% faster than the standard Base R approach. It also modifies the data in place, effectively allowing you to work with nearly twice as much of the data at once.
A clustering of other helpful tidyverse replacement approaches
Locationally:
index mutate_at(c(5:10), ~replace(., is.na(.), 0))
direct reference mutate_at(vars(var5:var10), ~replace(., is.na(.), 0))
fixed match mutate_at(vars(contains("1")), ~replace(., is.na(.), 0))
or in place of contains(), try ends_with(),starts_with()
pattern match mutate_at(vars(matches("\\d{2}")), ~replace(., is.na(.), 0))
Conditionally:
(change just single type and leave other types alone.)
integers mutate_if(is.integer, ~replace(., is.na(.), 0))
numbers mutate_if(is.numeric, ~replace(., is.na(.), 0))
strings mutate_if(is.character, ~replace(., is.na(.), 0))
##The Complete Analysis -
Updated for dplyr 0.8.0: functions use purrr format ~ symbols: replacing deprecated funs() arguments.
###Approaches tested:
# Base R:
baseR.sbst.rssgn <- function(x) { x[is.na(x)] <- 0; x }
baseR.replace <- function(x) { replace(x, is.na(x), 0) }
baseR.for <- function(x) { for(j in 1:ncol(x))
x[[j]][is.na(x[[j]])] = 0 }
# tidyverse
## dplyr
dplyr_if_else <- function(x) { mutate_all(x, ~if_else(is.na(.), 0, .)) }
dplyr_coalesce <- function(x) { mutate_all(x, ~coalesce(., 0)) }
## tidyr
tidyr_replace_na <- function(x) { replace_na(x, as.list(setNames(rep(0, 10), as.list(c(paste0("var", 1:10)))))) }
## hybrid
hybrd.ifelse <- function(x) { mutate_all(x, ~ifelse(is.na(.), 0, .)) }
hybrd.replace_na <- function(x) { mutate_all(x, ~replace_na(., 0)) }
hybrd.replace <- function(x) { mutate_all(x, ~replace(., is.na(.), 0)) }
hybrd.rplc_at.idx<- function(x) { mutate_at(x, c(1:10), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.nse<- function(x) { mutate_at(x, vars(var1:var10), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.stw<- function(x) { mutate_at(x, vars(starts_with("var")), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.ctn<- function(x) { mutate_at(x, vars(contains("var")), ~replace(., is.na(.), 0)) }
hybrd.rplc_at.mtc<- function(x) { mutate_at(x, vars(matches("\\d+")), ~replace(., is.na(.), 0)) }
hybrd.rplc_if <- function(x) { mutate_if(x, is.numeric, ~replace(., is.na(.), 0)) }
# data.table
library(data.table)
DT.for.set.nms <- function(x) { for (j in names(x))
set(x,which(is.na(x[[j]])),j,0) }
DT.for.set.sqln <- function(x) { for (j in seq_len(ncol(x)))
set(x,which(is.na(x[[j]])),j,0) }
DT.nafill <- function(x) { nafill(df, fill=0)}
DT.setnafill <- function(x) { setnafill(df, fill=0)}
###The code for this analysis:
library(microbenchmark)
# 20% NA filled dataframe of 10 Million rows and 10 columns
set.seed(42) # to recreate the exact dataframe
dfN <- as.data.frame(matrix(sample(c(NA, as.numeric(1:4)), 1e7*10, replace = TRUE),
dimnames = list(NULL, paste0("var", 1:10)),
ncol = 10))
# Running 600 trials with each replacement method
# (the functions are excecuted locally - so that the original dataframe remains unmodified in all cases)
perf_results <- microbenchmark(
hybrd.ifelse = hybrd.ifelse(copy(dfN)),
dplyr_if_else = dplyr_if_else(copy(dfN)),
hybrd.replace_na = hybrd.replace_na(copy(dfN)),
baseR.sbst.rssgn = baseR.sbst.rssgn(copy(dfN)),
baseR.replace = baseR.replace(copy(dfN)),
dplyr_coalesce = dplyr_coalesce(copy(dfN)),
tidyr_replace_na = tidyr_replace_na(copy(dfN)),
hybrd.replace = hybrd.replace(copy(dfN)),
hybrd.rplc_at.ctn= hybrd.rplc_at.ctn(copy(dfN)),
hybrd.rplc_at.nse= hybrd.rplc_at.nse(copy(dfN)),
baseR.for = baseR.for(copy(dfN)),
hybrd.rplc_at.idx= hybrd.rplc_at.idx(copy(dfN)),
DT.for.set.nms = DT.for.set.nms(copy(dfN)),
DT.for.set.sqln = DT.for.set.sqln(copy(dfN)),
times = 600L
)
###Summary of Results
> print(perf_results)
Unit: milliseconds
expr min lq mean median uq max neval
hybrd.ifelse 6171.0439 6339.7046 6425.221 6407.397 6496.992 7052.851 600
dplyr_if_else 3737.4954 3877.0983 3953.857 3946.024 4023.301 4539.428 600
hybrd.replace_na 1497.8653 1706.1119 1748.464 1745.282 1789.804 2127.166 600
baseR.sbst.rssgn 1480.5098 1686.1581 1730.006 1728.477 1772.951 2010.215 600
baseR.replace 1457.4016 1681.5583 1725.481 1722.069 1766.916 2089.627 600
dplyr_coalesce 1227.6150 1483.3520 1524.245 1519.454 1561.488 1996.859 600
tidyr_replace_na 1248.3292 1473.1707 1521.889 1520.108 1570.382 1995.768 600
hybrd.replace 913.1865 1197.3133 1233.336 1238.747 1276.141 1438.646 600
hybrd.rplc_at.ctn 916.9339 1192.9885 1224.733 1227.628 1268.644 1466.085 600
hybrd.rplc_at.nse 919.0270 1191.0541 1228.749 1228.635 1275.103 2882.040 600
baseR.for 869.3169 1180.8311 1216.958 1224.407 1264.737 1459.726 600
hybrd.rplc_at.idx 839.8915 1189.7465 1223.326 1228.329 1266.375 1565.794 600
DT.for.set.nms 761.6086 915.8166 1015.457 1001.772 1106.315 1363.044 600
DT.for.set.sqln 787.3535 918.8733 1017.812 1002.042 1122.474 1321.860 600
###Boxplot of Results
ggplot(perf_results, aes(x=expr, y=time/10^9)) +
geom_boxplot() +
xlab('Expression') +
ylab('Elapsed Time (Seconds)') +
scale_y_continuous(breaks = seq(0,7,1)) +
coord_flip()
Color-coded Scatterplot of Trials (with y-axis on a log scale)
qplot(y=time/10^9, data=perf_results, colour=expr) +
labs(y = "log10 Scaled Elapsed Time per Trial (secs)", x = "Trial Number") +
coord_cartesian(ylim = c(0.75, 7.5)) +
scale_y_log10(breaks=c(0.75, 0.875, 1, 1.25, 1.5, 1.75, seq(2, 7.5)))
A note on the other high performers
When the datasets get larger, Tidyr''s replace_na had historically pulled out in front. With the current collection of 100M data points to run through, it performs almost exactly as well as a Base R For Loop. I am curious to see what happens for different sized dataframes.
Additional examples for the mutate and summarize _at and _all function variants can be found here: https://rdrr.io/cran/dplyr/man/summarise_all.html
Additionally, I found helpful demonstrations and collections of examples here: https://blog.exploratory.io/dplyr-0-5-is-awesome-heres-why-be095fd4eb8a
Attributions and Appreciations
With special thanks to:
Tyler Rinker and Akrun for demonstrating microbenchmark.
alexis_laz for working on helping me understand the use of local(), and (with Frank's patient help, too) the role that silent coercion plays in speeding up many of these approaches.
ArthurYip for the poke to add the newer coalesce() function in and update the analysis.
Gregor for the nudge to figure out the data.table functions well enough to finally include them in the lineup.
Base R For loop: alexis_laz
data.table For Loops: Matt_Dowle
Roman for explaining what is.numeric() really tests.
(Of course, please reach over and give them upvotes, too if you find those approaches useful.)
Note on my use of Numerics: If you do have a pure integer dataset, all of your functions will run faster. Please see alexiz_laz's work for more information. IRL, I can't recall encountering a data set containing more than 10-15% integers, so I am running these tests on fully numeric dataframes.
Hardware Used
3.9 GHz CPU with 24 GB RAM
For a single vector:
x <- c(1,2,NA,4,5)
x[is.na(x)] <- 0
For a data.frame, make a function out of the above, then apply it to the columns.
Please provide a reproducible example next time as detailed here:
How to make a great R reproducible example?
dplyr example:
library(dplyr)
df1 <- df1 %>%
mutate(myCol1 = if_else(is.na(myCol1), 0, myCol1))
Note: This works per selected column, if we need to do this for all column, see #reidjax's answer using mutate_each.
If we are trying to replace NAs when exporting, for example when writing to csv, then we can use:
write.csv(data, "data.csv", na = "0")
It is also possible to use tidyr::replace_na.
library(tidyr)
df <- df %>% mutate_all(funs(replace_na(.,0)))
Edit (dplyr > 1.0.0):
df %>% mutate(across(everything(), .fns = ~replace_na(.,0)))
I know the question is already answered, but doing it this way might be more useful to some:
Define this function:
na.zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
Now whenever you need to convert NA's in a vector to zero's you can do:
na.zero(some.vector)
More general approach of using replace() in matrix or vector to replace NA to 0
For example:
> x <- c(1,2,NA,NA,1,1)
> x1 <- replace(x,is.na(x),0)
> x1
[1] 1 2 0 0 1 1
This is also an alternative to using ifelse() in dplyr
df = data.frame(col = c(1,2,NA,NA,1,1))
df <- df %>%
mutate(col = replace(col,is.na(col),0))
With dplyr 0.5.0, you can use coalesce function which can be easily integrated into %>% pipeline by doing coalesce(vec, 0). This replaces all NAs in vec with 0:
Say we have a data frame with NAs:
library(dplyr)
df <- data.frame(v = c(1, 2, 3, NA, 5, 6, 8))
df
# v
# 1 1
# 2 2
# 3 3
# 4 NA
# 5 5
# 6 6
# 7 8
df %>% mutate(v = coalesce(v, 0))
# v
# 1 1
# 2 2
# 3 3
# 4 0
# 5 5
# 6 6
# 7 8
To replace all NAs in a dataframe you can use:
df %>% replace(is.na(.), 0)
Would've commented on #ianmunoz's post but I don't have enough reputation. You can combine dplyr's mutate_each and replace to take care of the NA to 0 replacement. Using the dataframe from #aL3xa's answer...
> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
> d <- as.data.frame(m)
> d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 8 1 9 6 9 NA 8 9 8
2 8 3 6 8 2 1 NA NA 6 3
3 6 6 3 NA 2 NA NA 5 7 7
4 10 6 1 1 7 9 1 10 3 10
5 10 6 7 10 10 3 2 5 4 6
6 2 4 1 5 7 NA NA 8 4 4
7 7 2 3 1 4 10 NA 8 7 7
8 9 5 8 10 5 3 5 8 3 2
9 9 1 8 7 6 5 NA NA 6 7
10 6 10 8 7 1 1 2 2 5 7
> d %>% mutate_each( funs_( interp( ~replace(., is.na(.),0) ) ) )
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 8 1 9 6 9 0 8 9 8
2 8 3 6 8 2 1 0 0 6 3
3 6 6 3 0 2 0 0 5 7 7
4 10 6 1 1 7 9 1 10 3 10
5 10 6 7 10 10 3 2 5 4 6
6 2 4 1 5 7 0 0 8 4 4
7 7 2 3 1 4 10 0 8 7 7
8 9 5 8 10 5 3 5 8 3 2
9 9 1 8 7 6 5 0 0 6 7
10 6 10 8 7 1 1 2 2 5 7
We're using standard evaluation (SE) here which is why we need the underscore on "funs_." We also use lazyeval's interp/~ and the . references "everything we are working with", i.e. the data frame. Now there are zeros!
Another example using imputeTS package:
library(imputeTS)
na.replace(yourDataframe, 0)
Dedicated functions, nafill and setnafill, for that purpose is in data.table.
Whenever available, they distribute columns to be computed on multiple threads.
library(data.table)
ans_df <- nafill(df, fill=0)
# or even faster, in-place
setnafill(df, fill=0)
If you want to replace NAs in factor variables, this might be useful:
n <- length(levels(data.vector))+1
data.vector <- as.numeric(data.vector)
data.vector[is.na(data.vector)] <- n
data.vector <- as.factor(data.vector)
levels(data.vector) <- c("level1","level2",...,"leveln", "NAlevel")
It transforms a factor-vector into a numeric vector and adds another artifical numeric factor level, which is then transformed back to a factor-vector with one extra "NA-level" of your choice.
dplyr >= 1.0.0
In newer versions of dplyr:
across() supersedes the family of "scoped variants" like summarise_at(), summarise_if(), and summarise_all().
df <- data.frame(a = c(LETTERS[1:3], NA), b = c(NA, 1:3))
library(tidyverse)
df %>%
mutate(across(where(anyNA), ~ replace_na(., 0)))
a b
1 A 0
2 B 1
3 C 2
4 0 3
This code will coerce 0 to be character in the first column. To replace NA based on column type you can use a purrr-like formula in where:
df %>%
mutate(across(where(~ anyNA(.) & is.character(.)), ~ replace_na(., "0")))
No need to use any library.
df <- data.frame(a=c(1,3,5,NA))
df$a[is.na(df$a)] <- 0
df
You can use replace()
For example:
> x <- c(-1,0,1,0,NA,0,1,1)
> x1 <- replace(x,5,1)
> x1
[1] -1 0 1 0 1 0 1 1
> x1 <- replace(x,5,mean(x,na.rm=T))
> x1
[1] -1.00 0.00 1.00 0.00 0.29 0.00 1.00 1.00
The cleaner package has an na_replace() generic, that at default replaces numeric values with zeroes, logicals with FALSE, dates with today, etc.:
library(dplyr)
library(cleaner)
starwars %>% na_replace()
na_replace(starwars)
It even supports vectorised replacements:
mtcars[1:6, c("mpg", "hp")] <- NA
na_replace(mtcars, mpg, hp, replacement = c(999, 123))
Documentation: https://msberends.github.io/cleaner/reference/na_replace.html
Another dplyr pipe compatible option with tidyrmethod replace_na that works for several columns:
require(dplyr)
require(tidyr)
m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
d <- as.data.frame(m)
myList <- setNames(lapply(vector("list", ncol(d)), function(x) x <- 0), names(d))
df <- d %>% replace_na(myList)
You can easily restrict to e.g. numeric columns:
d$str <- c("string", NA)
myList <- myList[sapply(d, is.numeric)]
df <- d %>% replace_na(myList)
This simple function extracted from Datacamp could help:
replace_missings <- function(x, replacement) {
is_miss <- is.na(x)
x[is_miss] <- replacement
message(sum(is_miss), " missings replaced by the value ", replacement)
x
}
Then
replace_missings(df, replacement = 0)
An easy way to write it is with if_na from hablar:
library(dplyr)
library(hablar)
df <- tibble(a = c(1, 2, 3, NA, 5, 6, 8))
df %>%
mutate(a = if_na(a, 0))
which returns:
a
<dbl>
1 1
2 2
3 3
4 0
5 5
6 6
7 8
Replace is.na & NULL in data frame.
data frame with colums
A$name[is.na(A$name)]<-0
OR
A$name[is.na(A$name)]<-"NA"
with all data frame
df[is.na(df)]<-0
with replace na with blank in data frame
df[is.na(df)]<-""
replace NULL to NA
df[is.null(df)] <- NA
if you want to assign a new name after changing the NAs in a specific column in this case column V3, use you can do also like this
my.data.frame$the.new.column.name <- ifelse(is.na(my.data.frame$V3),0,1)
I wan to add a next solution which using a popular Hmisc package.
library(Hmisc)
data(airquality)
# imputing with 0 - all columns
# although my favorite one for simple imputations is Hmisc::impute(x, "random")
> dd <- data.frame(Map(function(x) Hmisc::impute(x, 0), airquality))
> str(dd[[1]])
'impute' Named num [1:153] 41 36 12 18 0 28 23 19 8 0 ...
- attr(*, "names")= chr [1:153] "1" "2" "3" "4" ...
- attr(*, "imputed")= int [1:37] 5 10 25 26 27 32 33 34 35 36 ...
> dd[[1]][1:10]
1 2 3 4 5 6 7 8 9 10
41 36 12 18 0* 28 23 19 8 0*
There could be seen that all imputations metadata are allocated as attributes. Thus it could be used later.
This is not exactly a new solution, but I like to write inline lambdas that handle things that I can't quite get packages to do. In this case,
df %>%
(function(x) { x[is.na(x)] <- 0; return(x) })
Because R does not ever "pass by object" like you might see in Python, this solution does not modify the original variable df, and so will do quite the same as most of the other solutions, but with much less need for intricate knowledge of particular packages.
Note the parens around the function definition! Though it seems a bit redundant to me, since the function definition is surrounded in curly braces, it is required that inline functions are defined within parens for magrittr.
This is a more flexible solution. It works no matter how large your data frame is, or zero is indicated by 0 or zero or whatsoever.
library(dplyr) # make sure dplyr ver is >= 1.00
df %>%
mutate(across(everything(), na_if, 0)) # if 0 is indicated by `zero` then replace `0` with `zero`
Another option using sapply to replace all NA with zeros. Here is some reproducible code (data from #aL3xa):
set.seed(7) # for reproducibility
m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
d <- as.data.frame(m)
d
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> 1 9 7 5 5 7 7 4 6 6 7
#> 2 2 5 10 7 8 9 8 8 1 8
#> 3 6 7 4 10 4 9 6 8 NA 10
#> 4 1 10 3 7 5 7 7 7 NA 8
#> 5 9 9 10 NA 7 10 1 5 NA 5
#> 6 5 2 5 10 8 1 1 5 10 3
#> 7 7 3 9 3 1 6 7 3 1 10
#> 8 7 7 6 8 4 4 5 NA 8 7
#> 9 2 1 1 2 7 5 9 10 9 3
#> 10 7 5 3 4 9 2 7 6 NA 5
d[sapply(d, \(x) is.na(x))] <- 0
d
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> 1 9 7 5 5 7 7 4 6 6 7
#> 2 2 5 10 7 8 9 8 8 1 8
#> 3 6 7 4 10 4 9 6 8 0 10
#> 4 1 10 3 7 5 7 7 7 0 8
#> 5 9 9 10 0 7 10 1 5 0 5
#> 6 5 2 5 10 8 1 1 5 10 3
#> 7 7 3 9 3 1 6 7 3 1 10
#> 8 7 7 6 8 4 4 5 0 8 7
#> 9 2 1 1 2 7 5 9 10 9 3
#> 10 7 5 3 4 9 2 7 6 0 5
Created on 2023-01-15 with reprex v2.0.2
Please note: Since R 4.1.0 you can use \(x) instead of function(x).
in data.frame it is not necessary to create a new column by mutate.
library(tidyverse)
k <- c(1,2,80,NA,NA,51)
j <- c(NA,NA,3,31,12,NA)
df <- data.frame(k,j)%>%
replace_na(list(j=0))#convert only column j, for example
result
k j
1 0
2 0
80 3
NA 31
NA 12
51 0
I used this personally and works fine :
players_wd$APPROVED_WD[is.na(players_wd$APPROVED_WD)] <- 0

Resources