Speed up fill function R - r

I have a dataframe with missing values that I've written a function to fill using R 3.3.2
pkgs <- c("dplyr", "ggplot2", "tidyr", 'data.table', 'lazyeval')
lapply(pkgs, require, character.only = TRUE)
UID <- c('A', 'A', 'A', 'B', 'B', 'B', 'C', 'C')
Col1 <- c(1, 0, 0, 0, 1, 0, 0, 0)
df <- data.frame(UID, Col1)
Function to fill in Col1:
AggregatedColumns <- function(DF, columnToUse, NewCol1) {
# Setting up column names to use
columnToUse <- deparse(substitute(columnToUse))
NewCol1 <- deparse(substitute(NewCol1))
# Creating new columns
DF[[NewCol1]] <- ifelse(DF[[columnToUse]] == 1, 1, NA)
DF <- DF %>% group_by_("UID") %>% sort(DF[[columnToUse]], decreasing = TRUE) %>% fill_(NewCol1)
DF <- DF %>% group_by_("UID") %>% sort(DF$columnToUse, decreasing = TRUE) %>% fill_(NewCol1, .direction = 'up')
DF[[NewCol1]] <- ifelse(is.na(DF[[NewCol1]]), 0, DF[[NewCol1]])
DF
}
I've pulled out this part of the function since this is the piece that is slowing down the function. I'm very new to writing functions and any advice on how/if this can be sped up would be appreciated. I've isolated the speed issue down to the fill_ part of the function.
What I am trying to do is pass a dummy variable from Col1 to New_Column and then forward fills to other same ID's. For example:
UID Col1
John Smith 1
John Smith 0
Should become
UID Col1 New_Column
John Smith 1 1
John Smith 0 1
EDITED FUNCTION
I edited the function to fit with #HubertL suggestion. The function is still fairly slow, but hopefully with these edits the example is reproducible.
AggregatedColumns <- function(DF, columnToUse, NewCol1) {
# Setting up column names to use
columnToUse <- deparse(substitute(columnToUse))
NewCol1 <- deparse(substitute(NewCol1))
# Creating new columns
DF[[NewCol1]] <- ifelse(DF[[columnToUse]] == 1, 1, NA)
DF <- DF %>% group_by_("UID") %>% fill_(NewCol1) %>% fill_(NewCol1, .direction = 'up')
DF[[NewCol1]] <- ifelse(is.na(DF[[NewCol1]]), 0, DF[[NewCol1]])
DF
}
Desired output:
UID Col1 New
A 1 1
A 0 1
A 0 1
B 0 1
B 1 1
B 0 1
C 0 0
C 0 0

First of all, here are few points:
You are needlessly calling ifelse (twice) while this function is very inefficient
You needlessly using inefficient function (by group) from external package (also twice) when you could simply vectorize the process with just base R.
Here's a simple one-liner without using any external packages that enhances performance by a factor of x72 (and probably much more for bigger data sets) on a 5e7 data set
AggregatedColumns2 <- function(DF, columnToUse, NewCol1) {
# Setting up column names to use
columnToUse <- deparse(substitute(columnToUse))
NewCol1 <- deparse(substitute(NewCol1))
# Creating the new column (one simple line)
DF[[NewCol1]] <- as.integer(DF$UID %in% DF$UID[DF[[columnToUse]] == 1])
# returning new data set back
DF
}
Benchmark
set.seed(123)
library(stringi)
N <- 5e7
UID <- stri_rand_strings(N, 2)
Col1 <- sample(0:1, N, replace = TRUE)
df <- data.frame(UID, Col1)
system.time(res <- AggregatedColumns(df, Col1, NewCol1))
# user system elapsed
# 198.67 3.94 203.07
system.time(res2 <- AggregatedColumns2(df, Col1, NewCol1))
# user system elapsed
# 2.82 0.00 2.82
Now in order to compare those I will reorder both and convert to a matrix, because Hadleyverses packages add tons of unnecessary attributes (compare the mess created in str(res) vs the simple structure in str(res2))
identical(arrange(res, UID) %>% as.matrix, arrange(res2, UID) %>% as.matrix)
## [1] TRUE

If speed is a concern, you may try this with data.table and na.locf() from the zoo package. LOCF means last observation carried forward.
library(data.table)
setDT(df)[Col1 != 0, New := Col1 ][, New := zoo::na.locf(New), UID][is.na(New), New := 0][]
# UID Col1 New
#1: A 1 1
#2: A 0 1
#3: A 0 1
#4: B 0 1
#5: B 1 1
#6: B 0 1
#7: C 0 0
#8: C 0 0
This is just to give an idea. It still needs to be wrapped in a function call.
It assumes that value 0 in Col1 is considered as missing.

Related

Merging two dataframes in R and arranging rows based on certain conditions

I have two dataframes df1 and df2 which I have merged together into another dataframe df3
df1 <- data.frame(
Name = c("A", "B", "C"),
Value = c(1, 2, 3),
Method = c("Indirect"))
df2 <- data.frame(
Name = c("A", "B"),
Value = c(4, 5),
Method = c("Direct"))
df3 <- rbind(df1, df2)
So df3 looks something like this
Now I need to identify all the unique entries in the Name column (which is C in this case) and for each of the unique entries, a row is to be added which would have the same "Name" but "Value" would be 0 and the "Method" would be the opposite one. The output should look like this.
Finally the rows with similar "Name" are to be arranged one below the other.
I have a huge dataframe and I need to achieve the above mentioned outcome in the most efficient way in R. How do I proceed?
One way
tmp=df3[!(df3$Name %in% df3$Name[duplicated(df3$Name)]),]
tmp$Value=0
tmp$Method=ifelse(tmp$Method=="Direct","Indirect","Direct")
Name Value Method
3 C 0 Direct
you can now rbind this to your original data (and sort it).
Please find another solution using data.table
Reprex
Code
library(data.table)
library(magrittr) # for the pipe!
setDT(df3)
df3 <- rbindlist(list(df3,
df3[!(df3$Name %in% df3[duplicated(Name)]$Name)
][, `:=` (Value = 0, Method = fifelse(Method == "Indirect", "Direct", "Indirect"))])) %>%
setorder(., Name)
Output
df3
#> Name Value Method
#> 1: A 1 Indirect
#> 2: A 4 Direct
#> 3: B 2 Indirect
#> 4: B 5 Direct
#> 5: C 3 Indirect
#> 6: C 0 Direct
Created on 2021-12-15 by the reprex package (v2.0.1)
I think that with 10,000 rows you will barely notice it:
library(dplyr)
df3 |>
add_count(Name) |>
filter(n == 1) |>
mutate(
Value = 0,
Method = c(Indirect = 'Direct', Direct = 'Indirect')[Method],
n = NULL
) |>
bind_rows(df3) |>
arrange(Name, Value, Method)
# Name Value Method
# 1 A 1 Indirect
# 2 A 4 Direct
# 3 B 2 Indirect
# 4 B 5 Direct
# 5 C 0 Direct
# 6 C 3 Indirect

Subset data.frame based on lag between two columns

Suppose you want to subset a data.frame where the rule for keeping rows is based
on a lag beteen rows 'a' and 'b':
# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
#output
a b
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1
Essentially, if 'a' = 1 you want to keep that row as well as the subsequent run of rows in
'b' that have a value of 1. This capture continues until the next row with a = 0 & b = 0.
I've tried using nested 'ifelse()' statements, but I am stuck incorporate logical tests based on a lag issue.
Suggestions?
This is how I would do it. There are probably options out there that require maybe 1 or 2 lines less.
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
library(dplyr)
df %>%
mutate(grp = cumsum(a==1|a+b==0)) %>%
group_by(grp) %>%
filter(any(a == 1)) %>%
ungroup() %>%
select(a, b)
A solution without dplyr. Work with a flag:
# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
# create new empty df
new_df <- read.table(text = "", col.names = c("a", "b"))
a_okay = FALSE # initialize the flag
for (row_number in seq(1:nrow(df))) { # loop over each row of the original df
# if a is 1, we add the row to the new df and set the flag to TRUE
if (df[row_number, "a"] == 1) {
a_okay = TRUE
new_df[nrow(new_df) + 1, ] = c(df[row_number, "a"], df[row_number, "b"])
}
# now we consider the rows where a is not 1
else {
# if b is 1 and we are still following an a == 1: add the row
if (df[row_number, "b"] == 1 & a_okay) {
new_df[nrow(new_df) + 1, ] = c(df[row_number, "a"], df[row_number, "b"])
}
# if b is 0, we reset the flag
else {
a_okay = FALSE
}
}
}
Another base solution inspired by this post, #Wietse de Vries's answer and #Ben's comment.
# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
# identify groups
df$grp <- cumsum(df$a == 1 | df$b == 0)
# subset df by groups with first element of a == 1
df <- do.call(rbind, split(df, df$grp)[by(df, df$grp, function(x) {x$a[1] == 1})])
# remove grp
df$grp <- NULL

Combining tables in R with some value replacement [duplicate]

I have two data.frames, one with only characters and the other one with characters and values.
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
merge(df1, df2)
x y
1 a 0
2 b 1
3 c 0
I want to merge df1 and df2. The characters a, b and c merged good and also have 0, 1, 0 but d and e has nothing. I want d and e also in the merge table, with the 0 0 condition. Thus for every missing row at the df2 data.frame, the 0 must be placed in the df1 table, like:
x y
1 a 0
2 b 1
3 c 0
4 d 0
5 e 0
Take a look at the help page for merge. The all parameter lets you specify different types of merges. Here we want to set all = TRUE. This will make merge return NA for the values that don't match, which we can update to 0 with is.na():
zz <- merge(df1, df2, all = TRUE)
zz[is.na(zz)] <- 0
> zz
x y
1 a 0
2 b 1
3 c 0
4 d 0
5 e 0
Updated many years later to address follow up question
You need to identify the variable names in the second data table that you aren't merging on - I use setdiff() for this. Check out the following:
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e', NA))
df2 = data.frame(x=c('a', 'b', 'c'),y1 = c(0,1,0), y2 = c(0,1,0))
#merge as before
df3 <- merge(df1, df2, all = TRUE)
#columns in df2 not in df1
unique_df2_names <- setdiff(names(df2), names(df1))
df3[unique_df2_names][is.na(df3[, unique_df2_names])] <- 0
Created on 2019-01-03 by the reprex package (v0.2.1)
Or, as an alternative to #Chase's code, being a recent plyr fan with a background in databases:
require(plyr)
zz<-join(df1, df2, type="left")
zz[is.na(zz)] <- 0
Another alternative with data.table.
EXAMPLE DATA
dt1 <- data.table(df1)
dt2 <- data.table(df2)
setkey(dt1,x)
setkey(dt2,x)
CODE
dt2[dt1,list(y=ifelse(is.na(y),0,y))]
Assuming df1 has all the values of x of interest, you could use a dplyr::left_join() to merge and then either a base::replace() or tidyr::replace_na() to replace the NAs as 0s:
library(tidyverse)
# dplyr only:
df_new <-
left_join(df1, df2, by = 'x') %>%
mutate(y = replace(y, is.na(y), 0))
# dplyr and tidyr:
df_new <-
left_join(df1, df2, by = 'x') %>%
mutate(y = replace_na(y, 0))
# In the sample data column `x` is a factor, which will give a warning with the join. This can be prevented by converting to a character before the join:
df_new <-
left_join(df1 %>% mutate(x = as.character(x)),
df2 %>% mutate(x = as.character(x)),
by = 'x') %>%
mutate(y = replace(y, is.na(y), 0))
I used the answer given by Chase (answered May 11 '11 at 14:21), but I added a bit of code to apply that solution to my particular problem.
I had a frame of rates (user, download) and a frame of totals (user, download) to be merged by user, and I wanted to include every rate, even if there were no corresponding total. However, there could be no missing totals, in which case the selection of rows for replacement of NA by zero would fail.
The first line of code does the merge. The next two lines change the column names in the merged frame. The if statement replaces NA by zero, but only if there are rows with NA.
# merge rates and totals, replacing absent totals by zero
graphdata <- merge(rates, totals, by=c("user"),all.x=T)
colnames(graphdata)[colnames(graphdata)=="download.x"] = "download.rate"
colnames(graphdata)[colnames(graphdata)=="download.y"] = "download.total"
if(any(is.na(graphdata$download.total))) {
graphdata[is.na(graphdata$download.total),]$download.total <- 0
}
Here, a data.table answer. This may be used in selected columns varying the cols_added_df2's definition
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
setDT(df1)
setDT(df2)
df3 <- merge(df1, df2, by = "x", all.x = TRUE)
cols_added_df2 <- setdiff(names(df2), names(df1))
df3[,
paste0(cols_added_df2) := lapply(.SD, function(col){
fifelse(is.na(col), 1, col)
}),
.SDcols = cols_added_df2
]
With {powerjoin} we can do:
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
powerjoin::power_full_join(df1, df2, fill = 0)
#> Joining, by = "x"
#> x y
#> 1 a 0
#> 2 b 1
#> 3 c 0
#> 4 d 0
#> 5 e 0
Created on 2022-04-28 by the reprex package (v2.0.1)

Fastest way to filter a data.frame list column contents in R / Rcpp

I have a data.frame:
df <- structure(list(id = 1:3, vars = list("a", c("a", "b", "c"), c("b",
"c"))), .Names = c("id", "vars"), row.names = c(NA, -3L), class = "data.frame")
with a list column (each with a character vector):
> str(df)
'data.frame': 3 obs. of 2 variables:
$ id : int 1 2 3
$ vars:List of 3
..$ : chr "a"
..$ : chr "a" "b" "c"
..$ : chr "b" "c"
I want to filter the data.frame according to setdiff(vars,remove_this)
library(dplyr)
library(tidyr)
res <- df %>% mutate(vars = lapply(df$vars, setdiff, "a"))
which gets me this:
> res
id vars
1 1
2 2 b, c
3 3 b, c
But to get drop the character(0) vars I have to do something like:
res %>% unnest(vars) # and then do the equivalent of nest(vars) again after...
Actual datasets:
560K rows and 3800K rows that also have 10 more columns (to carry along).
(this is quite slow, which leads to question...)
What is the Fastest way to do this in R?
Is there a dplyr/ data.table/ other faster method?
How to do this with Rcpp?
UPDATE/EXTENSION:
can the column modification be done in place rather then by copying the lapply(vars,setdiff(... result?
what's the most efficient way to filter out for vars == character(0) if it must be a seperate step.
Setting aside any algorithmic improvements, the analogous data.table solution is automatically going to be faster because you won't have to copy the entire thing just to add a column:
library(data.table)
dt = as.data.table(df) # or use setDT to convert in place
dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0]
# id vars newcol
#1: 2 a,b,c b,c
#2: 3 b,c b,c
You can also delete the original column (with basically 0 cost), by adding [, vars := NULL] at the end). Or you can simply overwrite the initial column if you don't need that info, i.e. dt[, vars := lapply(vars, setdiff, 'a')].
Now as far as algorithmic improvements go, assuming your id values are unique for each vars (and if not, add a new unique identifier), I think this is much faster and automatically takes care of the filtering:
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), by = id]
# id vars
#1: 2 b,c
#2: 3 b,c
To carry along the other columns, I think it's easiest to simply merge back:
dt[, othercol := 5:7]
# notice the keyby
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), keyby = id][dt, nomatch = 0]
# id vars i.vars othercol
#1: 2 b,c a,b,c 6
#2: 3 b,c b,c 7
Here's another way:
# prep
DT <- data.table(df)
DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
setkey(DT,vstr)
get_badkeys <- function(x)
unlist(sapply(1:length(x),function(n) combn(sort(x),n,paste0,collapse="_")))
# choose values to exclude
baduns <- c("a","b")
# subset
DT[!J(get_badkeys(baduns))]
This is fairly fast, but it takes up your key.
Benchmarks. Here's a made-up example:
Candidates:
hannahh <- function(df,baduns){
df %>%
mutate(vars = lapply(.$vars, setdiff, baduns)) %>%
filter(!!sapply(vars,length))
}
eddi <- function(df,baduns){
dt = as.data.table(df)
dt[,
unlist(vars)
, by = id][!V1 %in% baduns,
.(vars = list(V1))
, keyby = id][dt, nomatch = 0]
}
stevenb <- function(df,baduns){
df %>%
rowwise() %>%
do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, baduns)) %>%
mutate(length = length(newcol)) %>%
ungroup() %>%
filter(length > 0)
}
frank <- function(df,baduns){
DT <- data.table(df)
DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
setkey(DT,vstr)
DT[!J(get_badkeys(baduns))]
}
Simulation:
nvals <- 4
nbads <- 2
maxlen <- 4
nobs <- 1e4
exdf <- data.table(
id=1:nobs,
vars=replicate(nobs,list(sample(valset,sample(maxlen,1))))
)
setDF(exdf)
baduns <- valset[1:nbads]
Results:
system.time(frank_res <- frank(exdf,baduns))
# user system elapsed
# 0.24 0.00 0.28
system.time(hannahh_res <- hannahh(exdf,baduns))
# 0.42 0.00 0.42
system.time(eddi_res <- eddi(exdf,baduns))
# 0.05 0.00 0.04
system.time(stevenb_res <- stevenb(exdf,baduns))
# 36.27 55.36 93.98
Checks:
identical(sort(frank_res$id),eddi_res$id) # TRUE
identical(unlist(stevenb_res$id),eddi_res$id) # TRUE
identical(unlist(hannahh_res$id),eddi_res$id) # TRUE
Discussion:
For eddi() and hannahh(), the results scarcely change with nvals, nbads and maxlen. In contrast, when baduns goes over 20, frank() becomes incredibly slow (like 20+ sec); it also scales up with nbads and maxlen a little worse than the other two.
Scaling up nobs, eddi()'s lead over hannahh() stays the same, at about 10x. Against frank(), it sometimes shrinks and sometimes stays the same. In the best nobs = 1e5 case for frank(), eddi() is still 3x faster.
If we switch from a valset of characters to something that frank() must coerce to a character for its by-row paste0 operation, both eddi() and hannahh() beat it as nobs grows.
Benchmarks for doing this repeatedly. This is probably obvious, but if you have to do this "many" times (...how many is hard to say), it's better to create the key column than to go through the subsetting for each set of baduns. In the simulation above, eddi() is about 5x as fast as frank(), so I'd go for the latter if I was doing this subsetting 10+ times.
maxbadlen <- 2
set_o_baduns <- replicate(10,sample(valset,size=sample(maxbadlen,1)))
system.time({
DT <- data.table(exdf)
DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
setkey(DT,vstr)
for (i in 1:10) DT[!J(get_badkeys(set_o_baduns[[i]]))]
})
# user system elapsed
# 0.29 0.00 0.29
system.time({
dt = as.data.table(exdf)
for (i in 1:10) dt[,
unlist(vars), by = id][!V1 %in% set_o_baduns[[i]],
.(vars = list(V1)), keyby = id][dt, nomatch = 0]
})
# user system elapsed
# 0.39 0.00 0.39
system.time({
for (i in 1:10) hannahh(exdf,set_o_baduns[[i]])
})
# user system elapsed
# 4.10 0.00 4.13
So, as expected, frank() takes very little time for additional evaluations, while eddi() and hannahh() grow linearly.
Here's another idea:
df %>%
rowwise() %>%
do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, "a")) %>%
mutate(length = length(newcol)) %>%
ungroup()
Which gives:
# id vars newcol length
#1 1 a 0
#2 2 a, b, c b, c 2
#3 3 b, c b, c 2
You could then filter on length > 0 to keep only non-empty newcol
df %>%
rowwise() %>%
do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, "a")) %>%
mutate(length = length(newcol)) %>%
ungroup() %>%
filter(length > 0)
Which gives:
# id vars newcol length
#1 2 a, b, c b, c 2
#2 3 b, c b, c 2
Note: As mentioned by #Arun in the comments, this approach is quite slow. You are better off with the data.table solutions.

Merge unequal dataframes and replace missing rows with 0

I have two data.frames, one with only characters and the other one with characters and values.
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
merge(df1, df2)
x y
1 a 0
2 b 1
3 c 0
I want to merge df1 and df2. The characters a, b and c merged good and also have 0, 1, 0 but d and e has nothing. I want d and e also in the merge table, with the 0 0 condition. Thus for every missing row at the df2 data.frame, the 0 must be placed in the df1 table, like:
x y
1 a 0
2 b 1
3 c 0
4 d 0
5 e 0
Take a look at the help page for merge. The all parameter lets you specify different types of merges. Here we want to set all = TRUE. This will make merge return NA for the values that don't match, which we can update to 0 with is.na():
zz <- merge(df1, df2, all = TRUE)
zz[is.na(zz)] <- 0
> zz
x y
1 a 0
2 b 1
3 c 0
4 d 0
5 e 0
Updated many years later to address follow up question
You need to identify the variable names in the second data table that you aren't merging on - I use setdiff() for this. Check out the following:
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e', NA))
df2 = data.frame(x=c('a', 'b', 'c'),y1 = c(0,1,0), y2 = c(0,1,0))
#merge as before
df3 <- merge(df1, df2, all = TRUE)
#columns in df2 not in df1
unique_df2_names <- setdiff(names(df2), names(df1))
df3[unique_df2_names][is.na(df3[, unique_df2_names])] <- 0
Created on 2019-01-03 by the reprex package (v0.2.1)
Or, as an alternative to #Chase's code, being a recent plyr fan with a background in databases:
require(plyr)
zz<-join(df1, df2, type="left")
zz[is.na(zz)] <- 0
Another alternative with data.table.
EXAMPLE DATA
dt1 <- data.table(df1)
dt2 <- data.table(df2)
setkey(dt1,x)
setkey(dt2,x)
CODE
dt2[dt1,list(y=ifelse(is.na(y),0,y))]
Assuming df1 has all the values of x of interest, you could use a dplyr::left_join() to merge and then either a base::replace() or tidyr::replace_na() to replace the NAs as 0s:
library(tidyverse)
# dplyr only:
df_new <-
left_join(df1, df2, by = 'x') %>%
mutate(y = replace(y, is.na(y), 0))
# dplyr and tidyr:
df_new <-
left_join(df1, df2, by = 'x') %>%
mutate(y = replace_na(y, 0))
# In the sample data column `x` is a factor, which will give a warning with the join. This can be prevented by converting to a character before the join:
df_new <-
left_join(df1 %>% mutate(x = as.character(x)),
df2 %>% mutate(x = as.character(x)),
by = 'x') %>%
mutate(y = replace(y, is.na(y), 0))
I used the answer given by Chase (answered May 11 '11 at 14:21), but I added a bit of code to apply that solution to my particular problem.
I had a frame of rates (user, download) and a frame of totals (user, download) to be merged by user, and I wanted to include every rate, even if there were no corresponding total. However, there could be no missing totals, in which case the selection of rows for replacement of NA by zero would fail.
The first line of code does the merge. The next two lines change the column names in the merged frame. The if statement replaces NA by zero, but only if there are rows with NA.
# merge rates and totals, replacing absent totals by zero
graphdata <- merge(rates, totals, by=c("user"),all.x=T)
colnames(graphdata)[colnames(graphdata)=="download.x"] = "download.rate"
colnames(graphdata)[colnames(graphdata)=="download.y"] = "download.total"
if(any(is.na(graphdata$download.total))) {
graphdata[is.na(graphdata$download.total),]$download.total <- 0
}
Here, a data.table answer. This may be used in selected columns varying the cols_added_df2's definition
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
setDT(df1)
setDT(df2)
df3 <- merge(df1, df2, by = "x", all.x = TRUE)
cols_added_df2 <- setdiff(names(df2), names(df1))
df3[,
paste0(cols_added_df2) := lapply(.SD, function(col){
fifelse(is.na(col), 1, col)
}),
.SDcols = cols_added_df2
]
With {powerjoin} we can do:
df1 = data.frame(x=c('a', 'b', 'c', 'd', 'e'))
df2 = data.frame(x=c('a', 'b', 'c'),y = c(0,1,0))
powerjoin::power_full_join(df1, df2, fill = 0)
#> Joining, by = "x"
#> x y
#> 1 a 0
#> 2 b 1
#> 3 c 0
#> 4 d 0
#> 5 e 0
Created on 2022-04-28 by the reprex package (v2.0.1)

Resources