Different results from dplyr and data.table - r

Reproducible dataset:
library(data.table)
library(dplyr)
library(zoo)
df = expand.grid(ID = sample(LETTERS[1:5]),
Date = seq.Date(as.Date("2012-01-01"), as.Date("2012-12-01"), by = "1 month"))
df = df[order(as.character(df$ID)),]
df = data.table(df, V1 = runif(nrow(df),0,1), V2 = runif(nrow(df),0,1), V3 = runif(nrow(df),0,1))
ind = sample(nrow(df), nrow(df)*.5)
na.gen <- function(x, ind){x[ind] <- NA}
df1 <- df %>% slice(., ind) %>% mutate_each(funs(na.gen), starts_with("V"))
df2 = df[!ind]
df <- rbind(df1, df2)
df <- df[order(as.character(df$ID), df$Date),]
df$ID = as.character(df$ID)
In the above dataset, my idea was to impute data using Last Observation Carried Forward method. My original problem is a very large dataset, so I tested dplyr and data.table solutions.
final_dplyr <- df %>% group_by(ID) %>% mutate_each(funs(na.locf), starts_with("V"))
final_data.table <- df[, na.locf(.SD), by = ID]
data.table gives me the right solution, however, dplyr messes the subset which begins from NA. I get the following warning using dplyr:
Warning messages:
1: In `[.data.table`(`_dt`, , `:=`(V1, na.locf(V1)), by = `_vars`) :
Supplied 11 items to be assigned to group 1 of size 12 in column 'V1' (recycled leaving remainder of 1 items).
Can somone help me understand what I am doing wrong with dplyr?

Okay, a lot of things going on here. First as #Frank noted, the two commands operate on different objects. na.locf(.SD) on the subset-data.table for each ID, where as dplyr's on each column separately for each ID.
To identify where the issue is, I'll use data.table equivalent of your dplyr syntax.
df[, lapply(.SD, na.locf), by=ID]
# warning
We get the same warning message. Seems like the number of rows returned for each column aren't identical for 1 or more groups. Let's check that.
df[, lapply(.SD, function(x) length(na.locf(x))), by=ID]
# ID Date V1 V2 V3
# 1: A 12 12 12 12
# 2: B 12 12 12 12
# 3: C 12 11 11 11 # <~~~ we've a winner!
# 4: D 12 12 12 12
# 5: E 12 12 12 12
Why is this happening?
head(df[ID == "C"])
# ID Date V1 V2 V3
# 1: C 2012-01-01 NA NA NA
# 2: C 2012-02-01 0.7475075 0.8917311 0.7601174
# 3: C 2012-03-01 0.4922747 0.7749479 0.3995417
# 4: C 2012-04-01 0.9013631 0.3388313 0.8873779
# 5: C 2012-05-01 NA NA NA
# 6: C 2012-06-01 NA NA NA
nrow(df[ID == "C", na.locf(.SD), .SDcols= -c("ID")])
# 12 as expected
nrow(df[ID == "C", lapply(.SD, na.locf), .SDcols= -c("ID")])
# 12, but with warnings
Using na.locf() on columns separately returns 11 for V1:V4. Why? It seems like it's because of the NA at the beginning. ?na.locf has a na.rm argument which by default is set to TRUE which removes NAs from the beginning. So let's set it to false and try again
nrow(df[ID == "C", lapply(.SD, na.locf, na.rm=FALSE), .SDcols = -c("ID")])
# 12, no warnings
It worked with na.locf(.SD) because it also ran na.locf on Date column which returned 12 rows, I think.
In essence, you need to set na.rm=FALSE in dplyr somehow, or get dplyr to work on the entire object somehow. I've no idea how to do either.
PS: Note that you can use := to update the data.table by reference instead of returning a new object with data.table syntax.

Related

Extract first Non NA value over multiple columns

I'm still learning R and was wondering if I there was an elegant way of manipulating the below df to achieve df2.
I'm not sure if it's a loop that is supposed to be used for this, but basically I want to extract the first Non NA "X_No" Value if the "X_No" value is NA in the first row. This would perhaps be best described through an example from df to the desired df2.
A_ID <- c('A','B','I','N')
A_No <- c(11,NA,15,NA)
B_ID <- c('B','C','D','J')
B_No <- c(NA,NA,12,NA)
C_ID <- c('E','F','G','P')
C_No <- c(NA,13,14,20)
D_ID <- c('J','K','L','M')
D_No <- c(NA,NA,NA,40)
E_ID <- c('W','X','Y','Z')
E_No <- c(50,32,48,40)
df <- data.frame(A_ID,A_No,B_ID,B_No,C_ID,C_No,D_ID,D_No,E_ID,E_No)
ID <- c('A','D','F','M','W')
No <- c(11,12,13,40,50)
df2 <- data.frame(ID,No)
I'm hoping for an elegant solution to this as there are over a 1000 columns similar to the example provided.
I've looked all over the web for a similar example however to no avail that would reproduce the expected result.
Your help is very much appreciated.
Thankyou
I don't know if I'd call it "elegant", but here is a potential solution:
library(tidyverse)
A_ID <- c('A','B','I','N')
A_No <- c(11,NA,15,NA)
B_ID <- c('B','C','D','J')
B_No <- c(NA,NA,12,NA)
C_ID <- c('E','F','G','P')
C_No <- c(NA,13,14,20)
D_ID <- c('J','K','L','M')
D_No <- c(NA,NA,NA,40)
E_ID <- c('W','X','Y','Z')
E_No <- c(50,32,48,40)
df <- data.frame(A_ID,A_No,B_ID,B_No,C_ID,C_No,D_ID,D_No,E_ID,E_No)
ID <- c('A','D','F','M','W')
No <- c(11,12,13,40,50)
df2 <- data.frame(ID,No)
output <- df %>%
pivot_longer(everything(),
names_sep = "_",
names_to = c("Col", ".value")) %>%
drop_na() %>%
group_by(Col) %>%
slice_head(n = 1) %>%
ungroup() %>%
select(-Col)
df2
#> ID No
#> 1 A 11
#> 2 D 12
#> 3 F 13
#> 4 M 40
#> 5 W 50
output
#> # A tibble: 5 × 2
#> ID No
#> <chr> <dbl>
#> 1 A 11
#> 2 D 12
#> 3 F 13
#> 4 M 40
#> 5 W 50
all_equal(df2, output)
#> [1] TRUE
Created on 2023-02-08 with reprex v2.0.2
Using base R with max.col (assuming the columns are alternating with ID, No)
ind <- max.col(!is.na(t(df[c(FALSE, TRUE)])), "first")
m1 <- cbind(seq_along(ind), ind)
data.frame(ID = t(df[c(TRUE, FALSE)])[m1], No = t(df[c(FALSE, TRUE)])[m1])
ID No
1 A 11
2 D 12
3 F 13
4 M 40
5 W 50
Here is a data.table solution that should scale well to a (very) large dataset.
functionally
split the data.frame to a list of chunks of columns, based on their
names. So all columns startting with A_ go to
the first element, all colums startting with B_ to the second
Then, put these list elements on top of each other, using
data.table::rbindlist. Ignure the column-namaes (this only works if
A_ has the same number of columns as B_ has the same number of cols
as n_)
Now get the first non-NA value of each value in the first column
code
library(data.table)
# split based on what comes after the underscore
L <- split.default(df, f = gsub("(.*)_.*", "\\1", names(df)))
# bind together again
DT <- rbindlist(L, use.names = FALSE)
# extract the first value of the non-NA
DT[!is.na(A_No), .(No = A_No[1]), keyby = .(ID = A_ID)]
# ID No
# 1: A 11
# 2: D 12
# 3: F 13
# 4: G 14
# 5: I 15
# 6: M 40
# 7: P 20
# 8: W 50
# 9: X 32
#10: Y 48
#11: Z 40

pmap purrr error: Argument 1 must have names

I plan to sum a data.table row-wise and add a constant to it. What is wrong with this code. I am specifically looking for pmap_dfr solution:
library(data.table)
library(tidyverse)
temp.dt <- data.table(a = 1:3, b = 1:3, c = 1:3)
d <- 10
temp.dt %>% pmap_dfr(., sum, d) # add columns a b and c and add variable d to it
The output expected is a single column tibble with the following rows:
13
16
19
Error thrown: Argument 1 must have names.
I have been able to get it to work with pmap and pmap_dbl but it fails when using pmap_dfr. Additionally, the example I have provided is a toy example. I want the d variable as an input argument to the sum function instead of adding d later to the row-wise sum.
Example I know the below would work:
temp.dt %>% pmap_dbl(., sum) + d
The problem occurs for regular data frames too so to reduce this to the essentials start a new R session, get rid of the data.table part and use the input shown where we have a 3x4 data.frame so that we don't confuse rows and columns. Also note that pmap_dfr(sum, d) is the same as pmap(sum, d) %>% bind_rows and it is in the bind_rows step that the problem occurs.
library(dplyr)
library(purrr)
# test input
temp.df <- data.frame(a = 1:3, b = 1:3, c = 1:3, z = 1:3)
rownames(temp.df) <- LETTERS[1:3]
d <- 10
out <- temp.df %>% pmap(sum, d) # this works
out %>% bind_rows
## Error: Argument 1 must have names
The problem, as the error states, is that out has no names and it seems it will not provide default names for the result. For example, this will work -- I am not suggesting that you necessarily do this but just trying to illustrate why it does not work by showing minimal changes that make it work:
temp.df %>% pmap(sum, d) %>% set_names(rownames(temp.df)) %>% bind_rows
## # A tibble: 1 x 3
## A B C
## <dbl> <dbl> <dbl>
## 1 14 18 22
or this could be written like this to avoid writing temp.df twice:
temp.df %>% { set_names(pmap(., sum, d), rownames(.)) } %>% bind_rows
I think we can conclude that pmap_dfr is just not the right function to use here.
Base R
Of course, this is all trivial in base R as you can do this:
rowSums(temp.df) + d
## A B C
## 14 18 22
or more generally:
as.data.frame.list(apply(temp.df, 1, sum, d))
## A B C
## 14 18 22
or
as.data.frame.list(Reduce("+", temp.df) + d)
## X14 X18 X22
##1 14 18 22
data.table
In data.table we can write:
library(data.table)
DT <- as.data.table(temp.df)
DT[, as.list(rowSums(.SD) + d)]
## V1 V2 V3
## 1: 14 18 22
DT[, as.list(apply(.SD, 1, sum, d))]
## V1 V2 V3
## 1: 14 18 22
Also note that using data.table directly tends to be faster than sticking another level on top of it so if you thought you were getting the benefit of data.table's speed by using it with dplyr and purrr you likely aren't.
A pmap_dfr solution is to first transpose the dataset. We can later rename the columns as desired:
temp.dt %>%
t() %>%
as.data.frame()-> tmp_dt
pmap_dfr(list(tmp_dt, 10),sum)
# A tibble: 1 x 3
V1 V2 V3
<dbl> <dbl> <dbl>
1 13 16 19
A possible dplyr-base alternative:
temp.dt %>%
mutate(Sum = rowSums(.) + d) %>%
pull(Sum)
[1] 13 16 19
Or using pmap_dbl:
temp.dt %>%
pmap_dbl(.,sum) + d
[1] 13 16 19

How to combine data.tables by= with its shift() without having to create new variables?

I'm trying to generate row sums of a variable and its lag(s). Say I have:
library(data.table)
data <- data.table(id = rep(c("AT","DE"), each = 3),
time = rep(2001:2003, 2), var1 = c(1:6), var2 = c(NA, 1:3, NA, 8))
And I want to create a variable which adds 'var1' and the first lag of 'var2' by 'id'. If I create the lag first and the sum, I know how to:
data[ , lag := shift(var2, 1), by = id]
data[ , goalmessy := sum(var1, lag, na.rm = TRUE), by = 1:NROW(data)]
But is there a way to use shift inside sum or something similar (like apply sum or sth)? The intuitive problem I have, is that the by command is evaluated first as far as I know so we will be in a single row which makes the shifting unfeasible. Any hints?
I think this will do what you want in one line:
dt[, myVals := rowSums(cbind(var1, shift(var2)), na.rm=TRUE), by=id]
dt
id time var1 var2 myVals
1: AT 2001 1 NA 1
2: AT 2002 2 1 2
3: AT 2003 3 2 4
4: DE 2001 4 3 4
5: DE 2002 5 NA 8
6: DE 2003 6 8 6
The two variables of interest are put into cbind which is used to feed rowSums and NAs are dropped as in your code.
We can use rowSums
data[, goalmessy := rowSums(setDT(.(var1, shift(var2))), na.rm = TRUE), by = id]

Labeling each value in a column by grouping from another column R

I have an unusual data set that I need to work with and I've created a small scale, reproducible example.
library(data.table)
DT <- data.table(Type = c("A", rep("", 4), "B", rep("", 3), "C", rep("", 5)), Cohort = c(NA,1:4, NA, 5:7, NA, 8:12))
dt <- data.table(Type = c(rep("A", 4), rep("B", 3), rep("C", 5)), Cohort = 1:12)
I need DT to look like dt and the actual dataset has 6.8 million rows. I realize it might be a simple issue but I can't seem to figure it out, maybe setkey? Any help is appreciated, thanks.
You can replace "" by NA and use na.locf from the zoo package:
library(zoo)
DT[Type=="",Type:=NA][,Type:=na.locf(Type)][!is.na(Cohort)]
Here is another option without using na.locf. Grouped by the cumulative sum of logical vector (Type!=""), we select the first 'Type' and the lead value of 'Cohort', assign (:=) it to the names of 'DT' to replace the original column values and use na.omit to replace the NA rows.
na.omit(DT[, names(DT) := .(Type[1L], shift(Cohort, type="lead")), cumsum(Type!="")])
# Type Cohort
# 1: A 1
# 2: A 2
# 3: A 3
# 4: A 4
# 5: B 5
# 6: B 6
# 7: B 7
# 8: C 8
# 9: C 9
#10: C 10
#11: C 11
#12: C 12

How to replace NA values in a table for selected columns

There are a lot of posts about replacing NA values. I am aware that one could replace NAs in the following table/frame with the following:
x[is.na(x)]<-0
But, what if I want to restrict it to only certain columns? Let's me show you an example.
First, let's start with a dataset.
set.seed(1234)
x <- data.frame(a=sample(c(1,2,NA), 10, replace=T),
b=sample(c(1,2,NA), 10, replace=T),
c=sample(c(1:5,NA), 10, replace=T))
Which gives:
a b c
1 1 NA 2
2 2 2 2
3 2 1 1
4 2 NA 1
5 NA 1 2
6 2 NA 5
7 1 1 4
8 1 1 NA
9 2 1 5
10 2 1 1
Ok, so I only want to restrict the replacement to columns 'a' and 'b'. My attempt was:
x[is.na(x), 1:2]<-0
and:
x[is.na(x[1:2])]<-0
Which does not work.
My data.table attempt, where y<-data.table(x), was obviously never going to work:
y[is.na(y[,list(a,b)]), ]
I want to pass columns inside the is.na argument but that obviously wouldn't work.
I would like to do this in a data.frame and a data.table. My end goal is to recode the 1:2 to 0:1 in 'a' and 'b' while keeping 'c' the way it is, since it is not a logical variable. I have a bunch of columns so I don't want to do it one by one. And, I'd just like to know how to do this.
Do you have any suggestions?
You can do:
x[, 1:2][is.na(x[, 1:2])] <- 0
or better (IMHO), use the variable names:
x[c("a", "b")][is.na(x[c("a", "b")])] <- 0
In both cases, 1:2 or c("a", "b") can be replaced by a pre-defined vector.
Building on #Robert McDonald's tidyr::replace_na() answer, here are some dplyr options for controlling which columns the NAs are replaced:
library(tidyverse)
# by column type:
x %>%
mutate_if(is.numeric, ~replace_na(., 0))
# select columns defined in vars(col1, col2, ...):
x %>%
mutate_at(vars(a, b, c), ~replace_na(., 0))
# all columns:
x %>%
mutate_all(~replace_na(., 0))
Edit 2020-06-15
Since data.table 1.12.4 (Oct 2019), data.table gains two functions to facilitate this: nafill and setnafill.
nafill operates on columns:
cols = c('a', 'b')
y[ , (cols) := lapply(.SD, nafill, fill=0), .SDcols = cols]
setnafill operates on tables (the replacements happen by-reference/in-place)
setnafill(y, cols=cols, fill=0)
# print y to show the effect
y[]
This will also be more efficient than the other options; see ?nafill for more, the last-observation-carried-forward (LOCF) and next-observation-carried-backward (NOCB) versions of NA imputation for time series.
This will work for your data.table version:
for (col in c("a", "b")) y[is.na(get(col)), (col) := 0]
Alternatively, as David Arenburg points out below, you can use set (side benefit - you can use it either on data.frame or data.table):
for (col in 1:2) set(x, which(is.na(x[[col]])), col, 0)
This is now trivial in tidyr with replace_na(). The function appears to work for data.tables as well as data.frames:
tidyr::replace_na(x, list(a=0, b=0))
Not sure if this is more concise, but this function will also find and allow replacement of NAs (or any value you like) in selected columns of a data.table:
update.mat <- function(dt, cols, criteria) {
require(data.table)
x <- as.data.frame(which(criteria==TRUE, arr.ind = TRUE))
y <- as.matrix(subset(x, x$col %in% which((names(dt) %in% cols), arr.ind = TRUE)))
y
}
To apply it:
y[update.mat(y, c("a", "b"), is.na(y))] <- 0
The function creates a matrix of the selected columns and rows (cell coordinates) that meet the input criteria (in this case is.na == TRUE).
We can solve it in data.table way with tidyr::repalce_na function and lapply
library(data.table)
library(tidyr)
setDT(df)
df[,c("a","b","c"):=lapply(.SD,function(x) replace_na(x,0)),.SDcols=c("a","b","c")]
In this way, we can also solve paste columns with NA string. First, we replace_na(x,""),then we can use stringr::str_c to combine columns!
Starting from the data.table y, you can just write:
y[, (cols):=lapply(.SD, function(i){i[is.na(i)] <- 0; i}), .SDcols = cols]
Don't forget to library(data.table) before creating y and running this command.
This needed a bit extra for dealing with NA's in factors.
Found a useful function here, which you can then use with mutate_at or mutate_if:
replace_factor_na <- function(x){
x <- as.character(x)
x <- if_else(is.na(x), 'NONE', x)
x <- as.factor(x)
}
df <- df %>%
mutate_at(
vars(vector_of_column_names),
replace_factor_na
)
Or apply to all factor columns:
df <- df %>%
mutate_if(is.factor, replace_factor_na)
For a specific column, there is an alternative with sapply
DF <- data.frame(A = letters[1:5],
B = letters[6:10],
C = c(2, 5, NA, 8, NA))
DF_NEW <- sapply(seq(1, nrow(DF)),
function(i) ifelse(is.na(DF[i,3]) ==
TRUE,
0,
DF[i,3]))
DF[,3] <- DF_NEW
DF
For completeness, built upon #sbha's answer, here is the tidyverse version with the across() function that's available in dplyr since version 1.0 (which supersedes the *_at() variants, and others):
# random data
set.seed(1234)
x <- data.frame(a = sample(c(1, 2, NA), 10, replace = T),
b = sample(c(1, 2, NA), 10, replace = T),
c = sample(c(1:5, NA), 10, replace = T))
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
# with the magrittr pipe
x %>% mutate(across(1:2, ~ replace_na(.x, 0)))
#> a b c
#> 1 2 2 5
#> 2 2 2 2
#> 3 1 0 5
#> 4 0 2 2
#> 5 1 2 NA
#> 6 1 2 3
#> 7 2 2 4
#> 8 2 1 4
#> 9 0 0 3
#> 10 2 0 1
# with the native pipe (since R 4.1)
x |> mutate(across(1:2, ~ replace_na(.x, 0)))
#> a b c
#> 1 2 2 5
#> 2 2 2 2
#> 3 1 0 5
#> 4 0 2 2
#> 5 1 2 NA
#> 6 1 2 3
#> 7 2 2 4
#> 8 2 1 4
#> 9 0 0 3
#> 10 2 0 1
Created on 2021-12-08 by the reprex package (v2.0.1)
it's quite handy with data.table and stringr
library(data.table)
library(stringr)
x[, lapply(.SD, function(xx) {str_replace_na(xx, 0)})]
FYI
this works fine for me
DataTable DT = new DataTable();
DT = DT.AsEnumerable().Select(R =>
{
R["Campo1"] = valor;
return (R);
}).ToArray().CopyToDataTable();

Resources