How to keep NA values with dcast() function? - r

df <- data.frame(x = c(1,1,1,2,2,3,3,3,4,5,5),
y = c("A","B","C","A","B","A","B","D","B","C","D"),
z = c(3,2,1,4,2,3,2,1,2,3,4))
df_new <- dcast(df, x ~ y, value.var = "z")
If sample data as given above then dcast() function keeps NA values. But it doesn't work with my dataset. So, the function converts na to zero. Why?
How to keep na values?
ml-latest-small.zip
r <- read.csv("ratings.csv")
m <- read.csv("movies.csv")
rm <- merge(ratings, movies, by="movieId")
umr <- dcast(rm, userId ~ title, value.var = "rating", fun.aggregate= sum)
Thanks in advance.

In the first example, fun.aggregate is not called, but in second case the change is that fun.aggregate being called. According to ?dcast
library(reshape2)
fill - value with which to fill in structural missings, defaults to value from applying fun.aggregate to 0 length vector
dcast(df, x ~ y, value.var = "z", fun.aggregate = NULL)
# x A B C D
#1 1 3 2 1 NA
#2 2 4 2 NA NA
#3 3 3 2 NA 1
#4 4 NA 2 NA NA
#5 5 NA NA 3 4
dcast(df, x ~ y, value.var = "z", fun.aggregate = sum)
# x A B C D
#1 1 3 2 1 0
#2 2 4 2 0 0
#3 3 3 2 0 1
#4 4 0 2 0 0
#5 5 0 0 3 4
Note that here is there is only one element per combination, so the sum will return the same value except that if there is a particular combination not preseent, it return 0. It is based on the behavior of sum
length(integer(0))
#[1] 0
sum(integer(0))
#[1] 0
sum(NULL)
#[1] 0
Or when all the elements are NA and if we use na.rm, there won't be any element to sum, then also it goees into integer(0) mode
sum(c(NA, NA), na.rm = TRUE)
#[1] 0
If we use sum_ from hablar, this behavior is changed to return NA
library(hablar)
sum_(c(NA, NA))
#[1] NA
An option is to create a condition in the fun.aggregate to return NA
dcast(df, x ~ y, value.var = "z",
fun.aggregate = function(x) if(length(x) == 0) NA_real_ else sum(x, na.rm = TRUE))
# x A B C D
#1 1 3 2 1 NA
#2 2 4 2 NA NA
#3 3 3 2 NA 1
#4 4 NA 2 NA NA
#5 5 NA NA 3 4
For more info about how the sum (primitive function) is created, check the source code here

Related

aggregate function in R, sum of NAs are 0

I saw a list of questions asked in stack overflow, regarding the following, but never got a satisfactory answer. I will follow up on the following question Blend of na.omit and na.pass using aggregate?
> test <- data.frame(name = rep(c("A", "B", "C"), each = 4),
var1 = rep(c(1:3, NA), 3),
var2 = 1:12,
var3 = c(rep(NA, 4), 1:8))
> test
name var1 var2 var3
1 A 1 1 NA
2 A 2 2 NA
3 A 3 3 NA
4 A NA 4 NA
5 B 1 5 1
6 B 2 6 2
7 B 3 7 3
8 B NA 8 4
9 C 1 9 5
10 C 2 10 6
11 C 3 11 7
12 C NA 12 8
When I try out the given solution, instead of mean I try to find out the sum
aggregate(. ~ name, test, FUN = sum, na.action=na.pass, na.rm=TRUE)
the solution doesn't work as usual. Accordingly, it converts NA to 0, So the sum of NAs is 0. It displays it as 0 instead of NaN.
Why doesn't the following work for FUN=sum.And how to make it work?
Create a lambda function with a condition to return NaN when all elements are NA
aggregate(. ~ name, test, FUN = function(x) if(all(is.na(x))) NaN
else sum(x, na.rm = TRUE), na.action=na.pass)
-output
name var1 var2 var3
1 A 6 10 NaN
2 B 6 26 10
3 C 6 42 26
It is an expected behavior with sum and na.rm = TRUE. According to ?sum
the sum of an empty set is zero, by definition.
> sum(c(NA, NA), na.rm = TRUE)
[1] 0

Removing columns based on a vector of names in R

I have a data.frame called DATA. Using BASE R, I was wondering how I could remove any variables in DATA that is named any of the following: ar = c("out", "Name", "mdif" , "stder" , "mpre")?
Currently, I use DATA[ , !names(DATA) %in% ar] but while this removes the unwanted variables, it again creates some new nuisance variables suffixed .1.
After extraction, is it possible to remove just suffixes?
Note1: We have NO ACCESS to r, the only input is DATA.
Note2: This is toy data, a functional solution is appreciated.
r <- list(
data.frame(Name = rep("Jacob", 6),
X = c(2,2,1,1,NA, NA),
Y = c(1,1,1,2,1,NA),
Z = rep(3, 6),
out = rep(1, 6)),
data.frame(Name = rep("Jon", 6),
X = c(1,NA,3,1,NA,NA),
Y = c(1,1,1,2,NA,NA),
Z = rep(2, 6),
out = rep(1, 6)))
DATA <- do.call(cbind, r) ## DATA
ar = c("out", "Name", "mdif" , "stder" , "mpre") # The names for exclusion
DATA[ , !names(DATA) %in% ar] ## Current solution
#>
# X Y Z X.1 Y.1 Z.1 ## X.1 Y.1 Z.1 are automatically created but no needed
# 1 2 1 3 1 1 2
# 2 2 1 3 NA 1 2
# 3 1 1 3 3 1 2
# 4 1 2 3 1 2 2
# 5 NA 1 3 NA NA 2
# 6 NA NA 3 NA NA 2
Ideally column names should be unique but if you want to keep duplicated column names, we can remove suffixes using sub after extraction
DATA1 <- DATA[ , !names(DATA) %in% ar]
names(DATA1) <- sub("\\.\\d+", "", names(DATA1))
DATA1
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2
In base R, if we create an object with the index, we can reuse it later instead of doing additional manipulations on the column name
i1 <- !names(DATA) %in% ar
DATA1 <- setNames(DATA[i1], names(DATA)[i1])
DATA1
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2
For reusuability, we can create a function
f1 <- function(dat, vec) {
i1 <- !names(dat) %in% vec
setNames(dat[i1], names(dat)[i1])
}
f1(DATA, ar)
If the datasets are stored in a list, use lapply to loop over the list and apply the f1
lst1 <- list(DATA, DATA)
lapply(lst1, f1, vec = ar)
If the 'ar' elements are also different for different list elements
arLst <- list(ar1, ar2)
Map(f1, lst1, vec = arLst)
Here,
ar1 <- c("out", "Name")
ar2 <- c("mdif" , "stder" , "mpre")
Here is also another option using tidyverse
library(dplyr)
library(stringr)
DATA %>%
set_names(make.unique(names(.))) %>%
select(-matches(str_c(ar, collapse="|"))) %>%
set_names(str_remove(names(.), "\\.\\d+$"))
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2
NOTE: It is not recommended to have duplicate column names

find rows that are the same as a vector in R

I want to search row by row, and if it matches a pre defined vector, assign a value to a variable of that row. I prefer to solve it by using dplyr to stay in the pipeline.
for a simplified example:
a=c(1,2,NA)
b=c(1,NA,NA)
c=c(1,2,3)
d=c(1,2,NA)
D= data.frame(a,b,c,d)
My attempt is:
D %>% mutate(
i= case_when(
identical(c(a,b,c),c(1,1,1)) ~ 1,
identical(c(a,b,c),c(NA,NA,3)) ~ 2
)
)
I hope it gives me:
a b c d i
1 1 1 1 1 1
2 2 NA 2 2 NA
3 NA NA 3 NA 2
but my code doesn't work I guess it's because it's not comparing a row to a vector.
I do not want to simply type within the case_when c==1 & b==1 & c== 1 ~ 1 because there will be too many variables to type in my dataset.
Thank you for your advise.
For this example
The following code would work
a=c(1,2,NA)
b=c(1,NA,NA)
c=c(1,2,3)
D= data.frame(a,b,c,d)
D %>% mutate(
i= case_when(
paste(a,b,c, sep=',') == paste(1,1,1, sep=",") ~ 1,
paste(a,b,c, sep=',') == paste(NA,NA,3, sep=",") ~ 2
)
)
a b c d i
1 1 1 1 1 1
2 2 NA 2 2 NA
3 NA NA 3 NA 2
If we have multiple conditions, create a key/value dataset and then do a join
library(dplyr)
keydat <- data.frame(a =c(1, NA), b = c(1, NA), c = c(1, 3), i = c(1, 2))
left_join(D, keydat)
# a b c d i
#1 1 1 1 1 1
#2 2 NA 2 2 NA
#3 NA NA 3 NA 2

R: Combine columns ignoring NAs

I have a dataframe with a few columns, where for each row only one column can have a non-NA value. I want to combine the columns into one, keeping only the non-NA value, similar to this post:
Combine column to remove NA's
However, in my case, some rows may contain only NAs, so in the combined column, we should keep an NA, like this (adapted from the post I mentioned):
data <- data.frame('a' = c('A','B','C','D','E','F'),
'x' = c(1,2,NA,NA,NA,NA),
'y' = c(NA,NA,3,NA,NA,NA),
'z' = c(NA,NA,NA,4,5,NA))
So I would have
a x y z
1 A 1 NA NA
2 B 2 NA NA
3 C NA 3 NA
4 D NA NA 4
5 E NA NA 5
6 F NA NA NA
And I would to get
'a' 'mycol'
A 1
B 2
C 3
D 4
E 5
F NA
The solution from the post mentioned above does not work in my case because of row F, it was:
cbind(data[1], mycol = na.omit(unlist(data[-1])))
Thanks!
Using base R...
data$mycol <- apply(data[,2:4], 1, function(x) x[!is.na(x)][1])
data
a x y z mycol
1 A 1 NA NA 1
2 B 2 NA NA 2
3 C NA 3 NA 3
4 D NA NA 4 4
5 E NA NA 5 5
6 F NA NA NA NA
One option is coalesce from dplyr
library(tidyverse)
data %>%
transmute(a, mycol = coalesce(!!! rlang::syms(names(.)[-1])))
# a mycol
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5
#6 F NA
Or we can use max.col from base R
cbind(data[1], mycol= data[-1][cbind(1:nrow(data),
max.col(!is.na(data[-1])) * NA^!rowSums(!is.na(data[-1]))+1)])
# a mycol
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5
#6 F NA
Or only with rowSums
v1 <- rowSums(data[-1], na.rm = TRUE)
cbind(data[1], mycol = v1 * NA^!v1)
Or another option is pmax
cbind(data[1], mycol = do.call(pmax, c(data[-1], na.rm = TRUE)))
or pmin
cbind(data[1], mycol = do.call(pmin, c(data[-1], na.rm = TRUE)))

Allow grouping with NA in aggregate function

Here is dummy data
temp.df <- data.frame(count = rep(1,6), x = c(1,1,NA,NA,3,10), y=c("A","A","A","A","B","B"))
When I apply aggregate as given below:
aggregate(count ~ x + y, data=temp.df, FUN=sum, na.rm=FALSE, na.action=na.pass)
I get:
x y count
1 1 A 2
2 3 B 1
3 10 B 1
However, I would like the following output:
x y count
1 NA A 2
2 1 A 2
3 3 B 1
4 10 B 1
Hope it makes sense.Thanks in advance.
Use addNA to treat NA as a distinct level of x.
> temp.df$x <- addNA(temp.df$x)
> aggregate(count ~ x + y, data=temp.df, FUN=sum, na.rm=FALSE, na.action=na.pass)
x y count
1 1 A 2
2 <NA> A 2
3 3 B 1
4 10 B 1
One option may be to convert the NA to character "NA" (but I am not sure why you need the missing values)
temp.df$x[is.na(temp.df$x)] <- 'NA'
aggregate(count ~ x + y, data=temp.df, FUN=sum, na.rm=FALSE, na.action=na.pass)
# x y count
#1 1 A 2
#2 NA A 2
#3 10 B 1
#4 3 B 1

Resources