Allow grouping with NA in aggregate function - r

Here is dummy data
temp.df <- data.frame(count = rep(1,6), x = c(1,1,NA,NA,3,10), y=c("A","A","A","A","B","B"))
When I apply aggregate as given below:
aggregate(count ~ x + y, data=temp.df, FUN=sum, na.rm=FALSE, na.action=na.pass)
I get:
x y count
1 1 A 2
2 3 B 1
3 10 B 1
However, I would like the following output:
x y count
1 NA A 2
2 1 A 2
3 3 B 1
4 10 B 1
Hope it makes sense.Thanks in advance.

Use addNA to treat NA as a distinct level of x.
> temp.df$x <- addNA(temp.df$x)
> aggregate(count ~ x + y, data=temp.df, FUN=sum, na.rm=FALSE, na.action=na.pass)
x y count
1 1 A 2
2 <NA> A 2
3 3 B 1
4 10 B 1

One option may be to convert the NA to character "NA" (but I am not sure why you need the missing values)
temp.df$x[is.na(temp.df$x)] <- 'NA'
aggregate(count ~ x + y, data=temp.df, FUN=sum, na.rm=FALSE, na.action=na.pass)
# x y count
#1 1 A 2
#2 NA A 2
#3 10 B 1
#4 3 B 1

Related

Removing columns based on a vector of names in R

I have a data.frame called DATA. Using BASE R, I was wondering how I could remove any variables in DATA that is named any of the following: ar = c("out", "Name", "mdif" , "stder" , "mpre")?
Currently, I use DATA[ , !names(DATA) %in% ar] but while this removes the unwanted variables, it again creates some new nuisance variables suffixed .1.
After extraction, is it possible to remove just suffixes?
Note1: We have NO ACCESS to r, the only input is DATA.
Note2: This is toy data, a functional solution is appreciated.
r <- list(
data.frame(Name = rep("Jacob", 6),
X = c(2,2,1,1,NA, NA),
Y = c(1,1,1,2,1,NA),
Z = rep(3, 6),
out = rep(1, 6)),
data.frame(Name = rep("Jon", 6),
X = c(1,NA,3,1,NA,NA),
Y = c(1,1,1,2,NA,NA),
Z = rep(2, 6),
out = rep(1, 6)))
DATA <- do.call(cbind, r) ## DATA
ar = c("out", "Name", "mdif" , "stder" , "mpre") # The names for exclusion
DATA[ , !names(DATA) %in% ar] ## Current solution
#>
# X Y Z X.1 Y.1 Z.1 ## X.1 Y.1 Z.1 are automatically created but no needed
# 1 2 1 3 1 1 2
# 2 2 1 3 NA 1 2
# 3 1 1 3 3 1 2
# 4 1 2 3 1 2 2
# 5 NA 1 3 NA NA 2
# 6 NA NA 3 NA NA 2
Ideally column names should be unique but if you want to keep duplicated column names, we can remove suffixes using sub after extraction
DATA1 <- DATA[ , !names(DATA) %in% ar]
names(DATA1) <- sub("\\.\\d+", "", names(DATA1))
DATA1
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2
In base R, if we create an object with the index, we can reuse it later instead of doing additional manipulations on the column name
i1 <- !names(DATA) %in% ar
DATA1 <- setNames(DATA[i1], names(DATA)[i1])
DATA1
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2
For reusuability, we can create a function
f1 <- function(dat, vec) {
i1 <- !names(dat) %in% vec
setNames(dat[i1], names(dat)[i1])
}
f1(DATA, ar)
If the datasets are stored in a list, use lapply to loop over the list and apply the f1
lst1 <- list(DATA, DATA)
lapply(lst1, f1, vec = ar)
If the 'ar' elements are also different for different list elements
arLst <- list(ar1, ar2)
Map(f1, lst1, vec = arLst)
Here,
ar1 <- c("out", "Name")
ar2 <- c("mdif" , "stder" , "mpre")
Here is also another option using tidyverse
library(dplyr)
library(stringr)
DATA %>%
set_names(make.unique(names(.))) %>%
select(-matches(str_c(ar, collapse="|"))) %>%
set_names(str_remove(names(.), "\\.\\d+$"))
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2
NOTE: It is not recommended to have duplicate column names

How to keep NA values with dcast() function?

df <- data.frame(x = c(1,1,1,2,2,3,3,3,4,5,5),
y = c("A","B","C","A","B","A","B","D","B","C","D"),
z = c(3,2,1,4,2,3,2,1,2,3,4))
df_new <- dcast(df, x ~ y, value.var = "z")
If sample data as given above then dcast() function keeps NA values. But it doesn't work with my dataset. So, the function converts na to zero. Why?
How to keep na values?
ml-latest-small.zip
r <- read.csv("ratings.csv")
m <- read.csv("movies.csv")
rm <- merge(ratings, movies, by="movieId")
umr <- dcast(rm, userId ~ title, value.var = "rating", fun.aggregate= sum)
Thanks in advance.
In the first example, fun.aggregate is not called, but in second case the change is that fun.aggregate being called. According to ?dcast
library(reshape2)
fill - value with which to fill in structural missings, defaults to value from applying fun.aggregate to 0 length vector
dcast(df, x ~ y, value.var = "z", fun.aggregate = NULL)
# x A B C D
#1 1 3 2 1 NA
#2 2 4 2 NA NA
#3 3 3 2 NA 1
#4 4 NA 2 NA NA
#5 5 NA NA 3 4
dcast(df, x ~ y, value.var = "z", fun.aggregate = sum)
# x A B C D
#1 1 3 2 1 0
#2 2 4 2 0 0
#3 3 3 2 0 1
#4 4 0 2 0 0
#5 5 0 0 3 4
Note that here is there is only one element per combination, so the sum will return the same value except that if there is a particular combination not preseent, it return 0. It is based on the behavior of sum
length(integer(0))
#[1] 0
sum(integer(0))
#[1] 0
sum(NULL)
#[1] 0
Or when all the elements are NA and if we use na.rm, there won't be any element to sum, then also it goees into integer(0) mode
sum(c(NA, NA), na.rm = TRUE)
#[1] 0
If we use sum_ from hablar, this behavior is changed to return NA
library(hablar)
sum_(c(NA, NA))
#[1] NA
An option is to create a condition in the fun.aggregate to return NA
dcast(df, x ~ y, value.var = "z",
fun.aggregate = function(x) if(length(x) == 0) NA_real_ else sum(x, na.rm = TRUE))
# x A B C D
#1 1 3 2 1 NA
#2 2 4 2 NA NA
#3 3 3 2 NA 1
#4 4 NA 2 NA NA
#5 5 NA NA 3 4
For more info about how the sum (primitive function) is created, check the source code here

R - subset and include calculated column

Let's say I have this simple data frame:
df <- data.frame(x=c(1,3,3,1,3,1), y = c(2,2,2,2,2,2),z = c('a','b','c','d','e','f'))
> df
x y z
1 1 2 a
2 3 2 b
3 3 2 c
4 1 2 d
5 3 2 e
6 1 2 f
I would like to subset where x= 3, return only column x and y and include a calculated colum x+y.
I can get the first 2 things done, but I can't get the caclulated column to also appear.
df[df$x==3,c("x","y")]
How can I do that, but using base R only.
Staying in base, just do a rowSums before your subset.
df$xy <- rowSums(df[, c("x", "y")])
df[df$x == 3, c("x", "y", "xy")]
# x y xy
# 2 3 2 5
# 3 3 2 5
# 5 3 2 5
Personally, I do prefer the dplyr approach, which #akrun commented on your question.
You can also do like this
df <- data.frame(x=c(1,3,3,1,3,1), y = c(2,2,2,2,2,2),z = c('a','b','c','d','e','f'))
df$z <- ifelse(df$x == 3, (df$x + df$y), df$y)
df
x y z
1 1 2 2
2 3 2 5
3 3 2 5
4 1 2 2
5 3 2 5
6 1 2 2

How to find the number of unique values in vector for each values from another vetor

I have two vectors:
x <- c(1,5,3,2,3, 4,1,2,3,4, 10,5,2,10,12)
y <- c(1,1,2,2,2, 3,3,1,4,4, 4,5,5,4,4)
How can I find the number of unique numbers from X for each number from Y?
I know how to find the number of non-unique numbers from X for each number from Y:
r=aggregate(x ~ y , data= data, FUN=length)
Using data.table, this is pretty easy:
require(data.table)
DT = data.table(x,y)
unique(DT, by=c("x", "y"))[, .N, by=y]
# y N
# 1: 1 3
# 2: 2 2
# 3: 3 2
# 4: 4 4
# 5: 5 2
You can do it wih dplyr this way :
data.frame(x,y) %>%
group_by(y) %>%
summarize(nb=length(unique(x)))
Which gives :
y nb
1 1 3
2 2 2
3 3 2
4 4 4
5 5 2
You could do:
rowSums(!!table(y,x))
# 1 2 3 4 5
# 3 2 2 4 2

How do I convert/reshape a data frame in long format to a wide format without aggregating the records?

From this:
> test <- data.frame(x = c("a","a","a"), y = c("b","b","c"), z = c(1,2,1))
> test
x y z
1 a b 1
2 a b 2
3 a c 1
To this:
x b c
1 a 1 NA
2 a 2 NA
3 a NA 1
Since the x column in the test data-frame doesn't uniquely identify the rows, and yet you don't want to do any aggregation, you need to augment the data-frame with a unique id column, and then use dcast() from the reshape2 package:
require(reshape2)
test$id <- 1:nrow(test)
> dcast(test, id + x ~ y, value_var = 'z')[,-1]
x b c
1 a 1 NA
2 a 2 NA
3 a NA 1

Resources