aggregate function in R, sum of NAs are 0 - r

I saw a list of questions asked in stack overflow, regarding the following, but never got a satisfactory answer. I will follow up on the following question Blend of na.omit and na.pass using aggregate?
> test <- data.frame(name = rep(c("A", "B", "C"), each = 4),
var1 = rep(c(1:3, NA), 3),
var2 = 1:12,
var3 = c(rep(NA, 4), 1:8))
> test
name var1 var2 var3
1 A 1 1 NA
2 A 2 2 NA
3 A 3 3 NA
4 A NA 4 NA
5 B 1 5 1
6 B 2 6 2
7 B 3 7 3
8 B NA 8 4
9 C 1 9 5
10 C 2 10 6
11 C 3 11 7
12 C NA 12 8
When I try out the given solution, instead of mean I try to find out the sum
aggregate(. ~ name, test, FUN = sum, na.action=na.pass, na.rm=TRUE)
the solution doesn't work as usual. Accordingly, it converts NA to 0, So the sum of NAs is 0. It displays it as 0 instead of NaN.
Why doesn't the following work for FUN=sum.And how to make it work?

Create a lambda function with a condition to return NaN when all elements are NA
aggregate(. ~ name, test, FUN = function(x) if(all(is.na(x))) NaN
else sum(x, na.rm = TRUE), na.action=na.pass)
-output
name var1 var2 var3
1 A 6 10 NaN
2 B 6 26 10
3 C 6 42 26
It is an expected behavior with sum and na.rm = TRUE. According to ?sum
the sum of an empty set is zero, by definition.
> sum(c(NA, NA), na.rm = TRUE)
[1] 0

Related

How to call column names from an object in dplyr?

I am trying to replace all zeros in multiple columns with NA using dplyr.
However, since I have many variables, I do not want to call them all by one, but rather store them in an object that I can call afterwards.
This is a minimal example of what I did:
library(dplyr)
Data <- data.frame(var1=c(1:10), var2=rep(c(0,4),5), var3 = rep(c(2,0,3,4,5),2), var4 = rep(c(7,0),5))
col <- Data[,c(2:4)]
Data <- Data %>%
mutate(across(col , na_if, 0))
However, if I do this, I get the following error message:
Error: Problem with 'mutate()' input '..1'.
x Must subset columns with a valid subscript vector.
x Subscript has the wrong type 'data.frame<
var2: double
var3: double
var4: double>'.
i It must be numeric or character.
i Input '..1' is '(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...'.
I have tried to change the format of col to a tibble, but that did not help.
Could anyone tell me how to make this work?
In case you wanted to target numeric columns only, then try helper functions like where(), which will select any variable where the function returns TRUE. I suppose the only benefit here is targeting a specific type of variable.
library(dplyr)
# The where() function will select var2, var3, and var4
# Note: var1 is an integer so the function returns FALSE
# Useful when you want to completely ignore a specific type of variable
Data <- data.frame(
var1 = c(1:10),
var2 = rep(c(0, 4),5),
var3 = rep(c(2, 0 ,3, 4, 5), 2),
var4 = rep(c(7, 0), 5)
)
Data %>%
mutate(across(where(is.numeric), ~na_if(., 0)))
Here is the output:
var1 var2 var3 var4
1 1 NA 2 7
2 2 4 NA NA
3 3 NA 3 7
4 4 4 4 NA
5 5 NA 5 7
6 6 4 2 NA
7 7 NA NA 7
8 8 4 3 NA
9 9 NA 4 7
10 10 4 5 NA
The other answer you'll find here is great and allows you to select any arbitrary number of columns.
Here, the col should be names of the Data. As there is a function name with col, we can name the object differently, wrap with all_of and replace the 0 to NA within across
library(dplyr)
col1 <- names(Data)[2:4]
Data <- Data %>%
mutate(across(all_of(col1) , na_if, 0))
-output
Data
# var1 var2 var3 var4
#1 1 NA 2 7
#2 2 4 NA NA
#3 3 NA 3 7
#4 4 4 4 NA
#5 5 NA 5 7
#6 6 4 2 NA
#7 7 NA NA 7
#8 8 4 3 NA
#9 9 NA 4 7
#10 10 4 5 NA
NOTE: Here the OP asked about looping based on either the index or the column names

Order columns based on max value in columns - R dataframe arranging

I have the following dataframe:
x <- data.frame("A" = c(NA, NA, 3:10, NA), "B" = c(NA,2:11), "C" = c(2:12))
How do I reorder the columns in R based on the max value in each row. So here the column order should be
C, B, A
as the max value is in col C, the next max is in col B and the last max is in col A.
I've a huge dataframe and need to do this automatically.
Thanks
Does this work, using base R:
x[names(sort(sapply(x, max, na.rm = T), decreasing = T))]
C B A
1 2 NA NA
2 3 2 NA
3 4 3 3
4 5 4 4
5 6 5 5
6 7 6 6
7 8 7 7
8 9 8 8
9 10 9 9
10 11 10 10
11 12 11 NA
I think this is what you want.
x <- data.frame("A" = c(NA, NA, 3:10, NA), "B" = c(NA,2:11), "C" = c(2:12))
maxx <- sapply(x, function(x) max(x,na.rm = TRUE))
result <- x[,order(-maxx)]
result
C B A
1 2 NA NA
2 3 2 NA
3 4 3 3
4 5 4 4
Will such a solution work?
x %>% dplyr::arrange(-C,-B,-A)
or
x %>% dplyr::arrange(desc(C,B,A))
Please also see the question: [dplyr arrange() function sort by missing values] (dplyr arrange() function sort by missing values)

How to keep NA values with dcast() function?

df <- data.frame(x = c(1,1,1,2,2,3,3,3,4,5,5),
y = c("A","B","C","A","B","A","B","D","B","C","D"),
z = c(3,2,1,4,2,3,2,1,2,3,4))
df_new <- dcast(df, x ~ y, value.var = "z")
If sample data as given above then dcast() function keeps NA values. But it doesn't work with my dataset. So, the function converts na to zero. Why?
How to keep na values?
ml-latest-small.zip
r <- read.csv("ratings.csv")
m <- read.csv("movies.csv")
rm <- merge(ratings, movies, by="movieId")
umr <- dcast(rm, userId ~ title, value.var = "rating", fun.aggregate= sum)
Thanks in advance.
In the first example, fun.aggregate is not called, but in second case the change is that fun.aggregate being called. According to ?dcast
library(reshape2)
fill - value with which to fill in structural missings, defaults to value from applying fun.aggregate to 0 length vector
dcast(df, x ~ y, value.var = "z", fun.aggregate = NULL)
# x A B C D
#1 1 3 2 1 NA
#2 2 4 2 NA NA
#3 3 3 2 NA 1
#4 4 NA 2 NA NA
#5 5 NA NA 3 4
dcast(df, x ~ y, value.var = "z", fun.aggregate = sum)
# x A B C D
#1 1 3 2 1 0
#2 2 4 2 0 0
#3 3 3 2 0 1
#4 4 0 2 0 0
#5 5 0 0 3 4
Note that here is there is only one element per combination, so the sum will return the same value except that if there is a particular combination not preseent, it return 0. It is based on the behavior of sum
length(integer(0))
#[1] 0
sum(integer(0))
#[1] 0
sum(NULL)
#[1] 0
Or when all the elements are NA and if we use na.rm, there won't be any element to sum, then also it goees into integer(0) mode
sum(c(NA, NA), na.rm = TRUE)
#[1] 0
If we use sum_ from hablar, this behavior is changed to return NA
library(hablar)
sum_(c(NA, NA))
#[1] NA
An option is to create a condition in the fun.aggregate to return NA
dcast(df, x ~ y, value.var = "z",
fun.aggregate = function(x) if(length(x) == 0) NA_real_ else sum(x, na.rm = TRUE))
# x A B C D
#1 1 3 2 1 NA
#2 2 4 2 NA NA
#3 3 3 2 NA 1
#4 4 NA 2 NA NA
#5 5 NA NA 3 4
For more info about how the sum (primitive function) is created, check the source code here

R - delete consecutive (ONLY) duplicates

I need to eliminate rows from a data frame based on the repetition of values in a given column, but only those that are consecutive.
For example, for the following data frame:
df = data.frame(x=c(1,1,1,2,2,4,2,2,1))
df$y <- c(10,11,30,12,49,13,12,49,30)
df$z <- c(1,2,3,4,5,6,7,8,9)
x y z
1 10 1
1 11 2
1 30 3
2 12 4
2 49 5
4 13 6
2 12 7
2 49 8
1 30 9
I would need to eliminate rows with consecutive repeated values in the x column, keep the last repeated row, and maintain the structure of the data frame:
x y z
1 30 3
2 49 5
4 13 6
2 49 8
1 30 9
Following directions from help and some other posts, I have tried using the duplicated function:
df[ !duplicated(x,fromLast=TRUE), ] # which gives me this:
x y z
1 1 10 1
6 4 13 6
7 2 12 7
9 1 30 9
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA
Not sure why I get the NA rows at the end (wasn't happening with a similar table I was testing), but works only partially on the values.
I have also tried using the data.table package as follows:
library(data.table)
dt <- as.data.table(df)
setkey(dt, x)
dt[J(unique(x)), mult ='last']
Works great, but it eliminates ALL duplicates from the data frame, not just those that are consecutive, giving something like this:
x y z
1 30 9
2 49 8
4 13 6
Please, forgive if cross-posting. I tried some of the suggestions but none worked for eliminating only those that are consecutive.
I would appreciate any help.
Thanks
How about:
df[cumsum(rle(df$x)$lengths),]
Explanation:
rle(df$x)
gives you the run lengths and values of consecutive duplicates in the x variable. Then:
rle(df$x)$lengths
extracts the lengths. Finally:
cumsum(rle(df$x)$lengths)
gives the row indices which you can select using [.
EDIT for fun here's a microbenchmark of the answers given so far with rle being mine, consec being what I think is the most fundamentally direct answer, given by #James, and would be the answer I would "accept", and dp being the dplyr answer given by #Nik.
#> Unit: microseconds
#> expr min lq mean median uq max
#> rle 134.389 145.4220 162.6967 154.4180 172.8370 375.109
#> consec 111.411 118.9235 136.1893 123.6285 145.5765 314.249
#> dp 20478.898 20968.8010 23536.1306 21167.1200 22360.8605 179301.213
rle performs better than I thought it would.
You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.
df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
x y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9
A cheap solution with dplyr that I could think of:
Method:
library(dplyr)
df %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 30 9
This will even work if your data has the same x value at the bottom
New Input:
df2 <- df %>% add_row(x = 1, y = 10, z = 12)
df2
x y z
1 1 10 1
2 1 11 2
3 1 30 3
4 2 12 4
5 2 49 5
6 4 13 6
7 2 12 7
8 2 49 8
9 1 30 9
10 1 10 12
Use same method:
df2 %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
New Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 10 12
Here is a data.table solution. The trick is to create a shifted version of x with the shift function and compare it with x
library(data.table)
dattab <- as.data.table(df)
dattab[x != shift(x = x, n = 1, fill = -999, type = "lead")] # edited to add closing )
This way you compare each value of x with its immediately following value and throw out where they match. Make sure to set fill to something that is not in x in order for correct handling of the last value.

R building a subset based on value in previous row

I have a problem figuering this out:
suppose this is how my data looks like:
Num condition y
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 c 7
8 c 8
9 c 9
10 b 10
11 b 11
12 b 12
I now want to make calculation (e.g., mean) on b, depending on whether value was in the row before b, in this example a or c?
Thanks for any help!!!
Angelika
Is this what you want?
# in order to separate between different runs of condition 'b',
# get length and value of runs of equal values of 'condition'
rl <- rle(x = df$condition)
df$run <- rep(x = seq_len(length(rl$lengths)), times = rl$lengths)
# calculate sum of y, on data grouped by condition and run, and where condition is 'b'
aggregate(y ~ condition + run, data = df, subset = condition == "b", sum)
You can add a "lagged" condition column to your dataframe (assuming DF) using
> DF <- within(DF, lag_cond <- c(NA, head(as.character(condition), -1)))
Result:
Num condition y lag_cond
1 a 1 <NA>
2 a 2 a
3 a 3 a
4 b 4 a
5 b 5 b
6 b 6 b
7 c 7 b
8 c 8 c
9 c 9 c
10 b 10 c
11 b 11 b
12 b 12 b
Now you can identify rows you want like this:
> DF[with(DF, condition=="b" & lag_cond %in% c("a","c")),]
Num condition y lag_cond
4 b 4 a
10 b 10 c

Resources