Order columns based on max value in columns - R dataframe arranging - r

I have the following dataframe:
x <- data.frame("A" = c(NA, NA, 3:10, NA), "B" = c(NA,2:11), "C" = c(2:12))
How do I reorder the columns in R based on the max value in each row. So here the column order should be
C, B, A
as the max value is in col C, the next max is in col B and the last max is in col A.
I've a huge dataframe and need to do this automatically.
Thanks

Does this work, using base R:
x[names(sort(sapply(x, max, na.rm = T), decreasing = T))]
C B A
1 2 NA NA
2 3 2 NA
3 4 3 3
4 5 4 4
5 6 5 5
6 7 6 6
7 8 7 7
8 9 8 8
9 10 9 9
10 11 10 10
11 12 11 NA

I think this is what you want.
x <- data.frame("A" = c(NA, NA, 3:10, NA), "B" = c(NA,2:11), "C" = c(2:12))
maxx <- sapply(x, function(x) max(x,na.rm = TRUE))
result <- x[,order(-maxx)]
result
C B A
1 2 NA NA
2 3 2 NA
3 4 3 3
4 5 4 4

Will such a solution work?
x %>% dplyr::arrange(-C,-B,-A)
or
x %>% dplyr::arrange(desc(C,B,A))
Please also see the question: [dplyr arrange() function sort by missing values] (dplyr arrange() function sort by missing values)

Related

aggregate function in R, sum of NAs are 0

I saw a list of questions asked in stack overflow, regarding the following, but never got a satisfactory answer. I will follow up on the following question Blend of na.omit and na.pass using aggregate?
> test <- data.frame(name = rep(c("A", "B", "C"), each = 4),
var1 = rep(c(1:3, NA), 3),
var2 = 1:12,
var3 = c(rep(NA, 4), 1:8))
> test
name var1 var2 var3
1 A 1 1 NA
2 A 2 2 NA
3 A 3 3 NA
4 A NA 4 NA
5 B 1 5 1
6 B 2 6 2
7 B 3 7 3
8 B NA 8 4
9 C 1 9 5
10 C 2 10 6
11 C 3 11 7
12 C NA 12 8
When I try out the given solution, instead of mean I try to find out the sum
aggregate(. ~ name, test, FUN = sum, na.action=na.pass, na.rm=TRUE)
the solution doesn't work as usual. Accordingly, it converts NA to 0, So the sum of NAs is 0. It displays it as 0 instead of NaN.
Why doesn't the following work for FUN=sum.And how to make it work?
Create a lambda function with a condition to return NaN when all elements are NA
aggregate(. ~ name, test, FUN = function(x) if(all(is.na(x))) NaN
else sum(x, na.rm = TRUE), na.action=na.pass)
-output
name var1 var2 var3
1 A 6 10 NaN
2 B 6 26 10
3 C 6 42 26
It is an expected behavior with sum and na.rm = TRUE. According to ?sum
the sum of an empty set is zero, by definition.
> sum(c(NA, NA), na.rm = TRUE)
[1] 0

Selecting the top X values of each individual column in R

Say you have the following data set.
df1<-matrix(data = 1:10,
nrow = 5,
ncol = 5)
colnames(df1)=c("a","b","c","d","e")
How would you extract the top X values from each individual column as a new data frame?
The expected output would be something like this (for the top 3 values in each column
a
b
c
d
e
5
10
5
10
5
4
9
4
9
4
3
8
3
8
3
You can use apply to apply a function to each column (MARGIN = 2). Here, the function is \(x) head(sort(x, decreasing = T), 3), which sorts the column by decreasing order, and select the top three values (head(x, 3)).
apply(df1, 2, \(x) head(sort(x, decreasing = T), 3))
a b c d e
[1,] 5 10 5 10 5
[2,] 4 9 4 9 4
[3,] 3 8 3 8 3
Note: \(x) is a shorthand for function(x) in lambda-like functions since 4.1.0.
We can sort, then use head:
head(apply(df1, 2, sort, decreasing = TRUE), 3)
# a b c d e
# [1,] 5 10 5 10 5
# [2,] 4 9 4 9 4
# [3,] 3 8 3 8 3

Iterate through columns to sum the previous 2 numbers of each row

In R, I have a dataframe, with columns 'A', 'B', 'C', 'D'. The columns have 100 rows.
I need to iterate through the columns to perform a calculation for all rows in the dataframe which sums the previous 2 rows of that column, and then set in new columns ('AA', 'AB', etc) what that sum is:
A B C D
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
to
A B C D AA AB AC AD
1 2 3 4 NA NA NA NA
2 3 4 5 3 5 7 9
3 4 5 6 5 7 9 11
4 5 6 7 7 9 11 13
5 6 7 8 9 11 13 15
6 7 8 9 11 13 15 17
Can someone explain how to create a function/loop that allows me to set the columns I want to iterate over (selected columns, not all columns) and the columns I want to set?
A base one-liner:
cbind(df, setNames(df + df[c(NA, 1:(nrow(df)-1)), ], paste0("A", names(df))))
If your data is large, this one might be the fastest because it manipulates the entire data.frame.
A dplyr solution using mutate() with across().
library(dplyr)
df %>%
mutate(across(A:D,
~ .x + lag(.x),
.names = "A{col}"))
# A B C D AA AB AC AD
# 1 1 2 3 4 NA NA NA NA
# 2 2 3 4 5 3 5 7 9
# 3 3 4 5 6 5 7 9 11
# 4 4 5 6 7 7 9 11 13
# 5 5 6 7 8 9 11 13 15
# 6 6 7 8 9 11 13 15 17
If you want to sum the previous 3 rows, the second argument of across(), i.e. .fns, should be
~ .x + lag(.x) + lag(.x, 2)
which is equivalent to the use of rollsum() in zoo:
~ zoo::rollsum(.x, k = 3, fill = NA, align = 'right')
Benchmark
A benchmark test with microbenchmark package on a new data.frame with 10000 rows and 100 columns and evaluate each expression for 10 times.
# Unit: milliseconds
# expr min lq mean median uq max neval
# darren_base 18.58418 20.88498 35.51341 33.64953 39.31909 80.24725 10
# darren_dplyr_lag 39.49278 40.27038 47.26449 42.89170 43.20267 76.72435 10
# arg0naut91_dplyr_rollsum 436.22503 482.03199 524.54800 516.81706 534.94317 677.64242 10
# Grothendieck_rollsumr 3423.92097 3611.01573 3650.16656 3622.50895 3689.26404 4060.98054 10
You can use dplyr's across (and set optional names) with rolling sum (as implemented e.g. in zoo):
library(dplyr)
library(zoo)
df %>%
mutate(
across(
A:D,
~ rollsum(., k = 2, fill = NA, align = 'right'),
.names = 'A{col}'
)
)
Output:
A B C D AA AB AC AD
1 1 2 3 4 NA NA NA NA
2 2 3 4 5 3 5 7 9
3 3 4 5 6 5 7 9 11
4 4 5 6 7 7 9 11 13
5 5 6 7 8 9 11 13 15
6 6 7 8 9 11 13 15 17
With A:D we've specified the range of column names we want to apply the function to. The assumption above in .names argument is that you want to paste together A as prefix and the column name ({col}).
Here's a data.table solution. As you ask for, it allows you to select which columns you want to apply it to rather than just for all columns.
library(data.table)
x <- data.table(A=1:6, B=2:7, C=3:8, D=4:9)
selected_cols <- c('A','B','D')
new_cols <- paste0("A",selected_cols)
x[, (new_cols) := lapply(.SD, function(col) col+shift(col, 1)), .SDcols = selected_cols]
x[]
NB This is 2 or 3 times faster than the fastest other answer.
That is a naive approach with nested for loops. Beware it is damn slow if you gonna iterate over hundreds thousand rows.
i <- 1
n <- 5
df <- data.frame(A=i:(i+n), B=(i+1):(i+n+1), C=(i+2):(i+n+2), D=(i+3):(i+n+3))
for (col in colnames(df)) {
for (ind in 1:nrow(df)) {
if (ind-1==0) {next}
s <- sum(df[c(ind-1, ind), col])
df[ind, paste0('S', col)] <- s
}
}
That is a cumsum method:
na.df <- data.frame(matrix(NA, 2, ncol(df)))
colnames(na.df) <- colnames(df)
cs1 <- cumsum(df)
cs2 <- rbind(cs1[-1:-2,], na.df)
sum.diff <- cs2-cs1
cbind(df, rbind(na.df[1,], cs1[2,], sum.diff[1:(nrow(sum.diff)-2),]))
Benchmark:
# Unit: milliseconds
# expr min lq mean median uq max neval
# darrentsai.rbind 11.5623 12.28025 23.38038 16.78240 20.83420 91.9135 100
# darrentsai.rbind.rev1 8.8267 9.10945 15.63652 9.54215 14.25090 62.6949 100
# pseudopsin.dt 7.2696 7.52080 20.26473 12.61465 17.61465 69.0110 100
# ivan866.cumsum 25.3706 30.98860 43.11623 33.78775 37.36950 91.6032 100
I believe, most of the time the cumsum method wastes on df allocations. If correctly adapted to data.table backend, it could be the fastest.
Specify the columns we want. We show several different ways to do that. Then use rollsumr to get the desired columns, set the column names and cbind DF with it.
library(zoo)
# jx <- names(DF) # if all columns wanted
# jx <- sapply(DF, is.numeric) # if all numeric columns
# jx <- c("A", "B", "C", "D") # specify columns by name
jx <- 1:4 # specify columns by position
r <- rollsumr(DF[jx], 2, fill = NA)
colnames(r) <- paste0("A", colnames(r))
cbind(DF, r)
giving:
A B C D AA AB AC AD
1 1 2 3 4 NA NA NA NA
2 2 3 4 5 3 5 7 9
3 3 4 5 6 5 7 9 11
4 4 5 6 7 7 9 11 13
5 5 6 7 8 9 11 13 15
6 6 7 8 9 11 13 15 17
Note
The input in reproducible form:
DF <- structure(list(A = 1:6, B = 2:7, C = 3:8, D = 4:9),
class = "data.frame", row.names = c(NA, -6L))

Combine two vectors of different length based on common values and expand the shorter

I have two vectors with some values in common, but of different length:
x <- 1:10
# [1] 1 2 3 4 5 6 7 8 9 10
y <- c(3, 5, 8)
# [1] 3 5 8
I'd like to combine these two vectors into a dataframe and produce the following result:
data.frame(big = x,
small = c(NA, NA, 3, NA, 5, NA, NA, 8, NA, NA))
# big small
# 1 1 NA
# 2 2 NA
# 3 3 3
# 4 4 NA
# 5 5 5
# 6 6 NA
# 7 7 NA
# 8 8 8
# 9 9 NA
# 10 10 NA
One possibility is to index the short vector using the match between the long and short, with the nomatch argument set to NA ("the value to be returned in the case when no match is found").
data.frame(big = x,
small = y[match(x, y, nomatch = NA)])

Removing rows with NA in R [duplicate]

This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 5 years ago.
I have a dataframe with 2500 rows. A few of the rows have NAs (an excessive number of NAs), and I want to remove those rows.
I've searched the SO archives, and come up with this as the most likely solution:
df2 <- df[df[, 12] != NA,]
But when I run it and look at df2, all I see is a screen full of NAs (and s).
Any suggestions?
Depending on what you're looking for, one of the following should help you on your way:
Some sample data to start with:
mydf <- data.frame(A = c(1, 2, NA, 4), B = c(1, NA, 3, 4),
C = c(1, NA, 3, 4), D = c(NA, 2, 3, 4),
E = c(NA, 2, 3, 4))
mydf
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 3 NA 3 3 3 3
# 4 4 4 4 4 4
If you wanted to remove rows just according to a few specific columns, you can use complete.cases or the solution suggested by #SimonO101 in the comments. Here, I'm removing rows which have an NA in the first column.
mydf[complete.cases(mydf$A), ]
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 4 4 4 4 4 4
mydf[!is.na(mydf[, 1]), ]
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 4 4 4 4 4 4
If, instead, you wanted to set a threshold--as in "keep only the rows that have fewer than 2 NA values" (but you don't care which columns the NA values are in--you can try something like this:
mydf[rowSums(is.na(mydf)) < 2, ]
# A B C D E
# 3 NA 3 3 3 3
# 4 4 4 4 4 4
On the other extreme, if you want to delete all rows that have any NA values, just use complete.cases:
mydf[complete.cases(mydf), ]
# A B C D E
# 4 4 4 4 4 4

Resources