Ordering data.frame by multiple variables in mixed direction [duplicate] - r

This question already has answers here:
How to order a data frame by one descending and one ascending column?
(12 answers)
Closed 4 years ago.
For this sample data.frame,
df <- data.frame(var1=c("b","a","b","a","a","b"),
var2=c("l","l","k","k","l","k"),
var3=c("t","t","x","t","x","x"),
var4=c(5,3,3,5,5,3),
stringsAsFactors=F)
Unsorted
var1 var2 var3 var4
1 b l t 5
2 a l t 3
3 b k x 3
4 a k t 5
5 a l x 5
6 b k x 3
I would like to sort on three columns 'var2', 'var3' and 'var4' in this order simultaneously. One column ascending and another two descending. Column names to sort are stored in variables.
sort_asc <- "var2"
sort_desc <- c("var3","var4")
What's the best way to do this in base R?
Updated details
This is the output if sorted ascending by 'var2' first (step 1) and then descending by 'var3' and 'var4' (as step 2).
var1 var2 var3 var4
a l x 5
b k x 3
b k x 3
a k t 5
b l t 5
a l t 3
But what I am looking for is doing all three sort at the same time to get this:
var1 var2 var3 var4
b k x 3
b k x 3
a k t 5
a l x 5
b l t 5
a l t 3
'var2' is ascending (k,l), within k and within l, 'var3' is descending, and similarly 'var4' is descending
To clarify, how this question is different from other data.frame ordering questions...
ordering on multiple columns
column names to order on are stored in variables
different ordering directions (asc,desc)
ordering is not step-wise (one sort after another) but rather simultaneous (all selected columns at same time)
using base R, not dplyr

Step-wise ordering (sort ascending first and then descending).
dplyr solution:
library(dplyr)
df %>%
arrange_at(sort_asc) %>%
arrange_at(sort_desc, desc)
var1 var2 var3 var4
1 a l x 5
2 b k x 3
3 b k x 3
4 a k t 5
5 b l t 5
6 a l t 3
base R solution:
With base R, if there are multiple columns (in general) use order within do.call. Here, we create the index for ascending order first, then sort it descnding with the second set of columns ('sort_desc')
i1 <- do.call(order, df[sort_asc])
df1 <- df[i1,]
i2 <- do.call(order, c(df1[sort_desc], list(decreasing = TRUE)))
df1[i2,]
var1 var2 var3 var4
5 a l x 5
3 b k x 3
6 b k x 3
4 a k t 5
1 b l t 5
2 a l t 3
Simultaneous/Sequential ordering (all ordering variables are used in one ordering step):
dplyr solution:
df %>%
arrange_(.dots = c(sort_asc, paste0("desc(", sort_desc, ")")))
# var1 var2 var3 var4
#1 b k x 3
#2 b k x 3
#3 a k t 5
#4 a l x 5
#5 b l t 5
#6 a l t 3
base R solution:
With base R, if we need the similar output as with arrange_
df[do.call(order, c(as.list(df[sort_asc]), lapply(df[sort_desc],
function(x) -xtfrm(x)))),]
# var1 var2 var3 var4
#3 b k x 3
#6 b k x 3
#4 a k t 5
#5 a l x 5
#1 b l t 5
#2 a l t 3

Related

How to Use Select() function in a Data Frame by passing the Column name in a variable in R [duplicate]

This question already has answers here:
R dplyr subset with missing columns
(3 answers)
Closed 1 year ago.
Purpose
Can I select columns using dplyr conditional that the column name is in an external vector. I have found some posts that explain how to subset the data frame using a vector of name, but I could not find one when some of the names in the vector do not exist in the data frame.
Example dataset
library(tidyverse)
library(tibble)
library(data.table)
col_names <- c('a', 'b', 'e')
rename <- dplyr::rename
select <- dplyr::select
set.seed(10002)
a <- sample(1:20, 1000, replace=T)
set.seed(10003)
b <- sample(letters, 1000, replace=T)
set.seed(10004)
c <- sample(letters, 1000, replace=T)
data <-
data.frame(a, b, c)
# I would like to choose a, b that are in col_names vector.
We could use any_of with select
library(dplyr)
data %>%
select(any_of(col_names))
-output
a b
1 1 e
2 4 e
3 13 f
4 8 m
5 10 z
6 3 y
...
Here is one way to solve your problem:
data[names(data) %in% col_names]
# a b
# 1 1 e
# 2 4 e
# 3 13 f
# 4 8 m
# 5 10 z
# 6 3 y
# ...
We may also use matches:
library(dplyr)
data %>%
select(matches(col_names)))
Output:
a b
<int> <chr>
1 1 e
2 4 e
3 13 f
4 8 m
5 10 z
6 3 y
7 19 g
8 7 f
9 12 f
10 15 k
# … with 990 more rows

dplyr r : selecting columns whose names are in an external vector [duplicate]

This question already has answers here:
R dplyr subset with missing columns
(3 answers)
Closed 1 year ago.
Purpose
Can I select columns using dplyr conditional that the column name is in an external vector. I have found some posts that explain how to subset the data frame using a vector of name, but I could not find one when some of the names in the vector do not exist in the data frame.
Example dataset
library(tidyverse)
library(tibble)
library(data.table)
col_names <- c('a', 'b', 'e')
rename <- dplyr::rename
select <- dplyr::select
set.seed(10002)
a <- sample(1:20, 1000, replace=T)
set.seed(10003)
b <- sample(letters, 1000, replace=T)
set.seed(10004)
c <- sample(letters, 1000, replace=T)
data <-
data.frame(a, b, c)
# I would like to choose a, b that are in col_names vector.
We could use any_of with select
library(dplyr)
data %>%
select(any_of(col_names))
-output
a b
1 1 e
2 4 e
3 13 f
4 8 m
5 10 z
6 3 y
...
Here is one way to solve your problem:
data[names(data) %in% col_names]
# a b
# 1 1 e
# 2 4 e
# 3 13 f
# 4 8 m
# 5 10 z
# 6 3 y
# ...
We may also use matches:
library(dplyr)
data %>%
select(matches(col_names)))
Output:
a b
<int> <chr>
1 1 e
2 4 e
3 13 f
4 8 m
5 10 z
6 3 y
7 19 g
8 7 f
9 12 f
10 15 k
# … with 990 more rows

Vector to dataframe with variable row length [duplicate]

This question already has answers here:
Convert Rows into Columns by matching string in R
(3 answers)
Closed 4 years ago.
Given a vector, I want to convert it to a dataframe using a 'key' value which is randomly distributed throughout the vector at the start of what is to be a row. In this case, "z" would be the first value in each column.
vd <- c("z","a","b","c","z","a","b","c","z","a","b","c","d")
The resultant data should look like:
#using magrittr
data.frame(x1 = c("z","a","b","c", NA), x2 = c("z","a","b","c", NA), x3 = c("z","a","b","c","d"))
%>% transpose()
One solution would be to find the largest distance between 'keys' in the vector and then interject blank values at the end of 'sections' that are smaller than the longest 'section' so you could use matrix()
What would be the best way to do this?
plyr::ldply(split(vd, cumsum(vd == "z")), rbind)[-1]
(copied from here)
result:
1 2 3 4 5
1 z a b c <NA>
2 z a b c <NA>
3 z a b c d
We can use cumsum to identify groups then split them. Then we append the vectors and format them as a data.frame.
x <- split(vd,cumsum("z"==vd))
maxl <- max(lengths(x))
as.data.frame(lapply(x,function(y) c(y,rep(NA,maxl-length(y)))))
# X1 X2 X3
# 1 z z z
# 2 a a a
# 3 b b b
# 4 c c c
# 5 <NA> <NA> d

put duplicated rows in different data.frame(s)

Let
x=c(1,2,2,3,4,1)
y=c("A","B","C","D","E","F")
df=data.frame(x,y)
df
x y
1 1 A
2 2 B
3 2 C
4 3 D
5 4 E
6 1 F
How can I put duplicate rows in this data frame in different data frames
like this :
df1
x y
1 A
1 F
df2
x y
2 B
2 C
Thank you for help
You could use split
split(df, f = df$x)
f = df$x is used to specify the grouping column
check ?split for more details
to remove the non duplicated rows you could use
mylist = split(df, f = df$x)[df$x[duplicated(df$x)]]
names(mylist) = c('df1', 'df2')
list2env(mylist,envir=.GlobalEnv) # to separate the data frames

Filtering a R DataFrame with repeated values in columns

I have a R DataFrame and I want to make another DF from this one, but only with the values which appears more than X times in a determinate column.
>DataFrame
Value Column
1 a
4 a
2 b
6 c
3 c
4 c
9 a
1 d
For example a want a new DataFrame only with the values in Column which appears more than 2 times, to get something like this:
>NewDataFrame
Value Column
1 a
4 a
6 c
3 c
4 c
9 a
Thank you very much for your time.
We can use table to get the count of values in 'Column' and subset the dataset ('df1') based on the names in 'tbl' that have a count greater than 'n'
n <- 2
tbl <- table(DataFrame$Column) > n
NewDataFrame <- subset(DataFrame, Column %in% names(tbl)[tbl])
# Value Column
#1 1 a
#2 4 a
#4 6 c
#5 3 c
#6 4 c
#7 9 a
Or using ave from base R
NewDataFrame <- DataFrame[with(DataFrame, ave(Column, Column, FUN=length)>n),]
Or using data.table
library(data.table)
NewDataFrame <- setDT(DataFrame)[, .SD[.N>n] , by = Column]
Or
NewDataFrame <- setDT(DataFrame)[, if(.N > n) .SD, by = Column]
Or dplyr
NewDataFrame <- DataFrame %>%
group_by(Column) %>%
filter(n()>2)

Resources