Finding the index for 2nd Min value in a data frame - r

I have a data frame df1. I would like to find the index for the second smallest value from this dataframe. With the function which.min I was able to get the row index for the smallest value but is there a way to get the index for the second smallest value?
> df1
structure(list(x = c(1, 2, 3, 4, 3), y = c(2, 3, 2, 4, 6), z = c(1,
4, 2, 3, 11)), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"))
>df1
x y z
1 2 1
2 3 4
3 2 2
4 4 3
3 6 11
This is my desired output. For example, in x, the value 2 in row 2 is the second smallest value. Thank you.
>df2
x 2
y 2
z 3

Updated answer
You can write a function like the following, using factor:
which_min <- function(x, pos) {
sapply(x, function(y) {
which(as.numeric(factor(y, sort(unique(y)))) == pos)[1]
})
}
which_min(df1, 2)
# x y z
# 2 2 3
Testing it out with other data:
df2 <- df1
df2$new <- c(1, 1, 1, 2, 3)
which_min(df2, 2)
# x y z new
# 2 2 3 4
Original answer
Instead of sort, you can use order:
sapply(df1, function(x) order(unique(x))[2])
# x y z
# 2 2 3
Or you can make use of the index.return argument in sort:
sapply(df1, function(x) sort(unique(x), index.return = TRUE)$ix[2])
# x y z
# 2 2 3

You can do :
sapply(df1, function(x) which.max(x == sort(unique(x))[2]))
#x y z
#2 2 3
Or with dplyr :
library(dplyr)
df1 %>%
summarise(across(.fns = ~which.max(. == sort(unique(.))[2])))
# x y z
# <int> <int> <int>
#1 2 2 3

Another base R version using rank
> sapply(df1, function(x) which(rank(unique(x)) == 2))
x y z
2 2 3

You could try something like:
sort(unique(unlist(df1)))[2]

Related

Initialise a dataframe where a column references another column

I wonder if there is a way to do:
df <- data.frame(x = 1:3)
df$y = df$x + 5
yielding:
x y
1 1 6
2 2 7
3 3 8
in one line of code where the y column refers to the x column? For example:
data.frame(x = 1:3, y = self$x + 5) # doesn't work
(I won't accept answers that ignore the x column, for example, data.frame(x = 1:3, y = 6:8 :-))
This is possible using tibble from tibble library. Credit to #DaveArmstrong from the comments.
library(tibble)
tibble(x = 1:3, y = x + 5)
# A tibble: 3 × 2
x y
<int> <dbl>
1 1 6
2 2 7
3 3 8
Here's a base R method that do not need to use external package (e.g. tibble).
We can use outer to add 5 to each element in df$x, then cbind the result with df.
setNames(data.frame(cbind(1:3, outer(1:3, 5, `+`))), c("x", "y"))
# or to expand your code
setNames(cbind(data.frame(x = 1:3), outer(1:3, 5, `+`)), c("x", "y"))
x y
1 1 6
2 2 7
3 3 8

In R, How to make a function that finds if there is a matching pairs?

I want to make a function that can detect if there is a matching pair of numbers. I want to simulate x and y many times to see the # of matches occurring using a function.
x<-sample(1:6,6)
y<-sample(1:6,6)
x;y
For example, I have x<- c(2, 5, 6, 4, 3, 1)and y<- c(2, 1, 6, 5, 4, 3). Numbers 2 and 6 matches in order. There are 2 pairs. If there is no match between x and y, it should be just 0. I can use sum(x==y) to find for one example of x and y.
How can I make a function that finds number of identical pairs for many x and y?
You can just use
f<-function(n,k) {
sapply(1:k, \(i) sum(sample(n) == sample(n)))
}
where k is the number of iterations and n is the range (in your case 6)
Example Usage:
f(n=6, k=100)
In base R the following function would do the trick. The length of vector is given by the size argument, and the number of trials is given by n
n_pairs <- function(size, n) {
colSums(replicate(n, sample(size)) == replicate(n, sample(size)))
}
So, for example we can see:
set.seed(1)
n_pairs(size = 6, n = 5)
#> [1] 2 0 1 1 1
hist(n_pairs(6, 100), breaks = 0:6)
mean(n_pairs(6, 1000))
#> [1] 1.013
Note though that R already has the function rbinom, which can achieve the same result with:
rbinom(n, size, 1/size)
Created on 2022-04-26 by the reprex package (v2.0.1)
Maybe this one (removed first answer):
x<- c(2, 5, 6, 4, 3, 1)
y<- c(2, 1, 6, 5, 4, 3)
lst = list(x,y)
pairs <- outer(lst,lst,Vectorize(function(x,y){x[x==y]}))
pairs[1,2]
[[1]]
[1] 2 6
A possible solution with dplyr package
require(tidyverse)
x <- c(2, 5, 6, 4, 3, 1)
y <- c(2, 1, 6, 5, 4, 3)
df <- tibble(x = x,
y = y) %>%
mutate(pair = case_when(x == y ~ "PAIR",
TRUE ~ "NOT"))
The dataset:
# A tibble: 6 x 3
x y pair
<dbl> <dbl> <chr>
1 2 2 PAIR
2 5 1 NOT
3 6 6 PAIR
4 4 5 NOT
5 3 4 NOT
6 1 3 NOT
Filtering:
df %>%
filter(pair == "PAIR")
Output:
# A tibble: 2 x 3
x y pair
<dbl> <dbl> <chr>
1 2 2 PAIR
2 6 6 PAIR
Will this give you what you want? Make a table out of the values that are paired.
table(x[x==y])
x <- sample(1:6,1000, TRUE)
y <- sample(1:6,1000, TRUE)
table(x[x==y])
# 1 2 3 4 5 6
# 37 26 32 28 30 33

Finding min turning point in a data frame

I have a data frame df1. I would like to find the minimum turning point at each column, where the value before and after the minimum point is larger than it. For example in x=c(2,5,3,6,1,1,1), I would like to determine that the minimum turning point is at 3, but with the min function, I am only able to find the minimum point which is 1. If there is no minimum point, I would like to get NA. Thanks.
> df
structure(list(x = c(2, 5, 3, 6, 1, 1, 1), y = c(6, 9, 3, 6,
3, 1, 1), z = c(9, 3, 5, 1, 4, 6, 2)), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
df1>
x y z
2 6 9
5 9 3
3 3 5
6 6 1
1 3 4
1 1 6
1 1 2
Desired result as shown below.
df2>
x y z
3 3 1
You can use lead and lag to compare current value with previous and next value.
library(dplyr)
df %>% summarise(across(.fns = ~min(.x[which(lag(.x) > .x & lead(.x) > .x)])))
# x y z
# <dbl> <dbl> <dbl>
#1 3 3 1
You can use diff, get the sign than diff again to get the valleys. Use min to get the lowest valey.
#Value
sapply(df, function(x) min(x[1+which(diff(sign(diff(x))) == 2)]))
#x y z
#3 3 1
#Position
sapply(df, function(x) {
tt <- 1+which(diff(sign(diff(x))) == 2)
tt[which.min(x[tt])] })
#x y z
#3 3 4
But this will work only in case the valley is one position wide.
Am more robust solution will be using the function from Finding local maxima and minima:
peakPosition <- function(x, inclBorders=TRUE) {
if(inclBorders) {y <- c(min(x), x, min(x))
} else {y <- c(x[1], x)}
y <- data.frame(x=sign(diff(y)), i=1:(length(y)-1))
y <- y[y$x!=0,]
idx <- diff(y$x)<0
(y$i[c(idx,F)] + y$i[c(F,idx)] - 1)/2
}
#Value
sapply(df, function(x) min(x[ceiling(peakPosition(-x, FALSE))]))
#x y z
#3 3 1
#Position
sapply(df, function(x) {
tt <- peakPosition(-x, FALSE)
tt[which.min(x[floor(tt)])] })
#x y z
#3 3 4
An alternative would be to use rle:
x <- c(8,9,3,3,8,1,1)
y <- rle(x)
i <- 1 + which(diff(sign(diff(y$values))) == 2)
min(y$values[i]) #Value
#[1] 3
j <- which.min(y$values[i])
1+sum(y$lengths[seq(i[j])-1]) #First Position
#[1] 3
sum(y$lengths[seq(i[j])]) #Last Position
#[1] 4
Alternate approach
df %>% summarise_all(~ifelse(min(.)==last(.) | min(.) == first(.), min(.[. != last(.) & . != first(.)]), min(.)))
x y z
1 3 3 1
For returning the row_nums
df %>% mutate_all(~ifelse(min(.)==last(.) | min(.) == first(.), min(.[. != last(.) & . != first(.)]), min(.))) %>%
mutate(id = row_number()) %>% left_join(df %>% mutate(id = row_number()), by = "id") %>%
mutate(x_r = ifelse(x.x == x.y, row_number(), 0),
y_r = ifelse(y.x == y.y, row_number(), 0),
z_r = ifelse(z.x == z.y, row_number(), 0)) %>%
select(ends_with("r")) %>% summarise_all(~min(.[. != 0]))
x_r y_r z_r
1 3 3 4
```

Using a column as a column index to extract value from a data frame in R

I am trying to use the values from a column to extract column numbers in a data frame. My problem is similar to this topic in r-bloggers. Copying the script here:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c("x", "y", "x", "z"),
stringsAsFactors = FALSE)
However, instead of having column names in choice, I have column index number, such that my data frame looks like this:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3),
stringsAsFactors = FALSE)
I tried using this solution:
df$newValue <-
df[cbind(
seq_len(nrow(df)),
match(df$choice, colnames(df))
)]
Instead of giving me an output that looks like this:
# x y choice newValue
# 1 1 4 1 1
# 2 2 5 2 2
# 3 3 6 1 6
# 4 8 9 3 NA
My newValue column returns all NAs.
# x y choice newValue
# 1 1 4 1 NA
# 2 2 5 2 NA
# 3 3 6 1 NA
# 4 8 9 3 NA
What should I modify in the code so that it would read my choice column as column index?
As you have column numbers which we need to extract from data frame already we don't need match here. However, since there is a column called choice in the data which you don't want to consider while extracting data we need to turn the values which are not in the range to NA before subsetting from the dataframe.
mat <- cbind(seq_len(nrow(df)), df$choice)
mat[mat[, 2] > (ncol(df) -1), ] <- NA
df$newValue <- df[mat]
df
# x y choice newValue
#1 1 5 1 1
#2 2 6 2 6
#3 3 7 1 3
#4 4 8 3 NA
data
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3))

Filter by group max combination of values in a given order

I would like to filter by groups, the maximal combination of values based on a given order of columns.
A vector of column should specify the order of columns in which looking at maximal values.
For example :
x <- data.frame(id = c("a", "a", "b", "b"),
x = c(1, 1, 1, 2),
y = c(1, 2, 2, 1),
z = c(1, 1, 2, 1))
> x
id x y z
1 a 1 1 1
2 a 1 2 1
3 b 1 2 2
4 b 2 1 1
In this example I would like to group by id and set the 'priority' to x, y, z which means that I want to look the maximal x value, then it's associated maximal y value and then the maximal z value for the maximal x, y couple.
I'm not aware of such a vectorized function so I reccursively group to find the maximum following column maximal value :
> x
id x y z
1 a 1 2 1
2 b 2 1 1
I can do it with base R, with a loop :
group <- "id"
cols <- c("x", "y", "z")
for (i in seq_along(cols)) {
tmp <- aggregate(setNames(list(x[[cols[i]]]), cols[i]), by = as.list(x[group]), FUN = max)
x <- merge(x, tmp, by = c(group, cols[i]))
group <- c(group, cols[i])
}
x <- x[!duplicated(x), ]
> x
id x y z
1 a 1 2 1
2 b 2 1 1
I would like to apply this to larger amount of data, so this code will struggle at some point. Do you have any ideas to improve this ?
Thank you for any help !
We can try with dplyr
library(dplyr)
x %>%
group_by(id) %>%
arrange(desc(y),desc(z)) %>%
slice(which.max(x))
# id x y z
# <fctr> <dbl> <dbl> <dbl>
#1 a 1 2 1
#2 b 2 1 1
Here is a base R solution using the split-apply-combine methodology.
dfNew <- do.call(rbind, lapply(split(x, x$id),
function(x) x[with(x, order(x, y, z, decreasing=TRUE))[1],]))
which returns
dfNew
id x y z
a a 1 2 1
b b 2 1 1
split splits the dataframe by id and returns a list, This list is fed to lapply which then applies an anonymous function that returns the row with the maximum values according to order. Finally, the list of single row data.frames are appended with rbind and do.call.

Resources