Calculating the mode or 2nd/3rd/4th most common value - r

Surely there has to be a function out there in some package for this?
I've searched and I've found this function to calculate the mode:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
But I'd like a function that lets me easily calculate the 2nd/3rd/4th/nth most common value in a column of data.
Ultimately I will apply this function to a large number of dplyr::group_by()s.
Thank you for your help!

Maybe you could try
f <- function (x) with(rle(sort(x)), values[order(lengths, decreasing = TRUE)])
This gives unique vector values sorted by decreasing frequency. The first will be the mode, the 2nd will be 2nd most common, etc.
Another method is to based on table():
g <- function (x) as.numeric(names(sort(table(x), decreasing = TRUE)))
But this is not recommended, as input vector x will be coerced to factor first. If you have a large vector, this is very slow. Also on exit, we have to extract character names and of the table and coerce it to numeric.
Example
set.seed(0); x <- rpois(100, 10)
f(x)
# [1] 11 12 7 9 8 13 10 14 5 15 6 2 3 16
Let's compare with the contingency table from table:
tab <- sort(table(x), decreasing = TRUE)
# 11 12 7 9 8 13 10 14 5 15 6 2 3 16
# 14 14 11 11 10 10 9 7 5 4 2 1 1 1
as.numeric(names(tab))
# [1] 11 12 7 9 8 13 10 14 5 15 6 2 3 16
So the results are the same.

Here is an R function that I made (inspired by several other SO posts), which may work for your goal (and I use a local dataset on religious affiliation to illustrate it):
It's simple; only R base functions are involved: length, match, sort, tabulate, table, unique, which, as.character.
Find_Nth_Mode = function(d, N = 2) {
maxN = function(x, N){
len = length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N = length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
(ux = unique(as.character(d)))
(match(d, ux))
(a1 = tabulate(match(d, ux)))
(a2 = maxN(a1, N))
(a3 = which(a1 == a2))
(ux[a3])
}
Sample Output
> table(religion_data$relig11)
0.None 1.Protestant_Conservative 2.Protestant_Liberal 3.Catholic
34486 6134 19678 36880
4.Orthodox 5.Islam_Sunni 6.Islam_Shia 7.Hindu
20702 28170 668 4653
8.Buddhism 9.Jewish 10.Other
9983 381 6851
> Find_Nth_Mode(religion_data$relig11, 1)
[1] "3.Catholic"
> Find_Nth_Mode(religion_data$relig11, 2)
[1] "0.None"
> Find_Nth_Mode(religion_data$relig11, 3)
[1] "5.Islam_Sunni"
Reference:
I want to express my gratitude to these posts, from which I get the two functions and integrate them into one:
function to find the N th largest value: Fastest way to find second (third...) highest/lowest value in vector or column
how to find the second largest mode value?
Calculating the mode or 2nd/3rd/4th most common value

Related

How to extract outstanding values from an object returned by waldo::compare()?

I'm trying to use a new R package called waldo (see at the tidyverse blog too) that is designed to compare data objects to find differences. The waldo::compare() function returns an object that is, according to the documentation:
a character vector with class "waldo_compare"
The main purpose of this function is to be used within the console, leveraging coloring features to highlight outstanding values that are not equal between data objects. However, while just examining in console is useful, I do want to take those values and act on them (filter them out from the data, etc.). Therefore, I want to programmatically extract the outstanding values. I don't know how.
Example
Generate a vector of length 10:
set.seed(2020)
vec_a <- sample(0:20, size = 10)
## [1] 3 15 13 0 16 11 10 12 6 18
Create a duplicate vector, and add additional value (4) into an 11th vector element.
vec_b <- vec_a
vec_b[11] <- 4
vec_b <- as.integer(vec_b)
## [1] 3 15 13 0 16 11 10 12 6 18 4
Use waldo::compare() to test the differences between the two vectors
waldo::compare(vec_a, vec_b)
## `old[8:10]`: 12 6 18
## `new[8:11]`: 12 6 18 4
The beauty is that it's highlighted in the console:
But now, how do I extract the different value?
I can try to assign waldo::compare() to an object:
waldo_diff <- waldo::compare(vec_a, vec_b)
and then what? when I try to do waldo_diff[[1]] I get:
[1] "`old[8:10]`: \033[90m12\033[39m \033[90m6\033[39m \033[90m18\033[39m \n`new[8:11]`: \033[90m12\033[39m \033[90m6\033[39m \033[90m18\033[39m \033[34m4\033[39m"
and for waldo_diff[[2]] it's even worse:
Error in waldo_diff[3] : subscript out of bounds
Any idea how I could programmatically extract the outstanding values that appear in the "new" vector but not in the "old"?
As a disclaimer, I didn't know anything about this package until you posted so this is far from an authoritative answer, but you can't easily extract the different values using the compare() function as it returns an ANSI formatted string ready for pretty printing. Instead the workhorses for vectors seem to be the internal functions ses() and ses_context() which return the indices of the differences between the two objects. The difference seems to be that ses_context() splits the result into a list of non-contiguous differences.
waldo:::ses(vec_a, vec_b)
# A tibble: 1 x 5
x1 x2 t y1 y2
<int> <int> <chr> <int> <int>
1 10 10 a 11 11
The results show that there is an addition in the new vector beginning and ending at position 11.
The following simple function is very limited in scope and assumes that only additions in the new vector are of interest:
new_diff_additions <- function(x, y) {
res <- waldo:::ses(x, y)
res <- res[res$t == "a",] # keep only additions
if (nrow(res) == 0) {
return(NULL)
} else {
Map(function(start, end) {
d <- y[start:end]
`attributes<-`(d, list(start = start, end = end))
},
res[["y1"]], res[["y2"]])
}
}
new_diff_additions(vec_a, vec_b)
[[1]]
[1] 4
attr(,"start")
[1] 11
attr(,"end")
[1] 11
At least for the simple case of comparing two vectors, you’ll be better off
using diffobj::ses_dat() (which is from the package that waldo uses
under the hood) directly:
waldo::compare(1:3, 2:4)
#> `old`: 1 2 3
#> `new`: 2 3 4
diffobj::ses_dat(1:3, 2:4)
#> op val id.a id.b
#> 1 Delete 1 1 NA
#> 2 Match 2 2 NA
#> 3 Match 3 3 NA
#> 4 Insert 4 NA 3
For completeness, to extract additions you could do e.g.:
extract_additions <- function(x, y) {
ses <- diffobj::ses_dat(x, y)
y[ses$id.b[ses$op == "Insert"]]
}
old <- 1:3
new <- 2:4
extract_additions(old, new)
#> [1] 4

r - find maximum length "chain" of numerically increasing pairs of numbers

I have a two column dataframe of number pairs:
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
> dfPairs
ODD EVEN
1 1 10
2 1 8
3 1 2
4 3 2
5 3 6
6 3 4
7 5 2
8 7 6
9 7 8
10 9 4
11 9 8
Each row of this dataframe is a pair of numbers, and I would like to a find the longest possible numerically increasing combination of pairs. Conceptually, this is analogous to making a chain link of number pairs; with the added conditions that 1) links can only be formed using the same number and 2) the final chain must increase numerically. Visually, the program I am looking for will accomplish this:
For instance, row three is pair (1,2), which increases left to right. The next link in the chain would need to have a 2 in the EVEN column and increase right to left, such as row four (3,2). Then the pattern repeats, so the next link would need to have a 3 in the ODD column, and increase left to right, such as rows 5 or 6. The chain doesn't have to start at 1, or end at 9 - this was simply a convenient example.
If you try to make all possible linked pairs, you will find that many unique chains of various lengths are possible. I would like to find the longest possible chain. In my real data, I will likely encounter a situation in which more than one chain tie for the longest, in which case I would like all of these returned.
The final result should return the longest possible chain that meets these requirements as a dataframe, or a list of dataframes if more than one solution is possible, containing only the rows in the chain.
Thanks in advance. This one has been perplexing me all morning.
Edited to deal with df that does not start at 1 and returns maximum chains rather than chain lengths
Take advantage of graph data structure using igraph
Your data, dfPairs
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
New data, dfTest
ODD <- c(3,3,3,5,7,7,9,9)
EVEN <- c(2,6,4,2,6,8,4,8)
dfTest <- data.frame(ODD, EVEN)
Make graph of your data. A key to my solution is to rbind the reverse (rev(dfPairs)) of the data frame to the original data frame. This will allow for building directional edges from odd numbers to even numbers. Graphs can be used to construct directional paths fairly easily.
library(igraph)
library(dplyr)
GPairs <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2"))), X1))
GTest <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfTest, c("X1", "X2")), setNames(rev(dfTest), c("X1", "X2"))), X1))
Here's the first three elements of all_simple_paths(GPairs, 1) (starting at 1)
[[1]]
+ 2/10 vertices, named, from f8e4f01:
[1] 1 2
[[2]]
+ 3/10 vertices, named, from f8e4f01:
[1] 1 2 3
[[3]]
+ 4/10 vertices, named, from f8e4f01:
[1] 1 2 3 4
I create a function to 1) convert all simple paths to list of numeric vectors, 2) filter each numeric vector for only elements that satisfy left->right increasing, and 3) return the maximum chain of left->right increasing numeric vector
max_chain_only_increasing <- function(gpath) {
list_vec <- lapply(gpath, function(v) as.numeric(names(unclass(v)))) # convert to list of numeric vector
only_increasing <- lapply(list_vec, function(v) v[1:min(which(v >= dplyr::lead(v, default=tail(v, 1))))]) # subset vector for only elements that are left->right increasing
return(unique(only_increasing[lengths(only_increasing) == max(lengths(only_increasing))])) # return maximum chain length
}
This is the output of the above function using all paths that start from 1
max_chain_only_increasing(all_simple_paths(GPairs, 1))
# [[1]]
# [1] 1 2 3 6 7 8 9
Now, I'll output (header) of max chains starting with each unique element in dfPairs, your original data
start_vals <- sort(unique(unlist(dfPairs)))
# [1] 1 2 3 4 5 6 7 8 9 10
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GPairs, i)))
names(max_chains) <- start_vals
# $`1`
# [1] 1 2 3 6 7 8 9
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# etc
And finally with dfTest, the newer data
start_vals <- sort(unique(unlist(dfTest)))
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GTest, i)))
names(max_chains) <- start_vals
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# $`6`
# [1] 6 7 8 9
In spite of Cpak's efforts I ended up writing my own function to solve this. In essence I realize I could make the right to left chain links left to right by using this section of code from Cpak's answer:
output <- arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2")))`, X1)
To ensure the resulting chains were sequential, I deleted all decreasing links:
output$increase <- with(output, ifelse(X2>X1, "Greater", "Less"))
output <- filter(output, increase == "Greater")
output <- select(output, -increase)
I realized that if I split the dataframe output by unique values in X1, I could join each of these dataframes sequentially by joining the last column of the first dataframe to the first column of the next dataframe, which would create rows of sequentially increasing chains. The only problem I needed to resolve was the issues of NAs in last column of the mered dataframe. So ended up splitting the joined dataframe after each merge, and then shifted the dataframe to remove the NAs, and rbinded the result back together.
This is the actual code:
out_split <- split(output, output$X1)
df_final <- Reduce(join_shift, out_split)
The function, join_shift, is this:
join_shift <- function(dtf1,dtf2){
abcd <- full_join(dtf1, dtf2, setNames(colnames(dtf2)[1], colnames(dtf1)[ncol(dtf1)]))
abcd[is.na(abcd)]<-0
colnames(abcd)[ncol(abcd)] <- "end"
# print(abcd)
abcd_na <- filter(abcd, end==0)
# print(abcd_na)
abcd <- filter(abcd, end != 0)
abcd_na <- abcd_na[moveme(names(abcd_na), "end first")]
# print(abcd_na)
names(abcd_na) <- names(abcd)
abcd<- rbind(abcd, abcd_na)
z <- length(colnames(abcd))
colnames(abcd)<- c(paste0("X", 1:z))
# print(abcd)
return(abcd)
}
Finally, I found there were a lot of columns that had only zeros in it, so I wrote this to delete them and trim the final dataframe:
df_final_trim = df_final[,colSums(df_final) > 0]
Overall Im happy with this. I imagine it could be a little more elegant, but it works on anything, and it works on some rather huge, and complicated data. This will produce ~ 241,700 solutions from a dataset of 700 pairs.
I also used a moveme function that I found on stackoverflow (see below). I employed it to move NA values around to achieve the shift aspect of the join_shift function.
moveme <- function (invec, movecommand) {
movecommand <- lapply(strsplit(strsplit(movecommand, ";")[[1]],
",|\\s+"), function(x) x[x != ""])
movelist <- lapply(movecommand, function(x) {
Where <- x[which(x %in% c("before", "after", "first",
"last")):length(x)]
ToMove <- setdiff(x, Where)
list(ToMove, Where)
})
myVec <- invec
for (i in seq_along(movelist)) {
temp <- setdiff(myVec, movelist[[i]][[1]])
A <- movelist[[i]][[2]][1]
if (A %in% c("before", "after")) {
ba <- movelist[[i]][[2]][2]
if (A == "before") {
after <- match(ba, temp) - 1
}
else if (A == "after") {
after <- match(ba, temp)
}
}
else if (A == "first") {
after <- 0
}
else if (A == "last") {
after <- length(myVec)
}
myVec <- append(temp, values = movelist[[i]][[1]], after = after)
}
myVec
}

R expand.grid with row restrictions

I have a numeric vector x of length N and would like to create a vector of the within-set sums of all of the following sets: any possible combination of the x elements with at most M elements in each combination. I put together a slow iterative approach; what I am looking for here is a way without using any loops.
Consider the approach I have been taking, in the following example with N=5 and M=4
M <- 4
x <- 11:15
y <- as.matrix(expand.grid(rep(list(0:1), length(x))))
result <- y[rowSums(y) <= M, ] %*% x
However, as N gets large (above 22 for me), the expand.grid output becomes too big and gives an error (replace x above with x <- 11:55 to observe this). Ideally there would be an expand.grid function that permits restrictions on the rows before constructing the full matrix, which (at least for what I want) would keep the matrix size within memory limits.
Is there a way to achieve this without causing problems for large N?
Your problem has to do with the sheer amount of combinations.
What you appear to be doing is listing all different combinations of 0's and 1's in a sequence of length of x.
In your example x has length 5 and you have 2^5=32 combinations
When x has length 22 you have 2^22=4194304 combinations.
Couldn't you use a binary encoding instead?
In your case that would mean
0 stands for 00000
1 stands for 00001
2 stands for 00010
3 stands for 00011
...
It will not solve your problem completely, but you should be able to get a bit further than now.
Try this:
c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
It generates the same result as with your expand.grid approach, shown below for the test data.
M <- 4
x <- 11:15
# expand.grid approach
y <- as.matrix(expand.grid(rep(list(0:1), length(x))))
result <- y[rowSums(y) <= M, ] %*% x
# combn approach
result1 <- c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
all(sort(result[,1]) == sort(result1))
# [1] TRUE
This should be fast (it takes 0.227577 secs on my machine, with N=22, M=4):
x <- 1:22 # N = 22
M <- 4
c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 3 4 5 6 7
you may want to choose the unique values of the sums with
unique(c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k))))))

Binary Search like concept to create subset data in R

I have below dataset w and key variable x for two cases.
Case 1:
x = 4
w = c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)
Case2:
x = 12
w = c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)
I want to create a function which will search for x through dataset w and will subset original dataset to lower size dataset as per x's location in w. Output will be a lower size dataset having upper bound value same as search key. Below is the function I am trying to write in R:
create_chunk <- function(val, tab, L=1L, H=length(tab))
{
if(H >= L)
{
mid = L + ((H-L)/2)
## If the element is present within middle length
if(tab[mid] > val)
{
## subset the original data in reduced size and again do mid position value checking
## then subset the data
} else
{
mid = mid + (mid/2)
## Increase the mid position to go for right side checking
}
}
}
In the output I am looking for below:
Output for Case 1:
Dataset containing: 1,2,4,4,4,4
Output for Case 2:
Dataset containing: 1,2,4,4,4,4,6,7,8,9,10,11,12
Please note:
1. Dataset may contain duplicate values for search key and
all the duplicate values are expected in the output dataset.
2. I have huge size datasets (say around 2M rows) from
where I am trying to subset smaller dataset as per my requirement of search key.
New Update: Case 3
Input Data:
date value size stockName
1 2016-08-12 12:44:43 10093.40 4 HWA IS Equity
2 2016-08-12 12:44:38 10093.35 2 HWA IS Equity
3 2016-08-12 12:44:47 10088.00 2 HWA IS Equity
4 2016-08-12 12:44:52 10089.95 1 HWA IS Equity
5 2016-08-12 12:44:53 10089.95 1 HWA IS Equity
6 2016-08-12 12:44:54 10088.95 1 HWA IS Equity
Search Key is: 10089.95 in value column.
Expected Output is:
date value size stockName
1 2016-08-12 12:44:47 10088.00 2 HWA IS Equity
2 2016-08-12 12:44:54 10088.95 1 HWA IS Equity
3 2016-08-12 12:44:52 10089.95 1 HWA IS Equity
4 2016-08-12 12:44:53 10089.95 1 HWA IS Equity
You could do this which takes care of duplicate values. In case of duplicates, the highest position of which will be returned. Please note that A should be in non-decreasing order.
binSearch <- function(A, value, left=1, right=length(A)){
if (left > right)
return(-1)
middle <- (left + right) %/% 2
if (A[middle] == value){
while (A[middle] == value)
middle<-middle+1
return(middle-1)
}
else {
if (A[middle] > value)
return(binSearch(A, value, left, middle - 1))
else
return(binSearch(A, value, middle + 1, right))
}
}
w[1:binSearch(w,x1)]
# [1] 1 2 4 4 4 4
w[1:binSearch(w,x2)]
# [1] 1 2 4 4 4 4 6 7 8 9 10 11 12
However, as its mentioned in the comments, you could simply use findInterval to achieve the same:
w[1:findInterval(x1,w)]
As you know, binary search has order of log(n) but as stated in ?findInterval, it also benefits from log(n) since the length of the first argument is one:
The function findInterval finds the index of one vector x in another, vec, where the latter must be non-decreasing. Where this is trivial, equivalent to apply( outer(x, vec, ">="), 1, sum), as a matter of fact, the internal algorithm uses interval search ensuring O(n * log(N)) complexity where n <- length(x) (and N <- length(vec)). For (almost) sorted x, it will be even faster, basically O(n).
EDIT
As per your edit and your new setting, you could do this (suppose your data is in df):
o <- order(df$value)
rows <- o[1:findInterval(key, df$value[o])]
df[rows,]
Or equivalently, using the proposed binSearch function:
o <- order(df$value)
rows <- o[1:binSearch(df$value[o], key)]
df[rows,]
data
x1 <- 4
x2 <- 12
w <- c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)
key <- 10089.95
Here is a very simple solution and you can build your function out of this commands. Of course you have to check if x is in w, but that's your part :-)
x <- 12
w <- c(1,2,4,4,4,4,6,7,8,9,10,11,12,14,15)
index <- which(x == w)
w_new <- w[1:index[length(index)]]
print(w_new)
#[1] 1 2 4 4 4 4 6 7 8 9 10 11 12

R: recursive function to give groups of consecutive numbers

Given a sorted vector x:
x <- c(1,2,4,6,7,10,11,12,15)
I am trying to write a small function that will yield a similar sized vector y giving the last consecutive integer in order to group consecutive numbers. In my case it is (defining groups 2, 4, 7, 12 and 15):
> y
[1] 2 2 4 7 7 12 12 12 15
I tried this this recursive idea (were x is the vector, and i an index that would start by 1 in most cases: if the content of the next index is one larger than the current i, then call the function with i+1; else return the content):
fun <- function(x,i){
ifelse(x[i]+1 == x[i+1],
fun(x,i+1),
return(x[i]))
}
However:
> sapply(x,fun,1)
[1] NA NA NA NA NA NA NA NA NA
How to get this to work.
Your sapply call is applying fun across all values of x, when you really want it to be applying across all values of i. To get the sapply to do what I assume you want to do, you can do the following:
sapply(X = 1:length(x), FUN = fun, x = x)
[1] 2 2 4 7 7 12 12 12 NA
Although it returns NA as the last value instead of 15. This is because I don't think your function is set up to handle the last value of a vector (there is no x[10], so it returns NA). You can probably edit your function to handle this fairly easily.
Maybe this helps:
find_non_consec <- function(x){ c(x[which(as.logical(diff(x)-1))],x[length(x)]) }
x <- c(1,2,4,6,7,10,11,12,15)
res <- find_non_consec(x)
The result is:
> res
[1] 2 4 7 12 15
This function identifies the numbers where the series ceases to be consecutive.

Resources