Create the frequency count from a vector in R [duplicate] - r

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
Suppose there is a vector with numerical values with possible duplicated values
x <- c(1, 2, 3, 4, 5, 1, 2, 2, 3)
I want to create another vector of counts as follows.
It has the same length as x.
For each unique value in x, the first appearance is 1, the second appearance is 2, and so on.
The new vector I want is
1, 1, 1, 1, 1, 2, 2, 3, 2
I need a fast way of doing this since x can be really long.

Use ave and seq_along:
> x <- c(1, 2, 3, 4, 5, 1, 2, 2, 3)
> ave(x, x, FUN = seq_along)
[1] 1 1 1 1 1 2 2 3 2
Another option to consider is data.table. Although it is a little bit more work, it might pay off on very long vectors.
Here it is on your example--definitely seems like overkill!
library(data.table)
x <- c(1, 2, 3, 4, 5, 1, 2, 2, 3)
DT <- data.table(id = sequence(length(x)), x, key = "id")
DT[, y := sequence(.N), by = x][, y]
# [1] 1 1 1 1 1 2 2 3 2
But how about on a vector 10,000,000 items long?
set.seed(1)
x2 <- sample(100, 1e7, replace = TRUE)
funAve <- function() {
ave(x2, x2, FUN = seq_along)
}
funDT <- function() {
DT <- data.table(id = sequence(length(x2)), x2, key = "id")
DT[, y := sequence(.N), by = x2][, y]
}
identical(funAve(), funDT())
# [1] TRUE
library(microbenchmark)
# Unit: seconds
# expr min lq median uq max neval
# funAve() 6.727557 6.792743 6.827117 6.992609 7.352666 20
# funDT() 1.967795 2.029697 2.053886 2.070462 2.123531 20

Related

Is there an R function for returning sorted indexes of any values of a vector?

I'm not fluent in R data.table and any help will be greatly appreciated to resolve the following problem !
I have big data.table(~1000000 rows) with columns of numeric values and i want to output a same dimension data.table with the sorted indexes position of each row values.
a short example:
-Input:
dt = data.frame(ack = 1:7)
dt$A1 = c( 1, 6, 9, 10, 3, 5, NA)
dt$A2 = c( 25, 12, 30, 10, 50, 1, 30)
dt$A3 = c( 100, 63, 91, 110, 1, 4, 10)
dt$A4 = c( 51, 65, 2, 1, 0, 200, 1)
first row: 1 (1) <= 25 (2) <= 51 (3) <= 100 (4),
row sorted indexes position for (1, 25, 100, 51) are (1, 2, 4, 3) and output should be:
dt$PosA1 = c(1, ...
dt$PosA2 = c(2, ...
dt$PosA3 = c(4, ...
dt$PosA4 = c(3, ...
3rd row : 2 (1) <= 9 (2) <= 30 (3) <= 91 (4) , must output:
dt$PosA1 = c( 1,1,2,...)
dt$PosA2 = c( 2,2,3,...)
dt$PosA3 = c( 4,3,4,...)
dt$PosA4 = c( 3,4,1,...)
Output is a same dimension of input data.table filled with values of sorted indexes by rows .
dt$PosA1 = c( 1, 1, 2, 2, 3, 1, NA)
dt$PosA2 = c( 2, 2, 3, 3, 4, 2, 3)
dt$PosA3 = c( 4, 3, 4, 4, 2, 2, 2)
dt$PosA4 = c( 3, 4, 1, 1, 1, 4, 1)
I think about perhaps something like this?
library(data.table)
setDT(dt)
# pseudocode
dt[, PosA1 := rowPosition(.SD, 1, na.rm=T),
PosA2 := rowPosition(.SD, 2, na.rm=T),
PosA3 := rowPosition(.SD, 3, na.rm=T),
PosA4 := rowPosition(.SD, 4, na.rm=T),
.SDcols=c(A1, A2, A3, A4)]
I'm not sure of syntax and i miss a rowPosition Function. does any function exist to do that ? (i named it rowPosition here)
A little help would be great to code an efficient one , or another approach to solve the problem!
regards.
You can convert to long form and use rank. Or, since you're using data.table, frank:
library(data.table)
setDT(dt)
melt(dt, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
ack A1 A2 A3 A4
1: 1 1 2 4 3
2: 2 1 2 3 4
3: 3 2 3 4 1
4: 4 2 2 3 1
5: 5 3 4 2 1
6: 6 3 1 2 4
7: 7 NA 3 2 1
melt switches to long form; while dcast converts back to wide form.
Since you are looking for speed, you might want to consider using Rcpp. A Rcpp rank that takes care of NA and ties can be found in nrussell's adapted version of René Richter's code.
nr <- 811e3
nc <- 16
DT <- as.data.table(matrix(sample(c(1:200, NA), nr*nc, replace=TRUE), nrow=nr))[,
ack := .I]
#assuming that you have saved nrussell code in avg_rank.cpp
library(Rcpp)
system.time(sourceCpp("rcpp/avg_rank.cpp"))
# user system elapsed
# 0.00 0.13 6.21
nruss_rcpp <- function() {
DT[, as.list(avg_rank(unlist(.SD))), by=ack]
}
data.table.frank <- function() {
melt(DT, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
}
library(microbenchmark)
microbenchmark(nruss_rcpp(), data.table.frank(), times=3L)
timings:
Unit: seconds
expr min lq mean median uq max neval cld
nruss_rcpp() 10.33032 10.33251 10.3697 10.3347 10.38939 10.44408 3 a
data.table.frank() 610.44869 612.82685 613.9362 615.2050 615.68001 616.15501 3 b
edit: addressing comments
1) set column names for rank columns using updating by reference
DT[, (paste0("Rank", 1L:nc)) := as.list(avg_rank(unlist(.SD))), by=ack]
2) keeping NAs as it is
option A) change to NA in R after getting output from avg_rank:
for (j in 1:nc) {
DT[is.na(get(paste0("V", j))), (paste0("Rank", j)) := NA_real_]
}
option B) amend the avg_rank code in Rcpp as follows:
Rcpp::NumericVector avg_rank(Rcpp::NumericVector x)
{
R_xlen_t sz = x.size();
Rcpp::IntegerVector w = Rcpp::seq(0, sz - 1);
std::sort(w.begin(), w.end(), Comparator(x));
Rcpp::NumericVector r = Rcpp::no_init_vector(sz);
for (R_xlen_t n, i = 0; i < sz; i += n) {
n = 1;
while (i + n < sz && x[w[i]] == x[w[i + n]]) ++n;
for (R_xlen_t k = 0; k < n; k++) {
if (Rcpp::traits::is_na<REALSXP>(x[w[i + k]])) { #additional code
r[w[i + k]] = NA_REAL; #additional code
} else {
r[w[i + k]] = i + (n + 1) / 2.;
}
}
}
return r;
}

make a variable based on a cumulative sum with reset based on condition

I want a variable such as desired_output, based on a cumulative sum over cumsumover, where the cumsum function resets every time it reaches the next number in thresh.
cumsumover <- c(1, 2, 7, 4, 2, 5)
thresh <- c(3, 7, 11)
desired_output <- c(3, 3 ,7 ,11 ,11 ,11) # same length as cumsumover
This question is similar, but I can't wrap my head around the code.
dplyr / R cumulative sum with reset
Compared to similar questions my condition is specified in a vector of different length than the cumsumover.
Any help would be greatly appreciated. Bonus if both a base R and a tidyverse approach is provided.
In base R, we can use cut with breaks as thresh and labels as letters of same length as thresh.
cut(cumsum(cumsumover),breaks = c(0, thresh[-1], max(cumsum(cumsumover))),
labels = letters[seq_along(thresh)])
#[1] a a b c c c
Replaced the last element of thresh with max(cumsum(cumsumover)) so that anything outside last element of thresh is assigned the last label.
If we want labels as thresh instead of letters
cut(cumsum(cumsumover),breaks = c(0, thresh[-1], max(cumsum(cumsumover))),labels = thresh)
#[1] 3 3 7 11 11 11
Here is another solution:
data:
cumsumover <- c(1, 2, 7, 4, 2, 5)
thresh <- c(3, 7, 11)
code:
outp <- letters[1:3] # to make solution more general
cumsumover_copy <- cumsumover # I use <<- inside sapply so therefore I make a copy to stay save
unlist(
sapply(seq_along(thresh), function(x) {
cs_over <- cumsum(cumsumover_copy)
ntimes = sum( cs_over <= thresh[x] )
cumsumover_copy <<- cumsumover_copy[-(1:ntimes)]
return( rep(outp[x], ntimes) )
} )
)
result:
#[1] "a" "a" "b" "c" "c" "c"
Using .bincode you can do this:
thresh[.bincode(cumsum(cumsumover), c(-Inf,thresh[-1],Inf))]
[1] 3 3 7 11 11 11
.bincode is used by cut, which basically adds labels and checks, so it's more efficient:
x <-rep(cumsum(cumsumover),10000)
microbenchmark::microbenchmark(
bincode = thresh[.bincode(x, c(-Inf,thresh[-1],Inf))],
cut = cut(x,breaks = c(-Inf, thresh[-1], Inf),labels = thresh))
# Unit: microseconds
# expr min lq mean median uq max neval
# bincode 450.2 459.75 654.794 482.10 642.20 5028.4 100
# cut 1739.3 1864.90 2622.593 2215.15 2713.25 12194.8 100

Check rows from all columns that matches specific value

If I have a data.table:
d <- data.table("ID" = c(1, 2, 2, 4, 6, 6),
"TYPE" = c(1, 1, 2, 2, 3, 3),
"CLASS" = c(1, 2, 3, 4, 5, 6))
I know I can remove values greater than a specific value like this:
r <- d[!(d$TYPE > 2), ]
However, if I want to apply this to all of the columns in the entire table instead of just TYPE (basically drop any rows that have a value > 2 in the entire table), how would I generalize the above statement (avoiding using a for loop if possible).
I know I can do d > 2 resulting in a boolean index table, but if I put that into the above line of code it give me an error:
d[!d>2, ]
Results in a invalid matrix type
Note
It was brought up that this question is similar to Return an entire row if the value in any specific set of columns meets a certain criteria.
However, they are working with a data.frame and I am working with a data.table the notation is different. Not a duplicate question due to that.
Using apply with any
d[!apply(d>2,1,any)]
ID TYPE CLASS
1: 1 1 1
2: 2 1 2
Or rowSums
d[rowSums(d>2)==0,]
ID TYPE CLASS
1: 1 1 1
2: 2 1 2
I was wondering what the fastest approach would be for a varying number of rows and columns.
So, here is a benchmark.
It excludes the ID column from being checked for which is not exactly in line with OP's question but is a sensible decision, IMHO.
library(data.table)
library(bench)
bm <- press(
n_row = c(1E1, 1E3, 1E5),
n_col = c(2, 10, 50),
{
set.seed(1L)
d <- data.table(
ID = seq_len(n_row),
matrix(sample(10, n_row*n_col, TRUE), ncol = n_col)
)
mark(
m1 = d[d[, !apply(.SD > 2, 1, any), .SDcols = -"ID"]],
m2 = d[!d[, apply(.SD > 2, 1, any), .SDcols = -"ID"]],
m3 = d[!d[, which(apply(.SD > 2, 1, any)), .SDcols = -"ID"]],
m4 = d[d[, rowSums(.SD > 2) == 0, .SDcols = -"ID"]],
m5 = d[!d[, Reduce(any, lapply(.SD, `>`, y = 2)), by = 1:nrow(d), .SDcols = -"ID"]$V1]
)
})
ggplot2::autoplot(bm)
Apparently, the rowSums() approach is almost always the fastest method.

How can I reverse numbers in a vector ONLY if they are sequential?

So, if the input is a vector such as
v <- (1, 2, 3, 7, 2, 8, 9)
The output would be
(3, 2, 1, 7, 2, 9, 8)
I have tried using nested for and if loops with the condition as an is.sorted function, but have not had any success.
With the sample data
x <- c(1, 2, 3, 7, 2, 8, 9)
You can partition into "sequential groups" with
grp <- cumsum(!c(TRUE, diff(x)==1))
grp
# [1] 0 0 0 1 2 3 3
Basically we look at the diff from the previous element and track changes anytime that value isn't equal to 1.
You can use this group information to re-order those values with ave (this does a within-group transformation).
revsort<-function(...) sort(...,decreasing=TRUE)
ave(x, grp, FUN=revsort)
# [1] 3 2 1 7 2 9 8
We could also do:
x <- c(0, which(diff(vec) != 1), length(vec))
unlist(sapply(seq(length(x) - 1), function(i) rev(vec[(x[i]+1):x[i+1]])))
#[1] 3 2 1 7 2 9 8
The idea is to first cut the vector based on the positions of consecutive numbers. We then traverse these cuts and apply rev.
Data
vec <- c(1, 2, 3, 7, 2, 8, 9)
Benchmarking
library(microbenchmark)
vec1 <- c(1, 2, 3, 7, 2, 8, 9)
vec2 <- c(1:1000, sample.int(100, 10), 5:500, sample.int(100, 10), 100:125)
f_MrFlick <- function(x){
revsort<-function(...) sort(...,decreasing=TRUE)
grp <- cumsum(!c(TRUE, diff(x)==1))
ave(x, grp, FUN=revsort)
}
f_989 <- function(vec){
x <- c(0, which(diff(vec) != 1), length(vec))
unlist(sapply(seq(length(x) - 1), function(i) rev(vec[(x[i]+1):x[i+1]])))
}
all(f_MrFlick(vec1)==f_989(vec1))
# [1] TRUE
all(f_MrFlick(vec2)==f_989(vec2))
# [1] TRUE
length(vec1)
# [1] 7
microbenchmark(f_MrFlick(vec1), f_989(vec1))
# Unit: microseconds
# expr min lq mean median uq max neval
# f_MrFlick(vec1) 287.006 297.0585 305.6340 302.833 312.6695 479.912 100
# f_989(vec1) 113.348 119.7645 124.7901 121.903 127.0360 268.186 100
# --------------------------------------------------------------------------
length(vec2)
# [1] 1542
microbenchmark(f_MrFlick(vec2), f_989(vec2))
# Unit: microseconds
# expr min lq mean median uq max neval
# f_MrFlick(vec2) 1532.553 1565.2745 1725.7540 1627.937 1698.084 3914.149 100
# f_989(vec2) 426.874 454.6765 546.9115 466.439 489.322 2634.383 100

Finding the index of first changes in the elements of a vector

I have a vector v and I would like to find the index of first changes in elements of a vector in R. How can I do this? For example:
v = c(1, 1, 1, 1, 1, 1, 1, 1.5, 1.5, 2, 2, 2, 2, 2)
rle is a good idea, but if you only want the indices of the changepoints you can just do:
c(1,1+which(diff(v)!=0))
## 1 8 10
You're looking for rle:
rle(v)
## Run Length Encoding
## lengths: int [1:3] 7 2 5
## values : num [1:3] 1 1.5 2
This says that the value changes at locations 7+1, 7+2+1 (and 7+2+5+1 would be the index of the element "one off the end")
The data.table package internally (meaning not exported yet) uses a function uniqlist (in dev 1.8.11) or alternatively duplist (in current 1.8.10 #CRAN) that does exactly what you're after. It should be quite fast. Here's a benchmark:
require(data.table)
set.seed(45)
# prepare a huge vector (sorted)
x <- sort(as.numeric(sample(1e5, 1e7, TRUE)))
require(microbenchmark)
ben <- function(v) c(1,1+which(diff(v)!=0))
matthew <- function(v) rle(v)
matteo <- function(v) firstDiff(v)
exegetic <- function(v) first.changes(v)
# if you use 1.8.10, replace uniqlist with duplist
dt <- function(v) data.table:::uniqlist(list(v))
microbenchmark( ans1 <- ben(x), matthew(x), matteo(x),
exegetic(x), ans2 <- dt(x), times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# ans1 <- ben(x) 1181.808 1229.8197 1313.2646 1357.5026 1553.9296 10
# matthew(x) 1456.918 1496.0300 1581.0062 1660.4067 2148.1691 10
# matteo(x) 28609.890 29305.1117 30698.5843 32706.3147 34290.9864 10
# exegetic(x) 1365.243 1546.0652 1576.8699 1659.5488 1886.6058 10
# ans2 <- dt(x) 113.712 114.7794 143.9938 180.3743 221.8386 10
identical(as.integer(ans1), ans2) # [1] TRUE
I'm not that familiar with Rcpp, but seems like the solution could be improved quite a bit.
Edit: Refer to Matteo's updated answer for Rcpp timings.
> v <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 5, 5, 6, 6, 6, 6)
first.changes <- function(d) {
p <- cumsum(rle(d)$lengths) + 1
p[-length(p)]
}
> first.changes(v)
[1] 4 6 11 12 14
Or, with your data,
> v = c(1, 1, 1, 1, 1, 1, 1, 1.5, 1.5, 2, 2, 2, 2, 2)
> first.changes(v)
[1] 8 10
If you need the operation to be fast you can use the Rcpp package to call C++ from R:
library(Rcpp)
library(data.table)
library(microbenchmark)
# Rcpp solution
cppFunction('
NumericVector firstDiff(NumericVector & vett)
{
const int N = vett.size();
std::list<double> changes;
changes.push_back(1.0);
NumericVector::iterator iterH = vett.begin() + 1;
NumericVector::iterator iterB = vett.begin();
int count = 2;
for(iterH = vett.begin() + 1; iterH != vett.end(); iterH++, iterB++)
{
if(*iterH != *iterB) changes.push_back(count);
count++;
}
return wrap(changes);
}
')
# Data table
dt <- function(input) data.table:::uniqlist(list(input))
# Comparison
set.seed(543)
x <- sort(as.numeric(sample(1e5, 1e7, TRUE)))
microbenchmark(ans1 <- firstDiff(x), which(diff(x) != 0)[1], rle(x),
ans2 <- dt(x), times = 10)
Unit: milliseconds
expr min lq median uq max neval
ans1 <- firstDiff(x) 50.10679 50.12327 50.14164 50.16343 50.28475 10
which(diff(x) != 0)[1] 545.66478 547.58388 556.15550 561.78275 617.40281 10
rle(x) 664.53262 687.04316 709.84949 714.91528 721.37204 10
dt(x) 60.60317 82.30181 82.70207 86.13330 94.07739 10
identical(as.integer(ans1), ans2)
#[1] TRUE
Rcpp is slightly faster than data.table and much faster then the other alternatives in this example.

Resources