Check rows from all columns that matches specific value - r

If I have a data.table:
d <- data.table("ID" = c(1, 2, 2, 4, 6, 6),
"TYPE" = c(1, 1, 2, 2, 3, 3),
"CLASS" = c(1, 2, 3, 4, 5, 6))
I know I can remove values greater than a specific value like this:
r <- d[!(d$TYPE > 2), ]
However, if I want to apply this to all of the columns in the entire table instead of just TYPE (basically drop any rows that have a value > 2 in the entire table), how would I generalize the above statement (avoiding using a for loop if possible).
I know I can do d > 2 resulting in a boolean index table, but if I put that into the above line of code it give me an error:
d[!d>2, ]
Results in a invalid matrix type
Note
It was brought up that this question is similar to Return an entire row if the value in any specific set of columns meets a certain criteria.
However, they are working with a data.frame and I am working with a data.table the notation is different. Not a duplicate question due to that.

Using apply with any
d[!apply(d>2,1,any)]
ID TYPE CLASS
1: 1 1 1
2: 2 1 2
Or rowSums
d[rowSums(d>2)==0,]
ID TYPE CLASS
1: 1 1 1
2: 2 1 2

I was wondering what the fastest approach would be for a varying number of rows and columns.
So, here is a benchmark.
It excludes the ID column from being checked for which is not exactly in line with OP's question but is a sensible decision, IMHO.
library(data.table)
library(bench)
bm <- press(
n_row = c(1E1, 1E3, 1E5),
n_col = c(2, 10, 50),
{
set.seed(1L)
d <- data.table(
ID = seq_len(n_row),
matrix(sample(10, n_row*n_col, TRUE), ncol = n_col)
)
mark(
m1 = d[d[, !apply(.SD > 2, 1, any), .SDcols = -"ID"]],
m2 = d[!d[, apply(.SD > 2, 1, any), .SDcols = -"ID"]],
m3 = d[!d[, which(apply(.SD > 2, 1, any)), .SDcols = -"ID"]],
m4 = d[d[, rowSums(.SD > 2) == 0, .SDcols = -"ID"]],
m5 = d[!d[, Reduce(any, lapply(.SD, `>`, y = 2)), by = 1:nrow(d), .SDcols = -"ID"]$V1]
)
})
ggplot2::autoplot(bm)
Apparently, the rowSums() approach is almost always the fastest method.

Related

Generate double entry table with different data

I am using R-exams and I want to generate a double entry table, but each time with a different set of values (I propose 5 different ones: data1,data2,data3,data4,data5), but it does not work, neither with a counter, nor with "sample".
How could I do it? Thanks
datos1 = c(0 , 2 , 4 , 2, 4 , 8, 13, 6, 12)
datos2 = c(11 , 2 , 4 , 2, 4 , 8, 3, 6, 12)
datos3 = c(12 , 2 , 14 , 2, 4 , 28, 3, 6, 12)
datos4 = c(13 , 2 , 4 , 2, 4 , 8, 3, 6, 12)
datos5 = c(1 , 2 , 4 , 22, 4 , 8, 3, 6, 12)
w9 <- sample(c(datos1, datos2, datos3, datos4, datos5),1)
tabla = cbind (expand.grid (list (Y = c ("3","5","6") ,
X = c ("6","8","9"))), count = w9)
ftable (xtabs(count ∼ Y + X, tabla ))
Thanks for the clarification, Amelia :)
c(datos1, datos2, datos3, datos4, datos5) is combining them in to one long vector, and then sample is just taking 1 single value from these. Instead you want to put them in a list and then sample. This keeps them separate so sample is then looking over the list levels and taking one at random. See below:
w9 <- sample(list(datos1, datos2, datos3, datos4, datos5), 1)
# to extract it from the list back in to the same format as datos1, datos2 etc.
w9 <- unlist(w9)
Edit: Covariance
Following the Example section in Wiki. I'm assuming yours is the same style as this?
# package for data exploration
# install.packages("data.table") # if not installed, run this line
library(data.table)
dt <- as.data.table(tabla)
# X & Y stored as factors, converting to numbers
dt[, X := as.integer(as.character(X))]
dt[, Y := as.integer(as.character(Y))]
# weighted means of X and Y
dt[, mux := weighted.mean(X, count)]
dt[, muy := weighted.mean(Y, count)]
# need probabilities of each combo
dt[, p := count/sum(count)]
# covariance
cv <- dt[, sum(p*(X-mux)*(Y-muy))]

Remove all rows containing the same elements from dataframe, even though elements are in different order in R

Here is my data:
mymat <- structure(c(3, 6, 9, 9, 1, 4, 1, 5, 9, 6, 6, 4, 1, 4), .Dim = c(7L, 2L))
Some rows are duplicated, several other rows contain the same elements although they are differentially ordered. I wish to remove all rows that contain the same elements, whether these elements are in the same (duplicated rows) or different order. This will retain only the first row of c(3, 5).
I checked previous questions here and here. However, my requirement is that all such rows are removed rather than leaving one such row.
My question is also different from this one which removes all duplicated rows in that I look for rows not just duplicated, but also those that contain the same set of elements that are ordered differently. For example, rows c(6, 9) and c(9, 6) should both be removed since they both contian the same set of elements.
I look for solutions not using for loop since my real data is large and for loop may be slow.
Note: My full data has 40k rows and 2 columns.
You can sort the data rowwise and use duplicated -
tmp <- t(apply(mymat, 1, sort))
tmp[!(duplicated(tmp) | duplicated(tmp, fromLast = TRUE)), , drop = FALSE]
# [,1] [,2]
#[1,] 3 5
I added a little data to show that the matrix format remains
mymat <- structure(c(3, 6, 9, 9, 1, 4, 1, 10, 12, 13, 14, 5, 9, 6, 6, 4, 1, 4, 11, 13, 12, 15), .Dim = c(11L, 2L))
dup <- duplicated(rbind(mymat, mymat[, c(2, 1)]))
dup_fromLast <- duplicated(rbind(mymat, mymat[, c(2, 1)]), fromLast = TRUE)
mymat_duprm <- mymat[!(dup_fromLast | dup)[1:(length(dup) / 2)], ]
mymat_duprm
As a matrix:
tmp <- apply(mymat, 1, function(z) toString(sort(z)))
mymat[ave(tmp, tmp, FUN = length) == "1",, drop = FALSE]
# [,1] [,2]
# [1,] 3 5
The drop=FALSE is required only because (at least with this sample data) the filtering results in one row. While I doubt your real data (with 40k rows) would reduce to this, I recommend you keep it in there anyway ("just in case", and it's just defensive programming).
Benchmarking a couple new solutions along with a few already posted:
library(Rfast)
library(microbenchmark)
mymat <- matrix(sample(100, 4000, replace = TRUE), nrow = 2000)
noDup <- function(m) {
return(!(duplicated(m) | duplicated(m, fromLast = TRUE)))
}
combounique1 <- function(m) {
return(m[noDup(rowSort(m)),])
}
combounique2 <- function(m) {
msum <- rowsums(m)
return(m[noDup(rowsums(m^2) + msum + (msum - 3)*abs(m[,1] - m[,2])),])
}
combounique3 <- function(m) {
return(m[noDup(rowsums(m + 1/m)),])
}
combounique4 <- function(m) {
# similar to Harrison Jones, but correct
return(m[noDup(rbind(m, m[m[,1] != m[,2], 2:1]))[1:nrow(m)],])
}
combounique5 <- function(m) {
# similar to Ronak Shah, but maintains ordering within rows
tmp <- t(apply(m, 1, sort))
return(m[noDup(tmp),])
}
r2evans <- function(m) {
tmp <- apply(m, 1, function(z) toString(sort(z)))
return(m[ave(tmp, tmp, FUN = length) == "1",, drop = FALSE])
}
microbenchmark(mymat1 <- combounique1(mymat),
mymat2 <- combounique2(mymat),
mymat3 <- combounique3(mymat),
mymat4 <- combounique4(mymat),
mymat5 <- combounique5(mymat),
mymat6 <- r2evans(mymat))
expr min lq mean median uq max neval
mymat1 <- combounique1(mymat) 7129.9 7642.30 9236.841 8205.45 9467.70 28363.7 100
mymat2 <- combounique2(mymat) 171.0 197.30 219.341 215.75 225.45 385.5 100
mymat3 <- combounique3(mymat) 144.2 166.95 187.340 182.50 192.30 306.7 100
mymat4 <- combounique4(mymat) 14263.1 15343.90 17938.061 16417.30 19043.30 34884.9 100
mymat5 <- combounique5(mymat) 48230.9 50773.75 57662.463 55041.90 60968.35 193804.2 100
mymat6 <- r2evans(mymat) 66180.3 70835.30 78642.552 77299.85 81992.60 161034.5 100
> all(sapply(list(mymat1, mymat2, mymat3, mymat4, mymat5, mymat6), FUN = identical, mymat1))
[1] TRUE
Note that combounique2 and combounique3 are only strictly correct for integer values. The idea is to use a symmetric pairing function to get a unique value for each pair of integers, then use duplicated on that.
(see https://math.stackexchange.com/questions/3162166/what-function-symmetric-and-has-unique-solution)
You can just use, the following line of code:
mymat <- mymat[!mymat[,1] %in% mymat[,2], , drop = FALSE]
output:
mymat
#> [,1] [,2]
#> [1,] 3 5
Created on 2021-09-24 by the reprex package (v0.3.0)

Linear Programming - Unique number of categories

TL;DR: I am trying to find the "cheapest" set of items in a collection that satisfy certain linear constraints. However, every element can be part of multiple "categories" and I also want to have a mix of those unique categories and I'm not quite sure if this can be implemented in a LP way or not and in case how to approach it.
Example - Part 1
Let's say I have 7 items that have different costs and different values associated to them.
library(tidyverse)
library(lpSolve)
# Fake data
kd = tibble(
Item = 1:7,
Cost = c(1, 1, 1, 1, 2, 3, 4),
Value =c(1, 1, 3, 4, 6, 3, 2),
Type = c("A", "A", "A", "B", "C", "D", "E")
)
I want to pick 3 of those elements so that Cost is minimized and their Value is >= 5. I can easily do this with lp with the following code:
# Objective function
knapsack.obj = kd$Cost
# Constraints
knapsack.con = matrix(
c(
rep(1, nrow(kd)),
kd$Value
),
nrow = 2, byrow = TRUE
)
knapsack.dir = c("==", ">=")
knapsack.rhs = c(3, 5)
# Solve
knapsackSolution = lp("min", knapsack.obj, knapsack.con, knapsack.dir, knapsack.rhs, all.bin = TRUE)
# Results
kd[knapsackSolution$solution == 1, ]
As expected this returns Item 1, 2 and 3 that have a combined Value=5 and are obviously minimizing the price.
Example - Part 2
The extra complication I don't quite know how to solve now is adding code for making sure the items picked they come from at least 2 unique categories.
Now the solution I'd expect is Item 1, 2 and 4 (or 1, 3 and 4) that have still a combined cost of 3 and a value of 6 (or 8) that is >= 5 but are not all "A" elements but contain also Item 4 that is a "B" element.
Any idea on how to implement this in a LP framework?
Mathematical Model
If we introduce a zero-one (data) matrix
Category[i,j] = 1 if item i has type j
0 otherwise
and a binary variable:
y[j] = 1 if an item with type j is selected
0 otherwise
we can develop a simple mathematical model:
The blue colored symbols represent data while the red ones are decision variables.
Note that the variable y[j] can be relaxed to be continuous between 0 and 1.
The advantage of first writing down a mathematical model is that it is easier to reason about than a bunch of R code (at least for me).
Implementation
I use OMPR here for two reasons:
Direct way to implement the model in an equation based fashion. We stay closer to the mathematical model.
Access to better solvers than LpSolve.
Here is the R code:
library(tidyverse)
library(ROI)
library(ROI.plugin.symphony)
library(ompr)
library(ompr.roi)
# Fake data
kd = tibble(
Item = 1:7,
Cost = c(1, 1, 1, 1, 2, 3, 4),
Value =c(1, 1, 3, 4, 6, 3, 2),
Type = c("A", "A", "A", "B", "C", "D", "E")
)
Types <- c("A","B","C","D","E")
Category <- 1*outer(kd$Type,Types,FUN="==")
Type <- 1:length(Types)
numItems <- 3
MinValue <- 5
MinItems <- 2
m <- MIPModel() %>%
add_variable(x[i], i=kd$Item, type="binary") %>%
add_variable(y[j], j=Type, type="binary") %>%
add_constraint(sum_expr(x[i], i=kd$Item) == numItems) %>%
add_constraint(sum_expr(kd$Value[i]*x[i], i=kd$Item) >= MinValue) %>%
add_constraint(y[j] <= sum_expr(Category[i,j]*x[i], i=kd$Item), j=Type) %>%
add_constraint(sum_expr(y[j], j=Type) >= MinItems) %>%
set_objective(sum_expr(kd$Cost[i]*x[i], i=kd$Item),"min") %>%
solve_model(with_ROI(solver = "symphony", verbosity=1))
cat("Status:",solver_status(m),"\n")
cat("Objective:",objective_value(m),"\n")
m$solution
Probably the most complex part here is the calculation of the Category matrix.
Solution
The solution looks like:
Status: optimal
Objective: 3
x[1] x[2] x[3] x[4] x[5] x[6] x[7] y[1] y[2] y[3] y[4] y[5]
1 1 0 1 0 0 0 1 1 0 0 0
Actually, we do not have to force the solution to have k-1 or fewer elements from each group. We can instead force each group to have at most g_i-1 elements where g_i is the number of elements in each group.
Here is the implementation:
library(purrr)
library(lpSolve)
library(fastmatch)
# Fake data
kd = tibble(
Item = 1:7,
Cost = c(1, 1, 1, 1, 2, 3, 4),
Value =c(1, 1, 3, 4, 6, 3, 2),
Type = c("A", "A", "A", "B", "C", "D", "E")
)
# number of elements to choose
k = 3
type_match <- fmatch(kd$Type, unique(kd$Type))
unique_cat <- unique(type_match)
add_con <- map(unique_cat,function(x) {
type_match[type_match != x] = 0
type_match[type_match > 0] = 1
return(type_match)}) %>%
do.call(rbind,.)
knapsack.obj = kd$Cost
knapsack.con =
rbind(
rep(1, nrow(kd)),
kd$Value,
add_con
)
rhs_add <- apply(add_con, 1, function(x) ifelse(sum(x)>1,sum(x) - 1,1))
knapsack.dir = c("==", ">=", rep("<=",length(rhs_add)))
knapsack.rhs = c(k, 5, rhs_add)
knapsackSolution = lp("min",
knapsack.obj,
knapsack.con,
knapsack.dir,
knapsack.rhs,
all.bin = TRUE)
knapsackSolution$solution
> knapsackSolution$solution
[1] 1 1 0 1 0 0 0
Since we know that the solution must have k = 3 elements restrict each group to have k-1 or fewer elements forcing at least 2 groups to be used.
incid <- +outer(unique(kd$Type), kd$Type, "==")
ntypes <- nrow(incid)
knapsack.con = rbind(
rep(1, nrow(kd)),
kd$Value,
incid)
k <- 3
knapsack.dir = c("==", ">=", rep("<=", ntypes))
knapsack.rhs = c(k, 5, rep(k-1, ntypes))
res <- lp("min", knapsack.obj, knapsack.con, knapsack.dir, knapsack.rhs, all.bin = TRUE)
res$status
## [1] 0
res$solution
## [1] 1 1 0 1 0 0 0
Simplification
As we discussed in the comments, for this particular data we can omit the last 4 constraints since they are always satsified as there is only one element in each of the last 4 groups.
res2 <- lp("min", knapsack.obj, knapsack.con[1:3, ], knapsack.dir[1:3],
disknapsack.rhs[1:3], all.bin = TRUE)
res2$status
## [1] 0
res2$solution
## [1] 1 1 0 1 0 0 0
Generalizing
As we discussed in the comments, to generalize let us suppose we want at least 3 different categories in the solution rather than 2. In this particular data we could simply require the solution have no more than 1 of each category but in general that will not work so let us take all combinations of groups 2 at a time and produce the constraints shown below. 5 is the total number of categories in the problem and 2 is one less than the number of categories required to be in the solution.
combos <- combn(5, 2, function(x) colSums(incid[x, ]))
For each of those constraints, i.e. each row in combos, we require that it be less or equal to 2 in order to exclude any solution having only 1 or 2 categories. We then construct the LP in a similar manner as before adding the remaining constraints.

Is there an R function for returning sorted indexes of any values of a vector?

I'm not fluent in R data.table and any help will be greatly appreciated to resolve the following problem !
I have big data.table(~1000000 rows) with columns of numeric values and i want to output a same dimension data.table with the sorted indexes position of each row values.
a short example:
-Input:
dt = data.frame(ack = 1:7)
dt$A1 = c( 1, 6, 9, 10, 3, 5, NA)
dt$A2 = c( 25, 12, 30, 10, 50, 1, 30)
dt$A3 = c( 100, 63, 91, 110, 1, 4, 10)
dt$A4 = c( 51, 65, 2, 1, 0, 200, 1)
first row: 1 (1) <= 25 (2) <= 51 (3) <= 100 (4),
row sorted indexes position for (1, 25, 100, 51) are (1, 2, 4, 3) and output should be:
dt$PosA1 = c(1, ...
dt$PosA2 = c(2, ...
dt$PosA3 = c(4, ...
dt$PosA4 = c(3, ...
3rd row : 2 (1) <= 9 (2) <= 30 (3) <= 91 (4) , must output:
dt$PosA1 = c( 1,1,2,...)
dt$PosA2 = c( 2,2,3,...)
dt$PosA3 = c( 4,3,4,...)
dt$PosA4 = c( 3,4,1,...)
Output is a same dimension of input data.table filled with values of sorted indexes by rows .
dt$PosA1 = c( 1, 1, 2, 2, 3, 1, NA)
dt$PosA2 = c( 2, 2, 3, 3, 4, 2, 3)
dt$PosA3 = c( 4, 3, 4, 4, 2, 2, 2)
dt$PosA4 = c( 3, 4, 1, 1, 1, 4, 1)
I think about perhaps something like this?
library(data.table)
setDT(dt)
# pseudocode
dt[, PosA1 := rowPosition(.SD, 1, na.rm=T),
PosA2 := rowPosition(.SD, 2, na.rm=T),
PosA3 := rowPosition(.SD, 3, na.rm=T),
PosA4 := rowPosition(.SD, 4, na.rm=T),
.SDcols=c(A1, A2, A3, A4)]
I'm not sure of syntax and i miss a rowPosition Function. does any function exist to do that ? (i named it rowPosition here)
A little help would be great to code an efficient one , or another approach to solve the problem!
regards.
You can convert to long form and use rank. Or, since you're using data.table, frank:
library(data.table)
setDT(dt)
melt(dt, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
ack A1 A2 A3 A4
1: 1 1 2 4 3
2: 2 1 2 3 4
3: 3 2 3 4 1
4: 4 2 2 3 1
5: 5 3 4 2 1
6: 6 3 1 2 4
7: 7 NA 3 2 1
melt switches to long form; while dcast converts back to wide form.
Since you are looking for speed, you might want to consider using Rcpp. A Rcpp rank that takes care of NA and ties can be found in nrussell's adapted version of René Richter's code.
nr <- 811e3
nc <- 16
DT <- as.data.table(matrix(sample(c(1:200, NA), nr*nc, replace=TRUE), nrow=nr))[,
ack := .I]
#assuming that you have saved nrussell code in avg_rank.cpp
library(Rcpp)
system.time(sourceCpp("rcpp/avg_rank.cpp"))
# user system elapsed
# 0.00 0.13 6.21
nruss_rcpp <- function() {
DT[, as.list(avg_rank(unlist(.SD))), by=ack]
}
data.table.frank <- function() {
melt(DT, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
}
library(microbenchmark)
microbenchmark(nruss_rcpp(), data.table.frank(), times=3L)
timings:
Unit: seconds
expr min lq mean median uq max neval cld
nruss_rcpp() 10.33032 10.33251 10.3697 10.3347 10.38939 10.44408 3 a
data.table.frank() 610.44869 612.82685 613.9362 615.2050 615.68001 616.15501 3 b
edit: addressing comments
1) set column names for rank columns using updating by reference
DT[, (paste0("Rank", 1L:nc)) := as.list(avg_rank(unlist(.SD))), by=ack]
2) keeping NAs as it is
option A) change to NA in R after getting output from avg_rank:
for (j in 1:nc) {
DT[is.na(get(paste0("V", j))), (paste0("Rank", j)) := NA_real_]
}
option B) amend the avg_rank code in Rcpp as follows:
Rcpp::NumericVector avg_rank(Rcpp::NumericVector x)
{
R_xlen_t sz = x.size();
Rcpp::IntegerVector w = Rcpp::seq(0, sz - 1);
std::sort(w.begin(), w.end(), Comparator(x));
Rcpp::NumericVector r = Rcpp::no_init_vector(sz);
for (R_xlen_t n, i = 0; i < sz; i += n) {
n = 1;
while (i + n < sz && x[w[i]] == x[w[i + n]]) ++n;
for (R_xlen_t k = 0; k < n; k++) {
if (Rcpp::traits::is_na<REALSXP>(x[w[i + k]])) { #additional code
r[w[i + k]] = NA_REAL; #additional code
} else {
r[w[i + k]] = i + (n + 1) / 2.;
}
}
}
return r;
}

Create the frequency count from a vector in R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
Suppose there is a vector with numerical values with possible duplicated values
x <- c(1, 2, 3, 4, 5, 1, 2, 2, 3)
I want to create another vector of counts as follows.
It has the same length as x.
For each unique value in x, the first appearance is 1, the second appearance is 2, and so on.
The new vector I want is
1, 1, 1, 1, 1, 2, 2, 3, 2
I need a fast way of doing this since x can be really long.
Use ave and seq_along:
> x <- c(1, 2, 3, 4, 5, 1, 2, 2, 3)
> ave(x, x, FUN = seq_along)
[1] 1 1 1 1 1 2 2 3 2
Another option to consider is data.table. Although it is a little bit more work, it might pay off on very long vectors.
Here it is on your example--definitely seems like overkill!
library(data.table)
x <- c(1, 2, 3, 4, 5, 1, 2, 2, 3)
DT <- data.table(id = sequence(length(x)), x, key = "id")
DT[, y := sequence(.N), by = x][, y]
# [1] 1 1 1 1 1 2 2 3 2
But how about on a vector 10,000,000 items long?
set.seed(1)
x2 <- sample(100, 1e7, replace = TRUE)
funAve <- function() {
ave(x2, x2, FUN = seq_along)
}
funDT <- function() {
DT <- data.table(id = sequence(length(x2)), x2, key = "id")
DT[, y := sequence(.N), by = x2][, y]
}
identical(funAve(), funDT())
# [1] TRUE
library(microbenchmark)
# Unit: seconds
# expr min lq median uq max neval
# funAve() 6.727557 6.792743 6.827117 6.992609 7.352666 20
# funDT() 1.967795 2.029697 2.053886 2.070462 2.123531 20

Resources