GUI to subset a data frame in R - r

I am building a custom GUI in R for work, and I need to have a part that can select a subset of a dataframe based on variable values (i.e. select all females that are above 50 etc.). I am building the GUI with gwidgets, but I am stuck with regards to how this filter can be implemented. Specifically how to create a widget that allows the user to select one or more filters and then return the filtered data frame.
Here is a small sample from the data I am working with:
structure(list(kunde = c(3, 3, 3, 3, 3, 3, 3, 1, 3, 3),
bank = c(7,98, 3, 3, 98, 2, 2, 1, 7, 2)),
.Names = c("kunde", "bank"), row.names = c(NA, 10L), class = "data.frame")
Any help is greatly appreciated!!

There are some examples of similar things in the ProgGUIinR package. Here is one of them:
library(gWidgets)
options(guiToolkit="RGtk2")
options(repos="http://streaming.stat.iastate.edu/CRAN")
d <- available.packages() # pick a cran site
w <- gwindow("test of filter")
g <- ggroup(cont=w, horizontal=FALSE)
ed <- gedit("", cont=g)
tbl <- gtable(d, cont=g, filter.FUN="manual", expand=TRUE)
ourMatch <- function(curVal, vals) {
grepl(curVal, vals)
}
id <- addHandlerKeystroke(ed, handler=function(h, ...) {
vals <- tbl[, 1, drop=TRUE]
curVal <- svalue(h$obj)
vis <- ourMatch(curVal, vals)
visible(tbl) <- vis
})
For your purpose, you might want to use gcheckboxgroup or gcombobox to select factor levels or a level and filter by that. The key is the visible<- method of the gtable object is used to filter the displayed items.
If you are game, you can try the gfilter widget in gWidgets2 which as of know is just on my github site (use install_packages("gWidgets2", "jverzani") from devtools, also gWidgets2RGtk2). This may be just what you are trying to do.

With your data object and testing one of the variables, this is a simplified version of subset.data.frame:
tmp <-
structure(list(kunde = c(3, 3, 3, 3, 3, 3, 3, 1, 3, 3), bank = c(7,
98, 3, 3, 98, 2, 2, 1, 7, 2)), .Names = c("kunde", "bank"), row.names = c(NA,
10L), class = "data.frame")
getsub <- function(obj, logexpr) if (missing(logexpr)) {return(obj)
} else {e <- substitute(logexpr)
r <- eval(e, obj, parent.frame())
if (!is.logical(r))
stop("'subset' must evaluate to logical")
r <- r & !is.na(r)
obj[r, ] }
getsub(tmp, bank <50)
#--------------
kunde bank
1 3 7
3 3 3
4 3 3
6 3 2
7 3 2
8 1 1
9 3 7
10 3 2

Related

Need help understanding the output of DescTools:::.DoCount

As the documentation of DescTools:::ConDisPairs() says, the function DescTools:::.DoCount() returns "only concorant and discordant pairs". However, there are 4 non-NA list elements that are returned by the function named C, D, T, and N. I understand that C and D stand for "Concordant" and "Discordant". I guess that T stands for "Ties". I do not know what N stands for.
I've tested this:
library(DescTools)
df <- data.frame(a = c(1, 2, 3, 4),
b = c(1, 1, 3, 4))
DescTools:::.DoCount(df$a, df$b)
returns 1 T (= tie). The following code, however, returns no tie but 1 N:
df <- data.frame(a = c(1, 1, 3, 4),
b = c(1, 1, 3, 4))
DescTools:::.DoCount(df$a, df$b)
Can someone explain to me what's going on and what the 'N' stands for?

Is there in R function for finding index of an array?

this is going to be a body of the particular Question.
which function we are using in array .
You can find the index of the element by the functions
which() or match()
Example for using which():
# vector created
v <- c(0, 1, 2, 3, 4,
5, 6, 7, 8, 9)
# which function is used
# to get the index
which(v == 5) # output is: 6
Example for using match():
# vector created
v <- c(0, 1, 2, 3, 4,
5, 6, 7, 8, 9)
# match function is
# used to get the index
match( 5 , v ) # output is: 6
You can see here more information
For a matrix or an array, set the argument arr.ind = TRUE:
which(myarray == 5, arr.ind = TRUE)

Remove all rows containing the same elements from dataframe, even though elements are in different order in R

Here is my data:
mymat <- structure(c(3, 6, 9, 9, 1, 4, 1, 5, 9, 6, 6, 4, 1, 4), .Dim = c(7L, 2L))
Some rows are duplicated, several other rows contain the same elements although they are differentially ordered. I wish to remove all rows that contain the same elements, whether these elements are in the same (duplicated rows) or different order. This will retain only the first row of c(3, 5).
I checked previous questions here and here. However, my requirement is that all such rows are removed rather than leaving one such row.
My question is also different from this one which removes all duplicated rows in that I look for rows not just duplicated, but also those that contain the same set of elements that are ordered differently. For example, rows c(6, 9) and c(9, 6) should both be removed since they both contian the same set of elements.
I look for solutions not using for loop since my real data is large and for loop may be slow.
Note: My full data has 40k rows and 2 columns.
You can sort the data rowwise and use duplicated -
tmp <- t(apply(mymat, 1, sort))
tmp[!(duplicated(tmp) | duplicated(tmp, fromLast = TRUE)), , drop = FALSE]
# [,1] [,2]
#[1,] 3 5
I added a little data to show that the matrix format remains
mymat <- structure(c(3, 6, 9, 9, 1, 4, 1, 10, 12, 13, 14, 5, 9, 6, 6, 4, 1, 4, 11, 13, 12, 15), .Dim = c(11L, 2L))
dup <- duplicated(rbind(mymat, mymat[, c(2, 1)]))
dup_fromLast <- duplicated(rbind(mymat, mymat[, c(2, 1)]), fromLast = TRUE)
mymat_duprm <- mymat[!(dup_fromLast | dup)[1:(length(dup) / 2)], ]
mymat_duprm
As a matrix:
tmp <- apply(mymat, 1, function(z) toString(sort(z)))
mymat[ave(tmp, tmp, FUN = length) == "1",, drop = FALSE]
# [,1] [,2]
# [1,] 3 5
The drop=FALSE is required only because (at least with this sample data) the filtering results in one row. While I doubt your real data (with 40k rows) would reduce to this, I recommend you keep it in there anyway ("just in case", and it's just defensive programming).
Benchmarking a couple new solutions along with a few already posted:
library(Rfast)
library(microbenchmark)
mymat <- matrix(sample(100, 4000, replace = TRUE), nrow = 2000)
noDup <- function(m) {
return(!(duplicated(m) | duplicated(m, fromLast = TRUE)))
}
combounique1 <- function(m) {
return(m[noDup(rowSort(m)),])
}
combounique2 <- function(m) {
msum <- rowsums(m)
return(m[noDup(rowsums(m^2) + msum + (msum - 3)*abs(m[,1] - m[,2])),])
}
combounique3 <- function(m) {
return(m[noDup(rowsums(m + 1/m)),])
}
combounique4 <- function(m) {
# similar to Harrison Jones, but correct
return(m[noDup(rbind(m, m[m[,1] != m[,2], 2:1]))[1:nrow(m)],])
}
combounique5 <- function(m) {
# similar to Ronak Shah, but maintains ordering within rows
tmp <- t(apply(m, 1, sort))
return(m[noDup(tmp),])
}
r2evans <- function(m) {
tmp <- apply(m, 1, function(z) toString(sort(z)))
return(m[ave(tmp, tmp, FUN = length) == "1",, drop = FALSE])
}
microbenchmark(mymat1 <- combounique1(mymat),
mymat2 <- combounique2(mymat),
mymat3 <- combounique3(mymat),
mymat4 <- combounique4(mymat),
mymat5 <- combounique5(mymat),
mymat6 <- r2evans(mymat))
expr min lq mean median uq max neval
mymat1 <- combounique1(mymat) 7129.9 7642.30 9236.841 8205.45 9467.70 28363.7 100
mymat2 <- combounique2(mymat) 171.0 197.30 219.341 215.75 225.45 385.5 100
mymat3 <- combounique3(mymat) 144.2 166.95 187.340 182.50 192.30 306.7 100
mymat4 <- combounique4(mymat) 14263.1 15343.90 17938.061 16417.30 19043.30 34884.9 100
mymat5 <- combounique5(mymat) 48230.9 50773.75 57662.463 55041.90 60968.35 193804.2 100
mymat6 <- r2evans(mymat) 66180.3 70835.30 78642.552 77299.85 81992.60 161034.5 100
> all(sapply(list(mymat1, mymat2, mymat3, mymat4, mymat5, mymat6), FUN = identical, mymat1))
[1] TRUE
Note that combounique2 and combounique3 are only strictly correct for integer values. The idea is to use a symmetric pairing function to get a unique value for each pair of integers, then use duplicated on that.
(see https://math.stackexchange.com/questions/3162166/what-function-symmetric-and-has-unique-solution)
You can just use, the following line of code:
mymat <- mymat[!mymat[,1] %in% mymat[,2], , drop = FALSE]
output:
mymat
#> [,1] [,2]
#> [1,] 3 5
Created on 2021-09-24 by the reprex package (v0.3.0)

Linear Programming - Unique number of categories

TL;DR: I am trying to find the "cheapest" set of items in a collection that satisfy certain linear constraints. However, every element can be part of multiple "categories" and I also want to have a mix of those unique categories and I'm not quite sure if this can be implemented in a LP way or not and in case how to approach it.
Example - Part 1
Let's say I have 7 items that have different costs and different values associated to them.
library(tidyverse)
library(lpSolve)
# Fake data
kd = tibble(
Item = 1:7,
Cost = c(1, 1, 1, 1, 2, 3, 4),
Value =c(1, 1, 3, 4, 6, 3, 2),
Type = c("A", "A", "A", "B", "C", "D", "E")
)
I want to pick 3 of those elements so that Cost is minimized and their Value is >= 5. I can easily do this with lp with the following code:
# Objective function
knapsack.obj = kd$Cost
# Constraints
knapsack.con = matrix(
c(
rep(1, nrow(kd)),
kd$Value
),
nrow = 2, byrow = TRUE
)
knapsack.dir = c("==", ">=")
knapsack.rhs = c(3, 5)
# Solve
knapsackSolution = lp("min", knapsack.obj, knapsack.con, knapsack.dir, knapsack.rhs, all.bin = TRUE)
# Results
kd[knapsackSolution$solution == 1, ]
As expected this returns Item 1, 2 and 3 that have a combined Value=5 and are obviously minimizing the price.
Example - Part 2
The extra complication I don't quite know how to solve now is adding code for making sure the items picked they come from at least 2 unique categories.
Now the solution I'd expect is Item 1, 2 and 4 (or 1, 3 and 4) that have still a combined cost of 3 and a value of 6 (or 8) that is >= 5 but are not all "A" elements but contain also Item 4 that is a "B" element.
Any idea on how to implement this in a LP framework?
Mathematical Model
If we introduce a zero-one (data) matrix
Category[i,j] = 1 if item i has type j
0 otherwise
and a binary variable:
y[j] = 1 if an item with type j is selected
0 otherwise
we can develop a simple mathematical model:
The blue colored symbols represent data while the red ones are decision variables.
Note that the variable y[j] can be relaxed to be continuous between 0 and 1.
The advantage of first writing down a mathematical model is that it is easier to reason about than a bunch of R code (at least for me).
Implementation
I use OMPR here for two reasons:
Direct way to implement the model in an equation based fashion. We stay closer to the mathematical model.
Access to better solvers than LpSolve.
Here is the R code:
library(tidyverse)
library(ROI)
library(ROI.plugin.symphony)
library(ompr)
library(ompr.roi)
# Fake data
kd = tibble(
Item = 1:7,
Cost = c(1, 1, 1, 1, 2, 3, 4),
Value =c(1, 1, 3, 4, 6, 3, 2),
Type = c("A", "A", "A", "B", "C", "D", "E")
)
Types <- c("A","B","C","D","E")
Category <- 1*outer(kd$Type,Types,FUN="==")
Type <- 1:length(Types)
numItems <- 3
MinValue <- 5
MinItems <- 2
m <- MIPModel() %>%
add_variable(x[i], i=kd$Item, type="binary") %>%
add_variable(y[j], j=Type, type="binary") %>%
add_constraint(sum_expr(x[i], i=kd$Item) == numItems) %>%
add_constraint(sum_expr(kd$Value[i]*x[i], i=kd$Item) >= MinValue) %>%
add_constraint(y[j] <= sum_expr(Category[i,j]*x[i], i=kd$Item), j=Type) %>%
add_constraint(sum_expr(y[j], j=Type) >= MinItems) %>%
set_objective(sum_expr(kd$Cost[i]*x[i], i=kd$Item),"min") %>%
solve_model(with_ROI(solver = "symphony", verbosity=1))
cat("Status:",solver_status(m),"\n")
cat("Objective:",objective_value(m),"\n")
m$solution
Probably the most complex part here is the calculation of the Category matrix.
Solution
The solution looks like:
Status: optimal
Objective: 3
x[1] x[2] x[3] x[4] x[5] x[6] x[7] y[1] y[2] y[3] y[4] y[5]
1 1 0 1 0 0 0 1 1 0 0 0
Actually, we do not have to force the solution to have k-1 or fewer elements from each group. We can instead force each group to have at most g_i-1 elements where g_i is the number of elements in each group.
Here is the implementation:
library(purrr)
library(lpSolve)
library(fastmatch)
# Fake data
kd = tibble(
Item = 1:7,
Cost = c(1, 1, 1, 1, 2, 3, 4),
Value =c(1, 1, 3, 4, 6, 3, 2),
Type = c("A", "A", "A", "B", "C", "D", "E")
)
# number of elements to choose
k = 3
type_match <- fmatch(kd$Type, unique(kd$Type))
unique_cat <- unique(type_match)
add_con <- map(unique_cat,function(x) {
type_match[type_match != x] = 0
type_match[type_match > 0] = 1
return(type_match)}) %>%
do.call(rbind,.)
knapsack.obj = kd$Cost
knapsack.con =
rbind(
rep(1, nrow(kd)),
kd$Value,
add_con
)
rhs_add <- apply(add_con, 1, function(x) ifelse(sum(x)>1,sum(x) - 1,1))
knapsack.dir = c("==", ">=", rep("<=",length(rhs_add)))
knapsack.rhs = c(k, 5, rhs_add)
knapsackSolution = lp("min",
knapsack.obj,
knapsack.con,
knapsack.dir,
knapsack.rhs,
all.bin = TRUE)
knapsackSolution$solution
> knapsackSolution$solution
[1] 1 1 0 1 0 0 0
Since we know that the solution must have k = 3 elements restrict each group to have k-1 or fewer elements forcing at least 2 groups to be used.
incid <- +outer(unique(kd$Type), kd$Type, "==")
ntypes <- nrow(incid)
knapsack.con = rbind(
rep(1, nrow(kd)),
kd$Value,
incid)
k <- 3
knapsack.dir = c("==", ">=", rep("<=", ntypes))
knapsack.rhs = c(k, 5, rep(k-1, ntypes))
res <- lp("min", knapsack.obj, knapsack.con, knapsack.dir, knapsack.rhs, all.bin = TRUE)
res$status
## [1] 0
res$solution
## [1] 1 1 0 1 0 0 0
Simplification
As we discussed in the comments, for this particular data we can omit the last 4 constraints since they are always satsified as there is only one element in each of the last 4 groups.
res2 <- lp("min", knapsack.obj, knapsack.con[1:3, ], knapsack.dir[1:3],
disknapsack.rhs[1:3], all.bin = TRUE)
res2$status
## [1] 0
res2$solution
## [1] 1 1 0 1 0 0 0
Generalizing
As we discussed in the comments, to generalize let us suppose we want at least 3 different categories in the solution rather than 2. In this particular data we could simply require the solution have no more than 1 of each category but in general that will not work so let us take all combinations of groups 2 at a time and produce the constraints shown below. 5 is the total number of categories in the problem and 2 is one less than the number of categories required to be in the solution.
combos <- combn(5, 2, function(x) colSums(incid[x, ]))
For each of those constraints, i.e. each row in combos, we require that it be less or equal to 2 in order to exclude any solution having only 1 or 2 categories. We then construct the LP in a similar manner as before adding the remaining constraints.

Variable sample upper value in R

I have the following matrix
m <- matrix(c(2, 4, 3, 5, 1, 5, 7, 9, 3, 7), nrow=5, ncol=2,)
colnames(x) = c("Y","Z")
m <-data.frame(m)
I am trying to create a random number in each row where the upper limit is a number based on a variable value (in this case 1*Y based on each row's value for for Z)
I currently have:
samp<-function(x){
sample(0:1,1,replace = TRUE)}
x$randoms <- apply(m,1,samp)
which work works well applying the sample function independently to each row, but I always get an error when I try to alter the x in sample. I thought I could do something like this:
samp<-function(x){
sample(0:m$Z,1,replace = TRUE)}
x$randoms <- apply(m,1,samp)
but I guess that was wishful thinking.
Ultimately I want the result:
Y Z randoms
2 5 4
4 7 7
3 9 3
5 3 1
1 7 6
Any ideas?
The following will sample from 0 to x$Y for each row, and store the result in randoms:
x$randoms <- sapply(x$Y + 1, sample, 1) - 1
Explanation:
The sapply takes each value in x$Y separately (let's call this y), and calls sample(y + 1, 1) on it.
Note that (e.g.) sample(y+1, 1) will sample 1 random integer from the range 1:(y+1). Since you want a number from 0 to y rather than 1 to y + 1, we subtract 1 at the end.
Also, just pointing out - no need for replace=T here because you are only sampling one value anyway, so it doesn't matter whether it gets replaced or not.
Based on #mathematical.coffee suggestion and my edited example this is the slick final result:
m <- matrix(c(2, 4, 3, 5, 1, 5, 7, 9, 3, 7), nrow=5, ncol=2,)
colnames(m) = c("Y","Z")
m <-data.frame(m)
samp<-function(x){
sample(Z + 1, 1)}
m$randoms <- sapply(m$Z + 1, sample, 1) - 1

Resources