Related
I have a function taking four arguments,
h(a, b, c, d)
Where a and b are the i-th and the i+1-th row of df1 and c and d are the i-th and i+1-th row of df2, and the output has four variables and i-1 results.
The idea is the following: I want to use the function h to each combination of these four arguments where i is common, and so:
- for the first iteration it will take the 1st and 2nd row of df1 and 1st and 2nd row of df2
- for the second iteration it will take the 2nd and 3rd row of df1 and 2nd and 3rd row of df2
...
Afterward, perfectly, the results will be stored in a separate data frame, with 4 columns and i-1 rows.
I tried making use of apply function and of a for loop, yet my attempts failed me. I don't necessarily need a readymade solution, a hint would be nice. Thanks!
EDIT: reproducible example:
df1 <- data.frame(a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
df2 <- data.frame(c = c(4, 3, 2, 1), d = c(8, 7, 6, 5))
h <- function (a, b, c, d) {
vector <- (a + b) / (c - d)
vector
}
I would like to get a function that uses h until b and d reach the last row of df1/df2 (they have the same number of rows), and for each such combination generate vector and add it to some new data frame as a next row.
With apply you could do something like this:
df1 <- data.frame(a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
df2 <- data.frame(c = c(4, 3, 2, 1), d = c(8, 7, 6, 5))
h <- function (a, b, c, d) {
(a + b) / (c - d)
}
apply(cbind(df1, df2), 1, function(x) h(x["a"], x["b"], x["c"], x["d"]))
[1] -1.5 -2.0 -2.5 -3.0
If h is a vectorized function (as in your example) it would be better to
do.call(h, cbind(df1, df2))
Of course, I am not assuming that h is that simple, in which case (df1$a + df1$b) / (df2$c - df2$d) would suffice.
However, I advise learning about the purrr package. It is great for this kind of situation and mainly: you can define what type of output you are expecting (with purrr::map_*) to ensure consistency and avoid unexpected results.
For multiple arguments of a dataframe, use purrr::pmap_*:
# `pmap` returns a list
purrr::pmap(cbind(df1, df2), h)
[[1]]
[1] -1.5
[[2]]
[1] -2
[[3]]
[1] -2.5
[[4]]
[1] -3
# `pmap_dbl` returns a double vector or throws an error otherwise
purrr::pmap_dbl(cbind(df1, df2), h)
[1] -1.5 -2.0 -2.5 -3.0
If I have a data.table:
d <- data.table("ID" = c(1, 2, 2, 4, 6, 6),
"TYPE" = c(1, 1, 2, 2, 3, 3),
"CLASS" = c(1, 2, 3, 4, 5, 6))
I know I can remove values greater than a specific value like this:
r <- d[!(d$TYPE > 2), ]
However, if I want to apply this to all of the columns in the entire table instead of just TYPE (basically drop any rows that have a value > 2 in the entire table), how would I generalize the above statement (avoiding using a for loop if possible).
I know I can do d > 2 resulting in a boolean index table, but if I put that into the above line of code it give me an error:
d[!d>2, ]
Results in a invalid matrix type
Note
It was brought up that this question is similar to Return an entire row if the value in any specific set of columns meets a certain criteria.
However, they are working with a data.frame and I am working with a data.table the notation is different. Not a duplicate question due to that.
Using apply with any
d[!apply(d>2,1,any)]
ID TYPE CLASS
1: 1 1 1
2: 2 1 2
Or rowSums
d[rowSums(d>2)==0,]
ID TYPE CLASS
1: 1 1 1
2: 2 1 2
I was wondering what the fastest approach would be for a varying number of rows and columns.
So, here is a benchmark.
It excludes the ID column from being checked for which is not exactly in line with OP's question but is a sensible decision, IMHO.
library(data.table)
library(bench)
bm <- press(
n_row = c(1E1, 1E3, 1E5),
n_col = c(2, 10, 50),
{
set.seed(1L)
d <- data.table(
ID = seq_len(n_row),
matrix(sample(10, n_row*n_col, TRUE), ncol = n_col)
)
mark(
m1 = d[d[, !apply(.SD > 2, 1, any), .SDcols = -"ID"]],
m2 = d[!d[, apply(.SD > 2, 1, any), .SDcols = -"ID"]],
m3 = d[!d[, which(apply(.SD > 2, 1, any)), .SDcols = -"ID"]],
m4 = d[d[, rowSums(.SD > 2) == 0, .SDcols = -"ID"]],
m5 = d[!d[, Reduce(any, lapply(.SD, `>`, y = 2)), by = 1:nrow(d), .SDcols = -"ID"]$V1]
)
})
ggplot2::autoplot(bm)
Apparently, the rowSums() approach is almost always the fastest method.
I have the following matrix
m <- matrix(c(2, 4, 3, 5, 1, 5, 7, 9, 3, 7), nrow=5, ncol=2,)
colnames(x) = c("Y","Z")
m <-data.frame(m)
I am trying to create a random number in each row where the upper limit is a number based on a variable value (in this case 1*Y based on each row's value for for Z)
I currently have:
samp<-function(x){
sample(0:1,1,replace = TRUE)}
x$randoms <- apply(m,1,samp)
which work works well applying the sample function independently to each row, but I always get an error when I try to alter the x in sample. I thought I could do something like this:
samp<-function(x){
sample(0:m$Z,1,replace = TRUE)}
x$randoms <- apply(m,1,samp)
but I guess that was wishful thinking.
Ultimately I want the result:
Y Z randoms
2 5 4
4 7 7
3 9 3
5 3 1
1 7 6
Any ideas?
The following will sample from 0 to x$Y for each row, and store the result in randoms:
x$randoms <- sapply(x$Y + 1, sample, 1) - 1
Explanation:
The sapply takes each value in x$Y separately (let's call this y), and calls sample(y + 1, 1) on it.
Note that (e.g.) sample(y+1, 1) will sample 1 random integer from the range 1:(y+1). Since you want a number from 0 to y rather than 1 to y + 1, we subtract 1 at the end.
Also, just pointing out - no need for replace=T here because you are only sampling one value anyway, so it doesn't matter whether it gets replaced or not.
Based on #mathematical.coffee suggestion and my edited example this is the slick final result:
m <- matrix(c(2, 4, 3, 5, 1, 5, 7, 9, 3, 7), nrow=5, ncol=2,)
colnames(m) = c("Y","Z")
m <-data.frame(m)
samp<-function(x){
sample(Z + 1, 1)}
m$randoms <- sapply(m$Z + 1, sample, 1) - 1
I'm looking at a gene in 10 people. And this gene has two alleles, say a and b. And each allele has 3 forms: type 2, 3 or 4.
a <- c(2, 2, 2, 2, 3, 3, 3, 2, 4, 3)
b <- c(4, 2, 3, 2, 4, 2, 3, 4, 4, 4)
I wish to code a variable that tells me how many type 4 alleles the person has: 0, 1, or 2.
var <- ifelse(a==4 & b==4, 2, 0)
The code above doesn't work since I didn't account for the individuals who have just one copy of the type 4 allele. I feel like I might need 2 ifelse statements that work simultaneously?
EDIT: You don't actually need ifelse or any fancy operations other than plus and equal to.
var <- (a == 4) + (b == 4)
If you're set on ifelse, this can be done with
var <- ifelse(a == 4, 1, 0) + ifelse(b == 4, 1, 0)
However, I prefer the following solution using apply. The following will give you three cases, the result being the number of 4's the person has (assuming each row is a person).
a = c(2, 2, 2, 2, 3, 3, 3, 2, 4, 3)
b = c(4, 2, 3, 2, 4, 2, 3, 4, 4, 4)
d <- cbind(a,b)
apply(d, 1, function(x) {sum(x == 4)})
For this operation, I first combined the two vectors into a matrix since it makes applying the function easier. In R, generally if data are the same type it is easier (and faster for the computer) to combine the data into a matrix/data frame/etc., then create a function to be performed on each row/column/etc.
To understand the output, consider what happens to the first row of d.
> d[1, ]
a b
2 4
> d[1, ] == 4
a b
FALSE TRUE
Booleans are interpreted as integers under addition, so
> FALSE + TRUE
[1] 1
It doesn't seem to matter whether the 4 came from a or b, so we end up with three cases: 0, 1, and 2, depending on the number of 4's.
I am building a custom GUI in R for work, and I need to have a part that can select a subset of a dataframe based on variable values (i.e. select all females that are above 50 etc.). I am building the GUI with gwidgets, but I am stuck with regards to how this filter can be implemented. Specifically how to create a widget that allows the user to select one or more filters and then return the filtered data frame.
Here is a small sample from the data I am working with:
structure(list(kunde = c(3, 3, 3, 3, 3, 3, 3, 1, 3, 3),
bank = c(7,98, 3, 3, 98, 2, 2, 1, 7, 2)),
.Names = c("kunde", "bank"), row.names = c(NA, 10L), class = "data.frame")
Any help is greatly appreciated!!
There are some examples of similar things in the ProgGUIinR package. Here is one of them:
library(gWidgets)
options(guiToolkit="RGtk2")
options(repos="http://streaming.stat.iastate.edu/CRAN")
d <- available.packages() # pick a cran site
w <- gwindow("test of filter")
g <- ggroup(cont=w, horizontal=FALSE)
ed <- gedit("", cont=g)
tbl <- gtable(d, cont=g, filter.FUN="manual", expand=TRUE)
ourMatch <- function(curVal, vals) {
grepl(curVal, vals)
}
id <- addHandlerKeystroke(ed, handler=function(h, ...) {
vals <- tbl[, 1, drop=TRUE]
curVal <- svalue(h$obj)
vis <- ourMatch(curVal, vals)
visible(tbl) <- vis
})
For your purpose, you might want to use gcheckboxgroup or gcombobox to select factor levels or a level and filter by that. The key is the visible<- method of the gtable object is used to filter the displayed items.
If you are game, you can try the gfilter widget in gWidgets2 which as of know is just on my github site (use install_packages("gWidgets2", "jverzani") from devtools, also gWidgets2RGtk2). This may be just what you are trying to do.
With your data object and testing one of the variables, this is a simplified version of subset.data.frame:
tmp <-
structure(list(kunde = c(3, 3, 3, 3, 3, 3, 3, 1, 3, 3), bank = c(7,
98, 3, 3, 98, 2, 2, 1, 7, 2)), .Names = c("kunde", "bank"), row.names = c(NA,
10L), class = "data.frame")
getsub <- function(obj, logexpr) if (missing(logexpr)) {return(obj)
} else {e <- substitute(logexpr)
r <- eval(e, obj, parent.frame())
if (!is.logical(r))
stop("'subset' must evaluate to logical")
r <- r & !is.na(r)
obj[r, ] }
getsub(tmp, bank <50)
#--------------
kunde bank
1 3 7
3 3 3
4 3 3
6 3 2
7 3 2
8 1 1
9 3 7
10 3 2