I have data on covariates for several units. Additionally, I have access to a scoring rule that ranks my observations according to a score.
I decided to divide my training sample X according to the quantiles of score, which I achieved by using the quantile_group function from the GenericMl package.
## Generate data.
set.seed(1986)
n <- 1000
n_val <- 10000
k <- 3
X <- matrix(rnorm(n * k), ncol = k)
X_val <- matrix(rnorm(n_val * k), ncol = k)
score <- rexp(n)
score_val <- rexp(n_val)
## Quantiles of score.
library(GenericML)
groups <- quantile_group(score)
head(groups)
#> [-Inf, 0.277) [0.277, 0.678) [0.678, 1.34) [1.34, Inf]
#> [1,] TRUE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE TRUE
#> [3,] FALSE FALSE TRUE FALSE
#> [4,] FALSE TRUE FALSE FALSE
#> [5,] FALSE TRUE FALSE FALSE
#> [6,] FALSE FALSE TRUE FALSE
The g-th column of groups consists of TRUEs and FALSEs denoting membership to the g-th quantile of score. My next step is to divide units in the validation sample X_val using the same partition of groups. To clarify, I want to divide score_val in four groups defined by the intervals given by colnames(groups):
colnames(groups)
#> [1] "[-Inf, 0.277)" "[0.277, 0.678)" "[0.678, 1.34)" "[1.34, Inf]"
I need to automate this.
I think this can be an approach to get what you are looking for. I don't use the GenericML package because If I understood well, you only want to divide X_val into sub-sets.
# Load library
library(dplyr)
# Generate data
set.seed(1986)
n <- 1000
n_val <- 10000
k <- 3
X <- matrix(rnorm(n * k), ncol = k)
# Here I use "as.data.frame.matrx" in order to add the group (according to the interval)
X_val <- as.data.frame.matrix(matrix(rnorm(n_val * k), ncol = k))
score <- rexp(n)
score_val <- rexp(n_val)
# Get the quantiles of score
q.score <- quantile(score)
# Divide score_val acording to the quantiles of q.score
group.var <- cut(score_val, breaks = c(-Inf, q.score[2:4], Inf))
# Add "group.var" to X_val matrix
X_val$group.var <- group.var
# Divide the information according to "group.var"
new_X_val <- X_val %>%
group_split(group.var)
At the end, what you get is new_X_val, a list with 4 elements, one for each quantile.
Related
Using the r package "effsize" I am trying to calculate cohens d between all pairs of groups in my data outputting all the pairwise d estimates as a matrix. I have provided some test data to illustrate this. I would want a matrix of d estimates for all pairs of groups 1, 2, and 3.
I am struggling to find where to start with this. I know that it could be done using loops but since my real data contains 1000 groups each with 6000 data points I think this would be slow.
library("effsize")
test <- data.frame(
score=c(2,3,42,1,2,3,4,5,5,6,8,2),
group=c(1,1,1,1,2,2,2,2,3,3,3,3)
)
This would be similar functionality to what is provided for wilcox rank sum using pairwise.wilcox.test().
All you have to do is to note that function combn outputs the combinations of n elements taken k at a time and can also apply a function to each resulting combination. In this case the question asks for combinations of 2 groups at a time and the function fun is applied to each one.
fun <- function(x) {
cohen.d(x[[1]]$score, x[[2]]$score)
}
sp <- split(test, test$group)
cmb <- combn(sp, 2, fun)
cmb[, 1]
#[[1]]
#[1] "Cohen's d"
#
#[[2]]
#[1] "d"
#
#[[3]]
#[1] 0.5992954
#
#[[4]]
# lower upper
#-1.169345 2.367936
#
#[[5]]
#[1] 0.95
#
#[[6]]
#[1] medium
#Levels: negligible < small < medium < large
The code above can be written as a function that does all the work and returns a matrix.
cohen.d.pairwise.test <- function(DF, scoreCol, groupCol){
fun <- function(x) {
eff <- cohen.d(x[[1]][[scoreCol]], x[[2]][[scoreCol]])
c(eff[["estimate"]],
eff[["conf.int"]][1],
eff[["conf.int"]][2],
eff[["conf.level"]])
}
sp <- split(DF, DF[[groupCol]])
cmb <- combn(sp, 2, fun)
rownames(cmb) <- c("estimate", "lower", "upper", "conf.level")
t(cmb)
}
cohen.d.pairwise.test(test, scoreCol = "score", groupCol = "group")
# estimate lower upper conf.level
#[1,] 0.5992954 -1.169345 2.3679357 0.95
#[2,] 0.4732232 -1.281054 2.2275008 0.95
#[3,] -0.8795932 -2.691556 0.9323698 0.95
I'm working with the popbio package on a population model. It looks something like this:
library(popbio)
babies <- 0.3
kids <- 0.5
teens <- 0.75
adults <- 0.98
A <- c(0,0,0,0,teens*0.5,adults*0.8,
babies,0,0,0,0,0,
0,kids,0,0,0,0,
0,0,kids,0,0,0,
0,0,0,teens,0,0,
0,0,0,0,teens,adults
)
A <- matrix ((A), ncol=6, byrow = TRUE)
N<-c(10,10,10,10,10,10)
N<-matrix (N, ncol=1)
model <- pop.projection(A,N,iterations=10)
model
I'd like to know how I can randomise the input so that at each iteration, which represents years this case, I'd get a different input for the matrix elements. So, for instance, my model runs for 10 years, and I'd like to have the baby survival rate change for each year. babies <- rnorm(1,0.3,0.1)doesn't do it because that still leaves me with a single value, just randomly selected.
Update: This is distinct from running 10 separate models with different initial, random values. I'd like the update to occur within a single model run, which itself has 10 iteration in the pop.projection function.
Hope you can help.
I know this answer is very late, but here's one approach using expressions. First, use an expression to create the matrix.
vr <- list( babies=0.3, kids=0.5, teens=0.75, adults=0.98 )
Ax <- expression( matrix(c(
0,0,0,0,teens*0.5,adults*0.8,
babies,0,0,0,0,0,
0,kids,0,0,0,0,
0,0,kids,0,0,0,
0,0,0,teens,0,0,
0,0,0,0,teens,adults), ncol=6, byrow = TRUE ))
A1 <- eval(Ax, vr)
lambda(A1)
[1] 1.011821
Next, use an expression to create vital rates with nrorm or other functions.
vr2 <- expression( list( babies=rnorm(1,0.3,0.1), kids=0.5, teens=0.75, adults=0.98 ))
A2 <- eval(Ax, eval( vr2))
lambda(A2)
[1] 1.014586
Apply the expression to 100 matrices.
x <- sapply(1:100, function(x) lambda(eval(Ax, eval(vr2))))
quantile(x, c(.05,.95))
5% 95%
0.996523 1.025900
Finally, make two small changes to pop.projection by adding the vr option and a line to evaluate A at each time step.
pop.projection2 <- function (Ax, vr, n, iterations = 20)
{
x <- length(n)
t <- iterations
stage <- matrix(numeric(x * t), nrow = x)
pop <- numeric(t)
change <- numeric(t - 1)
for (i in 1:t) {
stage[, i] <- n
pop[i] <- sum(n)
if (i > 1) {
change[i - 1] <- pop[i]/pop[i - 1]
}
## evaluate Ax
A <- eval(Ax, eval(vr))
n <- A %*% n
}
colnames(stage) <- 0:(t - 1)
w <- stage[, t]
pop.proj <- list(lambda = pop[t]/pop[t - 1], stable.stage = w/sum(w),
stage.vectors = stage, pop.sizes = pop, pop.changes = change)
pop.proj
}
n <-c(10,10,10,10,10,10)
pop.projection2(Ax, vr2, n, 10)
$lambda
[1] 0.9874586
$stable.stage
[1] 0.33673579 0.11242588 0.08552367 0.02189786 0.02086656 0.42255023
$stage.vectors
0 1 2 3 4 5 6 7 8 9
[1,] 10 11.590000 16.375700 19.108186 20.2560223 20.5559445 20.5506251 20.5898222 20.7603581 20.713271
[2,] 10 4.147274 3.332772 4.443311 5.6693931 1.9018887 6.8455597 5.3879202 10.5214540 6.915534
[3,] 10 5.000000 2.073637 1.666386 2.2216556 2.8346965 0.9509443 3.4227799 2.6939601 5.260727
[4,] 10 5.000000 2.500000 1.036819 0.8331931 1.1108278 1.4173483 0.4754722 1.7113899 1.346980
[5,] 10 7.500000 3.750000 1.875000 0.7776139 0.6248948 0.8331209 1.0630112 0.3566041 1.283542
[6,] 10 17.300000 22.579000 24.939920 25.8473716 25.9136346 25.8640330 25.9715930 26.2494195 25.991884
$pop.sizes
[1] 60.00000 50.53727 50.61111 53.06962 55.60525 52.94189 56.46163 56.91060 62.29319 61.51194
$pop.changes
[1] 0.8422879 1.0014610 1.0485765 1.0477793 0.9521023 1.0664832 1.0079517 1.0945797 0.9874586
I have two logical vectors x and y and weighted values, z corresponding to each index. For column x values that are TRUE I'd like to find the nearest y column index that is also TRUE. Then grab the sum of z between min{x_i, y_i}. If there are two min{x_i, y_i} then the smaller sum of z is used.
x y z
1 FALSE TRUE 0.05647057
2 FALSE FALSE 0.09577802
3 TRUE FALSE 0.04150954
4 FALSE FALSE 0.07242995
5 FALSE TRUE 0.06220041
6 FALSE FALSE 0.01861535
7 FALSE FALSE 0.05056971
8 TRUE FALSE 0.07726933
9 FALSE TRUE 0.04669694
10 TRUE TRUE 0.02312497
There are 3 x values that are TRUE so we'll call them {x_1, x_2, x_3}. Here I demonstrate the summing of the minimum indexes between each x_i and it's nearest y_i neighbor. What is the most efficient base R way to accomplish this. I have a method at the end that utilizes 2 lapply telling me it's probably not efficient. I don't have a math background and usually there's some algebraic way to accomplish these sorts of tasks that is vectorized over the brute computational power.
## x_1
sum(z[3:5]) ## This one is smaller so use it
sum(z[1:3])
## x_2
sum(z[8:9])
## x_3
sum(z[10])
c(sum(z[3:5]), sum(z[8:9]), sum(z[10]))
[1] 0.17613990 0.12396627 0.02312497
MWE:
x <- y <- rep(FALSE, 10)
x[c(3, 8, 10)] <- TRUE
y[c(1, 5, 9, 10)] <- TRUE
set.seed(15)
z <- rnorm(10, .5, .25)/10
data.frame(x=x, y=y, z=z)
Here is an approach that is less than optimal:
dat <- data.frame(x=x, y=y, z=z)
sapply(which(dat[, "x"]), function(x) {
ylocs <- which(dat[, "y"])
dists <- abs(x - ylocs)
min.ylocs <- ylocs[min(dists) == dists]
min(sapply(min.ylocs, function(y, x2 = x) {
sum(dat[, "z"][x2:y])
}))
})
## [1] 0.17613990 0.12396627 0.02312497
I'd prefer to keep the solution within base.
This uses no loops or apply functions. We use na.locf from zoo to move the index of the last TRUE y up giving fwd and the next TRUE y back giving bck. Finally we determine which of the two corresponding sums is greater. This depends on na.locf in the zoo package but at the end we
extract the core code from zoo to avoid the dependence:
library(zoo) # na.locf
x <- dat$x
y <- dat$y
z <- dat$z
yy <- ifelse(y, TRUE, NA) * seq_along(y)
fwd <- na.locf(yy, fromLast = FALSE)[x]
bck <- na.locf(yy, fromLast = TRUE)[x]
cs <- cumsum(z)
pmin(cs[x] - cs[fwd] + z[fwd], cs[bck] - cs[x] + z[x])
The last line gives:
[1] 0.17613990 0.12396627 0.02312497
Here is a mini version of na.locf. The library call above could be replaced with this.
# code extracted from zoo package
na.locf <- function(x, fromLast = FALSE) {
L <- !is.na(x)
if (fromLast) rev(c(NA, rev(which(L)))[cumsum(rev(L)) + 1])
else c(NA, which(L))[cumsum(L)+1L]
}
REVISED: some improvements.
I have a setup that looks like below
for(V in (seq(1, 250, by = 5))){
for(n in (seq(1, 250, by = 5))){
# 1) Working Algorithm creating a probability
ie. vector in range [0:1]
# 2) Take the natural log of this probability
a <- log(lag(Probability), base = exp(1))
# 3) calculate price differences
b <- abs(diff(Price) -1)
# 4) Then compute correlation between a and b
cor(a, b)
# 5) Here I'd like to save this in the corresponding index of matrix
}
}
So that I get a [V, n] sized matrix as output, that collects from each loop.
I have a few problems with this.
The first problem is that my correlation is not computable, as the Probability is often 0, creating a ln(0) = -Inf input in the ln(Probability) vector. Is there a way to compute the std.dev or cor of a Ln vector with -Inf inputs?
My second question is how I save this correlation output into a matrix generated for each loop?
Thanks for your help. I hope this is clear enough.
For your second question (My second question is how I save this correlation output into a matrix generated for each loop?), you could initialise a matrix before the loop and store each computed correlation in the corresponding index like:
sz <- seq(1, 250, by = 5)
out_mat <- matrix(0, nrow=length(sz), ncol=length(sz))
# then continue with your for-loop
for (V in 1:length(sz)) {
for(n in length(sz)) {
# here instead of accessing V and n in computing probability
# use sz[V] and sz[n]
...
...
# after computing the correlation, here use V and n (not sz[V] or sz[n])
out_mat[V, n] <- c # c holds the value of cor(a,b)
}
}
What you can do with -Inf is replace that by NA, for example:
x = runif(10)
x[3] = 1/0
> is.infinite(x)
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x[is.infinite(x)] <- NA
> x
[1] 0.09936348 0.66624531 NA 0.90689357 0.71578917 0.14655174
[7] 0.59561047 0.41944552 0.67203026 0.03263173
And use the na.rm argument for sd:
> sd(x, na.rm = TRUE)
[1] 0.3126829
I have two data frames with different number of rows, but the same number of columns. In the example below data frame 1 is 4 x 2, data frame 2 is 3 x 2. I need a 4 x 3 logical matrix where TRUE indicates that the all the rows in the data frames match. This example works, but takes a very long time to run with larger data frames (I'm trying two data frames with about 5,000 rows, but still just two columns). Is there a more efficient way of doing this?
> df1 <- data.frame(row.names=1:4, var1=c(TRUE, TRUE, FALSE, FALSE), var2=c(1,2,3,4))
> df2 <- data.frame(row.names=5:7, var1=c(FALSE, TRUE, FALSE), var2=c(5,2,3))
>
> m1 <- t(as.matrix(df1))
> m2 <- as.matrix(df2)
>
> apply(m2, 1, FUN=function(x) { apply(m1, 2, FUN=function(y) { all(x==y) } ) })
5 6 7
1 FALSE FALSE FALSE
2 FALSE TRUE FALSE
3 FALSE FALSE TRUE
4 FALSE FALSE FALSE
Thanks in advance for any help.
I was drawn here by your post on R-bloggers: http://jason.bryer.org/posts/2013-01-24/Comparing_Two_Data_Frames.html
If like you say, your data has no numeric vectors, then I think I can suggest a faster approach. It consists in:
turn your two data.frames into two matrices of integers
compute the Euclidean distance between rows of your two datas
Quick example using your data:
mat1 <- as.matrix(sapply(df1, as.integer))
mat2 <- as.matrix(sapply(df2, as.integer))
library(fields)
rdist(mat1, mat2) < 1e-9
# [,1] [,2] [,3]
# [1,] FALSE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE TRUE
# [4,] FALSE FALSE FALSE
A few comments:
if your data contained vectors of characters, you would have to convert them into factors and make sure that they share the same factor levels.
I used the fields package to compute the Euclidean distance. It uses a Fortran implementation and is as far as I know the fastest R package around for the task (and I have tested many, trust me.)
I'm honestly not sure if this will be faster, but you might try:
foo <- Vectorize(function(x,y) {all(df1[x,] == df2[y,])})
> outer(1:4,1:3,FUN = foo)
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE TRUE
[4,] FALSE FALSE FALSE
I feel compelled to at least mention the danger in using == for comparisons as opposed to all.equal or identical. I'm presuming that you're comfortable enough with the data types involves that this won't be a problem.
I suspect that the optimal solution depends on how many unique rows and how many total rows you have.
For the example on your blog, where there are 1000-1500 rows but only 20 unique values (for the seed you set there), I think it's faster to do this:
assign ids to each unique row and then
run outer on the vector of ids seen in each data.frame.
Here's the performance I got. #flodel's approach does about the same on my computer; it's the third one below. Disclaimer: I don't know much about running these kinds of tests.
> set.seed(2112)
> df1 <- data.frame(row.names=1:1000,
+ var1=sample(c(TRUE,FALSE), 1000, replace=TRUE),
+ var2=sample(1:10, 1000, replace=TRUE) )
> df2 <- data.frame(row.names=1001:2500,
+ var1=sample(c(TRUE,FALSE), 1500, replace=TRUE),
+ var2=sample(1:10, 1500, replace=TRUE))
>
> # candidate method on blog
> system.time({
+ df1$var3 <- apply(df1, 1, paste, collapse='.')
+ df2$var3 <- apply(df2, 1, paste, collapse='.')
+ df6 <- sapply(df2$var3, FUN=function(x) { x == df1$var3 })
+ dimnames(df6) <- list(row.names(df1), row.names(df2))
+ })
user system elapsed
1.13 0.00 1.14
>
> rownames(df1) <- NULL # in case something weird happens to rownames on merge
> rownames(df2) <- NULL
> # id method
> system.time({
+ df12 <- unique(rbind(df1,df2))
+ df12$id <- rownames(df12)
+
+ id1 <- merge(df12,df1)$id
+ id2 <- merge(df12,df2)$id
+
+ x <- outer(id1,id2,`==`)
+ })
user system elapsed
0.11 0.02 0.13
>
> library(fields)
> # rdlist from fields method
> system.time({
+ mat1 <- as.matrix(sapply(df1, as.integer))
+ mat2 <- as.matrix(sapply(df2, as.integer))
+ rdist(mat1, mat2) < 1e-9
+ })
user system elapsed
0.15 0.00 0.16
I guess the rbind and the merges would make this solution relatively more costly with different data.