Lift curve is swapped - r

For the example for the lift curve I run
library(caret)
set.seed(1)
simulated <- data.frame(obs = factor(rep(letters[1:2], each = 100)),
perfect = sort(runif(200), decreasing = TRUE),
random = runif(200))
lift2 <- lift(obs ~ random + perfect, data = simulated)
xyplot(lift2, plot = "lift", auto.key = list(columns = 2))
and get
as result. I expected the image to be swapped horizontally - something along the lines of
What am I doing wrong?
Btw: This is a lift chart not a cumulative gains chart.
Update:
The plot that I expected, produced now by my own code
mylift <- caret::lift(Class ~ cforest_prob + perfect_prob + guess_prob, data = data_test)
ggplot(mylift$data) +
geom_line(aes(CumTestedPct, lift, color = liftModelVar))
is
I noticed, that the data.frame mylift$data contains the following columns:
names(mylift$data)
[1] "liftModelVar" "cuts" "events" "n" "Sn" "Sp" "EventPct"
[8] "CumEventPct" "lift" "CumTestedPct"
So I printed the following plot
ggplot(mylift$data) +
geom_line(aes(cuts, lift, color = liftModelVar))
So I guess that the different plots are just different ways of examining lift? I wasn't aware that there are different lift charts - I thought it was standardized across the industry.

Edit by the question author, for late readers: I accepted this answer for a large part because of the helpful discussion in the comments to this answer. Please consider reading the discussion!
Let's reproduce the graph and find the baseline. Let
cutoffs <- seq(0, 1, length = 1000)
be our cutoffs. Now the main computations are done by
aux <- sapply(cutoffs, function(ct) {
perf <- simulated$obs[simulated$perfect > ct]
rand <- simulated$obs[simulated$random > ct]
c(mean(perf == "a"), mean(rand == "a"))
})
where we go over the vector of cutoffs and do the following. Take the perfect case. We say that whenever perfect > ct, we are going to predict "a". Then simulated$obs[simulated$perfect > ct] are the true values, while mean(perf == "a") is our accuracy with a given cutoff. The same happens with random.
As for the baseline, it is just a constant defined by the share of "a" in the sample:
baseline <- mean(simulated$obs == "a")
When plotting the lifts, we divide our accuracy by that of the baseline method and get the same graph along with the baseline curve:
plot(x = cutoffs, y = aux[1, ] / baseline, type = 'l', ylim = c(0, 2), xlab = "Cutoff", ylab = "Lift")
lines(x = cutoffs, y = aux[2, ] / baseline, col = 'blue')
abline(a = baseline / baseline, b = 0, col = 'magenta')
Update:
Here's an illustration that, at least when plotted manually, the lift curve of the "expected" type can be manipulated and gives non-unique results.
Your example graph is from here, which also has this data:
# contacted response
# 1 10000 6000
# 2 20000 10000
# 3 30000 13000
# 4 40000 15800
# 5 50000 17000
# 6 60000 18000
# 7 70000 18800
# 8 80000 19400
# 9 90000 19800
# 10 100000 20000
Now suppose that we know not this evolution but 10 individual blocks:
# contacted response
# 1 10000 6000
# 2 10000 4000
# 3 10000 3000
# 4 10000 2800
# 5 10000 1200
# 6 10000 1000
# 7 10000 800
# 8 10000 600
# 9 10000 400
# 10 10000 200
In that case it depends on how we order the observations when putting "% Contacted" in the x-axis:
set.seed(1)
baseline <- sum(df$response) / sum(df$contacted) * cumsum(df$contacted)
lift1 <- cumsum(df$response)
lift2 <- cumsum(sample(df$response))
x <- 1:10 * 10
plot(x = x, y = lift1 / baseline, col = 'red', type = 'l', ylim = c(0, 3), xlab = "% Customers contacted", ylab = "Lift")
lines(x = x, y = lift2 / baseline, col = 'blue')
abline(a = baseline / baseline, b = 0, col = 'magenta')

Related

calculate new random number considering distribution of already existing numbers in r

I have a dataframe with participants and I want to randomly assign them to a group (0,1). Each group should have approximately the same amount of participants.
My problem: I will keep adding participants. So, when I calculate a new random number for that participant, it should take into accound the distribution of the random numbers I already have.
This is my code:
groupData <- data.frame(participant = c(1), Group = floor(runif(1, min=0, max=2)))
groupData[nrow(groupData) + 1,] = c(2,floor(runif(1, min=0, max=2))) # with this I will be adding participants
I think what you're saying is that when iteratively adding participants to groupData, you want to randomly assign them to a group such that over time, the groups will be evenly distributed.
N.B., iteratively adding rows to a frame scales horribly, so if you're doing this with a lot of data, it will slow down a lot. See "Growing Objects" in The R Inferno.
We can weight the different groups proportion to their relative size (inversely), so that a new participant has a slightly-higher likelihood of being assigned an under-populated group.
For instance, if we already have 100 participants with unbalanced groups:
set.seed(42)
groupData <- data.frame(participant = 1:100, Group = sample(c(rep(0, 70), rep(1, 30))))
head(groupData)
# participant Group
# 1 1 0
# 2 2 0
# 3 3 0
# 4 4 1
# 5 5 0
# 6 6 1
table(groupData$Group)
# 0 1
# 70 30
then we can prioritize the under-filled group using
100 / (table(c(0:1, groupData$Group))-1)
# 0 1
# 1.428571 3.333333
which can be used with sample as in
sample(0:1, size = 1, prob = 100 / (table(c(0:1, groupData$Group)) - 1) )
I use table(c(0:1, ..)) - 1 because I want this to work when there may not yet be participants in one of the groups; by concatenating 0:1 to it, I ensure heac group has at least one, and the "minus one" compensates for this artificiality, trying to keep the ratios unbiased.
To "prove" that this eventually rounds out ...
for (pa in 101:400) {
newgroup <- sample(0:1, size = 1, prob = 100 / (table(c(0:1, groupData$Group))-1))
groupData <- rbind(groupData, data.frame(participant=pa, Group=newgroup))
}
library(ggplot2)
transform(groupData, GroupDiff = cumsum(Group == 0) - cumsum(Group == 1)) |>
ggplot(aes(participant, y = GroupDiff)) +
geom_point() +
geom_hline(yintercept=0) +
geom_vline(xintercept = 100) +
geom_text(data=data.frame(participant=101, GroupDiff=c(-Inf, -1, 1), vjust=c(-0.5, 0.5, -0.5), label=c("Start of group-balancing", "Group0-heavy", "Group1-heavy")), hjust=0, aes(label=label, vjust=vjust))
It is possible (even likely) that the balance will sway from side-to-side, but in general (asymptotically) it should stay balanced.
It occurs to me that the simplest method is just to assign people in pairs. Draw a random number (0 or 1) assign person N to the group associated with that value and assign person N+1 to the other group. That guarantees random assignment as well as perfectly equal group sizes.
Whether this properly simulates the situation you want to analyze is a separate issue.

optimize R code for min() and sample() by group

I generate a network with npeople(=80), ncomp(=4) components and I want each component to have density equal to dens(=0.2).
I want to optimize 2 lines of the code which take most of the time (especially if I want to have 5k people in the network).
the 2 lines are:
# adjust probability to keep density
nodes[,p:= as.numeric(min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]
# simulate edges
nodes[, edge := sample(c(0,1),1, prob = c(1-p,p)), by = c("ID","ALTERID")]
I have tried using the lapply() function, but the execution time increased - see below the line of code:
nodes[,lapply(.SD, function(p) min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]
rm(list=ls())
library(data.table)
library(intergraph)
library(igraph)
library(Matrix)
library(profvis)
library(ggplot2)
draw.var <- function(n, var1, rho, mean){
C <- matrix(rho, nrow = 2, ncol = 2)
diag(C) <- 1
C <- chol(C)
S <- rnorm(n, mean = mean)
S <- cbind(scale(var1)[1:n],S)
ZS <- S %*% C
return(ZS[,2])
}
set.seed(1123)
profvis({
# create empty list to store data
dt.list <- list()
npeople <- 500
dens <- .2
OC.impact <- FALSE
cor_iv_si <- .6
cor_iv_uc <- 0
cor_uc_oc <- 0.6
ncomp <- 4
beta_oc <- 2 # observed characteristics
beta_uc <- 2 # unobserved characteristics
beta_si <- 1
# create data.table
dt.people <- data.table(ego = 1:npeople)
# draw observed characteristics
dt.people[, OC := abs(rt(npeople,2))]
# draw unobserved variable
dt.people[, UC := draw.var(npeople, dt.people$OC, rho = cor_uc_oc,mean = 5)]
# set component idientifier
dt.people$group <- cut_number(dt.people$UC, ncomp,labels = F)
for(q in 1:ncomp){
# subset comp
dt.sub <- dt.people[group == q]
# create undirected graph
nodes <- as.data.table(t(combn(dt.sub$ego, 2)))
setnames(nodes,c("ID","ALTERID"))
# add attributes
nodes <- merge(nodes,dt.people[,list(ID = ego, ID.UC = UC, ID.OC = OC)], by = "ID")
nodes <- merge(nodes,dt.people[,list(ALTERID = ego, ALTERID.UC = UC, ALTERID.OC = OC)], by = "ALTERID")
# calculate distance
nodes[,d := abs(ID.UC - ALTERID.UC)]
# estimate the appropiate density per component
n.edges <- (dens * (npeople * (npeople - 1)))/ncomp
n.nodes <- npeople/ncomp
c.dens <- n.edges/(n.nodes * (n.nodes - 1))
# estimate initial probability of tie based on distance
coefficient <- log(c.dens / (1 - c.dens))
alpha <- coefficient / mean(nodes$d)
nodes[,p := exp(alpha * d) / (1 + exp(alpha * d))]
# adjust probability to keep density
nodes[,p:= as.numeric(min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]
# simulate edges
nodes[, edge := sample(c(0,1),1, prob = c(1-p,p)), by = c("ID","ALTERID")]
# keep the edges
nodes <- nodes[edge == 1,list(ID,ALTERID)]
# bind the networks
if(q == 1){
net <- copy(nodes)
} else{
net <- rbind(net,nodes)
}
}
# create opposide direction
net <- rbind(net,net[,list(ID = ALTERID, ALTERID = ID)])
})
This incorporates #BenBolker and # DavidArenburg's suggestions and also incorporates some of data.table's tools.
Non-Equi joins
The OP code loops through each group. One part of the code also uses combn and multiple joins to get the data in the right format. Using non-equi joins, we can combine all of those steps in one data.table call
dt_non_sub <- dt.people[dt.people,
on = .(ego < ego, group = group),
allow.cartesian = T,
nomatch = 0L,
.(group,
ALTERID = i.ego, ID = x.ego,
ID.UC = UC, ID.OC = OC,
ALTERID.OC = i.OC, ALTERID.UC = i.UC,
d = abs(UC - i.UC)) #added to be more efficient
]
# dt_non_sub[, d:= abs(ID.UC - ALTERID.UC)]
Vectorization
The original code was mostly slow because of two calls with by groupings. Since each call split the dataframe in around 8,000 individual groups, there were 8,000 functions calls each time. This eliminates those by using pmin as suggested by #DavidArenburg and then uses runif(N)<p as suggested by #BenBolker. My addition was that since your final result don't seem to care about p, I only assigned the edge by using {} to only return the last thing calculated in the call.
# alpha <- coefficient / mean(nodes$d)
dt_non_sub[,
edge := {
alpha = coefficient / mean(d)
p = exp(alpha * d) / (1 + exp(alpha * d))
p_mean = mean(p)
p = pmin(1, p * (1/(p_mean / c.dens)))
as.numeric(runif(.N)<p)
}
, by = .(group)]
net2 <- rbindlist(dt_non_sub[edge == 1, .(group, ALTERID, ID)],
dt_non_sub[edge == 1, .(group, ID = ALTERID, ALTERID = ID)]
One thing to note is that the vectorization is not 100% identical. Your code was recursive, each split updated the mean(node$p) for the next ID, ALTERID group. If you need that recursive part of the call, there's not much help to make it faster.
In the end, the modified code runs in 20 ms vs. the 810 ms of your original function. The results, while different, are somewhat similar in the total number of results:
Original:
net
ID ALTERID
1: 5 10
2: 10 14
3: 5 25
4: 10 25
5: 14 25
---
48646: 498 458
48647: 498 477
48648: 498 486
48649: 498 487
48650: 498 493
Modified
net2
group ALTERID ID
1: 2 4 3
2: 2 6 4
3: 4 7 1
4: 4 8 7
5: 2 9 4
---
49512: 3 460 500
49513: 3 465 500
49514: 3 478 500
49515: 3 482 500
49516: 3 497 500

Best function for modelling diminishing returns

I am visiting a bird sanctuary that has many different species of birds. Some species are more numerous while other species are less numerous. I came back to the sanctuary 9 times and after every visit I am calculating the total number of species I observed. Unsurprisingly, there is a diminishing return in my visits, since I observe the most numerous species on my every visit, but it does not increase the count of observed species. What is the best function in R to predict how many birds I will observe on my 20th visit?
Here is the data.frame
d <- structure(list(visit = 1:9,
totalNumSpeciesObserved = c(200.903, 296.329, 370.018, 431.59, 485.14, 533.233, 576.595, 616.536, 654)),
class = "data.frame", row.names = c(NA, 9L))
I expect to see a model that fits data well and behaves in a "log-like" fashion, predicting diminishing returns
In order to best ask a question, stack has some good links: https://stackoverflow.com/help/how-to-ask
If you're trying to model this, I might take the approach of a regression on the square root of the independent variable based on the data. Kind of strange to think about it as a function of visits though... Maybe if it were even spaced time periods it would make more sense.
d <- structure(list(visit = 1:9,
totalNumSpeciesObserved = c(200.903, 296.329, 370.018, 431.59, 485.14, 533.233, 576.595, 616.536, 654)),
class = "data.frame", row.names = c(NA, 9L))
mod <- lm(totalNumSpeciesObserved ~ I(sqrt(visit)), d)
new.df <- data.frame(visit=1:13)
out <- predict(mod, newdata = new.df)
plot(d, type = 'o',pch = 16, xlim = c(1,13), ylim = c(200,800), lwd = 2, cex = 2)
points(out, type= 'o', pch = 21, col = "blue", cex = 2)
The I() wrapper allows you to transform the independent variable on the fly, hense the use of sqrt() without needing to save a new variable.
I also don't know if this helps, but you could build a simulator to test for asymptoptic behaviour. For example you could build a population:
population <- sample(size = 1e6, LETTERS[1:20],
replace = TRUE, prob = 1/(2:21)^2)
This would say there are 20 species and decreasing probability in your population (expand as you wish).
The you could simulate visits and information about your visit. For example how large is the sample of your visit? During a visit you only see 1% of the rainforest etc.
sim_visits <- function(visits, percent_obs, population){
species_viewed <- vector()
unique_views <- vector()
for(i in 1:visits){
my_samp <- sample(x = population, size = round(percent_obs*length(population),0),
replace = FALSE)
species_viewed <- c(species_viewed, my_samp)
unique_views[i] <- length(unique(species_viewed))
}
new_observed <- unique_views - dplyr::lag(unique_views, 1, 0)
df <- data.frame(unique_views = unique_views, new_observed)
df$cummulative <- cumsum(unique_views)
df
}
And then you could draw from the simulation many times and see what distribution of values you get.
sim_visits(9, percent_obs = .001, population = population)
unique_views new_observed cummulative
1 13 13 13
2 15 2 28
3 15 0 43
4 17 2 60
5 17 0 77
6 17 0 94
7 17 0 111
8 17 0 128
9 17 0 145
And don't know if this is helpful, but I find simulation a good way to conceptualise problems like these.

I do not know how to plot the probability distribution of outcomes of some code in R

I have created a program that simulates the throwing of dice 100 times. I need help with adding up the results of the individual dice and also how to plot the probability distribution of outcomes.
This is the code I have:
sample(1:6, size=100, replace = TRUE)
So far, what you've done is sample the dice throws (note I've added a line setting the seed for reproducibility:
set.seed(123)
x <- sample(1:6, size=100, replace = TRUE)
The simple command to "add[] up the results of the individual dice" is table():
table(x)
# x
# 1 2 3 4 5 6
# 17 16 20 14 18 15
Then, to "plot the probability distribution of outcomes," we must first get that distribution; luckily R provides the handy prop.table() function, which works for this sort of discrete distribution:
prop.table(table(x))
# x
# 1 2 3 4 5 6
# 0.17 0.16 0.20 0.14 0.18 0.15
Then we can easily plot it; for plotting PMFs, my preferred plot type is "h":
y <- prop.table(table(x))
plot(y, type = "h", xlab = "Dice Result", ylab = "Probability")
Update: Weighted die
sample() can easily used to simulate weighted die using its prob argument. From help("sample"):
Usage
sample(x, size, replace = FALSE, prob = NULL)
Arguments
[some content omitted]
prob a vector of probability weights for obtaining the elements of the vector being sampled.
So, we just add your preferred weights to the prob argument and proceed as usual (note I've also upped your sample size from 100 to 10000):
set.seed(123)
die_weights <- c(4/37, rep(6/37, 4), 9/37)
x <- sample(1:6, size = 10000, replace = TRUE, prob = die_weights)
(y <- prop.table(table(x)))
# x
# 1 2 3 4 5 6
# 0.1021 0.1641 0.1619 0.1691 0.1616 0.2412
plot(y, type = "h", xlab = "Dice Result", ylab = "Probability")

Generate multiple plots in base R with loop function then concatenate by matching group variables

I have a data frame (below, my apologies for the verbose code, this is my first attempt at generating reproducible random data) that I'd like to loop through and generate individual plots in base R (specifically, ethograms) for each subject's day and video clip (e.g. subj-1/day1/clipB). After generating n graphs, I'd like to concatenate a PDF for each subj that includes all days + clips, and have each row correspond to a single day. I haven't been able to get past the generating individual graphs, however, so any help would be greatly appreciated!
Data frame
n <- 20000
library(stringi)
test <- as.data.frame(sprintf("%s", stri_rand_strings(n, 2, '[A-Z]')))
colnames(test)<-c("Subj")
test$Day <- sample(1:3, size=length(test$Subj), replace=TRUE)
test$Time <- sample(0:600, size=length(test$Subj), replace=TRUE)
test$Behavior <- as.factor(sample(c("peck", "eat", "drink", "fly", "sleep"), size = length(test$Time), replace=TRUE))
test$Vid_Clip <- sample(c("Clip_A", "Clip_B", "Clip_C"), size = length(test$Time), replace=TRUE)
Sample data from data frame:
> head(test)
Subj Day Time Behavior Vid_Clip
1 BX 1 257 drink Clip_B
2 NP 2 206 sleep Clip_B
3 ZF 1 278 peck Clip_B
4 MF 2 391 sleep Clip_A
5 VE 1 253 fly Clip_C
6 ID 2 359 eat Clip_C
After adapting this code, I am able to successfully generate a single plot (one at a time):
Subset single subj/day/clip:
single_subj_day_clip <- test[test$Vid_Clip == "Clip_B" & test$Subj == "AA" & test$Day == 1,]
After which, I can generate the graph I'm after by running the following lines:
beh_numb <- nlevels(single_subj_day_clip$Behavior)
mar.default <- c(5,4,4,2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0))
plot(single_subj_day_clip$Time,
xlim=c(0,max(single_subj_day_clip$Time)), ylim=c(0, beh_numb), type="n",
ann=F, yaxt="n", frame.plot=F)
for (i in 1:length(single_subj_day_clip$Behavior)) {
ytop <- as.numeric(single_subj_day_clip$Behavior[i])
ybottom <- ytop - 0.5
rect(xleft=single_subj_day_clip$Subj[i], xright=single_subj_day_clip$Time[i+1],
ybottom=ybottom, ytop=ytop, col = ybottom)}
axis(side=2, at = (1:beh_numb -0.25), labels=levels(single_subj_day_clip$Behavior), las = 1)
mtext(text="Time (sec)", side=1, line=3, las=1)
Example graph from randomly generate data(sorry for link - newb SO user so until I'm at 10 reputation pts, I can't embed an image directly)
Example graph from actual data
Ideal per subject graph
Thank you all in advance for your input.
Cheers,
Dan
New and hopefully correct answer
The code is too long to post it here, so there is a link to the Dropbox folder with data and code. You can check this html document or run this .Rmd file on your machine. Please check if all required packages are installed. There is the output of the script.
There are additional problem in the analysis - some events are registered only once, at a single time point between other events. So there is no "width" of such bars. I assigned width of such events to 1000 ms, so some (around 100 per 20000 observations) of them are out of scale if they are at the beginning or at the end of the experiment (and if the width for such events is equal to zero). You can play with the code to fix this behavior.
Another problem is the different colors for the same factors on the different plots. I need some fresh air to fix it as well.
Looking into the graphs, you can notice that sometimes, it seems that some observation with a very short time are overlapping with other observations. But if you zoom the pdf to the maximum - you will see that they are not, and there is a 'holes' in underlying intervals, where they are supposed to be.
Lines, connecting the intervals for different kinds of behavior are helping to follow the timecourse of the experiment. You can uncomment corresponding parts of the code, if you wish.
Please let me know if it works.
Old answer
I am not sure it is the best way to do it, but probably you can use split() and after that lapply through your tables:
Split your data.frame by Subj, Day, and Vid_clip:
testl <- split(test, test[, c(1, 2, 5)], drop = T)
testl[[1123]]
# Subj Day Time Behavior Vid_Clip
#8220 ST 2 303 fly Clip_A
#9466 ST 2 463 fly Clip_A
#9604 ST 2 32 peck Clip_A
#10659 ST 2 136 peck Clip_A
#13126 ST 2 47 fly Clip_A
#14458 ST 2 544 peck Clip_A
Loop through the list with your data and plot to .pdf:
mar.default <- c(5,4,4,2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0))
nbeh = nlevels(test$Behavior)
pdf("plots.pdf")
invisible(
lapply(testl, function(l){
plot(x = l$Time, xlim = c(0, max(l$Time)), ylim = c(0, nbeh),
type = "n", ann = F, yaxt = "n", frame.plot = F)
lapply(1:nbeh, function(i){
ytop <- as.numeric(l$Behavior[i]); ybot <- ytop - .5
rect(l$Subj[i], ybot, l$Time[i + 1], ytop, col = ybot)
})
axis(side = 2, at = 1:nbeh - .25, labels = levels(l$Behavior), las = 1)
mtext(text = "Time (sec)", side = 1, line = 3, las = 1)
})
)
dev.off()
You should probably check output here before you run code on your PC. I didn't edit much your plot-code, so please check it twice.

Resources