Related
I need to simulate a vote cast ([0]=Reject, [1]=Approve) of a fictitious population according with their "probability to approve" (i.e., their probability to cast a vote [1]). Each individual (id) has a home state (uf) which has supposedly a different probability to approve (prob_approve) and which is considered known in this toy example.
Here is an example of the data:
pop_tab <- read.table(header=TRUE,sep=',',text = "id,uf,prob_approve
1,SC,0.528788386
2,AM,0.391834279
3,RJ,0.805862415
4,SP,0.762671162
5,CE,0.168054353
6,MG,0.78433876
7,PR,0.529794529
8,PA,0.334581091
9,SP,0.762671162
10,PA,0.334581091")
I tried:
x <- list(prob = pop_tab$prob_approve)
vote <- lapply(x, runif)
... but I don't think the 'runif()' function was processed with the probabilities on column "prop_approve".
How could I simulate the vote cast of the population, according with their home-state probabilities, in a single command, without having to process line by line in a for loop?
Thank you in advance.
Use rbinom():
pop_tab <- read.table(header=TRUE,sep=',',text = "id,uf,prob_approve
1,SC,0.528788386
2,AM,0.391834279
3,RJ,0.805862415
4,SP,0.762671162
5,CE,0.168054353
6,MG,0.78433876
7,PR,0.529794529
8,PA,0.334581091
9,SP,0.762671162
10,PA,0.334581091")
rbinom(n = nrow(pop_tab),
size = 1,
prob = pop_tab$prob_approve)
## [1] 0 0 1 0 0 1 1 1 1 1
I am visiting a bird sanctuary that has many different species of birds. Some species are more numerous while other species are less numerous. I came back to the sanctuary 9 times and after every visit I am calculating the total number of species I observed. Unsurprisingly, there is a diminishing return in my visits, since I observe the most numerous species on my every visit, but it does not increase the count of observed species. What is the best function in R to predict how many birds I will observe on my 20th visit?
Here is the data.frame
d <- structure(list(visit = 1:9,
totalNumSpeciesObserved = c(200.903, 296.329, 370.018, 431.59, 485.14, 533.233, 576.595, 616.536, 654)),
class = "data.frame", row.names = c(NA, 9L))
I expect to see a model that fits data well and behaves in a "log-like" fashion, predicting diminishing returns
In order to best ask a question, stack has some good links: https://stackoverflow.com/help/how-to-ask
If you're trying to model this, I might take the approach of a regression on the square root of the independent variable based on the data. Kind of strange to think about it as a function of visits though... Maybe if it were even spaced time periods it would make more sense.
d <- structure(list(visit = 1:9,
totalNumSpeciesObserved = c(200.903, 296.329, 370.018, 431.59, 485.14, 533.233, 576.595, 616.536, 654)),
class = "data.frame", row.names = c(NA, 9L))
mod <- lm(totalNumSpeciesObserved ~ I(sqrt(visit)), d)
new.df <- data.frame(visit=1:13)
out <- predict(mod, newdata = new.df)
plot(d, type = 'o',pch = 16, xlim = c(1,13), ylim = c(200,800), lwd = 2, cex = 2)
points(out, type= 'o', pch = 21, col = "blue", cex = 2)
The I() wrapper allows you to transform the independent variable on the fly, hense the use of sqrt() without needing to save a new variable.
I also don't know if this helps, but you could build a simulator to test for asymptoptic behaviour. For example you could build a population:
population <- sample(size = 1e6, LETTERS[1:20],
replace = TRUE, prob = 1/(2:21)^2)
This would say there are 20 species and decreasing probability in your population (expand as you wish).
The you could simulate visits and information about your visit. For example how large is the sample of your visit? During a visit you only see 1% of the rainforest etc.
sim_visits <- function(visits, percent_obs, population){
species_viewed <- vector()
unique_views <- vector()
for(i in 1:visits){
my_samp <- sample(x = population, size = round(percent_obs*length(population),0),
replace = FALSE)
species_viewed <- c(species_viewed, my_samp)
unique_views[i] <- length(unique(species_viewed))
}
new_observed <- unique_views - dplyr::lag(unique_views, 1, 0)
df <- data.frame(unique_views = unique_views, new_observed)
df$cummulative <- cumsum(unique_views)
df
}
And then you could draw from the simulation many times and see what distribution of values you get.
sim_visits(9, percent_obs = .001, population = population)
unique_views new_observed cummulative
1 13 13 13
2 15 2 28
3 15 0 43
4 17 2 60
5 17 0 77
6 17 0 94
7 17 0 111
8 17 0 128
9 17 0 145
And don't know if this is helpful, but I find simulation a good way to conceptualise problems like these.
I'm trying to replicate an Excel solver model in R. It's a simple problem to start looking to maximize points with the only constraint being limited the number events than can be played. So I have a two column data frame with a tournament number and project points. In Excel we have a Play Yes/no binary column and multiply it by the points and set to maximize, allowing the model to change the Play Yes/No column to 0 or 1. The constraint limits the sum of the play yes/no variable to the constraint value, for example 25.
library(lpSolve)
tournament<-rep(1:48,1)
mean<-c(12.2,30.4,30.9,44.1,31.3,27.6,31.5,25.0,31.2,24.0,28.0,23.9,14.1,9.5,17.2,37.8,30.5,43.0,32.1,30.7,30.2,37.0,32.1,28.9,23.7,4.6,29.0,29.1,30.7,31.6,49.5,25.1,30.2,10.3,30.3,21.8,88.5,31.0,30.9,2.9,31.1,30.3,29.7,63.7,31.6,91.6,30.6,31.0)
aggdata<-data.frame(tournament,mean)
maxevents <-25
obj<-aggdata$mean
con <- rep(1,nrow(aggdata))
dir <- c("==")
rhs <- maxevents
result <- lp("max", obj, con, dir, rhs, all.bin = TRUE)
The result looks at only 3 rows of the data frame and it should look at the top 25. Eventually, I'll add additional constraints as I know lp is not required for this simple example, but need to get past this roadblock first.
library(lpSolve)
#objective function
obj <- rep(1, nrow(aggdata))
#constraints
con <- matrix(c(obj <- rep(1, nrow(aggdata)),
as.vector(aggdata$point)), nrow = 2, byrow = T) #you can add another constraints here and make 'nrow' equals to number of total constraints
dir <- c("==", "<=")
rhs <- c(25, #total number of tournament
1000) #let's assume that total points can't exceeds 1000
#optimization solution
result <- lp ("max", obj, con, dir, rhs, all.bin=TRUE)
result$solution
Sample data:
aggdata <- data.frame(tournament = rep(1:48,1),
point = c(12.2,30.4,30.9,44.1,31.3,27.6,31.5,25.0,31.2,24.0,28.0,23.9,14.1,
9.5,17.2,37.8,30.5,43.0,32.1,30.7,30.2,37.0,32.1,28.9,23.7,4.6,
29.0,29.1,30.7,31.6,49.5,25.1,30.2,10.3,30.3,21.8,88.5,31.0,30.9,
2.9,31.1,30.3,29.7,63.7,31.6,91.6,30.6,31.0))
# tournament point
#1 1 12.2
#2 2 30.4
#3 3 30.9
#4 4 44.1
#5 5 31.3
#6 6 27.6
For a medical study I would like to calculate the eGFR, a measure of renal function, with an equation that require certain input values: Scr (serum creatinine), ScysC (serum cystatin C), age and sex-depending values, which are all available in my dataset.
Please see the attached image for the equations. eGFR equation
So I am mainly struggling with ifelse-statements and the min/max numbers. How do I create a code to retrieve the output with this equation?
My first thought is to create a loop function, but I don't know exactly how. So any help and time is very much appreciated :)
-EDIT-
NOTICE: it is important that the ratio between min/max is always <1.
e.g. a female with Scr= 0.9 gives Scr/k= 0.9/0.7=1.28 and results in min=1 and max=1.28.
A female with Scr=0.6 gives Scr/k= 0.6/0.7=0.86 and results in min=0.86 and max=1.
Here is a sample of my data:
df <- data.frame(ID = c(1,2,3), AGE = c(36,36, 36),
CYSC = c(0.757, 1.34, 1.34), SCR = c(0.58, 0.68, 0.68), SEX = c(1,1,0))
#Male = 1, Female 0
#equation:
eGFR = 135*((min(Scr/k,1)**a))*((max(Scr/k,1)**-0.601))*(min(Scysc/0.8,1)**-0.375)* (max(Scysc/0.8,1)**-0.711) * (0.995**Age) (*0.969 if female)
(With k=0.7 if F and k=0.9 if M, a=-0.248 if F and a=-0.207 if M)
Ok, so I'm guessing the structure of your data.frame. I provided how I created mine for the test since there seem to be more numbers than row.names. I also assumed that 1 is male and 0 is female. Finally, I added a third female patient for the test, with the same clinical results as male #2.
df <- data.frame(ID = c(1,2,3), AGE = c(36,36, 36), CYSC = c(51.614, 47.669, 47.669), SCR = c(0.75776, 1.34, 1.34), SEX = c(1,1,0))
male.idx <- df$SEX == 1
k <- rep(0.7, nrow(df))
k[male.idx] <- 0.9
a <- rep(-0.248, nrow(df))
a[male.idx] <- -0.207
eGFR <- 135*pmin(df$SCR/k,1)**a*((pmax(df$SCR/k,1)**-0.601))*(pmin(df$CYSC/0.8,1)**-0.375)*
(pmax(df$CYSC/0.8,1)**-0.711) * ifelse(male.idx, 0.995, 0.969)**df$AGE
[edited for more accurate answer]
Update:
the following code should be reproducible
someFrameA = data.frame(label="A", amount=rnorm(10000, 100, 20))
someFrameB = data.frame(label="B", amount=rnorm(1000, 50000, 20))
wholeFrame = rbind(someFrameA, someFrameB)
fit <- e1071::naiveBayes(label ~ amount, wholeFrame)
wholeFrame$predicted = predict(fit, wholeFrame)
nrow(subset(wholeFrame, predicted != label))
In my case, this gave 243 misclassifications.
Note these two rows:
(row num, label, amount, prediction)
10252 B 50024.81895 A
2955 A 100.55977 A
10678 B 50010.26213 B
While the input is only different by 12.6, the classification changes. It's curious that the posterior probabilities for rows like this are so close:
> predict(fit, wholeFrame[10683, ], type="raw")
A B
[1,] 0.5332296 0.4667704
Original Question:
I am trying to classify some bank transactions using the transaction amount. I had many other text based features in my original model, but noticed something fishy when using just the numeric one.
> head(trainingSet)
category amount
1 check 688.00
2 non-businesstransaction 2.50
3 non-businesstransaction 36.00
4 non-businesstransaction 243.22
5 payroll 302.22
6 non-businesstransaction 16.18
fit <- e1071::naiveBayes(category ~ amount, data=trainingSet)
fit
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
bankfee check creditcardpayment e-commercedeposit insurance
0.029798103 0.189613233 0.054001459 0.018973486 0.008270494
intrabanktransfer loanpayment mcapayment non-businesstransaction nsf
0.045001216 0.015689613 0.011432741 0.563853077 0.023351982
other payroll taxpayment utilitypayment
0.003405497 0.014838239 0.005716371 0.016054488
Conditional probabilities:
amount
Y [,1] [,2]
bankfee 103.58490 533.67098
check 803.44668 2172.12515
creditcardpayment 819.27502 2683.43571
e-commercedeposit 42.15026 59.24806
insurance 302.16500 727.52321
intrabanktransfer 1795.54065 11080.73658
loanpayment 308.43233 387.71165
mcapayment 356.62755 508.02412
non-businesstransaction 162.41626 951.65934
nsf 44.92198 78.70680
other 9374.81071 18074.36629
payroll 1192.79639 2155.32633
taxpayment 1170.74340 1164.08019
utilitypayment 362.13409 1064.16875
According to the e1071 docs, the first column for "conditional probabilities" is the mean of the numeric variable, and the other is the standard deviation. These means and stdevs are correct, as are the apriori probabilities.
So, it's troubling that this row:
> thatRow
category amount
40 other 11268.53
receives these posteriors:
> predict(fit, newdata=thatRow, type="raw")
bankfee check creditcardpayment e-commercedeposit insurance intrabanktransfer loanpayment mcapayment
[1,] 4.634535e-96 7.28883e-06 9.401975e-05 0.4358822 4.778703e-51 0.02582751 1.103762e-174 1.358662e-101
non-businesstransaction nsf other payroll taxpayment utilitypayment
[1,] 1.446923e-29 0.5364704 0.001717378 1.133719e-06 2.059156e-18 2.149142e-24
Note that "nsf" has about 300X the score than "other" does. Since this transaction has an amount of 11.2k dollars, if it were to follow that "nsf" distribution, it would be over 100 standard deviations from the mean. Meanwhile, since "other" transactions have a sample mean of about 9k dollars with a large standard deviation, I would think that this transaction is much more probable as an "other". While "nsf" is more likely wrt the prior probabilities, they aren't so different as to outweigh that tail observation, and there are plenty of other viable candidates besides "other" as well.
I was assuming that this package just looked at the normal(mew=samplemean, stdev=samplestdev) pdf and used that value to multiply, but is that not the case? I can't quite figure out how to see the source.
Datatypes seem to be fine too:
> class(trainingSet$amount)
[1] "numeric"
> class(trainingSet$category)
[1] "factor"
The "naive bayes classifier for discrete predictors" in the printout is maybe odd, since this is a continuous predictor, but I assume this package can handle continuous predictors.
I had similar results with the klaR package. Maybe I need to set the kernel option on that?
The threshold argument is a large part of this. The code in the package has a bit like this:
L <- sapply(1:nrow(newdata), function(i) {
ndata <- newdata[i, ]
L <- log(object$apriori) + apply(log(sapply(seq_along(attribs),
function(v) {
nd <- ndata[attribs[v]]
if (is.na(nd)) rep(1, length(object$apriori)) else {
prob <- if (isnumeric[attribs[v]]) {
msd <- object$tables[[v]]
msd[, 2][msd[, 2] <= eps] <- threshold
dnorm(nd, msd[, 1], msd[, 2])
} else object$tables[[v]][, nd]
prob[prob <= eps] <- threshold
prob
}
The threshold (and this is documented) will replace any probabilities less than eps. So, if the normal pdf for the continuous variable is 0.000000000, it will become .001 by default.
> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.001)
> nrow(subset(wholeFrame, predicted != label))
[1] 249
> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.0001)
> nrow(subset(wholeFrame, predicted != label))
[1] 17
> wholeFrame$predicted = predict(fit, wholeFrame, threshold=0.00001)
> nrow(subset(wholeFrame, predicted != label))
[1] 3
Now, I believe that the quantities returned by the sapply are incorrect, since when "debugging" it, I got something like .012 for what should have been dnorm(49990, 100, 20), and I think something gets left out / mixed up with the mean and standard deviation matrix, but in any case, setting the threshold will help with this.
.001*(10/11) > pdfB*(1/11) or A having higher posterior than B due to this situation means that pdfB has to be less than .01 by chance.
> dnorm(49977, 50000, 20)
[1] 0.01029681
> 2*pnorm(49977, 50000, 20)
[1] 0.2501439
And since there were 1000 observations in class B, we should expect about 250 misclassifications, which is pretty close to the original 243.