creating a function to calculate auc in R - r

I am extremely new with R, I have an assignment that I'm working on that I am having a lot of trouble with. I have defined a discrete probability distribution:
s P(s)
0 1/9
1 4/9
2 1/9
3 0/9
4 1/9
5 0/9
6 0/9
7 1/9
8 0/9
9 1/9
Now I have to work on this question:
Consistent with other distributions available in R, create a
family of support functions for your probability distributuon:
f = dsidp(d) # pmf - the height of the curve/bar for digit d
p = psidp(d) # cdf - the probability of a value being d or less
d = qsidp(p) # icdf - the digit corresponding to the given
# cumulative probability p
d[] = rsidp(n) # generate n random digits based on your probability distribution.
If someone could help me get started on writing these functions, it would be greatly appreciated!

Firstly, read the data:
dat <- read.table(text = "s P(s)
0 1/9
1 4/9
2 1/9
3 0/9
4 1/9
5 0/9
6 0/9
7 1/9
8 0/9
9 1/9", header = TRUE, stringsAsFactors = FALSE)
names(dat) <- c("s", "P")
Transform the fractions (represented as strings) to numeric values:
dat$P <- sapply(strsplit(dat$P, "/"), function(x) as.numeric(x[1]) / as.numeric(x[2]))
The functions:
# pmf - the height of the curve/bar for digit d
dsidp <- function(d) {
with(dat, P[s == d])
}
# cdf - the probability of a value being d or less
psidp <- function(d) {
with(dat, cumsum(P)[s == d])
}
# icdf - the digit corresponding to the given cumulative probability p
qsidp <- function(p) {
with(dat, s[sapply(cumsum(P), all.equal, p) == "TRUE"][1])
}
Note. Since some probabilities are zero, some digits have identical cumulative probabilities. In these cases the lowest digit is returned by function qsidp.
# generate n random digits based on your probability distribution.
rsidp <- function(n) {
with(dat, sample(s, n, TRUE, P))
}

Related

Generate totals of multinomial distribution directly

Let's assume we want to generate n samples from a multinomial distribution from given probabilities p. This works well with sample or rmultinorm. The totals can then be counted with table. Now I wonder if there is a direct way (or another distribution) available to get the result of table directly without generating complete sample vectors.
Here an example:
set.seed(123)
n <- 10000 # sample size
p <- c(0.1, 0.2, 0.7) # probabilities, sum up to 1.0
## 1) approach with sample
x <- sample(1:3, size = n, prob = p, replace = TRUE)
table(x)
# x
# 1 2 3
# 945 2007 7048
## 2) approach with rmultinorm
x <- rmultinom(n, size = 1, prob = p) * 1:3
table(x[x != 0])
# 1 2 3
# 987 1967 7046

Prediction with lm

I have the following data frame:
lm mean resids sd resids resid 1 resid 2 resid 3 intercept beta
1 0.000000e+00 6.2806844 -3.6261548 7.2523096 -3.6261548 103.62615 24.989340
2 -2.960595e-16 8.7515899 -5.0527328 10.1054656 -5.0527328 141.96786 -1.047323
3 -2.960595e-16 5.9138984 -3.4143908 6.8287817 -3.4143908 206.29046 -26.448694
4 3.700743e-17 0.5110845 0.2950748 -0.5901495 0.2950748 240.89801 -35.806642
5 7.401487e-16 6.6260504 3.8255520 -7.6511040 3.8255520 187.03479 -23.444762
6 5.921189e-16 8.7217431 5.0355007 -10.0710014 5.0355007 41.43239 3.138396
7 0.000000e+00 5.5269434 3.1909823 -6.3819645 3.1909823 -119.90628 27.817845
8 -1.480297e-16 1.0204260 -0.5891432 1.1782864 -0.5891432 -180.33773 35.623363
9 -5.921189e-16 6.9488186 -4.0119023 8.0238046 -4.0119023 -64.72245 21.820226
10 -8.881784e-16 8.6621512 -5.0010953 10.0021906 -5.0010953 191.65339 -5.218767
Each row represents an estimated linear model with window length 3. I used rollapply on a separate dataframe with the function lm(y~t) to extract the coefficients and intercepts into a new dataframe, which I have combined with the residuals from the same model and their corresponding means and residuals.
Since the window length is 3, it implies that there are 3 residuals as shown, per model, in resid 1, resid 2 and resid 3. The mean and sd of these are included accordingly.
I am seeking to predict the next observation, in essence, k+1, where k is the window length, using the intercept and beta.
Recall that lm1 takes observations 1,2,3 to estimate the intercept and the beta, and lm2 takes 2,3,4, lm3 takes 3,4,5, etc. The function for the prediction should be:
predict_lm1 = intercept_lm1 + beta_lm1*(k+1)
Where k+1 = 4. For lm2:
predict_lm2 = intercept_lm2 + beta_lm2*(k+1)
Where k+1 = 5.
Clearly, k increases by 1 every time I move down one row in the dataset. This is because the explanatory variable is time, t, which is a sequence increasing by one per observation.
Should I use a for loop, or an apply function here?
How can I make a function that iterates down the rows and calculates the predictions accordingly with the information found in that row?
Thanks.
EDIT:
I managed to find a possible solution by writing the following:
n=nrow(dataset)
for(i in n){
predictions = dataset$Intercept + dataset$beta*(k+1)
}
However, k does not increase by 1 per iteration. Thus, k+1 is always = 4.
How can I make sure k increases by 1 accordingly?
EDIT 2
I managed to add 1 to k by writing the following:
n=nrow(dataset)
for(i in n){
x = 0
x[i] = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x[i])
}
However, the first prediction is overestimated. It should be 203, whereas it is estimated as 228, implying that it sets the explanatory variable as 1 too high.
Yet, the second prediction is correct. I am not sure what I am doing wrong. Any advice?
EDIT 3
I managed to find a solution as follows:
n=nrow(dataset)
for(i in n){
x = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x)
x = x + 1
}
Your loop is not iterating:
dataset <- read.table(text="lm meanresids sdresids resid1 resid2 resid3 intercept beta
1 0.000000e+00 6.2806844 -3.6261548 7.2523096 -3.6261548 103.62615 24.989340
2 -2.960595e-16 8.7515899 -5.0527328 10.1054656 -5.0527328 141.96786 -1.047323
3 -2.960595e-16 5.9138984 -3.4143908 6.8287817 -3.4143908 206.29046 -26.448694
4 3.700743e-17 0.5110845 0.2950748 -0.5901495 0.2950748 240.89801 -35.806642
5 7.401487e-16 6.6260504 3.8255520 -7.6511040 3.8255520 187.03479 -23.444762
6 5.921189e-16 8.7217431 5.0355007 -10.0710014 5.0355007 41.43239 3.138396
7 0.000000e+00 5.5269434 3.1909823 -6.3819645 3.1909823 -119.90628 27.817845
8 -1.480297e-16 1.0204260 -0.5891432 1.1782864 -0.5891432 -180.33773 35.623363
9 -5.921189e-16 6.9488186 -4.0119023 8.0238046 -4.0119023 -64.72245 21.820226
10 -8.881784e-16 8.6621512 -5.0010953 10.0021906 -5.0010953 191.65339 -5.218767", header=T)
n <- nrow(dataset)
predictions <- data.frame()
for(i in 1:n){
k <- i ##not sure where k is coming from but put it here
predictions <- rbind(predictions, dataset$intercept[i] + dataset$beta[i]*(k+1))
}
predictions

Optimizing K-means clustering using Genetic Algorithm

I have the following dataset (obtained here):
----------item survivalpoints weight
1 pocketknife 10 1
2 beans 20 5
3 potatoes 15 10
4 unions 2 1
5 sleeping bag 30 7
6 rope 10 5
7 compass 30 1
I can cluster this dataset into three clusters with kmeans() using a binary string as my initial choice of centers. For eg:
## 1 represents the initial centers
chromosome = c(1,1,1,0,0,0,0)
## exclude first column (kmeans only support continous data)
cl <- kmeans(dataset[, -1], dataset[chromosome == 1, -1])
## check the memberships
cl$clusters
# [1] 1 3 3 1 2 1 2
Using this fundamental concept, I tried it out with GA package to conduct the search where I am trying to optimize(minimize) Davies-Bouldin (DB) Index.
library(GA) ## for ga() function
library(clusterSim) ## for index.DB() function
## defining my fitness function (Davies-Bouldin)
DBI <- function(x) {
## converting matrix to vector to access each row
binary_rep <- split(x, row(x))
## evaluate the fitness of each chromsome
for(each in 1:nrow(x){
cl <- kmeans(dataset, dataset[binary_rep[[each]] == 1, -1])
dbi <- index.DB(dataset, cl$cluster, centrotypes = "centroids")
## minimizing db
return(-dbi)
}
}
g<- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
Of course (I have no idea what's happening), I received error message of
Warning messages:
Error in row(x) : a matrix-like object is required as argument to 'row'
Here are my questions:
How can correctly use the GA package to solve my problem?
How can I make sure the randomly generated chromosomes contains the same number of 1s which corresponds to k number of clusters (eg. if k=3 then the chromosome must contain exactly three 1s)?
I can't comment on the sense of combining k-means with ga, but I can point out that you had issue in your fitness function. Also, errors are produced when all genes are on or off, so fitness is only calculated when that is not the case:
DBI <- function(x) {
if(sum(x)==nrow(dataset) | sum(x)==0){
score <- 0
} else {
cl <- kmeans(dataset[, -1], dataset[x==1, -1])
dbi <- index.DB(dataset[,-1], cl=cl$cluster, centrotypes = "centroids")
score <- dbi$DB
}
return(score)
}
g <- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
plot(g)
g#solution
g#fitnessValue
Looks like several gene combinations produced the same "best" fitness value

Randomize data between two columns in R

I have searched for an answer or a solution to this task with no success as of yet, so I do apologize if this is redundant.
I want to randomize the data between two columns. This is to simulate species misidentification in vegetation field data, so I want to assign some sort of probability of misidentification between the two columns as well. I would imagine that there is some way to do this using sample or the "permute" package.
I will select some readily available data for an example.
library (vegan)
data (dune)
If you type head (dune), then you can see that this is a data frame with sites as rows and species as columns. For convenience sake, we can presume some field tech has potential to misidentify Poa pratensis and Poa trivialis.
poa = data.frame(Poaprat=dune$Poaprat,Poatriv=dune$Poatriv)
head(poa)
Poaprat Poatriv
1 4 2
2 4 7
3 5 6
4 4 5
5 2 6
6 3 4
What would be the best way to randomize the values between these two columns (transferring between each other and/or adding to one when both are present). The resulting data may look like:
Poaprat Poatriv
1 6 0
2 4 7
3 5 6
4 5 4
5 0 7
6 4 3
P.S.
For the cringing ecologist out there: please realize, I have made this example in the interest of time and that I know relative cover values are not additive. I apologize for needing to do that.
*** Edit: For more clarity, the type of data being randomized would be percent cover estimates (so values between 0% and 100%). The data in this quick example are relative cover estimates, not counts.
You'll still need to replace the actual columns with the new ones and there may be a more elegant way to do this (it's late in EDT land) and you'll have to decide what else besides the normal distribution you'll want to use (i.e. how you'll replace sample()) but you get your swaps and adds with:
library(vegan)
library(purrr)
data(dune)
poa <- data.frame(
Poaprat=dune$Poaprat,
Poatriv=dune$Poatriv
)
map2_df(poa$Poaprat, poa$Poatriv, function(x, y) {
for (i in 1:length(x)) {
what <- sample(c("left", "right", "swap"), 1)
switch(
what,
left={
x[i] <- x[i] + y[i]
y[i] <- 0
},
right={
y[i] <- x[i] + y[i]
x[i] <- 0
},
swap={
tmp <- y[i]
y[i] <- x[i]
x[i] <- tmp
}
)
}
data.frame(Poaprat=x, Poatriv=y)
})
Here is my approach:
Let's define a function that will take a number of specimens (n) and a probability (p) that it could be labeled incorrectly. This function will sample a 1 with probability p and a 0 with 1-p. The sum of this random sampling will give how many of the n specimens were incorrect.
mislabel = function(x, p){
N_mis = sample(c(1,0), x, replace = T, prob = c(p, 1-p))
sum(N_mis)
}
Once defined the function, apply it to each column and store it into two new columns
p_miss = 0.3
poa$Poaprat_mislabeled = sapply(poa$Poaprat, mislabel, p_miss)
poa$Poatriv_mislabeled = sapply(poa$Poatriv, mislabel, p_miss)
The final number of specimens tagged for each species can be calculated by substracting the incorrect from same species and adding the incorrect from the other specimen.
poa$Poaprat_final = poa$Poaprat - poa$Poaprat_mislabeled + poa$Poatriv_mislabeled
poa$Poatriv_final = poa$Poatriv - poa$Poatriv_mislabeled + poa$Poaprat_mislabeled
Result:
> head(poa)
Poaprat Poatriv Poaprat_mislabeled Poatriv_mislabeled Poaprat_final Poatriv_final
1 4 2 0 0 4 2
2 4 7 1 2 5 6
3 5 6 0 3 8 3
4 4 5 1 2 5 4
5 2 6 0 3 5 3
6 3 4 1 2 4 3
Complete procedure:
mislabel = function(x, p){
N_mis = sample(c(1,0), x, replace = T, prob = c(p, 1-p))
sum(N_mis)
}
p_miss = 0.3
poa$Poaprat_mislabeled = sapply(poa$Poaprat, mislabel, p_miss)
poa$Poatriv_mislabeled = sapply(poa$Poatriv, mislabel, p_miss)
poa$Poaprat_final = poa$Poaprat - poa$Poaprat_mislabeled + poa$Poatriv_mislabeled
poa$Poatriv_final = poa$Poatriv - poa$Poatriv_mislabeled + poa$Poaprat_mislabeled
The p_miss variable is the probability of labeling incorrectly both species. You could also use a different value for each to simulate a non symmetrical chance that it may be easier to mislabel one of them compared to the other.
I just wanted to check in since accepting the answer from hrbrmstr. Given a little bit of time today, I went ahead and made a function that does this task with some degree of flexibility. It allows for inclusion of multiple species pairs, different probabilities between different species pairs (asymmetry in different direction), and includes explicitly the probability of the value staying the same.
misID = function(X, species,probs = c(0.1,0.1,0,0.8)){
library(purrr)
X2 = X
if (!is.matrix(species) == T){
as.matrix(species)
}
if (!is.matrix(probs) == T){
probs=matrix(probs,ncol=4,byrow=T)
}
if (nrow(probs) == 1){
probs = matrix(rep(probs[1,],nrow(species)),ncol=4,byrow=T)
}
for (i in 1:nrow(species)){
Spp = data.frame(X[species[i,1]],X[species[i,2]])
mis = map2_df(Spp[1],Spp[2],function(x,y) {
for(n in 1:length(x)) {
what = sample(c('left', 'right', 'swap','same'), size=1,prob=probs[i,])
switch(
what,
left = {
x[n] = x[n] + y[n]
y[n] = 0
},
right = {
y[n] = x[n] + y[n]
x[n] = 0
},
swap = {
tmp = y[n]
y[n] = x[n]
x[n] = tmp
},
same = {
x[n] = x[n]
y[n] = y[n]
}
)
}
misSpp = data.frame(x,y)
colnames(misSpp) =c(names(Spp[1]),names(Spp[2]))
return(misSpp)
})
X2[names(mis[1])] = mis[1]
X2[names(mis[2])] = mis[2]
}
return(X2)
}
There are probably a number of minor inefficiencies in here, but by and large it does what I need it to do. Sorry that there are no comments, but I did figure out how to handle getting the shuffled data into the data frame easily.
Thanks for pointing out the "purrr" package for me and also the switch function.
Example:
library(vegan)
library(labdsv)
data(dune)
#First convert relative abundances to my best guess at the % values in Van der Maarel (1979)
code = c(1,2,3,4,5,6,7,8,9)
value = c(0.1,1,2.5,4.25,5.5,20,40,60.5,90)
veg = vegtrans(dune,code,value)
specpairs = matrix(c("Poaprat","Poatriv","Trifprat","Trifrepe"),ncol=2,byrow=T) #create matrix of species pairs
probmat = matrix(c(0.3,0,0,0.7,0,0.5,0,0.5),ncol=4,byrow=T) #create matrix of misclassification probabilities
veg2 = misID(veg,specpairs,probs = probmat)
print(veg2)

Running 'prop.test' multiple times in R

I have some data showing a long list of regions, the population of each region and the number of people in each region with a certain disease. I'm trying to show the confidence intervals for each proportion (but I'm not testing whether the proportions are statistically different).
One approach is to manually calculate the standard errors and confidence intervals but I'd like to use a built-in tool like prop.test, because it has some useful options. However, when I use prop.test with vectors, it runs a chi-square test across all the proportions.
I've solved this with a while loop (see dummy data below), but I sense there must be a better and simpler way to approach this problem. Would apply work here, and how? Thanks!
dat <- data.frame(1:5, c(10, 50, 20, 30, 35))
names(dat) <- c("X", "N")
dat$Prop <- dat$X / dat$N
ConfLower = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[1]
ConfLower <- c(ConfLower, a)
x <- x + 1
}
ConfUpper = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[2]
ConfUpper <- c(ConfUpper, a)
x <- x + 1
}
dat$ConfLower <- ConfLower[2:6]
dat$ConfUpper <- ConfUpper[2:6]
Here's an attempt using Map, essentially stolen from a previous answer here:
https://stackoverflow.com/a/15059327/496803
res <- Map(prop.test,dat$X,dat$N)
dat[c("lower","upper")] <- t(sapply(res,"[[","conf.int"))
# X N Prop lower upper
#1 1 10 0.1000000 0.005242302 0.4588460
#2 2 50 0.0400000 0.006958623 0.1485882
#3 3 20 0.1500000 0.039566272 0.3886251
#4 4 30 0.1333333 0.043597084 0.3164238
#5 5 35 0.1428571 0.053814457 0.3104216

Resources