While working on an Rcpp program, I used the sample() function, which gave me the following error: "NAs not allowed in probability." I traced this issue to the fact that the probability vector I used had NA values in it. I have no idea how. Below is some R code that captures the errors:
n.0=20
n.1=20
n.reps=1
beta0.vals=rep(seq(-.3,.1,,n.0),n.reps)
beta1.vals=rep(seq(-7,0,,n.1),n.reps)
beta.grd=as.matrix(expand.grid(beta0.vals,beta1.vals))
n.rnd=200
beta.rnd.grd=cbind(runif(n.rnd,min(beta0.vals),max(beta0.vals)),runif(n.rnd,min(beta1.vals),max(beta1.vals)))
beta.grd=rbind(beta.grd,beta.rnd.grd)
N = 22670
count = 0
for(i in 1:dim(beta.grd)[1]){ # iterate through 600 possible beta values in beta grid
beta.ind = 0 # indicator for current pair of beta values
for(j in 1:N){ # iterate through all possible Nsums
logit = beta.grd[i,1]/N*(j - .1*N)^2 + beta.grd[i,2];
phi01 = exp(logit)/(1 + exp(logit))
if(is.na(phi01)){
count = count + 1
}
}
}
cat("Total number of invalid probabilities: ", count)
Here, $\beta_0 \in (-0.3, 0.1), \beta_1 \in (-7, 0), N = 22670, N_\text{sum} \in (1, N)$. Note that $N$ and $N_\text{sum}$ are integers, whereas the beta values may not be.
Since mathematically, $\phi_{01} \in (0,1)$, I'm assuming that NAs are arising because R is not liking extremely small values. I am receiving an overwhelming amount of NA values, too. More so than numbers. Why would I be getting NAs in this code?
Include print(logit) next to count = count + 1 and you will find lots of logit > 1000 values. exp(1000) == Inf so you divide Inf by Inf which will get you a NaN and NaN is NA:
> exp(500)
[1] 1.403592e+217
> Inf/Inf
[1] NaN
> is.na(NaN)
[1] TRUE
So your problems are not too small but to large numbers coming first out of the evaluation of exp(x) with x larger then roughly 700:
> exp(709)
[1] 8.218407e+307
> exp(710)
[1] Inf
Bernhard's answer correctly identifies the problem:
If logit is large, exp(logit) = Inf.
Here is a solution:
for(i in 1:dim(beta.grd)[1]){ # iterate through 600 possible beta values in beta grid
beta.ind = 0 # indicator for current pair of beta values
for(j in 1:N){ # iterate through all possible Nsums
logit = beta.grd[i,1]/N*(j - .1*N)^2 + beta.grd[i,2];
## This one isn't great because exp(logit) can be very large
# phi01 = exp(logit)/(1 + exp(logit))
## So, we say instead
## phi01 = 1 / ( 1 + exp(-logit) )
phi01 = plogis(logit)
if(is.na(phi01)){
count = count + 1
}
}
}
cat("Total number of invalid probabilities: ", count)
# Total number of invalid probabilities: 0
We can use the more stable 1 / (1 + exp(-logit)
(to convince yourself of this, multiply your expression with exp(-logit) / exp(-logit)),
and luckily either way, R has a builtin function plogis() that can calculate these probabilities quickly and accurately.
You can see from the help file (?plogis) that this function evaluates the expression I gave, but you can also double check to assure yourself
x = rnorm(1000)
y = 1 / (1 + exp(-x))
z = plogis(x)
all.equal(y, z)
[1] TRUE
I have two lengthy data sets with several columns and different lengths, for this example lets subset to few rows and just 3 columns:
Temp <- c(12.9423 ,12.9446 ,12.9412 ,12.9617 ,12.9742 ,12.9652 ,12.9463, 12.9847 ,12.9778,
12.9589, 12.9305, 12.9275 ,12.8569 ,12.8531 ,12.9092, 12.9471, 12.9298, 12.9266,
12.9374 ,12.9385, 12.9505, 12.9510, 12.9632 ,12.9621 ,12.9571, 12.9492 ,12.8988,
12.8895 ,12.8777, 12.8956, 12.8748 ,12.7850 ,12.7323, 12.7546 ,12.7375 ,12.7020,
12.7172, 12.7015, 12.6960, 12.6944, 12.6963, 12.6928, 12.6930 ,12.6883 ,12.6913)
Density <- c(26.38635 ,26.38531 ,26.38429, 26.38336, 26.38268 ,26.38242, 26.38265, 26.38343,
26.38486, 26.38697 ,26.38945, 26.39188, 26.39365, 26.39424 ,26.39376 ,26.39250,
26.39084 ,26.38912 ,26.38744 ,26.38587, 26.38456 ,26.38367, 26.38341 ,26.38398,
26.38547 ,26.38793 ,26.39120 ,26.39509, 26.39955 ,26.40455, 26.41002, 26.41578,
26.42126, 26.42593 ,26.42968, 26.43255 ,26.43463, 26.43603 ,26.43693 ,26.43750,
26.43787, 26.43815, 26.43841 ,26.43871 ,26.43904)
po4 <- c(0.4239840 ,0.4351156, 0.4456128, 0.4542392, 0.4608510, 0.4656445, 0.4690847,
0.4717291, 0.4742391 ,0.4774904 ,0.4831152, 0.4922122, 0.5029904, 0.5128720,
0.5190209, 0.5191368 ,0.5133212, 0.5027542 ,0.4905301 ,0.4796467 ,0.4708035,
0.4638879, 0.4578364 ,0.4519745, 0.4481336, 0.4483697, 0.4531310, 0.4622930,
0.4750474 ,0.4905152 ,0.5082183 ,0.5278212 ,0.5491580 ,0.5720519, 0.5961127,
0.6207716 ,0.6449603, 0.6675704 ,0.6878331 ,0.7051851,0.7195461, 0.7305200,
0.7359634 ,0.7343541, 0.7283988)
PP14 <- data.frame(Temp,Density,po4) ##df1
temp <- c(13.13875, 13.13477 ,13.12337 ,13.10662 ,13.09798 ,13.09542 ,13.08734 ,13.07616,
13.06671 ,13.05899, 13.05890 ,13.05293 ,13.03322, 13.01515, 13.02552 ,13.01668,
12.99829, 12.97075 ,12.95572 ,12.95045 ,12.94541 ,12.94365 ,12.94609 ,12.94256,
12.93565 ,12.93258 ,12.93489 ,12.93209 ,12.92219 ,12.90730 ,12.90416 ,12.89974,
12.89749 ,12.89626 ,12.89395, 12.89315 ,12.89274, 12.89276 ,12.89293 ,12.89302)
density <- c( 26.35897, 26.36274 ,26.36173 ,26.36401 ,26.36507 ,26.36662 ,26.36838,
26.36996,
26.37286 ,26.37452 ,26.37402, 26.37571 ,26.37776, 26.38008 ,26.37959 ,26.38178,
26.38642 ,26.39158 ,26.39350, 26.39467, 26.39601, 26.39601, 26.39596 ,26.39517,
26.39728 ,26.39766, 26.39774, 26.39699 ,26.40081 ,26.40328 ,26.40416, 26.40486,
26.40513 ,26.40474 ,26.40552 ,26.40584, 26.40613, 26.40602 ,26.40595 ,26.40498)
krho <- c( -9.999999e+06, -1.786843e+00, -9.142976e-01, -9.650734e-01, -2.532397e+00,
-3.760537e+00, -2.622484e+00, -1.776506e+00, -2.028391e+00, -2.225910e+00,
-3.486826e+00, -2.062341e-01, -3.010643e+00, -3.878437e+00, -3.796426e+00,
-3.227138e+00, -3.335446e+00, -3.738037e+00, -4.577778e+00, -3.818099e+00,
-3.891467e+00, -4.585045e+00 ,-3.150283e+00 ,-4.371089e+00 ,-3.902601e+00,
-4.546019e+00, -3.932538e+00, -4.331247e+00, -4.508137e+00, -4.789201e+00,
-4.383820e+00, -4.423486e+00, -4.334641e+00, -4.330544e+00, -4.838604e+00,
-4.729123e+00, -4.381797e+00, -4.207365e+00, -4.276804e+00, -4.001305e+00)
MS14 <- data.frame(temp,density,krho) ##df2
So now I would like to loop through both data sets and check if MS14$density=PP14$Density if it is true then I would like to use the column krho in that row to multiply it by delta po4 that corresponds to the same density so diff(po4) in that row or range. something like
#MS14$krho[i] * diff(PP14$po4)[i]
BUT when I run
PP14$Density == MS14$density
of course it is always FALSE, because the large decimal numbers, none is exactly the same. I solved that by round the numbers to the 3rd decimal, but it should be a way to include that in the code so density +- 0.005 for example. Well or just rounding it to the 3rd decimal like:
PP14$Density_round2 <- round(PP14$Density ,digit=2)
In any case I am not sure if I should use a nested loop to check both columns and make the operations accordingly or if it would be better to create a new data.frame with the intersect of each data.frame:
common <- intersect(PP14$Density, MS14$density)
and then make calculations....(??)
So I would probably need a nested loop like:
{for i:PP14
for j:MS14
new-> PP14$Density[i] == MS14$density[j]
#if new is true then PP14$krho[i]* MS14$diff(po4)[j]#[for that particular row]
#and print it into a new data.frame df3
#}
So please, feel free to suggest the best way to proceed.. there might be several ways to do it..
Thank you so much in advance!!
Ps: suggestions using Matlab are also welcome
Something like this?
compareDec <- function(x, y, digits = NULL, tol = .Machine$double.eps^0.5){
if(is.null(digits)){
abs(x - y) < tol
} else {
round(x, digits = digits) == round(y, digits = digits)
}
}
icomp <- outer(MS14$density, PP14$Density, compareDec, digits = 2)
m <- outer(MS14$krho, c(0, diff(PP14$po4)))
new <- which(icomp, arr.ind = TRUE)
df3 <- cbind.data.frame(new, Prod = m[new])
head(df3)
# row col Prod
#1 17 1 0.00000000
#2 18 1 0.00000000
#3 19 1 0.00000000
#4 20 1 0.00000000
#5 17 2 -0.03712885
#6 18 2 -0.04161033
I have the following data frame:
lm mean resids sd resids resid 1 resid 2 resid 3 intercept beta
1 0.000000e+00 6.2806844 -3.6261548 7.2523096 -3.6261548 103.62615 24.989340
2 -2.960595e-16 8.7515899 -5.0527328 10.1054656 -5.0527328 141.96786 -1.047323
3 -2.960595e-16 5.9138984 -3.4143908 6.8287817 -3.4143908 206.29046 -26.448694
4 3.700743e-17 0.5110845 0.2950748 -0.5901495 0.2950748 240.89801 -35.806642
5 7.401487e-16 6.6260504 3.8255520 -7.6511040 3.8255520 187.03479 -23.444762
6 5.921189e-16 8.7217431 5.0355007 -10.0710014 5.0355007 41.43239 3.138396
7 0.000000e+00 5.5269434 3.1909823 -6.3819645 3.1909823 -119.90628 27.817845
8 -1.480297e-16 1.0204260 -0.5891432 1.1782864 -0.5891432 -180.33773 35.623363
9 -5.921189e-16 6.9488186 -4.0119023 8.0238046 -4.0119023 -64.72245 21.820226
10 -8.881784e-16 8.6621512 -5.0010953 10.0021906 -5.0010953 191.65339 -5.218767
Each row represents an estimated linear model with window length 3. I used rollapply on a separate dataframe with the function lm(y~t) to extract the coefficients and intercepts into a new dataframe, which I have combined with the residuals from the same model and their corresponding means and residuals.
Since the window length is 3, it implies that there are 3 residuals as shown, per model, in resid 1, resid 2 and resid 3. The mean and sd of these are included accordingly.
I am seeking to predict the next observation, in essence, k+1, where k is the window length, using the intercept and beta.
Recall that lm1 takes observations 1,2,3 to estimate the intercept and the beta, and lm2 takes 2,3,4, lm3 takes 3,4,5, etc. The function for the prediction should be:
predict_lm1 = intercept_lm1 + beta_lm1*(k+1)
Where k+1 = 4. For lm2:
predict_lm2 = intercept_lm2 + beta_lm2*(k+1)
Where k+1 = 5.
Clearly, k increases by 1 every time I move down one row in the dataset. This is because the explanatory variable is time, t, which is a sequence increasing by one per observation.
Should I use a for loop, or an apply function here?
How can I make a function that iterates down the rows and calculates the predictions accordingly with the information found in that row?
Thanks.
EDIT:
I managed to find a possible solution by writing the following:
n=nrow(dataset)
for(i in n){
predictions = dataset$Intercept + dataset$beta*(k+1)
}
However, k does not increase by 1 per iteration. Thus, k+1 is always = 4.
How can I make sure k increases by 1 accordingly?
EDIT 2
I managed to add 1 to k by writing the following:
n=nrow(dataset)
for(i in n){
x = 0
x[i] = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x[i])
}
However, the first prediction is overestimated. It should be 203, whereas it is estimated as 228, implying that it sets the explanatory variable as 1 too high.
Yet, the second prediction is correct. I am not sure what I am doing wrong. Any advice?
EDIT 3
I managed to find a solution as follows:
n=nrow(dataset)
for(i in n){
x = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x)
x = x + 1
}
Your loop is not iterating:
dataset <- read.table(text="lm meanresids sdresids resid1 resid2 resid3 intercept beta
1 0.000000e+00 6.2806844 -3.6261548 7.2523096 -3.6261548 103.62615 24.989340
2 -2.960595e-16 8.7515899 -5.0527328 10.1054656 -5.0527328 141.96786 -1.047323
3 -2.960595e-16 5.9138984 -3.4143908 6.8287817 -3.4143908 206.29046 -26.448694
4 3.700743e-17 0.5110845 0.2950748 -0.5901495 0.2950748 240.89801 -35.806642
5 7.401487e-16 6.6260504 3.8255520 -7.6511040 3.8255520 187.03479 -23.444762
6 5.921189e-16 8.7217431 5.0355007 -10.0710014 5.0355007 41.43239 3.138396
7 0.000000e+00 5.5269434 3.1909823 -6.3819645 3.1909823 -119.90628 27.817845
8 -1.480297e-16 1.0204260 -0.5891432 1.1782864 -0.5891432 -180.33773 35.623363
9 -5.921189e-16 6.9488186 -4.0119023 8.0238046 -4.0119023 -64.72245 21.820226
10 -8.881784e-16 8.6621512 -5.0010953 10.0021906 -5.0010953 191.65339 -5.218767", header=T)
n <- nrow(dataset)
predictions <- data.frame()
for(i in 1:n){
k <- i ##not sure where k is coming from but put it here
predictions <- rbind(predictions, dataset$intercept[i] + dataset$beta[i]*(k+1))
}
predictions
I have searched for an answer or a solution to this task with no success as of yet, so I do apologize if this is redundant.
I want to randomize the data between two columns. This is to simulate species misidentification in vegetation field data, so I want to assign some sort of probability of misidentification between the two columns as well. I would imagine that there is some way to do this using sample or the "permute" package.
I will select some readily available data for an example.
library (vegan)
data (dune)
If you type head (dune), then you can see that this is a data frame with sites as rows and species as columns. For convenience sake, we can presume some field tech has potential to misidentify Poa pratensis and Poa trivialis.
poa = data.frame(Poaprat=dune$Poaprat,Poatriv=dune$Poatriv)
head(poa)
Poaprat Poatriv
1 4 2
2 4 7
3 5 6
4 4 5
5 2 6
6 3 4
What would be the best way to randomize the values between these two columns (transferring between each other and/or adding to one when both are present). The resulting data may look like:
Poaprat Poatriv
1 6 0
2 4 7
3 5 6
4 5 4
5 0 7
6 4 3
P.S.
For the cringing ecologist out there: please realize, I have made this example in the interest of time and that I know relative cover values are not additive. I apologize for needing to do that.
*** Edit: For more clarity, the type of data being randomized would be percent cover estimates (so values between 0% and 100%). The data in this quick example are relative cover estimates, not counts.
You'll still need to replace the actual columns with the new ones and there may be a more elegant way to do this (it's late in EDT land) and you'll have to decide what else besides the normal distribution you'll want to use (i.e. how you'll replace sample()) but you get your swaps and adds with:
library(vegan)
library(purrr)
data(dune)
poa <- data.frame(
Poaprat=dune$Poaprat,
Poatriv=dune$Poatriv
)
map2_df(poa$Poaprat, poa$Poatriv, function(x, y) {
for (i in 1:length(x)) {
what <- sample(c("left", "right", "swap"), 1)
switch(
what,
left={
x[i] <- x[i] + y[i]
y[i] <- 0
},
right={
y[i] <- x[i] + y[i]
x[i] <- 0
},
swap={
tmp <- y[i]
y[i] <- x[i]
x[i] <- tmp
}
)
}
data.frame(Poaprat=x, Poatriv=y)
})
Here is my approach:
Let's define a function that will take a number of specimens (n) and a probability (p) that it could be labeled incorrectly. This function will sample a 1 with probability p and a 0 with 1-p. The sum of this random sampling will give how many of the n specimens were incorrect.
mislabel = function(x, p){
N_mis = sample(c(1,0), x, replace = T, prob = c(p, 1-p))
sum(N_mis)
}
Once defined the function, apply it to each column and store it into two new columns
p_miss = 0.3
poa$Poaprat_mislabeled = sapply(poa$Poaprat, mislabel, p_miss)
poa$Poatriv_mislabeled = sapply(poa$Poatriv, mislabel, p_miss)
The final number of specimens tagged for each species can be calculated by substracting the incorrect from same species and adding the incorrect from the other specimen.
poa$Poaprat_final = poa$Poaprat - poa$Poaprat_mislabeled + poa$Poatriv_mislabeled
poa$Poatriv_final = poa$Poatriv - poa$Poatriv_mislabeled + poa$Poaprat_mislabeled
Result:
> head(poa)
Poaprat Poatriv Poaprat_mislabeled Poatriv_mislabeled Poaprat_final Poatriv_final
1 4 2 0 0 4 2
2 4 7 1 2 5 6
3 5 6 0 3 8 3
4 4 5 1 2 5 4
5 2 6 0 3 5 3
6 3 4 1 2 4 3
Complete procedure:
mislabel = function(x, p){
N_mis = sample(c(1,0), x, replace = T, prob = c(p, 1-p))
sum(N_mis)
}
p_miss = 0.3
poa$Poaprat_mislabeled = sapply(poa$Poaprat, mislabel, p_miss)
poa$Poatriv_mislabeled = sapply(poa$Poatriv, mislabel, p_miss)
poa$Poaprat_final = poa$Poaprat - poa$Poaprat_mislabeled + poa$Poatriv_mislabeled
poa$Poatriv_final = poa$Poatriv - poa$Poatriv_mislabeled + poa$Poaprat_mislabeled
The p_miss variable is the probability of labeling incorrectly both species. You could also use a different value for each to simulate a non symmetrical chance that it may be easier to mislabel one of them compared to the other.
I just wanted to check in since accepting the answer from hrbrmstr. Given a little bit of time today, I went ahead and made a function that does this task with some degree of flexibility. It allows for inclusion of multiple species pairs, different probabilities between different species pairs (asymmetry in different direction), and includes explicitly the probability of the value staying the same.
misID = function(X, species,probs = c(0.1,0.1,0,0.8)){
library(purrr)
X2 = X
if (!is.matrix(species) == T){
as.matrix(species)
}
if (!is.matrix(probs) == T){
probs=matrix(probs,ncol=4,byrow=T)
}
if (nrow(probs) == 1){
probs = matrix(rep(probs[1,],nrow(species)),ncol=4,byrow=T)
}
for (i in 1:nrow(species)){
Spp = data.frame(X[species[i,1]],X[species[i,2]])
mis = map2_df(Spp[1],Spp[2],function(x,y) {
for(n in 1:length(x)) {
what = sample(c('left', 'right', 'swap','same'), size=1,prob=probs[i,])
switch(
what,
left = {
x[n] = x[n] + y[n]
y[n] = 0
},
right = {
y[n] = x[n] + y[n]
x[n] = 0
},
swap = {
tmp = y[n]
y[n] = x[n]
x[n] = tmp
},
same = {
x[n] = x[n]
y[n] = y[n]
}
)
}
misSpp = data.frame(x,y)
colnames(misSpp) =c(names(Spp[1]),names(Spp[2]))
return(misSpp)
})
X2[names(mis[1])] = mis[1]
X2[names(mis[2])] = mis[2]
}
return(X2)
}
There are probably a number of minor inefficiencies in here, but by and large it does what I need it to do. Sorry that there are no comments, but I did figure out how to handle getting the shuffled data into the data frame easily.
Thanks for pointing out the "purrr" package for me and also the switch function.
Example:
library(vegan)
library(labdsv)
data(dune)
#First convert relative abundances to my best guess at the % values in Van der Maarel (1979)
code = c(1,2,3,4,5,6,7,8,9)
value = c(0.1,1,2.5,4.25,5.5,20,40,60.5,90)
veg = vegtrans(dune,code,value)
specpairs = matrix(c("Poaprat","Poatriv","Trifprat","Trifrepe"),ncol=2,byrow=T) #create matrix of species pairs
probmat = matrix(c(0.3,0,0,0.7,0,0.5,0,0.5),ncol=4,byrow=T) #create matrix of misclassification probabilities
veg2 = misID(veg,specpairs,probs = probmat)
print(veg2)