Odd thing happens when in R when I do set.seed(0) and set.seed(1);
set.seed(0)
sample(1:100,size=10,replace=TRUE)
#### [1] 90 27 38 58 91 21 90 95 67 63
set.seed(1)
sample(1:100,size=10,replace=TRUE)
#### [1] 27 38 58 91 21 90 95 67 63 7
When changing the seed from 0 to 1, I get the exact same sequence, but shifted over by 1 cell!
Note that if I do set.seed(2), I do get what appears to be a completely different (random?) vector.
set.seed(2)
sample(1:100,size=10,replace=TRUE)
#### [1] 19 71 58 17 95 95 13 84 47 55
Anyone know what's going on here?
This applies to the R implementation of the Mersenne-Twister RNG.
set.seed() takes the provided seed and scrambles it (in the C function RNG_Init):
for(j = 0; j < 50; j++)
seed = (69069 * seed + 1);
That scrambled number (seed) is then scrambled 625 times to fill out the initial state for the Mersenne-Twister:
for(j = 0; j < RNG_Table[kind].n_seed; j++) {
seed = (69069 * seed + 1);
RNG_Table[kind].i_seed[j] = seed;
}
We can examine the initial state for the RNG using .Random.seed:
set.seed(0)
x <- .Random.seed
set.seed(1)
y <- .Random.seed
table(x %in% y)
You can see from the table that there is a lot of overlap. Compare this to seed = 3:
set.seed(3)
z <- .Random.seed
table(z %in% x)
table(z %in% y)
Going back to the case of 0 and 1, if we examine the state itself (ignoring the first two elements of the vector which do not apply to what we are looking at), you can see that the state is offset by one:
x[3:10]
# 1280795612 -169270483 -442010614 -603558397 -222347416 1489374793 865871222
# 1734802815
y[3:10]
# -169270483 -442010614 -603558397 -222347416 1489374793 865871222 1734802815
# 98005428
Since the values selected by sample() are based on these numbers, you get the odd behavior.
As you can see from the other answer, seeds 0 and 1 result in almost similar initial states. In addition, Mersenne Twister PRNG has a severe limitation - "almost similar initial states will take a long time to diverge"
It is therefore advisable to use alternatives like WELL PRNG (which can be found in randtoolbox package)
Related
I'm trying to implement a check for decreasing values of avg temperatures to see when the temperature starts falling. See the chart of temperatures here:
Here is the formula I'm trying to implement:
Here is my code to implement that formula:
temps <- read.delim("temps.txt")
date_avgs <- rowMeans(temps[2:length(temps)], dims=1, na.rm=T)
mu <- 87
threshold <- 86
constant <- 3
date_avgs
S <- 0 * date_avgs
for (i in 2:length(date_avgs)) {
value <- S[i-1] + (mu - date_avgs[i] - constant)
cat("\nvalue", value, "si", date_avgs[i], i)
S[i] <- max(0, value)
if(S[i] >= threshold){
#Once I hit this for the first time, that indicates at this index the temp is decreasing
cat("\nDecreased past my threshold!!!", S[i] ,i)
}
}
But I'm not able to detect the change as I expect. My formula doesn't get over the threshold until index 108, when it should get there around index 60.
Here is the plot of my S (or CUSUM) values:
Any ideas what I'm doing wrong in my formula?
I think the problem is mu <- mean(date_avgs) basically means of all the observations. But mu should be "mean of X if no change". Thus mu should be about 87 but according your code and plotted data seems to be 80 or less.
# simulated data
set.seed(4422)
date_avgs <- c(runif(60, 84, 92), 88-(1:50)-rnorm(50,0,4))
plot(date_avgs)
# setting constants
mu <- 87
threshold <- 86
constant <- 3
# after running for cycle
Index <- match(S[S >= threshold][1], S)
Index
[1] 75
# for data
> date_avgs[74]
[1] 73.41981
# Considering a lower threshold
# (as maximum allowable difference to detect trend 2 * C)
mu <- 87
threshold <- 6 # arbitrary
constant <- 3
# after running for cycle
Index <- match(S[S >= threshold][1], S)
Index
[1] 66
So I think code is fine, maybe the interpretation is not
I've got this data processing:
library(text2vec)
##Using perplexity for hold out set
t1 <- Sys.time()
perplex <- c()
for (i in 3:25){
set.seed(17)
lda_model2 <- LDA$new(n_topics = i)
doc_topic_distr2 <- lda_model2$fit_transform(x = dtm, progressbar = F)
set.seed(17)
sample.dtm2 <- itoken(rawsample$Abstract,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = rawsample$id,
progressbar = F) %>%
create_dtm(vectorizer,vtype = "dgTMatrix", progressbar = FALSE)
set.seed(17)
new_doc_topic_distr2 <- lda_model2$transform(sample.dtm2, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
perplex[i] <- text2vec::perplexity(sample.dtm2, topic_word_distribution =
lda_model2$topic_word_distribution,
doc_topic_distribution = new_doc_topic_distr2)
}
print(difftime(Sys.time(), t1, units = 'sec'))
I know there are a lot of questions like this, but I haven't been able to exactly find the answer to my situation. Above you see perplexity calculation from 3 to 25 topic number for a Latent Dirichlet Allocation model. I want to get the most sufficient value among those, meaning that I want to find the elbow or knee, for those values that might only be considered as a simple numeric vector which outcome looks like this:
1 NA
2 NA
3 222.6229
4 210.3442
5 200.1335
6 190.3143
7 180.4195
8 174.2634
9 166.2670
10 159.7535
11 153.7785
12 148.1623
13 144.1554
14 141.8250
15 138.8301
16 134.4956
17 131.0745
18 128.8941
19 125.8468
20 123.8477
21 120.5155
22 118.4426
23 116.4619
24 113.2401
25 114.1233
plot(perplex)
This is how plot looks like
I would say that the elbow would be 13 or 16, but I'm not completely sure and I want the exact number as an outcome. I saw in this paper that f''(x) / (1+f'(x)^2)^1.5 is the knee formula, which I tried like this and says it's 18:
> d1 <- diff(perplex) # first derivative
> d2 <- diff(d1) / diff(perplex[-1]) # second derivative
> knee <- (d2)/((1+(d1)^2)^1.5)
Warning message:
In (d2)/((1 + (d1)^2)^1.5) :
longer object length is not a multiple of shorter object length
> which.min(knee)
[1] 18
I can't fully figure this thing out. Would someone like to share how I could get the exact ideal topics number according to perplexity as an outcome?
Found this: "The LDA model with the optimal coherence score, obtained with an elbow method (the point with maximum absolute second derivative) (...)" in this paper, so this coding does the work: d1 <- diff(perplex); k <- which.max(abs(diff(d1) / diff(perplex[-1])))
Consider the following dataset:
fictional.df <- data.frame(L1 = c(0,0,0,0,0,0,0,0),
L2 = c(0,1,0,0,0,1,1,0),
L3 = c(1,1,0,1,1,1,1,1),
L4=c(0,0,1,1,0,0,0,0))
I converted this to a phyDat object and then created a pairwise distance matrix as follows:
fictional.phydat <- as.phyDat(fictional.df,
type="USER",levels=c("1","0"),
names=names(fictional.df))
fictional.hamming <- dist.hamming(fictional.phydat)
From this distance matrix, I then estimated a UPGMA tree:
fictional.upgma <- upgma(fictional.hamming)
I then created bootstrap datasets:
set.seed(187)
fictional.upgma.bs <- bootstrap.phyDat(fictional.phydat, FUN =
function(xx) upgma(dist.hamming(xx)), bs=100)
I then calculated the proportion of partitions in the bootstrap set:
upgma.bs.part <- prop.part(fictional.upgma.bs)
So far so good. Here is where I would appreciate some help. When I call the function prop.clades, I do not understand the result:
prop.clades(fictional.upgma,fictional.upgma.bs)
[1] 100 NA 71
Why does this function return NA when there is evidence for that clade in the set of bootstrap trees?
A second question:
prop.clades(fictional.upgma,part=upgma.bs.part)
[1] 100 49 112
If there are only 100 bootstrap samples, why is the value for the final clade 112?
Your tree fictional.upgma is rooted and prop.clades return as default how often each bipartition occurs. In a rooted tree the two edges leading to the root both refer to the same bipartition or split:
prop.clades(unroot(fictional.upgma), fictional.upgma.bs)
[1] 100 71
For rooted trees you some times want to count the number of identical clades:
prop.clades(fictional.upgma, fictional.upgma.bs, rooted=TRUE)
[1] 100 49 71
This seems a bug and you best report it to Emmanuel Pardis
prop.clades(fictional.upgma,part=upgma.bs.part)
[1] 100 49 112
I have the R iris dataset which I am using for a PNN. The 3 species have been recoded from level 0 to 3 as follows: 0 is setosa, 1 is versicolor, 2 is virginica. Training set is 75%
Q1. I don't understand the function pred_pnn, if anyone is good in R perhaps you can explain how it works
Q2. The output of the test set or prediction is shown below, I don't understand the output because it is supposed to be something close to either 0,1,2
data = read.csc("c:/iris-recoded.csv" , header = T)
size = nrow(data)
length = ncol(data)
index <- 1:size
positions <- sample(index, trunc(size * 0.75))
training <- data[positions,]
testing <- data[-positions,1:length-1]
result = data[-positions,]
result$actual = result[,length]
result$predict = -1
nn1 <- smooth(learn(training), sigma = 0.9)
pred_pnn <- function(x, nn){
xlst <- split(x, 1:nrow(x))
pred <- foreach(i = xlst, .combine = rbind) %dopar% {
data.frame(prob = guess(nn, as.matrix(i))$probabilities[1], row.names =NULL)
}
}
print(pred_pnn(testing, nn1))
prob
1 1.850818e-03
2 9.820653e-03
3 6.798603e-04
4 7.421435e-03
5 2.168817e-03
6 3.277354e-03
7 6.541173e-03
8 1.725332e-04
9 2.081845e-03
10 2.491388e-02
11 7.679823e-03
12 1.291811e-03
13 2.197234e-06
14 1.316366e-03
15 1.421219e-05
16 4.639239e-05
17 3.671907e-04
18 1.460001e-04
19 4.382849e-05
20 2.387543e-05
21 1.011196e-05
22 2.719982e-04
23 4.445472e-04
24 1.281762e-04
25 5.931106e-09
26 9.741870e-08
27 9.236434e-09
28 8.384690e-08
29 3.311667e-07
30 6.045306e-11
31 2.949265e-08
32 2.070014e-10
33 8.043735e-06
34 2.136666e-08
35 5.604398e-08
36 2.455841e-07
37 3.445977e-07
38 7.314647e-07
I'm assuming you're using the pnn package. Documentation for ?guess would lead us to believe that it does similar to what predict does for other models. In other words, it predicts to which class the observation belongs to. Everything else in there for bookkeeping. Why you get only the probabilities? Because the person who wrote the function made it that way by extracting guess(x)$probabilities and returning only that. If you look at the raw output, you would also get predicted class tucked in away in $category list element.
I asked this question a year ago and got code for this "probability heatmap":
numbet <- 32
numtri <- 1e5
prob=5/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
library(plyr)
mxcum2 <- ddply(mxcum, .(bet, outcome), nrow)
mxcum3 <- ddply(mxcum2, .(bet), summarize,
ymin=c(0, head(seq_along(V1)/length(V1), -1)),
ymax=seq_along(V1)/length(V1),
fill=(V1/sum(V1)))
head(mxcum3)
library(ggplot2)
p <- ggplot(mxcum3, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", formatter="percent", low="red", high="blue") +
scale_y_continuous(formatter="percent") +
xlab("Bet")
print(p)
(May need to change this code slightly because of this)
This is almost exactly what I want. Except each vertical shaft should have different numbers of bins, ie the first should have 2, second 3, third 4 (N+1). In the graph shaft 6 +7 have the same number of bins (7), where 7 should have 8 (N+1).
If I'm right, the reason the code does this is because it is the observed data and if I ran more trials we would get more bins. I don't want to rely on the number of trials to get the correct number of bins.
How can I adapt this code to give the correct number of bins?
I have used R's dbinom to generate the frequency of heads for n=1:32 trials and plotted the graph now. It will be what you expect. I have read some of your earlier posts here on SO and on math.stackexchange. Still I don't understand why you'd want to simulate the experiment rather than generating from a binomial R.V. If you could explain it, it would be great! I'll try to work on the simulated solution from #Andrie to check out if I can match the output shown below. For now, here's something you might be interested in.
set.seed(42)
numbet <- 32
numtri <- 1e5
prob=5/6
require(plyr)
out <- ldply(1:numbet, function(idx) {
outcome <- dbinom(idx:0, size=idx, prob=prob)
bet <- rep(idx, length(outcome))
N <- round(outcome * numtri)
ymin <- c(0, head(seq_along(N)/length(N), -1))
ymax <- seq_along(N)/length(N)
data.frame(bet, fill=outcome, ymin, ymax)
})
require(ggplot2)
p <- ggplot(out, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", low="red", high="blue") +
xlab("Bet")
The plot:
Edit: Explanation of how your old code from Andrie works and why it doesn't give what you intend.
Basically, what Andrie did (or rather one way to look at it) is to use the idea that if you have two binomial distributions, X ~ B(n, p) and Y ~ B(m, p), where n, m = size and p = probability of success, then, their sum, X + Y = B(n + m, p) (1). So, the purpose of xcum is to obtain the outcome for all n = 1:32 tosses, but to explain it better, let me construct the code step by step. Along with the explanation, the code for xcum will also be very obvious and it can be constructed in no time (without any necessity for for-loop and constructing a cumsum everytime.
If you have followed me so far, then, our idea is first to create a numtri * numbet matrix, with each column (length = numtri) having 0's and 1's with probability = 5/6 and 1/6 respectively. That is, if you have numtri = 1000, then, you'll have ~ 834 0's and 166 1's *for each of the numbet columns (=32 here). Let's construct this and test this first.
numtri <- 1e3
numbet <- 32
set.seed(45)
xcum <- t(replicate(numtri, sample(0:1, numbet, prob=c(5/6,1/6), replace = TRUE)))
# check for count of 1's
> apply(xcum, 2, sum)
[1] 169 158 166 166 160 182 164 181 168 140 154 142 169 168 159 187 176 155 151 151 166
163 164 176 162 160 177 157 163 166 146 170
# So, the count of 1's are "approximately" what we expect (around 166).
Now, each of these columns are samples of binomial distribution with n = 1 and size = numtri. If we were to add the first two columns and replace the second column with this sum, then, from (1), since the probabilities are equal, we'll end up with a binomial distribution with n = 2. Similarly, instead, if you had added the first three columns and replaced th 3rd column by this sum, you would have obtained a binomial distribution with n = 3 and so on...
The concept is that if you cumulatively add each column, then you end up with numbet number of binomial distributions (1 to 32 here). So, let's do that.
xcum <- t(apply(xcum, 1, cumsum))
# you can verify that the second column has similar probabilities by this:
# calculate the frequency of all values in 2nd column.
> table(xcum[,2])
0 1 2
694 285 21
> round(numtri * dbinom(2:0, 2, prob=5/6))
[1] 694 278 28
# more or less identical, good!
If you divide the xcum, we have generated thus far by cumsum(1:numbet) over each row in this manner:
xcum <- xcum/matrix(rep(cumsum(1:numbet), each=numtri), ncol = numbet)
this will be identical to the xcum matrix that comes out of the for-loop (if you generate it with the same seed). However I don't quite understand the reason for this division by Andrie as this is not necessary to generate the graph you require. However, I suppose it has something to do with the frequency values you talked about in an earlier post on math.stackexchange
Now on to why you have difficulties obtaining the graph I had attached (with n+1 bins):
For a binomial distribution with n=1:32 trials, 5/6 as probability of tails (failures) and 1/6 as the probability of heads (successes), the probability of k heads is given by:
nCk * (5/6)^(k-1) * (1/6)^k # where nCk is n choose k
For the test data we've generated, for n=7 and n=8 (trials), the probability of k=0:7 and k=0:8 heads are given by:
# n=7
0 1 2 3 4 5
.278 .394 .233 .077 .016 .002
# n=8
0 1 2 3 4 5
.229 .375 .254 .111 .025 .006
Why are they both having 6 bins and not 8 and 9 bins? Of course this has to do with the value of numtri=1000. Let's see what's the probabilities of each of these 8 and 9 bins by generating probabilities directly from the binomial distribution using dbinom to understand why this happens.
# n = 7
dbinom(7:0, 7, prob=5/6)
# output rounded to 3 decimal places
[1] 0.279 0.391 0.234 0.078 0.016 0.002 0.000 0.000
# n = 8
dbinom(8:0, 8, prob=5/6)
# output rounded to 3 decimal places
[1] 0.233 0.372 0.260 0.104 0.026 0.004 0.000 0.000 0.000
You see that the probabilities corresponding to k=6,7 and k=6,7,8 corresponding to n=7 and n=8 are ~ 0. They are very low in values. The minimum value here is 5.8 * 1e-7 actually (n=8, k=8). This means that you have a chance of getting 1 value if you simulated for 1/5.8 * 1e7 times. If you check the same for n=32 and k=32, the value is 1.256493 * 1e-25. So, you'll have to simulate that many values to get at least 1 result where all 32 outcomes are head for n=32.
This is why your results were not having values for certain bins because the probability of having it is very low for the given numtri. And for the same reason, generating the probabilities directly from the binomial distribution overcomes this problem/limitation.
I hope I've managed to write with enough clarity for you to follow. Let me know if you've trouble going through.
Edit 2:
When I simulated the code I've just edited above with numtri=1e6, I get this for n=7 and n=8 and count the number of heads for k=0:7 and k=0:8:
# n = 7
0 1 2 3 4 5 6 7
279347 391386 233771 77698 15763 1915 117 3
# n = 8
0 1 2 3 4 5 6 7 8
232835 372466 259856 104116 26041 4271 392 22 1
Note that, there are k=6 and k=7 now for n=7 and n=8. Also, for n=8, you have a value of 1 for k=8. With increasing numtri you'll obtain more of the other missing bins. But it'll require a huge amount of time/memory (if at all).