Different dimensions of distributions of topics [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I would like to divide all documents in 10 topics, and it goes well with a converged result except for the dimensions of distributions and covariance matrix of topic.
Why the topics distribution is a 9 dimension vector instead of 10 and their covariance matrix is 9*9 matrix instead of 10*10?
I have use library(topicmodels) and function CTM() to implement the topic model in Chinese.
my code is below:
library(rJava);
library(Rwordseg);
library(NLP);
library(tm);
library(tmcn)
library(tm)
library(Rwordseg)
library(topicmodels)
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Law.scel","Law");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\NationalInstitution.scel","NationalInstitution");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Place.scel","Place");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Psychology.scel","Psychology");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Politics.scel","Politics");
listDict();
#read file
d.vec <- segmentCN("samgovWithoutID.csv", returnType = "tm")
samgov.segment <- read.table("samgovWithoutID.segment.csv", header = TRUE, fill = TRUE, stringsAsFactors = F, sep = ",",fileEncoding='utf-8')
fix(samgov.segment)
# create DTM(document term matrix)
d.corpus <- Corpus(VectorSource(samgov.segment$content))
inspect(d.corpus[1:10])
d.corpus <- tm_map(d.corpus, removeWords, stopwordsCN())
ctrl <- list(removePunctuation = TRUE, removeNumbers= TRUE, wordLengths = c(1, Inf), stopwords = stopwordsCN(), wordLengths = c(2, Inf))
d.dtm <- DocumentTermMatrix(d.corpus, control = ctrl)
inspect(d.dtm[1:10, 110:112])
# impletment topic models
ctm10<-CTM(d.dtm,k=10, control=list(seed=2014012692))
Terms10 <- terms(ctm10, 10)
Terms10[,1:10]
ctm20<-CTM(d.dtm,k=20, control=list(seed=2014012692))
Terms20 <- terms(ctm20, 20)
Terms20[,1:20]
The result in R Studio (see Highlighted part):
Help document:

A probability distribution over 10 values has 9 free parameters: once I tell you the probability of the first 9, the probability of the last value has to be one minus the sum of those probabilities.
A 10-dimensional logistic normal distribution is equivalent to sampling a 10-dimensional vector from a Gaussian distribution and then "squashing" that vector by exponentiating it and normalizing it to sum to 1.0. There are an infinite number of 10-dimensional vectors that will exponentiate and normalize to the same 10-dimensional probability distribution -- you just have to add an arbitrary constant c to each value. That's because the mean of the Gaussian has 10 free parameters, one more than the more constrained distribution.
There are several ways to make the Gaussian "identifiable". One is to fix one of the elements of the mean vector to be 0.0. That's why you see a 9-dimensional mean and covariance matrix: the 10th value is always 0 with no variance.

Related

How does one calculate the correct amplitude with R's fft()-function? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
In R, I am generating uncorrelated values in time domain with rnorm(). Then I apply fft() to these values, however, I am only getting a value of 0.88 instead of 1. Is there anything I am not aware of?
Here is a MWE:
# dt <- 0.01 # time stesp
nSteps <- 100000 # Number of time steps
# df <- 1/(nSteps*dt) # frequency resolution
# t <- 0:(nSteps-1)*dt #
y <- rnorm(nSteps, mean=0, sd=1) # generate uncorrelated data. Should result in a white noise spectrum with sd=1
y_sq_sum <- sum(y^2)
# We ignore cutting to the Nyquist frequency.
# f <- 0:(nSteps-1)*df
fft_y <- abs(fft(y))/sqrt(length(y))
fft_y_sq_sum <- sum(fft_y^2)
print(paste("Check for Parseval's theorem: y_sq_sum = ", y_sq_sum, "; fft_y_sq_sum = ", fft_y_sq_sum, sep=""))
print(paste("Mean amplitude of my fft spectrum: ", mean(fft_y)))
print(paste("The above is typically around 0.88, why is it not 1?"))
This question doesn't belong on StackOverflow, it's more of a Cross-validated kind of thing. But here's an answer anyway:
Parseval's theorem says that the mean of fft_y^2 should be 1. The square root function is a concave function, so Jensen's inequality says the mean of sqrt(fft_y^2) will be less than 1. Since fft_y is positive in your definition, fft_y = sqrt(fft_y^2).

How should I specify argument "prob" when using sample() for resampling?

In short
I'm trying to better understand the argument prob as part of the function sample in R. In what follows, I both ask a question, and provide a piece of R code in connection with my question.
Question
Suppose I have generated 10,000 random standard rnorms. I then want to draw a sample of size 5 from this mother 10,000 standard rnorms.
How should I set the prob argument within the sample such that the probability of drawing these 5 numbers from the mother rnorm considers that the middle areas of the mother rnorm are denser but tail areas are thinner (so in drawing these 5 numbers it would draw from the denser areas more frequently than the tail areas)?
x = rnorm(1e4)
sample( x = x, size = 5, replace = TRUE, prob = ? ) ## what should be "prob" here?
# OR I leave `prob` to be the default by not using it:
sample( x = x, size = 5, replace = TRUE )
Overthinking is devil.
You want to resample these samples, following the original distribution or an empirical distribution. Think about how an empirical CDF is obtained:
plot(sort(x), 1:length(x)/length(x))
In other words, the empirical PDF is just
plot(sort(x), rep(1/length(x), length(x)))
So, we want prob = rep(1/length(x), length(x)) or simply, prob = rep(1, length(x)) as sample normalizes prob internally. Or, just leave it unspecified as equal probability is default.

how to create a random loss sample in r using if function

I am working currently on generating some random data for a school project.
I have created a variable in R using a binomial distribution to determine if an observation had a loss yes=1 or not=0.
Afterwards I am trying to generate the loss amount using a random distribution for all observations which already had a loss (=1).
As my loss amount is a percentage it can be anywhere between 0
What Is The Intuition Behind Beta Distribution # stats.stackexchange
In a third step I am looking for an if statement, which combines my two variables.
Please find below my code (which is only working for the Loss_Y_N variable):
Loss_Y_N = rbinom(1000000,1,0.01)
Loss_Amount = dbeta(x, 10, 990, ncp = 0, log = FALSE)
ideally I can combine the two into something like
if(Loss_Y_N=1 then Loss_Amount=dbeta(...) #... is meant to be a random variable with mean=0.15 and should be 0<x=<1
else Loss_Amount=0)
Any input highly appreciated!
Create a vector for your loss proportion. Fill up the elements corresponding to losses with draws from the beta. Tweak the parameters for the beta until you get the desired result.
N <- 100000
loss_indicator <- rbinom(N, 1, 0.1)
loss_prop <- numeric(N)
loss_prop[loss_indicator > 0] <- rbeta(sum(loss_indicator), 10, 990)

Create correlated variables from existing variable [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Let's say I have a vector:
Q<-rnorm(10,mean=0,sd=20)
From this vector I would like to:
1. create 10 variables (a1...a10) that each have a correlation above .5 (i.e. between .5 and 1) with Q.
the first part can be done with:
t1<-sapply(1:10, function(x) jitter(t, factor=100))
2. each of these variables (a1...a10) should have a pre-specified correlation with each other. For example some should be correlated .8 and some -.2.
Can these two things be done?
I create a correlation matrix:
cor.table <- matrix( sample( c(0.9,-0.9) , 2500 , prob = c( 0.8 , 0.2 ) , repl = TRUE ) , 50 , 50 )
k=1
while (k<=length(cor.table[1,])){
cor.table[1,k]<-0.55
k=k+1
}
k=1
while (k<=length(cor.table[,1])){
cor.table[k,1]<-0.55
k=k+1
}
diag(cor.table) <- 1
However, when I apply the excellent solution by #SprengMeister I get the error:
Error in eigen(cor.table)$values > 0 :
invalid comparison with complex values
continued here: Eigenvalue decomposition of correlation matrix
As a pointer to solution use noise function jitter in R:
set.seed(100)
t = rnorm(10,mean=0,sd=20)
t1 = jitter(t, factor = 100)
cor(t,t1)
[1] 0.8719447
To generate data with a prescribed correlation (or variance),
you can start with random data,
and rescale it using the Cholesky decomposition of the desired correlation matrix.
# Sample data
Q <- rnorm(10, mean=0, sd=20)
desired_correlations <- matrix(c(
1, .5, .6, .5,
.5, 1, .2, .8,
.6, .2, 1, .5,
.5, .8, .5, 1 ), 4, 4 )
stopifnot( eigen( desired_correlations )$values > 0 )
# Random data, with Q in the first column
n <- length(Q)
k <- ncol(desired_correlations)
x <- matrix( rnorm(n*k), nc=k )
x[,1] <- Q
# Rescale, first to make the variance equal to the identity matrix,
# then to get the desired correlation matrix.
y <- x %*% solve(chol(var(x))) %*% chol(desired_correlations)
var(y)
y[,1] <- Q # The first column was only rescaled: that does not affect the correlation
cor(y) # Desired correlation matrix
I answered a very similar question a little while ago
R: Constructing correlated variables
I am not familiar with jitter so maybe my solutions is more verbose but it would allow you determining exactly what the intercorrelations of each of your variables and q is supposed to be.
The F matrix referenced in that answer describes the intercorrelations that you want to impose on your data.
EDIT to answer question in comment:
If i am not mistaken, you are trying to create a multivariate correlated data set. So all the variables in the set are correlated to varying degrees. I assume Q is your criterion or DV, and a1-a10 are predictors or IVs.
In the F matrix you would reflect the relationships between these variables. For example
cor_Matrix <- matrix(c(1.00, 0.90, 0.20 ,
0.90, 1.00, 0.40 ,
0.20, 0.40, 1.00),
nrow=3,ncol=3,byrow=TRUE)
describes the relationships between three variables. The first one could be Q, the second a1 and the third a2. So in this scenario, q is correlated with a1 (.90) and a2 (.20).
a1 is correlated with a2 (.40)
The rest of the matrix is redundant.
In the remainder of the code, you are simply creating your raw, uncorrelated variables and then impose the loadings that you have previously pulled from the F matrix.
I hope this helps. If there is a package in R that does all that, please let me know. I build this to help me understand how multivariate data sets are actually generated.
To generalize this to 10 variables plus q, just set the parameters that are set to 3 now to 11 and create an 11x11 F matrix.

In R, how do I find the optimal variable to minimise the correlation between two datasets [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
In R, how do I find the optimal variable to maximize or minimize correlation between several datasets
This can be done in Excel, but my dataset has gotten too large. In excel, I would use solver.
I have 5 variables and I want to recreate a weighted average of these 5 variables so that they have the lowest correlation to a 6th variable.
Column A,B,C,D,E = random numbers
Column F = random number (which I want to minimise the correlation to)
Column G = Awi1+Bwi2+C*2i3+D*wi4+wi5*E
where wi1 to wi5 are coefficients resulted from solver In a separate cell, I would have correl(F,G)
This is all achieved with the following constraints in mind:
1. A,B,C,D, E have to be between 0 and 1
2. A+B+C+D+E= 1
I'd like to print the results of this so that I can have an efficient frontier type chart.
How can I do this in R? Thanks for the help.
I looked at the other thread mentioned by Vincent and I think I have a better solution. I hope it is correct. As Vincent points out, your biggest problem is that the optimization tools for such non-linear problems do not offer a lot of flexibility for dealing with your constraints. Here, you have two types of constraints: 1) all your weights must be >= 0, and 2) they must sum to 1.
The optim function has a lower option that can take care of your first constraint. For the second constraint, you have to be a bit creative: you can force your weights to sum to one by scaling them inside the function to be minimized, i.e. rewrite your correlation function as function(w) cor(X %*% w / sum(w), Y).
# create random data
n.obs <- 100
n.var <- 6
X <- matrix(runif(n.obs * n.var), nrow = n.obs, ncol = n.var)
Y <- matrix(runif(n.obs), nrow = n.obs, ncol = 1)
# function to minimize
correl <- function(w)cor(X %*% w / sum(w), Y)
# inital guess
w0 <- rep(1 / n.var, n.var)
# optimize
opt <- optim(par = w0, fn = correl, method = "L-BFGS-B", lower = 0)
optim.w <- opt$par / sum(opt$par)

Resources