What to use for ancestry adjustment using eigenstrat - r

I'm confused with the eigenstrat function. Do I use the output top k eigenvectors in succeeding regressions for ancestry adjustment? (since it is n x k) But the papers say the adjustment using 4 principal components (solving for this will make the matrices non-conformable - assuming I followed the input genofile where the rows are the p SNPs and the n subjects are the columns). Basically, I'm confused on the usage of eigenvector & principal components in papers and documentation. I think people use it interchangeably.
I tried following the documentation so I used the output eigenvectors in the succeeding regressions.
(basically what's on the documentation)
write.table(t(genotype), file = "riskalleles.txt", quote = FALSE, sep = "", row.names = FALSE, col.names = FALSE)
pca.eg <- eigenstrat(genoFile = "riskalleles.txt", outFile.Robj = "eigenstrat.result.list" outFile.txt = "riskalleles.result.txt", rm.marker.index = NULL,rm.subject.index = NULL, miss.val = 9, num.splits = 10,topK = NULL, signt.eigen.level = 0.01, signal.outlier = FALSE, iter.outlier = 5, sigma.thresh = 6)
ev <- pca.eg$topK.eigenvectors
gamma <- t(ev)%*%genotype
adjusted <- t(genotype - ev%*%gamma)
(solving for PCs using formula from my multivariate class)
pc <- t(genotype)%*%ev

Related

Do you need to tokenize your text to visualize data from a LDA topic model?

I'm currently using the textmineR package to run LDA topicmodels on news articles from 2016-2019.
However, I am quite new to R and i don't know how to display my results from the model.
I want to show the prevalence of the 8 topics my model finds, over the time period I have collected data. The data is structured in a dataframe. My data is defined at the day to day level as %y-%m-%d
My LDA model is made like this:
## get textmineR dtm
dtm <- CreateDtm(doc_vec = dat$fulltext, # character vector of documents
ngram_window = c(1, 2),
doc_names = dat$names,
stopword_vec = c(stopwords::stopwords("da"), custom_stopwords),
lower = T, # lowercase - this is the default value
remove_punctuation = T, # punctuation - this is the default
remove_numbers = T, # numbers - this is the default
verbose = T,
cpus = 4)
dtm <- dtm[, colSums(dtm) > 3]
dtm <- dtm[, str_length(colnames(dtm)) > 3]
############################################################
## RUN & EXAMINE TOPIC MODEL
############################################################
# Draw quasi-random sample from the pc
set.seed(34838)
model <- FitLdaModel(dtm = dtm,
k = 8,
iterations = 500,
burnin = 200,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_coherence = TRUE,
calc_r2 = TRUE,
cpus = 4)
# model log-likelihood
plot(model$log_likelihood, type = "l")
# topic coherence
summary(model$coherence)
hist(model$coherence,
col= "blue",
main = "Histogram of probabilistic coherence")
# top terms by topic
model$top_terms1 <- GetTopTerms(phi = model$phi, M = 10)
t(model$top_terms1)
# topic prevalence
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100
# prevalence should be proportional to alpha
plot(model$prevalence, model$alpha, xlab = "prevalence", ylab = "alpha")
Can anyone tell me how to plot the most prevalent topics the model finds over time?
Do I need to tokenize the text or something like that?
I hope this makes sense.
Best,
Tokenization happens in the CreateDtm function. So, it doesn't sound like that's your issue.
You can get the prevalence of topics over a set of documents by taking a mean over the columns of theta, a matrix that's part of the resulting model.
I can't give you an exact answer with your data, but I can show you a similar example with the nih_sample data that ships with textmineR
# load the NIH sample data
data(nih_sample)
# create a dtm and topic model
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
doc_names = nih_sample$APPLICATION_ID)
m <- FitLdaModel(dtm = dtm, k = 20, iterations = 100, burnin = 75)
# aggregate theta by the year of the PROJECT_END variable
end_year <- stringr::str_split(string = nih_sample$PROJECT_END, pattern = "/")
end_year <- sapply(end_year, function(x) x[length(x)])
end_year <- as.numeric(end_year)
topic_by_year <- by(data = m$theta, INDICES = end_year, FUN = function(x){
if (is.null(nrow(x))) {
# if only one row, gets converted to a vector
# just return that vector
return(x)
} else { # if multiple rows, then aggregate
return(colMeans(x))
}
})
topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))
topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))
# plot topic 10's prevalence by year
plot(topic_by_year$year, topic_by_year$t_10, type = "l")

Can't use mppm on multitype point patterns

I'm trying to fit a MultiStraussHardcore interaction to one of the sample datatsets in spatstat (flu). I'm maintaining the same interaction and hardcore radius for all types and point patterns. I'm running the following block:
library(spatstat)
library("optimbase")
flusubset <- flu[1:4]
typelist <- lapply(lapply(flusubset$pattern, marks), levels)
stopifnot(length(unique(typelist))==1)
num_marks <- length(typelist[[1]])
iradii <- 50*ones(num_marks)
hradii <- 3*ones(num_marks)
Int <- anylist()
for (i in 1:dim(flusubset)[1]) {
Int[[i]] <- MultiStraussHard(iradii=iradii, hradii=hradii)
}
Int <- as.hyperframe(Int)
multmodel <- mppm(pattern ~ 1, data=flusubset, interaction=Int)
Each time I run mppm, I get the following error
Error in (function (d, tx, tu, par) : data and model do not have the same possible levels of marks
I've included the traceback, too.
12. stop("data and model do not have the same possible levels of marks")
11. (function (d, tx, tu, par) { r <- par$iradii h <- par$hradii ...
10. do.call(fun, usedargs)
9. do.call.matched(pairpot, list(d = matrix(, 0, 0), tx = marks(X)[integer(0)], tu = marks(P)[integer(0)], par = potpars))
8. evalPairPotential(X, U, EqualPairs, pairpot, potpars, Reach)
7. evaluate(X, P, E, interaction$pot, interaction$par, correction = correction, splitInf = splitInf, ..., Reach = Reach, precomputed = precomputed, savecomputed = savecomputed)
6. evalInterEngine(X = X, P = P, E = E, interaction = interaction, correction = correction, splitInf = splitInf, ..., precomputed = precomputed, savecomputed = savecomputed)
5. evalInteraction(X, P, E, interaction, correction, ..., splitInf = splitInf, precomputed = precomputed, savecomputed = savecomputed)
4. mpl.prepare(Q, X, P, trend, interaction, covariates, want.trend, want.inter, correction, rbord, "quadrature points", callstring, subsetexpr = subsetexpr, allcovar = allcovar, precomputed = precomputed, savecomputed = savecomputed, covfunargs = covfunargs, weightfactor = weightfactor, ...
3. mpl.engine(Q, trend = trend, interaction = interaction, ..., covariates = covariates, correction = correction, rbord = rbord, use.gam = use.gam, allcovar = allcovar, preponly = TRUE, forcefit = TRUE)
2. bt.frame(Yi, ~1, inter, ..., covariates = covariates, allcovar = TRUE, use.gam = use.gam, vnamebase = itags[j], vnameprefix = itags[j])
1. mppm(pattern ~ 1, data = flusubset, interaction = Int)
I've tried fitting a MultiStraussHardcore model with ppm for each individual point pattern, and I have no issues. I've confirmed that the possible levels of each point pattern are identical. I've also verified that the interaction and hardcore radii matrices have the correct dimensions (2x2 for both) and that my hyperframe containing the interact objects is the correct dimensions. Thanks!
Thank you for the well described problem and the reproducible example. It made things very easy for me.
Indeed you have found a minor bug (documentation inconsistency). Your code runs without errors if, in the for loop, you replace
MultiStraussHard(iradii=iradii, hradii=hradii)
by
typ <- levels(marks(flu$pattern[[1]]))
MultiStraussHard(iradii=iradii, hradii=hradii, types=typ)
The documentation for MultiStraussHard says that the argument types is optional, but that is only true for ppm calls at the moment. I will see if it is possible to extend the auto detection of types to mppm, so your original code would work in future versions of spatstat.
This has been fixed in the latest development version of spatstat available from the github repository

MXNET softmax output: label shape confusion

I have not got a clear idea about how labels for the softmax classifier should be shaped.
What I could understand from my experiments is that a scalar laber indicating the index of class probability output is one option, while another is a 2D label where the rows are class probabilities, or one-hot encoded variable, like c(1, 0, 0).
What puzzles me though is that:
I can use sclalar label values that go beyong indexing, like 4 in my
example below -- without warning or error. Why is that?
When my label is a negative scalar or an array with a negative value,
the model converges to uniform probablity distribution over classes.
For example, is this expected that actor_train.y = matrix(c(0, -1,v0), ncol = 1) results in equal probabilities in the softmax output?
I try to use softmax MXNET classifier to produce the policy gradient
reifnrocement learning, and my negative rewards lead to the issue
above: uniform probability. Is that expected?
require(mxnet)
actor_initializer <- mx.init.Xavier(rnd_type = "gaussian",
factor_type = "avg",
magnitude = 0.0001)
actor_nn_data <- mx.symbol.Variable('data') actor_nn_label <- mx.symbol.Variable('label')
device.cpu <- mx.cpu()
NN architecture
actor_fc3 <- mx.symbol.FullyConnected(
data = actor_nn_data
, num_hidden = 3 )
actor_output <- mx.symbol.SoftmaxOutput(
data = actor_fc3
, label = actor_nn_label
, name = 'actor' )
crossentfunc <- function(label, pred)
{
- sum(label * log(pred)) }
actor_loss <- mx.metric.custom(
feval = crossentfunc
, name = "log-loss"
)
initialize NN
actor_train.x <- matrix(rnorm(11), nrow = 1)
actor_train.y = 0 #1 #2 #3 #-3 # matrix(c(0, 0, -1), ncol = 1)
rm(actor_model)
actor_model <- mx.model.FeedForward.create(
symbol = actor_output,
X = actor_train.x,
y = actor_train.y,
ctx = device.cpu,
num.round = 100,
array.batch.size = 1,
optimizer = 'adam',
eval.metric = actor_loss,
clip_gradient = 1,
wd = 0.01,
initializer = actor_initializer,
array.layout = "rowmajor" )
predict(actor_model, actor_train.x, array.layout = "rowmajor")
It is quite strange to me, but I found a solution.
I changed optimizer from optimizer = 'adam' to optimizer = 'rmsprop', and the NN started to converge as expected in case of negative targets. I made simulations in R using a simple NN and optim function to get the same result.
Looks like adam or SGD may be buggy or whatever in case of multinomial classification... I also used to get stuck at the fact those optimizers did not converge to a perfect solution on just 1 example, while rmsprop does! Be aware!

Learning hidden markov model in R

A hidden Markov model (HMM) is one in which you observe a sequence of observations, but do not know the sequence of states the model went through to generate the observations. Analyses of hidden Markov models seek to recover the sequence of hidden states from the observed data.
I have data with both observations and hidden states (observations are of continuous values) where the hidden states were tagged by an expert. I would like to train a HMM that would be able - based on a (previously unseen) sequence of observations - to recover the corresponding hidden states.
Is there any R package to do that? Studying the existing packages (depmixS4, HMM, seqHMM - for categorical data only) allows you to specify a number of hidden states only.
EDIT:
Example:
data.tagged.by.expert = data.frame(
hidden.state = c("Wake", "REM", "REM", "NonREM1", "NonREM2", "REM", "REM", "Wake"),
sensor1 = c(1,1.2,1.2,1.3,4,2,1.78,0.65),
sensor2 = c(7.2,5.3,5.1,1.2,2.3,7.5,7.8,2.1),
sensor3 = c(0.01,0.02,0.08,0.8,0.03,0.01,0.15,0.45)
)
data.newly.measured = data.frame(
sensor1 = c(2,3,4,5,2,1,2,4,5,8,4,6,1,2,5,3,2,1,4),
sensor2 = c(2.1,2.3,2.2,4.2,4.2,2.2,2.2,5.3,2.4,1.0,2.5,2.4,1.2,8.4,5.2,5.5,5.2,4.3,7.8),
sensor3 = c(0.23,0.25,0.23,0.54,0.36,0.85,0.01,0.52,0.09,0.12,0.85,0.45,0.26,0.08,0.01,0.55,0.67,0.82,0.35)
)
I would like to create a HMM with discrete time t whrere random variable x(t) represents the hidden state at time t, x(t) {"Wake", "REM", "NonREM1", "NonREM2"}, and 3 continuous random variables sensor1(t), sensor2(t), sensor3(t) representing the observations at time t.
model.hmm = learn.model(data.tagged.by.user)
Then I would like to use the created model to estimate hidden states responsible for newly measured observations
hidden.states = estimate.hidden.states(model.hmm, data.newly.measured)
Data (training/testing)
To be able to run learning methods for Naive Bayes classifier, we need longer data set
states = c("NonREM1", "NonREM2", "NonREM3", "REM", "Wake")
artificial.hypnogram = rep(c(5,4,1,2,3,4,5), times = c(40,150,200,300,50,90,30))
data.tagged.by.expert = data.frame(
hidden.state = states[artificial.hypnogram],
sensor1 = log(artificial.hypnogram) + runif(n = length(artificial.hypnogram), min = 0.2, max = 0.5),
sensor2 = 10*artificial.hypnogram + sample(c(-8:8), size = length(artificial.hypnogram), replace = T),
sensor3 = sample(1:100, size = length(artificial.hypnogram), replace = T)
)
hidden.hypnogram = rep(c(5,4,1,2,4,5), times = c(10,10,15,10,10,3))
data.newly.measured = data.frame(
sensor1 = log(hidden.hypnogram) + runif(n = length(hidden.hypnogram), min = 0.2, max = 0.5),
sensor2 = 10*hidden.hypnogram + sample(c(-8:8), size = length(hidden.hypnogram), replace = T),
sensor3 = sample(1:100, size = length(hidden.hypnogram), replace = T)
)
Solution
In the solution, we used Viterbi algorithm - combined with Naive Bayes classifier.
At each clock time t, a Hidden Markov Model consist of
an unobserved state (denoted as hidden.state in this case) taking a finite number of states
states = c("NonREM1", "NonREM2", "NonREM3", "REM", "Wake")
a set of observed variables (sensor1, sensor2, sensor3 in this case)
Transition matrix
A new state is entered based upon a transition probability distribution
(transition matrix). This can be easily computed from data.tagged.by.expert e.g. using
library(markovchain)
emit_p <- markovchainFit(data.tagged.by.expert$hidden.state)$estimate
Emission matrix
After each transition is made, an observation (sensor_i) is produced according to a conditional probability distribution (emission matrix) which depends on the current state H of hidden.state only. We will replace emmision matrices by Naive Bayes classifier.
library(caret)
library(klaR)
library(e1071)
model = train(hidden.state ~ .,
data = data.tagged.by.expert,
method = 'nb',
trControl=trainControl(method='cv',number=10)
)
Viterbi algorithm
To solve the problem, we use Viterbi algorithm with the initial probability of 1 for "Wake" state and 0 otherwise. (We expect the patient to be awake in the beginning of the experiment)
# we expect the patient to be awake in the beginning
start_p = c(NonREM1 = 0,NonREM2 = 0,NonREM3 = 0, REM = 0, Wake = 1)
# Naive Bayes model
model_nb = model$finalModel
# the observations
observations = data.newly.measured
nObs <- nrow(observations) # number of observations
nStates <- length(states) # number of states
# T1, T2 initialization
T1 <- matrix(0, nrow = nStates, ncol = nObs) #define two 2-dimensional tables
row.names(T1) <- states
T2 <- T1
Byj <- predict(model_nb, newdata = observations[1,])$posterior
# init first column of T1
for(s in states)
T1[s,1] = start_p[s] * Byj[1,s]
# fill T1 and T2 tables
for(j in 2:nObs) {
Byj <- predict(model_nb, newdata = observations[j,])$posterior
for(s in states) {
res <- (T1[,j-1] * emit_p[,s]) * Byj[1,s]
T2[s,j] <- states[which.max(res)]
T1[s,j] <- max(res)
}
}
# backtract best path
result <- rep("", times = nObs)
result[nObs] <- names(which.max(T1[,nObs]))
for (j in nObs:2) {
result[j-1] <- T2[result[j], j]
}
# show the result
result
# show the original artificial data
states[hidden.hypnogram]
References
To read more about the problem, see Vomlel Jiří, Kratochvíl Václav : Dynamic Bayesian Networks for the Classification of Sleep Stages , Proceedings of the 11th Workshop on Uncertainty Processing (WUPES’18), p. 205-215 , Eds: Kratochvíl Václav, Vejnarová Jiřina, Workshop on Uncertainty Processing (WUPES’18), (Třeboň, CZ, 2018/06/06) [2018] Download

Max-min Markov blanket feature selection: R code error

I am using Max-min Markov blanket algorithm for variable selection in R from MXM package. Following is my code:
library(MXM)
dataset = read.table('data.txt', na.string = c("", "NA"), sep = '\t', header = FALSE)
dataset = dataset[, colSums(is.na(dataset)) == 0]
D = as.matrix(as.data.frame(lapply(dataset, as.numeric)))
target = read.table('class_num.txt')
target = c(target)
aa = mmmb(target, D, max_k = 3, threshold = 0.05, test = "testIndFisher", user_test = NULL, robust = FALSE, ncores = 2)
I am getting the following error:
Error in unique(as.numeric(target)) :
(list) object cannot be coerced to type 'double'
According to the mmmb manual page my dataset D is a matrix of continuous value of dimension (95933 x 85) and my target is a vector of [0, 1] of size 95933.
Can someone help me understand the error?
Got the solution:
The target is a list instead of an array. The following line solved the issue:
target = array(as.numeric(unlist(target)))
Thanks!

Resources