i'm using data from ("TravelMode", package = "AER") and try to follow [Heiss,2002] paper
this is what my code look like initially
Nested Structured Picture
library(mlogit)
data("TravelMode", package = "AER")
TravelMode_frame <- mlogit.data(TravelMode,choice = "choice",shape="long",chid.var="individual",alt.var = "mode")
ml_TM <- mlogit(choice ~travel|income,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'),air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
then i want to separate travel time variable between air and the other three as the picture below, so i wrote
air <- idx(TravelMode_frame, 2) %in% c('air')
TravelMode_frame$travel_air <- 0
TravelMode_frame$travel_air[air] <- TravelMode_frame$travel[air]
TravelMode_frame$travel[TravelMode_frame$alt == "air"] <- "0"
then my data look like this
individual mode choice wait vcost travel gcost income size idx travel_air
1 1 air FALSE 69 59 0 70 35 1 1:air 100
2 1 train FALSE 34 31 372 71 35 1 1:rain 0
3 1 bus FALSE 35 25 417 70 35 1 1:bus 0
4 1 car TRUE 0 10 180 30 35 1 1:car 0
~~~ indexes ~~~~
chid alt
1 1 air
2 1 train
3 1 bus
4 1 car
but when i compute it by
ml_TM <- mlogit(choice ~travel+travel_air|income,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'), air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
it's say Error in solve.default(H, g[!fixed]) : system is computationally singular: reciprocal condition number = 2.32747e-23
i had no idea why's this happened. could someone pls help me?
i try cutting variable in the formula out 1 by 1 and it's useless
ps. i roll back to the data before i create travel_air and try making travel time a alt specific constant by
ml_TM <- mlogit(choice ~0|income|travel,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'),
air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
and it cant compute either (Error in solve.default(crossprod(attr(x, "gradi")[, !fixed])) : system is computationally singular: reciprocal condition number = 1.39039e-20)
i think i get the idea behind this a little bit now. you can tell me if i'm wrong tho, i think my mistakes are
First thing, i forget to rescale income and travel time, so i need to add
TravelMode$travel <- TravelMode$travel/60+TravelMode$wait/60
TravelMode$income <- TravelMode$income/10
about the first question, this one
ml_TM <- mlogit(choice ~travel+travel_air|income,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'), air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
my nested have degenerate nested, so IV parameter will not count as dissimilarity anymore but the parameter to proportion the variable instead as the J & L model in the table in the (Heiss,2002) below and maybe because i tried to make it compute 2 variables at once, so come the error because they have to make IV parameter proportion to those variables simultaneously.
for this problem
ml_TM <- mlogit(choice ~0|income|travel,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'),
air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
like the above case as model L in the table
Related
I've got this data processing:
library(text2vec)
##Using perplexity for hold out set
t1 <- Sys.time()
perplex <- c()
for (i in 3:25){
set.seed(17)
lda_model2 <- LDA$new(n_topics = i)
doc_topic_distr2 <- lda_model2$fit_transform(x = dtm, progressbar = F)
set.seed(17)
sample.dtm2 <- itoken(rawsample$Abstract,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = rawsample$id,
progressbar = F) %>%
create_dtm(vectorizer,vtype = "dgTMatrix", progressbar = FALSE)
set.seed(17)
new_doc_topic_distr2 <- lda_model2$transform(sample.dtm2, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
perplex[i] <- text2vec::perplexity(sample.dtm2, topic_word_distribution =
lda_model2$topic_word_distribution,
doc_topic_distribution = new_doc_topic_distr2)
}
print(difftime(Sys.time(), t1, units = 'sec'))
I know there are a lot of questions like this, but I haven't been able to exactly find the answer to my situation. Above you see perplexity calculation from 3 to 25 topic number for a Latent Dirichlet Allocation model. I want to get the most sufficient value among those, meaning that I want to find the elbow or knee, for those values that might only be considered as a simple numeric vector which outcome looks like this:
1 NA
2 NA
3 222.6229
4 210.3442
5 200.1335
6 190.3143
7 180.4195
8 174.2634
9 166.2670
10 159.7535
11 153.7785
12 148.1623
13 144.1554
14 141.8250
15 138.8301
16 134.4956
17 131.0745
18 128.8941
19 125.8468
20 123.8477
21 120.5155
22 118.4426
23 116.4619
24 113.2401
25 114.1233
plot(perplex)
This is how plot looks like
I would say that the elbow would be 13 or 16, but I'm not completely sure and I want the exact number as an outcome. I saw in this paper that f''(x) / (1+f'(x)^2)^1.5 is the knee formula, which I tried like this and says it's 18:
> d1 <- diff(perplex) # first derivative
> d2 <- diff(d1) / diff(perplex[-1]) # second derivative
> knee <- (d2)/((1+(d1)^2)^1.5)
Warning message:
In (d2)/((1 + (d1)^2)^1.5) :
longer object length is not a multiple of shorter object length
> which.min(knee)
[1] 18
I can't fully figure this thing out. Would someone like to share how I could get the exact ideal topics number according to perplexity as an outcome?
Found this: "The LDA model with the optimal coherence score, obtained with an elbow method (the point with maximum absolute second derivative) (...)" in this paper, so this coding does the work: d1 <- diff(perplex); k <- which.max(abs(diff(d1) / diff(perplex[-1])))
I am visiting a bird sanctuary that has many different species of birds. Some species are more numerous while other species are less numerous. I came back to the sanctuary 9 times and after every visit I am calculating the total number of species I observed. Unsurprisingly, there is a diminishing return in my visits, since I observe the most numerous species on my every visit, but it does not increase the count of observed species. What is the best function in R to predict how many birds I will observe on my 20th visit?
Here is the data.frame
d <- structure(list(visit = 1:9,
totalNumSpeciesObserved = c(200.903, 296.329, 370.018, 431.59, 485.14, 533.233, 576.595, 616.536, 654)),
class = "data.frame", row.names = c(NA, 9L))
I expect to see a model that fits data well and behaves in a "log-like" fashion, predicting diminishing returns
In order to best ask a question, stack has some good links: https://stackoverflow.com/help/how-to-ask
If you're trying to model this, I might take the approach of a regression on the square root of the independent variable based on the data. Kind of strange to think about it as a function of visits though... Maybe if it were even spaced time periods it would make more sense.
d <- structure(list(visit = 1:9,
totalNumSpeciesObserved = c(200.903, 296.329, 370.018, 431.59, 485.14, 533.233, 576.595, 616.536, 654)),
class = "data.frame", row.names = c(NA, 9L))
mod <- lm(totalNumSpeciesObserved ~ I(sqrt(visit)), d)
new.df <- data.frame(visit=1:13)
out <- predict(mod, newdata = new.df)
plot(d, type = 'o',pch = 16, xlim = c(1,13), ylim = c(200,800), lwd = 2, cex = 2)
points(out, type= 'o', pch = 21, col = "blue", cex = 2)
The I() wrapper allows you to transform the independent variable on the fly, hense the use of sqrt() without needing to save a new variable.
I also don't know if this helps, but you could build a simulator to test for asymptoptic behaviour. For example you could build a population:
population <- sample(size = 1e6, LETTERS[1:20],
replace = TRUE, prob = 1/(2:21)^2)
This would say there are 20 species and decreasing probability in your population (expand as you wish).
The you could simulate visits and information about your visit. For example how large is the sample of your visit? During a visit you only see 1% of the rainforest etc.
sim_visits <- function(visits, percent_obs, population){
species_viewed <- vector()
unique_views <- vector()
for(i in 1:visits){
my_samp <- sample(x = population, size = round(percent_obs*length(population),0),
replace = FALSE)
species_viewed <- c(species_viewed, my_samp)
unique_views[i] <- length(unique(species_viewed))
}
new_observed <- unique_views - dplyr::lag(unique_views, 1, 0)
df <- data.frame(unique_views = unique_views, new_observed)
df$cummulative <- cumsum(unique_views)
df
}
And then you could draw from the simulation many times and see what distribution of values you get.
sim_visits(9, percent_obs = .001, population = population)
unique_views new_observed cummulative
1 13 13 13
2 15 2 28
3 15 0 43
4 17 2 60
5 17 0 77
6 17 0 94
7 17 0 111
8 17 0 128
9 17 0 145
And don't know if this is helpful, but I find simulation a good way to conceptualise problems like these.
I'm trying to use genetic algorithm for classification problem. However, I didn't succeed to get a summary for the model nor a prediction for a new data frame. How can I get the summary and the prediction for the new dataset?
Here is my toy example:
library(genalg)
dat <- read.table(text = " cats birds wolfs snakes
0 3 9 7
1 3 8 7
1 1 2 3
0 1 2 3
0 1 2 3
1 6 1 1
0 6 1 1
1 6 1 1 ", header = TRUE)
evalFunc <- function(x) {
if (dat$cats < 1)
return(0) else return(1)
}
iter = 100
GAmodel <- rbga.bin(size = 7, popSize = 200, iters = iter, mutationChance = 0.01,
elitism = T, evalFunc = evalFunc)
###########summary try#############
cat(summary.rbga(GAmodel))
# Error in cat(summary.rbga(GAmodel)) :
# could not find function "summary.rbga"
############# prediction try###########
dat$pred<-predict(GAmodel,newdata=dat)
# Error in UseMethod("predict") :
# no applicable method for 'predict' applied to an object of class "rbga"
Update:
After reading the answer given and reading this link:
Pattern prediction using Genetic Algorithm
I wonder how can I programmatically use the GA as part of a prediction mechanism? According to the link's text, one can use the GA for optimizing regression or NN and then use the predict function provided by them/
Genetic Algorithms are for optimization, not for classification. Therefore, there is no prediction method. Your summary statement was close to working.
cat(summary(GAmodel))
GA Settings
Type = binary chromosome
Population size = 200
Number of Generations = 100
Elitism = TRUE
Mutation Chance = 0.01
Search Domain
Var 1 = [,]
Var 0 = [,]
GA Results
Best Solution : 1 1 0 0 0 0 1
Some additional information is available from Imperial College London
Update in response to updated question:
I see from the paper that you mentioned how this makes sense. The idea is to use the genetic algorithm to optimize the weights for a neural network, then use the neural network for classification. This would be a big task, too big to respond here.
I just begin to learn to code using R and I tried to do a classification by C5.0. But I encounter some problems and I don't understand. I am looking for help with gratitude. Below is the code I learned from someone and I tried to use it to run my own data:
require(C50)
data.resultc50 <- c()
prematrixc50 <- c()
for(i in 3863:3993)
{
needdata$class <- as.factor(needdata$class)
trainc50 <- C5.0(class ~ ., needdata[1:3612,], trials=5, control=C5.0Control(noGlobalPruning = TRUE, CF = 0.25))
predc50 <- predict(trainc50, newdata=testdata[i, -1], trials=5, type="class")
data.resultc50[i-3862] <- sum(predc50==testdata$class[i])/length(predc50)
prematrixc50[i-3862] <- as.character.factor(predc50)
}
Belows are two objects needdata & testdata I used in the code above with part of their heads respectively:
class Volume MA20 MA10 MA120 MA40 MA340 MA24 BIAS10
1 1 2800 8032.00 8190.9 7801.867 7902.325 7367.976 1751 7.96
2 1 2854 8071.40 8290.3 7812.225 7936.550 7373.624 1766 6.27
3 0 2501 8117.45 8389.3 7824.350 7973.250 7379.444 1811 5.49
4 1 2409 8165.40 8488.1 7835.600 8007.900 7385.294 1825 4.02
# the above is "needdata" and actually has 15 variables with 3862 obs.
class Volume MA20 MA10 MA120 MA40 MA340 MA24 BIAS10
1 1 2800 8032.00 8190.9 7801.867 7902.325 7367.976 1751 7.96
2 1 2854 8071.40 8290.3 7812.225 7936.550 7373.624 1766 6.27
3 0 2501 8117.45 8389.3 7824.350 7973.250 7379.444 1811 5.49
4 1 2409 8165.40 8488.1 7835.600 8007.900 7385.294 1825 4.02
# the above is "testdata" and has 15 variables with 4112 obs.
The data above contain the factor class with value of 0 & 1. After I run it I got warnings below:
In predict.C5.0(trainc50, newdata = testdata[i, -1], trials = 5, ... : 'trials' should be <= 1 for this object. Predictions generated
using 1 trials
And when I try to look at the object trainc50 just created, I noticed the number of boosting iterations is 1 due to early stopping as shown below:
# trainc50
Call:
C5.0.formula(formula = class ~ ., data = needdata[1:3612, ],
trials = 5, control = C5.0Control(noGlobalPruning = TRUE,
CF = 0.25), earlyStopping = FALSE)
Classification Tree
Number of samples: 3612
Number of predictors: 15
Number of boosting iterations: 5 requested; 1 used due to early stopping
Non-standard options: attempt to group attributes, no global pruning
I also tried to plot the decision tree and I got the error as below:
plot(trainc50)
Error in if (!n.cat[i]) { : argument is of length zero
In addition: Warning message:
In 1:which(out == "Decision tree:") : numerical expression has 2 elements: only the first used
Does that mean my code is too bad to perform further trials while running C5.0? What is wrong? Can someone please help me out about why do I encounter early stopping and what does the error and waring message mean? How can I fix it? If anyone can help me I'll be very thankful.
Used in
http://r-project-thanos.blogspot.tw/2014/09/plot-c50-decision-trees-in-r.html
using function
C5.0.graphviz(firandomf,
"a.txt",
fontname='Arial',
col.draw='black',
col.font='blue',
col.conclusion='lightpink',
col.question='grey78',
shape.conclusion='box3d',
shape.question='diamond',
bool.substitute=c('None', 'yesno', 'truefalse', 'TF'),
prefix=FALSE,
vertical=TRUE)
And in the command line:
pip install graphviz
dot -Tpng ~plot/a.txt >~/plot/a.png
Following up from this question (see for reproducible data frame) I want to run MCMCGLMM n times, where n is the number of randomisations. I have tried to construct a loop which runs all the chains, and saves them (to retrieve the posterior distributions of the randomised variable later) but I am encountering problems.
This is what the data frame looks like (when n = 5, hence R1-R5), A = response variable, L and V are random effect variables, B is a fixed effect, R1-R5 are random assignments of L with structure of V maintained:
ID L B V A R1 R2 R3 R4 R5
1 1_1_1 1 1 1 11.1 6 19 21 1 31
2 1_1_1 1 1 1 6.9 6 19 21 1 31
3 1_1_4 1 1 4 7.7 2 24 8 22 22
4 1_1_4 1 1 4 10.5 2 24 8 22 22
5 1_1_5 1 1 5 8.5 11 27 14 17 22
6 1_1_7 1 1 7 11.2 5 24 13 18 25
I can create the names I want to assign to my chains, and the names of the variable that changes with each run of the MCMC chain (R1-Rn):
n = 5
Rs = as.vector(rep(NA,n))
for(i in 1:n){
Rs[i] = paste("R",i, sep = "")
}
Rs
Output:
> Rs
[1] "R1" "R2" "R3" "R4" "R5"
I then tried this loop to produce 5 chains:
for(i in 1:n){
chains[i] = MCMCglmm(A ~1 + B,
random = as.formula(paste0("~" ,Rs[i], " + Vial")),
rcov = ~units,
nitt = 500,
thin = 2,
burnin = 50,
prior = prior2,
family = "gaussian",
start = list(QUASI = FALSE),
data = df)
}
Thanks Roland for helping to get the random effect to call properly, previously I was getting an error Error in buildZ(rmodel.terms[r] ... object Rs[i] not found- fixed by as.formula
But this stores all of the data in chains and seemingly only the $Sol components, but I need to be able to access the values within the VCV, specifically the posterior distributions of the R variables (e.g. summary(chainR1$VCV))
In summary: It seems I am making a mistake in how I assign the chain names, does anyone have a suggestion of how to do this, and save the posterior distributions or even the whole chain?
Using assign was a key point:
n = 10 #Number of chains to run
chainVCVdf = matrix(rep(NA, times = ((nitt-burnin)/thin)*n), ncol = n)
colnames(chainVCVdf)=c(rep("X", times = n))
for(i in 1:n){
assign("chainX",paste0("chain",Rs[i]))
chainX = MCMCglmm(A ~1 + B,
random = as.formula(paste0("~" ,Rs[i], " + V")),
rcov = ~units,
nitt = nitt,
thin = thin,
burnin = burnin,
prior = prior1,
family = "gaussian",
start = list(QUASI = FALSE),
data = df)
assign("chainVCV", chainX$VCV[,1])
chainVCVdf[,i]=(chainVCV)
colnames(chainVCVdf)[i] = colnames(chainX$VCV)[1]
}
It then became possible to build a matrix of the VCV component that I am interested in (namely the randomised L assignment in columns R1-Rn)
It seems as though you want to run a number of different MCMCglmm formulas in a loop. #Roland has helped you found the solution to this (although I personally would create the formulas prior to the loop). #Roland also points out that in order to save the results of each model, you should save them in a list - rather than a chain as you are currently doing. You could also save each model as an .RData file, as seen in the end of the question. To formalize an answer to this question I would perform this in the following way:
Rs = paste0("~R", 1:5, " + V") ## Create all model formulae
chainNames = paste0("chainR", 1:5) ## Names for each model
chains = list() ## Initialize list
## Loop over models
for(i in 1:length(Rs)){
chains[[i]] = MCMCglmm(A ~1 + B,
random = formula(Rs[i]),
rcov = ~units,
nitt = 500,
thin = 2,
burnin = 50,
prior = prior2,
family = "gaussian",
start = list(QUASI = FALSE),
data = df)
}
names(chains) = chainNames ## Name each model
save(chains, "chainsR1-R5.Rdata") ## Save all model output
A side note, paste0 is the same as paste, but with the argument sep="" by default