I have a problem when I use the function CA() in R.
My data is :
data
row.names Conscient NonConscient
MoinsSouvent 185 213
PlusieursfMois 98 56
PlusieursfSemaine 28 27
TLJ 5 8
but when I use CA(data), I have :
test <- CA(data)
Error in res.ca$col$coord[, axes] : subscript out of bounds
Can someone help please ?
The problem is the due to the fact that in correspondance analysis with a conteingency table of size I x J the number of factorial axes is min{(I-1), (J-1)}.
You have a 4 x 2 table so you can't have factorial plan but an axe (because dim = 1 = min(4-1, 2-1)).
One way to solve this problem is to use CA with the parameter graph set to FALSE.
require(FactoMineR)
data <- matrix(c(185, 213, 98, 56, 28, 27, 5, 8),
ncol = 2, byrow = TRUE)
dimnames(data) <- list(c("ms", "plfm", "plfs", "tlj"),
c("cs", "ncs"))
data <- as.table(data)
res <- CA(data, graph = FALSE)
You can also check the coordinates to see that plotting a plan here is not possible.
res$row$coord
## ms plfm plfs tlj
## -0.0897234 0.2534199 -0.0011732 -0.2501709
res$col$coord
## [,1]
## cs 0.1469
## ncs -0.1527
There is no point of doing a correspondence analysis on a 4*2 table. CA are made to reduce the dimensionality of large contingency table.
If your variables have so few possible values, you better just interpret the contingency table directly, using chisquare or fisher test if needed.
Related
I've got this data processing:
library(text2vec)
##Using perplexity for hold out set
t1 <- Sys.time()
perplex <- c()
for (i in 3:25){
set.seed(17)
lda_model2 <- LDA$new(n_topics = i)
doc_topic_distr2 <- lda_model2$fit_transform(x = dtm, progressbar = F)
set.seed(17)
sample.dtm2 <- itoken(rawsample$Abstract,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = rawsample$id,
progressbar = F) %>%
create_dtm(vectorizer,vtype = "dgTMatrix", progressbar = FALSE)
set.seed(17)
new_doc_topic_distr2 <- lda_model2$transform(sample.dtm2, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
perplex[i] <- text2vec::perplexity(sample.dtm2, topic_word_distribution =
lda_model2$topic_word_distribution,
doc_topic_distribution = new_doc_topic_distr2)
}
print(difftime(Sys.time(), t1, units = 'sec'))
I know there are a lot of questions like this, but I haven't been able to exactly find the answer to my situation. Above you see perplexity calculation from 3 to 25 topic number for a Latent Dirichlet Allocation model. I want to get the most sufficient value among those, meaning that I want to find the elbow or knee, for those values that might only be considered as a simple numeric vector which outcome looks like this:
1 NA
2 NA
3 222.6229
4 210.3442
5 200.1335
6 190.3143
7 180.4195
8 174.2634
9 166.2670
10 159.7535
11 153.7785
12 148.1623
13 144.1554
14 141.8250
15 138.8301
16 134.4956
17 131.0745
18 128.8941
19 125.8468
20 123.8477
21 120.5155
22 118.4426
23 116.4619
24 113.2401
25 114.1233
plot(perplex)
This is how plot looks like
I would say that the elbow would be 13 or 16, but I'm not completely sure and I want the exact number as an outcome. I saw in this paper that f''(x) / (1+f'(x)^2)^1.5 is the knee formula, which I tried like this and says it's 18:
> d1 <- diff(perplex) # first derivative
> d2 <- diff(d1) / diff(perplex[-1]) # second derivative
> knee <- (d2)/((1+(d1)^2)^1.5)
Warning message:
In (d2)/((1 + (d1)^2)^1.5) :
longer object length is not a multiple of shorter object length
> which.min(knee)
[1] 18
I can't fully figure this thing out. Would someone like to share how I could get the exact ideal topics number according to perplexity as an outcome?
Found this: "The LDA model with the optimal coherence score, obtained with an elbow method (the point with maximum absolute second derivative) (...)" in this paper, so this coding does the work: d1 <- diff(perplex); k <- which.max(abs(diff(d1) / diff(perplex[-1])))
This is my data frame called mydata.
Denomination
Attendance Protestant Catholic Jewish
Regular 182 213 203
Irregular 154 138 110
I want to calculate the hypothesis for this through a chi square test. So i set my hypothesis like this:
H0: There is no relation between denomination and attendance
H1: There is a relation between denomination and attendance.
I've tried by calculating actual value and expected value and compare them to calculate chi square. But didn't get the output.
The codes I've tried:
> rowSums(mydata,na.rm = FALSE,dims = 1L)
> colSums(mydata,na.rm = FALSE,dims = 1L)
> sum(mydata)
> e = rowSums(mydata) * colSums(mydata)/ 1000
> chisq.test(mydata) = sum((mydata-e)^2 / e)
But I didn't get the result. Please suggest?
A couple of things to note here:
chisq.test can only have numeric columns (or integer). The way you have displayed the data, it seems as though Attendance is its own column as opposed to the rowname.
You don't need to calculate rowSums or colSums to use the test.
Here is the code I ran:
dfs <- data.frame(Attendance = c("Regular", "Irregular"), Protestant = c(182, 154), Catholic = c(213, 138), Jewish = c(203,110))
rownames(dfs) <- dfs[,1]
dfs$Attendance <- NULL
chisq.test(dfs)
Pearson's Chi-squared test
data: dfs
X-squared = 7.8782, df = 2, p-value = 0.01947
I am visiting a bird sanctuary that has many different species of birds. Some species are more numerous while other species are less numerous. I came back to the sanctuary 9 times and after every visit I am calculating the total number of species I observed. Unsurprisingly, there is a diminishing return in my visits, since I observe the most numerous species on my every visit, but it does not increase the count of observed species. What is the best function in R to predict how many birds I will observe on my 20th visit?
Here is the data.frame
d <- structure(list(visit = 1:9,
totalNumSpeciesObserved = c(200.903, 296.329, 370.018, 431.59, 485.14, 533.233, 576.595, 616.536, 654)),
class = "data.frame", row.names = c(NA, 9L))
I expect to see a model that fits data well and behaves in a "log-like" fashion, predicting diminishing returns
In order to best ask a question, stack has some good links: https://stackoverflow.com/help/how-to-ask
If you're trying to model this, I might take the approach of a regression on the square root of the independent variable based on the data. Kind of strange to think about it as a function of visits though... Maybe if it were even spaced time periods it would make more sense.
d <- structure(list(visit = 1:9,
totalNumSpeciesObserved = c(200.903, 296.329, 370.018, 431.59, 485.14, 533.233, 576.595, 616.536, 654)),
class = "data.frame", row.names = c(NA, 9L))
mod <- lm(totalNumSpeciesObserved ~ I(sqrt(visit)), d)
new.df <- data.frame(visit=1:13)
out <- predict(mod, newdata = new.df)
plot(d, type = 'o',pch = 16, xlim = c(1,13), ylim = c(200,800), lwd = 2, cex = 2)
points(out, type= 'o', pch = 21, col = "blue", cex = 2)
The I() wrapper allows you to transform the independent variable on the fly, hense the use of sqrt() without needing to save a new variable.
I also don't know if this helps, but you could build a simulator to test for asymptoptic behaviour. For example you could build a population:
population <- sample(size = 1e6, LETTERS[1:20],
replace = TRUE, prob = 1/(2:21)^2)
This would say there are 20 species and decreasing probability in your population (expand as you wish).
The you could simulate visits and information about your visit. For example how large is the sample of your visit? During a visit you only see 1% of the rainforest etc.
sim_visits <- function(visits, percent_obs, population){
species_viewed <- vector()
unique_views <- vector()
for(i in 1:visits){
my_samp <- sample(x = population, size = round(percent_obs*length(population),0),
replace = FALSE)
species_viewed <- c(species_viewed, my_samp)
unique_views[i] <- length(unique(species_viewed))
}
new_observed <- unique_views - dplyr::lag(unique_views, 1, 0)
df <- data.frame(unique_views = unique_views, new_observed)
df$cummulative <- cumsum(unique_views)
df
}
And then you could draw from the simulation many times and see what distribution of values you get.
sim_visits(9, percent_obs = .001, population = population)
unique_views new_observed cummulative
1 13 13 13
2 15 2 28
3 15 0 43
4 17 2 60
5 17 0 77
6 17 0 94
7 17 0 111
8 17 0 128
9 17 0 145
And don't know if this is helpful, but I find simulation a good way to conceptualise problems like these.
I'm trying to replicate an Excel solver model in R. It's a simple problem to start looking to maximize points with the only constraint being limited the number events than can be played. So I have a two column data frame with a tournament number and project points. In Excel we have a Play Yes/no binary column and multiply it by the points and set to maximize, allowing the model to change the Play Yes/No column to 0 or 1. The constraint limits the sum of the play yes/no variable to the constraint value, for example 25.
library(lpSolve)
tournament<-rep(1:48,1)
mean<-c(12.2,30.4,30.9,44.1,31.3,27.6,31.5,25.0,31.2,24.0,28.0,23.9,14.1,9.5,17.2,37.8,30.5,43.0,32.1,30.7,30.2,37.0,32.1,28.9,23.7,4.6,29.0,29.1,30.7,31.6,49.5,25.1,30.2,10.3,30.3,21.8,88.5,31.0,30.9,2.9,31.1,30.3,29.7,63.7,31.6,91.6,30.6,31.0)
aggdata<-data.frame(tournament,mean)
maxevents <-25
obj<-aggdata$mean
con <- rep(1,nrow(aggdata))
dir <- c("==")
rhs <- maxevents
result <- lp("max", obj, con, dir, rhs, all.bin = TRUE)
The result looks at only 3 rows of the data frame and it should look at the top 25. Eventually, I'll add additional constraints as I know lp is not required for this simple example, but need to get past this roadblock first.
library(lpSolve)
#objective function
obj <- rep(1, nrow(aggdata))
#constraints
con <- matrix(c(obj <- rep(1, nrow(aggdata)),
as.vector(aggdata$point)), nrow = 2, byrow = T) #you can add another constraints here and make 'nrow' equals to number of total constraints
dir <- c("==", "<=")
rhs <- c(25, #total number of tournament
1000) #let's assume that total points can't exceeds 1000
#optimization solution
result <- lp ("max", obj, con, dir, rhs, all.bin=TRUE)
result$solution
Sample data:
aggdata <- data.frame(tournament = rep(1:48,1),
point = c(12.2,30.4,30.9,44.1,31.3,27.6,31.5,25.0,31.2,24.0,28.0,23.9,14.1,
9.5,17.2,37.8,30.5,43.0,32.1,30.7,30.2,37.0,32.1,28.9,23.7,4.6,
29.0,29.1,30.7,31.6,49.5,25.1,30.2,10.3,30.3,21.8,88.5,31.0,30.9,
2.9,31.1,30.3,29.7,63.7,31.6,91.6,30.6,31.0))
# tournament point
#1 1 12.2
#2 2 30.4
#3 3 30.9
#4 4 44.1
#5 5 31.3
#6 6 27.6
For instance , if the number is 100 and the number of groups is 4 it should give any random list of 4 numbers that add upto 100:
input number = 100
number of groups = 4
Possible outputs:
25, 25, 25, 25
10, 20, 30, 40
15, 35, 2, 48
The output should only be one list generated. More application oriented example would be how i would split a probability 1 into multiple groups given the number of groups using R?
rmultinom might be handy here:
x <- rmultinom(n = 1, size = 100, prob = rep(1/4, 4))
x
colSums(x)
Here I draw one vector, with a total size of 100, which is splitted into 4 groups.
You can try following
total <- 100
n <- 4
as.vector(table(sample(1:n, size = total, replace = T)))
## [1] 23 27 24 26
as.vector(table(sample(1:n, size = total, replace = T)))
## [1] 25 26 28 21
as.vector(table(sample(1:n, size = total, replace = T)))
## [1] 24 20 28 28
When it comes to probabilities, I think this is a good idea:
generate.probabilities <- function(n){
bordersR <- c(sort(runif(n-1)), 1)
bordersL <- c(0, bordersR[1:(n-1)])
bordersR - bordersL
}
It gives you n numbers from random distribution which sum up to 1.
Define the parameters for generality
inN <- 100 # input number
nG <- 4 # number of groups
Following storaged's idea that we only need 3 random numbers to split the space into 4 regions, but requiring integers, the inner borders can be found as:
sort(sample(inN,nG-1, replace = TRUE))
The OP wanted the count in each group which we can find by
diff(c(0,sort(sample(inN,nG-1, replace = TRUE)), inN))