Poisson Table in R - r

I am trying to generate a Poisson Table in R for two events, one with mean 1.5 (lambda1) and the other with mean 1.25 (lambda2). I would like to generate the probabilities in both cases for x=0 to x=7+ (7 or more). This is probably quite simple but I can't seem to figure out how to do it! I've managed to create a data frame for the table but I don't really know how to input the parameters as I've never written a function before:
name <- c("0","1","2","3","4","5","6","7+")
zero <- mat.or.vec(8,1)
C <- data.frame(row.names=name,
"0"=zero,
"1"=zero,
"2"=zero,
"3"=zero,
"4"=zero,
"5"=zero,
"6"=zero,
"7+"=zero)
I am guessing I will need some "For" loops and will involve dpois(x,lambda1) at some point. Can somebody help please?

I'm assuming these events are independent. Here's one way to generate a table of the joint PMF.
First, here are the names you've defined, along with the lambdas:
name <- c("0","1","2","3","4","5","6","7+")
lambda1 <- 1.5
lambda2 <- 1.25
We can get the marginal probabilities for 0-6 by using dpois, and the marginal probability for 7+ using ppois and lower.tail=FALSE:
p.x <- c(dpois(0:6, lambda1), ppois(7, lambda1, lower.tail=FALSE))
p.y <- c(dpois(0:6, lambda2), ppois(7, lambda2, lower.tail=FALSE))
An even better way might be to create a function that does this given any lambda.
Then you just take the outer product (really, the same thing you would do by hand, outside of R) and set the names:
p.xy <- outer(p.x, p.y)
rownames(p.xy) <- colnames(p.xy) <- name
Now you're done:
0 1 2 3 4 5
0 6.392786e-02 7.990983e-02 4.994364e-02 2.080985e-02 6.503078e-03 1.625770e-03
1 9.589179e-02 1.198647e-01 7.491546e-02 3.121478e-02 9.754617e-03 2.438654e-03
2 7.191884e-02 8.989855e-02 5.618660e-02 2.341108e-02 7.315963e-03 1.828991e-03
3 3.595942e-02 4.494928e-02 2.809330e-02 1.170554e-02 3.657982e-03 9.144954e-04
4 1.348478e-02 1.685598e-02 1.053499e-02 4.389578e-03 1.371743e-03 3.429358e-04
5 4.045435e-03 5.056794e-03 3.160496e-03 1.316873e-03 4.115229e-04 1.028807e-04
6 1.011359e-03 1.264198e-03 7.901240e-04 3.292183e-04 1.028807e-04 2.572018e-05
7+ 4.858139e-05 6.072674e-05 3.795421e-05 1.581426e-05 4.941955e-06 1.235489e-06
6 7+
0 3.387020e-04 1.094781e-05
1 5.080530e-04 1.642171e-05
2 3.810397e-04 1.231628e-05
3 1.905199e-04 6.158140e-06
4 7.144495e-05 2.309303e-06
5 2.143349e-05 6.927908e-07
6 5.358371e-06 1.731977e-07
7+ 2.573935e-07 8.319685e-09
You could have also used a loop, as you originally suspected, but that's a more roundabout way to the same solution.

Related

R - Function that call conditionally another functions

I am working with igraph package and I'm trying to build a function that calculates the number of intra-community edges of different algorithm implementation. I try to concatenate everything inside the function even the algorithms community detection functions. Like this:
library("igraph")
intra.edges<-function(G,algorithm) {
if(algorithm==1){
Mod<-cluster_louvain(G)}
if(algoritmo==2){
Mod<-cluster_edge_betweenness(G)}
if(algoritmo==3){
Mod<-cluster_walktrap(G)}
Com<-as.data.frame(sizes(Mod))
NoCom<-as.vector(Com$Community.sizes)
vert<-NULL
for(i in 1:length(NoCom)){
M<-which(membership(Mod)==i)
sg<-induced.subgraph(G,M)
c.ec<-ecount(sg)
vert<-c.ec
}
intra<-data.frame(Com,vert)
print(intra)
}
When I try the function, it don't works correctly. For example:
When I run:
G <- graph.famous("Zachary")
intra.edges(G,1)
I get:
Community.sizes Freq vert
1 9 6
2 7 6
3 9 6
4 4 6
5 5 6
And when I run intra.edges(G,2) or intra.edges(G,3) I get the same output.
Also, not all the network's components have six vertex, it is only in one component.
You can either add your calculated value of vert to the dataframe with each iteration of your for loop by changing your code to:
intra<-Com
for(i in 1:length(NoCom)){
M<-which(membership(Mod)==i)
sg<-induced.subgraph(G,M)
intra$vert[i]<-ecount(sg)
}
print(intra)
Or, as #dash2 suggested, create a vector called vert and add values sequentially like this:
vert<-NULL
for(i in 1:length(NoCom)){
M<-which(membership(Mod)==i)
sg<-induced.subgraph(G,M)
c.ec<-ecount(sg)
vert[i]<-c.ec
}

Problems with Naive Bayes

I'm trying to run Naive Bayes in R for making predictions from textual data (by building a Document Term Matrix).
I read several posts warning about terms that could be missing in both the training and the testing set, so I decided to work with only one data frame and split it afterwards. The code I'm using is this:
data <- read.csv(file="path",header=TRUE)
########## NAIVE BAYES
library(e1071)
library(SparseM)
library(tm)
# CREATE DATA FRAME AND TRAINING AND
# TEST INCLUDING 'Text' AND 'InfoType' (columns 8 and 27)
traindata <- as.data.frame(data[13000:13999,c(8,27)])
testdata <- as.data.frame(data[14000:14999,c(8,27)])
complete <- as.data.frame(data[13000:14999,c(8,27)])
# SEPARATE TEXT VECTOR TO CREATE Source(),
# Corpus() CONSTRUCTOR FOR DOCUMENT TERM
# MATRIX TAKES Source()
completevector <- as.vector(complete$Text)
# CREATE SOURCE FOR VECTORS
completesource <- VectorSource(completevector)
# CREATE CORPUS FOR DATA
completecorpus <- Corpus(completesource)
# STEM WORDS, REMOVE STOPWORDS, TRIM WHITESPACE
completecorpus <- tm_map(completecorpus,tolower)
completecorpus <- tm_map(completecorpus,PlainTextDocument)
completecorpus <- tm_map(completecorpus, stemDocument)
completecorpus <- tm_map(completecorpus, removeWords,stopwords("english"))
completecorpus <- tm_map(completecorpus,removePunctuation)
completecorpus <- tm_map(completecorpus,removeNumbers)
completecorpus <- tm_map(completecorpus,stripWhitespace)
# CREATE DOCUMENT TERM MATRIX
completematrix<-DocumentTermMatrix(completecorpus)
trainmatrix <- completematrix[1:1000,]
testmatrix <- completematrix[1001:2000,]
# TRAIN NAIVE BAYES MODEL USING trainmatrix DATA AND traindata$InfoType CLASS VECTOR
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$InfoType),laplace=1)
# PREDICTION
results <- predict(model,as.matrix(testmatrix))
conf.matrix<-table(results, testdata$InfoType,dnn=list('predicted','actual'))
conf.matrix
The problem is that I'm getting weird results like this:
actual
predicted 1 2 3
1 60 833 107
2 0 0 0
3 0 0 0
Any idea of why is this happening?
The raw data looks like this:
head(complete)
Text
13000 Milkshakes, milkshakes, whats not to love? Really like the durability and weight of the cup. Something about it sure makes good milkshakes.Works beautifully with the Cuisinart smart stick.
13001 excellent. shipped on time, is excellent for protein shakes with a cuisine art mixer. easy to clean and the mixer fits in perfectly
13002 Great cup. Simple and stainless steel great size cup for use with my cuisinart mixer. I can do milkshakes really easy and fast. Recommended. No problems with the shipping.
13003 Wife Loves This. Stainless steel....attractive and the best part is---it won't break. We are considering purchasing another one because they are really nice.
13004 Great! Stainless steel cup is great for smoothies, milkshakes and even chopping small amounts of vegetables for salads!Wish it had a top but still love it!
13005 Great with my. Stick mixer...the plastic mixing container cracked and became unusable as a result....the only downside is you can't see if the stuff you are mixing is mixed well
InfoType
13000 2
13001 2
13002 2
13003 3
13004 2
13005 2
Seemingly the problem is that the TDM needs to get rid of so much sparsity. So I added:
completematrix<-removeSparseTerms(completematrix, 0.95)
And it started working!!
actual
predicted 1 2 3
1 60 511 6
2 0 86 2
3 0 236 99
Thank you all for your ideas (thank you Chelsey Hill!!)

Gaussian filter with uneven sampled data

I have a time series which is unevenly sampled and I want to downsample it to 20Hz. I have made a moving average by binning the data points when in 0.05s time windows (20Hz) and applying an arithmetic mean to it. The data frame looks like this:
Time Right Left
1 0.000000000 18.21980 30.98789
2 0.009222031 22.15157 37.18590
3 0.022511959 25.63218 42.49231
4 0.029854059 28.43851 46.57811
5 0.039320946 30.43885 49.29414
6 0.052499056 31.60561 50.67852
7 0.059612036 32.01045 50.92879
8 0.076606989 31.80335 50.34975
9 0.082647085 31.18134 49.29151
10 0.090698957 30.35415 48.09110
And the code I used for the moving average was this:
data$group_num <- floor(data$Time/0.05)
data2<-NULL
data2$Right = aggregate(data$Right,
list(group_num=data$group_num), mean)
data2$Left = aggregate(data$Left,
list(group_num=data$group_num), mean)
data2$Time = aggregate(data$Time,
list(group_num=data$group_num), mean)
However, for optimizing it I want to make rather a Gaussian filter so that the data points in the middle of the bin have more weight. I couldnt find any function that could deal with uneven sampling. Thus, I started writting a script, where I managed to give them weights.
data$weight <- ((data$Time-data$group_num*0.05)*((data$group_num+1)*0.05-data$Time))^5
I have to normalize these weights by the mean of the weights in their own bin (for instance). By trying to normalize these weights to the mean of their own group I ran into problems with too slow functions. Could anybody give me a hand with it??
I finally managed to do what I wanted. The key function that helped me to overcome this one was ave(). Thank god. So this is what I basically did:
data$weight <- (abs(data$Time-(data$group_num*0.05+0.025)))^(-1)
data$Norm<-ave(data$weight,data$group_num,FUN=function(x) x/sum(x))
data$Time2<- data$Time*data$Norm
data$Right2<- data$Right*data$Norm
data$Left2<- data$Left*data$Norm
data2$Time<- tapply(data$Time2, data$group_num, sum)
data2$Right<- tapply(data$Right2, data$group_num, sum)
data2$Left<- tapply(data$Left2, data$group_num, sum)
Thanks Marat Talipov for the help. From what I see in your code this could also work. But because this is just working fine and quick enough I will stay with this.

Fisher test more than 2 groups

Major Edit:
I decided to rewrite this question since my original was poorly put. I will leave the original question below to maintain a record. Basically, I need to do Fisher's Test on tables as big as 4 x 5 with around 200 observations. It turns out that this is often a major computational challenge as explained here (I think, I can't follow it completely). As I use both R and Stata I will frame the question for both with some made-up data.
Stata:
tabi 1 13 3 27 46 \ 25 0 2 5 3 \ 22 2 0 3 0 \ 19 34 3 8 1 , exact(10)
You can increase exact() to 1000 max (but it will take maybe a day before returning an error).
R:
Job <- matrix(c(1,13,3,27,46, 25,0,2,5,3, 22,2,0,3,0, 19,34,3,8,1), 4, 5,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", ">40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS", "exstatic")))
fisher.test(Job)
For me, at least, it errors out on both programs. So the question is how to do this calculation on either Stata or R?
Original Question:
I have Stata and R to play with.
I have a dataset with various categorical variables, some of which have multiple categories.
Therefore I'd like to do Fisher's exact test with more than 2 x 2 categories
i.e. apply Fisher's to a 2 x 6 table or a 4 x 4 table.
Can this be done with either R or Stata ?
Edit: whilst this can be done in Stata - it will not work for my dataset as I have too many categories. Stata goes through endless iterations and even being left for a day or more does not produce a solution.
My question is really - can R do this, and can it do it quickly ?
Have you studied the documentation of R function fisher.test? Quoting from help("fisher.test"):
For 2 by 2 cases, p-values are obtained directly using the (central or
non-central) hypergeometric distribution. Otherwise, computations are
based on a C version of the FORTRAN subroutine FEXACT which implements
the network developed by Mehta and Patel (1986) and improved by
Clarkson, Fan and Joe (1993).
This is an example given in the documentation:
Job <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", "> 40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS")))
fisher.test(Job)
# Fisher's Exact Test for Count Data
#
# data: Job
# p-value = 0.7827
# alternative hypothesis: two.sided
As far as Stata is concerned, your original statement was totally incorrect. search fisher leads quickly to help tabulate twoway and
the help for the exact option explains that it may be applied to r x
c as well as to 2 x 2 tables
the very first example in the same place of Fisher's exact test underlines that Stata is not limited to 2 x 2 tables.
It's a minimal expectation anywhere on this site that you try to read basic documentation. Please!

Testing recurrences and orders in strings matlab

I have observed nurses during 400 episodes of care and recorded the sequence of surfaces contacts in each.
I categorised the surfaces into 5 groups 1:5 and calculated the probability density functions of touching any one of 1:5 (PDF).
PDF=[ 0.255202629 0.186199343 0.104052574 0.201533406 0.253012048]
I then predicted some 1000 sequences using:
for i=1:1000 % 1000 different nurses
seq(i,1:end)=randsample(1:5,max(observed_seq_length),'true',PDF);
end
eg.
seq = 1 5 2 3 4 2 5 5 2 5
stairs(1:max(observed_seq_length),seq) hold all
I'd like to compare my empirical sequences with my predicted one. What would you suggest to be the best strategy or property to look at?
Regards,
EDIT: I put r as a tag as this may well fall more easily under that category due to the nature of the question rather than the matlab code.

Resources