Solving binomial distribution question in R language - r

Insurance policies are sold to 10 different people age 25-30 years. All of these are in good health
conditions. The probability that a person of similar condition will enjoy more than 25 years is 4/5. Calculate the probability that in 25 years Almost 2 will die. Perform this calculation in R syntax without using any direct built-in function.
n <- 10
p_live <- 4/5
p_notlive <- 0.2
Pzerodie <- combn(n,0)*(p_live^0)*(p_notlive^n-0)
print(Pzerodie)
I will do the same process for P(one die) and P(two die) and then add all three variable. Now the above code should print 1.024 * 10^-7 for P(Pzerodie)! But its printing : [,1]. Can anyone guide? Thanks

Related

Measure similarity of objects over a period of time

I've got a dataset that has monthly metrics for different stores. Each store has three monthly (Total sales, customers and transaction count), my task is over a year I need to find the store that most closely matches a specific test store (Ex: Store 77).
Therefore over the year both the test store and most similar store need to have similar performance. My question is how do I go about finding the most similar store? I've currently used euclidean distance but would like to know if there's a better way to go about it.
Thanks in advance
STORE
month
Metric 1
22
Jan-18
10
23
Jan-18
20
Is correlation a better way to measure similarity in this case compared to distance? I'm fairly new to data so if there's any resources where I can learn more about this stuff it would be much appreciated!!
In general, deciding similarity of items is domain-specific, i.e. it depends on the problem you try to solve. Therefore, there is not one-size-fits-all solution. Nevertheless, there is some a basic procedure someone can follow trying to solve this kind of problems.
Case 1 - only distance matters:
If you want to find the most similar items (stores in our case) using a distance measure, it's a good tactic to firstly scale your features in some way.
Example (min-max normalization):
Store
Month
Total sales
Total sales (normalized)
1
Jan-18
50
0.64
2
Jan-18
40
0.45
3
Jan-18
70
0
4
Jan-18
15
1
After you apply normalization on all attributes, you can calculate euclidean distance or any other metric that you think it fits your data.
Some resources:
Similarity measures
Feature scaling
Case 2 - Trend matters:
Now, say that you want to find the similarity over the whole year. If the definition of similarity for your problem is just the instance of the stores at the end of the year, then distance will do the job.
But if you want to find similar trends of increase/decrease of the attributes of two stores, then distance measures conceal this information. You would have to use correlation metrics or any other more sophisticated technique than just a distance.
Simple example:
To keep it simple, let's say we are interested in 3-months analysis and that we use only sales attribute (unscaled):
Store
Month
Total sales
1
Jan-18
20
1
Feb-18
20
1
Mar-18
20
2
Jan-18
5
2
Feb-18
15
2
Mar-18
40
3
Jan-18
10
3
Feb-18
30
3
Mar-18
78
At the end of March, in terms of distance Store 1 and Store 2 are identical, both having 60 total sales.
But, as far as the increase ratio per month is concerned, Store 2 and Store 3 is our match. In February they both had 2 times more sales and in March 1.67 and 1.6 times more sales respectively.
Bottom line: It really depends on what you want to quantify.
Well-known correlation metrics:
Pearson correlation coefficient
Spearman correlation coefficient

Measures of dissimilarity (distance) between character vectors in R

I have a seemingly easy question which however is troubling me a bit.
I have couples of vectors made up of nominal attributes. They can be of different length and sometimes some of the attributes in one might not be included in the other. See a and b as two potential examples.
a
1 mathematician
2 engineer
3 mathematician
4 mathematician
5 mathematician
6 engineer
7 mathematician
8 mathematician
9 mathematician
10 mathematician
11 mathematician
12 engineer
13 mathematician
14 mathematician
15 engineer
b
1 physicist
2 surgeon
3 physicist
4 surgeon
5 physicist
6 physicist
7 surgeon
8 surgeon
9 physicist
10 physicist
11 mathematician
Do you have in mind a measure (an index) that could summarize the dissimilarity between them? The type of measure I am looking for is something like the Euclidean distance, but for qualitative vectors.
One option I thought of is to actually compute the Euclidean distance among the categorical vectors earlier transformed into frequency vectors. In this way, they would become quantitative and would be of the same length. But my question is, do you find this a sound approach?
More generally, is there a R package that tackles these type of distances? Can you suggest other distances suitable to the case of nominal variables?
Many thanks!
I've only come across the unalikeability coefficient.
http://www.amstat.org/publications/jse/v15n2/kader.html
Weird name, intuitive approach, and incredibly simple implementation. For example:
> table(a)
a
engineer mathematician
4 11
> unalike(table(a))
[1] 0.391
> table(b)
b
mathematician physicist surgeon
1 6 4
> unalike(table(b))
[1] 0.562
It is clear just by eye-balling that b would be more dissimilar, and this coefficient gives a more quantitative measure.
There are some examples in the paper which I will calculate for you here:
> unalike(3,7)
[1] 0.42
> unalike(5,5)
[1] 0.5
> unalike(1,9)
[1] 0.18
The formula in this function is based on the paper I linked you to above:
unalike <- function(...) {
props <- c(...)
zzz <- 1 - sum(((props) / sum(props)) ** 2)
zzz <- round(zzz, 3)
return(zzz)
}
Let me know how your thing goes since this is a small side project for me as well.
I am not sure this is a programming question, because you do not know what you want to do yet, so we can't offer a solution. I think the main question here is what are you going to use this measure for, because you can measure dissimilarities in a lot of different ways, some will be good for what you want and some will not.
But trying to answer anyway, there is the utils::adist function and there is also a package called stringdist (these are the ones I have used before). But it seems that they are not quite what you want, based on your question, because they will measure the distance for each character string, and not for the whole matrix. But you could use them to have some ideas about how to measure the distance between the two vectors. For example, one measure could be how many changes you would have to make in vector a so it turns to vector b.
Thank you for keeping this open.
One option, which appears to have become available after this discussion, is R's qualvar (Gombin) package. The package provides functions for each of Wilcox's (1967, 1973) Indices of Qualitative Variation. Included with the package is a useful vignette summarizing implementation and results. I have found in limited experience that index selection requires some brute-force testing with actual and simulated data.

Fisher test more than 2 groups

Major Edit:
I decided to rewrite this question since my original was poorly put. I will leave the original question below to maintain a record. Basically, I need to do Fisher's Test on tables as big as 4 x 5 with around 200 observations. It turns out that this is often a major computational challenge as explained here (I think, I can't follow it completely). As I use both R and Stata I will frame the question for both with some made-up data.
Stata:
tabi 1 13 3 27 46 \ 25 0 2 5 3 \ 22 2 0 3 0 \ 19 34 3 8 1 , exact(10)
You can increase exact() to 1000 max (but it will take maybe a day before returning an error).
R:
Job <- matrix(c(1,13,3,27,46, 25,0,2,5,3, 22,2,0,3,0, 19,34,3,8,1), 4, 5,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", ">40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS", "exstatic")))
fisher.test(Job)
For me, at least, it errors out on both programs. So the question is how to do this calculation on either Stata or R?
Original Question:
I have Stata and R to play with.
I have a dataset with various categorical variables, some of which have multiple categories.
Therefore I'd like to do Fisher's exact test with more than 2 x 2 categories
i.e. apply Fisher's to a 2 x 6 table or a 4 x 4 table.
Can this be done with either R or Stata ?
Edit: whilst this can be done in Stata - it will not work for my dataset as I have too many categories. Stata goes through endless iterations and even being left for a day or more does not produce a solution.
My question is really - can R do this, and can it do it quickly ?
Have you studied the documentation of R function fisher.test? Quoting from help("fisher.test"):
For 2 by 2 cases, p-values are obtained directly using the (central or
non-central) hypergeometric distribution. Otherwise, computations are
based on a C version of the FORTRAN subroutine FEXACT which implements
the network developed by Mehta and Patel (1986) and improved by
Clarkson, Fan and Joe (1993).
This is an example given in the documentation:
Job <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", "> 40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS")))
fisher.test(Job)
# Fisher's Exact Test for Count Data
#
# data: Job
# p-value = 0.7827
# alternative hypothesis: two.sided
As far as Stata is concerned, your original statement was totally incorrect. search fisher leads quickly to help tabulate twoway and
the help for the exact option explains that it may be applied to r x
c as well as to 2 x 2 tables
the very first example in the same place of Fisher's exact test underlines that Stata is not limited to 2 x 2 tables.
It's a minimal expectation anywhere on this site that you try to read basic documentation. Please!

Testing recurrences and orders in strings matlab

I have observed nurses during 400 episodes of care and recorded the sequence of surfaces contacts in each.
I categorised the surfaces into 5 groups 1:5 and calculated the probability density functions of touching any one of 1:5 (PDF).
PDF=[ 0.255202629 0.186199343 0.104052574 0.201533406 0.253012048]
I then predicted some 1000 sequences using:
for i=1:1000 % 1000 different nurses
seq(i,1:end)=randsample(1:5,max(observed_seq_length),'true',PDF);
end
eg.
seq = 1 5 2 3 4 2 5 5 2 5
stairs(1:max(observed_seq_length),seq) hold all
I'd like to compare my empirical sequences with my predicted one. What would you suggest to be the best strategy or property to look at?
Regards,
EDIT: I put r as a tag as this may well fall more easily under that category due to the nature of the question rather than the matlab code.

Doing a one step Cox PH regression for 4 time intervals in R

I have 4 intervals of interest:
0 - 30 days
30 days - ½ year
½ - 2 years
2 years - 10 years
Right now I'm subsetting my dataset like this:
# Set time period
time_period.first <- 30/365.25
time_period.intermediate <- .5
....
# TREOP = Time in years
data.first = all_data
# Remove already censored data
data.intermediate = subset(data.first, data.first$TREOP > time_period.first)
# Set all outside as censored
data.first$RREOP[data.first$TREOP > time_period.first] = 0
data.first$TREOP[data.first$TREOP > time_period.first] = time_period.first
data.intermediate$RREOP[data.intermediate$TREOP > time_period.second] = 0
data.intermediate$TREOP[data.intermediate$TREOP > time_period.second] = time_period.second
....
I'm doing cox regression with the 'survival' package (I also use the cph in the 'Design' package for C-statistic calculations).
My question:
Is there a better way of performing this left-truncation & right-censoring?
Ideal would be:
# TREOP - time in years
# RREOP - event
surv <- Surv(TREOP, RREOP, start=30/365.25, stop=.5)
I've looked at the help and the time, time2 & type seem to do handle truncation but I think that it's for a more complex setting where subjects enter the study after 22 days and not for splitting data into intervals.
Edit
I've found the survSplit() function in the survival package but although it by description seems right I'm not sure how to tame it - the example doesn't really help me out. Anyone have any experience with it?
I agree with right-censoring which looks simple and straightforward.
I'm not sure that you should left-truncate. I would feel more comfortable leaving the shorter survival times unchanged and just increase the upper censoring limit. If the n'th time period is much longer than the (n-1)'th - it won't matter much and if it is not much longer than the shorter survival times shouldn't be truncated.

Resources