Issue in creating contingency table in R - r

I am using ISLR package for my statistics practice. I am using OJ dataset. I am trying to create a contingency table for Purchase column and specialPrice columns for weach of the population.
I am trying to find the likelihood of CH being sold if there is a special price.
Here is my code so far.
library(ISLR)
CH <- table(OJ[OJ$Purchase == 'CH', "SpecialCH"])
MM <- table(OJ[OJ$Purchase == 'MM', "SpecialMM"])
table (MM, CH)
The out put that I get is a bit weird.
CH
MM 121 532
101 1 0
316 0 1
I am trying to find the odds ration and eventually apply McNemar's test. But I am unable to generate the contingency table. I can do it by hand but need to do it in R.

You are trying to work with 3 variables, but a contingency table only uses 2. I recommend using xtabs since the formula method saves some typing and it does a better job of labeling the table:
xtabs(~SpecialMM+SpecialCH, OJ) # Only 4 weeks are both on special
# SpecialCH
# SpecialMM 0 1
# 0 743 154
# 1 169 4
xtabs(~Purchase+SpecialCH, OJ) # When CH is on special ca 75% CH
# SpecialCH
# Purchase 0 1
# CH 532 121
# MM 380 37
# xtabs(~Purchase+SpecialMM, OJ) # When MM is on special ca 58% MM
# SpecialMM
# Purchase 0 1
# CH 581 72
# MM 316 101
The first table asks the question. Are specials for one brand associated with the other brand. There are 1070 purchases of OJ represented. CH was on special 158 times and MM was on special 173 times. But only 4 times are both brands on special. This table suggests that MM and CH are not on special at the same time. You could use Chi Square or another test to see if that is a significant deviation from random assignment of specials.
The second and third tables look at purchase of OJ to see if one brand is more likely to be purchased relative to the other brand when it is on sale. Notice that most OJ purchases occur when neither is on sale, but it could be that sales boost the purchase of the brand on sale. Again the statistical tests would tell you if this could just be random chance or unlikely to be chance.

Related

Grouping and building intervals of data in R and useful visualization

I have some data extracted via HIVE. In the end we are talking of csv with around 500 000 rows. I want to plot them after grouping them in intervals.
Beside the grouping it's not clear how to visualize the data. Since we are talking about low spends and sometimes a high frequency I'm not sure how to handle this problem.
Here is just an overview via head(data)
userid64 spend freq
575033023245123 0.00924205 489
12588968125440467 0.00037 2
13830962861053825 0.00168 1
18983461971805285 0.001500366 333
25159368164208149 0.00215 1
32284253673482883 0.001721303 222
33221593608613197 0.00298 709
39590145306822865 0.001785281 11
45831636009567401 0.00397 654
71526649454205197 0.000949978 1
78782620614743930 0.00552 5
I want to group the data in intervals. So I want an extra columns indicating the groups. The first group should contain all data with an frequency (called freq) between 1 and 100. The second group should contain all rows where there entries have a frequency between 101 and 200... and so on.
The result should look like
userid64 spend freq group
575033023245123 0.00924205 489 5
12588968125440467 0.00037 2 1
13830962861053825 0.00168 1 1
18983461971805285 0.001500366 333 3
25159368164208149 0.00215 1 1
32284253673482883 0.001721303 222 2
33221593608613197 0.00298 709 8
39590145306822865 0.001785281 11 1
45831636009567401 0.00397 654 7
71526649454205197 0.000949978 1 1
78782620614743930 0.00552 5 1
Is there a nice and gentle art to get this? I need this grouping for upcoming plots. I want to do visualization for all intervals to get an overview regarding the spend. If you have any ideas for the visualization please let me know. I thought I should work with boxplots.
If you want to group freq for every 100 units, you can try ceiling function in base R
ceiling(df$freq / 100)
#[1] 5 1 1 4 1 3 8 1 7 1 1
where df is your dataframe.

'Forward' cumulative sum in dplyr

When examining datasets from longitudinal studies, I commonly get results like this from a dplyr analysis chain from the raw data:
df = data.frame(n_sessions=c(1,2,3,4,5), n_people=c(59,89,30,23,4))
i.e. a count of how many participants have completed a certain number of assessments at this point in time.
Although it is useful to know how many people have completed exactly n sessions, we more often need to know how many have completed at least n sessions. As per the table below, a standard cumulative sum isn't appropriate, What we want are the values in the n_total column, which is a sort of "forwards cumulative sum" of the values in the n_people column. i.e. the value in each row should be the sum of the values of itself and all values beyond it, rather than the standard cumulative sum, which is the sum of all values up to and including itself:
n_sessions n_people n_total cumsum
1 59 205 59
2 89 146 148
3 30 57 178
4 23 27 201
5 4 4 205
Generating the cumulative sum is simple:
mutate(df, cumsum = cumsum(n_people))
What would be an expression for generating a "forwards cumulative sum" that could be incorporated in a dplyr analysis chain? I'm guessing that cumsum would need to be applied to n_people after sorting by n_sessions descending, but can't quite get my head around how to get the answer while preserving the original order of the data frame.
You can take a cumulative sum of the reversed vector, then reverse that result. The built-in rev function is helpful here:
mutate(df, rev_cumsum = rev(cumsum(rev(n_people))))
For example, on your data this returns:
n_sessions n_people rev_cumsum
1 1 59 205
2 2 89 146
3 3 30 57
4 4 23 27
5 5 4 4

Using weights in R to consider the inverse of sampling probability [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 6 years ago.
This is similar but not equal to Using weights in R to consider the inverse of sampling probability.
I have a long data frame and this is a part of the real data:
age gender labour_situation industry_code FACT FACT_2....
35 M unemployed 15 1510
21 F inactive 00 651
FACT is a variable that means, for the first row, that a male unemployed individual of 35 years represents 1510 individuals of the population.
I need to obtain some tables to show relevant information like the % of employed and unemployed people, etc. In Stata there are some options like tab labour_situation [w=FACT] that shows the number of employed and unemployed people in the population while tab labour_situation shows the number of employed and unemployed people in the sample.
A partial solution could be to repeat the 1st row of the data frame 1510 times and then the 2nd row of my data frame 651 times? As I've searched one options is to run
longdata <- data[rep(1:nrow(data), data$FACT), ]
employment_table = with(longdata, addmargins(table(labour_situation, useNA = "ifany")))
The other thing I need to do is to run a regression having in mind that there was cluster sampling in the following way: the population was divided in regions. This creates a problem: one individual
interviewed in represents people while an individual interviewed in represents people but and are not in proportion to the total population of each region, so some regions will be overrepresented and other regions will be underrepresented. In order to take this into account, each observation should be weighted by the inverse of its probability of being sampled.
The last paragraph means that the model can be estimated with valid equations BUT the variance-covariance matrix won't be but if I consider the inverse of sampling probability.
In Stata it is possible to run a regression by doing reg y x1 x2 [pweight=n] and that calculates the right variance-covariance matrix considering the inverse of sampling probability. At the time I have to use Stata for some part of my work and R for others. I'd like to use just R.
You can do this by repeating the rownames:
df1 <- df[rep(row.names(df), df$FACT), 1:5]
> head(df1)
age gender labour_situation industry_code FACT
1 35 M unemployed 15 1510
1.1 35 M unemployed 15 1510
1.2 35 M unemployed 15 1510
1.3 35 M unemployed 15 1510
1.4 35 M unemployed 15 1510
1.5 35 M unemployed 15 1510
> tail(df1)
age gender labour_situation industry_code FACT
2.781 21 F inactive 0 787
2.782 21 F inactive 0 787
2.783 21 F inactive 0 787
2.784 21 F inactive 0 787
2.785 21 F inactive 0 787
2.786 21 F inactive 0 787
here 1:5 refers to the columns to keep. If you leave that bit blank, all will be returned.

Using data table to run 100,000 Fisher's Exact Tests is slower than apply

Good morning,
I'm trying to use R to run 100,000 Fisher's exact tests on simulated genetic data very quickly, preferably in under 30 seconds (since I need to permute case-control labels and iterate the process 1,000 times, so it runs overnight).
I tried using data tables on melted, tidy data, which contains about 200,000,000 rows and four columns (subject ID, disease status, position and 'value' [the number of wild-type alleles, a 3-factor variable]). The function groups by position, then performs Fisher exact tests on value against disease.
> head(casecontrol3)
ident disease position value
1: 1 0 36044 2
2: 2 0 36044 2
3: 3 0 36044 1
4: 4 0 36044 1
5: 5 0 36044 2
6: 6 0 36044 1
> setkey(casecontrol3,position)
> system.time(casecontrol4 <- casecontrol3[,list(p=fisher.test(value,
+ factor(disease))$p.value), by=position])
user system elapsed
215.430 11.878 229.148
> head(casecontrol4)
position p
1: 36044 6.263228e-40
2: 36495 1.155289e-68
3: 38411 7.842216e-19
4: 41083 1.272841e-69
5: 41866 2.264452e-09
6: 41894 9.833324e-10
However, it's really slow in comparison to using a simple apply function on flattened, messy, case-control tables (100,000 rows; the columns contain info re: disease status and number of wild-type alleles, so the apply function first converts each row into a 2x3 case-control tables, and uses the matrix syntax of Fisher's exact test). It takes about 20 seconds of running time to convert the data from a previous (unmelted) form into this form (not shown).
> head(cctab)
control_aa control_aA control_AA case_aa case_aA case_AA
[1,] 291 501 208 521 432 47
[2,] 213 518 269 23 392 585
[3,] 170 499 331 215 628 157
[4,] 657 308 35 269 619 112
[5,] 439 463 98 348 597 55
[6,] 410 480 110 323 616 61
> myfisher <- function(row){
+ contab <- matrix(as.integer(row),nrow=2,byrow=TRUE)
+ pval <- fisher.test(contab)$p.value
+ return(pval)
+ }
> system.time(tab <- apply(cctab,1,"myfisher"))
user system elapsed
28.846 10.989 40.173
> head(tab)
[1] 6.263228e-40 1.155289e-68 7.842216e-19 1.272841e-69 2.264452e-09 9.833324e-10
As you can see, using apply is much faster than data.table, which really surprises me. And the results are exactly the same:
> identical(casecontrol4$p,tab)
[1] TRUE
Does anyone who is an expert at using data.table know how I could speed up my code with it? Or is the data just too big for me to use it in the melted form (which rules out using data.table, dplyr, etc)? Note that I haven't tried dplyr, as I've heard that data.table is faster for big data sets like this.
Thanks.
I would suggest another route -- adding an HPC element to your approach.
You can use mutliple CPU or GPU cores, scale up a free cluster of computers on AWS EC2, connect to AWS EMR, or use any of a plethora of great HPC tools to faciliate your existing code.
Check our the CRAN HPC Task View and this tutorial.

sentiment analysis with different number of documents

I am trying to do sentiment analysis on newspaper articles and track the sentiment level across time. To do that, basically I will identify all the relevant news articles within a day, feed them into the polarity() function and obtain the average polarity scores of all the articles (more precisely, the average of all the sentence from all the articles) within that day.
The problem is, for some days, there will be many more articles compared to other days, and I think this might mask some of the info if we simply track the daily average polarity score. For example, a score of 0.1 from 30 news articles should carry more weight compared to a score of 0.1 generated from only 3 articles. and sure enough, some of the more extreme polarity scores I obtained came from days whereby there are only few relevant articles.
Is there anyway I can take the different number of articles each day into consideration?
library(qdap)
sentence = c("this is good","this is not good")
polarity(sentence)
I would warn that sometimes saying something strong with few words may pack the most punch. Make sure what you're doing makes sense in terms of your data and research questions.
One approach would be to use number of words as in the following example (I like the first approach moreso here):
poldat2 <- with(mraja1spl, polarity(dialogue, list(sex, fam.aff, died)))
output <- scores(poldat2)
weight <- ((1 - (1/(1 + log(output[["total.words"]], base = exp(2))))) * 2) - 1
weight <- weigth/max(weight)
weight2 <- output[["total.words"]]/max(output[["total.words"]])
output[["weighted.polarity"]] <- output[["ave.polarity"]] * weight
output[["weighted.polarity2"]] <- output[["ave.polarity"]] * weight2
output[, -c(5:6)]
## sex&fam.aff&died total.sentences total.words ave.polarity weighted.polarity weighted.polarity2
## 1 f.cap.FALSE 158 1641 0.083 0.143583793 0.082504197
## 2 f.cap.TRUE 24 206 0.044 0.060969157 0.005564434
## 3 f.mont.TRUE 4 29 0.079 0.060996614 0.001397106
## 4 m.cap.FALSE 73 651 0.031 0.049163984 0.012191207
## 5 m.cap.TRUE 17 160 -0.176 -0.231357933 -0.017135804
## 6 m.escal.FALSE 9 170 -0.164 -0.218126656 -0.016977931
## 7 m.escal.TRUE 27 590 -0.067 -0.106080866 -0.024092720
## 8 m.mont.FALSE 70 868 -0.047 -0.078139272 -0.025099276
## 9 m.mont.TRUE 114 1175 -0.002 -0.003389105 -0.001433481
## 10 m.none.FALSE 7 71 0.066 0.072409049 0.002862997
## 11 none.none.FALSE 5 16 -0.300 -0.147087026 -0.002925046

Resources