X2 Test for Independence

X2 Test for Independence - r

So we have
13-17 18-24 25-34 35-44 45-54 55-64 65+
Female 1 45 15 6 2 3 2
Male 2 121 31 7 4 2 3
and the raw data has headers like F.13-20, F.21-35 M.13-20 etc.
How would you do this? It is hard to explain, but we can't find it anywhere.
tab <- matrix(as.numeric(WeekReach[158,3:16]), nrow=2, byrow=TRUE)
colnames(tab) <- c("13-17", "18-24", "25-34", "35-44", "45-54", "55-64", "65+")
rownames(tab) <- c("Female","Male")
Then after this is:
exInd = function() {
n = sum(tab)
p = rowSums(tab)/sum(tab)
q = colSums(tab)/sum(tab)
return(p %o% q * n)}
chiSquaredStatistic = function(E) {
return(sum((tab - E)^2/E))}
E = exInd()
x2 = replicate(1000, {
ageShuffle = sample(age)
genderShuffle = sample(gender)
Xindep = table(ageShuffle, genderShuffle)
chiSquaredStatistic(Xindep, E)})
But we need something to make male and female their own thing - It is hard to explain, the teacher wont even explain it to us #univeristyproblems
Right so this was the solution given by the teacher - note, they do not explain anything.
WeekReach = read.csv("http://staff.scm.uws.edu.au/~lapark/300958/labs/WeeklyReachDemog.csv", as.is=TRUE)
tab = matrix(as.numeric(WeekReach[158,3:16]), nrow=2, byrow=TRUE)
colnames(tab) <- c("13-17", "18-24", "25-34", "35-44",
+ "45-54", "55-64", "65+")
rownames(tab) <- c("Female","Male")
stretchTable = function(tab, variableNames) {
+ tabx = rep(rownames(tab), rowSums(tab))
+ l = ncol(tab)
+ m = nrow(tab)
+ cn = colnames(tab)
+ taby = c()
+ for (a in 1:m) {
+ for (b in 1:l) {
+ taby = c(taby, rep(cn[b], tab[a,b]))
+ }
+ }
+
+ d = data.frame(x = tabx, y = taby)
+ colnames(d) = variableNames
+ return(d)
+ }
tab2 = stretchTable(tab, c("Gender","Age"))
Verify that we the correct values
table(tab2)
This was the 'question'
We showed in the lectures that we can perform a test for independence for a given two way table (two way meaning, has more than one row and column). To perform the test, we need to:
compute the expected values of the table if the rows and columns are independent (code shown in the lecture slides).
shuffle the rows and columns of the table.
To peform the shuffling, we must first untabulate the table. For example, if we start with the table:
A B C
X 2 1 1
Y 1 3 1
We must convert it to the form:
Column Row
A X
A X
A Y
B X
B Y
B Y
B Y
C X
C Y
Write the code to do this table conversion.
Hint: To compute the two column table, the two columns can be computed seperately as gender and age, then combined using tab2 = data.frame(Gender = gender, Age = age). Also, the functions rowSums, colSums, rownames, colnames and rep may be useful (if you are unfamiliar with these functions, read the R documentation on them, e.g help(rowSums)).
Once we have the data in two columns, we shuffle the columns and recompute the table and compute the χ2 value (as shown in the lecture).
Using the above table tab and the hypotheses H0: Gender and Age are independent, HA: Gender and Age are not independent:
Compute the χ2 randomisation distribution.
Compute the χ2 statistic for tab.
Compute the p value of the test.
Finally state the conclusion of the test.
HAVE FUN.

I assume you want to perform a chi-square test of independence, to establish if there are significant differences between the expected and observed frequencies in the male/female groups across the different age brackets.
The following should get you started
df <- read.table(text =
"13-17 18-24 25-34 35-44 45-54 55-64 65+
Female 1 45 15 6 2 3 2
Male 2 121 31 7 4 2 3", header = T)
chisq.test(df)
#
# Pearson's Chi-squared test
#
#data: df
#X-squared = 4.8117, df = 6, p-value = 0.5682
Based on the sample data and the chi-square test results, we fail to reject the null hypothesis, and conclude that there is not enough evidence to infer that there is a statistically significant difference between the male and female frequencies across the different age brackets.

Related

R - Dplyr - How to mutate rows

I found that dplyr is speedy and simple for aggregate and summarise data. But I can't find out how to solve the following problem with dplyr.
Given these data frames:
df_2017 <- data.frame(expand.grid(1:195,1:65,1:39),
value = sample(1:1000000,(195*65*39)),
period = rep("2017",(195*65*39)),
stringsAsFactors = F)
df_2017 <- df_2017[sample(1:(195*65*39),450000),]
names(df_2017) <- c("company", "product", "acc_concept", "value", "period")
df_2017$company <- as.character(df_2017$company)
df_2017$product <- as.character(df_2017$product)
df_2017$acc_concept <- as.character(df_2017$acc_concept)
df_2017$value <- as.numeric(df_2017$value)
ratio_df <- data.frame(concept=c("numerator","numerator","numerator","denom", "denom", "denom","name"),
ratio1=c("1","","","4","","","Sales over Assets"),
ratio2=c("1","","","5","6","","Sales over Expenses A + B"), stringsAsFactors = F)
where the columns in df_2017 are:
company = This is a categorical variable with companies from 1 to 195
product = This is a categorical, with home apliance products from 1 to 65. For example, 1 could be equal to irons, 2 to television, etc
acc_concept = This is a categorical variable with accounting concepts from 1 to 39. For example, 1 would be equal to "Sales", 2 to "Total Expenses", 3 to Returns", 4 to "Assets, etc
value = This is a numeric variable, with USD from 1 to 100.000.000
period = Categorical variable. Always 2017
As the expand.grid implies, the combinations of company - product - acc_concept are never duplicated, but, It could happen that certain subjects have not every company - product - acc_concept combinations. That's why the code line "df_2017 <- df_2017[sample(1:195*65*39),450000),]", and that's why the output could turn out into NA (see below).
And where the columns in ratio_df are:
Concept = which acc_concept corresponds to the numerator, which one to
denominator, and which is name of the ratio
ratio1 = acc_concept and name for ratio1
ratio2 = acc_concept and name for ratio2
I want to calculate 2 ratios (ratio_df) between acc_concept, for each product within each company.
For example:
I take the first ratio "acc_concepts" and "name" from ratio_df:
num_acc_concept <- ratio_df[ratio_df$concept == "numerator", 2]
denom_acc_concept <- ratio_df[ratio_df$concept == "denom", 2]
ratio_name <- ratio_df[ratio_df$concept == "name", 2]
Then I calculate the ratio for one product of one company, just to show you want i want to do:
ratio1_value <- sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% num_acc_concept, 4]) / sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% denom_acc_concept, 4])
Output:
output <- data.frame(Company="1", Product="1", desc_ratio=ratio_name, ratio_value = ratio1_value, stringsAsFactors = F)
As I said before I want to do this for each product within each company
The output data.frame could be something like this (ratios aren't the true ones because I haven't done the calculations yet):
company product desc_ratio ratio_value
1 1 Sales over Assets 0.9303675
1 3 Sales over Assets 1.30
1 7 Sales over Assets Nan
1 1 Sales over Expenses A + B Inf
1 2 Sales over Expenses A + B 2.32
1 3 Sales over Expenses A + B NA
2
3
and so on...
NaN when ratio is 0 / 0
Inf when ratio is number / 0
NA when there is no data for certain company and product.
I hope I have made myself clear...
Is there any way to solve this row problem with dplyr? Should I cast the df_2017?

Creating a contingency table by hypergeometric sampling with the Titanic's database

I created a contingency table with the passengers data from the Titanic by the Hypergeometric sampling -That's mean that both of the marginal totals are preset and equals-. It was created crossing the Sex and Survivor columns of 328 cases -164 men and 164 women-, this is the code:
First, I ungroup the data and deleted the useless columns
titanic = as.data.frame(Titanic)
titanic = titanic[rep(1:nrow(titanic),titanic$Freq),]
titanic = titanic[,c(2,4)]
later, selected a sample of men
men = subset(titanic, titanic$Sex == 'Male')
men = men [sample(nrow(men),164), ]
table(men$Sex, men$Survived)
# No Yes
# Male 133 31
# Female 0 0
now the row of women must be filled in with the appropriate values
n = summary.factor(men$Survived)
womenYes = subset(titanic, (titanic$Sex == 'Female' & titanic$Survived=='Yes'))
womenYes = subset(womenYes[1:n[1], ])
womenNo = subset(titanic, (titanic$Sex == 'Female' & titanic$Survived=='No'))
womenNo = subset(womenNo[1:n[2], ])
women = merge(womenYes, womenNo, all = TRUE)
hyperSample = merge(men, women, all = TRUE)
table(hyperSample$Sex, hyperSample$Survived)
# No Yes
# Male 133 31
# Female 31 133
It works, but it looks like a bit ugly and I honestly think perhaps someone could find a much more elegant or efficient way to do it. Thanks.

You can sample in two stages, both using rhyper: First to determine the number of men and women subject to only sampling 328 and assuming populations were sex-distributed as in the original sample. This is what you might do if you were trying to bootstrap a statistic like a rate ratio. And then secondly, use rhyper twice more to determine the numbers of survivors subject to the same probabilities in the original sample rows.
MFmat <- apply(Titanic, c(2, 4), sum)
nMale <- rhyper(1, rowSums(MFmat)[1], rowSums(MFmat)[2], 328)
#[1] 262
nFemale <- 328 - nMale
DMale <- rhyper(1, MFmat[1,1], MFmat[1,2], nMale)
SurvMale = nMale-DMale
DFemale = rhyper(1, MFmat[2,1], MFmat[2,2], nFemale)
SurvFemale = nFemale - DFemale
matrix( c( DMale, DFemale, SurvMale, SurvFemale), ncol=2,
dimnames=dimnames(MFmat) )
#----
Survived
Sex No Yes
Male 223 42
Female 22 41
I suppose you could sample the two rows separately and you should be able to use the logic above, ... if that what you have decided to do. Which way is more appropriate will depend on the underlying problem.
# Fixed row marginals....
nMale <-164
nFemale <- 164
DMale <- rhyper(1, MFmat[1,1], MFmat[1,2], nMale)
SurvMale = nMale-DMale
DFemale = rhyper(1, MFmat[2,1], MFmat[2,2], nFemale)
SurvFemale = nFemale - DFemale
matrix( c( DMale, DFemale, SurvMale, SurvFemale), ncol=2,
dimnames=dimnames(MFmat) )
#----------------
Survived
Sex No Yes
Male 127 37
Female 39 125

I want to calculate maximum variance according to the sales with storetype

I am a beginner in R. I have dataset with 11 column and 3000 obs.
The data frame has 3000 obs and 11 columns. There are 6 columns of various sales and I want to measure the variance in each sale column across store_Type:
table(s1$store_Type)
Grocery Store Supermarket Type1 Supermarket Type2 Supermarket Type3
242 1226 200 350
I am not sure how to start to this problem.

To calculate variance on a data set you can use var(). To calculate by columns, use apply(). For example:
# create fake data
set.seed(123) # for reproducibility
dat <- as.data.frame(matrix(runif(15,100,200), ncol = 3, nrow = 5))
colnames(dat) <-c("Store 1", "Store 2", "Store 3")
# generate variance
var.dat <-apply(dat, MARGIN=2, FUN=var) # by column
var.dat
Store 1 Store 2 Store 3
866.3951 914.2388 978.7129

I used tapply() to find the individual variance for different sales. Then I summed up the variance for individual store type to get the variance. Thought of grouping by Store Type but couldn't get the result. Eventhough, I got the answer, felt mechanical doing it. There must be another way of doing this.!
v2 = tapply(store_train$sales1,store_train$store_Type, var)
v3 = tapply(store_train$sales2,store_train$store_Type, var)
v4 = tapply(store_train$sales3,store_train$store_Type, var)
v5 = tapply(store_train$sales4,store_train$store_Type, var)
v1[1] + v2[1] + v3[1] + v4[1] + v5[1]
v1[2] + v2[2] + v3[2] + v4[2] + v5[2]
v1[3] + v2[3] + v3[3] + v4[3] + v5[3]
v1[4] + v2[4] + v3[4] + v4[4] + v5[4]

R: How to output subset calculations (n, %) using ddply

I hope you can help me with this problem: For my work I have to use R to analyze survey data. The data has a number of columns by which I have/want to group the data and then do some calculations, e.g. How many men or women do work at a certain department? And then calculate the number and percentage for each group. --> at department A work 42 people, whereof 30 women and 12 men, at department B work 70 people, whereof 26 women and 44 men.
I currently use the following code to output the data (using ddply):
percentage_median_per_group_multiple_columns <- function(data, column_name, column_name2){
library(plyr)
descriptive <- ddply( data, column_name,
function(x){
percentage_median_per_group(x, column_name)
percentage_median_per_group(x, column_name2)
}
)
print(data.frame(descriptive))
}
## give number, percentage and median per group_value in column
percentage_median_per_group <- function(data, column_name3){
library(plyr)
descriptive <- ddply( data, column_name3,
function(x){
c(
N <- nrow(x[column_name3]), #number
pct <- (N/nrow(data))*100 #percentage
#TODO: median
)
}
)
return(descriptive)
}
#calculate
percentage_median_per_group_multiple_columns(users_surveys_full_responses, "department", "gender")
Now the data outputs like this:
Department Sex N % per sex
A f 30 71,4
m 12 28,6
B f 26 37,1
m 44 62,9
But, I want the output to look like this, so calculations take place and are printed in each substep:
Department N % per department Sex N % per sex
A 42 37,5 f 30 71,4
m 12 28,6
B 70 62,5 f 26 37,1
m 44 62,9
Does anyone have a suggestion of how I can do that, if possible even build it dynamic so I can potentially group it by the variables in multiple columns (e.g. department + sex + type of software + ...), but I would be happy if I can have it already like in the example =)
thanks!
EDIT
You can use this to generate example data:
n=100
sample_data = data.frame(department=sample(1:20,n,replace=TRUE), gender=sample(1:2,n,replace=TRUE))
percentage_median_per_group_multiple_columns(sample_data, "department", "gender")
V1 in the output stands for N (number) and V2 for %

Create variables from content of a row in R

I have a hospital visit data that contain records for gender, age, main diagnosis, and hospital identifier. I intend to create separate variables for these entries. The data has some pattern: most observations start with gender code (M or F) followed by age, then diagnosis and mostly the hospital identifier. But there are some exceptions. In some the gender id is coded 01 or 02 and in this case the gender identifier appears at the end.
I looked into the archives and found some examples of grep but I was not successful to efficiently implement it to my data. For example the code
ndiag<-dat[grep("copd", dat[,1], fixed = TRUE),]
could extract each diagnoses individually, but not all at once. How can I do this task?
Sample data that contain current situation (column 1) and what I intend to have is shown below:
diagnosis hospital diag age gender
m3034CVDA A cvd 30-34 M
m3034cardvA A cardv 30-34 M
f3034aceB B ace 30-34 F
m3034hfC C hf 30-34 M
m3034cereC C cere 30-34 M
m3034resPC C resp 30-34 M
3034copd_Z_01 Z copd 30-34 M
3034copd_Z_01 Z copd 30-34 M
fcereZ Z cere NA F
f3034respC C resp 30-34 F
3034copd_Z_02 Z copd 30-34 F

There appears to be two key parts to this problem.
Dealing with the fact that strings are coded in two different
ways
Splicing the string into the appropriate data columns
Note: as for applying a function over several values at once, many of the functions can handle vectors already. For example str_locate and substr.
Part 1 - Cleaning the strings for m/f // 01/02 coding
# We will be using this library later for str_detect, str_replace, etc
library(stringr)
# first, make sure diagnosis is character (strings) and not factor (category)
diagnosis <- as.character(diagnosis)
# We will use a temporary vector, to preserve the original, but this is not a necessary step.
diagnosisTmp <- diagnosis
males <- str_locate(diagnosisTmp, "_01")
females <- str_locate(diagnosisTmp, "_02")
# NOTE: All of this will work fine as long as '_01'/'_02' appears *__only__* as gender code.
# Therefore, we put in the next two lines to check for errors, make sure we didn't accidentally grab a "_01" from the middle of the string
#-------------------------
if (any(str_length(diagnosisTmp) != males[,2], na.rm=T)) stop ("Error in coding for males")
if (any(str_length(diagnosisTmp) != females[,2], na.rm=T)) stop ("Error in coding for females")
#------------------------
# remove all the '_01'/'_02' (replacing with "")
diagnosisTmp <- str_replace(diagnosisTmp, "_01", "")
diagnosisTmp <- str_replace(diagnosisTmp, "_02", "")
# append to front of string appropriate m/f code
diagnosisTmp[!is.na(males[,1])] <- paste0("m", diagnosisTmp[!is.na(males[,1])])
diagnosisTmp[!is.na(females[,1])] <- paste0("m", diagnosisTmp[!is.na(females[,1])])
# remove superfluous underscores
diagnosisTmp <- str_replace(diagnosisTmp, "_", "")
# display the original next to modified, for visual spot check
cbind(diagnosis, diagnosisTmp)
Part 2 - Splicing the string
# gender is the first char, hospital is the last.
gender <- toupper(str_sub(diagnosisTmp, 1,1))
hosp <- str_sub(diagnosisTmp, -1,-1)
# age, if present is char 2-5. A warning will be shown if values are missing. Age needs to be cleaned up
age <- as.numeric(str_sub(diagnosisTmp, 2,5)) # as.numeric will convert none-numbers to NA
age[!is.na(age)] <- paste(substr(age[!is.na(age)], 1, 2), substr(age[!is.na(age)], 3, 4), sep="-")
# diagnosis is variable length, so we have to find where to start
diagStart <- 2 + 4*(!is.na(age))
diag <- str_sub(diagnosisTmp, diagStart, -2)
# Put it all together into a data frame
dat <- data.frame(diagnosis, hosp, diag, age, gender)
## OR WITHOUT ORIGINAL DIAGNOSIS STRING ##
dat <- data.frame(hosp, diag, age, gender)