R mutate new column based on range of values in other column - r

I have r dataframe in following format
+--------+---------------+--------------------+--------+
| time | Stress_ratio | shear_displacement | CX |
+--------+---------------+--------------------+--------+
| <dbl> | <dbl> | <dbl> | <dbl> |
| 50.1 | -0.224 | 4.9 | 0 |
| 50.2 | -0.219 | 4.98 | 0.0100 |
| . | . | . | . |
| . | . | . | . |
| 249.3 | -0.217 | 4.97 | 0.0200 |
| 250.4 | -0.214 | 4.96 | 0.0300 |
| 251.1 | -0.222 | 4.91 | 0.06 |
| 252.1 | -0.222 | 4.91 | 0.06 |
| 253.3 | -0.222 | 4.91 | 0.06 |
| 254.5 | -0.222 | 4.91 | 0.06 |
| 256.8 | -0.222 | 4.91 | 0.06 |
| . | . | . | . |
| . | . | . | . |
| 500.1 | -0.22 | 4.91 | 0.6 |
| 501.4 | -0.22 | 4.91 | 0.6 |
| 503.1 | -0.22 | 4.91 | 0.6 |
+--------+---------------+--------------------+--------+
and I want a new column which has repetitive values based on the difference between a range of values in column time. The range should be 250 for the column time. For example in all the rows of new_column I should get number 1 when df$time[1] and df$time[1]*4.98 is 250. Similarly this number 1 should change to 2 when the next chunk starts of difference of 250. So the new dataframe should be like
+--------+---------------+--------------------+--------+------------+
| time | Stress_ratio | shear_displacement | CX | new_column |
+--------+---------------+--------------------+--------+------------+
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 50.1 | -0.224 | 4.9 | 0 | 1 |
| 50.2 | -0.219 | 4.98 | 0.0100 | 1 |
| . | . | . | . | 1 |
| . | . | . | . | 1 |
| 249.3 | -0.217 | 4.97 | 0.0200 | 1 |
| 250.4 | -0.214 | 4.96 | 0.0300 | 2 |
| 251.1 | -0.222 | 4.91 | 0.06 | 2 |
| 252.1 | -0.222 | 4.91 | 0.06 | 2 |
| 253.3 | -0.222 | 4.91 | 0.06 | 2 |
| 254.5 | -0.222 | 4.91 | 0.06 | 2 |
| 256.8 | -0.222 | 4.91 | 0.06 | 2 |
| . | . | . | . | . |
| . | . | . | . | . |
| 499.1 | -0.22 | 4.91 | 0.6 | 2 |
| 501.4 | -0.22 | 4.91 | 0.6 | 3 |
| 503.1 | -0.22 | 4.91 | 0.6 | 3 |
+--------+---------------+--------------------+--------+------------+

If I understand what you're trying to do, a base R solution could be:
df$new_column <- df$time %/% 250 + 1
The %/% operator is integer division (sort of the complement of the modulus operator) and tells you how many copies of 250 would fit into your number; we add 1 to get the value you want.
The tidyverse version:
df <- df %>%
mutate(new_column = time %/% 250 + 1)

library(data.table)
setDT(df)[, new_column := rleid(time %/% 250)][]

Related

How to plot following exponential function properly

Data
| x | Y |
| --------| --------|
| 26.88 | 3.16 |
| 28.57 | 4.21 |
| 30.94 | 2.97 |
| 33.90 | 3.06 |
| 37.24 | 2.87 |
| 39.76 | 2.95 |
| 41.89 | 2.70 |
| 44.37 | 1.25 |
| 27.20 | 5.04 |
| 26.54 | 6.69 |
| 29.21 | 4.42 |
| 33.26 | 3.15 |
| 34.80 | 3.20 |
| 37.87 | 3.11 |
| 41.88 | 2.95 |
| 44.13 | 2.26 |
| 26.42 | 7.07 |
| 24.02 | 8.72 |
| 29.73 | 6.38 |
| 31.10 | 3.85 |
| 33.16 | 3.00 |
| 36.76 | 3.28 |
| 43.26 | 3.18 |
| 42.06 | 2.73 |
| 26.73 | 9.44 |
| 23.03 | 9.72 |
| 27.07 | 6.98 |
| 29.04 | 4.67 |
| 31.83 | 3.55 |
| 36.29 | 3.89 |
| 39.45 | 3.55 |
| 42.17 | 3.37 |
| 23.51 | 10.44 |
| 21.98 | 10.90 |
| 27.21 | 8.13 |
| 28.63 | 5.76 |
| 30.92 | 3.96 |
| 35.57 | 3.94 |
| 38.33 | 3.88 |
| 40.91 | 3.58 |
| 25.15 | 13.05 |
| 19.44 | 15.91 |
| 25.94 | 10.37 |
| 28.03 | 5.17 |
| 31.25 | 4.04 |
| 35.31 | 4.24 |
| 37.02 | 4.31 |
| 38.89 | 3.99 |
| 25.12 | 15.66 |
| 18.36 | 19.86 |
| 25.05 | 12.82 |
| 27.58 | 6.07 |
| 28.83 | 4.11 |
| 33.76 | 4.17 |
| 34.48 | 4.30 |
| 37.32 | 3.97 |
| 21.27 | 20.49 |
| 16.61 | 25.53 |
| 22.68 | 16.58 |
| 25.63 | 6.34 |
| 28.15 | 4.40 |
| 32.80 | 3.99 |
| 35.27 | 4.59 |
| 36.75 | 4.35 |
Code
library(data.table)
library(readxl)
library(dplyr)
library(ggplot2)
library(patchwork)
library(ggplot2)
library(ggpubr)
library(ggpmisc)
setwd("E:/")
Data_2 <- read_excel("Data_2.xlsx")
model.0 <- lm(log(Strength) ~ Theoritical, data= Data_2)
alpha.0 <- exp(coef(model.0)[1])
beta.0 <- coef(model.0)[2]
# Starting parameters
start <- list(alpha = alpha.0, beta = beta.0)
start
model <- nls(Strength ~ alpha * exp((1/beta) * Theoritical) , data = Data_2, start = start)
summary(model)
# Plot fitted curve
plot(Data_2$Theoritical, Data_2$Strength)
line(Data_2$Theoritical, predict(model, list(x = Data_2$Theoritical)), col = 'skyblue')
When draw my plot I got following image.
I need this kind of equation for my data
y=a*e^(-x/b)
I could not get the R^2 value as well as shown in this picture
Please correct my code as well. kindly help me to provide a good code for this best fit graph for that equation. I am new to R programming.

make a table in expss that shows both freq and cpct but only tests cpct on the cpct columns

Using this data set with a multiple dichotomy set and a group:
set.seed(14)
checkall <- data.frame(ID=1:200,
group=sample(c("A", "B", "C"), size=200, replace=TRUE),
q1a=sample(c(0,1), size=200, replace=TRUE),
q1b=sample(c(0,1), size=200, replace=TRUE),
q1c=sample(c(0,1), size=200, replace=TRUE),
q1d=sample(c(0,1), size=200, replace=TRUE),
q1e=sample(c(0,1), size=200, replace=TRUE),
q1f=sample(c(0,1), size=200, replace=TRUE),
q1g=sample(c(0,1), size=200, replace=TRUE),
q1h=sample(c(0,1), size=200, replace=TRUE))
#Doctor some to be related to group
checkall$q1c[checkall$group=="A"] <- sample(c(0,1,1,1), size=sum(checkall$group=="A"), replace=TRUE)
checkall$q1e[checkall$group=="A"] <- sample(c(0,0,0,1), size=sum(checkall$group=="A"), replace=TRUE)
I would like to make a table that shows frequencies and column percents like this:
library(dplyr)
if( !require(expss) ){ install.packages("expss", dependencies=TRUE); library(expss) }
checkall %>% tab_cells(mdset(q1a %to% q1h)) %>%
tab_cols(total(), group) %>%
tab_stat_cases(label = "freq") %>%
tab_stat_cpct(label = "col %") %>%
tab_pivot(stat_position = "inside_columns")
| | #Total | | group | | | | | |
| | freq | col % | A | | B | | C | |
| | | | freq | col % | freq | col % | freq | col % |
| ------------ | ------ | ----- | ----- | ----- | ---- | ----- | ---- | ----- |
| q1a | 101 | 50.8 | 33 | 47.8 | 36 | 51.4 | 32 | 53.3 |
| q1b | 92 | 46.2 | 34 | 49.3 | 29 | 41.4 | 29 | 48.3 |
| q1c | 111 | 55.8 | 53 | 76.8 | 30 | 42.9 | 28 | 46.7 |
| q1d | 89 | 44.7 | 35 | 50.7 | 30 | 42.9 | 24 | 40.0 |
| q1e | 100 | 50.3 | 19 | 27.5 | 43 | 61.4 | 38 | 63.3 |
| q1f | 89 | 44.7 | 34 | 49.3 | 36 | 51.4 | 19 | 31.7 |
| q1g | 97 | 48.7 | 29 | 42.0 | 33 | 47.1 | 35 | 58.3 |
| q1h | 113 | 56.8 | 40 | 58.0 | 36 | 51.4 | 37 | 61.7 |
| #Total cases | 199 | 199.0 | 69 | 69.0 | 70 | 70.0 | 60 | 60.0 |
But I would like to add the notations that compare the cpct values to that in the first column. I can get that on a table with just cpct values like this:
checkall %>% tab_cells(mdset(q1a %to% q1h)) %>%
tab_cols(total(), group) %>%
tab_stat_cpct(label = "col %")%>%
tab_pivot(stat_position = "inside_columns")%>%
significance_cpct(compare_type = "first_column")
| | #Total | group | | |
| | col % | A | B | C |
| | | col % | col % | col % |
| ------------ | ------ | ------ | ----- | ----- |
| q1a | 50.8 | 47.8 | 51.4 | 53.3 |
| q1b | 46.2 | 49.3 | 41.4 | 48.3 |
| q1c | 55.8 | 76.8 + | 42.9 | 46.7 |
| q1d | 44.7 | 50.7 | 42.9 | 40.0 |
| q1e | 50.3 | 27.5 - | 61.4 | 63.3 |
| q1f | 44.7 | 49.3 | 51.4 | 31.7 |
| q1g | 48.7 | 42.0 | 47.1 | 58.3 |
| q1h | 56.8 | 58.0 | 51.4 | 61.7 |
| #Total cases | 199 | 69 | 70 | 60 |
Is there a way to get the + and - notations onto the first graph in just the cpct columns? If I try to mix the lines with tab_stat_cases(label="freq") and significance_cpct(compare_type = "first_column"), I get a weird table that tries to compare both the freq and cpct columns to the first column:
checkall %>% tab_cells(mdset(q1a %to% q1h)) %>%
tab_cols(total(), group) %>%
#tab_stat_cases(label = "freq") %>%
tab_stat_cpct(label = "col %")%>%
tab_pivot(stat_position = "inside_columns")%>%
significance_cpct(compare_type = "first_column")
| | #Total | | group | | | | | |
| | freq | col % | A | | B | | C | |
| | | | freq | col % | freq | col % | freq | col % |
| ------------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| q1a | 101.0 | 50.8 - | 33.0 - | 47.8 - | 36.0 - | 51.4 - | 32.0 - | 53.3 - |
| q1b | 92.0 | 46.2 - | 34.0 - | 49.3 - | 29.0 - | 41.4 - | 29.0 - | 48.3 - |
| q1c | 111.0 | 55.8 - | 53.0 - | 76.8 | 30.0 - | 42.9 - | 28.0 - | 46.7 - |
| q1d | 89.0 | 44.7 - | 35.0 - | 50.7 - | 30.0 - | 42.9 - | 24.0 - | 40.0 - |
| q1e | 100.0 | 50.3 - | 19.0 - | 27.5 - | 43.0 - | 61.4 - | 38.0 - | 63.3 - |
| q1f | 89.0 | 44.7 - | 34.0 - | 49.3 - | 36.0 - | 51.4 - | 19.0 - | 31.7 - |
| q1g | 97.0 | 48.7 - | 29.0 - | 42.0 - | 33.0 - | 47.1 - | 35.0 - | 58.3 - |
| q1h | 113.0 | 56.8 - | 40.0 - | 58.0 - | 36.0 - | 51.4 - | 37.0 - | 61.7 |
| #Total cases | 199 | 199 | 69 | 69 | 70 | 70 | 60 | 60 |
I'm looking for the top table with the + and - notation as below:
| | #Total | | group | | | | | |
| | freq | col % | A | | B | | C | |
| | | | freq | col % | freq | col % | freq | col % |
| ------------ | ------ | ----- | ----- | ----- | ---- | ----- | ---- | ----- |
| q1a | 101 | 50.8 | 33 | 47.8 | 36 | 51.4 | 32 | 53.3 |
| q1b | 92 | 46.2 | 34 | 49.3 | 29 | 41.4 | 29 | 48.3 |
| q1c | 111 | 55.8 | 53 | 76.8 +| 30 | 42.9 | 28 | 46.7 |
| q1d | 89 | 44.7 | 35 | 50.7 | 30 | 42.9 | 24 | 40.0 |
| q1e | 100 | 50.3 | 19 | 27.5 -| 43 | 61.4 | 38 | 63.3 |
| q1f | 89 | 44.7 | 34 | 49.3 | 36 | 51.4 | 19 | 31.7 |
| q1g | 97 | 48.7 | 29 | 42.0 | 33 | 47.1 | 35 | 58.3 |
| q1h | 113 | 56.8 | 40 | 58.0 | 36 | 51.4 | 37 | 61.7 |
| #Total cases | 199 | 199.0 | 69 | 69.0 | 70 | 70.0 | 60 | 60.0 |
There is a special function for such case - tab_last_sig_cpct - which will be applied only to the last calculation:
checkall %>% tab_cells(mdset(q1a %to% q1h)) %>%
tab_cols(total(), group) %>%
tab_stat_cases(label = "freq") %>%
tab_stat_cpct(label = "col %") %>%
tab_last_sig_cpct(compare_type = "first_column") %>%
tab_pivot(stat_position = "inside_columns")

AODE Machine Learning in R

I wanted to know if really AODE may be better than Naive Bayes in its way, as the description says:
https://cran.r-project.org/web/packages/AnDE/AnDE.pdf
--> "AODE achieves highly accurate classification by averaging over all of a small space."
https://www.quora.com/What-is-the-difference-between-a-Naive-Bayes-classifier-and-AODE
--> "AODE is a weird way of relaxing naive bayes' independence assumptions. It is no longer a generative model, but it relaxes the independence assumptions in a slightly different (and less principled) way than logistic regression does. It replaces the convex optimization problem used in training a logistic regression classifier by a quadratic (on the number of features) dependency on both training and test times."
But when I experiment it, I found that the predict results seems off, I implemented it with these codes:
library(gmodels)
library(AnDE)
AODE_Model = aode(iris)
predict_aode = predict(AODE_Model, iris)
CrossTable(as.numeric(iris$Species), predict_aode)
Can anyone explain to me about this? or are there any good practical solutions to implement AODE? thankyou in advance
If you check out the vignette for the function:
train: data.frame : training data. It should be a data frame. AODE
works only discretized data. It would be better to
discreetize the data frame before passing it to this
function.However, aode discretizes the data if not done
before hand. It uses an R package called discretization for
the purpose. It uses the well known MDL discretization
technique.(It might fail sometimes)
By default, the discretization function from arules cuts it into 3, which may not be enough for iris. So I first reproduce the result you have with the discretization by arules:
library(arules)
library(gmodels)
library(AnDE)
set.seed(111)
trn = sample(1:nrow(indata),100)
test = setdiff(1:nrow(indata),trn)
indata <- data.frame(lapply(iris[,1:4],discretize,breaks=3),Species=iris$Species)
AODE_Model = aode(indata[trn,])
predict_aode = predict(AODE_Model, indata[test,])
CrossTable(as.numeric(indata$Species)[test], predict_aode)
| predict_aode
as.numeric(indata$Species)[test] | 1 | 3 | Row Total |
---------------------------------|-----------|-----------|-----------|
1 | 15 | 5 | 20 |
| 0.500 | 4.500 | |
| 0.750 | 0.250 | 0.400 |
| 0.333 | 1.000 | |
| 0.300 | 0.100 | |
---------------------------------|-----------|-----------|-----------|
2 | 11 | 0 | 11 |
| 0.122 | 1.100 | |
| 1.000 | 0.000 | 0.220 |
| 0.244 | 0.000 | |
| 0.220 | 0.000 | |
---------------------------------|-----------|-----------|-----------|
3 | 19 | 0 | 19 |
| 0.211 | 1.900 | |
| 1.000 | 0.000 | 0.380 |
| 0.422 | 0.000 | |
| 0.380 | 0.000 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 45 | 5 | 50 |
| 0.900 | 0.100 | |
---------------------------------|-----------|-----------|-----------|
You can see one of the classes is missing in prediction. Let's increase it to 4:
indata <- data.frame(lapply(iris[,1:4],discretize,breaks=4),Species=iris$Species)
AODE_Model = aode(indata[trn,])
predict_aode = predict(AODE_Model, indata[test,])
CrossTable(as.numeric(indata$Species)[test], predict_aode)
| predict_aode
as.numeric(indata$Species)[test] | 1 | 2 | 3 | Row Total |
---------------------------------|-----------|-----------|-----------|-----------|
1 | 20 | 0 | 0 | 20 |
| 18.000 | 4.800 | 7.200 | |
| 1.000 | 0.000 | 0.000 | 0.400 |
| 1.000 | 0.000 | 0.000 | |
| 0.400 | 0.000 | 0.000 | |
---------------------------------|-----------|-----------|-----------|-----------|
2 | 0 | 10 | 1 | 11 |
| 4.400 | 20.519 | 2.213 | |
| 0.000 | 0.909 | 0.091 | 0.220 |
| 0.000 | 0.833 | 0.056 | |
| 0.000 | 0.200 | 0.020 | |
---------------------------------|-----------|-----------|-----------|-----------|
3 | 0 | 2 | 17 | 19 |
| 7.600 | 1.437 | 15.091 | |
| 0.000 | 0.105 | 0.895 | 0.380 |
| 0.000 | 0.167 | 0.944 | |
| 0.000 | 0.040 | 0.340 | |
---------------------------------|-----------|-----------|-----------|-----------|
Column Total | 20 | 12 | 18 | 50 |
| 0.400 | 0.240 | 0.360 | |
---------------------------------|-----------|-----------|-----------|-----------|
It gets only 3 wrong. To me, it's a matter of playing with discretization without overfitting, which can be tricky..

Varargs is giving key error in Julia

Consider the following table:
julia> using RDatasets, DataFrames
julia> anscombe = dataset("datasets","anscombe")
11x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1 | Y2 | Y3 | Y4 |
|-----|----|----|----|----|-------|------|-------|------|
| 1 | 10 | 10 | 10 | 8 | 8.04 | 9.14 | 7.46 | 6.58 |
| 2 | 8 | 8 | 8 | 8 | 6.95 | 8.14 | 6.77 | 5.76 |
| 3 | 13 | 13 | 13 | 8 | 7.58 | 8.74 | 12.74 | 7.71 |
| 4 | 9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | 8.84 |
| 5 | 11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | 8.47 |
| 6 | 14 | 14 | 14 | 8 | 9.96 | 8.1 | 8.84 | 7.04 |
| 7 | 6 | 6 | 6 | 8 | 7.24 | 6.13 | 6.08 | 5.25 |
| 8 | 4 | 4 | 4 | 19 | 4.26 | 3.1 | 5.39 | 12.5 |
| 9 | 12 | 12 | 12 | 8 | 10.84 | 9.13 | 8.15 | 5.56 |
| 10 | 7 | 7 | 7 | 8 | 4.82 | 7.26 | 6.42 | 7.91 |
| 11 | 5 | 5 | 5 | 8 | 5.68 | 4.74 | 5.73 | 6.89 |
I have defined a function as follows:
julia> f1(df, matchval, matchfield, qfields...) = isempty(qfields)
WARNING: Method definition f1(Any, Any, Any, Any...) in module Main at REPL[314]:1 overwritten at REPL[317]:1.
f1 (generic function with 3 methods)
Now below is the problem
julia> f1(anscombe, 11, "X1")
ERROR: KeyError: key :field not found
in getindex at ./dict.jl:697 [inlined]
in getindex(::DataFrames.Index, ::Symbol) at /home/arghya/.julia/v0.5/DataFrames/src/other/index.jl:114
in getindex at /home/arghya/.julia/v0.5/DataFrames/src/dataframe/dataframe.jl:228 [inlined]
in f1(::DataFrames.DataFrame, ::Int64, ::String) at ./REPL[249]:2
Where am I doing wrong? FYI I'm using Julia Version 0.5.2. How to overcome this problem? Thanks in advance!
There is nothing wrong with your code - try running just what you've posted in a fresh session. Possibly you've defined another f1 method before. If you come from R, you may assume that this is overwritten by f1(df, matchval, matchfield, qfields...) = isempty(qfields), while in fact you're just defining a new method for the f1 function. The error is probably thrown by a 3-argument version you've defined earlier. Look at https://docs.julialang.org/en/stable/manual/methods/

Mimic tabulate command from Stata in R

I'm trying to get a 2 way table in R similar to this one from Stata. I was trying to use CrossTable from gmodels package, but the table is not the same. Do you known how can this be done in R?
I hope at least to get the frequencies from
when cursmoke1 == "Yes" & cursmoke2 == "No" and reversed
In R I'm only getting totals from yes, no and NA.
Here is the output:
Stata
. tabulate cursmoke1 cursmoke2, cell column miss row
+-------------------+
| Key |
|-------------------|
| frequency |
| row percentage |
| column percentage |
| cell percentage |
+-------------------+
Current |
smoker, | Current smoker, exam 2
exam 1 | No Yes . | Total
-----------+---------------------------------+----------
No | 1,898 131 224 | 2,253
| 84.24 5.81 9.94 | 100.00
| 86.16 7.59 44.44 | 50.81
| 42.81 2.95 5.05 | 50.81
-----------+---------------------------------+----------
Yes | 305 1,596 280 | 2,181
| 13.98 73.18 12.84 | 100.00
| 13.84 92.41 55.56 | 49.19
| 6.88 35.99 6.31 | 49.19
-----------+---------------------------------+----------
Total | 2,203 1,727 504 | 4,434
| 49.68 38.95 11.37 | 100.00
| 100.00 100.00 100.00 | 100.00
| 49.68 38.95 11.37 | 100.00
R
> CrossTable(cursmoke2, cursmoke1, missing.include = T, format="SAS")
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 4434
| cursmoke1
cursmoke2 | No | Yes | NA | Row Total |
-------------|-----------|-----------|-----------|-----------|
No | 2203 | 0 | 0 | 2203 |
| 1122.544 | 858.047 | 250.409 | |
| 1.000 | 0.000 | 0.000 | 0.497 |
| 1.000 | 0.000 | 0.000 | |
| 0.497 | 0.000 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|
Yes | 0 | 1727 | 0 | 1727 |
| 858.047 | 1652.650 | 196.303 | |
| 0.000 | 1.000 | 0.000 | 0.389 |
| 0.000 | 1.000 | 0.000 | |
| 0.000 | 0.389 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|
NA | 0 | 0 | 504 | 504 |
| 250.409 | 196.303 | 3483.288 | |
| 0.000 | 0.000 | 1.000 | 0.114 |
| 0.000 | 0.000 | 1.000 | |
| 0.000 | 0.000 | 0.114 | |
-------------|-----------|-----------|-----------|-----------|
Column Total | 2203 | 1727 | 504 | 4434 |
| 0.497 | 0.389 | 0.114 | |
-------------|-----------|-----------|-----------|-----------|
Maybe I'm missing something here. The default settings for CrossTable seem to provide essentially what you are looking for.
Here is CrossTable with minimal arguments. (I've loaded the dataset as "temp".) Note that the results are the same as what you posted from the Stata output (you just need to multiply by 100 if you want the result as a percentage).
library(gmodels)
with(temp, CrossTable(cursmoke1, cursmoke2, missing.include=TRUE))
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 4434
| cursmoke2
cursmoke1 | No | Yes | NA | Row Total |
-------------|-----------|-----------|-----------|-----------|
No | 1898 | 131 | 224 | 2253 |
| 541.582 | 635.078 | 4.022 | |
| 0.842 | 0.058 | 0.099 | 0.508 |
| 0.862 | 0.076 | 0.444 | |
| 0.428 | 0.030 | 0.051 | |
-------------|-----------|-----------|-----------|-----------|
Yes | 305 | 1596 | 280 | 2181 |
| 559.461 | 656.043 | 4.154 | |
| 0.140 | 0.732 | 0.128 | 0.492 |
| 0.138 | 0.924 | 0.556 | |
| 0.069 | 0.360 | 0.063 | |
-------------|-----------|-----------|-----------|-----------|
Column Total | 2203 | 1727 | 504 | 4434 |
| 0.497 | 0.389 | 0.114 | |
-------------|-----------|-----------|-----------|-----------|
Alternatively, you can use format="SPSS" if you want the numbers displayed as percentages.
with(temp, CrossTable(cursmoke1, cursmoke2, missing.include=TRUE, format="SPSS"))
Cell Contents
|-------------------------|
| Count |
| Chi-square contribution |
| Row Percent |
| Column Percent |
| Total Percent |
|-------------------------|
Total Observations in Table: 4434
| cursmoke2
cursmoke1 | No | Yes | NA | Row Total |
-------------|-----------|-----------|-----------|-----------|
No | 1898 | 131 | 224 | 2253 |
| 541.582 | 635.078 | 4.022 | |
| 84.243% | 5.814% | 9.942% | 50.812% |
| 86.155% | 7.585% | 44.444% | |
| 42.806% | 2.954% | 5.052% | |
-------------|-----------|-----------|-----------|-----------|
Yes | 305 | 1596 | 280 | 2181 |
| 559.461 | 656.043 | 4.154 | |
| 13.984% | 73.177% | 12.838% | 49.188% |
| 13.845% | 92.415% | 55.556% | |
| 6.879% | 35.995% | 6.315% | |
-------------|-----------|-----------|-----------|-----------|
Column Total | 2203 | 1727 | 504 | 4434 |
| 49.684% | 38.949% | 11.367% | |
-------------|-----------|-----------|-----------|-----------|
Update: prop.table()
Just FYI (to save you the tedious work you did in making your own data.frame as you did), you may also be interested in the prop.table() function.
Again, using the data you linked to and assuming it is named "temp", the following gives you the underlying data from which you can construct your data.frame. You may also be interested in looking into the functions margin.table() or addmargins():
## Your basic table
CurSmoke <- with(temp, table(cursmoke1, cursmoke2, useNA = "ifany"))
CurSmoke
# cursmoke2
# cursmoke1 No Yes <NA>
# No 1898 131 224
# Yes 305 1596 280
## Row proportions
prop.table(CurSmoke, 1) # * 100 # If you so desire
# cursmoke2
# cursmoke1 No Yes <NA>
# No 0.84243231 0.05814470 0.09942299
# Yes 0.13984411 0.73177442 0.12838148
## Column proportions
prop.table(CurSmoke, 2) # * 100 # If you so desire
# cursmoke2
# cursmoke1 No Yes <NA>
# No 0.86155243 0.07585408 0.44444444
# Yes 0.13844757 0.92414592 0.55555556
## Cell proportions
prop.table(CurSmoke) # * 100 # If you so desire
# cursmoke2
# cursmoke1 No Yes <NA>
# No 0.42805593 0.02954443 0.05051872
# Yes 0.06878665 0.35994587 0.06314840

Resources