Summary() not showing some level of data in R [duplicate] - r

I am trying to get a crosstab with percentages from this file using Hmisc. But why is summary() dropping a category ("OTHERS") from the variable OCCUPATION?
library(Hmisc)
summary(ID ~ OCCUPATION, data=df, method="reverse")
Output:
Descriptive Statistics by ID
+--------------------------+--------+--------+
| |HUSBAND |SELF |
| |(N=28) |(N=72) |
+--------------------------+--------+--------+
|OCCUPATION : SELF EMPLOYED|93% (26)|31% (22)|
+--------------------------+--------+--------+
Compare this to the simple table()
OCCUPATION
ID OTHERS SELF EMPLOYED
HUSBAND 2 26
SELF 50 22

This is for the benefit of whoever has faced this peculiar problem. I stumbled across the solution after going over the very, very long documentation that Hmisc has. The solution is to use print() with exclude1=F option:
print(summary(ID ~ OCCUPATION, data=df, method="reverse"), exclude1=F)
Descriptive Statistics by ID
+-------------------+--------+--------+
| |HUSBAND |SELF |
| |(N=28) |(N=72) |
+-------------------+--------+--------+
|OCCUPATION : OTHERS| 7% ( 2)|69% (50)|
+-------------------+--------+--------+
| SELF EMPLOYED |93% (26)|31% (22)|
+-------------------+--------+--------+

Related

Converting 1 row instance to a suitable format in R for repeated measures ANOVA

I'm really struggling with how to format my data to a suitable one in R.
At the moment, I have my data in the format of:
ParticipantNo | Sex | Age | IV1(0)_IV2(0)_DV1 | IV1(1)_IV2(0)_DV1 | etc
There are two levels for IV1, and 3 for IV2, so 6 columns per DV.
I've stacked them, so that I compare all IV1 results with each other, and the same for IV2 using a Friedman test.
However, I'd like to compare across groups like Sex and Age, and was told ANOVA is the best for this. I've used ANOVA directly before in SPSS, which accepts this data format.
The problem I have is getting this data into the correct format in R.
As I understand it, it should look like:
1 | M | 40 | IV1(0)_IV2(0)_DV1_Result
1 | M | 40 | IV1(1)_IV2(0)_DV1_Result
1 | M | 40 | IV1(0)_IV2(1)_DV1_Result
1 | M | 40 | IV1(1)_IV2(1)_DV1_Result
1 | M | 40 | IV1(0)_IV2(2)_DV1_Result
1 | M | 40 | IV1(1)_IV2(2)_DV1_Result
Then I can do
aov(sex~DV1_result, data=data)
Does this seem like the correct thing to do, and if so, how can I convert from the format I have to the one I need in R?
Figured it out!
I used stack on my data, and then separate (i.e. s = separate(stack(data), "ind", c("IV1", "IV2").
Then I could do the ANOVA by aov(values ~ IV1 * IV2, data = s)
Hope this helps someone!

Formatting Categorical Variables for a linear regression

I am trying to build a linear regression model in R. I am working on converting a categorical variable in to numeric for consumption by the model. I want to convert the name of a procedure to a number and have the following line of code to do so. It appears to be working successfully. I am using a library called CAR as well.
res$Procedure <- recode(res$Procedure, "'Primary Knee'='1'; 'Primary Hip'='2'; 'Revision Knee'='3'; 'Revision Knee'='4';
'Partial Knee'='5'; 'Revision Hip'='6'; 'Partial knee'='7'; 'Bilateral Hip'='8';
'Bilateral knee'='9'; 'Bilateral Knee'='9'; 'Resurfacing Hip'='10';'Resurfacing Hip '='10'; 'Revision knee'='3'")
I am then running the model -
lg1 = glm(BloodTransfusions~ Age+Hospital+Procedure+LenthOfStay,
family=binomial(link=probit), data=res)
Then I am looking at the results of my model and this is were things look a little odd.
summary(lg1)
| Variable | P-Values |
| Age | |
|Hospital | |
|Procedure1 | |
|Procedure2 | |
|Procedure3 | |
Basically the model is treating each of the categorical variables that I converted as numbers as a distinct variable rather than a continuous one. Does anyone have any suggestions? Or am I going about this the wrong way. I appreciate the help!
You can dummify your dataframe. This will create a binary variable out of every level of categorical variables.
library("dummy")
res.dummy <- dummy(res)
Then use res.dummy in glm.

Apply a formula through a function in many columns with different column names of data frame in R

I want to use pb2gen function from WRS2 package. It runs a robust t-test against your data and here is its documentation
pb2gen(formula, data, est = "mom")
formula an object of class formula.
data an optional data frame for the input data.
est estimate to be used for the group comparisons: either "onestep"
for one-step M-estimator of location using Huber's Psi, "mom" for the
modified one-step (MOM) estimator of location based on Huber's Psi, or
"median", "mean".
Anyway. The thing is that I'm trying to apply this function on 5 columns of a data frame.
The data frame seems like this
| Ge/treat | Control_1 | Control_2 | Cancer_1 | Cancer_2 | Cancer_3 |
|----------|:-------------:|----------:|----------:|---------:|---------:|
| gene1 | 2.65 | 3.01 | 2.20 | 3.65 | 4.01 |
and i want to run the t-test between Controls and Cancer.
The formula i want to apply somehow is the Control_{1,2} ~ Cancer_{1,2,3).
I mean I want it to take in mind both Control columns and all of the Cancer ones.
Until now I can run only the t-test for the first column of Control's and Cancer's by running pb2gen(Control_1 ~ Cancer_1, data= data, est="mom").
I'm wondering if it is possible to run the same command by including and the other columns. So any idea or hint is welcomed
Thank you.
EDIT:
I also tried
pb2gen(c(Control_1,Control_2) ~ c(Cancer_1,Cancer_2,Cancer_3) , data = data, est="mom")
but got
Error in model.frame.default(formula, data) : variable lengths
differ

summary {Hmisc} drops category?

I am trying to get a crosstab with percentages from this file using Hmisc. But why is summary() dropping a category ("OTHERS") from the variable OCCUPATION?
library(Hmisc)
summary(ID ~ OCCUPATION, data=df, method="reverse")
Output:
Descriptive Statistics by ID
+--------------------------+--------+--------+
| |HUSBAND |SELF |
| |(N=28) |(N=72) |
+--------------------------+--------+--------+
|OCCUPATION : SELF EMPLOYED|93% (26)|31% (22)|
+--------------------------+--------+--------+
Compare this to the simple table()
OCCUPATION
ID OTHERS SELF EMPLOYED
HUSBAND 2 26
SELF 50 22
This is for the benefit of whoever has faced this peculiar problem. I stumbled across the solution after going over the very, very long documentation that Hmisc has. The solution is to use print() with exclude1=F option:
print(summary(ID ~ OCCUPATION, data=df, method="reverse"), exclude1=F)
Descriptive Statistics by ID
+-------------------+--------+--------+
| |HUSBAND |SELF |
| |(N=28) |(N=72) |
+-------------------+--------+--------+
|OCCUPATION : OTHERS| 7% ( 2)|69% (50)|
+-------------------+--------+--------+
| SELF EMPLOYED |93% (26)|31% (22)|
+-------------------+--------+--------+

Is it possible to combine separate boxplot summaries into one and create the combined graph?

I am working with rather large datasets (appx. 4 mio rows per month with 25 numberic attributes and 4 factor attributes). I would like to create a graph that contains per month (for the last 36 months) a boxplot for each numeric attribute per product (one of the 4 factor attributes).
So as an example for product A:
-
_ | -
_|_ | _|_
| | | | |
| | _|_ | |
| | | | |---|
| | |---| | |
|---| | | | |
|_ _| | | |_ _|
| |_ _| |
| | |
- | -
-
--------------------------------------------------------------
jan '10 feb '10 mar '10 ................... feb '13
But since these are quite large datasets I will be working with I would like some advice to get started on how to approach. My idea (but I am not sure if this is possible) is to
a) extract the data per month per product
b) create a boxplot for that specific month (so let's say jan'10 for product A)
c) store the boxplot summary data somewhere
d) repeat a-c for all months until feb '13
e) combine all the stored boxplot summary data into one
f) plot the combined boxplot g) repeat a-f for all other products
So my main question is: is it possible to combine separate boxlot summaries into one and create the combined graph as sketched above from this?
Any help would be appreciated,
Thank you
Here's a long-hand example that you can probably cook something up around:
Read in the individual datasets - you might want to overwrite the same data or wrap this step in a function given the large data you are using.
dset1 <- 1:10
dset2 <- 10:20
dset3 <- 20:30
Store some boxplot info, notice the plot=FALSE
result1 <- boxplot(dset1,plot=FALSE,names="month1")
result2 <- boxplot(dset2,plot=FALSE,names="month2")
result3 <- boxplot(dset3,plot=FALSE,names="month3")
Group up the data and plot with bxp
mylist <- list(result1, result2, result3)
groupbxp <- do.call(mapply, c(cbind, mylist))
bxp(groupbxp)
Result:
You will not be able to predict with absolute precision what the values of the "fivenum" values will be for combined assembly of values. Think about the situation with two groups for which you have the 75th percentiles in each group and the counts of observations in each group. Suppose the percentiles are unequal. You cannot just take the weighted mean of the percentiles to get the 75th percentile of the aggregated values. The see the help page for ?boxplot.stats. I would think, however, that you might come very close by using the median values of the fivenum collections. This might be a place to start your examinations.
mo.mtx <- tapply(dat$values, dat$month, function( mo.dat) c( fivenum(mo.dat), length(mo.dat) )
matplot( mo.mtx[, 1:5] , type="l" )

Resources