Formatting Categorical Variables for a linear regression

Formatting Categorical Variables for a linear regression - r

I am trying to build a linear regression model in R. I am working on converting a categorical variable in to numeric for consumption by the model. I want to convert the name of a procedure to a number and have the following line of code to do so. It appears to be working successfully. I am using a library called CAR as well.
res$Procedure <- recode(res$Procedure, "'Primary Knee'='1'; 'Primary Hip'='2'; 'Revision Knee'='3'; 'Revision Knee'='4';
'Partial Knee'='5'; 'Revision Hip'='6'; 'Partial knee'='7'; 'Bilateral Hip'='8';
'Bilateral knee'='9'; 'Bilateral Knee'='9'; 'Resurfacing Hip'='10';'Resurfacing Hip '='10'; 'Revision knee'='3'")
I am then running the model -
lg1 = glm(BloodTransfusions~ Age+Hospital+Procedure+LenthOfStay,
family=binomial(link=probit), data=res)
Then I am looking at the results of my model and this is were things look a little odd.
summary(lg1)
| Variable | P-Values |
| Age | |
|Hospital | |
|Procedure1 | |
|Procedure2 | |
|Procedure3 | |
Basically the model is treating each of the categorical variables that I converted as numbers as a distinct variable rather than a continuous one. Does anyone have any suggestions? Or am I going about this the wrong way. I appreciate the help!

You can dummify your dataframe. This will create a binary variable out of every level of categorical variables.
library("dummy")
res.dummy <- dummy(res)
Then use res.dummy in glm.

Related

Converting 1 row instance to a suitable format in R for repeated measures ANOVA

I'm really struggling with how to format my data to a suitable one in R.
At the moment, I have my data in the format of:
ParticipantNo | Sex | Age | IV1(0)_IV2(0)_DV1 | IV1(1)_IV2(0)_DV1 | etc
There are two levels for IV1, and 3 for IV2, so 6 columns per DV.
I've stacked them, so that I compare all IV1 results with each other, and the same for IV2 using a Friedman test.
However, I'd like to compare across groups like Sex and Age, and was told ANOVA is the best for this. I've used ANOVA directly before in SPSS, which accepts this data format.
The problem I have is getting this data into the correct format in R.
As I understand it, it should look like:
1 | M | 40 | IV1(0)_IV2(0)_DV1_Result
1 | M | 40 | IV1(1)_IV2(0)_DV1_Result
1 | M | 40 | IV1(0)_IV2(1)_DV1_Result
1 | M | 40 | IV1(1)_IV2(1)_DV1_Result
1 | M | 40 | IV1(0)_IV2(2)_DV1_Result
1 | M | 40 | IV1(1)_IV2(2)_DV1_Result
Then I can do
aov(sex~DV1_result, data=data)
Does this seem like the correct thing to do, and if so, how can I convert from the format I have to the one I need in R?

Figured it out!
I used stack on my data, and then separate (i.e. s = separate(stack(data), "ind", c("IV1", "IV2").
Then I could do the ANOVA by aov(values ~ IV1 * IV2, data = s)
Hope this helps someone!

ifelse Statement in R Not Working - Incorrect Column Result

I have a df:
Year | Stage | Home.Team.Name | Home.Team.Goals | Away.Team.Name | Away.Team.Goals
1998 | Group A| Brazil..................| 2............................ | Scotland............... | 1
and so on.
What I'm trying to do is create a new column based off the result of each game. So the winners name appears in a new column. The code I currently have is:
RecentWorldCups$Game.Winner <- ifelse(RecentWorldCups$Home.Team.Goals>RecentWorldCups$Away.Team.Goals, RecentWorldCups$Home.Team.Name,
ifelse(RecentWorldCups$Away.Team.Goals>RecentWorldCups$Home.Team.Goals, RecentWorldCups$Away.Team.Name,
"Draw"))
The result of this is that it gives me a number (perhaps a factor number?) instead of the name of the team.
Anyone able to help?
Cheers

You need to extract the character level value from your factor columns. Try this:
df <- RecentWorldCups # for readability of your code
df$Game.Winner <- ifelse(df$Home.Team.Goals > df$Away.Team.Goals,
levels(df$Home.Team.Name)[df$Home.Team.Name],
ifelse(df$Away.Team.Goals > df$Home.Team.Goals,
levels(df$Away.Team.Name)[df$Away.Team.Name],
"Draw")
)
If you find it cumbersome to do these factor conversions, then one workaround would be to create your data frame with all strings set to not be factors, e.g. something like this:
RecentWorldCups <- data.frame(Home.Team.Goals=c(...), ..., stringsAsFactors=FALSE)

Apply a formula through a function in many columns with different column names of data frame in R

I want to use pb2gen function from WRS2 package. It runs a robust t-test against your data and here is its documentation
pb2gen(formula, data, est = "mom")
formula an object of class formula.
data an optional data frame for the input data.
est estimate to be used for the group comparisons: either "onestep"
for one-step M-estimator of location using Huber's Psi, "mom" for the
modified one-step (MOM) estimator of location based on Huber's Psi, or
"median", "mean".
Anyway. The thing is that I'm trying to apply this function on 5 columns of a data frame.
The data frame seems like this
| Ge/treat | Control_1 | Control_2 | Cancer_1 | Cancer_2 | Cancer_3 |
|----------|:-------------:|----------:|----------:|---------:|---------:|
| gene1 | 2.65 | 3.01 | 2.20 | 3.65 | 4.01 |
and i want to run the t-test between Controls and Cancer.
The formula i want to apply somehow is the Control_{1,2} ~ Cancer_{1,2,3).
I mean I want it to take in mind both Control columns and all of the Cancer ones.
Until now I can run only the t-test for the first column of Control's and Cancer's by running pb2gen(Control_1 ~ Cancer_1, data= data, est="mom").
I'm wondering if it is possible to run the same command by including and the other columns. So any idea or hint is welcomed
Thank you.
EDIT:
I also tried
pb2gen(c(Control_1,Control_2) ~ c(Cancer_1,Cancer_2,Cancer_3) , data = data, est="mom")
but got
Error in model.frame.default(formula, data) : variable lengths
differ

Summary() not showing some level of data in R [duplicate]

I am trying to get a crosstab with percentages from this file using Hmisc. But why is summary() dropping a category ("OTHERS") from the variable OCCUPATION?
library(Hmisc)
summary(ID ~ OCCUPATION, data=df, method="reverse")
Output:
Descriptive Statistics by ID
+--------------------------+--------+--------+
| |HUSBAND |SELF |
| |(N=28) |(N=72) |
+--------------------------+--------+--------+
|OCCUPATION : SELF EMPLOYED|93% (26)|31% (22)|
+--------------------------+--------+--------+
Compare this to the simple table()
OCCUPATION
ID OTHERS SELF EMPLOYED
HUSBAND 2 26
SELF 50 22

This is for the benefit of whoever has faced this peculiar problem. I stumbled across the solution after going over the very, very long documentation that Hmisc has. The solution is to use print() with exclude1=F option:
print(summary(ID ~ OCCUPATION, data=df, method="reverse"), exclude1=F)
Descriptive Statistics by ID
+-------------------+--------+--------+
| |HUSBAND |SELF |
| |(N=28) |(N=72) |
+-------------------+--------+--------+
|OCCUPATION : OTHERS| 7% ( 2)|69% (50)|
+-------------------+--------+--------+
| SELF EMPLOYED |93% (26)|31% (22)|
+-------------------+--------+--------+

Prediction analysis (Time series model) in UNIX

I know its not a code level question but wanted your views.
I need to perform “Prediction Analysis” in UNIX level using Time series model (like ARIMA).
We have implemented the same using R , but my work environment is not supporting R.
Data snapshot
Year | Month| Data1| Data2 | Data3
2012 | Jan | 1 |1 |3
2012 | Feb | 2 |21 | 4
So I wanted to implement some algorithm which will help me in finding the predicted values for future months.
Is there any other way of implementing “Time series Prediction Analysis” in UNIX (preferably Perl/Shell).

Since you are interested in perl and statistics, I'm sure you are aware of PDL. There are some specific time-series statistics modules available and of course, since perl is involved, other CPAN modules can be used.
R is still king and has a lot of packages to choose from - and, lucky us, R and perl play nice together using Statistics::R. I've not tried using Statistics-R from the PDL shell but this too may be possible to some extent.
Here's a pdl example session using MVA
/home/zombiepl % pdl
pdl> use Statistics::MVA::MultipleRegression;
pdl> $lol = [ [qw/745 36 66/],
[qw/895 37 68/],
[qw/442 47 64/],
[qw/440 32 53/],
[qw/1598 1 101/],];
pdl> linear_regression($lol);
The coefficients are: B[0] = -281.426985090045, B[1] = -7.61102966577879,
B[2] = 19.0102910918022.
R^2 is 0.943907302962818
Cheers and good luck with your project.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Formatting Categorical Variables for a linear regression - r

You can dummify your dataframe. This will create a binary variable out of every level of categorical variables. library("dummy") res.dummy <- dummy(res) Then use res.dummy in glm.

Related

Converting 1 row instance to a suitable format in R for repeated measures ANOVA

ifelse Statement in R Not Working - Incorrect Column Result

Apply a formula through a function in many columns with different column names of data frame in R

Summary() not showing some level of data in R [duplicate]

Prediction analysis (Time series model) in UNIX

Categories

Resources