I'm starting to learn R and I'm playing around with the Titanic dataset on Kaggle. When attempting logistical regression on the survival rate, my expectation was for 0 and 1?
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked Title FamSize FamSizeD AgeRange
1 0 3 male 22 1 0 7.25 S Mr 2 small 21-30
2 1 1 female 38 1 0 71.2833 C Mrs 2 small 31-40
3 1 3 female 26 0 0 7.925 S Miss 1 single 21-30
4 1 1 female 35 1 0 53.1 S Mrs 2 small 31-40
5 0 3 male 35 0 0 8.05 S Mr 1 single 31-40
6 0 3 male 28 0 0 8.4583 Q Mr 1 single 21-30
7 0 1 male 54 0 0 51.8625 S Mr 1 single 51-60
8 0 3 male 2 3 1 21.075 S Master 5 large 0-20
9 1 3 female 27 0 2 11.1333 S Mrs 3 small 21-30
10 1 2 female 14 1 0 30.0708 C Mrs 2 small 0-20
logistix <- glm(Survived ~ Pclass + Sex + Embarked + Title + FamSizeD + AgeRange,
data = train,
family = 'binomial')
Prediction <- predict(logistix, test, type = "response")
Prediction
Related
I'm using Latent Variable Model (IRT) using the BES2015 data
BES2015$fscore <- as.vector(fscores(result, method = "EAP"))
ID age male vote voteyesno closing deposit righttovote
1 10102 45 Male Labour Voted 1 0 1
2 10104 49 Male Labour Voted 1 1 1
3 10107 73 Female Did not vote Did not vote 1 1 1
5 10202 63 Male Liberal Democrat Voted 1 1 1
6 10203 81 Female Liberal Democrat Voted 1 1 1
7 10205 18 Female Did not vote Did not vote NA NA 0
libdempr committees fscore
1 1 1 -0.3924927
2 1 1 0.5007364
3 NA 0 -0.1961401
5 1 1 0.5007364
6 0 1 -0.5804618
7 NA NA -0.1704896
t.test(fscore ~ age + male + vote, data = BES2015, var.equal=T)
Error in t.test.formula(fscore ~ age + male + vote, data = BES2015, var.equal = T) :
'formula' missing or incorrect
I've tried removing NA values using the code below, but still receive the same error
BES2015na <- na.omit(BES2015)
t.test(fscore ~ age + male + vote, data = BES2015na, var.equal=T)
I have a dataset with 50 columns and over 100,000 observations. I want to determine which columns have no variability (i.e., all rows contain the same value), print the names of the columns, and then remove those columns.
I tried using this code:
names(Filter(function(x) length(unique(x)) != 0, df))
but I am not sure if it does what I want it to do. I think It lists the unique columns meaning no identical values?
Here is a sample from the data I am using:
135 24437208 1 2 1 Cardiology ? <30 8 8 None No No Ch No No Yes No No No No No Down No Steady None Steady No No No No No No No No No No 77 33 6 0 0 0 [50-60) Female Caucasian ? 401.00 997.00 560.00
2 135 26264286 7 1 1 Surgery-Cardiovascular/Thoracic ? >30 3 5 None No No Ch No No Yes No No No No No Steady No No None Steady No No No No No No No No No No 31 14 1 0 1 0 [50-60) Female Caucasian ? 998.00 41.00 250.00
3 378 29758806 1 3 1 Surgery-Neuro ? NO 2 3 None No No No No No No No No No No No No No No None No No No No No No No No No No No 49 11 1 0 0 0 [50-60) Female Caucasian ? 722.00 305.00 250.00
4 729 189899286 7 1 3 InternalMedicine MC NO 4 9 >7 No No No No No Yes No No No No No No No No None Steady No No No No No No No No No No 68 23 2 0 0 0 [80-90) Female Caucasian ? 820.00 493.00 880.00
5 774 64331490 7 1 1 InternalMedicine ? NO 3 9 >8 No No Ch No No Yes No No No No No Steady No No None Steady No No No No No No No No No No 46 20 0 0 0 0 [80-90) Female Caucasian ? 274.00 427.00 416.00
6 927 14824206 7 1 1 InternalMedicine ? NO 5 3 None No No No No No Yes No Steady No No No No No No None No No No No No No No No No No No 49 5 0 0 0 0 [30-40) Female AfricanAmerican ? 590.00 220.00 250.00
7 1152 8380170 7 1 1 Hematology/Oncology ? >30 6 2 None No No No No No Yes No No No No No No No Steady None No No No No No No No No No No No 43 13 2 0 1 0 [50-60) Female AfricanAmerican ? 282.00 250.01 NA
8 1152 30180318 7 1 1 Hematology/Oncology ? >30 6 6 None No No Ch No No Yes No No No No No No No Down None No No No No No No No No No No No 45 15 4 0 2 0 [50-60) Female AfricanAmerican ? 282.00 794.00 250.00
9 1152 55533660 7
I'm looking to plot the results of a support vector classification, whereby the predictor variables are categorical (Gender and School). School can be 1 of 20, Gender 1 of 2. The response variable is Attended school (y/n)
The attendance.df is as follows:
gender school attendance
1 male 1 1
2 male 2 0
3 male 1 1
4 female 2 1
5 male 1 1
6 female 2 0
7 female 3 1
8 male 4 0
9 female 5 1
10 female 6 1
11 female 7 1
12 male 8 1
13 male 9 1
14 male 10 1
15 male 10 1
16 male 11 0
17 male 12 1
18 female 13 1
19 male 14 1
20 female 15 0
21 female 16 1
22 male 17 0
23 female 18 1
24 female 19 0
25 female 4 1
26 male 5 1
27 male 5 1
28 male 20 0
The code for the SVM and plot is:
# Builds linear SVM model
svm.response = tune(svm,attendance~ ., data=attendance.df, kernel="linear", ranges=list(cost=c(0.001,0.01,0.1,1,10,100,1000)))
svm.response$best.parameters # Identifies best parameters to rebuild model below
svm.response = svm(attendance ~ ., data=attendance.df, kernel="linear", cost=10)
# Plots SVM classification plot
plot(svm.response, attendance.df, fill=TRUE)
I want the predictor variables (gender and school) to be on the x and y, with the linear line (the SVM) to separate attendance which will be a black dot for yes, and red dot for no.
With the plot() I have included, the error "‘min’ not meaningful for factors" appears due to all variables being factors. I'm just unsure how to plot
Note: It may help to expand the df, I have deliberately shortned in order to post.
I am attempting to use Random Forest. The training data has 7000 observations with 12 variables. These variables include both categorical and continuous variables. When I submit the code I receive the following
warning: Warning message: In randomForest.default(m, y, ...) : The
response has five or fewer unique values. Are you sure you want to do
regression?
The data is structured as such:
CustomerId CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
15634602 619 France Female 42 2 0 1 1 1 101348.88 1
15647311 608 Spain Female 41 1 83807.86 1 0 1 112542.58 0
15619304 502 France Female 42 8 159660.8 3 1 0 113931.57 1
15701354 699 France Female 39 1 0 2 0 0 93826.63 0
15737888 850 Spain Female 43 2 125510.82 1 1 1 79084.1 0
15574012 645 Spain Male 44 8 113755.78 2 1 0 149756.71 1
15592531 822 France Male 50 7 0 2 1 1 10062.8 0
15656148 376 Germany Female 29 4 115046.74 4 1 0 119346.88 1
15792365 501 France Male 44 4 142051.07 2 0 1 74940.5 0
Based on research, I have attempted to change variables to factors, but this has not corrected the issue.
The random forest model code that I am using is as follows:
rfModel=randomForest(Exited~.,data=train)
I have been unable to proceed past the warning to this point.
Try converting the outcome variable to a factor. For example if you the outcome variable in train is y
then run this before the model
train$y <- as.factor(train$y)
imagine, you have the following data set:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 0 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
Further, imagine you want to compile summary tables that print out the frequencies of those that drink wine, beer, water.
I solved it that way.
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
This allows me to complete my ultimate goal of compiling a bar chart in the way I want it:
barplot(con_P)
It works perfectly. No problem. Now, let us tweak the data set as follows: We set all entries for water to 1.
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
If I now run the following commands:
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
it gives me the following error message after the second line: Error in margin.table(x, margin) : 'x' is not an array!
Through another question here on this forum, I learned that the following will help me to overcome this issue:
con_P <- lapply(con, function(x) x/sum(x))
However, if I now run
barplot(con_P)
R does not create a barplot: Error in -0.01 * height : non-numeric argument to binary operator. I assume it is because it is no array!
My question is what to do now (how would I transform con_P in th second example into an array?). Secondly, how can I make the entire step of creating prop.tables and then a bar chart more efficient? Any help is much appreciated.
We can by converting the columns to factor with levels specified. In the second example, as the columns have 0 and 1 values in the 2nd and 3rd, we use the levels as 0:1, then get the table and convert to proportion with prop.table. and do the barplot
barplot(prop.table(sapply(df[2:4],
function(x) table(factor(x, levels=0:1))),2))
Reproducing your data:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
con <-lapply(df[,c(2:4)], table)
con_P <- lapply(con, function(x) x/sum(x))
You can use reshape2 to melt the data:
library(reshape2)
df <- melt(con_P)
Now, if you want to use gpplot2 you can use df to plot the bar plot:
ggplot(df, aes(x = L1, y = value, fill = factor(Var1) )) +
geom_bar(stat= "identity") +
theme_bw()
If you want to use barplot you can reshape the data.frame into an array:
array <- acast( df, Var1~L1)
array[is.na(array)] <- 0
barplot(array)