In cox regression categorical levels not shown in my output - r

In my data, Hemoglobin_group is a categorical variable with 6 levels. When I run the below code, I can't see Hemoglobin_group levels in the output. How can I solve this problem?
fit <- coxph(Surv(Time,Status)~Hemoglobin_group, data)
fit
My output:
coef exp(coef) se(coef) z p
Hemoglobin_group -0.06585 0.93627 0.01874 -3.514 0.000441

str(data)
tibble [2,021 x 21] (S3: tbl_df/tbl/data.frame)
$ status : num [1:2021] 1 1 1 1 1 1 0 0 0 0 ...
$ id : num [1:2021] 1 1 1 1 1 1 2 2 2 2 ...
$ Time : num [1:2021] 20.3 20.3 20.3 20.3 20.3 ...
$ t1 : num [1:2021] 0 1 2 3 4 5 0 1 2 3 ...
$ t2 : num [1:2021] 1 2 3 4 5 ...
$ sex_string : chr [1:2021] "MALE" "MALE" "MALE" "MALE"...
$ sex : num [1:2021] 1 1 1 1 1 1 1 1 1 1 ...
$ age : num [1:2021] 89 89 89 89 89 89 77 77 77 77 ...
$ hemoglobin : num [1:2021] 9.71 10.22 11.3 11.8 11.2 ...
$ Diabet : num [1:2021] 1 1 1 1 1 1 2 2 2 2 ...
$ Hemoglobin_group : num [1:2021] 4 4 4 4 4 4 5 5 5 5 ...
$ Kreatinin : num [1:2021] 7.19 7.19 7.19 7.19 7.19 ...
$ fosfor : num [1:2021] 4.14 4.14 4.14 4.14 4.14 ...
$ Kalsiyum : num [1:2021] 8.5 8.5 8.5 8.5 8.5 ...
$ CRP : num [1:2021] 1.33 1.33 1.33 1.33 1.33 ...
$ Albumin : num [1:2021] 4.19 4.19 4.19 4.19 4.19 ...
$ Ferritin : num [1:2021] 428 428 428 428 428 ...
$ months : num [1:2021] 1 2 3 4 5 6 1 2 3 4 ...
It looks like when I write str (data). I thought I was transforming into a factor by doing the following codes in my data. I guess I couldn't transform. I did not understand?
The codes I wrote to convert to factor were as follows
sex<-as.factor(sex)
is.factor(sex)
Diabet<-as.factor(Diabet)
is.factor(Diabet)
Status<-as.factor(Status)
is.factor(Status)
months<-as.factor(months)
is.factor(months)
Hemoglobin_group<-as.factor(Hemoglobin_group)
is.factor(Hemoglobin_group)
When ı run this code, R console looks like:
> sex<-as.factor(sex)
> is.factor(sex)
[1] TRUE
>
> Diabet<-as.factor(Diabet)
> is.factor(Diabet)
[1] TRUE
>
>
> Status<-as.factor(Status)
> is.factor(Status)
[1] TRUE
>
> months<-as.factor(months)
> is.factor(months)
[1] TRUE
>
> Hemoglobin_group<-as.factor(Hemoglobin_group)
> is.factor(Hemoglobin_group)
[1] TRUE
In this case, don't the categorical variables in my data turn into factors?

Your variable Hemoglobin_group is probably considered as a numeric value. Try:
Hemoglobin_groupF <- factor(Hemoglobin_group)
fit <- coxph(Surv(Time,Status) ~ Hemoglobin_groupF, data)
fit
The reference group will the first factor. You can easily change your reference factor with the function relevel

Related

Error with grouped summary results when using likert()

I am trying to visualise some Likert data. I have been successful doing this across all respondents using likert()
df <- data[,11:14]
df[] <- lapply(df, factor,
levels=c(1,2,3,4,5),
labels = c("Strongly disagree", "Disagree", "Neutral","Agree", "Strongly agree"))
df <- data.frame(df)
LK <- likert(df)
LK <- likert(summary = LK$results)
plot(LK, include.center=FALSE)+ggtitle("Title")
To cut this by group i have used the following code (which works):
LK2 <- likert(df, grouping = data$group)
str(LK2$results)
'data.frame': 34 obs. of 7 variables:
$ Group : Factor w/ 9 levels "NSW","VIC", "QLD": 1 1 1 1 1 1 1 1 1 1 ...
$ Item : Factor w/ 4 levels "Question_1",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Strongly disagree: num 18.18 13.64 12.12 15.15 9.09 ...
$ Disagree : num 19.7 16.7 10.6 24.2 22.7 ...
$ Neutral : num 27.3 34.8 45.5 22.7 27.3 ...
$ Agree : num 24.2 22.7 18.2 24.2 24.2 ...
$ Strongly agree : num 10.6 12.1 13.6 13.6 16.7 ...
But then when i make a summary out of these results to plot:
LK3 <- likert(summary = LK2$results, grouping = LK2$results[,1])
str(LK3)
List of 5
$ results :'data.frame': 28 obs. of 8 variables:
..$ Group : Factor w/ 9 levels "NSW","VIC","QLD",..: 1 1 1 1 2 2 2 2 3 3 ...
..$ Group : Factor w/ 9 levels "NSW","VIC","QLD",..: 1 1 1 1 2 2 2 2 3 3 ...
..$ Item : Factor w/ 4 levels "Question_1",..: 1 2 3 4 1 2 3 4 1 2 ...
..$ Strongly disagree: num [1:28] 16.22 16.22 8.11 18.92 14.29 ...
..$ Disagree : num [1:28] 18.9 18.9 13.5 16.2 19 ...
..$ Neutral : num [1:28] 35.1 27 48.6 21.6 42.9 ...
..$ Agree : num [1:28] 16.2 24.3 13.5 29.7 19 ...
..$ Strongly agree : num [1:28] 13.51 13.51 16.22 13.51 4.76 ...
$ items : NULL
$ grouping: Factor w/ 9 levels "NSW","VIC","QLD",..: 1 1 1 1 2 2 2 2 3 3 ...
$ nlevels : num 6
$ levels : chr [1:6] "Item" "Strongly disagree" "Disagree" "Neutral" ...
- attr(*, "class")= chr "likert"
It seems to duplicate 'Group', and when i try to plot i get the following error message:
plot(LK3)
Error in FUN(newX[, i], ...) : invalid 'type' (character) of argument
No idea why this is happening - any help would be appreciated

issue with gbm.step() function in R

I'm trying to execute the Cross-Validation for the boosting regression/classification trees using the function gbm.step() from the R package dismo, but it returns a empty output and I can't figure out why. This is the code I'm using:
ColIndexCov <- match(names(myRS),colnames(DFbrt_df2))
ColIndexResp <- match(c("HasRes"),colnames(DFbrt_df2))
DFbrt_df <- DFbrt#data
DFbrt_df2 <- na.omit(DFbrt_df)
myBRT = gbm.step(data=DFbrt_df2,
gbm.x = ColIndexCov,
gbm.y = ColIndexResp,
tree.complexity = 3,
learning.rate = 10^(-8),
n.trees = 50,
family = "bernoulli",
n.folds = 4,
fold.vector = DFbrt_df2$Region.num,
step.size = 50,
verbose = F,
silent = T
)
str(DFbrt_df2)
'data.frame': 560845 obs. of 18 variables:
$ Nsamples : num 310 310 310 310 310 310 310 310 310 310 ...
$ cluster : num 39 39 39 39 39 39 39 39 39 39 ...
$ R : num 44.9 44.9 44.9 44.9 44.9 ...
$ P50 : num 0.565 0.544 0.609 0.605 0.593 ...
$ regions : Factor w/ 6 levels "China_east","China_middlesouth",..: 1 1 1 1 1 1 1 1 1 1 ...
$ HasRes : num 1 0 1 0 0 0 1 1 0 0 ...
$ use : num 10.02 9.75 0 9.38 8.77 ...
$ acc : num 0 0 0.4103 0.0769 0.0779 ...
$ tmp : num 2.46 2.46 2.46 2.46 2.45 ...
$ irg : num 1.788 0.399 1.205 1.836 1.841 ...
$ PgExt : num 3.11 0 3.7 3.11 3.18 ...
$ PgInt : num 4.69 2.76 0 3.99 2.22 ...
$ ChExt : num 3.74 0 4.33 3.74 3.81 ...
$ ChInt : num 5.01 5.99 5.35 4.88 4.97 ...
$ Ca : num 0 0 2.71 0 2.8 ...
$ veg : num 0 0 0 0 0 0 0 0 0 0 ...
$ Region.num: num 4 4 4 4 4 4 4 4 4 4 ...
$ Region : num 4 4 4 4 4 4 4 4 4 4 ...
- attr(*, "na.action")= 'omit' Named int 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "names")= chr "1" "2" "3" "4" ...
the answer variable is the variable HasRes and the covariates are the variables use, acc, tmp, irg, PgExt, PgInt, ChExt, ChInt, ca, veg.

Classify factor output with factors with >60 levels and numeric inputs

I'm newbie, and working on a classification to see the causes of coral diseases. The dataset contains 45 variables.
The output variable is a factor with 21 levels (21 diseases) and the inputs are numeric and factor variables, and those factors have even 94 levels, those are like "type of specie of coral", so I can't get into a split factor because I want to be as precise as possible, so maybe one species is less resistant than another. So I can't split those factors. Numeric variables are such as, population in the area, fishing trips etc.
First problem: tried genetic algorithms to select most important variables, random forests, etc., but... it gets aborted, so the variables I eliminated were just based on correlograms. I want something stronger to decide which variables select.
Second problem: I've tried everything I know and made tons of searches on Google to find something that runs and make a classification, but nothing goes on. I tried SVM, Random Forests, Cart, GBM, bagging and boosting, but nothing can't with this dataset.
This is the structure of the dataset
'data.frame': 136510 obs. of 45 variables:
$ SITE : Factor w/ 144 levels "TUT-1511","TUT-1513",..: 56 15 55 21 12 12 17 53 48 82 ...
$ Zone_Fine : Factor w/ 17 levels "Aunuu_E","Aunuu_W",..: 11 9 10 9 9 9 9 8 10 10 ...
$ TRANSECT : num 1 1 1 1 1 1 1 1 1 1 ...
$ SEGMENT : num 5 1 1 1 7 5 7 5 3 7 ...
$ Seg_WIDTH : num 1 1 1 1 1 1 1 1 1 1 ...
$ Seg_LENGTH : num 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
$ SPECIES : Factor w/ 156 levels "AAAA","AABR",..: 94 126 94 102 9 126 135 94 93 94 ...
$ COLONYLENGTH : num 11 45 10 5 12 10 8 30 20 14 ...
$ OLDDEAD : num 5 2 5 0 0 5 10 0 5 10 ...
$ RECENTDEAD : num 0 10 0 0 0 0 0 0 0 0 ...
$ DZCLASS : Factor w/ 21 levels "Acute Tissue Loss - White Syndrome",..: 14 14 14 14 14 14 14 14 14 14 ...
$ EXTENT : num 52.9 52.9 52.9 52.9 52.9 ...
$ SEVERITY : num 3.11 3.11 3.11 3.11 3.11 ...
$ TAXONNAME.x : Factor w/ 155 levels "Acanthastrea hemprichii",..: 95 132 95 107 7 132 133 95 89 95 ...
$ PHYLUM : Factor w/ 2 levels "Cnidaria","Rhodophyta": 1 1 1 1 1 1 1 1 1 1 ...
$ CLASS : Factor w/ 3 levels "Anthozoa","Florideophyceae",..: 1 1 1 1 1 1 1 1 1 1 ...
$ FAMILY : Factor w/ 20 levels "Acroporidae",..: 1 18 1 2 1 18 18 1 8 1 ...
$ GENUS : Factor w/ 55 levels "Acanthastrea",..: 35 44 35 39 2 44 44 35 34 35 ...
$ RANK : Factor w/ 2 levels "Genus","Species": 1 1 1 1 2 1 2 1 1 1 ...
$ DATE_ : Date, format: "0015-03-27" ...
$ OBS_YEAR : num 2015 2015 2015 2015 2015 ...
$ REEF_ZONE : Factor w/ 2 levels "Backreef","Forereef": 2 2 2 2 2 2 2 2 2 2 ...
$ DEPTH_BIN : Factor w/ 4 levels "Bank","Deep",..: 2 2 4 3 2 2 3 4 3 3 ...
$ LBSP : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
$ Zone_Fine_ReefZone_Depth: Factor w/ 41 levels "Aunuu_E_Deep",..: 30 24 29 25 24 24 25 23 28 28 ...
$ Area_km2.x : num 50.9 49.1 101.8 49.1 49.1 ...
$ Fishing.trips.per.km2 : num 719 1148 1431 1148 1148 ...
$ Area_km2.y : num 50.9 49.1 50.9 49.1 49.1 ...
$ Pop.km2 : num 167.5 49.1 561.9 49.1 49.1 ...
$ SHED_NAME : Factor w/ 35 levels "Aasu","Afao - Asili",..: 2 9 15 17 17 1 1 35 28 26 ...
$ Shed_Cond : Factor w/ 4 levels "Extensive","Intermediate",..: 3 4 2 4 4 3 3 3 1 2 ...
$ Shed_Area_Calc : num 30202 29422 458542 126361 32595 ...
$ Perc_Area : num 0.00128 0.00107 0.00993 0.00458 0.00118 ...
$ Cond_Scale : num 3 4 2 4 4 3 3 3 1 2 ...
$ Shoreline_m : num 23146 33046 45821 33046 33046 ...
$ Rank : num 5 9 3 9 9 9 9 6 3 3 ...
$ Comp.8 : num 0.826 0.814 0.838 0.814 0.814 ...
$ Ble : num 0.958 0.969 0.959 0.969 0.969 ...
$ DZ : num 0.647 0.837 0.732 0.837 0.837 ...
$ Herb : num 0.682 0.564 0.704 0.564 0.564 ...
$ Rec : num 0.375 0.477 0.467 0.477 0.477 ...
$ MA : num 0.965 0.975 0.907 0.975 0.975 ...
$ Dam : num 0.998 1 0.992 1 1 ...
$ TAXONNAME.y : Factor w/ 94 levels "Abudefduf sordidus",..: 94 94 94 94 94 94 94 94 94 94 ...
$ Dummy : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
I expected a classification of "DZCLASS".
Thanks, every recommendation is welcomed!

R Dataframe issue preventing normality test

I've read my .CSV and then converted the file to a data frame using several methods including:
df<-read.csv('cdSH2015Fall.csv', dec = ".", na.strings = c("na"), header=TRUE,
row.names=NULL, stringsAsFactors=F)
df<-as.data.frame(lapply(df, unlist)) # converted .csv to a a data.frame
str(df) # provides the structure of df.
'data.frame': 72 obs. of 16 variables:
$ trtGroup : Factor w/ 68 levels "AANN","AAPN",..: 5 7 14 18 20 23
27 33 37 48 ...
$ cd : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ PreviousExp : Factor w/ 2 levels "Empty","Enriched": 2 1 2 2 2 2 1
1 1 1 ...
$ treatment : Factor w/ 2 levels "NN","PN": 1 1 1 1 1 1 1 1 1 1 ...
$ total.Area.DarkBlue.: num 827 1037 663 389 983 ...
$ numberOfGroups : int 1 1 1 1 1 1 1 1 1 1 ...
$ totalGroupArea : num 15.72 2.26 9.45 11.57 9.73 ...
$ averageGrpArea : num 15.72 2.26 9.45 11.57 9.73 ...
$ proximityToPlants : num 5.65 16.05 2.58 9.65 4.74 ...
$ latFeed : num 2 0.5 0 1 0 0 1 0.5 2 1 ...
$ latBalloon : num 6 2 2 NA 0 0.1 3 0.5 1 0.7 ...
$ countChases : int 5 8 16 4 16 21 18 11 14 28 ...
$ chases : int 95 87 67 923 636 96 1210 571 775 816 ...
$ grpDiameter : num 16.8 23.3 19.5 11.2 29.9 ...
$ grpActiv : num 4908 5164 4197 5263 5377 ...
$ NND : num 0 11.88 8.98 3.6 9.8 ...
I then run my model two ways:
First option.
fit = t.test(df$proximityToPlants[which (df$cd==1 &
df$treatment == 'PN')], df$proximityToPlants[which
(df$cd==0 & df$treatment == 'PN')]
)
Second option trying to ensure I have a proper data frame.
Subset the data and then create a matrix.
cdProximityToPlantsPN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==1 & cdSH2015Fall$treatment == 'PN')]
H2ProximityToPlantsPN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==0 & cdSH2015Fall$treatment == 'PN')]
cdProximityToPlantsNN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==1 & cdSH2015Fall$treatment == 'NN')]
H2ProximityToPlantsNN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==0 & cdSH2015Fall$treatment == 'NN')]
Creating a matrix
df<-
cbind(cdProximityToPlantsPN,H2ProximityToPlantsPN,cdProximityToPlantsNN,
H2ProximityToPlantsNN)
mat <- sapply(df,unlist)
fit=t.test(mat[,1],mat[,2], paired = F, var.equal = T)
Yet, I still get errors when assessing outliers using the following:
outlierTest(fit) # Bonferonni p-value for most extreme obs
Error in UseMethod("outlierTest") :
no applicable method for 'outlierTest' applied to an object of class
"htest"
qqPlot(fit, main="QQ Plot") #qq plot for studentized resid 
Error in order(x[good]) : unimplemented type 'list' in 'orderVector1'
leveragePlots(fit) # leverage plots
Error in formula.default(model) : invalid formula
I know the issue must be with my data structure. Any ideas on how to fix it?

Calculate y-value of curve maximum of a smooth line in R and ggplot2

I'm following up an old question addressed here:
calculate x-value of curve maximum of a smooth line in R and ggplot2
How could I calculate the Y-value of curve maximum?
Cheers
It would seem to me that code changes of "x" to "y" and 'vline' to 'hline' and "xintercept" to "yintercept" would be all that were needed:
gb <- ggplot_build(p1)
exact_y_value_of_the_curve_maximum <- gb$data[[1]]$y[which(diff(sign(diff(gb$data[[1]]$y)))==-2)+1]
p1 + geom_hline( yintercept =exact_y_value_of_the_curve_maximum)
exact_y_value_of_the_curve_maximum
I don't think I would call these "exact" since they are only numerical estimates. The other way to get that value would be
max(gb$data[[1]]$y)
As the $data element of that build-object can be examined:
> str(gb$data)
List of 2
$ :'data.frame': 80 obs. of 7 variables:
..$ x : num [1:80] 1 1.19 1.38 1.57 1.76 ...
..$ y : num [1:80] -123.3 -116.6 -109.9 -103.3 -96.6 ...
..$ ymin : num [1:80] -187 -177 -166 -156 -146 ...
..$ ymax : num [1:80] -59.4 -56.5 -53.5 -50.3 -46.9 ...
..$ se : num [1:80] 29.3 27.6 25.9 24.3 22.8 ...
..$ PANEL: int [1:80] 1 1 1 1 1 1 1 1 1 1 ...
..$ group: int [1:80] 1 1 1 1 1 1 1 1 1 1 ...
$ :'data.frame': 16 obs. of 4 variables:
..$ x : num [1:16] 1 2 3 4 5 6 7 8 9 10 ...
..$ y : num [1:16] -79.6 -84.7 -88.4 -74.1 -29.6 ...
..$ PANEL: int [1:16] 1 1 1 1 1 1 1 1 1 1 ...
..$ group: int [1:16] 1 1 1 1 1 1 1 1 1 1 ...

Resources