R Dataframe issue preventing normality test - r

I've read my .CSV and then converted the file to a data frame using several methods including:
df<-read.csv('cdSH2015Fall.csv', dec = ".", na.strings = c("na"), header=TRUE,
row.names=NULL, stringsAsFactors=F)
df<-as.data.frame(lapply(df, unlist)) # converted .csv to a a data.frame
str(df) # provides the structure of df.
'data.frame': 72 obs. of 16 variables:
$ trtGroup : Factor w/ 68 levels "AANN","AAPN",..: 5 7 14 18 20 23
27 33 37 48 ...
$ cd : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ PreviousExp : Factor w/ 2 levels "Empty","Enriched": 2 1 2 2 2 2 1
1 1 1 ...
$ treatment : Factor w/ 2 levels "NN","PN": 1 1 1 1 1 1 1 1 1 1 ...
$ total.Area.DarkBlue.: num 827 1037 663 389 983 ...
$ numberOfGroups : int 1 1 1 1 1 1 1 1 1 1 ...
$ totalGroupArea : num 15.72 2.26 9.45 11.57 9.73 ...
$ averageGrpArea : num 15.72 2.26 9.45 11.57 9.73 ...
$ proximityToPlants : num 5.65 16.05 2.58 9.65 4.74 ...
$ latFeed : num 2 0.5 0 1 0 0 1 0.5 2 1 ...
$ latBalloon : num 6 2 2 NA 0 0.1 3 0.5 1 0.7 ...
$ countChases : int 5 8 16 4 16 21 18 11 14 28 ...
$ chases : int 95 87 67 923 636 96 1210 571 775 816 ...
$ grpDiameter : num 16.8 23.3 19.5 11.2 29.9 ...
$ grpActiv : num 4908 5164 4197 5263 5377 ...
$ NND : num 0 11.88 8.98 3.6 9.8 ...
I then run my model two ways:
First option.
fit = t.test(df$proximityToPlants[which (df$cd==1 &
df$treatment == 'PN')], df$proximityToPlants[which
(df$cd==0 & df$treatment == 'PN')]
)
Second option trying to ensure I have a proper data frame.
Subset the data and then create a matrix.
cdProximityToPlantsPN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==1 & cdSH2015Fall$treatment == 'PN')]
H2ProximityToPlantsPN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==0 & cdSH2015Fall$treatment == 'PN')]
cdProximityToPlantsNN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==1 & cdSH2015Fall$treatment == 'NN')]
H2ProximityToPlantsNN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==0 & cdSH2015Fall$treatment == 'NN')]
Creating a matrix
df<-
cbind(cdProximityToPlantsPN,H2ProximityToPlantsPN,cdProximityToPlantsNN,
H2ProximityToPlantsNN)
mat <- sapply(df,unlist)
fit=t.test(mat[,1],mat[,2], paired = F, var.equal = T)
Yet, I still get errors when assessing outliers using the following:
outlierTest(fit) # Bonferonni p-value for most extreme obs
Error in UseMethod("outlierTest") :
no applicable method for 'outlierTest' applied to an object of class
"htest"
qqPlot(fit, main="QQ Plot") #qq plot for studentized resid 
Error in order(x[good]) : unimplemented type 'list' in 'orderVector1'
leveragePlots(fit) # leverage plots
Error in formula.default(model) : invalid formula
I know the issue must be with my data structure. Any ideas on how to fix it?

Related

In cox regression categorical levels not shown in my output

In my data, Hemoglobin_group is a categorical variable with 6 levels. When I run the below code, I can't see Hemoglobin_group levels in the output. How can I solve this problem?
fit <- coxph(Surv(Time,Status)~Hemoglobin_group, data)
fit
My output:
coef exp(coef) se(coef) z p
Hemoglobin_group -0.06585 0.93627 0.01874 -3.514 0.000441
str(data)
tibble [2,021 x 21] (S3: tbl_df/tbl/data.frame)
$ status : num [1:2021] 1 1 1 1 1 1 0 0 0 0 ...
$ id : num [1:2021] 1 1 1 1 1 1 2 2 2 2 ...
$ Time : num [1:2021] 20.3 20.3 20.3 20.3 20.3 ...
$ t1 : num [1:2021] 0 1 2 3 4 5 0 1 2 3 ...
$ t2 : num [1:2021] 1 2 3 4 5 ...
$ sex_string : chr [1:2021] "MALE" "MALE" "MALE" "MALE"...
$ sex : num [1:2021] 1 1 1 1 1 1 1 1 1 1 ...
$ age : num [1:2021] 89 89 89 89 89 89 77 77 77 77 ...
$ hemoglobin : num [1:2021] 9.71 10.22 11.3 11.8 11.2 ...
$ Diabet : num [1:2021] 1 1 1 1 1 1 2 2 2 2 ...
$ Hemoglobin_group : num [1:2021] 4 4 4 4 4 4 5 5 5 5 ...
$ Kreatinin : num [1:2021] 7.19 7.19 7.19 7.19 7.19 ...
$ fosfor : num [1:2021] 4.14 4.14 4.14 4.14 4.14 ...
$ Kalsiyum : num [1:2021] 8.5 8.5 8.5 8.5 8.5 ...
$ CRP : num [1:2021] 1.33 1.33 1.33 1.33 1.33 ...
$ Albumin : num [1:2021] 4.19 4.19 4.19 4.19 4.19 ...
$ Ferritin : num [1:2021] 428 428 428 428 428 ...
$ months : num [1:2021] 1 2 3 4 5 6 1 2 3 4 ...
It looks like when I write str (data). I thought I was transforming into a factor by doing the following codes in my data. I guess I couldn't transform. I did not understand?
The codes I wrote to convert to factor were as follows
sex<-as.factor(sex)
is.factor(sex)
Diabet<-as.factor(Diabet)
is.factor(Diabet)
Status<-as.factor(Status)
is.factor(Status)
months<-as.factor(months)
is.factor(months)
Hemoglobin_group<-as.factor(Hemoglobin_group)
is.factor(Hemoglobin_group)
When ı run this code, R console looks like:
> sex<-as.factor(sex)
> is.factor(sex)
[1] TRUE
>
> Diabet<-as.factor(Diabet)
> is.factor(Diabet)
[1] TRUE
>
>
> Status<-as.factor(Status)
> is.factor(Status)
[1] TRUE
>
> months<-as.factor(months)
> is.factor(months)
[1] TRUE
>
> Hemoglobin_group<-as.factor(Hemoglobin_group)
> is.factor(Hemoglobin_group)
[1] TRUE
In this case, don't the categorical variables in my data turn into factors?
Your variable Hemoglobin_group is probably considered as a numeric value. Try:
Hemoglobin_groupF <- factor(Hemoglobin_group)
fit <- coxph(Surv(Time,Status) ~ Hemoglobin_groupF, data)
fit
The reference group will the first factor. You can easily change your reference factor with the function relevel

Classify factor output with factors with >60 levels and numeric inputs

I'm newbie, and working on a classification to see the causes of coral diseases. The dataset contains 45 variables.
The output variable is a factor with 21 levels (21 diseases) and the inputs are numeric and factor variables, and those factors have even 94 levels, those are like "type of specie of coral", so I can't get into a split factor because I want to be as precise as possible, so maybe one species is less resistant than another. So I can't split those factors. Numeric variables are such as, population in the area, fishing trips etc.
First problem: tried genetic algorithms to select most important variables, random forests, etc., but... it gets aborted, so the variables I eliminated were just based on correlograms. I want something stronger to decide which variables select.
Second problem: I've tried everything I know and made tons of searches on Google to find something that runs and make a classification, but nothing goes on. I tried SVM, Random Forests, Cart, GBM, bagging and boosting, but nothing can't with this dataset.
This is the structure of the dataset
'data.frame': 136510 obs. of 45 variables:
$ SITE : Factor w/ 144 levels "TUT-1511","TUT-1513",..: 56 15 55 21 12 12 17 53 48 82 ...
$ Zone_Fine : Factor w/ 17 levels "Aunuu_E","Aunuu_W",..: 11 9 10 9 9 9 9 8 10 10 ...
$ TRANSECT : num 1 1 1 1 1 1 1 1 1 1 ...
$ SEGMENT : num 5 1 1 1 7 5 7 5 3 7 ...
$ Seg_WIDTH : num 1 1 1 1 1 1 1 1 1 1 ...
$ Seg_LENGTH : num 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
$ SPECIES : Factor w/ 156 levels "AAAA","AABR",..: 94 126 94 102 9 126 135 94 93 94 ...
$ COLONYLENGTH : num 11 45 10 5 12 10 8 30 20 14 ...
$ OLDDEAD : num 5 2 5 0 0 5 10 0 5 10 ...
$ RECENTDEAD : num 0 10 0 0 0 0 0 0 0 0 ...
$ DZCLASS : Factor w/ 21 levels "Acute Tissue Loss - White Syndrome",..: 14 14 14 14 14 14 14 14 14 14 ...
$ EXTENT : num 52.9 52.9 52.9 52.9 52.9 ...
$ SEVERITY : num 3.11 3.11 3.11 3.11 3.11 ...
$ TAXONNAME.x : Factor w/ 155 levels "Acanthastrea hemprichii",..: 95 132 95 107 7 132 133 95 89 95 ...
$ PHYLUM : Factor w/ 2 levels "Cnidaria","Rhodophyta": 1 1 1 1 1 1 1 1 1 1 ...
$ CLASS : Factor w/ 3 levels "Anthozoa","Florideophyceae",..: 1 1 1 1 1 1 1 1 1 1 ...
$ FAMILY : Factor w/ 20 levels "Acroporidae",..: 1 18 1 2 1 18 18 1 8 1 ...
$ GENUS : Factor w/ 55 levels "Acanthastrea",..: 35 44 35 39 2 44 44 35 34 35 ...
$ RANK : Factor w/ 2 levels "Genus","Species": 1 1 1 1 2 1 2 1 1 1 ...
$ DATE_ : Date, format: "0015-03-27" ...
$ OBS_YEAR : num 2015 2015 2015 2015 2015 ...
$ REEF_ZONE : Factor w/ 2 levels "Backreef","Forereef": 2 2 2 2 2 2 2 2 2 2 ...
$ DEPTH_BIN : Factor w/ 4 levels "Bank","Deep",..: 2 2 4 3 2 2 3 4 3 3 ...
$ LBSP : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
$ Zone_Fine_ReefZone_Depth: Factor w/ 41 levels "Aunuu_E_Deep",..: 30 24 29 25 24 24 25 23 28 28 ...
$ Area_km2.x : num 50.9 49.1 101.8 49.1 49.1 ...
$ Fishing.trips.per.km2 : num 719 1148 1431 1148 1148 ...
$ Area_km2.y : num 50.9 49.1 50.9 49.1 49.1 ...
$ Pop.km2 : num 167.5 49.1 561.9 49.1 49.1 ...
$ SHED_NAME : Factor w/ 35 levels "Aasu","Afao - Asili",..: 2 9 15 17 17 1 1 35 28 26 ...
$ Shed_Cond : Factor w/ 4 levels "Extensive","Intermediate",..: 3 4 2 4 4 3 3 3 1 2 ...
$ Shed_Area_Calc : num 30202 29422 458542 126361 32595 ...
$ Perc_Area : num 0.00128 0.00107 0.00993 0.00458 0.00118 ...
$ Cond_Scale : num 3 4 2 4 4 3 3 3 1 2 ...
$ Shoreline_m : num 23146 33046 45821 33046 33046 ...
$ Rank : num 5 9 3 9 9 9 9 6 3 3 ...
$ Comp.8 : num 0.826 0.814 0.838 0.814 0.814 ...
$ Ble : num 0.958 0.969 0.959 0.969 0.969 ...
$ DZ : num 0.647 0.837 0.732 0.837 0.837 ...
$ Herb : num 0.682 0.564 0.704 0.564 0.564 ...
$ Rec : num 0.375 0.477 0.467 0.477 0.477 ...
$ MA : num 0.965 0.975 0.907 0.975 0.975 ...
$ Dam : num 0.998 1 0.992 1 1 ...
$ TAXONNAME.y : Factor w/ 94 levels "Abudefduf sordidus",..: 94 94 94 94 94 94 94 94 94 94 ...
$ Dummy : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
I expected a classification of "DZCLASS".
Thanks, every recommendation is welcomed!

chart.Correlation with continious and categorical variables

I want to see if there is correlation between my variables. This is the structure of the dataset
'data.frame': 189 obs. of 20 variables:
$ age : num 24 31 32 35 36 26 31 24 35 36 ...
$ diplM2 : Factor w/ 3 levels "0","1","2": 3 2 1 3 2 2 3 2 2 1 ...
$ TimeDelcat : Factor w/ 4 levels "0","1","2","3": 1 1 3 3 3 4 2 1 4 4 ...
$ SeasonDel : Factor w/ 4 levels "1","2","3","4": 1 2 4 3 4 3 4 3 2 3 ...
$ BMIM2 : num 23.4 25.7 17 26.6 24.6 21.6 21 22.3 20.8 20.7 ...
$ WgtB2 : int 3740 3615 3705 3485 3420 2775 3365 3770 3075 3000 ...
$ sex : Factor w/ 2 levels "1","2": 2 2 1 2 2 2 1 1 1 1 ...
$ smoke : Factor w/ 3 levels "0","1","2": 1 1 1 2 1 1 1 1 1 3 ...
$ nRBC : num 0.1621 0.0604 0.1935 0.0527 0.1118 ...
$ CD4T : num 0.1427 0.2143 0.1432 0.0686 0.0979 ...
$ CD8T : num 0.1574 0.1549 0.1243 0.0804 0.0782 ...
$ NK : num 0.02817 0 0.04368 0.00641 0.02398 ...
$ Bcell : num 0.1033 0.1124 0.1468 0.0551 0.0696 ...
$ Mono : num 0.0633 0.0641 0.0773 0.0531 0.0656 ...
$ Gran : num 0.428 0.442 0.329 0.716 0.6 ...
$ chip : Factor w/ 92 levels "200251580021",..: 12 24 23 2 27 22 6 22 17 22 ...
$ pos : Factor w/ 12 levels "R01C01","R01C02",..: 11 12 1 6 9 2 12 1 7 11 ...
$ trim1PM25ifdmv4: num 9.45 13.81 15.59 7.13 15.43 ...
$ trim2PM25ifdmv4: num 13.27 15.53 10.69 13.56 9.27 ...
$ trim3PM25ifdmv4: num 16.72 16.21 12.17 6.47 10.66 ...
As you can see, there are both continious and categorical variables.
When I run chart.Correlation(variables, histrogram=T,method = c("pearson") )
I get this error:
Error in pairs.default(x, gap = 0, lower.panel = panel.smooth, upper.panel = panel.cor, :
non-numeric argument to 'pairs'
How can I fix this?
Thank you.
I believe you want correlation only between numerical variables. The below code will do this and it will output only unique correlations between the input.
library(reshape2)
data <- data.frame(x1=rnorm(10),
x2=rnorm(10),
x3=rnorm(10),
x4=c("a","b","c","d","e","f","g","h","i","j"),
x5=c("ab","sp","sp","dd","hg","hj","qw","dh","ko","jk"))
data
x1 x2 x3 x4 x5
1 -1.2169793 0.5397598 0.4981513 a ab
2 -0.7032631 -2.1262837 -1.0377371 b sp
3 0.8766831 -0.2326975 -0.1219613 c sp
4 0.3405332 2.4766225 -1.1960618 d dd
5 0.1889945 0.3444534 1.9659062 e hg
6 0.8086956 0.4654644 -1.2526696 f hj
7 -0.6850181 -1.7657241 0.5156620 g qw
8 0.8518034 0.9484547 1.4784063 h dh
9 0.5191793 1.2246566 1.3867829 i ko
10 0.4568953 -0.6881464 0.3548839 j jk
#finding correlation for all numerical values
corr=cor(data[as.numeric(which(sapply(data,class)=="numeric"))])
#convert the correlation table to long format
res=melt(corr)
##keeping only one side of the correlations
res$type=apply(res,1,function(x)
paste(sort(c(as.character(x[1]),as.character(x[2]))),collapse="*"))
res=unique(res[,c("type","value")])
res
type value
x1*x1 1.00000000
x1*x2 0.44024939
x1*x3 0.04936654
x2*x2 1.00000000
x2*x3 0.08859169
x3*x3 1.00000000

R write.table read.table change the format of some columns in dataframes

I am experienced a problem when saving data using write.table and reading data using read.table.
I wrote some code that collect data from thousands of files, does some calculations, and creates a data frame. In this data frame I have 8 columns and more then 11000 rows. The columns contain the 8 variables, 3 of which are ordered factors; the other variables are numeric.
When I look at the structure of my data before using the command write.table I got exactly what I expect which is:
str(data)
'data.frame': 11424 obs. of 8 variables:
$ a_KN : num 8.56e-09 1.11e-08 1.45e-08 1.88e-08 2.45e-08 ...
$ a_DTM : num 5.05e-08 5.12e-08 5.19e-08 5.26e-08 5.33e-08 ...
$ SF : num 5.89 4.6 3.58 2.79 2.18 ...
$ Energy : Ord.factor w/ 6 levels "160"<"800"<"1.4"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ EnergyUnit: Ord.factor w/ 3 levels "MeV"<"GeV"<"TeV": 1 1 1 1 1 1 1 1 1 1 ...
$ Location : Ord.factor w/ 7 levels "BeamImpact"<"WithinBulky"<..: 5 5 5 5 5 5 5 5 5 5 ...
$ Ti : num 0.25 0.25 0.25 0.25 0.25 0.25 1 0.25 1 0.25 ...
$ Tc : num 30 28 26 24 22 20 30 18 28 16 ...
After that I use the usual write.table command to save my file:
write.table(data, file = "filename.txt")
Now, when I read again this file into R, and I look at the structure, I get this:
mydata <- read.table("filename.txt", header=TRUE)
> str(mydata)
'data.frame': 11424 obs. of 8 variables:
$ a_KN : num 8.56e-09 1.11e-08 1.45e-08 1.88e-08 2.45e-08 ...
$ a_DTM : num 5.05e-08 5.12e-08 5.19e-08 5.26e-08 5.33e-08 ...
$ SF : num 5.89 4.6 3.58 2.79 2.18 ...
$ Energy : num 160 160 160 160 160 160 160 160 160 160 ...
$ EnergyUnit: Factor w/ 3 levels "GeV","MeV","TeV": 2 2 2 2 2 2 2 2 2 2 ...
$ Location : Factor w/ 7 levels "10cmTarget","AdjBulky",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Ti : num 0.25 0.25 0.25 0.25 0.25 0.25 1 0.25 1 0.25 ...
$ Tc : int 30 28 26 24 22 20 30 18 28 16 ...
Do you know how to solve this problem? THis bothers me also because I am creating a Shiny app and this changed class doesn't fit my purpose.
Thanks!

Carc data from rda file to numeric matrix

I try to make KDA (Kernel discriminant analysis) for carc data, but when I call command X<-data.frame(scale(X)); r shows error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
I tried to use as.numeric(as.matrix(carc)) and carc<-na.omit(carc), but it does not help either
library(ks);library(MASS);library(klaR);library(FSelector)
install.packages("klaR")
install.packages("FSelector")
library(ks);library(MASS);library(klaR);library(FSelector)
attach("carc.rda")
data<-load("carc.rda")
data
carc<-na.omit(carc)
head(carc)
class(carc) # check for its class
class(as.matrix(carc)) # change class, and
as.numeric(as.matrix(carc))
XX<-carc
X<-XX[,1:12];X.class<-XX[,13];
X<-data.frame(scale(X));
fit.pc<-princomp(X,scores=TRUE);
plot(fit.pc,type="line")
X.new<-fit.pc$scores[,1:5]; X.new<-data.frame(X.new);
cfs(X.class~.,cbind(X.new,X.class))
X.new<-fit.pc$scores[,c(1,4)]; X.new<-data.frame(X.new);
fit.kda1<-Hkda(x=X.new,x.group=X.class,pilot="samse",
bw="plugin",pre="sphere")
kda.fit1 <- kda(x=X.new, x.group=X.class, Hs=fit.kda1)
Can you help to resolve this problem and make this analysis?
Added:The car data set( Chambers, kleveland, Kleiner & Tukey 1983)
> head(carc)
P M R78 R77 H R Tr W L T D G C
AMC_Concord 4099 22 3 2 2.5 27.5 11 2930 186 40 121 3.58 US
AMC_Pacer 4749 17 3 1 3.0 25.5 11 3350 173 40 258 2.53 US
AMC_Spirit 3799 22 . . 3.0 18.5 12 2640 168 35 121 3.08 US
Audi_5000 9690 17 5 2 3.0 27.0 15 2830 189 37 131 3.20 Europe
Audi_Fox 6295 23 3 3 2.5 28.0 11 2070 174 36 97 3.70 Europe
Here is a small dataset with similar characteristics to what you describe
in order to answer this error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
carc <- data.frame(type1=rep(c('1','2'), each=5),
type2=rep(c('5','6'), each=5),
x = rnorm(10,1,2)/10, y = rnorm(10))
This should be similar to your data.frame
str(carc)
# 'data.frame': 10 obs. of 3 variables:
# $ type1: Factor w/ 2 levels "1","2": 1 1 1 1 1 2 2 2 2 2
# $ type2: Factor w/ 2 levels "5","6": 1 1 1 1 1 2 2 2 2 2
# $ x : num -0.1177 0.3443 0.1351 0.0443 0.4702 ...
# $ y : num -0.355 0.149 -0.208 -1.202 -1.495 ...
scale(carc)
# Similar error
# Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Using set()
require(data.table)
DT <- data.table(carc)
cols_fix <- c("type1", "type2")
for (col in cols_fix) set(DT, j=col, value = as.numeric(as.character(DT[[col]])))
str(DT)
# Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables:
# $ type1: num 1 1 1 1 1 2 2 2 2 2
# $ type2: num 5 5 5 5 5 6 6 6 6 6
# $ x : num 0.0465 0.1712 0.1582 0.1684 0.1183 ...
# $ y : num 0.155 -0.977 -0.291 -0.766 -1.02 ...
# - attr(*, ".internal.selfref")=<externalptr>
The first column(s) of your data set may be factors. Taking the data from corrgram:
library(corrgram)
carc <- auto
str(carc)
# 'data.frame': 74 obs. of 14 variables:
# $ Model : Factor w/ 74 levels "AMC Concord ",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ Origin: Factor w/ 3 levels "A","E","J": 1 1 1 2 2 2 1 1 1 1 ...
# $ Price : int 4099 4749 3799 9690 6295 9735 4816 7827 5788 4453 ...
# $ MPG : int 22 17 22 17 23 25 20 15 18 26 ...
# $ Rep78 : num 3 3 NA 5 3 4 3 4 3 NA ...
# $ Rep77 : num 2 1 NA 2 3 4 3 4 4 NA ...
# $ Hroom : num 2.5 3 3 3 2.5 2.5 4.5 4 4 3 ...
# $ Rseat : num 27.5 25.5 18.5 27 28 26 29 31.5 30.5 24 ...
# $ Trunk : int 11 11 12 15 11 12 16 20 21 10 ...
# $ Weight: int 2930 3350 2640 2830 2070 2650 3250 4080 3670 2230 ...
# $ Length: int 186 173 168 189 174 177 196 222 218 170 ...
# $ Turn : int 40 40 35 37 36 34 40 43 43 34 ...
# $ Displa: int 121 258 121 131 97 121 196 350 231 304 ...
# $ Gratio: num 3.58 2.53 3.08 3.2 3.7 3.64 2.93 2.41 2.73 2.87 ...
So exclude them by trying this:
X<-XX[,3:14]
or this
X<-XX[,-(1:2)]

Resources