I am currently doing a repeated measure anova test and can't get over that one error message saying:
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 2 rows:
* 82, 83
This is the formula I used:
res.aov <- anova_test(data = data1,
dv = AUC, wid = Observation,
within = c(SMIP, SMIB))
This is my data frame:
'data.frame': 410 obs. of 8 variables:
$ ID : Factor w/ 125 levels "1","4","6","7",..: 1 1 2 2 2 3 4 4 4 4 ...
$ Observation : Factor w/ 11 levels "1","2","3","4",..: 1 2 1 2 3 1 1 2 3 4 ...
$ IDobs : Factor w/ 407 levels "1.1","1.2","4.1",..: 1 2 3 4 5 6 7 8 9 10 ...
$ SMIP : num 2.26 2.26 1.05 1.05 1.05 1.11 1.23 1.23 1.23 1.23 ...
$ SMIB : num 4.84 4.84 1.95 1.95 1.95 4.78 4.34 4.34 4.34 4.34 ...
$ AUC : num 21.2 16.6 19.2 16.9 18.5 15.4 10.4 12.8 15.4 17.9 ...
My df has different IDs (patient 1, 4, 6) with several observations (obs. 1, 2, for patient 1) for each one and therefor I have no idantical keys, even though they can have the same values in other variables (AUC). BUT - not every patient has the same number of observations! They vary from 1 - 11.
As you can see I tried combining the colums ID and Observation, but this didn`t help.
I tried removing the error lines in my excel sheet which led to the same error message but with different lines.
Any help or suggestions will be greatly apprecieated.
Many thanks in advance!
Nigina
Related
I want to run a RF classification just like it's specified in 'randomForest' but still use the k-fold repeated cross validation method (code below). How do I stop caret from creating dummy variables out of my categorical ones? I read that this may be due to One-Hot-Encoding, but not sure how to change this. I would be very greatful for some example lines on how to fix this!
database:
> str(river)
'data.frame': 121 obs. of 13 variables:
$ stat_bino : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 2 2 2 ...
$ Subfamily : Factor w/ 14 levels "carettochelyinae",..: 14 14 14 14 8 8 8 8 8 8 ...
$ MAXCL : num 850 850 360 540 625 600 760 480 560 580 ...
$ CS : num 8 8 14 15 26 25.5 20 20 18 21.5 ...
$ CF : num 3.5 3.5 2.5 2 1.5 3 2 2 1 1 ...
$ size_mat : num 300 300 170 180 450 450 460 406 433 433 ...
$ incubat : num 97.5 97.5 71 72.5 91.5 67.5 73 55 83 80 ...
$ diet : Factor w/ 5 levels "omnivore leaning carnivore",..: 1 1 1 1 2 2 2 5 4 4 ...
$ HDI : num 721 627 878 885 704 ...
$ HF09M93 : num 23.19 9.96 -8.52 -5.67 27.3 ...
$ HF09 : num 116 121 110 110 152 ...
$ deg_reg : num 8.64 39.37 370.95 314.8 32.99 ...
$ protected_area: num 7.55 10.93 2.84 2.89 12.71 …
the rest:
> control <- trainControl(method='repeatedcv',
+ number=5,repeats = 3,
+ search='grid')
> tunegrid <- expand.grid(.mtry = (1:12))
> rf_gridsearch <- train(stat_bino ~ .,
+ data = river,
+ method = 'rf',
+ metric = 'Accuracy',
+ ntree = 600,
+ importance = TRUE,
+ tuneGrid = tunegrid, trControl = control)
> rf_gridsearch$finalModel[["xNames"]]
[1] "Subfamilychelinae" "Subfamilychelodininae" "Subfamilychelydrinae"
[4] "Subfamilycyclanorbinae" "Subfamilydeirochelyinae" "Subfamilydermatemydinae"
[7] "Subfamilygeoemydinae" "Subfamilykinosterninae" "Subfamilypelomedusinae"
...you get the picture. I now have 27 predictors instead of 12.
When you use the formula interface to train:
train(stat_bino ~ .,
...
it will convert factors using dummy coding. This makes sense because formulas in most traditional R functions work this way (for instance lm).
However if you use the non formula interface:
train(y = river$stat_bino,
x = river[,colnames(river) != "stat_bino"],
...
then caret will leave the variables as they are suppled. This is what you want with tree based methods, but it will produce errors with algorithms not capable of internally handling factors such as glmnet.
I've been trying to train a caret glmnet model for the past few hours but it keeps throwing me errors, my dataset has 15 observations, 3 are factors variables, 11 are numeric and 1 is an integer. I split the dataset into 70/30 train test split.
The dataset has some NA values in it so I tried to impute the NA's in the recipe code, I then piped it to centre and scale the numeric data.
I keep getting an error when I try to preprocess my data using the recipe I have
library(caret)
library(tidyverse)
library(recipes)
data = "data.csv"
'data.frame': 168 obs. of 15 variables:
$ COUNTRY : Factor w/ 190 levels "Country1","Country10",..: 1 103 114 125 136 147 158 169 180 2 ...
$ GOVERNMENT : Factor w/ 5 levels "AUTOCRATIC","LEFT"
$ POPULATION : num 45.4 45.1 80.2 7.8 37.5 ...
$ AGE25PROP : num 13.6 17.9 11.3 17 15.1 ...
$ AGE55PROP : num 33.5 36.5 34.4 32.5 33.1 ...
$ POPDENSITY : num 498 502 494 506 492 ...
$ GDP2019 : num 22.6 22.7 58 56.4 57.4 ...
$ INFANTMORT : num 16.3 14.2 17.7 NA 15.2 ...
$ DOC10 : num 22.6 24.1 24.7 NA 26.6 ...
$ VAXRATE : num 39.5 35.2 61.6 NA 60.6 ...
$ HEALTHCARE_BASIS : Factor w/ 4 levels "FREE","INSURANCE",
$ HEALTHCARE_COST : num 4759 15281 NA 5009 NA ...
$ DEATHRATE : num 21.7 27.3 17.3 16.7 25.2 ...
$ HEALTHCARE_COST_shadow: num 0 0 1 0 1 1 0 1 0 0 ...
$ na_count : int 0 0 1 3 1 2 0 1 4 0 ...
Test/Train split with "DEATHRATE" as the Y variable
subIndex <- caret::createDataPartition(y = data$DEATHRATE, p = 0.7, list = FALSE)
train <- data[subIndex]
test <- data[-subIndex]
Using recipes for preprocessing with "DEATHRATE" as the dependent variable and "COUNTRY" as the id
rec <- recipes::recipe("DEATHRATE" ~., data = train) %>%
update_role("COUNTRY", new_role = "id") %>%
step_knnimpute(all_predictors(), neighbours = 5) %>%
step_center(all_numeric(), -has_role("outcome")) %>%
step_scale(all_numeric(), -has_role("outcome"))
I always get the error
Error in terms.formula(formula, data = data) :
invalid model formula in ExtractVars
Training the model
model <- caret::train(rec, data = train, method = "glmnet")
Does anyone know what I'm doing wrong?
I've read my .CSV and then converted the file to a data frame using several methods including:
df<-read.csv('cdSH2015Fall.csv', dec = ".", na.strings = c("na"), header=TRUE,
row.names=NULL, stringsAsFactors=F)
df<-as.data.frame(lapply(df, unlist)) # converted .csv to a a data.frame
str(df) # provides the structure of df.
'data.frame': 72 obs. of 16 variables:
$ trtGroup : Factor w/ 68 levels "AANN","AAPN",..: 5 7 14 18 20 23
27 33 37 48 ...
$ cd : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ PreviousExp : Factor w/ 2 levels "Empty","Enriched": 2 1 2 2 2 2 1
1 1 1 ...
$ treatment : Factor w/ 2 levels "NN","PN": 1 1 1 1 1 1 1 1 1 1 ...
$ total.Area.DarkBlue.: num 827 1037 663 389 983 ...
$ numberOfGroups : int 1 1 1 1 1 1 1 1 1 1 ...
$ totalGroupArea : num 15.72 2.26 9.45 11.57 9.73 ...
$ averageGrpArea : num 15.72 2.26 9.45 11.57 9.73 ...
$ proximityToPlants : num 5.65 16.05 2.58 9.65 4.74 ...
$ latFeed : num 2 0.5 0 1 0 0 1 0.5 2 1 ...
$ latBalloon : num 6 2 2 NA 0 0.1 3 0.5 1 0.7 ...
$ countChases : int 5 8 16 4 16 21 18 11 14 28 ...
$ chases : int 95 87 67 923 636 96 1210 571 775 816 ...
$ grpDiameter : num 16.8 23.3 19.5 11.2 29.9 ...
$ grpActiv : num 4908 5164 4197 5263 5377 ...
$ NND : num 0 11.88 8.98 3.6 9.8 ...
I then run my model two ways:
First option.
fit = t.test(df$proximityToPlants[which (df$cd==1 &
df$treatment == 'PN')], df$proximityToPlants[which
(df$cd==0 & df$treatment == 'PN')]
)
Second option trying to ensure I have a proper data frame.
Subset the data and then create a matrix.
cdProximityToPlantsPN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==1 & cdSH2015Fall$treatment == 'PN')]
H2ProximityToPlantsPN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==0 & cdSH2015Fall$treatment == 'PN')]
cdProximityToPlantsNN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==1 & cdSH2015Fall$treatment == 'NN')]
H2ProximityToPlantsNN<-cdSH2015Fall$proximityToPlants[which (cdSH2015Fall$cd==0 & cdSH2015Fall$treatment == 'NN')]
Creating a matrix
df<-
cbind(cdProximityToPlantsPN,H2ProximityToPlantsPN,cdProximityToPlantsNN,
H2ProximityToPlantsNN)
mat <- sapply(df,unlist)
fit=t.test(mat[,1],mat[,2], paired = F, var.equal = T)
Yet, I still get errors when assessing outliers using the following:
outlierTest(fit) # Bonferonni p-value for most extreme obs
Error in UseMethod("outlierTest") :
no applicable method for 'outlierTest' applied to an object of class
"htest"
qqPlot(fit, main="QQ Plot") #qq plot for studentized residÂ
Error in order(x[good]) : unimplemented type 'list' in 'orderVector1'
leveragePlots(fit) # leverage plots
Error in formula.default(model) : invalid formula
I know the issue must be with my data structure. Any ideas on how to fix it?
Using matplot, I'm trying to plot the 2nd, 3rd and 4th columns of airquality data.frame after dividing these 3 columns by the first column of airquality.
However I'm getting an error
Error in ncol(xj) : object 'xj' not found
Why are we getting this error? The code below will reproduce this problem.
attach(airquality)
airquality[2:4] <- apply(airquality[2:4], 2, function(x) x /airquality[1])
matplot(x= airquality[,1], y= as.matrix(airquality[-1]))
You have managed to mangle your data in an interesting way. Starting with airquality before you mess with it. (And please don't attach() - it's unnecessary and sometimes dangerous/confusing.)
str(airquality)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
After you do
airquality[2:4] <- apply(airquality[2:4], 2,
function(x) x /airquality[1])
you get
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R:'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 4.63 3.28 12.42 17.39 NA ...
$ Wind :'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 0.18 0.222 1.05 0.639 NA ...
$ Temp :'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 1.63 2 6.17 3.44 NA ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
or
sapply(airquality,class)
## Ozone Solar.R Wind Temp Month Day
## "integer" "data.frame" "data.frame" "data.frame" "integer" "integer"
that is, you have data frames embedded within your data frame!
rm(airquality) ## clean up
Now change one character and divide by the column airquality[,1] rather than airquality[1] (divide by a vector, not a list of length one ...)
airquality[,2:4] <- apply(airquality[,2:4], 2,
function(x) x/airquality[,1])
matplot(x= airquality[,1], y= as.matrix(airquality[,-1]))
In general it's safer to use [, ...] indexing rather than [] indexing to refer to columns of a data frame unless you really know what you're doing ...
I am experienced a problem when saving data using write.table and reading data using read.table.
I wrote some code that collect data from thousands of files, does some calculations, and creates a data frame. In this data frame I have 8 columns and more then 11000 rows. The columns contain the 8 variables, 3 of which are ordered factors; the other variables are numeric.
When I look at the structure of my data before using the command write.table I got exactly what I expect which is:
str(data)
'data.frame': 11424 obs. of 8 variables:
$ a_KN : num 8.56e-09 1.11e-08 1.45e-08 1.88e-08 2.45e-08 ...
$ a_DTM : num 5.05e-08 5.12e-08 5.19e-08 5.26e-08 5.33e-08 ...
$ SF : num 5.89 4.6 3.58 2.79 2.18 ...
$ Energy : Ord.factor w/ 6 levels "160"<"800"<"1.4"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ EnergyUnit: Ord.factor w/ 3 levels "MeV"<"GeV"<"TeV": 1 1 1 1 1 1 1 1 1 1 ...
$ Location : Ord.factor w/ 7 levels "BeamImpact"<"WithinBulky"<..: 5 5 5 5 5 5 5 5 5 5 ...
$ Ti : num 0.25 0.25 0.25 0.25 0.25 0.25 1 0.25 1 0.25 ...
$ Tc : num 30 28 26 24 22 20 30 18 28 16 ...
After that I use the usual write.table command to save my file:
write.table(data, file = "filename.txt")
Now, when I read again this file into R, and I look at the structure, I get this:
mydata <- read.table("filename.txt", header=TRUE)
> str(mydata)
'data.frame': 11424 obs. of 8 variables:
$ a_KN : num 8.56e-09 1.11e-08 1.45e-08 1.88e-08 2.45e-08 ...
$ a_DTM : num 5.05e-08 5.12e-08 5.19e-08 5.26e-08 5.33e-08 ...
$ SF : num 5.89 4.6 3.58 2.79 2.18 ...
$ Energy : num 160 160 160 160 160 160 160 160 160 160 ...
$ EnergyUnit: Factor w/ 3 levels "GeV","MeV","TeV": 2 2 2 2 2 2 2 2 2 2 ...
$ Location : Factor w/ 7 levels "10cmTarget","AdjBulky",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Ti : num 0.25 0.25 0.25 0.25 0.25 0.25 1 0.25 1 0.25 ...
$ Tc : int 30 28 26 24 22 20 30 18 28 16 ...
Do you know how to solve this problem? THis bothers me also because I am creating a Shiny app and this changed class doesn't fit my purpose.
Thanks!