I am trying to run this code:
lm_height<-lmer(Height_cm_JUN~ENTRY+(1|REP), data=ASM_HEIGHT18_CL, REML=FALSE)
But I get this error:
Error in mkRespMod(fr, REML = REMLpass) : response must be numeric
I don't understand what part of my data is not "numeric" here is the head summary of it:
$ PLOT : int 1 2 3 4 5 6 7 8 9 10 ...
$ ROW : int 1 1 1 1 1 1 1 1 1 1 ...
$ RANGE : int 1 2 3 4 5 6 7 8 9 10 ...
$ REP : int 1 1 1 1 1 1 1 1 1 1 ...
$ ENTRY : int 989 965 931 936 983 926 969 883 911 897 ...
....
$ Height_cm_JUN: Factor w/ 30 levels "","55","56","58",..: 13 21 17 20 27 17 20 22 15 12 ...
Can someone give me advice on what I am doing wrong and how to fix it.
I appreciate it ---many thanks!
Your response is the variable Height_cm_JUN which has to be numeric (as indicated by the error message), but is a factor variable instead. You can turn them into a numeric value by using as.numeric combined with as.character (since you have want the labels of your factor):
ASM_HEIGHT18_CL$Height_cm_JUN <- as.numeric(as.character(ASM_HEIGHT18_CL$Height_cm_JUN))
Related
I've read through others who have had a similar issue, but my situation doesn't seem to be the same as the fixes that have been proposed for those other issues. I'm trying to recode a variable using a conditional statement. I want to take a character string & turn it into a numeric so I can subset those observations out into a new data frame. Here's what I have, so far:
blad_mor <- read.csv("blad_mor.csv", header = T)
str(blad_mor)
blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod)
I get this output for the str() command:
> str(blad_mor)
'data.frame': 127073 obs. of 12 variables:
$ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...
$ sex : Factor w/ 4 levels "1","2","F","M": 1 1 1 2 1 2 2 2 2 2 ...
$ race : Factor w/ 17 levels "America","Asian &",..: 4 4 4 4 4 4 4 4 4 4 ...
$ county : Factor w/ 79 levels "COUNTY1","COUNTY2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ cod : Factor w/ 327 levels "C001","C005",..: 89 108 108 294 63 42 172 74 85 269 ...
$ fips : int 1 1 1 1 1 1 1 1 1 1 ...
$ state : int 5 5 5 5 5 5 5 5 5 5 ...
$ race_code : int 2 2 2 2 2 2 2 2 2 2 ...
$ ethnicity : Factor w/ 4 levels "","Hispanic",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ethnicity_code: int NA NA NA NA NA NA NA NA NA NA ...
But when I try the blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod) code I get this error:
> blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod)
Error in gsub(C670:C679, 29010, blad_mor$cod) : object 'C670' not found
So, I verify that there actually is that object by table(blad_mor$cod) with this being some of the output:
C578 C579 C58 C60 C601 C609 C61 C629 C631 C639 C64 C65 C66 C670 C672 C674 C675 C676
2 43 4 1 1 53 6162 62 1 14 2911 30 47 1 4 1 1 2
C677 C678 C679 C680 C689 C690 C692 C693 C694 C695 C696 C699 C700 C701 C709 C71 C710 C711
1 4 2776 35 77 1 4 5 1 1 8 45 7 3 11 1 29 34
The object 'C670' has one instance as per this output, yet R is telling me it is not there & doesn't run the command. What am I missing here? Should I change the class type from factor to something else? I'm quite confused.
Edit: I have tried quotes around the character strings (e.g. blad_mor_recode <- gsub('C670:C679', '29010', blad_mor$cod) as well as ifelse(). I still get the same error message.
If you want to change all strings from C70to C79 you have to use regex. Something like the following would work:
blad_mor_recode <- gsub("C7[0-9]", "29010", blad_mor$cod)
A simple example:
gsub("C7[0-9]","",c("C60","C70","C78"))
[1] "C60" "" ""
I am using the package randomForest to produce habitat suitability models for species. I thought everything was working as it should until I started looking at individual trees with getTree(). The documentation (see page 4 of the randomForest vignette) states that for categorical variables, the split point will be an integer, which makes sense. However, in the trees I have looked at for my results, this is not the case.
The data frame I used to build the model was formatted with categorical variables as factors:
> str(df.full)
'data.frame': 27087 obs. of 23 variables:
$ sciname : Factor w/ 2 levels "Laterallus jamaicensis",..: 1 1 1 1 1 1 1 1 1 1 ...
$ estid : Factor w/ 2 levels "7694","psabs": 1 1 1 1 1 1 1 1 1 1 ...
$ pres : Factor w/ 2 levels "1","0": 1 1 1 1 1 1 1 1 1 1 ...
$ stratum : Factor w/ 89 levels "poly_0","poly_1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ra : Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 3 3 3 3 3 ...
$ eoid : Factor w/ 2 levels "0","psabs": 1 1 1 1 1 1 1 1 1 1 ...
$ avd3200 : num 0.1167 0.0953 0.349 0.1024 0.3765 ...
$ biocl05 : num 330 330 330 330 330 ...
$ biocl06 : num 66 65.8 66 65.8 66 ...
$ biocl08 : num 277 277 277 277 277 ...
$ biocl09 : num 170 170 170 170 170 ...
$ biocl13 : num 186 186 185 186 185 ...
$ cti : num 19.7 19 10.4 16.4 14.7 ...
$ dtnhdwat : num 168 240 39 206 309 ...
$ dtwtlnd : num 0 0 0 0 0 0 0 0 0 0 ...
$ e2em1n99 : num 0 0 0 0 0 0 0 0 0 0 ...
$ ems30_53 : Factor w/ 53 levels "0","602","2206",..: 19 4 17 4 19 19 4 4 19 19 ...
$ ems5607_46: num 0 0 1 0 0.4 ...
$ ksat : num 0.21 0.21 0.21 0.21 0.21 ...
$ lfevh_53 : Factor w/ 53 levels "0","11","16",..: 38 38 38 38 38 38 38 38 38 38 ...
$ ned : num 1.46 1.48 1.54 1.48 1.47 ...
$ soilec : num 14.8 14.8 19.7 14.8 14.8 ...
$ wtlnd_53 : Factor w/ 50 levels "0","3","7","11",..: 4 31 7 31 7 31 7 7 31 31 ...
This was the function call:
# rfStratum and sampSizeVec were previously defined
> rf.full$call
randomForest(x = df.full[, c(7:23)], y = df.full[, 3],
ntree = 2000, mtry = 7, replace = TRUE, strata = rfStratum,
sampsize = sampSizeVec, importance = TRUE, norm.votes = TRUE)
Here are the first 15 lines of an example tree (note that the variables in lines 1, 5, and 15 should be categorical, i.e., they should have integer split values):
> tree100
left daughter right daughter split var split point status prediction
1 2 3 ems30_53 9.007198e+15 1 <NA>
2 4 5 biocl08 2.753206e+02 1 <NA>
3 6 7 biocl06 6.110518e+01 1 <NA>
4 8 9 biocl06 1.002722e+02 1 <NA>
5 10 11 lfevh_53 9.006718e+15 1 <NA>
6 0 0 <NA> 0.000000e+00 -1 0
7 12 13 biocl05 3.310025e+02 1 <NA>
8 14 15 ned 2.814818e+00 1 <NA>
9 0 0 <NA> 0.000000e+00 -1 1
10 16 17 avd3200 4.199712e-01 1 <NA>
11 18 19 e2em1n99 1.724138e-02 1 <NA>
12 20 21 biocl09 1.738916e+02 1 <NA>
13 22 23 ned 8.837864e-01 1 <NA>
14 24 25 biocl05 3.442437e+02 1 <NA>
15 26 27 lfevh_53 9.007199e+15 1 <NA>
Additional information: I encountered this because I was investigating an error I was getting when predicting the results back onto the study area stating that the types of predictors in the new data did not match those of the training data. I have done 6 other iterations of this model using the same data frame and scripts (just with different subsets of predictors) and never before gotten this message. The only thing I could find that was different between the randomforest object in this run compared to that in the other runs is that the rf.full$forest$ncat components are stored as double instead of integer
> for(i in 1:length(rf.full$forest$ncat)){
+ cat(names(rf.full$forest$ncat)[[i]], ": ", class(rf.full$forest$ncat[[i]]), "\n")
+ }
avd12800 : numeric
cti : numeric
dtnhdwat : numeric
dtwtlnd : numeric
ems2207_99 : numeric
ems30_53 : numeric
ems5807_99 : numeric
hydgrp : numeric
ksat : numeric
lfevh_53 : numeric
ned : numeric
soilec : numeric
wtlnd_53 : numeric
>
> rf.full$forest$ncat
avd12800 cti dtnhdwat dtwtlnd ems2207_99 ems30_53 ems5807_99 hydgrp ksat lfevh_53
1 1 1 1 1 53 1 1 1 53
ned soilec wtlnd_53
1 1 50
However, xlevels (which appears to be a list of the predictor variables used and their types) are all showing the correct datatype for each predictor.
> for(i in 1:length(rf.full$forest$xlevels)){
+ cat(names(rf.full$forest$xlevels)[[i]], ": ", class(rf.full$forest$xlevels[[i]]),"\n")
+ }
avd12800 : numeric
cti : numeric
dtnhdwat : numeric
dtwtlnd : numeric
ems2207_99 : numeric
ems30_53 : character
ems5807_99 : numeric
hydgrp : character
ksat : numeric
lfevh_53 : character
ned : numeric
soilec : numeric
wtlnd_53 : character
# example continuous predictor
> rf.full$forest$xlevels$avd12800
[1] 0
# example categorical predictor
> rf.full$forest$xlevels$ems30_53
[1] "0" "602" "2206" "2207" "4504" "4507" "4702" "4704" "4705" "4706" "4707" "4717" "5207" "5307" "5600"
[16] "5605" "5607" "5616" "5617" "5707" "5717" "5807" "5907" "6306" "6307" "6507" "6600" "7002" "7004" "9107"
[31] "9116" "9214" "9307" "9410" "9411" "9600" "4607" "4703" "6402" "6405" "6407" "6610" "7005" "7102" "7104"
[46] "7107" "9000" "9104" "9106" "9124" "9187" "9301" "9505"
The ncat component is simply a vector of the number of categories per variable with 1 for continuous variables (as noted here), so it doesn't seem like it should matter if that is stored as an integer or a double, but it seems like this might all be related.
Questions
1) Shouldn't the split point for categorical predictors in any given tree of a randomForest forest be an integer, and if yes, any thoughts as to why factors in the data frame used as input to the randomForest call here are not being used as such?
2) Does the number type (double vs integer) of the ncat component of a randomForest object matter in any way related to model building, and any thoughts as to what could cause this to switch from integer in the first 6 runs to double in this last run (with each run containing different subsets of the same data)?
The randomforest::randomForest algorithm encodes low-cardinality (up to 32 categories) and high-cardinality (32 to 64? categories) categorical splits differently. Pay attention - all your "problematic" features belong to the latter class, and are encoded using 64-bit floating point values.
While the console output doesn't make sense for the human observer, the randomForest model object/algorithm itself is correct (ie. treats those variables as categorical), and is making correct predictions.
If you want to investigate the structure of decision tree, and decision tree ensemble models, then you might consider exporting them to the PMML data format. For example, you can use the R2PMML package for this:
library("r2pmml")
r2pmml(rf.full, "MyRandomForest.pmml")
Then, open the MyRandomForest.pmml in a text editor, and you shall have a nice overview about the internals of your model (branches, split conditions, leaf values, etc).
I have a large data set, which I reduced applying gsub multiple times, basically in this form:
levels(Orders$Im) <- gsub("Osp", "OsProf", levels(Orders$Im))
I also used rbind:
DI_Reduced <- rbind(CX, OsP)
I need to apply function "tree" to the resulting data.frame, but I get an error:
tree.model <- tree(line ~ CountryCode + OrderType + Support, data=train.set)
The error is:
Error in tree(line ~ CountryCode + OrderType + Support, :
factor predictors must have at most 32 levels
Strange thing: if I export the train.set with write.csv and then I re-import it with read.csv, the error disappears and the tree is built.
I investigated the structure of the train.set and this is the difference before and after exporting/importing it:
$ CustomerNumber: Factor w/ 4616 levels "0","101959","210285",..: 3070 3069 4539 3732 2573 3086 2973 3817 4056 2956 ...
$ CountryCode : Factor w/ 4 levels "OtherCountry",..: 3 3 4 4 3 3 3 4 4 3 ...
$ OrderType : Factor w/ 5 levels "Order","NewOrder",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Support : Factor w/ 5 levels "#N/A","BN",..: 4 4 4 4 2 4 4 4 4 4 ...
$ Manuf : Factor w/ 163 levels "<Generic>","6gi",..: 52 52 52 52 52 52 52 52 52 52 ...
$ line : Factor w/ 623 levels "\"Generic\" Skews",..: 400 35 400 400 400 400 400 400 400 400 ...
________________________________________________________________
$ CustomerNumber: Factor w/ 692 levels "201500","20202",..: 361 360 680 499 138 367 315 523 592 304 ...
$ CountryCode : Factor w/ 2 levels "JP","US": 1 1 2 2 1 1 1 2 2 1 ...
$ OrderType : Factor w/ 1 level "Online": 1 1 1 1 1 1 1 1 1 1 ...
$ Support : Factor w/ 4 levels "BN","MC",..: 3 3 3 3 1 3 3 3 3 3 ...
$ Manuf : Factor w/ 1 level "DY": 1 1 1 1 1 1 1 1 1 1 ...
$ line : Factor w/ 2 levels "CX","OTX": 2 1 2 2 2 2 2 2 2 2 ...
It seems to me that gsub does not really subsect the original data.frame, and the hidden values stay in the train.set till I export/import the train. Is there another way to do this operation and obtain a tree?
As the error says, your dependent variable line has more than 32 levels. As per your train.set structure line : Factor w/ 623 levels
Try using other tree libraries like rpart.
Refactoring after subset might help.
sapply(train.set, {function(x) if(class(x) == "factor") {factor(x)}})
Also, gsub is not used for subsetting usually. It is global substitution function. You should share the pre-processing steps followed as well to help others help you with this better.
I've used aregImpute to impute the missing values then i used impute.transcan function trying to get complete dataset using the following code.
impute_arg <- aregImpute(~ age + job + marital + education + default +
balance + housing + loan + contact + day + month + duration + campaign +
pdays + previous + poutcome + y , data = mov.miss, n.impute = 10 , nk =0)
imputed <- impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE)
y <- completed[names(imputed)]
and when i used str(y) it already gives me a dataframe but with NAs as it is not imputed before, My question is how to get complete dataset without NAs after imputation?
str(y)
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 NA 35 30 NA 35 36 39 41 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 NA 2 3 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 NA 1 1 1 ...
$ balance : int NA 4789 1350 1476 0 747 307 147 NA -88 ...
$ housing : Factor w/ 2 levels "no","yes": NA 2 2 2 NA 1 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 NA 1 1 NA 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 NA 1 ...
$ day : int 19 NA 16 3 5 23 14 6 14 NA ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 NA 9 9 1 ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 NA ...
$ pdays : int -1 339 330 NA -1 176 330 -1 -1 NA ...
$ previous : int 0 4 NA 0 NA 3 2 0 0 2 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
I have tested your code myself, and it works just fine, except for the last line:
y <- completed[names(imputed)]
I believe there's a type in the above line. Plus, you do not even need the completed function.
Besides, if you want to get a data.frame from the impute.transcan function, then wrap it with as.data.frame:
imputed <- as.data.frame(impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE))
Moreover, if you need to test your missing data pattern, you can also use the md.pattern function provided by the mice package.
Here are the steps I'm following to do a Multinomial Linear Regression.
> z<-read.table("2008 Racedata.txt", header=TRUE, sep="\t", row.names=NULL)
> head(z)
datekey raceno horseno place winner draw winodds log_odds jwt hwt
1 2008091501 1 8 1 1 2 12.0 2.484907 128 1170
2 2008091501 1 11 2 0 3 8.6 2.151762 123 1135
3 2008091501 1 6 3 0 5 7.0 1.945910 127 1114
4 2008091501 1 12 4 0 10 23.0 3.135494 123 1018
5 2008091501 1 14 5 0 4 11.0 2.397895 113 1027
6 2008091501 1 5 6 0 14 50.0 3.912023 131 972
> x<-mlogit.data(z,choice="winner",shape="long",id.var="datekey",alt.var="horseno")
Error in `row.names<-.data.frame`(`*tmp*`, value = c("1.8", "1.11", "1.6", :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘10.2’, ‘10.4’, ‘10.8’,
‘100.7’, ‘101.12’, ‘102.1’, ‘102.3’, ‘103.2’, ‘103.4’,
‘103.6’, ‘104.12’, ‘104.3’, ‘104.9’, ‘105.1’, ‘105.5’,
‘105.6’, ‘105.8’, ‘106.11’, ‘106.12’, ‘106.13’, ‘106.7’,
‘107.10’, ‘107.14’, ‘107.3’, ‘108.12’, ‘108.2’, ‘108.6’,
‘108.9’, ‘109.1’, ‘109.14’, ‘109.7’, ‘11.12’, ‘11.5’,
‘11.9’, ‘110.2’, ‘110.3’, ‘110.4’, ‘110.9’, ‘111.1’,
‘111.7’, ‘112.12’, ‘112.3’, ‘112.6’, ‘112.8’, ‘113.10’,
‘113.13’, ‘113.7’, ‘114.12’, ‘114.2’, ‘114.9’, ‘115.10’,
‘115.13’, ‘115.5’, ‘116.11’, ‘116.6’, ‘117.14’, ‘117.3’,
‘117.7’, ‘118.1’, ‘118.13’, ‘118.2’, ‘118.9’, ‘119.10’,
‘119.5’, ‘119.6’, ‘119.8’, ‘12.1’, ‘12.10’, ‘12.3’,
‚Äò12.6‚Äô, ‚Äò120.2‚Äô, ‚Äò120.4‚Äô, ‚Äò120.7‚ [... truncated]
>
What step am I missing here? Why the duplicates in row.names?
Thanks,
Walt
Two problems.
You seem to have some problem with encoding since we are seeing lots of umlauts and accent marks in that error message. Furthernore I am wondering if that datekey column got converted into a factor class?
In this case it it referring to an error in construction of the row.names attribute of the new object, x. If you do:
with( z, table( datekey, horseno) )
... you may see an a horse with multiple entries on the same day.
Actually there were no duplicate datekey x horseno combos. Changing to factor for horseno and datekey and then switching the "long" argument to "wide" produces error free result with this result:
z$datekey <- as.character(z$datekey)
z$horseno <- as.character(z$horseno)
x<-mlogit.data(z,choice="winner",shape="wide",id.var="datekey",alt.var="horseno")
str(x)
#----------
Classes ‘mlogit.data’ and 'data.frame': 18312 obs. of 11 variables:
$ datekey : Factor w/ 733 levels "2008091501","2008091502",..: 1 1 1 1 1 1 1 1 1 1 ...
$ raceno : int 1 1 1 1 1 1 1 1 1 1 ...
$ horseno : chr "0" "1" "0" "1" ...
$ place : int 1 1 2 2 3 3 4 4 5 5 ...
$ winner : logi FALSE TRUE TRUE FALSE TRUE FALSE ...
$ draw : int 2 2 3 3 5 5 10 10 4 4 ...
$ winodds : num 12 12 8.6 8.6 7 7 23 23 11 11 ...
$ log_odds: num 2.48 2.48 2.15 2.15 1.95 ...
$ jwt : int 128 128 123 123 127 127 123 123 113 113 ...
$ hwt : int 1170 1170 1135 1135 1114 1114 1018 1018 1027 1027 ...
$ chid : num 1 1 2 2 3 3 4 4 5 5 ...
- attr(*, "index")='data.frame': 18312 obs. of 3 variables:
..$ chid: Factor w/ 9156 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
..$ alt : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 1 2 ...
..$ id : Factor w/ 733 levels "2008091501","2008091502",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "choice")= chr "winner"