I ran a glmer, and got the following the error message "Error in model.frame.default(data = data.density.EM.gra, weights = number_of_nest.boxes, : variable lengths differ (found for 'year')". I don't understand what this means despite reading a number of different posts regarding the same error.
here is my model:
model.1.EM.gra<-glmer(cbind(data.density$number.nest.boxes.occupied.that.year,data.density$number_of_nest.boxes)~ caterpillar.sc +(1|year),data = data.density.EM.gra,weights = number_of_nest.boxes,family = binomial)
I appreciate any suggestions you may have.
setwd("~/Word/UQAM/Master's_Reale/DATA/Blue tits data and instructions/csv") # work station
install.packages("dplyr")
#calling libraries.
library(dplyr)
library (reprex)
library(lme4)
data.density<-read.csv ("nest_box_caterpillar_density.csv")
data.density$year<-factor (data.density$year)# making year a factor (categorical variable)
str(data.density) # now we see year as a factor in the data.
#> 'data.frame': 63 obs. of 16 variables:
#> $ year : Factor w/ 9 levels "2011","2012",..: 1 2 3 4 5 6 7 8 9 1 ...
#> $ number.nest.boxes.occupied.that.year: int 17 13 12 16 16 16 15 17 12 17 ...
#> $ number_of_nest.boxes : int 20 20 20 20 20 20 20 20 20 30 ...
#> $ failure : int 3 3 3 3 3 3 3 3 3 13 ...
#> $ proportion_occupied_boxes : num 0.85 0.65 0.6 0.8 0.8 0.8 0.75 0.85 0.6 0.57 ...
#> $ site : Factor w/ 7 levels "ari","ava","fel",..: 5 5 5 5 5 5 5 5 5 1 ...
#> $ population : Factor w/ 3 levels "D-Muro","E-Muro",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ mean_yearly_frass : num 295 231 437 263 426 ...
#> $ site_ID : Factor w/ 63 levels "2011_ari_","2011_ava_",..: 5 12 19 26 33 40 47 54 61 1 ...
#> $ exploration_avg : num 13.28 14.19 9.85 9.42 8.67 ...
#> $ X : logi NA NA NA NA NA NA ...
#> $ X.1 : logi NA NA NA NA NA NA ...
#> $ X.2 : Factor w/ 2 levels "","failure means the total number of nest boxes -the number of nest boxes occupied. ": 1 1 1 1 1 1 1 1 1 2 ...
#> $ X.3 : logi NA NA NA NA NA NA ...
#> $ X.4 : logi NA NA NA NA NA NA ...
#> $ X.5 : Factor w/ 5 levels "","1 column with number of nest boxes used. ",..: 1 1 4 3 1 2 5 1 1 1 ...
#making new objects
density<-data.density$proportion_occupied_boxes # making a new object called density
caterpillar<-data.density$mean_yearly_frass # making new object called caterpillar
caterpillar.sc<-scale(caterpillar)
data.density.EM<-filter(data.density,population=='E-Muro') # data for population 'E-Muro'
data.density.EM.gra<-filter(data.density.EM,site=='gra') # data for site gra in in the E-Muro population.
View(data.density.EM.gra)
model.1.EM.gra<-glmer(cbind(data.density$number.nest.boxes.occupied.that.year,data.density$number_of_nest.boxes)~ caterpillar.sc +(1|year),
data = data.density.EM.gra,
weights = number_of_nest.boxes,
family = binomial)
#> Error in model.frame.default(data = data.density.EM.gra, weights = number_of_nest.boxes, : variable lengths differ (found for 'year')
I am practicing with this dataset: http://archive.ics.uci.edu/ml/datasets/Census+Income
I loaded training & testing data.
# Downloading train and test data
trainFile = "adult.data"; testFile = "adult.test"
if (!file.exists (trainFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
destfile = trainFile)
if (!file.exists (testFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
destfile = testFile)
# Assigning column names
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# Reading training data
training = read.table (trainFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", stringsAsFactors = TRUE)
# Load the testing data set
testing = read.table (testFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", fill = TRUE, stringsAsFactors = TRUE)
I needed to combined two into one. But, there is a problem. I am seeing structure of the two data is not same.
Display structure of the training data
> str (training)
'data.frame': 32561 obs. of 15 variables:
$ age : int 39 50 38 53 28 37 49 52 31 42 ...
$ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
Display structure of the testing data
> str (testing)
'data.frame': 16282 obs. of 15 variables:
$ age : Factor w/ 74 levels "|1x3 Cross validator",..: 1 10 23 13 29 3 19 14 48 9 ...
$ workclass : Factor w/ 9 levels "","Federal-gov",..: 1 5 5 3 5 NA 5 NA 7 5 ...
$ fnlwgt : int NA 226802 89814 336951 160323 103497 198693 227026 104626 369667 ...
$ education : Factor w/ 17 levels "","10th","11th",..: 1 3 13 9 17 17 2 13 16 17 ...
$ educationnum : int NA 7 9 12 10 10 6 9 15 10 ...
$ maritalstatus: Factor w/ 8 levels "","Divorced",..: 1 6 4 4 4 6 6 6 4 6 ...
$ occupation : Factor w/ 15 levels "","Adm-clerical",..: 1 8 6 12 8 NA 9 NA 11 9 ...
$ relationship : Factor w/ 7 levels "","Husband","Not-in-family",..: 1 5 2 2 2 5 3 6 2 6 ...
$ race : Factor w/ 6 levels "","Amer-Indian-Eskimo",..: 1 4 6 6 4 6 6 4 6 6 ...
$ sex : Factor w/ 3 levels "","Female","Male": 1 3 3 3 3 2 3 3 3 2 ...
$ capitalgain : int NA 0 0 0 7688 0 0 0 3103 0 ...
$ capitalloss : int NA 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int NA 40 50 40 40 30 30 40 32 40 ...
$ nativecountry: Factor w/ 41 levels "","Cambodia",..: 1 39 39 39 39 39 39 39 39 39 ...
$ incomelevel : Factor w/ 3 levels "","<=50K.",">50K.": 1 2 2 3 3 2 2 2 3 2 ...
Problem 1:
age has become factor at testing. and all other levels of factor in testing is being increased by 1 than levels of factor in training. This is because first row is an unnecessary row in testing.
|1x3 Cross validator
I tried to get rid of this by re-assigning testing:
testing = testing[-1,]
but, after running str() command again, I don't see any change.
Problem 2:
Like I said at previous, I needed to combine those two data-frame into one data-frame. So, I run this:
combined <- rbind(training , testing)
Besides the problem-1, I can see new a problem after running str()
> str(combined)
'data.frame': 48842 obs. of 15 variables:
$ age : chr "39" "50" "38" "53" ...
$ workclass : Factor w/ 9 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 17 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
$ maritalstatus: Factor w/ 8 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 15 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
$ relationship : Factor w/ 7 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 6 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 1 1 1 2 1 2 ...
$ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
$ nativecountry: Factor w/ 42 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
$ incomelevel : Factor w/ 5 levels "<=50K",">50K",..: 1 1 1 1 1 1 1 2 2 2 ...
factor levels at target variable (incomelevel) in combined data-frame is 5 where it's 2 (which is correct) in the training data-frame and 3 (increased by 1 for problem-1) in testing data-frame. This is because there is a . (dot) after each value at incomelevel in testing data-frame (<=50K., <=50K., >50K.,......). So, I need to remove that .(dot) But, I am not getting idea how to remove it. Is there any function?
I am very in data and r. That's why, facing this type of basic issues. Can you please help me to solve the issue I am facing?
I think you can ignore the first line of test, this will solve the issue of age being a factor, because it seems like a header:
head(readLines(testFile))
[1] "|1x3 Cross validator"
[2] "25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K."
[3] "38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K."
We run your code, we can use read.csv, with skip=1 for test:
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
# Reading training data
training = read.csv (trainFile, header = FALSE, col.names = colNames,stringsAsFactors = TRUE,na.strings = "?",strip.white = TRUE)
testing = read.csv (testFile, header = FALSE, col.names = colNames,na.strings = "?",stringsAsFactors = TRUE,skip=1,strip.white = TRUE)
Now, the income level, unfortunately we have to correct it manually, it's a good thing you check:
testing$incomelevel = factor(gsub("\\.","",as.character(testing$incomelevel)))
We check levels, only difference is native country:
all.equal(sapply(testing,levels) ,sapply(training,levels))
[1] "Component “nativecountry”: Lengths (40, 41) differ (string compare on first 40)"
[2] "Component “nativecountry”: 26 string mismatches"
And I don't think there's much you can do, maybe you have to remove it before / after joining:
setdiff(levels(training$nativecountry),levels(testing$nativecountry))
[1] "Holand-Netherlands"
I've read through others who have had a similar issue, but my situation doesn't seem to be the same as the fixes that have been proposed for those other issues. I'm trying to recode a variable using a conditional statement. I want to take a character string & turn it into a numeric so I can subset those observations out into a new data frame. Here's what I have, so far:
blad_mor <- read.csv("blad_mor.csv", header = T)
str(blad_mor)
blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod)
I get this output for the str() command:
> str(blad_mor)
'data.frame': 127073 obs. of 12 variables:
$ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...
$ sex : Factor w/ 4 levels "1","2","F","M": 1 1 1 2 1 2 2 2 2 2 ...
$ race : Factor w/ 17 levels "America","Asian &",..: 4 4 4 4 4 4 4 4 4 4 ...
$ county : Factor w/ 79 levels "COUNTY1","COUNTY2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ cod : Factor w/ 327 levels "C001","C005",..: 89 108 108 294 63 42 172 74 85 269 ...
$ fips : int 1 1 1 1 1 1 1 1 1 1 ...
$ state : int 5 5 5 5 5 5 5 5 5 5 ...
$ race_code : int 2 2 2 2 2 2 2 2 2 2 ...
$ ethnicity : Factor w/ 4 levels "","Hispanic",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ethnicity_code: int NA NA NA NA NA NA NA NA NA NA ...
But when I try the blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod) code I get this error:
> blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod)
Error in gsub(C670:C679, 29010, blad_mor$cod) : object 'C670' not found
So, I verify that there actually is that object by table(blad_mor$cod) with this being some of the output:
C578 C579 C58 C60 C601 C609 C61 C629 C631 C639 C64 C65 C66 C670 C672 C674 C675 C676
2 43 4 1 1 53 6162 62 1 14 2911 30 47 1 4 1 1 2
C677 C678 C679 C680 C689 C690 C692 C693 C694 C695 C696 C699 C700 C701 C709 C71 C710 C711
1 4 2776 35 77 1 4 5 1 1 8 45 7 3 11 1 29 34
The object 'C670' has one instance as per this output, yet R is telling me it is not there & doesn't run the command. What am I missing here? Should I change the class type from factor to something else? I'm quite confused.
Edit: I have tried quotes around the character strings (e.g. blad_mor_recode <- gsub('C670:C679', '29010', blad_mor$cod) as well as ifelse(). I still get the same error message.
If you want to change all strings from C70to C79 you have to use regex. Something like the following would work:
blad_mor_recode <- gsub("C7[0-9]", "29010", blad_mor$cod)
A simple example:
gsub("C7[0-9]","",c("C60","C70","C78"))
[1] "C60" "" ""
I know this may be a potential duplicate question, but I found other answers didn't work in my situation.
I am using the following dataset:
> str(total_data)
'data.frame': 32260 obs. of 13 variables:
$ age : int 40 42 44 32 25 31 30 30 27 28 ...
$ workclass : Factor w/ 4 levels "Other-Unknown",..: 3 2 2 1 2 2 2 3 2 3 ...
$ education : Ord.factor w/ 7 levels "1"<"2"<"3"<"4"<..: 2 3 2 2 2 3 2 2 2 2 ...
$ marital.status : Factor w/ 5 levels "Divorced","Married",..: 2 1 2 3 3 3 3 2 2 3 ...
$ occupation : Factor w/ 6 levels "Blue-Collar",..: 5 3 6 2 1 6 6 1 1 6 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 1 5 1 1 5 5 5 5 5 5 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 2 1 1 ...
$ hours.per.week : int 84 40 40 38 40 38 48 70 35 38 ...
$ naitive.country: Factor w/ 41 levels "?","Cambodia",..: 39 39 39 39 39 39 39 12 39 39 ...
$ classifier : chr "<=50K" "<=50K" ">50K" "<=50K" ...
$ class_num : Factor w/ 2 levels "1","2": 1 1 2 1 1 1 1 2 1 1 ...
$ age_norm : num 0.315 0.342 0.37 0.205 0.11 ...
$ hours_norm : num 0.847 0.398 0.398 0.378 0.398 ...
I'm trying to encode the factors into binary using one_hot() but receive the following error message:
encoded_data <- one_hot(total_data, dropCols = FALSE)
ERROR MESSAGE:
Error in `[.data.frame`(dt, , cols, with = FALSE) :
unused argument (with = FALSE)
I'm not sure what the "with" argument is as I don't see it in the R documentation.
I also saw that someone suggested to use model.matrix. However, when I use that, my ordered factor gets encoded as well, which is what I'm trying to avoid.
This is what happens to my ordered factor variable:
education.L education.Q education.C education^4 education^5 education^6
-3.779645e-01 9.690821e-17 4.082483e-01 -0.5640761 4.364358e-01 -0.19738551
-1.889822e-01 -3.273268e-01 4.082483e-01 0.0805823 -5.455447e-01 0.49346377
I'm also not sure why there are sometimes letters or numbers after the attribute name. i.e. education**.L** vs education**^5**
Convert the data.frame into a data.table and it should work fine.
library(data.table)
dt = data.table(total_data)
one_hot(dt)
I've used aregImpute to impute the missing values then i used impute.transcan function trying to get complete dataset using the following code.
impute_arg <- aregImpute(~ age + job + marital + education + default +
balance + housing + loan + contact + day + month + duration + campaign +
pdays + previous + poutcome + y , data = mov.miss, n.impute = 10 , nk =0)
imputed <- impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE)
y <- completed[names(imputed)]
and when i used str(y) it already gives me a dataframe but with NAs as it is not imputed before, My question is how to get complete dataset without NAs after imputation?
str(y)
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 NA 35 30 NA 35 36 39 41 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 NA 2 3 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 NA 1 1 1 ...
$ balance : int NA 4789 1350 1476 0 747 307 147 NA -88 ...
$ housing : Factor w/ 2 levels "no","yes": NA 2 2 2 NA 1 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 NA 1 1 NA 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 NA 1 ...
$ day : int 19 NA 16 3 5 23 14 6 14 NA ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 NA 9 9 1 ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 NA ...
$ pdays : int -1 339 330 NA -1 176 330 -1 -1 NA ...
$ previous : int 0 4 NA 0 NA 3 2 0 0 2 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
I have tested your code myself, and it works just fine, except for the last line:
y <- completed[names(imputed)]
I believe there's a type in the above line. Plus, you do not even need the completed function.
Besides, if you want to get a data.frame from the impute.transcan function, then wrap it with as.data.frame:
imputed <- as.data.frame(impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE))
Moreover, if you need to test your missing data pattern, you can also use the md.pattern function provided by the mice package.