merge data frames "not a slot in class data.frame" - r

I use the book "A practical guide to geostatistical mapping" from T. Hengl, which also offers the code to reproduce the results. Unfortunately, loads of the code contained is deprecated or even defunct. I was able to restore most of the code, but now I'm stuck with something seemingly simple: merging two data frames. My error:
Error in (function (cl, name, valueClass) : ‘data’ is not a slot in class “data.frame”
Here the code to reproduce that error:
library(gstat)
library(rgdal)
library(sp)
# load the data:
data(meuse)
coordinates(meuse) <- ~x+y
proj4string(meuse) <- CRS("+init=epsg:28992")
download.file("http://spatial-analyst.net/book/system/files/meuse.zip", destfile=paste(getwd(), "meuse.zip", sep="/"))
grid.list <- c("ahn.asc", "dist.asc", "ffreq.asc", "soil.asc")
# unzip the maps in a loop:
for(j in grid.list){
fname <- unzip("meuse.zip", file=j)
print(fname)
file.copy(fname, paste("./", j, sep=""), overwrite=FALSE)
}
# load grids to R:
meuse.grid <- readGDAL(grid.list[1])
# fix the layer name:
names(meuse.grid)[1] <- sub(".asc", "", grid.list[1])
for(i in grid.list[-1]) {
meuse.grid#data[sub(".asc", "", i[1])] <- readGDAL(paste(i))$band1
}
names(meuse.grid)
proj4string(meuse.grid) <- CRS("+init=epsg:28992")
meuse.ov <- over(meuse, meuse.grid)
str(meuse.ov)
meuse.data <- meuse[c("zinc", "lime")]#data
str(meuse.data)
meuse.ov#data <- merge(meuse.ov, meuse.data)
This is really confusing, as both data frames (meuse.ov and meuse.data) seem identical in their structure:
> str(meuse.ov)
'data.frame': 155 obs. of 4 variables:
$ ahn : int 3214 3402 3277 3563 3406 3355 3428 3476 3522 3525 ...
$ dist : num 0.00136 0.01222 0.10303 0.19009 0.27709 ...
$ ffreq: int 1 1 1 1 1 1 1 1 1 1 ...
$ soil : int 1 1 1 2 2 2 2 1 1 2 ...
and
> str(meuse.data)
'data.frame': 155 obs. of 2 variables:
$ zinc: num 1022 1141 640 257 269 ...
$ lime: Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
I tried resolving this with looking things up on stackoverflow, but nothing did work. The (not working) legacy code in the book suggested this (for your understanding maybe):
meuse.ov <- overlay(meuse.grid, meuse)
meuse.ov#data <- cbind(meuse.ov#data, meuse[c("zinc", "lime")]#data)

Related

R SVM Predict - Error in predict.svm: test data does not match model

I started with a data frame of 23,515 rows and 3 columns. I split the data 70/30 into training/testing. I am fitting a classification model with SVM from the e1071 package to predict variable MISSING. After I fit the model, I attempt to predict MISSING in my test set but I get the error below:
> ftplh_svm <- svm(MISSING ~ V1+V2, data=train_vars, type="C-classification", kernel="linear")
> p <- predict(ftplh_svm, test_vars, type="class")
Error in predict.svm(object, ...) : test data does not match model !
I tried removing the predicted class from the test set as recommended in another post:
> p <- predict(ftplh_svm, test_vars[-3], type="class")
Error in predict.svm(object, ...) : test data does not match model !
I also tried dropping empty levels as recommended by Brad, but no levels ended up being dropped and I got the same results:
> train_vars$V1 <- droplevels(as.factor(train_vars$V1))
> train_vars$V2 <- droplevels(as.factor(train_vars$V2))
> train_vars$MISSING <- droplevels(as.factor(train_vars$MISSING))
> test_vars$V1 <- droplevels(as.factor(test_vars$V1))
> test_vars$V2 <- droplevels(as.factor(test_vars$V2))
> test_vars$MISSING <- droplevels(as.factor(test_vars$MISSING))
> ftplh_svm <- svm(MISSING ~ V1+V2, data=train_vars, type="C-classification", kernel="linear")
> p <- predict(ftplh_svm, test_vars, type="class")
Error in predict.svm(object, ...) : test data does not match model !
Structure of my training set and test set:
> str(train_vars)
'data.frame': 16395 obs. of 3 variables:
$ V1: Factor w/ 148 levels "AAC","AAL","AGP",..: 1 1 2 2 2 2 2 2 2 2 ...
$ V2 : Factor w/ 284 levels "6AR","AAC","AAL",..: 79 42 180 180 180 180 180 180 180 180 ...
$ MISSING : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
> str(test_vars)
'data.frame': 7129 obs. of 3 variables:
$ V1: Factor w/ 111 levels "AAC","AAL","AGP",..: 1 2 2 2 2 2 2 2 2 2 ...
$ V2 : Factor w/ 265 levels "AAC","AAL","ABZ",..: 225 169 169 169 169 169 169 169 169 169 ...
$ MISSING : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
Test to see if there are new levels in my test set (I did this for each variable):
> train_lev <- levels(train_vars$V1)
> test_lev <- levels(test_vars$V1)
> # these levels only exist in the test set
> new_levels <- setdiff(test_lev,train_lev)
> new_levels
character(0)
> # how many observations is it?
> obs <- which(test_vars$V1 %in% new_levels)
> length(obs)
[1] 0

Why is dplyr collapsing my whole data frame and not grouping it

I have been researching this for a while and I can't seem to find the issue. I use dplyr regularly, but seems like all of a sudden, I am getting odd output from the group_by/summarise combination.
I have a large dataset and I am trying to summarize it using the following:
dataAgg <- dataRed %>% group_by(ClmNbr, SnapshotDay, Pre2016) %>%
filter(SnapshotDay == '30'| SnapshotDay == '90') %>%
summarise(
NumFeat = sum(FeatureNbr),
TotInc = sum(IncSnapshotDay),
TotDelta = sum(InctoFinal),
TotPaid = sum(FinalPaid)
)
The setup of the data frame is below:
'data.frame': 123819 obs. of 8 variables:
$ ClmNbr : Factor w/ 33617 levels "14-00765132",..: 2162 2163 2163 2164 1842 2287 27 27 27 28 ...
$ SnapshotDay : Factor w/ 3 levels "7","30","90": 1 1 1 1 1 1 1 1 1 1 ...
$ Pre2016 : Factor w/ 2 levels "Post2016","Pre2016": 2 2 2 2 2 2 2 2 2 2 ...
$ FeatureNbr : int 6 2 3 3 6 2 4 5 6 5 ...
$ IncSnapshotDay: num 5000 77 5000 4500 77 2200 1800 1100 1800 25000 ...
$ FinalPaid : num 442 0 15000 5000 0 ...
$ InctoFinal : num -4558 -77 10000 500 -77 ...
$ TimeDelta : num 25.833 2.833 2.833 0.833 1.833 ...
When I execute the code, I get 1 obs. of 4 variables; there is no grouping applied.
'data.frame': 1 obs. of 4 variables:
$ NumFeat : int 287071
$ TotInc : num NA
$ TotDelta: num NA
$ TotPaid : num 924636433
I used to do this all the time without problems.
I could use aggregate, but sometimes, I am mixing and matching functions based on the column so it does not always work.
What am I doing wrong?
So, after a bit of research and some experimentation, the order of the library load matters. The original order was the following:
library(RODBC)
library(dplyr)
library(DT)
library(reshape2)
library(ggplot2)
library(scales)
library(caret)
library(markovchain)
library(knitr)
library(Metrics)
library(RColorBrewer)
However, ggplot2 loads in plyr as a dependency, so in order to make this work more smoothly, the order should be revised to load dplyr last; which is what I used to do.
library(RODBC)
library(DT)
library(reshape2)
library(ggplot2)
library(scales)
library(caret)
library(markovchain)
library(knitr)
library(Metrics)
library(RColorBrewer)
library(dplyr)
Alternately, as in Python, it can be accomplished by specifying the library to execute the command. In Python, we import libraries in the following syntax:
import numpy as np
Then any numpy commmands are referenced using np. like np.array() the R syntax is the following library::
Adding dplyr:: to the commands fixes the problem as shown below.
dataAgg <- dataRed %>% dplyr::group_by(ClmNbr, SnapshotDay, Pre2016) %>%
dplyr::filter(SnapshotDay == '30'| SnapshotDay == '90') %>%
dplyr::summarise(
NumFeat = sum(FeatureNbr),
TotInc = sum(IncSnapshotDay),
TotDelta = sum(InctoFinal),
TotPaid = sum(FinalPaid)
)

Paste / NoQuote - Not Working as Expected

I have a Data Frame c1 as below:
str(c1)
#'data.frame': 2312 obs. of 6 variables:
# $ dt : Date, format: "2014-04-01" "2014-04-01" "2014-04-01" ...
# $ base : Factor w/ 2 levels "AA","AB": 1 1 1 2 2 2 2 1 1 1 ...
# $ curr : Factor w/ 5 levels "BA","BB","BC",..: 2 3 5 1 2 3 4 2 3 5 ...
# $ trans: int 72 176 4365 234 144 352 16762 61 160 4276 ...
# $ amt : num 2.18e+09 5.55e+09 9.99e+09 3.75e+08 4.37e+09 ...
# $ rate : num 1.11e-04 1.22e-02 1.26 3.94 5.65e+03 ...
d = "c1"
d
# [1] "c1"
Now then I use d instead of the actual data frame name it does not work correctly -
i <- sapply( c1, is.factor)
i
# dt base curr trans amt rate
#FALSE TRUE TRUE FALSE FALSE FALSE
Correct!
i <- sapply( paste(d), is.factor)
i
# c1
#FALSE
Incorrect
i <- sapply( noquote(d), is.factor)
i
# c1
#FALSE
Incorrect
Is there a way to fix this?
Edit -
c1[i] <- lapply(c1[i], as.character)
Works
get(d)[i] <- lapply( get(d)[i], as.character)
Fails
for (j in 1:length(i)) { ifelse(is.factor(get(d)[j]),get(d)[i] <- as.character(get(d)[i])) }
Fails
Can get be used in every place or are there 3/4 ways to use get()
Thanks Again
If I understand correctly, you're looking for
xy <- data.frame(a = runif(3), b = letters[1:3])
sapply(get("xy"), is.factor)
mind you this is bad practice. If you're making up variable names on-the-fly, you should consider using other objects, like a list, to store a data.frame(s).
This works for now. Although its exceptionally bad to make sense of.
.eval <- function(evaltext,envir=sys.frame()) {
## evaluate a string as R code
eval(parse(text=evaltext), envir=envir)
}
.eval(paste( "i = sapply(",noquote(d),",is.factor)",sep=""))
.eval(paste( noquote(d),"[i] <- lapply(",noquote(d),"[i], as.character)",sep=""))
I am still looking for better alternatives. This is so bad that I cannot accept this as answer :-(
Thanks, Manish

Plotting great circles from a subset in R

I have a data frame that after some processing (as geocoding for example) has the following characteristics:
'data.frame': 13 obs. of 5 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ ciudad : Factor w/ 10 levels "Auch","Barcelona",..: 8 4 5 3 2 7 9 10 6 6 ...
$ proyecto: int 1 1 1 1 1 1 1 1 2 2 ...
$ lon : num -1.131 0.564 -9.139 0.627 2.173 ...
$ lat : num 38 44.2 38.7 44.5 41.4 ...
Every proyect (proyecto) has a list of cities. And I need to connect in a radial-way the first of them with the others (of the project). That is what I have been done so far:
# Capitalizing first letters
municipios <- read.csv("ciudades.csv", header=TRUE, sep=";")
stri_trans_totitle(as.character(municipios$ciudad))
write.csv(municipios, file = "municipios.csv")
# Obtaining latitude & longitude
lonlat <- geocode(as.character(municipios$ciudad))
municipios_lonlat <- cbind(municipios, lonlat)
write.csv(municipios_lonlat, file = "municipios_lonlat.csv")
str(municipios_lonlat)
# Plotting a simple map
xlim <- c(-13.08, 8.68)
ylim <- c(34.87, 49.50)
map("world", col="#191919", fill=TRUE, bg="#000000", lwd=0.05, xlim=xlim, ylim=ylim)
# Plotting cities
symbols(municipios_lonlat$lon, municipios_lonlat$lat, bg="#e2373f", fg="#ffffff", lwd=0.5, circles=rep(1, length(municipios_lonlat$lon)), inches=0.05, add=TRUE)
# Subsetting, splitting & connecting
uniq <- unique(unlist(municipios_lonlat$proyecto))
for (i in 1:length(uniq)){
data_1 <- subset(municipios_lonlat, proyecto == uniq[i])
for (i in 2:length(data_1$lon)-1){
lngs <- c(data_1$lon[1], data_1$lon[i])
lats <- c(data_1$lat[1], data_1$lat[i])
lines(lngs, lats, col="#e2373f", lwd=2)
}
}
But it does not like quite real. So I need to use great circles to improve the resulting map. I know I have to use the geosphere library, and use a similar loop as the one I have done in the last paragraph. But the things I tried did not work. Please could you help me. You are my only hope Obi Wan Kenobis.
Note: here you can download my data.

C5.0 decision tree - c50 code called exit with value 1

I am getting the following error
c50 code called exit with value 1
I am doing this on the titanic data available from Kaggle
# Importing datasets
train <- read.csv("train.csv", sep=",")
# this is the structure
str(train)
Output :-
'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Then I tried using C5.0 dtree
# Trying with C5.0 decision tree
library(C50)
#C5.0 models require a factor outcome otherwise error
train$Survived <- factor(train$Survived)
new_model <- C5.0(train[-2],train$Survived)
So running the above lines gives me this error
c50 code called exit with value 1
I'm not able to figure out what's going wrong? I was using similar code on different dataset and it was working fine. Any ideas about how can I debug my code?
-Thanks
For anyone interested, the data can be found here: http://www.kaggle.com/c/titanic-gettingStarted/data. I think you need to be registered in order to download it.
Regarding your problem, first of I think you meant to write
new_model <- C5.0(train[,-2],train$Survived)
Next, notice the structure of the Cabin and Embarked Columns. These two factors have an empty character as a level name (check with levels(train$Embarked)). This is the point where C50 falls over. If you modify your data such that
levels(train$Cabin)[1] = "missing"
levels(train$Embarked)[1] = "missing"
your algorithm will now run without an error.
Just in case. You can take a look to the error by
summary(new_model)
Also this error occurs when there are a special characters in the name of a variable. For example, one will get this error if there is "я"(it's from Russian alphabet) character in the name of a variable.
Here is what worked finally:-
Got this idea after reading this post
library(C50)
test$Survived <- NA
combinedData <- rbind(train,test)
combinedData$Survived <- factor(combinedData$Survived)
# fixing empty character level names
levels(combinedData$Cabin)[1] = "missing"
levels(combinedData$Embarked)[1] = "missing"
new_train <- combinedData[1:891,]
new_test <- combinedData[892:1309,]
new_model <- C5.0(new_train[,-2],new_train$Survived)
new_model_predict <- predict(new_model,new_test)
submitC50 <- data.frame(PassengerId=new_test$PassengerId, Survived=new_model_predict)
write.csv(submitC50, file="c50dtree.csv", row.names=FALSE)
The intuition behind this is that in this way both the train and test data set will have consistent factor levels.
I had the same error, but I was using a numeric dataset without missing values.
After a long time, I discovered that my dataset had a predictive attribute called "outcome" and the C5.0Control use this name, and this was the error cause :'(
My solution was changing the column name. Other way, would be create a C5.0Control object and change the value of the label attribute and then pass this object as parameter for the C50 method.
I also struggled some hours with the same Problem (Return code "1") when building a model as well as when predicting.
With the hint of answer of Marco I have written a small function to remove all factor levels equal to "" in a data frame or vector, see code below. However, since R does not allow for pass by reference to functions, you have to use the result of the function (it can not change the original dataframe):
removeBlankLevelsInDataFrame <- function(dataframe) {
for (i in 1:ncol(dataframe)) {
levels <- levels(dataframe[, i])
if (!is.null(levels) && levels[1] == "") {
levels(dataframe[,i])[1] = "?"
}
}
dataframe
}
removeBlankLevelsInVector <- function(vector) {
levels <- levels(vector)
if (!is.null(levels) && levels[1] == "") {
levels(vector)[1] = "?"
}
vector
}
Call of the functions may look like this:
trainX = removeBlankLevelsInDataFrame(trainX)
trainY = removeBlankLevelsInVector(trainY)
model = C50::C5.0.default(trainX,trainY)
However, it seems, that C50 has a similar Problem with character columns containing an empty cell, so you will have probably to extend this to handle also character attributes if you have some.
I also got the same error, but it was because of some illegal characters in the factor levels of one the columns.
I used make.names function and corrected the factor levels:
levels(FooData$BarColumn) <- make.names(levels(FooData$BarColumn))
Then the problem was resolved.

Resources