Fail to implement SVM in R - r

I am trying run an image classification using svm but I am facing an error that, while being already reported on this forum, the solutions do not fit my case.
The data I want to classify is a raster stack of 2 layers:
> S1_images
class : RasterStack
dimensions : 1000, 1414, 1414000, 2 (nrow, ncol, ncell, nlayers)
resolution : 10, 10 (x, y)
extent : 670860, 685000, 6163420, 6173420 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=32 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
names : X20180415_VH, X20180415_VV
min values : 1.621079e-05, 1.929869e-04
max values : 24.6396, 159.7452
My training data is obtained using polygons as reference and extracting the pixel values at those locations:
training_S<-raster::extract(S_images_t, training, df=TRUE)
training_S$Class<-factor(training_S$Class)
> head(training_S)
ID X20180415_VH X20180415_VV Class
1 1 0.006463605 0.05813200 1
2 1 0.006663103 0.06266786 1
3 1 0.007048910 0.06308612 1
4 1 0.006351015 0.04774158 1
5 1 0.006822301 0.05248845 1
6 1 0.007194918 0.05911565 1
and
> str(training_S)
'data.frame': 33239 obs. of 4 variables:
$ ID : num 1 1 1 1 1 1 1 1 1 1 ...
$ X20180415_VH: num 0.00646 0.00666 0.00705 0.00635 0.00682 ...
$ X20180415_VV: num 0.0581 0.0627 0.0631 0.0477 0.0525 ...
$ Class : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
After tune.svm for best parameter choice, I create the model (so far, so good)
SVM<-svm(x=training_S[ ,c(1:(length(training_S)-1))], y=training_S$Class, gamma = 0.1, cost = 10)
Next, I try to use predict to classify my input data:
LC<-predict(S1_images, model=SVM, na.rm=TRUE)
and here comes my error:
> LC<-predict(S1_images, model=SVM, na.rm=TRUE)
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
Following the example of R Bloggers, I converted my raster stack into a dataframe, and rename the columns properly:
S1_images_df <- data.frame(getValues(S1_images))
names(S1_images_df) <- c("X20180415_VH", "X20180415_VV")
When trying to run the classification again:
LC<-predict(SVM, S1_images_df)
> LC<-predict(SVM, S1_images_df)
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
Some extra information on my data:
> str(training_S)
'data.frame': 33239 obs. of 4 variables:
$ ID : num 1 1 1 1 1 1 1 1 1 1 ...
$ X20180415_VH: num 0.00646 0.00666 0.00705 0.00635 0.00682 ...
$ X20180415_VV: num 0.0581 0.0627 0.0631 0.0477 0.0525 ...
$ Class : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1
> str(S1_images_df)
'data.frame': 1414000 obs. of 2 variables:
$ X20180415_VH: num 0.005 0.00531 0.00514 0.0048 0.00461 ...
$ X20180415_VV: num 0.0954 0.0947 0.0933 0.0952 0.0951 ...
> dim(training_S)
[1] 33239 4
> dim(S1_images_df)
[1] 1414000 2
I have been checking this two older posts but not sure how to implement there solution in my case:
Here and Here

It looks like you're including ID as a covariate when training the model. If ID is meaningful and you want to include it in the model, you need to add a corresponding ID field to S1_images_df. More likely, you should exclude it when passing your training data to svm:
SVM<-svm(x=training_S[, -c(1, ncol(training_S))], y=training_S$Class, gamma = 0.1, cost = 10)

Related

caret::predict giving Error: $ operator is invalid for atomic vectors

This has been driving me crazy and I've been looking through similar posts all day but can't seem to solve my problem. I have a naive bayes model trained and stored as model. I'm attempting to predict with a newdata data frame but I keep getting the error Error: $ operator is invalid for atomic vectors. Here is what I am running: stats::predict(model, newdata = newdata) where newdata is the first row of another data frame: new data <- pbp[1, c("balls", "strikes", "outs_when_up", "stand", "pitcher", "p_throws", "inning")]
class(newdata) gives [1] "tbl_df" "tbl" "data.frame".
The issue is with the data used. it should match the levels used in the training. E.g. if we use one of the rows from trainingData to predict, it does work
predict(model, head(model$trainingData, 1))
#[1] Curveball
#Levels: Changeup Curveball Fastball Sinker Slider
By checking the str of both datasets, some of the factor columns in the training is character class
str(model$trainingData)
'data.frame': 1277525 obs. of 7 variables:
$ pitcher : Factor w/ 1390 levels "112526","115629",..: 277 277 277 277 277 277 277 277 277 277 ...
$ stand : Factor w/ 2 levels "L","R": 1 1 2 2 2 2 2 1 1 1 ...
$ p_throws : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
$ balls : num 0 1 0 1 2 2 2 0 0 0 ...
$ strikes : num 0 0 0 0 0 1 2 0 1 2 ...
$ outs_when_up: num 1 1 1 1 1 1 1 2 2 2 ...
$ .outcome : Factor w/ 5 levels "Changeup","Curveball",..: 3 4 1 4 1 5 5 1 1 5 ...
str(newdata)
tibble [1 × 6] (S3: tbl_df/tbl/data.frame)
$ balls : int 3
$ strikes : int 2
$ outs_when_up: int 1
$ stand : chr "R"
$ pitcher : int 605200
$ p_throws : chr "R"
An option is to make levels same for factor class
nm1 <- intersect(names(model$trainingData), names(newdata))
nm2 <- names(which(sapply(model$trainingData[nm1], is.factor)))
newdata[nm2] <- Map(function(x, y) factor(x, levels = levels(y)), newdata[nm2], model$trainingData[nm2])
Now do the prediction
predict(model, newdata)
#[1] Sinker
#Levels: Changeup Curveball Fastball Sinker Slider

Why does the clusthr function on my SpatialPointsDataFrame return back a script out of bounds error?

I am trying to calculate MCP and KDE on a vhf tracking data, my dataframe has three columns, Individual, Lat and Long.
I am trying following the adahabitatHR vignette to try and create a SpatialPointsDataFrame that contains vhf data on multiple individuals (https://cran.r-project.org/web/packages/adehabitatHR/vignettes/adehabitatHR.pdf), but using my own dataset.
I load my data
vhf <- read.csv("vhfdatacrom.csv")`
str(vhf)
data.frame': 186 obs. of 3 variables:
$ Individual: Factor w/ 12 levels "PM01","PM03",..: 2 1 2 2 1 1 1 1 1 4 ...
$ Lat : num 54.2 54.2 54.2 54.2 54.2 ...
$ Long : num -7.44 -7.42 -7.45 -7.44 -7.42 ...
> coords.data.xy <- vhf[c("Long", "Lat")]
> id <- vhf[c("Individual")]
> idsp <- data.frame(id)
> coordinates(idsp) <- coords.data.xy
> class(idsp)
[1] "SpatialPointsDataFrame"
attr(,"package")
[1] "sp"
> clusthr(idsp)
Error in tmp[1, ] : subscript out of bounds
I have tinkered all I can but cannot get this to run, any help would be appreciated.
Thank you

R - Aggregate function creating sub-lists

I'm using the aggregate function to summarise some data. The data is loans data, I have the ContractNum and LoanAmount. I want to aggregate the data by StartDate, count the number of Loans and Average the loan amount.
Here is a sample of the data and the function that I use:
ContractNum <- c("RHL-1","RHL-2","RHL-3","RHL-3")
StartDate <- c("2016-11-01","2016-11-01","2016-12-01","2016-12-01")
LoanPurpose <- c("Personal","Personal","HomeLoan","Investment")
LoanAmount <- c(200,500,600,150)
dat <- data.frame(ContractNum,StartDate,LoanPurpose,LoanAmount)
aggr.data <- aggregate(
cbind(LoanAmount,ContractNum) ~ StartDate + LoanPurpose
,data = dat
,FUN = function(x)c(count = mean(x),length(x))
)
When I lookat the results of the aggregate function, it looks ok:
> aggr.data
StartDate LoanPurpose LoanAmount.count LoanAmount.V2 ContractNum.count ContractNum.V2
1 2016-12-01 HomeLoan 600 1 3.0 1.0
2 2016-12-01 Investment 150 1 3.0 1.0
3 2016-11-01 Personal 350 2 1.5 2.0
But when I look at the strucutre of it, it seems to have created a sub-list:
> str(aggr.data)
'data.frame': 3 obs. of 4 variables:
$ StartDate : Factor w/ 2 levels "2016-11-01","2016-12-01": 2 2 1
$ LoanPurpose: Factor w/ 3 levels "HomeLoan","Investment",..: 1 2 3
$ LoanAmount : num [1:3, 1:2] 600 150 350 1 1 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "count" ""
$ ContractNum: num [1:3, 1:2] 3 3 1.5 1 1 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "count" ""
How do I get rid of this sub-list so that I can access each column the way I would normally access a DF? I understand that in the code I've asked to give me a mean on a ContractNum which is not meaningful, but I can just get rid of that column.
Thank you
Just do do.call(data.frame, ...) on aggr.data to unnest the matrices.
aggr.data <- do.call(data.frame, aggr.data);
str(aggr.data);
#'data.frame': 3 obs. of 6 variables:
# $ StartDate : Factor w/ 2 levels "2016-11-01","2016-12-01": 2 2 1
# $ LoanPurpose : Factor w/ 3 levels "HomeLoan","Investment",..: 1 2 3
# $ LoanAmount.count : num 600 150 350
# $ LoanAmount.V2 : num 1 1 2
# $ ContractNum.count: num 3 3 1.5
# $ ContractNum.V2 : num 1 1 2

Calculate new column with growth rate based on two factors without splintering dataframe

Hej hej,
I would like to calculate growth rates, storing them in a new column of my data frame e.g. named growth.per.day. I am - as always - looking for a way that doesn't include hundreds and hundreds of lines of manually edited code.
I have six levels of algae and 25 levels of nutrients.
This means i have 150 "subgroups" for which i want to calculate the rates. Those subsets differ in length based on the individual algae.
So, basically:
Algae A ->
Nutrient (1) -> C.mikro.gr.L (Day 2) - C.mikro.gr.L (Day 1),C.mikro.gr.L (Day 3) - C.mikro.gr.L (Day 2) ... ;
Nutrient (2) -> C.mikro.gr.L (Day 2) - C.mikro.gr.L (Day 1),C.mikro.gr.L (Day 3) - C.mikro.gr.L (Day 2) ... etc.
I already split the data frame by algae
X <- split(data, data$ALGAE)
names(X) <- c("ANKI", "CHLAMY", "MIX_A", "MIX_B", "SCENE", "STAURA")
list2env(X, envir = .GlobalEnv)
and i have also split those again, creating the aforementioned lovely 150 subsets. Then i applied
ratio1$growth.per.day <- c(NA,ratio1[2:nrow(ratio1), 16] - ratio1[1:(nrow(ratio1)-1), 16])
which is perfect and does what i want, BUT i would really very much appreciate a shorter, more elegant way without butchering my dataframe.
'data.frame': 3550 obs. of 16 variables:
$ SAMPLE.ID : Factor w/ 150 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ COMMUNITY : chr "com.1" "com.1" "com.1" "com.1" ...
$ NUTRIENT : Factor w/ 25 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ RATIO : Factor w/ 23 levels "3.2","4","5.4",..: 11 9 6 4 1 14 10 8 5 2 ...
$ PHOS : Factor w/ 5 levels "0.09","0.195",..: 5 5 5 5 5 4 4 4 4 4 ...
$ NIT : Factor w/ 5 levels "1.5482","3.0964",..: 5 4 3 2 1 5 4 3 2 1 ...
$ DATUM : Factor w/ 35 levels "30.08.16","31.08.16",..: 1 1 1 1 1 1 1 1 1 1 ...
$ DAY : int 0 0 0 0 0 0 0 0 0 0 ...
$ TYPE : chr "mono" "mono" "mono" "mono" ...
$ ALGAE : Factor w/ 6 levels "ANK","CHLA","MIX A",..: 5 5 5 5 5 5 5 5 5 5 ...
$ MEAN : num 864 868 882 873 872 ...
$ GROW : num 0.00116 0.00115 0.00113 0.00115 0.00115 ...
$ FLUORO : num NA NA NA NA NA NA NA NA NA NA ...
$ MEAN.MQ : num 0.964 0.969 0.985 0.975 0.973 ...
$ GROW.MQ : num 1.04 1.03 1.02 1.03 1.03 ...
$ C.mikro.gr.L: num -764 -913 -1394 -1085 -1039 ...
I hope this sufficiently describes the problem,
Thanks so much!
Hope it is what you asked for:
df = data.frame(algae = sort(rep(LETTERS[1:6], 20)),
nutrient = rep(letters[22:26], 24),
day = rep(c(rep(1, 5),
rep(2, 5),
rep(3, 5),
rep(4, 5)), 6),
growth = runif(120, 30, 60))
library(dplyr)
df = df %>% group_by(algae, nutrient) %>% mutate(rate = c(NA, diff(growth, lag = 1)))
And there the table for alga A and nutrient v:
algae nutrient day growth rate
<fctr> <fctr> <dbl> <dbl> <dbl>
1 A v 1 48.68547 NA
2 A v 2 55.63570 6.950232
3 A v 3 53.28569 -2.350013
4 A v 4 44.83022 -8.455465

Between/within standard deviations

When working on a hierarchical/multilevel/panel dataset, it may be very useful to adopt a package which returns the within- and between-group standard deviations of the available variables.
This is something that with the following data in Stata can be easily done through the command
xtsum, i(momid)
I made a research, but I cannot find any R package which can do that..
edit:
Just to fix ideas, an example of hierarchical dataset could be this:
son_id mom_id hispanic mom_smoke son_birthweigth
1 1 1 1 3950
2 1 1 0 3890
3 1 1 0 3990
1 2 0 1 4200
2 2 0 1 4120
1 3 0 0 2975
2 3 0 1 2980
The "multilevel" structure is given by the fact that each mother (higher level) has two or more sons (lower level). Hence, each mother defines a group of observations.
Accordingly, each dataset variable can vary either between and within mothers or only between mothers. birtweigth varies among mothers, but also within the same mother. Instead, hispanic is fixed for the same mother.
For example, the within-mother variance of son_birthweigth is:
# mom1 means
bwt_mean1 <- (3950+3890+3990)/3
bwt_mean2 <- (4200+4120)/2
bwt_mean3 <- (2975+2980)/2
# Within-mother variance for birthweigth
((3950-bwt_mean1)^2 + (3890-bwt_mean1)^2 + (3990-bwt_mean1)^2 +
(4200-bwt_mean2)^2 + (4120-bwt_mean2)^2 +
(2975-bwt_mean3)^2 + (2980-bwt_mean3)^2)/(7-1)
While the between-mother variance is:
# overall mean of birthweigth:
# mean <- sum(data$son_birthweigth)/length(data$son_birthweigth)
mean <- (3950+3890+3990+4200+4120+2975+2980)/7
# within variance:
((bwt_mean1-mean)^2 + (bwt_mean2-mean)^2 + (bwt_mean3-mean)^2)/(3-1)
I don't know what your stata command should reproduce, but to answer the second part of question about
hierarchical structure , it is easy to do this with list.
For example, you define a structure like this:
tree = list(
"var1" = list(
"panel" = list(type ='p',mean = 1,sd=0)
,"cluster" = list(type = 'c',value = c(5,8,10)))
,"var2" = list(
"panel" = list(type ='p',mean = 2,sd=0.5)
,"cluster" = list(type="c",value =c(1,2)))
)
To create this lapply is convinent to work with list
tree <- lapply(list('var1','var2'),function(x){
ll <- list(panel= list(type ='p',mean = rnorm(1),sd=0), ## I use symbol here not name
cluster= list(type = 'c',value = rnorm(3))) ## R prefer symbols
})
names(tree) <-c('var1','var2')
You can view he structure with str
str(tree)
List of 2
$ var1:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num 0.284
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] 0.0722 -0.9413 0.6649
$ var2:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num -0.144
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] -0.595 -1.795 -0.439
Edit after OP clarification
I think that package reshape2 is what you want. I will demonstrate this here.
The idea here is in order to do the multilevel analysis we need to reshape the data.
First to divide the variables into two groups :identifier and measured variables.
library(reshape2)
dat.m <- melt(dat,id.vars=c('son_id','mom_id')) ## other columns are measured
str(dat.m)
'data.frame': 21 obs. of 4 variables:
$ son_id : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 1 2 1 2 3 ...
$ mom_id : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 3 3 1 1 1 ...
$ variable: Factor w/ 3 levels "hispanic","mom_smoke",..: 1 1 1 1 1 1 1 2 2 2 ...
$ value : num 1 1 1 0 0 0 0 1 0 0 ..
Once your have data in "moten" form , you can "cast" to rearrange it in the shape that you want:
# mom1 means for all variable
acast(dat.m,variable~mom_id,mean)
1 2 3
hispanic 1.0000000 0 0.0
mom_smoke 0.3333333 1 0.5
son_birthweigth 3943.3333333 4160 2977.5
# Within-mother variance for birthweigth
acast(dat.m,variable~mom_id,function(x) sum((x-mean(x))^2))
1 2 3
hispanic 0.0000000 0 0.0
mom_smoke 0.6666667 0 0.5
son_birthweigth 5066.6666667 3200 12.5
## overall mean of each variable
acast(dat.m,variable~.,mean)
[,1]
hispanic 0.4285714
mom_smoke 0.5714286
son_birthweigth 3729.2857143
I know this question is four years old, but recently I wanted to do the same in R and came up with the following function. It depends on dplyr and tibble. Where: df is the dataframe, columns is a numerical vector to subset the dataframe and individuals is the column with the individuals.
xtsumR<-function(df,columns,individuals){
df<-dplyr::arrange_(df,individuals)
panel<-tibble::tibble()
for (i in columns){
v<-df %>% dplyr::group_by_() %>%
dplyr::summarize_(
mean=mean(df[[i]]),
sd=sd(df[[i]]),
min=min(df[[i]]),
max=max(df[[i]])
)
v<-tibble::add_column(v,variacao="overal",.before=-1)
v2<-aggregate(df[[i]],list(df[[individuals]]),"mean")[[2]]
sdB<-sd(v2)
varW<-df[[i]]-rep(v2,each=12) #
varW<-varW+mean(df[[i]])
sdW<-sd(varW)
minB<-min(v2)
maxB<-max(v2)
minW<-min(varW)
maxW<-max(varW)
v<-rbind(v,c("between",NA,sdB,minB,maxB),c("within",NA,sdW,minW,maxW))
panel<-rbind(panel,v)
}
var<-rep(names(df)[columns])
n1<-rep(NA,length(columns))
n2<-rep(NA,length(columns))
var<-c(rbind(var,n1,n1))
panel$var<-var
panel<-panel[c(6,1:5)]
names(panel)<-c("variable","variation","mean","standard.deviation","min","max")
panel[3:6]<-as.numeric(unlist(panel[3:6]))
panel[3:6]<-round(unlist(panel[3:6]),2)
return(panel)
}

Resources