I am following the tutorial over here : https://www.rpubs.com/loveb/som . This tutorial shows how to use the Kohonen Network (also called SOM, a type of machine learning algorithm) on the iris data.
I ran this code from the tutorial:
library(kohonen) #fitting SOMs
library(ggplot2) #plots
library(GGally) #plots
library(RColorBrewer) #colors, using predefined palettes
iris_complete <-iris[complete.cases(iris),]
iris_unique <- unique(iris_complete) # Remove duplicates
#scale data
iris.sc = scale(iris_unique[, 1:4]) #Levels/Factors cannot be scaled... But used in predictive SOM:s using xyf. Later.
#build grid
iris.grid = somgrid(xdim = 10, ydim=10, topo="hexagonal", toroidal = TRUE)
set.seed(33) #for reproducability
iris.som <- som(iris.sc, grid=iris.grid, rlen=700, alpha=c(0.05,0.01), keep.data = TRUE)
#plot 1
plot(iris.som, type="count")
#plot2
var <- 1 #define the variable to plot
plot(iris.som, type = "property", property = getCodes(iris.som)[,var], main=colnames(getCodes(iris.som))[var], palette.name=terrain.colors)
The above code fits a Kohonen Network on the iris data. Each observation from the data set is assigned to each one of the "colorful circles" (also called "neurons") in the below pictures.
My question: In these plots, how would you identify which observations were assigned to which circles? Suppose I wanted to know which observations belong in the circles outlined in with the black triangles below:
Is it possible to do this? Right now, I am trying to use iris.som$classif to somehow trace which points are in which circle. Is there a better way to do this?
UPDATE: #Jonny Phelps showed me how to identify observations within a triangular form (see answer below). But i am still not sure if it possible to identify irregular shaped forms. E.g.
In a previous post (Labelling Points on a Plot (R Language)), a user showed me how to assign arbitrary numbers to each circle on the grid:
Based on the above plot, how could you use the "som$classif" statement to find out which observations were in circles 92,91,82,81,72 and 71?
Thanks
EDIT: Now with Shiny App!
A plotly solution is also possible, where you can mouse over individual neurons to display the associated iris rownames (called id here). Based on your iris.som data and Jonny Phelps' grid approach, you can just assign the row numbers as concatenated strings to the individual neurons and have these shown upon mouseover:
library(ggplot2)
library(plotly)
ga <- data.frame(g=iris.som$unit.classif,
sample=seq_len(dim(iris.som$data[[1]])[1]))
grid_pts <- as.data.frame(iris.som$grid$pts)
grid_pts$column <- rep(1:iris.som$grid$xdim, by=iris.som$grid$ydim)
grid_pts$row <- rep(1:iris.som$grid$ydim, each=iris.som$grid$xdim)
grid_pts$classif <- 1:nrow(grid_pts)
grid_pts$id <- sapply(seq_along(grid_pts$classif),
function(x) paste(ga$sample[ga$g==x], collapse=", "))
grid_pts$count <- sapply(seq_along(grid_pts$classif),
function(x) length(ga$sample[ga$g==x]))
grid_pts$count <- factor(grid_pts$count, levels=0:max(grid_pts$count))
p1 <- ggplot(grid_pts, aes(x=x, y=y, colour=count, row=row, column=column, id=id)) +
geom_point(size=8) +
scale_colour_manual(values=c("grey50", heat.colors(length(unique(grid_pts$count))))) +
theme_void() +
theme(plot.margin=unit(c(1,rep(.3, 3)),"cm"))
ggplotly(p1)
Here is a full Shiny app that allows lasso selection and shows a table with the data:
invisible(suppressPackageStartupMessages(
lapply(c("shiny","dplyr","ggplot2", "plotly", "kohonen", "GGally", "DT"),
require, character.only=TRUE)))
iris_complete <- iris[complete.cases(iris),]
iris_unique <- unique(iris_complete) # Remove duplicates
#scale data
iris.sc = scale(iris_unique[, 1:4]) #Levels/Factors cannot be scaled... But used in predictive SOM:s using xyf. Later.
#build grid
iris.grid = somgrid(xdim = 10, ydim=10, topo="hexagonal", toroidal = TRUE)
set.seed(33) #for reproducability
iris.som <- som(iris.sc, grid=iris.grid, rlen=700, alpha=c(0.05,0.01), keep.data = TRUE)
ga <- data.frame(g=iris.som$unit.classif,
sample=seq_len(dim(iris.som$data[[1]])[1]))
grid_pts <- as.data.frame(iris.som$grid$pts)
grid_pts$column <- rep(1:iris.som$grid$xdim, by=iris.som$grid$ydim)
grid_pts$row <- rep(1:iris.som$grid$ydim, each=iris.som$grid$xdim)
grid_pts$classif <- 1:nrow(grid_pts)
grid_pts$id <- sapply(seq_along(grid_pts$classif),
function(x) paste(ga$sample[ga$g==x], collapse=", "))
grid_pts$count <- sapply(seq_along(grid_pts$classif),
function(x) length(ga$sample[ga$g==x]))
grid_pts$count <- factor(grid_pts$count, levels=0:max(grid_pts$count))
# Shiny app, adapted from https://gist.github.com/dgrapov/128e3be71965bf00495768e47f0428b9
ui <- fluidPage(
fluidRow(
column(12, plotlyOutput("plot", height = "600px")),
column(12, DT::dataTableOutput('data_table'))
)
)
server <- function(input, output){
output$plot <- renderPlotly({
req(data())
p <- ggplot(data = data()$data,
aes(x=x, y=y, classif=classif, colour=count, row=row, column=column, id=id)) +
geom_point(size=8) +
scale_colour_manual(
values=c("grey50", heat.colors(length(unique(grid_pts$count))))
) +
theme_void() +
theme(plot.margin=unit(c(1, rep(.3, 3)), "cm"))
obj <- data()$sel
if(nrow(obj) != 0) {
p <- p + geom_point(data=obj, mapping=aes(x=x, y=y, classif=classif,
count=count, row=row, column=column, id=id), color="blue",
size=5, inherit.aes=FALSE)
}
ggplotly(p, source="p1") %>% layout(dragmode = "lasso")
})
selected <- reactive({
event_data("plotly_selected", source = "p1")
})
output$data_table <- DT::renderDataTable(
data()$sel, filter='top', options=list(
pageLength=5, autoWidth=TRUE
)
)
data <- reactive({
tmp <- grid_pts
sel <- tryCatch(filter(grid_pts, paste(x, y, sep="_") %in%
paste(selected()$x, selected()$y, sep="_")),
error=function(e){NULL})
list(data=tmp, sel=sel)
})
}
shinyApp(ui,server)
From what I can see, using iris.som$unit.classif & iris.som$grid is the way to go in isolating circles within the plotting grid. I have made an assumption that the classifier value matches the row index of iris.som$grid so this will need some more validation. Let me know if this helps your problem :)
findTriangle <- function(top_row, top_column, side_length, iris.som,
reverse=FALSE){
# top_row: row index of the top most triangle value
# top_column: column index...
# side_length: how many rows does the triangle occupy?
# iris.som: the som object
# reverse: set to TRUE to flip the triangle
# make the grid
grid_pts <- as.data.frame(iris.som$grid$pts)
grid_pts$column <- rep(1:iris.som$grid$xdim, by=iris.som$grid$ydim)
grid_pts$row <- rep(1:iris.som$grid$ydim, each=iris.som$grid$xdim)
grid_pts$classif <- 1:nrow(grid_pts)
# starting point - top most point of the triangle
# use reverse for triangles the other way around
grid_pts$triangle <- FALSE
grid_pts[grid_pts$column == top_column & grid_pts$row == top_row, ][["triangle"]] <- TRUE
# loop through the remaining rows and fill out the triangle
value_row <- top_row
value_start_column <- grid_pts[grid_pts$triangle == TRUE,]$x
value_end_column <- grid_pts[grid_pts$triangle == TRUE,]$x
if(reverse){
row_move <- -1
}else{
row_move <- 1
}
# update triangle
for(row in 1:(side_length-1)){
value_row <- value_row + row_move
value_start_column <- value_start_column - 0.5
value_end_column <- value_end_column + 0.5
grid_pts[grid_pts$row == value_row &
grid_pts$x >= value_start_column &
grid_pts$x <= value_end_column, ]$triangle <- TRUE
}
# visualise
pl <- ggplot(grid_pts, aes(x=x, y=rev(row), col=as.factor(triangle))) +
geom_point(size=7) +
scale_color_manual(values=c("grey", "indianred")) +
theme_void()
print(pl)
return(grid_pts)
}
# take the grid and pick out the triangle
top_row <- 2
top_column <- 6
side_length <- 4
reverse <- FALSE # set to TRUE to flip the triangle ie go from the bottom
grid_pts <- findTriangle(top_row, top_column, side_length, iris.som, reverse)
# now add the classifier and merge to get the co-ordinates
iris.sc2 <- as.data.frame(iris.sc)
iris.sc2$classif <- iris.som$unit.classif
iris.sc2 <- merge(iris.sc2, grid_pts, by=c("classif"), all.x=TRUE)
# filter to the points in the triangle
iris.sc2[iris.sc2$triangle==TRUE,]
Output data:
classif Sepal.Length Sepal.Width Petal.Length Petal.Width x y column row triangle
21 16 -1.01537328 0.5506423 -1.3287735 -1.3042249 6.0 1.732051 6 2 TRUE
22 16 -1.01537328 0.3214643 -1.4419091 -1.3042249 6.0 1.732051 6 2 TRUE
39 25 -0.89501479 1.0089981 -1.3287735 -1.3042249 5.5 2.598076 5 3 TRUE
40 25 -0.77465630 1.0089981 -1.2722057 -1.3042249 5.5 2.598076 5 3 TRUE
41 25 -0.77465630 0.7798202 -1.3287735 -1.3042249 5.5 2.598076 5 3 TRUE
42 25 -1.01537328 0.7798202 -1.2722057 -1.3042249 5.5 2.598076 5 3 TRUE
43 25 -0.89501479 0.7798202 -1.2722057 -1.3042249 5.5 2.598076 5 3 TRUE
44 26 -0.89501479 0.5506423 -1.1590702 -0.9108454 6.5 2.598076 6 3 TRUE
45 26 -1.01537328 0.7798202 -1.2156380 -1.0419719 6.5 2.598076 6 3 TRUE
58 36 -0.53393933 0.7798202 -1.2722057 -1.0419719 6.0 3.464102 6 4 TRUE
59 36 -0.41358084 1.0089981 -1.3853413 -1.3042249 6.0 3.464102 6 4 TRUE
60 36 -0.53393933 0.7798202 -1.1590702 -1.3042249 6.0 3.464102 6 4 TRUE
61 37 -1.01537328 1.0089981 -1.2156380 -0.7797188 7.0 3.464102 7 4 TRUE
62 37 -1.01537328 1.0089981 -1.3853413 -1.1730984 7.0 3.464102 7 4 TRUE
63 37 -0.89501479 1.0089981 -1.3287735 -1.1730984 7.0 3.464102 7 4 TRUE
74 44 0.06785311 0.3214643 0.5945312 0.7937995 4.5 4.330127 4 5 TRUE
75 46 -0.65429782 1.4673539 -1.2722057 -1.3042249 6.5 4.330127 6 5 TRUE
76 46 -0.53393933 1.4673539 -1.2722057 -1.3042249 6.5 4.330127 6 5 TRUE
77 47 -0.89501479 1.6965319 -1.0459346 -1.0419719 7.5 4.330127 7 5 TRUE
78 47 -0.89501479 1.6965319 -1.2156380 -1.3042249 7.5 4.330127 7 5 TRUE
79 47 -0.89501479 1.4673539 -1.2722057 -1.0419719 7.5 4.330127 7 5 TRUE
80 47 -0.89501479 1.6965319 -1.2722057 -1.1730984 7.5 4.330127 7 5 TRUE
Validation plotting on the grid:
I elaborated the example in my post, however, not on the iris data set but I suppose it is no problem: R, SOM, Kohonen Package, Outlier Detection and also added code snippets you might need. They show
How to generate data, add outliers and depict them on plots
How to train the SOM
How to do the clustering
How to use hierarchic clustering to add the cluster boundaries to the SOM plots
Finally, I added the clusters predicted by SOM to compare them with the real clusters in which I generated the data
I think this answers your questions. It would also be nice to compare the performance of SOM with t-SNE. I have only used SOM as an experiment on the data I generated and on the real wine data set. It would also be nice to prepare heat maps if you have more than 2 variables. All the best to you analysis!
Related
I have a data set of real estate data. I'm trying to create a new column of days on market groups (labeled DOM_Groups) and group them into 15-day intervals (i.e. 0-14, 15-29, etc.). Then I'm trying to summarize() these groupings by the count of observations and the average sale price for each 15-day group.
I'm using the cut() function attempting to break my DOM_Groups into these 15-day intervals. In the base spreadsheet that I imported, the column containing the days on market has a unique observation in each cell, and the data in that column are numeric whole numbers...no decimals, no negative numbers.
When I run the following code, the tibble output is not grouping correctly, and it is including a negative number with a decimal, which does not exist in my data set. I'm not sure what to do to correct this.
gibbsMkt %>%
mutate(DOM_Groups = cut(DOM, breaks = 15, dig.lab = 2)) %>%
filter(Status == "SOLD") %>%
group_by(DOM_Groups) %>%
summarize(numDOM = n(),
avgSP = mean(`Sold Price`, na.rm = TRUE))
The tibble output I get is this:
DOM_Groups numDOM avgSP
<fct> <int> <dbl>
1 (-0.23,16] 74 561675.
2 (16,31] 18 632241.
3 (31,47] 11 561727.
4 (47,63] 8 545862.
5 (63,78] 7 729286.
6 (78,94] 6 624167.
7 (1.4e+02,1.6e+02] 2 541000
8 (1.6e+02,1.7e+02] 1 535395
Also, for rows 7 & 8 in the tibble, the largest number is 164, so I also don't understand why these rows are being converted to scientific notation.
When I use an Excel pivot table, I get the output that I want to reproduce in R, which is depicted below:
How can I reproduce this in R with the correct code?
cut(x, breaks = 15) means x will be cut into 15 intervals--it cannot guess that you want 15-unit intervals starting with 0 and ending with 150. This is in the docs for ?cut:
breaks either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.
You will need to define your own start and end to each interval such as:
seq(0, max(x), 15)
# [1] 0 15 30 45 60 75 90 105 120 135 150
cut(x, seq(0, max(x), 15))
However, if you set it up correctly, you can define your intervals and make labels at the same time.
set.seed(1)
x <- floor(runif(500, 0, 164))
from <- seq(0, max(x), 15)
to <- from + 15 - 1
labs <- sprintf('%s-%s', from, to)
# [1] "0-14" "15-29" "30-44" "45-59" "60-74" "75-89" "90-104" "105-119" "120-134" "135-149" "150-164"
data.frame(table(cut(x, c(from, Inf), right = FALSE)), labels = labs)
# Var1 Freq labels
# 1 [0,15) 35 0-14
# 2 [15,30) 57 15-29
# 3 [30,45) 45 30-44
# 4 [45,60) 44 45-59
# 5 [60,75) 57 60-74
# 6 [75,90) 55 75-89
# 7 [90,105) 33 90-104
# 8 [105,120) 47 105-119
# 9 [120,135) 40 120-134
# 10 [135,150) 39 135-149
# 11 [150,Inf) 48 150-164
DOM_Groups <- cut(x, c(from, Inf), labs, right = FALSE)
data.frame(table(DOM_Groups))
# DOM_Groups Freq
# 1 0-14 35
# 2 15-29 57
# 3 30-44 45
# 4 45-59 44
# 5 60-74 57
# 6 75-89 55
# 7 90-104 33
# 8 105-119 47
# 9 120-134 40
# 10 135-149 39
# 11 150-164 48
Your other question of "why am I getting negative numbers," as I mentioned this does not mean that you have negatives in your data--these are just labels generated by using breaks = 15 with your data.
These are the relevant lines in cut.default
if (length(breaks) == 1L) {
if (is.na(breaks) || breaks < 2L)
stop("invalid number of intervals")
nb <- as.integer(breaks + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0) {
dx <- if (rx[1L] != 0)
abs(rx[1L])
else 1
breaks <- seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
}
else {
breaks <- seq.int(rx[1L], rx[2L], length.out = nb)
breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] +
dx/1000)
}
Using the x from before and breaks = 15, you can see how negatives are introduced:
breaks <- 15
nb <- as.integer(breaks + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0) {
dx <- if (rx[1L] != 0)
abs(rx[1L])
else 1
breaks <- seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
} else {
breaks <- seq.int(rx[1L], rx[2L], length.out = nb)
breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] + dx/1000)
}
breaks
# [1] -0.16300 10.86667 21.73333 32.60000 43.46667 54.33333 65.20000 76.06667 86.93333 97.80000 108.66667 119.53333 130.40000
# [14] 141.26667 152.13333 163.16300
levels(cut(x, breaks = 15))
# [1] "(-0.163,10.9]" "(10.9,21.7]" "(21.7,32.6]" "(32.6,43.5]" "(43.5,54.3]" "(54.3,65.2]" "(65.2,76.1]" "(76.1,86.9]"
# [9] "(86.9,97.8]" "(97.8,109]" "(109,120]" "(120,130]" "(130,141]" "(141,152]" "(152,163]"
Here's a simple solution with my santoku package:
library(santoku)
gibbsMkt %>%
mutate(DOM_Groups = chop_width(DOM, 15, labels = lbl_dash("-")))
# then proceed as before
You can use the start argument to chop_width if you want to start the intervals at a particular number.
I am trying to calculate DFFITS for GLM, where responses follow a Beta distribution. By using betareg R package. But I think this package doesn't support influence.measures() because by using dffits()
Code
require(betareg)
df<-data("ReadingSkills")
y<-ReadingSkills$accuracy
n<-length(y)
bfit<-betareg(accuracy ~ dyslexia + iq, data = ReadingSkills)
DFFITS<-dffits(bfit, infl=influence(bfit, do.coef = FALSE))
DFFITS
it yield
Error in if (model$rank == 0) { : argument is of length zero
I am a newbie in R. I don't know how to resolve this problem. Kindly help to solve this or give me some tips through R code that how to calculate DFFITs manually.
Regards
dffits are not implemented for "betareg" objects, but you could try to calculate them manually.
According to this Stack Overflow Q/A we could write this function:
dffits1 <- function(x1, bres.type="response") {
stopifnot(class(x1) %in% c("lm", "betareg"))
sapply(1:length(x1$fitted.values), function(i) {
x2 <- update(x1, data=x1$model[-i, ]) # leave one out
h <- hatvalues(x1)
nm <- rownames(x1$model[i, ])
num_dffits <- suppressWarnings(predict(x1, x1$model[i, ]) -
predict(x2, x1$model[i, ]))
residx <- if (class(x1) == "betareg") {
betareg:::residuals.betareg(x2, type=bres.type)
} else {
x2$residuals
}
denom_dffits <- sqrt(c(crossprod(residx)) / x2$df.residual*h[i])
return(num_dffits / denom_dffits)
})
}
It works well for lm:
fit <- lm(mpg ~ hp, mtcars)
dffits1(fit)
stopifnot(all.equal(dffits1(fit), dffits(fit)))
Now let's try betareg:
library(betareg)
data("ReadingSkills")
bfit <- betareg(accuracy ~ dyslexia + iq, data=ReadingSkills)
dffits1(bfit)
# 1 2 3 4 5 6 7
# -0.07590185 -0.21862047 -0.03620530 0.07349169 -0.11344968 -0.39255172 -0.25739032
# 8 9 10 11 12 13 14
# 0.33722706 0.16606198 0.10427684 0.11949807 0.09932991 0.11545263 0.09889406
# 15 16 17 18 19 20 21
# 0.21732090 0.11545263 -0.34296030 0.09850239 -0.36810187 0.09824013 0.01513643
# 22 23 24 25 26 27 28
# 0.18635669 -0.31192106 -0.39038732 0.09862045 -0.10859676 0.04362528 -0.28811277
# 29 30 31 32 33 34 35
# 0.07951977 0.02734462 -0.08419156 -0.38471945 -0.43879762 0.28583882 -0.12650591
# 36 37 38 39 40 41 42
# -0.12072976 -0.01701615 0.38653773 -0.06440176 0.15768684 0.05629040 0.12134228
# 43 44
# 0.13347935 0.19670715
Looks not bad.
Notes:
Even if this works in code, you should check if it meets your statistical requirements!
I've used suppressWarnings in lines 5:6 of dffits1. predict(bfit, ReadingSkills) drops the contrasts somehow, whereas predict(bfit) does not (should practically be the same). However the results are identical: all.equal(predict(bfit, ReadingSkills), predict(bfit)), thus ignoring the warnings be safe.
I have vectors of different size, and I want to sample all of them equally (for example 10 sample of each vector), in a way that these samples represent each vector.
suppose that one of my vectors is
y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
what are the 10 represntive points of this vector?
In case you are referring to retaining the shape of the curve, you can try preserving the local minimas and maximas:
df = as.data.frame(y)
y2 <- df %>%
mutate(loc_minima = if_else(lag(y) > y & lead(y) > y, TRUE, FALSE)) %>%
mutate(loc_maxima = if_else(lag(y) < y & lead(y) < y, TRUE, FALSE)) %>%
filter(loc_minima == TRUE | loc_maxima == TRUE) %>%
select(y)
Though this does not guarantee you'll have exactly 10 points.
Thanks to #minem, I got my answer. Perfect!
library(kmlShape)
Px=(1:length(y))
Py=y
par(mfrow=c(1,2))
plot(Px,Py,type="l",main="original points")
plot(DouglasPeuckerNbPoints(Px,Py,10),type="b",col=2,main="reduced points")
and the result is as below (using Ramer–Douglas–Peucker algorithm):
The best answer has already been given, but since I was working on it, I will post my naive heuristic solution :
Disclaimer :
this is for sure less efficient and naive than Ramer–Douglas–Peucker algorithm, but in this case it gives a similar result...
# Try to remove iteratively one element from the vector until we reach N elements only.
# At each iteration, the reduced vector is interpolated and completed again
# using a spline, then it's compared with the original one and the
# point leading to the smallest difference is selected for the removal.
heuristicDownSample <- function(x,y,n=10){
idxReduced <- 1:length(x)
while(length(idxReduced) > 10){
minDist <- NULL
idxFinal <- NULL
for(idxToRemove in 1:length(idxReduced)){
newIdxs <- idxReduced[-idxToRemove]
spf <- splinefun(x[newIdxs],y[newIdxs])
full <- spf(x)
dist <- sum((full-y)^2)
if(is.null(minDist) || dist < minDist){
minDist <- dist
idxFinal <- newIdxs
}
}
idxReduced <- idxFinal
}
return(list(x=x[idxReduced],y=y[idxReduced]))
}
Usage :
y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
x <- 1:length(y)
reduced <- heuristicDownSample(x,y,10)
par(mfrow=c(1,2))
plot(x=x,y=y,type="b",main="original")
plot(x=reduced$x,y=reduced$y,type="b",main="reduced",col='red')
You could use cut to generate a factor that indicates in which quintile (or whatever quantile you want) your values belong, and then sample from there:
df <- data.frame(values = c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23))
cutpoints <- seq(min(df$values), max(df$values), length.out = 5)
> cutpoints
[1] -2.00 4.25 10.50 16.75 23.00
df$quintiles <- cut(df$values, cutpoints, include.lowest = TRUE)
> df
values quintiles
1 2.5 [-2,4.25]
2 1.0 [-2,4.25]
3 0.0 [-2,4.25]
4 1.2 [-2,4.25]
5 2.0 [-2,4.25]
6 3.0 [-2,4.25]
7 2.0 [-2,4.25]
8 1.0 [-2,4.25]
9 0.0 [-2,4.25]
10 -2.0 [-2,4.25]
11 -1.0 [-2,4.25]
12 0.5 [-2,4.25]
13 2.0 [-2,4.25]
14 3.0 [-2,4.25]
15 6.0 (4.25,10.5]
16 5.0 (4.25,10.5]
17 7.0 (4.25,10.5]
18 9.0 (4.25,10.5]
19 11.0 (10.5,16.8]
20 15.0 (10.5,16.8]
21 23.0 (16.8,23]
Now you could split the data by quintiles, calculate the propensities and sample from the groups.
groups <- split(df, df$quintiles)
probs <- prop.table(table(df$quintiles))
nsample <- as.vector(ceiling(probs*10))
> nsample
[1] 7 2 1 1
resample <- function(x, ...) x[sample.int(length(x), ...)]
mysamples <- mapply(function(x, y) resample(x = x, size = y), groups, nsample)
z <- unname(unlist(mysamples))
> z
[1] 2.0 1.0 0.0 1.0 3.0 0.5 3.0 5.0 9.0 11.0 23.0
Due to ceiling(), this may lead to 11 cases being sampled instead of 10.
Apparently you are interested in systematic sampling. If so, maybe the following can help.
set.seed(1234)
n <- 10
step <- floor(length(y)/n)
first <- sample(step, 1)
z <- y[step*(seq_len(n) - 1) + first]
I have the R iris dataset which I am using for a PNN. The 3 species have been recoded from level 0 to 3 as follows: 0 is setosa, 1 is versicolor, 2 is virginica. Training set is 75%
Q1. I don't understand the function pred_pnn, if anyone is good in R perhaps you can explain how it works
Q2. The output of the test set or prediction is shown below, I don't understand the output because it is supposed to be something close to either 0,1,2
data = read.csc("c:/iris-recoded.csv" , header = T)
size = nrow(data)
length = ncol(data)
index <- 1:size
positions <- sample(index, trunc(size * 0.75))
training <- data[positions,]
testing <- data[-positions,1:length-1]
result = data[-positions,]
result$actual = result[,length]
result$predict = -1
nn1 <- smooth(learn(training), sigma = 0.9)
pred_pnn <- function(x, nn){
xlst <- split(x, 1:nrow(x))
pred <- foreach(i = xlst, .combine = rbind) %dopar% {
data.frame(prob = guess(nn, as.matrix(i))$probabilities[1], row.names =NULL)
}
}
print(pred_pnn(testing, nn1))
prob
1 1.850818e-03
2 9.820653e-03
3 6.798603e-04
4 7.421435e-03
5 2.168817e-03
6 3.277354e-03
7 6.541173e-03
8 1.725332e-04
9 2.081845e-03
10 2.491388e-02
11 7.679823e-03
12 1.291811e-03
13 2.197234e-06
14 1.316366e-03
15 1.421219e-05
16 4.639239e-05
17 3.671907e-04
18 1.460001e-04
19 4.382849e-05
20 2.387543e-05
21 1.011196e-05
22 2.719982e-04
23 4.445472e-04
24 1.281762e-04
25 5.931106e-09
26 9.741870e-08
27 9.236434e-09
28 8.384690e-08
29 3.311667e-07
30 6.045306e-11
31 2.949265e-08
32 2.070014e-10
33 8.043735e-06
34 2.136666e-08
35 5.604398e-08
36 2.455841e-07
37 3.445977e-07
38 7.314647e-07
I'm assuming you're using the pnn package. Documentation for ?guess would lead us to believe that it does similar to what predict does for other models. In other words, it predicts to which class the observation belongs to. Everything else in there for bookkeeping. Why you get only the probabilities? Because the person who wrote the function made it that way by extracting guess(x)$probabilities and returning only that. If you look at the raw output, you would also get predicted class tucked in away in $category list element.
I would like to eliminate the gap between the x and y axes in barplot and extend the predicted line back to intersect the y axis, preferably in base R. Is this possible? Thank you for any advice or suggestions.
my.data <- read.table(text = '
band mid.point count
1 0.5 74
2 1.5 73
3 2.5 79
4 3.5 70
5 4.5 78
6 5.5 63
7 6.5 59
8 7.5 60
', header = TRUE)
my.data
x <- my.data$mid.point^2
my.model <- lm(count ~ x, data = my.data)
my.plot <- barplot(my.data$count, ylim=c(0,100), space=0, col=NA)
axis(1, at=my.plot+0.5, labels=my.data$band)
lines(predict(my.model, data.frame(x=x), type="resp"), col="black", lwd = 1.5)
EDIT November 26, 2014
I just realized the two plots are not the same (the plot in the original post and the plot in my answer below). Compare the two curved lines closely, particularly at the right-side of the plot. Clearly the two curved lines intersect the top of the 8th bar in different locations. However, I have not yet had time to figure out why the plots differ.
Here is one way to extrapolate the predicted line back to the y axis. I incorporate rawr's suggestion regarding eliminating the gap between the y axis and the x axis.
setwd('c:/users/markm/simple R programs/')
jpeg(filename = "barplot_and_line.jpeg")
my.data <- read.table(text = '
band mid.point count
1 0.5 74
2 1.5 73
3 2.5 79
4 3.5 70
5 4.5 78
6 5.5 63
7 6.5 59
8 7.5 60
', header = TRUE)
x <- my.data$mid.point^2
my.model <- lm(count ~ x, data = my.data)
z <- seq(0,8,0.01)
y <- my.model$coef[1] + my.model$coef[2] * z^2
barplot(my.data$count, ylim=c(0,100), space=0, col=NA, xaxs = 'i')
points(z, y, type='l', col=1)
dev.off()