stock price prediction by using nnet - r

stock<-structure(list(week = c(1L, 2L, 5L, 2L, 3L, 4L, 3L, 2L, 1L, 5L,
1L, 3L, 2L, 4L, 3L, 4L, 2L, 3L, 1L, 4L, 3L),
close_price = c(774000L,
852000L, 906000L, 870000L, 1049000L, 941000L, 876000L, 874000L,
909000L, 966000L, 977000L, 950000L, 990000L, 948000L, 1079000L,
NA, 913000L, 932000L, 1020000L, 872000L, 916000L),
vol = c(669L,
872L, 3115L, 2693L, 575L, 619L, 646L, 1760L, 419L, 587L, 8922L,
366L, 764L, 6628L, 1116L, NA, 572L, 592L, 971L, 1181L, 1148L),
obv = c(1344430L, 1304600L, 1325188L, 1322764L, 1365797L,
1355525L, 1308385L, 1308738L, 1353999L, 1364475L, 1326557L,
1357572L, 1362492L, 1322403L, 1364273L, NA, 1354571L, 1354804L,
1363256L, 1315441L, 1327927L)),
.Names = c("week", "close_price", "vol", "obv"),
row.names = c(16L, 337L, 245L, 277L, 193L, 109L, 323L, 342L, 106L,
170L, 226L, 133L, 72L, 234L, 208L, 329L, 107L, 103L, 71L, 284L, 253L),
class = "data.frame")
I have data set like this form called Nam which has observations of 349 and I want to use nnet to predict close_price.
obs<- sample(1:21, 20*0.5, replace=F)
tr.Nam<- stock[obs,]; st.Nam<- stock[-obs,]
# tr.Nam is a training data set while st.Nam is test data.
library(nnet)
Nam_nnet<-nnet(close_price~., data=tr.Nam, size=2, decay=5e-4)
By this statement, I think I made a certain function to predict close_price.
summary(Nam_nnet)
y<-tr.Nam$close_price
p<-predict(Nam_nnet, tr.Nam, type="raw")
I expected p to be the predicted value of close_price, but it has only values of 1. Why doesn't p have the continuous value of close_price?
tt<-table(y,p)
summary(tt)
tt

I think I could do a bit better with a reproducible example but I think the problem may be one (or more) of several reasons. Firstly, do a str(data) to make sure each variable is of the correct type (factor, numeric, etc.). Also, Neural Nets usually respond better to standardized, scaled, and centered data otherwise the inputs get oversaturated with larger numeric inputs which might be the case if the 'week' variable is numeric.
In summary, definitely check the types of each variable to make sure you are inputting the correct forms and consider scaling your data to be smooth and so the inputs are of comparable magnitudes.

Related

How to make scatterplot with colors based on a column and add a mean line through stats_summary with grouping based on another column?

I have a data.frame (see below) and I would like to build a scatterplot, where colours of dots is based on a factor column (replicate). I simultaneously want to add a line that represents the mean of y, for each x. The problem is that when I define the stat_summary it uses the colours I requested for groupingand hence I get three mean lines (for each color) instead of one. Trying to redefine groups either in ggplot() or stat_summary() function did not work.
if I disable colors I get what I want (a single mean line).
How do I have colors (plot # 1), yet still have a single mean line (plot # 2)?
structure(list(conc = c(10L, 10L, 10L, 25L, 25L, 25L, 50L, 50L,
50L, 75L, 75L, 75L, 100L, 100L, 100L, 200L, 200L, 200L, 300L,
300L, 300L, 400L, 400L, 400L, 500L, 500L, 500L, 750L, 750L, 750L,
1000L, 1000L, 1000L), citric_acid = c(484009.63, 409245.09, 303193.26,
426427.47, 332657.35, 330875.96, 447093.71, 344837.39, 302873.98,
435321.69, 359146.09, 341760.28, 378298.37, 342970.87, 323146.92,
362396.98, 361246.41, 290638.14, 417357.82, 351927.66, 323611.37,
416280.3, 359430.65, 327950.99, 431167.14, 361429.91, 291901.43,
340166.41, 353640.91, 341839.08, 393392.69, 311375.19, 342103.54
), MICIT = c(20771.28, 18041.97, 12924.35, 49814.13, 38683.32,
38384.72, 106812.16, 82143.12, 72342.43, 156535.39, 128672.12,
119397.14, 187208.46, 167814.92, 159418.62, 350813.47, 357227.48,
295948.31, 505553.77, 523282.46, 489652.3, 803544.84, 704431.61,
654753.29, 1030485.41, 895451.64, 717698.52, 1246839.19, 1309712.63,
1212111.53, 1930503.38, 1499838.89, 1642091.64), replicate = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L
), .Label = c("1", "2", "3"), class = "factor"), MICITNorm = c(0.0429150139016862,
0.0440859779160698, 0.0426274317575529, 0.116817357005636, 0.116285781751102,
0.116009395182412, 0.238903293897827, 0.238208275500519, 0.238853235263062,
0.359585551549246, 0.358272367659634, 0.34935932285636, 0.494869856298879,
0.489297881187402, 0.493331701877276, 0.968036405822146, 0.98887482369721,
1.01827072661558, 1.21131974956166, 1.48690347328766, 1.51308744189056,
1.93029754230503, 1.95985403582026, 1.99649737297637, 2.38999059622215,
2.47752500616233, 2.45870162403795, 3.6653801002868, 3.70350995307641,
3.54585417793659, 4.90731889298706, 4.81682207885606, 4.79998435561351
)), class = "data.frame", row.names = c(NA, -33L))
ggplot(xx, aes (conc, MICIT, colour = replicate)) + geom_point () +
stat_summary(geom = "line", fun = mean)
Use aes(group = 1):
ggplot(xx, aes(conc, MICIT, colour = replicate)) +
geom_point() +
geom_line() +
stat_summary(aes(group = 1), geom = "line", fun = mean)

ggpairs formatting for points only

I'm looking to increase the size of the points AND outline them in black while keeping the line weight the same across the remaining plots.
library(ggplot2)
library(GGally)
pp <- ggpairs(pp.sed, columns = c(1,2), aes(color=pond.id, alpha = 0.5)) +
theme_bw()
print(pp)
Which gives me the following figure:
Data for reproducibility, and TIA!
> dput(pp.sed)
structure(list(Fe.259.941 = c(905.2628883, 825.7883359, 6846.128702,
1032.932924, 997.8037721, 588.9599882, 6107.641947, 798.4493611,
1046.38376, 685.2485692, 6452.273486, 730.8656684, 902.8585447,
1039.886406, 7408.801001, 2512.089991, 911.2101809, 941.3712067,
659.1069185, 1070.090445, 1017.666402, 925.3221586, 645.0500668,
954.0009756, 1022.594904, 803.5865352, 7653.184537, 1082.714082,
1048.51115, 773.9070604, 6889.060748, 973.0971769, 1002.091143,
798.9670583, 5089.035978, 2361.713222, 970.8258109, 748.3574529,
3942.04816, 889.1760124), Mn.257.611 = c(17.24667962, 14.90488024,
14.39265671, 20.51133433, 19.92596564, 11.76690074, 19.76386229,
14.29779164, 20.23646264, 13.55374658, 16.8847698, 13.11784439,
15.91777975, 20.64068844, 16.78681661, 28.61732162, 15.88328987,
19.59750367, 13.09735943, 21.59458118, 17.680152, 19.87127449,
12.8082581, 20.12050221, 17.57143193, 18.72196029, 16.21525793,
22.0518966, 18.39642397, 18.32238508, 16.17696923, 20.69668404,
17.96018218, 18.71945309, 16.50162126, 30.60719123, 17.69058768,
14.99048753, 16.28302375, 18.32277507), pond.id = structure(c(6L,
5L, 2L, 1L, 3L, 5L, 2L, 1L, 3L, 5L, 2L, 1L, 6L, 3L, 2L, 4L, 6L,
3L, 4L, 4L, 6L, 3L, 4L, 1L, 6L, 3L, 2L, 1L, 6L, 3L, 2L, 1L, 6L,
3L, 2L, 1L, 6L, 5L, 2L, 1L), .Label = c("LIL", "RHM", "SCS",
"STN", "STS", "TS"), class = "factor")), class = "data.frame", row.names = c(11L,
12L, 13L, 15L, 26L, 27L, 28L, 30L, 36L, 37L, 38L, 40L, 101L,
102L, 103L, 105L, 127L, 128L, 129L, 131L, 142L, 143L, 144L, 146L,
157L, 158L, 159L, 161L, 172L, 173L, 174L, 176L, 184L, 185L, 186L,
188L, 199L, 200L, 201L, 203L))
The GGally package already offers a family of wrap_xxx functions which could be used to set parameters to override default behaviour, e.g. using wrap you could override the default size of points using wrap(ggally_points, size = 5).
To use the wrapped function instead of the default you have to call
ggpairs(..., lower = list(continuous = wrap(ggally_points, size = 5))).
Switching the outline is a bit more tricky. Using wrap we could switch the shape of the points to 21 and set the outline color to "black". However, doing so the points are no longer colored. Unfortunately I have found no way to override the mapping. While it is possible to add a global fill aes, a drawback of doing so is that we lose the black outline for the densities.
One option to fix that is to write a wrapper for ggally_points which adjusts the mapping so that the fill aes is used instead of color.
library(ggplot2)
library(GGally)
ggally_points_filled <- function(data, mapping, ...) {
names(mapping)[grepl("^colour", names(mapping))] <- "fill"
ggally_points(data, mapping, ..., shape = 21)
}
w_ggally_points_filled <- wrap(ggally_points_filled, size = 5, color = "black")
ggpairs(pp.sed, columns = c(1, 2), aes(color = pond.id, alpha = 0.5),
lower = list(continuous = w_ggally_points_filled)) +
theme_bw()

R line chart - removing vexing zero line not associated with data

I have a simple (yet very large) data set of counts made at different sites from Apr to Aug.
Between mid Apr and July there are no zero counts - yet a line at zero extends from the earliest to latest date.
Here is the part of the data used to make the above chart (columns are- Site.ID, DATE, Visible Number):
data=structure(list(Site.ID = c(302L, 302L, 302L, 302L, 302L, 302L,
302L, 302L, 302L, 302L, 302L, 302L, 304L, 304L, 304L, 304L, 304L,
304L, 304L, 304L, 304L, 304L, 304L, 304L), DATE = structure(c(1L,
2L, 5L, 3L, 4L, 6L, 8L, 7L, 9L, 10L, 11L, 12L, 1L, 2L, 5L, 3L,
4L, 6L, 8L, 7L, 9L, 10L, 11L, 12L), .Label = c("3/21/2014", "3/27/2014",
"4/17/2014", "4/28/2014", "4/8/2014", "5/13/2014", "6/17/2014",
"6/6/2014", "7/10/2014", "7/22/2014", "7/29/2014", "8/5/2014"
), class = "factor"), Visible.Number = c(0L, 0L, 5L, 14L, 20L,
21L, 6L, 8L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 7L, 7L, 7L, 7L, 5L,
0L, 0L, 0L, 0L)), .Names = c("Site.ID", "DATE", "Visible.Number"
), class = "data.frame", row.names = c(NA, -24L))
attach(data)
DATE<-as.Date(DATE,"%m/%d/%Y")
plot(data$Visible.Number~DATE, type="l", ylab="Visible Number")
I have two sites but there are three lines. How to make R not plot a line along zero?
Thank you for your help!
Your problem is with the multiple site ID's. It plots the first one, then goes back (drawing a line) to draw the second one. Essentially, base plots tries to draw all the lines without "lifting the pen". With base plotting, your option is to plot them separately with lines, perhaps in a for loop. I think stuff like this is easier with ggplot2
library(ggplot2)
ggplot(data, aes(x = DATE, y = Visible.Number, group = Site.ID)) + geom_line()
# if you prefer more base-like styling
ggplot(data, aes(x = DATE, y = Visible.Number, group = Site.ID)) +
geom_line() +
theme_bw()
In base:
plot(data$DATE, data$Visible.Number, type = "n",
ylab = "Visible Number", xlab = "Date")
for(site in unique(data$Site.ID)) {
with(subset(data, Site.ID == site),
lines(Visible.Number ~ DATE)
)
}
N.B. I did not attach my data as you did, so I don't know if the subsetting in the base solution will work properly for you if you do attach. In general, avoid attach; with is a nice way to save typing without attaching, and is much less "risky" in that it doesn't copy your data columns into isolated vectors, thus making them more difficult to keep track of as you subset or otherwise work with your data.

R: prediction of stock price by neural-network [duplicate]

This question already has an answer here:
stock price prediction by using nnet
(1 answer)
Closed 9 years ago.
stock<-structure(list(week = c(1L, 2L, 5L, 2L, 3L, 4L, 3L, 2L, 1L, 5L,
1L, 3L, 2L, 4L, 3L, 4L, 2L, 3L, 1L, 4L, 3L), close_price = c(774000L,
852000L, 906000L, 870000L, 1049000L, 941000L, 876000L, 874000L,
909000L, 966000L, 977000L, 950000L, 990000L, 948000L, 1079000L,
NA, 913000L, 932000L, 1020000L, 872000L, 916000L), vol = c(669L,
872L, 3115L, 2693L, 575L, 619L, 646L, 1760L, 419L, 587L, 8922L,
366L, 764L, 6628L, 1116L, NA, 572L, 592L, 971L, 1181L, 1148L),
obv = c(1344430L, 1304600L, 1325188L, 1322764L, 1365797L,
1355525L, 1308385L, 1308738L, 1353999L, 1364475L, 1326557L,
1357572L, 1362492L, 1322403L, 1364273L, NA, 1354571L, 1354804L,
1363256L, 1315441L, 1327927L)), .Names = c("week", "close_price",
"vol", "obv"), row.names = c(16L, 337L, 245L, 277L, 193L, 109L,
323L, 342L, 106L, 170L, 226L, 133L, 72L, 234L, 208L, 329L, 107L,
103L, 71L, 284L, 253L), class = "data.frame")
This is subset of data I have. I split the data, one for training and the other for testing.
obs<- sample(1:21, 21*0.5, replace=F)
tr.Nam<- stock[obs,]; st.Nam<- stock[-obs,]
library(nnet)
Nam_nnet<-nnet(close_price~., data=tr.Nam, size=4, decay=5e-4)
summary(Nam_nnet)
y<-tr.Nam$close_price
p<-predict(Nam_nnet, st.Nam, type="raw")
p
tt<-table(y,p)
summary(tt)
tt
By this nnet procedure, I expect "p" to predict close_price. However, the values of "p" are only "1"s or "Na"s.
What should I do to predict the close_price properly, with nnet?
By default, nnet uses logistic output units, i.e., tries to predict a binary variable.
You want linear output units.
Nam_nnet <- nnet(
close_price ~ .,
data = tr.Nam,
size = 4, decay = 5e-4,
linout = TRUE
)
p <- predict(Nam_nnet, st.Nam, type="raw")
plot( p, st.Nam$close_price )
However, the internal nodes are still logistic
(and you probably want that, if you are using a neural network in the first place):
since the values of the variables are very large, the nodes saturate,
output a constant value, and the optimizer is stuck on a plateau...

Scaling data in R data frame and fitting gaussian to geom_point

2 questions based on my data.frame
structure(list(Collimator = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("n", "y"), class = "factor"), angle = c(0L,
15L, 30L, 45L, 60L, 75L, 90L, 105L, 120L, 135L, 150L, 165L, 180L,
0L, 15L, 30L, 45L, 60L, 75L, 90L, 105L, 120L, 135L, 150L, 165L,
180L), X1 = c(2099L, 11070L, 17273L, 21374L, 23555L, 23952L,
23811L, 21908L, 19747L, 17561L, 12668L, 6008L, 362L, 53L, 21L,
36L, 1418L, 6506L, 10922L, 12239L, 8727L, 4424L, 314L, 38L, 21L,
50L), X2 = c(2126L, 10934L, 17361L, 21301L, 23101L, 23968L, 23923L,
21940L, 19777L, 17458L, 12881L, 6051L, 323L, 40L, 34L, 46L, 1352L,
6569L, 10880L, 12534L, 8956L, 4418L, 344L, 58L, 24L, 68L), X3 = c(2074L,
11109L, 17377L, 21399L, 23159L, 23861L, 23739L, 21910L, 20088L,
17445L, 12733L, 6046L, 317L, 45L, 26L, 46L, 1432L, 6495L, 10862L,
12300L, 8720L, 4343L, 343L, 38L, 34L, 60L), average = c(2099.6666666667,
11037.6666666667, 17337, 21358, 23271.6666666667, 23927, 23824.3333333333,
21919.3333333333, 19870.6666666667, 17488, 12760.6666666667,
6035, 334, 46, 27, 42.6666666667, 1400.6666666667, 6523.3333333333,
10888, 12357.6666666667, 8801, 4395, 333.6666666667, 44.6666666667,
26.3333333333, 59.3333333333)), .Names = c("Collimator", "angle",
"X1", "X2", "X3", "average"), row.names = c(NA, -26L), class = "data.frame")
I wish to plot detector counts versus angle with and without a collimator attached to the device. I guess geom_point is probably the best way to summarise the data
p <- ggplot(df, aes(x=angle,y=average,col=Collimator)) + geom_point() + geom_line()
Instead of plotting average count in the y-axis, I would prefer to rescale the data so that the angle with max counts has a value 1 for both collimator Y and N. The way I have done this seems quite cumbersome
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
coly = subset(df,Collimator=='y')
coly$norm_count = range01(coly$average)
coln = subset(df,Collimator=='n')
coln$norm_count = range01(coln$average)
df = rbind(coln,coly)
p <- ggplot(df, aes(x=angle,y=norm_count,col=Collimator) + geom_point() + geom_line()
I'm sure this can be done in a more efficient manner, applying the function to the data.frame based on the variable 'Collimator'. How can I do this?
Also I want to fit a function to the data rather than using geom_line. I think a Gaussian function may work in this case but have no idea how/if I can implement this in stat_smooth. Also can I pull out mead/standard deviation from such a fit?
ggplot2 goes hand in hand with the package plyr:
df <- ddply(df,.(Collimator),
transform,
norm_count1 = (average - min(average)) / (max(average) - min(average)) )
joran's answer scales the highest value to 1 and the lowest to 0; if you just want to scale to make the highest value 1 (and leaving 0 as 0), it is even simpler.
library("plyr")
df <- ddply(df, .(Collimator), transform,
norm.average = average / max(average))
The the plot is
ggplot(df, aes(x=angle,y=norm.average,col=Collimator)) +
geom_point() + geom_line()

Resources