Related
I am using the function train() (from the caret package) to train a neural network coming from the package neuralnet.
Before running the training of the neural network, these are the data frames to work in the predict function and other elements that need to be included (as the activation function). Simply copy/paste the next code to prepare to run the training.
train_df <- structure(list(DC1 = c(4, 5, 6, 8, 11, 11, 11, 6, 8, 5, 9, 10,
6, 11, 15, 13, 17, 6, 7, 9, 9, 9, 10, 10, 13, 15), DC2 = c(10,
7, 11, 11, 12, 8, 10, 5, 8, 5, 9, 7, 6, 13, 16, 13, 16, 7, 8,
13, 12, 9, 13, 13, 15, 13), DC3 = c(9, 8, 10, 10, 9, 11, 9, 6,
9, 8, 9, 7, 7, 13, 14, 11, 9, 11, 12, 13, 12, 10, 12, 11, 11,
11), DC4 = c(12, 9, 11, 11, 8, 13, 12, 7, 9, 10, 8, 8, 10, 12,
12, 10, 10, 13, 11, 12, 11, 14, 14, 11, 13, 14), DC5 = c(14,
12, 9, 14, 10, 14, 13, 11, 10, 10, 10, 12, 14, 14, 12, 14, 16,
14, 12, 14, 11, 16, 15, 11, 19, 17), DC6 = c(10, 12, 10, 18,
11, 15, 16, 12, 12, 11, 13, 13, 14, 16, 19, 19, 22, 15, 14, 15,
14, 14, 16, 15, 20, 23), DC7 = c(20, 21, 12, 17, 15, 19, 17,
11, 16, 15, 17, 15, 15, 25, 22, 20, 18, 15, 13, 20, 20, 19, 24,
17, 29, 24), DC8 = c(21, 15, 20, 17, 21, 17, 19, 15, 21, 14,
20, 19, 15, 30, 26, 24, 22, 16, 19, 21, 32, 27, 23, 28, 33, 30
), DC9 = c(28, 26, 22, 17, 23, 16, 18, 18, 18, 22, 25, 25, 23,
32, 27, 32, 27, 22, 24, 22, 29, 33, 26, 21, 31, 30), DC10 = c(30,
33, 23, 43, 26, 30, 22, 34, 23, 23, 36, 37, 29, 29, 30, 50, 31,
25, 27, 23, 30, 36, 35, 38, 18, 33)), row.names = c(165L, 167L,
168L, 172L, 174L, 176L, 177L, 236L, 246L, 260L, 263L, 277L, 280L,
281L, 282L, 302L, 315L, 321L, 322L, 331L, 332L, 333L, 335L, 336L,
339L, 340L), class = "data.frame")
pre_mdl_df <- structure(list(dim = c(26L, 10L), bc = NULL, yj = NULL, et = NULL,
invHyperbolicSine = NULL, mean = NULL, std = NULL, ranges = structure(c(4,
17, 5, 16, 6, 14, 7, 14, 9, 19, 10, 23, 11, 29, 14, 33, 16,
33, 18, 50), .Dim = c(2L, 10L), .Dimnames = list(NULL, c("DC1",
"DC2", "DC3", "DC4", "DC5", "DC6", "DC7", "DC8", "DC9", "DC10"
))), rotation = NULL, method = list(range = c("DC1", "DC2",
"DC3", "DC4", "DC5", "DC6", "DC7", "DC8", "DC9", "DC10"),
ignore = character(0)), thresh = 0.95, pcaComp = NULL,
numComp = NULL, ica = NULL, wildcards = list(PCA = character(0),
ICA = character(0)), k = 5, knnSummary = function (x,
...)
UseMethod("mean"), bagImp = NULL, median = NULL, data = NULL,
rangeBounds = c(0, 1)), class = "preProcess")
softplus <- function(x) log(1+exp(x)) # activation function
tune.grid.neuralnet <- expand.grid(
layer1 = c(1:5), # first hidden layer, from 1 to 5 neurons
layer2 = c(0:5), # second hidden layer, from 0 to 5 neurons
layer3 = c(0:5) # third hidden layer, from 0 to 5 neurons
)
Now I run the training of the neural network:
model <- train(DC1 + DC2 + DC3 ~., data = predict(pre_mdl_df, train_df),
method = "neuralnet",
tuneGrid = tune.grid.neuralnet,
metric = "RMSE",
stepmax = 100000,
learningrate = 0.01,
threshold = 0.01,
act.fct = softplus,
trControl = caret::trainControl (
method = "repeatedcv",
number = 2, # Number of folds of the cv
repeats = 1, # Number of cv repetitions
verboseIter = TRUE,
savePredictions = TRUE,
allowParallel = TRUE))
It takes some time (around 1 min in my computer) but finally it results in creating the model.
Problem: Error: wrong model type for classification
However, based on this question in SO, in order to predict separating per dependent variables, I should use cbind() in the formula (in the next code I only changed the first line):
model <- train(cbind(DC1, DC2, DC3) ~., data = predict(pre_mdl_df, train_df),
method = "neuralnet",
tuneGrid = tune.grid.neuralnet,
metric = "RMSE",
stepmax = 100000,
learningrate = 0.01,
threshold = 0.01,
act.fct = softplus,
trControl = caret::trainControl (
method = "repeatedcv",
number = 2, # Number of folds of the cv
repeats = 1, # Number of cv repetitions
verboseIter = TRUE,
savePredictions = TRUE,
allowParallel = TRUE))
I get this error:
Error: wrong model type for classification
Where is the problem?
How can I replicate the process explained in the previous linked question with a neural network?
I have longitudinal data for which I would like to reverse score a subset of items using corresponding predefined maximum scores that are stored in a seperate data frame.
In the below example data (df) there are three scores, DST, SOS, and VR at two timepoints (baseline and wave 1). neg_skew.vars contains the scores that are to be reverse across timepoints. I would like to reverse scores based on the maximum possible value for that score, as stored in df.CP1.vars$max.vars. I'd like this to work when multiple scores with different maximum values are included in neg_skew.vars.
For example, in the example below "SOS.score" is stored in neg_skew.vars. Therefore I want all SOS.Score variables to be reversed (i.e., across timepoints); this would include 'SOS.Score.baseline' and 'SOS.Score.wave1' in the example data below. I want scores to be reversed using the corresponding maximum score for SOS. For each SOS variable, I want each value to be reversed like this: (20 + 1) - value. The 20 corresponds to the maximum value for SOS stored in df.CP1.vars. As DST is also negatively skewed, all DST scores (i.e., 'DST.Score.baseline' and 'DST.Score.wave1') should be reveresed, but with 16 as the maximum value, per df.CP1.vars, so: (16 + 1) - value. This results in the desired data frame df_wanted below. VR.Score does not appear in neg_skew.vars and so no VR.Score variables are reversed (i.e., VR.Score.baseline and VR.Score.wave1).
So far I have the code listed below under # reverse scores however this produces two undesired outcomes in the resulting data frame (i.e., df2). These are A) the columns for other scores, such as DST, are not retained, and B) the maximum value used to reverse items is the maximum value for that item/at that timepoint; this is a problem as the data is longitudinal.
The desired data should look like df_wanted. I tried to set up a for-loop but ran into problems with using the dplyr pipeline.
# required packages
library(dplyr)
# create relevant variables and data sets
CP1.vars <- c("DST.Score","SOS.Score", "VR.Score")
max.vars <- c(16,20,80)
df.CP1.vars <- data.frame(CP1.vars, max.vars)
df <- structure(list(
SOS.Score.baseline = c(4, 11, 7, 9, 10, 8, 6, 8, 7, 0, 9, 10),
SOS.Score.wave1 = c(NA, 7.5, 8.5, NA, NA, 6.66, NA, 6, 8, 8, 7, 8),
DST.Score.baseline = c(11, 10, 8, 8, 8, 8, 9, 9, 7, 6, 7, 6),
DST.Score.wave1 = c(NA, 10, 8.5, NA, NA, 8, NA, 9.33, 9, 7, 8, 8),
VR.Score.baseline = c(NA, 60, 38.5, 50, NA, 48, NA, 33, 49, 67, 78, 80),
VR.Score.wave1 = c(NA, 58, 38.5, NA, NA, 40, NA, 35, 49, 67, 78, 78)),
row.names = c(NA, 12L), class = "data.frame")
neg_skew.vars <- c("SOS.Score", "DST.Score")
# reverse scores
df2 <- df %>%
select(contains(neg_skew.vars)) %>%
mutate(across(everything(), ~ max(., na.rm = TRUE) + 1 - . , .names = "{.col}_r"))
# desired outcome (order of variables irrelevant)
df_wanted <- structure(list(
SOS.Score.baseline = c(4, 11, 7, 9, 10, 8, 6, 8, 7, 0, 9, 10),
SOS.Score.wave1 = c(NA, 7.5, 8.5, NA, NA, 6.66, NA, 6, 8, 8, 7, 8),
SOS.Score.baseline_r = c(17, 10, 14, 12, 11, 13, 15, 13, 14, 21, 12, 11),
SOS.Score.wave1_r = c(NA, 13.5, 12.5, NA, NA, 14.34, NA, 15, 13, 13, 14, 13),
DST.Score.baseline = c(11, 10, 8, 8, 8, 8, 9, 9, 7, 6, 7, 6),
DST.Score.wave1 = c(NA, 10, 8.5, NA, NA, 8, NA, 9.33, 9, 7, 8, 8),
DST.Score.baseline_r = c(6, 7, 9, 9, 9, 9, 8, 8, 10, 11, 10, 11),
DST.Score.wave1_r = c(NA, 7, 8.5, NA, NA, 9, NA, 7.67, 8, 10, 9, 9),
VR.Score.baseline = c(NA, 60, 38.5, 50, NA, 48, NA, 33, 49, 67, 78, 80),
VR.Score.wave1 = c(NA, 58, 38.5, NA, NA, 40, NA, 35, 49, 67, 78, 78)),
row.names = c(NA,12L), class = "data.frame")
You can use purrr::map_dfc to loop over the neg_skew.vars and get the value directly from df.CP1.vars, and then bind the resulting dataframe with columns that remained unchanged.
library(tidyverse)
library(purrr)
df2 <- neg_skew.vars %>%
map_dfc(function(a) df %>%
select(matches(a)) %>%
mutate(across(everything(), ~ df.CP1.vars$max.vars[df.CP1.vars$CP1.vars == a] + 1 - .,
.names = "{.col}_r"))) %>%
bind_cols(df %>%
select(!contains(neg_skew.vars)))
This indeed leads to the desired outcome:
identical(df2, df_wanted)
#[1] TRUE
Data:
# create relevant variables and data sets
CP1.vars <- c("DST.Score","SOS.Score", "VR.Score")
max.vars <- c(16,20,80)
df.CP1.vars <- data.frame(CP1.vars, max.vars)
df <- structure(list(
SOS.Score.baseline = c(4, 11, 7, 9, 10, 8, 6, 8, 7, 0, 9, 10),
SOS.Score.wave1 = c(NA, 7.5, 8.5, NA, NA, 6.66, NA, 6, 8, 8, 7, 8),
DST.Score.baseline = c(11, 10, 8, 8, 8, 8, 9, 9, 7, 6, 7, 6),
DST.Score.wave1 = c(NA, 10, 8.5, NA, NA, 8, NA, 9.33, 9, 7, 8, 8),
VR.Score.baseline = c(NA, 60, 38.5, 50, NA, 48, NA, 33, 49, 67, 78, 80),
VR.Score.wave1 = c(NA, 58, 38.5, NA, NA, 40, NA, 35, 49, 67, 78, 78)),
row.names = c(NA, 12L), class = "data.frame")
neg_skew.vars <- c("SOS.Score", "DST.Score")
This is small example of my data set. This set contains weekly data about 52 weeks. You can see data with code below:
# CODE
#Data
library(tidyverse)
library(plotly)
ARTIFICIALDATA<-dput(structure(list(week = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52), `2019 Series_1` = c(534.771929824561,
350.385964912281, 644.736842105263, 366.561403508772, 455.649122807018,
533.614035087719, 829.964912280702, 466.035087719298, 304.421052631579,
549.473684210526, 649.719298245614, 537.964912280702, 484.982456140351,
785.929824561404, 576.736842105263, 685.508771929824, 514.842105263158,
464.491228070175, 608.245614035088, 756.701754385965, 431.859649122807,
524.315789473684, 739.40350877193, 604.736842105263, 669.684210526316,
570.491228070175, 641.649122807018, 649.298245614035, 664.210526315789,
530.385964912281, 754.315789473684, 646.80701754386, 764.070175438596,
421.333333333333, 470.842105263158, 774.245614035088, 752.842105263158,
575.368421052632, 538.315789473684, 735.578947368421, 522, 862.561403508772,
496.526315789474, 710.631578947368, 584.456140350877, 843.19298245614,
563.473684210526, 568.456140350877, 625.368421052632, 768.912280701754,
679.824561403509, 642.526315789474), `2020 Series_1` = c(294.350877192983,
239.824561403509, 709.614035087719, 569.824561403509, 489.438596491228,
561.964912280702, 808.456140350877, 545.157894736842, 589.649122807018,
500.877192982456, 584.421052631579, 524.771929824561, 367.438596491228,
275.228070175439, 166.736842105263, 58.2456140350878, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA)), row.names = c(NA, -52L), class = c("tbl_df", "tbl",
"data.frame")))
colnames(ARTIFICIALDATA) <- c('week', 'series1', 'series2')
So the next step is to plot this data with r-plotly package. I want to have a plot like the example below. Because this is weekly data, first series1 have 52 observations while series2 has 16 observation (series1 is mean data for 2019 and series2 data for 2020). So for that reason, the comparison must be only on 16 observation (all observations which don't have NA) like the example below:
So can anybody help how to plot this graph with plotly?
Try this:
colnames(ARTIFICIALDATA) <- c("week", "series1", "series2")
ARTIFICIALDATA %>%
# Drop rows with NA
drop_na() %>%
# Convert to long format
pivot_longer(-week, names_to = "series") %>%
# Set the labels for the plot. If you want other lables simply adjust
mutate(label = case_when(
series == "series1" ~ "2019 Series_1",
series == "series2" ~ "2020 Series_1")) %>%
# Compute sum by sereis
group_by(label) %>%
summarise(sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
# Plot
plot_ly(x = ~label, y = ~sum) %>%
add_bars() %>%
# Remove title for xaxis. But can you can label it as you like
layout(xaxis = list(title = ""))
This is small example of my data set.This set contain weekly data about 52 weeks.You can see data with code below:
# CODE
#Data
ARTIFICIALDATA<-dput(structure(list(week = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52), `2019 Series_1` = c(534.771929824561,
350.385964912281, 644.736842105263, 366.561403508772, 455.649122807018,
533.614035087719, 829.964912280702, 466.035087719298, 304.421052631579,
549.473684210526, 649.719298245614, 537.964912280702, 484.982456140351,
785.929824561404, 576.736842105263, 685.508771929824, 514.842105263158,
464.491228070175, 608.245614035088, 756.701754385965, 431.859649122807,
524.315789473684, 739.40350877193, 604.736842105263, 669.684210526316,
570.491228070175, 641.649122807018, 649.298245614035, 664.210526315789,
530.385964912281, 754.315789473684, 646.80701754386, 764.070175438596,
421.333333333333, 470.842105263158, 774.245614035088, 752.842105263158,
575.368421052632, 538.315789473684, 735.578947368421, 522, 862.561403508772,
496.526315789474, 710.631578947368, 584.456140350877, 843.19298245614,
563.473684210526, 568.456140350877, 625.368421052632, 768.912280701754,
679.824561403509, 642.526315789474), `2020 Series_1` = c(294.350877192983,
239.824561403509, 709.614035087719, 569.824561403509, 489.438596491228,
561.964912280702, 808.456140350877, 545.157894736842, 589.649122807018,
500.877192982456, 584.421052631579, 524.771929824561, 367.438596491228,
275.228070175439, 166.736842105263, 58.2456140350878, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA)), row.names = c(NA, -52L), class = c("tbl_df", "tbl",
"data.frame")))
# CODE WITH PLOTLY
library(tidyverse)
library(plotly)
colnames(ARTIFICIALDATA) <- c('week', 'series1', 'series2')
fig <- plot_ly(ARTIFICIALDATA, x = ~week, y = ~series2, name = "2019", type = 'scatter', mode = 'lines')
fig <- fig %>% add_trace(y = ~series1, name = "2020")
fig
So next steep is plot this data with plotly. So you can see how my plot look like below:
But my intention is to make plot like plot below with dashed line (size=1 linetype=2).So can anybody help me how to modify this ?
You can add line = list(dash = "dash") to get the dashed line. See options for different dash property choices. The other line can be set to solid.
library(tidyverse)
library(plotly)
colnames(ARTIFICIALDATA) <- c('week', 'series1', 'series2')
fig <- plot_ly(ARTIFICIALDATA, x = ~week, y = ~series2, name = "2019", type = 'scatter', mode = 'lines',
line = list(dash = "dash"))
fig <- fig %>% add_trace(y = ~series1, name = "2020", line = list(dash = "solid"))
fig
Plot
I'm having difficulty using the Tidyr function "complete()" to fill in columns for absent weeks. While the complete() function does work, it loops through the entire year 35 times and fills in 4,375 entries rather than just 125.
In short, when I try to use the complete function, it does not just complete the dataframe but duplicates all columns 35 times.
I have tried several different approaches including with and w/o the full_seq function.
1st approach:
Df %>%
group_by(week = week(`Local Start Time`)) %>%
complete(week = 1:52)
Second approach:
Df %>%
group_by(week = week(`Local Start Time`)) %>%
complete(week = full_seq(week <- c(1:52), 1L))
I expected the dataframe to stop at row 125 but instead the complete() loops over the entire yearly data (35 times!) and continues until column 4375.
Any advice is appreciated, thanks!
The data I used is here...
structure(list(`Local Start Time` = structure(c(1483846399, 1483846519,
1483851979, 1484734742, 1485017522, 1485190862, 1486236902, 1486238462,
1486347422, 1486448822, 1487221742, 1487392502, 1487449502, 1487678750,
1487679111, 1487679411, 1487683370, 1488321651, 1488745130, 1489353950,
1489710710, 1491043550, 1492036467, 1492105535, 1492150284, 1492180823,
1492772358, 1493428578, 1493440398, 1493465717, 1493476518, 1493484558,
1493495837, 1493622558, 1493639598, 1493718078, 1493718858, 1493720778,
1495021772, 1495598357, 1495599017, 1496175677, 1496428517, 1496439678,
1496494637, 1496632757, 1496887457, 1496887757, 1496888117, 1497031577,
1497207557, 1497318797, 1497368657, 1497491178, 1497558017, 1497857478,
1498220117, 1498245977, 1498246577, 1498255277, 1498257797, 1499203517,
1499470577, 1500752057, 1500899837, 1502036477, 1502392277, 1502410817,
1502428157, 1502429957, 1503492618, 1503500417, 1503507318, 1503672677,
1503674057, 1503674370, 1503675077, 1503923478, 1503928037, 1503932777,
1503943037, 1503972019, 1503989537, 1504383497, 1504421837, 1504639337,
1504656977, 1504672937, 1504682418, 1504722677, 1506766878, 1507180518,
1507184597, 1507228877, 1507229657, 1507370717, 1508326217, 1508343977,
1508357297, 1508374397, 1508492838, 1508555177, 1508560158, 1508868737,
1509231244, 1509252184, 1509845644, 1510709818), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), week = c(2, 2, 2, 3, 3, 4, 5, 5, 6,
6, 7, 7, 7, 8, 8, 8, 8, 9, 10, 11, 11, 13, 15, 15, 15, 15, 16,
17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 20, 21, 21, 22, 22,
22, 22, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 25, 25, 25, 25,
25, 25, 27, 27, 29, 30, 32, 32, 32, 32, 32, 34, 34, 34, 34, 34,
34, 34, 35, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36, 36, 39,
40, 40, 40, 40, 40, 42, 42, 42, 42, 42, 42, 42, 43, 43, 44, 45,
46)), class = "data.frame", row.names = c(NA, -108L), .Names = c("Local Start Time",
"week"))