Conditional statements for one row vs all rows - r

I don't know what's going on here. Do I have some logical flaw in the code?
I want to match two datasets by their time difference. One case is around 4hs different from the entry in the other set. I am calculating the difference, e.g.:
qnr$submitdate[10]-raw1$time[7]
Time difference of 4 hours
I am specifying a time window:
sum(qnr$submitdate[10]-raw1$time[7] <= 4 & qnr$submitdate[10]-raw1$time[7] > 3.995)
[1] 1
Perfect, 1 match!
Now when I am considering the whole data set, I get 0 matches, how can that be?
sum(qnr$submitdate[10]-raw1$time <= 4 & qnr$submitdate[10]-raw1$time > 3.995)
[1] 0
Specifically, I want to match an identifier:
for (i in 1:nrow(qnr)){
match <- raw1$subject[(qnr$submitdate[i]-raw1$time <= 4 & qnr$submitdate[i]-raw1$time > 3.995)]
if(length(match)>0) qnr$subject[i] <- match
}
this works, but only for some cases, not the one mentioned above. Can someone please help me and enlighten me?
Data:
qnr <- structure(list(submitdate = structure(c(1635427498, 1635427876,
1635428218, 1635429757, 1635430844, 1635432380, 1635435962, 1635453487,
1635464448, 1635508264, 1635509440, 1635509727, 1635510277, 1635511263,
1635511718, 1635514199, 1635514329, 1635517928, 1635519441, 1635519704,
1635520386, 1635521108, 1635522747, 1635525148, 1635526577), tzone = "UTC", class = c("POSIXct",
"POSIXt")), subject = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA,
-25L), class = c("tbl_df", "tbl", "data.frame"))
raw1 <- structure(list(time = structure(c(1635413099, 1635413819, 1635416446,
1635417980, 1635421563, 1635439088, 1635493864, 1635495041, 1635495326,
1635495876, 1635496863, 1635499803, 1635499932, 1635503528, 1635505042,
1635508347, 1635512177, 1635512850, 1635518752, 1635519382), class = c("POSIXct",
"POSIXt"), tzone = ""), subject = c("9wtd4kldpun6bhgq", "qbvhqxuw67x1eduw",
"k2dc9c88t3jcfssy", "vmvwfc6z7j236nhk", "7qo7ra1jj25ue3fb", "5xx9qkkb53nzxev5",
"o6zaaq469c7t2jps", "dfsj021ojphza6uc", "4k0l4a3yrb33hel1", "vf6usaa0cl8kz17t",
"f1wwfeoeekoru88z", "oe8e2u6w4a1f6f6m", "tnxxywtpsj8nejoa", "zht8w1bfhq4dk22l",
"atd314r9a4htlaal", "mwbh9eafxczk0x8u", "ke7m4qqp4aodd1fb", "v13fx76lsohsa1hh",
"8kvynhcvfs09g658", "5scqtdz8ha8cuxt1")), row.names = c(79226L,
26641L, 79425L, 79624L, 79823L, 26789L, 2961L, 3109L, 3257L,
47585L, 3405L, 3553L, 3701L, 47784L, 3849L, 47983L, 48182L, 48381L,
48580L, 48779L), class = c("data.table", "data.frame"))

The reason is that subtracting times uses the timediff-function which uses units="auto" as a standard.
It works, when changing
qnr$submitdate[i]-raw1$time
to
difftime(qnr$submitdate[i],raw1$time, units="hours")

Related

r Replace multiple strings in a data frame column with multiple strings from a column of another data frame

I have a dataframe (df1) with a column "PartcipantID". Some ParticipantIDs are wrong and should be replaced with the correct ParticipantID. I have another dataframe (df2) where all Participant IDs appear in columns Goal_ID to T4. The Participant IDs in column "Goal_ID" are the correct IDs.
Now I want to replace all ParticipantIDs in df1 with all Goal_ID ParticipantIDs from df2.
This is my original dataframe (df1):
structure(list(Partcipant_ID = c("AA_SH_RA_91", "AA_SH_RA_91",
"AB_BA_PR_93", "AB_BH_VI_90", "AB_BH_VI_90", "AB_SA_TA_91", "AJ_BO_RA_92",
"AJ_BO_RA_92", "AK_SH_HA_91", "AL_EN_RA_95", "AL_MA_RA_95", "AL_SH_BA_99",
"AM_BO_AB_49", "AM_BO_AB_94", "AM_BO_AB_94", "AM_BO_AB_94", "AN_JA_AN_91",
"AN_KL_GE_11", "AN_KL_WO_91", "AN_MA_DI_95", "AN_MA_DI_95", "AN_SE_RA_95",
"AN_SE_RA_95", "AN_SI_RA_97", "AN_SO_PU_94", "AN_SU_RA_91", "AR_BO_RA_92",
"AR_KA_VI_94", "AR_KA_VI_94", "AS_AR_SO_90", "AS_AR_SU_95", "AS_KU_SO_90",
"AS_MO_AS_97", "AW_SI_OJ_97", "AW_SI_OJ_97", "AY_CH_SU_97", "BH_BE_LD_84",
"BH_BE_LI_83", "BH_BE_LI_83", "BH_BE_LI_84", "BH_KO_SA_87", "BH_PE_AB_89",
"BH_YA_SA_87", "BI_CH_PR_94", "BI_CH_PR_94"), Start_T2 = structure(c(NA,
NA, NA, NA, 1579514871, 1576658745, NA, 1579098225, NA, NA, 1576663067,
1576844759, NA, 1577330639, NA, NA, 1576693930, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 1577718380, 1577718380, 1577454467, NA,
NA, 1576352237, NA, NA, NA, NA, 1576420656, 1576420656, NA, NA,
1578031772, 1576872938, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), End_T2 = structure(c(NA, NA, NA, NA, 1579515709,
1576660469, NA, 1579098989, NA, NA, 1576693776, 1576845312, NA,
1577331721, NA, NA, 1576694799, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 1577719049, 1577719049, 1577455167, NA, NA, 1576352397,
NA, NA, NA, NA, 1576421607, 1576421607, NA, NA, 1578032408, 1576873875,
NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
45L), class = "data.frame")
And this is the reference data frame (df2):
structure(list(Goal_ID = c("AJ_BO_RA_92", "AL_EN_RA_95", "AM_BO_AB_49",
"AS_KU_SO_90", "BH_BE_LI_84", "BH_YA_SA_87", "BI_CH_PR_94", "BI_CH_PR_94"
), T2 = c("AJ_BO_RA_92", "AL_MA_RA_95", "AM_BO_AB_94", "AS_AR_SO_90",
"BH_BE_LI_83", "BH_YA_SA_87", "BI_NA_PR_94", "BI_NA_PR_94"),
T3 = c("AR_BO_RA_92", "AL_MA_RA_95", "AM_BO_AB_94", NA, "BH_BE_LI_83",
NA, "BI_CH_PR_94", "BI_CH_PR_94"), T4 = c("AJ_BO_RA_92",
"AL_MA_RA_95", "AM_BO_AB_94", NA, "BH_BE_LI_83", "BH_KO_SA_87",
"BI_CH_PR_94", "BI_CH_PR_94")), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
For example, in my df1, I want
"AR_BO_RA_92" to be replaced by "AJ_BO_RA_92";
"AL_MA_RA_95" to be replaced by "AL_EN_RA_95";
"AM_BO_AB_94" to be replaced by "AM_BO_AB_49"
and so on...
I thought about using string_replace and I started with this:
df1$Partcipant_ID <- str_replace(df1$Partcipant_ID, "AR_BO_RA_92", "AJ_BO_RA_92")
But that is of course very unefficient because I have so many replacements and it would be nice to make use of my reference data frame. I just cannot figure it out myself.
I hope this is understandable. Please ask if you need additional information.
Thank you so much already!
You can use match to find where the string is located and excange those which have been found and are not NA like:
i <- match(df1$Partcipant_ID, unlist(df2[-1])) %% nrow(df2)
j <- !is.na(i)
df1$Partcipant_ID[j] <- df2$Goal_ID[i[j]]
df1$Partcipant_ID
# [1] "AA_SH_RA_91" "AA_SH_RA_91" "AB_BA_PR_93" "AB_BH_VI_90" "AB_BH_VI_90"
# [6] "AB_SA_TA_91" "AJ_BO_RA_92" "AJ_BO_RA_92" "AK_SH_HA_91" "AL_EN_RA_95"
#[11] "AL_MA_RA_95" "AL_SH_BA_99" "AM_BO_AB_49" "AM_BO_AB_94" "AM_BO_AB_94"
#[16] "AM_BO_AB_94" "AN_JA_AN_91" "AN_KL_GE_11" "AN_KL_WO_91" "AN_MA_DI_95"
#[21] "AN_MA_DI_95" "AN_SE_RA_95" "AN_SE_RA_95" "AN_SI_RA_97" "AN_SO_PU_94"
#[26] "AN_SU_RA_91" "AR_BO_RA_92" "AR_KA_VI_94" "AR_KA_VI_94" "AS_AR_SO_90"
#[31] "AS_AR_SU_95" "AS_KU_SO_90" "AS_MO_AS_97" "AW_SI_OJ_97" "AW_SI_OJ_97"
#[36] "AY_CH_SU_97" "BH_BE_LD_84" "BH_BE_LI_83" "BH_BE_LI_83" "BH_BE_LI_84"
#[41] "BH_KO_SA_87" "BH_PE_AB_89" "BH_YA_SA_87" "BI_CH_PR_94" "BI_CH_PR_94"
I think this might work. Create a true look up table with a column of correct and incorrect codes. I.e. stack the columns, then join the subsequent df3 to df1 and use coalesce to create a new part_id. You spelt participant wrong, which made me feel more human I always do that.
library(dplyr)
df3 <- df2[1:2] %>%
bind_rows(df2[c(1,3)] %>% rename(T2 = T3),
df2[c(1,4)] %>% rename(T2 = T4)) %>%
distinct()
df1 %>%
left_join(df3, by = c("Partcipant_ID" = "T2")) %>%
mutate(Goal_ID = coalesce(Goal_ID, Partcipant_ID)) %>%
select(Goal_ID, Partcipant_ID, Start_T2, End_T2)

Speed up/replace the loop for millions data:judge multi date range

Good evening guys,I have 6 millions data and they have four types.
z=structure(list(date = structure(c(11866, 16190, 14729, 11718), class = "Date"),
beg1 = structure(c(12264, 12264, 13970, 12264), class = "Date"),
end1 = structure(c(17621, 14760, 14760, 13298), class = "Date"),
ID1 = c(1003587, 1000396, 1010743, 1002113), beg2 = structure(c(NA,
14790, 14790, 13299), class = "Date"), end2 = structure(c(NA,
17621, 15217, 13969), class = "Date"), ID2 = c(NA, 1024488,
1027877, 1002824), beg3 = structure(c(NA, NA, 15218, 13970
), class = "Date"), end3 = structure(c(NA, NA, 17621, 14760
), class = "Date"), ID3 = c(NA, NA, 1031361, 1002113), beg4 = structure(c(NA,
NA, NA, 14790), class = "Date"), end4 = structure(c(NA, NA,
NA, 17621), class = "Date"), ID4 = c(NA, NA, NA, 1021290),
realID = c(NA, NA, NA, NA)), row.names = c(267365L, 193587L,
5294385L, 2039421L), class = "data.frame")
and I tried to judge and assign a suitalbe ID based on their date in which date ranges(use the loop).
for(i in 1:nrow(z)){tryCatch({print(i)
if(between(z$date[i],z$beg1[i],z$end1[i])==T){z$realID[i]=z$ID1[i]}
if(between(z$date[i],z$beg2[i],z$end2[i])==T){z$realID[i]=z$ID2[i]}
if(between(z$date[i],z$beg3[i],z$end3[i])==T){z$realID[i]=z$ID3[i]}
if(between(z$date[i],z$beg4[i],z$end4[i])==T){z$realID[i]=z$ID4[i]}},error=function(e){})}
The code works.
But,now the problem is I have too many datas,the loop is inefficiency,may be it will take almost one day to loop.
Does anyone know how can I improve or replace the code?
Thanks you so much.
Since R is a vectorized language, to speed up this code it is best to operate on the entire vector as oppose to looping through each element.
As simple solution is to use a series of ifelse statements.
z$realID <- ifelse(!is.na(z$beg1) & z$date> z$beg1 & z$date< z$end1, z$ID1, z$realID)
z$realID <- ifelse(!is.na(z$beg2) & z$date> z$beg2 & z$date< z$end2, z$ID2, z$realID)
z$realID <- ifelse(!is.na(z$beg3) & z$date> z$beg3 & z$date< z$end3, z$ID3, z$realID)
z$realID <- ifelse(!is.na(z$beg4) & z$date> z$beg4 & z$date< z$end4, z$ID4, z$realID)
When the if statement evaluates TRUE, the realID will update if not it will retain its prior value.

linear regression model with dplyr on sepcified columns by name

I have the following data frame, each row containing four dates ("y") and four measurements ("x"):
df = structure(list(x1 = c(69.772808673525, NA, 53.13125414839,
17.3033274666411,
NA, 38.6120670385487, 57.7229000792707, 40.7654208618078, 38.9010405201831,
65.7108936694177), y1 = c(0.765671296296296, NA, 1.37539351851852,
0.550277777777778, NA, 0.83037037037037, 0.0254398148148148,
0.380671296296296, 1.368125, 2.5250462962963), x2 = c(81.3285388496182,
NA, NA, 44.369872853302, NA, 61.0746827226573, 66.3965114460601,
41.4256874481852, 49.5461413070349, 47.0936997726146), y2 =
c(6.58287037037037,
NA, NA, 9.09377314814815, NA, 7.00127314814815, 6.46597222222222,
6.2462962962963, 6.76976851851852, 8.12449074074074), x3 = c(NA,
60.4976916064608, NA, 45.3575294731303, 45.159758146854, 71.8459173097114,
NA, 37.9485456227131, 44.6307631013742, 52.4523342186143), y3 = c(NA,
12.0026157407407, NA, 13.5601157407407, 16.1213657407407, 15.6431018518519,
NA, 15.8986805555556, 13.1395138888889, 17.9432638888889), x4 = c(NA,
NA, NA, 57.3383407228293, NA, 59.3921356160536, 67.4231673171527,
31.853845252547, NA, NA), y4 = c(NA, NA, NA, 18.258125, NA,
19.6074768518519,
20.9696527777778, 23.7176851851852, NA, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
I would like to create an additional column containing the slope of all the y's versus all the x's, for each row (each row is a patient with these 4 measurements).
Here is what I have so far:
df <- df %>% mutate(Slope = lm(vars(starts_with("y") ~
vars(starts_with("x"), data = .)
I am getting an error:
invalid type (list) for variable 'vars(starts_with("y"))'...
What am I doing wrong, and how can I calculate the rowwise slope?
You are using a tidyverse syntax but your data is not tidy...
Maybe you should rearrange your data.frame and rethink the way you store your data.
Here is how to do it in a quick and dirty way (at least if I understood your explanations correctly):
df <- merge(reshape(df[,(1:4)*2-1], dir="long", varying = list(1:4), v.names = "x", idvar = "patient"),
reshape(df[,(1:4)*2], dir="long", varying = list(1:4), v.names = "y", idvar = "patient"))
df$patient <- factor(df$patient)
Then you could loop over the patients, perform a linear regression and get the slopes as a vector:
sapply(levels(df$patient), function(pat) {
coef(lm(y~x,df[df$patient==pat,],na.action = "na.omit"))[2]
})

How can I get highcharter to represent a forecast object?

This is a follow-on to this question.
I am trying to get the pipeline given in that question to accept a forecast object as input:
Again, using this data:
> dput(t)
structure(c(2, 2, 267822980, 325286564, 66697091, 239352431,
94380295, 1, 126621669, 158555699, 32951026, 23, 108000151, 132505189,
29587564, 120381505, 25106680, 117506099, 22868767, 115940080,
22878163, 119286731, 22881061), .Dim = c(23L, 1L), index = structure(c(1490990400,
1490994000, 1490997600, 1491001200, 1491004800, 1491008400, 1491012000,
1491026400, 1491033600, 1491037200, 1491040800, 1491058800, 1491062400,
1491066000, 1491069600, 1491073200, 1491076800, 1491109200, 1491112800,
1491120000, 1491123600, 1491156000, 1491159600), tzone = "US/Mountain", tclass = c("POSIXct",
"POSIXt")), class = c("xts", "zoo"), .indexCLASS = c("POSIXct",
"POSIXt"), tclass = c("POSIXct", "POSIXt"), .indexTZ = "US/Mountain", tzone = "US/Mountain", .CLASS = "double", .Dimnames = list(
NULL, "count"))
I use
highchart(type = 'stock') %>%
hc_add_series(t) %>%
hc_xAxis(type = 'datetime')
To create
But if I follow this same recipe using
require("forecast")
t.arima <- auto.arima(t)
x <- forecast(t.arima, level = c(95, 80))
highchart(type = 'stock') %>%
hc_add_series(x) %>%
hc_xAxis(type = 'datetime')
I get this error:
Error in as.Date.ts(.) : unable to convert ts time to Date class
How can I show the forecast series along with the historical? I've seen this in the documentation, but don't understand why I'd be getting this error.
JS CONSOLE OUTPUT FOR JK:
DF DATA AFTER RE-INDEXING:
dput(df)
structure(list(Index = structure(c(1490968800, 1490972400, 1490976000,
1490979600, 1490983200, 1490986800, 1490990400, 1491004800, 1491012000,
1491015600, 1491019200, 1491037200, 1491040800, 1491044400, 1491048000,
1491051600, 1491055200, 1491087600, 1491091200, 1491098400, 1491102000,
1491134400, 1491138000, 1491217200, 1491220800, 1491224400, 1491228000,
1491231600, 1491235200, 1491238800, 1491242400, 1491246000, 1491249600,
1491253200, 1491256800, 1491260400, 1491264000, 1491267600), class = c("POSIXct",
"POSIXt")), Data = c(2, 2, 259465771, 315866206, 64582553, 233440220,
91918347, 1, 126563786, 158555699, 32951026, 23, 108000151, 132505189,
29587564, 120381505, 25106680, 117506099, 22868767, 115898351,
22878163, 119285747, 22881061, 157925588, 32447780, 223096830,
281656273, 45406684, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
Fitted = c(102170573.857143, 102170573.857143, 102170573.857143,
102170573.857143, 102170573.857143, 102170573.857143, 102170573.857143,
102170573.857143, 102170573.857143, 102170573.857143, 102170573.857143,
102170573.857143, 102170573.857143, 102170573.857143, 102170573.857143,
102170573.857143, 102170573.857143, 102170573.857143, 102170573.857143,
102170573.857143, 102170573.857143, 102170573.857143, 102170573.857143,
102170573.857143, 102170573.857143, 102170573.857143, 102170573.857143,
102170573.857143, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
`Point Forecast` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, 102170573.857143, 102170573.857143, 102170573.857143,
102170573.857143, 102170573.857143, 102170573.857143, 102170573.857143,
102170573.857143, 102170573.857143, 102170573.857143), `Lo 80` = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -16003477.5789723,
-16003477.5789723, -16003477.5789723, -16003477.5789723,
-16003477.5789723, -16003477.5789723, -16003477.5789723,
-16003477.5789723, -16003477.5789723, -16003477.5789723),
`Hi 80` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 220344625.293258, 220344625.293258, 220344625.293258,
220344625.293258, 220344625.293258, 220344625.293258, 220344625.293258,
220344625.293258, 220344625.293258, 220344625.293258), `Lo 95` = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -78561041.5917782,
-78561041.5917782, -78561041.5917782, -78561041.5917782,
-78561041.5917782, -78561041.5917782, -78561041.5917782,
-78561041.5917782, -78561041.5917782, -78561041.5917782),
`Hi 95` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 282902189.306064, 282902189.306064, 282902189.306064,
282902189.306064, 282902189.306064, 282902189.306064, 282902189.306064,
282902189.306064, 282902189.306064, 282902189.306064)), .Names = c("Index",
"Data", "Fitted", "Point Forecast", "Lo 80", "Hi 80", "Lo 95",
"Hi 95"), row.names = c(NA, -38L), class = "data.frame")
Not sure this is due to the irregular time series.
Anyway, ggfortify:::fortify.forecast is your friend. Why? Because fortify (try to) transform all the R object in data frames. So:
library(highcharter)
library(forecast)
t.arima <- auto.arima(t)
x <- forecast(t, level = c(95, 80))
library(highcharter)
library(ggplot2)
library(ggfortify)
#>
#> Attaching package: 'ggfortify'
#> The following object is masked from 'package:forecast':
#>
#> gglagplot
class(x)
#> [1] "forecast"
df <- fortify(x)
head(df)
#> Index Data Fitted Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
#> 1 1 2 140658844 NA NA NA NA NA
#> 2 3601 2 121734145 NA NA NA NA NA
#> 3 7201 267822980 105355638 NA NA NA NA NA
#> 4 10801 325286564 127214522 NA NA NA NA NA
#> 5 14401 66697091 153863779 NA NA NA NA NA
#> 6 18001 239352431 142136089 NA NA NA NA NA
Now you can:
highchart(type = "stock") %>%
hc_add_series(df, "line", hcaes(Index, Data), name = "Original") %>%
hc_add_series(df, "line", hcaes(Index, Fitted), name = "Fitted") %>%
hc_add_series(df, "line", hcaes(Index, `Point Forecast`), name = "Forecast") %>%
hc_add_series(df, "arearange", hcaes(Index, low = `Lo 80`, high = `Hi 80`), name = "Interval")
As you can see, fortify can't detect the real time too. So you need to transform the Index in the time what you want.
The error
Error in as.Date.ts(.) : unable to convert ts time to Date class
is due to the fact that you have a ts object with a frequency that is not covered by the function as.Date.ts(.). When we see what this function does, this is what we get:
function (x, offset = 0, ...)
{
time.x <- unclass(time(x)) + offset
if (frequency(x) == 1)
as.Date(paste(time.x, 1, 1, sep = "-"))
else if (frequency(x) == 4)
as.Date(paste((time.x + 0.001)%/%1, 3 * (cycle(x) - 1) +
1, 1, sep = "-"))
else if (frequency(x) == 12)
as.Date(paste((time.x + 0.001)%/%1, cycle(x), 1, sep = "-"))
else stop("unable to convert ts time to Date class")
}
This function considers only 3 values for the frequency of a ts object: 1, 4, or 12. When we take a look at the frequency of your object x, we see that its frequency = 0.000277777777777778, so when highcharter calls the function using the ts objects in x it stops and gives you that error.
We have two options on how to "fix" it:
Transform t into a ts object (instead of a xts object) with frequency = 1 before running auto.arima and forecast;
After running auto.arima and forecast, we can create an index for the future dates and transform the ts objects in x into xts objects with the correct index.
I said "fix" because these solutions are not perfect, as we will see.
Option 1
t <- structure(
c(2, 2, 267822980, 325286564, 66697091, 239352431,
94380295, 1, 126621669, 158555699, 32951026, 23,
108000151, 132505189, 29587564, 120381505, 25106680,
117506099, 22868767, 115940080, 22878163, 119286731,
22881061),
.Dim = c(23L, 1L),
index = structure(c(1490990400, 1490994000, 1490997600,
1491001200, 1491004800, 1491008400,
1491012000, 1491026400, 1491033600,
1491037200, 1491040800, 1491058800,
1491062400, 1491066000, 1491069600,
1491073200, 1491076800, 1491109200,
1491112800, 1491120000, 1491123600,
1491156000, 1491159600),
tzone = "US/Mountain",
tclass = c("POSIXct","POSIXt")),
class = c("xts", "zoo"),
.indexCLASS = c("POSIXct","POSIXt"),
tclass = c("POSIXct", "POSIXt"),
.indexTZ = "US/Mountain",
tzone = "US/Mountain",
.CLASS = "double",
.Dimnames = list(NULL, "count"))
require("forecast")
library(highcharter)
# SOLUTION 1
t.tmp <- ts(t, start=1, end = length(t))
t.arima.1 <- auto.arima(t.tmp)
x.1 <- forecast(t.arima.1, level = c(95, 80))
highchart(type = 'stock') %>%
hc_add_series(x.1) %>%
hc_add_series(x.1$x, name = "Original") %>%
hc_add_series(x.1$fitted, name = "Fitted")
The problem with this approach is that we lose the dates (axis, tooltip, etc.).
Option 2, 1st try: Hourly Forecasts
I tried to create an hourly index for the future values, but for some reason Highcharter moves the intervals to the left (or there's some problem with the dates that I can't see/figure out).
Option 2, 2nd try: Daily Forecasts
When I changed it to a daily index for the future values it worked, but it's weird since we have hourly observations and the forecast part of our plot shows "daily forecasts".
Here is the full code:
t <- structure(
c(2, 2, 267822980, 325286564, 66697091, 239352431,
94380295, 1, 126621669, 158555699, 32951026, 23,
108000151, 132505189, 29587564, 120381505, 25106680,
117506099, 22868767, 115940080, 22878163, 119286731,
22881061),
.Dim = c(23L, 1L),
index = structure(c(1490990400, 1490994000, 1490997600,
1491001200, 1491004800, 1491008400,
1491012000, 1491026400, 1491033600,
1491037200, 1491040800, 1491058800,
1491062400, 1491066000, 1491069600,
1491073200, 1491076800, 1491109200,
1491112800, 1491120000, 1491123600,
1491156000, 1491159600),
tzone = "US/Mountain",
tclass = c("POSIXct","POSIXt")),
class = c("xts", "zoo"),
.indexCLASS = c("POSIXct","POSIXt"),
tclass = c("POSIXct", "POSIXt"),
.indexTZ = "US/Mountain",
tzone = "US/Mountain",
.CLASS = "double",
.Dimnames = list(NULL, "count"))
require("forecast")
library(highcharter)
library(xts)
t.arima <- auto.arima(t)
x <- forecast(t.arima, level = c(95, 80))
# Problem
## Time from 'forecast'
time.x <- time(x$mean) # ts variable
time.x # see that frequency = 0.000277777777777778
## Original time
time.t <- time(t) # POSIXct variable, use as.ts to see frequency
as.ts(time.t) # frequency = 1
## Try to transform back to formatted date
as.POSIXct(as.double(time.t), tz = "US/Mountain", origin = "1970-01-01")
as.POSIXct(as.double(time.x), tz = "US/Mountain", origin = "1970-01-01")
#--------------------------------------------------------#
# SOLUTION 1
t.tmp <- ts(t, start=1, end = length(t))
t.arima.1 <- auto.arima(t.tmp)
x.1 <- forecast(t.arima.1, level = c(95, 80))
highchart(type = 'stock') %>%
hc_add_series(x.1) %>%
hc_add_series(x.1$x, name = "Original") %>%
hc_add_series(x.1$fitted, name = "Fitted")
#------------------------------------------------------#
# SOLUTION 2 - With correct dates but wrong plot
## Create new forecast variable
x.2 <- forecast(t.arima.1, level = c(95, 80))
## Take forecast length
forecast.length <- length(time.x)
### Create New Forecast dates (HOUR)
### Since I don't know the exact forecast times, I'll add one HOUR
### for each obs starting from the last date in the original dataset
last.date <- time.t[length(time.t)]
new.forecast.time.hour <- as.POSIXct(last.date) + c((1:forecast.length)*3600)
## Insert date back
x.2$mean <- xts(x.1$mean, order.by = new.forecast.time.hour)
x.2$lower <- xts(x.1$lower, order.by = new.forecast.time.hour)
x.2$upper <- xts(x.1$upper, order.by = new.forecast.time.hour)
### Original Data
x.2$x <- xts(x.1$x, order.by = time.t)
### Fitted
x.2$fitted <- xts(x.1$fitted, order.by = time.t)
# Plot forecasts with correct date
highchart(type = 'stock') %>%
hc_add_series(x.2) %>%
hc_add_series(x.2$x, name = "Original") %>%
hc_add_series(x.2$fitted, name = "Fitted") %>%
hc_xAxis(type = 'datetime')
#------------------------------------------------------#
# SOLUTION 3 - Correct plot but only for daily forecasts
## Create new forecast variable
x.3 <- forecast(t.arima.1, level = c(95, 80))
## Take forecast length
forecast.length <- length(time.x)
### Create New Forecast dates (DAY)
### Since I don't know the exact forecast times, I'll add one DAY
### for each obs starting from the last date in the original dataset
last.date <- time.t[length(time.t)]
new.forecast.time.day <- as.POSIXct(last.date) + c((1:forecast.length)*3600*24)
## Add change from as.POSIXct to as.Date
new.forecast.time.day <- as.Date(new.forecast.time.day)
## Insert date back
x.3$mean <- xts(x.1$mean, order.by = new.forecast.time.day)
x.3$lower <- xts(x.1$lower, order.by = new.forecast.time.day)
x.3$upper <- xts(x.1$upper, order.by = new.forecast.time.day)
### Original Data
x.3$x <- xts(x.1$x, order.by = time.t)
### Fitted
x.3$fitted <- xts(x.1$fitted, order.by = time.t)
# Plot forecasts with correct date
highchart(type = 'stock') %>%
hc_add_series(x.3) %>%
hc_add_series(x.3$x, name = "Original") %>%
hc_add_series(x.3$fitted, name = "Fitted") %>%
hc_xAxis(type = 'datetime')
One other thing: the fitted values on my plots differ from the fitted values on jbkunst's plot because he used forecast directly on t, not on t.arima (just a typo, I believe). This way, my forecasts are based on an Arima model, while his are based on an ETS model.

How to create a proper dataset for boxplots

I'm having trouble to create a proper boxplot of my dataset. All of the solutions on this platform don't work because their dataset all look different with variables against each other.
So I want to ask: how do I need to format my dataset if it only contains 3 variables and their measured values in 3 columns. In the boxplot examples here, they plot a variable against another one but here this is not the case right?
Using boxplot(data) gives me 3 boxplots. But I want to show the MEAN and also the population size on each boxplot. I don't know how to use the solution as they are all about ggplot2 or boxplot with variables against each other.
I know that this must be simple, but I think I'm plotting the boxplots on a bad method and that's why the solutions on this site don't work?
Data:
structure(list(Rest = c(3.479386607, 3.478445796, 2.52227462,
1.726115552, 3.917693859, 2.300840122), Peat = c(16.79515746,
22.76673699, 24.43289941, 15.64168939, 31.60459098, 16.2369787
), Top.culture = c(8.288, 8.732, 5.199, 6.539, 3.248, 10.156)), .Names = c("Rest",
"Peat", "Top.culture"), row.names = c(NA, 6L), class = "data.frame")
If text annotation is what is meant by 'show the mean and also the population size' then:
boxplot(dat)
text(1:3, 12.5, paste( "Mean= ",round(sapply(dat,mean, na.rm=TRUE), 2),
"\n N= ",
sapply(dat, function(x) length( x[!is.na(x)] ) )
) )
This used your more complex data-object from the other (duplicated) question.
dat <- structure(list(Rest = c(3.479386607, 3.478445796, 2.52227462, 1.726115552, 3.917693859, 2.300840122, 2.326307503, 2.344828287, 4.654278623, 3.68669447, 3.343706863, 0.712228306, 2.735897248, 1.936723375, 2.724260325, 2.069633651, 1.741484154, 2.304391217, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Peat = c(16.79515746, 22.76673699, 24.43289941, 15.64168939, 31.60459098, 16.2369787, 32.63285246, 35.91852324, 19.27802839, 21.78974576, 30.39119451, 35.4846573, 42.21807817, 42.00913743, 40.96996704, 19.85075354, 17.247096, 22.81689524, 43.35990368, 37.57273508, 23.76889902, 38.34604591, 20.98376674, 16.44173119, 17.27639888, NA, NA, NA, NA, NA, NA), Top.culture = c(8.288, 8.732, 5.199, 6.539, 3.248, 10.156, 3.436, 5.584, 4.483, 2.087, 3.28, 2.71, 2.196, 4.971, 4.475, 6.361, 5.49, 9.085, 3.52, 5.772, 9.308, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Rest", "Peat", "Top.culture" ), class = "data.frame", row.names = c(NA, -31L))

Resources