How to get number from holt-winters forecast in Rstudio - r

Like the title said, is there anyway to get an exact number from a Holt-winters forecast? For example, say I have a time-series object like this:
Date Total
6/1/2014 150
7/1/2014 219
8/1/2014 214
9/1/2014 47
10/1/2014 311
11/1/2014 198
12/1/2014 169
1/1/2015 253
2/1/2015 167
3/1/2015 262
4/1/2015 290
5/1/2015 319
6/1/2015 405
7/1/2015 395
8/1/2015 391
9/1/2015 345
10/1/2015 401
11/1/2015 390
12/1/2015 417
1/1/2016 375
2/1/2016 397
3/1/2016 802
4/1/2016 466
After storing it in variable hp, I used Holt Winters to make a forecast:
hp.ts <- ts(hp$Total, frequency = 12, start = c(2014,4))
hp.ts.hw <- HoltWinters(hp.ts)
library(forecast)
hp.ts.hw.fc <- forecast.HoltWinters(hp.ts.hw, h = 5)
plot(hp.ts.hw.fc)
However, what I need to know is how exactly the Total in 2016/05 is (predictly) going to be. Is there anyway to get the exact value?
By the way, I noticed that the blue (forecast) line is NOT connected to the black line. Is that normal? Or I should fix my code?
Thank you for reading.

I don't know why you went round around while you have called the library(forecast). Below provides direct answers for your questions:
hp.ts.hw <- hw(hp.ts)
hp.ts.hw.fc <- forecast(hp.ts.hw, h = 5)
plot(hp.ts.hw.fc)
hp.ts.hw.fc
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Mar 2016 546.5311 448.2997 644.7624 396.2992 696.7630
Apr 2016 623.7030 525.4716 721.9344 473.4711 773.9349
May 2016 671.8989 573.6675 770.1303 521.6670 822.1309
Jun 2016 667.3722 569.1408 765.6036 517.1402 817.6041
Jul 2016 500.0710 401.8396 598.3024 349.8390 650.3030

I'm not sure if I understood your doubt. But you can get the forecasted value by:
hp.ts.hw.fc$mean
You can use accuracy function to measure how good is your results.

Related

How to implement the facet grid feature using the ggfortify library on a time series data?

I am using RStudio and I have a time series data (ts object) called data1.
Here is how data1 looks:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014 135 172 179 189 212 47 301 183 247 292 280 325
2015 471 243 386 235 388 257 344 526 363 261 189 173
2016 272 267 197 217 393 299 343 341 315 305 384 497
To plot the above, I have run this code:
plot (data1)
and I get the following plot:
I want to have a plot that is broken by Year and I was thinking of implementing the facet_grid feature found in ggplot2 but since my data is a ts object, I can't use ggplot2 directly on it.
After some research, I've found that the ggfortify library works with ts objects. However, I am having a hard time trying to figure out to use the facet_grid feature with it.
My aim to is to plot something like below from my ts data:
'Female'and 'Male' will be replaced by the Years 2014, 2015 and 2016. The X-axis will be the Months (Jan, Feb, Mar, and so on) and the y-axis will be the values in the ts file . I would prefer a line plot rather than a dot plot.
Am I on the right track here or is there another way of approaching this problem?
We can use ggplot2::autoplot. I will use AirPassengers data as an example.
library(ggplot2)
library(lubridate)
autoplot(AirPassengers) +
facet_grid(. ~ year(Index), scales = "free_x") +
scale_x_date(date_labels = "%b")

hybridModel of Auto.arima and ANN produce point forecast outside of 95% CI

I have been working on time series forecasting and recently read about how the hybrid model of auto.arima and ann provide better/more accurate forecasting results.
I have six time series data sets, the hybrid model work wonders for five of them but it gives weird results for the other.
I ran the model using the following to packages:
library(forecast)
library(forecastHybrid)
Here is the data:
ts.data
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2012 1 16 41 65 87 104 152 203 213 263
2013 299 325 388 412 409 442 447 421 435 448 447 443
2014 454 446 467 492 525
Model:
fit <- hybridModel(ts.data, model="an")
Forecast results for the next 5 periods:
forecast(fit, 5)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jun 2014 594.6594 519.2914 571.0163 505.6007 584.7070
Jul 2014 702.1626 528.7327 601.8827 509.3710 621.2444
Aug 2014 738.5732 540.6665 630.2566 516.9534 653.9697
Sep 2014 752.1329 553.8905 657.3403 526.5090 684.7218
Oct 2014 762.7481 567.9391 683.5994 537.3256 714.2129
You see how the point forecasts are outside of the 95% confidence interval.
Does anybody know what this is happening and how I could fix it?
Any thoughts and insights are appreciated!
Thanks in advance.
See the description of this issue here
tl;dr nnetar models do not create prediction intervals, so these are not included in the ensemble prediction intervals. When the "forecast" package adds this behavior (on the road map for 2016), the prediction intervals and point forecasts will be consistent

unix telephone syntax with different ways of writing American phone numbers

Ok so I need I have a .txt file with names followed by their respective phone numbers and need to grab all the numbers following the ###-###-#### syntax which I have accomplished with this code
grep -E "([0-9]{3})-[0-9]{3}-[0-9]{4}" telephonefile_P2
but my problem is that there are instances of
(###)-###-####
(###) ### ####
### ### ####
###-####
This is the file:
Sam Spade (212)-756-1045
Daffy Duck 312 450 2856
Mom 354-2015
Star Club 49 040–31 77 78 0
Lolita Spengler (816) 756 8657
Hoffman's Kleider 049 37 1836 027
Dr. Harold Kranzler 765-986-9987
Ralph Spoilsport's Motors 967 882 6534
Hermann's Speilhaus 49 25 8377 1765
Hal Kubrick 44 1289 332934
Sister Sue 978 0672
Auggie Keller 49 089/594 393
JCCC 913-469-8500
This is my desired output:
Sam Spade (212)-756-1045
Daffy Duck 312 450 2856
Mom 354-2015
Lolita Spengler (816) 756 8657
Dr. Harold Kranzler 765-986-9987
Ralph Spoilsport's Motors 967 882 6534
Sister Sue 978 0672
JCCC 913-469-8500
and I don't know how to account for these alternate forms...
obviously new to Unix, please be gentle!
$ awk '/(^|[[:space:]])(\(?[0-9]{3}\)?[- ])?[0-9]{3}[- ][0-9]{4}([[:space:]]|$)/' file
Sam Spade (212)-756-1045
Daffy Duck 312 450 2856
Mom 354-2015
Lolita Spengler (816) 756 8657
Dr. Harold Kranzler 765-986-9987
Ralph Spoilsport's Motors 967 882 6534
Sister Sue 978 0672
JCCC 913-469-8500

Fuzzy string matching in r

I have 2 datasets with more than 100K rows each. I would like to merge them based on fuzzy string matching one column('movie title') as well as using release date. I am providing a sample from both datasets below.
dataset-1
itemid userid rating time title release_date
99991 1673 835 3 1998-03-27 mirage 1995
99992 1674 840 4 1998-03-29 mamma roma 1962
99993 1675 851 3 1998-01-08 sunchaser, the 1996
99994 1676 851 2 1997-10-01 war at home, the 1996
99995 1677 854 3 1997-12-22 sweet nothing 1995
99996 1678 863 1 1998-03-07 mat' i syn 1997
99997 1679 863 3 1998-03-07 b. monkey 1998
99998 1680 863 2 1998-03-07 sliding doors 1998
99999 1681 896 3 1998-02-11 you so crazy 1994
100000 1682 916 3 1997-11-29 scream of stone (schrei aus stein) 1991
dataset - 2
itemid userid rating time title release_date
1 2844 4477 3 2013-03-09 fantã´mas - 〠l'ombre de la guillotine 1913
2 4936 8871 4 2013-05-05 the bank 1915
3 4936 11628 3 2013-07-06 the bank 1915
4 4972 16885 4 2013-08-19 the birth of a nation 1915
5 5078 11628 2 2013-08-23 the cheat 1915
6 6684 4222 3 2013-08-24 the fireman 1916
7 6689 4222 3 2013-08-24 the floorwalker 1916
8 7264 2092 4 2013-03-17 the rink 1916
9 7264 5943 3 2013-05-12 the rink 1916
10 7880 11628 4 2013-07-19 easy street 1917
I have looked at 'agrep' but it only matches one string at a time. The 'stringdist' function is good but you need to run it in a loop, find the minimum distance and then go onto further precessing which is very time consuming given the size of the datasets. The strings can have typo's and special characters due to which fuzzy matching is required. I have looked around and found 'Lenenshtein' and 'Jaro-Winkler' methods. The later I read is good for when you have typo's in strings.
In this scenario, only fuzzy matching may not provide good results e.g., A movie title 'toy story' in one dataset can be matched to 'toy story 2' in the other which is not right. So I need to consider the release date to make sure the movies that are matched are unique.
I want to know if there is a way to achieve this task without using a loop? worse case scenario if I have to use a loop, how can I make it work efficiently and as fast as possible.
I have tried the following code but it has taken an awful amount of time to process.
for(i in 1:nrow(test))
for(j in 1:nrow(test1))
{
test$title.match <- ifelse(jarowinkler(test$x[i], test1$x[j]) > 0.85,
test$title, NA)
}
test - contains 1682 unique movie names converted to lower case
test1 - contains 11451 unique movie names converted to lower case
Is there a way to avoid the for loops and make it work faster?
What about this approach to move you forward? You can adjust the degree of match from 0.85 after you see the results. You could then use dplyr to group by the matched title and summarise by subtracting release dates. Any zeros would mean the same release date.
dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)

Coding for the onset of an event in panel data in R

I was wondering if you could help me devise an effortless way to code this country-year event data that I'm using.
In the example below, each row corresponds with an ongoing event (that I will eventually fold into a broader panel data set, which is why it looks bare now). So, for example, country 29 had the onset of an event in 1920, which continued (and ended) in 1921. Country 23 had the onset of the event in 1921, which lasted until 1923. Country 35 had the onset of an event that occurred in 1921 and only in 1921, et cetera.
country year
29 1920
29 1921
23 1921
23 1922
23 1923
35 1921
64 1926
135 1928
135 1929
135 1930
135 1931
135 1932
135 1933
135 1934
120 1930
70 1932
What I want to do is create "onset" and "ongoing" variables. The "ongoing" variable in this sample data frame would be easy. Basically: Data$ongoing <- 1
I'm more interested in creating the "onset" variable. It would be coded as 1 if it marks the onset of the event for the given country. Basically, I want to create a variable that looks like this, given this example data.
country year onset
29 1920 1
29 1921 0
23 1921 1
23 1922 0
23 1923 0
35 1921 1
64 1926 1
135 1928 1
135 1929 0
135 1930 0
135 1931 0
135 1932 0
135 1933 0
135 1934 0
120 1930 1
70 1932 1
If you can think of effortless ways to do this in R (that minimizes the chances of human error when working with it in a spreadsheet program like Excel), I'd appreciate it. I did see this related question, but this person's data set doesn't look like mine and it may require a different approach.
Thanks. Reproducible code for this example data is below.
country <- c(29,29,23,23,23,36,64,135,135,135,135,135,135,135,120,70)
year <- c(1920,1921,1921,1922,1923,1921,1926,1928,1929,1930,1931,1932,1933,1934,1930,1932)
Data=data.frame(country=country,year=year)
summary(Data)
Data
This should work, even with multiple onsets per country:
Data$onset <- with(Data, ave(year, country, FUN = function(x)
as.integer(c(TRUE, tail(x, -1L) != head(x, -1L) + 1L))))
You could also do this:
library(data.table)
setDT(Data)[, onset := (min(country*year)/country == year) + 0L, country]
This could be very fast when you have a larger dataset.

Resources