How to put parenthesis on knitr_kable object - r

I have a knitr_kable object. I want to put parenthesis on every value in every consecutive (2,4,6....) row. How I can do this.
out %>% knitr::kable(digits = 2, "pipe")
| | 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 10-1|
|:-------------|-----:|-----:|-----:|----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
|Excess Return | -0.20| 1.20| 0.97| 1.04| 0.77| 0.82| 0.79| 0.76| 0.76| 0.59| 0.79|
|NW t_Stat | -0.28| 2.04| 3.02| 3.17| 2.56| 3.16| 3.09| 3.45| 3.86| 3.48| 1.13|
|FF5 Alpha | -0.97| 0.43| -0.05| 0.02| -0.26| -0.17| -0.18| -0.19| -0.11| -0.11| 0.86|
|NW t_stats | -1.41| 1.03| -0.26| 0.14| -2.75| -2.10| -2.32| -2.85| -1.77| -5.01| 1.26|
|MktRF beta | 0.92| 1.00| 1.07| 1.08| 1.13| 1.11| 1.10| 1.08| 1.06| 0.99| 0.07|
|NW t_Stat | 1.34| 2.40| 5.44| 6.49| 11.98| 13.51| 14.58| 16.13| 16.43| 46.90| 0.10|
|SMB5 beta | 1.35| 1.34| 1.46| 1.26| 1.14| 1.03| 0.84| 0.69| 0.36| -0.16| -1.51|
|NW t_Stat | 1.97| 3.22| 7.41| 7.56| 12.11| 12.44| 11.05| 10.28| 5.61| -7.37| -2.20|
|HML beta | 0.52| 1.27| 0.48| 0.46| 0.55| 0.51| 0.36| 0.29| 0.22| 0.10| -0.42|
|NW t_Stat | 0.75| 3.05| 2.46| 2.75| 5.82| 6.23| 4.70| 4.38| 3.47| 4.53| -0.62|
|CMA beta | 0.43| -0.66| 0.15| 0.28| 0.11| -0.02| 0.10| 0.11| 0.13| 0.08| -0.34|
|NW t_stat | 0.62| -1.59| 0.77| 1.65| 1.14| -0.28| 1.38| 1.61| 1.96| 3.92| -0.50|
|RMW beta | -0.68| -0.25| 0.11| 0.08| 0.15| 0.24| 0.32| 0.35| 0.31| 0.17| 0.84|
|NW t_stat | -0.98| -0.60| 0.54| 0.47| 1.56| 2.96| 4.22| 5.18| 4.75| 7.87| 1.23|
|Adj_r_square | 0.14| 0.31| 0.64| 0.73| 0.87| 0.93| 0.94| 0.94| 0.93| 0.98| 0.09|

Related

How to implement PySpark StandardScaler on subset of columns?

I want to use pyspark StandardScaler on 6 out of 10 columns in my dataframe. This will be part of a pipeline.
The inputCol parameter seems to expect a vector, which I can pass in after using VectorAssembler on all my features, but this scales all 10 features. I don’t want to scale the other 4 features because they are binary and I want unstandardized coefficients for them.
Am I supposed to use vector assembler on the 6 features, scale them, then use vector assembler again on this scaled features vector and the remaining 4 features? I would end up with a vector within a vector and I’m not sure this will work.
What’s the right way to do this? An example is appreciated.
You can do this by using VectorAssembler. They key is you have to extract the columns from the assembler output. See the code below for a working example,
from pyspark.ml.feature import MinMaxScaler, StandardScaler
from pyspark.ml.feature import VectorAssembler
import pandas as pd
import numpy as np
import random
df = pd.DataFrame()
df['a'] = random.sample(range(100), 10)
df['b'] = random.sample(range(100), 10)
df['c'] = random.sample(range(100), 10)
df['d'] = random.sample(range(100), 10)
df['e'] = random.sample(range(100), 10)
sdf = sc.createDataFrame(df)
sdf.show()
+---+---+---+---+---+
| a| b| c| d| e|
+---+---+---+---+---+
| 51| 13| 6| 5| 26|
| 18| 29| 19| 81| 28|
| 34| 1| 36| 57| 87|
| 56| 86| 51| 52| 48|
| 36| 49| 33| 15| 54|
| 87| 53| 47| 89| 85|
| 7| 14| 55| 13| 98|
| 70| 50| 32| 39| 58|
| 80| 20| 25| 54| 37|
| 40| 33| 44| 83| 27|
+---+---+---+---+---+
cols_to_scale = ['c', 'd', 'e']
cols_to_keep_unscaled = ['a', 'b']
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
assembler = VectorAssembler().setInputCols(cols_to_scale).setOutputCol("features")
sdf_transformed = assembler.transform(sdf)
scaler_model = scaler.fit(sdf_transformed.select("features"))
sdf_scaled = scaler_model.transform(sdf_transformed)
sdf_scaled.show()
+---+---+---+---+---+----------------+--------------------+
| a| b| c| d| e| features| scaledFeatures|
+---+---+---+---+---+----------------+--------------------+
| 51| 13| 6| 5| 26| [6.0,5.0,26.0]|[0.39358015146628...|
| 18| 29| 19| 81| 28|[19.0,81.0,28.0]|[1.24633714630991...|
| 34| 1| 36| 57| 87|[36.0,57.0,87.0]|[2.36148090879773...|
| 56| 86| 51| 52| 48|[51.0,52.0,48.0]|[3.34543128746345...|
| 36| 49| 33| 15| 54|[33.0,15.0,54.0]|[2.16469083306459...|
| 87| 53| 47| 89| 85|[47.0,89.0,85.0]|[3.08304451981926...|
| 7| 14| 55| 13| 98|[55.0,13.0,98.0]|[3.60781805510765...|
| 70| 50| 32| 39| 58|[32.0,39.0,58.0]|[2.09909414115354...|
| 80| 20| 25| 54| 37|[25.0,54.0,37.0]|[1.63991729777620...|
| 40| 33| 44| 83| 27|[44.0,83.0,27.0]|[2.88625444408612...|
+---+---+---+---+---+----------------+--------------------+
# Function just to convert to help build data frame
def extract(row):
return (row.a, row.b,) + tuple(row.scaledFeatures.toArray().tolist())
sdf_scaled = sdf_scaled.select(*cols_to_keep_unscaled, "scaledFeatures").rdd \
.map(extract).toDF(cols_to_keep_unscaled + cols_to_scale)
sdf_scaled.show()
+---+---+------------------+-------------------+------------------+
| a| b| c| d| e|
+---+---+------------------+-------------------+------------------+
| 51| 13|0.3935801514662892|0.16399957083190683|0.9667572801316145|
| 18| 29| 1.246337146309916| 2.656793047476891|1.0411232247571234|
| 34| 1|2.3614809087977355| 1.8695951074837378|3.2349185912096337|
| 56| 86|3.3454312874634584| 1.7055955366518312|1.7847826710122114|
| 36| 49| 2.164690833064591|0.49199871249572047| 2.007880504888738|
| 87| 53| 3.083044519819266| 2.9191923608079415|3.1605526465841245|
| 7| 14|3.6078180551076513| 0.4263988841629578| 3.643931286649932|
| 70| 50|2.0990941411535426| 1.2791966524888734|2.1566123941397555|
| 80| 20| 1.639917297776205| 1.7711953649845937| 1.375769975571913|
| 40| 33|2.8862544440861213| 2.7223928758096534| 1.003940252444369|
+---+---+------------------+-------------------+------------------+

Anti group by/R apply in Pyspark

I am R programmer moving into the pyspark world and have gotten a lot of the basic tricks down but something I am still struggling about is things I would do applys or basic for loops for.
In this case I am trying to calculate the "anti-groupby" for an ID. Basically the idea is to look at a population for that ID and then the population for not this ID and have both those values on the same row. The getting the population for that ID is easy using a groupby and then joining it to a dataset with new_id as the only column.
This is how I would do it in R:
anti_group <- function(id){
tr <- sum(subset(df1, new_id!=id)$total_1)
to <- sum(subset(df1, new_id!=id)$total_2)
54 * tr / to
}
test$other.RP54 <- sapply(test$new_id, anti_group )
How would I do it in pyspark?
Thanks!
Edit:
#df.show()
#sample data
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
| 2| 10|
| 2| 90|
| 3| 20|
| 3| 10|
| 4| 2|
| 4| 5|
+---+-----+
Then some function that creates a final dataframe that looks like this:
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 1| 70| 137|
| 2| 100| 107|
| 3| 30| 177|
| 4| 7| 200|
+---+-------------+------------------+
I think you can do that in two steps: first you sum by id then you take the total and substract by the value for this id.
My idea is a little bit like a group_by(id) %>% summarise(x = sum(x)) %>% mutate(y = sum(x) - x) in dplyr
The solution I propose is based on Window function. It is untested:
Let's first create the data
import pyspark.sql.functions as psf
import pyspark.sql.window as psw
df = spark.createDataFrame([(1,40),(1,30),(2,10),(2,90),(3,20),(3,10),(4,2),(4,5)], ['id','value'])
df.show(2)
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
+---+-----+
only showing top 2 rows
and then apply that approach:
w = psw.Window.orderBy()
df_id = df.groupBy("id").agg(psf.sum("value").alias("grouped_total"))
df_id = (df_id
.withColumn("anti_grouped_total",psf.sum("grouped_total").over(w))
.withColumn('anti_grouped_total', psf.col('anti_grouped_total') - psf.col('grouped_total'))
)
df_id.show(2)
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 3| 30| 177|
| 1| 70| 137|
+---+-------------+------------------+
only showing top 2 rows
So there's no in-built function that would replicate that groupBy function, but you could easily do it by creating a new column using case(when/otherwise clause) to create your group and anti-group, and then groupBy on that new column.
#df.show()
#sample data
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
| 2| 10|
| 2| 90|
| 3| 20|
| 3| 10|
| 4| 2|
| 4| 5|
+---+-----+
from pyspark.sql import functions as F
df.withColumn("anti_id_1", F.when(F.col("id")==1, F.lit('1')).otherwise(F.lit('Not_1')))\
.groupBy("anti_id_1").agg(F.sum("value").alias("sum")).show()
+---------+---+
|anti_id_1|sum|
+---------+---+
| 1| 70|
| Not_1|137|
+---------+---+
UPDATE:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w1=Window().partitionBy("id")
w=Window().partitionBy()
df.withColumn("grouped_total",F.sum("value").over(w1))\
.withColumn("anti_grouped_total", (F.sum("value").over(w))-F.col("grouped_total"))\
.groupBy("id").agg(F.first("grouped_total").alias("grouped_total"),\
F.first("anti_grouped_total").alias("anti_grouped_total"))\
.drop("value").orderBy("id").show()
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 1| 70| 137|
| 2| 100| 107|
| 3| 30| 177|
| 4| 7| 200|
+---+-------------+------------------+
Less verbose/concise way to achieve the same output:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy()
df.groupBy("id").agg(F.sum("value").alias("grouped_total"))\
.withColumn("anti_grouped_total",F.sum("grouped_total").over(w)-F.col("grouped_total")).orderBy("id"),show()
For 2 value columns:
df.show()
+---+------+------+
| id|value1|value2|
+---+------+------+
| 1| 40| 50|
| 1| 30| 70|
| 2| 10| 91|
| 2| 90| 21|
| 3| 20| 42|
| 3| 10| 4|
| 4| 2| 23|
| 4| 5| 12|
+---+------+------+
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window().partitionBy()
df.groupBy("id").agg(F.sum("value1").alias("grouped_total_1"),F.sum("value2").alias("grouped_total_2"))\
.withColumn("anti_grouped_total_1",F.sum("grouped_total_1").over(w)-F.col("grouped_total_1"))\
.withColumn("anti_grouped_total_2",F.sum("grouped_total_2").over(w)-F.col("grouped_total_2")).orderBy("id").show()
+---+---------------+---------------+--------------------+--------------------+
| id|grouped_total_1|grouped_total_2|anti_grouped_total_1|anti_grouped_total_2|
+---+---------------+---------------+--------------------+--------------------+
| 1| 70| 120| 137| 193|
| 2| 100| 112| 107| 201|
| 3| 30| 46| 177| 267|
| 4| 7| 35| 200| 278|
+---+---------------+---------------+--------------------+--------------------+

manipulate multiple variables in a data frame

How to shorten the following codes? I feel it's so repetitive and lengthy and perhaps can be shortened. Not sure how to select those variables and do the recoding like this in a succinct way. Any help is welcome!
data_France$X.1CTP2[data_France$X.1CTP2>7.01 | data_France$X.1CTP2<0.99]<-NA
data_France$X.1CTP3[data_France$X.1CTP3>7.01 | data_France$X.1CTP3<0.99]<-NA
data_France$X.1CTP4[data_France$X.1CTP4>7.01 | data_France$X.1CTP4<0.99]<-NA
data_France$X.1CTP5[data_France$X.1CTP5>7.01 | data_France$X.1CTP5<0.99]<-NA
data_France$X.1CTP6[data_France$X.1CTP6>7.01 | data_France$X.1CTP6<0.99]<-NA
data_France$X.1CTP7[data_France$X.1CTP7>7.01 | data_France$X.1CTP7<0.99]<-NA
data_France$X.1CTP8[data_France$X.1CTP8>7.01 | data_France$X.1CTP8<0.99]<-NA
data_France$X.1CTP9[data_France$X.1CTP9>7.01 | data_France$X.1CTP9<0.99]<-NA
data_France$X.1CTP10[data_France$X.1CTP10>7.01 | data_France$X.1CTP10<0.99]<-NA
data_France$X.1CTP11[data_France$X.1CTP11>7.01 | data_France$X.1CTP11<0.99]<-NA
data_France$X.1CTP12[data_France$X.1CTP12>7.01 | data_France$X.1CTP12<0.99]<-NA
data_France$X.1CTP13[data_France$X.1CTP13>7.01 | data_France$X.1CTP13<0.99]<-NA
data_France$X.1CTP14[data_France$X.1CTP14>7.01 | data_France$X.1CTP14<0.99]<-NA
data_France$X.1CTP15[data_France$X.1CTP15>7.01 | data_France$X.1CTP15<0.99]<-NA
data_France$X.2CTP1[data_France$X.2CTP1>7.01 | data_France$X.2CTP1<0.99]<-NA
data_France$X.2CTP3[data_France$X.2CTP3>7.01 | data_France$X.2CTP3<0.99]<-NA
data_France$X.2CTP4[data_France$X.2CTP4>7.01 | data_France$X.2CTP4<0.99]<-NA
data_France$X.2CTP5[data_France$X.2CTP5>7.01 | data_France$X.2CTP5<0.99]<-NA
data_France$X.2CTP6[data_France$X.2CTP6>7.01 | data_France$X.2CTP6<0.99]<-NA
data_France$X.2CTP7[data_France$X.2CTP7>7.01 | data_France$X.2CTP7<0.99]<-NA
data_France$X.2CTP8[data_France$X.2CTP8>7.01 | data_France$X.2CTP8<0.99]<-NA
data_France$X.2CTP9[data_France$X.2CTP9>7.01 | data_France$X.2CTP9<0.99]<-NA
data_France$X.2CTP10[data_France$X.2CTP10>7.01 | data_France$X.2CTP10<0.99]<-NA
data_France$X.2CTP11[data_France$X.2CTP11>7.01 | data_France$X.2CTP11<0.99]<-NA
data_France$X.2CTP12[data_France$X.2CTP12>7.01 | data_France$X.2CTP12<0.99]<-NA
data_France$X.2CTP13[data_France$X.2CTP13>7.01 | data_France$X.2CTP13<0.99]<-NA
data_France$X.2CTP14[data_France$X.2CTP14>7.01 | data_France$X.2CTP14<0.99]<-NA
data_France$X.2CTP15[data_France$X.2CTP15>7.01 | data_France$X.2CTP15<0.99]<-NA
data_France$X.3CTP1[data_France$X.3CTP1>7.01 | data_France$X.3CTP1<0.99]<-NA
data_France$X.3CTP2[data_France$X.3CTP2>7.01 | data_France$X.3CTP2<0.99]<-NA
data_France$X.3CTP4[data_France$X.3CTP4>7.01 | data_France$X.3CTP4<0.99]<-NA
data_France$X.3CTP5[data_France$X.3CTP5>7.01 | data_France$X.3CTP5<0.99]<-NA
data_France$X.3CTP6[data_France$X.3CTP6>7.01 | data_France$X.3CTP6<0.99]<-NA
data_France$X.3CTP7[data_France$X.3CTP7>7.01 | data_France$X.3CTP7<0.99]<-NA
data_France$X.3CTP8[data_France$X.3CTP8>7.01 | data_France$X.3CTP8<0.99]<-NA
data_France$X.3CTP9[data_France$X.3CTP9>7.01 | data_France$X.3CTP9<0.99]<-NA
data_France$X.3CTP10[data_France$X.3CTP10>7.01 | data_France$X.3CTP10<0.99]<-NA
data_France$X.3CTP11[data_France$X.3CTP11>7.01 | data_France$X.3CTP11<0.99]<-NA
data_France$X.3CTP12[data_France$X.3CTP12>7.01 | data_France$X.3CTP12<0.99]<-NA
data_France$X.3CTP13[data_France$X.3CTP13>7.01 | data_France$X.3CTP13<0.99]<-NA
data_France$X.3CTP14[data_France$X.3CTP14>7.01 | data_France$X.3CTP14<0.99]<-NA
data_France$X.3CTP15[data_France$X.3CTP15>7.01 | data_France$X.3CTP15<0.99]<-NA
data_France$X.4CTP1[data_France$X.4CTP1>7.01 | data_France$X.4CTP1<0.99]<-NA
data_France$X.4CTP2[data_France$X.4CTP2>7.01 | data_France$X.4CTP2<0.99]<-NA
data_France$X.4CTP3[data_France$X.4CTP3>7.01 | data_France$X.4CTP3<0.99]<-NA
data_France$X.4CTP5[data_France$X.4CTP5>7.01 | data_France$X.4CTP5<0.99]<-NA
data_France$X.4CTP6[data_France$X.4CTP6>7.01 | data_France$X.4CTP6<0.99]<-NA
data_France$X.4CTP7[data_France$X.4CTP7>7.01 | data_France$X.4CTP7<0.99]<-NA
data_France$X.4CTP8[data_France$X.4CTP8>7.01 | data_France$X.4CTP8<0.99]<-NA
data_France$X.4CTP9[data_France$X.4CTP9>7.01 | data_France$X.4CTP9<0.99]<-NA
data_France$X.4CTP10[data_France$X.4CTP10>7.01 | data_France$X.4CTP10<0.99]<-NA
data_France$X.4CTP11[data_France$X.4CTP11>7.01 | data_France$X.4CTP11<0.99]<-NA
data_France$X.4CTP12[data_France$X.4CTP12>7.01 | data_France$X.4CTP12<0.99]<-NA
data_France$X.4CTP13[data_France$X.4CTP13>7.01 | data_France$X.4CTP13<0.99]<-NA
data_France$X.4CTP14[data_France$X.4CTP14>7.01 | data_France$X.4CTP14<0.99]<-NA
data_France$X.4CTP15[data_France$X.4CTP15>7.01 | data_France$X.4CTP15<0.99]<-NA
data_France$X.5CTP1[data_France$X.5CTP1>7.01 | data_France$X.5CTP1<0.99]<-NA
data_France$X.5CTP2[data_France$X.5CTP2>7.01 | data_France$X.5CTP2<0.99]<-NA
data_France$X.5CTP3[data_France$X.5CTP3>7.01 | data_France$X.5CTP3<0.99]<-NA
data_France$X.5CTP4[data_France$X.5CTP4>7.01 | data_France$X.5CTP4<0.99]<-NA
data_France$X.5CTP6[data_France$X.5CTP6>7.01 | data_France$X.5CTP6<0.99]<-NA
data_France$X.5CTP7[data_France$X.5CTP7>7.01 | data_France$X.5CTP7<0.99]<-NA
data_France$X.5CTP8[data_France$X.5CTP8>7.01 | data_France$X.5CTP8<0.99]<-NA
data_France$X.5CTP9[data_France$X.5CTP9>7.01 | data_France$X.5CTP9<0.99]<-NA
data_France$X.5CTP10[data_France$X.5CTP10>7.01 | data_France$X.5CTP10<0.99]<-NA
data_France$X.5CTP11[data_France$X.5CTP11>7.01 | data_France$X.5CTP11<0.99]<-NA
data_France$X.5CTP12[data_France$X.5CTP12>7.01 | data_France$X.5CTP12<0.99]<-NA
data_France$X.5CTP13[data_France$X.5CTP13>7.01 | data_France$X.5CTP13<0.99]<-NA
data_France$X.5CTP14[data_France$X.5CTP14>7.01 | data_France$X.5CTP14<0.99]<-NA
data_France$X.5CTP15[data_France$X.5CTP15>7.01 | data_France$X.5CTP15<0.99]<-NA
data_France$X.6CTP1[data_France$X.6CTP1>7.01 | data_France$X.6CTP1<0.99]<-NA
data_France$X.6CTP2[data_France$X.6CTP2>7.01 | data_France$X.6CTP2<0.99]<-NA
data_France$X.6CTP3[data_France$X.6CTP3>7.01 | data_France$X.6CTP3<0.99]<-NA
data_France$X.6CTP4[data_France$X.6CTP4>7.01 | data_France$X.6CTP4<0.99]<-NA
data_France$X.6CTP5[data_France$X.6CTP5>7.01 | data_France$X.6CTP5<0.99]<-NA
data_France$X.6CTP7[data_France$X.6CTP7>7.01 | data_France$X.6CTP7<0.99]<-NA
data_France$X.6CTP8[data_France$X.6CTP8>7.01 | data_France$X.6CTP8<0.99]<-NA
data_France$X.6CTP9[data_France$X.6CTP9>7.01 | data_France$X.6CTP9<0.99]<-NA
data_France$X.6CTP10[data_France$X.6CTP10>7.01 | data_France$X.6CTP10<0.99]<-NA
data_France$X.6CTP11[data_France$X.6CTP11>7.01 | data_France$X.6CTP11<0.99]<-NA
data_France$X.6CTP12[data_France$X.6CTP12>7.01 | data_France$X.6CTP12<0.99]<-NA
data_France$X.6CTP13[data_France$X.6CTP13>7.01 | data_France$X.6CTP13<0.99]<-NA
data_France$X.6CTP14[data_France$X.6CTP14>7.01 | data_France$X.6CTP14<0.99]<-NA
data_France$X.6CTP15[data_France$X.6CTP15>7.01 | data_France$X.6CTP15<0.99]<-NA
data_France$X.7CTP1[data_France$X.7CTP1>7.01 | data_France$X.7CTP1<0.99]<-NA
data_France$X.7CTP2[data_France$X.7CTP2>7.01 | data_France$X.7CTP2<0.99]<-NA
data_France$X.7CTP3[data_France$X.7CTP3>7.01 | data_France$X.7CTP3<0.99]<-NA
data_France$X.7CTP4[data_France$X.7CTP4>7.01 | data_France$X.7CTP4<0.99]<-NA
data_France$X.7CTP5[data_France$X.7CTP5>7.01 | data_France$X.7CTP5<0.99]<-NA
data_France$X.7CTP6[data_France$X.7CTP6>7.01 | data_France$X.7CTP6<0.99]<-NA
data_France$X.7CTP8[data_France$X.7CTP8>7.01 | data_France$X.7CTP8<0.99]<-NA
data_France$X.7CTP9[data_France$X.7CTP9>7.01 | data_France$X.7CTP9<0.99]<-NA
data_France$X.7CTP10[data_France$X.7CTP10>7.01 | data_France$X.7CTP10<0.99]<-NA
data_France$X.7CTP11[data_France$X.7CTP11>7.01 | data_France$X.7CTP11<0.99]<-NA
data_France$X.7CTP12[data_France$X.7CTP12>7.01 | data_France$X.7CTP12<0.99]<-NA
data_France$X.7CTP13[data_France$X.7CTP13>7.01 | data_France$X.7CTP13<0.99]<-NA
data_France$X.7CTP14[data_France$X.7CTP14>7.01 | data_France$X.7CTP14<0.99]<-NA
data_France$X.7CTP15[data_France$X.7CTP15>7.01 | data_France$X.7CTP15<0.99]<-NA
data_France$X.8CTP1[data_France$X.8CTP1>7.01 | data_France$X.8CTP1<0.99]<-NA
data_France$X.8CTP2[data_France$X.8CTP2>7.01 | data_France$X.8CTP2<0.99]<-NA
data_France$X.8CTP3[data_France$X.8CTP3>7.01 | data_France$X.8CTP3<0.99]<-NA
data_France$X.8CTP4[data_France$X.8CTP4>7.01 | data_France$X.8CTP4<0.99]<-NA
data_France$X.8CTP5[data_France$X.8CTP5>7.01 | data_France$X.8CTP5<0.99]<-NA
data_France$X.8CTP6[data_France$X.8CTP6>7.01 | data_France$X.8CTP6<0.99]<-NA
data_France$X.8CTP7[data_France$X.8CTP7>7.01 | data_France$X.8CTP7<0.99]<-NA
data_France$X.8CTP9[data_France$X.8CTP9>7.01 | data_France$X.8CTP9<0.99]<-NA
data_France$X.8CTP10[data_France$X.8CTP10>7.01 | data_France$X.8CTP10<0.99]<-NA
data_France$X.8CTP11[data_France$X.8CTP11>7.01 | data_France$X.8CTP11<0.99]<-NA
data_France$X.8CTP12[data_France$X.8CTP12>7.01 | data_France$X.8CTP12<0.99]<-NA
data_France$X.8CTP13[data_France$X.8CTP13>7.01 | data_France$X.8CTP13<0.99]<-NA
data_France$X.8CTP14[data_France$X.8CTP14>7.01 | data_France$X.8CTP14<0.99]<-NA
data_France$X.8CTP15[data_France$X.8CTP15>7.01 | data_France$X.8CTP15<0.99]<-NA
data_France$X.9CTP1[data_France$X.9CTP1>7.01 | data_France$X.9CTP1<0.99]<-NA
data_France$X.9CTP2[data_France$X.9CTP2>7.01 | data_France$X.9CTP2<0.99]<-NA
data_France$X.9CTP3[data_France$X.9CTP3>7.01 | data_France$X.9CTP3<0.99]<-NA
data_France$X.9CTP4[data_France$X.9CTP4>7.01 | data_France$X.9CTP4<0.99]<-NA
data_France$X.9CTP5[data_France$X.9CTP5>7.01 | data_France$X.9CTP5<0.99]<-NA
data_France$X.9CTP6[data_France$X.9CTP6>7.01 | data_France$X.9CTP6<0.99]<-NA
data_France$X.9CTP7[data_France$X.9CTP7>7.01 | data_France$X.9CTP7<0.99]<-NA
data_France$X.9CTP8[data_France$X.9CTP8>7.01 | data_France$X.9CTP8<0.99]<-NA
data_France$X.9CTP10[data_France$X.9CTP10>7.01 | data_France$X.9CTP10<0.99]<-NA
data_France$X.9CTP11[data_France$X.9CTP11>7.01 | data_France$X.9CTP11<0.99]<-NA
data_France$X.9CTP12[data_France$X.9CTP12>7.01 | data_France$X.9CTP12<0.99]<-NA
data_France$X.9CTP13[data_France$X.9CTP13>7.01 | data_France$X.9CTP13<0.99]<-NA
data_France$X.9CTP14[data_France$X.9CTP14>7.01 | data_France$X.9CTP14<0.99]<-NA
data_France$X.9CTP15[data_France$X.9CTP15>7.01 | data_France$X.9CTP15<0.99]<-NA
data_France$X.10CTP1[data_France$X.10CTP1>7.01 | data_France$X.10CTP1<0.99]<-NA
data_France$X.10CTP2[data_France$X.10CTP2>7.01 | data_France$X.10CTP2<0.99]<-NA
data_France$X.10CTP3[data_France$X.10CTP3>7.01 | data_France$X.10CTP3<0.99]<-NA
data_France$X.10CTP4[data_France$X.10CTP4>7.01 | data_France$X.10CTP4<0.99]<-NA
data_France$X.10CTP5[data_France$X.10CTP5>7.01 | data_France$X.10CTP5<0.99]<-NA
data_France$X.10CTP6[data_France$X.10CTP6>7.01 | data_France$X.10CTP6<0.99]<-NA
data_France$X.10CTP7[data_France$X.10CTP7>7.01 | data_France$X.10CTP7<0.99]<-NA
data_France$X.10CTP8[data_France$X.10CTP8>7.01 | data_France$X.10CTP8<0.99]<-NA
data_France$X.10CTP9[data_France$X.10CTP9>7.01 | data_France$X.10CTP9<0.99]<-NA
data_France$X.10CTP11[data_France$X.10CTP11>7.01 | data_France$X.10CTP11<0.99]<-NA
data_France$X.10CTP12[data_France$X.10CTP12>7.01 | data_France$X.10CTP12<0.99]<-NA
data_France$X.10CTP13[data_France$X.10CTP13>7.01 | data_France$X.10CTP13<0.99]<-NA
data_France$X.10CTP14[data_France$X.10CTP14>7.01 | data_France$X.10CTP14<0.99]<-NA
data_France$X.10CTP15[data_France$X.10CTP15>7.01 | data_France$X.10CTP15<0.99]<-NA
data_France$X.11CTP1[data_France$X.11CTP1>7.01 | data_France$X.11CTP1<0.99]<-NA
data_France$X.11CTP2[data_France$X.11CTP2>7.01 | data_France$X.11CTP2<0.99]<-NA
data_France$X.11CTP3[data_France$X.11CTP3>7.01 | data_France$X.11CTP3<0.99]<-NA
data_France$X.11CTP4[data_France$X.11CTP4>7.01 | data_France$X.11CTP4<0.99]<-NA
data_France$X.11CTP5[data_France$X.11CTP5>7.01 | data_France$X.11CTP5<0.99]<-NA
data_France$X.11CTP6[data_France$X.11CTP6>7.01 | data_France$X.11CTP6<0.99]<-NA
data_France$X.11CTP7[data_France$X.11CTP7>7.01 | data_France$X.11CTP7<0.99]<-NA
data_France$X.11CTP8[data_France$X.11CTP8>7.01 | data_France$X.11CTP8<0.99]<-NA
data_France$X.11CTP9[data_France$X.11CTP9>7.01 | data_France$X.11CTP9<0.99]<-NA
data_France$X.11CTP10[data_France$X.11CTP10>7.01 | data_France$X.11CTP10<0.99]<-NA
data_France$X.11CTP12[data_France$X.11CTP12>7.01 | data_France$X.11CTP12<0.99]<-NA
data_France$X.11CTP13[data_France$X.11CTP13>7.01 | data_France$X.11CTP13<0.99]<-NA
data_France$X.11CTP14[data_France$X.11CTP14>7.01 | data_France$X.11CTP14<0.99]<-NA
data_France$X.11CTP15[data_France$X.11CTP15>7.01 | data_France$X.11CTP15<0.99]<-NA
data_France$X.12CTP1[data_France$X.12CTP1>7.01 | data_France$X.12CTP1<0.99]<-NA
data_France$X.12CTP2[data_France$X.12CTP2>7.01 | data_France$X.12CTP2<0.99]<-NA
data_France$X.12CTP3[data_France$X.12CTP3>7.01 | data_France$X.12CTP3<0.99]<-NA
data_France$X.12CTP4[data_France$X.12CTP4>7.01 | data_France$X.12CTP4<0.99]<-NA
data_France$X.12CTP5[data_France$X.12CTP5>7.01 | data_France$X.12CTP5<0.99]<-NA
data_France$X.12CTP6[data_France$X.12CTP6>7.01 | data_France$X.12CTP6<0.99]<-NA
data_France$X.12CTP7[data_France$X.12CTP7>7.01 | data_France$X.12CTP7<0.99]<-NA
data_France$X.12CTP8[data_France$X.12CTP8>7.01 | data_France$X.12CTP8<0.99]<-NA
data_France$X.12CTP9[data_France$X.12CTP9>7.01 | data_France$X.12CTP9<0.99]<-NA
data_France$X.12CTP10[data_France$X.12CTP10>7.01 | data_France$X.12CTP10<0.99]<-NA
data_France$X.12CTP11[data_France$X.12CTP11>7.01 | data_France$X.12CTP11<0.99]<-NA
data_France$X.12CTP13[data_France$X.12CTP13>7.01 | data_France$X.12CTP13<0.99]<-NA
data_France$X.12CTP14[data_France$X.12CTP14>7.01 | data_France$X.12CTP14<0.99]<-NA
data_France$X.12CTP15[data_France$X.12CTP15>7.01 | data_France$X.12CTP15<0.99]<-NA
data_France$X.13CTP1[data_France$X.13CTP1>7.01 | data_France$X.13CTP1<0.99]<-NA
data_France$X.13CTP2[data_France$X.13CTP2>7.01 | data_France$X.13CTP2<0.99]<-NA
data_France$X.13CTP3[data_France$X.13CTP3>7.01 | data_France$X.13CTP3<0.99]<-NA
data_France$X.13CTP4[data_France$X.13CTP4>7.01 | data_France$X.13CTP4<0.99]<-NA
data_France$X.13CTP5[data_France$X.13CTP5>7.01 | data_France$X.13CTP5<0.99]<-NA
data_France$X.13CTP6[data_France$X.13CTP6>7.01 | data_France$X.13CTP6<0.99]<-NA
data_France$X.13CTP7[data_France$X.13CTP7>7.01 | data_France$X.13CTP7<0.99]<-NA
data_France$X.13CTP8[data_France$X.13CTP8>7.01 | data_France$X.13CTP8<0.99]<-NA
data_France$X.13CTP9[data_France$X.13CTP9>7.01 | data_France$X.13CTP9<0.99]<-NA
data_France$X.13CTP10[data_France$X.13CTP10>7.01 | data_France$X.13CTP10<0.99]<-NA
data_France$X.13CTP11[data_France$X.13CTP11>7.01 | data_France$X.13CTP11<0.99]<-NA
data_France$X.13CTP12[data_France$X.13CTP12>7.01 | data_France$X.13CTP12<0.99]<-NA
data_France$X.13CTP14[data_France$X.13CTP14>7.01 | data_France$X.13CTP14<0.99]<-NA
data_France$X.13CTP15[data_France$X.13CTP15>7.01 | data_France$X.13CTP15<0.99]<-NA
data_France$X.14CTP1[data_France$X.14CTP1>7.01 | data_France$X.14CTP1<0.99]<-NA
data_France$X.14CTP2[data_France$X.14CTP2>7.01 | data_France$X.14CTP2<0.99]<-NA
data_France$X.14CTP3[data_France$X.14CTP3>7.01 | data_France$X.14CTP3<0.99]<-NA
data_France$X.14CTP4[data_France$X.14CTP4>7.01 | data_France$X.14CTP4<0.99]<-NA
data_France$X.14CTP5[data_France$X.14CTP5>7.01 | data_France$X.14CTP5<0.99]<-NA
data_France$X.14CTP6[data_France$X.14CTP6>7.01 | data_France$X.14CTP6<0.99]<-NA
data_France$X.14CTP7[data_France$X.14CTP7>7.01 | data_France$X.14CTP7<0.99]<-NA
data_France$X.14CTP8[data_France$X.14CTP8>7.01 | data_France$X.14CTP8<0.99]<-NA
data_France$X.14CTP9[data_France$X.14CTP9>7.01 | data_France$X.14CTP9<0.99]<-NA
data_France$X.14CTP10[data_France$X.14CTP10>7.01 | data_France$X.14CTP10<0.99]<-NA
data_France$X.14CTP11[data_France$X.14CTP11>7.01 | data_France$X.14CTP11<0.99]<-NA
data_France$X.14CTP12[data_France$X.14CTP12>7.01 | data_France$X.14CTP12<0.99]<-NA
data_France$X.14CTP13[data_France$X.14CTP13>7.01 | data_France$X.14CTP13<0.99]<-NA
data_France$X.14CTP15[data_France$X.14CTP15>7.01 | data_France$X.14CTP15<0.99]<-NA
data_France$X.15CTP1[data_France$X.15CTP1>7.01 | data_France$X.15CTP1<0.99]<-NA
data_France$X.15CTP2[data_France$X.15CTP2>7.01 | data_France$X.15CTP2<0.99]<-NA
data_France$X.15CTP3[data_France$X.15CTP3>7.01 | data_France$X.15CTP3<0.99]<-NA
data_France$X.15CTP4[data_France$X.15CTP4>7.01 | data_France$X.15CTP4<0.99]<-NA
data_France$X.15CTP5[data_France$X.15CTP5>7.01 | data_France$X.15CTP5<0.99]<-NA
data_France$X.15CTP6[data_France$X.15CTP6>7.01 | data_France$X.15CTP6<0.99]<-NA
data_France$X.15CTP7[data_France$X.15CTP7>7.01 | data_France$X.15CTP7<0.99]<-NA
data_France$X.15CTP8[data_France$X.15CTP8>7.01 | data_France$X.15CTP8<0.99]<-NA
data_France$X.15CTP9[data_France$X.15CTP9>7.01 | data_France$X.15CTP9<0.99]<-NA
data_France$X.15CTP10[data_France$X.15CTP10>7.01 | data_France$X.15CTP10<0.99]<-NA
data_France$X.15CTP11[data_France$X.15CTP11>7.01 | data_France$X.15CTP11<0.99]<-NA
data_France$X.15CTP12[data_France$X.15CTP12>7.01 | data_France$X.15CTP12<0.99]<-NA
data_France$X.15CTP13[data_France$X.15CTP13>7.01 | data_France$X.15CTP13<0.99]<-NA
data_France$X.15CTP14[data_France$X.15CTP14>7.01 | data_France$X.15CTP14<0.99]<-NA
Base R equivalent of #Cettt's answer:
## helper function to replace elements with NA
rfun <- function(x) replace(x, which(x<0.99 | x>7.01), NA)
## identify which columns need to be changed
cnm <- grep("^X.[0-9]+CTP[0-9]+", names(data_France))
for (i in cnm) {
data_France[cnm] <- rfun(data_France[cnm])
}
You could also use lapply(), but sometimes the for loop is easier to understand and debug.
I would recommend the dplyr package which has the mutate_at function.
In your case you could use it like this:
library(dplyr)
data_France %>%
as_tibble %>%
mutate_at(vars(matches("^X.[0-9]+CTP[0-9]+")), ~ifelse(.x < 0.99 | .x > 7.01, NA_real_, .x))
#Create a vector of variable names. There may be other ways to do this, like using
#regex or just taking the indices of the variables names (e.g., 1:225)
vars <- apply(expand.grid("X.", as.character(1:15), "CTP", as.character(1:15)),
1, paste0, collapse = "")
for (i in vars) {
data_France[[i]][data_France[[i]] > 7.01 | data_France[[i]] < 0.99] <- NA
}
If this is your entire data set (i.e., there are no other variables in the data), you can simply do
data_France[data_France > 7.01 | data_France < 0.99] <- NA

How can I calculate the correlation of my residuals ? Spark-Scala

I need to know if my residuals are correlated or not. I didn't find a way to do it using Spark-Scala on Databricks.
And i conclude that i should export my project to R to use acf function.
Does someone know a trick to do it using Spark-Scala on Databricks ?
For those who need more information : I'm currently working on Sales Forecasting. I used a Regression Forest using different features. Then, I need to evaluate the quality of my forecast. To check this, i read on this paper that residuals were a good way to see if your forecasting model is good or bad. In any cases, you can still improve it but it's just to make my opinion on my forecast model and compared it to others models.
Currently, I have one dataframe like the one below. It's a part of the testing data/out-of-sample data. (I casted prediction and residuals to IntegerType, that's why at the 3rd row 40 - 17 = 22)
I am using Spark 2.1.1.
You can find correlation between columns using spark ml library function
Lets first import the classes.
import org.apache.spark.sql.functions.corr
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
Now prepare the input DataFrame :
scala> val seqRow = Seq(
| ("2017-04-27",13,21),
| ("2017-04-26",7,16),
| ("2017-04-25",40,17),
| ("2017-04-24",17,17),
| ("2017-04-21",10,20),
| ("2017-04-20",9,19),
| ("2017-04-19",30,30),
| ("2017-04-18",18,25),
| ("2017-04-14",32,28),
| ("2017-04-13",39,18),
| ("2017-04-12",2,4),
| ("2017-04-11",8,24),
| ("2017-04-10",18,27),
| ("2017-04-07",6,17),
| ("2017-04-06",13,29),
| ("2017-04-05",10,17),
| ("2017-04-04",6,8),
| ("2017-04-03",20,32)
| )
seqRow: Seq[(String, Int, Int)] = List((2017-04-27,13,21), (2017-04-26,7,16), (2017-04-25,40,17), (2017-04-24,17,17), (2017-04-21,10,20), (2017-04-20,9,19), (2017-04-19,30,30), (2017-04-18,18,25), (2017-04-14,32,28), (2017-04-13,39,18), (2017-04-12,2,4), (2017-04-11,8,24), (2017-04-10,18,27), (2017-04-07,6,17), (2017-04-06,13,29), (2017-04-05,10,17), (2017-04-04,6,8), (2017-04-03,20,32))
scala> val rdd = sc.parallelize(seqRow)
rdd: org.apache.spark.rdd.RDD[(String, Int, Int)] = ParallelCollectionRDD[51] at parallelize at <console>:34
scala> val input_df = spark.createDataFrame(rdd).toDF("date","amount","prediction").withColumn("residuals",'amount - 'prediction)
input_df: org.apache.spark.sql.DataFrame = [date: string, amount: int ... 2 more fields]
scala> input_df.show(false)
+----------+------+----------+---------+
|date |amount|prediction|residuals|
+----------+------+----------+---------+
|2017-04-27|13 |21 |-8 |
|2017-04-26|7 |16 |-9 |
|2017-04-25|40 |17 |23 |
|2017-04-24|17 |17 |0 |
|2017-04-21|10 |20 |-10 |
|2017-04-20|9 |19 |-10 |
|2017-04-19|30 |30 |0 |
|2017-04-18|18 |25 |-7 |
|2017-04-14|32 |28 |4 |
|2017-04-13|39 |18 |21 |
|2017-04-12|2 |4 |-2 |
|2017-04-11|8 |24 |-16 |
|2017-04-10|18 |27 |-9 |
|2017-04-07|6 |17 |-11 |
|2017-04-06|13 |29 |-16 |
|2017-04-05|10 |17 |-7 |
|2017-04-04|6 |8 |-2 |
|2017-04-03|20 |32 |-12 |
+----------+------+----------+---------+
The values of residuals for row 2017-04-14 and 2017-04-13 don't match as i am subtracting amount - prediction for residuals
Now proceeding forward to calculate correlation between all the columns.
This method is used for calculating correlation if number of columns are more and you need correlation between each column to others.
First we drop the column whose correlation is not to be calculated
scala> val drop_date_df = input_df.drop('date)
drop_date_df: org.apache.spark.sql.DataFrame = [amount: int, prediction: int ... 1 more field]
scala> drop_date_df.show
+------+----------+---------+
|amount|prediction|residuals|
+------+----------+---------+
| 13| 21| -8|
| 7| 16| -9|
| 40| 17| 23|
| 17| 17| 0|
| 10| 20| -10|
| 9| 19| -10|
| 30| 30| 0|
| 18| 25| -7|
| 32| 28| 4|
| 39| 18| 21|
| 2| 4| -2|
| 8| 24| -16|
| 18| 27| -9|
| 6| 17| -11|
| 13| 29| -16|
| 10| 17| -7|
| 6| 8| -2|
| 20| 32| -12|
+------+----------+---------+
Since there are more than 2 column for correlation, we need to find correlation matrix.
For calculating correlation matrix we need RDD[Vector] as you can see in spark example for correlation.
scala> val dense_rdd = drop_date_df.rdd.map{row =>
| val first = row.getAs[Integer]("amount")
| val second = row.getAs[Integer]("prediction")
| val third = row.getAs[Integer]("residuals")
| Vectors.dense(first.toDouble,second.toDouble,third.toDouble)}
dense_rdd: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[62] at map at <console>:40
scala> val correlMatrix: Matrix = Statistics.corr(dense_rdd, "pearson")
correlMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0 0.40467032516705076 0.782939330961529
0.40467032516705076 1.0 -0.2520531290688281
0.782939330961529 -0.2520531290688281 1.0
The order of column remains same but you loose out the column names.
You can find good resources about structure of correlation matrix.
Since you want to find the correlation of residuals with other two columns.
We can explore other options
Hive corr UDAF
scala> drop_date_df.createOrReplaceTempView("temp_table")
scala> val corr_query_df = spark.sql("select corr(amount,residuals) as amount_residuals_corr,corr(prediction,residuals) as prediction_residual_corr from temp_table")
corr_query_df: org.apache.spark.sql.DataFrame = [amount_residuals_corr: double, prediction_residual_corr: double]
scala> corr_query_df.show
+---------------------+------------------------+
|amount_residuals_corr|prediction_residual_corr|
+---------------------+------------------------+
| 0.7829393309615287| -0.252053129068828|
+---------------------+------------------------+
Spark corr function link
scala> val corr_df = drop_date_df.select(
| corr('amount,'residuals).as("amount_residuals_corr"),
| corr('prediction,'residuals).as("prediction_residual_corr"))
corr_df: org.apache.spark.sql.DataFrame = [amount_residuals_corr: double, prediction_residual_corr: double]
scala> corr_df.show
+---------------------+------------------------+
|amount_residuals_corr|prediction_residual_corr|
+---------------------+------------------------+
| 0.7829393309615287| -0.252053129068828|
+---------------------+------------------------+

Unable to forecast linear model in R

I'm able to do forecasts with an ARIMA model, but when I try to do a forecast for a linear model, I do not get any actual forecasts - it stops at the end of the data set (which isn't useful for forecasting since I already know what's in the data set). I've found countless examples online where using this same code works just fine, but I haven't found anyone else having this same error.
library("stats")
library("forecast")
y <- data$Mfg.Shipments.Total..USA.
model_a1 <- auto.arima(y)
forecast_a1 <- forecast.Arima(model_a1, h = 12)
The above code works perfectly. However, when I try to do a linear model....
model1 <- lm(y ~ Mfg.NO.Total..USA. + Mfg.Inv.Total..USA., data = data )
f1 <- forecast.lm(model1, h = 12)
I get an error message saying that I MUST provide a new data set (which seems odd to me, since the documentation for the forecast package says that it is an optional argument).
f1 <- forecast.lm(model1, newdata = x, h = 12)
If I do this, I am able to get the function to work, but the forecast only predicts values for the existing data - it doesn't predict the next 12 periods. I have also tried using the append function to add additional rows to see if that would fix the issue, but when trying to forecast a linear model, it immediately stops at the most recent point in the time series.
Here's the data that I'm using:
+------------+---------------------------+--------------------+---------------------+
| | Mfg.Shipments.Total..USA. | Mfg.NO.Total..USA. | Mfg.Inv.Total..USA. |
+------------+---------------------------+--------------------+---------------------+
| 2110-01-01 | 3.59746e+11 | 3.58464e+11 | 5.01361e+11 |
| 2110-01-01 | 3.59746e+11 | 3.58464e+11 | 5.01361e+11 |
| 2110-02-01 | 3.62268e+11 | 3.63441e+11 | 5.10439e+11 |
| 2110-03-01 | 4.23748e+11 | 4.24527e+11 | 5.10792e+11 |
| 2110-04-01 | 4.08755e+11 | 4.02769e+11 | 5.16853e+11 |
| 2110-05-01 | 4.08187e+11 | 4.02869e+11 | 5.18180e+11 |
| 2110-06-01 | 4.27567e+11 | 4.21713e+11 | 5.15675e+11 |
| 2110-07-01 | 3.97590e+11 | 3.89916e+11 | 5.24785e+11 |
| 2110-08-01 | 4.24732e+11 | 4.16304e+11 | 5.27734e+11 |
| 2110-09-01 | 4.30974e+11 | 4.35043e+11 | 5.28797e+11 |
| 2110-10-01 | 4.24008e+11 | 4.17076e+11 | 5.38917e+11 |
| 2110-11-01 | 4.11930e+11 | 4.09440e+11 | 5.42618e+11 |
| 2110-12-01 | 4.25940e+11 | 4.34201e+11 | 5.35384e+11 |
| 2111-01-01 | 4.01629e+11 | 4.07748e+11 | 5.55057e+11 |
| 2111-02-01 | 4.06385e+11 | 4.06151e+11 | 5.66058e+11 |
| 2111-03-01 | 4.83827e+11 | 4.89904e+11 | 5.70990e+11 |
| 2111-04-01 | 4.54640e+11 | 4.46702e+11 | 5.84808e+11 |
| 2111-05-01 | 4.65124e+11 | 4.63155e+11 | 5.92456e+11 |
| 2111-06-01 | 4.83809e+11 | 4.75150e+11 | 5.86645e+11 |
| 2111-07-01 | 4.44437e+11 | 4.40452e+11 | 5.97201e+11 |
| 2111-08-01 | 4.83537e+11 | 4.79958e+11 | 5.99461e+11 |
| 2111-09-01 | 4.77130e+11 | 4.75580e+11 | 5.93065e+11 |
| 2111-10-01 | 4.69276e+11 | 4.59579e+11 | 6.03481e+11 |
| 2111-11-01 | 4.53706e+11 | 4.55029e+11 | 6.02577e+11 |
| 2111-12-01 | 4.57872e+11 | 4.81454e+11 | 5.86886e+11 |
| 2112-01-01 | 4.35834e+11 | 4.45037e+11 | 6.04042e+11 |
| 2112-02-01 | 4.55996e+11 | 4.70820e+11 | 6.12071e+11 |
| 2112-03-01 | 5.04869e+11 | 5.08818e+11 | 6.11717e+11 |
| 2112-04-01 | 4.76213e+11 | 4.70666e+11 | 6.16375e+11 |
| 2112-05-01 | 4.95789e+11 | 4.87730e+11 | 6.17639e+11 |
| 2112-06-01 | 4.91218e+11 | 4.87857e+11 | 6.09361e+11 |
| 2112-07-01 | 4.58087e+11 | 4.61037e+11 | 6.19166e+11 |
| 2112-08-01 | 4.97438e+11 | 4.74539e+11 | 6.22773e+11 |
| 2112-09-01 | 4.86994e+11 | 4.85560e+11 | 6.23067e+11 |
| 2112-10-01 | 4.96744e+11 | 4.92562e+11 | 6.26796e+11 |
| 2112-11-01 | 4.70810e+11 | 4.64944e+11 | 6.23999e+11 |
| 2112-12-01 | 4.66721e+11 | 4.88615e+11 | 6.08900e+11 |
| 2113-01-01 | 4.51585e+11 | 4.50763e+11 | 6.25881e+11 |
| 2113-02-01 | 4.56329e+11 | 4.69574e+11 | 6.33157e+11 |
| 2113-03-01 | 5.04023e+11 | 4.92978e+11 | 6.31055e+11 |
| 2113-04-01 | 4.84798e+11 | 4.76750e+11 | 6.35643e+11 |
| 2113-05-01 | 5.04478e+11 | 5.04488e+11 | 6.34376e+11 |
| 2113-06-01 | 4.99043e+11 | 5.13760e+11 | 6.25715e+11 |
| 2113-07-01 | 4.75700e+11 | 4.69012e+11 | 6.34892e+11 |
| 2113-08-01 | 5.05244e+11 | 4.90404e+11 | 6.37735e+11 |
| 2113-09-01 | 5.00087e+11 | 5.04849e+11 | 6.34665e+11 |
| 2113-10-01 | 5.05965e+11 | 4.99682e+11 | 6.38945e+11 |
| 2113-11-01 | 4.78876e+11 | 4.80784e+11 | 6.34442e+11 |
| 2113-12-01 | 4.80640e+11 | 4.98807e+11 | 6.19458e+11 |
| 2114-01-01 | 4.56779e+11 | 4.57684e+11 | 6.36568e+11 |
| 2114-02-01 | 4.62195e+11 | 4.70312e+11 | 6.48982e+11 |
| 2114-03-01 | 5.19472e+11 | 5.25900e+11 | 6.47038e+11 |
| 2114-04-01 | 5.04217e+11 | 5.06090e+11 | 6.52612e+11 |
| 2114-05-01 | 5.14186e+11 | 5.11149e+11 | 6.58990e+11 |
| 2114-06-01 | 5.25249e+11 | 5.33247e+11 | 6.49512e+11 |
| 2114-07-01 | 4.99198e+11 | 5.52506e+11 | 6.57645e+11 |
| 2114-08-01 | 5.17184e+11 | 5.07622e+11 | 6.59281e+11 |
| 2114-09-01 | 5.23682e+11 | 5.24051e+11 | 6.55582e+11 |
| 2114-10-01 | 5.17305e+11 | 5.09549e+11 | 6.59237e+11 |
| 2114-11-01 | 4.71921e+11 | 4.70093e+11 | 6.57044e+11 |
| 2114-12-01 | 4.84948e+11 | 4.86804e+11 | 6.34120e+11 |
+------------+---------------------------+--------------------+---------------------+
Edit - Here's the code I used for adding new datapoints for forecasting.
library(xts)
library(mondate)
d <- as.mondate("2115-01-01")
d11 <- d + 11
seq(d, d11)
newdates <- seq(d, d11)
new_xts <- xts(order.by = as.Date(newdates))
new_xts$Mfg.Shipments.Total..USA. <- NA
new_xts$Mfg.NO.Total..USA. <- NA
new_xts$Mfg.Inv.Total..USA. <- NA
x <- append(data, new_xts)
Not sure if you ever figured this out, but just in case I thought I'd point out what's going wrong.
The documentation for forecast.lm says:
An optional data frame in which to look for variables with which to predict. If omitted, it is assumed that the only variables are trend and season, and h forecasts are produced.
so it's optional if trend and season are your only predictors.
The ARIMA model works because it's using lagged values of the time series in the forecast. For the linear model, it uses the given predictors (Mfg.NO.Total..USA. and Mfg.Inv.Total..USA. in your case) and thus needs their corresponding future values; without these, there are no independent variables to predict from.
In the edit, you added those variables to your future dataset, but they still have values of NA for all future points, thus the forecasts are also NA.
Gabe is correct. You need future values of your causals.
You should consider the Transfer Function modeling process instead of regression(ie developed for use with cross-sectional data). By using prewhitening your X variables (ie build a model for each one), you can calculate the Cross correlation function to see any lead or lag relationship.
It is very apparent that Inv.Total is a lead variable(b**-1) from the standardized graph of Y and the two x's. When Invto moves down so does shipments. In addition, there is also AR seasonal component beyond the causals that is driving the data. There are a few outliers as well so this is a robust solution. I am developer of this software used here, but this can be run in any tool.

Resources