Is there a way to take the csv file as it is? ....can we stop spark putting the default name to the empty column - opencsv

csv-data
I wanted to delete the empty column from the dataframe
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test_Parquet").master("local[*]").getOrCreate()
names = spark.read.csv("name.csv", header="true", inferSchema="true")
names.show()
This is the dataframe created from name.csv file
+-------+---+---+---+-----+----+
| Name| 1|Age| 3|Class| _c5|
+-------+---+---+---+-----+----+
|Diwakar| | 25| | 12|null|
|Prabhat| | 27| | 15|null|
| Zyan| | 30| | 17|null|
| Jack| | 35| | 21|null|
+-------+---+---+---+-----+----+
Spark by default gave name to the empty column as 1, 3,_c5 can we stop spark giving default names to the columns.
I wanted to have a dataframe like given below:
+-------+---+---+---+-----+----+
| Name| |Age| |Class| |
+-------+---+---+---+-----+----+
|Diwakar| | 25| | 12|null|
|Prabhat| | 27| | 15|null|
| Zyan| | 30| | 17|null|
| Jack| | 35| | 21|null|
+-------+---+---+---+-----+----+
and i wanted to remove the empty column in one go like:
temp = list(set(names.columns))
temp.remove(" ")
names = names.select(temp)
names.show
+-------+---+-----+
| Name|Age|Class|
+-------+---+-----+
|Diwakar| 25| 12|
|Prabhat| 27| 15|
| Zyan| 30| 17|
| Jack| 35| 21|
+-------+---+-----+

Related

How to create a variable based on character and number iteration in R?

I'm trying to create a dummy variable based on the character type variable.
For example, I need to create "newcat" variable ranging from "I00" to "I99".
In the code I wrote, I place all the characters from I00-I99.
But is there any way to make this code efficient with the loop to iterate number after the string?
Thank you in advance!!
mort <- mort %>%
mutate(newcat = ifelse(ucod=="I00" |
ucod=="I01" | ucod=="I02" | ucod=="I03" | ucod=="I04" | ucod=="I05" |
ucod=="I06" | ucod=="I07" | ucod=="I08" | ucod=="I09" | ucod=="I10" |
ucod=="I11" | ucod=="I12" | ucod=="I13" | ucod=="I14" | ucod=="I15" |
ucod=="I16" | ucod=="I17" | ucod=="I18" | ucod=="I19" | ucod=="I20" |
ucod=="I21" | ucod=="I22" | ucod=="I23" | ucod=="I24" | ucod=="I25" |
ucod=="I26" | ucod=="I27" | ucod=="I28" | ucod=="I29" | ucod=="I30" |
ucod=="I31" | ucod=="I32" | ucod=="I33" | ucod=="I34" | ucod=="I35" |
ucod=="I36" | ucod=="I37" | ucod=="I38" | ucod=="I39" | ucod=="I40" |
ucod=="I41" | ucod=="I42" | ucod=="I43" | ucod=="I44" | ucod=="I45" |
ucod=="I46" | ucod=="I47" | ucod=="I48" | ucod=="I49" | ucod=="I50" |
ucod=="I51" | ucod=="I52" | ucod=="I53" | ucod=="I54" | ucod=="I55" |
ucod=="I56" | ucod=="I57" | ucod=="I58" | ucod=="I59" | ucod=="I60" |
ucod=="I61" | ucod=="I62" | ucod=="I63" | ucod=="I64" | ucod=="I65" |
ucod=="I66" | ucod=="I67" | ucod=="I68" | ucod=="I69" | ucod=="I70" |
ucod=="I71" | ucod=="I72" | ucod=="I73" | ucod=="I74" | ucod=="I75" |
ucod=="I76" | ucod=="I77" | ucod=="I78" | ucod=="I79" | ucod=="I80" |
ucod=="I81" | ucod=="I82" | ucod=="I83" | ucod=="I84" | ucod=="I85" |
ucod=="I86" | ucod=="I87" | ucod=="I88" | ucod=="I89" | ucod=="I90" |
ucod=="I91" | ucod=="I92" | ucod=="I93" | ucod=="I94" | ucod=="I95" |
ucod=="I96" | ucod=="I97" | ucod=="I98" | ucod=="I99", 1, 0))
Try %in% instead of == with |
x <- c(paste0("I0", 0:9),paste0("I", c(10:99)))
mort %>%
mutate(newcat = ifelse(ucod %in% x, 1, 0))
Another option is to use regex:
mort <- mort %>%
mutate(newcat = +str_detect(ucod, '^I[0-9]{2}$'))
where ^ is a metacharacter which indicates the beginning of the string. Then we have I[0-9]{2} which matches the letter I and any 2 combinations of the numbers 0-9. Then $ is another metacharacter that indicates the end of the string. So the string matched must start with I followed by 2 numbers and that should be the end of the string. Any string that does not match the pattern will be flaged as FALSE

How to implement PySpark StandardScaler on subset of columns?

I want to use pyspark StandardScaler on 6 out of 10 columns in my dataframe. This will be part of a pipeline.
The inputCol parameter seems to expect a vector, which I can pass in after using VectorAssembler on all my features, but this scales all 10 features. I don’t want to scale the other 4 features because they are binary and I want unstandardized coefficients for them.
Am I supposed to use vector assembler on the 6 features, scale them, then use vector assembler again on this scaled features vector and the remaining 4 features? I would end up with a vector within a vector and I’m not sure this will work.
What’s the right way to do this? An example is appreciated.
You can do this by using VectorAssembler. They key is you have to extract the columns from the assembler output. See the code below for a working example,
from pyspark.ml.feature import MinMaxScaler, StandardScaler
from pyspark.ml.feature import VectorAssembler
import pandas as pd
import numpy as np
import random
df = pd.DataFrame()
df['a'] = random.sample(range(100), 10)
df['b'] = random.sample(range(100), 10)
df['c'] = random.sample(range(100), 10)
df['d'] = random.sample(range(100), 10)
df['e'] = random.sample(range(100), 10)
sdf = sc.createDataFrame(df)
sdf.show()
+---+---+---+---+---+
| a| b| c| d| e|
+---+---+---+---+---+
| 51| 13| 6| 5| 26|
| 18| 29| 19| 81| 28|
| 34| 1| 36| 57| 87|
| 56| 86| 51| 52| 48|
| 36| 49| 33| 15| 54|
| 87| 53| 47| 89| 85|
| 7| 14| 55| 13| 98|
| 70| 50| 32| 39| 58|
| 80| 20| 25| 54| 37|
| 40| 33| 44| 83| 27|
+---+---+---+---+---+
cols_to_scale = ['c', 'd', 'e']
cols_to_keep_unscaled = ['a', 'b']
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
assembler = VectorAssembler().setInputCols(cols_to_scale).setOutputCol("features")
sdf_transformed = assembler.transform(sdf)
scaler_model = scaler.fit(sdf_transformed.select("features"))
sdf_scaled = scaler_model.transform(sdf_transformed)
sdf_scaled.show()
+---+---+---+---+---+----------------+--------------------+
| a| b| c| d| e| features| scaledFeatures|
+---+---+---+---+---+----------------+--------------------+
| 51| 13| 6| 5| 26| [6.0,5.0,26.0]|[0.39358015146628...|
| 18| 29| 19| 81| 28|[19.0,81.0,28.0]|[1.24633714630991...|
| 34| 1| 36| 57| 87|[36.0,57.0,87.0]|[2.36148090879773...|
| 56| 86| 51| 52| 48|[51.0,52.0,48.0]|[3.34543128746345...|
| 36| 49| 33| 15| 54|[33.0,15.0,54.0]|[2.16469083306459...|
| 87| 53| 47| 89| 85|[47.0,89.0,85.0]|[3.08304451981926...|
| 7| 14| 55| 13| 98|[55.0,13.0,98.0]|[3.60781805510765...|
| 70| 50| 32| 39| 58|[32.0,39.0,58.0]|[2.09909414115354...|
| 80| 20| 25| 54| 37|[25.0,54.0,37.0]|[1.63991729777620...|
| 40| 33| 44| 83| 27|[44.0,83.0,27.0]|[2.88625444408612...|
+---+---+---+---+---+----------------+--------------------+
# Function just to convert to help build data frame
def extract(row):
return (row.a, row.b,) + tuple(row.scaledFeatures.toArray().tolist())
sdf_scaled = sdf_scaled.select(*cols_to_keep_unscaled, "scaledFeatures").rdd \
.map(extract).toDF(cols_to_keep_unscaled + cols_to_scale)
sdf_scaled.show()
+---+---+------------------+-------------------+------------------+
| a| b| c| d| e|
+---+---+------------------+-------------------+------------------+
| 51| 13|0.3935801514662892|0.16399957083190683|0.9667572801316145|
| 18| 29| 1.246337146309916| 2.656793047476891|1.0411232247571234|
| 34| 1|2.3614809087977355| 1.8695951074837378|3.2349185912096337|
| 56| 86|3.3454312874634584| 1.7055955366518312|1.7847826710122114|
| 36| 49| 2.164690833064591|0.49199871249572047| 2.007880504888738|
| 87| 53| 3.083044519819266| 2.9191923608079415|3.1605526465841245|
| 7| 14|3.6078180551076513| 0.4263988841629578| 3.643931286649932|
| 70| 50|2.0990941411535426| 1.2791966524888734|2.1566123941397555|
| 80| 20| 1.639917297776205| 1.7711953649845937| 1.375769975571913|
| 40| 33|2.8862544440861213| 2.7223928758096534| 1.003940252444369|
+---+---+------------------+-------------------+------------------+

R: Regex to match more than one pipe occurrence

I have a dataset in which I paste values in a dplyr chain and collapse with the pipe character (e.g. " | "). If any of the values in the dataset are blank, I just get recurring pipe characters in the pasted list.
Some of the values look like this, for example:
badstring = "| | | | | | GHOULSBY,SCROGGINS | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CAT,JOHNSON | | | | | | | | | | | | BURGLAR,PALA | | | | | | | | |"
I want to match all the pipes that occur more than once and delete them, so that just the names appear like so:
correctstring = "| GHOULSBY,SCROGGINS | CAT,JOHNSON | |BURGLAR,PALA |"
I tried the following, but to no avail:
mutate(names = gsub('[\\|]{2,}', '', name_list))
The difficulty in this question is in formulating a regex which can selectively remove every pipe, except the ones we want to remain as actual separators between terms. We can match on the following pattern:
\|\s+(?=\|)
and then replace just empty string. This pattern will remove any pipe (and any following whitespace) so long as what follows is another pipe. A removal would not occur when a pipe is followed by an actual term, or when it is followed by the end of the string.
badstring = "| | | | | | GHOULSBY,SCROGGINS | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CAT,JOHNSON | | | | | | | | | | | | BURGLAR,PALA | | | | | | | | |"
result <- gsub("\\|\\s+(?=\\|)", "", badstring, perl=TRUE)
result
[1] "| GHOULSBY,SCROGGINS | CAT,JOHNSON | BURGLAR,PALA |"
Demo
Edit:
If you expect to have inputs like | | | which are devoid of any terms, and you would expect empty string as the output, then my solution would fail. I don't see an obvious way to modify the above regex, but you can handle this case with one more call to sub:
result <- sub("^\\|$", "", result)
We also might be able to modify the original pattern to use an alternation covering all cases:
result <- gsub("\\|\\s+(?=\\|)|(?:^\\|$)", "", badstring, perl=TRUE)

How can I calculate the correlation of my residuals ? Spark-Scala

I need to know if my residuals are correlated or not. I didn't find a way to do it using Spark-Scala on Databricks.
And i conclude that i should export my project to R to use acf function.
Does someone know a trick to do it using Spark-Scala on Databricks ?
For those who need more information : I'm currently working on Sales Forecasting. I used a Regression Forest using different features. Then, I need to evaluate the quality of my forecast. To check this, i read on this paper that residuals were a good way to see if your forecasting model is good or bad. In any cases, you can still improve it but it's just to make my opinion on my forecast model and compared it to others models.
Currently, I have one dataframe like the one below. It's a part of the testing data/out-of-sample data. (I casted prediction and residuals to IntegerType, that's why at the 3rd row 40 - 17 = 22)
I am using Spark 2.1.1.
You can find correlation between columns using spark ml library function
Lets first import the classes.
import org.apache.spark.sql.functions.corr
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
Now prepare the input DataFrame :
scala> val seqRow = Seq(
| ("2017-04-27",13,21),
| ("2017-04-26",7,16),
| ("2017-04-25",40,17),
| ("2017-04-24",17,17),
| ("2017-04-21",10,20),
| ("2017-04-20",9,19),
| ("2017-04-19",30,30),
| ("2017-04-18",18,25),
| ("2017-04-14",32,28),
| ("2017-04-13",39,18),
| ("2017-04-12",2,4),
| ("2017-04-11",8,24),
| ("2017-04-10",18,27),
| ("2017-04-07",6,17),
| ("2017-04-06",13,29),
| ("2017-04-05",10,17),
| ("2017-04-04",6,8),
| ("2017-04-03",20,32)
| )
seqRow: Seq[(String, Int, Int)] = List((2017-04-27,13,21), (2017-04-26,7,16), (2017-04-25,40,17), (2017-04-24,17,17), (2017-04-21,10,20), (2017-04-20,9,19), (2017-04-19,30,30), (2017-04-18,18,25), (2017-04-14,32,28), (2017-04-13,39,18), (2017-04-12,2,4), (2017-04-11,8,24), (2017-04-10,18,27), (2017-04-07,6,17), (2017-04-06,13,29), (2017-04-05,10,17), (2017-04-04,6,8), (2017-04-03,20,32))
scala> val rdd = sc.parallelize(seqRow)
rdd: org.apache.spark.rdd.RDD[(String, Int, Int)] = ParallelCollectionRDD[51] at parallelize at <console>:34
scala> val input_df = spark.createDataFrame(rdd).toDF("date","amount","prediction").withColumn("residuals",'amount - 'prediction)
input_df: org.apache.spark.sql.DataFrame = [date: string, amount: int ... 2 more fields]
scala> input_df.show(false)
+----------+------+----------+---------+
|date |amount|prediction|residuals|
+----------+------+----------+---------+
|2017-04-27|13 |21 |-8 |
|2017-04-26|7 |16 |-9 |
|2017-04-25|40 |17 |23 |
|2017-04-24|17 |17 |0 |
|2017-04-21|10 |20 |-10 |
|2017-04-20|9 |19 |-10 |
|2017-04-19|30 |30 |0 |
|2017-04-18|18 |25 |-7 |
|2017-04-14|32 |28 |4 |
|2017-04-13|39 |18 |21 |
|2017-04-12|2 |4 |-2 |
|2017-04-11|8 |24 |-16 |
|2017-04-10|18 |27 |-9 |
|2017-04-07|6 |17 |-11 |
|2017-04-06|13 |29 |-16 |
|2017-04-05|10 |17 |-7 |
|2017-04-04|6 |8 |-2 |
|2017-04-03|20 |32 |-12 |
+----------+------+----------+---------+
The values of residuals for row 2017-04-14 and 2017-04-13 don't match as i am subtracting amount - prediction for residuals
Now proceeding forward to calculate correlation between all the columns.
This method is used for calculating correlation if number of columns are more and you need correlation between each column to others.
First we drop the column whose correlation is not to be calculated
scala> val drop_date_df = input_df.drop('date)
drop_date_df: org.apache.spark.sql.DataFrame = [amount: int, prediction: int ... 1 more field]
scala> drop_date_df.show
+------+----------+---------+
|amount|prediction|residuals|
+------+----------+---------+
| 13| 21| -8|
| 7| 16| -9|
| 40| 17| 23|
| 17| 17| 0|
| 10| 20| -10|
| 9| 19| -10|
| 30| 30| 0|
| 18| 25| -7|
| 32| 28| 4|
| 39| 18| 21|
| 2| 4| -2|
| 8| 24| -16|
| 18| 27| -9|
| 6| 17| -11|
| 13| 29| -16|
| 10| 17| -7|
| 6| 8| -2|
| 20| 32| -12|
+------+----------+---------+
Since there are more than 2 column for correlation, we need to find correlation matrix.
For calculating correlation matrix we need RDD[Vector] as you can see in spark example for correlation.
scala> val dense_rdd = drop_date_df.rdd.map{row =>
| val first = row.getAs[Integer]("amount")
| val second = row.getAs[Integer]("prediction")
| val third = row.getAs[Integer]("residuals")
| Vectors.dense(first.toDouble,second.toDouble,third.toDouble)}
dense_rdd: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[62] at map at <console>:40
scala> val correlMatrix: Matrix = Statistics.corr(dense_rdd, "pearson")
correlMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0 0.40467032516705076 0.782939330961529
0.40467032516705076 1.0 -0.2520531290688281
0.782939330961529 -0.2520531290688281 1.0
The order of column remains same but you loose out the column names.
You can find good resources about structure of correlation matrix.
Since you want to find the correlation of residuals with other two columns.
We can explore other options
Hive corr UDAF
scala> drop_date_df.createOrReplaceTempView("temp_table")
scala> val corr_query_df = spark.sql("select corr(amount,residuals) as amount_residuals_corr,corr(prediction,residuals) as prediction_residual_corr from temp_table")
corr_query_df: org.apache.spark.sql.DataFrame = [amount_residuals_corr: double, prediction_residual_corr: double]
scala> corr_query_df.show
+---------------------+------------------------+
|amount_residuals_corr|prediction_residual_corr|
+---------------------+------------------------+
| 0.7829393309615287| -0.252053129068828|
+---------------------+------------------------+
Spark corr function link
scala> val corr_df = drop_date_df.select(
| corr('amount,'residuals).as("amount_residuals_corr"),
| corr('prediction,'residuals).as("prediction_residual_corr"))
corr_df: org.apache.spark.sql.DataFrame = [amount_residuals_corr: double, prediction_residual_corr: double]
scala> corr_df.show
+---------------------+------------------------+
|amount_residuals_corr|prediction_residual_corr|
+---------------------+------------------------+
| 0.7829393309615287| -0.252053129068828|
+---------------------+------------------------+

Unable to forecast linear model in R

I'm able to do forecasts with an ARIMA model, but when I try to do a forecast for a linear model, I do not get any actual forecasts - it stops at the end of the data set (which isn't useful for forecasting since I already know what's in the data set). I've found countless examples online where using this same code works just fine, but I haven't found anyone else having this same error.
library("stats")
library("forecast")
y <- data$Mfg.Shipments.Total..USA.
model_a1 <- auto.arima(y)
forecast_a1 <- forecast.Arima(model_a1, h = 12)
The above code works perfectly. However, when I try to do a linear model....
model1 <- lm(y ~ Mfg.NO.Total..USA. + Mfg.Inv.Total..USA., data = data )
f1 <- forecast.lm(model1, h = 12)
I get an error message saying that I MUST provide a new data set (which seems odd to me, since the documentation for the forecast package says that it is an optional argument).
f1 <- forecast.lm(model1, newdata = x, h = 12)
If I do this, I am able to get the function to work, but the forecast only predicts values for the existing data - it doesn't predict the next 12 periods. I have also tried using the append function to add additional rows to see if that would fix the issue, but when trying to forecast a linear model, it immediately stops at the most recent point in the time series.
Here's the data that I'm using:
+------------+---------------------------+--------------------+---------------------+
| | Mfg.Shipments.Total..USA. | Mfg.NO.Total..USA. | Mfg.Inv.Total..USA. |
+------------+---------------------------+--------------------+---------------------+
| 2110-01-01 | 3.59746e+11 | 3.58464e+11 | 5.01361e+11 |
| 2110-01-01 | 3.59746e+11 | 3.58464e+11 | 5.01361e+11 |
| 2110-02-01 | 3.62268e+11 | 3.63441e+11 | 5.10439e+11 |
| 2110-03-01 | 4.23748e+11 | 4.24527e+11 | 5.10792e+11 |
| 2110-04-01 | 4.08755e+11 | 4.02769e+11 | 5.16853e+11 |
| 2110-05-01 | 4.08187e+11 | 4.02869e+11 | 5.18180e+11 |
| 2110-06-01 | 4.27567e+11 | 4.21713e+11 | 5.15675e+11 |
| 2110-07-01 | 3.97590e+11 | 3.89916e+11 | 5.24785e+11 |
| 2110-08-01 | 4.24732e+11 | 4.16304e+11 | 5.27734e+11 |
| 2110-09-01 | 4.30974e+11 | 4.35043e+11 | 5.28797e+11 |
| 2110-10-01 | 4.24008e+11 | 4.17076e+11 | 5.38917e+11 |
| 2110-11-01 | 4.11930e+11 | 4.09440e+11 | 5.42618e+11 |
| 2110-12-01 | 4.25940e+11 | 4.34201e+11 | 5.35384e+11 |
| 2111-01-01 | 4.01629e+11 | 4.07748e+11 | 5.55057e+11 |
| 2111-02-01 | 4.06385e+11 | 4.06151e+11 | 5.66058e+11 |
| 2111-03-01 | 4.83827e+11 | 4.89904e+11 | 5.70990e+11 |
| 2111-04-01 | 4.54640e+11 | 4.46702e+11 | 5.84808e+11 |
| 2111-05-01 | 4.65124e+11 | 4.63155e+11 | 5.92456e+11 |
| 2111-06-01 | 4.83809e+11 | 4.75150e+11 | 5.86645e+11 |
| 2111-07-01 | 4.44437e+11 | 4.40452e+11 | 5.97201e+11 |
| 2111-08-01 | 4.83537e+11 | 4.79958e+11 | 5.99461e+11 |
| 2111-09-01 | 4.77130e+11 | 4.75580e+11 | 5.93065e+11 |
| 2111-10-01 | 4.69276e+11 | 4.59579e+11 | 6.03481e+11 |
| 2111-11-01 | 4.53706e+11 | 4.55029e+11 | 6.02577e+11 |
| 2111-12-01 | 4.57872e+11 | 4.81454e+11 | 5.86886e+11 |
| 2112-01-01 | 4.35834e+11 | 4.45037e+11 | 6.04042e+11 |
| 2112-02-01 | 4.55996e+11 | 4.70820e+11 | 6.12071e+11 |
| 2112-03-01 | 5.04869e+11 | 5.08818e+11 | 6.11717e+11 |
| 2112-04-01 | 4.76213e+11 | 4.70666e+11 | 6.16375e+11 |
| 2112-05-01 | 4.95789e+11 | 4.87730e+11 | 6.17639e+11 |
| 2112-06-01 | 4.91218e+11 | 4.87857e+11 | 6.09361e+11 |
| 2112-07-01 | 4.58087e+11 | 4.61037e+11 | 6.19166e+11 |
| 2112-08-01 | 4.97438e+11 | 4.74539e+11 | 6.22773e+11 |
| 2112-09-01 | 4.86994e+11 | 4.85560e+11 | 6.23067e+11 |
| 2112-10-01 | 4.96744e+11 | 4.92562e+11 | 6.26796e+11 |
| 2112-11-01 | 4.70810e+11 | 4.64944e+11 | 6.23999e+11 |
| 2112-12-01 | 4.66721e+11 | 4.88615e+11 | 6.08900e+11 |
| 2113-01-01 | 4.51585e+11 | 4.50763e+11 | 6.25881e+11 |
| 2113-02-01 | 4.56329e+11 | 4.69574e+11 | 6.33157e+11 |
| 2113-03-01 | 5.04023e+11 | 4.92978e+11 | 6.31055e+11 |
| 2113-04-01 | 4.84798e+11 | 4.76750e+11 | 6.35643e+11 |
| 2113-05-01 | 5.04478e+11 | 5.04488e+11 | 6.34376e+11 |
| 2113-06-01 | 4.99043e+11 | 5.13760e+11 | 6.25715e+11 |
| 2113-07-01 | 4.75700e+11 | 4.69012e+11 | 6.34892e+11 |
| 2113-08-01 | 5.05244e+11 | 4.90404e+11 | 6.37735e+11 |
| 2113-09-01 | 5.00087e+11 | 5.04849e+11 | 6.34665e+11 |
| 2113-10-01 | 5.05965e+11 | 4.99682e+11 | 6.38945e+11 |
| 2113-11-01 | 4.78876e+11 | 4.80784e+11 | 6.34442e+11 |
| 2113-12-01 | 4.80640e+11 | 4.98807e+11 | 6.19458e+11 |
| 2114-01-01 | 4.56779e+11 | 4.57684e+11 | 6.36568e+11 |
| 2114-02-01 | 4.62195e+11 | 4.70312e+11 | 6.48982e+11 |
| 2114-03-01 | 5.19472e+11 | 5.25900e+11 | 6.47038e+11 |
| 2114-04-01 | 5.04217e+11 | 5.06090e+11 | 6.52612e+11 |
| 2114-05-01 | 5.14186e+11 | 5.11149e+11 | 6.58990e+11 |
| 2114-06-01 | 5.25249e+11 | 5.33247e+11 | 6.49512e+11 |
| 2114-07-01 | 4.99198e+11 | 5.52506e+11 | 6.57645e+11 |
| 2114-08-01 | 5.17184e+11 | 5.07622e+11 | 6.59281e+11 |
| 2114-09-01 | 5.23682e+11 | 5.24051e+11 | 6.55582e+11 |
| 2114-10-01 | 5.17305e+11 | 5.09549e+11 | 6.59237e+11 |
| 2114-11-01 | 4.71921e+11 | 4.70093e+11 | 6.57044e+11 |
| 2114-12-01 | 4.84948e+11 | 4.86804e+11 | 6.34120e+11 |
+------------+---------------------------+--------------------+---------------------+
Edit - Here's the code I used for adding new datapoints for forecasting.
library(xts)
library(mondate)
d <- as.mondate("2115-01-01")
d11 <- d + 11
seq(d, d11)
newdates <- seq(d, d11)
new_xts <- xts(order.by = as.Date(newdates))
new_xts$Mfg.Shipments.Total..USA. <- NA
new_xts$Mfg.NO.Total..USA. <- NA
new_xts$Mfg.Inv.Total..USA. <- NA
x <- append(data, new_xts)
Not sure if you ever figured this out, but just in case I thought I'd point out what's going wrong.
The documentation for forecast.lm says:
An optional data frame in which to look for variables with which to predict. If omitted, it is assumed that the only variables are trend and season, and h forecasts are produced.
so it's optional if trend and season are your only predictors.
The ARIMA model works because it's using lagged values of the time series in the forecast. For the linear model, it uses the given predictors (Mfg.NO.Total..USA. and Mfg.Inv.Total..USA. in your case) and thus needs their corresponding future values; without these, there are no independent variables to predict from.
In the edit, you added those variables to your future dataset, but they still have values of NA for all future points, thus the forecasts are also NA.
Gabe is correct. You need future values of your causals.
You should consider the Transfer Function modeling process instead of regression(ie developed for use with cross-sectional data). By using prewhitening your X variables (ie build a model for each one), you can calculate the Cross correlation function to see any lead or lag relationship.
It is very apparent that Inv.Total is a lead variable(b**-1) from the standardized graph of Y and the two x's. When Invto moves down so does shipments. In addition, there is also AR seasonal component beyond the causals that is driving the data. There are a few outliers as well so this is a robust solution. I am developer of this software used here, but this can be run in any tool.

Resources