How to form linear model from two data frames? - r

MarriageLicen
Year Month Amount
1 2011 Jan 742
2 2011 Feb 796
3 2011 Mar 1210
4 2011 Apr 1376
BusinessLicen
Month Year MARRIAGE_LICENSES
1 Jan 2011 754
2 Feb 2011 2706
3 Mar 2011 2689
4 Apr 2011 738
My question is how can we predict the number of Marriage Licenses (Y) issued by the city using the number of Business Licenses (X)?
And how can we join two datasets together?
It says that you can join using the combined key of Month and Year.
But I am suffering from this question for several days.

There are three options here.
The first is to just be direct. I'm going to assume you have the labels swapped around for the data frames in your example (it doesn't make a whole lot of sense to have a MARRIAGE_LICENSES variable in the BusinessLicen data frame, if I'm following what you are trying to do).
You can model the relationship between those two variables with:
my.model <- lm(MarriageLicen$MARRIAGE_LICENSES ~ BusinessLicen$Amount)
The second (not very rational) option would be to make a new data frame explicitly, since it looks like you have an exact match on each of your rows:
new.df <- data.frame(marriage.licenses=MarriageLicen$MARRIAGE_LICENSES, business.licenses=BusinessLicen$Amount)
my.model <- lm(marriage.licenses ~ business.licenses, data=new.df)
Finally, if you don't actually have the perfect alignment shown in your example you can use merge.
my.df <- merge(BusinessLicen, MarriageLicen, by=c("Month", "Year"))
my.model <- lm(MARRIAGE_LICENCES ~ Amount, data=my.df)

Related

Merge two data frames - no unique identifier

I would like to combine two data frames. One is information for birds banded. The other is information on recovered banded birds. I would like to add the recovery data to the banding data, if the bird was recovered (not all birds were recovered). Unfortunately the full band number is not included in the banding data, only in the recovery data, so there is not a unique column to join them by.
One looks like this:
GISBLong
GISBLat
B Flyway
B Month
B Year
Band Prefix Plus
-85.41667
42.41667
8
5
2001
12456
-85.41655
36.0833
9
6
2003
21548
The other looks like this:
GISBLong
GISBLat
B Flyway
B Month
B Year
Band
R Month
R Year
-85.41667
42.41667
8
5
2001
124565482
12
2002
-85.41655
36.0833
9
6
2003
215486256
1
2004
I have tried '''merge''', '''ifelse''', '''dplyr-join''' with no luck. Any suggestions? Thanks in advance!
you should look up rbind(). That might do the trick. For it to work the data frames have to have the same columns. I'd suggest you to add missing columns to your first DF with dplyr::mutate() and later on eliminate useless rows.

Simple Forecasting using Average method in R for Time series data for multiple groups

I've done forecasting and time series analysis for individual values but not for group of values in one go. I've got a historical data (36 months- 1st day of each month which I created as required by time series) for multiple groups(Model No.) in a data frame which looks like below:
ModelNo. Month_Year Quantity
a 2017-06-01 0
a 2017-07-01 5
a 2017-08-01 3
.. .......... ....
.. .......... ....
a 2020-05-01 6
b 2017-06-01 9
b 2017-07-01 0
b 2017-08-01 1
.. .......... ....
.. .......... ....
b 2020-05-01 4
c 2020-05-01 3
c 2017-06-01 1
c 2017-07-01 1
c 2017-08-01 0
.. .......... ....
.. .......... ....
c 2020-05-01 4
I then use the below code to subset my data frame for "one group" to generate forecast using simple average function
Selected_data<-subset(data, ModelNo.=='a')
currentMonth<-month(Sys.Date())
currentYear<-year(Sys.Date())
I then create the time series object for 24 months which i then input to my forecast function.
y_ts = ts(Selected_data$Quantity, start=c(currentYear-3, currentMonth), end=c(currentYear-1, currentMonth-1), frequency=12)
I then use simple mean function for forecasting the 12 months value (which I already have "quantity" valuesfor , june 2019-may 2020)
meanf(y_ts, 12, level = c(95))
and I get a output like for my data (not the output linked to above data provide, just a snapshot of my original data)
Point Forecast Lo 95 Hi 95
Jun 2019 1.875 -3.117887 6.867887
Jul 2019 1.875 -3.117887 6.867887
Aug 2019 1.875 -3.117887 6.867887
Sep 2019 1.875 -3.117887 6.867887
Oct 2019 1.875 -3.117887 6.867887
Nov 2019 1.875 -3.117887 6.867887
Dec 2019 1.875 -3.117887 6.867887
Jan 2020 1.875 -3.117887 6.867887
Feb 2020 1.875 -3.117887 6.867887
Mar 2020 1.875 -3.117887 6.867887
Apr 2020 1.875 -3.117887 6.867887
May 2020 1.875 -3.117887 6.867887
So I'm able to successfully generate forecast for "one" Model No. here. However, my question are :
I have to generate this forecast for all groups in my dataframe, like a , b, c and so on. So I don't know how to do this and store the result in a new data frame for forecast values along with Dates for each ModelNo.
I know if i use below , that will return me the forecasted values R function meanf the output shows
meanf(y_ts, 12, level = c(95))$mean
But how to store its for each group type against dates in a dataframe, I tried mutate() it didnt work.
Following on Question 1, how should I then compare the forecast values with the actual values (as you can see I only sliced 24 months data to predict 12 month values). I know there are methods in R and time series analysis where I can use multiple historical slicing test and train window and then check and compare with actual values to measure forecast results/accuracy etc. I plan to expand this to use and try multiple forecasting methods.
Please if someone can help me with the above two questions.
I believe there is a learning curve required , I know partially the process but I'm not sure how systematically I can fill this knowledge gap to use forecasting methods for multiple groups and test them against actual values. Apart from the answers to the above two questions any link to a tutorial with which I can enhance my learning will be very helpful. Thank you very much.
Your question(s) is rather broad, so you can start with something like this to think about how to proceed. First of all you did not provide some reproducible data, so I used what you've posted, with some tweak to your code to make it works. The idea is to do for each model a train and a test time series, create the forecast, and store it in a data.frame. Then you can calculate for example RMSE to see the goodness of fit on test.
library(forecast)
library(lubridate)
# set date limits to train and test
train_start <- ymd("2017-06-01")
train_end <- ymd("2019-05-01")
test_start <- ymd("2019-06-01") # end not necessary
# create an empty list
listed <- list()
for (i in unique(data$ModelNo.))
{
# subset one group
Selected_data<-subset(data, ModelNo.==i)
# as ts
y_ts <- ts(Selected_data$Quantity,
start=c(year(min(data$Month_Year)),
month(max(data$Month_Year))),
frequency=12)
# create train
train_ts <- window(y_ts,
start=c(year(train_start), month(train_start)),
end=c(year(train_end), month(train_end)), frequency = 12)
# create test (note: using parameters ok to your sample data)
test_ts <- window(y_ts,
start=c(year(test_start), month(test_start)), frequency = 12)
listed[[i]] <- cbind(
data.frame(meanf(train_ts,length(test_ts),level = c(95))),
real =as.vector(test_ts))
}
Now for part 1, you can create a data.frame with the results:
res <- do.call(rbind,listed)
head(res) # only head to simplify output
Point.Forecast Lo.95 Hi.95 real
a.Jun 2019 49.29167 -22.57528 121.1586 95
a.Jul 2019 49.29167 -22.57528 121.1586 93
a.Aug 2019 49.29167 -22.57528 121.1586 5
a.Sep 2019 49.29167 -22.57528 121.1586 66
a.Oct 2019 49.29167 -22.57528 121.1586 47
a.Nov 2019 49.29167 -22.57528 121.1586 40
For point 2, you can calculate RMSE (there is an handy function in package Metrics) for each time series:
library(Metrics)
goodness <- lapply(listed, function(x)rmse(x$real, x$Point.Forecast))
goodness
$$a
[1] 31.8692
$b
[1] 30.69859
$c
[1] 30.28037
With data:
set.seed(1234)
data <- data.frame(ModelNo. = c(rep("a",36),rep("b",36),rep("c",36)),
Month_Year = lubridate::ymd(rep(seq(as.Date("2017/6/1"), by = "month", length.out = 36),3)),
Quantity =sample(1:100,108, replace = T)
)

R plots: simple statistics on data by year. Base package

How to apply simple statistics to data and plot them elegantly by year using the R base plotting system and default functions?
The database is quite heavy, hence do not generate new variables would be preferable.
I hope it is not a silly question, but I am wondering about this problem without finding a specific solution not involving additional packages such as ggplot2, dplyr, lubridate, such as the ones I found on the site:
ggplot2: Group histogram data by year
R group by year
Split data by year
The use of the R default systems is due to didactic purposes. I think it could be an important training before turn on the more "comfortable" R specific packages.
Consider a simple dataset:
> prod_dat
lab year production(kg)
1 2010 0.3219
1 2011 0.3222
1 2012 0.3305
2 2010 0.3400
2 2011 0.3310
2 2012 0.3310
3 2010 0.3400
3 2011 0.3403
3 2012 0.3410
I would like to plot with an histogram of, let's say, the total production of material during specific years.
> hist(sum(prod_dat$production[prod_dat$year == c(2010, 2013)]))
Unfortunately, this is my best attempt, and it trow an error:
in prod_dat$year == c(2010, 2012):
longer object length is not a multiple of shorter object length
I am really out of route, hence any suggestion can turn in use.
without ggplot I used to do it like this but there are smarter way I think
all <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "lab year production
1 2010 1
1 2011 0.3222
1 2012 0.3305
2 2010 0.3400
2 2011 0.3310
2 2012 0.3310
3 2010 0.3400
3 2011 0.3403
3 2012 0.3410")
ar <- data.frame(year = unique(all$year), prod = tapply(all$production, list(all$year), FUN = sum))
barplot(ar$prod)

Differencing with respect to specific value of a column

I have a variable called Depression which has 40 observations and goes from 2004 to 2013 quarterly (e.g. 2004 Q1, 2004 Q2 etc.) I would like to make a new column which differences with respect to the 27th row/observations which corresponds with 2010 Q3 and set that value to 0. Any help is greatly appreciated!
If I understand correctly your question, this would do it:
# generate sample data
dat <- data.frame(id=paste0("Obs.",1:40),depression=as.integer(runif(40,0,20)))
# Create new var that calculates difference with 27th observation on depression score
dat$diff <- dat$depression - dat$depression[27]

Eliminating Existing Observations in a Zoo Merge

I'm trying to do a zoo merge between stock prices from selected trading days and observations about those same stocks (we call these "Nx observations") made on the same days. Sometimes do not have Nx observations on stock trading days and sometimes we have Nx observations on non-trading days. We want to place an "NA" where we do not have any Nx observations on trading days but eliminate Nx observations where we have them on non-trading day since without trading data for the same day, Nx observations are useless.
The following SO question is close to mine, but I would characterize that question as REPLACING missing data, whereas my objective is to truly eliminate observations made on non-trading days (if necessary, we can change the process by which Nx observations are taken, but it would be a much less expensive solution to leave it alone).
merge data frames to eliminate missing observations
The script I have prepared to illustrate follows (I'm new to R and SO; all suggestions welcome):
# create Stk_data data.frame for use in the Stack Overflow question
Date_Stk <- c("1/2/13", "1/3/13", "1/4/13", "1/7/13", "1/8/13") # dates for stock prices used in the example
ABC_Stk <- c(65.73, 66.85, 66.92, 66.60, 66.07) # stock prices for tkr ABC for Jan 1 2013 through Jan 8 2013
DEF_Stk <- c(42.98, 42.92, 43.47, 43.16, 43.71) # stock prices for tkr DEF for Jan 1 2013 through Jan 8 2013
GHI_Stk <- c(32.18, 31.73, 32.43, 32.13, 32.18) # stock prices for tkr GHI for Jan 1 2013 through Jan 8 2013
Stk_data <- data.frame(Date_Stk, ABC_Stk, DEF_Stk, GHI_Stk) # create the stock price data.frame
# create Nx_data data.frame for use in the Stack Overflow question
Date_Nx <- c("1/2/13", "1/4/13", "1/5/13", "1/6/13", "1/7/13", "1/8/13") # dates for Nx Observations used in the example
ABC_Nx <- c(51.42857, 51.67565, 57.61905, 57.78349, 58.57143, 58.99564) # Nx scores for stock ABC for Jan 1 2013 through Jan 8 2013
DEF_Nx <- c(35.23809, 36.66667, 28.57142, 28.51778, 27.23150, 26.94331) # Nx scores for stock DEF for Jan 1 2013 through Jan 8 2013
GHI_Nx <- c(7.14256, 8.44573, 6.25344, 6.00423, 5.99239, 6.10034) # Nx scores for stock GHI for Jan 1 2013 through Jan 8 2013
Nx_data <- data.frame(Date_Nx, ABC_Nx, DEF_Nx, GHI_Nx) # create the Nx scores data.frame
# create zoo objects & merge
z.Stk_data <- zoo(Stk_data, as.Date(as.character(Stk_data[, 1]), format = "%m/%d/%Y"))
z.Nx_data <- zoo(Nx_data, as.Date(as.character(Nx_data[, 1]), format = "%m/%d/%Y"))
z.data.outer <- merge(z.Stk_data, z.Nx_data)
The NAs on Jan 3 2013 for the Nx observations are fine (we'll use the na.locf) but we need to eliminate the Nx observations that appear on Jan 5 and 6 as well as the associated NAs in the Stock price section of the zoo objects.
I've read the R Documentation for merge.zoo regarding the use of "all": that its use "allows
intersection, union and left and right joins to be expressed". But trying all combinations of the
following use of "all" yielded the same results (as to why would be a secondary question).
z.data.outer <- zoo(merge(x = Stk_data, y = Nx_data, all.x = FALSE)) # try using "all"
While I would appreciate comments on the secondary question, I'm primarily interested in learning how to eliminate the extraneous Nx observations on days when there is no trading of stocks. Thanks. (And thanks in general to the community for all the great explanations of R!)
The all argument of merge.zoo must be (quoting from the help file):
logical vector having the same length as the number of "zoo" objects to be merged
(otherwise expanded)
and you want to keep all rows from the first argument but not the second so its value should be c(TRUE, FALSE).
merge(z.Stk_data, z.Nx_data, all = c(TRUE, FALSE))
The reason for the change in all syntax for merge.zoo relative to merge.data.frame is that merge.zoo can merge any number of arguments whereas merge.data.frame only handles two so the syntax had to be extended to handle that.
Also note that %Y should have been %y in the question's code.
I hope I have understood your desired output correctly ("NAs on Jan 3 2013 for the Nx observations are fine"; "eliminate [...] observations that appear on Jan 5 and 6"). I don't quite see the need for zoo in the merging step.
merge(Stk_data, Nx_data, by.x = "Date_Stk", by.y = "Date_Nx", all.x = TRUE)
# Date_Stk ABC_Stk DEF_Stk GHI_Stk ABC_Nx DEF_Nx GHI_Nx
# 1 1/2/13 65.73 42.98 32.18 51.42857 35.23809 7.14256
# 2 1/3/13 66.85 42.92 31.73 NA NA NA
# 3 1/4/13 66.92 43.47 32.43 51.67565 36.66667 8.44573
# 4 1/7/13 66.60 43.16 32.13 58.57143 27.23150 5.99239
# 5 1/8/13 66.07 43.71 32.18 58.99564 26.94331 6.10034

Resources