correlation between two data frames in R - r

I have one data frame which has sales values for the time period Oct. 2000 to Dec. 2001 (15 months). Also I have profit values for the same time period as above and I want to find the correlation between these two data frames month wise for these 15 months in R. My data frame sales is:
Month sales
Oct. 2000 24.1
Nov. 2000 23.3
Dec. 2000 43.9
Jan. 2001 53.8
Feb. 2001 74.9
Mar. 2001 25
Apr. 2001 48.5
May. 2001 18
Jun. 2001 68.1
Jul. 2001 78
Aug. 2001 48.8
Sep. 2001 48.9
Oct. 2001 34.3
Nov. 2001 54.1
Dec. 2001 29.3
My second data frame profit is:
period profit
Oct 2000 14.1
Nov 2000 3.3
Dec 2000 13.9
Jan 2001 23.8
Feb 2001 44.9
Mar 2001 15
Apr 2001 58.5
May 2001 18
Jun 2001 58.1
Jul 2001 38
Aug 2001 28.8
Sep 2001 18.9
Oct 2001 24.3
Nov 2001 24.1
Dec 2001 19.3
Now I know that for initial two months I cannot get the correlation as there are not enough values but from Dec 2000 onwards I want to calculate the correlation by taking into consideration the previous months values. So, for Dec. 200 I will consider values of Oct. 2000, Nov. 2000 and Dec. 2000 which will give me 3 sales value and 3 profit values. Similarly for Jan. 2001 I will consider values of Oct. 2000, Nov. 2000 Dec. 2000 and Jan. 2001 thus having 4 sales value and 4 profit value. Thus for every month I will consider previous month values also to calculate the correlation and my output should be something like this:
Month Correlation
Oct. 2000 NA or Empty
Nov. 2000 NA or Empty
Dec. 2000 x
Jan. 2001 y
. .
. .
Dec. 2001 a
I know that in R there is a function cor(sales, profit) but how can I find out the correlation for my scenario?

Make some sample data:
> sales = c(1,4,3,2,3,4,5,6,7,6,7,5)
> profit = c(4,3,2,3,4,5,6,7,7,7,6,5)
> data = data.frame(sales=sales,profit=profit)
> head(data)
sales profit
1 1 4
2 4 3
3 3 2
4 2 3
5 3 4
6 4 5
Here's the beef:
> data$runcor = c(NA,NA,
sapply(3:nrow(data),
function(i){
cor(data$sales[1:i],data$profit[1:i])
}))
> data
sales profit runcor
1 1 4 NA
2 4 3 NA
3 3 2 -0.65465367
4 2 3 -0.63245553
5 3 4 -0.41931393
6 4 5 0.08155909
7 5 6 0.47368421
8 6 7 0.69388867
9 7 7 0.78317543
10 6 7 0.81256816
11 7 6 0.80386072
12 5 5 0.80155885
So now data$runcor[3] is the correlation of the first 3 sales and profit numbers.
Note I call this runcor as its a "running correlation", like a "running sum" which is the sum of all elements so far. This is the correlation of all pairs so far.

Another possibility would be: (if dat1 and dat2 are the initial datasets)
Update
dat1$Month <- gsub("\\.", "", dat1$Month)
datN <- merge(dat1, dat2, sort=FALSE, by.x="Month", by.y="period")
indx <- sequence(3:nrow(datN)) #create index to replicate the rows
indx1 <- cumsum(c(TRUE,diff(indx) <0)) #create another index to group the rows
#calculate the correlation grouped by `indx1`
datN$runcor <- setNames(c(NA, NA,by(datN[indx,-1],
list(indx1), FUN=function(x) cor(x$sales, x$profit) )), NULL)
datN
# Month sales profit runcor
#1 Oct 2000 24.1 14.1 NA
#2 Nov 2000 23.3 3.3 NA
#3 Dec 2000 43.9 13.9 0.5155911
#4 Jan 2001 53.8 23.8 0.8148546
#5 Feb 2001 74.9 44.9 0.9345166
#6 Mar 2001 25.0 15.0 0.9119941
#7 Apr 2001 48.5 58.5 0.7056301
#8 May 2001 18.0 18.0 0.6879528
#9 Jun 2001 68.1 58.1 0.7647177
#10 Jul 2001 78.0 38.0 0.7357748
#11 Aug 2001 48.8 28.8 0.7351366
#12 Sep 2001 48.9 18.9 0.7190413
#13 Oct 2001 34.3 24.3 0.7175138
#14 Nov 2001 54.1 24.1 0.7041889
#15 Dec 2001 29.3 19.3 0.7094334

Related

Creating averages across time periods

I'm a beginner to R, but I have the below dataframe with more observations in which I have at max each 'id' observation for three years 91, 99, 07.
I want to create a variable avg_ln_rd by 'id' that takes the average of 'ln_rd' and 'ln_rd' from year 91 if the first ln_rd observation is from 99 - and from year 99 if the first ln_rd observation is from 07.
id year ln_rd
<dbl> <dbl> <dbl>
1 1013 1991 3.51
2 1013 1999 5.64
3 1013 2007 4.26
4 1021 1991 0.899
5 1021 1999 0.791
6 1021 2007 0.704
7 1034 1991 2.58
8 1034 1999 3.72
9 1034 2007 4.95
10 1037 1991 0.262
I also already dropped any observations of 'id' that only exist for one of the three years.
My first thought was to create for each year a standalone variable for ln_rd but then i still would need to filter by id which i do not know how to do.
Then I tried using these standalone variables to form an if clause.
df$lagln_rd_99 <- ifelse(df$year == 1999, df$ln_rd_91, NA)
But again I do not know how to keep 'id' fixed.
Any help would be greatly appreciated.
EDIT:
I grouped by id using dplyr. Can I then just sort my df by id and create a new variable that is ln_rd but shifted by one row?
Still a bit unclear what to do if all years are present in a group but this might help.
-- edited -- to show the desired output.
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, year) %>%
mutate(avg91 = mean(c(ln_rd[year == 1991], ln_rd[year == 1999])),
avg99 = mean(c(ln_rd[year == 1999], ln_rd[year == 2007])),
avg91 = ifelse(year == 1991, avg91, NA),
avg99 = ifelse(year == 2007, avg99, NA)) %>%
ungroup()
# A tibble: 15 × 5
year id ln_rd avg91 avg99
<int> <int> <dbl> <dbl> <dbl>
1 1991 3505 3.38 3.09 NA
2 1999 3505 2.80 NA NA
3 1991 4584 1.45 1.34 NA
4 1999 4584 1.22 NA NA
5 1991 5709 1.90 2.13 NA
6 1999 5709 2.36 NA NA
7 2007 5709 3.11 NA 2.74
8 2007 9777 2.36 NA 2.36
9 1991 18729 4.82 5.07 NA
10 1999 18729 5.32 NA NA
11 2007 18729 5.53 NA 5.42
12 1991 20054 0.588 0.307 NA
13 1999 20054 0.0266 NA NA
14 1999 62169 1.91 NA NA
15 2007 62169 1.45 NA 1.68

Add share column based on year and other categorical variable to dataframe in R

I have sales data by year, condition and products
Year <- c(2010,2010,2010,2010,2010,2010,2011,2011,2011,2011,2011,2011,2012,2012,2012,2012,2012,2012)
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12",18,10,17)
Condition <- c("New","New","New","Used","Used","Used","New","New","New","Used","Used","Used","New","New","New","Used","Used","Used")
Product <- c("a","b","c","a","b","c","a","b","c","a","b","c","a","b","c","a","b","c")
df <- data.frame(Year,Condition, Product, Sale)
Now I want to calculate the share of each product by condition variable within each year. I tried the following code, but it calculates based on total no by year and "condition"
df$percentage <- df$Sale/sum(df$Sale)*100
First convert Sale from character to numeric with type.convert(as.is = TRUE),
then group by the desired columns and apply summarise:
Note that in your provided dataframe you will get 100 for percentage because of your provided data:
With this fake data
set.seed(123)
Year <- sample(c(2010, 2011, 2012), 18, replace = TRUE)
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12",18,10,17)
Condition <- sample(c("Used","New"), 18, replace = TRUE)
Product <- sample(c("a","b","c"), 18, replace = TRUE)
df <- data.frame(Year,Condition, Product, Sale)
using this code
library(dplyr)
df %>%
type.convert(as.is=TRUE) %>%
group_by(Year, Product, Condition) %>%
summarise(percentage = Sale/sum(Sale)*100)
you will get:
Year Product Condition percentage
<int> <chr> <chr> <dbl>
1 2010 a Used 83.2
2 2010 a Used 16.8
3 2010 c New 100
4 2011 a New 100
5 2011 a Used 42.9
6 2011 a Used 14.3
7 2011 a Used 42.9
8 2011 b New 100
9 2011 c New 49.2
10 2011 c New 50.8
11 2012 a Used 63.8
12 2012 a Used 36.2
13 2012 b New 100
14 2012 b Used 69.7
15 2012 b Used 30.3
16 2012 c New 100
17 2012 c Used 34.8
18 2012 c Used 65.2
Update: to keep Sale column: replace summarise with mutate
df %>%
type.convert(as.is=TRUE) %>%
group_by(Year, Product, Condition) %>%
mutate(percentage = paste(round(Sale/sum(Sale)*100, 1), "%"))
Year Condition Product Sale percentage
<int> <chr> <chr> <int> <chr>
1 2012 Used a 30 63.8 %
2 2012 New c 45 100 %
3 2012 Used b 23 69.7 %
4 2011 Used a 33 42.9 %
5 2012 Used c 24 34.8 %
6 2011 Used a 11 14.3 %
7 2011 New a 56 100 %
8 2011 New b 19 100 %
9 2012 Used c 45 65.2 %
10 2010 New c 56 100 %
11 2011 Used a 33 42.9 %
12 2011 New c 32 49.2 %
13 2010 Used a 89 83.2 %
14 2011 New c 33 50.8 %
15 2012 New b 12 100 %
16 2010 Used a 18 16.8 %
17 2012 Used b 10 30.3 %
18 2012 Used a 17 36.2 %
Here is a base solution using ave(). You can replace grouping variables in ave with any others you want.
within(df, {
perc1 = ave(as.numeric(Sale), Year, Product, FUN = proportions) * 100
perc2 = sprintf("%.1f %%", perc1)
})
Year Condition Product Sale perc2 perc1
1 2010 New a 30 47.6 % 47.61905
2 2010 New b 45 65.2 % 65.21739
3 2010 New c 23 67.6 % 67.64706
4 2010 Used a 33 52.4 % 52.38095
5 2010 Used b 24 34.8 % 34.78261
6 2010 Used c 11 32.4 % 32.35294
7 2011 New a 56 50.0 % 50.00000
8 2011 New b 19 36.5 % 36.53846
9 2011 New c 45 58.4 % 58.44156
10 2011 Used a 56 50.0 % 50.00000
11 2011 Used b 33 63.5 % 63.46154
12 2011 Used c 32 41.6 % 41.55844

Seasonal package: Forecasts end date [...] must end on or before user-defined regression variables end date

I'm relatively new to R and had a question regarding time series format for forecasting and seasonal adjustment using the seasonal package. I'm working with import.spc to generate function calls based on spec files.
Currently, I have FORECAST{MAXLEAD=48}, with my time series ending in 2022-02. I'm getting this error:
- forecasts end date, 2026.Feb, must end on or before user-defined regression variables end date, 2022.Feb.
Is this because my time series ends earlier than 2026-02? I tried appending "NA"s to the end of my historicals but it didn't do much.
Alternatively, I also tried setting FORECAST{MAXLEAD=0}, but I ran into this error:
Error: X-13 has returned a non-zero exist status, which means that the current spec file cannot be processed for an unknown reason.
See my code below:
library("tidyverse")
library("seasonal")
fn<-import.spc("C:\\PATH\\TO\\SPEC\\FILE.spc")
x<-import.ts("C:\\PATH\\TO\\DATA\\FILE.dat")
x %>% (fn[1]$seas)
FILE.spc
SERIES{
TITLE = "Logging"
START = 2016.01
PERIOD = 12
SAVE = (A1 B1)
PRINT = BRIEF
NAME = '1011330000 - AE'
FILE = '"C:\\PATH\\TO\\DATA\\FILE.dat"}
TRANSFORM{FUNCTION = NONE
}
REGRESSION{
USER = (dum1 dum2 dum3 dum4 dum5 dum6 dum7 dum8 dum9 dum10 dum11)
START = 1986.01
USERTYPE = TD
FILE = 'C:\\PATH\\TO\\FILE\\FDUM8606.dat'
SAVE = (TD AO LS TC)
}
ARIMA{
MODEL = (0 1 1)(0 1 1)
}
ESTIMATE{
MAXITER = 3000
}
FORECAST{
MAXLEAD = 0
}
OUTLIER{
CRITICAL = 10.5
TYPES = AO
}
X11{
SEASONALMA = (s3x3)
MODE = ADD
PRINT = (BRIEF)
SAVE = (D10 D11 D16)
APPENDFCST = YES
FINAL = USER
SAVELOG = (Q Q2 M7 FB1 FD8 MSF)
}
FDUM8606.dat can be found here
FILE.dat
2016 2 51.1
2016 3 50.4
2016 4 47.9
2016 5 49.8
2016 6 52.0
2016 7 52.6
2016 8 52.6
2016 9 51.9
2016 10 52.1
2016 11 51.4
2016 12 49.9
2017 1 48.2
2017 2 49.6
2017 3 48.0
2017 4 47.6
2017 5 48.9
2017 6 50.4
2017 7 50.7
2017 8 50.6
2017 9 50.1
2017 10 49.7
2017 11 50.7
2017 12 50.2
2018 1 49.2
2018 2 49.8
2018 3 48.7
2018 4 47.8
2018 5 49.0
2018 6 49.2
2018 7 50.8
2018 8 50.6
2018 9 50.0
2018 10 49.6
2018 11 49.1
2018 12 49.7
2019 1 49.3
2019 2 48.1
2019 3 47.7
2019 4 45.4
2019 5 47.1
2019 6 48.8
2019 7 49.3
2019 8 50.5
2019 9 49.5
2019 10 51.6
2019 11 51.2
2019 12 49.1
2020 1 47.9
2020 2 47.9
2020 3 46.7
2020 4 42.0
2020 5 44.3
2020 6 45.7
2020 7 46.8
2020 8 46.7
2020 9 46.6
2020 10 47.5
2020 11 47.0
2020 12 48.1
2021 1 48.1
2021 2 48.0
2021 3 46.3
2021 4 43.4
2021 5 43.7
2021 6 46.8
2021 7 47.6
2021 8 48.0
2021 9 46.0
2021 10 45.5
2021 11 45.4
2021 12 44.7
2022 1 44.8
2022 2 45.1

Forecasting one step ahead

I have one data.frame with three columns Year, Nominal_Revenue and COEFFICIENT. So I want to forecast with this data like example below
library(dplyr)
TEST<-data.frame(
Year= c(2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021),
Nominal_Revenue=c(8634,5798,6011,6002,6166,6478,6731,7114,6956,6968,7098,7610,7642,8203,9856,10328,11364,12211,13150,NA,NA,NA),
COEFFICIENT=c(NA,1.016,1.026,1.042,1.049,1.106,1.092,1.123,1.121,0.999,1.059,1.066,1.006,1.081,1.055,1.063,1.071,1.04,1.072,1.062,1.07, 1.075))
SIMULATION<-mutate(TEST,
FORECAST=lag(Nominal_Revenue)*COEFFICIENT
)
And results from this code is like picture below, or in other words this code calculate forecasting only for one year or more precisely 2019.
So my intention is get results only for NA in column Nominal_Revenue,like picture below.
So can anybody help me how to fix this code ?
Because each time you need the previously computed value, we can loop for the number of NAs in your variable and apply a dplyr
for (i in 1:length(which(is.na(TEST$Nominal_Revenue)))){
TEST=TEST%>%mutate(Nominal_Revenue=if_else(is.na(Nominal_Revenue),COEFFICIENT*lag(Nominal_Revenue),Nominal_Revenue))
}
> TEST
Year Nominal_Revenue COEFFICIENT
1 2000 8634.00 NA
2 2001 5798.00 1.016
3 2002 6011.00 1.026
4 2003 6002.00 1.042
5 2004 6166.00 1.049
6 2005 6478.00 1.106
7 2006 6731.00 1.092
8 2007 7114.00 1.123
9 2008 6956.00 1.121
10 2009 6968.00 0.999
11 2010 7098.00 1.059
12 2011 7610.00 1.066
13 2012 7642.00 1.006
14 2013 8203.00 1.081
15 2014 9856.00 1.055
16 2015 10328.00 1.063
17 2016 11364.00 1.071
18 2017 12211.00 1.040
19 2018 13150.00 1.072
20 2019 13965.30 1.062
21 2020 14942.87 1.070
22 2021 16063.59 1.075

How to stop R Loop storing the same variable multiple times instead of just once

I have time-series daily temperature data which I've split into years and seasons (SU = Summer etc.). The Season.Year column allows for analysis of the climatological year (using December from the previous year as winter, keeping the seasonality of the trend).
Sample:
Day Month Year maxtp Season.Year Season
1 20 8 2007 19.1 2007 SU
2 21 8 2007 17.6 2007 SU
3 22 8 2007 21.8 2007 SU
4 23 8 2007 20.0 2007 SU
5 24 8 2007 22.4 2007 SU
6 25 8 2007 21.2 2007 SU
7 26 8 2007 19.3 2007 SU
8 27 8 2007 17.5 2007 SU
9 28 8 2007 18.9 2007 SU
10 29 8 2007 18.3 2007 SU
11 30 8 2007 19.5 2007 SU
12 1 9 2007 19.8 2007 A
13 2 9 2007 19.2 2007 A
14 3 9 2007 18.9 2007 A
15 4 9 2007 20.4 2007 A
16 5 9 2007 21.2 2007 A
I want to extract all the winters from each year, creating a subset (and new dataset) with all the temperature values from winter 2007 to 2014.
The R-Loop I created (below) does this, but repeats the data (i.e. there are 364 values for winter (W) 2008 where there should only be around 90)
for( i in 2008:2014) {
for(j in 1:4) {
j = 1
data.sub <- subset(data, data$Season.Year == i & data$Season == s[j])
Winter <- rbind(Winter, data.sub)
}
}
Can anyone see what's wrong with this loop? Why is the subset storing so much data, and not just giving me all winter values for 2008, followed by 2009, up to 2014?
There should be around 90 data points for each winter for each year (around 600 values overall, whereas I'm getting over 2500).
I think that your problems are coming from modifying j within the inner loop, so for each of the four values of j, you're then extracting the s[1] data.
Regardless, it should be simpler to just do:
Winter <- subset(data, data$Season.Year %in% 2008:2014 & data$Season %in% s[1])
presuming that s[1] is winter

Resources