I hope this is basic; just need a nudge in the right direction.
I have read in a database table from MS Access into a data frame using RODBC. Here is a basic structure of what I read in:
PRODID PROD Year Week QTY SALES INVOICE
Here is the structure:
str(data)
'data.frame': 8270 obs. of 7 variables:
$ PRODID : int 20001 20001 20001 100001 100001 100001 100001 100001 100001 100001 ...
$ PROD : Factor w/ 1239 levels "1% 20qt Box",..: 335 335 335 128 128 128 128 128 128 128 ...
$ Year : int 2010 2010 2010 2009 2009 2009 2009 2009 2009 2010 ...
$ Week : int 12 18 19 14 15 16 17 18 19 9 ...
$ QTY : num 1 1 0 135 300 270 300 270 315 315 ...
$ SALES : num 15.5 0 -13.9 243 540 ...
$ INVOICES: num 1 1 2 5 11 11 10 11 11 12 ...
Here are the top few rows:
head(data, n=10)
PRODID PROD Year Week QTY SALES INVOICES
1 20001 Dolie 12" 2010 12 1 15.46 1
2 20001 Dolie 12" 2010 18 1 0.00 1
3 20001 Dolie 12" 2010 19 0 -13.88 2
4 100001 Cage Free Eggs 2009 14 135 243.00 5
5 100001 Cage Free Eggs 2009 15 300 540.00 11
6 100001 Cage Free Eggs 2009 16 270 486.00 11
7 100001 Cage Free Eggs 2009 17 300 540.00 10
8 100001 Cage Free Eggs 2009 18 270 486.00 11
9 100001 Cage Free Eggs 2009 19 315 567.00 11
10 100001 Cage Free Eggs 2010 9 315 569.25 12
I simply want to generate lags for QTY, SALES, INVOICE for each product but I am not sure where to start. I know R is great with Time Series, but I am not sure where to start.
I have two questions:
I have the raw invoice data but have aggregated it for reporting purposes. Would it be easier if I didn't aggregate the data?
Regardless of aggregation or not, what functions will I need to loop over each product and generate the lags as I need them?
In short, I want to loop over a set of records, calculate lags for a product (if possible), append the lags (as they apply) to the current record for each product, and write the results back to a table in my database for my reporting software to use.
There is most likely a more elegant way to do this.
First read in the data:
h <- 'row PRODID PROD Year Week QTY SALES INVOICES
1 20001 Dolie12 2010 12 1 15.46 1
2 20001 Dolie12 2010 18 1 0.00 1
3 20001 Dolie12 2010 19 0 -13.88 2
4 100001 CageFreeEggs 2009 14 135 243.00 5
5 100001 CageFreeEggs 2009 15 300 540.00 11
6 100001 CageFreeEggs 2009 16 270 486.00 11
7 100001 CageFreeEggs 2009 17 300 540.00 10
8 100001 CageFreeEggs 2009 18 270 486.00 11
9 100001 CageFreeEggs 2009 19 315 567.00 11
10 100001 CageFreeEggs 2010 9 315 569.25 12'
dat <- read.table(textConnection(h),T)
next we calculate the lagged variables. This line of code splits up the data by PROD, then puts NAs as the top row, and drops the last row, resulting in a lag of 1.
new_vars <-do.call(rbind,rev(by(dat[,c(6,7,8)],dat$PROD,function(x) rbind(NA,x[-nrow(x),]))))
Show them in the console:
> cbind(dat,new_vars)
row PRODID PROD Year Week QTY SALES INVOICES QTY SALES INVOICES
Dolie12.1 1 20001 Dolie12 2010 12 1 15.46 1 NA NA NA
Dolie12.2 2 20001 Dolie12 2010 18 1 0.00 1 1 15.46 1
Dolie12.3 3 20001 Dolie12 2010 19 0 -13.88 2 1 0.00 1
CageFreeEggs.1 4 100001 CageFreeEggs 2009 14 135 243.00 5 NA NA NA
CageFreeEggs.4 5 100001 CageFreeEggs 2009 15 300 540.00 11 135 243.00 5
CageFreeEggs.5 6 100001 CageFreeEggs 2009 16 270 486.00 11 300 540.00 11
CageFreeEggs.6 7 100001 CageFreeEggs 2009 17 300 540.00 10 270 486.00 11
CageFreeEggs.7 8 100001 CageFreeEggs 2009 18 270 486.00 11 300 540.00 10
CageFreeEggs.8 9 100001 CageFreeEggs 2009 19 315 567.00 11 270 486.00 11
CageFreeEggs.9 10 100001 CageFreeEggs 2010 9 315 569.25 12 315 567.00 11
You can use back() function from the PERregress library on a single variable. You can also create multiple lags. I typically use this function for auto-correlation in regression analysis.
Related
I have a longitudinal dataset in the long form with the length of around 2800, with around 400 participants in total. Here's a sample of my data.
# ID wave score sex age edu
#1 1001 1 28 1 69 12
#2 1001 2 27 1 70 12
#3 1001 3 28 1 71 12
#4 1001 4 26 1 72 12
#5 1002 1 30 2 78 9
#6 1002 3 30 2 80 9
#7 1003 1 30 2 65 16
#8 1003 2 30 2 66 16
#9 1003 3 29 2 67 16
#10 1003 4 28 2 68 16
#11 1004 1 22 2 85 4
#12 1005 1 20 2 60 9
#13 1005 2 18 1 61 9
#14 1006 1 22 1 74 9
#15 1006 2 23 1 75 9
#16 1006 3 25 1 76 9
#17 1006 4 19 1 77 9
I want to create a new column "cutoff" with values "Normal" or "Impaired" because my outcome data, "score" has a cutoff score indicating impairment according to norm. The norm consists of different -1SD measures(the cutoff point) according to Sex, Edu(year of education), and Age.
Below is what I'm currently doing, checking an excel file myself and putting in the corresponding cutoff score according to the three conditions. First of all, I am not sure if I am creating the right column.
data$cutoff <- ifelse(data$sex==1 & data$age<70
& data$edu<3
& data$score<19.91, "Impaired", "Normal")
data$cutoff <- ifelse(data$sex==2 & data$age<70
& data$edu<3
& data$score<18.39, "Impaired", "Normal")
Additionally, I am wondering if I can import an excel file stating the norm, and create a column according to the values in it.
The excel file has a structure as shown below.
# Sex Male Female
#60-69 Edu(yr) 0-3 4-6 7-12 13>= 0-3 4-6 7-12 13>=
#Age Number 22 51 119 72 130 138 106 51
# Mean 24.45 26.6 27.06 27.83 23.31 25.86 27.26 28.09
# SD 3.03 1.89 1.8 1.53 3.28 2.55 1.85 1.44
# -1.5SD' 19.92 23.27 23.76 24.8 18.53 21.81 23.91 25.15
#70-79 Edu(yr) 0-3 4-6 7-12 13>= 0-3 4-6 7-12 13>=
....
I have created new columns "agecat" and "educat," allocating each ID into a group of age and education used in the norm. Now I want to make use of these columns, matching it with rows and columns of the excel file above. One of the motivations is to create a code that can be used for further research using the test scores of my data.
I think your ifelse statements should work fine, but I would definitely import the Excel file rather than hardcoding it, though you may need to structure it a bit differently. I would structure it just like a dataset, with columns for Sex, Edu, Age, Mean, SD, -1.5SD, etc., import it into R, then do a left outer join on Sex+Edu+Age:
merge(x = long_df, y = norm_df, by = c("Sex", "Edu(yr)", "Age"), all.x = TRUE)
Then you can compare the columns directly.
If I understand correctly, the OP wants to mark a certain type of outliers in his dataset. So, there are two tasks here:
Compute the statistics mean(score), sd(score), and cutoff value mean(score) - 1.5 * sd(score) for each group of sex, age category agecat, and edu category edcat.
Find all rows where score is lower than the cutoff value for the particular group.
As already mentioned by hannes101, the second step can be implemented by a non-equi join.
library(data.table)
# categorize age and edu (left closed intervals)
mydata[, c("agecat", "educat") := .(cut(age, c(seq(0, 90, 10), Inf), right = FALSE),
cut(edu, c(0, 4, 7, 13, Inf), right = FALSE))][]
# compute statistics
cutoffs <- mydata[, .(.N, Mean = mean(score), SD = sd(score),
m1.5SD = mean(score) - 1.5 * sd(score)),
by = .(sex, agecat, educat)]
# non-equi update join
mydata[, cutoff := "Normal"]
mydata[cutoffs, on = .(sex, agecat, educat, score < m1.5SD), cutoff := "Impaired"][]
mydata
ID wave score sex age edu agecat educat cutoff
1: 1001 1 28 1 69 12 [60,70) [7,13) Normal
2: 1001 2 27 1 70 12 [70,80) [7,13) Normal
3: 1001 3 28 1 71 12 [70,80) [7,13) Normal
4: 1001 4 26 1 72 12 [70,80) [7,13) Normal
5: 1002 1 30 2 78 9 [70,80) [7,13) Normal
6: 1002 3 30 2 80 9 [80,90) [7,13) Normal
7: 1003 1 33 2 65 16 [60,70) [13,Inf) Normal
8: 1003 2 32 2 66 16 [60,70) [13,Inf) Normal
9: 1003 3 31 2 67 16 [60,70) [13,Inf) Normal
10: 1003 4 24 2 68 16 [60,70) [13,Inf) Impaired
11: 1004 1 22 2 85 4 [80,90) [4,7) Normal
12: 1005 1 20 2 60 9 [60,70) [7,13) Normal
13: 1005 2 18 1 61 9 [60,70) [7,13) Normal
14: 1006 1 22 1 74 9 [70,80) [7,13) Normal
15: 1006 2 23 1 75 9 [70,80) [7,13) Normal
16: 1006 3 25 1 76 9 [70,80) [7,13) Normal
17: 1006 4 19 1 77 9 [70,80) [7,13) Normal
18: 1007 1 33 2 65 16 [60,70) [13,Inf) Normal
19: 1007 2 32 2 66 16 [60,70) [13,Inf) Normal
20: 1007 3 31 2 67 16 [60,70) [13,Inf) Normal
21: 1007 4 31 2 68 16 [60,70) [13,Inf) Normal
ID wave score sex age edu agecat educat cutoff
In this made-up example there is only one row which meets the "Impaired" conditions.
Likewise, the statistics is rather sparsely populated:
cutoffs
sex agecat educat N Mean SD m1.5SD
1: 1 [60,70) [7,13) 2 23.00000 7.071068 12.39340
2: 1 [70,80) [7,13) 7 24.28571 3.147183 19.56494
3: 2 [70,80) [7,13) 1 30.00000 NA NA
4: 2 [80,90) [7,13) 1 30.00000 NA NA
5: 2 [60,70) [13,Inf) 8 30.87500 2.900123 26.52482
6: 2 [80,90) [4,7) 1 22.00000 NA NA
7: 2 [60,70) [7,13) 1 20.00000 NA NA
Data
OP's sample dataset has been modified in one group for demonstration.
library(data.table)
mydata <- fread("
# ID wave score sex age edu
#1 1001 1 28 1 69 12
#2 1001 2 27 1 70 12
#3 1001 3 28 1 71 12
#4 1001 4 26 1 72 12
#5 1002 1 30 2 78 9
#6 1002 3 30 2 80 9
#7 1003 1 33 2 65 16
#8 1003 2 32 2 66 16
#9 1003 3 31 2 67 16
#10 1003 4 24 2 68 16
#11 1004 1 22 2 85 4
#12 1005 1 20 2 60 9
#13 1005 2 18 1 61 9
#14 1006 1 22 1 74 9
#15 1006 2 23 1 75 9
#16 1006 3 25 1 76 9
#17 1006 4 19 1 77 9
#18 1007 1 33 2 65 16
#19 1007 2 32 2 66 16
#20 1007 3 31 2 67 16
#21 1007 4 31 2 68 16
", drop = 1L)
Starting from this SO question.
Example data.frame:
df = read.table(text = 'ID Day Count Count_group
18 1933 6 15
33 1933 6 15
37 1933 6 15
18 1933 6 15
16 1933 6 15
11 1933 6 15
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 2
677 1798 2 2
778 888 4 8
111 888 4 8
88 888 4 8
10 888 4 8
37 887 2 4
26 887 2 4
8 886 1 2
56 885 1 1
22 120 2 6
34 120 2 6
88 119 1 6
99 118 2 5
12 118 2 5
90 117 1 3
22 115 2 2
99 115 2 2', header = TRUE)
The Count col shows the total number of ID values per each Day and the Count_group col shows the sum of the ID values per each Day, Day - 1, Day -2, Day -3 and Day -4.
e.g. 1933 = Count_group 15 because Count 6 (1933) + Count 5 (1932) + Count 3 (1931) + Count 1 (1930) + Count 0 (1929).
What I need to do is to create duplicated observations per each Count_group and add them to it in order to show per each Count_group its Day, Day - 1, Day -2, Day -3 and Day -4.
e.g. Count_group = 15 is composed by the Count values of Day 1933, 1932, 1931, 1930 (and 1929 not present in the df). So the five days needs to be included in the Count_group = 15. The next one will be Count_group = 9, composed by 1932, 1931, 1930, 1929 and 1928; etc...
Desired output:
ID Day Count Count_group
18 1933 6 15
33 1933 6 15
37 1933 6 15
18 1933 6 15
16 1933 6 15
11 1933 6 15
111 1932 5 15
34 1932 5 15
60 1932 5 15
88 1932 5 15
18 1932 5 15
33 1931 3 15
13 1931 3 15
56 1931 3 15
23 1930 1 15
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 9
13 1931 3 9
56 1931 3 9
23 1930 1 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 12
6 1799 4 12
52 1799 4 12
133 1799 4 12
112 1798 2 12
677 1798 2 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 6
677 1798 2 6
112 1798 2 2
677 1798 2 2
778 888 4 8
111 888 4 8
88 888 4 8
10 888 4 8
37 887 2 8
26 887 2 8
8 886 1 8
56 885 1 8
37 887 2 4
26 887 2 4
8 886 1 4
56 885 1 4
8 886 1 2
56 885 1 2
56 885 1 1
22 120 2 6
34 120 2 6
88 119 1 6
99 118 2 6
12 118 2 6
90 117 1 6
88 119 1 6
99 118 2 6
12 118 2 6
90 117 1 6
22 115 2 6
99 115 2 6
99 118 2 5
12 118 2 5
90 117 1 5
22 115 2 5
99 115 2 5
90 117 1 3
22 115 2 3
99 115 2 3
22 115 2 2
99 115 2 2
(note that different group of 5 days each one have been separated by a blank line in order to make them clearer)
I have got different data.frames which are grouped by n days and therefore I would like to adapt the code (by changing it a little) specifically for each of them.
Thanks
A generalised version of my previous answer...
#first add grouping variables
days <- 5 #grouping no of days
df$smalldaygroup <- c(0,cumsum(sapply(2:nrow(df),function(i) df$Day[i]!=df$Day[i-1]))) #individual days
df$bigdaygroup <- c(0,cumsum(sapply(2:nrow(df),function(i) df$Day[i]<df$Day[i-1]-days+1))) #blocks of linked days
#duplicate days in each big group
df2 <- lapply(split(df,df$bigdaygroup),function(x) {
n <- max(x$Day)-min(x$Day)+1 #number of consecutive days in big group
dayvec <- (max(x$Day):min(x$Day)) #possible days in range
daylog <- dayvec[dayvec %in% x$Day] #actual days in range
pattern <- data.frame(base=rep(dayvec,each=days))
pattern$rep <- sapply(1:nrow(pattern),function(i) pattern$base[i]+1-sum(pattern$base[1:i]==pattern$base[i])) #indices to repeat
pattern$offset <- match(pattern$rep,daylog)-match(pattern$base,daylog) #offsets (used later)
pattern <- pattern[(pattern$base %in% x$Day) & (pattern$rep %in% x$Day),] #remove invalid elements
#store pattern in list as offsets needed in next loop
return(list(df=split(x,x$smalldaygroup)[match(pattern$rep,daylog)],pat=pattern))
})
#change the Count_group to previous value in added entries
df2 <- lapply(df2,function(L) lapply(1:length(L$df),function(i) {
x <- L$df[[i]]
offset <- L$pat$offset #pointer to day to copy Count_group from
x$Count_group <- L$df[[i-offset[i]]]$Count_group[1]
return(x)
}))
df2 <- do.call(rbind,unlist(df2,recursive=FALSE)) #bind back together
df2[,5:6] <- NULL #remove grouping variables
head(df2,30) #ignore rownames!
ID Day Count Count_group
01.1 18 1933 6 15
01.2 33 1933 6 15
01.3 37 1933 6 15
01.4 18 1933 6 15
01.5 16 1933 6 15
01.6 11 1933 6 15
02.7 111 1932 5 15
02.8 34 1932 5 15
02.9 60 1932 5 15
02.10 88 1932 5 15
02.11 18 1932 5 15
03.12 33 1931 3 15
03.13 13 1931 3 15
03.14 56 1931 3 15
04 23 1930 1 15
05.7 111 1932 5 9
05.8 34 1932 5 9
05.9 60 1932 5 9
05.10 88 1932 5 9
05.11 18 1932 5 9
06.12 33 1931 3 9
06.13 13 1931 3 9
06.14 56 1931 3 9
07 23 1930 1 9
08.12 33 1931 3 4
08.13 13 1931 3 4
08.14 56 1931 3 4
09 23 1930 1 4
010 23 1930 1 1
11.16 6 1800 6 12
I attach a rather mechanical method, but I believe it is a good starting point.
I have noticed that in your original table the entry
ID Day Count Count_group
18 1933 6 14
is duplicated; I have left it untouched for sake of clarity.
Structure of the approach:
Read original data
Generate list of data frames, for each Day
Generate final data frame, collapsing the list in 2.
1. Read original data
We start with
df = read.table(text = 'ID Day Count Count_group
18 1933 6 14
33 1933 6 14
37 1933 6 14
18 1933 6 14
16 1933 6 14
11 1933 6 14
111 1932 5 9
34 1932 5 9
60 1932 5 9
88 1932 5 9
18 1932 5 9
33 1931 3 4
13 1931 3 4
56 1931 3 4
23 1930 1 1
6 1800 6 12
37 1800 6 12
98 1800 6 12
52 1800 6 12
18 1800 6 12
76 1800 6 12
55 1799 4 6
6 1799 4 6
52 1799 4 6
133 1799 4 6
112 1798 2 2
677 1798 2 2
778 888 4 7
111 888 4 7
88 888 4 7
10 888 4 7
37 887 2 4
26 887 2 4
8 886 1 2
56 885 1 1', header = TRUE)
# ordered vector of unique values for "Day"
ord_day <- unique(df$Day[order(df$Day)])
ord_day
[1] 885 886 887 888 1798 1799 1800 1930 1931 1932 1933
2. Generate list of data frames, for each Day
For each element in ord_day we introduce a data.frame as element of a list called df_new_aug.
Such data frames are defined through a for loop for all values in ord_day except ord_day[2] and ord_day[1] which are treated separately.
Idea behind the looping: for each unique ord_day[i] with i > 2 we check which days between ord_day[i-1] and ord_day[i-2] (or both!) contribute (through the variable "Count") to the value "Count_Group" at ord_day[i].
We therefore introduce if else statements in the loop.
Here we go
# Recursive generation of the list of data.frames (for days > 886)
#-----------------------------------------------------------------
df_new <- list()
df_new_aug <- list()
# we exclude cases i=1, 2: they are manually treated below
for ( i in 3: length(ord_day) ) {
# is "Count_Group" for ord_day[i] equal to the sum of "Count" at ord_day[i-1] and ord_day[i-2]?
if ( unique(df[df$Day == ord_day[i], "Count_group"]) == unique(df[df$Day == ord_day[i], "Count"]) +
unique(df[df$Day == ord_day[i-1], "Count"]) + unique(df[df$Day == ord_day[i-2], "Count"])
) {
# we create columns ID | Day | Count
df_new[[i]] <- data.frame(df[df$Day == ord_day[i] | df$Day == ord_day[i-1] | df$Day == ord_day[i-2],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[i]] <- data.frame( df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
} else if (unique(df[df$Day == ord_day[i], "Count_group"]) == unique(df[df$Day == ord_day[i], "Count"]) +
unique(df[df$Day == ord_day[i-1], "Count"]) ) #only "Count" at i and i-1 contribute to "Count_group" at i
{
df_new[[i]] <- data.frame(df[df$Day == ord_day[i] | df$Day == ord_day[i-1],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[2]
df_new_aug[[i]] <- data.frame(df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
} else #only "Count" at i contributes to "Count_group" at i
df_new[[i]] <- data.frame(df[df$Day == ord_day[i],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[i]] <- data.frame(df_new[[i]],
Count_group = rep(unique(df[df$Day == ord_day[i], "Count_group"]), nrow(df_new[[i]]) ) )
#closing the for loop
}
# for ord_day[2] = "886" (both "Count" at i =2 and i = 1 contribute to "Count_group" at i=2)
#-------------------------------------------------------------------------------------
df_new[[2]] <- data.frame(df[df$Day == ord_day[2] | df$Day == ord_day[1],
c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[2]
df_new_aug[[2]] <- data.frame(df_new[[2]],
Count_group = rep(unique(df[df$Day == ord_day[2], "Count_group"]), nrow(df_new[[2]]) ) )
# for ord_day[1] = "885" (only "count" at i = 1 contributes to "Count_group" at i =1)
#------------------------------------------------------------------------------------
df_new[[1]] <- data.frame(df[df$Day == ord_day[1], c("ID", "Day", "Count")])
# we append the Count_Group of the Day in ord_day[i]
df_new_aug[[1]] <- data.frame(df_new[[1]], Count_group = rep(unique(df[df$Day == ord_day[1], "Count_group"]), nrow(df_new[[1]]) ) )
# produced list
df_new_aug
3. Generate final data frame, collapsing the list in 2.
We collapse df_new_aug through an ugly loop, but other solutions (for example with Reduce() and merge() are possible):
# merging the list (mechanically): final result
df_result <- df_new_aug[[1]]
for (i in 1:10){
df_result <- rbind(df_result, df_new_aug[[i+1]])
}
One arrives at df_result and the analysis is stopped.
My friend is currently working on his assignment about estimation of parameter of a time series model, SARIMAX (Seasonal ARIMA Exogenous), with Maximum Likelihood Estimation (MLE) method. The data used by him is about the monthly rainfall from 2000 - 2012 with Indian Ocean Dipole (IOD) index as the exogenous variable.
Here is data:
MONTH YEAR RAINFALL IOD
1 1 2000 15.3720526 0.0624
2 2 2000 10.3440804 0.1784
3 3 2000 14.6116392 0.3135
4 4 2000 18.6842179 0.3495
5 5 2000 15.2937896 0.3374
6 6 2000 15.0233152 0.1946
7 7 2000 11.1803399 0.3948
8 8 2000 11.0589330 0.4391
9 9 2000 10.1488916 0.3020
10 10 2000 21.1187121 0.2373
11 11 2000 15.3980518 -0.0324
12 12 2000 18.9393770 -0.0148
13 1 2001 19.1075901 -0.2448
14 2 2001 14.9097284 0.1673
15 3 2001 19.2379833 0.1538
16 4 2001 19.6900990 0.3387
17 5 2001 8.0684571 0.3578
18 6 2001 14.0463518 0.3394
19 7 2001 5.9916609 0.1754
20 8 2001 8.4439327 0.0048
21 9 2001 11.8321596 0.1648
22 10 2001 24.3700636 -0.0653
23 11 2001 22.3584436 0.0291
24 12 2001 23.6114379 0.1731
25 1 2002 17.8409641 0.0404
26 2 2002 14.7377067 0.0914
27 3 2002 21.2226294 0.1766
28 4 2002 16.6403125 -0.1512
29 5 2002 10.8074049 -0.1072
30 6 2002 6.3796552 0.0244
31 7 2002 17.0704423 0.0542
32 8 2002 1.7606817 0.0898
33 9 2002 5.3665631 0.6736
34 10 2002 8.3246622 0.7780
35 11 2002 17.8044938 0.3616
36 12 2002 16.7062862 0.0673
37 1 2003 13.5572859 -0.0628
38 2 2003 17.1113997 0.2038
39 3 2003 14.9899967 0.1239
40 4 2003 14.0996454 0.0997
41 5 2003 11.4017542 0.0581
42 6 2003 6.7749539 0.3490
43 7 2003 7.1484264 0.4410
44 8 2003 10.3004854 0.4063
45 9 2003 10.6630202 0.3289
46 10 2003 20.6518764 0.1394
47 11 2003 20.8638443 0.1077
48 12 2003 20.5548048 0.4093
49 1 2004 16.0436903 0.2257
50 2 2004 17.2568827 0.2978
51 3 2004 20.2361063 0.2523
52 4 2004 11.6619038 0.1212
53 5 2004 12.8296532 -0.3395
54 6 2004 8.4202138 -0.1764
55 7 2004 15.5916644 0.0118
56 8 2004 0.9486833 0.1651
57 9 2004 7.2732386 0.2825
58 10 2004 18.0083314 0.3747
59 11 2004 14.4672043 0.1074
60 12 2004 17.3637554 0.0926
61 1 2005 18.9420168 0.0551
62 2 2005 17.0146995 -0.3716
63 3 2005 23.3002146 -0.2641
64 4 2005 17.8689675 0.2829
65 5 2005 17.2365890 0.1883
66 6 2005 14.0178458 0.0347
67 7 2005 12.6925175 -0.0680
68 8 2005 9.3861600 -0.0420
69 9 2005 11.7132404 -0.1425
70 10 2005 18.5768673 -0.0514
71 11 2005 19.6723156 -0.0008
72 12 2005 18.3248465 -0.0659
73 1 2006 18.6252517 0.0560
74 2 2006 18.7002674 -0.1151
75 3 2006 23.4882950 -0.0562
76 4 2006 19.5652754 0.1862
77 5 2006 13.6857590 0.0105
78 6 2006 11.1265448 0.1504
79 7 2006 11.0227038 0.3490
80 8 2006 7.6550637 0.5267
81 9 2006 1.8708287 0.8089
82 10 2006 5.4129474 0.9479
83 11 2006 15.2249795 0.7625
84 12 2006 14.1703917 0.3941
85 1 2007 22.8691932 0.4027
86 2 2007 14.3317829 0.3353
87 3 2007 13.0766968 0.2792
88 4 2007 23.2335964 0.2960
89 5 2007 12.2474487 0.4899
90 6 2007 11.3357840 0.2445
91 7 2007 9.3112835 0.3629
92 8 2007 1.6431677 0.5396
93 9 2007 6.8483575 0.6252
94 10 2007 13.1529464 0.4540
95 11 2007 14.5120639 0.2489
96 12 2007 18.7909553 0.0054
97 1 2008 17.6493626 0.3037
98 2 2008 13.3828248 0.1166
99 3 2008 19.0525589 0.2730
100 4 2008 17.3262806 0.0467
101 5 2008 5.2345009 0.4020
102 6 2008 3.3166248 0.4263
103 7 2008 10.1094016 0.5558
104 8 2008 11.7260394 0.4236
105 9 2008 10.7470926 0.4762
106 10 2008 15.1591557 0.4127
107 11 2008 25.5558213 0.1474
108 12 2008 18.2455474 0.1755
109 1 2009 14.5430396 0.2185
110 2 2009 12.8569048 0.3521
111 3 2009 24.0707291 0.2680
112 4 2009 16.0374562 0.3234
113 5 2009 7.2387844 0.4757
114 6 2009 13.8021737 0.3078
115 7 2009 7.5232972 0.1179
116 8 2009 6.3403470 0.1999
117 9 2009 4.6583259 0.2814
118 10 2009 13.0958008 0.3646
119 11 2009 15.3329710 0.1914
120 12 2009 19.0394328 0.3836
121 1 2010 15.5080624 0.4732
122 2 2010 17.1551742 0.2134
123 3 2010 23.9729014 0.6320
124 4 2010 18.2537667 0.5644
125 5 2010 18.2236111 0.1881
126 6 2010 14.6082169 0.0680
127 7 2010 13.6161669 0.3111
128 8 2010 11.1220502 0.2472
129 9 2010 20.7870152 0.1259
130 10 2010 19.5371441 -0.0529
131 11 2010 24.8837296 -0.2133
132 12 2010 15.5016128 0.0233
133 1 2011 17.3435867 0.3739
134 2 2011 17.6096564 0.4228
135 3 2011 19.0682983 0.5413
136 4 2011 20.4890214 0.3569
137 5 2011 12.0540450 0.1313
138 6 2011 12.5896783 0.2642
139 7 2011 5.0990195 0.5356
140 8 2011 6.5726707 0.6490
141 9 2011 2.5099801 0.5884
142 10 2011 17.6380271 0.7376
143 11 2011 17.5128524 0.6004
144 12 2011 17.2655727 0.0990
145 1 2012 16.6883193 0.2272
146 2 2012 20.8374663 0.1049
147 3 2012 16.7002994 0.1991
148 4 2012 18.7962762 -0.0596
149 5 2012 16.9292646 -0.1165
150 6 2012 11.6490343 0.2207
151 7 2012 6.2529993 0.8586
152 8 2012 5.8991525 0.9473
153 9 2012 7.8485667 0.8419
154 10 2012 12.5817328 0.4928
155 11 2012 24.7770055 0.1684
156 12 2012 23.2486559 0.4899
In doing this, he works with R because it has the package for analysing the SARIMAX model. And so far, he's been doing it good with arimax() function of TSA package with seasonal ARIMA order (1,0,1).
So here I attach his syntax:
#Importing data
data=read.csv("C:/DATA.csv", header=TRUE)
rainfall=data$RAINFALL
exo=data$IOD
#Creating the suitable model of data that is able to be read by R with ts() function
library(forecast)
rainfall_ts=ts(rainfall, start=c(2000, 1), end=c(2012, 12), frequency = 12)
exo_ts=ts(exo, start=c(2000, 1), end=c(2012, 12), frequency = 12)
#Fitting SARIMAX model with seasonal ARIMA order (1,0,1) & estimation method is MLE (or ML)
library(TSA)
model_ts=arimax(log(rainfall_ts), order=c(1,0,1), seasonal=list(order=c(1,0,1), period=12), xreg=exo_ts, method='ML')
Below is the result:
> model_ts
Call:
arimax(x = log(rainfall_ts), order = c(1, 0, 1), seasonal = list(order = c(1,
0, 1), period = 12), xreg = exo_ts, method = "ML")
Coefficients:
ar1 ma1 sar1 sma1 intercept xreg
0.5730 -0.4342 0.9996 -0.9764 2.6757 -0.4894
s.e. 0.2348 0.2545 0.0018 0.0508 0.1334 0.1489
sigma^2 estimated as 0.1521: log likelihood = -86.49, aic = 184.99
Although he claimed the syntax is working, but his lecturer expected more.
Theoretically, because he used MLE, he has proven that the first derivatives of the log-likelihood function give implicit solutions. It means that the estimation process couldn't be done analytically with MLE so we need
to continue our working with the numerical method to get it done.
So this is the expectation of my friend's lecturer. He expected him that he can at least convince him that the estimation is truly need to be done numerically
and if so, he might be able to show him what method that is used by R (the numerical method such as Newton-Raphson, BFGS, BHHH, etc).
But the thing here is the arimax() function doesn't give us the choice on numerical method to choose if the estimation need to be executed numerically like below:
model_ts=arimax(log(rainfall_ts), order=c(1,0,1), seasonal=list(order=c(1,0,1), period=12), xreg=exo_ts, method='ML')
The 'method' above is for the estimation method and the available method are ML, CSS, and CSS-ML. It is clear that the sintax above doesn't consist of the numerical method and this is the matter.
So is there any possible way to know what numerical method used by R? Or my friend just got to construct his own program without depending to arimax() function?
If there are any errors in my code, please let me know. I also apologize for any grammatical or vocabulary mistakes. English is not my native language.
Some suggestions:
Estimate the model with each of the methods: ML, CSS, CSS-ML. Do the parameter estimates agree?
You can view the source code of the arimax() function by typing arimax, View(arimax) or getAnywhere(arimax) in the console.
Or you can do a debug by placing a debug bullet before the line model_ts=arimax(...) and then sourcing or debugSource()-ing your script. You can then step into the arimax function and see/verify yourself which optimization method arimax uses.
I've got a data frame with panel-data, subjects' characteristic through the time. I need create a column with a sequence from 1 to the maximum number of year per every subject. For example, if subject 1 is in the data frame from 2000 to 2005, I need the following sequence: 1,2,3,4,5,6.
Below is a small fraction of my data. The last column (exp) is what I trying to get. Additionally, if you have a look at the first subject (13) you'll see that in 2008 the value of qtty is zero. In this case I need just a NA or a code (0,1, -9999), it doesn't matter which one.
Below the data is what I did to get that vector, but it didn't work.
Any help will be much appreciated.
subject season qtty exp
13 2000 29 1
13 2001 29 2
13 2002 29 3
13 2003 29 4
13 2004 29 5
13 2005 27 6
13 2006 27 7
13 2007 27 8
13 2008 0 NA
28 2000 18 1
28 2001 18 2
28 2002 18 3
28 2003 18 4
28 2004 18 5
28 2005 18 6
28 2006 18 7
28 2007 18 8
28 2008 18 9
28 2009 20 10
28 2010 20 11
28 2011 20 12
28 2012 20 13
35 2000 21 1
35 2001 21 2
35 2002 21 3
35 2003 21 4
35 2004 21 5
35 2005 21 6
35 2006 21 7
35 2007 21 8
35 2008 21 9
35 2009 14 10
35 2010 11 11
35 2011 11 12
35 2012 10 13
My code:
numbY<-aggregate(season ~ subject, data = toCountY,length)
colnames(numbY)<-c("subject","inFish")
toCountY$inFish<-numbY$inFish[match(toCountY$subject,numbY$subject)]
numbYbyFisher<-unique(numbY)
seqY<-aggregate(numbYbyFisher$inFish, by=list(numbYbyFisher$subject), function(x)seq(1,x,1))
I am using ddply and I distinguish 2 cases:
Either you generate a sequence along subjet and you replace by NA where you have qtty is zero
ddply(dat,.(subjet),transform,new.exp=ifelse(qtty==0,NA,seq_along(subjet)))
Or you generate a sequence along qtty different of zero with a jump where you have qtty is zero
ddply(dat,.(subjet),transform,new.exp={
hh <- seq_along(which(qtty !=0))
if(length(which(qtty ==0))>0)
hh <- append(hh,NA,which(qtty==0)-1)
hh
})
EDITED
ind=qtty!=0
exp=numeric(length(subject))
temp=0
for(i in 1:length(unique(subject[ind]))){
temp[i]=list(seq(from=1,to=table(subject[ind])[i]))
}
exp[ind]=unlist(temp)
this will provide what you need
My dataframe, df:
df
EffYr EffMo count dts
2 2012 1 1 2012-01-01
3 2012 2 3 2012-02-01
4 2012 3 1 2012-03-01
5 2012 5 1 2012-05-01
6 2012 6 1 2012-06-01
7 2012 7 2 2012-07-01
8 2012 8 11 2012-08-01
9 2012 9 84 2012-09-01
10 2012 10 184 2012-10-01
11 2012 11 165 2012-11-01
12 2012 12 246 2012-12-01
13 2013 1 414 2013-01-01
14 2013 2 130 2013-02-01
15 2013 3 182 2013-03-01
16 2013 4 261 2013-04-01
17 2013 5 229 2013-05-01
18 2013 6 249 2013-06-01
19 2013 7 330 2013-07-01
20 2013 8 135 2013-08-01
Each row of df represents a "month-year", the earliest being Jan 2012 and the latest being Aug 2013. I want to plot a bar graph (using ggplot2) where each bar represents a row of df with the bar height equal to the row's count. So, I should have 24 bars in total.
I want my x axis to be divided into 12 intervals: Jan-Dec, and bars that represent the same calendar month should lie in the same "month interval". For example, if df has a row for Jan 2011, Jan 2012, Jan 2013, then the Jan portion of my graph should have 3 bars so that I can compare my business's performance in the month of January for subsequent years.
Thanks
Edit: I want something that looks like
ggplot(diamonds, aes(cut, fill=cut)) + geom_bar() +
facet_grid(. ~ clarity)
But broken down by month. I tried to modify that code to fit my data, but never could get it right.
#Ben you're asking a number of ggplot2 questions. I would recommend you sit down with some good ggplot2 resources and try the example to become more skilled. Here are 2 excellent resources I use often:
http://docs.ggplot2.org/current/
http://www.cookbook-r.com/Graphs/
Now the solution I think you're after:
## dat <- read.table(text=" EffYr EffMo count dts
## 2 2012 1 1 2012-01-01
## 3 2012 2 3 2012-02-01
## 4 2012 3 1 2012-03-01
## 5 2012 5 1 2012-05-01
## 6 2012 6 1 2012-06-01
## 7 2012 7 2 2012-07-01
## 8 2012 8 11 2012-08-01
## 9 2012 9 84 2012-09-01
## 10 2012 10 184 2012-10-01
## 11 2012 11 165 2012-11-01
## 12 2012 12 246 2012-12-01
## 13 2013 1 414 2013-01-01
## 14 2013 2 130 2013-02-01
## 15 2013 3 182 2013-03-01
## 16 2013 4 261 2013-04-01
## 17 2013 5 229 2013-05-01
## 18 2013 6 249 2013-06-01
## 19 2013 7 330 2013-07-01
## 20 2013 8 135 2013-08-01", header=TRUE)
dat$month <- factor(month.name[dat$EffMo], levels = month.name)
dat$year <- as.factor(dat$EffYr)
ggplot(dat, aes(month, fill=year)) + geom_bar(aes(weight=count), position="dodge")