Error with ARIMA - r

I'm trying to run an ARIMA on a temporal dataset that is in a .csv file. Here is my code so far:
Oil_all <- read.delim("/Users/Jkels/Documents/Introduction to Computational
Statistics/Oil production.csv",sep="\t",header=TRUE,stringsAsFactors=FALSE)
Oil_all
The file looks like:
year.mbbl
1 1880,30
2 1890,77
3 1900,149
4 1905,215
5 1910,328
6 1915,432
7 1920,689
8 1925,1069
9 1930,1412
10 1935,1655
11 1940,2150
12 1945,2595
13 1950,3803
14 1955,5626
15 1960,7674
16 1962,8882
17 1964,10310
18 1966,12016
19 1968,14104
20 1970,16690
21 1972,18584
22 1974,20389
23 1976,20188
24 1978,21922
25 1980,21732
26 1982,19403
27 1984,19608
Code:
apply(Oil_all,1,function(x) sum(is.na(x)))
Results:
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
When I run ARIMA:
library(forecast)
auto.arima(Oil_all,xreg=year)
This is the error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
In addition: Warning message:
In data.matrix(data) : NAs introduced by coercion
So, I was able to call in the data set and it prints. However, when I go to check whether the values are present with the apply function, I see all 0's, so I know something's wrong and that's probably why I'm getting the error. I'm just not sure what the error means or how to fix it in the code.
Any advice?

If I got your question right, it should be like:
Oil_all <- read.csv("myfolder/myfile.csv",header=TRUE)
## I don't have your source data, so I tried to reproduce it with the data you printed
Oil_all
year value
1 1880 30
2 1890 77
3 1900 149
4 1905 215
5 1910 328
6 1915 432
7 1920 689
8 1925 1069
9 1930 1412
10 1935 1655
11 1940 2150
12 1945 2595
13 1950 3803
14 1955 5626
15 1960 7674
16 1962 8882
17 1964 10310
18 1966 12016
19 1968 14104
20 1970 16690
21 1972 18584
22 1974 20389
23 1976 20188
24 1978 21922
25 1980 21732
26 1982 19403
27 1984 19608
library(forecast)
auto.arima(Oil_all$value,xreg=Oil_all$year)
Series: Oil_all$value
ARIMA(3,0,0) with non-zero mean
Coefficients:
ar1 ar2 ar3 intercept Oil_all$year
1.2877 0.0902 -0.4619 -271708.4 144.2727
s.e. 0.1972 0.3897 0.2275 107344.4 55.2108
sigma^2 estimated as 642315: log likelihood=-221.07
AIC=454.15 AICc=458.35 BIC=461.92

your import should be
Oil_all<-read.csv("/Users/Jkels/Documents/Introduction to Computational Statistics/Oil production.csv")
That is why your data is weird. Sorry I do not have the reputation to comment.I did the same as Nemesi and it worked then. I think you are trying to import a csv as a tab delimited file.

Related

Running Total with subtraction

I have a data set with closing and opening dates of public schools in California. Available here or dput() at the bottom of the question. The data also lists what type of school it is and where it is. I am trying to create a running total column which also takes into account school closings as well as school type.
Here is the solution I've come up with, which basically entails me encoding a lot of different 1's and 0's based on the conditions using ifelse:
# open charter schools
pubschls$open_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# open public schools
pubschls$open_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# closed charters
pubschls$closed_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
# closed public schools
pubschls$closed_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
lausd <- filter(pubschls, NCESDist=="0622710")
# count number open during each year
Then I subtract the columns from each other to get totals.
la_schools_count <- aggregate(lausd[c('open_chart','closed_chart','open_pub','closed_pub')],
by=list(year(lausd$OpenDate)), sum)
# find net charters by subtracting closed from open
la_schools_count$net_chart <- la_schools_count$open_chart - la_schools_count$closed_chart
# find net public schools by subtracting closed from open
la_schools_count$net_pub <- la_schools_count$open_pub - la_schools_count$closed_pub
# add running totals
la_schools_count$cum_chart <- cumsum(la_schools_count$net_chart)
la_schools_count$cum_pub <- cumsum(la_schools_count$net_pub)
# total totals
la_schools_count$total <- la_schools_count$cum_chart + la_schools_count$cum_pub
My output looks like this:
la_schools_count <- select(la_schools_count, "year", "cum_chart", "cum_pub", "pen_rate", "total")
year cum_chart cum_pub pen_rate total
1 1952 1 0 100.00000 1
2 1956 1 1 50.00000 2
3 1969 1 2 33.33333 3
4 1980 55 469 10.49618 524
5 1989 55 470 10.47619 525
6 1990 55 470 10.47619 525
7 1991 55 473 10.41667 528
8 1992 55 476 10.35782 531
9 1993 55 477 10.33835 532
10 1994 56 478 10.48689 534
11 1995 57 478 10.65421 535
12 1996 57 479 10.63433 536
13 1997 58 481 10.76067 539
14 1998 59 480 10.94620 539
15 1999 61 480 11.27542 541
16 2000 61 481 11.25461 542
17 2001 62 482 11.39706 544
18 2002 64 484 11.67883 548
19 2003 73 485 13.08244 558
20 2004 83 496 14.33506 579
21 2005 90 524 14.65798 614
22 2006 96 532 15.28662 628
23 2007 90 534 14.42308 624
24 2008 97 539 15.25157 636
25 2009 108 546 16.51376 654
26 2010 124 566 17.97101 690
27 2011 140 580 19.44444 720
28 2012 144 605 19.22563 749
29 2013 162 609 21.01167 771
30 2014 179 611 22.65823 790
31 2015 195 611 24.19355 806
32 2016 203 614 24.84700 817
33 2017 211 619 25.42169 830
I'm just wondering if this could be done in a better way. Like an apply statement to all rows based on the conditions?
dput:
structure(list(CDSCode = c("19647330100289", "19647330100297",
"19647330100669", "19647330100677", "19647330100743", "19647330100750"
), OpenDate = structure(c(12324, 12297, 12240, 12299, 12634,
12310), class = "Date"), ClosedDate = structure(c(NA, 15176,
NA, NA, NA, NA), class = "Date"), Charter = c("Y", "Y", "Y",
"Y", "Y", "Y")), .Names = c("CDSCode", "OpenDate", "ClosedDate",
"Charter"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I followed your code and learned what you were doing except pen_rate. It seems that pen_rate is calculated dividing cum_chart by total. I download the original data set and did the following. I called the data set foo. Whenclosed_pub), I combined Charter and ClosedDate. I checked if ClosedDate is NA or not, and converted the logical output to numbers (1 = open, 0 = closed). This is how I created the four groups (i.e., open_chart, closed_chart, open_pub, and closed_pub). I guess this would ask you to do less typing. Since the dates are in character, I extracted year using substr(). If you have a date object, you need to do something else. Once you have year, you group the data with it and calculate how many schools exist for each type of school using count(). This part is the equivalent of your aggregate() code. Then, Convert the output to a wide-format data with spread() and did the rest of the calculation as you demonstrated in your codes. The final output seems different from what you have in your question, but my outcome was identical to one that I obtained by running your codes. I hope this will help you.
library(dplyr)
library(tidyr)
library(readxl)
# Get the necessary data
foo <- read_xls("pubschls.xls") %>%
select(NCESDist, CDSCode, OpenDate, ClosedDate, Charter) %>%
filter(NCESDist == "0622710" & (!Charter %in% NA))
mutate(foo, group = paste(Charter, as.numeric(is.na(ClosedDate)), sep = "_"),
year = substr(OpenDate, star = nchar(OpenDate) - 3, stop = nchar(OpenDate))) %>%
count(year, group) %>%
spread(key = group, value = n, fill = 0) %>%
mutate(net_chart = Y_1 - Y_0,
net_pub = N_1 - N_0,
cum_chart = cumsum(net_chart),
cum_pub = cumsum(net_pub),
total = cum_chart + cum_pub,
pen_rate = cum_chart / total)
# A part of the outcome
# year N_0 N_1 Y_0 Y_1 net_chart net_pub cum_chart cum_pub total pen_rate
#1 1866 0 1 0 0 0 1 0 1 1 0.00000000
#2 1873 0 1 0 0 0 1 0 2 2 0.00000000
#3 1878 0 1 0 0 0 1 0 3 3 0.00000000
#4 1881 0 1 0 0 0 1 0 4 4 0.00000000
#5 1882 0 2 0 0 0 2 0 6 6 0.00000000
#110 2007 0 2 15 9 -6 2 87 393 480 0.18125000
#111 2008 2 8 9 15 6 6 93 399 492 0.18902439
#112 2009 1 9 4 15 11 8 104 407 511 0.20352250
#113 2010 5 26 5 21 16 21 120 428 548 0.21897810
#114 2011 2 16 2 18 16 14 136 442 578 0.23529412
#115 2012 2 27 3 7 4 25 140 467 607 0.23064250
#116 2013 1 5 1 19 18 4 158 471 629 0.25119237
#117 2014 1 3 1 18 17 2 175 473 648 0.27006173
#118 2015 0 0 2 18 16 0 191 473 664 0.28765060
#119 2016 0 3 0 8 8 3 199 476 675 0.29481481
#120 2017 0 5 0 9 9 5 208 481 689 0.30188679

How can I call for something in a data.frame when the destinction has to be done in two columns?

Sorry for the very specific question, but I have a file as such:
Adj Year man mt wm wmt by bytl gr grtl
3 careless 1802 0 126 0 54 0 13 0 51
4 careless 1803 0 166 0 72 0 1 0 18
5 careless 1804 0 167 0 58 0 2 0 25
6 careless 1805 0 117 0 5 0 5 0 7
7 careless 1806 0 408 0 88 0 15 0 27
8 careless 1807 0 214 0 71 0 9 0 32
...
560 mean 1939 21 5988 8 1961 0 1152 0 1512
561 mean 1940 20 5810 6 1965 1 914 0 1444
562 mean 1941 10 6062 4 2097 5 964 0 1550
563 mean 1942 8 5352 2 1660 2 947 2 1506
564 mean 1943 14 5145 5 1614 1 878 4 1196
565 mean 1944 42 5630 6 1939 1 902 0 1583
566 mean 1945 17 6140 7 2192 4 1004 0 1906
Now I have to call for specific values (e.g. [careless,1804,man] or [mean, 1944, wmt].
Now I have no clue how to do that, one possibility would be to split the data.frame and create an array if I'm correct. But I'd love to have a simpler solution.
Thank you in advance!
Subsetting for specific values in Adj and Year column and selecting the man column will give you the required output.
df[df$Adj == "careless" & df$Year == 1804, "man"]

Spatial autocorrelation using Moran's I or other spatial overlap index

I have a file containing the data from bottom trawl survey. There are 102 draw's points to each with associated coordinates (Lon,Lat), for every set I calculated the density (DI N / 1km2) of the predator (merlmerDI N/1km2) and the density of its preferred prey (other columns) with the same coordinates...
I would pull out a spatial overlap index that quantifies a number of affinity for the presence / absence (using the value of DI N / 1km2) for the predator with its prey in order to justify the preferential choice of a prey than another (which will basically be a choice of presence in that same area shared ). I had found the Moran Index ( precisely Bivariate Moran's I) that returns a simple number (1) when there is high spatial correlation ... once compared the data comes out a Moran's bi-variate scatter plot.
I should compare the hake (predator) separately with each of its preys and to see so many index how many the preys are.
Can someone help me? I don't know if it is right to use this index. Some or any idea?
Year PrHNĀ° Latitude Longitude Haul depth (m) Swept area (km2) merlmerDI N/1km2 tractraDI N/1km2 engrenDI N/1km2 sardpilDI N/1km2 papelonDI N/1km2
2004 1 37,5370 12,6067 51 0,044 137 69 0 891 0
2004 2 37,5433 12,8518 34 0,043 743 0 0 2067 0
2004 3 37,4757 12,9192 51 0,045 841 376 1350 5754 0
2004 4 37,3212 12,9258 310 0,076 4299 949 0 0 12223
2004 5 37,2868 12,8012 214 0,098 1729 366 0 0 4027
2004 6 37,1255 12,9703 331 0,103 77 29 0 0 2563
2004 7 37,0010 12,8058 391 0,099 192 0 0 0 6891
2004 8 37,0298 12,7738 388 0,103 156 0 0 0 5040
2004 9 37,2212 12,6082 158 0,049 2347 7000 0 0 3768
2004 10 37,3883 12,5287 151 0,045 2467 1102 0 0 5023
2004 11 37,2632 13,2430 130 0,049 2788 10298 0 0 66304
2004 12 37,1952 13,3478 136 0,048 952 16612 0 0 21412
2004 13 37,2642 13,4077 40 0,045 270 112 270 8764 0
2004 14 37,2472 13,4677 34 0,045 539 0 0 16854 0
2004 15 37,1348 13,6887 26 0,045 22 3461 0 12135 0
2004 16 36,9882 13,0683 337 0,101 99 50 0 0 3044
2004 17 37,0145 13,4638 619 0,102 10 10 0 0 79
2004 18 37,0800 13,5803 96 0,045 516 314 516 426 7063
2004 19 36,9162 13,6578 655 0,084 0 0 0 0 95
2004 20 36,8105 13,3932 413 0,102 108 0 0 0 2626
2004 22 36,5673 13,9302 586 0,103 29 0 0 0 652

Mutate with dplyr strange error

I'm trying to create new variables with mutate in dplyr and I can't understand my error, I've tried everything and have not stumbled upon this issue in the past.
I have a large data set, over a million observations. I only provide you with the 20 first observations.
This is how my data looks like:
data1 <- read.table(header=TRUE, text="IDnr visit time year end event survival
7 1 04/09/06 2006 31/12/06 0 118
7 2 04/09/06 2007 31/12/07 0 483
7 3 04/09/06 2008 31/12/08 0 849
7 4 04/09/06 2009 31/12/09 0 1214
7 5 04/09/06 2010 31/12/10 0 1579
7 6 04/09/06 2011 31/12/11 0 1944
20 1 24/10/03 2003 31/12/03 0 68
20 2 24/10/03 2004 31/12/04 0 434
20 3 24/10/03 2005 31/12/05 0 799
20 4 24/10/03 2006 31/12/06 0 1164
20 5 24/10/03 2007 31/12/07 0 1529
20 6 24/10/03 2008 31/12/08 0 1895
20 7 24/10/03 2009 31/12/09 0 2260
20 8 24/10/03 2010 31/12/10 0 2625
20 9 24/10/03 2011 31/12/11 0 2990
87 1 17/01/06 2006 31/12/06 0 348
87 2 17/01/06 2007 31/12/07 0 713
87 3 17/01/06 2008 31/12/08 0 1079
87 4 17/01/06 2009 31/12/09 0 1444
87 5 17/01/06 2010 31/12/10 0 1809")
I must say that the date and time variables does not have this format in my dataset, I't is coded with POSIXct with the format ("%Y-%m-%d"). I't somehow reformats itself when I attach I't to stackoverflow and apply the "code" citations.
Okey, the problem is that I'm trying to create new survival time variables in the same dataset, one is for a cox regression model with stop and start time (survival is stop time and the new start variable should be called survcox).
Also im trying to do a poisson regression where the offset variable (i.e the survival time variable) should be called survpois. This is the code I'm trying to use;
data2 <- data1 %>%
group_by(IDnr) %>%
mutate(survcox = ifelse(visit==1, 0, lag(survival)),
year_aar = substr(data1$year, 1,4), first_day = as.POSIXct(paste0(year_aar, "-01-01-")),
survpois = as.numeric(data1$end - first_day)+1) %>%
mutate(survpois = ifelse(year_aar > first_day, as.numeric(end - year_aar),
survpois)) %>%
ungroup()
I receive an error in this step!
Error: incompatible size (1345000), expecting 6 (the group size) or 1
I have no idea why I get this error, what I't means and why my code doesn't work.
All the help I can get is appreciated, thanks in advance!
It's because you reference variable as data1$year which doesn't fit in grouped data (and in data1$end too)
I teased apart your code and found a few issues. One was the thing I mentioned in the comment above. Second thing was the class of end. If the data you provided is the one, end is factor. If this is the case in your own situation, you need to convert end to an date object. The other thing was year_aar > first_day. first_day is a date object whereas year_arr is character. Given those, I modified your code.
data1 %>%
group_by(IDnr) %>%
mutate(survcox = ifelse(visit == 1, 0, lag(survival)),
year_aar = substr(year, 1,4),
first_day = as.POSIXct(paste0(year_aar, "-01-01-")),
survpois = as.numeric(as.POSIXct(end, format = "%d/%m/%y") - first_day) + 1) %>%
mutate(survpois = ifelse(as.numeric(year_aar) > as.numeric(format(first_day, "%Y")),
as.numeric(as.POSIXct(end, format = "%d/%m/%y") - year_aar), survpois)) %>%
ungroup()
Here is a bit of the outcome.
# IDnr visit time year end event survival survcox year_aar first_day survpois
#1 7 1 04/09/06 2006 31/12/06 0 118 0 2006 2006-01-01 365
#2 7 2 04/09/06 2007 31/12/07 0 483 118 2007 2007-01-01 365
#3 7 3 04/09/06 2008 31/12/08 0 849 483 2008 2008-01-01 366
#4 7 4 04/09/06 2009 31/12/09 0 1214 849 2009 2009-01-01 365
#5 7 5 04/09/06 2010 31/12/10 0 1579 1214 2010 2010-01-01 365

R : High frequency data statistical analysis

I'm working with tick data and would like to have some basic information about the distribution of the change in tick prices. My database is made of tick data during a period of 10 open days.
I've taken the first difference of the tick prices :
Tick spread
2010-02-02 08:00:04 -1
2010-02-02 08:00:04 1
2010-02-02 08:00:04 0
2010-02-02 08:00:04 0
2010-02-02 08:00:04 0
2010-02-02 08:00:04 -1
2010-02-02 08:00:05 1
2010-02-02 08:00:05 1
I've created an array which provides me with the first and last tick of each day :
Open Close
[1,] 1 59115
[2,] 59116 119303
[3,] 119304 207300
[4,] 207301 351379
[5,] 351380 426553
[6,] 426554 516742
[7,] 516743 594182
[8,] 594183 683840
[9,] 683841 754962
[10,] 754963 780725
I would like to know each day the empirical distribution of my tick spreads.
I know that I can use the R function table() but the problem is that it gives me a table object which length varies with days. The second problem is that some day I can have spreads of 3 points whereas the days after I only have spreads less than 3 points.
first day table() output :
-3 -2 -1 0 1 2 3
1 19 6262 46494 6321 16 2
second day table() output :
-2 -1 0 1 2 3 5
27 5636 48902 5588 33 1 1
What I would like is to create a data frame with all table()'s output for my whole tick sample.
Any idea?
thanks
Just use a 2-dimensional table, using as.Date(index(x)) as the rows:
# create some example data
set.seed(21)
p <- sort(runif(6))*(1:6)^2
p <- c(p,rev(p)[-1])
p <- p/sum(p)
P <- sample(-5:5, 1e5, TRUE, p)
x <- .xts(P, (1:1e5)*5)
# create table
table(as.Date(index(x)), x)
# x
# -5 -4 -3 -2 -1 0 1 2 3 4 5
# 1970-01-01 22 141 527 1623 2968 6647 2953 1700 538 139 21
# 1970-01-02 31 142 548 1596 2937 6757 2874 1677 529 167 22
# 1970-01-03 26 172 547 1599 2858 6814 2896 1681 504 163 20
# 1970-01-04 23 178 537 1645 2855 6805 2891 1626 537 165 18
# 1970-01-05 23 139 490 1597 3028 6740 2848 1724 505 158 28
# 1970-01-06 21 134 400 1304 2266 5496 2232 1213 397 112 26
If you want the frequency distribution for the entire 10 day period just concatenate the data and do the same. Is that what you want to do?

Resources