I hope somebody can help me. I'm trying to conduct an analysis which examines the number of samples of Hymenoptera caught over an elevational gradient. I want to examine the possibility of a uni-modal distribution in relation to elevation, as well as a linear distribution. Hence I am including I(Altitude^2) as an explanatory variable in the analysis.
I am trying to run the following model which includes a Poisson error structure (as we are dealing with count data) and date and Trap Type (Trap) as random effects.
model7 <- glmer(No.Specimens~Altitude+I(Altitude^2)+(1|Date)+(1|Trap),
family="poisson",data=Santa.Lucia,na.action=na.omit)
However I keep receiving the following error message:
Error: (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
In addition: Warning messages:
1: Some predictor variables are on very different scales: consider rescaling
2: In pwrssUpdate(pp, resp, tolPwrss, GQmat, compDev, fac, verbose) :
Cholmod warning 'not positive definite' at file:../Cholesky/t_cholmod_rowfac.c, line 431
3: In pwrssUpdate(pp, resp, tolPwrss, GQmat, compDev, fac, verbose) :
Cholmod warning 'not positive definite' at file:../Cholesky/t_cholmod_rowfac.c, line 431
Clearly I am making some big mistakes. Can anybody help me figure out where I am going wrong?
Here is the structure of the dataframe:
str(Santa.Lucia)
'data.frame': 97 obs. of 6 variables:
$ Date : Factor w/ 8 levels "01-Sep-2014",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Trap.No : Factor w/ 85 levels "N1","N10","N11",..: 23 48 51 14 17 20 24 27 30 33 ...
$ Altitude : int 1558 1635 1703 1771 1840 1929 1990 2047 2112 2193 ...
$ Trail : Factor w/ 3 levels "Cascadas","Limones",..: 1 1 1 1 1 3 3 3 3 3 ...
$ No.Specimens: int 1 0 2 2 3 4 5 0 1 1 ...
$ Trap : Factor w/ 2 levels "Net","Pan": 2 2 2 2 2 2 2 2 2 2 ...
And here is the complete data.set (these are only my preliminary analyses)
Date Trap.No Altitude Trail No.Specimens Trap
1 28-Aug-2014 W2 1558 Cascadas 1 Pan
2 28-Aug-2014 W5 1635 Cascadas 0 Pan
3 28-Aug-2014 W8 1703 Cascadas 2 Pan
4 28-Aug-2014 W11 1771 Cascadas 2 Pan
5 28-Aug-2014 W14 1840 Cascadas 3 Pan
6 28-Aug-2014 W17 1929 Tower 4 Pan
7 28-Aug-2014 W20 1990 Tower 5 Pan
8 28-Aug-2014 W23 2047 Tower 0 Pan
9 28-Aug-2014 W26 2112 Tower 1 Pan
10 28-Aug-2014 W29 2193 Tower 1 Pan
11 28-Aug-2014 W32 2255 Tower 0 Pan
12 30-Aug-2014 N1 1562 Cascadas 5 Net
13 30-Aug-2014 N2 1635 Cascadas 0 Net
14 30-Aug-2014 N3 1723 Cascadas 2 Net
15 30-Aug-2014 N4 1779 Cascadas 0 Net
16 30-Aug-2014 N5 1842 Cascadas 3 Net
17 30-Aug-2014 N6 1924 Tower 2 Net
18 30-Aug-2014 N7 1979 Tower 2 Net
19 30-Aug-2014 N8 2046 Tower 0 Net
20 30-Aug-2014 N9 2110 Tower 0 Net
21 30-Aug-2014 N10 2185 Tower 0 Net
22 30-Aug-2014 N11 2241 Tower 0 Net
23 31-Aug-2014 N1 1562 Cascadas 1 Net
24 31-Aug-2014 N2 1635 Cascadas 1 Net
25 31-Aug-2014 N3 1723 Cascadas 0 Net
26 31-Aug-2014 N4 1779 Cascadas 0 Net
27 31-Aug-2014 N5 1842 Cascadas 0 Net
28 31-Aug-2014 N6 1924 Tower 0 Net
29 31-Aug-2014 N7 1979 Tower 7 Net
30 31-Aug-2014 N8 2046 Tower 4 Net
31 31-Aug-2014 N9 2110 Tower 6 Net
32 31-Aug-2014 N10 2185 Tower 1 Net
33 31-Aug-2014 N11 2241 Tower 1 Net
34 01-Sep-2014 W1 1539 Cascadas 0 Pan
35 01-Sep-2014 W2 1558 Cascadas 0 Pan
36 01-Sep-2014 W3 1585 Cascadas 2 Pan
37 01-Sep-2014 W4 1604 Cascadas 0 Pan
38 01-Sep-2014 W5 1623 Cascadas 1 Pan
39 01-Sep-2014 W6 1666 Cascadas 4 Pan
40 01-Sep-2014 W7 1699 Cascadas 0 Pan
41 01-Sep-2014 W8 1703 Cascadas 0 Pan
42 01-Sep-2014 W9 1746 Cascadas 1 Pan
43 01-Sep-2014 W10 1762 Cascadas 0 Pan
44 01-Sep-2014 W11 1771 Cascadas 0 Pan
45 01-Sep-2014 W12 1796 Cascadas 1 Pan
46 01-Sep-2014 W13 1825 Cascadas 0 Pan
47 01-Sep-2014 W14 1840 Tower 4 Pan
48 01-Sep-2014 W15 1859 Tower 2 Pan
49 01-Sep-2014 W16 1889 Tower 2 Pan
50 01-Sep-2014 W17 1929 Tower 0 Pan
51 01-Sep-2014 W18 1956 Tower 0 Pan
52 01-Sep-2014 W19 1990 Tower 1 Pan
53 01-Sep-2014 W20 2002 Tower 3 Pan
54 01-Sep-2014 W21 2023 Tower 2 Pan
55 01-Sep-2014 W22 2047 Tower 0 Pan
56 01-Sep-2014 W23 2068 Tower 1 Pan
57 01-Sep-2014 W24 2084 Tower 0 Pan
58 01-Sep-2014 W25 2112 Tower 1 Pan
59 01-Sep-2014 W26 2136 Tower 0 Pan
60 01-Sep-2014 W27 2150 Tower 1 Pan
61 01-Sep-2014 W28 2193 Tower 1 Pan
62 01-Sep-2014 W29 2219 Tower 0 Pan
63 01-Sep-2014 W30 2227 Tower 1 Pan
64 01-Sep-2014 W31 2255 Tower 0 Pan
85 03/06/2015 WT47 1901 Tower 2 Pan
86 03/06/2015 WT48 1938 Tower 2 Pan
87 03/06/2015 WT49 1963 Tower 2 Pan
88 03/06/2015 WT50 1986 Tower 0 Pan
89 03/06/2015 WT51 2012 Tower 9 Pan
90 03/06/2015 WT52 2033 Tower 0 Pan
91 03/06/2015 WT53 2050 Tower 4 Pan
92 03/06/2015 WT54 2081 Tower 2 Pan
93 03/06/2015 WT55 2107 Tower 1 Pan
94 03/06/2015 WT56 2128 Tower 4 Pan
95 03/06/2015 WT57 2155 Tower 0 Pan
96 03/06/2015 WT58 2179 Tower 2 Pan
97 03/06/2015 WT59 2214 Tower 0 Pan
98 03/06/2015 WT60 2233 Tower 0 Pan
99 03/06/2015 WT61 2261 Tower 0 Pan
100 03/06/2015 WT62 2278 Tower 0 Pan
101 03/06/2015 WT63 2300 Tower 0 Pan
102 04/06/2015 WT31 1497 Cascadas 0 Pan
103 04/06/2015 WT32 1544 Cascadas 1 Pan
104 04/06/2015 WT33 1568 Cascadas 1 Pan
105 04/06/2015 WT34 1574 Cascadas 0 Pan
106 04/06/2015 WT35 1608 Cascadas 5 Pan
107 04/06/2015 WT36 1630 Cascadas 3 Pan
108 04/06/2015 WT37 1642 Cascadas 0 Pan
109 04/06/2015 WT38 1672 Cascadas 5 Pan
110 04/06/2015 WT39 1685 Cascadas 6 Pan
111 04/06/2015 WT40 1723 Cascadas 3 Pan
112 04/06/2015 WT41 1744 Cascadas 2 Pan
113 04/06/2015 WT42 1781 Cascadas 1 Pan
114 04/06/2015 WT43 1794 Cascadas 2 Pan
115 04/06/2015 WT44 1833 Cascadas 0 Pan
116 04/06/2015 WT45 1855 Cascadas 4 Pan
117 04/06/2015 WT46 1876 Cascadas 2 Pan
You're almost there. As #BondedDust suggests, it's not practical
to use a two-level factor (Trap) as a random effect; in fact,
it doesn't seem right in principle either (the levels of Trap are
not arbitrary/randomly chosen/exchangeable). When I tried a model
with quadratic altitude, fixed effect of trap, and random effect
of Date, I was warned that I might want to rescale a parameter:
Some predictor variables are on very different scales: consider rescaling
(you saw this warning mixed in with your error messages). The only continuous (and hence worth rescaling) predictor is Altitude, so I centered and scaled it with scale() (the only disadvantage is that this changes the quantitative interpretation of the coefficients, but the model itself is practically identical). I also added an observation-level random effect to allow for overdispersion.
The results seem OK, and agree with the picture.
library(lme4)
Santa.Lucia <- transform(Santa.Lucia,
scAlt=scale(Altitude),
obs=factor(seq(nrow(Santa.Lucia))))
model7 <- glmer(No.Specimens~scAlt+I(scAlt^2)+Trap+(1|Date)+(1|obs),
family="poisson",data=Santa.Lucia,na.action=na.omit)
summary(model7)
## Random effects:
## Groups Name Variance Std.Dev.
## obs (Intercept) 0.64712 0.8044
## Date (Intercept) 0.02029 0.1425
## Number of obs: 97, groups: obs, 97; Date, 6
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.53166 0.31556 1.685 0.09202 .
## scAlt -0.22867 0.14898 -1.535 0.12480
## I(scAlt^2) -0.52840 0.16355 -3.231 0.00123 **
## TrapPan -0.01853 0.32487 -0.057 0.95451
Test the quadratic term by comparing with a model that lacks it ...
model7R <- update(model7, . ~ . - I(scAlt^2))
## convergence warning, but probably OK ...
anova(model7,model7R)
On principle it might be worth looking at the interaction between the quadratic altitude model and Trap (allowing for different altitude trends by trap type), but the picture suggests it won't do much ...
library(ggplot2); theme_set(theme_bw())
ggplot(Santa.Lucia,aes(Altitude,No.Specimens,colour=Trap))+
stat_sum(aes(size=factor(..n..)))+
scale_size_discrete(range=c(2,4))+
geom_line(aes(group=Date),colour="gray",alpha=0.3)+
geom_smooth(method="gam",family="quasipoisson",
formula=y~poly(x,2))+
geom_smooth(method="gam",family="quasipoisson",
formula=y~poly(x,2),se=FALSE,
aes(group=1),colour="black")
The problem is almost surely due to you passing a character vector to the data argument:
..., data="Santa.Lucia, ..."
?glmer says the data argument should be:
data: an optional data frame containing the variables named in
‘formula’. By default the variables are taken from the
environment from which ‘lmer’ is called. While ‘data’ is
optional, the package authors _strongly_ recommend its use,
especially when later applying methods such as ‘update’ and
‘drop1’ to the fitted model (_such methods are not guaranteed
to work properly if ‘data’ is omitted_). If ‘data’ is
omitted, variables will be taken from the environment of
‘formula’ (if specified as a formula) or from the parent
frame (if specified as a character vector).
The last part in parentheses, "if specified as a character vector" relates to what happens if the specification of formula is as a character vector, not to specifying data as a character.
Correct your call to include data = Santa.Lucia and you should be good to go.
You've managed to use two different formats for Date. Here's a fix:
Santa.Lucia$Date2 <- ifelse(nchar(as.character(Santa.Lucia$Date)) > 10,
as.Date(Santa.Lucia$Date, format="%d-%b-%Y"),
as.Date(Santa.Lucia$Date, format="%d/%m/%Y") )
I tried a simpler model:
( model6 <-glmer(No.Specimens~Altitude+(1|Date2)+(1|Trap),family="poisson",data=Santa.Lucia,na.action=na.omit) )
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [
glmerMod]
Family: poisson ( log )
Formula: No.Specimens ~ Altitude + (1 | Date2) + (1 | Trap)
Data: Santa.Lucia
AIC BIC logLik deviance df.resid
368.6522 378.9510 -180.3261 360.6522 93
Random effects:
Groups Name Std.Dev.
Date2 (Intercept) 0.2248
Trap (Intercept) 0.0000
Number of obs: 97, groups: Date2, 6; Trap, 2
Fixed Effects:
(Intercept) Altitude
1.3696125 -0.0004992
Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0516296 (tol = 0.001, component 3)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
- Rescale variables?
I'm actually able to get my suggested modification to run without error or warning but I think that using those two groupings is not right because one predicts the other:
> table(Santa.Lucia$Date2, Santa.Lucia$Trap)
Net Pan
16310 0 11
16312 11 0
16313 11 0
16314 0 31
16589 0 17
16590 0 16
That's why you are getting non-convergence. It's not the error model that is at fault, but the pathology in your design and data collection. I question whether you really have sufficient data to support a mixed model:
( model5 <-glm(No.Specimens~Altitude,family="poisson",data=Santa.Lucia,na.action=na.omit) )
Call: glm(formula = No.Specimens ~ Altitude, family = "poisson", data = Santa.Lucia,
na.action = na.omit)
Coefficients:
(Intercept) Altitude
1.4218234 -0.0005391
Degrees of Freedom: 96 Total (i.e. Null); 95 Residual
Null Deviance: 215.3
Residual Deviance: 213.2 AIC: 368.6
To compare with a quadratic altitude model:
( model5.2 <-glm(No.Specimens~poly(Altitude,2),family="poisson",data=Santa.Lucia,na.action=na.omit) )
Call: glm(formula = No.Specimens ~ poly(Altitude, 2), family = "poisson",
data = Santa.Lucia, na.action = na.omit)
Coefficients:
(Intercept) poly(Altitude, 2)1 poly(Altitude, 2)2
0.3188 -1.7116 -3.9539
Degrees of Freedom: 96 Total (i.e. Null); 94 Residual
Null Deviance: 215.3
Residual Deviance: 194.6 AIC: 352
> anova(model5.2)
Analysis of Deviance Table
Model: poisson, link: log
Response: No.Specimens
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL 96 215.31
poly(Altitude, 2) 2 20.698 94 194.61
> anova(model5.2, model5)
Analysis of Deviance Table
Model 1: No.Specimens ~ poly(Altitude, 2)
Model 2: No.Specimens ~ Altitude
Resid. Df Resid. Dev Df Deviance
1 94 194.61
2 95 213.20 -1 -18.59
Related
I'm unsure what I'm doing wrong. This is the data that I'm using:
dtf <- read.table(text=
"Litter Treatment Tube.L
1 Control 1641
2 Control 1290
3 Control 2411
4 Control 2527
5 Control 1930
6 Control 2158
1 GH 1829
2 GH 1811
3 GH 1897
4 GH 1506
5 GH 2060
6 GH 1207
1 FSH 3395
2 FSH 3113
3 FSH 2219
4 FSH 2667
5 FSH 2210
6 FSH 2625
1 GH+FSH 1537
2 GH+FSH 1991
3 GH+FSH 3639
4 GH+FSH 2246
5 GH+FSH 1840
6 GH+FSH 2217", header=TRUE)
What I did was:
BoarsMod1 <- aov(Tube.L ~ Litter + Treatment, data=dtf)
anova(BoarsMod1)
I'm getting an incorrect number of degrees of freedom for litter. It should be 5 (as there are 6 litter blocks) but it is 1. Am I doing something wrong?
I have some experimental data with a response variable (y) and multiple explanatory variables (x1 to x4). Zero values in the explanatory variable fields are the experimental controls.
x1 x2 x3 x4 y
0.082494269 1 16.43 328 -0.325
0.195673137 5 27.07 318 -0.3625
0.219937331 7 45.44 360 -0.525
0.059035684 4 32.68 203 -0.4125
0.381432485 8 71.38 167 -0.475
0.040377394 3 16.43 135 -0.425
0.055993298 1 21.88 154 -0.3875
0 0 0 0 -0.325
0.112635472 5 33.63 328 -0.3625
0.217039159 7 45.83 200 -0.475
0.035330022 2 17.78 117 -0.48
0.234216386 6 51.79 119 -0.45
0.085048722 6 39.71 98 -0.445
0.064759017 1 27.46 133 -0.3625
0.123863896 7 36.61 82 -0.4625
0.18932145 6 44.21 74 -0.425
0.409036425 8 62.38 154 -0.525
0 0 0 0 -0.275
0.493185115 8 120.3 103 -0.5625
0.132214199 5 26.61 222 -0.425
0 0 0 0 -0.3375
My regression models are usually bivariate (y~x1) but all the x values here could be used as predictors. What I'm trying to do is find (if there is one) the point at which the y value changes significantly using whichever combination of x values is required to elicit a change.
I'm using R and have briefly looked at the 'chngpt' package but it's a little beyond me at the moment. Has anyone had any experience using this package and is this appropriate for my purpose? Could anyone demonstrate how to do what I'm trying to do with the data provided?
Thanks
I am trying to calculate the residuals from a random forest cross validation. I am working with the response variable "Sales" in this data set. I want to put the residuals into a support vector machine. I am using the Carseats data set in R. Here is my code so far:
set.seed (1)
library(ISLR)
data(Carseats)
head(Carseats)
Sales CompPrice Income Advertising Population Price ShelveLoc
1 9.50 138 73 11 276 120 Bad
2 11.22 111 48 16 260 83 Good
3 10.06 113 35 10 269 80 Medium
4 7.40 117 100 4 466 97 Medium
5 4.15 141 64 3 340 128 Bad
6 10.81 124 113 13 501 72 Bad
Age Education Urban US sales
1 42 17 Yes Yes Yes
2 65 10 Yes Yes Yes
3 59 12 Yes Yes Yes
4 55 14 Yes Yes Yes
5 38 13 Yes No Yes
6 78 16 No Yes Yes
##Random forest
#cross validation to pick best mtry from 3,5,10
library(randomForest)
cv.carseats = rfcv(trainx=Carseats[,-1],trainy=Carseats[,1],cv.fold=5,step=0.9)
cv.carseats
with(cv.carseats,plot(n.var,error.cv,type="o"))
#from the graph it would appear mtry=5 produces the lowest error
##SVM
library(e1071)
#cross validation to pick best gamma
tune.out=tune(svm,Sales~.,data=Carseats,gamma=c(0.01,0.1,1,10),
tunecontrol = tune.control(cross=5))
I will replace "Sales" in the SVM with the residuals from the random forest cross validation. I am having a difficult time calculating the residuals from the random forest cross validation. Any help is greatly appreciated! Thank you!
I have two sets of panel data that I would like to merge. The problem is that, for each respective time interval, the variable which links the two data sets appears more frequently in the first data frame than the second. My objective is to add each row from the second data set to its corresponding row in the first data set, even if that necessitates copying said row multiple times in the same time interval. Specifically, I am working with basketball data from the NBA. The first data set is a panel of Player and Date while the second is one of Team (Tm) and Date. Thus, each Team entry should be copied multiple times per date, once for each player on that team who played that day. I could do this easily in excel, but the data frames are too large.
The result is 0 observations of 52 variables. I've experimented with bind, match, different versions of merge, and I've searched for everything I can think of; but, nothing seems to address this issue specifically. Disclaimer, I am very new to R.
Here is my code up until my road block:
HGwd = "~/Documents/Fantasy/Basketball"
library(plm)
library(mice)
library(VIM)
library(nnet)
library(tseries)
library(foreign)
library(ggplot2)
library(truncreg)
library(boot)
Pdata = read.csv("2015-16PlayerData.csv", header = T)
attach(Pdata)
Pdata$Age = as.numeric(as.character(Pdata$Age))
Pdata$Date = as.Date(Pdata$Date, '%m/%e/%Y')
names(Pdata)[8] = "OppTm"
Pdata$GS = as.factor(as.character(Pdata$GS))
Pdata$MP = as.numeric(as.character(Pdata$MP))
Pdata$FG = as.numeric(as.character(Pdata$FG))
Pdata$FGA = as.numeric(as.character(Pdata$FGA))
Pdata$X2P = as.numeric(as.character(Pdata$X2P))
Pdata$X2PA = as.numeric(as.character(Pdata$X2PA))
Pdata$X3P = as.numeric(as.character(Pdata$X3P))
Pdata$X3PA = as.numeric(as.character(Pdata$X3PA))
Pdata$FT = as.numeric(as.character(Pdata$FT))
Pdata$FTA = as.numeric(as.character(Pdata$FTA))
Pdata$ORB = as.numeric(as.character(Pdata$ORB))
Pdata$DRB = as.numeric(as.character(Pdata$DRB))
Pdata$TRB = as.numeric(as.character(Pdata$TRB))
Pdata$AST = as.numeric(as.character(Pdata$AST))
Pdata$STL = as.numeric(as.character(Pdata$STL))
Pdata$BLK = as.numeric(as.character(Pdata$BLK))
Pdata$TOV = as.numeric(as.character(Pdata$TOV))
Pdata$PF = as.numeric(as.character(Pdata$PF))
Pdata$PTS = as.numeric(as.character(Pdata$PTS))
PdataPD = plm.data(Pdata, index = c("Player", "Date"))
attach(PdataPD)
Tdata = read.csv("2015-16TeamData.csv", header = T)
attach(Tdata)
Tdata$Date = as.Date(Tdata$Date, '%m/%e/%Y')
names(Tdata)[3] = "OppTm"
Tdata$MP = as.numeric(as.character(Tdata$MP))
Tdata$FG = as.numeric(as.character(Tdata$FG))
Tdata$FGA = as.numeric(as.character(Tdata$FGA))
Tdata$X2P = as.numeric(as.character(Tdata$X2P))
Tdata$X2PA = as.numeric(as.character(Tdata$X2PA))
Tdata$X3P = as.numeric(as.character(Tdata$X3P))
Tdata$X3PA = as.numeric(as.character(Tdata$X3PA))
Tdata$FT = as.numeric(as.character(Tdata$FT))
Tdata$FTA = as.numeric(as.character(Tdata$FTA))
Tdata$PTS = as.numeric(as.character(Tdata$PTS))
Tdata$Opp.FG = as.numeric(as.character(Tdata$Opp.FG))
Tdata$Opp.FGA = as.numeric(as.character(Tdata$Opp.FGA))
Tdata$Opp.2P = as.numeric(as.character(Tdata$Opp.2P))
Tdata$Opp.2PA = as.numeric(as.character(Tdata$Opp.2PA))
Tdata$Opp.3P = as.numeric(as.character(Tdata$Opp.3P))
Tdata$Opp.3PA = as.numeric(as.character(Tdata$Opp.3PA))
Tdata$Opp.FT = as.numeric(as.character(Tdata$Opp.FT))
Tdata$Opp.FTA = as.numeric(as.character(Tdata$Opp.FTA))
Tdata$Opp.PTS = as.numeric(as.character(Tdata$Opp.PTS))
TdataPD = plm.data(Tdata, index = c("OppTm", "Date"))
attach(TdataPD)
PD = merge(PdataPD, TdataPD, by = "OppTm", all.x = TRUE)
attach(PD)
Any help on how to do this would be greatly appreciated!
EDIT
I've tweaked it a little from last night, but still nothing seems to do the trick. See the above, updated code for what I am currently using.
Here is the output for head(PdataPD):
Player Date Rk Pos Tm X..H OppTm W.L GS MP FG FGA FG. X2P
22408 Aaron Brooks 2015-10-27 817 G CHI CLE W 0 16 3 9 0.333 3
22144 Aaron Brooks 2015-10-28 553 G CHI # BRK W 0 16 5 9 0.556 3
21987 Aaron Brooks 2015-10-30 396 G CHI # DET L 0 18 2 6 0.333 1
21456 Aaron Brooks 2015-11-01 4687 G CHI ORL W 0 16 3 11 0.273 3
21152 Aaron Brooks 2015-11-03 4383 G CHI # CHO L 0 17 5 8 0.625 1
20805 Aaron Brooks 2015-11-05 4036 G CHI OKC W 0 13 4 8 0.500 3
X2PA X2P. X3P X3PA X3P. FT FTA FT. ORB DRB TRB AST STL BLK TOV PF PTS GmSc
22408 8 0.375 0 1 0.000 0 0 NA 0 2 2 0 0 0 2 1 6 -0.9
22144 3 1.000 2 6 0.333 0 0 NA 0 1 1 3 1 0 1 4 12 8.5
21987 2 0.500 1 4 0.250 0 0 NA 0 4 4 4 0 0 0 1 5 5.2
21456 6 0.500 0 5 0.000 0 0 NA 2 1 3 1 1 1 1 4 6 1.0
21152 3 0.333 4 5 0.800 0 0 NA 0 0 0 4 1 0 0 4 14 12.6
20805 5 0.600 1 3 0.333 0 0 NA 1 1 2 0 0 0 0 1 9 5.6
FPTS H.A
22408 7.50 H
22144 20.25 A
21987 16.50 A
21456 14.75 H
21152 24.00 A
20805 12.00 H
And for head(TdataPD):
OppTm Date Rk X Opp Result MP FG FGA FG. X2P X2PA X2P. X3P X3PA
2105 ATL 2015-10-27 71 DET L 94-106 240 37 82 0.451 29 55 0.527 8 27
2075 ATL 2015-10-29 41 # NYK W 112-101 240 42 83 0.506 32 59 0.542 10 24
2047 ATL 2015-10-30 13 CHO W 97-94 240 36 83 0.434 28 60 0.467 8 23
2025 ATL 2015-11-01 437 # CHO W 94-92 240 37 88 0.420 30 59 0.508 7 29
2001 ATL 2015-11-03 413 # MIA W 98-92 240 37 90 0.411 30 69 0.435 7 21
1973 ATL 2015-11-04 385 BRK W 101-87 240 37 76 0.487 29 54 0.537 8 22
X3P. FT FTA FT. PTS Opp.FG Opp.FGA Opp.FG. Opp.2P Opp.2PA Opp.2P. Opp.3P
2105 0.296 12 15 0.800 94 37 96 0.385 25 67 0.373 12
2075 0.417 18 26 0.692 112 38 93 0.409 32 64 0.500 6
2047 0.348 17 22 0.773 97 36 88 0.409 24 58 0.414 12
2025 0.241 13 14 0.929 94 32 86 0.372 18 49 0.367 14
2001 0.333 17 22 0.773 98 38 86 0.442 33 58 0.569 5
1973 0.364 19 24 0.792 101 36 83 0.434 31 62 0.500 5
Opp.3PA Opp.3P. Opp.FT Opp.FTA Opp.FT. Opp.PTS
2105 29 0.414 20 26 0.769 106
2075 29 0.207 19 21 0.905 101
2047 30 0.400 10 13 0.769 94
2025 37 0.378 14 15 0.933 92
2001 28 0.179 11 16 0.688 92
1973 21 0.238 10 13 0.769 87
If there is way to truncate the output from dput(head(___)), I am not familiar with it. It appears that simply erasing the excess characters would remove entire variables from the dataset.
It would help if you posted your data (or a working subset of it) and a little more detail on how you are trying to merge, but if I understand what you are trying to do, you want each final data record to have individual stats for each player on a particular date followed by the player's team's stats for that date. In this case, you should have a team column in the Player table that identifies the player's team, and then join the two tables on the composite key Date and Team by setting the by= attribute in merge:
merge(PData, TData, by=c("Date", "Team"))
The fact that the data frames are of different lengths doesn't matter--this is exactly what join/merge operations are for.
For an alternative to merge(), you might check out the dplyr package join functions at https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html
I'm trying to run an ARIMA on a temporal dataset that is in a .csv file. Here is my code so far:
Oil_all <- read.delim("/Users/Jkels/Documents/Introduction to Computational
Statistics/Oil production.csv",sep="\t",header=TRUE,stringsAsFactors=FALSE)
Oil_all
The file looks like:
year.mbbl
1 1880,30
2 1890,77
3 1900,149
4 1905,215
5 1910,328
6 1915,432
7 1920,689
8 1925,1069
9 1930,1412
10 1935,1655
11 1940,2150
12 1945,2595
13 1950,3803
14 1955,5626
15 1960,7674
16 1962,8882
17 1964,10310
18 1966,12016
19 1968,14104
20 1970,16690
21 1972,18584
22 1974,20389
23 1976,20188
24 1978,21922
25 1980,21732
26 1982,19403
27 1984,19608
Code:
apply(Oil_all,1,function(x) sum(is.na(x)))
Results:
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
When I run ARIMA:
library(forecast)
auto.arima(Oil_all,xreg=year)
This is the error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
In addition: Warning message:
In data.matrix(data) : NAs introduced by coercion
So, I was able to call in the data set and it prints. However, when I go to check whether the values are present with the apply function, I see all 0's, so I know something's wrong and that's probably why I'm getting the error. I'm just not sure what the error means or how to fix it in the code.
Any advice?
If I got your question right, it should be like:
Oil_all <- read.csv("myfolder/myfile.csv",header=TRUE)
## I don't have your source data, so I tried to reproduce it with the data you printed
Oil_all
year value
1 1880 30
2 1890 77
3 1900 149
4 1905 215
5 1910 328
6 1915 432
7 1920 689
8 1925 1069
9 1930 1412
10 1935 1655
11 1940 2150
12 1945 2595
13 1950 3803
14 1955 5626
15 1960 7674
16 1962 8882
17 1964 10310
18 1966 12016
19 1968 14104
20 1970 16690
21 1972 18584
22 1974 20389
23 1976 20188
24 1978 21922
25 1980 21732
26 1982 19403
27 1984 19608
library(forecast)
auto.arima(Oil_all$value,xreg=Oil_all$year)
Series: Oil_all$value
ARIMA(3,0,0) with non-zero mean
Coefficients:
ar1 ar2 ar3 intercept Oil_all$year
1.2877 0.0902 -0.4619 -271708.4 144.2727
s.e. 0.1972 0.3897 0.2275 107344.4 55.2108
sigma^2 estimated as 642315: log likelihood=-221.07
AIC=454.15 AICc=458.35 BIC=461.92
your import should be
Oil_all<-read.csv("/Users/Jkels/Documents/Introduction to Computational Statistics/Oil production.csv")
That is why your data is weird. Sorry I do not have the reputation to comment.I did the same as Nemesi and it worked then. I think you are trying to import a csv as a tab delimited file.