I have this data:
Year W L PTS GF GA S SA
1 2006 49 25 106 253 224 2380 2662
2 2007 51 23 110 266 207 2261 2553
3 2008 41 32 91 227 224 2425 2433
4 2009 40 34 88 207 228 2375 2398
5 2010 47 29 100 217 221 2508 2389
6 2011 44 27 99 213 190 2362 2506
7 2012 48 26 104 232 205 2261 2517
8 2014 38 32 88 214 233 2382 2365
9 2015 47 25 104 226 202 2614 2304
10 2016 41 27 96 224 213 2507 2231
11 2017 41 29 94 238 220 2557 2458
12 2018 53 18 117 261 204 2641 2650
I've built a VAR model from this data (it's hockey data for one team for the listed years). I converted the above into a time series the ts() argument, and created this model:
VARselect(NSH_ts[, 3:5], lag.max = 8)
var1 <- VAR(NSH_ts[, 3:5], p = 2, type = "both", ic = c("AIC"))
serial.test(var1, type = "PT.adjusted")
forecast.var1 <- forecast(var1, h = 2)
autoplot(forecast.var1) +
scale_x_continuous(breaks = seq(2006, 2022))
I want to use the serial.test() argument, but I get this error:
Error in t(Ci) %*% C0inv : non-conformable arguments
Why won't the serial.test() argument work? (Overall I'm trying to forecast PTS for the next two years, based on the variables in the set).
I've been using this as a guide: https://otexts.org/fpp2/VAR.html
I'm getting a different error, which may be from the VARselect. My table is mostly -Inf entries, with one NaN, and the rest 0. Adjusting the lag.max gave me real numbers, and I had to adjust the other values as well.
VARselect(dfVAR[, 3:5], lag.max = 2)
var1 <- VAR(dfVAR[, 3:5], p = 1, type = "both", ic = c("AIC"))
serial.test(var1, lags.pt = 4, type = "PT.adjusted")
Portmanteau Test (adjusted)
data: Residuals of VAR object var1
Chi-squared = 35.117, df = 27, p-value = 0.1359
The basis of the non-conformable error is that your matrix algebra isn't working, the number of cols in the first matrix have to match the number of rows in the second. Having no knowledge of VAR models, I can't offer help beyond this.
Related
So these are the survey results. I have tried to do pairwise testing (pairwise.wilcox.test) for these results collected in Spring and Autumn for these sites. But I can't get a specific P -value as to which site has the most influence.
This is the error message I keep getting. My dataset isn't even, ie there were some of the sites that were not surveyed in Spring which I think may be the issue.
Error in wilcox.test.default(xi, xj, paired = paired, ...) :
'x' must be numeric
So I'm not sure if I have laid it out in the table wrong to see how much site influences the results between Spring and Autumn
Site Autumn Spring
Stokes Bay 25 6
Stokes Bay 54 6
Stokes Bay 31 0
Gosport Wall 213 16
Gosport Wall 24 19
Gosport Wall 54 60
No Mans Land 76 25
No Mans Land 66 68
No Mans Land 229 103
Osbourne 1 77
Osbourne 1 92
Osbourne 1 92
Osbourne 2 114 33
Osbourne 2 217 114
Osbourne 2 117 64
Osbourne 3 204 131
Osbourne 3 165 85
Osbourne 3 150 81
Osbourne 4 124 15
Osbourne 4 79 64
Osbourne 4 176 65
Ryde Roads 217 165
Ryde Roads 182 63
Ryde Roads 112 53
Ryde Sands 386 44
Ryde Sands 375 25
Ryde Sands 147 45
Spit Bank 223 23
Spit Bank 78 29
Spit Bank 60 15
St Helen's 1 247 11
St Helen's 1 126 36
St Helen's 1 107 20
St Helen's 2 108 115
St Helen's 2 223 25
St Helen's 2 126 30
Sturbridge 58 43
Sturbridge 107 34
Sturbridge 156 0
Osbourne Deep 1 76 59
Osbourne Deep 1 64 52
Osbourne Deep 1 77 30
Osbourne Deep 2 153 60
Osbourne Deep 2 106 88
Osbourne Deep 2 74 35
Sturbridge Shoal 169 45
Sturbridge Shoal 19 84
Sturbridge Shoal 81 44
Mother's Bank 208
Mother's Bank 119
Mother's Bank 153
Ryde Middle 16
Ryde Middle 36
Ryde Middle 36
Stanswood 14 132
Stanswood 47 87
Stanswood 14 88
This is what I've done so far:
MWU <- read.csv(file.choose(), header = T)
#attach file to workspace
attach(MWU)
#Read column names of the data
colnames(MWU) # Site, Autumn, Spring
MWU.1 <- MWU[c(1,2,3)] #It included blank columns in the df
kruskal.test(MWU.1$Autumn ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Autumn by MWU.1$Site
#Kruskal-Wallis chi-squared = 36.706, df = 24, p-value = 0.0468
kruskal.test(MWU.1$Spring ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Spring by MWU.1$Site
#Kruskal-Wallis chi-squared = 35.134, df = 21, p-value = 0.02729
wilcox.test(MWU.1$Autumn, MWU.1$Spring, paired = T)
#Wilcoxon signed rank exact test
#data: MWU.1$Autumn and MWU.1$Spring**
#V = 1066, p-value = 8.127e-08**
#alternative hypothesis: true location shift is not equal to 0******
#Tried this version too to see if it would give a summary of where the influence is.
pairwise.wilcox.test(MWU.1$Spring, MWU.1$Autumn)
#Error in wilcox.test.default(xi, xj, paired = paired, ...) : not enough (non-missing) 'x' observations
I have three years of detection data. In each year there are 8 probabilities at a site. These are no, a, n, na, l, la, ln, lna. I've assigned the values below:
no = 0
a = 1
n = 1
na = 2
l = 100
la = 101
ln = 101
lna = 102
In year 2, I wish to calculate and label all outcomes, so any combination of 2 of the terms above, to describe a detection history numerically.
So essentially I'm trying to get a list of 64 terms ranging from no,no to lna,lna with their respective values.
For example, no,no = 0 and lna,lna = 204
In year 3, I wish for the same. I'd like to calculate and label all possibilities. This needs to be arranged in two columns, one with history text, and one with history value.
x1 x2
no,no,no 0
I'm sure this is possible, and possibly even basic. Though I have no idea where to begin.
Any help would be greatly appreciated.
Thanks in advance
I'm sure there are more elegant, concise ways to do it, but here's one approach:
Define the two lists of possibilities
poss = c("no", "a", "n", "na", "l", "la", "ln", "lna")
vals = c(1, 1, 2, 100, 101, 101, 101, 102)
Use expand.grid to enumerate the possibilities
output <- expand.grid(poss, poss, stringsAsFactors = FALSE)
comb_values <- expand.grid(vals, vals)
Write the ouput
output$names <- paste(output$Var1, output$Var2, sep = ",")
output$value <- comb_values$Var1 + comb_values$Var2
output$Var1 <- output$Var2 <- NULL
Result
names value
1 no,no 2
2 a,no 2
3 n,no 3
4 na,no 101
5 l,no 102
6 la,no 102
7 ln,no 102
8 lna,no 103
9 no,a 2
10 a,a 2
11 n,a 3
12 na,a 101
13 l,a 102
14 la,a 102
15 ln,a 102
16 lna,a 103
17 no,n 3
18 a,n 3
19 n,n 4
20 na,n 102
21 l,n 103
22 la,n 103
23 ln,n 103
24 lna,n 104
25 no,na 101
26 a,na 101
27 n,na 102
28 na,na 200
29 l,na 201
30 la,na 201
31 ln,na 201
32 lna,na 202
33 no,l 102
34 a,l 102
35 n,l 103
36 na,l 201
37 l,l 202
38 la,l 202
39 ln,l 202
40 lna,l 203
41 no,la 102
42 a,la 102
43 n,la 103
44 na,la 201
45 l,la 202
46 la,la 202
47 ln,la 202
48 lna,la 203
49 no,ln 102
50 a,ln 102
51 n,ln 103
52 na,ln 201
53 l,ln 202
54 la,ln 202
55 ln,ln 202
56 lna,ln 203
57 no,lna 103
58 a,lna 103
59 n,lna 104
60 na,lna 202
61 l,lna 203
62 la,lna 203
63 ln,lna 203
64 lna,lna 204
Same logic for three days, just replace poss, poss with poss, poss, poss etc.
Data given are a sample of cholesterol levels taken from 24 hospital employees who were on a standard American diet and who agreed to adopt a vegetarian diet for 1 month. Serum-cholesterol measurements were made before adopting the diet and 1 month after.
Subject Before After Difference
1 1 195 146 49
2 2 145 155 -10
3 3 205 178 27
4 4 159 146 13
5 5 244 208 36
6 6 166 147 19
7 7 250 202 48
8 8 236 215 21
9 9 192 184 8
10 10 224 208 16
11 11 238 206 32
12 12 197 169 28
13 13 169 182 -13
14 14 158 127 31
15 15 151 149 2
16 16 197 178 19
17 17 180 161 19
18 18 222 187 35
19 19 168 176 -8
20 20 168 145 23
21 21 167 154 13
22 22 161 153 8
23 23 178 137 41
24 24 137 125 12
Now here is the question I am trying to answer. Some investigators believe that the effects of diet
on cholesterol are more evident in people with high rather than low cholesterol levels. If you split the data according to whether baseline cholesterol is above or below the median, can you comment descriptively on this issue?
Now, I am thinking of creating boxplot based on two categories here. I wish to use dplyr for data manipulation here. So, I will create a new column based on if Before is less than or greater than median of Before. So, I will have a new character vector with "high" for high Before cholesterol and low for low Before cholesterol. And, then I will do a boxplot of Difference based on the categorical new column. So, here is my code. I call the original data set as df2.
df2 %>%
mutate(new_col = if_else(Before < median(Before), "low", "high")) %>%
group_by(new_col) %>%
ggplot(aes(x= new_col, y=Difference)) +
geom_boxplot()
And following is the boxplot I get
So, based on this, I conclude that investigators are right and effects of diet on cholesterol are more evident in people with high rather than low cholesterol levels. I want to know if this can be done more effectively.
This is more a statistical plan question rather than a programming question, therefore it would belong more to stats.stackexchange than StackOverflow.
Anyway, categorizing a variable depending on the median is not the recommended way of visualizing associations, as you are suppressing a lot of information. You can read about this in this very good article by Peter Flom.
It is better to keep all the points and apply some spline or smoothing algorithm.
For instance, you could consider something like this:
ggplot(df2, aes(x= Before, y=Difference)) +
geom_point() +
geom_smooth()
Here, the relationship is clearly seeable, while keeping all the information you want.
If you really have to generate subgroups, you could also try something like this:
df2 %>%
mutate(new_col = if_else(Before < median(Before), "low", "high")) %>%
ggplot(aes(x= Before, y=Difference, group=new_col, color=new_col)) +
geom_point() +
geom_smooth(span=3) #try some other values here
However, using the median is still not a very good idea, especially with that amount of data points. You might want to assess the functional form of the relationship, but that would need a specific question on stats.stackexchange.com.
not really an answer, but more of a different approach in visualisation of the data..
library( data.table )
library( ggplot2 )
DT.melt <- melt( DT, id.vars = "Subject", measure.vars = c( "Before", "After" ) )
ggplot() +
geom_line( data = DT.melt,
aes( x = variable, y = value, group = Subject ) ) +
geom_line( data = DT.melt[, .(mean = mean(value)), by = variable ],
aes( x = variable, y = mean, group = 1 ), color = "red", size = 2 ) +
labs( x = "", y = "" )
sample data used
DT <- fread(" Subject Before After Difference
1 195 146 49
2 145 155 -10
3 205 178 27
4 159 146 13
5 244 208 36
6 166 147 19
7 250 202 48
8 236 215 21
9 192 184 8
10 224 208 16
11 238 206 32
12 197 169 28
13 169 182 -13
14 158 127 31
15 151 149 2
16 197 178 19
17 180 161 19
18 222 187 35
19 168 176 -8
20 168 145 23
21 167 154 13
22 161 153 8
23 178 137 41
24 137 125 12")
When I run the command:
H <-length(table(data$Team))
n.h <- rep(5,H)
strata(data, stratanames=data$Team,size=n.h,method="srswor"),
I get the error statement:
'Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you called 'sort' on a list?'
Please help me how can I get this stratified sample. The variable 'Team' is 'Factor' type.
Data is as below:
zz <- "Team League.ID Player Salary POS G GS InnOuts PO A
ANA AL molinjo0 335000 C 73 57 1573 441 37
ANA AL percitr0 7833333 P 3 0 149 1 3
ARI NL bautida0 4000000 RF 141 135 3536 265 8
ARI NL estalbo0 550000 C 7 3 92 19 2
ARI NL finlest0 7000000 CF 104 102 2689 214 5
ARI NL koplomi0 330000 P 72 0 260 6 23
ARI NL sparkst0 500000 P 27 18 362 8 21
ARI NL villaos0 325000 P 17 0 54 0 4
ARI NL webbbr01 335000 P 33 35 624 13 41
ATL NL francju0 750000 1B 125 71 1894 627 48
ATL NL hamptmi0 14625000 P 35 29 517 13 37
ATL NL marreel0 3000000 LF 90 42 1125 80 4
ATL NL ortizru0 6200000 P 32 34 614 7 38
BAL AL surhobj0 800000 LF 100 31 805 69 0"
data <- read.table(text=zz, header=T)
This should work:
library(sampling)
H <- length(levels(data$Team))
n.h <- rep(5, H)
strata(data, stratanames=c("Team"), size=n.h, method="srswor")
stratanames should be a list of column names, not a reference to the actual column data.
Update:
Now that example data is available, I see another problem: you are sampling without-replacement (wor), but your samples are bigger that the available data. You need to sample with replacement in this case
smpl <- strata(data, stratanames=c("Team"), size=n.h, method="srswr")
BTW, you get the actual data with:
sampledData <- getdata(data, smpl)
This doesn't really answer your question, but a long time ago, I wrote a function called stratified that might be of use to you.
I've posted it here as a GitHub Gist.
Notice that when you have asked for samples that are bigger than your data, it just returns all of the relevant rows.
output <- stratified(data, "Team", 5)
# Some groups
# ---ANA, ATL, BAL---
# contain fewer observations than desired number of samples.
# All observations have been returned from those groups.
table(output$Team)
#
# ANA ARI ATL BAL
# 2 5 4 1
output
# Team League.ID Player Salary POS G GS InnOuts PO A
# 1 ANA AL molinjo0 335000 C 73 57 1573 441 37
# 2 ANA AL percitr0 7833333 P 3 0 149 1 3
# 9 ARI NL webbbr01 335000 P 33 35 624 13 41
# 7 ARI NL sparkst0 500000 P 27 18 362 8 21
# 8 ARI NL villaos0 325000 P 17 0 54 0 4
# 3 ARI NL bautida0 4000000 RF 141 135 3536 265 8
# 6 ARI NL koplomi0 330000 P 72 0 260 6 23
# 12 ATL NL marreel0 3000000 LF 90 42 1125 80 4
# 13 ATL NL ortizru0 6200000 P 32 34 614 7 38
# 10 ATL NL francju0 750000 1B 125 71 1894 627 48
# 11 ATL NL hamptmi0 14625000 P 35 29 517 13 37
# 14 BAL AL surhobj0 800000 LF 100 31 805 69 0
I'll add official documentation to the function at some point, but here's a summary to help you get the best use out of it:
The arguments to stratified are:
df: The input data.frame
group: A character vector of the column or columns that make up the "strata".
size: The desired sample size.
If size is a value less than 1, a proportionate sample is taken from each stratum.
If size is a single integer of 1 or more, that number of samples is taken from each stratum.
If size is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10).
select: This allows you to subset the groups in the sampling process. This is a list. For instance, if your group variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C")).
replace: For sampling with replacement.
I am trying to take the following data, and then uses this data to create a table which has the information broken down by state.
Here's the data:
> head(mydf2, 10)
lead_id buyer_account_id amount state
1 52055267 62 300 CA
2 52055267 64 264 CA
3 52055305 64 152 CA
4 52057682 62 75 NJ
5 52060519 62 750 OR
6 52060519 64 574 OR
15 52065951 64 152 TN
17 52066749 62 600 CO
18 52062751 64 167 OR
20 52071186 64 925 MN
I've allready subset the states that I'm interested in and have just the data I'm interested in:
mydf2 = subset(mydf, state %in% c("NV","AL","OR","CO","TN","SC","MN","NJ","KY","CA"))
Here's an idea of what I'm looking for:
State Amount Count
NV 1 50
NV 2 35
NV 3 20
NV 4 15
AL 1 10
AL 2 6
AL 3 4
AL 4 1
...
For each state, I'm trying to find a count for each amount "level." I don't necessary need to group the amount variable, but keep in mind that they are are not just 1,2,3, etc
> mydf$amount
[1] 300 264 152 75 750 574 113 152 750 152 675 489 188 263 152 152 600 167 34 925 375 156 675 152 488 204 152 152
[29] 600 489 488 75 152 152 489 222 563 215 452 152 152 75 100 113 152 150 152 150 152 452 150 152 152 225 600 620
[57] 113 152 150 152 152 152 152 152 152 152 640 236 152 480 152 152 200 152 560 152 240 222 152 152 120 257 152 400
Is there an elegant solution for this in R for this or will I be stuck using Excel (yuck!).
Here's my understanding of what you're trying to do:
Start with a simple data.frame with 26 states and amounts only ranging from 1 to 50 (which is much more restrictive than what you have in your example, where the range is much higher).
set.seed(1)
mydf <- data.frame(
state = sample(letters, 500, replace = TRUE),
amount = sample(1:50, 500, replace = TRUE)
)
head(mydf)
# state amount
# 1 g 28
# 2 j 35
# 3 o 33
# 4 x 34
# 5 f 24
# 6 x 49
Here's some straightforward tabulation. I've also removed any instances where frequency equals zero, and I've reordered the output by state.
temp1 <- data.frame(table(mydf$state, mydf$amount))
temp1 <- temp1[!temp1$Freq == 0, ]
head(temp1[order(temp1$Var1), ])
# Var1 Var2 Freq
# 79 a 4 1
# 157 a 7 2
# 391 a 16 1
# 417 a 17 1
# 521 a 21 1
# 1041 a 41 1
dim(temp1) # How many rows/cols
# [1] 410 3
Here's a little bit different tabulation. We are tabulating after grouping the "amount" values. Here, I've manually specified the breaks, but you could just as easily let R decide what it thinks is best.
temp2 <- data.frame(table(mydf$state,
cut(mydf$amount,
breaks = c(0, 12.5, 25, 37.5, 50),
include.lowest = TRUE)))
temp2 <- temp2[!temp2$Freq == 0, ]
head(temp2[order(temp2$Var1), ])
# Var1 Var2 Freq
# 1 a [0,12.5] 3
# 27 a (12.5,25] 3
# 79 a (37.5,50] 3
# 2 b [0,12.5] 2
# 28 b (12.5,25] 6
# 54 b (25,37.5] 5
dim(temp2)
# [1] 103 3
I am not sure if I understand correctly (you have two data.frames mydf and mydf2). I'll assume your data is in mydf. Using aggregate:
mydf$count <- 1:nrow(mydf)
aggregate(data = mydf, count ~ amount + state, length)
Is this what you are looking for?
Note: here count is a variable that is created just to get directly the output of the 3rd column as count.
Alternatives with ddply from plyr:
# no need to create a variable called count
ddply(mydf, .(state, amount), summarise, count=length(lead_id))
Here' one could use any column that exists in one's data instead of lead_id. Even state:
ddply(mydf, .(state, amount), summarise, count=length(state))
Or equivalently without using summarise:
ddply(mydf, .(state, amount), function(x) c(count=nrow(x)))