Convert data frame to time series for prediction in R - r

I retrieve data from MySQL in the following format:
date newCustomers
2016-07-27 31
2016-07-26 3
The data starts from date 2015-02-25 and there is an entry for each day.
I want to convert this data frame to time series for forecasting purposes.
I tried the following:
dataTimeSeries <- ts(data, start=c(2015,2,25), frequency=365.25) and it gave me a warning In data.matrix(data) : NAs introduced by coercion. On checking what's in dataTimeSeries, this is what I found
date day
2016.000 NA 31
2016.003 NA 3
2016.005 NA 2
2016.008 NA 0
What am I doing wrong, please point me in the right direction?
UPDATE: As suggested, I tried dataTimeSeries <- ts(data$newCustomers, start=c(2015,2,25), frequency=365.25) and it gave me the following result
Time Series:
Start = 2015.00273785079
End = 2015.9993155373
Frequency = 365.25
[1] 31 3 2 0 101 69 8 4 15 3 1 22 47 85 359 6 7 2 134 44 20 61 2 0 4 2373 4243 7 31 11 2 0 25 1689 24 74
[37] 22 0 1 336 373 14 11 145 7 0 1 19 49 522 19 1 39 1611 9 675 21 1 45 4 156 180 747 265 169 0 0 4 7 3 4 10
[73] 64 1 3 5 2 13 15 0 6 0 13 2 13 10 5 14 16 28 134 8 2 0 0 9 29 7 79 17 1 4 167 6 64 334 14 0
[109] 0 13 17 57 66 3 0 0 25 2 4 22 16 2 0 23 23 169 9912 24 8 3 154 3 2 29 29 243 0 6 2 72 66 7 1 0
[145] 24 208 13 6 7 10 4 54 79 72 9 29 31 208 224 18 50 65 152 50 10 55 107 249 178 3 0 0 627 19 220 20 285 0 1 11
[181] 26 25 88 9 2 7 64 54 212 295 37 49 19 144 30 78 29 97 210 143 4 294 2 34 642 24 0 0 1 4 0 0 0 0 0 0
[217] 2 3 9 0 0 62 6 16 0 12 0 21 3 6 5 8 1 1 0 3 40 16 1 0 0 66 0 0 1 8 6 1 14 26 4 4
[253] 285 4 0 0 0 3 1 0 28 0 0 24 360 0 0 2 3 0 11 294 578 1 4 0 0 19 2 7 10 0 0 1 20 1 59 19
[289] 2 0 0 9 19 12 4 10 5 4 5 5 7 38 10 5 6 9 18 22 30 28 13 14 22 22 35 12 6 3 3 15 3 3 28 1
[325] 0 0 7 45 21 14 21 0 0 22 14 17 799 7 0 3 8 20 21 107 75 3 3 39 36 137 42 39 6 16 113 11 6 10 8 6
[361] 6 8 21 12 81
which is not correct.

This should work, since you only need to feed the data (and not the times) to ts():
dataTimeSeries <- ts(data$newCustomers, ...)
It's also possible that your data doesn't have regularly spaced intervals between observations? Time series are best used for data sets with equally-spaced intervals between your observation dates. You can see Analyzing Daily/Weekly data using ts in R for other methods of analyzing data that doesn't necessarily have equally-spaced time.

Related

Creating a table viable for t.test function

I am given a data frame with multiple variables but I am only interested in 2 variables and am required to group the variables into 2 groups. (i.e. group 1:mean age at child-birth with having 10+ years of education; group 2: mean age at child-birth with having less than 10 years of education) I am trying to figure out how to put this into a table but I am having troubles on how I can group the rows I want based on years of education. I currently have a table that looks like this with the following code:
'''
means<-table(bfeed_df$ybirth,bfeed_df$yschool)
'''
giving me:
'''
3 6 7 8 9 10 11 12 13 14 15 16 17 18 19
78 0 0 2 2 5 8 8 26 1 2 1 0 0 0 0
79 1 2 2 3 6 12 12 38 10 5 0 0 0 0 0
80 0 0 0 5 10 11 13 38 10 5 2 0 0 0 0
.
.
'''
I want:
<10years +10years
78 9 46
79 14 77
80 15 88
. . .
. . .
# Let's generate some fake data that matches your input
temp = matrix(sample(60,60), ncol = 15)
colnames(temp) = c(3,6,7,8,9,10,11,12,13,14,15,16,17,18,19)
rownmes(temp) = c(78, 79, 80, 81)
# 3 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# 78 5 4 21 13 18 17 34 43 19 41 55 36 12 52 15
# 79 56 14 38 28 30 25 8 44 35 59 39 49 20 2 58
# 80 22 27 3 9 33 54 26 50 53 45 10 40 48 7 6
# 81 42 46 23 1 60 57 47 16 24 51 37 32 11 29 31
Now we can create the summations using apply
sums = t(apply(temp, 1, function(x) c(sum(x[1:4]), sum(x[5:15])) ))
colnames(sums) = c("<10y","+10y")
sums
> sums
<10y +10y
78 43 342
79 136 369
80 61 372
81 112 395
Is this what you are looking for?
You can use cut to divide yschool in two categories and use it in table.
means <- table(bfeed_df$ybirth,cut(bfeed_df$yschool, c(-Inf, 10, Inf)))
colnames(means) <- c('<10years', '10+years')
means

How to test for p-value with groups/filters in dplyr

My data looks like the example below. (sorry if it's too long, not sure what's acceptable/needed).
I have used the following code to calculate the median and IQR of each time difference (tdif) between tests (testno):
data %>% group_by(testno) %>% filter(type ==1) %>%
summarise(Median = median(tdif), IQR= IQR(tdif), n= n(), .groups = 'keep') -> result
I have done this for each category of 'type' (coded as 1 - 10), which brought me to the added table (bottom).
My question is, if it is possible to:
Do this an easier way (without the filters? So I can do this all in 1 run), and
Is it possible run a test for p-value with all the groups/filters?
data <- read.table(header=T, text= '
PID time tdif testno type
3 205 0 1 1
4 77 0 1 1
4 85 8 2 1
4 126 41 3 1
4 165 39 4 1
4 202 37 5 1
4 238 36 6 1
4 272 34 7 1
4 277 5 8 1
4 370 93 9 1
4 397 27 10 1
4 452 55 11 1
4 522 70 12 1
4 529 7 13 1
4 608 79 14 1
4 651 43 15 1
4 655 4 16 1
4 713 58 17 1
4 804 91 18 1
4 900 96 19 1
4 944 44 20 1
4 979 35 21 1
4 1015 36 22 1
4 1051 36 23 1
4 1077 26 24 1
4 1124 47 25 1
4 1162 38 26 1
4 1222 60 27 1
4 1334 112 28 1
4 1383 49 29 1
4 1457 74 30 1
4 1506 49 31 1
4 1590 84 32 1
4 1768 178 33 1
4 1838 70 34 1
4 1880 42 35 1
4 1915 35 36 1
4 1973 58 37 1
4 2017 44 38 1
4 2090 73 39 1
4 2314 224 40 1
4 2381 67 41 1
4 2433 52 42 1
4 2484 51 43 1
4 2694 210 44 1
4 2731 37 45 1
4 2792 61 46 1
4 2958 166 47 1
5 48 0 1 3
5 111 63 2 3
5 699 588 3 3
5 1077 378 4 3
6 -43 0 1 3
8 67 0 1 1
8 168 101 2 1
8 314 146 3 1
8 368 54 4 1
8 586 218 5 1
10 639 0 1 6
13 -454 0 1 3
13 -384 70 2 3
13 -185 199 3 3
13 193 378 4 3
13 375 182 5 3
13 564 189 6 3
13 652 88 7 3
13 669 17 8 3
13 718 49 9 3
14 704 0 1 8
15 -165 0 1 3
15 -138 27 2 3
15 1335 1473 3 3
16 168 0 1 6
18 -1329 0 1 3
18 -1177 152 2 3
18 -1071 106 3 3
18 -945 126 4 3
18 -834 111 5 3
18 -719 115 6 3
18 -631 88 7 3
18 -497 134 8 3
18 -376 121 9 3
18 -193 183 10 3
18 -78 115 11 3
18 -13 65 12 3
18 100 113 13 3
18 196 96 14 3
18 552 356 15 3
18 650 98 16 3
18 737 87 17 3
18 804 67 18 3
18 902 98 19 3
18 983 81 20 3
18 1119 136 21 3
19 802 0 1 1
19 1593 791 2 1
26 314 0 1 8
26 389 75 2 8
26 597 208 3 8
33 639 0 1 6
Added table (values differ from example data, because this isn't the complete set).

how to fix “No appropriate likelihood could be inferred” for network meta-analysis in R?

I am currently learning Network meta-analysis in R with "gemtc",and "netmeta".
As I try to fit the GLM model for analysis, I encountered this error message " No appropriate likelihood could be inferred" .
My code are:
gemtc_network_numbers <-mtc.network(data.ab=diabetes_data,treatments=treatments)
mtcmodel<-mtc.model(network=gemtc_network_numbers,type="consistency",factor=2.5, n.chain=4, linearModel="random")
mtcresults <- mtc.run(mtcmodel, n.adapt = 20000, n.iter=100000, thin=10, sampler="rjags")
# View results summary
print(summary(mtcresults))
My data are:
> diabetes_data
study treatment responder samplesize
1 1 1 45 410
2 1 3 70 405
3 1 4 32 202
4 2 1 119 4096
5 2 4 154 3954
6 2 5 302 6766
7 3 2 1 196
8 3 5 8 196
9 4 1 138 2800
10 4 5 200 2826
11 5 3 799 7040
12 5 4 567 7072
13 6 1 337 5183
14 6 3 380 5230
15 7 2 163 2715
16 7 6 202 2721
17 8 1 449 2623
18 8 6 489 2646
19 9 5 29 416
20 9 6 20 424
21 10 4 177 4841
22 10 6 154 4870
23 11 3 86 3297
24 11 5 75 3272
25 12 1 102 2837
26 12 6 155 2883
27 13 4 136 2508
28 13 5 176 2511
29 14 3 665 8078
30 14 4 569 8098
31 15 2 242 4020
32 15 3 320 3979
33 16 3 37 1102
34 16 5 43 1081
35 16 6 34 2213
36 17 3 251 5059
37 17 4 216 5095
38 18 1 335 3432
39 18 6 399 3472
40 19 2 93 2167
41 19 6 115 2175
42 20 5 140 1631
43 20 6 118 1578
44 21 1 93 1970
45 21 3 97 1960
46 21 4 95 1965
47 22 2 690 5087
48 22 4 845 5074
Thanks for your help.
Angel
You have to solution :
1- Replace your responder variable by "responders" and your samplesize variable by "sampleSize".
or
2- Use for example : mtc.model(...,likelihood="poisson",link="log")).

Plotting Stacked bar plot of large dataset and setting bar limits of plot in r

I am trying to plot a stacked bar plot of my dataset which is data.csv and which is as below.Apologies for posting large dataset.
degree Freq.x Freq.y
1 2978 0
2 1779 33
3 1390 22
4 919 19
5 787 16
6 676 22
7 578 16
8 513 23
9 460 11
10 376 17
11 345 13
12 292 17
13 291 14
14 286 8
15 269 15
16 216 10
17 192 18
18 183 10
19 184 7
20 190 10
21 157 9
22 155 14
23 127 9
24 151 15
25 119 10
26 102 6
27 113 7
28 99 6
29 98 4
30 103 7
31 94 11
32 79 7
33 76 5
34 73 8
35 76 11
36 59 5
37 58 5
38 61 5
39 63 7
40 68 9
41 63 4
42 57 8
43 45 6
44 45 4
45 39 3
46 40 6
47 42 6
48 30 3
49 36 7
50 28 5
51 33 1
52 32 6
53 34 5
54 43 4
55 35 6
56 29 2
57 27 4
58 35 6
59 25 4
60 24 4
61 32 4
62 15 2
63 24 5
64 25 4
65 23 9
66 25 7
67 27 7
68 22 7
69 23 7
70 17 6
71 19 4
72 19 4
73 19 2
74 18 2
75 19 6
76 12 3
77 25 6
78 23 9
79 20 4
80 17 6
81 15 5
82 13 4
83 14 4
84 13 5
85 15 1
86 13 1
87 12 5
88 14 5
89 16 4
90 12 3
91 10 3
92 12 5
93 12 7
94 10 0
95 11 4
96 12 3
97 6 5
98 20 7
99 5 3
100 8 3
101 11 2
102 11 3
103 8 0
104 14 4
105 15 2
106 7 0
107 7 1
108 6 0
109 9 2
110 10 1
111 8 1
112 6 1
113 8 1
114 8 2
115 7 4
116 3 1
117 4 2
118 5 0
120 5 0
121 1 0
122 9 2
123 7 3
124 4 1
125 3 0
126 3 2
127 7 3
128 5 3
129 3 1
130 3 0
131 5 1
132 5 2
133 2 0
134 5 2
135 10 1
136 5 2
137 3 1
138 7 2
139 6 2
140 3 1
141 5 1
142 9 4
143 3 1
144 2 1
145 4 2
146 2 0
147 2 2
148 3 1
149 1 0
150 1 0
151 2 1
152 3 1
153 3 1
154 2 1
155 3 1
156 6 4
157 4 2
158 3 1
159 4 1
160 2 1
161 2 1
163 3 1
164 5 2
165 2 1
166 3 0
167 4 4
168 2 1
169 1 0
170 2 2
171 3 2
172 1 0
173 4 3
174 3 2
175 1 1
177 3 3
178 3 2
179 1 0
180 3 1
181 2 0
182 1 1
183 3 1
184 2 2
185 2 1
186 3 1
187 2 1
188 1 1
191 1 0
192 1 0
193 1 0
195 4 2
196 2 2
197 4 1
198 1 0
199 2 1
200 1 0
201 2 2
202 1 0
204 2 0
206 3 1
207 1 0
208 1 0
209 2 1
211 1 1
212 2 1
213 2 2
214 1 1
215 1 1
218 2 2
220 2 1
222 3 1
223 2 2
224 1 1
225 1 1
226 1 1
227 2 1
228 2 1
230 3 1
231 1 1
233 2 2
234 3 1
235 1 1
236 1 1
237 1 1
239 2 2
241 1 1
242 1 0
243 1 0
244 1 1
245 1 1
246 1 1
247 2 0
250 2 1
251 3 2
252 1 1
253 2 2
254 1 1
256 1 1
258 2 1
260 1 1
262 1 1
264 1 0
267 1 1
268 1 1
269 1 1
270 1 1
271 2 1
272 1 1
275 2 1
276 1 1
277 2 2
278 1 0
280 1 1
283 1 0
285 2 1
290 1 1
291 1 1
294 1 1
299 1 1
301 4 3
303 1 1
304 2 0
305 1 1
307 1 1
311 1 1
314 2 1
317 1 1
318 1 1
319 1 1
321 1 1
323 1 1
329 2 1
330 1 1
333 1 0
334 1 1
335 1 1
337 1 1
339 1 1
342 1 1
343 1 0
350 2 2
356 1 1
368 1 0
370 2 2
377 1 1
390 1 1
392 1 1
394 1 1
406 1 1
408 1 1
409 1 1
419 1 1
424 1 1
427 1 1
451 1 1
459 1 1
461 1 1
462 1 0
478 1 1
479 1 0
488 1 1
530 1 1
550 1 1
553 1 1
568 1 0
594 1 1
608 1 1
622 1 1
625 1 1
626 1 1
628 1 1
646 1 1
648 1 1
652 1 1
655 1 1
656 1 1
660 1 0
688 1 1
723 1 1
732 1 1
740 1 1
761 1 1
769 1 0
845 1 1
865 1 1
1063 1 1
1105 1 1
1242 1 1
1737 1 1
1989 1 1
2456 1 1
9588 1 1
I want to plot stacked barplot in which i want to compare the degree in freq.x and freq.y field. That means on x axis there will be degree and on on y axis there will be frequency.I tried the ggplot2 function in r and plotted stacked bar plot. But the problem is my dataset is large so i want to combine bar limits. The code which i tried is as follow.
d_ap <- read.csv("data.csv")
l_nw <- data.frame(d_ap)
library(reshape2)
final_df <- melt(l_nw, id.var="Degree")
library(ggplot2)
ggplot(final_df, aes(x = Degree, y = value, fill = variable)) +
geom_bar(stat = "identity")
this will output a barplot but i want to set bar limits on x-axis and in my desired output of bar plot on x-axis i want to plot degree from 1 to 10 in individual bars. Then from degree 11 to 9588 i want to club it in bars like 11 to 20 then 20 to 30 and then 30 to 50 and 50 to 9588. How can i set bar limits on x-axis like this..?? So that by setting this bar limit i can better visualize my stacked bar plot.
Is that what you want?
final_df$cdegree=cut(final_df$degree,c(0,1,2,3,4,5,6,7,8,9,10,20,30,50,9590))
library(ggplot2)
ggplot(final_df, aes(x = cdegree, y = value, fill = variable)) +
geom_bar(stat = "identity")

Binning continuous data to stack histogram in R

I have a dataset that looks like this:
USER.ID avgfrequency orders group
1 3 3.7821782 101 3
2 7 14.7500000 8 3
3 9 13.4761905 21 3
4 13 5.1967213 61 3
5 16 6.7812500 64 3
6 26 41.7500000 4 2
7 49 13.6666667 3 2
8 50 7.0000000 1 1
9 51 1.0000000 1 1
10 52 17.7500000 4 2
11 69 4.5000000 2 1
12 75 9.9500000 20 3
13 91 84.2000000 5 2
14 98 8.0185185 54 3
15 138 14.2000000 5 2
16 139 34.7500000 4 2
17 149 7.6666667 21 3
18 155 35.3333333 9 3
19 167 24.0000000 1 1
20 170 7.3529412 34 3
21 171 4.4210526 76 3
22 174 4.5000000 2 1
23 175 6.5781250 64 3
24 176 19.2857143 21 3
25 177 10.4864865 37 3
26 178 28.0000000 15 3
27 180 4.8461538 39 3
28 183 25.5000000 2 1
29 184 13.0000000 1 1
30 210 32.0000000 1 1
31 215 13.4615385 13 3
32 220 11.3611111 36 3
33 223 26.2500000 8 3
34 224 40.5000000 8 3
35 230 15.4000000 10 3
36 232 14.6666667 3 2
37 234 34.5833333 12 3
38 238 138.5000000 2 1
39 240 7.0000000 3 2
40 243 35.0000000 3 2
41 246 6.7500000 4 2
42 247 8.5000000 50 3
43 258 17.6666667 3 2
44 283 23.5000000 2 1
45 295 19.5625000 16 3
46 300 81.6666667 3 2
47 311 34.4166667 12 3
48 338 64.0000000 1 1
49 342 113.3333333 3 2
50 343 197.0000000 1 1
51 347 3.6923077 13 3
52 350 4.6666667 3 2
53 360 177.5000000 2 1
54 361 39.0000000 10 3
55 362 1.4000000 5 2
56 365 15.0000000 24 3
57 366 59.2000000 5 2
58 367 5.0000000 4 2
59 369 27.9285714 14 3
60 372 63.6666667 3 2
61 375 9.3750000 8 3
62 377 13.3225806 31 3
63 380 169.5000000 2 1
64 383 23.2352941 17 3
65 391 0.0000000 1 1
I want to split avgfrequency into different bins of width 10 and plot it as x-axis and on y-axis I want to show the count of USER.ID as histograms and in each bar I want to show count of USER.ID of different group with different color. So, each histogram would have three different colors for each bin.
Is it possible to do it in R ?
It is possible. See below:
library(ggplot2) #load the ggplot2 graph package
data = data.frame(data) #make the dataset a R dataframe object
head(data,2) #just showing part of the data here.
USER.ID avgfrequency orders group
3 3.782178 101 3
7 14.750000 8 3
#build graph
ggplot(data, aes(x=avgfrequency,fill=factor(group))) +
geom_histogram(breaks=seq(0,200,by=10),colour='black') +
xlab("Average Frequency") + ylab("Count of USER.ID") +
scale_fill_manual("Group", breaks = c("1","2","3"), values = c("grey30","grey50", "grey70")) +
theme_bw()

Resources