Converting Data frame to time series in R - r

I currently have this data set below, but I am unsure as to how I can convert this into a time series from the data frame format that it is currently in.
I am also unsure as to how I can split this data up to create an in-sample and out-of-sample data set for forecasting.
Date Observations
1 1975/01 5172
2 1975/02 6162
3 1975/03 6979
4 1975/04 5418
5 1976/01 4801
6 1976/02 5849
7 1976/03 6292
8 1976/04 5261
9 1977/01 4461
10 1977/02 5322
11 1977/03 6153
12 1977/04 5377
13 1978/01 4808
14 1978/02 5845
15 1978/03 6023
16 1978/04 5691
17 1979/01 4683
18 1979/02 5663
19 1979/03 6068
20 1979/04 5429
21 1980/01 4897
22 1980/02 5685
23 1980/03 5862
24 1980/04 4663
25 1981/01 4566
26 1981/02 5118
27 1981/03 5261
28 1981/04 4459
29 1982/01 4352
30 1982/02 4995
31 1982/03 5559
32 1982/04 4823
33 1983/01 4462
34 1983/02 5228
35 1983/03 5997
36 1983/04 4725
37 1984/01 4223
38 1984/02 4940
39 1984/03 5780
40 1984/04 5232
41 1985/01 4723
42 1985/02 5219
43 1985/03 5855
44 1985/04 5613
45 1986/01 4987
46 1986/02 6117
47 1986/03 5777
48 1986/04 5803
49 1987/01 5113
50 1987/02 6298
51 1987/03 7152
52 1987/04 6591
53 1988/01 6337
54 1988/02 6672
55 1988/03 7224
56 1988/04 6296
57 1989/01 6957
58 1989/02 7538
59 1989/03 8022
60 1989/04 7216
61 1990/01 6633
62 1990/02 7355
63 1990/03 7897
64 1990/04 7159
65 1991/01 6637
66 1991/02 7629
67 1991/03 8080
68 1991/04 7077
69 1992/01 7190
70 1992/02 7396
71 1992/03 7795
72 1992/04 7147

Related

Why is the first run of microbenchmark always the slowest?

When using Microbenchmark I have noticed that the first execution is always a lot slower than the rest. This effect was the same over different machines and with different functions. Does this have something to do with with the library or is this some kind of warmup that is to be expected?
library(microbenchmark)
X <- matrix(rnorm(100), nrow = 10)
microbenchmark(solve(X))$time
#[1] 82700 23700 18300 17700 19700 19100 16900 17500 17300 16600 16700 16700 18500 16900 17700 16900 17000 16200 17400 17000 16800 16600 17000 16700 16800 17100
#[27] 17300 17100 16800 17800 17400 18100 17400 18100 18000 16700 17400 17300 17000 16800 16400 17300 16700 16900 16900 16700 17200 17800 16600 17100 16800 17800
#[53] 17000 17200 17500 17200 17200 17300 17800 17600 17600 17200 16600 16700 16800 16600 16400 16500 17300 17600 16800 17600 16300 16800 17100 16500 16800 16700
#[79] 16300 16700 16300 16700 16800 16700 16400 17100 16400 17100 17000 18000 16600 16600 16600 16800 16700 16500 17600 19100 17400 16900
It has to do with the warm-up time, see help('microbenchmark'), section details, argument control:
The control list can contain the following entries:
order
[omited]
warmup
the number of warm-up iterations performed before the actual benchmark. These are used to estimate the timing overhead as well as spinning up the processor from any sleep or idle states it might be in. The default value is 2.
If you increase the number of warm-up iterations, the first run might not be the slowest, though it many times is.
library(microbenchmark)
set.seed(2020)
X <- matrix(rnorm(100), nrow = 10)
times <- microbenchmark(solve(X), control = list(warmup = 10))$time
times
# [1] 145229 72724 65333 65305 115715 63797 689113 72101 64830 66392
# [11] 65776 66619 65531 64765 65351 65605 65745 65106 64661 65790
# [21] 65435 64964 66138 65952 66893 65654 65585 75141 74666 69060
# [31] 72725 66650 65486 65894 66808 65381 66039 65959 64842 65029
# [41] 65673 66439 64394 70585 68899 73875 73180 67807 65891 65699
# [51] 64693 63679 65504 80190 66150 65048 64372 64842 65845 65144
# [61] 65543 65297 65485 64695 66580 64921 65453 64840 65559 65805
# [71] 64362 66098 65464 65227 64998 64007 65659 63919 64727 64796
# [81] 65231 64030 65871 65735 64217 65195 65181 65130 66015 63891
# [91] 63755 65274 65116 64573 64244 64214 64148 64457 65346 64228
Now see which is the first with order:
order(times, decreasing = TRUE)
# [1] 7 1 5 54 28 29 46 47 31 2 8 44 30 45 48 25 35 32 12
# [20] 65 42 10 55 23 72 37 89 38 24 34 49 83 59 70 20 11 17 84
# [39] 50 41 77 26 16 27 69 61 13 53 33 63 73 67 21 36 15 99 3
# [58] 4 62 92 81 74 86 87 60 88 93 18 56 40 75 22 66 39 58 68
# [77] 9 80 14 79 64 51 19 94 98 43 57 71 95 100 85 96 97 82 76
# [96] 78 90 6 91 52
In this case the slowest was the seventh run, not the first.

R: How plot negative and positive anomaly (for this data) with ggplot? [duplicate]

This question already has answers here:
How to fill with different colors between two lines? (originally: fill geom_polygon with different colors above and below y = 0 (or any other value)?)
(4 answers)
Closed 5 years ago.
I have this df
x acc
1 1902-01-01 0.782887804
2 1903-01-01 -0.003144199
3 1904-01-01 0.100006276
4 1905-01-01 0.326173392
5 1906-01-01 1.285114692
6 1907-01-01 2.844399973
7 1920-01-01 -0.300232190
8 1921-01-01 1.464389342
9 1922-01-01 0.142638653
10 1923-01-01 -0.020162385
11 1924-01-01 0.361928571
12 1925-01-01 0.616325588
13 1926-01-01 -0.108206003
14 1927-01-01 -0.318441954
15 1928-01-01 -0.267884586
16 1929-01-01 -0.022473777
17 1930-01-01 -0.294452983
18 1931-01-01 -0.654927109
19 1932-01-01 -0.263508341
20 1933-01-01 0.622530992
21 1934-01-01 1.009666043
22 1935-01-01 0.675484421
23 1936-01-01 1.209162008
24 1937-01-01 1.655280986
25 1948-01-01 2.080021785
26 1949-01-01 0.854572563
27 1950-01-01 0.997540963
28 1951-01-01 1.000244163
29 1952-01-01 0.958322941
30 1953-01-01 0.816259474
31 1954-01-01 0.814488644
32 1955-01-01 1.233694537
33 1958-01-01 0.460120970
34 1959-01-01 0.344201474
35 1960-01-01 1.601430139
36 1961-01-01 0.387850967
37 1962-01-01 -0.385954401
38 1963-01-01 0.699355708
39 1964-01-01 0.084519926
40 1965-01-01 0.708964572
41 1966-01-01 1.456280443
42 1967-01-01 1.479412638
43 1968-01-01 1.199000726
44 1969-01-01 0.282942042
45 1970-01-01 -0.181724504
46 1971-01-01 0.012170186
47 1972-01-01 -0.095891043
48 1973-01-01 -0.075384446
49 1974-01-01 -0.156668145
50 1975-01-01 -0.303023258
51 1976-01-01 -0.516027310
52 1977-01-01 -0.826791524
53 1980-01-01 -0.947112221
54 1981-01-01 -1.634878300
55 1982-01-01 -1.955298323
56 1987-01-01 -1.854447550
57 1988-01-01 -1.458955443
58 1989-01-01 -1.256102245
59 1990-01-01 -0.864108585
60 1991-01-01 -1.293373024
61 1992-01-01 -1.049530431
62 1993-01-01 -1.002526230
63 1994-01-01 -0.868783614
64 1995-01-01 -1.081858981
65 1996-01-01 -1.302103374
66 1997-01-01 -1.288048194
67 1998-01-01 -1.455750340
68 1999-01-01 -1.015467069
69 2000-01-01 -0.682789640
70 2001-01-01 -0.811058004
71 2002-01-01 -0.972374057
72 2003-01-01 -0.536505225
73 2004-01-01 -0.518686263
74 2005-01-01 -0.976298621
75 2006-01-01 -0.946429713
I would like plot the data in this kind:
where on x axes there is column x of df, and on y axes column acc.
Is possible plot it with ggplot?
I tried with this code:
ggplot(df,aes(x=x,y=acc))+
geom_linerange(data =df , aes(colour = ifelse(acc <0, "blue", "red")),ymin=min(df),ymax=max(cdf))
but the result is this:
Please, how I can do it?
Is this what you want? I'm not sure.
ggplot(data = df,mapping = aes(x,acc))+geom_segment(data = df , mapping = aes(x=x,y=ystart,xend=x,yend=acc,color=col))
df$x=year(as.Date(df$x))
df$ystart=0
df$col=ifelse(df$acc>=0,"blue","red")

R dcast duplicating first subject when creating wide-format data

I am trying to move from long format data to wide format in order to do some correlation analyses.
But, dcast seems to create to rows for the first subject and splits the data across those two rows filling the created empty cells with NA.
The first 2 subjects were being duplicated when I was using alphanumeric subject codes, I went to numeric subject numbers and that has to down to only the first subject being duplicated.
the first few lines of the long format data frame:
Subject Age Gender R_PTA L_PTA BE_PTA Avg_PTA L_Aided_SII R_Aided_SII Best_Aided_SII L_Unaided_SII R_Unaided_SII Best_Unaided_SII L_SII_Diff R_SII_Diff
1 1 74 M 48.33 53.33 48.33 50.83 31 42 42 14 25 25 17 17
2 2 77 F 36.67 36.67 36.67 36.67 73 67 73 44 43 44 29 24
3 3 72 F 45.00 41.67 41.67 43.33 42 34 42 35 28 35 7 6
4 4 66 F 36.67 36.67 36.67 36.67 66 76 76 44 44 44 22 32
5 5 38 F 41.67 46.67 41.67 44.17 48 58 58 23 29 29 25 29
6 6 65 M 35.00 43.33 35.00 39.17 46 60 60 32 46 46 14 14
Best_SII_Diff rSII MoCA_Vis MoCA_Nam MoCA_Attn MoCA_Lang MoCA_Abst MoCA_Del_Rec MoCA_Ori MoCA_Tot PNT Semantic Aided PNT_Prop PNT_Prop_Mod
1 17 -0.4231157 5 3 6 2 2 2 6 26 0.971 0.029 Unaided 0.971 0.983
2 29 1.2739255 3 3 5 0 2 2 5 20 0.954 0.046 Unaided 0.960 0.966
3 7 -1.2777889 4 2 5 2 2 5 6 26 0.966 0.034 Unaided 0.960 0.982
4 32 1.5959701 5 3 6 3 2 5 6 30 0.983 0.017 Unaided 0.983 0.994
5 29 0.9492167 4 2 6 3 1 3 6 25 0.983 0.017 Unaided 0.983 0.994
6 14 -0.2936395 4 2 6 2 2 2 6 24 0.989 0.011 Unaided 0.989 0.994
PNT_S_Wt PNT_P_Wt
1 0.046 0.041
2 0.073 0.033
3 0.045 0.074
4 0.049 0.057
5 0.049 0.057
6 0.049 0.057
Creating varlist:
varlist <- list(colnames(subset(PNT_Data_All2, ,c(18:27,29:33))))
My dcast command:
Data_Wide <- dcast(as.data.table(PNT_Data_All2),Subject + Age + Gender + R_PTA + L_PTA + BE_PTA + Avg_PTA + L_Aided_SII + R_Aided_SII + Best_Aided_SII + L_Unaided_SII + R_Unaided_SII + Best_Unaided_SII + L_SII_Diff + R_SII_Diff + Best_SII_Diff + rSII ~ Aided, value.var=varlist)
The resulting first few lines of the wide format:
Subject Age Gender R_PTA L_PTA BE_PTA Avg_PTA L_Aided_SII R_Aided_SII Best_Aided_SII L_Unaided_SII R_Unaided_SII Best_Unaided_SII L_SII_Diff R_SII_Diff
1: 1 74 M 48.33 53.33 48.33 50.83 31 42 42 14 25 25 17 17
2: 1 74 M 48.33 53.33 48.33 50.83 31 42 42 14 25 25 17 17
3: 2 77 F 36.67 36.67 36.67 36.67 73 67 73 44 43 44 29 24
4: 3 72 F 45.00 41.67 41.67 43.33 42 34 42 35 28 35 7 6
5: 4 66 F 36.67 36.67 36.67 36.67 66 76 76 44 44 44 22 32
6: 5 38 F 41.67 46.67 41.67 44.17 48 58 58 23 29 29 25 29
Notice Subject 1 has 2 entries. All of the other subjects seem correct
Is this a problem with my command/arguments? A bug in dcast?
Edit 1: Through the process of elimination, the extra entries only appear when I include the "rSII" variable. This is a variable that is calculated from a previous step in the script:
PNT_Data_All$rSII <- stdres(lm(Best_Aided_SII ~ Best_Unaided_SII, data=PNT_Data_All))
PNT_Data_All <- PNT_Data_All[, colnames(PNT_Data_All)[c(1:17,34,18:33)]]
Is there something about that calculated variable that would mess up dcast for some subjects?
Edit 2 to add my workaround:
I ended up rounding the calculated variable to 3 digits after the decimal and that solved the problem. Everything is casting correctly now with no duplicates.
PNT_Data_All$rSII <- format(round(stdres(lm(Best_Aided_SII ~ Best_Unaided_SII, data=PNT_Data_All)),3),nsmall=3)

Error while plotting a tree with some squirrels using trees package

I am using the package trees found here, by #jbaums and explained in this post.
My data are the following:
the tree is composed by
the trunk
Trunk
[1] 13.60415
and the branches
Tree
TreeBranchLength TreeBranchID
1 10.004269 1
2 7.994269 2
3 9.028834 11
4 10.817401 12
5 8.551311 111
6 10.599798 112
7 11.073243 121
8 13.367392 122
9 9.625431 1111
10 10.793569 1112
11 9.896499 11121
12 8.687741 11122
13 7.791180 1211
14 12.506105 1212
15 6.768478 1221
16 10.441796 1222
17 10.751892 1121
18 9.458651 1122
19 10.768509 11221
20 10.150673 11222
21 12.377448 111211
22 12.235136 111212
23 9.074079 11211
24 9.996334 11212
25 9.807019 112221
26 10.895809 112222
27 6.741274 1122211
28 15.841272 1122212
29 5.753920 11222111
30 8.846389 11222112
31 11.925961 112111
32 9.780776 112112
33 8.207965 12221
34 10.079375 12222
the 50 squirrel populations -
Populations
PopulationPositionOnBranch PopulationBranchID ID
1 10.6321655 112111 1
2 1.0644897 1 2
3 3.9315473 1 3
4 1.0310244 0 4
5 9.1768846 0 5
6 13.4267181 0 6
7 7.9461528 0 7
8 6.0533401 121 8
9 2.1227425 121 9
10 1.8256787 121 10
11 4.7332588 11222112 11
12 4.4837432 11222112 12
13 4.6200834 11222112 13
14 2.5622276 1221 14
15 1.2446683 1221 15
16 7.0674052 111 16
17 1.3854674 111 17
18 4.8735635 111 18
19 9.5007998 1222 19
20 6.6373468 1222 20
21 12.6757728 122 21
22 4.2685465 122 22
23 3.9806540 2 23
24 3.1025403 2 24
25 3.9119065 11122 25
26 1.5527653 11122 26
27 1.6687957 11122 27
28 8.0697456 1122 28
29 6.7871391 1122 29
30 9.8050713 111212 30
31 8.5226920 111212 31
32 3.6113379 111212 32
33 7.3184965 111211 33
34 8.6142984 111211 34
35 1.3550870 1211 35
36 8.3650639 12 36
37 4.6411446 112112 37
38 3.2985541 112112 38
39 12.2344148 1212 39
40 9.0290776 1212 40
41 1.3900249 1121 41
42 0.9261425 1122212 42
43 15.2522199 1122212 43
44 4.0253771 12222 44
45 8.7507678 11222 45
46 4.6289841 1122211 46
47 9.1799522 112 47
48 5.1293838 12221 48
49 1.1543080 12221 49
50 10.1014837 112222 50
the code to produce the plot
g <- germinate(list(trunk.height=Trunk,
branches=Tree$TreeBranchID,
lengths=Tree$TreeBranchLength),
left='1', right='2', angle=30))
xy <- squirrels(g, Populations$PopulationBranchID, pos=Populations$PopulationPositionOnBranch,
left='1', right='2', pch=21, bg='white', cex=3, lwd=2)
text(xy$x, xy$y, labels=seq_len(nrow(xy)), font=1)
, which produces
As you can see on the plot bellow population 43 (blue arrow) is out of the tree.. It seems that the length of the branches on the plot do not correspond to the data. For example the branch (left green arrow) on which are populations 38 and 37 is longer than the one where population 43 is (right green arrow), that is not the case in the data. What am I doing wrong? Have I understood correctly how to use trees?
On studying the germinate function it seems to me that the Tree values that you are passing to it needs to be sorted on TreeBranchId field in the ascending order.
The BranchID: 1122212 where you have placed 43 is not the actual 1122212 branch.
Due to the order in which you have fed the values in the Tree, the function is somehow messing the location of branch.
I was curious to see if I increase the length of Branch ID: 1122212, will it change the branch where 43 is placed, and guess what? it didn't. The branch which actually showed an increase in length was the branch where you have placed 37 and 38.
So this hint pointed out that something was wrong with germinate function. On further debugging I was able to make it work using the below code.
Tree<-read.csv("treeBranch.csv")
Tree<-Tree[order(Tree$TreeBranchID),]
g <- germinate(list(trunk.height=15,
branches=Tree$TreeBranchID,
lengths=Tree$TreeBranchLength),
left='1', right='2', angle=30)
xy <- squirrels(g, Populations$PopulationBranchID,pos=Populations$PopulationPositionOnBranch,
left='1', right='2', pch=21, bg='white', cex=3, lwd=2)
text(xy$x, xy$y, labels=seq_len(nrow(xy)), font=1)

Create a column according to the levels of a vector

I have a data frame with a column (species) presenting 153 levels of a factor
> out80[1:10,1:3]
Species Plots100 Plots80
1 02 901 2091
2 03 921 2094
3 04 29 60
4 05 1255 2145
5 06 563 850
6 07 38 53
7 08S 102 144
8 09 897 1734
9 10 503 1084
10 11 134 334
What I would like to do is look for this level of the factor in another column (code)of another data frame(species.tab2) and simply create another column in out80 with the name associated with this level from the column French name
> head(species.tab2[,1:3])
var code French_name
1 ESPAR 2 CHENE PEDONCULE
2 ESPAR 3 CHENE SESSILE
3 ESPAR 3 CHENE SESSILE
4 ESPAR 3 CHENE SESSILE
5 ESPAR 4 CHENE ROUGE
6 ESPAR 5 CHENE PUBESCENT
I have tried doing it with ifelse or with a loop but I can't get it to work.
So the result would be something like this:
Species Plots100 Plots80 Name
1 02 901 2091 CHENE PEDONCULE
2 03 921 2094 CHENE SESSILE
etc...
EDIT: Here are the levels:
> out80$Species
[1] 02 03 04 05 06 07 08S 09 10 11 12P 12V 13B 13C 13G 14 15P 15S 16
[20] 17C 17F 17O 18C 18D 18M 19 20G 20P 20X 21C 21M 21O 22C 22G 22M 22S 23A 23AB
[39] 23AF 23AM 23C 23F 23PA 23PC 23PD 23PF 23PM 23SO 23SS 24 25B 25C 25FD 25FR 25M 25R 25V
[58] 26E 26OC 27C 27N 28 29AF 29AI 29CM 29EN 29LI 29MA 29MI 31 32 33B 33G 33N 34 36
[77] 37 38AL 38AU 39 40 41 42 49AA 49AE 49AM 49BO 49BS 49C 49CA 49CS 49EA 49EV 49FL 49IA
[96] 49LN 49MB 49PC 49PL 49PM 49PS 49PT 49RA 49RC 49RP 49RT 49SN 49TF 49TG 51 52 53CA 53CO 53S
[115] 54 55 56 57A 57B 58 59 61 62 63 64 65 66 67 68CC 68CE 68CJ 68CL 68CM
[134] 68EO 68PC 68PM 68SC 68SV 68TG 68TH 69 69JC 69JO 70SB 70SC 70SE 71 72V 73 74H 74J 76
[153] 77
> species.tab2$code
[1] 2 3 3 3 4 5 5 5 6 6 6 7 08S 9 10 10 11 12P 12V
[20] 12V 13B 13C 13G 14 14 14 15P 15S 15S 16 17C 17F 17O 17O 18C 18C 18D 18D
[39] 18M 19 19 20G 20P 20X 21C 21M 21O 22C 22G 22G 22M 22S 23A 23A 23AB 23AF 23AM
[58] 23C 23F 23PA 23PA 23PC 23PD 23PF 23PM 23SO 24 25B 25C 25D 25E3 25FR 25M 25R 25V 26E
[77] 26E 26OC 27C 27N 28 29AI 29CM 29EN 29MA 29MI 29LI 31 32 33B 33G 33N 34 36 37
[96] 38AU 38AL 39 40 41 42 49AA 49AE 49AM 49BO 49BO 49BS 49C 49CA 49CS 49EA 49EV 49FL 49IA
[115] 49LN 49MB 49PC 49PL 49PM 49PS 49PT 49RA 49RC 49RP 49RT 49SN 49TF 49TG 51 52 53CA 53CO 53S
[134] 54 55 56 57A 57B 58 59 61 62 63 64 65 66 67 68CC 68CJ 68CL 68CM 68EO
[153] 68PC 68PM 68SC 68SV 68TG 68TH 69 69JC 69JO 70SB 70SC 70SE 71 72V 73 74H 74J 76 77
There are some repetition in code just due to the fact that for a same code, there are 2 or 3 different French names existing. For these I just want one of the name, doesn't matter which one it is.
Thank you for your help.
Using merge , after creating a new column code in out80
out80$code <- gsub('^0|S$','',out80$Species)
merge(out80,species.tab2)
code Species Plots100 Plots80 var French_name
1 2 02 901 2091 ESPAR CHENE PEDONCULE
2 3 03 921 2094 ESPAR CHENE SESSILE
3 3 03 921 2094 ESPAR CHENE SESSILE
4 3 03 921 2094 ESPAR CHENE SESSILE
5 4 04 29 60 ESPAR CHENE ROUGE
6 5 05 1255 2145 ESPAR CHENE PUBESCENT
EDIT
Code and Species doesn't match for levels 01,02,...., so I create a new column to match them.
gsub('^0([0-9])$','\\1',out80$Species)
A data.table solution:
require(data.table)
dt1 <- data.table(out80)
# positive look ahead
# match 0's at beginning followed by numbers
# if found, replace all beginning 0's with ""
dt1[, key := sub("^[0]+(?=[0-9]+$)", "", Species, perl=T)]
setkey(dt1, "key")
dt2 <- data.table(species.tab2)
dt2[, code := as.character(code)]
dt2[, key := sub("^[0]+(?=[0-9]+$)", "", code, perl=T)]
setkey(dt2, "key")
merge(dt1, dt2)
# key Species Plots100 Plots80 var code French_name
# 1: 2 02 901 2091 ESPAR 2 CHENE_PEDONCULE
# 2: 3 03 921 2094 ESPAR 3 CHENE_SESSILE
# 3: 3 03 921 2094 ESPAR 3 CHENE_SESSILE
# 4: 3 03 921 2094 ESPAR 3 CHENE_SESSILE
# 5: 4 04 29 60 ESPAR 4 CHENE_ROUGE
# 6: 5 05 1255 2145 ESPAR 5 CHENE_PUBESCENT

Resources