Multivariate imputing missing values in weather data - r

I need to get a weather dataset ready as input to keras. I have 1096 entries over 3 years of daily data of which first month is missing. I got one of the columns filled in for temperature from a nearby weather station. However, to check which imputation fits best, I deleted these 30 values and kept all columns as NA for first month. Then, I tried various imputing packages including 1. Mice - gave continuous values but too high average; 2. KNN (VIM) gave a constant value too high 3.MissForest - gave constant value too high; 4. imputeTS_interpolation - gave continuous value slightly low; 5. imputeTS_seasonal - gave constant value slight low.
Therefore, I selected imputeTS_interpolation. And used this to now impute the remaining columns after filling the temperature column with actual values. However, I cannot seem to get the seasonality in imputeTS working.
Any idea why? Please find below the data file and code used:
Code:
# #MICE did not match with historical data too high avg of 10.2
# impute <- mice(gf, method = "pmm")
# print(impute)
# xyplot(impute, Temp ~ Reco | .imp, pch = 20, cex =1.4)
# mf <- complete(impute, 3)
# mf <- cbind(mf, Date = df$Date.Time)
# write.csv(mf, "Mice_imputed.csv", row.names=TRUE)
# View(mf)
#
# View(gf_impo)
# ##Using Miss Forest ;) too high contstant 9.3
# gf_impo <- missForest(gf, maxiter = 100, ntree = 500)
# gf_impo$ximp
# gf_impo <- cbind(gf_impo, Date = df$Date.Time)
# write.csv(gf_impo$ximp, "Val_MissForest_imputed.csv", row.names=TRUE)
# class(gf_impo)
##KNN using VIM too high constant 13
imp_knn <- kNN(gf, k = 500)
aggr(imp_knn, delimiter = "imp")
View(imp_knn)
imp_knn <- cbind(imp_knn, Date = df$Date.Time)
write.csv(imp_knn, "Val_KNN_imputed.csv", row.names=TRUE)
View(imp_seas)
#imputeTS
#for seasonal imputation
imp_seas <- gf
imp_seas <- cbind(imp_seas, Date = df$Date.Time)
View(imp_seas)
View(imp_TS_intn)
imp_TS_intn <- na_interpolation(imp_seas, option = "spline") #avg of 2.83 close to real 4.1
# imp_TS_seas <- na_seasplit(imp_seas, algorithm = "interpolation", find_frequency = FALSE, maxgap = Inf)
#const 2.7
write.csv(imp_TS_intn, "ML_impTS_interpolate.csv", row.names=TRUE)
DATA:
A B C D E Date
1 NA NA NA 5.4000000 NA 2018-01-01
2 NA NA NA 5.7500000 NA 2018-01-02
3 NA NA NA 6.8000000 NA 2018-01-03
4 NA NA NA 6.3500000 NA 2018-01-04
5 NA NA NA 3.3500000 NA 2018-01-05
6 NA NA NA 3.0500000 NA 2018-01-06
7 NA NA NA 2.2000000 NA 2018-01-07
8 NA NA NA 0.6500000 NA 2018-01-08
9 NA NA NA 2.8500000 NA 2018-01-09
10 NA NA NA 2.2000000 NA 2018-01-10
11 NA NA NA 2.3500000 NA 2018-01-11
12 NA NA NA 5.1000000 NA 2018-01-12
13 NA NA NA 6.5500000 NA 2018-01-13
14 NA NA NA 5.0000000 NA 2018-01-14
15 NA NA NA 5.7500000 NA 2018-01-15
16 NA NA NA 2.0000000 NA 2018-01-16
17 NA NA NA 5.0500000 NA 2018-01-17
18 NA NA NA 3.8500000 NA 2018-01-18
19 NA NA NA 2.4500000 NA 2018-01-19
20 NA NA NA 5.1500000 NA 2018-01-20
21 NA NA NA 6.7500000 NA 2018-01-21
22 NA NA NA 9.2500000 NA 2018-01-22
23 NA NA NA 9.5000000 NA 2018-01-23
24 NA NA NA 6.4500000 NA 2018-01-24
25 NA NA NA 5.4000000 NA 2018-01-25
26 NA NA NA 5.3500000 NA 2018-01-26
27 NA NA NA 6.5500000 NA 2018-01-27
28 NA NA NA 10.1000000 NA 2018-01-28
29 NA NA NA 6.6000000 NA 2018-01-29
30 NA NA NA 3.8500000 NA 2018-01-30
31 NA NA NA 2.9000000 NA 2018-01-31
32 0.05374951 0.041144312 0.0023696211 5.9902083 0.068784302 2018-02-01
33 0.07565470 0.012326176 0.0057481689 10.5280417 0.176209125 2018-02-02
34 0.04476314 0.113718139 0.0089845444 12.8125000 0.176408788 2018-02-03
35 0.01695546 0.060965133 -0.0034163682 16.9593750 0.000000000 2018-02-04
36 0.09910202 0.090170142 -0.0111946461 10.4867292 0.088337951 2018-02-05
37 0.08514839 0.026061013 -0.0029183210 7.1662500 0.085590326 2018-02-06
38 0.06724108 0.104761909 -0.0416036605 6.9130417 0.134828348 2018-02-07
39 0.07638534 0.097570813 -0.0192784571 3.3840000 0.029682717 2018-02-08
40 0.02568162 0.008244304 -0.0288903610 12.0282292 0.055817103 2018-02-09
41 0.02752688 0.088544666 -0.0172136911 6.8694792 0.098169954 2018-02-10
42 0.06643098 0.063321337 -0.0347752292 7.4539792 0.034110652 2018-02-11
43 0.09743445 0.057502178 0.0162851223 13.9365208 0.264168082 2018-02-12
44 0.09189575 0.034429904 0.0020940613 13.8687292 0.162341764 2018-02-13
45 0.07857244 0.009406862 0.0075904680 11.7800000 0.101283946 2018-02-14
46 0.01987263 0.024783795 -0.0088742973 4.4463750 0.063949011 2018-02-15
47 0.02332892 0.010138857 0.0091396448 5.6452292 0.034708981 2018-02-16
48 0.02022396 0.014207518 0.0036018714 14.2862500 0.043205299 2018-02-17
49 0.07043020 0.075317793 0.0036760070 5.5940208 0.171898590 2018-02-18
50 0.02120779 0.010461857 -0.0277470177 13.6131250 0.061486533 2018-02-19
51 0.06405819 0.034185344 0.0173606568 7.0551042 0.052148976 2018-02-20
52 0.09428869 0.026957653 0.0016863903 6.7955000 0.085888435 2018-02-21
53 0.04248937 0.048782786 0.0004039921 17.5706250 0.000000000 2018-02-22
54 0.02076763 0.038094949 -0.0003671638 14.8379167 0.000000000 2018-02-23
55 0.01343260 0.118003726 -0.0214988345 6.4564583 0.053353606 2018-02-24
56 0.05231647 0.054454132 -0.0098012290 7.8568333 0.183326943 2018-02-25
57 0.02476706 0.087501472 0.0031839472 15.7493750 0.210616272 2018-02-26
58 0.07358998 0.023558218 0.0031618607 10.8001250 0.241602571 2018-02-27
59 0.02042573 0.009268439 0.0088051496 7.2967500 0.251608940 2018-02-28
60 0.02107772 0.083567750 -0.0037223644 6.2674375 0.062221630 2018-03-01
61 0.05830801 0.029456683 0.0114978078 13.0810417 0.193765948 2018-03-02
62 0.02923587 0.070533843 0.0068299668 14.4095833 0.244310193 2018-03-03
63 0.02570283 0.058270093 0.0137174366 3.8527917 0.120846709 2018-03-04
64 0.01434395 0.014637405 0.0051951050 9.6877083 0.112579011 2018-03-05
65 0.06426214 0.078872579 0.0068664343 4.6763750 0.000000000 2018-03-06
66 0.04782772 0.011762501 0.0086182870 12.7027083 0.129606106 2018-03-07
67 0.01809136 0.105398844 0.0231671305 10.8052083 0.017683908 2018-03-08
68 0.04427582 0.020397435 -0.0009758693 6.5983333 0.041148864 2018-03-09
69 0.05123687 0.115984361 -0.0372104856 6.5021250 0.180013174 2018-03-10
70 0.01913266 0.005981014 -0.0159701842 8.9844375 0.095262921 2018-03-11
71 0.04407234 0.009142247 -0.0031640496 7.7638333 0.000000000 2018-03-12
72 0.09108709 0.038174205 0.0005654564 5.3772083 0.044105747 2018-03-13
73 0.05488394 0.115153937 0.0192819858 8.9182917 0.039993864 2018-03-14
74 0.03726892 0.067983475 -0.0311367032 2.4423333 0.066108171 2018-03-15
75 0.05563102 0.003831231 -0.0011148743 10.7100000 0.217461791 2018-03-16
76 0.04922930 0.055446609 0.0075246331 5.0829375 0.149530704 2018-03-17
77 0.02972858 0.061966039 -0.0392014211 12.3645625 0.060670492 2018-03-18
78 0.02812688 0.018183092 0.0134514770 9.0172292 0.158435250 2018-03-19
79 0.03066101 0.007622504 -0.0249482114 6.2709792 0.118487919 2018-03-20
80 0.06801767 0.083261012 0.0133423296 13.3683333 0.196053774 2018-03-21
81 0.04178157 0.093600914 0.0116253865 10.0024167 0.020835522 2018-03-22
82 0.04725052 0.018187748 -0.0115718535 10.3528333 0.097352796 2018-03-23
83 0.02042339 0.081504844 -0.0380958738 17.2006250 0.010500742 2018-03-24
84 0.06674396 0.098739090 -0.0108474961 17.5437500 0.119415595 2018-03-25
85 0.07049507 0.016286614 -0.0007817195 16.8800000 0.060452087 2018-03-26
86 0.01244906 0.018100693 -0.0266155999 8.8651458 0.018144668 2018-03-27
87 0.05271711 0.015368632 -0.0477885811 7.2415417 0.092797451 2018-03-28
88 0.01610886 0.014919094 0.0023487944 7.7914792 0.062818728 2018-03-29
89 0.08847253 0.059397043 0.0130362880 10.9732708 0.087451484 2018-03-30
90 0.02938725 0.044473745 0.0091253257 6.0241458 0.025488946 2018-03-31
91 0.08599249 0.043160908 0.0082536160 8.8211875 0.012975783 2018-04-01
92 0.05747667 0.017709243 -0.0090965038 6.3249375 0.065731818 2018-04-02
93 0.05772051 0.085210524 -0.0013533831 13.4166667 0.067790160 2018-04-03
94 0.01699834 0.020657341 0.0039885065 3.2999792 0.076302652 2018-04-04
95 0.03565076 0.110372607 -0.0313309140 12.7822083 0.184844707 2018-04-05
96 0.02050401 0.078943608 -0.0062322339 4.3233125 0.067820413 2018-04-06
97 0.06186790 0.013147512 0.0203249289 6.3953750 0.034104318 2018-04-07
98 0.06304988 0.012997642 0.0061171825 9.7322708 0.021220516 2018-04-08
99 0.03799006 0.012420760 0.0054724563 8.8472083 0.068664033 2018-04-09
100 0.01610225 0.061182804 0.0031002885 7.5622708 0.085766429 2018-04-10
101 0.05937683 0.008333173 -0.0053972689 7.8848542 0.058386726 2018-04-11
102 0.02190115 0.037843227 0.0089823372 8.3339792 0.055761391 2018-04-12
103 0.01179665 0.016899394 -0.0016533437 5.5101667 0.099133313 2018-04-13
104 0.02464707 0.021231270 -0.0212016846 15.5106250 0.126661378 2018-04-14
105 0.01906818 0.065273389 0.0081694393 7.6616667 0.032939519 2018-04-15
106 0.05418785 0.074619385 -0.0355680586 11.3618750 0.057768261 2018-04-16
107 0.06508988 0.014345229 0.0080423912 14.7137500 0.032709791 2018-04-17
108 0.06101126 0.060624597 -0.0399526978 17.2754167 0.230982139 2018-04-18
109 0.02226268 0.010230837 0.0001617419 2.9382083 0.000000000 2018-04-19
110 0.03884772 0.014218453 0.0039652960 10.7261875 0.179962834 2018-04-20
111 0.09054488 0.025711098 -0.0115944362 4.4734583 0.011442318 2018-04-21
112 0.03072171 0.076530730 0.0032123501 9.4128750 0.033174489 2018-04-22
113 0.04361276 0.101151670 0.0249408843 14.5804167 0.024238883 2018-04-23
114 0.03877568 0.049142846 0.0080689866 8.3168750 0.084570611 2018-04-24
115 0.05564027 0.076917047 0.0033447160 15.7308333 0.199762524 2018-04-25
116 0.04752544 0.019655228 -0.0063218138 15.7302083 0.020449908 2018-04-26
117 0.01718916 0.026132806 -0.0261027525 10.0887500 0.128898351 2018-04-27
118 0.04144832 0.034526516 0.0117868820 6.0784375 0.014449565 2018-04-28
119 0.03255833 0.113650910 -0.0123724759 11.8654167 0.085410171 2018-04-29
120 0.03656535 0.043333607 0.0230071368 7.0974167 0.035725321 2018-04-30
121 0.04570760 0.093595938 -0.0329915968 5.4016458 0.013467946 2018-05-01
122 0.07271528 0.061923504 0.0130002656 9.1602292 0.018299062 2018-05-02
123 0.02646133 0.007506529 -0.0276898846 0.2338125 0.246100834 2018-05-03
124 0.02379895 0.067273612 0.0112587565 19.1260417 0.120707266 2018-05-04
125 0.05925152 0.075768053 0.0050178925 16.2114583 0.162884739 2018-05-05
126 0.01858152 0.040845398 0.0164467420 12.9156250 0.028823967 2018-05-06
127 0.06994835 0.059457560 -0.0181926787 7.7316042 0.035106399 2018-05-07
128 0.05926409 0.038623605 0.0167222227 13.5464583 0.055665220 2018-05-08
129 0.03104010 0.006805893 -0.0141792029 14.5006250 0.012099383 2018-05-09
130 0.06631012 0.059314975 -0.0228020931 13.3711875 0.073114370 2018-05-10
131 0.03794480 0.015615642 0.0034917459 16.6675208 0.191141576 2018-05-11
132 0.03532917 0.050988581 0.0079455282 14.7375208 0.214172062 2018-05-12
133 0.08512617 0.063322454 0.0224309652 11.6861250 0.166425889 2018-05-13
134 0.04498265 0.012386160 -0.0051629339 7.2488333 0.280120908 2018-05-14
135 0.06383512 0.126840241 -0.0172296864 17.3852083 0.020363429 2018-05-15
136 0.06932861 0.026819550 -0.0109061610 20.9152083 0.099516538 2018-05-16
137 0.04020292 0.021831228 -0.0007211804 6.7122292 0.069831669 2018-05-17
138 0.02037474 0.020931810 0.0088341962 15.8758333 0.130548701 2018-05-18
139 0.01704143 0.105810563 -0.0243003529 10.7339583 0.038013440 2018-05-19
140 0.01266417 0.013985439 0.0091359503 6.5119375 0.196746897 2018-05-20
141 0.03623625 0.057182212 -0.0136101306 18.6637500 0.009431062 2018-05-21
142 0.03938695 0.054879146 0.0091277482 15.5393750 0.115389187 2018-05-22
143 0.05995812 0.061925644 -0.0029137774 11.8191667 0.015729774 2018-05-23
144 0.06548692 0.095240991 0.0055356839 4.3011875 0.081309326 2018-05-24
145 0.01582489 0.015264434 -0.0020079231 9.3315833 0.105132636 2018-05-25
146 0.06834050 0.028756388 -0.0512068435 13.6035417 0.212930829 2018-05-26
147 0.08354736 0.023524928 0.0041989465 4.5111250 0.227197329 2018-05-27
148 0.05738595 0.011159952 -0.0225834032 12.9385417 0.090503870 2018-05-28
149 0.07817132 0.103507587 -0.0222426051 13.4047292 0.034928812 2018-05-29
150 0.04773356 0.035856991 -0.0191600449 9.6657708 0.019893986 2018-05-30
Disclaimer: I am looking for a co-author for help in validating my work with keras / tensor flow

Maybe you can try the following things:
Using Seasonal Decomposition na_seadec() instead of seasonal split
Manually setting seasonality
Setting find frequency = TRUE
You can set the seasonality manually with the following actions:
Suppose you have a vector x and you have monthly values - this would mean freq = 12
# Create time series with seasonality information
x <-c(2,3,4,5,6,6,6,6,6,4,4,4,4,5,6,4,3,3,5,6,4,3,5,3,5,3,5,3,4,4,4,4,4,4,2,4,2,2,4,5,6,7,8,9,0,0,5,2,4)
x_with_freq <- ts(x, frequency = 12)
# imputation for the series
na_seadec(x_with_freq)
If you have daily values your frequency should be 365.
Other than this, you could also try to run na_seadec(x, find_frequency = T) then imputeTS tries to automatically find the seasonality for you.
But in the end, I don't know your data, could very well be, that the seasonal patterns aren't too strong.

Related

Quarterly year-to-year changes

I have a quarterly time series. I am trying to apply a function which is supposed calculate the year-to-year growth and year-to-year difference and multiply a variable by (-1).
I already used a similar function for calculating quarter-to-quarter changes and it worked.
I modified this function for yoy changes and it does not have any effect on my data frame. And any error popped up.
Do you have any suggestion how to modify the function or how to accomplish to apply the yoy change function on a time series?
Here is the code:
Date <- c("2004-01-01","2004-04-01", "2004-07-01","2004-10-01","2005-01-01","2005-04-01","2005-07-01","2005-10-01","2006-01-01","2006-04-01","2006-07-01","2006-10-01","2007-01-01","2007-04-01","2007-07-01","2007-10-01")
B1 <- c(3189.30,3482.05,3792.03,4128.66,4443.62,4876.54,5393.01,5885.01,6360.00,6930.00,7430.00,7901.00,8279.00,8867.00,9439.00,10101.00)
B2 <- c(7939.97,7950.58,7834.06,7746.23,7760.59,8209.00,8583.05,8930.74,9424.00,9992.00,10041.00,10900.00,11149.00,12022.00,12662.00,13470.00)
B3 <- as.numeric(c("","","","",140.20,140.30,147.30,151.20,159.60,165.60,173.20,177.30,185.30,199.30,217.10,234.90))
B4 <- as.numeric(c("","","","",-3.50,-14.60,-11.60,-10.20,-3.10,-16.00,-4.90,-17.60,-5.30,-10.90,-12.80,-8.40))
df <- data.frame(Date,B1,B2,B3,B4)
The code will produce following data frame:
Date B1 B2 B3 B4
1 2004-01-01 3189.30 7939.97 NA NA
2 2004-04-01 3482.05 7950.58 NA NA
3 2004-07-01 3792.03 7834.06 NA NA
4 2004-10-01 4128.66 7746.23 NA NA
5 2005-01-01 4443.62 7760.59 140.2 -3.5
6 2005-04-01 4876.54 8209.00 140.3 -14.6
7 2005-07-01 5393.01 8583.05 147.3 -11.6
8 2005-10-01 5885.01 8930.74 151.2 -10.2
9 2006-01-01 6360.00 9424.00 159.6 -3.1
10 2006-04-01 6930.00 9992.00 165.6 -16.0
11 2006-07-01 7430.00 10041.00 173.2 -4.9
12 2006-10-01 7901.00 10900.00 177.3 -17.6
13 2007-01-01 8279.00 11149.00 185.3 -5.3
14 2007-04-01 8867.00 12022.00 199.3 -10.9
15 2007-07-01 9439.00 12662.00 217.1 -12.8
16 2007-10-01 10101.00 13470.00 234.9 -8.4
And I want to apply following changes on the variables:
# yoy absolute difference change
abs.diff = c("B1","B2")
# yoy percentage change
percent.change = c("B3")
# make the variable negative
negative = c("B4")
This is the fuction that I am trying to use for my data frame.
transformation = function(D,abs.diff,percent.change,negative)
{
TT <- dim(D)[1]
DData <- D[-1,]
nms <- c()
for (i in c(2:dim(D)[2])) {
# yoy absolute difference change
if (names(D)[i] %in% abs.diff)
{ DData[,i] = (D[5:TT,i]-D[1:(TT-4),i])
names(DData)[i] = paste('a',names(D)[i],sep='') }
# yoy percent. change
if (names(D)[i] %in% percent.change)
{ DData[,i] = 100*(D[5:TT,i]-D[1:(TT-4),i])/D[1:(TT-4),i]
names(DData)[i] = paste('p',names(D)[i],sep='') }
#CA.deficit
if (names(D)[i] %in% negative)
{ DData[,i] = (-1)*D[1:TT,i] }
}
return(DData)
}
This is what I would like to get :
Date pB1 pB2 aB3 B4
1 2004-01-01 NA NA NA NA
2 2004-04-01 NA NA NA NA
3 2004-07-01 NA NA NA NA
4 2004-10-01 NA NA NA NA
5 2005-01-01 39.33 -2.26 NA 3.5
6 2005-04-01 40.05 3.25 NA 14.6
7 2005-07-01 42.22 9.56 NA 11.6
8 2005-10-01 42.54 15.29 11.0 10.2
9 2006-01-01 43.13 21.43 19.3 3.1
10 2006-04-01 42.11 21.72 18.3 16.0
11 2006-07-01 37.77 16.99 22.0 4.9
12 2006-10-01 34.26 22.05 17.7 17.6
13 2007-01-01 30.17 18.3 19.7 5.3
14 2007-04-01 27.95 20.32 26.1 10.9
15 2007-07-01 27.04 26.1 39.8 12.8
16 2007-10-01 27.84 23.58 49.6 8.4
Grouping by the months, i.e. 6th and 7th substring using ave and do the necessary calculations. With sapply we may loop over the columns.
f <- function(x) {
g <- substr(Date, 6, 7)
l <- length(unique(g))
o <- ave(x, g, FUN=function(x) 100/x * c(x[-1], NA) - 100)
c(rep(NA, l), head(o, -4))
}
cbind(df[1], sapply(df[-1], f))
# Date B1 B2 B3 B4
# 1 2004-01-01 NA NA NA NA
# 2 2004-04-01 NA NA NA NA
# 3 2004-07-01 NA NA NA NA
# 4 2004-10-01 NA NA NA NA
# 5 2005-01-01 39.32901 -2.259202 NA NA
# 6 2005-04-01 40.04796 3.250329 NA NA
# 7 2005-07-01 42.21960 9.560688 NA NA
# 8 2005-10-01 42.54044 15.291439 NA NA
# 9 2006-01-01 43.12655 21.434066 13.83738 -11.428571
# 10 2006-04-01 42.10895 21.720063 18.03279 9.589041
# 11 2006-07-01 37.77093 16.986386 17.58316 -57.758621
# 12 2006-10-01 34.25636 22.050356 17.26190 72.549020
# 13 2007-01-01 30.17296 18.304329 16.10276 70.967742
# 14 2007-04-01 27.95094 20.316253 20.35024 -31.875000
# 15 2007-07-01 27.03903 26.102978 25.34642 161.224490
# 16 2007-10-01 27.84458 23.577982 32.48731 -52.272727

Time Series Package that Replaces NA values as a Forecast [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have a dataset like below:
Date Metric1 Metric2 Metric3 Metric4
2017-01-01 NA 3 NA 7
2017-01-02 NA 4 NA 10
2017-01-03 NA 2 NA 18
2017-01-04 5 8 NA 20
2017-01-05 8 9 87 34
2017-01-06 10 2 45 12
. . . . .
. . . . .
. . . . .
2018-09-01 12 13 14 15
2018-09-02 34 12 28 19
2018-09-03 45 12 45 34
2018-09-04 NA 14 49 11
2018-09-05 NA 11 90 12
2018-09-06 NA 15 NA 32
2018-09-07 NA 23 NA 43
2018-09-08 NA 12 NA 22
My dataset has 100 columns. There are no missing values in between the NAs in their respective columns. Does anyone know a package or a function that will forecast or use a moving average for the values before and after the first or last numeric value?
I have done some research on this so far and the best I can find is na.fill but that will just repeat values at the beginning and end of columns.
You can use the imputeTS package to impute the missing values. For moving average you can do something like:
library(imputeTS)
ts_df[,2:5] <- apply(ts_df[,2:5], 2, na_ma, k = 6) # k = width of moving average
ts_df
Date Metric1 Metric2 Metric3 Metric4
1 2017-01-01 6.933333 3 64.57143 7
2 2017-01-02 7.806452 4 62.13333 10
3 2017-01-03 8.396825 2 61.58065 18
4 2017-01-04 5.000000 8 61.38095 20
5 2017-01-05 8.000000 9 87.00000 34
6 2017-01-06 10.000000 2 45.00000 12
7 2018-09-01 12.000000 13 14.00000 15
8 2018-09-02 34.000000 12 28.00000 19
9 2018-09-03 45.000000 12 45.00000 34
10 2018-09-04 33.984127 14 49.00000 11
11 2018-09-05 34.451613 11 90.00000 12
12 2018-09-06 35.333333 15 66.80952 32
13 2018-09-07 37.142857 23 67.16129 43
14 2018-09-08 41.333333 12 68.93333 22
Refer R documentation for more time series related imputation techniques in imputeTS package.
Data:
ts_df <- read.table(text = " Date Metric1 Metric2 Metric3 Metric4
2017-01-01 NA 3 NA 7
2017-01-02 NA 4 NA 10
2017-01-03 NA 2 NA 18
2017-01-04 5 8 NA 20
2017-01-05 8 9 87 34
2017-01-06 10 2 45 12
2018-09-01 12 13 14 15
2018-09-02 34 12 28 19
2018-09-03 45 12 45 34
2018-09-04 NA 14 49 11
2018-09-05 NA 11 90 12
2018-09-06 NA 15 NA 32
2018-09-07 NA 23 NA 43
2018-09-08 NA 12 NA 22" , header = T, colClasses = c("Date" = "Date"))

How to count rows in a logical vector

I have a data frame called source that looks something like this
185 2002-07-04 NA NA 20
186 2002-07-05 NA NA 20
187 2002-07-06 NA NA 20
188 2002-07-07 14.400 0.243 20
189 2002-07-08 NA NA 20
190 2002-07-09 NA NA 20
191 2002-07-10 NA NA 20
192 2002-07-11 NA NA 20
193 2002-07-12 NA NA 20
194 2002-07-13 4.550 0.296 20
195 2002-07-14 NA NA 20
196 2002-07-15 NA NA 20
197 2002-07-16 NA NA 20
198 2002-07-17 NA NA 20
199 2002-07-18 NA NA 20
200 2002-07-19 NA 0.237 20
and when I try
> nrow(complete.cases(source))
I only get NULL
can someone explain why this is the case and how can I count how many rows there are without NA or NaN values?
Instead use sum. Though the safest option would be NROW (because it can handle both data.frams and vectors)
sum(complete.cases(source))
#[1] 2
Or alternatively if you insist on using nrow
nrow(source[complete.cases(source), ])
#[1] 2
Explanation: complete.cases returns a logical vector indicating which cases (in your case rows) are complete.
Sample data
source <- read.table(text =
"185 2002-07-04 NA NA 20
186 2002-07-05 NA NA 20
187 2002-07-06 NA NA 20
188 2002-07-07 14.400 0.243 20
189 2002-07-08 NA NA 20
190 2002-07-09 NA NA 20
191 2002-07-10 NA NA 20
192 2002-07-11 NA NA 20
193 2002-07-12 NA NA 20
194 2002-07-13 4.550 0.296 20
195 2002-07-14 NA NA 20
196 2002-07-15 NA NA 20
197 2002-07-16 NA NA 20
198 2002-07-17 NA NA 20
199 2002-07-18 NA NA 20
200 2002-07-19 NA 0.237 20")
complete.cases returns a logical vector that indicates the rows which are complete. As a vector doesn't have a row attribute, you cannot use nrow here, but as suggested by others sum. With sum the TRUE and FALSE are transformed to 1 and 0 internally, so using sum counts the TRUE values of your vector.
sum(complete.cases(source))
# [1] 2
If you however are more interested in the data.frame, which is left after you exclude all non-complete rows, you can use na.exclude. This returns a data.frame and you can use nrow.
nrow(na.exclude(source))
# [1] 2
na.exclude(source)
# V2 V3 V4 V5
# 188 2002-07-07 14.40 0.243 20
# 194 2002-07-13 4.55 0.296 20
You can even try:
source[rowSums(is.na(source))==0,]
# V1 V2 V3 V4 V5
# 4 188 2002-07-07 14.40 0.243 20
# 10 194 2002-07-13 4.55 0.296 20
nrow(source[rowSums(is.na(source))==0,])
#[1] 2

Calculating rates when data is in long form

A sample of my data is available here.
I am trying to calculate the growth rate (change in weight (wt) over time) for each squirrel.
When I have my data in wide format:
squirrel fieldBirthDate date1 date2 date3 date4 date5 date6 age1 age2 age3 age4 age5 age6 wt1 wt2 wt3 wt4 wt5 wt6 litterid
22922 2017-05-13 2017-05-14 2017-06-07 NA NA NA NA 1 25 NA NA NA NA 12 52.9 NA NA NA NA 7684
22976 2017-05-13 2017-05-16 2017-06-07 NA NA NA NA 3 25 NA NA NA NA 15.5 50.9 NA NA NA NA 7692
22926 2017-05-13 2017-05-16 2017-06-07 NA NA NA NA 0 25 NA NA NA NA 10.1 48 NA NA NA NA 7719
I am able to calculate growth rate with the following code:
library(dplyr)
#growth rate between weight 1 and weight 3, divided by age when weight 3 is recorded
growth <- growth %>%
mutate (g.rate=((wt3-wt1)/age3))
#growth rate between weight 1 and weight 2, divided by age when weight 2 is recorded
merge.growth <- merge.growth %>%
mutate (g.rate=((wt2-wt1)/age2))
However, when the data is in long format (a format needed for the analysis I am running afterwards):
squirrel litterid date age wt
22922 7684 2017-05-13 0 NA
22922 7684 2017-05-14 1 12
22922 7684 2017-06-07 25 52.9
22976 7692 2017-05-13 1 NA
22976 7692 2017-05-16 3 15.5
22976 7692 2017-06-07 25 50.9
22926 7719 2017-05-14 0 10.1
22926 7719 2017-06-08 25 48
I cannot use the mutate function I used above. I am hoping to create a new column that includes growth rate as follows:
squirrel litterid date age wt g.rate
22922 7684 2017-05-13 0 NA NA
22922 7684 2017-05-14 1 12 NA
22922 7684 2017-06-07 25 52.9 1.704
22976 7692 2017-05-13 1 NA NA
22976 7692 2017-05-16 3 15.5 NA
22976 7692 2017-06-07 25 50.9 1.609
22926 7719 2017-05-14 0 10.1 NA
22926 7719 2017-06-08 25 48 1.516
22758 7736 2017-05-03 0 8.8 NA
22758 7736 2017-05-28 25 43 1.368
22758 7736 2017-07-05 63 126 1.860
22758 7736 2017-07-23 81 161 1.879
22758 7736 2017-07-26 84 171 1.930
I have been calculating the growth rates (growth between each wt and the first time it was weighed) in excel, however I would like to do the calculations in R instead since I have a large number of squirrels to work with. I suspect if else loops might be the way to go here, but I am not well versed in that sort of coding. Any suggestions or ideas are welcome!
You can use group_by to calculate this for each squirrel:
group_by(df, squirrel) %>%
mutate(g.rate = (wt - nth(wt, which.min(is.na(wt)))) /
(age - nth(age, which.min(is.na(wt)))))
That leaves NaNs where the age term is zero, but you can change those to NAs if you want with df$g.rate[is.nan(df$g.rate)] <- NA.
alternative using data.table and its function "shift" that takes the previous row
library(data.table)
df= data.table(df)
df[,"growth":=(wt-shift(wt,1))/age,by=.(squirrel)]

R- how subset lines of data based on column values in a data frame

I would like to plot things like (where C is column):
C4 vs C2 for all similar C1 and
C1 vs C4 for all similar C2
The data frame in question is:
C1 C2 C3 C4
1 2012-12-28 0 NA 10773
2 2012-12-28 5 NA 34112
3 2012-12-28 10 NA 30901
4 2012-12-28 0 NA 12421
5 2012-12-30 0 NA 3925
6 2012-12-30 5 NA 17436
7 2012-12-30 10 NA 13717
8 2012-12-30 15 NA 36708
9 2012-12-30 20 NA 28408
10 2012-12-30 NA NA 2880
11 2013-01-02 0 -13.89 9972
12 2013-01-02 5 -13.89 10576
13 2013-01-02 10 -13.89 33280
14 2013-01-02 15 -13.89 28667
15 2013-01-02 20 -13.89 21104
16 2013-01-02 25 -13.89 24771
17 2013-01-02 NA NA 22
18 2013-01-05 0 -3.80 20727
19 2013-01-05 5 -3.80 2033
20 2013-01-05 10 -3.80 16045
21 2013-01-05 15 -3.80 12074
22 2013-01-05 20 -3.80 10095
23 2013-01-05 NA NA 32693
24 2013-01-08 0 -1.70 19579
25 2013-01-08 5 -1.70 20200
26 2013-01-08 10 -1.70 12263
27 2013-01-08 15 -1.70 28797
28 2013-01-08 20 -1.70 23963
29 2013-01-11 0 -2.30 26525
30 2013-01-11 5 -2.30 21472
31 2013-01-11 10 -2.30 9633
32 2013-01-11 15 -2.30 27849
33 2013-01-11 20 -2.30 23950
34 2013-01-17 0 1.40 16271
35 2013-01-17 5 1.40 18581
36 2013-01-19 0 0.10 5910
37 2013-01-19 5 0.10 16890
38 2013-01-19 10 0.10 13078
39 2013-01-19 NA NA 55
40 2013-01-23 0 -9.20 15048
41 2013-01-23 6 -9.20 20792
42 2013-01-26 0 NA 21649
43 2013-01-26 6 NA 24655
44 2013-01-29 0 0.10 9100
45 2013-01-29 5 0.10 27514
46 2013-01-29 10 0.10 19392
47 2013-01-29 15 0.10 21720
48 2013-01-29 NA 0.10 112
49 2013-02-11 0 0.40 13619
50 2013-02-11 5 0.40 2748
51 2013-02-11 10 0.40 1290
52 2013-02-11 15 0.40 762
53 2013-02-11 20 0.40 1125
54 2013-02-11 25 0.40 1709
55 2013-02-11 30 0.40 29459
56 2013-02-11 35 0.40 106474
57 2013-02-13 0 1.30 3355
58 2013-02-13 5 1.30 970
59 2013-02-13 10 1.30 2240
60 2013-02-13 15 1.30 35871
61 2013-02-18 0 -0.60 8564
62 2013-02-20 0 -1.20 12399
63 2013-02-26 0 0.30 2985
64 2013-02-26 5 0.30 9891
65 2013-03-01 0 0.90 5221
66 2013-03-01 5 0.90 9736
67 2013-03-05 0 0.60 3192
68 2013-03-05 5 0.60 4243
69 2013-03-09 0 0.10 45138
70 2013-03-09 5 0.10 55534
71 2013-03-12 0 1.40 7278
72 2013-03-12 NA NA 45
73 2013-03-15 0 0.30 2447
74 2013-03-15 5 0.30 2690
75 2013-03-18 0 -2.30 3008
76 2013-03-22 0 -0.90 11411
77 2013-03-22 5 -0.90 NA
78 2013-03-22 10 -0.90 17675
79 2013-03-22 NA NA 47
80 2013-03-25 0 1.20 9802
81 2013-03-25 5 1.20 15790
There are other posts here about time series subseting and merging/matching/pasting subseting, but I think I miss the point when I'm trying to follow those instructions.
The end goal is to have a plot of C1 vs C4 for every C2 = 0 C2 = 5 and so on. Same thing for C4 vs C2 for every same C1. I know there are some duplicate C1 and C2, but the C4 for those values can be averaged. I can figure these plots out, I just need to know how to subset the data in this way. Perhaps creating a new data.frame() with these subsets could be the easiest?
Thanks in advance,
It's relatively easy to plot subsets using ggplot2. First you need to reshape your data from "wide" to "long" format, creating a new categorical variable with possible values C4 and C5.
library(reshape2)
library(ggplot2)
# Starting with the data you posted in a data frame called "dat":
# Convert C2 to date format
dat$C2 = as.Date(dat$C2)
# Reshape data to long format
dat.m = melt(dat, id.var=c("C1","C2","C3"))
# Plot values of C4 and C5 vs. C2 with separate lines for each level of C3
ggplot(dat.m, aes(x=C2, y=value, group=C3, colour=as.factor(C3))) +
geom_line() + geom_point() +
facet_grid(variable ~ ., scales="free_y")
The C4 lines are the same for every level of C3, so they all overlap each other.
You can also have a separate panel for each level of C3:
ggplot(dat.m, aes(x=C2, y=value, group=variable, colour=variable)) +
geom_line() + geom_point() +
facet_grid(variable ~ C3, scales="free_y") +
theme(axis.text.x=element_text(angle=-90)) +
guides(colour=FALSE)
Here's a base graphics method to getting separate plots. I'm using your new column names below:
# Use lapply to create a separate plot for each level of C2
lapply(na.omit(unique(dat$C2)), function(x) {
# The next line of code removes NA values so that there will be a line through
# every point. You can remove this line if you don't care whether all points
# are connected or not.
dat = dat[complete.cases(dat[,c("C1","C2","C4")]),]
# Create a plot of C4 vs. C1 for the current value of C2
plot(dat$C1[dat$C2==x], dat$C4[dat$C2==x],
type="o", pch=16,
xlab=paste0("C2=",x), ylab="C4")
})

Resources