Flagging duplicate (non-unique) values based on specifications

Flagging duplicate (non-unique) values based on specifications - r

enter code hereI'm working with a dataset, "Final.Export" that looks like this:
LakeID LakeName SourceVariableName SourceVariableDescription SourceFlags
47 390 Moosehead Acolor(PCU) Apparent color <NA>
48 390 Moosehead Acolor(PCU) Apparent color <NA>
49 390 Moosehead Acolor(PCU) Apparent color <NA>
50 390 Moosehead Acolor(PCU) Apparent color <NA>
51 390 Moosehead Acolor(PCU) Apparent color <NA>
52 390 Moosehead Acolor(PCU) Apparent color <NA>
53 390 Moosehead Acolor(PCU) Apparent color <NA>
54 390 Moosehead Acolor(PCU) Apparent color <NA>
55 390 Moosehead Acolor(PCU) Apparent color <NA>
56 390 Moosehead Acolor(PCU) Apparent color <NA>
LagosVariableID LagosVariableName Value Units CensorCode DetectionLimit Date
47 11 Color, apparent 22 PCU NC NA 2003-08-26
48 11 Color, apparent 17 PCU NC NA 2003-08-26
49 11 Color, apparent 16 PCU NC NA 2003-08-26
50 11 Color, apparent 14 PCU NC NA 2003-08-26
51 11 Color, apparent 14 PCU NC NA 2003-08-26
52 11 Color, apparent 17 PCU NC NA 2003-08-26
53 11 Color, apparent 16 PCU NC NA 2003-08-26
54 11 Color, apparent 17 PCU NC NA 2003-08-26
55 11 Color, apparent 14 PCU NC NA 2003-08-26
56 11 Color, apparent 17 PCU NC NA 2003-08-26
LabMethodName LabMethodInfo SampleType SamplePosition SampleDepth MethodInfo
47 <NA> <NA> INTEGRATED SPECIFIED 6 <NA>
48 <NA> <NA> INTEGRATED SPECIFIED 7 <NA>
49 <NA> <NA> INTEGRATED SPECIFIED 6 <NA>
50 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
51 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
52 <NA> <NA> INTEGRATED SPECIFIED 9 <NA>
53 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
54 <NA> <NA> INTEGRATED SPECIFIED 8 <NA>
55 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
56 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
BasinType Subprogram Comments Dup
47 UNKNOWN NA NA NA
48 UNKNOWN NA NA NA
49 UNKNOWN NA NA NA
50 UNKNOWN NA NA NA
51 UNKNOWN NA NA NA
52 UNKNOWN NA NA NA
53 UNKNOWN NA NA NA
54 UNKNOWN NA NA NA
55 UNKNOWN NA NA NA
56 UNKNOWN NA NA NA
I want to flag all duplicate values as 1. Duplicate values are defined as those that have the exact same values in EVERY column of 'LakeID','Date','LagosVariableID','SampleDepth', and 'SamplePosition' columns.
To do this I have created a new data table "data1" using the following code:
library(data.table)
data1=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value'))
data1=data1[,Dup:=duplicated(.SD),.SDcols=c('LakeID','Date', 'LagosVariableID', 'SampleDepth', 'SamplePosition','Value')]
data1$Dup[which(data1$Dup==FALSE)]=NA
data1$Dup[which(data1$Dup==TRUE)]=1
The problem with "data1" is that only duplicate rows (according to my definition of duplicate) after the first unique row (flagged as NA) are flagged as "1." I need to flag the unique row and the associated duplicate rows as "1." Any ideas how to do this?
If this is confusing let me know how I can clarify.

It's difficult to say without a reproducible example, but it seems you want something like this:
data1[,dup:=duplicated(.SD),
by=list(LakeID, LagosVariableID, Value, Date, SamplePosition, SampleDepth)]
Edit:
After OP's clarification it appears they simply want this:
data1[,dup:=duplicated(.SD),
.SDcols=c('LakeID', 'Date', 'LagosVariableID', 'SampleDepth', 'SamplePosition')]

Related

Multivariate imputing missing values in weather data

I need to get a weather dataset ready as input to keras. I have 1096 entries over 3 years of daily data of which first month is missing. I got one of the columns filled in for temperature from a nearby weather station. However, to check which imputation fits best, I deleted these 30 values and kept all columns as NA for first month. Then, I tried various imputing packages including 1. Mice - gave continuous values but too high average; 2. KNN (VIM) gave a constant value too high 3.MissForest - gave constant value too high; 4. imputeTS_interpolation - gave continuous value slightly low; 5. imputeTS_seasonal - gave constant value slight low.
Therefore, I selected imputeTS_interpolation. And used this to now impute the remaining columns after filling the temperature column with actual values. However, I cannot seem to get the seasonality in imputeTS working.
Any idea why? Please find below the data file and code used:
Code:
# #MICE did not match with historical data too high avg of 10.2
# impute <- mice(gf, method = "pmm")
# print(impute)
# xyplot(impute, Temp ~ Reco | .imp, pch = 20, cex =1.4)
# mf <- complete(impute, 3)
# mf <- cbind(mf, Date = df$Date.Time)
# write.csv(mf, "Mice_imputed.csv", row.names=TRUE)
# View(mf)
#
# View(gf_impo)
# ##Using Miss Forest ;) too high contstant 9.3
# gf_impo <- missForest(gf, maxiter = 100, ntree = 500)
# gf_impo$ximp
# gf_impo <- cbind(gf_impo, Date = df$Date.Time)
# write.csv(gf_impo$ximp, "Val_MissForest_imputed.csv", row.names=TRUE)
# class(gf_impo)
##KNN using VIM too high constant 13
imp_knn <- kNN(gf, k = 500)
aggr(imp_knn, delimiter = "imp")
View(imp_knn)
imp_knn <- cbind(imp_knn, Date = df$Date.Time)
write.csv(imp_knn, "Val_KNN_imputed.csv", row.names=TRUE)
View(imp_seas)
#imputeTS
#for seasonal imputation
imp_seas <- gf
imp_seas <- cbind(imp_seas, Date = df$Date.Time)
View(imp_seas)
View(imp_TS_intn)
imp_TS_intn <- na_interpolation(imp_seas, option = "spline") #avg of 2.83 close to real 4.1
# imp_TS_seas <- na_seasplit(imp_seas, algorithm = "interpolation", find_frequency = FALSE, maxgap = Inf)
#const 2.7
write.csv(imp_TS_intn, "ML_impTS_interpolate.csv", row.names=TRUE)
DATA:
A B C D E Date
1 NA NA NA 5.4000000 NA 2018-01-01
2 NA NA NA 5.7500000 NA 2018-01-02
3 NA NA NA 6.8000000 NA 2018-01-03
4 NA NA NA 6.3500000 NA 2018-01-04
5 NA NA NA 3.3500000 NA 2018-01-05
6 NA NA NA 3.0500000 NA 2018-01-06
7 NA NA NA 2.2000000 NA 2018-01-07
8 NA NA NA 0.6500000 NA 2018-01-08
9 NA NA NA 2.8500000 NA 2018-01-09
10 NA NA NA 2.2000000 NA 2018-01-10
11 NA NA NA 2.3500000 NA 2018-01-11
12 NA NA NA 5.1000000 NA 2018-01-12
13 NA NA NA 6.5500000 NA 2018-01-13
14 NA NA NA 5.0000000 NA 2018-01-14
15 NA NA NA 5.7500000 NA 2018-01-15
16 NA NA NA 2.0000000 NA 2018-01-16
17 NA NA NA 5.0500000 NA 2018-01-17
18 NA NA NA 3.8500000 NA 2018-01-18
19 NA NA NA 2.4500000 NA 2018-01-19
20 NA NA NA 5.1500000 NA 2018-01-20
21 NA NA NA 6.7500000 NA 2018-01-21
22 NA NA NA 9.2500000 NA 2018-01-22
23 NA NA NA 9.5000000 NA 2018-01-23
24 NA NA NA 6.4500000 NA 2018-01-24
25 NA NA NA 5.4000000 NA 2018-01-25
26 NA NA NA 5.3500000 NA 2018-01-26
27 NA NA NA 6.5500000 NA 2018-01-27
28 NA NA NA 10.1000000 NA 2018-01-28
29 NA NA NA 6.6000000 NA 2018-01-29
30 NA NA NA 3.8500000 NA 2018-01-30
31 NA NA NA 2.9000000 NA 2018-01-31
32 0.05374951 0.041144312 0.0023696211 5.9902083 0.068784302 2018-02-01
33 0.07565470 0.012326176 0.0057481689 10.5280417 0.176209125 2018-02-02
34 0.04476314 0.113718139 0.0089845444 12.8125000 0.176408788 2018-02-03
35 0.01695546 0.060965133 -0.0034163682 16.9593750 0.000000000 2018-02-04
36 0.09910202 0.090170142 -0.0111946461 10.4867292 0.088337951 2018-02-05
37 0.08514839 0.026061013 -0.0029183210 7.1662500 0.085590326 2018-02-06
38 0.06724108 0.104761909 -0.0416036605 6.9130417 0.134828348 2018-02-07
39 0.07638534 0.097570813 -0.0192784571 3.3840000 0.029682717 2018-02-08
40 0.02568162 0.008244304 -0.0288903610 12.0282292 0.055817103 2018-02-09
41 0.02752688 0.088544666 -0.0172136911 6.8694792 0.098169954 2018-02-10
42 0.06643098 0.063321337 -0.0347752292 7.4539792 0.034110652 2018-02-11
43 0.09743445 0.057502178 0.0162851223 13.9365208 0.264168082 2018-02-12
44 0.09189575 0.034429904 0.0020940613 13.8687292 0.162341764 2018-02-13
45 0.07857244 0.009406862 0.0075904680 11.7800000 0.101283946 2018-02-14
46 0.01987263 0.024783795 -0.0088742973 4.4463750 0.063949011 2018-02-15
47 0.02332892 0.010138857 0.0091396448 5.6452292 0.034708981 2018-02-16
48 0.02022396 0.014207518 0.0036018714 14.2862500 0.043205299 2018-02-17
49 0.07043020 0.075317793 0.0036760070 5.5940208 0.171898590 2018-02-18
50 0.02120779 0.010461857 -0.0277470177 13.6131250 0.061486533 2018-02-19
51 0.06405819 0.034185344 0.0173606568 7.0551042 0.052148976 2018-02-20
52 0.09428869 0.026957653 0.0016863903 6.7955000 0.085888435 2018-02-21
53 0.04248937 0.048782786 0.0004039921 17.5706250 0.000000000 2018-02-22
54 0.02076763 0.038094949 -0.0003671638 14.8379167 0.000000000 2018-02-23
55 0.01343260 0.118003726 -0.0214988345 6.4564583 0.053353606 2018-02-24
56 0.05231647 0.054454132 -0.0098012290 7.8568333 0.183326943 2018-02-25
57 0.02476706 0.087501472 0.0031839472 15.7493750 0.210616272 2018-02-26
58 0.07358998 0.023558218 0.0031618607 10.8001250 0.241602571 2018-02-27
59 0.02042573 0.009268439 0.0088051496 7.2967500 0.251608940 2018-02-28
60 0.02107772 0.083567750 -0.0037223644 6.2674375 0.062221630 2018-03-01
61 0.05830801 0.029456683 0.0114978078 13.0810417 0.193765948 2018-03-02
62 0.02923587 0.070533843 0.0068299668 14.4095833 0.244310193 2018-03-03
63 0.02570283 0.058270093 0.0137174366 3.8527917 0.120846709 2018-03-04
64 0.01434395 0.014637405 0.0051951050 9.6877083 0.112579011 2018-03-05
65 0.06426214 0.078872579 0.0068664343 4.6763750 0.000000000 2018-03-06
66 0.04782772 0.011762501 0.0086182870 12.7027083 0.129606106 2018-03-07
67 0.01809136 0.105398844 0.0231671305 10.8052083 0.017683908 2018-03-08
68 0.04427582 0.020397435 -0.0009758693 6.5983333 0.041148864 2018-03-09
69 0.05123687 0.115984361 -0.0372104856 6.5021250 0.180013174 2018-03-10
70 0.01913266 0.005981014 -0.0159701842 8.9844375 0.095262921 2018-03-11
71 0.04407234 0.009142247 -0.0031640496 7.7638333 0.000000000 2018-03-12
72 0.09108709 0.038174205 0.0005654564 5.3772083 0.044105747 2018-03-13
73 0.05488394 0.115153937 0.0192819858 8.9182917 0.039993864 2018-03-14
74 0.03726892 0.067983475 -0.0311367032 2.4423333 0.066108171 2018-03-15
75 0.05563102 0.003831231 -0.0011148743 10.7100000 0.217461791 2018-03-16
76 0.04922930 0.055446609 0.0075246331 5.0829375 0.149530704 2018-03-17
77 0.02972858 0.061966039 -0.0392014211 12.3645625 0.060670492 2018-03-18
78 0.02812688 0.018183092 0.0134514770 9.0172292 0.158435250 2018-03-19
79 0.03066101 0.007622504 -0.0249482114 6.2709792 0.118487919 2018-03-20
80 0.06801767 0.083261012 0.0133423296 13.3683333 0.196053774 2018-03-21
81 0.04178157 0.093600914 0.0116253865 10.0024167 0.020835522 2018-03-22
82 0.04725052 0.018187748 -0.0115718535 10.3528333 0.097352796 2018-03-23
83 0.02042339 0.081504844 -0.0380958738 17.2006250 0.010500742 2018-03-24
84 0.06674396 0.098739090 -0.0108474961 17.5437500 0.119415595 2018-03-25
85 0.07049507 0.016286614 -0.0007817195 16.8800000 0.060452087 2018-03-26
86 0.01244906 0.018100693 -0.0266155999 8.8651458 0.018144668 2018-03-27
87 0.05271711 0.015368632 -0.0477885811 7.2415417 0.092797451 2018-03-28
88 0.01610886 0.014919094 0.0023487944 7.7914792 0.062818728 2018-03-29
89 0.08847253 0.059397043 0.0130362880 10.9732708 0.087451484 2018-03-30
90 0.02938725 0.044473745 0.0091253257 6.0241458 0.025488946 2018-03-31
91 0.08599249 0.043160908 0.0082536160 8.8211875 0.012975783 2018-04-01
92 0.05747667 0.017709243 -0.0090965038 6.3249375 0.065731818 2018-04-02
93 0.05772051 0.085210524 -0.0013533831 13.4166667 0.067790160 2018-04-03
94 0.01699834 0.020657341 0.0039885065 3.2999792 0.076302652 2018-04-04
95 0.03565076 0.110372607 -0.0313309140 12.7822083 0.184844707 2018-04-05
96 0.02050401 0.078943608 -0.0062322339 4.3233125 0.067820413 2018-04-06
97 0.06186790 0.013147512 0.0203249289 6.3953750 0.034104318 2018-04-07
98 0.06304988 0.012997642 0.0061171825 9.7322708 0.021220516 2018-04-08
99 0.03799006 0.012420760 0.0054724563 8.8472083 0.068664033 2018-04-09
100 0.01610225 0.061182804 0.0031002885 7.5622708 0.085766429 2018-04-10
101 0.05937683 0.008333173 -0.0053972689 7.8848542 0.058386726 2018-04-11
102 0.02190115 0.037843227 0.0089823372 8.3339792 0.055761391 2018-04-12
103 0.01179665 0.016899394 -0.0016533437 5.5101667 0.099133313 2018-04-13
104 0.02464707 0.021231270 -0.0212016846 15.5106250 0.126661378 2018-04-14
105 0.01906818 0.065273389 0.0081694393 7.6616667 0.032939519 2018-04-15
106 0.05418785 0.074619385 -0.0355680586 11.3618750 0.057768261 2018-04-16
107 0.06508988 0.014345229 0.0080423912 14.7137500 0.032709791 2018-04-17
108 0.06101126 0.060624597 -0.0399526978 17.2754167 0.230982139 2018-04-18
109 0.02226268 0.010230837 0.0001617419 2.9382083 0.000000000 2018-04-19
110 0.03884772 0.014218453 0.0039652960 10.7261875 0.179962834 2018-04-20
111 0.09054488 0.025711098 -0.0115944362 4.4734583 0.011442318 2018-04-21
112 0.03072171 0.076530730 0.0032123501 9.4128750 0.033174489 2018-04-22
113 0.04361276 0.101151670 0.0249408843 14.5804167 0.024238883 2018-04-23
114 0.03877568 0.049142846 0.0080689866 8.3168750 0.084570611 2018-04-24
115 0.05564027 0.076917047 0.0033447160 15.7308333 0.199762524 2018-04-25
116 0.04752544 0.019655228 -0.0063218138 15.7302083 0.020449908 2018-04-26
117 0.01718916 0.026132806 -0.0261027525 10.0887500 0.128898351 2018-04-27
118 0.04144832 0.034526516 0.0117868820 6.0784375 0.014449565 2018-04-28
119 0.03255833 0.113650910 -0.0123724759 11.8654167 0.085410171 2018-04-29
120 0.03656535 0.043333607 0.0230071368 7.0974167 0.035725321 2018-04-30
121 0.04570760 0.093595938 -0.0329915968 5.4016458 0.013467946 2018-05-01
122 0.07271528 0.061923504 0.0130002656 9.1602292 0.018299062 2018-05-02
123 0.02646133 0.007506529 -0.0276898846 0.2338125 0.246100834 2018-05-03
124 0.02379895 0.067273612 0.0112587565 19.1260417 0.120707266 2018-05-04
125 0.05925152 0.075768053 0.0050178925 16.2114583 0.162884739 2018-05-05
126 0.01858152 0.040845398 0.0164467420 12.9156250 0.028823967 2018-05-06
127 0.06994835 0.059457560 -0.0181926787 7.7316042 0.035106399 2018-05-07
128 0.05926409 0.038623605 0.0167222227 13.5464583 0.055665220 2018-05-08
129 0.03104010 0.006805893 -0.0141792029 14.5006250 0.012099383 2018-05-09
130 0.06631012 0.059314975 -0.0228020931 13.3711875 0.073114370 2018-05-10
131 0.03794480 0.015615642 0.0034917459 16.6675208 0.191141576 2018-05-11
132 0.03532917 0.050988581 0.0079455282 14.7375208 0.214172062 2018-05-12
133 0.08512617 0.063322454 0.0224309652 11.6861250 0.166425889 2018-05-13
134 0.04498265 0.012386160 -0.0051629339 7.2488333 0.280120908 2018-05-14
135 0.06383512 0.126840241 -0.0172296864 17.3852083 0.020363429 2018-05-15
136 0.06932861 0.026819550 -0.0109061610 20.9152083 0.099516538 2018-05-16
137 0.04020292 0.021831228 -0.0007211804 6.7122292 0.069831669 2018-05-17
138 0.02037474 0.020931810 0.0088341962 15.8758333 0.130548701 2018-05-18
139 0.01704143 0.105810563 -0.0243003529 10.7339583 0.038013440 2018-05-19
140 0.01266417 0.013985439 0.0091359503 6.5119375 0.196746897 2018-05-20
141 0.03623625 0.057182212 -0.0136101306 18.6637500 0.009431062 2018-05-21
142 0.03938695 0.054879146 0.0091277482 15.5393750 0.115389187 2018-05-22
143 0.05995812 0.061925644 -0.0029137774 11.8191667 0.015729774 2018-05-23
144 0.06548692 0.095240991 0.0055356839 4.3011875 0.081309326 2018-05-24
145 0.01582489 0.015264434 -0.0020079231 9.3315833 0.105132636 2018-05-25
146 0.06834050 0.028756388 -0.0512068435 13.6035417 0.212930829 2018-05-26
147 0.08354736 0.023524928 0.0041989465 4.5111250 0.227197329 2018-05-27
148 0.05738595 0.011159952 -0.0225834032 12.9385417 0.090503870 2018-05-28
149 0.07817132 0.103507587 -0.0222426051 13.4047292 0.034928812 2018-05-29
150 0.04773356 0.035856991 -0.0191600449 9.6657708 0.019893986 2018-05-30
Disclaimer: I am looking for a co-author for help in validating my work with keras / tensor flow

Maybe you can try the following things:
Using Seasonal Decomposition na_seadec() instead of seasonal split
Manually setting seasonality
Setting find frequency = TRUE
You can set the seasonality manually with the following actions:
Suppose you have a vector x and you have monthly values - this would mean freq = 12
# Create time series with seasonality information
x <-c(2,3,4,5,6,6,6,6,6,4,4,4,4,5,6,4,3,3,5,6,4,3,5,3,5,3,5,3,4,4,4,4,4,4,2,4,2,2,4,5,6,7,8,9,0,0,5,2,4)
x_with_freq <- ts(x, frequency = 12)
# imputation for the series
na_seadec(x_with_freq)
If you have daily values your frequency should be 365.
Other than this, you could also try to run na_seadec(x, find_frequency = T) then imputeTS tries to automatically find the seasonality for you.
But in the end, I don't know your data, could very well be, that the seasonal patterns aren't too strong.

Error Converting a time series with NA values to data frame in r

I want to convert a time series into a data frame and keep the same format. The problem is this ts has some NA values that come from a previous calculation step and I can't fill them by interpolation. I tried to remove the NA's from the time series before converting it but I get an error that I don't even understand what it is.
My time series is the following; it has only two NAs at the beginning (an excerpt)
>spi_ts_3
Jan Feb Mar Apr May Jun
1989 NA NA 0.765069346 1.565910141 1.461138946 1.372936681
1990 -0.157878028 0.097403112 0.099963471 0.729772909 0.569480219 -0.419761595
1991 -0.157878028 0.348524568 0.230534719 0.356331349 0.250889358 0.353116608
1992 1.662879078 2.178001602 1.379790538 1.367209519 1.367845061 1.451183431
1993 0.096554376 0.058881807 0.247172184 -0.085316621 0.020991171 -0.491276965
1994 0.258656104 0.716903968 0.847780489 0.440594371 0.474698780 -0.473765100
The code I'm using to convert it and handle the NAs is the following>
library(tseries)
na.remove(spi_ts_3)
df_fitted_3 <- as.data.frame(type.convert(.preformat.ts(spi_ts_3)))
When I check at the data frame produced, I don't even understand what is happening. I get something like this for each month and a warning at end of all months.
type.convert(.preformat.ts(spi_ts_3)).Feb
1 NA
2 0.097403112
3 0.348524568
4 2.178001602
5 0.058881807
6 0.716903968
7 2.211192460
8 -1.123925787
9 0.395452064
10 -0.106514633
11 -1.637049815
12 -0.862751319
13 -0.010681104
14 -0.958173964
15 0.470583289
16 0.088061116
17 0.485598080
18 -0.661229419
19 1.323879689
20 -0.449031840
21 -1.867196593
22 0.598343928
23 -2.549778490
24 -0.174824280
25 0.892977124
26 -0.246675932
27 0.324195405
28 -0.296931389
29 0.356029416
30 0.171029515
31 <NA>
32 <NA>
33 <NA>
34 <NA>
35 <NA>
36 <NA>
37 <NA>
38 <NA>
39 <NA>
40 <NA>
41 <NA>
42 <NA>
43 <NA>
44 <NA>
45 <NA>
46 <NA>
47 <NA>
48 <NA>
49 <NA>
50 <NA>
51 <NA>
52 <NA>
53 <NA>
54 <NA>
55 <NA>
56 <NA>
57 <NA>
58 <NA>
59 <NA>
60 <NA>
61 <NA>
62 <NA>
63 <NA>
64 <NA>
65 <NA>
66 <NA>
67 <NA>
68 <NA>
69 <NA>
70 <NA>
71 <NA>
72 <NA>
73 <NA>
74 <NA>
75 <NA>
76 <NA>
77 <NA>
78 <NA>
79 <NA>
80 <NA>
81 <NA>
82 <NA>
83 <NA>
Warning message:
In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
corrupt data frame: columns will be truncated or padded with NAs

How to get the same result every time when allocating values following a specific distribution

My data consists of postal codes and hospitals. Many records have a missing hospital and I want to allocate the hospital to this record following the distribution of all records in the postal code. Let's say that in Postcal code 2211 the distribution of hospitals A and B is 0.3 vs. 0.7. The records with missing hospitals in this Postcal code needs to follow the same distribution and needs to get the same results every time I run the code.
I already tried:
sample(c("A","B"), nrow(df), replace=TRUE, prob=c(0.3,0.7))
This gave the desired result, but when I run the code again, the result on record level is different. I read about set.seed() but that doesn't give the same output.
Some of my data:
postal code hospital daydate
1 2211 NA 0
2 2211 NA 6
3 2211 NA 8
4 2211 NA 15
5 2211 NA 18
6 2211 NA 18
7 2211 NA 25
8 2211 NA 30
9 2211 NA 51
10 2211 NA 55
11 2211 NA 58
12 2211 NA 59
13 2211 NA 61
14 2211 NA 61
15 2211 NA 64
16 2211 NA 66
17 2211 NA 68
18 2211 NA 69
There are 18 records in this example so 13 records needs to get hospital A and 5 records needs to get hospital B. And for example record 10 always needs to be A and not the second time B.
I hope my question is clear (first time I asked a question here) and that someone can help me out! Thank you in advance!

set.seed should be the solution:
set.seed(0)
s1 <- sample(c("A","B"), 18, replace=TRUE, prob=c(0.3,0.7))
set.seed(0)
s2 <- sample(c("A","B"), 18, replace=TRUE, prob=c(0.3,0.7))
identical(s1, s2)
#[1] TRUE

Make a row NA starting from a cell in a column

I need to make a row NA starting by a cell in a column. Please see the example below:
How can I achieve this in R. Any help is appreciated.
When I use data <- [!(data$DES6=="F001"),] it removes 1st and 3rd row in the example below but I need to keep the 1st and 3rd row as shown in the output below.
Thanks in advance.
data:
YEAR ID STATE CROP CTY DES1 DES2 DES3 DES4 DES5 DES6 DES7 DES8
1998 53 CA 11 25 LOO1 50 N 23 W F001 25 S
1998 54 CA 11 26 LOO1 61 N 25 W NA NA NA
1998 55 CO 11 17 LOO1 62 S 26 E F001 26 N
output:
YEAR ID STATE CROP CTY DES1 DES2 DES3 DES4 DES5 DES6 DES7 DES8
1998 53 CA 11 25 LOO1 50 N 23 W NA NA NA
1998 54 CA 11 26 LOO1 61 N 25 W NA NA NA
1998 55 CO 11 17 LOO1 62 S 26 E NA NA NA

This will set the matching row to NA from the specified column to the end
df1[df1$DES6 %in% "F001", seq(grep("^DES6$", colnames(df1)), ncol(df1))] <- NA

Loss of data when using merge

I have a df with states that I am trying to add lat, long values for each state so I can plot percent values for each state on a map. When I use merge I get either and empty df if I don't use
all=TRUE
Or I get missing data for either my lat, long values of my data itself depending on which I make x or y
Code to load my df and add column header
fileURL <- c("https://drive.google.com/open?id=0B-jAX5hT2D3hNnVtLVhROENKRGs")
suppressMessages(require(data.table))
ge.planted <- fread(fileURL, na.strings = "NA")
colnames(ge.planted) <- c("region", "type", "crop", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015")
Code to get state names with lat, long values for the center of each state
snames <- data.frame(region=tolower(state.name), long=state.center$x, lat=state.center$y)
When I merge the two df using:
snames <- merge(ge.planted, snames, by="region")
I get
[1] region long lat type crop 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
[17] 2011 2012 2013 2014 2015
Or if I use
snames <- merge( ge.planted, snames, by="region", all=TRUE)
And I get my values but no lat, long
region type crop 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
1: Alabama Insect-resistant (Bt) only Cotton - - - - - 10 10 10 18 13 11 18 17 12
2: Alabama Herbicide-tolerant only Cotton - - - - - 28 25 25 15 18 7 4 11 4
3: Alabama Stacked gene varieties Cotton - - - - - 54 60 60 65 60 76 75 70 82
4: Alabama All GE varieties Cotton - - - - - 92 95 95 98 91 94 97 98 98
5: Arkansas Herbicide-tolerant only Soybean 43 60 68 84 92 92 92 92 94 94 96 95 94 97
6: Arkansas All GE varieties Soybean 43 60 68 84 92 92 92 92 94 94 96 95 94 97
2014 2015 long lat
1: 9 4 NA NA
2: 6 3 NA NA
3: 83 90 NA NA
4: 98 97 NA NA
5: 99 97 NA NA
6: 99 97 NA NA
And finally with
snames <- merge(snames, ge.planted, by="region", all=TRUE)
I get lat, long but no values
region long lat type crop 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
1 alabama -87 33 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
2 alaska -127 49 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
3 arizona -112 34 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
4 arkansas -92 35 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
5 california -120 37 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
6 colorado -106 39 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
From best I can tell instead of merging the files based on 'region' it is appending the 'y' value on to the end of the data frame.

The problem is that you used tolower(), so that region names in one frame are different to the other (ge.planted has caps, snames does not). So merge will not recognize region names as equivalent. Delete the tolower() call, and it should work.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Flagging duplicate (non-unique) values based on specifications - r

Related

Multivariate imputing missing values in weather data

Error Converting a time series with NA values to data frame in r

How to get the same result every time when allocating values following a specific distribution

Make a row NA starting from a cell in a column

Loss of data when using merge

Categories

Resources