How to filter a data.Frame using a String in another table? - r

So, when I type to see the first element of a table I get the following.
>names(freq25)[1]
"japão"
When I type
>library(dplyr)
>filter(data, grepl("japão", HeadLine)) %>% select(Column)
Correct table
I get the correct table. But If I put both together I get:
>filter(data, grepl(names(freq25)[1], Headline)) %>% select(Column)
[1] Column
<0 rows> (or 0-length row.names)
I get an empty table.
I imagine the class of names() is not appropriate for the function grepl. But then I tried
>class(names(freq)[1])
[1] "character"
I don't know what I am doing wrong.
Any advice?
Ps: I want to filter using names(freq25)[i] because I want to filter a number of times, one per element in freq25. Maybe I could do that using something else.
Edit1: Content of freq25
>str(freq25)
Named num [1:25] 1260 304 215 193 192 167 164 151 150 149 ...
- attr(*, "names")= chr [1:25] "japão" "anos" "japonês" "tóquio" ...
>freq25
japão anos japonês tóquio preso homem tokyo japoneses polícia
1260 304 215 193 192 167 164 151 150
maior dia sobre previsão após parte japonesa mil mulher
149 136 134 131 129 128 117 113 112
brasil estrangeiros tempo pessoas governo novo pode
108 107 100 97 95 93 90

Related

Indexing through 'names' in a list and performing actions on contained values in R

I have a data set of counts from standard solutions passed through an instrument that analyses chemical concentrations (an ICPMS for those familiar). The data is over a range of different standards and for each standard I have four repeat measurements that I want to calculate the mean and variance of.
I'm importing the data from an excel spreadsheet and then, following some housekeeping such as getting dates and times in the right format, I split the the dataset up into a list identified by the name of the standard solution using Count11.sp<-split(Count11.raw, Count11.raw$Type). Count11.raw$Type then becomes the list element name and I have the four count results for each chemical element in that list element.
So far so good.
I find I can yield an average (mean, median etc) easily enough by identifying the list element specifically i.e. mean(Count11.sp$'Ca40') , or sapply(Count11$'Ca40', median), but what I'm not able to do is automate that in a loop so that I can calculate the means for each standard and drop that into a numerical matrix for further manipulation. I can extract the list element names with names() and I can even use a loop to make a vector of all the names and reference the specific list element using these in a for loop.
For instance Count11.sp[names(Count11.sp[i])]will extract the full list element no problem:
$`Post Ca45t`
Type Run Date 7Li 9Be 24Mg 43Ca 52Cr 55Mn 59Co 60Ni
77 Post Ca45t 1 2011-02-08 00:13:08 114 26101 4191 453525 2632 520 714 2270
78 Post Ca45t 2 2011-02-08 00:13:24 114 26045 4179 454299 2822 524 704 2444
79 Post Ca45t 3 2011-02-08 00:13:41 96 26372 3961 456293 2898 520 762 2244
80 Post Ca45t 4 2011-02-08 00:13:58 112 26244 3799 454702 2630 510 792 2356
65Cu 66Zn 85Rb 86Sr 111Cd 115In 118Sn 137Ba 140Ce 141Pr 157Gd 185Re 208Pb
77 244 1036 56 3081 44 520625 78 166 724 10 0 388998 613
78 250 982 70 3103 46 526154 76 174 744 16 4 396496 644
79 246 1014 36 3183 56 524195 60 198 744 2 0 396024 612
80 270 932 60 3137 44 523366 70 180 824 2 4 390436 632
238U
77 24
78 20
79 14
80 6
but sapply(Count11.sp[names(count11.sp[i])produces an error message: Error in median.default(X[[i]], ...) : need numeric data
while sapply(Input$Post Ca45t, median) <'Post Ca45t' being name Count11.sp[i] i=4> does exactly what I want and produces the median value (I can clean that vector up later for medians that don't make sense) e.g.
Type Run Date 7Li 9Be 24Mg
NA 2.5 1297109612.5 113.0 26172.5 4070.0
43Ca 52Cr 55Mn 59Co 60Ni 65Cu
454500.5 2727.0 520.0 738.0 2313.0 248.0
66Zn 85Rb 86Sr 111Cd 115In 118Sn
998.0 58.0 3120.0 45.0 523780.5 73.0
137Ba 140Ce 141Pr 157Gd 185Re 208Pb
177.0 744.0 6.0 2.0 393230.0 622.5
238U
17.0
Can anyone give me any insight into how I can automate (i.e. loop through) these names to produce one median vector per list element? I'm sure there's just some simple disconnect in my logic here that may be easily solved.
Update: I've solved the problem. The way to do so is to use tapply on the original dataset with out the need to split it. tapply allows functions to be applied to data based on a user defined grouping criteria. In my case I could group according to the Count11.raw$Type and then take the mean of the data subset. tapply(Count11.raw$Type, Count11.raw[,3:ncol(Count11.raw)], mean), job done.

Create a dataframe i nR

I would like to create a dataframe with 117 columns and 90 rows, the first ones being: ID, date1, date2, Category, DR1, DRM01, DRM02, DRM03 .... up to DRM111. For the first column, it would have values ranging from 1 to 3. In date1 it would have a fixed value, which would be "2022-01-05", in date2, it would have values between 2021-12-20 to the maximum that it gives. Category can be ABC or ERF, in DR1 would be values that would vary from 200 to 250, and finally, in DRM columns, would be values that would vary from 0 to 300. Is it possible to create a dataframe like this?
I wondering if this is an effort at simulation. The first few tasks seem blindly obvious but the last call to replicate with simplify=FALSE might have been a bit less than trivial.
test <- data.frame( ID = rep(1:3, length=90),
date1 = as.Date( "2022-01-05"),
date2= seq( as.Date("2021-12-20"), length.out=90, by=1),
#Category = ???? so far not specified
DR1 = sample( 200:250, 90, repl=TRUE), #need repl is length need is long
setNames( replicate(111, { sample(0:300, 90)}, simplify=FALSE) ,
nm=paste("DRM",1:111) ) )
Snipped the last 105 rows of the output from str:
str(test)
'data.frame': 90 obs. of 115 variables:
$ ID : int 1 2 3 1 2 3 1 2 3 1 ...
$ date1 : Date, format: "2022-01-05" "2022-01-05" "2022-01-05" "2022-01-05" ...
$ data2 : Date, format: "2021-12-20" "2021-12-21" "2021-12-22" "2021-12-23" ...
$ DR1 : int 229 218 240 243 221 202 242 221 237 208 ...
$ DRM.1 : int 41 238 142 100 19 56 224 152 85 84 ...
$ DRM.2 : int 150 185 141 55 34 83 88 105 165 294 ...
$ DRM.3 : int 144 22 237 174 78 291 120 63 261 236 ...
$ DRM.4 : int 223 105 263 214 45 226 129 80 182 15 ...
$ DRM.5 : int 27 108 288 237 129 251 150 70 300 243 ...
# additional rows elided
The last item in that construction returns a list that has 111 "columns" with ascending numbered names. I admit to being puzzled about why there were periods in the DRM names but then realized that the data.frame function uses check.names to make sure they are legitimate, so the spaces from paste were converted to periods. If you don't like periods then use paste0.

Converting different maximum scores to percentage out of 100

I have three different datasets with 3 students and 3 subjects each, with different maximum scores(125,150,200). How to calculate the mean percentage(out of 100) per subject of a standard(not section), when all the three maximum scores are different. which are not comparable at this point.
Class2:
section1.csv
english maths science
name score(125) score(125) score(125)
sam 114 112 111
erm 89 91 97
asd 101 107 118
section2.csv
english maths science
name score(150) score(150) score(150)
wer 141 127 143
rahul 134 119 145
rohit 149 135 139
section3.csv
english maths science
name score(200) score(200) score(200)
vinod 178 186 176
manoj 189 191 185
deepak 191 178 187
P.s: Expected columns in the output:
class1 englishavg mathsavg scienceavg( the values are the summation of mean percentage of all the three section)
Here is the piece of the code. I tried.
files <- list.files(pattern = ".csv") ## creates a vector with all file names in your folder
list_files <- lapply(files,read.csv,header=F,stringsAsFactors=F)
list_files <- lapply(list_files, function(x) x)
engav <- sapply(list_files,function(x) mean(as.numeric(x[,2]),na.rm=T)/2)
mathav <- sapply(list_files,function(x) mean(as.numeric(x[,3]),na.rm=T)/2)
scienceav <- sapply(list_files,function(x) mean(as.numeric(x[,4]),na.rm=T)/2)
result <- cbind(files,engav,mathav,scienceav)
Looking forward for an assistance.

R generate 2D histogram from raw data

I have some raw data in 2D, x, y as given below. I want to generate a 2D histogram from the data. Typically, dividing the x,y values into bins of size 0.5, and count the number of occurrences in each bin (for both x and y at the same time). Is there any way to do that?
> df
x y
1 4.2179611 5.7588577
2 5.3901279 5.8219784
3 4.1933089 6.4317645
4 5.8076411 5.8999598
5 5.5781166 5.9382342
6 4.5569735 6.7833469
7 4.4024492 5.8019719
8 4.1734975 6.0896355
9 5.1707871 5.5640962
10 5.6380258 6.9112775
11 4.6405353 5.2251746
12 4.1809004 6.1127144
13 4.2764079 5.4598799
14 5.4466446 6.0130047
15 5.2443804 5.5421851
16 5.7521515 5.4115965
17 4.9667564 5.3519795
18 4.5007141 6.8669231
19 5.0268273 5.7681888
20 4.4738948 6.4241168
21 4.4116357 5.9819519
22 4.5741988 6.4595129
23 4.0839075 6.8105259
24 4.7154364 6.5054761
25 4.8986785 5.5511226
26 5.6262397 6.8996480
27 4.9034275 5.6716375
28 4.1872928 5.8387641
29 4.0444855 5.2554446
30 4.8911393 5.8449165
31 5.7268887 6.7100432
32 5.9136374 6.5059128
33 4.9481286 6.4679917
34 4.6198987 5.7462047
35 5.7306916 6.0613158
36 5.5818586 6.4533566
37 5.9240267 6.7748290
38 4.8160926 6.4942865
39 5.5456258 5.7911897
40 4.3075173 6.8165520
41 4.9654533 5.8904734
42 5.9581820 5.7692468
43 4.2417172 5.7990554
44 5.3670112 5.8252479
45 5.2932098 5.3983672
46 5.7456521 6.2563828
47 4.9398795 5.2879065
48 4.8526884 6.9827555
49 5.6135753 6.5219431
50 4.0727956 5.2647714
51 6.9418969 5.2584325
52 5.4189039 5.9936456
53 3.9193741 6.7099562
54 5.5885252 5.9680734
55 5.9581279 5.1843804
56 4.5724421 6.6774004
57 4.7700303 6.6083613
58 5.5490254 6.2431170
59 4.1668548 5.1017475
60 5.8948947 6.7646917
61 6.5501872 5.2803433
62 5.6011444 4.2733087
63 5.1337226 6.5225780
64 5.3153358 6.6164809
65 3.3815056 6.4077659
66 3.8405670 5.3677008
67 6.7036350 4.3090214
68 3.2446588 4.0965275
69 4.6563593 7.6868628
70 5.2382914 7.0020874
71 6.0771605 6.6232541
72 3.5672511 6.9333691
73 5.0865233 4.0778233
74 5.6743559 5.5177734
75 4.5759146 7.2210012
76 5.8203140 4.9787148
77 3.1106176 6.3937707
78 4.6310679 4.4731806
79 6.8237641 6.2679791
80 3.7653803 5.9188107
81 5.6139040 5.8586176
82 6.2016662 5.3514293
83 3.9362048 5.3217560
84 6.8005236 7.9247371
85 5.8030101 7.7492432
86 6.0143418 6.0709249
87 6.5734089 7.6112815
88 4.0569383 5.8440535
89 4.6825752 7.7926235
90 4.8204027 6.3106798
91 3.5001675 6.3156079
92 3.6521280 7.5155810
93 5.0945236 4.8206873
94 3.8732946 5.6771599
95 6.4812309 5.6082170
96 5.0308355 7.6877289
97 5.2193389 7.7133717
98 6.2239631 5.5387684
99 4.6501488 7.8559335
100 3.5389389 5.4594034
101 5.7139486 4.5008182
102 3.5425132 7.3562487
103 6.9950663 6.1036549
104 5.3801845 5.8903123
105 4.7629191 5.3394552
106 4.4102815 7.2312852
107 5.8723641 4.1410996
108 3.4691208 4.6383708
109 4.6479362 5.8562699
110 3.0315732 6.8614265
111 5.9456145 4.7497545
112 4.8461189 4.4730002
113 4.9606723 5.1099093
114 4.7802659 7.8147864
115 5.0189229 6.9308301
116 6.4738074 5.0539666
117 5.3725075 5.3282273
118 6.5374505 7.0508875
119 4.0907139 5.0855075
120 5.0557532 5.6449829
121 6.5483249 7.5800015
122 3.1083616 7.3697234
123 3.6119548 7.7639486
124 6.5157691 7.7152933
125 4.0305622 7.0521419
126 3.2197769 6.5881246
127 4.7570419 6.4564400
128 4.0063007 6.3981942
129 4.4412649 7.6576221
130 5.7348769 6.7601804
131 3.1312551 5.6295996
132 3.8627964 7.5817083
133 5.2008281 5.1082509
134 6.4229161 6.2816475
135 2.5241894 6.0802138
136 7.3759753 5.1090478
137 3.7284166 5.2045976
138 3.4404286 6.9708127
139 6.4237399 5.1363851
140 4.1829368 5.1612791
141 5.9500285 5.4765621
142 3.3555182 6.2627360
143 7.7691356 5.1877095
144 4.0684189 7.1663495
145 7.3929140 7.3819058
146 2.1659981 7.9796005
147 4.8539955 7.3108966
148 5.3932658 4.7116979
149 3.5610560 4.6096759
150 5.1883331 6.8068501
151 6.4233558 7.2955388
152 7.3308739 6.1761356
153 3.0710449 4.5296235
154 7.5400128 5.1559900
155 3.5776389 5.2057676
156 4.0402288 7.1487121
157 2.3107258 6.9816127
158 7.2065591 7.7307439
159 5.7577620 5.6652052
160 2.0595554 7.4373547
161 7.5994468 4.6216856
162 4.8053745 3.9113634
163 7.5769460 7.6019067
164 5.5362034 8.9270974
165 3.6713241 3.9060205
166 6.0612046 7.3862080
167 6.9205755 7.0792392
168 6.0892821 6.3248315
169 2.0532905 4.1545875
170 3.4086310 3.5510909
171 5.2148895 5.3266145
172 4.7638780 7.9240988
173 6.4717329 5.1350172
174 7.8287022 4.3457324
175 6.0299681 3.0952274
176 3.2760103 5.2730464
177 2.5729991 7.6594251
178 3.9403251 7.8928014
179 6.0021556 7.5313493
180 7.8561727 4.5092728
181 3.5818174 4.1140876
182 7.4972295 5.5313987
183 6.0138287 6.9369784
184 3.9257191 7.6395296
185 3.0462106 3.1347680
186 6.0630447 4.1847229
187 7.4878528 5.1004141
188 4.5145570 4.6389011
189 6.2777996 4.2647980
190 3.0166336 7.5755042
191 2.8791041 6.4471746
192 7.1029767 7.0061048
193 2.4526181 6.3373793
194 5.8762775 7.0746223
195 7.0609100 8.1256569
196 4.7252400 8.4829780
197 3.3695501 8.8786640
198 3.8505741 6.8260398
199 5.3573846 6.3864944
200 3.7039072 8.9951078
201 4.6216933 6.7890198
202 7.0390643 5.9458624
203 5.7172605 6.9083246
204 2.3814644 8.3856125
205 2.4432566 3.2618192
206 4.3881965 6.7022219
207 5.2583749 7.2432485
208 5.8540367 8.5154705
209 6.4267791 4.9593757
210 5.0668461 3.1358129
211 2.6845736 8.9880143
212 7.3094761 5.4049133
213 4.2176252 5.5062193
214 5.2025716 4.0798478
215 6.5592571 8.1852765
216 2.0417939 7.0843906
217 7.6045374 7.4870940
218 6.5971789 8.8641329
219 5.3541694 7.2176914
220 2.8314803 6.4831720
221 2.4252467 4.0918736
222 6.6804732 6.3624739
223 6.0325285 6.2057468
224 2.2751047 5.1275412
225 5.5397481 5.9890834
226 4.6420585 4.6013327
227 7.6385642 5.1722194
228 6.7378078 5.8246169
229 5.0647686 7.9219705
230 2.8672731 6.6371082
231 7.5487359 4.5727898
232 1.0837662 7.1788146
233 5.4483746 6.8955122
234 9.3085746 4.8330044
235 3.8484225 6.0133789
236 2.8034987 3.0023096
237 2.8952626 8.2623788
238 5.7666136 3.2158710
239 6.4978214 5.7866574
240 1.5184268 5.9791716
241 2.3836147 8.2897188
242 4.7318649 6.1174515
243 5.8544588 7.5056688
244 9.6776416 6.5151695
245 0.4319531 4.2470331
246 0.9810053 8.6452087
247 7.0819634 3.2488110
248 1.9084265 6.1122130
249 7.5096342 3.3495096
250 8.9564496 3.4960564
251 5.7603943 6.9091760
252 0.8801204 7.2744429
253 1.2183581 6.4264214
254 1.7761613 7.1199729
255 3.2490662 7.9935963
256 3.5420375 8.4801333
257 8.7709382 3.8011487
258 8.4770868 3.4749692
259 0.9965042 6.7509705
260 7.5049457 5.4313474
261 9.7261151 6.5909553
262 5.3893371 4.0194548
263 9.6154510 7.3117416
264 1.0327841 6.2376586
265 4.0064715 3.7333634
266 6.6941050 3.9452152
267 4.1317951 9.3322756
268 9.6481471 7.5330023
269 7.3474233 1.0310166
270 3.7343864 4.9808341
271 9.1412231 2.6655861
272 5.8414100 0.1329439
273 2.4837309 7.4956203
274 2.7983337 1.3563719
275 0.6335727 7.9273816
276 7.5566740 0.4321263
277 8.6182079 0.6038505
278 0.8928523 8.0131172
279 5.7375090 8.5275545
280 0.7864533 3.3954255
281 8.7808839 1.7059789
282 9.6621659 0.9215045
283 8.4894688 8.7667948
284 1.0358920 7.2505891
285 0.7378660 0.1173287
286 9.5485481 3.3186128
287 6.8987508 9.5480887
288 7.4105831 5.8809522
289 6.6984457 5.9509037
290 1.7878216 9.1932955
291 0.8443295 5.1662902
292 0.4498266 8.9636923
293 2.5068754 5.3692908
294 9.2509052 2.4204235
295 4.1333742 6.2581851
296 6.5510938 7.2923688
297 4.3412873 3.5514825
298 4.2349765 9.3207514
299 2.8730785 7.2752405
300 2.0425362 6.6513146
301 6.4498432 7.2949259
302 5.7453188 6.3263712
303 7.0501276 8.2238207
304 4.1915008 1.5325379
305 8.1307954 7.7681944
306 7.3156552 6.3031412
307 4.0302052 0.3039900
308 3.3740358 2.1386235
309 8.2055657 2.9112215
310 1.8817856 7.0503046
311 7.0820523 6.8739097
312 5.0725238 6.9951556
313 1.6246224 5.4126084
314 3.8865553 7.6398192
315 6.6727672 8.9677947
316 9.6048687 7.6757966
317 2.2006018 9.6385351
318 9.6403802 7.6438900
319 0.1267512 0.9048408
320 1.8160829 7.3193066
321 9.9318386 9.6068456
322 2.1275892 7.8034724
323 1.2232242 1.0695030
324 3.0198057 3.8964732
325 3.3265773 8.5865587
326 5.1519605 7.5068253
327 0.4137485 5.9223826
328 1.6896445 0.6071874
329 1.8534083 2.3554291
330 1.7182264 9.3488597
331 6.4165456 9.8670765
332 7.6270001 2.1839607
333 8.9867227 5.9565743
334 6.9185079 0.2440980
335 6.7359209 7.1072908
336 3.8034763 5.8466404
337 3.4583027 6.9041502
338 1.7983897 1.7108336
339 6.9184406 6.3632716
340 1.3538600 6.8484462
341 3.6731748 4.9846946
342 5.6139620 8.0637827
343 9.0991782 2.3051189
344 1.1220448 8.9624365
345 2.5925265 8.3673795
346 9.9977377 8.5423564
347 5.1761187 5.1240824
348 5.9330451 9.4141322
349 6.3337224 6.8055697
350 2.7287418 5.7100024
351 6.1022411 2.9733360
352 2.7331869 3.7135612
353 6.7394034 8.2721572
354 2.1757932 9.0574057
355 5.5011486 6.0124142
356 4.5301911 2.5865048
357 5.3137001 0.7062267
358 0.6959286 3.2395043
359 5.3494169 6.5742589
360 7.1472046 6.3821916
361 0.1749855 0.3954287
362 6.7709760 6.5212015
363 7.2983482 3.0086604
364 0.6147726 9.3336870
365 7.4417342 2.6836695
366 1.2769881 4.0591093
367 9.5342317 5.3443613
368 0.9368862 1.1391497
369 8.4271193 8.6641296
370 6.2000851 8.2987486
371 2.1768279 6.0684896
372 5.2021222 6.9222675
373 0.6095874 8.4759464
374 2.0217473 9.5844241
375 4.8080163 6.5052801
376 3.6099334 0.3272768
377 6.0132712 7.9920535
378 4.0495344 8.8153621
379 6.9646704 7.0375214
380 3.9211171 2.5994333
381 4.4749268 1.0517360
382 1.1683429 3.8710614
383 1.7618115 0.3513996
384 1.1257639 5.7446745
385 3.7351688 8.7376011
386 4.9234662 7.1975462
387 7.4899861 7.3846309
388 7.4170082 2.2885060
389 0.8526702 3.8160722
390 4.5907512 8.9315418
391 7.6996179 9.8409051
392 0.2340987 4.2906009
393 2.2502736 1.7819172
394 3.5679969 1.7419479
395 5.4214908 5.6001803
396 3.9965213 9.2021549
397 3.8610336 2.0462740
398 5.9490575 4.4422382
399 9.8897791 5.6402915
400 6.1153192 4.1236797
401 5.8906384 2.6153750
402 8.0582664 2.7137804
403 7.2969209 2.9362187
404 3.8673527 1.0837191
405 3.5647339 6.2338014
406 9.6490210 0.8373270
407 0.8133243 6.3393130
408 2.8760565 9.9462423
409 3.3836457 7.4451869
410 4.7772609 2.9141127
411 8.6635971 5.7812494
412 5.6192160 1.4764255
413 9.1334625 8.9822399
414 0.4662385 6.6440937
415 3.4503559 4.2064800
416 0.6704780 2.8508758
417 0.5211872 4.3109175
418 7.5615411 9.2851454
419 7.5081906 4.0019450
420 8.8851669 9.7323717
421 7.3856288 8.6152906
422 9.5926351 0.3993818
423 1.4478981 1.4845263
424 5.0425560 1.3501638
425 0.8952120 7.9407680
426 6.4732584 7.1493210
427 9.6595225 5.2377876
428 7.2204625 2.0300222
429 3.5410601 7.3117738
430 6.7991771 3.6368291
Just for clarification, I want to get something like this plot below (this plot doesn't have to do anything with my raw data, I am just showing it to explain the problem more clearly! If I use hist(df$x) it will show the distribution of x only.)
The ggplot is elegant and fast and pretty, as usual. But if you want to use base graphics (image, contour, persp) and display your actual frequencies (instead of the smoothing 2D kernel), you have to first obtain the binnings yourself and create a matrix of frequencies. Here's some code (not necessarily elegant, but pretty robust) that does 2D binning and generates plots somewhat similar to the ones above:
require(mvtnorm)
xy <- rmvnorm(1000,c(5,10),sigma=rbind(c(3,-2),c(-2,3)))
nbins <- 20
x.bin <- seq(floor(min(xy[,1])), ceiling(max(xy[,1])), length=nbins)
y.bin <- seq(floor(min(xy[,2])), ceiling(max(xy[,2])), length=nbins)
freq <- as.data.frame(table(findInterval(xy[,1], x.bin),findInterval(xy[,2], y.bin)))
freq[,1] <- as.numeric(freq[,1])
freq[,2] <- as.numeric(freq[,2])
freq2D <- diag(nbins)*0
freq2D[cbind(freq[,1], freq[,2])] <- freq[,3]
par(mfrow=c(1,2))
image(x.bin, y.bin, freq2D, col=topo.colors(max(freq2D)))
contour(x.bin, y.bin, freq2D, add=TRUE, col=rgb(1,1,1,.7))
palette(rainbow(max(freq2D)))
cols <- (freq2D[-1,-1] + freq2D[-1,-(nbins-1)] + freq2D[-(nbins-1),-(nbins-1)] + freq2D[-(nbins-1),-1])/4
persp(freq2D, col=cols)
For a really fun time, try making an interactive, zoomable, 3D surface:
require(rgl)
surface3d(x.bin,y.bin,freq2D/10, col="red")
Bivariate density estimates can be done with MASS::kde2d, or KernSmooth::bkde2D (both supplied with the base R distribution). The latter uses an algorithm based on the fast Fourier transform over a grid of points, and is very fast. The result can be plotted with contour or persp or similar functions in other graphing packages.
Using your data:
require(KernSmooth)
z <- bkde2D(df, .5)
persp(z$fhat)
If you want it with a 2d contour, you can also use the package ggplot2. Some example code is shown in this question:
gradient breaks in a ggplot stat_bin2d plot
Adjusted slightly:
x <- rnorm(10000)+5
y <- rnorm(10000)+5
df <- data.frame(x,y)
require(ggplot2)
p <- ggplot(df, aes(x, y))
p <- p + stat_bin2d(bins = 20)
p
Here's the output of the code above:
For completeness, you can also use the hist2d{gplots} function. It seems to be the most straightforward for a 2D plot:
library(gplots)
# data is in variable df
# define bin sizes
bin_size <- 0.5
xbins <- (max(df$x) - min(df$x))/bin_size
ybins <- (max(df$y) - min(df$y))/bin_size
# create plot
hist2d(df, same.scale=TRUE, nbins=c(xbins, ybins))
# if you want to retrieve the data for other purposes
df.hist2d <- hist2d(df, same.scale=TRUE, nbins=c(xbins, ybins), show=FALSE)
df.hist2d$counts
i came to this page from http://www.r-bloggers.com/5-ways-to-do-2d-histograms-in-r/ which lists one of the answers above.
It provides code samples for a total of 5 methods:
hist2d from the library gplots
hexbin,hexbinplot from the library hexbin
stat_bin2d from the library ggplot2
kde2d from the library MASS
the "hard way" solution listed above.
freq <- as.data.frame(table(findInterval(xy[,1], x.bin),findInterval(xy[,2], y.bin)))
freq[,1] <- as.numeric(freq[,1])
freq[,2] <- as.numeric(freq[,2])
This is probably wrong since it destroys the original indices.

grep: How can i search through my data using a wildcard in R

I have recently started using R. So now I am trying to get some data out of it. However, the results I get are quite confusing. I have datas from the year 1961 to 1963 of everyday in the format 1961-04-25. I created a vector called: date
So when I try to use grep to just search for the period between April 10 and May 21 and display the dates I used this command:
date[date >= grep("196.-04-10", date, value = TRUE) &
date <= grep("196.-05-21", date, value = TRUE)]
The results I get is are somehow confusing as it is making 3 days steps instead of giving me every single day... see below.
[1] "1961-04-10" "1961-04-13" "1961-04-16" "1961-04-19" "1961-04-22" "1961-04-25" "1961-04-28" "1961-05-01" "1961-05-04" "1961-05-07" "1961-05-10"
[12] "1961-05-13" "1961-05-16" "1961-05-19" "1962-04-12" "1962-04-15" "1962-04-18" "1962-04-21" "1962-04-24" "1962-04-27" "1962-04-30" "1962-05-03"
[23] "1962-05-06" "1962-05-09" "1962-05-12" "1962-05-15" "1962-05-18" "1962-05-21" "1963-04-11" "1963-04-14" "1963-04-17" "1963-04-20" "1963-04-23"
[34] "1963-04-26" "1963-04-29" "1963-05-02" "1963-05-05" "1963-05-08" "1963-05-11" "1963-05-14" "1963-05-17" "1963-05-20"
I think the grep strategy is misguided, but maybe something like this will work ... basically, I'm computing the day-of-year (Julian date, yday()) and using that for comparison.
z <- as.Date(c("1961-04-10","1961-04-11","1961-04-12",
"1961-05-21","1961-05-22","1961-05-23",
"1963-04-09","1963-04-12","1963-05-21","1963-05-22"))
library(lubridate)
z[yday(z)>=yday(as.Date("1961-04-10")) & yday(z)<=yday(as.Date("1961-05-21"))]
## [1] "1961-04-10" "1961-04-11" "1961-04-12" "1961-05-21" "1963-04-12"
## [6] "1963-05-21"yz <- year(z)
Actually, this solution is fragile to leap-years ...
Better (?):
yz <- year(z)
z[z>=as.Date(paste0(yz,"-04-10")) & z<=as.Date(paste0(yz,"-05-21"))]
(You should definitely test this for yourself, I haven't tested carefully!)
Using a date format for your variable would be the best bet here.
## set up some test data
datevar <- seq.Date(as.Date("1961-01-01"),as.Date("1963-12-31"),by="day")
test <- data.frame(date=datevar,id=1:(length(datevar)))
head(test)
## which looks like:
> head(test)
date id
1 1961-01-01 1
2 1961-01-02 2
3 1961-01-03 3
4 1961-01-04 4
5 1961-01-05 5
6 1961-01-06 6
## find the date ranges you want
selectdates <-
(format(test$date,"%m") == "04" & as.numeric(format(test$date,"%d")) >= 10) |
(format(test$date,"%m") == "05" & as.numeric(format(test$date,"%d")) <= 21)
## subset the original data
result <- test[selectdates,]
## which looks as expected:
> result
date id
100 1961-04-10 100
101 1961-04-11 101
102 1961-04-12 102
103 1961-04-13 103
104 1961-04-14 104
105 1961-04-15 105
106 1961-04-16 106
107 1961-04-17 107
108 1961-04-18 108
109 1961-04-19 109
110 1961-04-20 110
111 1961-04-21 111
112 1961-04-22 112
113 1961-04-23 113
114 1961-04-24 114
115 1961-04-25 115
116 1961-04-26 116
117 1961-04-27 117
118 1961-04-28 118
119 1961-04-29 119
120 1961-04-30 120
121 1961-05-01 121
122 1961-05-02 122
123 1961-05-03 123
124 1961-05-04 124
125 1961-05-05 125
126 1961-05-06 126
127 1961-05-07 127
128 1961-05-08 128
129 1961-05-09 129
130 1961-05-10 130
131 1961-05-11 131
132 1961-05-12 132
133 1961-05-13 133
134 1961-05-14 134
135 1961-05-15 135
136 1961-05-16 136
137 1961-05-17 137
138 1961-05-18 138
139 1961-05-19 139
140 1961-05-20 140
141 1961-05-21 141
465 1962-04-10 465
...

Resources