sample() not sampling randomly? [duplicate] - r

This question already has an answer here:
Sample() in R returning non-random sample after population vector length > 13. Why? [duplicate]
(1 answer)
Closed 7 years ago.
Can someone explain whats going wrong here? I wanted to simulate 10,000 20-sided dice rolls. I used this code:
x <- sample(1:20,10000,replace=T)
but that give me this:
hist(x)
It seems to be a problem above 12:
What am I not understanding here?
Thanks

It's not actually to do with your sample, it's hist.
If you do this
set.seed(1)
x <- sample(1:20,10000,replace=T)
table(x)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
513 522 482 495 459 549 506 505 518 498 495 492 440 490 459 509 496 528 511 533
you'll notice it's random. However hist reproduces your graph. If you count the bars you'll notice there are 19 and not 20.
Trying this instead:
bins <- seq(0, 20, by=1)
hist(x, breaks=bins)
gives a graph with even bar heights because all 20 bars are shown (i.e. 1 and 2 are not collapsed together).

Related

R: How to compare values in a column with later values in the same column

I am attempting to work with a large dataset in R where I need to create a column that compares the value in an existing column to all values that follow it (ex: row 1 needs to compare rows 1-10,000, row 2 needs to compare rows 2-10,000, row 3 needs to compare rows 3-10,000, etc.), but cannot figure out how to write the range.
I currently have a column of raw numeric values and a column of row values generated by:
samples$row = seq.int(nrow(samples))
I have attempted to generate the column with the following command:
samples$processed = min(samples$raw[samples$row:10000])
but get the error "numerical expression has 10000 elements: only the first used" and the generated column only has the value for row 1 repeated for each of the 10,000 rows.
How do I need to write this command so that the lower bound of the range is the row currently being calculated instead of 1?
Any help would be appreciated, as I have minimal programming experience.
If all you need is the min of the specific row and all following rows, then
rev(cummin(rev(samples$val)))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
If you have some other function that doesn't have a cumulative variant (and your use of min is just a placeholder), then one of:
mapply(function(a, b) min(samples$val[a:b]), seq.int(nrow(samples)), nrow(samples))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
sapply(seq.int(nrow(samples)), function(a) min(samples$val[a:nrow(samples)]))
The only reason to use mapply over sapply is if, for some reason, you want window-like operations instead of always going to the bottom of the frame. (Though if you wanted windows, I'd suggest either the zoo or slider packages.)
Data
set.seed(42)
samples <- data.frame(val = sample(1000, size=20))
samples
# val
# 1 561
# 2 997
# 3 321
# 4 153
# 5 74
# 6 228
# 7 146
# 8 634
# 9 49
# 10 128
# 11 303
# 12 24
# 13 839
# 14 356
# 15 601
# 16 165
# 17 622
# 18 532
# 19 410
# 20 882

Divide a data-frame into x roughly equal groups -- sequentially

I want to divide a df into x roughly equal groups, sequentially.
I was basically doing it like this:
df_1 <- df[1:10,]
df_2 <- df[11:21,]
df_3..
Is there a simpler way to do this, using split or slice? The important thing is, I want to maintain the order of the df, not sample from it.
Imagine I had 7000 observations, and I wanted 19 roughly equal groups.
Best!
I don't know if it counts for roughly equal, but you can do this:
nobs <- 7000
ngroups <- 17
df <- data.frame(x = sample(nobs))
set.seed(1)
df$grp <- sort(sample(1:ngroups,nobs,T)) # added the sort so the order of your df is maintained
table(df$grp)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# 436 407 410 369 417 411 440 401 431 411 356 398 390 414 443 418 448
then split(df,df$grp)

how to discretize R data.frame cloumn in a given width?

Say, I have a data.frame() like this
>head(Acquisition)
original_date first_payment_date LTV DTI FICO
1 01/2007 03/2007 56 37 734
2 02/2007 04/2007 80 11 762
3 12/2006 02/2007 80 28 656
4 12/2006 03/2007 70 50 700
I want to discretize the Acquisition$LTV and Acquisition$DTI by the step size 0.05 and Acquisition$FICO by the step size 10.
I have found the answer just use cut function is okay.
dis.LTV=cut(Acquisition$LTV,(max(Acquisition$LTV)-min(Acquisition$LTV))/0.05)

Sample() in R returning non-random sample after population vector length > 13. Why? [duplicate]

This question already has answers here:
Histogram of uniform distribution not plotted correctly in R
(3 answers)
Closed 8 years ago.
The following code will return a perfectly sound sample:
b <- sample(c(0,1,2,3,4,5,6,7,8,9,10,11,12), 100000, replace=TRUE)
hist(b)
Increasing the number for elements by 1 to 14 will result into this:
b <- sample(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13), 100000, replace=TRUE)
hist(b)
That's clearly not correct. Zero occurs more often that it should. Is there a reason for this?
The problem lies in hist, not in sample.
You can check that doing:
> table(sample(0:15, 10000, replace=T))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
634 642 664 654 628 598 633 642 647 625 587 577 618 645 615 591
From the hist help:
If right = TRUE (default), the histogram cells are intervals of the
form (a, b], i.e., they include their right-hand endpoint, but not
their left one, with the exception of the first cell when
include.lowest is TRUE.
For right = FALSE, the intervals are of the form [a, b), and
include.lowest means ‘include highest’.
If you try
hist(sample(0:15, 10000, replace=T), br=-1:15)
the results will look correct

Loop for subsetting data.frame

I work with neuralnet package to predict values of stocks (diploma thesis). The example data are below
predict<-runif(23,min=0,max=1)
day<-c(369:391)
ChoosedN<-c(2,5,5,5,5,5,4,3,5,5,5,2,1,1,5,5,4,3,2,3,4,3,2)
Profit<-runif(23,min=-2,max=5)
df<-data.frame(predict,day,ChoosedN,Profit)
colnames(df)<-c('predict','day','ChoosedN','Profit')
But I haven't always same period for investments (ChoodedN). For backtest the neural site I have to skip the days when I am still in position even if the neural site says 'buy it' (i.e.predict > 0.5). The frame looks like this
predict day ChoosedN Profit
1 0.6762981061 369 2 -1.6288823350
2 0.0195611224 370 5 1.5682195597
3 0.2442795106 371 5 0.6195915225
4 0.9587601107 372 5 -1.9701975542
5 0.7415729680 373 5 3.7826137026
6 0.4814927997 374 5 4.1228808255
7 0.1340754859 375 4 3.7818792837
8 0.6316874851 376 3 0.7670884461
9 0.1107241728 377 5 -1.3367400097
10 0.5850426450 378 5 2.2848396166
11 0.2809308425 379 5 2.5234691438
12 0.2835292015 380 2 -0.3291319925
13 0.3328713216 381 1 4.7425349397
14 0.4766904986 382 1 -0.4062103292
15 0.5005860797 383 5 4.8612083721
16 0.2734292494 384 5 -0.2320077328
17 0.1488479455 385 4 2.6195679584
18 0.9446908936 386 3 0.4889716264
19 0.8222738281 387 2 0.7362413658
20 0.7570014759 388 3 4.6661250258
21 0.9988698252 389 4 2.6340743946
22 0.8384663551 390 3 1.0428046484
23 0.1938821415 391 2 0.8855748393
And I need to create new data.frame this way.For example:If predict (in first row) > 0.5,delete second and third row (because ChoosedN in first row is 2 so next two after first row has to be delete, because there we were still in position). And continue on fourth the same way (if predict (fourth row) > 0.5, delete next five rows and so. And of course, if predict <=0.5 delete this row too.
Any straightforward way how to do it with some loop?
Thanks
I would create a new dataframe, then bind the rows you want using rbind inside of a for loop
newDF <- data.frame() # New, Empty Dataframe
i = 1 # Loop index Variable
while (i < nrow(df)) {
if (df$predict[i] > 0.5) { # If predict > 0.5,
newDF <- rbind(newDF, df[i,]) # Bind the row
i = i + df$ChoosedN[i] # Adjust for ChoosedN rows
}
i = i + 1 # Move to the next row
}

Resources