I have binned my data using the cut function
breaks<-seq(0, 250, by=5)
data<-split(df2, cut(df2$val, breaks))
My split dataframe looks like
... ...
$`(15,20]`
val ks_Result c
15 60 237
18 70 247
... ...
$`(20,25]`
val ks_Result c
21 20 317
24 10 140
... ...
My bins looks like
> table(data)
data
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35]
0 0 0 7 128 2748 2307
(35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,70]
1404 11472 1064 536 7389 1008 1714
(70,75] (75,80] (80,85] (85,90] (90,95] (95,100] (100,105]
2047 700 329 1107 399 376 323
(105,110] (110,115] (115,120] (120,125] (125,130] (130,135] (135,140]
314 79 1008 77 474 158 381
(140,145] (145,150] (150,155] (155,160] (160,165] (165,170] (170,175]
89 660 15 1090 109 824 247
(175,180] (180,185] (185,190] (190,195] (195,200] (200,205] (205,210]
1226 139 531 174 1041 107 257
(210,215] (215,220] (220,225] (225,230] (230,235] (235,240] (240,245]
72 671 98 212 70 95 25
(245,250]
494
When I mean the bins, I get on an average of ~900 samples
> mean(table(data))
[1] 915.9
I want to tell R to make irregular bins in such a way that each bin will contain on an average 900 samples (e.g. (0, 27] = 900, (27,28.5] = 900, and so on). I found something similar here, which deals with only one variable, not the whole dataframe.
I also tried Hmisc package, unfortunately the bins don't contain equal frequency!!
library(Hmisc)
data<-split(df2, cut2(df2$val, g=30, oneval=TRUE))
data<-split(df2, cut2(df2$val, m=1000, oneval=TRUE))
Assuming you want 50 equal sized buckets (based on your seq) statement, you can use something like:
df <- data.frame(var=runif(500, 0, 100)) # make data
cut.vec <- cut(
df$var,
breaks=quantile(df$var, 0:50/50), # breaks along 1/50 quantiles
include.lowest=T
)
df.split <- split(df, cut.vec)
Hmisc::cut2 has this option built in as well.
Can be done by the function provided here by Joris Meys
EqualFreq2 <- function(x,n){
nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
data<-split(df2, EqualFreq2(df2$val, 25))
Related
I have a data frame reporting the count of answers per question (this is just a part of it), and I'd like to obtain the answer percentage for each question. I've found adorn_percentages, but it computes the percentage by dividing the values for the whole data frame, meanwhile, I just want the percentage for each column. Each column has a total of 2230 answers.
I was thinking to use something like (x/2230)*100 but I don't know how to go on.
df<-data.frame(q1=c(159,139,1048,571,93), q2=c(106,284,1043,672,125), q3=c(99,222,981,843,94))
q1 q2 q3
1 159 106 99
2 139 284 222
3 1048 1043 981
4 571 672 843
5 93 125 94
We may use colSums to do the division after making the lengths same
100 * df/colSums(df)[col(df)]
or use sweep
100 * sweep(df, 2, colSums(df), `/`)
Or use proportions
df[paste0(names(df), "_prop")] <- 100 * proportions(as.matrix(df), 2)
-output
> df
q1 q2 q3 q1_prop q2_prop q3_prop
1 159 106 99 7.910448 4.753363 4.421617
2 139 284 222 6.915423 12.735426 9.915141
3 1048 1043 981 52.139303 46.771300 43.814203
4 571 672 843 28.407960 30.134529 37.650737
5 93 125 94 4.626866 5.605381 4.198303
You can apply prop.table for each column -
library(dplyr)
df %>% mutate(across(.fns = prop.table, .names = '{col}_prop') * 100)
# q1 q2 q3 q1_prop q2_prop q3_prop
#1 159 106 99 7.910448 4.753363 4.421617
#2 139 284 222 6.915423 12.735426 9.915141
#3 1048 1043 981 52.139303 46.771300 43.814203
#4 571 672 843 28.407960 30.134529 37.650737
#5 93 125 94 4.626866 5.605381 4.198303
I have 20 intervals:
10 intervals from 1 to 250 of size 25:
[1.25] [26.50] [51.75] [76.100] [101.125] [126.150] ... [226.250]
10 intervals from 251 to 1000 of size 75:
[251,325] [326,400] [401,475] [476,550] [551,625] ... [926,1000]
I would like to create a vector composed of the first 5 elements of each interval like:
(1,2,3,5, 26,27,28,29,30, 51,52,53,54,55, 76,77,78,79,80, ....,
251,252,253,254,255, 326,327,328,329,330, ...)
How create this vector using R?
Let's assume you have two interval like :
interval1 <- seq(1.25, 226.250, 25)
interval2 <- seq(251, 1000, 75)
We can create a new interval combining the two and then use mapply to create sequence
new_interval <- c(as.integer(interval1), interval2)
c(mapply(`:`, new_interval, new_interval + 4))
#[1] 1 2 3 4 5 26 27 28 29 30 51 52 53 54 .....
#[89] ..... 779 780 851 852 853 854 855 926 927 928 929 930
I have a dataframe with a series of dates, here's a simplified version of it:
> eventdates
dr.rank dr.start dr.end
1 14 1964-09-30 1964-10-06
2 16 1964-11-01 1964-12-24
I also have a time series of dates with values etc. associated with that, here's a much simplified version of the timeseries:
ts1964 <- data.frame(DATE = seq(from = as.Date("1964-01-01"), to = as.Date("1964-12-31"), by = "days"),
Q = 1:366)
What I am trying to do is subset by each date in eventdates, i.e.:
> filter(ts1964, ts1964$DATE >= eventdates[1,2] & ts1964$DATE <= eventdates[1,3])
DATE Q
1 1964-09-30 274
2 1964-10-01 275
3 1964-10-02 276
4 1964-10-03 277
5 1964-10-04 278
6 1964-10-05 279
7 1964-10-06 280
8 1964-10-07 281
9 1964-10-08 282
10 1964-10-09 283
11 1964-10-10 284
12 1964-10-11 285
13 1964-10-12 286
14 1964-10-13 287
15 1964-10-14 288
16 1964-10-15 289
17 1964-10-16 290
18 1964-10-17 291
19 1964-10-18 292
20 1964-10-19 293
21 1964-10-20 294
22 1964-10-21 295
23 1964-10-22 296
24 1964-10-23 297
25 1964-10-24 298
26 1964-10-25 299
27 1964-10-26 300
28 1964-10-27 301
29 1964-10-28 302
30 1964-10-29 303
31 1964-10-30 304
32 1964-10-31 305
33 1964-11-01 306
>
But I need to do this hundreds of times. What I would like to do is have each subset form an element in a list. I would normally be considering to using something like dlply in plyr but this isn't an option when I'm using dplyr. Could anyone advise on how I might achieve this otherwise? Thanks
We can use Map
Map(function(x,y) filter(ts1964, DATE >= x & DATE <= y),
eventdates$dr.start, eventdates$dr.end)
Sorry for the confusing title, but i wasn't sure how to title what i am trying to do. My objective is to create a dataset of 1000 obs each would be the length of the run. I have created a phase1 dataset, from which a set of control limits are produced. What i am trying to do now is create a phase2 dataset most likely using rnorm. what im trying to do is create a repeat loop that will continuously create values in the phase2 dataset until one of those values is outside of the control limits produced from the phase1 dataset. for example if i had 3.0 and -3.0 as control limits the phase2 dataset would create a bunch of observations until obs 398 when the value here happens to be 3.45, thus stopping the creation of data. my objective is then to record the number 398. Furthermore, I am then trying to loop the code back to the phase1 dataset/ control limits portion and create a new set of control limits and then run another phase2, until i have 1000 run lengths recorded. the code i have for the phase1/ control limits works fine and looks like this:
nphase1=50
nphase2=1000
varcount=1
meanshift= 0
sigmashift= 1
##### phase1 dataset/ control limits #####
phase1 <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol=varcount)
mean_var <- apply(phase1, 2, mean)
std_var <- apply(phase1, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1 <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1 <- with(df_var, mean_var - 3 * std_var)
df_control_limits<- data.frame(Upper_SPC_Limit_Method1, Lower_SPC_Limit_Method1)
I have previously created this code in SAS and it looks like this. might be a better reference for what i am trying to achieve then me trying to explain it.
%macro phase2_dataset (n=,varcount=, meanshift=, sigmashift=, nphase1=,simID=,);
%do z=1 %to &n;
%phase1_dataset (n=&nphase1, varcount=&varcount);
data phase2; set control_limits n=lastobs;
call streaminit(0);
do until (phase2_var1<Lower_SPC_limit_method1_var1 or
phase2_var1>Upper_SPC_limit_method1_var1);
phase2_var1 = rand("normal", &meanshift, &sigmashift);
output;
end;
run;
ods exclude all;
proc means data=phase2;
var phase2_var1;
ods output summary=x;
run;
ods select all;
data run_length; set x;
keep Phase2_var1_n;
run;
proc append base= QA.Phase2_dataset&simID data=Run_length force; run;
%end;
%mend;
Also been doing research about using a while loop in replace of the repeat loop.
Im new to R so Any ideas you are able to throw my way are greatly appreciated. Thanks!
Using a while loop indeed seems to be the way to go. Here's what I think you're looking for:
set.seed(10) #Making results reproducible
replicate(100, { #100 is easier to display here
phase1 <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol=varcount)
mean_var <- colMeans(phase1) #Slightly better than apply
std_var <- apply(phase1, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1 <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1 <- with(df_var, mean_var - 3 * std_var)
df_control_limits<- data.frame(Upper_SPC_Limit_Method1, Lower_SPC_Limit_Method1)
#Phase 2
x <- 0
count <- 0
while(x > Lower_SPC_Limit_Method1 && x < Upper_SPC_Limit_Method1) {
x <- rnorm(1)
count <- count + 1
}
count
})
The result is:
[1] 225 91 97 118 304 275 550 58 115 6 218 63 176 100 308 844 90 2758
[19] 161 311 1462 717 2446 74 175 91 331 210 118 1517 420 32 39 201 350 89
[37] 64 385 212 4 72 730 151 7 1159 65 36 333 97 306 531 1502 26 18
[55] 67 329 75 532 64 427 39 352 283 483 19 9 2 1018 137 160 223 98
[73] 15 182 98 41 25 1136 405 474 1025 1331 159 70 84 129 233 2 41 66
[91] 1 23 8 325 10 455 363 351 108 3
If performance becomes a problem, perhaps it would be interesting to explore some improvements, like creating more numbers with rnorm() at a time and then counting how many are necessary to exceed the limits and repeat if necessary.
Let's say I'd like to calculate the magnitude of the range over a few columns, on a row-by-row basis.
set.seed(1)
dat <- data.frame(x=sample(1:1000,1000),
y=sample(1:1000,1000),
z=sample(1:1000,1000))
Using data.frame(), I would do something like this:
dat$diff_range <- apply(dat,1,function(x) diff(range(x)))
To put it more simply, I'm looking for this operation, over each row:
diff(range(dat[1,]) # for i 1:nrow(dat)
If I were doing this for the entire table, it would be something like:
setDT(dat)[,diff_range := apply(dat,1,function(x) diff(range(x)))]
But how would I do it for only named (or numbered) rows?
pmax and pmin find the min and max across columns in a vectorized way, which is much better than splitting and working with each row separately. It's also pretty concise:
dat[, r := do.call(pmax,.SD) - do.call(pmin,.SD)]
x y z r
1: 266 531 872 606
2: 372 685 967 595
3: 572 383 866 483
4: 906 953 437 516
5: 201 118 192 83
---
996: 768 945 292 653
997: 61 231 965 904
998: 771 145 18 753
999: 841 148 839 693
1000: 857 252 218 639
How about this:
D[,list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I][c(1:4,15:18)]
# I V1
#1: 1 971
#2: 2 877
#3: 3 988
#4: 4 241
#5: 15 622
#6: 16 684
#7: 17 971
#8: 18 835
#actually this will be faster
D[c(1:4,15:18),list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I]
use .I to give you an index to call with the by= parameter, then you can run the function on each row. The second call pre-filters by any list of row numbers, or you can add a key and filter on that if your real table looks different.
You can do it by subsetting before/during the function. If you only want every second row for example
dat_Diffs <- apply(dat[seq(2,1000,by=2),],1,function(x) diff(range(x)))
Or for rownames 1:10 (since their names weren't specified they are just numbers counting up)
dat_Diffs <- apply(dat[rownames(dat) %in% 1:10,],1,function(x) diff(range(x)))
But why not just calculate per row then subset later?