How to use IQR outlier function, based on a key, in R - r

I want to use this IQR function:
smooth_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.3 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- round(qnt[1] - H)
y[x > (qnt[2] + H)] <- round(qnt[2] + H)
y
}
on the below df, on the total column for every specific key, based on the key column:
key total
US4ZNB 10
US4ZNB 1075
US4ZNB 10000
US4ZNB 1138
US4ZNB 1156
US4YYM 1114
US4YYM 1072
US4YYM 50
US4YYM 1181
US4YYM 8000
JM4YYM 15000
JM4YYM 2000
JM4YYM 100
JM4YYM 2200
JM4YYM 2300

ddply from the plyr package does exactly this. It applies a function to each subset of the data based off a column.
plyr::ddply(df, "key", plyr::numcolwise(smooth_outliers))
The first argument is your data with "key" and "total", the second argument is the grouping variable, in this case "key".
The final variable is the function you want to apply, the numcolwise function is used here essentially so it applied it to the column rather than a whole row. So we make the row-based smooth-outliers function a column based function.
Then voila.
You'll get a data frame that lists each each key and its IQR as calculated by the smooth_outliers function.
Here's the result.
key total
1 JM4YYM 1421
2 JM4YYM 1712
3 JM4YYM 1709
4 US4YYM 1114
5 US4YYM 1473
6 US4YYM 1181
7 US4YYM 1767
8 US4YYM 1005
9 US4ZAW 1138
10 US4ZAW 1156
11 US4ZAW 1982
12 US4ZNB 1338
13 US4ZNB 1075
14 US4ZNB 1806
As you can see, each key is matched up with one of the outputs from the smooth_outliers function.

After ideas elaboration, I manage to find solution for my issue. I just used dplyr::group_by:
df.new <- df %>%
group_by(key) %>%
mutate(val=smooth_outliers(total))
Thanks you all.

Related

Aggregate column intervals into new columns in data.table

I would like to aggregate a data.table based on intervals of a column (time). The idea here is that each interval should be a separate column with a different name in the output.
I've seen a similar question in SO but I couldn't get my head around the problem. help?
reproducible example
library(data.table)
# sample data
set.seed(1L)
dt <- data.table( id= sample(LETTERS,50,replace=TRUE),
time= sample(60,50,replace=TRUE),
points= sample(1000,50,replace=TRUE))
# simple summary by `id`
dt[, .(total = sum(points)), by=id]
> id total
> 1: J 2058
> 2: T 1427
> 3: C 1020
In the desired output, each column would be named after the interval size they originate from. For example with three intervals, say time < 10, time < 20, time < 30, the head of the output should be:
id | total | subtotal_under10 | subtotal_under20 | subtotal_under30
Exclusive Subtotal Categories
set.seed(1L);
N <- 50L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));
breaks <- seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L);
cuts <- cut(dt$time,breaks,labels=paste0('subtotal_under',breaks[-1L]),right=F);
res <- dcast(dt[,.(subtotal=sum(points)),.(id,cut=cuts)],id~cut,value.var='subtotal');
res <- res[dt[,.(total=sum(points)),id]][order(id)];
res;
## id subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60 total
## 1: A NA NA 176 NA NA 512 688
## 2: B NA NA 599 NA NA NA 599
## 3: C 527 NA NA NA NA NA 527
## 4: D NA NA 174 NA NA NA 174
## 5: E NA 732 643 NA NA NA 1375
## 6: F 634 NA NA NA NA 1473 2107
## 7: G NA NA 1410 NA NA NA 1410
## 8: I NA NA NA NA NA 596 596
## 9: J 447 NA 640 NA NA 354 1441
## 10: K 508 NA NA NA NA 454 962
## 11: M NA 14 1358 NA NA NA 1372
## 12: N NA NA NA NA 730 NA 730
## 13: O NA NA 271 NA NA 259 530
## 14: P NA NA NA NA 78 NA 78
## 15: Q 602 NA 485 NA 925 NA 2012
## 16: R NA 599 357 479 NA NA 1435
## 17: S NA 986 716 865 NA NA 2567
## 18: T NA NA NA NA 105 NA 105
## 19: U NA NA NA 239 1163 641 2043
## 20: V NA 683 NA NA 929 NA 1612
## 21: W NA NA NA NA 229 NA 229
## 22: X 214 993 NA NA NA NA 1207
## 23: Y NA 130 992 NA NA NA 1122
## 24: Z NA NA NA NA 104 NA 104
## id subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60 total
Cumulative Subtotal Categories
I've come up with a new solution based on the requirement of cumulative subtotals.
My objective was to avoid looping operations such as lapply(), since I realized that it should be possible to compute the desired result using only vectorized operations such as findInterval(), vectorized/cumulative operations such as cumsum(), and vector indexing.
I succeeded, but I should warn you that the algorithm is fairly intricate, in terms of its logic. I'll try to explain it below.
breaks <- seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L);
ints <- findInterval(dt$time,breaks);
res <- dt[,{ y <- ints[.I]; o <- order(y); y <- y[o]; w <- which(c(y[-length(y)]!=y[-1L],T)); v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks)))); c(sum(points),as.list(cumsum(points[o])[v])); },id][order(id)];
setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks[-1L])));
res;
## id total subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60
## 1: A 688 NA NA 176 176 176 688
## 2: B 599 NA NA 599 599 599 599
## 3: C 527 527 527 527 527 527 527
## 4: D 174 NA NA 174 174 174 174
## 5: E 1375 NA 732 1375 1375 1375 1375
## 6: F 2107 634 634 634 634 634 2107
## 7: G 1410 NA NA 1410 1410 1410 1410
## 8: I 596 NA NA NA NA NA 596
## 9: J 1441 447 447 1087 1087 1087 1441
## 10: K 962 508 508 508 508 508 962
## 11: M 1372 NA 14 1372 1372 1372 1372
## 12: N 730 NA NA NA NA 730 730
## 13: O 530 NA NA 271 271 271 530
## 14: P 78 NA NA NA NA 78 78
## 15: Q 2012 602 602 1087 1087 2012 2012
## 16: R 1435 NA 599 956 1435 1435 1435
## 17: S 2567 NA 986 1702 2567 2567 2567
## 18: T 105 NA NA NA NA 105 105
## 19: U 2043 NA NA NA 239 1402 2043
## 20: V 1612 NA 683 683 683 1612 1612
## 21: W 229 NA NA NA NA 229 229
## 22: X 1207 214 1207 1207 1207 1207 1207
## 23: Y 1122 NA 130 1122 1122 1122 1122
## 24: Z 104 NA NA NA NA 104 104
## id total subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60
Explanation
breaks <- seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L);
breaks <- seq(0,ceiling(max(dt$time)/10)*10,10); ## old derivation, for reference
First, we derive breaks as before. I should mention that I realized there was a subtle bug in my original derivation algorithm. Namely, if the maximum time value is a multiple of 10, then the derived breaks vector would've been short by 1. Consider if we had a maximum time value of 60. The original calculation of the upper limit of the sequence would've been ceiling(60/10)*10, which is just 60 again. But it should be 70, since the value 60 technically belongs in the 60 <= time < 70 interval. I fixed this in the new code (and retroactively amended the old code) by adding 1 to the maximum time value when computing the upper limit of the sequence. I also changed two of the literals to integers and added an as.integer() coercion to preserve integerness.
ints <- findInterval(dt$time,breaks);
Second, we precompute the interval indexes into which each time value falls. We can precompute this once for the entire table, because we'll be able to index out each id group's subset within the j argument of the subsequent data.table indexing operation. Note that findInterval() behaves perfectly for our purposes using the default arguments; we don't need to mess with rightmost.closed, all.inside, or left.open. This is because findInterval() by default uses lower <= value < upper logic, and it's impossible for values to fall below the lowest break (which is zero) or on or above the highest break (which must be greater than the maximum time value because of the way we derived it).
res <- dt[,{ y <- ints[.I]; o <- order(y); y <- y[o]; w <- which(c(y[-length(y)]!=y[-1L],T)); v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks)))); c(sum(points),as.list(cumsum(points[o])[v])); },id][order(id)];
Third, we compute the aggregation using a data.table indexing operation, grouping by id. (Afterward we sort by id using a chained indexing operation, but that's not significant.) The j argument consists of 6 statements executed in a braced block which I will now explain one at a time.
y <- ints[.I];
This pulls out the interval indexes for the current id group in input order.
o <- order(y);
This captures the order of the group's records by interval. We will need this order for the cumulative summation of points, as well as the derivation of which indexes in that cumulative sum represent the desired interval subtotals. Note that the within-interval orders (i.e. ties) are irrelevant, since we're only going to extract the final subtotals of each interval, which will be the same regardless if and how order() breaks ties.
y <- y[o];
This actually reorders y to interval order.
w <- which(c(y[-length(y)]!=y[-1L],T));
This computes the endpoints of each interval sequence, IOW the indexes of only those elements that comprise the final element of an interval. This vector will always contain at least one index, it will never contain more indexes than there are intervals, and it will be unique.
v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks))));
This repeats each element of w according to its distance (as measured in intervals) from its following element. We use diff() on y[w] to compute these distances, requiring an appended length(breaks) element to properly treat the final element of w. We also need to cover if the first interval (and zero or more subsequent intervals) is not represented in the group, in which case we must pad it with NAs. This requires prepending an NA to w and prepending a 1 to the argument vector to diff().
c(sum(points),as.list(cumsum(points[o])[v]));
Finally, we can compute the group aggregation result. Since you want a total column and then separate subtotal columns, we need a list starting with the total aggregation, followed by one list component per subtotal value. points[o] gives us the target summation operand in interval order, which we then cumulatively sum, and then index with v to produce the correct sequence of cumulative subtotals. We must coerce the vector to a list using as.list(), and then prepend the list with the total aggregation, which is simply the sum of the entire points vector. The resulting list is then returned from the j expression.
setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks[-1L])));
Last, we set the column names. It is more performant to set them once after-the-fact, as opposed to having them set repeatedly in the j expression.
Benchmarking
For benchmarking, I wrapped my code in a function, and did the same for Mike's code. I decided to make my breaks variable a parameter with its derivation as the default argument, and I did the same for Mike's my_nums variable, but without a default argument.
Also note that for the identical() proofs-of-equivalence, I coerce the two results to matrix, because Mike's code always computes the total and subtotal columns as doubles, whereas my code preserves the type of the input points column (i.e. integer if it was integer, double if it was double). Coercing to matrix was the easiest way I could think of to verify that the actual data is equivalent.
library(data.table);
library(microbenchmark);
bgoldst <- function(dt,breaks=seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L)) { ints <- findInterval(dt$time,breaks); res <- dt[,{ y <- ints[.I]; o <- order(y); y <- y[o]; w <- which(c(y[-length(y)]!=y[-1L],T)); v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks)))); c(sum(points),as.list(cumsum(points[o])[v])); },id][order(id)]; setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks[-1L]))); res; };
mike <- function(dt,my_nums) { cols <- sapply(1:length(my_nums),function(x){return(paste0("subtotal_under",my_nums[x]))}); dt[,(cols) := lapply(my_nums,function(x) ifelse(time<x,points,NA))]; dt[,total := points]; dt[,lapply(.SD,function(x){ if (all(is.na(x))){ as.numeric(NA) } else{ as.numeric(sum(x,na.rm=TRUE)) } }),by=id, .SDcols=c("total",cols) ][order(id)]; };
## OP's sample input
set.seed(1L);
N <- 50L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));
identical(as.matrix(bgoldst(copy(dt))),as.matrix(mike(copy(dt),c(10,20,30,40,50,60))));
## [1] TRUE
microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60)));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(copy(dt)) 3.281380 3.484301 3.793532 3.588221 3.780023 6.322846 100
## mike(copy(dt), c(10, 20, 30, 40, 50, 60)) 3.243746 3.442819 3.731326 3.526425 3.702832 5.618502 100
Mike's code is actually faster (usually) by a small amount for the OP's sample input.
## large input 1
set.seed(1L);
N <- 1e5L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));
identical(as.matrix(bgoldst(copy(dt))),as.matrix(mike(copy(dt),c(10,20,30,40,50,60,70))));
## [1] TRUE
microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60,70)));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(copy(dt)) 19.44409 19.96711 22.26597 20.36012 21.26289 62.37914 100
## mike(copy(dt), c(10, 20, 30, 40, 50, 60, 70)) 94.35002 96.50347 101.06882 97.71544 100.07052 146.65323 100
For this much larger input, my code significantly outperforms Mike's.
In case you're wondering why I had to add the 70 to Mike's my_nums argument, it's because with so many more records, the probability of getting a 60 in the random generation of dt$time is extremely high, which requires the additional interval. You can see that the identical() call gives TRUE, so this is correct.
## large input 2
set.seed(1L);
N <- 1e6L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));
identical(as.matrix(bgoldst(copy(dt))),as.matrix(mike(copy(dt),c(10,20,30,40,50,60,70))));
## [1] TRUE
microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60,70)));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(copy(dt)) 204.8841 207.2305 225.0254 210.6545 249.5497 312.0077 100
## mike(copy(dt), c(10, 20, 30, 40, 50, 60, 70)) 1039.4480 1086.3435 1125.8285 1116.2700 1158.4772 1412.6840 100
For this even larger input, the performance difference is slightly more pronounced.
I'm pretty sure something like this might work as well:
# sample data
set.seed(1)
dt <- data.table( id= sample(LETTERS,50,replace=TRUE),
time= sample(60,50,replace=TRUE),
points= sample(1000,50,replace=TRUE))
#Input numbers
my_nums <- c(10,20,30)
#Defining columns
cols <- sapply(1:length(my_nums),function(x){return(paste0("subtotal_under",my_nums[x]))})
dt[,(cols) := lapply(my_nums,function(x) ifelse(time<x,points,NA))]
dt[,total := sum((points)),by=id]
dt[,(cols):= lapply(.SD,sum,na.rm=TRUE),by=id, .SDcols=cols ]
head(dt)
id time points subtotal_under10 subtotal_under20 subtotal_under30 total
1: G 29 655 0 0 1410 1410
2: J 52 354 447 447 1087 1441
3: O 27 271 0 0 271 530
4: X 15 993 214 1207 1207 1207
5: F 5 634 634 634 634 2107
6: X 6 214 214 1207 1207 1207
Edit: To aggregate columns, you can simply change to:
#Defining columns
cols <- sapply(1:length(my_nums),function(x){return(paste0("subtotal_under",my_nums[x]))})
dt[,(cols) := lapply(my_nums,function(x) ifelse(time<x,points,NA))]
dt[,total := points]
dt[,lapply(.SD,function(x){
if (all(is.na(x))){
as.numeric(NA)
} else{
as.numeric(sum(x,na.rm=TRUE))
}
}),by=id, .SDcols=c("total",cols) ]
This should give the expected output of 1 row per ID.
Edit: Per OPs comment below, changed so that 0s are NA. Changed so don't need an as.numeric() call in the building of columns.
After a while thinking about this, I think I've arrived at a very simple and fast solution based on conditional sum ! The small problem is that I haven't figured out how to automate this code to create a larger number of columns without having to write each of them. Any help here would be really welcomed !
library(data.table)
dt[, .( total = sum(points)
, subtotal_under10 = sum(points[which( time < 10)])
, subtotal_under20 = sum(points[which( time < 20)])
, subtotal_under30 = sum(points[which( time < 30)])
, subtotal_under40 = sum(points[which( time < 40)])
, subtotal_under50 = sum(points[which( time < 50)])
, subtotal_under60 = sum(points[which( time < 60)])), by=id][order(id)]
microbenchmark
Using the same benchmark proposed by #bgoldst in another answer, this simple solution is much faster than the alternatives:
set.seed(1L)
N <- 1e6L
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T))
library(microbenchmark)
microbenchmark(rafa(copy(dt)),bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60)))
# expr min lq mean median uq max neval cld
# rafa(copy(dt)) 95.79 102.45 117.25 110.09 116.95 278.50 100 a
# bgoldst(copy(dt)) 192.53 201.85 211.04 207.50 213.26 354.17 100 b
# mike(copy(dt), c(10, 20, 30, 40, 50, 60)) 844.80 890.53 955.29 921.27 1041.96 1112.18 100 c

R: sum vector by vector of conditions

I am trying to obtain a vector, which contains sum of elements which fit condition.
values = runif(5000)
bin = seq(0, 0.9, by = 0.1)
sum(values < bin)
I expected that sum will return me 10 values - a sum of "values" elements which fit "<" condition per each "bin" element.
However, it returns only one value.
How can I achieve the result without using a while loop?
I understand this to mean that you want, for each value in bin, the number of elements in values that are less than bin. So I think you want vapply() here
vapply(bin, function(x) sum(values < x), 1L)
# [1] 0 497 1025 1501 1981 2461 2955 3446 3981 4526
If you want a little table for reference, you could add names
v <- vapply(bin, function(x) sum(values < x), 1L)
setNames(v, bin)
# 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# 0 497 1025 1501 1981 2461 2955 3446 3981 4526
I personally prefer data.table over tapply or vapply, and findInterval over cut.
set.seed(1)
library(data.table)
dt <- data.table(values, groups=findInterval(values, bin))
setkey(dt, groups)
dt[,.(n=.N, v=sum(values)), groups][,list(cumsum(n), cumsum(v)),]
# V1 V2
# 1: 537 26.43445
# 2: 1041 101.55686
# 3: 1537 226.12625
# 4: 2059 410.41487
# 5: 2564 637.18782
# 6: 3050 904.65876
# 7: 3473 1180.53342
# 8: 3951 1540.18559
# 9: 4464 1976.23067
#10: 5000 2485.44920
cbind(vapply(bin, function(x) sum(values < x), 1L)[-1],
cumsum(tapply( values, cut(values, bin), sum)))
# [,1] [,2]
#(0,0.1] 537 26.43445
#(0.1,0.2] 1041 101.55686
#(0.2,0.3] 1537 226.12625
#(0.3,0.4] 2059 410.41487
#(0.4,0.5] 2564 637.18782
#(0.5,0.6] 3050 904.65876
#(0.6,0.7] 3473 1180.53342
#(0.7,0.8] 3951 1540.18559
#(0.8,0.9] 4464 1976.23067
Using tapply with a cut()-constructed INDEX vector seems to deliver:
tapply( values, cut(values, bin), sum)
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
25.43052 71.06897 129.99698 167.56887 222.74620 277.16395
(0.6,0.7] (0.7,0.8] (0.8,0.9]
332.18292 368.49341 435.01104
Although I'm guessing you would want the cut-vector to extend to 1.0:
bin = seq(0, 1, by = 0.1)
tapply( values, cut(values, bin), sum)
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
25.48087 69.87902 129.37348 169.46013 224.81064 282.22455
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
335.43991 371.60885 425.66550 463.37312
I see that I understood the question differently than Richard. If you wanted his result you can use cumsum on my result.
Using dplyr:
set.seed(1)
library(dplyr)
df %>% group_by(groups) %>%
summarise(count = n(), sum = sum(values)) %>%
mutate(cumcount= cumsum(count), cumsum = cumsum(sum))
Output:
groups count sum cumcount cumsum
1 (0,0.1] 537 26.43445 537 26.43445
2 (0.1,0.2] 504 75.12241 1041 101.55686
3 (0.2,0.3] 496 124.56939 1537 226.12625
4 (0.3,0.4] 522 184.28862 2059 410.41487
5 (0.4,0.5] 505 226.77295 2564 637.18782
6 (0.5,0.6] 486 267.47094 3050 904.65876
7 (0.6,0.7] 423 275.87466 3473 1180.53342
8 (0.7,0.8] 478 359.65217 3951 1540.18559
9 (0.8,0.9] 513 436.04508 4464 1976.23067
10 NA 536 509.21853 5000 2485.44920

R - setting equiprobability over a specific variable when sampling

I have a data set with more than 2 millions entries which I load into a data frame.
I'm trying to grab a subset of the data. I need around 10000 entries but I need the entries to be picked with equal probability on one variable.
This is what my data looks like with str(data):
'data.frame': 2685628 obs. of 3 variables:
$ category : num 3289 3289 3289 3289 3289 ...
$ id: num 8064180 8990447 747922 9725245 9833082 ...
$ text : chr "text1" "text2" "text3" "text4" ...
You've noticed that I have 3 variables : category,id and text.
I have tried the following :
> sample_data <- data[sample(nrow(data),10000,replace=FALSE),]
Of course this works, but the probability of sample if not equal. Here is the output of count(sample_data$category) :
x freq
1 3289 707
2 3401 341
3 3482 160
4 3502 243
5 3601 1513
6 3783 716
7 4029 423
8 4166 21
9 4178 894
10 4785 31
11 5108 121
12 5245 2178
13 5637 387
14 5946 1484
15 5977 117
16 6139 664
Update: Here is the output of count(data$category) :
x freq
1 3289 198142
2 3401 97864
3 3482 38172
4 3502 59386
5 3601 391800
6 3783 201409
7 4029 111075
8 4166 6749
9 4178 239978
10 4785 6473
11 5108 32083
12 5245 590060
13 5637 98785
14 5946 401625
15 5977 28769
16 6139 183258
But when I try setting the probability I get the following error :
> catCount <- length(unique(data$category))
> probabilities <- rep(c(1/catCount),catCount)
> train_set <- data[sample(nrow(data),10000,prob=probabilities),]
Error in sample.int(x, size, replace, prob) :
incorrect number of probabilities
I understand that the sample function is randomly picking between the row number but I can't figure out how to associate that with the probability over the categories.
Question : How can I sample my data over an equal probability for the category variable?
Thanks in advance.
I guess you could do this with some simple base R operation, though you should remember that you are using probabilities here within sample, thus getting the exact amount per each combination won't work using this method, though you can get close enough for large enough sample.
Here's an example data
set.seed(123)
data <- data.frame(category = sample(rep(letters[1:10], seq(1000, 10000, by = 1000)), 55000))
Then
probs <- 1/prop.table(table(data$category)) # Calculating relative probabilities
data$probs <- probs[match(data$category, names(probs))] # Matching them to the correct rows
set.seed(123)
train_set <- data[sample(nrow(data), 1000, prob = data$probs), ] # Sampling
table(train_set$category) # Checking frequencies
# a b c d e f g h i j
# 94 103 96 107 105 99 100 96 107 93
Edit: So here's a possible data.table equivalent
library(data.table)
setDT(data)[, probs := .N, category][, probs := .N/probs]
train_set <- data[sample(.N, 1000, prob = probs)]
Edit #2: Here's a very nice solution using the dplyr package contributed by #Khashaa and #docendodiscimus
The nice thing about this solution is that it returns the exact sample size within each group
library(dplyr)
train_set <- data %>%
group_by(category) %>%
sample_n(1000)
Edit #3:
It seems that data.table equivalent to dplyr::sample_n would be
library(data.table)
train_set <- setDT(data)[data[, sample(.I, 1000), category]$V1]
Which will also return the exact sample size within each group

R: Rollapply lm regression on zoo matrix objects

I would like to perform a rolling regression using lm on many pairs of data series within a single zoo object.
While I am able to perform a rolling regression on one single pair of data series in a zoo object by the following codes:
FunLm <- function(x,Param,Days) coef(lm(AAA ~ Z, data = as.data.frame(x), weights=Param*(seq(Days,1,by=-1))))
DataLmCoef <- rollapplyr(Data, Days, FunLm, Param, Days, by.column = FALSE)
with zoo of this structure:
Z AAA
2012-07-01 1 853
2012-07-04 2 864
2012-07-05 3 865
2012-07-06 4 873
2012-07-07 5 870
2012-07-08 6 874
My question is, if I have the following zoo object:
Z AAA BBB CCC
2012-07-01 1 853 123 65
2012-07-04 2 864 124 62
2012-07-05 3 865 126 63
2012-07-06 4 873 120 66
2012-07-07 5 870 121 68
2012-07-08 6 874 123 69
without using loop, how can I perform rolling regression similarly on Z~AAA, Z~BBB, Z~CCC, Z~DDD, .... and get two zoo matrix objects with one storing intercepts and the other storing slopes?
Following the example from the rollapply man page
You can add more then one test in the roll function
for example
> seat <- as.zoo(log(UKDriverDeaths))
> time(seat) <- as.yearmon(time(seat))
> seat <- merge(y = seat, y1 = lag(seat, k = -1),
y12 = lag(seat, k = -12), all = FALSE)
> fm <- rollapply(seat, width = 36,
FUN = function(z)
data.frame(
test1 = t(coef(lm(y ~ y1 + y12, data = as.data.frame(z)))),
test3 = t(coef(lm(y ~ y12, data = as.data.frame(z))))
) ,
by.column = FALSE, align = "right")
And the result
> head(fm)
test1..Intercept. test1.y1 test1.y12 test3..Intercept. test3.y12
דצמ 1972 0.9629793 0.15344243 0.7240740 1.530598 0.8026003
ינו 1973 1.1336058 0.13920023 0.7155899 1.570067 0.7973688
פבר 1973 0.9978077 0.14346100 0.7293183 1.440635 0.8145803
מרץ 1973 0.9879002 0.12929214 0.7442218 1.375245 0.8226257
אפר 1973 1.2281307 0.11700612 0.7250115 1.545356 0.8003661
מאי 1973 1.4483700 0.08860055 0.7245032 1.706343 0.7792279

Sum every nth points

I have a vector and I need to sum every n numbers and return the results. This is the way I plan on doing it currently. Any better way to do this?
v = 1:100
n = 10
sidx = seq.int(from=1, to=length(v), by=n)
eidx = c((sidx-1)[2:length(sidx)], length(v))
thesum = sapply(1:length(sidx), function(i) sum(v[sidx[i]:eidx[i]]))
This gives:
thesum
[1] 55 155 255 355 455 555 655 755 855 955
unname(tapply(v, (seq_along(v)-1) %/% n, sum))
# [1] 55 155 255 355 455 555 655 755 855 955
UPDATE:
If you want to sum every n consecutive numbers use colSums
If you want to sum every nth number use rowSums
as per Josh's comment, this will only work if n divides length(v) nicely.
rowSums(matrix(v, nrow=n))
 [1] 460 470 480 490 500 510 520 530 540 550
colSums(matrix(v, nrow=n))
[1] 55 155 255 355 455 555 655 755 855 955
Update
The olde version don't work. Here a ne awnser that use rep to create the grouping factor. No need to use cut:
n <- 5
vv <- sample(1:1000,100)
seqs <- seq_along(vv)
tapply(vv,rep(seqs,each=n)[seqs],FUN=sum)
You can use tapply
tapply(1:100,cut(1:100,10),FUN=sum)
or to get a list
by(1:100,cut(1:100,10),FUN=sum)
EDIT
In case you have 1:92, you can replace your cut by this :
cut(1:92,seq(1,92,10),include.lowest=T)
One way is to convert your vector to a matric then take the column sums:
colSums(matrix(v, nrow=n))
[1] 55 155 255 355 455 555 655 755 855 955
Just be careful: this implicitly assumes that your input vector can in fact be reshaped to a matrix. If it can't, R will recycle elements of your vector to complete the matrix.
v <- 1:100
n <- 10
cutpoints <- seq( 1 , length( v ) , by = n )
categories <- findInterval( 1:length( v ) , cutpoints )
tapply( v , categories , sum )
I will add one more way of doing it without any function from apply family
v <- 1:100
n <- 10
diff(c(0, cumsum(v)[slice.index(v, 1)%%n == 0]))
## [1] 55 155 255 355 455 555 655 755 855 955
Here are some of the main variants offered so far
f0 <- function(v, n) {
sidx = seq.int(from=1, to=length(v), by=n)
eidx = c((sidx-1)[2:length(sidx)], length(v))
sapply(1:length(sidx), function(i) sum(v[sidx[i]:eidx[i]]))
}
f1 <- function(v, n, na.rm=TRUE) { # 'tapply'
unname(tapply(v, (seq_along(v)-1) %/% n, sum, na.rm=na.rm))
}
f2 <- function(v, n, na.rm=TRUE) { # 'matrix'
nv <- length(v)
if (nv %% n)
v[ceiling(nv / n) * n] <- NA
colSums(matrix(v, n), na.rm=na.rm)
}
f3 <- function(v, n) { # 'cumsum'
nv = length(v)
i <- c(seq_len(nv %/% n) * n, if (nv %% n) nv else NULL)
diff(c(0L, cumsum(v)[i]))
}
Basic test cases might be
v = list(1:4, 1:5, c(NA, 2:4), integer())
n = 2
f0 fails with the final test, but this could probably be fixed
> f0(integer(), n)
Error in sidx[i]:eidx[i] : NA/NaN argument
The cumsum approach f3 is subject to rounding error, and the presence of an NA early in v 'poisons' later results
> f3(c(NA, 2:4), n)
[1] NA NA
In terms of performance, the original solution is not bad
> library(rbenchmark)
> cols <- c("test", "elapsed", "relative")
> v <- 1:100; n <- 10
> benchmark(f0(v, n), f1(v, n), f2(v, n), f3(v, n),
+ columns=cols)
test elapsed relative
1 f0(v, n) 0.012 3.00
2 f1(v, n) 0.065 16.25
3 f2(v, n) 0.004 1.00
4 f3(v, n) 0.004 1.00
but the matrix solution f2 seems to be both fast and flexible (e.g., adjusting the handling of that trailing chunk of fewer than n elements)
> v <- runif(1e6); n <- 10
> benchmark(f0(v, n), f2(v, n), f3(v, n), columns=cols, replications=10)
test elapsed relative
1 f0(v, n) 5.804 34.141
2 f2(v, n) 0.170 1.000
3 f3(v, n) 0.251 1.476
One way is to use rollapply from zoo:
rollapply(v, width=n, FUN=sum, by=n)
# [1] 55 155 255 355 455 555 655 755 855 955
And in case length(v) is not a multiple of n:
v <- 1:92
rollapply(v, width=n, FUN=sum, by=n, partial=T, align="left")
# [1] 55 155 255 355 455 555 655 755 855 183
A little late to the party, but I don't see a rowsum() answer yet. rowsum() is proven more efficient than tapply() and I think it would also be very efficient relative to a few of the other responses as well.
rowsum(v, rep(seq_len(length(v)/n), each=n))[,1]
# 1 2 3 4 5 6 7 8 9 10
# 55 155 255 355 455 555 655 755 855 955
Using #Josh O'Brien's grouping technique would likely improve efficiency even more.
rowsum(v, (seq_along(v)-1) %/% n)[,1]
# 0 1 2 3 4 5 6 7 8 9
# 55 155 255 355 455 555 655 755 855 955
Simply wrap in unname() to drop the group names.

Resources