I would like to aggregate an R data.frame by equal amounts of the cumulative sum of one of the variables in the data.frame. I googled quite a lot, but probably I don't know the correct terminology to find anything useful.
Suppose I have this data.frame:
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> head(x)
p v
1 10.002904 4
2 10.132200 2
3 10.026105 6
4 10.001146 2
5 9.990267 2
6 10.115907 6
7 10.199895 9
8 9.949996 8
9 10.165848 8
10 9.953283 6
11 10.072947 10
12 10.020379 2
13 10.084002 3
14 9.949108 8
15 10.065247 6
16 9.801699 3
17 10.014612 8
18 9.954638 5
19 9.958256 9
20 10.031041 7
I would like to reduce the x to a smaller data.frame where each line contains the weighted average of p, weighted by v, corresponding to an amount of n units of v. Something of this sort:
> n <- 100
> cum.v <- cumsum(x$v)
> f <- cum.v %/% n
> x.agg <- aggregate(cbind(v*p, v) ~ f, data=x, FUN=sum)
> x.agg$'v * p' <- x.agg$'v * p' / x.agg$v
> x.agg
f v * p v
1 0 10.039369 98
2 1 9.952049 94
3 2 10.015058 104
4 3 9.938271 103
5 4 9.967244 100
6 5 9.995071 69
First question, I was wondering if there is a better (more efficient approach) to the code above. The second, more important, question is how to correct the code above in order to obtain more precise bucketing. Namely, each row in x.agg should contain exacly 100 units of v, not just approximately as it is the case above. For example, the first row contains the aggregate of the first 17 rows of x which correspond to 98 units of v. The next row (18th) contains 5 units of v and is fully included in the next bucket. What I would like to achieve instead would be attribute 2 units of row 18th to the first bucket and the remaining 3 units to the following one.
Thanks in advance for any help provided.
Here's another method that does this with out repeating each p v times. And the way I understand it is, the place where it crosses 100 (see below)
18 9.954638 5 98
19 9.958256 9 107
should be changed to:
18 9.954638 5 98
19.1 9.958256 2 100 # ---> 2 units will be considered with previous group
19.2 9.958256 7 107 # ----> remaining 7 units will be split for next group
The code:
n <- 100
# get cumulative sum, an id column (for retrace) and current group id
x <- transform(x, cv = cumsum(x$v), id = seq_len(nrow(x)), grp = cumsum(x$v) %/% n)
# Paste these two lines in R to install IRanges
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
require(IRanges)
ir1 <- successiveIRanges(x$v)
ir2 <- IRanges(seq(n, max(x$cv), by=n), width=1)
o <- findOverlaps(ir1, ir2)
# gets position where multiple of n(=100) occurs
# (where we'll have to do something about it)
pos <- queryHits(o)
# how much do the values differ from multiple of 100?
val <- start(ir2)[subjectHits(o)] - start(ir1)[queryHits(o)] + 1
# we need "pos" new rows of "pos" indices
x1 <- x[pos, ]
x1$v <- val # corresponding values
# reduce the group by 1, so that multiples of 100 will
# belong to the previous row
x1$grp <- x1$grp - 1
# subtract val in the original data x
x$v[pos] <- x$v[pos] - val
# bind and order them
x <- rbind(x1,x)
x <- x[with(x, order(id)), ]
# remove unnecessary entries
x <- x[!(duplicated(x$id) & x$v == 0), ]
x$cv <- cumsum(x$v) # updated cumsum
x$id <- NULL
require(data.table)
x.dt <- data.table(x, key="grp")
x.dt[, list(res = sum(p*v)/sum(v), cv = tail(cv, 1)), by=grp]
Running on your data:
# grp res cv
# 1: 0 10.037747 100
# 2: 1 9.994648 114
Running on #geektrader's data:
# grp res cv
# 1: 0 9.999680 100
# 2: 1 10.040139 200
# 3: 2 9.976425 300
# 4: 3 10.026622 400
# 5: 4 10.068623 500
# 6: 5 9.982733 562
Here's a benchmark on a relatively big data:
set.seed(12345)
x <- data.frame(cbind(p=rnorm(1e5, 10, 0.1), v=round(runif(1e5, 1, 10))))
require(rbenchmark)
benchmark(out <- FN1(x), replications=10)
# test replications elapsed relative user.self
# 1 out <- FN1(x) 10 13.817 1 12.586
It takes about 1.4 seconds on 1e5 rows.
If you are looking for precise bucketing, I am assuming value of p is same for 2 "split" v
i.e. in your example, value of p for 2 units of row 18th that go in first bucket is 9.954638
With above assumption, you can do following for not super large datasets..
> set.seed(12345)
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> z <- unlist(mapply(function(x,y) rep(x,y), x$p, x$v, SIMPLIFY=T))
this creates a vector with each value of p repeated v times for each row and result is combined into single vector using unlist.
After this aggregation is trivial using aggregate function
> aggregate(z, by=list((1:length(z)-0.5)%/%100), FUN=mean)
Group.1 x
1 0 9.999680
2 1 10.040139
3 2 9.976425
4 3 10.026622
5 4 10.068623
6 5 9.982733
Related
I would like to remove outliers for each cluster of a dataset. The dataset contains 3 columns with different variables and a column indicating the cluster to which each point is allocated. If only one of the 3 variables is an outlier, the entire row will be removed. Outliers are identified determining the interval spanning over the mean plus/minus three standard deviations but I can also use the outlierfunction.
I am able to remove outliers without considering clusters, using:
#data: each row has 3 different variables and the allocating cluster (k)
dat <- cbind.data.frame(v1=c(sample(5:10, 100,replace=T),sample(1:5,5)),
v2=c(sample(20:25, 100,replace=T),sample(5:10,5)),
v3=c(sample(30:35, 100,replace=T),sample(10:20,5)),
k=c(rep(1:5,21)))
### find outliers without considering clusters
#(obviously only the last 5 samples in this example)
rmv<-c()
for(i in 1:3){
variable<-dat[,i]
rmv.tm<-which(variable >= (mean(variable)+sd(variable)*3)
| variable <= (mean(variable)-sd(variable)*3))
rmv<-c(rmv,rmv.tm)
}
rmv<-unique(rmv)
rmv
### remove outliers
dat_clean <- dat[-rmv,]
However, I am not able to detect outliers CONSIDERING clusters and thus determining intervals inside each cluster and not inside the entire population. I thought to nest another loop, but I am finding difficult coding it.
Any help would be much appreciated.
Here's a dplyr-approach:
library(dplyr)
dat %>%
group_by(k) %>%
filter_all(all_vars((abs(mean(.) - .) < 3*sd(.))))
# # A tibble: 100 x 4
# # Groups: k [5]
# v1 v2 v3 k
# <int> <int> <int> <int>
# 1 9 20 30 1
# 2 5 24 35 2
# 3 8 20 30 3
# 4 8 23 32 4
# 5 6 23 35 5
# 6 9 24 32 1
# 7 9 22 33 2
# 8 9 23 31 3
# 9 7 21 35 4
# 10 9 23 32 5
# # ... with 90 more rows
Base R:
dat <- cbind.data.frame(v1=c(sample(5:10, 100,replace=T),sample(1:5,5)),
v2=c(sample(20:25, 100,replace=T),sample(5:10,5)),
v3=c(sample(30:35, 100,replace=T),sample(10:20,5)),
k=c(rep(1:5,21)))
get_remove <- function(x, index, a = 3) {
lower_limit <- tapply(x, index, function(x) mean(x) - a * sd(x))
upper_limit <- tapply(x, index, function(x) mean(x) + a * sd(x))
vals <- split(x, index)
res <- sapply(seq_along(vals), function(i)
((vals[[i]] < lower_limit[i]) | (vals[[i]] > upper_limit[i])))
}
mask <- apply(do.call(cbind,
lapply(dat[ , c("v1", "v2", "v3")],
get_remove, dat$k)),
MARGIN = 1, any)
dat[!mask, ]
print("removed:")
dat[mask, ]
I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))
I am basically new to using R software.
I have a list of repeating codes (numeric/ categorical) from an excel file. I need to add another column values (even at random) to which every same code will get the same value.
Codes Value
1 122
1 122
2 155
2 155
2 155
4 101
4 101
5 251
5 251
Thank you.
We can use match:
n <- length(code0 <- unique(code))
value <- sample(4 * n, n)[match(code, code0)]
or factor:
n <- length(unique(code))
value <- sample(4 * n, n)[factor(code)]
The random integers generated are between 1 and 4 * n. The number 4 is arbitrary; you can also put 100.
Example
set.seed(0); code <- rep(1:5, sample(5))
code
# [1] 1 1 1 1 1 2 2 3 3 3 3 4 4 4 5
n <- length(code0 <- unique(code))
sample(4 * n, n)[match(code, code0)]
# [1] 5 5 5 5 5 18 18 19 19 19 19 12 12 12 11
Comment
The above gives the most general treatment, assuming that code is not readily sorted or taking consecutive values.
If code is sorted (no matter what value it takes), we can also use rle:
if (!is.unsorted(code)) {
n <- length(k <- rle(code)$lengths)
value <- rep.int(sample(4 * n, n), k)
}
If code takes consecutive values 1, 2, ..., n (but not necessarily sorted), we can skip match or factor and do:
n <- max(code)
value <- sample(4 * n, n)[code]
Further notice: If code is not numerical but categorical, match and factor method will still work.
What you could also do is the following, it is perhaps more intuitive to a beginner:
data <- data.frame('a' = c(122,122,155,155,155,101,101,251,251))
duplicates <- unique(data)
duplicates[, 'b'] <- rnorm(nrow(duplicates))
data <- merge(data, duplicates, by='a')
I need to extract summed subsets of a data.frame row-by-row and use the output to return a new data.frame. However, I want to increase the number of columns to sum across by 4 each time. So, for example, I want to extract the 1st column by itself, then the sum of columns 2 to 6 on a row-by-row basis, then columns 7 to 15 and so on.
I have this code that returns the sum of a constant number of columns across a data.frame (by a maximum number of trials) into a new data.frame- I just need to find a way to add the escalating function.
t<- max(as.numeric(df[,c(5)]))
process.row <- function (x){
sapply(1:t,function(i){
return(sum(as.numeric(x[c((6+(i-1)*5):(10+(i-1)*5))]
)
)
)
})
}
t(apply(df,1,process.row)) -> collated.data
I've been really struggling with a way to do this so thanks very much for any help. I couldn't find an answer to this elsewhere so apologies if I've missed something.
I was thinking you wanted to sum the rows of the selected subset of columns. If so, perhaps this will help.
# fake data
mydf <- as.data.frame(matrix(sample(45*5), nrow=5))
mydf
# prepare matrix of start and ending columns
n <- 20
i <- 1:n
ncols <- 1 + (i-1)*4
endcols <- cumsum(ncols)
startcols <- c(1, cumsum(ncols[-length(endcols)])+1)
mymat <- cbind(endcols, startcols)
# function to sum the rows
myfun <- function(df, m) {
# select subset with end columns within the dimensions of the given df
subm <- m[m[, 2] <= dim(df)[2], ]
# sum up the selected columns of df by rows
sapply(1:dim(subm)[1], function(j)
rowSums(df[, subm[j, 1]:subm[j, 2], drop=FALSE]))
}
mydf
myfun(df=mydf, m=mymat)
What you are looking for is a function that gives x (the lower value of the series), which looks like this for the sequence-part i:
In r, the code looks like this:
# the foo part of the function
foo <- function(x) ifelse(x > 0, 1 + (x - 1) * 4, 0)
# the wrapper of the function
min.val <- function(i){
ifelse(i == 1, 1, 1 + sum(sapply(1:(i - 1), foo)))
}
# takes only one value
min.val(1)
# [1] 1
min.val(2)
# [1] 2
min.val(3)
# [1] 7
# to calculate multiple values, use it like this
sapply(1:5, min.val)
#[1] 1 2 7 16 29
If you want to get the maximum number, you can create another function, which looks like this
max.val <- function(i) min.val(i + 1) - 1
sapply(1:5, max.val)
#[1] 1 6 15 28 45
Testing:
# creating a series to test it
series <- 1:20
min.vals <- sapply(series, min.val)
max.vals <- sapply(series, max.val)
dat <- data.frame(min = min.vals, max = max.vals)
# dat
# min max
# 1 1 1
# 2 2 6
# 3 7 15
# 4 16 28
# 5 29 45
# 6 46 66
# 7 67 91
# 8 92 120
# 9 121 153
# 10 154 190
# 11 191 231
# 12 232 276
# 13 277 325
# 14 326 378
# 15 379 435
# 16 436 496
# 17 497 561
# 18 562 630
# 19 631 703
# 20 704 780
Does that give you what you want?
I have two data frames that look like this:
>df.A
NAME pvalues index
A 0.9 1
B 0.8 2
C 0.7 3
D 0.6 4
E 0.5 5
F 0.4 6
G 0.3 7
H 0.2 8
I 0.1 9
>df.B
NAME VALUE index
B 100 1
G 99 2
H 98 3
C 97 4
D 96 5
F 95 6
A 94 7
I 93 8
E 92 9
I want to sample elements from df.A a number of times (e.g. 100) selecting each time 4 elements and then to look which indexes of the df.B match these new data.frames, and at the end sum the index of these new data.frames.
My approach is the following;
res = list(data.frame())
for (i in 1:100){
res[[i]] = as.data.frame(sample(df.A$NAME, 4))
names(res[[i]]) <- 'NAME'
}
func <- function(x,y){merge(x, y, by.x=names(x)[1], by.y=names(y)[1])}
rand <- lapply(res, func, df.B)
sum.random = sapply(rand, function(x) sum(x$index))
Both data.frames are a summary of my real data (13000 rows and 4 columns), therefore, speed of the process will be important. My current approach takes a long time, I think because of the merge step.
I've also tried with data.table
dt = data.table(df.B, key='NAME')
fn = function(x){dt[x]};
rand2 = lapply(res, func)
but it is also two slow,
Any ideas for improving the code? Sure I'm missing something obvious
Thanks
Maybe like this:
n.samp <- 4
n <- 100
samp.names <- replicate(n, sample(df.A$NAME, n.samp))
library(data.table)
DT.B <- as.data.table(df.B)
setkey(DT.B, NAME)
tmp <- DT.B[c(samp.names),]
tmp[, grp := rep(seq_len(n), each=n.samp) ]
sum.random <- tmp[,sum(index), by=grp]$V1
You didn't make it clear why you can't sample df.B directly. Are names not unique? If that's the case, you could use this:
samp.names <- data.table(NAME=c(samp.names),
grp = rep(seq_len(n), each=n.samp), key="NAME")
tmp <- DT.B[samp.names, all=TRUE]
tmp[,sum(index, na.rm=TRUE), by=grp]$V1
Try this approach and you can avoid for loop in R: (I assume sample size=3)
Data preparation
mydf1<-list(df1)
mydf1a<-rep(mydf1,100)
mydf2<-list(df2)
mydf2a<-rep(mydf2,100)
sampleno<-as.list(1:100)
Sampling
set.seed(1)
kk<-Map(function(x) x[sample(1:nrow(x),3,replace=FALSE),], mydf1a)
ll<-Map(function(x,y,z)cbind(sampleno=z,mysum=sum(x[which(x[,1] %in% y[,1]),3])),kk,mydf2a,sampleno)
myresult<-data.frame(do.call(rbind,ll))
head(myresult)
sampleno mysum
1 1 17
2 2 18
3 3 20
4 4 11
5 5 17
6 6 18