R: sum vector by vector of conditions - r

I am trying to obtain a vector, which contains sum of elements which fit condition.
values = runif(5000)
bin = seq(0, 0.9, by = 0.1)
sum(values < bin)
I expected that sum will return me 10 values - a sum of "values" elements which fit "<" condition per each "bin" element.
However, it returns only one value.
How can I achieve the result without using a while loop?

I understand this to mean that you want, for each value in bin, the number of elements in values that are less than bin. So I think you want vapply() here
vapply(bin, function(x) sum(values < x), 1L)
# [1] 0 497 1025 1501 1981 2461 2955 3446 3981 4526
If you want a little table for reference, you could add names
v <- vapply(bin, function(x) sum(values < x), 1L)
setNames(v, bin)
# 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# 0 497 1025 1501 1981 2461 2955 3446 3981 4526

I personally prefer data.table over tapply or vapply, and findInterval over cut.
set.seed(1)
library(data.table)
dt <- data.table(values, groups=findInterval(values, bin))
setkey(dt, groups)
dt[,.(n=.N, v=sum(values)), groups][,list(cumsum(n), cumsum(v)),]
# V1 V2
# 1: 537 26.43445
# 2: 1041 101.55686
# 3: 1537 226.12625
# 4: 2059 410.41487
# 5: 2564 637.18782
# 6: 3050 904.65876
# 7: 3473 1180.53342
# 8: 3951 1540.18559
# 9: 4464 1976.23067
#10: 5000 2485.44920
cbind(vapply(bin, function(x) sum(values < x), 1L)[-1],
cumsum(tapply( values, cut(values, bin), sum)))
# [,1] [,2]
#(0,0.1] 537 26.43445
#(0.1,0.2] 1041 101.55686
#(0.2,0.3] 1537 226.12625
#(0.3,0.4] 2059 410.41487
#(0.4,0.5] 2564 637.18782
#(0.5,0.6] 3050 904.65876
#(0.6,0.7] 3473 1180.53342
#(0.7,0.8] 3951 1540.18559
#(0.8,0.9] 4464 1976.23067

Using tapply with a cut()-constructed INDEX vector seems to deliver:
tapply( values, cut(values, bin), sum)
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
25.43052 71.06897 129.99698 167.56887 222.74620 277.16395
(0.6,0.7] (0.7,0.8] (0.8,0.9]
332.18292 368.49341 435.01104
Although I'm guessing you would want the cut-vector to extend to 1.0:
bin = seq(0, 1, by = 0.1)
tapply( values, cut(values, bin), sum)
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
25.48087 69.87902 129.37348 169.46013 224.81064 282.22455
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
335.43991 371.60885 425.66550 463.37312
I see that I understood the question differently than Richard. If you wanted his result you can use cumsum on my result.

Using dplyr:
set.seed(1)
library(dplyr)
df %>% group_by(groups) %>%
summarise(count = n(), sum = sum(values)) %>%
mutate(cumcount= cumsum(count), cumsum = cumsum(sum))
Output:
groups count sum cumcount cumsum
1 (0,0.1] 537 26.43445 537 26.43445
2 (0.1,0.2] 504 75.12241 1041 101.55686
3 (0.2,0.3] 496 124.56939 1537 226.12625
4 (0.3,0.4] 522 184.28862 2059 410.41487
5 (0.4,0.5] 505 226.77295 2564 637.18782
6 (0.5,0.6] 486 267.47094 3050 904.65876
7 (0.6,0.7] 423 275.87466 3473 1180.53342
8 (0.7,0.8] 478 359.65217 3951 1540.18559
9 (0.8,0.9] 513 436.04508 4464 1976.23067
10 NA 536 509.21853 5000 2485.44920

Related

Divide colums by other columns and itself depending on index in dplyr

library(dplyr)
set.seed(1)
df <- data.frame(dddt_a = sample(1:1000, 1000, replace=T),
dddt_b = sample(1:1000, 1000, replace=T),
dddt_c = sample(1:1000, 1000, replace=T),
dddt_d = sample(1:1000, 1000, replace=T),
index = as.character(sample(c("a", "b"), 1000, replace=T)))
I want to divide each colum by either dddt_a or dddt_b depending on what the index is. If the index is a then divide all columns except the index by dddt_a and if index==b divide all columns except the index by dddt_b. The way it is set up now, this only divides dddt_a by a but not the other columns (likewise if index==b).
df1 <- df %>%
mutate_at(.vars = vars(starts_with("dddt")),
.funs = list(~ifelse(index=="a", ./dddt_a, ./dddt_b)))
head(df1)
dddt_a dddt_b dddt_c dddt_d index
1 1.0000000 686 474 756 a
2 0.7388466 1 681 726 b
3 1.0000000 218 570 448 a
4 2.0086393 1 830 958 b
5 1.0000000 989 590 128 a
6 1.0000000 128 978 144 a
A work around is storing the denominator variable outside, split the data for each index, divide everything and put it back together (I ran it only for index==a here). However, this should be possible in dplyr, I'm sure...?
ind_a <- df$dddt_a[df$index=="a"]
dfa <- df %>%
filter(index=="a")%>%
mutate_at(.vars = vars(starts_with("dddt")),
.funs = ~ ./!!ind_a)
Related to what seems to be the same issue. In a nex step I want to sum the values up, again depending on the index variable:
df2 <- df1 %>%
mutate(SUMS = ifelse(index=="a",
1+dddt_b+dddt_c+dddt_d,
1+dddt_a+dddt_c+dddt_d))
However, this sums all variables up...
head(df2)
dddt_a dddt_b dddt_c dddt_d index SUMS
1 1.0000000 686 474 756 a 1917.000
2 0.7388466 1 681 726 b 1408.739
3 1.0000000 218 570 448 a 1237.000
4 2.0086393 1 830 958 b 1791.009
5 1.0000000 989 590 128 a 1708.000
6 1.0000000 128 978 144 a 1251.000
But for the first row, for example, SUMS should be equal to 1916:
rowSums(df2[1,2:4]) #the result should be 1916 not 1917
1916
Thanks for the help.
Create a new column after dividing
library(dplyr)
df %>%
mutate_at(vars(starts_with("dddt")),
list(new = ~ifelse(index=="a", ./dddt_a, ./dddt_b))) %>%
head
# dddt_a dddt_b dddt_c dddt_d index dddt_a_new dddt_b_new dddt_c_new dddt_d_new
#1 836 686 474 756 a 1.000 0.821 0.567 0.904
#2 679 919 681 726 b 0.739 1.000 0.741 0.790
#3 129 218 570 448 a 1.000 1.690 4.419 3.473
#4 930 463 830 958 b 2.009 1.000 1.793 2.069
#5 509 989 590 128 a 1.000 1.943 1.159 0.251
#6 471 128 978 144 a 1.000 0.272 2.076 0.306
If you want you can then select only "_new" columns or rename the "_new" column to names of your choice.
We can also use case_when
library(dplyr)
df %>%
mutate_at(vars(starts_with("dddt")),
list(new = ~case_when(index=="a" ~ ./dddt_a, TRUE ~ ./dddt_b)))

How to effectively determine the maximum difference between the variable value in each row and same variable subsequent row values in data.table in R

What is the most efficient way to determine the maximum positive difference between the value (X) for each row and the subsequent values of the same variable (X) within group (Y) in data.table in R.
Example:
set.seed(1)
dt <- data.table(X = sample(100:200, 500455, replace = TRUE),
Y = unlist(sapply(10:1000, function(x) rep(x, x))))
Here's my solution which I consider ineffective and slow:
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
head(dt, 21)
X Y max_diff
1: 126 10 69
2: 137 10 58
3: 157 10 38
4: 191 10 4
5: 120 10 75
6: 190 10 5
7: 195 10 0
8: 166 10 0
9: 163 10 0
10: 106 10 0
11: 120 11 80
12: 117 11 83
13: 169 11 31
14: 138 11 62
15: 177 11 23
16: 150 11 50
17: 172 11 28
18: 200 11 0
19: 138 11 56
20: 178 11 16
21: 194 11 0
If you can advise the efficient (faster) solution?
Here's a dplyr solution that is about 20x faster and gets the same results. I presume the data.table equivalent would be yet faster. (EDIT: see bottom - it is!)
The speedup comes from reducing how many comparisons need to be performed. The largest difference will always be found against the largest remaining number in the group, so it's faster to identify that number first and do only the one subtraction per row.
First, the original solution takes about 4 sec on my machine:
tictoc::tic("OP data.table")
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
tictoc::toc()
# OP data.table: 4.594 sec elapsed
But in only 0.2 sec we can take that data.table, convert to a data frame, add the orig_row row number, group by Y, reverse sort by orig_row, take the difference between X and the cumulative max of X, ungroup, and rearrange in original order:
library(dplyr)
tictoc::tic("dplyr")
dt2 <- dt %>%
as_data_frame() %>%
mutate(orig_row = row_number()) %>%
group_by(Y) %>%
arrange(-orig_row) %>%
mutate(max_diff2 = cummax(X) - X) %>%
ungroup() %>%
arrange(orig_row)
tictoc::toc()
# dplyr: 0.166 sec elapsed
all.equal(dt2$max_diff, dt2$max_diff2)
#[1] TRUE
EDIT: as #david-arenburg suggests in the comments, this can be done lightning-fast in data.table with an elegant line:
dt[.N:1, max_diff2 := cummax(X) - X, by = Y]
On my computer, that's about 2-4x faster than the dplyr solution above.

How to use IQR outlier function, based on a key, in R

I want to use this IQR function:
smooth_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.3 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- round(qnt[1] - H)
y[x > (qnt[2] + H)] <- round(qnt[2] + H)
y
}
on the below df, on the total column for every specific key, based on the key column:
key total
US4ZNB 10
US4ZNB 1075
US4ZNB 10000
US4ZNB 1138
US4ZNB 1156
US4YYM 1114
US4YYM 1072
US4YYM 50
US4YYM 1181
US4YYM 8000
JM4YYM 15000
JM4YYM 2000
JM4YYM 100
JM4YYM 2200
JM4YYM 2300
ddply from the plyr package does exactly this. It applies a function to each subset of the data based off a column.
plyr::ddply(df, "key", plyr::numcolwise(smooth_outliers))
The first argument is your data with "key" and "total", the second argument is the grouping variable, in this case "key".
The final variable is the function you want to apply, the numcolwise function is used here essentially so it applied it to the column rather than a whole row. So we make the row-based smooth-outliers function a column based function.
Then voila.
You'll get a data frame that lists each each key and its IQR as calculated by the smooth_outliers function.
Here's the result.
key total
1 JM4YYM 1421
2 JM4YYM 1712
3 JM4YYM 1709
4 US4YYM 1114
5 US4YYM 1473
6 US4YYM 1181
7 US4YYM 1767
8 US4YYM 1005
9 US4ZAW 1138
10 US4ZAW 1156
11 US4ZAW 1982
12 US4ZNB 1338
13 US4ZNB 1075
14 US4ZNB 1806
As you can see, each key is matched up with one of the outputs from the smooth_outliers function.
After ideas elaboration, I manage to find solution for my issue. I just used dplyr::group_by:
df.new <- df %>%
group_by(key) %>%
mutate(val=smooth_outliers(total))
Thanks you all.

Binning a dataframe with equal frequency of samples

I have binned my data using the cut function
breaks<-seq(0, 250, by=5)
data<-split(df2, cut(df2$val, breaks))
My split dataframe looks like
... ...
$`(15,20]`
val ks_Result c
15 60 237
18 70 247
... ...
$`(20,25]`
val ks_Result c
21 20 317
24 10 140
... ...
My bins looks like
> table(data)
data
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35]
0 0 0 7 128 2748 2307
(35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,70]
1404 11472 1064 536 7389 1008 1714
(70,75] (75,80] (80,85] (85,90] (90,95] (95,100] (100,105]
2047 700 329 1107 399 376 323
(105,110] (110,115] (115,120] (120,125] (125,130] (130,135] (135,140]
314 79 1008 77 474 158 381
(140,145] (145,150] (150,155] (155,160] (160,165] (165,170] (170,175]
89 660 15 1090 109 824 247
(175,180] (180,185] (185,190] (190,195] (195,200] (200,205] (205,210]
1226 139 531 174 1041 107 257
(210,215] (215,220] (220,225] (225,230] (230,235] (235,240] (240,245]
72 671 98 212 70 95 25
(245,250]
494
When I mean the bins, I get on an average of ~900 samples
> mean(table(data))
[1] 915.9
I want to tell R to make irregular bins in such a way that each bin will contain on an average 900 samples (e.g. (0, 27] = 900, (27,28.5] = 900, and so on). I found something similar here, which deals with only one variable, not the whole dataframe.
I also tried Hmisc package, unfortunately the bins don't contain equal frequency!!
library(Hmisc)
data<-split(df2, cut2(df2$val, g=30, oneval=TRUE))
data<-split(df2, cut2(df2$val, m=1000, oneval=TRUE))
Assuming you want 50 equal sized buckets (based on your seq) statement, you can use something like:
df <- data.frame(var=runif(500, 0, 100)) # make data
cut.vec <- cut(
df$var,
breaks=quantile(df$var, 0:50/50), # breaks along 1/50 quantiles
include.lowest=T
)
df.split <- split(df, cut.vec)
Hmisc::cut2 has this option built in as well.
Can be done by the function provided here by Joris Meys
EqualFreq2 <- function(x,n){
nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
data<-split(df2, EqualFreq2(df2$val, 25))

How to divide a set of overlapping ranges into non-overlapping ranges? but in R

Let's say we have two datasets:
assays:
BHID<-c(127,127,127,127,128)
FROM<-c(950,959,960,961,955)
TO<-c(958,960,961,966,969)
Cu<-c(0.3,0.9,2.5,1.2,0.5)
assays<-data.frame(BHID,FROM,TO,Cu)
and litho:
BHID<-c(125,127,127,127)
FROM<-c(940,949,960,962)
TO<-c(949,960,961,969)
ROCK<-c(1,1,2,3)
litho<-data.frame(BHID,FROM,TO,ROCK)
and I want to join the two sets and the results after running the algorithm would be:
BHID FROM TO CU ROCK
125 940 970 - 1
127 949 950 - 1
127 950 958 0.3 1
127 958 959 - 1
127 959 960 0.9 1
127 960 961 2.5 2
127 961 962 1.2 -
127 962 966 1.2 3
127 966 969 - 3
128 955 962 0.5 -
Use merge
merge(assays, litho, all=T)
In essence, all=T is the SQL equivalent for FULL OUTER JOIN. I haven't specified any columns, because in this case merge function will perform the join across the column with same names.
Tough one but the code seems to work. The idea is to first expand each row into many, each representing a one-increment from FROM to TO. After merging, identify contiguous rows and un-expand them... Obviously it is not a very efficient approach so it may or may not work if your real data has very large FROM and TO ranges.
library(plyr)
ASSAYS <- adply(assays, 1, with, {
SEQ <- seq(FROM, TO)
data.frame(BHID,
FROM = head(seq(FROM, TO), -1),
TO = tail(seq(FROM, TO), -1),
Cu)
})
LITHO <- adply(litho, 1, with, {
SEQ <- seq(FROM, TO)
data.frame(BHID,
FROM = head(seq(FROM, TO), -1),
TO = tail(seq(FROM, TO), -1),
ROCK)
})
not.as.previous <- function(x) {
x1 <- head(x, -1)
x2 <- tail(x, -1)
c(TRUE, !is.na(x1) & !is.na(x2) & x1 != x2 |
is.na(x1) & !is.na(x2) |
!is.na(x1) & is.na(x2))
}
MERGED <- merge(ASSAYS, LITHO, all = TRUE)
MERGED <- transform(MERGED,
gp.id = cumsum(not.as.previous(BHID) |
not.as.previous(Cu) |
not.as.previous(ROCK)))
merged <- ddply(MERGED, "gp.id", function(x) {
out <- head(x, 1)
out$TO <- tail(x$TO, 1)
out
})
merged
# BHID FROM TO Cu ROCK gp.id
# 1 125 940 949 NA 1 1
# 2 127 949 950 NA 1 2
# 3 127 950 958 0.3 1 3
# 4 127 958 959 NA 1 4
# 5 127 959 960 0.9 1 5
# 6 127 960 961 2.5 2 6
# 7 127 961 962 1.2 NA 7
# 8 127 962 966 1.2 3 8
# 9 127 966 969 NA 3 9
# 10 128 955 969 0.5 NA 10
Note that the first row is not exactly the same as in your expected output, but I think mine makes more sense.

Resources