I have the following data.frame (which is longer then the following example)
sub height group
1 1.55 a
2 1.65 a
3 1.76 b
4 1.77 a
5 1.58 c
6 1.65 d
7 1.82 c
8 1.91 c
9 1.77 b
10 1.69 b
11 1.74 a
12 1.75 c
Im making a data partition with the following code:
library("caret")
train = createDataPartition(df$group, p = 0.50)
partition = df[train, ]
So it takes a subject with the probability of 0.5 from each group.
My problem is in this following example is that sometimes a subject from group d will be picked and sometimes not (because group d is really small). I want to create a constraint that in every partition I make, atlist 1 subject from EVERY group will be picked.
Any graceful solution?
I came up with a not-so graceful solution looking like this:
allGroupSamles <- c()
for (i in unique(df$groups))
{
allGroupSamles <- c(allGroupSamles , sample(rownames(df[df$groups == i, ]) , 1, replace = TRUE))
}
allGroupSamles <- as.integer(allGroupSamles )
train = createDataPartition(df$groups, p = 0.50)[[1]]
train <- c(allGroupSamles , train)
partition= df[unique(train), ]
You can use split on a data.frame and sample within each group taking half of the records or 1, whichever is greater:
# apply a function over the split data.frame
samples <- lapply(split(df, df$group), function(x) {
# the function takes a random sample of half the records in each group
# by using `ceiling`, it guarantees at least one record
s <- sample(nrow(x), ceiling(nrow(x)/2))
x[s,]
})
train <- do.call(rbind, samples)
Edit:
If you need a numeric vector:
s <- tapply(1:nrow(df), df$group, function(x) {
sample(x, ceiling(length(x)/2))
})
do.call(c, s)
Related
I have a dataset that looks like the below:
> head(mydata)
id value1 value2
1: 1 200001 300001
2: 2 200002 300002
3: 3 200003 300003
4: 4 200004 300004
5: 5 200005 300005
6: 6 200006 300006
value1 and value2 represent amounts at the beginning and the end of a given year. I would like to linearly interpolate the value for a given month, for each id (i.e. rowwise).
After trying different options that were slower, I am currently using map2 from the purrr package in combination with approx from base R. I create the new variable using assignment by reference from the data.table package. This is still surprisingly slow, as it takes approximately 2.2 min for my code to run on my data (1.7 million rows).
Note that I also use get() to access the variables for the interpolation, as their names need to be dynamic. This is slowing down my code, but it doesn't seem to be the bottleneck. Also, I have tried to use the furrr package to speed up map2 by making the code parallel, but the speed gains were not material.
Below is reproducible example with 1000 rows of data. Any help to speed up the code is greatly appreciated!
mydata <- data.table(id = 1:1000, value1= 2001:3000, value2= 3001:4000)
floor_value <- "value1"
ceiling_value <- "value2"
m <- 7
monthly_sum_assured <- function(a, b, m) {
monthly_value <- approx(x = c(0, 12), c(a, b), xout = m)$y
}
mydata[, interpolated_value := map2(get(floor_value), get(ceiling_value),
~ monthly_sum_assured(.x, .y, m))]
Just use the formula for linear interpolation to vectorize over the whole data.table.
mydata <- data.table(id = 0:1e6, value1= 2e6:3e6, value2= 3e6:4e6)
floor_value <- "value1"
ceiling_value <- "value2"
m <- 7
monthly_sum_assured <- function(a, b, m) {
monthly_value <- approx(x = c(0, 12), c(a, b), xout = m)$y
}
system.time({
mydata[, interpolated_value := map2(get(floor_value), get(ceiling_value),
~ monthly_sum_assured(.x, .y, m))]
})
#> user system elapsed
#> 41.50 0.53 42.05
system.time({
mydata[, interpolated_value2 := get(floor_value) + m*(get(ceiling_value) - get(floor_value))/12]
})
#> user system elapsed
#> 0 0 0
identical(unlist(mydata$interpolated_value), mydata$interpolated_value2)
#> [1] TRUE
It also works just as fast when m is a vector.
m <- sample(12, 1e6 + 1, 1)
system.time({
mydata[, interpolated_value2 := get(floor_value) + m*(get(ceiling_value) - get(floor_value))/12]
})
#> user system elapsed
#> 0.01 0.00 0.02
Survey shows average score of 4.2 out of 5, with sample size of 14. How do I create a dataframe that provides a combination of results to achieve score of 4.2?
I tried this but it got too big
library(tidyverse)
n <- 14
avg <- 4.2
df <- expand.grid(rep(list(c(1:5)),n))
df <- df %>%
rowwise() %>%
mutate(avge = mean(c_across())) %>%
filter(ave >= 4)
The aim for this is, given the limited information above, I want to know the distribution of combinations of individual scores and see which combination is more likely to occur and how many low scores + high scores needed to have an average of that score above.
Thanks!
If you can tolerate doing this randomly, then
set.seed(42) # only so that you get the same results I show here
n <- 14
iter <- 1000000
scores <- integer(0)
while (iter > 0) {
tmp <- sample(1:5, size = n, replace = TRUE)
if (mean(tmp) > 4) {
scores <- tmp
break
}
iter <- iter - 1
}
mean(scores)
# [1] 4.142857
scores
# [1] 5 3 5 5 5 3 3 5 5 2 5 5 4 3
Notes:
The reason I use iter in there is to preclude the possibility of an "infinite" loop. While here it reacts rather quickly and is highly unlikely to go that far, if you change the conditions then it is possible your conditions could be infeasible or just highly improbable. If you don't need this, then remove iter and use instead while (TRUE) ...; you can always interrupt R with Escape (or whichever mechanism your IDE provides).
The reason I prefill scores with an empty vector and use tmp is so that you won't accidentally assume that scores having values means you have your average. That is, if the constraints are too tight, then you should find nothing, and therefore scores should not have values.
FYI: if you're looking for an average of 4.2, two things to note:
change the conditional to be what you need, such as looking for 4.2 ... but ...
looking for floating-point equality is going to bite you hard (see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754), I suggest looking within a tolerance, perhaps
tol <- 0.02
# ...
if (abs(mean(tmp) - 4.2) < tol) {
scores <- tmp
break
}
# ...
where tol is some meaningful number. Unfortunately, using this seed (and my iter limit) there is no combination of 14 votes (of 1 to 5) that produce a mean that is within tol = 0.01 of 4.2:
set.seed(42)
n <- 14
iter <- 100000
scores <- integer(0)
tol <- 0.01
while (iter > 0) {
tmp <- sample(1:5, size = n, replace = TRUE)
# if (mean(tmp) > 4) {
if (abs(mean(tmp) - 4.2) < tol) {
scores <- tmp
break
}
iter <- iter - 1
}
iter
# [1] 0 # <-- this means the loop exited on the iteration-limit, not something found
scores
# integer(0)
if you instead set tol = 0.02 then you will find something:
tol <- 0.02
# ...
scores
# [1] 4 4 4 4 4 5 4 5 5 5 3 4 3 5
mean(scores)
# [1] 4.214286
You can try the code below
n <- 14
avg <- 4.2
repeat{
x <- sample(1:5, n, replace = TRUE)
if (sum(x) == round(avg * n)) break
}
and you will see
> x
[1] 5 5 5 5 5 5 4 5 5 4 1 5 1 4
> mean(x)
[1] 4.214286
Good Afternoon R wizards,
I searched through a few posts on replacing outliers in data set - two that came closest to answering my questions were Changing outliers for NA in all columns in a dataset in R and Replace outliers by quantiles in R
The code in the 2nd reference works great if you want to update a column or two, but I have 40+ and would like to be able to use apply function to hit all the columns at once.
I want to set a threshold "max" of quantile(probs = .75) for each column, and replace any x>"max" with "max"
set.seed(1)
x = matrix(rnorm(20), ncol = 2)
x[2, 1] = 100
x[4, 2] = 200
colnames(x) <- c("a","b")
#apply(x,2,quantile,probs = .75)
Winsor75 <- function(x) {
Max <- quantile(x, probs = .75)
return(Max)
}
y <- as.data.frame(x)
y$a[y$a > Winsor75(x)] <- Winsor75(x)
The last line of code effectively replaces any defined outliers (in my case values above 75%) but uses the 75% for the entire matrix "x" where as I would like (a) the quantile to be attributable to each column and for (b) the ability to use the function in apply/tapply etc so I can perform the operation on all columns efficiently.
Any suggestions?
Thanks!
as.data.frame(lapply(y, function(x) pmin(x, quantile(x, 0.75, na.rm = TRUE))))
As a function:
df_winsor <- function(df, p) {
as.data.frame(lapply(df,
function(x) pmin(x, quantile(x, probs = p, na.rm = TRUE))))
}
Statistician's Disclaimer: I've solved the programming problem you asked. This should not be taken as an endorsement of the idea of automatically checking for, or doing anything with, so-called "outliers".
One option is to use mutate_all with custom function and apply rules to all columns.
Approach:
I have crated an replaceOutlier function (based on OPs function) which calculatesMaxand then replaces any item which is more thanMaxbefore returning vector.replaceOutlieris applied over all columns usingdplyr::mutate_all`.
library(tidyverse)
replaceOutlier <- function(x) {
Max <- quantile(x, probs = .75)
x[x>Max] <- Max
return(x)
}
x %>% as_tibble() %>% mutate_all(funs(replaceOutlier))
#Results
# # A tibble: 10 x 2
# a b
# <dbl> <dbl>
# 1 -0.626 1.08
# 2 0.698 0.390
# 3 -0.836 -0.621
# 4 0.698 1.08
# 5 0.330 1.08
# 6 -0.820 -0.0449
# 7 0.487 -0.0162
# 8 0.698 0.944
# 9 0.576 0.821
# 10 -0.305 0.594
#
Data
set.seed(1)
x = matrix(rnorm(20), ncol = 2)
x[2, 1] = 100
x[4, 2] = 200
colnames(x) <- c("a","b")
This question already has answers here:
Subtract every element of vector A from every element of vector B
(4 answers)
Closed 5 years ago.
I need to take the difference between any two elements of two vector.
If A<-c(1,2) and B<-c(3,4) then my result R should be c(3-1,3-2,4-1,4-2).
With this snippet
myfunction <- function(N)
{
A = runif(N)
B = runif(N)
R = c()
for(a in A){
for(b in B){
R=c(b-a,R)
}
}
R
}
print(system.time(result <- myfunction(300)))
I get this time
user system elapsed
14.27 0.01 14.39
Is there any faster way to do it?
The fastest base solution is the use of outer:
as.vector(outer(B,A,"-"))
To my own surprise, map2_dbl is actually quite a bit faster than outer:
Not to my surprise, map2_dbl seems faster, but that's because it is not calculating every combination of values in A and B:
test elapsed relative
3 CP(A, B) 7.54 47.125 # using expand.grid
2 JL(A, B) 0.16 1.000 # using map2_dbl
1 JM(A, B) 3.13 19.563 # using outer
But:
> A <- 1:3
> B <- 3:1
> JL(A,B)
[1] -2 0 2
> JM(A,B)
[1] 2 1 0 1 0 -1 0 -1 -2
This is for two vectors of length 1000, and with 100 replications. I didn't include your own solution because that one is ridiculously slow for two reasons:
for loops in R are quite a bit faster than in the old days, but still not as optimal as using functions that have their loops coded in C or equivalent. That's the case for the functions used in the tested code here.
you "grow" your result object. Every loop through the code, that R becomes one value larger, so R has to look for a new place in the memory to store it. That's actually the biggest bottleneck in your code. Try to avoid that kind of construct at all costs, because it's one of the most important causes of terribly slow code.
The benchmark code:
library(tidyverse)
JM <- function(A,B){
as.vector(outer(B,A,"-"))
}
JL <- function(A,B){
map2_dbl(.x = A,
.y = B,
.f = ~ c(.x - .y))
}
CP <- function(A,B){
as.data.frame(expand.grid(A,B)) %>%
mutate(Var3 = Var2-Var1)
}
library(rbenchmark)
A <- runif(1000)
B <- runif(1000)
benchmark(JM(A,B),
JL(A,B),
CP(A,B),
replications = 100,
columns = c("test","elapsed","relative"))
You can use expand.grid to vectorize the approach:
A <- runif(300)
B <- runif(300)
library(dplyr)
R <- as.data.frame(expand.grid(A,B)) %>%
mutate(Var3 = Var2-Var1)
The first 5 lines of output:
Var1 Var2 Var3
1 0.8516676 0.325261 -0.5264066246
2 0.2126453 0.325261 0.1126156694
3 0.5394620 0.325261 -0.2142010126
4 0.1364876 0.325261 0.1887734290
5 0.3248651 0.325261 0.0003958747
This took:
user system elapsed
0.02 0.00 0.02
Your function took:
user system elapsed
42.39 0.43 42.90
Using purrr::map2:
library(tidyverse)
N = 300
A = runif(N)
B = runif(N)
R = c()
print(
system.time(
result <- map(
.x = A,
.f = ~ c(.x - B)) %>% unlist
)
)
Time taken:
user system elapsed
0.02 0 0.02
If I got your attention now, check out this repo for a nice walk through of purrr.
I have a 25 years data set that looks similar to the following:
date name value tag
1 2014-12-01 f -0.338578654 12
2 2014-12-01 a 0.323379254 4
3 2014-12-01 f 0.004163806 9
4 2014-12-01 f 1.365219477 2
5 2014-12-01 l -1.225602543 7
6 2014-12-01 d -0.308544089 9
This is how to replicate it:
set.seed(9)
date <- rep(seq(as.Date("1990-01-01"), as.Date("2015-01-1"), by="months"), each=50)
N <- length(date)
name <- sample(letters, N, replace=T)
value <- rnorm(N)
tag <- sample(c(1:50), N, replace=T)
mydata <- data.frame(date, name, value, tag)
head(mydata)
I would like to create a new matrix that stores values that satisfy multiple criteria. For instance, the sum of values that have a name j and a tag i. I use two for-loops and the which() function to filter out the correct values. Like this:
S <- matrix(data=NA, nrow=length(unique(mydata$tag)), ncol=length(unique(mydata$name)))
for(i in 1:nrow(S)){
for (j in 1:ncol(S)){
foo <- which(mydata$tag == unique(mydata$tag)[i] & mydata$name == unique(mydata$name)[j])
S[i,j] <- sum(mydata$value[foo])
}
}
This is ok for small data sets, but too slow for larger ones. Is it possible to avoid the for-loops or somehow speed up the process?
You can use dcast from package reshape2, with a custom function to sum your values:
library(reshape2)
dcast(mydata, name~tag, value.var='value', fun.aggregate=sum)
Or simply xtabs, base R:
xtabs(value~name+tag, mydata)
Some benchmark:
funcPer = function(){
S <- matrix(data=NA, nrow=length(unique(mydata$tag)), ncol=length(unique(mydata$name)))
for(i in 1:nrow(S)){
for (j in 1:ncol(S)){
foo <- which(mydata$tag == unique(mydata$tag)[i] & mydata$name == unique(mydata$name)[j])
S[i,j] <- sum(mydata$value[foo])
}
}
}
colonel1 = function() dcast(mydata, name~tag, value.var='value', fun.aggregate=sum)
colonel2 = function() xtabs(value~name+tag, mydata)
#> system.time(colonel1())
# user system elapsed
# 0.01 0.00 0.01
#> system.time(colonel2())
# user system elapsed
# 0.05 0.00 0.05
#> system.time(funcPer())
# user system elapsed
# 4.67 0.00 4.82