R Function using the result inside the function - r

I have one variable A
0
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
which is an input into the following function
NoBeta <- function(A)
{
return(1-(1- B * (1-4000))/EXP(0.007*A)
}
The variable B is the result of this function how do I feed the result back into the function to calculate my next result?
Here is B
0
0.07
0.10
0.13
0.16
0.19
0.22
0.24
0.27
0.30
0.32
0.34
0.37
0.39
0.41
0.43
0.45
0.47
So the function needs to return the values of B but also using B e.g. if we using A 10 as input then the input for B is 0, when the input for A is 15 the input for B is the result from the previous calculation 0.07
B is calculated with the following formula in Excel
=1-(1-B1*(1-4000))/EXP(0.007*$A2)
How do I implement this formula in R?

If I understand your question correctly you wish to reference a previous row in a calculation for the current row.
You can adapt a function that was provided in another SO question here.
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
test <- data.frame(x = c(1:10), y = c(2:11))
test$z <- rowShift(test$x, -1) + rowShift(test$y, -1)
> test
x y z
1 1 2 NA
2 2 3 3
3 3 4 5
4 4 5 7
5 5 6 9
6 6 7 11
7 7 8 13
8 8 9 15
9 9 10 17
10 10 11 19
Then what you want to achieve becomes
test$z2 <- 1- (1-rowShift(test$x, -1)*(1-4000))/exp(0.007*rowShift(test$y, -1))
> head(test)
x y z z2
1 1 2 NA NA
2 2 3 3 -3943.390
3 3 4 5 -7831.772
4 4 5 7 -11665.716
5 5 6 9 -15445.790
6 6 7 11 -19172.560

Related

create new var by product of preceding result per goup id in dplyr

I have the follorwing data, and what I need is to create new var new will obtain by product the preceding row values of var Z per group id. eg. the first value of column new is 0.9, 0.90.1, 0.90.1*0.5 for id x=1.
data <- data.frame(x=c(1,1,1,1,2,2,3,3,3,4,4,4,4),
y=c(4,2,2,6,5,6,6,7,8,2,1,6,5),
z=c(0.9,0.1,0.5,0.12,0.6,1.2,2.1,0.9,0.4,0.8,0.45,1.3,0.85))
desired outcome
x y z new
1 1 4 0.90 0.9000
2 1 2 0.10 0.0900
3 1 2 0.50 0.0450
4 1 6 0.12 0.0054
5 2 5 0.60 0.6000
6 2 6 1.20 0.7200
7 3 6 2.10 2.1000
8 3 7 0.90 1.8900
9 3 8 0.40 0.7560
10 4 2 0.80 0.8000
11 4 1 0.45 0.3600
12 4 6 1.30 0.4680
13 4 5 0.85 0.3978
We can use the cumprod from base R
library(dplyr)
data %>%
group_by(x) %>%
mutate(new = cumprod(z)) %>%
ungroup
Or with base R
data$new <- with(data, ave(z, x, FUN = cumprod))

Simulate unbalanced clustered data

I want to simulate some unbalanced clustered data. The number of clusters is 20 and the average number of observations is 30. However, I would like to create an unbalanced clustered data per cluster where there are 10% more observations than specified (i.e., 33 rather than 30). I then want to randomly exclude an appropriate number of observations (i.e., 60) to arrive at the specified average number of observations per cluster (i.e., 30). The probability of excluding an observation within each cluster was not uniform (i.e., some clusters had no cases removed and others had more excluded). Therefore in the end I still have 600 observations in total. Anyone knows how to realize that in R? Here is a smaller example dataset. The number of observation per cluster doesn't follow the condition specified above though, I just used this to convey my idea.
> y <- rnorm(20)
> x <- rnorm(20)
> z <- rep(1:5, 4)
> w <- rep(1:4, each=5)
> df <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
> df
id cluster x y
1 1 1 0.30003855 0.65325768
2 2 1 -1.00563626 -0.12270866
3 3 1 0.01925927 -0.41367651
4 4 1 -1.07742065 -2.64314895
5 5 1 0.71270333 -0.09294102
6 1 2 1.08477509 0.43028470
7 2 2 -2.22498770 0.53539884
8 3 2 1.23569346 -0.55527835
9 4 2 -1.24104450 1.77950291
10 5 2 0.45476927 0.28642442
11 1 3 0.65990264 0.12631586
12 2 3 -0.19988983 1.27226678
13 3 3 -0.64511396 -0.71846622
14 4 3 0.16532102 -0.45033862
15 5 3 0.43881870 2.39745248
16 1 4 0.88330282 0.01112919
17 2 4 -2.05233698 1.63356842
18 3 4 -1.63637927 -1.43850664
19 4 4 1.43040234 -0.19051680
20 5 4 1.04662885 0.37842390
After randomly adding and deleting some data, the unbalanced data become like this:
id cluster x y
1 1 1 0.895 -0.659
2 2 1 -0.160 -0.366
3 1 2 -0.528 -0.294
4 2 2 -0.919 0.362
5 3 2 -0.901 -0.467
6 1 3 0.275 0.134
7 2 3 0.423 0.534
8 3 3 0.929 -0.953
9 4 3 1.67 0.668
10 5 3 0.286 0.0872
11 1 4 -0.373 -0.109
12 2 4 0.289 0.299
13 3 4 -1.43 -0.677
14 4 4 -0.884 1.70
15 5 4 1.12 0.386
16 1 5 -0.723 0.247
17 2 5 0.463 -2.59
18 3 5 0.234 0.893
19 4 5 -0.313 -1.96
20 5 5 0.848 -0.0613
EDIT
This part of the problem solved (credit goes to jay.sf). Next, I want to repeat this process 1000 times and run regression on each generated dataset. However, I don't want to run regression on the whole dataset but rather on some selected clusters with the clusters being selected randomly (can use this function: df[unlist(cluster[sample.int(k, k, replace = TRUE)], use.names = TRUE), ]. In the end, I would like to get confidence intervals from those 1000 regressions. How to proceed?
As per Ben Bolker's request, I am posting my solution but see jay.sf for a more generalizable answer.
#First create an oversampled dataset:
y <- rnorm(24)
x <- rnorm(24)
z <- rep(1:6, 4)
w <- rep(1:4, each=6)
df <- data.frame(id=z,cluster=w,x=x,y=y)
#Then just slice_sample to arrive at the sample size as desired
df %>% slice_sample(n = 20) %>%
arrange(cluster)
#Or just use base R
a <- df[sample(nrow(df), 20), ]
df2 <- a[order(a$cluster), ]
Let ncl be the desired number of clusters. We may generate a sampling space S which is a sequence of tolerance tol around mean observations per cluster mnobs. From that we draw repeatetly a random sample of size 1 to obtain a list of clusters CL. If the sum of cluster lengths meets ncl*mnobs we break the loop, add random data to the clusters and rbind the result.
FUN <- function(ncl=20, mnobs=30, tol=.1) {
S <- do.call(seq.int, as.list(mnobs*(1 + tol*c(-1, 1))))
repeat({
CL <- lapply(1:ncl, function(x) rep(x, sample(S, 1, replace=T)))
if (sum(lengths(CL)) == ncl*mnobs) break
})
L <- lapply(seq.int(CL), function(i) {
id <- seq.int(CL[[i]])
cbind(id, cluster=i,
matrix(rnorm(max(id)*2),,2, dimnames=list(NULL, c("x", "y"))))
})
do.call(rbind.data.frame, L)
}
Usage
set.seed(42)
res <- FUN() ## using defined `arg` defaults
dim(res)
# [1] 600 4
(res.tab <- table(res$cluster))
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# 29 29 31 31 30 32 31 30 32 28 28 27 28 31 32 33 31 30 27 30
table(res.tab)
# 27 28 29 30 31 32 33
# 2 3 2 4 5 3 1
sapply(c("mean", "sd"), function(x) do.call(x, list(res.tab)))
# mean sd
# 30.000000 1.747178
Displayable example
set.seed(42)
FUN(4, 5, tol=.3) ## tol needs to be adjusted for smaller samples
# id cluster x y
# 1 1 1 1.51152200 -0.0627141
# 2 2 1 -0.09465904 1.3048697
# 3 3 1 2.01842371 2.2866454
# 4 1 2 -1.38886070 -2.4404669
# 5 2 2 -0.27878877 1.3201133
# 6 3 2 -0.13332134 -0.3066386
# 7 4 2 0.63595040 -1.7813084
# 8 5 2 -0.28425292 -0.1719174
# 9 6 2 -2.65645542 1.2146747
# 10 1 3 1.89519346 -0.6399949
# 11 2 3 -0.43046913 0.4554501
# 12 3 3 -0.25726938 0.7048373
# 13 4 3 -1.76316309 1.0351035
# 14 5 3 0.46009735 -0.6089264
# 15 1 4 0.50495512 0.2059986
# 16 2 4 -1.71700868 -0.3610573
# 17 3 4 -0.78445901 0.7581632
# 18 4 4 -0.85090759 -0.7267048
# 19 5 4 -2.41420765 -1.3682810
# 20 6 4 0.03612261 0.4328180

Compute values relative to specific factor in R data frame

I have a data frame in R of the following form:
BC solvopt istrng tSolv EPB
1 10 1 0 0.10 -78.1450
2 10 1 1 0.15 -78.7174
3 10 1 10 0.14 -78.7175
4 10 1 100 0.12 -78.7184
5 10 1 1000 0.09 -78.7232
6 10 1 2 0.15 -78.7175
7 10 1 20 0.14 -78.7176
8 10 1 200 0.12 -78.7192
30 10 2 0 0.10 -78.1450
31 10 2 1 0.11 -78.7174
32 10 2 10 0.11 -78.7175
33 10 2 100 0.10 -78.7184
34 10 2 1000 0.13 -78.7232
35 10 2 2 0.11 -78.7174
36 10 2 20 0.10 -78.7176
37 10 2 200 0.10 -78.7192
59 10 3 0 0.16 -78.1450
60 10 3 1 0.23 -78.7174
61 10 3 10 0.21 -78.7175
62 10 3 100 0.19 -78.7184
63 10 3 1000 0.17 -78.7232
64 10 3 2 0.22 -78.7175
65 10 3 20 0.21 -78.7176
66 10 3 200 0.18 -78.7192
88 10 4 0 0.44 -78.1450
89 10 4 1 14.48 -78.7162
90 10 4 10 12.27 -78.7175
91 10 4 100 1.23 -78.7184
92 10 4 1000 0.44 -78.7232
93 10 4 2 14.52 -78.7172
94 10 4 20 6.16 -78.7176
95 10 4 200 0.62 -78.7192
I want to add a column to this frame which shows the relative error in the EPB for each value of BC and istrng relative to solvopt=3.
For example, to compute the relative difference in EPB at each row I would subtract the EPB value of the corresponding row with the same value of BC and istrng but with solvopt=3.
Is there an easy way to do this short of splitting this into multiple data frames (for each solvopt) and then remunging it together?
The end goal is to generate plots of relative error vs istrng for each value of BC using qplot.
If you merge the subset where solvopt==3 against the main data on both BC and istrong, and subtract the difference, you should get the result you want, e.g.:
newdat <- merge(dat,dat[dat$solvopt==3,c("BC","istrng","EPB")], by=c("BC","istrng"))
newdat$diff <- with(newdat, EPB.x - EPB.y)
...or do it all in one fell swoop using match and interaction:
dat$diff <- dat$EPB - dat[dat$solvopt==3,"EPB"][match(
with(dat, interaction(BC,istrng) ),
with(dat[dat$solvopt==3,], interaction(BC,istrng) )
)]
A similar option with data.table
library(data.table)
res <- setkey(setDT(dat), BC,istrng)[dat[solvopt==3, c(1,3,5),
with=FALSE]][, diff:= EPB- i.EPB][]

stratified sampling or proportional sampling in R

I have a data set generated as follows:
myData <- data.frame(a=1:N,b=round(rnorm(N),2),group=round(rnorm(N,4),0))
The data looks like as this
I would like to generate a stratified sample set of myData with given sample size, i.e., 50. The resulting sample set should follow the proportion allocation of the original data set in terms of "group". For instance, assume myData has 20 records belonging to group 4, then the resulting data set should have 50*20/200=5 records belonging to group 4. How to do that in R.
You can use my stratified function, specifying a value < 1 as your proportion, like this:
## Sample data. Seed for reproducibility
set.seed(1)
N <- 50
myData <- data.frame(a=1:N,b=round(rnorm(N),2),group=round(rnorm(N,4),0))
## Taking the sample
out <- stratified(myData, "group", .3)
out
# a b group
# 17 17 -0.02 2
# 8 8 0.74 3
# 25 25 0.62 3
# 49 49 -0.11 3
# 4 4 1.60 3
# 26 26 -0.06 4
# 27 27 -0.16 4
# 7 7 0.49 4
# 12 12 0.39 4
# 40 40 0.76 4
# 32 32 -0.10 4
# 9 9 0.58 5
# 42 42 -0.25 5
# 43 43 0.70 5
# 37 37 -0.39 5
# 11 11 1.51 6
Compare the counts in the final group with what we would have expected.
round(table(myData$group) * .3)
#
# 2 3 4 5 6
# 1 4 6 4 1
table(out$group)
#
# 2 3 4 5 6
# 1 4 6 4 1
You can also easily take a fixed number of samples per group, like this:
stratified(myData, "group", 2)
# a b group
# 34 34 -0.05 2
# 17 17 -0.02 2
# 49 49 -0.11 3
# 22 22 0.78 3
# 12 12 0.39 4
# 7 7 0.49 4
# 18 18 0.94 5
# 33 33 0.39 5
# 45 45 -0.69 6
# 11 11 1.51 6

How to merge two data frames on common columns in R with sum of others?

R Version 2.11.1 32-bit on Windows 7
I got two data sets: data_A and data_B:
data_A
USER_A USER_B ACTION
1 11 0.3
1 13 0.25
1 16 0.63
1 17 0.26
2 11 0.14
2 14 0.28
data_B
USER_A USER_B ACTION
1 13 0.17
1 14 0.27
2 11 0.25
Now I want to add the ACTION of data_B to the data_A if their USER_A and USER_B are equal. As the example above, the result would be:
data_A
USER_A USER_B ACTION
1 11 0.3
1 13 0.25+0.17
1 16 0.63
1 17 0.26
2 11 0.14+0.25
2 14 0.28
So how could I achieve it?
You can use ddply in package plyr and combine it with merge:
library(plyr)
ddply(merge(data_A, data_B, all.x=TRUE),
.(USER_A, USER_B), summarise, ACTION=sum(ACTION))
Notice that merge is called with the parameter all.x=TRUE - this returns all of the values in the first data.frame passed to merge, i.e. data_A:
USER_A USER_B ACTION
1 1 11 0.30
2 1 13 0.25
3 1 16 0.63
4 1 17 0.26
5 2 11 0.14
6 2 14 0.28
This sort of thing is quite easy to do with a database-like operation. Here I use package sqldf to do a left (outer) join and then summarise the resulting object:
require(sqldf)
tmp <- sqldf("select * from data_A left join data_B using (USER_A, USER_B)")
This results in:
> tmp
USER_A USER_B ACTION ACTION
1 1 11 0.30 NA
2 1 13 0.25 0.17
3 1 16 0.63 NA
4 1 17 0.26 NA
5 2 11 0.14 0.25
6 2 14 0.28 NA
Now we just need sum the two ACTION columns:
data_C <- transform(data_A, ACTION = rowSums(tmp[, 3:4], na.rm = TRUE))
Which gives the desired result:
> data_C
USER_A USER_B ACTION
1 1 11 0.30
2 1 13 0.42
3 1 16 0.63
4 1 17 0.26
5 2 11 0.39
6 2 14 0.28
This can be done using standard R function merge:
> merge(data_A, data_B, by = c("USER_A","USER_B"), all.x = TRUE)
USER_A USER_B ACTION.x ACTION.y
1 1 11 0.30 NA
2 1 13 0.25 0.17
3 1 16 0.63 NA
4 1 17 0.26 NA
5 2 11 0.14 0.25
6 2 14 0.28 NA
So we can replace the sqldf() call above with:
tmp <- merge(data_A, data_B, by = c("USER_A","USER_B"), all.x = TRUE)
whilst the second line using transform() remains the same.
We can use {powerjoin}:
library(powerjoin)
power_left_join(
data_A, data_B, by = c("USER_A", "USER_B"),
conflict = ~ .x + ifelse(is.na(.y), 0, .y)
)
#> USER_A USER_B ACTION
#> 1 1 11 0.30
#> 2 1 13 0.42
#> 3 1 16 0.63
#> 4 1 17 0.26
#> 5 2 11 0.39
#> 6 2 14 0.28
In case of conflict, the function fed to the conflict argument will be used
on pairs of conflicting columns.
We can also use sum(, na.rm = TRUE) row-wise for the same effect :
power_left_join(data_A,data_B, by = c("USER_A", "USER_B"),
conflict = rw ~ sum(.x, .y, na.rm = TRUE))

Resources