I am trying to generate a random sample that excludes certain "bad data." I do not know whether the data is "bad" until after I sample it. Thus, I need to make a random draw from the population and then test it. If the data is "good" then keep it. If the data is "bad" then randomly draw another and test it. I would like to do this until my sample size reaches 25. Below is a simplified example of my attempt to write a function that does this. Can anyone please tell me what I am missing?
df <- data.frame(NAME=c(rep('Frank',10),rep('Mary',10)), SCORE=rnorm(20))
df
random.sample <- function(x) {
x <- df[sample(nrow(df), 1), ]
if (x$SCORE > 0) return(x)
#if (x$SCORE <= 0) run the function again
}
random.sample(df)
Here is a general use of a while loop:
random.sample <- function(x) {
success <- FALSE
while (!success) {
# do something
i <- sample(nrow(df), 1)
x <- df[sample(nrow(df), 1), ]
# check for success
success <- x$SCORE > 0
}
return(x)
}
An alternative is to use repeat (syntactic sugar for while(TRUE)) and break:
random.sample <- function(x) {
repeat {
# do something
i <- sample(nrow(df), 1)
x <- df[sample(nrow(df), 1), ]
# exit if the condition is met
if (x$SCORE > 0) break
}
return(x)
}
where break makes you exit the repeat block. Alternatively, you could have if (x$SCORE > 0) return(x) to exit the function directly.
use this after your first sample
while (any(bad <- (x$SCORE <= 0)))
x[bad, ] <- df[sample(nrow(df), sum(bad)), ]
You can just select the rows to sample directly like so (just 5):
> df <- data.frame(NAME=c(rep('Frank',10),rep('Mary',10)), SCORE=rnorm(20))
> df[sample(which(df$SCORE>0), 5),]
NAME SCORE
14 Mary 1.0858854
10 Frank 0.7037989
16 Mary 0.7688913
5 Frank 0.2067499
17 Mary 0.4391216
this is without replacement, for bootstrap put in replace=T.
random.sample <- function(x) {
x <- df[sample(nrow(df), 1), ]
if (x$SCORE > 0) return(x)
Recall(x)# run the function again
}
random.sample(df)
# NAME SCORE
#14 Mary 1.252566
It seems to me that this should work as well:
df$SCORE[ df$SCORE > 0 ][ sample(1:sum(df$SCORE > 0), 1) ]
#[1] 0.6579631
Related
Consider the following matrix:
testMat <- matrix(c(1,2,1,3,
3,2,3,3,
1,3,1,3,
2,3,3,3,
3,3,3,3,
3,2,3,1), byrow = T, ncol = 4)
and following condition:
cond <- c(0,0) # binary
Problem: If cond[1] is 0 and there is a 1 in either the first or third column, the corresponding rows will be removed. Similarly if cond[2] is 0 and there is a 1 in either the second or fourth column, the corresponding rows will be removed. For example the new matrix will be:
newMat <- testMat[-c(1,3,6),] # for cond <- c(0,0)
newMat <- testMat[-c(1,3),] # for cond <- c(0,1)
newMat <- testMat[-c(6),] # for cond <- c(1,0)
newMat <- testMat # for cond <- c(1,1)
I tried in following way which is both wrong and clumsy.
newMat <- testMat[-(cond[1] == 0 & testMat[,c(1,3)] == 1),]
newMat <- newMat[-(cond[2] == 0 & newMat[,c(2,4)] == 1),]
Can you help to find a base R solution?
This is ugly, but seems to work:
(generalized for length of cond, assuming that the matrix has length(cond)*2 columns)
keepRows <- apply(testMat, 1,
function(rw, cond) {
logicrow <- vector(length=2)
for (b in 1:length(cond)) {
logicrow[b] <- ifelse(cond[b]==0, all(rw[b]!=1) & all(rw[length(cond)+b]!=1), TRUE)
}
all(logicrow)
}, cond = cond)
newMat <- testMat[keepRows, ]
(edited according to comment)
Assuming 1) cond can be of arbitrary length, 2) testMat has an even number of columns, and 3) the rule is to look at the i-th and (i+2)-th column of testMat
cond=c(0,0)
unlist(
sapply(
1:length(cond),
function(i){
j=rowSums(testMat[,c(i,i+2)]==1)
if (cond[i]==0 & sum(j)>0) which(j>0) else nrow(testMat)+1
}
)
)
[1] 1 3 6
which returns the rows which satisfy your conditions, you can then remove these
testMat[-.Last.value,]
In my artificial problem, I need to remove empty values create during a sample process and make a new sampling process until I have just one value (nrow(s.df)>0). But, if the first condition is satisfied I keep the results (res[[i]] <- s.df) but if not, I need to make a new sample again and for this, I try to use repeat and while functions combining with else without success.
My example:
#Artificial data set
v0<-rnorm(20)
vNA<-rep(NA, 80)
v<-c(v0,vNA)
id<-1:100
df<-data.frame(id,v)
s_size<-c(1,2,3,4,5)
# Sampling using repeat
res<-list()
for(i in 1:length(s_size)){ # Loop for different sample size
s.df<-df[sample(nrow(df), 3), ] #sampling in data set
s.df<-s.df[complete.cases(s.df), ] #remove NAs
s.df
if (nrow(s.df)>0){
res[[i]] <- s.df# add it to the list
}
}
else{
repeat{
s.df<-df[sample(nrow(df), 3), ] #sampling in data set
s.df<-s.df[complete.cases(s.df), ] #remove NAs
if (nrow(s.df)>0){
res[[i]] <- s.df# add it to the list
}
if (nrow(res.circle)>0){break}
}
}
}
big_sample = do.call(rbind, res)
or
# Sampling using while
res<-list()
for(i in 1:length(s_size)){ # Loop
s.df<-df[sample(nrow(df), 3), ] #sampling in data set
s.df<-s.df[complete.cases(s.df), ] #remove NAs
if (nrow(s.df)>0){
res[[i]] <- s.df# add it to the list
}
}
else{
while(nrow(res.circle)>0) {
s.df<-df[sample(nrow(df), 3), ] #sampling in data set
s.df<-s.df[complete.cases(s.df), ] #remove NAs
if (nrow(s.df)>0){
res[[i]] <- s.df# add it to the list
}
}
}
big_sample = do.call(rbind, res)
This approach obviously doesn't work but if I don't use the else{}, I will overwrite the results that already satisfied the first condiction. Any ideas, please?
You could put the while loop inside the for loop and make it depend on the outcome of the if-condition then you don't need the else:
set.seed(42)
v0 <- rnorm(20)
vNA <- rep(NA, 80)
v <- c(v0, vNA)
id <- 1:100
df <- data.frame(id, v)
s_size <- c(1, 2, 3, 4, 5)
res <- list()
for (i in 1:length(s_size)) {
condition <- FALSE
while (condition == FALSE) {
s.df <- df[sample(nrow(df), 3),]
s.df <- s.df[complete.cases(s.df),]
if (nrow(s.df) > 0) {
res[[i]] <- s.df
condition <- TRUE
}
}
}
big_sample <- do.call(rbind, res)
big_sample
#> id v
#> 15 15 -0.13332134
#> 8 8 -0.09465904
#> 18 18 -2.65645542
#> 4 4 0.63286260
#> 6 6 -0.10612452
#> 2 2 -0.56469817
Created on 2020-06-11 by the reprex package (v0.3.0)
I am comparing a 20-day moving average against a 50-day, 100-day 200-day and 333-day. The condition is essentially just
if(20MA > 50MA > 100MA > 200MA > 333MA) {
return TRUE
} else{
FALSE
}
Is there a way in R to handle this without multiple nested if statements?
if(tail(MA_20,n=1) > tail(MA_50,n=1) > tail(MA_100,n=1) > tail(MA_200,n=1) > tail(MA_333,n=1)) {
score[1] <- 1
} else{
score[1] <- -1
}
Use && (see help("&") for other logical operators)
if (tail(MA_20,n=1) > tail(MA_50,n=1) &&
tail(MA_50,n=1) > tail(MA_100,n=1) &&
tail(MA_100,n=1) > tail(MA_200,n=1) &&
tail(MA_200,n=1) > tail(MA_333,n=1)) {
score[1] <- 1
} else {
score[1] <- -1
}
The if statement will return true if all four of those comparisons are true.
The ifelse function might be helpful for you in speeding up your code as it helps vectorize comparisons.
The following way seems more complicated but is more flexible. It first gets all variables named after the pattern "MA_" and sapply the tail to extract the last element. Then uses diff to see if they are in decreasing order.
First make up some data.
library(zoo)
set.seed(1234) # Reproducible results
n <- 1e3
x <- rnorm(n)
MA_20 <- rollmean(x, k = 20)
MA_50 <- rollmean(x, k = 50)
MA_100 <- rollmean(x, k = 100)
MA_333 <- rollmean(x, k = 333)
Now the problem.
score <- NULL
ma <- stringr::str_sort(ls(pattern = "^MA_"), numeric = TRUE)
MA_last <- sapply(ma, function(m) tail(get(m), n = 1))
score[1] <- if(all(diff(MA_last) > 0)) 1 else -1
I am working with a time-series raster brick. The brick has 365 layers representing a value for each day of the year.
I want to create a new layer in which each cell holds the number of day of year in which a certain condition is met.
My current approach is the following (APHRO being the raster brick), but returns the error message below:
enter code here
r <- raster(ncol=40, nrow=20)
r[] <- rnorm(n=ncell(r))
APHRO <- brick(x=c(r, r*2, r))
NewLayer <- calc(APHRO, fun=FindOnsetDate(APHRO))
Returning this error:
Error in .local(x, ...) : not a valid subset
And the function being parsed:
FindOnsetDate <- function (s) {
x=0
repeat {
x+1
if(s[[x]] >= 20 | s[[x]] + s[[x+1]] >= 20 & ChkFalseOnset() == FALSE)
{break}
}
return(x);
}
With the function for the 3rd condition being:
ChkFalseOnset <- function (x) {
for (i in 0:13){
if (sum(APHRO[[x+i:x+i+7]]) >= 5)
{return(FALSE); break}
return(TRUE)
}
}
Thank you in advance!!!!
And please let me know if I should provide more information - tried to keep it parsimonious.
The problem is that your function is no good:
FindOnsetDate <- function (s) {
x=0
repeat {
x+1
if(s[[x]] >= 20 | s[[x]] + s[[x+1]] >= 20)
{break}
}
return(x);
}
FindOnsetDate(1:100)
#Error in s[[x]] :
# attempt to select less than one element in get1index <real>
Perhaps something like this:
FindOnsetDate <- function (s) {
j <- s + c(s[-1], 0)
sum(j > 20 | s > 20)
# if all values are positive, just do sum(j > 20)
}
FindOnsetDate(1:20)
#10
This works now:
r <- calc(APHRO, FindOnsetDate)
I would suggest a basic two-step process. With a 365-days example:
set.seed(123)
r <- raster(ncol=40, nrow=20)
r_list <- list()
for(i in 1:365){
r_list[[i]] <- setValues(r,rnorm(n=ncell(r),mean = 10,sd = 5))
}
APHRO <- brick(r_list)
Use a basic logic test for each iteration:
r_list2 <- list()
for(i in 1:365){
if(i != 365){
r_list2[[i]] <- APHRO[[i]] >= 20 | APHRO[[i]] + APHRO[[i+1]] >= 20
}else{
r_list2[[i]] <- APHRO[[i]] >= 20
}
}
Compute sum by year:
NewLayer <- calc(brick(r_list2), fun=sum)
plot(NewLayer)
I have a glm based on data A and I'd like to score data B to do validation, but some records in B have missing data.
Instead of these ending up without a score (na.omit) or being removed (na.exclude) I'd like them to end up with an outputted prediction that uses the model to determine a value based only on the data with values.
A reproducible example...
data(mtcars)
model<-glm(mpg~.,data=mtcars)
mtcarsNA<-mtcars
NAins <- NAinsert <- function(df, prop = .1){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
mtcarsNA<-NAins(mtcarsNA,.4)
mtcarsNA$mpg<-mtcars$mpg
predict(model,newdata=mtcarsNA,type="response")
Where I need the last line to return a result (non-NA) for all records. Can you point me in the direction of the code needed?
Based on the conversation in the comments, you want to replace NA values with zero before predicting. This seems dangerous/dubious to me -- use at your own risk.
naZero <- function(x) { x[is.na(x)] <- 0; x }
mtcarszero <- lapply(mtcarsNA,naZero)
predict(model,newdata=mtcarszero,type="response")
should be what you want.
For categorical variables, if you are using default treatment contrasts, then I think the consistent thing to do is something like this:
naZero <- function(x) { if (is.numeric(x)) {
repVal <- 0
} else {
if (is.factor(x)) {
repVal <- levels(x)[1]
} else stop("uh-oh")
}
x[is.na(x)] <- repVal
x }