create multiple train, val, test splits using sample - r

I want to create train, val, test splits (60:20:20). I repeated the process multiple times.
Test set should contain only 2 observations each time. But why does it sometimes contain 1 or 3 observations.
What is role of replace in sample(). Should I keep it FALSE
library(dplyr)
tbl <- tibble(id = 1:10)
train = list()
val = list()
test = list()
for (run in 1:5)
{
assignment <- sample(1:3, size = nrow(tbl), prob = c(0.6, 0.2, 0.2), replace = TRUE)
# Create a train, validation and test sets
train[[run]] <- tbl[assignment == 1, ]
val[[run]] <- tbl[assignment == 2, ]
test[[run]] <- tbl[assignment == 3, ]
}

If we exactly 6, 2, 2, values for 1, 2, 3 as sample, just replicate the 1, 2, 3 and sample it
v1 <- sample(rep(1:3, c(6, 2, 2)))
Then do a split
split(tbl, v1)
When we use prob, it can change the frequency slightly because it is just a probability. Regarding the use of replace = TRUE, it is needed in the OP's code as the length of 1:3 is just 3, whereas size = nrow(tbl) is 10, thus without replacement, it can't fill those 7 extra elements

Related

Impute NA Values in a matrix using linear model

I'm trying to write a function that will take a matrix of any size, find a single NA value, and impute that value based on a LINEAR MODEL of the other values in the matrix. I'm running into the issue that my linear model is not supplying me with a slope and an intercept to calculate the missing value, as I believe it should. If anyone can give me any pointers or help me find a more efficient way of achieving this, I would appreciate it.
Here's what I have so far:
rawVector <- c(1, 2, 3, 4, 5, NA, 7, 8, 9)
testMatrix <- matrix(rawVector, nrow = 3, ncol = 3, byrow = T)
imputeMissingValuesHARD <- function(inputMatrix) {
dimnames(inputMatrix) <- list(rownames(inputMatrix, do.NULL = FALSE, prefix = "row"), colnames(inputMatrix, do.NULL = FALSE, prefix = "col"))
dependent <- ""
colNumber <- -1
for(row in 1:nrow(inputMatrix)) {
for(col in 1:ncol(inputMatrix)) {
if(is.na(inputMatrix[row, col]))
{
dependent <- colnames(inputMatrix)[col]
colNumber <- col
}
}
}
otherCols <- colnames(inputMatrix)[-(colNumber)]
myModel <- as.formula(paste(dependent, paste(otherCols, collapse="+"), sep="~"))
fit <- lm(myModel, as.data.frame(inputMatrix))
print(fit)
return("OK")
}
imputeMissingValuesHARD(testMatrix)

How can I append spearman Rho stat to new object?

I am carrying out a number of spearman's rank correlations and I want to produce a list of all the Rho estimates automatically.
Here is some sample data:
A <- data.frame('Area' = c(4, 6, 5),
'flow' = c(1, 1, 1))
B <- data.frame('Area' = c(6, 8, 4),
'flow' = c(1, 2, 1))
files <- list(A, B)
frames <- list('A', 'B')
I currently have the following code that carries out a correlation for each data frame in the list:
lapply(files, function (x)
cor.test(~flow + Area, data = x, method = 'spearman'))
However, what I would like to do is add another line to this to extract the Rho estimation for each correlation and append this to a new list.
How can I do this?

How to automatically set-up and add functions to a model in R?

I am setting up a model, and I am trying to reduce the amount of writing I have to do.
Concretely, I am using the coala R-package to do coalescent simulations, and I am trying to easily implement a stepping-stone migration model.
A reproducible example: 4 linearly distributed populations exchange migrants according to stepping-stone pattern (only the adjacent populations).
model <- coal_model(sample_size = c(5, 5, 5, 5),
loci_number = 1,
loci_length = 10,
ploidy = 1) +
feat_mutation(rate = mut_rate, # e.g. 0.1
model = "HKY",
base_frequencies = c(0.25,0.25,0.25,0.25),
tstv_ratio = 4) +
feat_migration(mig_rate, 1, 2) + # mig_rate can be e.g. 0.5
feat_migration(mig_rate, 2, 1) +
feat_migration(mig_rate, 2, 3) +
feat_migration(mig_rate, 3, 2) +
feat_migration(mig_rate, 3, 4) +
feat_migration(mig_rate, 4, 3) +
sumstat_dna(name = "dna", transformation = identity)
This example works, but the downside is that I have to write many 'feat_migration' lines, although there is a clear pattern that could be automated. It is fine for a small number of populations, but I want to do a large simulation with about 70 populations. Does someone has a good idea how to automate this? The documentation has not helped me so far.
I tried two things that didn't work:
feat_migration(mig_rate, c(1,2,2,3,3,4), c(2,1,3,2,4,3))
and something like this:
migration_model <- function(){
for(i in 1:n_pops){
feat_migration(mig_rate, i, i+1) +
feat_migration(mig_rate, i+1, i))
}
In the latter case, I don't really know how I can correctly create and parse all functions correctly into my model.
Good ideas are very welcome! :)
Consider the higher-order functions: Map (wrapper to mapply) and Reduce to build a list of function calls and add them iteratively into model. Specifically, Reduce helps for function accumulating needs where result of each iteration needs to be passed into the next iteration to reduce to a single final result.
n_pops <- 4
start_pts <- as.vector(sapply(seq(n_pops-1), function(x) c(x, x+1)))
start_pts
# [1] 1 2 2 3 3 4
end_pts <- as.vector(sapply(seq(n_pops-1), function(x) c(x+1, x)))
end_pts
# [1] 2 1 3 2 4 3
# LIST OF feat_migration()
feats <- Map(function(x, y) feat_migration(mig_rate, x, y), start_pts, end_pts)
# LIST OF FUNCTIONS
funcs <- c(coal_model(sample_size = c(5, 5, 5, 5),
loci_number = 1,
loci_length = 10,
ploidy = 1),
feat_mutation(rate = mut_rate, # e.g. 0.1
model = "HKY",
base_frequencies = c(0.25,0.25,0.25,0.25),
tstv_ratio = 4),
feats,
sumstat_dna(name = "dna", transformation = identity)
)
# MODEL CALL
model <- Reduce(`+`, funcs)
As an aside, the functional form for ggplot + calls is Reduce:
gp <- ggplot(df) + aes_string(x='Time', y='Data') +
geom_point() + scale_x_datetime(limits=date_range)
# EQUIVALENTLY
gp <- Reduce(ggplot2:::`+.gg`, list(ggplot(df), aes_string(x='Time', y='Data'),
geom_point(), scale_x_datetime(limits=date_range)))
The answer is a slight edit by the solution proposed by Parfait. The model initializes without errors, and can be run in the simulator without errors.
n_pops <- 4
start_pts <- as.vector(sapply(seq(n_pops-1), function(x) c(x, x+1)))
end_pts <- as.vector(sapply(seq(n_pops-1), function(x) c(x+1, x)))
# LIST OF feat_migration()
feats <- Map(function(x, y) feat_migration(mig_rate, x, y), start_pts, end_pts)
# LIST OF FUNCTIONS
funcs <- c(list(coal_model(sample_size = c(5, 5, 5, 5),
loci_number = 1,
loci_length = 10,
ploidy = 1),
feat_mutation(rate = mut_rate, # e.g. 0.1
model = "HKY",
base_frequencies = c(0.25,0.25,0.25,0.25),
tstv_ratio = 4),
sumstat_dna(name = "dna", transformation = identity)),
feats)
)
# MODEL CALL
model <- Reduce(`+`, funcs)

R - removing data table rows based on two values

I have a large data frame (tbl_df) with approximately the following information:
data <- data.frame(Energy = sample(1:200, 100, replace = T), strip1 = sample(1:12, 100, replace = T), strip2 = sample(1:12, 100, replace = T))
It has 3 columns. The first is energy, the second and third are strip numbers (where energy was deposited).
Each strip has a different threshold and these are stored in two numeric arrays, each position in the array is for the corresponding strip number:
threshold_strip1 <- c(4, 6, 3, 7, 7, 1, 2, 5, 8, 10, 2, 2)
threshold_strip2 <- c(5, 3, 5, 7, 6, 2, 7, 7, 10, 2, 2, 2)
These tell me the minimum amount of energy the strip can receive. What I want to be able to do is remove the rows from the data frame where BOTH strips do not have over the required threshold.
As an example, if I have the row:
Energy = 4, strip1 = 2, strip2 = 2
Then I would remove this row as although strip2 has a lower threshold than 4, strip1 has a threshold of 6 and so there isn't enough energy here.
Apologies if this question is worded poorly, I couldn't seem to find anything like it in old questions.
filter1 <- data$strip1 >= threshold_strip1[data$strip1]
filter2 <- data$strip2 >= threshold_strip1[data$strip2]
data <- subset(data, filter1 & filter2)
I'd maybe do...
library(data.table)
setDT(data)
# structure lower-bound rules
threshes = list(threshold_strip1, threshold_strip2)
lbDT = data.table(
strip_loc = rep(seq_along(threshes), lengths(threshes)),
strip_num = unlist(lapply(threshes, seq_along)),
thresh = unlist(threshes)
)
# loop over strip locations (strip1, strip2, etc)
# marking where threshold is not met
data[, keep := TRUE]
lbDT[, {
onexpr = c(sprintf("strip%s==s", strip_loc), "Energy<th")
data[.(s = strip_num, th = thresh), on=onexpr, keep := FALSE]
NULL
}, by=strip_loc]
What about this? Using dplyr:
require(dplyr)
data2 <- data %>%
mutate(
strip1_value = threshold_strip1[strip1],
strip2_value = threshold_strip2[strip2],
to_keep = Energy > strip1_value & Energy > strip2_value
) %>%
filter(to_keep == TRUE)

Divide vector with grouping vector

I have two vectors, which I would like to combine in one dataframe. One of the vectors values needs to be divided into two columns. The second vector nc informs about the number of values for each observation. If nc is 1, only one value is given in values (which goes into val1) and 999 is to be written in the second column (val2).
What is an r-ish way to divide vector value and populate the two columns of df? I suspect I miss something very obvious, but can't proceed at the moment...Many thanks!
set.seed(123)
nc <- sample(1:2, 10, replace = TRUE)
value <- sample(1:6, sum(nc), replace = TRUE)
# result by hand
df <- data.frame(nc = nc,
val1 = c(6, 3, 4, 1, 2, 2, 6, 5, 6, 5),
val2 = c(999, 5, 999, 6, 1, 999, 6, 4, 4, 999))
Here's an approach based on this answer:
set.seed(123)
nc <- sample(1:2, 10, replace = TRUE)
value <- sample(1:6, sum(nc), replace = TRUE)
splitUsing <- function(x, pos) {
unname(split(x, cumsum(seq_along(x) %in% cumsum(replace(pos, 1, pos[1] + 1)))))
}
combineValues <- function(vals, nums) {
mydf <- data.frame(cbind(nums, do.call(rbind, splitUsing(vals, nums))))
mydf$V3[mydf$nums == 1] <- 999
return(mydf)
}
df <- combineValues(value, nc)
I think this is what you are looking for. I'm not sure it is the fastest way, but it should do the trick.
count <- 0
for (i in 1:length(nc)) {
count <- count + nc[i]
if(nc[i]==1) {
df$val1[i] <- value[count]
df$val2[i] <- 999
} else {
df$val1[i] <- value[count-1]
df$val2[i] <- value[count]
}
}

Resources